Traffic prediction is highly significant for intelligent traffic systems and traffic management. eXtreme Gradient Boosting (XGBoost), a scalable tree lifting algorithm, is proposed and improved to predict more high-resolution traffic state by utilizing origin-destination (OD) relationship of segment flow data between upstream and downstream on the highway. In order to achieve fine prediction, a generalized extended-segment data acquirement mode is added by incorporating information of Automatic Number Plate Recognition System (ANPRS) from exits and entrances of toll stations and acquired by mathematical OD calculation indirectly without cameras. Abnormal data preprocessing and spatio-temporal relationship matching are conducted to ensure the effectiveness of prediction. Pearson analysis of spatial correlation is performed to find the relevance between adjacent roads, and the relative importance of input modes can be verified by spatial lag input and ordinary input. Two improved models, independent XGBoost (XGBoost-I) with individual adjustment parameters of different sections and static XGBoost (XGBoost-S) with overall adjustment of parameters, are conducted and combined with temporal relevant intervals and spatial staggered sectional lag. The early_stopping_rounds adjustment mechanism (EAM) is introduced to improve the effect of the XGBoost model. The prediction accuracy of XGBoost-I-lag is generally higher than XGBoost-I, XGBoost-S-lag, XGBoost-S, and other baseline methods for short-term and long-term multistep ahead. Additionally, the accuracy of the XGBoost-I-lag is evaluated well in nonrecurrent conditions and missing cases with considerable running time. The experiment results indicate that the proposed framework is convincing, satisfactory, and computationally reasonable.
As a key technological component of intelligent transportation systems (ITS), traffic flow prediction has become an extensively researched topic. To support the dynamic application of ITS, traffic forecasting models usually predict traffic fluctuation, ranging from seconds to hours [
Based on the issues above, traffic prediction problems particularly about highway segments have attracted a lot of attention as the rapid emerging and closing connection of cities. The accurate prediction of highway has obvious influence on logistics, trade, and commuting. At present, there are a number of sensors and cameras to help us obtain traffic data, such as Automatic Number Plate Recognition System (ANPRS), which is installed along mainlines and exits and entrances of toll stations in the highway, is a popular form of expert system that has been applied in many countries. This system is much needed for the detection of vehicles and to optimize all functions, including monitoring, controlling, problem solving, fine management, and compliance. Based on the ANPRS, travel time, traffic volume, travel route, and other traffic data can be acquired. Current historical research utilizes historical ANPRS data on mainlines to predict traffic state and traffic congestion, such as iterative tensor decomposition (ITD) [
Plenty of machine learning (ML) models are used to predict the traffic flow. Existing ML methods are still full of challenges for how to deal with big data [
At present, a large number of studies [
First, the segment data can be obtained directly when there are cameras in the range of the road section. Second, based on ANPRS, we propose a section-flow calculation method for the highway to predict the traffic state finely and microcosmically. For the segment flow which cannot be obtained directly, we take the OD relationship between entrances and exits of toll stations and the license plate recognition relationship of upstream and downstream roads for mathematical calculation. The established calculation method about highway can handle all similar cases and can be extended to the same scenarios for data acquisition and road section division to further manage and prevent traffic congestion. Third, compared with other prediction methods, XGBoost has advantages in scalability, high efficiency, low calculation cost, supporting for parallelization, and regularization processing. Here, we propose an improved XGBoost-based spatio-temporal method with the EAM optimization mode to predict the traffic flow of the segmented highway, by considering of multiple-step short-term and long-term prediction, influence of nonrecurrent incidents, and spatial interaction of sophisticated staggered sections.
The paper is organized according to the following parts. The next section summarizes related literature review.
Traffic prediction is mainly divided into parametric and nonparametric methods [
Due to the uncertainty of the traffic data structure and nonlinear relationship hidden behind datasets, nonparametric methods are more flexible and complex enough for the nonlinear relationship. Statistical methods, such as support vector machine (SVM), have been applied to predict the traffic flow [
Recently, XGBoost, which is a successful prediction method, has been applied in lots of issues of Kaggle competition and other applications with excellent results, such as Didi products. It is a decision tree-based method developed by Chen and Guestrin [
Figure
Methodology framework.
XGBoost is based on the GBDT model and improves on the calculation speed of the algorithm, while optimizing its performance and efficiency, attempting to achieve the ultimate balance. Compared with GBDT, XGBoost explicitly adds the complexity of the tree structure as a regular term and uses second derivative information in the derivation of the optimization objective equation, whereas GBDT only uses the first-order allowance. XGBoost implements an approximate algorithm for the split-node search, which is used to quicken and reduce memory consumption. The node splitting algorithm automatically utilizes the feature of sparseness, and the data is sorted in advance and stored in the form of blocks, which are conducive to parallel computing.
The core idea of XGBoost [
XGBoost [
Here,
Here,
In equation (
During training, a new
Next, we expand the objective function by Taylor expansion, taking the first three terms and removing the high-order small infinitesimal term. Finally, our objective function is transformed into equation (
According to equations (
In general, we cannot enumerate all possible tree structures and choose the optimal one; hence, we use a greedy algorithm as it can greatly improve computing efficiency. We start with a single leaf node and iteratively split it to add nodes to the tree. By enumerating the feasible segmentation points and selecting the minimum target function and maximum gain partition, the gain equation becomes
The above equation is used to evaluate the loss function after slicing. In practice, however, it is used to evaluate the candidate after slicing. The XGBoost model produces many simple trees that are used to assess scores of leaf nodes during splitting. The first, second, and third terms of the equations represent the scores on the left, right, and original leaves, respectively. In addition,
Our data is derived from Shaoxing, Zhejiang Province, China, as shown in Figure
Target highway.
We divide the entire segment into 7 sections from northwest to southeast, which are labeled as Sections
First, according to up-direction or down-direction, the vehicle is, respectively, mapped in different roads. Upstream and downstream traffic data of the respective ANPRS are counted according to license plate matching. The numbers of vehicles of every 5 minutes in the chronological order are also counted as segment data. It means that one license plate number corresponds to one vehicle. For toll stations, when vehicles pass entrances or exits, its license plate numbers can be selected and recorded by cameras. In the study of highway, there are four cameras. If we can only obtain the traffic data of four sections according to the location of cameras directly, it is not conducive to better understand the traffic situations and OD law with toll stations and ramps. Therefore, the ANPRS data acquirement mode not only divides the whole segment more finely but also grasps more traffic information so as to accurately predict the future segmented traffic flow. The method is applicable to any highways with the same structure, such as service areas and roundabouts. In addition, these kinds of toll stations and the replacements including entrances and exits in the highway are universal in any country.. According to the above, we propose a generalized method for accurately acquiring data on highway segment.
To illustrate, for different sections, there are two ways to obtain the accurate traffic flow. The first way is directly obtain segment data through the cameras capturing the number of vehicles of the corresponding sections. In our target highway, Sections
The generalized extended-segment data calculation mode is as follows. According to Figure
Schematic diagram of traffic flow calculation for Section X1.
After vehicles enter through the descending entrance of the toll station
Part of the vehicles pass through the camera
For our target highway, traffic flow data of Section
The outliers in the datasets will far exceed the ground truth and greatly affect the accuracy of prediction. In order to suppress the influence of outliers, we apply winsorization to preprocess data [
Here,
We divide the datasets of 80 days into 75 days for the training set and 5 days for the testing set. Moreover, numerous training iterative processing is conducted to find the recursive relationship in the traffic flow in order to attain more accurate prediction. The entire datasets take the past six 5 min flow data so as to predict the future. Python library Keras, which is based on Tensorflow, is used to build our models. All experiments are performed by a PC Server with the following configuration: Intel(R), Xeon(R), CPU E5-1650, 3.50 GHz, and 64 GB of memory.
There is a spatial transmitting correlation between the highway traffic flow of different sections. In order to prove it, we use the Pearson correlation test given in the following equation to test that
We calculate
Pearson correlation results of up-direction and down-direction.
We independently adjust the XGBoost parameters of the corresponding sections for each of the 7 sections of up-direction and down-direction, named XGBoost-I. When training iterations, early_stopping_rounds adjustment mechanism (EAM) for adjusting parameters is introduced to improve the XGBoost method. When the lowest error iteration comes up, the model continues to iterate 100 times. After that, if no lower error is found, the iteration is terminated. Otherwise, the model will repeat the above EAM mode. This is done to avoid missing optimal parameters until the best case.
Num_boost_round, which refers to the number of boosting trees, represents the number of training iterations. A value that is too small can result in underfitting, while a value that is too large can cause overfitting. Num_boost_round and learning_rates are generally adjusted with the same parameters, where learning_rates is comprised with a list of learning rates each time. The adjustment of up-direction and down-direction of the num_boost_round and learning_rates parameters of XGBoost-I is shown in Figure
Num_boost_round and learning_rates up-direction and down-direction parameters’ adjustment results.
num_boost_round results of 14 roads.
Dir. | Section | num_boost_round | Dir. | Section | num_boost_round |
---|---|---|---|---|---|
Up | Road 1 | 358 | Down | Road 8 | 219 |
Road 2 | 453 | Road 9 | 531 | ||
Road 3 | 437 | Road 10 | 571 | ||
Road 4 | 301 | Road 11 | 299 | ||
Road 5 | 271 | Road 12 | 373 | ||
Road 6 | 344 | Road 13 | 424 | ||
Road 7 | 340 | Road 14 | 280 |
During model training, the other parameters also need to be determined. max_depth is the maximum depth of a tree by increasing the value to make the model more complex and avoid overfitting. Min_child_weight determines the minimum leaf nodes’ sample weight, which is used to avoid overfitting as well. When the value is large, the model can avoid learning special local samples. We adjust max_depth and min_child_weigh synchronously, and the adjustments of up-direction and down-direction are shown in Figure
max_depth and min_child_weigh up-direction and down-direction parameter adjustment results.
Reg_alpha is the
reg_alpha and reg_lambda up-direction and down-direction parameter adjustment results.
Gamma specifies the minimum dropping loss function which is required for node splitting. Subsample is a ratio set that is used to train the model subsamples to the entire training process. This parameter controls the ratio of random sampling for each tree. Setting scale_pos_weight enables the algorithm to converge faster. Evals is a list that evaluates elements in the list during training, allowing to observe the effect of the validation set during training. Common parameters are used to control the macrofunction of XGBoost. The learning objective parameter is used to control the ideal optimization goal and measurement in the result of each step.
XGBoost consists of over thirty hyperparameters; hence, we choose the following parameters that confer greater impact on the optimization performance. The best relevant parameter settings of the XGBoost-I model are depicted in Table
Parameter settings of XGBoost-I
Type | Parameter | Settings |
---|---|---|
Booster | max_depth | 3 |
min_child_weight | 10 | |
gamma | 0 | |
subsample | 1 | |
reg_alpha | 0.1 | |
reg_lambda | 0.05 | |
scale_pos_weight | 1 | |
General | Booster | gbtree |
Silent | 0 | |
Nthread | max | |
Learning target | Objective | reg:gamma |
eval_metric | Depending on objective | |
Seed | 0 | |
learning_rates | 0.04 | |
eval_metric | rmse | |
evals | evallist |
Here, we set another static XGBoost (XGBoost-S) model in regard to the overall adjustment of parameters among all sections of up-direction and down-direction. Specifically, the optimal situational parameters are adopted for the 14 roads, and its adjustment parameter settings are shown in Table
Parameter settings of XGBoost-S.
Type | Parameter | Settings |
---|---|---|
Booster | max_depth | 5 |
min_child_weight | 10 | |
gamma | 0 | |
subsample | 1 | |
reg_alpha | 0.01 | |
reg_lambda | 0.05 | |
scale_pos_weight | 1 | |
General | Booster | gbtree |
Silent | 0 | |
Nthread | max | |
Learning target | Objective | reg:gamma |
eval_metric | Depending on objective | |
Seed | 0 | |
learning_rates | 0.25 | |
num_boost_round | 80 |
For the evaluation of different prediction methods, we employ Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) as the evaluation index. Given the predicted value
This model is divided according to the datasets in
Spatial lag of up-direction and down-direction.
For XGBoost models examined in this study (XGBoost-I and XGBoost-S are ordinary input without spatial lag), input modes are calibrated about different analyses above, called XGBoost-I-lag and XGBoost-S-lag, respectively, by spatial lag. Comparisons are made for highway traffic prediction of seven sections (14 roads). The according prediction results of up-direction and down-direction are given in Tables
Up-direction-related XGBoost models’ prediction results.
Methods | Error | Road 1 | Road 2 | Road 3 | Road 4 | Road 5 | Road 6 | Road 7 |
---|---|---|---|---|---|---|---|---|
XGBoost-I | RMSE | 33.6726 | 17.5608 | 20.2317 | 19.4625 | 20.8586 | ||
MAE | 23.8850 | 13.0644 | 15.1233 | 14.4374 | 14.0289 | |||
MAPE | 13.9770 | 12.1249 | 13.7570 | 12.2136 | ||||
XGBoost-I-lag | RMSE | 26.2245 | 30.4392 | |||||
MAE | 18.4053 | 20.6569 | ||||||
MAPE | 13.4027 | 10.9286 | 12.5402 | 15.2520 | ||||
XGBoost-S | RMSE | 34.5123 | 18.3918 | 22.1078 | 26.1455 | 21.2394 | 21.7922 | 29.5969 |
MAE | 24.1088 | 13.3061 | 15.4927 | 18.4941 | 15.0558 | 14.1222 | 19.6536 | |
MAPE | 14.1671 | 12.1188 | 13.7593 | 10.8694 | 12.6695 | 15.9026 | 14.4065 | |
XGBoost-S-lag | RMSE | 32.2435 | 17.2803 | 20.9974 | 26.6617 | 20.2683 | 21.0335 | 31.4545 |
MAE | 22.2236 | 12.4192 | 14.5295 | 18.5168 | 14.4424 | 13.7168 | 20.8345 | |
MAPE | 13.5993 | 11.5780 | 11.1660 | 12.5718 | 13.7145 | 15.4466 |
Down-direction-related XGBoost models’ prediction results.
Methods | Error | Road 8 | Road 9 | Road 10 | Road 11 | Road 12 | Road 13 | Road 14 |
---|---|---|---|---|---|---|---|---|
XGBoost-I | RMSE | 27.6826 | 19.5893 | 18.5488 | 31.1616 | 18.1484 | 17.9374 | |
MAE | 19.7969 | 14.3576 | 13.3581 | 21.9113 | 13.2407 | 11.7431 | ||
MAPE | 11.0912 | 12.6743 | 12.8101 | 13.7022 | 12.1961 | 14.6207 | ||
XGBoost-I-lag | RMSE | 31.9440 | ||||||
MAE | 22.2563 | |||||||
MAPE | 14.1016 | |||||||
XGBoost-S | RMSE | 28.2182 | 20.2276 | 18.9394 | 32.1894 | 18.7528 | 19.9912 | 32.6606 |
MAE | 19.5314 | 14.3086 | 13.3500 | 22.1994 | 13.3074 | 11.8262 | 22.4856 | |
MAPE | 11.0098 | 12.6626 | 12.8490 | 13.8508 | 12.3134 | 14.9592 | 14.9703 | |
XGBoost-S-lag | RMSE | 25.2074 | 18.7069 | 18.1568 | 30.3098 | 19.1776 | 20.2706 | 32.8841 |
MAE | 17.3920 | 13.0054 | 12.7606 | 20.9290 | 13.6335 | 11.9992 | 22.6842 | |
MAPE | 9.8163 | 11.4915 | 12.2118 | 12.7907 | 12.4502 | 14.7001 | 14.8923 |
On the whole, in regard to fourteen roads of up-direction and down-direction, except Roads 4, 7, and 14, RMSE and MAE of the XGBoost-I-lag model are found to be optimal. For MAPE, except for Roads 3, 4, 6, 7, and 13, XGBoost-I-lag is the best among the other 9 roads. Among the average results of all these roads, RMSE, MAE, and MAPE of the XGBoost-I-lag model are found to be better than those of the XGBoost-I model by 3.33%, 3.99%, and 2.87%, respectively. RMSE, MAE, and MAPE outperform those of the XGBoost-S model by 7.14%, 4.68%, and 5.97%, respectively, and are better than the XGBoost-S-lag model by 4.33%, 1.29%, and 2.38%, respectively. Notably, these three errors of the XGBoost-S-lag model are observed to be better than the XGBoost-S model by 3.02%, 3.56%, and 3.81%, respectively. Overall, the XGBoost-I-lag prediction result is considered to be the most accurate because it is due to the respective adjustment of different road parameters. Moreover, the fourteen roads, whose results are individually adjusted corresponding to the XGBoost-I parameter model, are also found to be better than the overall adjustment of the XGBoost-S parameter model. That is, the separate optimal parameter structure of each road is evidently better than the entire optimal parameter structure. In addition, the spatial lag input results of both XGBoost-I and XGBoost-S are better than the ordinary input. On the contrary, concerning different segment features of the fourteen roads, errors of Section
Figure
The four XGBoost models on the first-day predicted results and ground truth.
The accuracy of the proposed XGBoost methods is verified using three types of errors in traffic prediction that are superimposed to various periods of twenty-four hours. Figure
Three errors for the up-direction and down-direction of twenty-four hours’ prediction.
Figure
Figures
Special sample (traffic accident) traffic flow prediction results of the XGBoost-I-lag method.
Special sample (weather factor) traffic flow prediction results of the XGBoost-I-lag method.
In intelligent transportation systems, data missing is an inevitable and widespread phenomenon, although many studies [
RMSE of up-direction and down-direction with different missing rates.
MAE of up-direction and down-direction with different missing rates.
Judging from the average results of up-direction and down-direction, as the missing rate gradually increases, the error suddenly increases. When the missing rate increases to 40%, RMSE and MAE also increase by more than 70%. It is obvious that missing data has a great impact on the results of XGBoost-I-lag traffic flow prediction. Though it performs well in full data, the importance of data preprocessing is seen to have a substantial influence on the accuracy of the model as well as the prediction results.
The proposed method in this study is compared with the following two baselines:
CNN network structure.
Layer (type) | Output shape | Param |
---|---|---|
input_1 (InputLayer) | (None, 6, 7) | 0 |
conv1d_1 (Conv1D) | (None, 6, 7) | 56 |
conv1d_2 (Conv1D) | (None, 6, 7) | 56 |
conv1d_3 (Conv1D) | (None, 6, 7) | 56 |
conv1d_4 (Conv1D) | (None, 6, 7) | 56 |
conv1d_5 (Conv1D) | (None, 6, 7) | 56 |
conv1d_6 (Conv1D) | (None, 6, 7) | 56 |
flatten_1 (flatten) | (None, 42) | 0 |
dense_1 (dense) | (None, 12) | 516 |
Each LSTM network structure.
Layer (type) | Output shape | Param |
---|---|---|
lstm_1 (LSTM) | (None, 6, 7) | 252 |
dense_1 (dense) | (None, 64) | 3200 |
dense_2 (dense) | (None, 12) | 780 |
We compare the performance of XGBoost-I and XGBoost-S with the four baseline methods (SARIMA, CNN, RF, and LSTM) based on the datasets. Figure
Comparison of three errors in different methods of up-direction and down-direction.
Compared to traditional prediction methods, XGBoost is found to perform better. Except SARIMA, the spatial lag input of the other methods is better than the ordinary input. However, SARIMA explores the individual prediction in each road without reflecting the characteristics in lag input. Instead, due to misalignment of the data, the prediction effect is evidently reduced. In addition, calculation time is based on the training and testing time. Running time of programs determines CPU time of the system. The longer the run time is, the more resources the CPU uses. XGBoost-I provides the best performance with the runtime of 162 s, though XGBoost-S gets the least running time of only 123 s. This is due to the number of trees for different roads of XGBoost-I that is different, which maximizes optimization. Therefore, an additional number of branches are explored, with the time being longer than XGBoost-S. The RF completion time is 214 s, which also serves as an ideal method in view of the results. Supporting parallel training of random forests can speed up training and is also suitable for high-dimensional data processing. Although the running time of CNN (225 s) is close to RF, its prediction is much worse. The traditional prediction methods SARIMA (383 s) and LSTM (2827 s) have longer running time. Although LSTM acquires satisfactory results, time costs and system consumption are too much. Therefore, XGBoost-I is considered to be the best choice among these six common methods for highway traffic prediction.
We utilize historical data for the next 60 minutes to predict highway traffic in the next 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60 minutes. Figure
Comparison of traffic prediction performance of different methods in short-term 5-minute time steps of mean RMSE, MAE, and MAPE.
We compare XGBoost family models with baselines models (SARIMA, CNN, RF, and LSTM) and the corresponding different input modes. Accordingly, short-term traffic flow prediction results of RMSE, MAE, and MAPE by XGBoost-I-lag are the most accurate. The reason is that XGBoost adopts different parameter adjustments and tree structures for different sections when considering temporal and spatial characteristics. Moreover, it can be observed that, in short-term traffic flow prediction, the spatial lag input of different methods is better than the results of ordinary input. CNN shows the worst prediction ability, while RF and LSTM show similar accuracy, indicating that spatio-temporal characteristics play vital roles in short-term traffic prediction. The prediction errors in each method increase synchronously as the prediction range increases. Different XGBoost methods have more stable prediction trends than other methods.
Long-term forecasting is mainly contributed to travelers who plan for longer trips that are considered to be more challenging than short-term forecasting. We predict the traffic flow for the next (35, 40, 45, 50, 55, 60) minutes based on the historical data. Figure
Comparison of traffic prediction performance of different methods in long-term 5-minute time steps of mean RMSE, MAE, and MAPE.
Regarding CNN-lag, the spatial lag input, which can be better to highlight the ability of spatial features, is evidently far superior to CNN in terms of long-term traffic flow prediction. For long-term traffic prediction, spatial information contributes better than temporal characteristics. Additionally, the advantages of CNN in utilizing the spatial characteristics in the traffic network are confirmed. The spatial lag results among the other methods are better apparently. Similar to short-term prediction, CNN performs the worst prediction performance. At the same time, RF still performs similarly to LSTM and the errors increase as the prediction range increases. However, the long-term predictive performance is marginally faster than the short-term performance. Compared with other models, XGBoost-I-lag achieves the best accuracy in both short-term and long-term of highway traffic flow prediction and obtains the most stable trend. These results prove the superiority and feasibility of the improved XGBoost model with proposed EAM optimization mode and tree structures, and the model is able to capture the traffic features and the regularity of the highway traffic flow.
The ability of predicting the highway traffic flow in an accurate manner is important in proactive traffic management strategies in order that it can provide reliable travel information for commuters. In this paper, improved XGBoost traffic flow prediction methods and a generalized segmented-data acquirement mode are proposed. Then, we introduce an optimization way based on the EAM mode and a lag strategy involving spatio-temporal delivery. For the computing and processing datasets, the XGBoost-I parameter structures are adjusted corresponding to up-direction and down-direction roads separately. XGBoost-I-lag achieves the best performance compared with XGBoost-S series models and other baseline models. Multistep performance is evaluated, and the model is examined under the predicting of segment data and ANPRS data to prove the accuracy. It is confirmed that the missing data greatly affects the traffic flow prediction results in the XGBoost-I-lag. Except for SARIMA, the spatial lag input of all methods is better than the ordinary input. It is also observed that the identified spatio-temporal lag strategy is extremely necessary of highway traffic prediction.
In the near future, we plan to improve the predicted accuracy of the improved XGBoost framework in the following two directions: (1) more effective XGBoost parameters are worth exploring and adjusting and further expanding the usability of the EAM optimization mode. (2) Extensive segmented data calculation mode should explore more practical scenarios to divide sections subtly, and we also plan to broaden this research to estimate wider highway.
The ANPRS data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research was funded by National Natural Science Foundation of China (51578040), Beijing Natural Science Foundation (8162013), and Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (CIT&TCD20180324).