Traffic Incident Clearance Time Prediction and Influencing Factor Analysis Using Extreme Gradient Boosting Model

. Accurate prediction and reliable signiﬁcant factor analysis of incident clearance time are two main objects of traﬃc incident management (TIM) system, as it could help to relieve traﬃc congestion caused by traﬃc incidents. This study applies the extreme gradient boosting machine algorithm (XGBoost) to predict incident clearance time on freeway and analyze the signiﬁcant factors of clearance time. The XGBoost integrates the superiority of statistical and machine learning methods, which can ﬂexibly deal with the nonlinear data in high-dimensional space and quantify the relative importance of the explanatory variables. The data collected from the Washington Incident Tracking System in 2011 are used in this research. To investigate the potential philosophy hidden in data, K -means is chosen to cluster the data into two clusters. The XGBoost is built for each cluster. Bayesian optimization is used to optimize the parameters of XGBoost, and the MAPE is considered as the predictive indicator to evaluate the prediction performance. A comparative study conﬁrms that the XGBoost outperforms other models. In addition, response time, AADT (annual average daily traﬃc), incident type, and lane closure type are identiﬁed as the signiﬁcant explanatory variables for clearance time.


Introduction
According to Lindley [1], traffic incidents result in about 60% of nonrecurrent traffic congestions. ese congestions may cause lots of adverse effects such as reducing the roadway capacity, increasing the likelihood of secondary incidents [2], and unfavorable social and economic phenomenon [3]. When a traffic incident occurred, timely and reliable incident duration prediction plays an important role in the traffic authorities to design strategy for traffic guidance. In terms of Highway Capacity Manual, there are four phases in traffic incident duration [4]: detection time (the time from incident occurrence to detection), response time (the time from incident detection to verification), clearance time (the time from incident verification to clearance), and recovery time (the time from incident clearance to the normal traffic condition). Severe incidents that are not cleared in time may lead to a twice even three times incident duration [5]. Compared to other phases, clearance time is the most important and time-consuming phase in the time incident process. us, the aims of this paper are to effectively predict the clearance time and investigate the significant influencing factors of clearance time.
In summary, conventional incident clearance time prediction studies rely on either statistical models with prior assumptions or machine learning models with poor interpretability [43]. To solve the abovementioned issues, we apply the extreme gradient boosting machine (XGBoost) method to predict the clearance time and then investigate the significant influencing factors of traffic incident clearance time. Because the XGBoost inherits both the advantages of statistical models and machine learning models, which can handle the nonlinear high-dimensional data when computing the relative importance among variables.
In this study, the prediction performance of XGBoost is examined by using the data from the Washington Incident Tracking System in 2011. In order to better explore the potential philosophy hidden in the original data, we cluster the original data in terms of their inherent properties. And then XGBoost model is built for each cluster. e framework of the proposed method is detailed in Section 3.5. e remaining of this research is organized as follows. e data source is described in Section 2. Section 3 presents the K-means algorithm, the XGBoost algorithm, the Bayesian optimization algorithm, evaluation indicator, and the framework of the proposed method. e model results and discussion are outlined in Section 4. e last section is the conclusion.

Data Description
Traffic incident data were collected from the Washington Incident Tracking System (WITS), which occurred on the section from Boeing Access Road (Milepost 157) to the Seattle Central Business District (Milepost 165). is segment is not only a high incident-occurrence area but also takes on heavy traffic demand [44]. erefore, it was chosen as the research object. And the annual average daily traffic (AADT) comes from the Highway Safety Information System (HSIS) database. e historical weather data were obtained from the National Oceanic and Atmospheric Administration (NOAA)'s weather stations in the region. e components of the data are detailed in Table 1. ere are 14 discrete explanatory variables and 2 continuous explanatory variables in this dataset. In terms of their properties, they are divided into six categories: incident, temporal, geographical, environment, traffic, and operational. e detailed value sets of variables are presented as the third column in Table 1. In order to equalize the variability of independent variables, both response time and AADT variables are normalized [41,[43][44][45][46].
Totally, 2565 incident records were retrieved from the WITS database for the time period from 1 January to 31 December 2011. e mean and standard values of clearance time are, respectively, 13.10 minutes and 14.63 minutes. A big standard value (14.63 min) means that most of the clearance time values are quite different from their average values. at is, the original data should be processed to make the data organized well.

Methodology
3.1. K-Means Algorithm. K-means algorithm, developed by MacQueen [47], is one of the widely used methods in the field of dataset clustering. Samples in the dataset with similar characteristics can be clustered into the same class by using K-means [48]. e data we used in this research are expressed as {x i � [x i1 , x i2 , . . . , x im ], y i }, i � 1, 2, 3, . . . , n and n represents the number of incidents, m is the number of explanatory variables, and the y denotes the actual clearance time.
e detailed steps of the K-means algorithm are presented as follows: Step 1: assuming the number of clusters (K clusters) and choosing the cluster centers from the dataset randomly.
Step 2: determining the clusters of other samples by the distance function as Here, the O a and O b are the centers of the cluster a and cluster b, and C a denotes the cluster a.
Step 3: after all samples have been clustered, the new center of each cluster should be calculated by using the following equation: where N C is the number of the samples in cluster j.
Step 4: repeating step 2 and step 3 until the center of the cluster is within the permission. Accordingly, we can find that the value of K and the cluster center are important to the clustering performance, as the clustering of K-means is extremely dependent on the selection of initial cluster center and the number of K. To obtain a reasonable K, we use the silhouette coefficient as the evaluation index, which is proposed by Rousseeuw [49] and defined as follows:

Journal of Advanced Transportation
Here, the a(i) is the average distance between sample i and other samples within the same cluster, and the b(i) is the lowest average distance of sample i to all the remaining samples.

Extreme Gradient Boosting Machine Algorithm.
Chen and Guestrin [50] proposed the extreme gradient boosting machine (XGBoost) algorithm. It is regarded as the advanced application of gradient boosting machine (GBDT) and adopts decision trees as the base learners for achieving classification and regression. Boosting is the integrated approach that can adjust the predicted error of the current model by adding new models to the model [41]. e predicted result of the boosting model is the sum scores of all models. Accordingly, the prediction of XGBoost is the sum scores of K boosted trees and is shown in the following equation: where x i is the i th sample, f k (x i ) is the score of x i at the i th boosted tree, and F is the space composed of boosted trees.
To decrease the fitting error of XGBoost, there is an improvement in regulation compared to GBDT, and it is presented as follows: where y i and y i are the actual and predicted values of the i th sample, the former item is the loss function, which needs to be a differentiable convex function, and the latter item is the penalty corresponding to the model complexity for avoiding overfitting. e second item of equation (5) can be detailed as follows: where both c and λ are constants, T denotes the sum number of leaves, and w j is the score of j th leaf. When equation (6) equals zero, the obj(Θ) will convert to the conventional formula of GBDT. According to equations (5) and (6), the training error and the model complexity are the two main sections of XGBoost. When the previous trees have been trained, the current tree can be trained by using additive training method. It means that when the t th boosted tree is trained, the parameters of the previous trees (from the first tree to the (t − 1) th tree) are fixed and their corresponding variables are constant. Taking the t th boosted tree as an example, the loss can be expressed as follows: ere are two formulas in these two items of (7): e first items of equations (8) and (9) are the sum score and sum regulation of former (t − 1) th trees and the second items of them are the score and regulation of the t th boosted tree, y i t is the predicted value of the t th iteration, and T t�1 Ω(f t ) is the regulation of t th iteration.
Equations (8) and (9) are substituted into equation (7), and then equation (7) is expanded by using the following Taylor formula: en, equation (7) is transformed as follows: As Chen and Guestrin [50] suggested, f t (x) can also be written as (12) where q(x) is the leaf node of x, the ω q(x) indicates the weight of q(x) or that can be considered as the predicted value of the t th iteration, and d is the number of leaf nodes. en, equation (11) can be expressed as follows: where g i and h i are the first order and second order of gradient statistics. When the q(x) is fixed, the optimal leaf weight and the metric function can be used to measure the quality of the tree structure q(x) can be calculated: 3.3. Bayesian Optimization Algorithm. Bayesian optimization algorithm (BOA), one of the most famous extendible applications of the Bayesian network, is based on the construction of the probabilistic model. is model defines the distribution of objective function from the input data to output data. In this Bayesian optimization process, the global statistical characteristics are obtained from the optimal solutions and modeled by using the Bayesian network [51].
at is why the BOA shows its advantage in machine learning models because these machine learning models need more accurate parameters to flexibly handle nonlinear high-dimensional data [52]. In this study, the BOA is applied to optimize the parameters in the XGBoost with the aim to accurately predict the traffic incident clearance time.
e accomplishment of Bayesian optimization includes two core parts: prior function (PF) and acquisition function (AC), which is also called the utility function [51]. Gaussian process (GP) is generally considered as the PF. And the AC is used to balance the model exploration and exploitation. e framework of Bayesian optimization is presented in Figure 1 and the main steps are described as follows: (1) e data is split into training data and validation data by using the kfold cross-validation method. Initialization parameters of the target model are defined as θ 1 , θ 2 , . . . , θ n . (2) e accuracy of the target model with initial parameters is evaluated by using validation data, and then the accuracy is recorded. e goal of the optimization is to minimize validation accuracy. (3) Gaussian process (GP) is employed to fit the recorded accuracy. (4) e parameters of the target model are updated in terms of the result of GP. en, the maximum value of AC is used to select the next point, as it achieves the optimization by determining the next point to evaluate. Probability of improvement, expected improvement, and information gain are the three widely used AC [51]. In this study, expected improvement is chosen as the AC.
en, the best validation accuracy is mathematically written as follows: where L is the validation accuracy and P GP (L | θ) is the probability of L with θ that is executed by using GP.

Evaluation Indicator.
In general, the mean absolute percent error (MAPE) is a commonly used predictive indicator to evaluate the prediction performance of the regressive model. As mentioned above, the data are described 2, 3, . . . , n, that can be considered as a matrix with the size of n * m. Specifically, n is the number of incidents and y i represents the actual value of the i th incident. Considering p i is the predicted value of the i th incident. en, the MAPE can be expressed as follows: In terms of this formula, the MAPE is a relative predictive indicator that can measure the prediction performance of the models based on actual values and predicted values.

Framework of the Proposed Method.
As introduced in Section 2, we need a suitable way to handle the original dataset to organize the dataset well for exploring the potential philosophy hidden in data easier. To this end, in this research, we select the K-means algorithm as the method to cluster the original dataset into several categories in which the data are high similarity. en, the XGBoost model is built for each category to perform prediction. e main steps of the proposed method are introduced as follows: Step 1: clustering the original data into several categories by using the K-means algorithm. e number of clusters is determined by the optimal silhouette coefficient (the detailed information is introduced in Section 3.1).
Step 2: splitting the clustered data into training data and testing data for each category. Using the training data to constructs the XGBoost model.
Step 3: the BOA is used to optimize parameters for each constructed XGBoost model.
Step 4: inputting the testing data into the trained XGBoost, and then the predicted clearance time will be output and recorded.
Step 5: calculating the predictive indicator (MAPE) and the relative importance of explanatory factors Noting that with the number of traffic incidents increasing, the dataset will be updated continuously, and thus the XGBoost should be retrained.

Prediction Result and Discussion
ere are two objects of this study: (a) examining the performance of the XGBoost model in predicting clearance time and (b) investigating the significant factors of clearance time. We firstly process the original data, including data clustering, and clustering evaluation. Next, the data are split into training data and testing data with a ratio of 7 : 3. e XGBoost is trained by using training data, and the testing data are used for model evaluation. en, comparison research examines the prediction performance of XGBoost. MAPE is chosen as a predictive measure. Finally, the relative importance of all the explanatory variables is calculated, and the significant explanatory variables of incident clearance time are analyzed. Overall, the proposed model is accomplished by coding and executing at Python.

Data Preprocessing.
Before modeling, the original dataset has been processed by means of the K-means algorithm. As described in Section 3.1, the number of clusters (K) is the key parameter of the K-means algorithm. To find the best K, the values of K increasing from 2 to 10 are selected to calculate the corresponding silhouette coefficient, and the results are shown in Table 2. Assuming the iteration stops when the silhouette coefficients for continuous 5 iterations are not improved. e iteration stops when K � 7, as the silhouette coefficients of continuous 5 iterations are decreasing. In terms of equation (3), a higher silhouette coefficient indicates a better clustering performance. According to Table 2, when K � 2, the silhouette coefficient reaches the biggest value (0.613), which means K is set as 2 in this study. In this case, the original data are clustered into two clusters in this study. To present each cluster clearly, we draw the scatter plots of the target variable and one of the explanatory variables (which is chosen randomly), shown in Figure 2. e x-axis is clearance time and the y-axis denotes the response time. Figure 2(a) shows the scatter plot of these two variables in the original data, while Figure 2(b) shows the scatter plot of the clustered data. As shown in Figure 2(b), the cluster 1 marked with purple represents relative shorter clearance time, and cluster 2 marked with yellow indicates longer clearance time.
In order to knowledge the characteristic of two clusters clearly, several essential indexes are calculated and presented in Table 3. In total, there are 2246 incidents in cluster 1 and 319 incidents in cluster 2. Regarding cluster 1, the mean, standard, median, and range values of clearance time are 9 minutes, 5.44 minutes, 7.00 minutes, and 22 minutes. In respect to cluster 2, these values, respectively, are 39.25 minutes, 15.25 minutes, 35 minutes, and 75 minutes. Compared median value to mean value within each cluster, we can find that median values are, respectively, bigger than mean values for both two clusters. e result indicates that the distributions of clearance time in two clusters are skewed, instead of normal distribution. en, we calculate the skew values of two clearance time distributions, and they are 0.92 in cluster 1 and 1.59 in cluster 2. Both of them present right-skewed, which are consistent with previous studies [26,39,41]. Distribution figures of clearance time in two clusters are shown in Figures 3(a) and 3

(b). Both Figures 3(a) and 3(b) present long-tail distributions with the range values of 22 and 75.
It is difficult to handle the data with such a wide value range [53]. In this case, in order to make the distribution of clearance time closer to the normal distribution, we use data transformation to deal with clearance time data in two clusters. Regarding cluster 1, the skew value of clearance time is 0.92 which is between 0.5 and 1, indicating the median skewed. erefore, according to the empirical method, we apply the square transformation to handle clearance time in cluster 1. In respect to cluster 2, the skewed value is 1.59 which is larger than 1, leading to a highly skewed. e log transformation is used to convert clearance time in cluster 2. Distributions of transformed clearance time are presented in Figures 3(c) and 3(d). In Figure 3, the blue line is the fitting curve of clustered data and the black line denotes the normal distribution curve which is fitted by their calculated mean and standard values. As shown in Figures 3(c) and 3(d), the distributions of transformed data are closer to normal distribution.

Parameter Optimization.
In general, there are three approaches to optimize parameters, including the systematic grid search approach, the random search approach, and the Bayesian optimization approach. e grid search approach works well as it systematically searches the entire search space, but time-consuming. In contrast, the random search approach runs fast while it may miss the best value as it searches randomly in the search space. Bayesian optimization is the process of continuously sampling, calculating, and updating the model. In overall, we apply the Bayesian optimization method to find the optimal parameters in XGBoost. ese parameters include max depth of the tree (max_depth), the number of trees (n_estimators), the learning rate of the tree (learning_rate), percent of randomly sampling for trees (subsample), sum of minimum leaf node sample weights (min_child_weight), and percentage of randomly sampled features (colsample_bytree). e increasing of n_estimators may improve the accuracy of XGBoost but increase the computing time too. e max_depth is used to avoid overfitting. In contrast, the larger min_child_weight will result in underfitting. Both subsample and min_child_weight, respectively, denote the row and column sampling. e meaning of the learning rate is identified to avoid overfitting and increase the robustness of the model [54]. erefore, all these parameters should be optimal for achieving the best model performance.
e Bayesian optimization is packaged in a module of python, called Hyperopt [55].
e XGBoost model reaches its best prediction performance when using these optimal parameters. And the MAPE values of optimized XGBoost for two clusters are 0.348 and 0.221, respectively.

Comparison Analysis.
To examine the prediction performance of XGBoost in clearance time prediction, we select several commonly used models including support vector regression (SVR) model, random forest (RF) model, and Adaboost model for comparison. To ensure fairy comparison, the testing data and the parameter-tuning method (BOA) of all models are the same. For the SVR model, we select the radial basis function (RBF) as the kernel function. e gamma and penalty C are two key parameters of RBF and are set as 0.1, 64, and 0.15, 32 for two clusters. For the RF model, the number of trees (n_estimators), the maximum depth of the tree (max_depth), the minimum number of samples of internal node splitting (min_samples_split), and the minimum number of leaf nodes (min_samples_leaf) are the four key parameters, and they are set as 195, 8,11, and 23 in the cluster 1 and 100, 13, 18, and 12 in the cluster 2. In regard to the Adaboost model, the same with RF model, n_estimators, max_depth, and min_-samples_split should be identified. In addition, the learnin-g_rate and the maximum features in splitting (max_features) also need to be optimized. ese parameters of Adaboost in two clusters are set as 470, 6, 25, 0.05, 7 and 425, 9, 30, 0.11. e MAPE for four candidates is shown in Table 5, and the smallest values for two clusters are marked in bold. Table 5

Importance Evaluation for Explanatory Factors.
Different explanatory variables have different effects on the target factor [56,57]. To investigate the significant factors of clearance time, the relative importance of each explanatory factor is calculated by using the XGBoost with optimal parameters for two clusters. An explanatory factor with higher relative importance means that it generates a stronger effect on clearance time [41]. In this study, we assume that factors with relative importance greater than 8.0% are defined as significant explanatory factors, the relative importance of the general factor is from 2.5% to 8.0%, and the remaining explanatory factors are considered as insignificant factors. In this case, the explanatory factors with its importance are shown in Table 6.
As for cluster 1, AADT (17.70%), incident type (17.30%), response time (15.10%), and lane closure type (8.00%) are categorized into the significant explanatory factors of clearance time as their relative importance is bigger than 8.00%. e general factors of clearance time include six explanatory factors, such as WSP involved (7.60%), month of year (6.10%), traffic control (5.00%), weather (4.70%), day of week (4.60%), and peak hours (3.10%). And the remaining HOV (2.50%), time of day (2.10%), heavy truck involved (1.70%), injury involved (1.70%), and work zone involved   [58]. In detail, the AADT is the greatest contribution to shorter clearance time in cluster 1 and generates the second impacts on longer clearance time in cluster 2 with the relative importance of 17.70% and 14.00%, respectively. Generally speaking, AADT represents    [59,60]. at is, the traffic congestion with a high AADT may make the incident difficult to clear, leading to longer clearance time. As for incident type, it respectively contributes 17.30% and 12.80% to short and long clearance time and ranks the second in cluster 1 and the third in cluster 2. As shown in Table 1, the incident type factor consists of disabled vehicles, debris, abandoned vehicles, collision, and others. ese incidents may block normal traffic [61,62]. In this case, the transportation authorities may make a series of strategies to deal with the problems caused by these incidents [63,64]. Interestingly, the longer clearance time seems less sensitive to incident type than shorter clearance time. Maybe a long clearing time means a high severity of the crash. With the relative importance of 15.10% and 22.3%, the response time factor is the third contributor for shorter clearance time in cluster 1 and yields the biggest impacts on longer clearance time in cluster 2. e result shows that longer clearance time is more sensitive to response time compared to shorter clearance time, which is consistent with the previous studies [18,19]. For every minute, the response time increases, and the clearing time will increase by one percent [18,19]. e lane closure type factor is the fourth contributed factor for both two clusters. It indicates the severity of incidents by restricting vehicles from entering the incident site [41].

Conclusions
In this study, XGBoost is applied to predict incident clearance time that occurred on the freeway and investigates the significant factors of clearance time by using the data collected from the Washington Incident Tracking System in 2011. We firstly introduce the original data and the proposed method briefly. e original data are clustered by using the K-means algorithm for better exploring the underlying relationship.
en, we built the XGBoost model for each cluster. Each clustered data is divided into 70% training data and 30% testing data. Training data are applied for modeling XGBoost and optimizing parameters on the basis of 5-fold cross-validation BOA. Testing data are used to measure the prediction performance of XGBoost. And the MAPE is considered as the predictive indicators in this paper. To examine the model performance of XGBoost, support vector regression (SVR), random forest (RF), and Adaboost are chosen to predict the clearance time. e comparing study manifests that the XGBoost outperforms the other three models with the lowest MAPE in both two clusters. To obtain the significant factors of clearance time, we calculate the relative importance of each explanatory factor and then define the quantitative indexes about significant explanatory factors, general explanatory factors, and insignificant explanatory factors. e result is that response time, AADT, incident type, and lane closure type are the significant explanatory factors of clearance time.
It is worth noting that the traffic incident is the time-sequential process [65]. And almost the incident information is acquired from that process [66]. Modeling based on the acquired incident information is the limitation of the proposed method in this study. Because, during the initial stage of the incident, the prediction may not be accurate due to the acquired information is incomplete. For future research, multistage updates of information should be a promising future research direction. In addition, strategies about dealing with the unobserved heterogeneity of dependent variables, especially in traffic incidents filed, may be a hot topic, due to some omitted variables (e.g., driving behavior) that may generate potential impacts on the target variable.
Data Availability e traffic incident data used to support the findings of this study are available from the corresponding author and first author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Journal of Advanced Transportation 9