Modeling of Merging Decision during Execution Period Based on Random Forest

)is study aims to investigate the key feature variables and build an accurate decision model for merging behavior during the execution period by using a data-driven method called random forest (RF). To comprehensively explore the feature variables during merging execution period, nineteen candidate variables including speeds, relative speeds, gaps, time-to-collisions (TTCs), and locations are extracted from a dataset including 375 noise-filtered vehicle trajectories. After the variable selection process, an RF model with 9 key feature variables is finally built. Results show that the gap between the merging vehicle and its putative following vehicle and the ration of this gap to the total accepted gap are the two most important feature variables. It is because merging vehicle drivers can easily observe the putative leading vehicles and control the relative speeds and positions to the putative leading vehicles and they tend to leave more space for their putative following vehicles. Relative speed between the merging vehicle and its following vehicle in the auxiliary lane is the only variable related to the vehicles in the auxiliary lane, which means merging vehicles mainly focus on the traffic condition in the adjacent main lane. Evaluation of the performance in comparison with the state-of-the-art method reveals that the proposed method can obtain much more accurate results in both training and testing datasets, which means RF is practical for predicting the merging decision behavior during execution period and has better transferability.


Introduction
As a basic driving task, lane changing has drawn great attention recently. Lane changing behavior was considered to be an important reason for traffic oscillations and accidents [1][2][3][4]. It was estimated that lane change crashes account for 4 to 10% of all crashes in the US [5]. Lane-changing behavior is complicated and risky because it is influenced by vehicles in both the current lane and the target lane. Several factors such as velocities and gaps should be taken into account during the lane changing process.
Luckily, with the rapid development of communication technology, driving assistance systems have been developed to help drivers to make safer decisions [6,7]. Lane-changing decision assistance is one of the key functions of driving assistance systems. It can help drivers make safer decisions to start a lane change.
rough the Vehicular Ad-hoc Network (VANET), vehicles can communicate with the surrounding vehicles and roadside unites [8][9][10]. e lanechanging decision assistance systems can well deal with the situation of discretionary lane-changing by using the data from surrounding vehicles and roadside unites. However, for merging areas on freeway, the judgment rules might be not applicable [11]. In merging areas, drivers need to change to the adjacent main lane within the limited distance, which may result in traffic congestions and even breakdowns [12][13][14][15][16][17].
As a sequential decision process, the whole merging process can be simplified as a sequential two-step model (gap searching and merging execution) or a three-step model (gap searching, merging position searching, and merging execution) [18][19][20][21]. However, most previous studies focused on the gap searching process but neglected the merging execution period. Several seconds are needed to execute the merging behavior and the traffic condition may change dynamically during the whole merging execution period. e ignorance of the merging execution process would lead to reduction of accuracy of traffic simulation and autonomous driving. us, there is a critical need to model the merging decision behavior during the execution period. During the merging execution period, the merging vehicles have interactions with putative leading (PL) and putative following (PF) vehicles in the adjacent main lane and the leading (L) and following (F) vehicles in the auxiliary lane. Various influencing factors might be considered for merging decision and should be analyzed in depth. However, previous studies [17] showed that there is multicollinearity between the variables. It was pointed by Balal et al. [22] that most of the lane changing related variables are highly correlated, implying that only a few representative or key variables might be sufficient to describe the interactions of vehicles. However, the selection of key variables is not an easy work. erefore, the variable selection process should be conducted before building parametric models such as logit model. Improper selection of the key variables might make the performance of the model deteriorate too seriously to be applied to merging assistance systems.
Recently, data mining techniques have received a lot of attention in transportation fields due to their ability to deal with the large-scale data. Some of them can naturally overcome the multicollinearity problem and make full use of the training data. us, this study tried use a famous machine learning technique, random forest (RF), to model the merging decision behavior during execution period. It can not only produce more accurate prediction results but also excavate the hidden information among the data. More importantly, RF can effectively select the key variables. e main contribution can be summarized as follows: first, this study gives a comprehensive analysis of the influencing variables of merging decision. Second, the proposed RF method can accurately predict the merging decision during execution period, which can improve the safety and comfort level of driving assistance system if it could be incorporated into lane changing assistance system. ird, a key feature selection process is conducted to investigate the influencing factors. ese contributions can not only help understand the diverse influences of different variables on the merging decision but also shed new insights for driver assistance systems and autonomous driving. e remainder of the paper is organized as follows. Section 2 will provide a state-of-the-art review on the existing studies followed by section 3, which gives the methodology to build a RF model. Section 4 describes the NGSIM data used in this paper and comprehensively analyzes the influencing variables. Results and discussions are presented in section 5. Finally, the concluding remarks are presented in section 6.

Literature Review
Predicting merging decision has always been one of the focuses of transportation researches. A great number of models have been developed based on different theories. e first comprehensive lane changing framework was developed by Gipps [23] based on gap acceptance theory. en, similar frameworks were adopted in other studies [24][25][26][27]. However, the gap acceptance theory has been criticized that it cannot reflect the real behavior of drivers. To overcome the deficiency, logistic and logit models were introduced by some researchers [15,28,29]. To account for the heterogeneity among drivers, mixed models were proposed by Weng et al. [30] and Li [31]. Game theory models were also developed to model the merging behavior [32,33]. However, the prediction accuracy of the parametric models is barely satisfactory and the collinearity of influencing variables makes it difficult for researchers to choose appropriate variables to build accurate models [22].
Recently, data-driven methods, such as classification and regression tree (CART), Bayesian network, and fuzzy logic models, were used in building merging models or lane changing models and achieved promising results [16,[34][35][36][37][38]. CART was applied by Weng et al. [11] to model the merging decision in work zone area during execution period, in which time-to-collision (TTC) was considered as a risky factor. Considering the difference between cars and heavy vehicles, Moridpour et al. [39] presented the lane changing model based on fuzzy logic for heavy vehicles. A cooperative merging strategy was developed by Xu et al. [40] for vehicles with V2V and V2I networks, which is applicable to cooperative merging operations under saturated traffic conditions. However, the majority of previous studies separately considered speeds, relative speeds, and gaps as the influencing variables and ignored the interaction of variables. In addition, considering the complexity of merging behavior, a comprehensive analysis of all possible influencing factors should be conducted to better understand the merging decision during execution period.
Previous studies showed that the variables of lane changing behaviour were highly correlated with each other [17,22,31].
us, selecting some representative or key variables might better describe the interactions of vehicles. However, feature selection has never been an easy work. Feature selection methods can be classified into statistics based methods [41], information theory [42], manifold [43], and rough set [44]. Besides, data-driven methods are also widely used for feature selection [34,45,46]. In this study, a popular data-driven method called random forest was applied in this paper to model the merging decision during the execution period. Compared with other models in the literature, the RF has several unique features and advantages. First, it is able to handle multisource heterogeneous data without long-time data processing. Second, as an ensemble machine learning technique based on CART, RF inherits the advantage of CART that can automatically accommodate missing data of independent variables. ird, RF overcomes the deficiency of CART and can automatically resist outliers and is not easy to be affected by small perturbations in the training data. Finally, RF can select the key variables from high dimension data by the importance of all independent variables [45,47]. RF has been successfully used in traffic prediction and produced promising results [48][49][50][51].

Methodology
Predicting merging decision can be simplified as a classification problem. Some classical machine learning techniques, such as CART, are very suitable for modeling merging decision. ough CART is efficient and easy-touse, it is also easy to be affected by small perturbations in the training data [52]. To improve the robustness and generalization capacity of CART, an ensemble learning technique called random forest, which combines the bagging technique, CART, and random subspace method, was proposed by Breiman [45]. RF is an ensemble classifier composed of a group of decision tree classifiers and gets the prediction result by a simple majority vote. e RF model can improve the prediction accuracy of merging decision as well as help connected and autonomous vehicles (CAVs) make safer decisions during merging process. A brief description of random forest is given in this section and detailed fundamentals of mathematics can be referred to Breiman [45].
In RF, bootstrap aggregating (bagging) is the most basic theory. Suppose we have a training dataset (X, . , x K i and y i represent the feature vector and the response variable of the sample i, respectively. rough bagging, RF generates B new training sets (X b , Y b ) by sampling from (X, Y) uniformly and with replacement for N times. By sampling with replacement, some observations may be repeated in each data set (X b , Y b ) and some may not appear. e probability that each sample in en, we can get Equation (1) indicates that about 36.8% of the samples are not used in the training process, which is called OOB (Out of Bag) data. ese data can be used for validation.
us, cross-validation or separate test data are not necessary like other machine learning methods. In RF, the OOB error has been proved to be an unbiased estimation of generalization error. e random subspace method is also used in RF. It can also be called attribute bagging or feature bagging, which means each tree is constructed based on a random subset of the feature variables. is method is designed to reduce the correlation between the trees and improve the generalization accuracy because the RF uses a simple majority vote of all the trees.
Combining the above two methods and CART, the basic steps of RF can be shown in Figure 1 and summarized as follows: (I) Initiate the algorithm, set b � 1. (II) Use the bootstrap sampling method to obtain a new data set (X b , Y b ) by random sampling with replacement for N times, and the data that are not sampled will form a set called OOB set.
(III) Randomly select m feature variables (m < J) and use the selected variables for splitting to train a decision tree T b based on the new sample set (X b , Y b ). e decision tree will grow the deepest and is not pruned. (IV) For b � 2, . . . , B, repeats steps II-III. e importance of the variables can be sorted by OOB data. RF can screen out important variables in the complex feature variable space, which is conducive to deepen the understanding of the research object. Assuming that the sample subset obtained by bootstrap method is b � 1, 2, . . . , B, the process of using RF to calculate the importance of variable x j is as follows:

Journal of Advanced Transportation
(2) Previous studies have shown that the merging decision could be influenced by a number of highly correlated variables [22,35]. us, the feature selection process must be conducted before building parametric merging decision models. By bagging and random space method, RF can naturally overcome the collinearity of influencing variables. Furthermore, the importance values can be utilized to rank the influencing variables and select the key feature variables through a forward stepwise or backward stepwise elimination process, which will be described in section 5.3.

Data Description and Processing.
In this section, vehicle trajectory data collected by the Federal Highway Administration (FHWA) in the NGSIM project are adopted to verify the proposed RF model. As an open-source dataset, the NGSIM dataset can provide rich and accurate vehicle trajectory data collected on both freeway and urban road [14]. It has been widely used in traffic studies such as traffic flow analysis and driving behavior modeling [18,37,53,54].
Previous studies have shown that the US-101 dataset had the best accuracy and consistency [18,55]. us, this dataset is chosen in this study. Figure 2 shows schematic diagram of data collecting site. One can find that the chosen 640 meters long segment is located between an on-ramp and an offramp with five main lanes and one auxiliary lane. Videos were captured from 7:50 a.m. to 8:35 a.m. on June 15, 2005, which was a sunny day. e dataset is updated at a resolution of 10 fps (frames per second) and contains three subsets containing 15 minutes trajectory data [56]. Table 1 shows the aggregate statics of speed and volume for every subset. e coordinates, speed, and acceleration of every vehicle at any instant can be easily obtained from the NGSIM dataset. Previous studies have shown that some random noises existed in the NGSIM data [55,57]. Filtering and smoothing techniques should be adopted before using. In this study, a data smoothing technique called symmetric exponential moving average filter (sEMA) proposed by iemann et al. [57] is applied before further data analysis. In addition, the local coordinates of three subsets are unified to filter the inconsistency of the local coordinates. Detailed steps of data processing can be referred to Li and Sun [17], Li [31], and Li and Cheng [15]. After processing, trajectories of 375 merging vehicle trajectories are extracted from the dataset. All of the vehicles are passenger cars with lengths from 2.5 m to 7.8 m.

Data Extraction.
After selecting the accepted gap, one merging vehicle needs several seconds to find the right time to merge into the adjacent lane and the driver may keep on adjusting the speed and relative position through acceleration deceleration during the execution period. At any time, a merging driver can either choose to continue merging or complete merging as shown in Figure 3. Let y t n define the n th merging vehicle's decision at time t. Obviously, y t n is a binary variable, shown in the following equation: Previous studies showed one second is suitable for a driver to make decisions [11,28,34,37]. us, we also choose one second in this study. en, T n represents the total time to complete merging for vehicle n. Obviously, a merging vehicle can have several observations of y t n � 0, but only have one observation of y t n � 1. By extracting the trajectory data of 375 merging vehicles, 1583 observations are obtained in this paper, that is, 375 observations are selecting to merge (y t n � 1) and 1208 observations are not (y t n � 0). It means that it takes 3.23 seconds on average for a vehicle to complete merging after making the decision of gap selection.
During the process of merging execution, it has some certain influence on the additional lane and the main lane. At the same time, the merging behavior is also affected by the traffic flow state of the two lanes and the surrounding vehicles. erefore, the main factors that affect the decisionmaking of merging vehicles are the speeds, relative speeds, and gaps in the adjacent main lane and the auxiliary lane.
However, previous models considered the above variables separately and ignored the interaction between variables. Some studies showed that the gaps between the merging vehicle and PF vehicle in adjacent main line were linearly related to the total gap during the merging process [20]. Figure 4 shows the scatter plots of the PF gaps and the accepted gaps according to the dataset used in this study. A strong linear relationship can be found in Figure 4. One can also find that the range of the ratio of the PF gap to the accepted gap for y t n � 1 is rather smaller than that for y t n � 0, indicating that this ratio might be an important factor for merging decision. erefore, the ratio of the PF gap to the accepted gap is also considered as the influence variable in this paper.
In addition, a surrogate safety measure combining vehicle speeds, space gap, and time-to-collision (TTC) was also considered, because merging driver needs to control vehicle to avoid rear end accidents with the surrounding vehicles. TTC is defined as 4 Journal of Advanced Transportation where x L and x F are the longitudinal position coordinates of the front bumper of the leading and following vehicle, respectively; V L and V F are the speeds of leading and following vehicle, respectively; and L is the length of leading vehicle. Figure 5 shows the interactions between a merging vehicle and its surrounding vehicles. Table 2 shows the candidate variables and their explanations. It should be pointed out that TTC is negative when the following vehicle moves slower than the leading vehicle, which means that the collision would never occur. In addition, when the speed of the following vehicle is equal to or slightly larger than the    Journal of Advanced Transportation 5 leading vehicle, TTC will be infinite or too large. In order to restrict these situations, we will set the TTC range to (0, 100 s], that is, when TTC is negative or greater than 100 s, it is set to 100 s. Table 3 shows the main statistical characteristics of the candidate variables for merging behavior. One can find that the merging vehicles move faster than both PF and PL vehicles and the PF vehicles have the lowest average speed. Both the leading and following vehicles in the auxiliary lane move faster than the merging vehicles. Additionally, the average speed of merging vehicles reduces from 12.477 m/s to 12.086 m/s during the merging process to accommodate for the mainline traffic speed, which can also be reflected by changes of average ΔV PL and ΔV PF . It is interesting to find that Gap PF increases from 9.616 m to 16.081 m while Gap PL does not change much. It means Gap PF plays an important role and the PF vehicles tend to yield to the merging vehicles during the merging execution period. One can also find that the TTC PL has the lowest average value during the merging process, indicating that the traffic conflicts between the merging vehicles and PL vehicles might be the most serious. A Pearson's correlation analysis is conducted to correlation coefficients between dependent variable and independent variables, as shown in Table 4. Bold values are the insignificant correlation coefficients at 0.95 confidence level. One can find that the dependent variable y t n has significant correlations with several independent variables, such as V PL and Gap PF . It is interesting to find that there is no significant correlation between Gap PL and y t n . (Gap PF /Gap) has the strongest correlation with y t n .

Modelling Results
After extracting enough data, the RF model is trained and tested in this section to verify the effectiveness. A data mining software called Salford Predictive Modeler is used in this study [16]. e data is randomly divided into two parts: 80% of the lane change cases are randomly selected as the training data, and the remaining 20% is used as the test data for validation. ough RF can use the OOB data for validation, we still do this for comparison with the state-of-the-art methods.

Parameter Determination.
e number of decision trees B is an important parameter of RF. When building decision trees, RF does not prune it. us, the modeling Figure 5: Schematic diagram of candidate variables.  accuracy of RF will increase rapidly with the increase of the number of decision trees at first. However, after reaching a certain number, generating more trees would not improve the model accuracy but increase the computational burden. Previous studies showed that the total number of trees should be set at 200-500 [45,50]. To ensure the reliability of the modeling results, this paper sets the number of trees at 500. In RF, a randomly selected subset of features is used to build each single tree. Reducing the number of sampled features m would bring down the correlation among decision tree, leading to less generalization error. However, a too small m would also make the single tree suffer from large prediction error. Different m has been used in different studies [49,58]; thus, the number of sampled features m should be selected carefully. To select the best m, RF models are trained with an increasing number of m from 1 to 10. Table 5 shows the OOB errors with a different number of m. One can find that the OOB error has the lowest value when m is 3. us, the number of randomly sampled features m is set at 3 in this study.

Variable Importance.
e variable importance can be easily obtained by RF according to equation (2). e rank and importance values of independent variables are shown in Table 6.  According to Table 6, it can be seen that Gap PF and (Gap PF /Gap) are the most two important variables, whose importance values are much greater than other variables. e reason is probably that merging vehicle drivers can easily observe the PL vehicles and control the relative speeds and positions with them. us, they tend to leave more space for their PF vehicles. is finding is consistent with that of the previous studies [20]. Table 6, one can find that the relative importance values of several variables are rather low, such as TTC L (0.18%), indicating that there are some redundant or irrelevant variables in the RF model. erefore, a feature variable selection process introduced by Genuer et al. [59] is applied in this study. e basic steps are shown as follows:

Feature Variable Selection. From
(1) Build a RF model with all candidate variables and rank the variables with the relative importance values in descending order (2) Delete the variable with the lowest relative importance value and create a new variable set (3) Build a new RF model with the new variable set and rank the variables with the relative importance values in descending order (4) Repeat steps (2) and (3) until only one variable remains (5) Rank all the RF models established in steps (1) to (4) according to the OOB error, and select the model and feature variable set with the lowest error After feature variable selection, nine feature variables are remained and the OOB error is reduced from 9.1% to 8.9%, indicating that reducing the number of feature variables will not reduce the prediction performance. e values of variable importance in the model are shown in Table 7. It is easy to know from Table 7 that Gap PF and (Gap PL /Gap) are still the two most important factors. ΔV F is the only variable related to the vehicles in the auxiliary lane, which means merging vehicle drivers mainly focus on the traffic condition in the main lane. Table 8 shows the prediction accuracy for training data and testing data. For comparison, a binary logit model and a CART model are also built based on the same dataset.   e results show that the prediction accuracy of the RF model is much better than the binary logit model for both training data and test data. One can also find that CART has the highest prediction accuracy in training data. However, the performance of CART in testing data is much poorer than RF, indicating that RF has better ability to deal with problem of overfitting than CART. In addition, due to the influence of collinearity of variables, only six variables are included in the binary logit model. Some variables that may affect the merging decision behavior in a certain range are ignored by the binary logit model, such as TTC PL and ΔV F . It is clear that RF can overcome the collinearity problem and deeply explore the complicated nonlinear relationships between merging decision and influencing variables. One can also find that the reduction of the accuracy in training and testing dataset is also much smaller than the logit model and CART model, showing that RF is practical for predicting the merging decision during execution period and has better transferability.

Conclusions
is study conducts a comprehensive analysis of the influencing variables of merging decision and employs the random forest (RF) to model the merging decision behavior during the execution period. e proposed RF method can accurately predict the merging decision during the execution period and investigate important influencing factors. e US-101 vehicle trajectory data are used to train and validate the RF model. To comprehensively explore the influencing factors during merging execution, 19 candidate variables are extracted including speeds, relative speeds, gaps, time-to-collisions (TTCs), and locations. e modeling results show that Gap PF and (Gap PF /Gap) are the most two important variables, whose importance values are much greater than other variables. It is probably because that the merging vehicle drivers can easily observe the PL vehicles and control the relative speeds and positions with them and thus, they tend to leave more space for their PF vehicles. To select the effective variables, a feature variable selection process is adopted and 9 variables are selected in the RF model finally. Gap PF and (Gap PF /Gap) are still the two most important feature variables. ΔV F is the only variable related to the vehicles in the auxiliary lane, which means merging vehicles mainly focus on the traffic condition in the adjacent main lane. Evaluation of the performances in comparison with the state-of-the-art method reveals that the proposed method can obtain much more accurate results in both training ant testing datasets. e reduction of the accuracy in training and testing dataset is also much smaller than that of logit model, showing that RF is practical for predicting the merging decision behavior during execution period and has better transferability.
Furthermore, it is obvious that merging drivers face more challenges and may make improper decisions under congested traffic conditions, which might cause long delays. In future, if vehicles can receive the real-time information about the traffic environment via VANETs, the proposed RF models can help the merging vehicles make safer decisions.
us, the results of this study can also improve the safety and comfort of driving assistance systems and autonomous driving systems.

Data Availability
e NGISM data used to support the findings of this study have been deposited at the website: https://catalog.data.gov/dataset/ next-generation-simulation-ngsim-vehicle-trajectories.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Journal of Advanced Transportation 9