Analysis of Factors Affecting the Over-Representation of Sequential Crashes in Freeway Tunnels: Using Rule-Based Data Mining Method

. Te paper provides an empirical analysis of road/tunnel design, trafc volume, and environmental factors associated with the increased likelihood of sequential crashes in freeway tunnels. Te association rule mining and decision tree methods are employed since both of them are capable of identifying complicated interactions among variables and expressing them in the form of rules. Results show that tunnel length, trafc congestion, time of day, season, and vehicle type are the signifcant factors infuencing the likelihood of sequential crashes in freeway tunnels. More importantly, association rule mining and decision tree analysis reveal that a combination of road/tunnel design, trafc, and environmental factors produces even a higher likelihood of sequential crashes, leading to a series of hazardous situations. For example, when factors including long tunnel and grade ≤ 2%, fourth level, and winter are combined, the proportion of sequential crashes is more than twice the average proportion of sequential crashes in the complete tunnel crash database. Trafc safety management should pay more attention to monitoring these hazardous situations which are more likely to be linked to sequential crashes.


Introduction
Tunnels are regarded as a key element of the mountainous freeway system because of their advantage of shortening trafc distances and less impact on ecological environment. Furthermore, the progress in construction technology has rendered road tunnel as a cost-efective infrastructure in developing new freeway networks. Terefore, the proportion of tunnel length in freeway networks, especially in mountainous freeways, has reasonably increased over the last recent years [1,2]. In spite of travel time-saving and environmental benefts, increasing tunnel trafc crashes have become a major concern for improving freeway trafc safety [3]. Although the overall crash rate in tunnels appears lower than that on open sections, the severity and consequences of tunnel crashes are commonly higher [3,4].
A sequential crash is defned in this study as a crash where one vehicle hits a vehicle or object frst and then another vehicle collides afterward. Because lane-changing behaviors are not allowed in Chinese freeway tunnels, emergency braking is commonly the frst reaction of the driver to avoid the hazard in the front, which causes new dangers and spreads to the vehicle behind [5]. In addition, the sophisticated interactions of light conditions, increased driving anxiety, and increased difculty in rescue and evacuation due to limited inner space and semi-enclosed structure of the tunnel may greatly elevate the likelihood of sequential crashes [6]. Moreover, sequential crashes are always likely to cause serious consequences [5]. A comprehensive understanding of risk factors afecting sequential crashes at freeway tunnels is very essential since trafc professionals could beneft from this information to reduce the occurrence of these crashes.
Various studies exist in the literature on tunnel trafc crashes. Some studies focus on developing safety performance functions, which associate the tunnel crash rate/ frequency with risk factors [7][8][9][10]. Tese revealed signifcant factors including tunnel length, distance between two adjacent tunnels, trafc volume, surface friction coefcient, and tunnel alignment such as grade and curvature. Some studies are conducted to investigate the severity of tunnel crashes and their infuencing factors [1,11]. For example, Huang et al. [1] examined the interactive efect of mountainous freeway alignment, driving behavior, vehicle characteristics, and environmental factors on the severity of crash at freeway tunnel groups. In addition, several studies [2,[12][13][14][15] were devoted to the analysis of crash characteristics (crash rate, crash severity, collision type, etc.) at diferent tunnel zones, i.e., access zone, entrance zone, interior zone, and exit zone, so as to understand the space distribution characteristics of tunnel crashes. Nonetheless, to the best of our knowledge, there are no previous research studies that investigate how risk factors infuence the sequential crashes in freeway tunnels.
From a methodological standpoint, statistical regression models such as logistic/probit regressions have been commonly applied to model the severity or consequence of trafc crashes [11,16,17]. However, statistical regression models make predefned assumptions about the distribution of the independent and dependent variables, and they have nonnegligible shortcomings in untangling complex interactions between diferent variables [18,19]. Data mining techniques or machine learning methods are considered one of the solutions to overcome preassumption issues associated with statistical models [20][21][22]. Moreover, data mining techniques have good capabilities in ofering systematic ways of extracting useful rules and mining the complex relationships that are hidden in the data. It is particularly important for studying crashes because trafc safety analysts could use these rules to understand the events causing a crash and identify the interactions of variables infuencing the consequence of a crash.
Among these technologies, the decision tree (DT) method is a preferred tool for rule extraction due to its simplicity and interpretability. Te DT method has simple structure with a hierarchical tree, which permits the extraction of decision rules in the form of "IF A THEN B"rules. Moreover, the DT method uses a white-box model, which makes it easy to interpret. Many other technologies such as support vector machines (SVMs) and artifcial neural networks (ANNs) can also be employed in crash studies for rule extraction [23,24]. However, both SVMs and ANNs generate black-box models that are difcult to explain the prediction results [25]. When it is more vital for understanding the internal knowledge in the learning process than the prediction results, white-box models are more appropriate. As the representative of white-box models, DT allows the reading of interactions between conditional predictors to explain the response variable through the tree structure and can acquire the importance of each variable on the prediction results. Because of the superiority, DT has been widely used in trafc crash analysis [1,[26][27][28][29].
However, rules extracted from the DT method can also cause certain limitations. Te DT method usually employs a greedy approach; therefore, it is difcult to obtain a global optimal solution [30]. Specifcally, the rule extracted from tree is confgured from the root node, which is where the rule (IF) begins. However, there could be other important rules that are not derived from the root node, and that would not be detected by the DT method. Te association rule (AR) method is another commonly used rule-based data mining technology for discovering interesting relations or rules of variables in large databases [18,[31][32][33][34][35]. Te AR method could be regarded as a process of looking through all possible multidimensional contingency tables and extracting the interesting rules and patterns [32,36]. Difering from the DT method, the AR method uses a brute-force, exhaustive global search and thus may yield richer rules than decision trees. But the AR method also has its limitations since many rules are often redundant and unrelated, which will produce adverse efect on results. In addition, the AR method is a descriptive data mining approach (that is, it belongs to the unsupervised learning technology) that cannot be used for the purpose of prediction.
Terefore, this paper intends to apply data mining techniques to identify the combined efect of contributing factors (the road/tunnel design, trafc volume, and environmental factors) afecting the likelihood of sequential crashes in freeway tunnels. Two rule-based data mining methods, i.e., association rule (AR) and decision tree model (DT), are proposed to address them together in this study. We can compare them to see how employing two diferent techniques can generate varied results for this particular task and combine these results from both in order to obtain more comprehensive rules and useful information underlying the data. Te framework of this paper is shown in Figure 1. First, data referring to tunnel geometry, trafc crashes, and trafc volume are exacted and cleaned in order to set up the combined dataset. Second, two types of data mining techniques, i.e., an improved associated rule mining method (combined with statistical testing) and a decision tree, are used to fnd the interesting rules linking sequential crashes in the freeway tunnel with risk factors covering tunnel length, tunnel geometry, trafc volume, and weather factors. Finally, based on the obtained rules, evidence-based safety interventions to prevent sequential crashes in freeway tunnels are recommended.

Data
Twelve freeway tunnels are selected for this study. All selected tunnels are located on a freeway section with a length of 40 km, in the west of Hunan Province, China. Tese tunnels are four-lane twin-tube tunnels with two lanes in each direction (that is, 12 west bound one-way trafc tunnels and 12 east bound one-way trafc tunnels). Te length of tunnels ranges from 155 m to approximately 7000 m. Some of spacing distances between adjacent tunnels are less than 1000 m, indicating that these tunnels constitute a typical freeway tunnel group [2]. Te sketch of these selected tunnels can be seen in our previous studies [1,2]. Table 1   2 Journal of Advanced Transportation   , and environmental factors such as weather conditions, the precise crash time, and location of the crash. According to the defnition of a sequential crash, the sequential crash in this study includes two types: (1) a crash that one vehicle hits a vehicle and then another vehicle collides afterward; (2) a crash that one vehicle hits an object and then another vehicle collides afterward. Tis can be determined by using multiple crash contributory factors recorded in the crash data report, such as the cause of crash, the rear end collision, and the number of vehicles involved in a crash. When drivers approach and drive away from the tunnel, the variations of light environments between inside and outside tunnel may pose efects on the trafc safety. It is reasonable to extend the tunnel in front of the tunnel portal as a tunnel afected area. According to the Chinese tunnel lighting guideline [37] and previous studies on tunnel zoning methods [2,12,14], the tunnel afected area in this study is divided into four distinct zones: access zone, entrance zone, interior zone, and exit zone (shown in Figure 2). In this study, the access zone is frst 300 m of the open road (that is approximately equal to the driver's fxation point distance under 70 km/h driving speed) before the portal of the tunnel in trafc direction. Te entrance zone is frst 300 m of the tunnel directly after the portal of the tunnel. According to the Chinese tunnel lighting guideline [37], the entrance zone includes threshold zone and transition zone. Te length of threshold zone is approximately equal to the stopping sight distance (that is about 100 m under 80 km/h driving speed), and the length of transition zone varies from 0 to 14-second driving distance (this is about 200 m for 10second driving distance under 80 km/h driving speed). Te exit zone includes the last 100 m of the tunnel and frst 200 m of open section directly after the exit portal (that is approximately equal to the light adaptation distance under 80 km/h driving speed).
Data for trafc crashes between January 2013 (start date of the ATSPM system operations) and December 2019 that occurred in selected 40 km freeway section are extracted from the crash report. Tere are a total of 2285 crashes, including 2140 property damage only crashes, 107 injury crashes, and 38 fatal crashes. Among 2285 crashes, 985 crashes occurred in tunnel afected area, and 1273 crashes occurred on the open road (not in the tunnel afected area). Of these 985 tunnel crashes, 156 tunnel crashes are sequential crashes. Te proportion of sequential crashes that occurred in tunnel afected area (15.8%) is signifcantly higher than that of the proportion on the open road (7.2%). Tis result is consistent with the study of Zhou et al. [5]. Based on crash data for freeway network of Guizhou Province, a comparative study conducted by Zhou et al. [5] found that the proportion of crashes belonging to sequential crashes in freeway tunnels (19.6%) is over three times the proportion on general freeways (5.9%). Tese results of crash statistics indicate that there is a signifcant overrepresentation of sequential crashes in freeway tunnels compared to that on general freeways. Tus, this study intends to analyze factors afecting the over-representation of sequential crashes in freeway tunnels.

Tunnel Geometric Design.
Previous studies have demonstrated that road/tunnel geometric designs are the infuencing factors on the crash occurrence and the injury severity of a crash [3,8,9]. For this case, tunnel geometric designs are supposed to be the contributing factors of sequential crashes. Tus, tunnel geometric features such as longitudinal grade, horizontal curvature, and tunnel length are obtained from Freeway Management Bureau of Hunan Province for this analysis. Te steepest longitudinal grade of these tunnels is about 4.0%, and the smallest horizontal curve radius is about 800 m. In order to facilitate the model analysis, the grade is transformed to two categories for the steep grade (larger than 2.0%) and the non-steep grade (lower than 2.0%). Te horizontal alignment is also divided into two types for the straight and the curve.
Based on the tunnel length, these tunnels can be divided into two categories for long tunnels (longer than 1000 m) and short/medium tunnels [37]. According to the distance between two adjacent tunnels, these tunnels can be divided into two categories for tunnel groups (less than 1000 m) and single tunnels. More detailed information about the defnition of tunnel groups can be found in [2]. In addition, there are signifcant diferences in light conditions for different tunnel zones, which may have impacts on sequential crashes. Four tunnel zones ( Figure 2), i.e., access zone, entrance zone, interior zone, and exit zone, are selected as categorical variables for model development. For each tunnel crash, the corresponding roadway geometry (grade and curve), tunnel length type, tunnel group or single tunnel, and type of tunnel zone are identifed and matched.

Trafc Volume.
Trafc volume data from January 2013 to December 2019 for selected freeways are obtained from Hunan Highway Trafc Investigation System 2.0 (HTIS) that is operated by Hunan Provincial Department of Trafc Police (https://jd.hnjt.gov.cn:8080/jd/login.html). Te HTIS detectors are installed close to ramp gore areas, and the average spacing between detectors is about 10 km, and at least one HTIS detector is installed between two adjacent oframps. Because there is a lack of a shorter measurement interval (5 min or 10 min), the normalized hourly volume (NHV) before a crash occurs is selected as a fairly good proxy for the real-time trafc volume.
Tere are fve vehicle classes defned by the Chinese Freeway Network Toll System with respect to vehicles' head height, axis number, wheel number, and wheelbase. Te weights for classes 1 to 5 are 1, 1.5, 2, 3, and 3.5, respectively. Normalized hourly volume is the weighted sum of hourly volume for each vehicle class. Detailed information can be found in [17]. Combining the design trafc capacity of freeway, the NHV is transformed to a more reasonable categorical variable, i.e., the hourly congestive level. According to the Chinese technical standard of highway engineering (JTG B01-2014) [38], we classify the hourly congestive level into four levels based on the ratio between normalized hourly volume and normalized hourly capacity (or V/C ratio for short). Specially, the hourly congestive level is divided into four levels as follows: where for the freeway with 80 km/h design speed, according to JTG B01-2014, the hourly capacity of freeway with 80 km/ h design speed is 1900 pcu for one lane, and the values of a1, a2, and a3 are 0.67, 0.83, and 1.00, respectively. For each tunnel crash, the hourly congestive level before a crash occurs is identifed and matched into crash-related and geometric design variables. Table 2 illustrates the descriptive statistics of all dependent variables for sequential crashes and non-sequential crashes in selected freeway tunnels.

Association Rule Mining.
Te association rule is a commonly used rule-based machine learning method for fnding interesting relationships among items in a database. Generally, a rule is defned as the form of A ⟶ B with two restrictions of A, B ⊂ I and A ∩ B ≠ Φ, where I represents a set of terms. A and B are called, respectively, the antecedent and consequent of the rule. Te rule of A ⟶ B can be explained that in a database where the term A occurs, there is a high probability of having the term B as well. Note that each rule can have multiple items, i.e., a set of items, as antecedent and consequent. In this study, the association rule is utilized to investigate interesting relationship between multiple crashcontributing factors (antecedent of the rule) and sequential crashes (consequent of the rule) in the freeway tunnel crashes. Apriori algorithm, proposed by Agrawal et al. [39], is utilized to obtain these strong association rules among items of crash data collected here. Apriori algorithm applies levelwise search for mining frequent itemsets, and the extraction of association rules is based on indicators, namely, support, confdence, and lift. Te support of an association rule denoted by S A⟶B is the measurement of the probability of both item A and item B occurring together in the entire crash database, expressed as where N(A ∩ B) is the number of crashes where both item A (antecedent) and item B (consequent) occurred together and N is the total number of crashes. Similarly, the support of item A and item B is, respectively, expressed as where N(A) is the number of crashes where item A occurs and N(B) is the number of crashes where item B occurs. Te confdence of an association rule denoted by C A⟶B is the probability that item B occurs given that item A occurs. Te confdence can be interpreted as an estimate of the conditional probability P(A|B), expressed as Te lift of an association rule denoted by L A⟶B relates the frequency of co-occurrence of A and B to the expected frequency of co-occurrence under the assumption of conditional independence. Te lift is the measurement of correlation between A and B, which is expressed as follows: Journal of Advanced Transportation A lift value smaller than 1 indicates negative interdependence between antecedent and consequent, and a value equal to 1 indicates independence, while a value greater than 1 indicates positive interdependence.
According to Han et al. [20], an association rule A ⟶ B classifed as interesting will satisfy where δ, ε, and c are the predefned minimum thresholds for support, confdence, and lift. Tere is no established criterion for selecting threshold values for support and confdence. Diferent studies considered diferent threshold support and confdence values as per the availability of number of data points and achievement of strong rules [31,40].
Te rules generated by minimum support-confdence-lift constraints are often too numerous to be utilized efciently. Moreover, some rules are redundant, random, and coincidental rules, which will lead to adverse efects on results [2,41]. Tus, redundancy check for each rule is conducted to address the issue that some rules with the same consequent but diferent antecedents probably imply nearly the same knowledge. Following the previous studies [42,43], a lift increase criterion is employed to ensure the generated rules are not redundant. It measures the diference between the lift of n + 1-item rule and the lift of its sub-set n-item rule (with the same consequent), given by formula where  In addition, to ensure that the rules obtained are not caused by random and coincidence, the chi-squared test is used to determine if there is a statistically signifcant correlation between the antecedent and consequent of an association rule A ⟶ B. Te rules with no statistically signifcance are removed.
Te proposed association rule mining for detecting strong associations between the over-representation of sequential crashes and risk factors can be summarized into three steps: (i) Step 1. Generate initial rules based on minimum support, minimum confdence, and minimum lift by employing the Apriori algorithm. Specifcally, the Apriori algorithm is performed by keeping crash type � sequential crash as the consequent. Crashcontributing factors and crash characteristics (listed in Table 1) are kept as the antecedents. Considering the proportion of sequential crashes in total samples (0.158) and the reliability of the rules, the minimum support, minimum confdence, and minimum lift are set to be 0.01, 0.16, and 1.05, respectively. (ii) Step 2. Remove rules with statistical insignifcance by the chi-squared test. Specifcally, those rules with the test statistic with P value >0.10 are removed from the generated rules in Step 1. (iii) Step 3. Check the redundancy for each rule by minimum improvement. In this study, to obtain more refned and signifcant rule set, we remove those rules whose lift increase is less than 1.05, which is consistent with previous studies [31].

Classifcation and Regression Tree (CART).
Decision tree (DT) is usually employed to classify the data or fnd regression model for data by extracting logical values involved in the dataset. Out of all DT algorithms such as ID3, CHAID, CART, and C4.5, the CART method is known to be one of the most successful techniques widely used to explicate the cause of trafc crash [1,44,45]. Its advantages include the following. (1) CARTcan use the same parameters more than once in the process of generating the tree, and thus this algorithm is fexible and powerful. (2) CART always yields binary trees, which possess high efciency and are very simple. Tus, the CART method is selected in the present study. Tere are mainly two types of CART, namely, classifcation tree and regression tree. Tis algorithm generates one of these two trees depending on the type of dependent variable. Tat is, if the variable is categorical, then a classifcation tree is generated; else, if the variable is numerical, then a regression tree is generated. Te dependent variable used in the present study is categorical variable: sequential crash and non-sequential crash. Tus, the tree generated is a classifcation tree. Te construction of a classifcation tree is a top-down approach as the split starts from the top node and then at each step further splits are made using certain metrics to result in the most homogeneous sub-set. CART algorithm uses the Gini index to measure the contribution of each split towards maximizing the homogeneity. Gini index measures the frequency of a element in the sub-set to be labeled incorrectly (for example, if a sequential crash is labeled as a non-sequential crash). Te heterogeneity at any node t can be evaluated in terms of the impurity index expressed by Gini(t), which is calculated by the following formulas. where where p(j | t) denotes the probability of crashes in node t that belong to the class j; p(j, t) denotes the probability that a crash is both with classj and in node t; π(j) denotes the prior probability of class j; m j (t) denotes the number of crashes with class j at node t; m j represents the number of crashes with class j in the tree. Smaller Gini(t) indicates a higher purity of the samples in the node t. Tus, the criterion of attribute splitting is to minimize the weighted average of Gini in each child node, i.e., maximizing ∆Gini(t) as defned by the following formula: where ∆Gini(t) represents the impurity diference of sample set S before and after a split. t l and t l represent the sample set of the left and right child nodes, respectively, and W l and W r represent the proportions of t l and t l , respectively. Furthermore, CART allows to evaluate the importance of each independent variable achieved by splitting variable x j defned by variable importance measure VIM(x j ). VIM(x j ) measures the reduction in total impurity of any tree achieved by splitting variable x j , which can be calculated by the following formula: where T denotes the total number of nodes, N t stands for the observations in the dataset that belongs to node t, N represents the total, and ∆Gini(x j , t) denotes the Gini reduction at node t after splitting node t into two child nodes according to the variable x j , which is calculated by formula (12).

Cost-Sensitive CART Model.
In this study, crash samples that belong to sequential crashes or non-sequential crashes are highly unbalanced, since the proportion of sequential crashes is much less than that of non-sequential Journal of Advanced Transportation crashes. To address the imbalance issue of dependent variable in the CART model, we introduce the cost-sensitive learning process into the basic CART model, which is followed by Zhu and Meng [45]. Specifcally, a misclassifcation cost matrix C of size J × J is specifed, where the column is the actual class of crashes that belong to sequential or nonsequential crashes, the row is the predicted class, and c(i, j) is the cost that misclassifes j as i. Te values of diagonal c(i, j) are zero (c(i, j) � 0, ∀i � j), while the values of ofdiagonal c(i, j) are positive that represent the cost of misclassifcation ((c(i, j) > 0, ∀i ≠ j). In the basic CART model, the of-diagonal values are all set to 1 (c(i, j) � 1, ∀i ≠ j), which denotes the misclassifying cost which is considered identical for diferent classes. On the other hand, the costsensitive CART model assigns a higher penalty cost for misclassifying the sample from the minority class. Signifcantly, the misclassifcation cost can be embedded into the cost-sensitive CART via adjusted priors π(i) ′ . Tat is, the prior probability value π(j) in formula (10) is replaced with the formula where where π(j) represents the priori and m denotes the total number of tunnel crashes. With formula (14), through putting a larger priori on the minority class, its misclassifcation rate tends to reduce. To demonstrate the cost-sensitive learning process in addressing the class imbalance issue, the class-weighted ratio (c + /c − ) has been introduced. We set minority class j with sequential crashes as the positive class and class i with nonsequential crashes as the negative class. c + � c(i, j) is the cost that misclassifes class j (i.e., sequential crashes) as class i (non-sequential crashes), while c − � c(j, i) is the cost of misclassifying non-sequential crashes as sequential crashes. Tereby, the value of c + should be higher than c − in order to control the balance between sequential and non-sequential crashes.
For assessing the model performance of the costsensitive CART, classic metrics for classifcation problems are adopted, including accuracy, sensitivity, specifcity, and AUC. Accuracy is defned as the percentage of cases correctly classifed by the classifer. Sensitivity and specifcity indicate the performance of the models in predicting sequential crashes and non-sequential crashes, respectively. AUC represents the area under the receiver-operating curve, in such a way that value 1 describes a very perfect prediction. Te equations that defne these indicators are accuracy � TP + TN TP + TN + FP + FN , where true positive (TP), false positive (FP), true negative (TN), and false negative (FN) are components of the confusion matrix. TP represents the number of positive values (sequential crashes) that are predicted as positive. On the other hand, FP represents the crashes that are classifed as non-sequential crashes while they are sequential crashes.

Converting the Tree Structure to Decision Rules.
After the tree is constructed, it is easy to convert the tree into a rule set by deriving a rule for each path in the tree that starts at the root and ends at the leaf node. Specifcally, the rules are confgured from the terminal nodes, with all the splits of the parent nodes being the antecedent and the class of the terminal node being the consequent. For each terminal node t, support, confdence, and lift are calculated: (a) the support is the ratio between the number of crashes belonging to the class j (i.e., sequential crash) of the node t and the total number of crashes, which is equal to formula (2); (b) the confdence is the proportion of crashes in the node t that belong to the class j (i.e., sequential crashes), which is equal to formula (5); and (c) the lift is the ratio between the proportion of crashes belonging to the class j in the node t and the baseline proportion in total samples, which is equal to formula (6). In this study, the cost-sensitive CART is selected to analyze factors infuencing the likelihood of sequential crashes in freeway tunnels. Te process can be summarized into the following steps: (i) Step 1. Generate the tree based on the Gini index by using the cost-sensitive CART algorithm. Te crash type, i.e., sequential crash and non-sequential crash, represented by a dummy variable, is used as the dependable variable, while crash-related factors (in Table 1) are specifed as independent variables. (ii) Step 2. Calculate the VIM of each independent variable in order to understand the relative importance of each factor on the likelihood of sequential crashes. (iii) Step 3. Obtain rules from the tree and calculate the support, confdence, and lift values of each rule. Furthermore, rules obtained by the decision trees are compared with the rules from the associated rule mining.

Association Rule Analysis.
Association rule analysis is conducted for 985 crashes that occurred in freeway tunnels, which include 156 sequential crashes and 729 non-sequential crashes. We implement the association rule mining with three steps (as described in Section 3.1): Apriori algorithm, statistical signifcance test, and redundancy check. By employing the Apriori algorithm, we obtain 50 rules with the sequential crash as consequent that satisfy minimum support-confdence-lift constraints (support ≥0.01, confdence ≥0. 16, and lift ≥1.05). Ten, a chi-square test is performed to test the signifcance of rules, and fve rules with statistical insignifcance (P value >0.10) are removed. Finally, the lift increase criterion is employed to these statistically signifcant rules, and twenty-one rules (lift increase criterion <1.05) are removed because they are not satisfed by redundancy check. After completion of Step 3, we obtain 24 meaningful and signifcant rules with the sequential crash as consequent, as shown in Table 3. Rules with higher lift value indicate stronger associations between antecedent and consequent. Tere are six twoitemset rules with the lift value higher than 1.05 (rule 1, rule 10, rule 18, rule 22, rule 23, and rule 24). Tis indicates that sequential crashes are positively associated with these six factors: long tunnel, fourth congestion level, afternoon, bus, winter, and grade ≤ 2%. Te values of support, confdence, and lift index can provide useful insights. For example, twoitemset rule with the highest lift is bus⟶sequential crashes (support � 0.011, confdence � 0.268, and lift � 1.696). Te support value indicates that 1.1% of freeway tunnel crashes are sequential crashes that involved buses. Te confdence value indicates that out of all freeway tunnel crashes that involved buses, the proportion of sequential crashes is 26.8%. Te proportion of sequential crashes that involved buses in freeway tunnels is 1.696 times the proportion of sequential crashes in the complete crash database.
Te lift values of 3-item and 4-item association rules are generally greater than those of the 2-item association rules, indicating that sequential crashes in freeway tunnels are the result of combination of multiple factors, which are related to road/tunnel design, trafc, and environmental factors. For example, the rule with highest lift value (2.188) is the 4-item rule 9 (long tunnel and grade ≤ 2% and fourth level and winter⟶sequential collision). Tis indicated that when these factors including long tunnel and grade ≤ 2% (road/ tunnel design factor), fourth level (trafc factor), and winter (environment factors) are combined, the proportion of sequential crashes is 2.188 times the average proportion.

Decision Tree Analysis.
To test the generalization ability of the proposed CART method and obtain the optimal tree, we adopt the 10-fold cross-validation approach and randomly split the samples into training (70%) and testing (30%) sets. Te performance indices of 10-fold crossvalidation approach are obtained by averaging the results from 10 iterations. Following the study by de Ona et al. [46], the overftting of tree model is controlled by requiring that all nodes contain at least 1% observations. Furthermore, we assess the performance indices of cost-sensitive CARTmodel with diferent misclassifcation cost. As shown in Table 4, the efects of diferent class-weighted ratios (c + /c − ) on performance indices including AUC, accuracy, sensitivity, and specifcity have been analyzed. Table 4 shows that the accuracy and specifcity are high when the cost-sensitive learning is not considered (c + /c − � 1). However, the sensitivity is zero, indicating that none of the sequential crashes have been correctly classifed. When increasing the value of class-weighted ratios, the higher value of sensitivity can be achieved although the value of specifcity decreases. Tis is to say, the accuracy of classifying crashes with sequential crashes has been improved, while the accuracy of classifying crashes with nonsequential crashes has a decline. A good performance model should have the trade-of between sensitivity and specifcity. Te highest of AUC value is 0.703 when the value of classweighted ratio is 4, demonstrating a good separability between sequential crashes and non-sequential crashes. Tereby, class-weighted ratio c + /c − � 4 is adopted to obtain the optimal tree and decision rules in our study.
By employing the cost-sensitive CART model with classweighted ratio c + /c − � 4, we obtain the results of tree structure for predicting sequential crashes. Figure 3 shows that the time, season, congestion level, grade, weather, horizontal alignment, tunnel length, and bus are the main splitting variables in the tree. It implies that these variables play an important role in classifying sequential crashes and non-sequential crashes in freeway tunnels, which is generally consistent with the results obtained from the associated rule mining. Te tree generated eight splitters and nine terminal nodes. Te proportion of sequential crashes in terminal nodes varies from 4.5% to 77.8%.
Te tree is frst split by the time. It indicates that the time is the best variable to classify sequential crashes in freeway tunnels. CART directs the crashes occurred in the afternoon to the left, forming node 1, and directs the rest of crashes in all other times to the right, forming node 2. Tose crashes in the afternoon (12:00-17:59) are associated with a higher proportion of sequential crashes (22.5%) compared with those in other times (10.4%).
To the left, CART continues to split node 1 based on the variable of season, forming node 3, which represents that the crashes occurred in the spring, summer, and autumn; and forming node 4, which represents that the crashes occurred in the winter. Crashes in the winter have higher proportions of sequential crashes than those in the other three seasons. Further down to the tree, node 3 is split by the variable of weather, forming terminal nodes 8 and 9. Node 9 has a higher proportion of sequential crashes (25.8%), indicating that crashes in the rainy/snowy weather have higher proportions of sequential crashes. Back to node 4, crashes in the winter continued to be divided by the variable of horizontal alignment, forming node 6 and terminal 13. Node 6 can be further divided by bus, forming node 7 and terminal 12. Terminal 12 has a higher proportion of sequential crashes, indicating that collisions with bus are more likely to associate with sequential crashes. Node 7 is split according to the variable of tunnel length. Crash occurring in the long tunnel has a higher probability of sequential crashes.

Journal of Advanced Transportation
Turning to the right, node 2 is split into node 5 and terminal 14 based on the variable of trafc congestion level. Sequential crashes are more likely to occur at the trafc congestion of third or fourth level. Further down to the tree, node 5 is split by the variable of grade, forming terminal nodes 15 and 16. It shows that these crashes in the segment with grade less than 2% have a higher likelihood of sequential crashes.
Based on the result of cost-sensitive tree in Figure 3, we identify fve rules with the consequent being the sequential crash, which are confgured from terminal node 9, node 11, node 12, node 13, and node 15. Te support, confdence, and lift values of these fve rules are also calculated, as shown in Table 5.
(i) Rule 101 (Node 9). It identifes crashes in the afternoon under rainy or snowy weather condition when the season is spring, summer, or autumn, which are classifed as sequential crashes by the cost-sensitive CART model. Te probability of sequential crashes in these cases is 25.8%, and the lift is 1.633. (ii) Rule 102 (Node 11). It identifes crashes in the afternoon in the long tunnel with curve road when the season is winter, and there are no buses in the collision. Te probability of sequential crashes in these cases is 24.5%. (iii) Rule 103 (Node 12). It identifes bus-involved crashes in the afternoon on the curve road when the season is winter. Tis rule represents the highest probability of sequential crashes with 77.8%.
(iv) Rule 104 (Node 13). It identifes crashes in the afternoon on the straight road when the season is winter. Te probability of sequential crashes in these cases is 29.2%. (v) Rule 105 (Node 15). It identifes crashes in morning/ night/evening (not afternoon) on the road with grade less than 2% at the third or fourth congestion level. Te probability of sequential crashes is 21.4%.
Te relative importance of variables (VIM) can also be obtained, which is calculated based on the index of Gini reduction (formulas (9)-(11)). Most important variables to classify the sequential crashes are ranked as the crash time, trafc congestion, season, horizontal alignment, longitudinal grade, tunnel length, weather, and bus. Te values of VIM for these fve variables are 1.00, 0.86, 0.66, 0.56, 0.34, 30, 0.24, and 0.16.
By comparing the results of the Apriori algorithm and the decision tree, several meaningful discoveries can be obtained. (1) Both methods have identifed some common factors associating with the increased likelihood of sequential crashes, which include afternoon, winter, long tunnel, grade less than 2%, and trafc congestion. (2) Since the decision tree is built recursively that follows a top-down approach, all of the rules obtained interacted with the frst splitter-the variable of afternoon in this study. On the other hand, the Apriori algorithm uses an exhaustive global search and produces a richer set of rules compared with the decision tree. In this study, the Apriori algorithm identifed 24 signifcant rules associated with sequential crashes, while only 5 signifcant rules are identifed by the decision tree. (3)   Te decision tree produces an easily interpretable, wellorganized knowledge structure and quantifes the relative importance of variables, which is superior to the Apriori algorithm. To sum up, combining these results provides us a deeper insight into risk factors associated with sequential crashes in freeway tunnels.

Discussion
Results from both association rule mining and decision tree show that tunnel length is the important predictor of sequential crashes. Te previous studies by Zhou et al. [5] found that a crash in a long tunnel was more likely to cause severe or fatal injury than the one in a short tunnel. Te present study adds to literature by the fnding that that the long tunnel (tunnel length >1000 m) is associated with the overrepresentation of sequential crashes, especially in the afternoon and in the winter. Te most likely explanation for this phenomenon is that-as in previous research-when driving in semi-enclosed monotonous environments for a long time, the driver easily feels visual fatigue and anxious, which lead to a decreased ability related to risk perception and risk avoidance [6,13]. Moreover, once a collision occurs in the long tunnel, because of the difculty in rescue and evacuation, a sequential or secondary crash can be induced further. Te variable of grade less than 2% is associated with an increased likelihood of sequential crashes. Tis counterintuitive safety efect of this variable may result from the positive correlation of two variables of grade less than 2% and long tunnel (the correlation coefcient between two variables is 0.235 and statistically signifcant with p < 0.01). Tese results imply that long tunnel is the critical location for decreasing the likelihood of sequential crashes. To enhance driving safety in the long tunnel, advanced speed limit and safety distance keeping technologies might be efective solutions. In addition, increased abnormal event detectors are suggested to be equipped in the long tunnels, ensuring abnormal emergency detection in time to avoid sequential crashes. Te involvement of bus is found be associated with an increased likelihood of sequential crashes. Tis could be explained by the large size and heavy weight of a bus, which could result in a long braking distance due to inertia force and thus easily cause the sequential crashes. Tis result is complementary to previous fndings that large passenger vehicles have a greater risk of causing severe crashes in tunnels [47]. Tese results imply that buses should be the focus when implementing tunnel safety policies, such as associating with driver safety education and advanced vehicle technology to help maintain safe car-following [48].
Te present study fnds that sequential crashes tend to happen in a congested trafc environment. Tis is not surprising: a higher congested trafc is related with a smaller headway. In this case, once a crash happened, the following driver will be more difcult to avoid the hazard in the front, which contributes to a higher risk of sequential or secondary crash [5]. In addition, results from both the association rule mining and decision tree show a higher proportion of sequential crashes in the winter (from December to February) compared with other periods of the year and in the afternoon compared with other period of the day. Te main reason may be related to an increasing trafc volume in the winter and in the afternoon. Monthly and hourly distribution of sequential crashes and trafc volume in selected tunnels is shown in Figures 4 and 5. From Figure 4, we can see that the number of sequential crashes in the February and January accounts for nearly three-quarters of the total sequential crashes. In China, the most important traditional festival-Spring Festival-is usually in February, sometimes in January. Tis leads to the sharply increasing trafc volume in January and February and thus induces a signifcant increase of sequential crashes. Moreover, we believe that the adverse weather conditions such as cold and icy in these specifc months play a key role in increasing the likelihood of sequential crashes.
In addition, from Figure 5, we can see that most of sequential crashes occur in the period of 6:00-17:59, and the highest number of sequential crashes happens around 12: 00-17:59. Tese time periods are also related to a higher level of trafc volume. Tis agrees with previous fndings that crashes occurring in the daytime period (06:00−19:00) are more likely to cause secondary collisions [49,50]. Aforementioned fndings imply that trafc control policies should be strengthened in the specifc time including trafc congestion period, afternoon, and winter.
More importantly, results from association rule mining and decision tree analysis confrm that the combination of road/tunnel design, environmental, and trafc factors produces even a higher likelihood of sequential crashes, leading to a series of hazardous situations. As an example, the combination of long tunnel and grade ≤ 2% (road/tunnel design factor), fourth congestion level (trafc factor), and winter (environmental factor) is associated to a signifcantly higher likelihood of sequential crashes. More hazardous situations can be seen in Tables 3 and 5 which are obtained from these rules with high lift values. To efectively prevent the high risk afecting sequential crashes in freeway tunnels, trafc safety management should pay more attention to monitoring these hazardous situations.

Conclusion
Te issue of tunnel trafc safety is crucial as these crashes occurring in tunnel always lead to serious consequences. Tis study aims to investigate important road/tunnel design, trafc, and environment factors contributing to the overrepresentation of sequential crashes in freeway tunnels. Using a comprehensive dataset for twelve freeway tunnels in China, the association rule mining and decision tree models are employed. We select these two models because they are capable of identifying complicated interactions among dependent variables associated with independent variables and expressing them by rules. Results show that tunnel length, trafc congestion, time of day, season, and vehicle type are the important factors infuencing the likelihood of sequential crashes. Long tunnel is associated with higher probabilities of sequential crashes. Tis may be due to visual fatigue and increased anxiousness for drivers in semi-enclosed monotonous environments for a long time. Moreover, the difculty in rescue and evacuation might increase the risk of sequential or secondary crashes. A high level of trafc congestion increases the likelihood of sequential crashes, which might be due to the smaller headway under the congestion condition. Afternoon and winter are found to be associated with the overrepresentation of sequential crashes. Te involvement of bus is more likely to cause a sequential crash. In addition, the interactive efects of multiple factors (roadway, trafc volume, and environmental factors) on the likelihood of sequential crashes are revealed.
It is noteworthy that drivers' individual-level factors such as fatigue state and aggressive behavior prior to crash occurrence are included in this study; though these factors are deemed important, their efects cannot be examined. In addition, the analysis of this study is limited to crash cases only, and there are non-crash cases. Te extension of this study is to incorporate police reported data and other emerging data sources (such as the high-resolution trajectory data) to achieve a more explicit understanding of the casual mechanism underlying sequential crashes in freeway tunnels.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest regarding the publication of this paper.