Crash/Near-Crash Analysis of Naturalistic Driving Data Using Association Rule Mining

This study explores the associations between crash/near-crash (C/NC) events and roadway, driver-related, and environmental factors in naturalistic driving studies (NDS). We used the Naturalistic Engagement in Secondary Tasks (NEST) dataset, which is massive and detailed and contains 50 million miles of naturalistic driving data resulting from the Strategic Highway Research Program 2 (SHRP2). Association rule mining (ARM) is applied to extract the rules for frequently occurring events. The generated association rules are ﬁ ltered by four metrics (support, con ﬁ dence, lift, and conviction) and validated by the lift increase criterion. A three-step analysis is performed to obtain a comprehensive understanding of the rules of C/NC events. The 20 most frequent items are ﬁ rst selected to investigate their relationship with the C/NC events. Subsequently, the association rules are used to identify the factors contributing to C/NC events. Finally, correlations between contributing factors and di ﬀ erent severities of crashes (I — most severe, II — police-reportable, III — minor crash, and IV — low-risk tire strike) are analyzed by ARM. The results demonstrate that C/NC events occur most frequently on straight and level road segments with no controlled intersections or tra ﬃ c control devices when drivers are performing secondary tasks. Thus, the reasons for these crashes are carelessness and overcon ﬁ dence. In addition, a median strip or barrier and a wider road can signi ﬁ cantly reduce the frequency and severity of crash events. Moreover, gender, age, average annual mileage, and secondary tasks are highly correlated with the frequency and severity of C/NC events. Drivers with visual-spatial disabilities or crash records are more likely to be involved in the most severe crash events. Near-crash events occur more frequently at higher tra ﬃ c density and on roads with tra ﬃ c control devices and controlled intersections. These conditions may keep drivers alert, preventing crashes.


Introduction
The National Highway Traffic Safety Administration (NHTSA) data [1] show that approximately 38,680 people died in traffic crashes in the United States in 2020, representing an increase of almost 7.2% compared to the 36,096 fatalities reported in 2019 and the largest number of fatalities since 2007. The increase in traffic crashes has harmed many families, although most of the injuries and deaths could have been averted. Thus, it is essential to determine the correlations between the contributing factors and crash/near-crash (C/NC) events to minimize their occurrence. However, many factors contribute to C/NC events, with latent correlations hidden in the C/NC data. Thus, it is challenging to extract the correlations between the contributing factors and the causes of C/NC events to prevent them. Conse-quently, traffic safety has become an urgent and crucial topic in transportation research.
Data acquisition is a critical prerequisite for traffic safety studies. Many safety studies [2][3][4] have focused on extracting associations between C/NC events and roadway features using police report data due to easy accessibility. However, the lack of available factors, such as driving behavior and driver characteristics, has limited the comprehensiveness of these studies. Therefore, several experimental studies [5][6][7] have analyzed the impacts of different driving behaviors on C/NC events in a simulated environment. In experimental studies, dozens of drivers were recruited for experiments. For example, in secondary task engagement experiments, participants are asked to perform certain secondary tasks under specific C/NC conditions. Eye movement, heart rate, and vehicle kinetic data are simultaneously recorded during the experiments [8]. Although experimental studies can extract valuable information because of their ability to simulate C/NC conditions, they may not be able to mine the latent rules of C/NC events for two main reasons [9][10][11][12]: (1) The participants are equipped with eye-tracking glasses, galvanic skin resistance (GSR) electrodes, wearable sensors, optical probes, and photoplethysmography (PPG) sensors to obtain data from multiple sources. The participants may not feel comfortable in the simulated driving environment due to the equipment. Therefore, the applicability of the experiment's results is questionable. (2) Obtaining instructions from a computer screen rather than responding to traffic conditions is common in driving simulations. This situation does not accurately represent the real-world driving experience.
Many studies used observational data to ensure the transferability of the results to real-life conditions [13][14][15][16]. Observational studies or naturalistic driving studies (NDS) [17] provide realistic conditions to gather C/NC data for accident analysis and prevention. Multichannel video, sensor, kinematic, and vehicle network data can be obtained from vehicles equipped with a data acquisition system (DAS) in a naturalistic driving setting. The highly detailed and comprehensive dataset is suitable for traffic safety studies and many other research fields.
Detailed and comprehensive datasets have been obtained, representing a solid foundation for traffic safety analysis. Researchers used these datasets and different methods to analyze different aspects of traffic safety. Some researchers used statistical models to reveal the correlations between variables and the occurrence of C/NC events using NDS. For instance, Papazikou et al. [18] investigated vehicle kinematics during crashes to obtain reliable indicators of the time to collision (TTC). Kreusslein et al. [19] focused on the characteristics of mobile phone calls, including the call duration, glance behavior, call type, and mobile phone location, to determine the influence of making mobile phone calls. Schlick et al. [20] used hierarchical regression models to determine the associations between motor vehicle crashes and different contributing factors.
Driving behavior analysis and machine learning methods have been used to identify the cause of C/NC events. Zou et al. [21] predicted vehicle acceleration using behavioral semantic analysis to prevent accidents caused by rapid acceleration. Guo et al. [22] utilized SHapley Additive exPlanation (SHAP) to analyze the importance of features related to crash events; sharp deceleration was the most important feature.
Association rule mining (ARM) has been proposed for crash analysis [23,24]. ARM is widely used in the traffic safety field because it can reveal the intrinsic relationships between the contributing factors and the accidents without assumptions and significantly outperforms traditional modelling techniques. A summary of the applications of ARM for crash analysis is presented in Table 1.
Several studies [33][34][35][36][37] used ARM for crash analysis under different conditions, such as truck crashes or near crashes. Unlike these studies, we propose a three-step method using the frequent pattern (FP) growth algorithm [38] to mine the correlations between different categorical variables and C/NC events using the Naturalistic Engagement in Secondary Tasks (NEST) dataset [39]. The 20 most frequent items are first selected to determine which features are associated with C/NC events. The association rules describing the factors contributing to C/NC crash events are then identified. Finally, association rules are used to analyze crash events of different severities. Suggestions for practical applications are provided. The flowchart of the proposed approach is illustrated in Figure 1.
The remainder of this paper is organized as follows. Section 2.1 presents the dataset and preprocessing steps. The methodology is described in Section 2.2, focusing on the principles of the FP growth algorithm and the formulations of four metrics: support, confidence, lift, and conviction. The results are presented and discussed in Section 3, the findings and discussions are drawn in Section 4, and conclusions are summarized in Section 5.

Data Description
2.1.1. Dataset Overview. We used C/NC data from the NEST dataset [39], which is a subset of the Strategic Highway Research Program 2 (SHRP2) database produced under the collaboration between the Virginia Tech Transportation Institute (VTTI) and the Toyota Collaborative Safety Research Center (Toyota CSRC). This dataset contains high-level data and detailed time-series data on secondary task engagement and distraction-related safety-critical events (SCEs) during real-world driving. The summary data provide information at the event level, and the time-series data provide frame-by-frame detailed information at the millisecond level. We only used the summary data in this study.
The summary data contain information on the event severity of baseline, crash, and near-crash events, with a total of 1080 samples. We did not consider the baseline data because they contain no C/NC events. The duration of the C/NC events was 30 s, including 20 s prior to the event and 10 s following it. The summary data comprised 36 items. The subtasks and environmental conditions were split into three fractions for each 10 s duration, while the driver information and other information were not. After deleting samples with too many missing values, we obtained 699 C/NC event samples.
2.1.2. Variables. The raw summary data of the C/NC events contains 36 categorical variables. Twenty of them were chosen to analyze the patterns of the C/NC events. The remaining 16 variables were not chosen for the following three reasons: (1) a large percentage of missing values, (2) heavily skewed distribution, and (3) overlap in meaning. For example, the stop sign, merge sign, yield sign, slow or other warning signs, and railroad crossing sign variables are included in the raw summary data. However, most of the values are blank because these signs do not occur frequently; thus, the distribution is skewed. In addition, the traffic control variable represents these signs at a higher level. Therefore, these variables were deleted, and only the traffic control variable was used. Note that crucial variables were retained even if they had a skewed distribution or an overlap in meaning.
Some of the chosen variables required aggregation because they contained many attributes, skewing the distribution. Therefore, the attributes of these variables were categorized into a higher level, such as secondary task, traffic density, locality, age group, and annual miles. For example, different secondary tasks (including no secondary task) were aggregated into secondary tasks (yes) and no secondary tasks (no). This approach was different from a previous study [40] because all C/NC events were analyzed comprehensively in this paper rather than focusing on one aspect. More details on the variables are presented in Table 2. 2.1.3. Distribution of Attributes. The distribution of attributes is significant for hyperparameter selection, such as the support value, and influences the association rules generated by ARM. For example, some attributes of a variable occurred infrequently and might not been considered because of a high support value; thus, they might be filtered out by ARM and excluded from the association rules, result-ing in errors in evaluating the attribute's contribution to C/ NC events. Figure 2 describes the distribution of attributes for the crash and near-crash events. There were 447 crash events and 252 near-crash events. Figure 2 shows that (1) most percentages are greater than 0.05, indicating that 0.05 might be a suitable initial support value; (2) some attributes are associated with a higher proportion of crash events than near-crash events, such as no lanes, lane number ≤ 2, improper driver behavior, and teenager driving. This implies a correlation between the severity of events and these attributes.

Methodology.
Recent studies used various techniques to conduct pattern mining using large amounts of crash data, such as ARM [36], Bayesian networks [41], neural networks [42], linear regression networks [43], cluster analysis [44], random forests [45], and support vector machine [46]. ARM has the advantage of finding meaningful associations and providing valuable insights into the interdependence between roadway, environmental, and driver-related factors and the frequency and severity of crashes [29]. Besides, ARM is more suitable for discovering patterns in large data  Figure 1: Flowchart of the proposed approach.  Journal of Advanced Transportation volumes than confirming hypotheses [36] and is not influenced by missing values. Thus, it is preferable to machine learning and linear regression methods. Therefore, ARM was chosen to analyze C/NC data. The Apriori algorithm [23] is considered the most popular and efficient ARM method compared to the weighted classification based on association rule (WCBA) method [47], fast classification based on association rule (FCBA) method [48], and the maximal frequent itemset algorithm (MAFIA) [49]. However, it scans the entire dataset for frequent items, resulting in high computational complexity, especially for a large dataset. The FP growth algorithm [50] is an improvement of the Apriori algorithm that requires only two scans of the database to develop the FP tree. Thus, it can identify frequent items in a large database with a low execution time. Due to the advantages of the FP growth algorithm, it is used here to extract frequent items.

Journal of Advanced Transportation
In this study, the association rules are mined in two steps: (1) the FP growth algorithm is used to detect frequent item sets and (2) association rules are mined from the frequent item sets.
It is assumed that I = fi1, i2, ⋯, img is a collection of categorical variables (item sets), and T = ft1, t2, ⋯, tng is a collection of C/NC events (transactions), where m is the number of item sets that is much greater than n, which is the number of transactions. All association rules are generated based on I and T. However, not all the association rules are needed. For example, ftrafficflow = no lanes g ⟶ f trafficdensity = free flowg may be an association rule with a high support value, but it may not provide any new or meaningful information because a road with no lanes implies a low-grade road unsuitable for high traffic density. Thus, these types of rules should be discarded. X is defined as the antecedent (e.g., ftrafficflow = no lanes g), and Y is defined as the consequent (e.g., fevent = near − crash eventg). The antecedent and consequent are used to discard meaningless association rules. However, this does not indicate that X is the cause of Y,Y is the result of X, or X and Y have a causal relationship. Four performance metrics are typically used to test the model performance and validity: support, confidence, lift, and conviction. The support indicates how fre-quently the itemset appears in the dataset; it is the ratio of the number of transactions containing the item set to the total number of transactions. The confidence is the percentage of all transactions satisfying X that also satisfy Y. It is the ratio of the number of transactions including items X and Y to the number of transactions including item X. The lift of a rule refers to the frequency of items X and Y in a transaction. However, the frequency of item X or item Y should be simultaneously considered. The lift value reflects the correlation between X and Y in the association rules. When the lift value is greater than 1, the higher the value, the higher the positive correlation between X and Y is. When the lift value is less than 1, the lower the value, the higher the negative correlation between X and Y is. When the lift value is equal to 1, there is no correlation between X and Y. A rule with a single antecedent and a single consequent is referred to as a 2-item rule. Similarly, a rule with k-1 antecedents and a single consequent is denoted as a k-item rule, where k is the sum of the number of antecedents and the number of consequents. The support, confidence, lift, and conviction are computed as follows: where X is the antecedent, Y is the consequent, P ðXÞ is the percentage or probability of a transaction containing item X, supportðX ⟶ YÞ is the support value of the association rule X ⟶ Y, confidenceðX ⟶ YÞ is the confidence value of the association rule X ⟶ Y, liftðX ⟶ YÞ is the lift value of the association rule X ⟶ Y, and convictionðX ⟶ YÞ is the conviction value of the association rule X ⟶ Y.   Journal of Advanced Transportation The "mlxtend" package in Python 3.7 is used to implement the FP growth algorithm for frequent items and mine the association rules with a minimum support value of 0.05 and a minimum confidence value of 0.05 as hyperparameters.

Results
3.1. Frequency Analysis. The 20 most frequent items were selected to determine which features the C/NC events are associated with. As shown in Figure 3, the most frequent item is no driver impairment, and the second most frequent item is secondary tasks, indicating that most drivers are driving normally, and secondary tasks are highly associated with crash events. In addition, the most frequent items related to the road are a straight road, level road, and no controlled intersections. It can also be deduced from Figure 3 that the C/NC events are highly associated with driving normally and are associated with performing secondary tasks on straight and level road segments with no controlled intersections. These conditions are common in real life and have the highest probability of crashes.
Figures 4(a) and 4(b) show the frequency plots for crash events and near-crash events, respectively. Several differences are observed in these two plots: (1) the secondary task is the most frequent item contributing to crash events with a frequency of 94.85%, whereas this item ranks fourth for near-crash events with a frequency of 90.47%, indicating that secondary tasks are frequently associated with crash events. (2) The number of travel lanes less than or equal to 2 ranks eighth for crash events (frequency of 62.64%), and the number of travel lanes between 2 and 7 ranks seventh for near crashes, with a frequency of 71.43%, indicating that the probability of a crash is higher for fewer lanes. (3) Free flow ranks 12 th for crash events, with a frequency of 66.67%. This result suggests that a free traffic flow may keep the drivers over-confident, causing crashes. (4) Improper behavior ranks 13 th for crash events and is not correlated with near-crash events. Thus, improper behavior occurs more frequently in crash events. (5) An annual mileage of less than 10000 miles is associated with crash events, and an annual mileage greater than 15000 miles is more frequently associated with near-crash events, indicating that drivers with more driving experience are less likely to be involved in crashes.

Model Performance and Descriptive
Statistics of the Parameters. We created two-key plots [30] to visualize the patterns extracted from the association rules of the C/NC events. There are 142794 rules for crash events and 18759 rules for near-crash events generated by the FP growth algorithm, with a minimum support value of 0.05 and a minimum confidence value of 0.05. Because there are numerous association rules, we randomly selected some to show the pattern. We merged the 3-item rules and 4-item rules as well as the 5-item rules and 6-item rules. In Figure 5, the range of support values for the 2-item rules is 0.05 to 0.6, and the confidence values of these rules exceed 0.4. For the 3-4item rules, the range of support values is 0.05 to 0.5, and the confidence values also exceed 0.4. The 5-6-item rules have a similar trend, but the maximum value of support values is less than 0.25. Figure 6 shows the two-key plots for the rules of the near-crash events. The range of the support values is 20% smaller, and the confidence value range for the majority of rules of the near-crash events is 80% lower than in Figure 5. 3.3. Obtaining the Patterns from the Association Rules of the C/NC Events 3.3.1. Crash Event Patterns. Table 3 presents the 25 top rules selected from 142,794 rules according to the lift value (from high to low) for crash events. The 6-item rule ftrafficflow = not divided + travellanes = lanes ≤ 2 + NUMVIOL = 0 + gender = F + driverbehavior = improper behaviorg is used as an example. A male person driving on an undivided road with less than 2 lanes is more likely to be involved in a crash when performing improper behavior, such as aggressive driving, even if he has no violations. The corresponding metrics are support = 0:053, confidence = 1, lift = 1:564, and conviction = inf . This can be interpreted as follows: the support value indicates that only 5.3% of crash events contain these five items. The confidence value indicates that if an event contains the five items, it is a crash event. The lift value shows that the percentage of crash events with these five items is 1.564 times higher than that of other crash events in the dataset. The conviction indicates the relationship between antecedents and consequents; the higher the conviction, the stronger the relationship is.
The likely reasons for these results are as follows. Undivided roads or roads with fewer than two lanes are typically low-grade roads. Young drivers have less driving experience and are more likely to underestimate the danger of driving on these road segments, especially when there are no  Journal of Advanced Transportation vehicles, traffic control, or intersections to interrupt driving. Under these conditions, drivers can be involved in crashes when they suffer from fatigue or perform secondary tasks or improper behavior. Table 4 presents the 25 top rules selected from 142,794 rules according to the lift value (from high to low) for near-crash events. The first 6-item rule fage group = 20 − 24 + locality = business/industrial + traffic density = stable flow + travel lanes = 2 < lanes ≤ 7 + secondary task = secondary task observedg is used as example. When a driver is affected by the interactions with others in traffic, the driver's speed is influenced. In addition, maneuvering in stable flow requires substantial vigilance by the driver, and the general comfort level declines. A young man driving on a wide road in a business/industrial area is more likely to be involved in a near-crash event when he is performing secondary tasks. The corresponding metrics are support = 0:05, confidence = 0:946, lift = 2:264, and conviction = 11:83. This can be interpreted as follows: the support value indicates that only 5% of near-crash events contain these five items. The confidence value shows that an event containing the five items has a 94.6% probability of being a near-crash event. The lift value demonstrates that the percentage of near-crash events with these five items is 2.264 times higher than that of other near-crash events in the dataset. The consequent depends significantly on the antecedent because the conviction value is higher (11.83) than the others.
The likely reasons for these results are as follows. Divided roads and more lanes have fewer crashes. However, the high traffic density limits the drivers' freedom to maneuver, making them irritable in a stable or unstable/forced traffic flow. The drivers are inclined to overtake and accelerate frequently under these conditions and underestimate the danger, especially older drivers with higher confidence in their driving experience. If they perform secondary tasks and their attention is distracted, near-crash events are likely to occur.     Journal of Advanced Transportation      (1) road: divided roads, roads with no lanes, and the number of lanes are the main differences between the C/NC patterns. Crash events are more unlikely to occur on divided roads with more than 2 lanes. (2) Driver: the age group and annual miles are two significant factors in C/NC events. Drivers associated with crash events are predominantly 16-24-year-old teenagers with relatively little driving experience, whereas drivers involved in near-crash events are more likely older people  year old) with more driving experience. In addition, drivers are more likely to be associated with crash events when performing improper behaviors, such as aggressive driving and drunk driving, whereas secondary tasks are more influential in near-crash events. (3) Environment: crash events occur more likely in free flow, when the comfort level of drivers is high, in areas without traffic control or controlled intersections, and in residential or business/industrial areas. Near-crash events are more common in stable traffic flow or unstable/forced flow in business/industrial areas. The likely reason is that high traffic density keeps drivers alert, preventing crashes. Near-crash events occur due to a combination of factors (i.e., traffic density levels, secondary tasks, and improper Table 6: Key findings.

Researches Road Driver Environment
Our study (1) Wider and median strip can reduce the frequency and severity of crashes (2) Only combined with other factors, level roadway and straight alignment are related to C/NC events (1) The females are likely to be linked with lower-severe crashes, while males are likely to be linked with severe crashes and near crashes (2) Young age and less annual miles are more linked to crashes. However, the age does not show a strong correlation with the severity of C/NC events (3) Improper behavior and secondary tasks are correlated with crashes (4) Crash records and minor visual spatial disabilities are associated with the most severe events, while age, driver impairments and improper behaviors do not strongly correlate with the severity of crashes (1) In free flow, crashes are more likely to occur (2) Traffic control and intersections are associated with C/NC events; this is more common in residential or business/industrial areas Kong et al. [31] (1) Small radius curves are linked with run-off-the-road (ROR) crashes (2)  Kong et al. [30] (1) Interstate highway or divided highway are highly associated with near crashes because of the overconfidence and secondary tasks (1) When drivers perform secondary tasks, the main cause of near-crash events is the leading vehicle suddenly slowed or stopped. When not performing secondary tasks, lane-changing behavior is the main cause (2) When drivers perform secondary tasks, the most common evasive maneuver of avoiding the near crash is braked only. When not performing secondary tasks, the evasive maneuver is either steered or braked and steered (1) Drivers are more concentrated in bad environmental conditions Yu et al. [25] (1) Crashes mostly occurred in urban areas with no physical separation (2) Crashes are more likely to occur on straight roads (1) Male drivers are more prone to be associated with property damage than female drivers (2) Drivers aged 16-25 are most likely to be involved in crashes (3) Male drivers are more prone to fail to keep the vehicle under control (1) Crashes are more likely to occur at an intersection Hong et al. [27] (1) Single-vehicle crashes are more likely induced by straight alignment (1) Male and older drivers are highly linked to hazardous material vehicle involved crashes (1) Dark conditions and poor visibility are two main contributing factors 15 Journal of Advanced Transportation driving behavior). Although near-crash events do not result in economic loss or casualties, some risk factors can turn near-crash events into crash events. Thus, it is necessary to discuss the relationship between crash and near-crash events and determine which conditions change near-crash events to crash events: (1) road: crash events are more likely to occur on narrow roads, whereas near-crash events are more likely to occur on wide roads. Thus, we assume near-crash events may change into crash events because of changes in the road features from urban to rural area roads or from main roads to bypasses. (2) Driver: older drivers are more likely to be involved in near-crash events rather than crash events; however, if they perform improper driving behavior, a nearcrash event may become a crash event. (3) Environment: Bernat et al. [51] found that night-time single vehicle crashes (SVCs) were strongly related to drunk driving, and improper driving behavior was more likely when there were no vehicles nearby. Thus, improper driving behavior might increase the probability of turning near-crash events into crash events in free flow.

Patterns of Four
Types of Crash Events. The association rules between different categorical variables and the severity of crash events are analyzed, and crash events are categorized into severity levels: I-most severe, II-policereportable, III-minor crash, and IV-low-risk tire strike. Note that the definition of the four severity levels of crash events is derived from the NEST [39] dataset. Forty association rules are considered according to the lift value (Table 5).
Undivided roadways (rules 15, 23, 24, 31, 32, 39, and 40) are strongly associated with IV-low-risk tire strike events. However, this does not indicate that a low-risk tire strike causes severe crash events. Straight roads (rules 14, 21, 29, 30, 31, 33, 35, 36, 37, 38, and 40) are rarely associated with 2-item, 3-item, or 4-item rules but are more commonly with 5-item and 6-item rules. It is assumed that crashes rarely occur on straight road segments. However, crash events are more likely when a straight road is combined with other antecedents. Similar to the straight road segment, level road segments (rules 17, 21, 22, 23, 24, 25, 26, 29, 30, 31, 32, 33, 34, 35, 37, 38, 39, and 40) combined with other factors have an increased likelihood of crash events. Police-reportable events (II) are more likely on roads with less than two lanes (rules 12, 19, 27, and 28). Minor crash events (rule 13) (III) are more likely on roads with more than two lanes, indicating that widening the roadway can reduce the frequency and severity of crash events.
Male (rules 2, 10, 11, and 20) drivers are more likely to be associated with I-most severe events and II-policereportable events. Drivers with one crash record during the past five years (rules 1, 9) are more likely to be associated with I-most severe events. The age group (rule 4) does not show a strong correlation with the crash severity. Drivers with annual miles greater than 15000 miles (rules 5, 14, 21, 22, 29, 30, 37, and 38) have a low correlation with severe crash events, indicating that drivers with more driving experience drive more safely. Minor visual-spatial disabilities do not show a strong correlation with crash events. However, they are strongly associated with I-most severe events. We speculate that minor visual-spatial disabilities do not affect driving significantly. However, if crash events are about to occur, the visual-spatial disabled drivers (rules 10, 17, 18, 25, 26, 33, and 34) may have more problems if a crash occurs. Thus, the crash events are typically more severe. Driver impairments (rules 9, 18, 20, 25, 26, 27, 32, 33, 34, 35, 36, 38, 39, and 40) and improper behavior (rules 26, and 33) are not strongly correlated with the severity of crash events, whereas performing secondary tasks (rules 28, 30, 35, 37, and 38) results in more frequent II-police-reportable crash events and III-minor crash events.
Driving in residential areas and other areas (rules 3, 6) is more likely associated with level II or III crash events. However, driving in business/industrial areas (rules 15, 23, 31, and 40) is more likely associated with IV-low-risk tire strike crash events. I-most severe events (rules 17, 18, and 25) and IV-low-risk tire strike events (rules 8, and 16) occur more likely when the traffic flow is stable. II-policereportable crash events occur more likely in free flow (rules 12, 19, 27, and 28). Interruptions due to traffic control (rules 13, 24, 32, 34, 36, and 39) or controlled intersections (rules 19, 22, 27, 28, 29, 34, 36, 37, and 39) do not affect the severity of crash events.

Findings and Discussion
The key findings are summarized as follows: (1) Road (a) Undivided roadways are more likely associated with crash events, especially IV-low-risk tire strike events. In contrast, divided roadways are more likely associated with near-crash events. It is assumed that a median strip or barrier could prevent crashes The key findings of a comparison of our results and three similar studies are summarized in Table 6.
We analyzed the associations between various factors and C/NC events and the crash severity. The following was observed: (1) road: Kong et al. [30] found associations between near-crash events and roads with median strips. Yu et al. [25] observed that most crashes occurred in urban areas on undivided roads. We also found that a median strip reduced the frequency and severity of crash events. Yu et al. [25] reported that crashes were more likely on straight road sections, similar to our study. However, we found that crashes were associated with straight road sections in combination with other factors. (2) Driver: similar to most other studies, we also found that gender, age, improper driving behavior, and secondary tasks were correlated with C/NC events. In contrast to other studies, we observed that only severe crashes were correlated with minor visual-spatial disabilities. Thus, we speculate that minor visual-spatial disabilities do not affect driving. However, in a serious crash, the visual-spatial disabled drivers may be more likely to lose control. (3) Environment: Kong et al. [30] found that drivers had shorter reaction times in inclement weather, and clear weather was associated with KSI crashes. Similarly, we observed that crash events occurred more likely in road sections without traffic control and intersections in residential or business/industrial areas, suggesting that accidents often occur under the most common road conditions.

Conclusions
This study investigated the correlations between C/NC events and driver, road, and environment-related categorical variables, such as secondary tasks, road conditions, and traffic density. We used the FP growth ARM algorithm to obtain new insights into C/NC events. The patterns of C/NC events were analyzed to determine which variables were associated with C/NC events. This paper provides two major contributions. First, we used a large dataset containing categorical variables collected from naturalistic driving studies, including driver, vehicle, and environment-related data. Therefore, it is believed that our results are robust and unbiased. Second, a framework was developed to mine the association rules of the C/NC events and crash events with different severities. In many cases, multiple variables were associated with C/NC events. We used the support, confidence, lift, and conviction metrics to measure the strength of association between the rules and outcomes.
Interesting correlations were observed between the categorical variables and C/NC events, and differences were revealed between crash and near-crash events. The top 5item rules for crash events fNUMVIOL = 0 + travel lanes = lanes ≤ 2 + driver behavior = improper behavior + locality = business/industrialg and near-crash events ftravel lanes = 2 < lanes ≤ 7 + traffic density = stable flow + age group = 20 − 24 + locality = business/industrialg are used as examples. In these two association rules, travel lanes and locality were significantly correlated with the occurrence of C/NC events. However, the correlation strength differed for different categorical variables. Drivers with an aggressive driving style were more likely to be involved in a crash when driving on roads with less than two lanes in a business/industrial area. Drivers driving in a business/industrial area on roads with Journal of Advanced Transportation more than 2 lanes in stable traffic were more likely to be involved in near-crash events.
This study is expected to provide useful information for future research on C/NC events using ARM methods and suggestions for traffic engineers to improve road safety and prevent accidents. However, this study has three limitations. First, we did not include all rules in the analysis due to the large number of generated rules. Second, although we included a large range of categorical variables and extracted the association rules between the variables and C/NC events, we did not evaluate the correlations between the categorical variables. For example, many researchers have found that performing secondary tasks, such as using a phone or talking to passengers while driving, significantly increased driving risks. However, we aggregated all secondary tasks into one category. Third, some important categorical variables were discarded for the reasons described in Section 2.2, although they may have influenced the C/NC events. These limitations will be addressed in future studies.

Data Availability
The Naturalistic Engagement in Secondary Tasks (NEST) data used to support the findings of this study have been deposited in the SHRP2 Naturalistic Driving Study repository (doi:10.15787/VTT1/OZQ6BL).