^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Traffic crash is a complex phenomenon that involves coupling interdependency among multiple influencing factors. Considering that interdependency is critical for predicting crash risk accurately and contributes to revealing the underlying mechanism of crash occurrence as well, the present study attempts to build a Real-Time Crash Prediction Model (RTCPM) for urban elevated expressway accounting for the dynamicity and coupling interdependency among traffic flow characteristics before crash occurrence and identify the most probable risk propagation path and the most significant contributors to crash risk. In this study, Dynamic Bayesian Network (DBN) was the framework of the RTCPM. Random Forest (RF) method was employed to identify the most important variables, which were used to build DBN-based RTCPMs. The PC algorithm combined with expert experience was further applied to investigate the coupling interdependency among traffic flow characteristics in the DBN model. A comparative analysis among the improved DBN-based RTCPM considering the interdependency, the original DBN-based RTCPM without considering the interdependency, and Multilayer Perceptron (MLP) was conducted. Besides, the sensitivity and strength of influences analyses were utilized to identify the most probable risk propagation path and the most significant contributors to crash risk. The results showed that the improved DBN-based RTCPM had better prediction performance than the original DBN-based RTCPM and the MLP based RTCPM. The most probable risk influencing path was identified as follows: speed on current segment (V) (time slice 2)⟶V (time slice 1)⟶speed on upstream segment (U_V) (time slice 1)⟶Traffic Performance Index (TPI) (time slice 1)⟶crash risk on current segment. The most sensitive contributor to crash risk in this path was V (time slice 2), followed by TPI (time slice 1), V (time slice 1), and U_V (time slice 1). These results indicate that the improved DBN-based RTCPM has the potential to predict crashes in real time for urban elevated expressway. Besides, it contributes to revealing the underlying mechanism of crash and formulating the real-time risk control measures.

Predicting road crashes in real time is a hotspot in road safety under the context of active traffic management (ATM) over the past two decades. Real-time crash prediction refers to the assumption that the occurrence probability of a crash on a specific road segment can be predicted within a very short precrash time interval by adopting instantaneous traffic flow characteristics [

In general, numerous RTCPMs studies establish a direct connection between traffic flow data (i.e., volume, speed, occupancy and their combinations) and crash data. In these models, the collinearity and correlation among dependent variables are avoided; thus, the independence of variables is guaranteed [

DBN, a particular form of Bayesian Network (BN), represents the dynamic evolution of some state space model through time [

Furthermore, considering the interdependency among influencing factors also helps to reveal the underlying mechanism of crash occurrences. The present study estimates the crash risk by quantifying the probability of crash occurrences. We hope this model can provide some real-time countermeasures to mediate risk when there is a high probability of crash. The formulation of countermeasures should be based on the identification of risk propagation path and significant risk contributors. However, there has been a dilemma between predictive and explanatory models: the models specialized in prediction are not the best in knowledge discovery, and vice versa [

The existing DBN-based RTCPMs mainly emphasize the dynamicity of traffic flow characteristics, lacking investigation on the coupling interdependency among traffic flow characteristics. The main contributions of this study are (1) to apply the DBN structure learning algorithm in an example to predict road crashes; (2) to compare the performances of two DBN-based RTCPMs (considering the interdependency and not considering the interdependency) and the MLP-based RTCPM; and (3) to identify the most sensitive risk contributors in the propagation path by the use of the sensitivity and strength of influence analyses.

The manuscript is organized into five sections. The remainder is organized as follows. In Section

The 40 segments of the Yan'an elevated expressway in Shanghai, China, sequentially linking up to each other along the westbound and eastbound expressway, were selected as the study areas (see Figure

Yan'an elevated expressway, Shanghai, China.

The existing dual-loop detectors in study areas are available for providing the average speed (km/h) and the average volume of a single lane (pcu/h) for each segment. Hourly weather variables, including visibility (km) and weather type (rainy or sunny), were collected from the Shanghai Xujiahui Observatory, which is 7.5 km far from the Yan'an expressway. In this study, the Traffic Performance Index (TPI) varying between 0 and 1 was applied as an indicator to measure the magnitude of congestion degree, where 1 is a traffic jam state and 0 is a free flow state (equation (_{max} is the maximum speed and _{i} is the average speed at the

The average speed data on the current, upstream, and downstream segments of the crash location and the TPI of the whole expressway were aggregated in 5-minute intervals. The evolution of traffic flow with time leading to a crash was a dynamic process; thus, the traffic flow characteristics of several time intervals before the crash should be combined to build the model. The intervals of 0–5 min (time slice 0), 5–10 min (time slice 1), and 10–15 min (time slice 2) prior to the crash were considered. The time slice 0 was excluded, because the crash warning system needs some time to recognise crash states, and the actual crash occurrence time and recorded time are not always completely consistent. Due to the raw weather data updated once an hour, the weather condition was regarded as a stable influencing variable across different time slices. Finally, the traffic flow and weather data corresponding to 82 crash cases and 246 noncrash cases were generated. In total, nine variables combining traffic flow characteristics on current, upstream, and downstream segments of the crash location with weather condition are shown in Table

Information of the nine alternative variables.

Variable | Description |
---|---|

TPI | The average TPI of the whole expressway within 5 min interval |

V | The average speed of current segment within 5 min interval (km/h) |

U_V | The average speed of upstream segment within 5 min interval (km/h) |

D_V | The average speed of downstream segment within 5 min interval (km/h) |

Q | The volume of current segment within 5 min interval (pcu/h) |

U_Q | The volume of upstream segment within 5 min interval (pcu/h) |

D_Q | The volume of downstream segment within 5 min interval (pcu/h) |

Visibility | The horizontal visibility within one hour (km) |

Weather | The weather type in one hour, rainy or sunny |

The main purpose of constructing RTCPM is to evaluate crash risk in real time. High-dimensional variable space can increase the processing complexity of the RTCPM. Thus, Random Forest (RF), a widely used variable selection model, was implemented in this study to select influencing variables and reduce the redundancy of variables. Variable importance (VI) metric was used as the criterion to pick the mostly related variables [

Sample

Each tree classifier produced a classification result by voting for the binary target (crash or noncrash) based on OOB samples, and the classification error rate _{i} was calculated consequently.

Add random noise disturbance for the values of any variable in the OOB sample, and the new OOB sample was produced. Each tree that was implemented for crash/noncrash classification tests with the new OOB sample was used to calculate the classification error rate _{i}

VI was calculated as the increase in the mean of the classification error rate of trees after adding random noise disturbance. The calculation formula was shown in the following equation:

The Bayesian Network (BN) is a probabilistic graphical model that expresses the probability relationships among a set of variables that connect those variables in a directed acyclic graph (DAG). The BN has the advantages in learning causal relationships, predicting the consequences of intervention, and analyzing the most probable explanations of consequences. Some researchers have adopted BN to evaluate and analyze traffic accidents risk [_{1,}_{2}, … , _{t}} and hidden variables _{1}, _{2}, … , _{t}}, which were traffic state variables and crash likelihood, respectively. When a Markov model and a BN were integrated to construct a DBN model, there were a transition model _{t}|_{t−1}), an observation model _{t}|_{t}), and an initial state distribution _{1}). The joint probability distribution can be expressed as follows:

There were three key steps to initialize a DBN model: (1) The ChiMerge algorithm was adopted to implement the discretization of continuous variables. (2) Structure learning was applied to present the graphic dependencies among variables. In this step, the DBN not only estimated the dependencies between variables within one time slice but also examined them among different time slices. The PC algorithm was used to build the structure of the BN within one slice among traffic state variables and crash likelihood. Then, the same variables among different slices were connected to build the structure of the DBN. (3) Parameter learning was conducted to learn the conditional probability distribution of variables within one time slice and across time slices. Parameter estimation was tested by the Expectation Maximization (EM) algorithm.

The continuous variables are usually problematic in DBNs because it fails to capture the relationships between the continuous variables [^{2} statistic to test whether there are significant differences or similarities of relative class frequencies between adjacent intervals.

The ChiMerge algorithm is mainly consisted of several steps.

Sort the samples according to their value.

Calculate the ^{2} value for each pair of adjacent intervals with the following equation:

where _{ij} = number of samples in the _{ij} = expected frequency of _{ij}.

Merge the pair of adjacent intervals with the lowest ^{2} value until all pairs of intervals with ^{2} values beyond ^{2} threshold. The ^{2} threshold is determined by a desired significance level (0.95 percentile level in this study) and the number of degrees of freedom (1 less than the number of classes). There are 2 classes (crash and noncrash); thus, the degree of freedom is 1. Finally, the ^{2} value is 3.841.

The PC algorithm is an efficient and classical algorithm used for structural learning in BN [

Determine the skeleton of the graph by conditional independence tests. Let _{1}, _{2}, …, _{k}}be a set of random variables and _{i} and _{j} given a conditioning _{γ} in the graph by calculating the cross entropy CE(_{i}, _{j} | _{γ}):

The PC algorithm adopts ^{2} test statistic, which equals 2_{i}, _{j}_{γ}

Search the _{i} and _{j} are not conditionally independent with given _{γ}, then

Confirm the directions of the rest of the edges. Combining with expert experience, some undirected edges between nodes are specified based on the principles where any cycle and any other

The EM algorithm is a general algorithm to calculate maximal log likelihood and the performance has been proved to be effective in parameter learning of BN [

Initialize

Introduce a distribution

E-Step: Calculate the distribution

M-Step: Optimize the parameters based on the estimation of the joint probability distribution, which is viewed as the M-step.

The neural network, an effective function approximator, is often used to solve regression and prediction problems in various fields. A general multilayer perceptron model can be performed by the following 3 steps.

Initialize the MLP model. Assume that the original function can be approximated by a set of basic functions:

where

Load the sample point pair (

where

Adjust network synapse weights according to error feedback. The general calculation formula of the adjustment is

where

The DBN-based RTCPM was constructed based on the training dataset (involving 264 crash data and noncrash data) and validated based on the validation dataset (involving 64 crash data and noncrash data).

Figure

Variable importance ranking determined by Random Forest.

The DBN models with and without considering the interdependence among traffic flow characteristics (TPI, V, and U_V) were both constructed based on the training dataset. The former model (the improved DBN-based RTCPM) was the main purpose, and the latter one (the original DBN-based RTCPM) was developed for comparison. Before constructing the graphical structure of the improved DBN-based RTCPM, the three traffic flow characteristics were discretized according to their corresponding crash cases using the ChiMerge algorithm. The number of discretization states of every variable was confined to 10 so that the calculation complexity in DBN models can be decreased. The discretization ranges of TPI, V, and U_V are presented in Figures

Discretization ranges of TPI.

Discretization ranges of V.

Discretization ranges of U_V.

After discretization, the PC algorithm and expert assessment were utilized to investigate the interdependency among traffic flow characteristics within one time slice. The dynamicity of traffic flow characteristics was reflected by connecting the same variables from time slice 2 to time slice 1. The dynamicity and interdependency determined the graphical structure of the improved DBN-based RTCPM (Figure

Graphical structural of the improved DBN-based RTCPM.

Graphical structural of the original DBN-based RTCPM.

Afterwards the parameter learning process was implemented using the EM algorithm. The initial states of the improved DBN-based RTCPM and the original DBN-based RTCPM are presented in Figures

Initial state of the improved DBN-based RTCPM.

Initial state of the original DBN-based RTCPM.

The validation dataset was used to validate the DBN models. When no new evidence was entered into the DBN, the marginal probability of crash risk node of initial state of DBN model was set as the classification threshold for evaluating the model performance. And then, each validation dataset was entered individually in the models. The crash risks, i.e., the posterior probability of crash risk node, relating to the prone traffic condition, were calculated based on the prior probabilities. Several evaluation metrics based on the confusion matrix (Table

Confusion matrix.

Predicted crashes | Predicted noncrashes | |
---|---|---|

Actual crashes | _{crash} | _{noncrash} |

Actual noncrashes | _{crash} | _{noncrash} |

Besides the overall accuracy from equation (

Performance comparison of two types of DBNs and MLP.

Overall accuracy | Sensitivity | |||
---|---|---|---|---|

Original DBN-based RTCPM | 0.750 | 0.313 | 0.385 | 0.529 |

Improved DBN-based RTCPM | 0.766 | 0.688 | 0.595 | 0.738 |

MLP-based RTCPM | 0.725 | 0.556 | 0.476 | 0.656 |

As illustrated by Table

Investigation of the interdependency among the traffic flow also contributes to revealing the underlying mechanism of crash occurrence, which is helpful for formulating the real-time risk control measures. The sensitivity and strength of influences analysis were implemented in a professional DBN analysis software, Genie, to identify the most significant contributors to crash risk and the most probable risk propagation path.

The sensitivity analysis of Genie can be utilized to identify which node had greater contribution to the target node in DBN. Setting the crash risk as the target node, conducting sensitivity analysis on it, and the contribution degrees of traffic flow characteristics on crash risks are presented in Figure

Results of sensitivity analysis.

The strength of influences analysis was utilized to identify the most probable risk propagation path based on the improved interdependency structure. The strength of influence is always calculated from the distance between the probability distributions of the child node conditional on the state of its parent node. As shown in Figure

Results of strength of influences analysis.

Synthesizing the results of sensitivity and strength of influences analysis can be used to identify the most probable risk propagation path, as well as determine the most sensitive contributor in the propagation path. The results suggested that the sequence and emphasis of the real-time risk countermeasures should sequentially lay on V (time slice 2), TPI (time slice 1), V (time slice 1), and U_V (time slice 1).

This study aimed to build a RTCPM for urban elevated expressway by using the DBN model to capture the dynamicity and coupling interdependency among traffic flow characteristics before crash occurrence. The model was built and validated adopting traffic flow data collected on the Yan'an elevated expressway. Based on the DBN-based RTCPM, the sensitivity and strength of influences analysis were utilized to identify the most probable risk propagation path and the most sensitive contributors to crash risk. The main conclusions are as follows:

In model construction process, interdependency in the DBN model was determined by the PC algorithm and expert experience, and the dynamicity of traffic flow characteristics was expressed by adopting data in time slices. By validation, the improved DBN-based RTCPM got an overall accuracy of 76.6%, with a crash prediction accuracy of 68.8% and a crash/noncrash balanced classification accuracy of 73.8%. The results indicated that the model can achieve an effective crash prediction for urban elevated expressway.

Comparisons of the original DBN-based RTCPM and MLP-based RTCPM suggested that the improved DBN-based RTCPM can identify the interdependency among traffic flow characteristics before crash occurrences. The comparison results also indicated that the improved DBN-based RTCPM was more suitable for the prediction of real-time crashes for urban elevated expressway.

According to the results of sensitivity and strength of influences analysis, the most probable risk propagation path is V (time slice 2)⟶V (time slice 1)⟶U_V (time slice 1)⟶TPI (time slice 1)⟶crash risk on current segment, and the most sensitive contributor to crash risk in this path is V (time slice 2), followed by TPI (time slice 1), V (time slice 1), and U_V (time slice 1). The results suggested that the formulation of the real-time risk countermeasures should sequentially focus on this sequence in the propagation path.

There would be two extensions in future research. On the one hand, the model was built and validated on the same urban elevated expressway; thus, the transferability of the model to another urban elevated expressway has not been discussed. On the other hand, the specific real-time risk countermeasures such as variable speed limit (VSL) can be investigated to improve crash risk.

The research data are available in the .CSV format file. They are available from the corresponding author upon request.

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This study was supported by the National Natural Science Foundation of China (Grant no. 52072071).