Predicting Severity and Duration of Road Traffic Accident

This paper presents a model system to predict severity and duration of traffic accidents by employing Ordered Probit model and Hazard model, respectively. The models are estimated using traffic accident data collected in Jilin province, China, in 2010. With the developed models, three severity indicators, namely, number of fatalities, number of injuries, and property damage, as well as accident duration, are predicted, and the important influences of related variables are identified. The results indicate that the goodness-of-fit of Ordered Probit model is higher than that of SVC model in severity modeling. In addition, accident severity is proven to be an important determinant of duration; that is, more fatalities and injuries in the accident lead to longer duration. Study results can be applied to predictions of accident severity and duration, which are two essential steps in accident management process. By recognizing those key influences, this study also provides suggestive results for government to take effective measures to reduce accident impacts and improve traffic safety.


Introduction
Traffic accidents are a significant source of deaths, injuries, property damage, and a major concern for public health and traffic safety.Accidents are also a major cause of traffic congestion and delay.Effective management of accident is crucial to mitigating accident impacts and improving traffic safety and transportation system efficiency.As two major steps of the accident response program (shown in Figure 1), severity prediction and duration estimation are, therefore, of great importance.Accurate predictions of severity and duration can provide crucial information for emergency responders to evaluate the severity level of accidents, estimate the potential impacts, and implement efficient accident management procedures.
To the authors' knowledge, most of the previous studies examined accident severity and duration separately, although they were found to have correlation between each other.Moreover, only one or two of the three aspects of accident severity, that is, number of fatalities, number of injuries, and property damage, were investigated by existing researchers.Therefore, the present study is aimed at developing a model system to estimate both accident severity and duration.Furthermore, three indicators for accident severity will be set, which represents number of fatalities, number of injuries, and property damage, respectively.In doing so, we will provide crucial information for emergency responders to take effective management measures.
The remainder of this paper is organized as follows.In Section 2, we present the literature review on predictions of severity and duration in general.The data are described in Section 3. Following is accident severity modeling in Section 4 and duration forecasting in Section 5.The paper concludes with a summary and directions for future research.

Existing Literature
As two major factors in accident analysis, severity and duration have long been important topics for research.Most of the previous studies examined only one of severity and duration.For example, with respect to severity analysis, Chang and Mannering [1] studied the relationship between injury severity and vehicle occupancy using Washington State accident data.Mannera and Wünsch-Ziegler [2] investigated accident severity and determined the important effects of related factors.As for duration, Chung [3] modeled accident duration with freeway accident data collected in Korea.Anastasopoulos et al. [4] presented a Bayesian network model that can be used to learn emerging patterns and predict accident clearance time.Nevertheless, accident severity was found to have influence on duration time by some researchers.For instance, Nam and Mannering [5] revealed that whether there is fatality or injury in accident impacts accident duration.Besides, as shown in Figure 1, severity prediction and duration estimation are connected procedures in the accident management system.Therefore, the two indicators should be considered together and combined in one model system.Concerning severity analysis, which includes mainly three aspects, that is, number of fatalities, number of injuries, and property damage, most of the existing researchers investigated it as one comprehensive indicator; for example, Mannera and Wünsch-Ziegler [2]took accident severity as one independent variable with four alternatives, namely, fatal, severe injury, light injury, and property damage.Milton et al. [6] defined severity levels as property damage only, possible injury, and injury.Malyshkina and Mannering [7] modeled severity by using three alternatives, that is, fatality, injury, and property damage only.In addition, a number of researchers considered only one or two of the three aspects of severity.For instance, Stone and Broughton [8] and Sze and Wong [9] considered only the aspect of fatality by defining two levels of severity, that is, fatal and nonfatal accident.Delen et al. [10] defined injury severity levels as no injury, probable injury, nonincapacitating, incapacitating and fatality.Similarly, Ballesteros et al. [11] and Roudsari et al. [12] considered only number of fatalities and injuries but not property damage.In fact, different types of losses as well as the amount of losses lead to different response measures and last possibly for disparate amount of time.For example, either an accident resulting in $167-5000 property damage or an accident leading to 1-3 injuries will be defined as level 2 accident in Zhang's study [13].However, the latter one needs rescue services but the former one does not.This indicates that any of the three indicators, that is, number of fatalities, number of injuries, and property damage, is crucial to making accident response decision and is therefore recommended to be modeled separately in order to provide more detailed information about accident management.
As mentioned above, most of the previous studies examined accident severity and duration separately, although they were found to have correlation between each other.Moreover, only one or two of the three aspects of accident severity, that is, number of fatalities, number of injuries, and property damage, were investigated by the existing studies.Therefore, the present work is aimed at developing a model system to estimate both accident severity and duration.Furthermore, three indicators for accident severity will be investigated, which represent number of fatalities, number of injuries, and property damage, respectively.

Data and Modeling Framework
The dataset for the study contains police-reported traffic accident records for Jilin province, China, in 2010.With records containing missing values eliminated, our final dataset consists of 3,914 cases, in which, 1,280 (32.70%) cases were pedestrian involved accidents and 387 (9.89%) cases were non-motor-vehicle-involved accidents.In addition to severity information, the data contains information regarding accident duration, accident characteristics (vehicle fire, crash type, accident occurrence time, and number of lanes affected), emergency services (police services, fire and rescue services, tow services, and emergency medical services), vehicle characteristics (vehicle type involved, debris involved, hazardous material involved, and disabled vehicles involved), environmental factors (weather conditions and visibility distance) and road conditions (number of lanes, pavement condition, road geometrics, and roadway surface condition, etc.).
Based on a preliminary correlation test, 4 independent variables and 26-candidate dependent variables were selected from the dataset, as shown in Table 1.
With Nof, Noi, and Pd as independent variables, three separate severity prediction models will be developed.Then, duration modeling will be conducted by taking accident severity as input.The modeling framework is shown in Figure 2.

Severity Modeling
Besides the Ordered Probit model [14], which is often used in discrete choice modeling, SVM will be introduced in this paper and be compared with the Ordered Probit model according to the prediction accuracies. 1, the alternatives of the severity related dependent variables are all ordered.Since multinomial logit (MNL) model, which is commonly used in discrete choice modeling, would fail to account for the ordinal nature of the dependent variable and have the problem of Independence from irrelevant alternatives (IIA) [15], this study will employ Ordered multiple choice model for severity modeling.

Ordered Probit Model. As shown in Table
The Ordered multiple choice model assumes the relationship: where   () is the probability that alternative  happens in accident  ( = 1, . . ., ),   is an alternative specific constant,   is a vector of the attributes of accident ,   is a vector of estimable coefficients, and  is a parameter that controls the shape of probability distribution .Therefore, , can have various shapes of distribution based on different value of .
The Ordered Probit model, which assumes standard normal distribution for  is the most commonly used Ordered multiple choice model [16].The Ordered Probit model has the following form: where   () is the cumulative standard normal distribution function.For all the probabilities to be positive, we must have 4.2.Support Vector Machine Model.Support vector machine (SVM) is a type of learning algorithms based on statistical learning theory, which can be adjusted to map the inputoutput relationship for the nonlinear system [17][18][19].SVM has been widely used in transportation modeling; for example, Bolbol et al. [20] employed SVM classification in travel behavior analysis, Apatean et al. [21] used it in road obstacle classification, and Abdel-Aty and Haleem [22] applied it to analyze angle crashes at unsignalized intersections.Previous studies indicate that SVM can conduct discrete choice modeling with acceptable accuracy.Therefore, it is chosen to be employed to model accident severity in this paper.
Given a set of input-output data pairs  = ( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   ) (  ∈  ⊆   ,   ∈  ⊆   , and  being the number of training samples, that are randomly and independently generated from an unknown function, SVM estimates the function using the following equation [23]: where Φ() represents the high-dimensional feature spaces which are nonlinearly mapped from the input space ,  denotes a parameter vector, and  is the threshold [24,25].
If the domain of output space  only takes category values, that is, −1 and +1, the learning problem then refers to support vector classification (SVC) [26].
For classification about the training data , SVM's linear soft-margin algorithm is used to solve the following primal quadratic programming problem: min ,, where  is a penalty parameter and   are the slack variables.The goal is to find an optimal separating hyperplane, where  ∈   .The Wolfe dual, that is, (4), can be expressed as max where  ∈   are lagrangian multipliers.The optimal separating hyperplane of (5) can be given by where  * is the solution of ( 6) and  V represents the number of support vectors such that 0 <  < .A new sample is classified as +1 or −1 according to the finally decision function () = sgn(( ⋅ ) + ).
In order to conduct multiclass classification (as SVC model is originally designed for binary classification), oneagainst-one method will be employed in this paper [27,28].

Estimation Results
. By using Stata and Matlab, the severity prediction models based on Ordered Probit and SVM are estimated, respectively.The estimation results as well as the prediction accuracies are shown in Table 2.
The last row shows the hit ratio for all the models.In general, higher value of hit ratio represents higher goodnessof-fit of the model.As all the hit ratio values of the Ordered Probit models are higher than that of SVM models, Ordered Probit-based models are chosen as the severity prediction models.
The results indicate that hazardous material involved in the accident, weather, and accident location are significant in all the three models.According to the estimation results, hazardous material involved will increase the probability of high property damage.The reason is that hazardous material will increase the probability of occurrence of fire or even explosion, which leads to high damage to the vehicles and goods.
Some of the variables have impact on only one or two indicators.For example, bus involved, truck involved, time of day, and traffic signal control are crucial to number of fatalities and injuries.The more buses or trucks are involved, the more fatalities and injuries the accident will cause.In addition, the factors of road geometrics, vehicle fire, and vehicle rollover are important for number of fatalities, while roadway surface condition has effect on number of injuries.The results also indicate that disabled vehicles involved, debris involved, visibility distance, pavement condition, and motor-vehicle-only accident are significant for property damage.The more disabled vehicles or debris is involved in the accident, the more property damage the accident will lead to.As for motor-vehicle-only accident, the results reveal that accidents with only vehicles involved cause more property damage than that with pedestrian or non-motor-vehicles involved.

AFT Model and KM Estimator. As suggested by Nam and
Mannering [5] and Stathopoulos and Karlaftis [29], hazardbased duration models have an advantage in that they allow the explicit study of duration effects of accidents (i.e., the relationship between how long an accident has lasted and the likelihood of it ending soon).Thus, hazard-based duration models, in particular the accelerated failure time (AFT) metric, were utilized in this study to model the accident duration.The reason that we choose AFT model is that, compared with other forms of hazard-based model, AFT model is predominately fully parametric; that is, a probability distribution is specified and it is also less affected by the choice of probability distribution [30,31], and the results of AFT model are easily interpreted [32].
Let  be a nonnegative random variable representing the accident duration.The hazard at time  on the continuous time-scale ℎ() is defined as the instantaneous probability that the duration under study will end in an infinitesimal time period Δ after time , given that the duration has not elapsed until time .A mathematical definition for the hazard function is as follows: Let (⋅) and (⋅) be the density and cumulative distribution function for , respectively.Then the probability of ending in an infinitesimal interval of range Δ, after time  is ()Δ.And the probability that the process lasts for at least  is given by the survival equation Thus, the hazard function can be further expressed as The distribution of the hazard can be assumed to be one of many parametric forms or to be nonparametric.Because the distribution of the accident duration is unknown, one of the nonparametric methods, the Kaplan-Meier (KM) product limit estimator, is conducted to explore the covariates effects and the potential distribution.
As a nonparametric method, the KM estimator, produces an empirical approximation of survival and hazard but hardly takes any covariate effects into consideration.It is similar to an exploratory data analysis.Denoting the distinct failure times of individuals  as  1 <  2 < ⋅⋅⋅ <   , the KM estimator of survival at time   is computed as the product of the conditional survival proportions: where (  ) is the total number of accidents at risk for ending at   and (  ) is the number of accidents stopping at   .By using the KM estimator, the survival function curves of the accident duration are estimated, which are shown in Figure 3.The results indicate that the survival probability decreases with duration, which implies an accelerated failure time model with Weibull or Exponential distribution should be employed.Therefore, the AFT model is developed to examine the linkages between duration and covariates relative to accident information.
The AFT model permits the covariates to affect the duration dependence.Its survival function is given as where  0 (⋅) is the baseline survival function.The corresponding hazard function is The AFT model can be expressed as a log-linear model: Assuming that the random error  follows either a Weibull distribution or an Exponential distribution, one can get two kinds of AFT models, and both of them are often used in duration analysis.

Estimation Results.
Assuming that the random error in ( 14) follows a Weibull distribution and an Exponential distribution, respectively, the accident duration models are established.The models are estimated by employing maximum likelihood estimation (MLE), and the estimation results are shown in Table 3.
The Mean absolute percentage error (MAPE), which looks at the average percentage difference between predicted values and observed ones, is adopted to examine the accuracy of the developed duration predication model.MAPE is calculated as where   is the observed value and   is the predicted value for observation .
The MAPE value of Weibull distribution (0.22) is less than that of the Exponential distribution (0.23), indicating that the values predicted by the AFT model with the Weibull distribution is more close to the actual accident duration [3].Therefore, the Weibull distribution function is chosen.
The estimation results indicate that most of the results were consistent with the theoretical expectation.According to the results, the variables with respect to accident severity significantly affect accident duration: the more fatalities and injuries occur in the accident, the longer duration it will lead to.This supports the necessity of combining predictions of accident severity and duration in one model system.Besides, accident type is revealed to be crucial to duration: comparing with other types of accidents, the duration of rear-end type collision is 37% shorter, while that of rollover is 28% longer.The results also show that the duration of accident involving bus, truck, debris, or hazard material is 60%, 58%, 55%,  the results reveal that the accident occurs at regular road section or 4-way intersection results in longer duration than that occuring at other locations.The reason may be that the traffic volume is higher at regular road section or intersection.Regarding emergency services, the accident which needs tow services has longer duration.Moreover, as the number of lanes occupied in the accident increases, duration increases.By using the accident duration model, the survival curve of duration is estimated, which is shown in Figure 4. Comparing with observed value, the prediction accuracy of accident duration model is shown in Table 4.

Conclusions
In this paper, a severity prediction model system was constructed by employing Ordered Probit model, and a duration prediction model was established by applying Hazard model.Accident severity, including number of fatalities, number of injuries, and property damage, as well as accident duration was forecasted with the models.

Mathematical Problems in Engineering
Study results can be applied to severity and duration prediction, which are essential steps in accident response process.By comparing SVM and Ordered Probit model, it also makes a methodological contribution in enhancing prediction accuracy of severity estimation.In addition, by identifying the key effects of related factors on accident severity and duration, the results provide useful clues for government to take effective measures in order to reduce accident impacts and improve traffic safety.
One limitation of current study is that some factors, such as characteristics of the driver, passenger and pedestrian, and traffic condition, which have potential effects on accident severity and duration, are not considered because of the lack of suitable data.Further study should be done to collect the related information and investigate the impacts of these factors.

Figure 2 :
Figure 2: Accident severity and duration modeling framework.

Table 1 :
Variables and statistics based on survey data.

Table 4 :
Goodness of fit index and estimated distribution statistics of accident duration model.