A Crash Severity Prediction Method Based on Improved Neural Network and Factor Analysis

,


Crash Severity.
Traffic safety is a challenging task to be accomplished and has been identified as crash hotspots around the world. e total number of fatal crashes in the U.S. increased to around 35,000 in 2016. In addition, according to the Washington State Collision Summary report, a total of 117,053 crashes were identified in the Washington State, including 499 fatal collisions, 36,531 injury collisions, and 77,358 property-damage-only collisions, indicating a crash occurred every 4.5 min and a person died in a crash every 16 hours [1]. Figure 1 shows the traffic fatality rates between the U.S. and Washington State [1]. Billions of dollars in personal and property damage are wasted in traffic crashes each year around the world [2].
To achieve the intrinsic goal of exploring numerous factors that trigger the crash, crash severity is often used for crash analyses to represent the degree of injury. A KABCO scale was proposed by WSDOT to represent the level of the injury: K-fatal injury; A-incapacitating-injury; B-nonincapacitating injury; C-minor injury; and O-property-damage-only injury.

Crash Severity Prediction.
Crash prediction problems have long been a popular area of study around the world. Numerous studies conducted the prediction analyses based on classic statistical models, e.g., the linear, nonlinear, generalized linear model (GLM), generalized estimating equation (GEE), nominal binary (NB), and Poisson regression models, which are regarded as a good attempt at thoroughly formulating the relationship between tens or hundreds of explanative variables. However, it should be noted that the traditional statistical method has its limitation. e artificial intelligence (AI) technique, particularly, the deep learning methodology [6], as an emerging but promising tool for addressing the problems faced by the traditional statistical domain, deserves more attention and exploration.
Safety performance functions (SPFs) are frequently utilized to demonstrate the relationships between different crashes and crash impact parameters. Such functions usually use the crash frequency as the targeting variable. e Highway Safety Manual [7] has several chapters demonstrating the average crash frequency of an entire network, facility, or individual site. Elvik et al. [8] developed six power functions to demonstrate the relationship between speed and road safety; these six equations are slightly adjusted according to the road characteristics under the following categories: fatal injuries, fatal and serious injuries, all injuries, fatal accidents, fatal and serious accidents, and all injury accidents. Russo et al. [9] used a negative binomial regression model to develop four sets of SPFs based on the crash frequency and conducted a residual analysis to prove the accuracy. Park and Abdel-Aty [10] developed several crash modification factors (CMFs) for a combination of traffic and roadway cross-sectional elements on noncurved and curved roadway sections using a cross-sectional method and found that CMFs for increasing the lane and shoulder width were decreasing as the annual average daily traffic (AADT) level increased. Using a Bayesian ranking technique, Ahmed et al. [11] examined the safety effects of the roadway geometry on the crash occurrence rate along a freeway section that features mountainous terrain and an adverse weather condition and confirmed that segments with steep downgrades are more crash-prone along the studied section.
A number of researchers are eager to dig into the use of statistical analysis for traffic safety research, such as linear, nonlinear, GLM, GEE, NB, or Poisson regression models. Such models have performed well when the number of explanatory variables is constrained. Debrabant et al. [12] applied an autoregressive Poisson-Tweedie model to mine spatially and temporally aggregated hospital records of traffic accidents, and it was confirmed that this method is very accurate when applied to a black spot identification problem. Pei et al. [13] developed a joint probability model combined with a Markov chain, Monte Carlo (MCMC) approach, and full Bayesian to estimate the effects of explanatory factors, and their results indicated that the model achieves a good statistical fit and provides an accurate analysis of the influences of the explanatory factors. El-Basyouny et al. [14] applied a multivariate model based on a MCMC simulation to address the impact of weather elements, and their results showed that temperature and snowfall are statistically significant with intuitive signs for all crash types, whereas rainfall is mostly insignificant, as is the maximum wind gust with a few exceptions that are positively related to the crash type.
To address the shortcomings of the traditional statistical model, scholars in the crash analysis field are more willing to use AI methods at present. Karlaftis and Vlahogianni [15] discussed the differences and similarities between a statistical method and neural networks (NNs), and their results showed that the goals of the analysis are more important than the tools used, and that there are always assumptions to all modeling approaches. Specific to a traffic accident analysis, although classic statistical models such as NB, Poison, and a Bayesian network can achieve a good identification of a broad range of risk factors, they are also limited to a finite factor assumption as compared with a deep-learning method [16]. Aided by the powerful hardware and software of modern computers, deep-learning methods are becoming powerful tools for many aspects of our daily lives [6]. For example, in a crash analysis, Zeng and Huang [17] used a pruned NN for the crash severity, adopting a convex combination (CC) training algorithm and a NN pruning for a function approximation (N2PFA) structure optimization method, and found that the CC outperforms the backpropagation (BP) method in both convergence ability and training speed; in addition, simplification of the nodes in an NN structure can obtain a better performance. Huang et al. [18] developed an optimized radial basis function neural network (RBFNN) model to analyze the relationships between crash frequency and the relevant risk factors, and their comparative work showed that RBFNN models outperform negative binomial models and backpropagation neural network (BPNN) models. Li et al. [5] developed a data-driven method combining the nondominated sorting genetic algorithm (NSGA-II) with an NN to identify the key factors in a fatal highway crash analysis. All of the abovementioned studies have focused more on the NN structure itself, using complicated mathematical equations to illustrate their abstract concepts; nevertheless, they have seldom given a thoughtful, detailed, and general  procedure to deal with a complete traffic severity analysis problem. However, in short, both statistical and AI methods in the previous studies are still facing some challenges, and especially for neural networks, the traditional NNs are more easily stuck in the local optimum in accordance with its random weights initialization at the very first beginning. Although some studies [5,18] were carried out to address the abovementioned problems, the efficiency of the global optimum search for particle swarm optimization (PSO) and local optimum search for some other optimization methods is still to be improved.
To solve the aforementioned problems, the purpose of this paper is as follows: (1) to provide a BPNN algorithm integrating the PSO with adaptive inertial weights for the establishment of the crash severity prediction model; (2) to conduct a detailed factor analysis (FA) based on the refined model to quantify the internal relationship and heterogeneity of different variables that trigger the crash of distinguished severity. e prediction target in the first phase is the crash severity. It should be noted that the novelty of the paper lies in the fact that an integrated method incorporating an emerging AI technique and a traditional statistical model was provided for crash severity analysis. Besides, it is marginally original to use FA and PSO for the calculation of the parameters of the crash triggers.
is paper is organized as follows: in the first part, the background and severity levels are discussed to demonstrate the reason for conducting this research. In the second part, classic statistical models and NN models dealing with crash analysis are reviewed to illustrate their advantages and limitations. e following section demonstrates the processing of the dataset, including its description and simplification. In the fourth part, the entire methodology for the developed model and data incorporation are presented to highlight the process of conducting a severity prediction process. Finally, the results are presented with the conclusion.

Data Description
e dataset used in this study consisted of the data acquired from the Highway Safety Information System (HSIS); in this system, data from nine states (California, Washington, Minnesota, Michigan, Maine, Ohio, North Carolina, and Illinois) in the U.S. are available. Considering the author's time studying in the Washington State during the period of 2016 to 2017, the crash data from this were selected as the target data.
e HSIS data contained, roughly, two tables for the variables. e first one is related to Accident, Vehicle, and Occupant files, which involve TIME, e crash data from I5 in the Washington State covering the years 2011-2015 were extracted. e data from the first four years, 2011-2014, were used to fit the model, and the data from the later year were used as the prediction validation set. e total crashes were 9926, 10083, 10127, 11628, and 12804 from 2011-2015. Based on these raw samples, the following steps need to be conducted before digging into the model input procedure: (1) Exclude apparently irrelevant variables. More than 40 features were requested from the HSIS database system, and some features, such as "CASENO" (accident case number), "MILEPOST" (milepost), "RD_INV" (a linkage variable on the Accident file which is used in the merging operation), and "RTE_NBR" (route number) are not related to the crash severity and were omitted for simplicity. (2) Samples with features such as "LIGHT," "WEATHER," and "ACCTYPE" (accident type) which have values such as "UNKNOWN," "NAN," "UNSTATED," and "NULL" were also omitted for simplicity. (3) Some nominal variables which cannot be denoted by the continuous number such as "DIR_CURV" (the horizontal curve direction) and "DIR_GRAD" (the vertical curve grade direction), both representing the relative direction of left or right, were transformed into discrete scale values ("1" or "0"). (4) e vehicle-related and driver-related variables such as "DRV_AGE" (driver age) and "DRV_SEX" (driver sex) were incorporated with accident-related data files through the "CASENO" label, whereas the grad/ curve-related variables were incorporated with accident-related data through "MILEPOST"; here, a data process computer program written in MATLAB was developed to locate the "MILEPOST" between "BEGPOST" and "ENDPOST" in the grad/curve files. (5) "VEHYR," which indicates the vehicle model year, was transformed into the vehicle operation year through the following formula: where should be (year i ) − Vehyr i raw , indicating 13 − 11 � 2. (6) e output file contains the vectors derived from the "SEVERITY" variable in the raw dataset. In detail, noninjury was derived from "1, No Injury," injury was derived from "6, Nondisabling Injury; 7, Possible Injury," and incapacitating injury was derived from the remaining. e vectors are described through the following formula: where Out i refers to the output of the crash item i, and B Injury , B Non−Injury B incapacitating−injury refer to the Boolean index of the crash severity for an injury, noninjury, and incapacitating injury.
After the processing was conducted through the abovementioned steps, a total of 4310, 4494, 4436, 4666, and 4984 samples from 2011 to 2015 were used for model fitting and validation; glancing at the data, crashes seemed to be slightly more prone to occur during the winter (cold season) and on work days, whereas younger drivers contribute to a significant number of accidents ( Figure 2).
After the selection of 20 features (Table 1) available from the raw file, the next step for the data cleaning and processing work was variable standardization which was carried out using the min-max method through the following formula: From Table 1, we can see that, among all 20 features, there are 15 categorical and five continuous variables. In addition, for a research perspective, we divided all variables into two large categories: road-related and nonroad-related.

NN with the BP Method.
An artificial neural network uses information technology to mimic the human neurons and can process complicated connections between the input, hidden, and output layers. Among the multiple-layer neural networks, a three-layer simple NN has been proven to be most adopted and effective in a previous research [19].
In the forward propagation three-layer NN, the input variables can be defined as an input vector X: where x i refers to the ith input variable, I � 20, and T refers to a transpose in the matrix calculation. Similarly, the expectation of the crash severity level output vectors Ψ can be where ψ Injury refers to the Boolean index for an injury-related crash, ψ Noninjury refers to a Boolean index for a noninjuryrelated crash, and ψ Incapacitating−Injury refers to a Boolean index for an incapacitating injury.
In addition, the weight matrix between the input and hidden layers, W 1 , should be where W (1) j,i denotes the weight between the ith input node and jth hidden node and J refers to the total number of the hidden layers, I � 20. e weight matrix between the hidden and output layers, where W (2) k,j denotes the weight between the kth output node and the jth hidden node, J refers to the total number of hidden layers, and the k vector has values indicating injuries, noninjuries, and incapacitating injuries. e structure of a general three-layer neural network is shown in Figure 3.
Using the forward calculation method [17], the outputs of the nodes in the hidden layer H j (m) can be depicted as where m is the number of the output neurons, W (1) j,i denotes the weight between the ith input node and the jth hidden node, J refers to the total number of nodes in hidden layer, g j is the activation function between the input and hidden layers, and β j is the bias term.
Similarly, the outputs calculated from the hidden layer ψ k (m) are shown as where W (2) k,j denotes the weight between the kth output node and the jth hidden node, J refers to the total number of nodes in the hidden layer, the k vector has values including injuries, noninjuries, and incapacitating injuries, and g k is the activation function between the output layer and hidden layers. θ k is the bias term.
Generally, the activation functions such as sigmoid or tanh are selected and have the ability to transform the input signal into a certain range. If the network adopts sigmoid function as the output activation function, the output can be narrowed into a small scale as (0, 1); however, between the hidden and input layers, usually, the tanh function is adopted because it usually can converge fast.
Another aspect used for building an NN is the definition of the number of nodes in the hidden layer, and there is no easy and complete mathematical way of defining this number; however, based on the experience from former research [20], the empirical equation used is as follows: where n is the number of hidden nodes, n i is the number of input nodes, n 0 is the number of outputs, and α is a constant varying from 1 to 10.

Discrete Dynamics in Nature and Society
Based on the BP method [17], the local gradients δ (2) k (m) and δ (1) j (m) of the output and hidden layer neurons and the correction values Δw (2) k,j (m) and Δw (1) j,i (m) of their connection weights are as follows: where a(m) and η(m) are the momentum and step size, respectively. e weights in the network can be updated as Other than the general BP method, there are some other modified training functions, including resilient backpropagation (RPROP), conjugate gradient backpropagation,  In the traditional BP, method, the initialized weights in formulas (6) and (7) are randomly initialized following two different uniform distributions. However, this implementation will highly possibly cause the whole model solution stopping at a local optimum. In order to address this problem, the authors offered a particle swarm optimization method with refined adaptive inertial weights to enhance both local and global searching for the optimal initial weights for BPNN. e details are provided in the following section.

PSO with Adaptive Inertial Weights.
e particle swarm optimization method was a well-known metaheuristic computation method provided in 1995 [21]. Also, it was easy to use in different optimization applications [22]. e standard and original formulas for this method are U r+1 where U r s is the sth particle at the rth generation and V r+1 s denotes this particle's velocity to the r + 1th generation. c 1 and c 2 are two constants usually taken around the value of 2. λ 1 and λ 2 are the two uniform random numbers in the range of [0, 1], pbest is the best position experienced by the particle itself, and gbest is the best position experienced by the particle swarm. In this paper, the initial weights from BPNN are treated as the particles in the PSO algorithm, and the optimization problem can be described as mapping a decision space X to Y, and for a typical 3-layer neural network, they are encoded in the following set: min where W 1 and B 1 refer to the weights and bias connecting the input and hidden layer, while W 2 and B 2 refer to the weights and bias connecting the hidden and output layer. Also, the transfer objective function Y is the neural network mean squared error (MSE). For the standard or original PSO, it could solve nonlinear or nondifferentiable problems easily, but the searching space for a particular particle is almost fixed during each phase of generation, which means the model could then be easy and fast to find a solution, a possible solution near the local optimal. us, bringing the tradeoff between the local search ability and global search ability ahead [23], an inertial weight is introduced into formula (14), which could be written as follows: where w s is the inertial weights controlling the global and local optimal searching speed, and it iterated during each generation in a linear or nonlinear form. us, in this paper, different inertial weight setting methods [23][24][25] including the linear and nonlinear form are compared as follows: where w start and w end refer to the weights at the start and the end of the generation, and they are usually set to 0.9 and 0.4, respectively, s is the current iteration, and S max is the max generation. e function graph for formulas (18)- (20) is depicted in Figure 4. From Figure 4, can we see that the weight is of a relatively high value at first in order to expand the search space for global optimal; however, at the end of the iteration, it is converging slowly to enhance the local optimal search.
In conclusion, the pseudocode for the whole procedure in Sections 3.1 and 3.2 is formulated as follows:

Risk factors
Hidden layer Severity levels Figure 3: General structure of a typical NN used in this study. 6 Discrete Dynamics in Nature and Society e calculation process is given in Figure 5.

Factor Analysis.
To carry out a factor analysis (FA), the factor importance index (FII) is introduced in this paper. According to the nonlinear and classification function in practice, the n-dimensional input vector constructs the whole input space. us, from the perspective of engineering math, the first order partial derivative of the outcome Y with respect to the ith variable x i could explain the connecting weight of that input variable vector, according to the chain rule in calculus, and the math description is as follows: where α is the linear output value from the hidden layer to the output layer in BPNN, which could be described as formula (9), and thus, equation (21) is transformed to where j is the jth nodes in the hidden layer, W (2) k,j refers to the weights through the hidden layer to the output layer, g(α) ′ denotes the derivative with respect to the activation function between the output and hidden layers.
Considering formula (8) combined with chain rule in calculus, formula (22) could be written as where β is a short note for formula (8) and W (1) j,i refers to the weights through the input layer to the hidden layer. f(β) ′ denotes the derivative with respect to the activation function between the input layer and the hidden layer.
rough formula (24), we can see that while given a fixed input vector in the ith dimension, the value of f(β) ′ and g(α) ′ is fixed with respect to all X i ; thus, considering the remaining part, the FII for the ith variable can be written as where the value of W (1) j, i and W (2) k, j can be calculated through formulas (12) and (13). ese values are stored in a dictionary in a program code during the calculation process.
Finally, in order to ease the simulation variance of the model training process, the FII expectation is introduced by running the model for a certain k times:

NN Model Structure Test.
Based on the theory discussed in the previous section, the most primary step in building an NN is to define the number of good hidden layer nodes and a better training function. Usually, the model performance (mean square error, MSE) combined with the total iteration number of convergence is used to test the structure. For the number of hidden layer nodes, based on formula (10), consecutive numbers of 5 through 14 were selected; for the training function, however, one of the following (Table 2) is applied. e last two methods in Table 2 usually provide a fast calculation speed, but tend to be challenging and inefficient when dealing with the big data issues, especially, for the GPU hardware with low configuration. For the present study, considering the sample size, GPU support is not a problem, and thus, this minor difference is not a significant concern. eoretically, the number of nodes in the hidden layers should be within the range of (5, 14) based on formula (10); usually, however, the number of hidden layer nodes does not fall below 10, and thus, a combined test using the popular training functions and 10-14 hidden layer nodes number was carried out based on a loop test.
We randomly separated the sample data from 2011 to 2014 to form the dataset of training, testing, and validation with respect to 70%, 15%, and 15%. In detail, a total data of 17839 were divided into 12487, 2676, and 2676, in accordance with training, testing, and validation. e outcome is shown in Table 3.
It can be seen from Table 3 that 12 and 14 hidden layer nodes achieve the best performance (in other words, the lowest MSE) and the BR training function has the lowest MSE. However, when adopting these methods, the gradient value of the GDA, GDX (with an adaptive learning rate), and LM decreases quickly during the very early validation stage Discrete Dynamics in Nature and Society and, thus, quickly converges to the preset goal, and methods such as BR, GDM, and GD converge slowly, requiring more validation than usual. us, in conclusion, the choice of GDA, GDM, or LM should be better, and considering a lack of GPU support for the simulation environment, we selected the Levenberg-Marquardt backpropagation (LM) as the training function. During the number test, LM with 12 nodes of the hidden layer showed the best performance.

Results for PSO Optimization.
After setting the adaptive inertial weights for the PSO optimizer, we can conclude the following performance graph through each iteration.
To eliminate the random data separation variance, 100time simulation was conducted, and the average performance for each method is described in Table 4.
From Figure 6 and Table 4, it could be seen that the performance refers to the MSE of the training model, training accuracy refers to the average classification accuracy among the training set (17839), and the prediction accuracy refers to the classification accuracy among the 4984 samples from 2015. Although the fitness (MSE) of PSO with w 1 drops fast in the early stage (almost starts with 22), it converges poorly and has the poor condition of performance and training/testing accuracy, which means it lacks of certain Conjugate gradient backpropagation with Polak-Ribiere updates CGP 5 Gradient descent backpropagation GD 6 Gradient descent with adaptive learning rate backpropagation GDA 7 Gradient descent with momentum GDM 8 Gradient descent w/momentum and adaptive learning rate backpropagation GDX 9 One-step secant backpropagation OSS 10 RPROP backpropagation RP 11 Scaled conjugate gradient backpropagation SCG 12 Levenberg-Marquardt backpropagation LM 13 Bayesian regulation backpropagation BR  Figure 5: Calculation process for the method used in this study. 8 Discrete Dynamics in Nature and Society ability for local optimal searching. On the other hand, PSO with w 2 and w 3 (two nonlinear formulas) outperforms the other methods not only in global optimal search (both from 28, while NN with standard PSO drops from nearly 40) but also in the local optimal search (with good performance and accuracy), and the NN and NN with standard PSO are in the middle level with an acceptable result. It can be assumed that with more data flushing in, the PSO with nonlinear will perform a dominating advantage over the other methods. As shown in Table 4, the accuracy of training is better than the results of the prediction; the relatively lower sample number may be a major contributor. Notably, the performance from all categories is at the same rate related to the sample size, and it can be concluded that although the    prediction accuracy applied to other samples could result in a marginally worse accuracy, the model can explain all variance and contributors.

Results for Factor
Analysis. e final results for two PSO with nonlinear adaptive inertial weight are described in Table 5.
From Figure 7 and Table 5 we can see that two methods (PSO with w 2 and PSO with w 3 ) provide almost the same ranking with respect to each variable, despite some minor difference, for example, MONTH and LIGHT are ranked slightly higher in the second method, while DRIVER SEX are ranked slightly lower. Nearly seven or eight of the 20 variables have a relative importance value exceeding 50%, including the vehicle year, driver age, accident type, month, location type, and road functional class (also LIGHT in the second method), with vehicle year and driver age taking the top two values. Additionally, the related road factors such as curve and gradient variables have the least importance, and the only important factor contributing to the severity prediction is the road functional class, which indicates whether a crash occurred on a certain type of road. us, it can be concluded that the relative road variables contribute less than the relative nonroad variables, particularly, compared with the relative vehicle factors (vehicle year and type).
erefore, the policy makers should pay more attention to the vehicle and driver regulation rules, as well as the road design, to reduce the possible severity level in the future.
Driver age and month are two other important factors in predicting a crash severity. From the sample size, we can see that the most severe crashes occur during the winter in the Washington State (December, January, and February), and drivers below the age of 25 and above the age of 60 are more prone to encountering severe injury crashes. e month may account for the rainy season in the mountainous Seattle area, whereas age may be derived from the fact that younger people and older people are more prone to making severe mistakes.

Conclusions
In this study, a thorough artificial neural network (ANN) was developed to address the problems of the crash severity level modeling and factor analysis (FA). Besides the test of different types of training structure and methods, more importantly, a nonlinear adaptive PSO optimization method was proposed in order to solve the tradeoff problem between the global and local search ability among the previous studies. e detail test of different algorithm confirmed our hypothesis. e additional contributing factor analysis also offers a different point of view compared with former statistical analysis. e main conclusions can be concluded as follows: (1) e number 12 hidden layer nodes fit the model developed in this paper well; and the BP method (Levenberg-Marquardt) can be better utilized when aided by fast hardware (2) e simulation result showed that the PSO optimizer with nonlinear adaptive inertial weight outperforms the standard PSO and PSO with linear adaptive inertial weight (3) rough the factor analysis (FA), it can be found that, among all 20 variables, nonroad-related variables can account for most of the severity prediction variance, and the rainy mountainous area in Seattle may be the reason for the importance of the month as a factor and, also, the impact of driver age, where younger and older people are more prone to encountering a severe crash e main innovations can be concluded as follows: (1) Traditional studies often used statistical methods like the Poisson regression, negative binary regression, and generalized logit or probit model for the identification and mathematical qualification of the inner internal triggers and their impact on crash severity, while this paper utilized FA as the analytical tool, which is unusual for the current research system of crash severity, and we think that our attempt extended the methods of crash severity analyses, and more research could be conducted in the future work.
(2) FA, as a traditional statistical implement, also can serve as a powerful explanatory tool in the last stage of the model, and our work has proved it. e application of FA in this paper indicated that the basic statistical method is still useful and efficient while the AI methods sometimes did not have an agreeable explanation for the inner mechanism of the data.
e method developed in this study can be applied to a big data analysis of traffic accidents and be used as a fastuseful tool for policy makers and traffic safety researchers. e authors recognize that much can be further investigated. In this paper, only crash severity was discussed. Further research could be conducted from the perspective of the collision type (e.g., head-on collisions and rear-end collisions). Besides, the dataset could be enlarged in the future research to improve the accuracy.

Data Availability
e dataset used in this study was made up of data requested from the Highway Safety Information System (HSIS), and requests for access to these data should be made by filling the form at the following link: https://www.hsisinfo.org/ datarequest.cfm.

Conflicts of Interest
e authors declare no conflicts of interest.