The Application of Data Mining Technology to Build a Forecasting Model for Classification of Road Traffic Accidents

With the ever-increasing number of vehicles on the road, traffic accidents have also increased, resulting in the loss of lives and properties, as well as immeasurable social costs.The environment, time, and region influence the occurrence of traffic accidents.The life and property loss is expected to be reduced by improving traffic engineering, education, and administration of law and advocacy. This study observed 2,471 traffic accidents which occurred in central Taiwan from January to December 2011 and used the Recursive Feature Elimination (RFE) of Feature Selection to screen the important factors affecting traffic accidents. It then establishedmodels to analyze traffic accidents with various methods, such as Fuzzy Robust Principal Component Analysis (FRPCA), Backpropagation Neural Network (BPNN), and Logistic Regression (LR). The proposed model aims to probe into the environments of traffic accidents, as well as the relationships between the variables of road designs, rule-violation items, and accident types. The results showed that the accuracy rate of classifiers FRPCA-BPNN (85.89%) and FRPCA-LR (85.14%) combined with FRPCA is higher than that of BPNN (84.37%) and LR (85.06%) by 1.52% and 0.08%, respectively. Moreover, the performance of FRPCA-BPNN and FRPCA-LR combined with FRPCA in classification prediction is better than that of BPNN and LR.


Introduction
As the demand for vehicles rises, the number of vehicles on the road increases greatly and traffic jams worsen, especially during rush hours; thus, traffic accidents are more likely to occur.Faced with more severe accidents, the traffic problem has become a topic of concern in Taiwan.The statistics of the Ministry of Health and Welfare (2013) indicated that accidental injury is the sixth major cause of death in Taiwan, with 6,873 deaths from accidental injuries.Most traffic accidents are caused by improper driving behaviors, and one of the major reasons is that drivers failed to pay attention to the road ahead (Ministry of Transportation, 2008).
According to the statistical data of the National Police Agency (2012), the number of road traffic accidents with death in Taichung City was next to Kaohsiung City.In 2012, the number of traffic accidents causing death was 208, and the death toll was 210.In 2012, the number of traffic accidents causing death was 198, the death toll was 203, and the gradient of number of accidents was −3.41%.Due to the increase in urban population, at a growth rate of 0.76% in 2012, the occurrence rate of road traffic accidents increased accordingly.In 2011, accidental injury ranked sixth among the ten major causes of death in Taiwan, and the death toll from motor vehicle accidents was about 30, accounting for 17.3% of the death rate per 100,000 persons (MOHW, 2012).This study uses the traffic accident data from the NPA of the region from January to December 2012 as the data source.The data content includes 17 items, such as weather, light rays, road category, speed limit, road type, accident site, road conditions, and roadblocks.There are 2,471 original observations.
According to previous transportation research [1][2][3][4][5], the causes of traffic accidents are mostly human factors, such as speeding, violation of signals, and drunk driving, as well as the interaction between road environments and traffic engineering facilities.This study identifies the key factors that affect traffic accidents using Feature Selection and establishes models to analyze traffic accidents and their types with various methods, such as Fuzzy Robust Principal Component Analysis (FRPCA), Back Propagation Neural Network (BPNN), and FRPCA-Logistic Regression (LR).The environments of traffic accidents and the variables of road designs identified by the model could serve as reference for the police force and regulatory authorities to design and plans and improve traffic safety, thus decreasing the ratio of traffic accidents, damage to property, and loss of lives.
With the advancements of information technology, data mining becomes increasingly mature, and useful information without preconditions can be found in databases.Relational models can be built to determine the correlation between characterization factors of traffic accidents and casualties.This study uses the Recursive Feature Elimination (RFE), FRPCA, BPNN, and LR of Feature Selection to determine important factors influencing traffic accidents.The results can provide suggestions for improving the occurrence of traffic accidents.Finally, the model was statistically evaluated.
The remainder of this paper is organized as follows.Section 2 reviews the literature concerning the severity of injuries in traffic accidents; Section 3 presents the FRPCA; Section 4 discusses the research data; Section 5 offers conclusions.

Material and Methods
Many studies have focused on forecasting and modeling traffic accidents and analyzed the results.The results suggest that the significant factors influencing the occurrence of accidents must be eliminated or controlled in order to prevent traffic accidents and reduce injuries and deaths.
In terms of research methods, most studies use BPNN or LR to forecast or model analysis results [6][7][8][9].Gang and Zhuping [10] suggested that the PSO-SVM is better than BPNN in traffic safety forecasting.Chang et al. [6] used the established modeling method and LR to discuss the contributing factors and conditions of driving after drinking.The analysis results showed that law enforcement, drivers' drinking habits, and regulatory knowledge of drunk driving apparently influenced drivers' selecting drunk driving behaviors.Kong and Yang [8] used LR for casualties and driving speed in traffic accident survey data and found that, regarding the correlation of collisions between vehicles and pedestrians, the risk of pedestrian death was 26% when the vehicle's speed was 50 km/h, 50% when the speed was 58 km/h, and 82% when the speed was 70 km/h.However, the analysis result showed that age was not a major risk factor in death.Fu and Zhou [7] pointed out that the traditional BPNN has some defects, such as local minima, too many iterations, and too slow training.Therefore, the improved LM-BP neural network was used for forecasting.The forecast results of traffic accidents, death toll, and amount of direct economic loss were significant; thus, the BP network is applicable to traffic accident forecasting.
A number of recent studies have used data mining or statistical methods [11][12][13][14][15]. Karacasua and Er [14] used chisquare significance testing to analyze whether the same age and gender have similar traffic accidents, as well as the correlation among education, age, gender, and psychology.The findings showed that (1) males were more prone to traffic accidents than females; (2) driving while being intoxicated and speeding were major causes.Kanchan et al. [13] used statistical software for analysis and found that the injured were mostly male, and the major causes of death were head and abdominal injuries.Traffic accidents are a significant public health hazard; thus, first aid should be strengthened, and traffic regulations and health education should be strictly implemented.Kashani et al. [16] used classification and CART to analyze traffic collision data.The results showed that improper passing and not using seat belts were the most important factors influencing the severity of injuries.De Oña et al. [15] used Latent Class Cluster (LCC) to reduce the heterogeneity of traffic accident data and combined it with Bayesian Networks (BNS) to recognize major factors.The results indicated that weather factors, pavement markings, and road width were significant factors.
Based on the above discussion, this study uses BPNN, LR, and statistical methods differing from previous studies, which aim at accident patterns and types.The FRPCA is used for data preprocessing, which is combined with BPNN and LR models to analyze the performance of the aforesaid four classification models (BPNN, LR, FRPCA-BPNN, and FRPCA-LR) in forecasting.

Data Preprocessing of Feature
Selection.This study uses RFE as the Feature Selection method, which is a Feature Selection algorithm, with the principle proposed by Guyon et al. [17].Guyon used RFE to select the key and important feature set, which not only shortens classification computing time but also improves the classification accuracy rate.The purpose of RFE is to calculate the weight vectors of each feature, which are ordered according to the calculated weight vectors as the basis of classification.RFE is an iterative process that eliminates features backwards, and its feature set screening procedure is described as follows: (1) Use current data set to train classifier.
(2) Calculate the weight of each feature.
(3) Delete the feature with minimum weight.
The iterative process is ended when there is one feature remaining.A list of features ordered according to the weights is obtained as a result of execution, and unimportant or uncorrelated features are eliminated from the list first; thus, they are listed at the end, whereas, the most important features are eliminated last and are listed at the front [18,19].The RFE selects the feature set in three major steps, imports the data set for classification, calculates the weight of each feature, and deletes the feature with minimum weight.Feature ordering is obtained, the feature with minimum weight square is removed in each cycle, and then the remaining features are retrained to obtain a new feature ordering.RFE continuously executes this process, and a feature order list is obtained [20].It is noteworthy that one of the features ordered in the front does not always enable the classifier to obtain the best classification performance; however, the combination of multiple features enables the classifier to obtain the optimum classification performance.Therefore, RFE algorithm can select the most complementary feature combination [4].

Backpropagation Neural Network (BPNN)
. BPNN is the most frequently used supervised learning among the neural networks and is highly effective in classification problems [21].The parameters are divided into structural parameters and learning parameters.The structural parameters include the number of hidden layers, while the learning parameters include the learning rate, initial weight range, and momentum term.Generally, Trial and Error is used to determine the optimal parameter values when selecting structural parameters and learning parameters.The most used nonlinear transfer function in the hidden layer of BPNN is the log-sigmoid transfer function, whose output is between 0 and 1, in order to respond to the negative infinity to positive infinity input of neurons.
An alternative is the tangent sigmoid transfer function "tansig," as shown in the hidden layer.The linear transfer function purelin is in the output layer.If the sigmoid transfer function is used in the output layer, the network output is restricted to a very small range.If the linear transfer function is used in the output layer, the network output can be an arbitrary value.

LR.
LR can be used to analyze one or several forecast values.These results have a binary (e.g., existence or nonexistence of an event) relationship [22,23].LR is derived from the cumulative probability function of the logistic model and is a linear probability model, which is similar to a linear regression model.The difference is that LR can test the dependent variable of a nominal scale, where the discussed dependent variable is discontinuous, especially in binary classification.The purpose of LR is to establish the simplest and fittest analysis result.Furthermore, it can be used in a practical model to forecast the relationships between dependent variables and a set of forecast variances, where the explanatory variable can be a categorical or continuous variable.

FRPCA.
The nonlinear FRPCA algorithm is deduced from the linear fuzzy principal component analysis algorithm, as introduced by Yang and Wang [24], and the nonlinear criteria in blind source separation of Karhunen et al. [25].The robust principal component analysis, as proposed by Yang and Wang, is established on the principal component analysis learning rule and energy function, as proposed by Xu and Yuilles [26], and the objective function bias is proposed.These methods are briefly introduced as follows.Xu and Yuilles [26] proposed the optimal function of constraint   ∈ {0, 1}: The objective is to minimize (, ), where  = { 1 ,  2 , . . .,   } is the data set,  = {  |  = 1, . . ., } is the membership set,  is the threshold,   is the binary variable, and  is the continuous variable, rendering gradient descent method optimization difficult to solve; thus, they transformed the minimization problem, where Gibbs distribution is maximized by the following equation: where  is the separation function; ensuring ∑  ∫ (, ) = 1, (  ) can be one of the following functions: The gradient descent rule for minimizing where   is the learning rate,  =     .Therefore, Yang and Wang [24] proposed a new objective function: The constraints are , where   is the membership value belonging to the   data cluster, (1 −   ) is the membership value of the   disturbance cluster, and  1 is the fuzzy variable.In this case, (  ) is the error between the measured   and the cluster center, which is similar to the -means algorithm [27].
As   is a continuous variable, it avoids the difficulty of an optimum mix of discrete types and continuous types; thus, the gradient descent method can be used.First,   equals 0 as calculated by the slope of (2), so is replaced in (2), and the following equation is obtained: On the other hand,  gradient is 1 is the fuzzy variable.If  1 = 1, the fuzzy membership is demoted to a fixed membership and can be determined by the following rule: In this case,  is the hard threshold, where  1 is not set, but  1 = 2 in most studies.Yang and Wang [24] deduced the following process of an optimization algorithm.
(1) The number of iterations is set as  = 1, the iteration is constrained as , the learning coefficient is  0 ∈ (0, 1], the soft threshold  is a small positive, and the weight  is randomly initialized.(2) In a case of less than , execute step (3) to step (9).

Case Study
This section is divided into three parts: collection of traffic accident data, preprocessing of the research data, and substituting the data after Feature Selection in FRPCA, BPNN, and LR.Four groups of information are obtained, including BPNN, LR, FRPCA-BPNN, and FRPCA-LR.Signal type  13 Signal action  14 Separation facility  15 In fast or general lane  16 In fast and slow lanes  17

Empirical Research Result
Edge of pavement each experiment is repeated 5 times, with the results of the three experiments as shown in Table 5. Experimental combination 2: the FRPAC-BPNN forecasting model is different from experimental combination 1, where the principal component scores are converted by executing FRPCA before building BPNN, and then the BPNN forecasting network is built.The experimental results show the learning rate of the lr = empirical results of BPNN versus FRPCA-BPNN forecasting models.

LR Analysis.
Experimental combination 3: the LR classifier is constructed, and the data are imported into LR for classification forecasting according to the aforesaid test data set  1 .Experimental combination 4: the FRPCA-LR classifier is constructed as experimental combination 2, the  1 data set is converted into principal component scores by FRPCA, and then the LR classifier is constructed.Each experiment is conducted five times; the experimental results are as shown in Table 6.The experimental results show that the average accuracy rate and standard deviation of the FRPCA-LR model are 0.8506 ± 0.0021, which is better than the 0.8514 ± 0.0031 of the LR classification model.The LR model investigates the impacts on the pattern of traffic accidents according to the types and patterns of the traffic accidents.The optimal model is shown in Table 7.This section discusses the correlation between the vehicles and the environment, based on 2,325 pieces of data for analysis.After the deletion of 146 pieces of data involving human and vehicles, the dependent variables are divided into the two categories of "vehicle to vehicle" and "vehicle in itself" according to the types and patterns of traffic accidents.Odd ratio is adopted to represent the relevant influences of Event A to the occurrence of Event B. In Table 7, the odd ratio of crossroads among the road types is 3.01, meaning that the risk of traffic accidents on "crossroads" is higher than In other words, the ratio of traffic accidents at intersections is 3.24 times that of one-way roads.Likewise, the ratio of traffic accidents on "intersections" under the category of "traffic locations" is also the highest, at 7.24 times that of general roads;  5.72−3.74=  1.98 .Table 7 shows the importance of traffic accident environments and road design to the ratio of traffic accidents.

Conclusions
Traffic safety depends on road design, road configuration, vehicle performance, traffic regulations, and the effectiveness of implementation.The main means of transport in middle and low income countries include walking, bicycle, motorcycle, and bus, while that of high income countries is automobiles.Therefore, the traffic safety control measures of high income countries are not completely applicable to middle and low income countries and thus should be imported and improved to fit local transportation and road usage conditions [29].
The report of the World Health Organization (WHO) indicated that about 1.2 million people die from traffic accidents in the world annually; about 3,400 people die from traffic accidents per day; approximately 1,000 people are injured or disabled; children, pedestrians, cyclers, and the elderly are the most vulnerable road users; and 85% of fatalities and 90% of the disabled live in middle and low income countries.The scientific analysis of accident data, as well as the implementation of relevant safety measures, can prevent the occurrence of traffic accidents, thus, reducing the severity of injuries.
At present, with the rapid development of cities, it is necessary to make efficient forecasting in order that decision makers can make preventions and decisions in advance in order to reduce the death rate.This study uses RFE, FRPCA, BPNN, and LR to analyze the classification accuracy rate of the traffic accident data of the region.According to the experimental results, the classification accuracy rate of BPNN, LR, FRPCA-BPNN, and FRPCA-LR is higher than 80%; thus, forecast performance is significant.Further analysis shows that the network performance of the FRPCA-BPNN and FRPCA-LR classifiers, combined with FRPCA, is better than BPNN and LR.According to Tables 5 and  6, the accuracy rate of classifiers FRPCA-BPNN (85.89%) and FRPCA-LR (85.14%), combined with FRPCA, is higher than BPNN (84.37%) and LR (85.06%) by 1.52% and 0.08%, respectively, meaning the FRPCA-BPNN and FRPCA-LR have better classification forecast ability.
In traffic accident analysis or verification results, the human factor is mostly regarded as the first cause of traffic accidents.However, the road environment has certain correlation, and improper intersection design or planning is likely to cause traffic accidents.In comparison to other traffic accident sites, a forked road intersection is the most probable accident site.This study used RFE to select 7 input variables from 17 input variables.Based on the 7 input variables, the environmental factor and road design are found to be the causes of road traffic accidents in the region.According to the statistical data of the Taichung Police Station, the 4 main causes among the 67 causes of accidents are as follows: (1) not allowing other vehicles to pass as per regulations, (2) not aware of the situation ahead, (3) violating specific sign (line) bans, and (4) not maintaining a safe driving distance, accounting for 23.72%, 13.54%, 7.51%, and 8.22%, respectively, of the total number of traffic accidents, and the total proportion is as high as 52.99%.The road authorities may refer to the 7 variables of traffic accidents and road design, as proposed in this study, regarding future road designs and plans.As for the 4 main causes of accidents on road sections involving the above seven variables of traffic accidents and road design, they should be the priorities in the future elimination actions of the police force.If improvements and preventive measures are made, road safety can be substantially increased, thereby reducing traffic accidents and fatalities.The findings can serve as reference for the police force and management authorities to improve roads, as well as the assessment and management models for the elimination of traffic offences.

2. 1 .
Research Method.This study uses four constructs, namely, (1) natural factors; (2) environmental factors; (3) road design; and (4) accident types and patterns of road traffic accident cases in Taichung City, to discuss the factors influencing the occurrence of road traffic accidents.The research structure is as shown in Figure 1.

Table 1 :
Number of traffic accidents and output variable codes in the region in 2012.This study uses RFE for data preprocessing.The 17 variables of the database are coded before Feature Selection, as shown in Table2.The 17 variables are sequenced according to importance, as shown in Table3.
cycle data variables.There are 2,096 observations of Category (1) vehicle-vehicle; there are 146 observations of Category (2) person-automobile/motorcycle; and there are 229 observations of Category (3) automobile/motorcycle.The aforesaid variables are coded as  1 ,  2 , and  3 , respectively, as listed in Table 1.
This study uses the first 7 variables in order of importance, as obtained by the Feature Selection of RFE, as input variables, while the person-automobile/motorcycle, vehiclevehicle, and automobile/motorcycle are output variables, as shown in Table 4.The 7 input variables and 3 output variables  is the loadings of No.  variable on No.  principal component,   is the weight of No.  variable on No.  principal component, are substituted in FRPCA, BPNN, and LR, respectively, in order to obtain BPNN, LR, FRPCA-BPNN, and FRPCA-LR classifier models.The experimental procedure is as follows:Step (1): 2,471 data sets are used as test data, the 17 variables are sequenced according to their importance by using the RFE data preprocessing method, and the first 7 variables are rearranged as the test data set ( 1 ).Step (2): classifier  is the eigenvalue of No.  principal component (i.e., variance), and ŝ is the standard deviation of No.  variable.Experimental combination 1: the experimental parameters of the BPNN forecasting model are set as follows: epochs = 1000, learning rate lr is 0.1, 0.3, and 0.5, respectively, and

Table 3 :
Sequence of feature attributes of environmental factors.

Table 7 :
The impacts on the types and patterns of traffic accidents, as imposed by the various factors and represented in the LR model.