As roadway and development factors are identified as the most effective factors contributing to road traffic accidents, investigating these factors could lead to reducing the accident frequency rate. However, previous works focused on investigating the effect of roadway factors on the accident frequency rate using statistical analysis. The present study aimed to evaluate the effect of roadway and development factors on the accident frequency rate using ANOVA and Chi-square tests on a rural road. Secondly, it aimed to develop a rural road safety risk index based on K-means clustering and Gaussian models. The findings indicated that the operating speed and the differences between posted speed limits and the operating speed are the pivotal influencing factors on the accident frequency rate. Moreover, clustering analysis of the roadway and development factors on the two-lane, two-way road of Borujerd-Khorramabad indicated six clusters which were identified as highly, relatively highly, moderately, relatively lowly, lowly risky, and not risky (safe) clusters. Regarding clusters, the accident frequency rate increased by decreasing the difference between the posted speed limits and the operating speed from the safe cluster. In addition, the risky index model based on the Gaussian model showed that the average reducing factor of accident frequency rate reached 0.99 by increasing per km/hr in the difference between the posted speed limits and the operating speed among low risky and safe clusters, while it was equal to 1.17 in risky and unsafe clusters. The comparison of the clusters revealed that accident occurrence probability in risky clusters was more than the ones in low risky or safe clusters. Therefore, the maximum and minimum values of the safety risk index were observed in the sixth and the third clusters, respectively.
Road traffic accidents cost most countries 3% of their gross domestic product [
Other research studies focused on evaluating geometric variables such as the lane and shoulder width, pavement type, skid resistance, annual average daily traffic, spiral transitions, and passing behavior [
Therefore, this study aims to evaluate the effect of roadway and development factors on accident frequency using ANOVA test and Chi-square tests on a rural road. Moreover, it develops a rural road safety risk index based on K-means clustering and Gaussian models to produce a technique for supporting the road safety analysis.
The organization of the remaining parts of the study is as follows. In Section
Several studies focused on driving safety affected by various factors and investigated the relationship between these factors and road accidents. Road accident data are classified as big data and include many attributes belonging to the accident such as driver attributes, environmental causes, as well as traffic, vehicle, and geometric characteristics and the location nature and the time of the day. In addition, data related to road accidents are taken for a long period of time and available as datasets, statistical tables and reports, or even Global Positioning System data. According to several studies, statistical and data mining techniques are proper for analyzing the road accident data [
Some researchers investigated the effect of roadway factors on the number of road accidents on urban highways. They applied different techniques to establish a relationship between these factors and the accident frequency rate [
Shirmohammadi et al. [
Among accident data analysis methods, clustering analysis is the best way to find several between-data correlations which probably remain unknown [
To our best knowledge, no study has investigated the effect of roadway and development factors, especially the difference between posted speed limits and operating speed and operating speed on accident frequency rate on rural roads. Furthermore, we did not find any previous study on developing a rural safety risk index using roadway and development factors. Furthermore, previous studies only used clustering analysis for drivers’ behavioral characteristics concerning the accidents. Given this, the novelty of the present study is, firstly, investigating the effects of roadway and development factors on the accident frequency rate. Secondly, it applies clustering analysis and the Gaussian model for developing a rural risk index of the clusters regarding roadway and development factors. Moreover, finding the contributing factors to accidents plays an important role in collision statistics, which is considered as another reason for developing the subjective and driver-based evaluation of road safety risk. Finally, SPSS 17.0 and MATLAB R2013a software were employed to obtain the results.
The process of evaluating the effect of roadway and development factors on the accident frequency rate for the development of a rural road safety risk index is performed as follows (see Figure
The proposed model.
Lorestan Province has an area of 29308 km2 and a population of about 1.76 million. The capital city Khorramabad is located in the southern part of Lorestan. The province is widely known as a popular tourist destination. Since the Boroujerd-Khorramabad road is located throughout the transit road of the North to the South of Iran, it is the most densely populated part of the Lorestan roads, and the number of motor vehicles accidents had been steadily rising during 2013 to 2016. A comparison of the motor vehicle accidents from 2013–2016 along the Boroujerd-Khorramabad road revealed that the mortality rate reached up to 67% and the injury rate was up to 30%. During this period in total, there were 1409 accidents.
The accident frequency rate, normalized by the segment length, was used for this study and belongs to the accidents that occurred during three years (2013–2016). Regarding roadway and development factors in previous studies [
Definition of variables included in the link-based dataset and descriptive statistics analysis.
Variable | Link-based dataset | No. of samples | Minimum | Maximum | Mean | Std. deviation | |
---|---|---|---|---|---|---|---|
Accident frequency rate | No. of accidents/segment length | 106 | 0.00 | 12.66 | 0.40 | 1.44 | |
Roadway and development factors | Operating speed | — | 106 | 50.10 | 106.00 | 83.87 | 12.39 |
Difference between posted speed and operating speed | — | 106 | -29.40 | 59.20 | 23.48 | 16.81 | |
Segment length | — | 106 | 0.10 | 5.20 | 1.62 | 0.88 | |
Volume | (AADT) | 106 | 10.84 | 22.27 | 17.58 | 4.52 | |
Presence or absence of a speed control camera | Binary variables: | 106 | 0.00 | 1.00 | 0.03 | 0.17 | |
Homogeneous sections | Binary variables: | 106 | 1.00 | 6.00 | 4.08 | 1.97 | |
Gradient | G1: links with a median gradient below | 106 | 0.00 | 2.00 | 1.17 | 0.93 | |
Dominant land uses along the roadways | Binary variables: | 106 | 0.00 | 2.00 | 0.94 | 0.92 | |
Number of accessibility | — | 106 | 0.00 | 10.00 | 2.31 | 2.38 |
Schematic map of the case study (source: Google map). (a) Plan of the Borujerd-Khorramabad road. (b) Location map of the study area.
The Borujerd-Khorramabad road is a two-lane, two-way road where the width of each lane and shoulder is constant and is equal to 3.65 and 1.85 meters, respectively, along the whole road and with no changes in lane or shoulder widths. Road pavement is in a relatively good condition along with road sections whose performance serviceability index (PSI) equals 3. The road sections are away from the zone of the influence of intersections, towns and so on. In addition, the value of side friction is considered 0.35 for the road sections according to AASHTO [
Therefore, based on the output of this approach, each road section was assigned a number of accidents varied from 0 to 13 per section. Considering the dynamic nature of traffic variables (i.e., operating speed and volume), traffic conditions were expressed by annual averages while road geometry was represented by categorical variables. The final dataset included 106 road sections (total length = 172 km) after the exclusion of sections applying missing traffic or geometry data.
The ANOVA test is one of the most applicable methods in transportation data analysis [ H0= there are no associations between roadway and development factors and the accident frequency rate H1= there are associations between roadway and development factors and the accident frequency rate
Therefore, the hypothesis H0 was rejected, while the hypothesis H1 was accepted when the
Clustering technique is one of the most commonly used data mining methods, and there are many clustering algorithms such as K-means and K-modes [
Using clustering techniques causes the problem of determining the best number of expected clusters. To solve this issue, the K-means algorithm is recommended to enter the number of K clusters. According to the framework of this method, the best and optimal number of clusters is determined by the Elbow method [ Computing the clustering algorithm (i.e., K-means) for different values of K, Calculating the total within-cluster sum of the square (wss) for each K cluster Plotting the curve of wss according to the number of K clusters Considering the location of a bend (knee) in the plot as a general indicator of the appropriate number of the clusters
By the development of a risk index, it is vital to consider the fundamental elements that can contribute to road safety [
To examine the effect of roadway and development factors on the accident frequency rate, the ANOVA test was run, the results of which are presented in Table
ANOVA test for the examination of the effect of roadway and development factors on the accident frequency rate.
Results of significance analysis of roadway factors | |||||
---|---|---|---|---|---|
Results of significance analysis (operating speed) | |||||
Sum of squares | df | Mean square | Sig. | ||
Between groups | 218.24 | 93 | 2.35 | 625.77 | 0.000 |
Within groups | 0.05 | 12 | 0.00 | ||
Total | 218.28 | 105 | |||
Results of significance analysis (the difference between posted speed limits and operating speed) | |||||
Sum of squares | df | Mean square | Sig. | ||
Between groups | 218.24 | 96 | 2.27 | 454.66 | 0.000 |
Within groups | 0.05 | 9 | 0.01 | ||
Total | 218.28 | 105 | |||
Results of significance analysis (segment length) | |||||
Segment length | Sum of squares | df | Mean square | Sig. | |
Between groups | 35.41 | 33 | 1.07 | 0.42 | 0.996 |
Within groups | 182.87 | 72 | 2.54 | ||
Total | 218.28 | 105 | |||
Results of significance analysis (volume) | |||||
Volume | Sum of squares | df | Mean square | Sig. | |
Between groups | 5.900 | 3 | 1.967 | 0.945 | 0.422 |
Within groups | 212.38 | 102 | 2.082 | ||
Total | 218.28 | 105 | |||
Results of significance analysis (the presence or absence of a speed control camera) | |||||
Presence or absence of a speed control camera | Sum of squares | df | Mean square | Sig. | |
Between groups | 0.49 | 1 | 0.49 | 0.23 | 0.629 |
Within groups | 217.79 | 104 | 2.09 | ||
Total | 218.28 | 105 | |||
Results of significance analysis (homogeneous sections) | |||||
Homogeneous sections | Sum of squares | df | Mean square | Sig. | |
Between groups | 5.71 | 5 | 1.14 | 0.53 | 0.751 |
Within groups | 212.41 | 99 | 2.15 | ||
Results of significance analysis (gradient) | |||||
Grade | Sum of squares | df | Mean square | Sig. | |
Between groups | 0.69 | 2 | 0.35 | 0.17 | 0.849 |
Within groups | 217.59 | 103 | 2.11 | ||
Results of significance analysis of development factors | |||||
Results of significance analysis (dominant land uses along the roadways) | |||||
Dominant land uses along the roadways | Sum of squares | df | Mean square | Sig. | |
Between groups | 4.45 | 2 | 2.22 | 1.07 | 0.346 |
Within groups | 213.84 | 103 | 2.08 | ||
Total | 218.28 | 105 | |||
Results of significance analysis (the number of accessibility) | |||||
Number of accessibility | Sum of squares | df | Mean square | Sig. | |
Between groups | 9.77 | 9 | 1.086 | 0.50 | 0.871 |
Within groups | 208.51 | 96 | 2.172 | ||
Total | 218.28 | 105 |
The average linkage hierarchical clustering was used to determine the number of clusters although identifying the most optimal heterogeneous clusters has occasionally some limitations and deficiencies. Based on these limitations, the K-means cluster is applicable after determining the number of clusters. In this clustering method, using the centroids (i.e., the cluster center means) generated from the average linkage hierarchical clustering is a starting point [
Cluster analysis applies algorithms to collate individual variables with similar scores [
The standardized scores (Z-scores) of variables are used to avoid the problem of comparing Euclidean distances based on different measurement scales [
Number of clusters.
Final cluster centers for independent and dependent variables in this study.
Final cluster centers | |||||||
---|---|---|---|---|---|---|---|
Cluster | |||||||
Independent variables | 1 | 2 | 3 | 4 | 5 | 6 | |
Operating speed | 56.66 | 62.65 | 91.79 | 81.80 | 97.53 | 88.00 | |
Difference between posted speed limits and operating speed | −17.34 | 8.10 | 41.02 | 27.77 | 18.69 | -0.75 | |
Segment length | 1.26 | 1.56 | 1.90 | 1.42 | 1.84 | 1.56 | |
Volume | 16.42 | 16.87 | 15.75 | 18.95 | 17.73 | 18.56 | |
Presence or absence of a speed control camera | 0.00 | 0.09 | 0.04 | 0.03 | 0.00 | 0.00 | |
Homogeneous sections | 3.00 | 3.64 | 4.43 | 4.26 | 3.86 | 3.67 | |
Gradient | 1.20 | 1.09 | 1.07 | 1.23 | 1.21 | 1.22 | |
Dominant land uses along the roadways land use | 1.20 | 1.27 | 0.75 | 0.69 | 1.29 | 1.56 | |
Number of accessibility | 0.60 | 1.36 | 3.46 | 2.13 | 2.93 | 0.67 | |
Dependent variable | Accident frequency rate | 0.80 | 0.53 | 0.01 | 0.12 | 0.75 | 1.87 |
Evaluating the ANOVA test of variables in the clusters for finding the most effective factors that play a role in the accident frequency rate, only the difference between posted speed limits and operating speed is specified as the most effective variable among the roadway and development factors due to the maximum statistical value or F-statistic observed in Tables
Examination of the ANOVA test of variables in the clusters.
ANOVA | ||||||
---|---|---|---|---|---|---|
Variables | Cluster | Error | Sig. | |||
Mean square | df | Mean square | df | |||
Operating speed | 2669.37 | 5 | 27.74 | 100 | 96.21 | 0.000 |
Difference between posted speed and operating speed | 5174.64 | 5 | 37.82 | 100 | 136.81 | 0.000 |
Segment length | 1.04 | 5 | 0.76 | 100 | 1.37 | 0.241 |
Volume | 37.55 | 5 | 19.57 | 100 | 1.92 | 0.098 |
Presence or absence of a speed control camera | 0.01 | 5 | 0.03 | 100 | 0.47 | 0.796 |
Homogeneous sections | 2.97 | 5 | 3.93 | 100 | 0.76 | 0.583 |
Gradient | 0.11 | 5 | 0.90 | 100 | 0.12 | 0.988 |
Dominant land uses along the roadways | 2.01 | 5 | 0.80 | 100 | 2.52 | 0.034 |
Number of accessibility | 18.55 | 5 | 5.00 | 100 | 3.71 | 0.004 |
Accident frequency rate | 5.88 | 5 | 1.89 | 100 | 3.10 | 0.012 |
Results of the chi-square test for examining the most effective factors in relation to the accident frequency rate.
Variables | (Chi-square) | Sig. (alpha) |
---|---|---|
Operating speed | 1051.77 | 0.003 |
Difference between posted speed limits and operating speed | 1060.00 | 0.043 |
Segment length | 218.32 | 1.000 |
Volume | 34.85 | 0.248 |
Presence or absence of a speed control camera | 0.95 | 1.000 |
Homogeneous sections | 64.38 | 0.083 |
Gradient | 21.38 | 0.375 |
Dominant land uses along the roadways | 19.16 | 0.511 |
Number of accessibility | 44.35 | 1.000 |
(a) Relationship between the cluster and accident frequency rate. (b) Relationship between the cluster and the difference between posted speed limits and the operating speed. (c) Relationship between the cluster and probability.
The F tests should be used only for descriptive purposes because the clusters are chosen to maximize the differences among the cases in different clusters. However, the observed significance levels are not corrected for this and, thus, cannot be interpreted as the tests of the hypothesis that the cluster means are equal.
Similarly, based on the results of the Chi-square (
To understand the effect of the difference between posted speed limits and the operating speed on the accident frequency rate, the probability of the occurrence was obtained for each cluster. Based on Figure
As shown, the first cluster, namely, “relatively high risk,” is ranked the second based on the accident frequency rate, and its probability risk value is less than 10%. Hence, the occurrence of an accident is relatively low in this cluster.
The second cluster is ranked the fourth, “relatively low risk,” based on the observed accident frequency rate, and its probability risk value is less than 5%; thus, the incidence of a high accident frequency rate is very low in this cluster.
Likewise, the third cluster is ranked the sixth, “safe cluster,” based on the accident frequency rate. Identically, the probability risk value is less than 5%, which demonstrates that the accident occurrence is very low in this cluster.
The fourth cluster is ranked the fifth, “low risk,” based on the accident frequency rate. By comparing the probability risk value in this cluster with safe clusters, it can be found that the probability of accident occurrence in this cluster is 10% which might lead to a lower rate of accident.
In addition, the fifth cluster is put on the third, “moderately risk,” place considering the accident frequency rate. Based on the evaluation of the accident occurrence probability of this risky cluster and its comparison with the other cluster, the probability is 85%, which is high, and thus, the accident frequency rate is expected to demonstrate a significant increase.
Finally, the sixth cluster is ranked the first, “high risk,” based on the increasing accident frequency rate. Regarding the probability of accident occurrence in the cluster, the obtained probability is less than 5%, indicating that the frequency related to this kind of the cluster of accident might happen less than the other risky clusters.
Therefore, the probability of the occurrence of a moderate risky cluster is higher as compared to the other clusters, and more accident frequency rates occur in this cluster. Furthermore, the difference between the posted speed limits and operating speed in this cluster is nearly 18.69 km/hr which is near to the mean of the difference between the posted speed limits and operating speed. As a result, the accident frequency rate significantly increases by decreasing the difference between the posted speed limits and operating speed from the safe cluster (Figure
Examination of the effect of the difference between posted speed limits and the operating speed on the accident frequency rate.
The relationship between difference posted speed limits and the operating speed and the accident frequency rate, as well as the behavior of the frequency of risky and unrisky clusters was evaluated using the Gaussian function. The findings (Figure
The relationship between the difference of posted speed limits and the operating speed and the accident frequency rate based on the proposed model.
Proposed model for the analysis of the accident frequency rate for the clusters.
General model Gaussian | Coefficients (with 95% confidence bounds) | Goodness of fit | |||||
---|---|---|---|---|---|---|---|
SSE | Adjusted | RMSE | |||||
1.93 | −4.82 | 13.70 | 0.51 | 0.77 | 0.61 | 0.41 |
As an example in Figure
Examination of the obtained risk value based on the proposed risk index and its comparison with that of the other studies for each cluster.
Cluster | Proposed study based on collected data | Leur and Sayed’s study [ | |||||||
---|---|---|---|---|---|---|---|---|---|
Operating speed | Difference between posted speed limits and the operating speed | Risk degree | Accident frequency ratio | Risky index (predicted value) (Eq. 2) | Accident frequency rate (consequences) | Probability | Exposure | Final risk | |
1 | 56.66 | −17.34 | Relatively high risk | 0.80 | 1.62 | 12.12 | 0.013 | 3 | 0.476 |
2 | 62.65 | 8.10 | Relatively low risk | 0.53 | 2.02 | 10.12 | 0.027 | 1 | 0.275 |
3 | 91.79 | 41.02 | Not risk (safe) | 0.01 | 0.14 | 7.83 | 0.058 | 0 | 0.00 |
4 | 81.80 | 27.77 | Low risk | 0.12 | 0.59 | 12.4 | 0.012 | 1 | 0.146 |
5 | 97.53 | 18.69 | Moderately risk | 0.75 | 1.98 | 9.06 | 0.039 | 2 | 0.708 |
6 | 88.00 | −0.75 | High risk | 1.87 | 14.81 | 11.1 | 0.019 | 3 | 0.636 |
Based on the findings of ANOVA and Chi-square tests, among the roadway and development factors and their effects on frequency accident rate, only operating speed and the difference between posted speed limits and the operating speed were employed to the safety risk index model in equation (
ANOVA test for the proposed safety risk index.
Parameter estimates | ||||
---|---|---|---|---|
Parameter | Estimate | Std. error | 95% confidence interval | |
Lower bound | Upper bound | |||
b1 | 1.01 | 0.07 | 0.84 | 1.18 |
ANOVAa | ||||
Source | Sum of squares | df | Mean squares | |
Regression | 231.74 | 1 | 231.74 | |
Residual | 1.26 | 5 | 0.25 | |
Uncorrected total | 233.00 | 6 | ||
Corrected total | 150.20 | 5 | ||
Dependent variable: risk value | ||||
a. |
Moreover, the Chi-Square distribution probability function was used as a probability generator for obtaining the probability of each cluster, the results of which are presented in Table
Future works might consider investigating the effect of geometric factors such as road width, weather, and lightening conditions on accident frequency, and development of the rural risk index. In addition, data mining and multicriteria decision making approaches including decision tree techniques, fuzzy AHP, and fuzzy COPRAS could be noteworthy to expand this risk index for rural roads for drivers based on database and experts’ opinion in the field.
To examine the reliability of the safety risk index for clusters, a sensitivity analysis was performed between the results of the proposed model and the findings of Leur and Sayed [
Sensitivity analysis between the two studies.
Given the fact that roadway and development factors are known as the most effective parameters contributing to road traffic accidents on roads, applying these factors in safety analysis could be instrumental in reducing the accident frequency rate and preventing the growth fatality and injury rate on rural roads. Therefore, this study evaluated the effect of roadway and development factors on accident frequency in order to develop a rural road safety risk index using the K-means clustering and Gaussian model. Relying on the obtained data and the results of the analysis, the main findings of the study and the evaluation of the rural accident risk index among roadway and development factors are summarized based on the ANOVA test, as well as clustering and risk analyses as follows. Based on the results of the ANOVA test, among roadway and development factors, only operating speed and the difference between posted speed limits and the operating speed had significant effects on the accident frequency rate. Furthermore, the results of the Chi-square test demonstrated that the maximum chi-square of the operating speed in the risky index has a lower effect on the accident frequency rate compared to the difference between posted speed limits and the operating speed. Based on the K-means clustering analysis of roadway and development factors respecting the accident frequency rate, six easily understandable clusters were investigated as high risky, relatively high risky, moderately risky, relatively low risky, low risky, and not risky (safe) drivers for each cluster. The comparison of the clusters regarding the accident frequency rate revealed that the sixth cluster was categorized as the high risky cluster, whereas the third cluster was considered as a safe cluster. The risky index model was proposed based on the Gaussian model to analyze the behavior of the accident frequency rate for clusters and to obtain the risk value. Therefore, the average reducing factor of the accident frequency rate was achieved by 0.99 through increasing (per km/hr) the difference between the posted speed limits and the operating speed among the safe clusters. However, in unsafe clusters, the average increasing factor of the accident frequency rate was obtained as 1.17. Therefore, the growth factor in risky and unsafe clusters was 1.18 times the accident frequency rate in low risky and safe clusters. Based on the comparison of the difference between posted speed limits and the operating speed and the probability of accident occurrence, it is concluded that, by decreasing the difference of posted speed limits and the operating speed from the safe cluster, the probability of accident occurrence risk in each cluster increases, followed by an increase in the accident frequency rate. As a result, the maximum probability of the accident occurrence was observed in the fifth cluster, which was achieved by 85%. The probability of accidents in the fifth cluster increased as well. Sensitivity analysis showed that the proposed safety risk index has a better performance regarding predicting the risk values for the clusters when compared to the other study. The proposed risk index model is considered as a useful tool for obtaining the safety risk value for studies concerning the accident rate and clustering analysis of drivers on rural roads. Finally, this study can be useful for safety research organizations such as governmental institutes and police centers to consider the maximum risk value in order to accurately present their plans and strategies toward minimizing accidents.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare no conflicts of interest.