^{1}

^{2}

^{1}

^{1}

^{1}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

This paper proposes a clustering approach to predict the probability of a collision occurring in the proximity of planned road maintenance operations (i.e., work zones). The proposed method is applied to over 54,000 short-term work zones in the state of Maryland and demonstrates an ability to predict work zone collision probabilities. One of the key applications of this work is using the predicted probabilities at the operational level to help allocate highway response teams. To this end, a two-stage stochastic program is used to locate response vehicles on the Maryland highway network in order to minimize expected response times.

Work zone collisions accounted for approximately 1.2 to 1.7 percent of all 2006-2013 crashes in the United States, amounting to 67,523 incidents in 2013 [

Both of these applications depend on the ability to

Existing work zone research tends to be oriented towards mobility or safety applications. The mobility-oriented research generally focuses on the effect of work zones on traffic capacity, queuing delays, and other performance metrics [

Another class of advanced statistical techniques includes random parameter [

The models described thus far are primarily useful for predicting the number of crashes that will occur in a work zone over a period of time. As Yang, Ozbay, Xie, and Bartin [

Despite many different modeling approaches and perspectives, the majority of literature focuses on either work zones or crash frequency independently, with comparatively little research dedicated to modeling collision frequency/risk based on work zone characteristics. Within this niche body of research, most papers conclude that increased vehicle demand, work zone length, and duration all increase the number of collisions that occur at a work zone over a period of time [

We propose a scalable, unsupervised learning approach to predict the probability of a collision occurring at a short-term work zone. In contrast to most classical regression approaches which do not explicitly calculate crash risk/probability and are not appropriate for short-term durations, this machine learning approach clusters all work zones based on salient characteristics, calculates a collision probability for each cluster, and assigns new work zones to existing ones with similar features. Furthermore, it yields collision predictions without having access to current traffic volumes or work zone lengths and would easily scale to larger work zone datasets (e.g., see [

We present an integrated approach which combines both work zone collision risk predictions and actionable response recommendations, which is intuitive and applicable for practitioners. This approach takes historic information about work zones and returns the optimal allocation of highway response teams, which could be readily implemented by an agency such as the Coordinated Highways Action Response Team [

The remainder of this paper is organized in the following manner. We begin by discussing the historical work zone and collision dataset that is used for model development in subsequent sections. Next, we describe clustering methods and explain how we determine the number of clusters and quantify model performance. We then apply this clustering framework to the work zone data set, noting how different approaches affect the model performance. Afterwards, we discuss example applications, focusing on optimally locating highway response vehicles by using the work zone incident probabilities as an input to a stochastic optimization model. Finally, we draw conclusions and suggest future steps to extend the research.

Data describing work zones (WZs) and collisions that take place in their proximity (i.e., within 1 mile distance) were collected from the Regional Integrated Transportation Information System [

Heat map showing locations of over 54 thousand WZs that took place in Maryland during 2010-2015. The pie chart indicates that majority of WZs were setup along MD and Interstate roads.

Most WZs would have one or two lanes closed, while closures of more than four lanes were very rare and typically occurred close to toll plazas. For the most part, closures affected main and shoulder lanes.

Duration of WZs and split in total work zone hours in day/night and peak/off-peak. Seventeen continuous distributions were fitted to WZ duration [

Most maintenance work was carried out on weekdays and during summer and fall.

The observed 380 WZ collisions occurred at a very small subset of all the recorded WZs. There were hardly any WZs with more than one collision; however, a few observed as many as three collisions (Figure

Input variables considered in the analysis of WZ collisions.

| |

| |

| |

| |

| |

| |

Heat map showing locations of 359 WZs within which 380 collisions occurred. The pie chart indicates that very few WZs had more than one collision.

Most collisions took place on weekdays and during summer and fall, when high maintenance activity was observed.

Visual representation of AADT in 2013 for the road links in Maryland.

Numeric values indicate correlation between (continuous and ordinal) input variables and number of WZ collisions. Yellow and magenta denote positive and negative correlation, respectively. Brighter links indicate greater correlation, while less visible links imply low correlation. Unfilled tokens indicate that a variable has low total absolute correlation with other covariates [

Data preprocessing can significantly improve performance of the clustering algorithm presented in the following section. Categorical variables were modeled using their binary counterparts, and the peak and day duration of a WZ were divided by its overall duration because we are interested in predicting the one-hour probability of a collision. The off-peak and night durations were removed from the model as redundant information (e.g., relative peak and relative off-peak add up to one), and all the variables were normalized in order to enhance performance of the clustering approach presented in the next section. The WZ features used for clustering are presented in Table

Input variables used for WZ clustering.

Variable | Description | Columns |
---|---|---|

| Season of year | 3 |

| Type of road (I, US, MD) | 3 |

| Equals 1 for Saturday or Sunday, otherwise 0 | 1 |

| The number of all, main, shoulder and median lanes | 1 |

| Average annual daily traffic | 1 |

| Peak duration / overall duration | 1 |

| Day duration / overall duration | 1 |

Note that WZ lengths were not available for the 54,000 WZs considered in this study, but since WZ lengths have been shown to influence crash frequency [

The objective is to use historical data in order to predict the probability of a collision occurring within a future WZ (e.g., road maintenance work scheduled for the next week), based on the underlying assumption that WZs with similar features should have comparable safety levels. Accordingly, we take the set of 54,000 WZs observed in the past (Section

Graphical description of the methodology via a trivial example including 2 dimensions, 3 clusters, and 41 data points (i.e., WZs). After clustering historical WZs, for each cluster, we compute the number of WZs with at least one collision over the total number of WZs within the cluster. A newly scheduled WZ is attributed to the cluster with the most similar features. The predicted probability of a collision occurring at this WZ corresponds to the ratio computed for the cluster it was attributed to.

Historical (unclustered) work zones

Clustered WZs and corresponding collision probabilities

Newly scheduled WZ is attributed to the closest cluster

The historical data can be partitioned into clusters of WZs with similar features via classical

Select centers for

Assign each point to the closest cluster center

Compute new cluster centers using assigned points

To evaluate the model, the dataset is first separated into training and testing datasets. Upon selecting the number of clusters (discussed below), the

In order to determine the number of clusters to use, we employ three different methods: Elbow, Silhouette, and Cross-Validation. In the Elbow method, the number of clusters is increased until the improvement in the objective function becomes marginal. The Silhouette method [

Additionally, these three methods are helpful in determining the relevant model scenarios. In the base scenario, we assume that all the variables are of equal importance and thus are normalized from 0 to 1. To explore the influence of individual variables, we inflate their normalized values and study how this affects the clustering results. Thus, the Elbow, Silhouette, and Cross-Validation methods help provide insight into the number of clusters and appropriate variables to use.

The proposed clustering approach may overfit the data (i.e., memorize data points rather than detect patterns). In order to check for possible overfitting, it is helpful to train models with various sizes of training data and then check the behavior of the test error. When overfitting occurs, the error should significantly decrease as the training dataset size increases, but if the model is able to generalize well (i.e., does not overfit), further changes in training dataset size should not result in a meaningful reduction of test error. The results for the base scenario are presented in Figure

Predicted and actual mean collision probability for the base scenario.

Selecting the number of clusters is essential for obtaining satisfactory results. After some preliminary experiments, the lower and upper bound on the number of clusters were set to 8 and 21, respectively. It should be noted that if a model includes few clusters, the WZs in each cluster may be quite diverse. This results in relatively small differences among mean collision probabilities in each cluster. Consequently, even if the accuracy of clustering is very high, the model cannot be employed for predictions. On the other hand, having many clusters should (in theory) improve the prediction accuracy. However, if some clusters do not include enough data points with collisions (due to relatively few collisions in the entire dataset), then the estimated collision probabilities may be inaccurate for these clusters. Therefore the upper bound on the number of clusters should correspond to the size of the training set and the number of collisions in the entire dataset.

As argued before, the authors considered three methods to select the number of clusters: Elbow, Silhouette, and Cross-Validation. The accuracy of each was estimated using a relative difference between the error of the model indicated by the method and error of the best model selected using a test set. The results for each method were computed for different scenarios and for different training dataset sizes (due to the overfitting analysis only the results for training dataset sizes greater or equal of 50% of entire dataset size were taken into consideration), which are summarized in Table

Aggregated accuracy metrics of various methods for selecting the best model (smaller numbers indicate better accuracy).

Silhouette | Cross-Validation | Elbow | |
---|---|---|---|

Mean | 0.36365 | 0.40575 | 0.44222 |

Std. dev. | 0.28405 | 0.42113 | 0.43249 |

The results indicate that the Silhouette method has the best accuracy and smallest standard deviation amongst the three measures, meaning that it is less prone to changes of scenario or training dataset size and, consequently, more reliable. Accordingly, the Silhouette method was chosen to determine the number of clusters.

The base scenario assumes that all the features are equal, but in actuality some may have a larger impact on the predicted values and others. In order to verify this, different scenarios were created and tested. In each scenario some features are more (or less) important than others, and the proximity in dimensions corresponding to these features has a greater impact on the attribution of WZs to certain clusters. A list of tested scenarios is presented in Table

List of tested scenarios.

Scenario | Description |
---|---|

0 | Base scenario, all features equally important |

| |

1 | Importance of seasons increased |

2 | Importance of road type increased |

3 | Importance of weekday/weekend increased |

4 | Importance of number of lanes increased |

5 | Importance of AADT increased |

6 | Importance of peak/off-peak increased |

7 | Importance of day/night increased |

| |

8 | Importance of AADT decreased |

9 | AADT data removed |

| |

10 | Importance of peak/off-peak and day/night increased |

11 | Importance of peak/off-peak and day/night increased, AADT data removed |

12 | Uses only peak/off-peak and day/night data |

| |

13 | Importance of number of lanes, peak/off-peak and day/night increased |

14 | Uses only number of lanes, peak/off-peak and day/night data |

Errors for different sizes of training data

Mean errors

Scenarios

The error for the model whose specifications are determined using the Silhouette method for 3 quantiles is 2.95%, indicating a high level of model accuracy. The accuracy for each quantile is shown in Figure

Accuracy and

Accuracy for 3 quantiles

Using the previously proposed clustering approach to predict the probability of a collision occurring within a WZ, we now provide an illustrative application of the proposed model. This hypothetical case study pertains to the jurisdiction of the Coordinated Highways Action Response Team, whose objective is to improve operations of Maryland’s highway system. Suppose that this agency has a list of planned maintenance work for the following day and is interested in deploying a fixed number of response units to tackle collisions that may happen within these WZs. Clearly, we can assign these WZs to the clusters derived in the previous section and consequently estimate collision probability for each of the WZs scheduled for the following day (Figure

In this illustrative example we randomly sample the 40 WZs that were used to test the clustering methodology and pretend they represent the maintenance work scheduled for the following day. Consequently, we assign the 40 WZs to clusters built based on historical WZs (2010-2015), in order to compute the collision probabilities associated with each of the 40 newly-scheduled WZs (Figure

Collision probabilities at 40 future WZs are estimated via the proposed clustering procedure. These probabilities are used as an input for the two-stage stochastic model to optimally allocate 5 highway response teams.

Collision probabilities predicted via clustering

Optimal response team allocation

Sensitivity analysis showing the optimal allocation of response teams when the number of available units is perturbed from

3 response units

4 response units

6 response units

7 response units

8 response units

9 response units

In addition to optimal allocation of highway response units, the proposed clustering method can be used to determine or adjust WZ parameters. Specifically, the presented model could help modify WZ parameters (e.g., lanes closed, day/night, and peak/off-peak duration) in order to meet certain safety levels (e.g., keep the collision probability below a specified threshold). For example, the easternmost WZ in Figure

This paper proposed a clustering approach to predict the probability of a collision occurring in the proximity of planned maintenance work, which is important for allocation of highway response units. We presented the first application of clustering in the analysis of WZ collisions, which involved a large dataset of over 54,000 WZs in Maryland. The model showed good prediction accuracy, and its potential application was illustrated by optimally allocating response units in the Maryland highway network. Namely, collision probabilities determined via clustering were used as inputs to a two-stage stochastic program to optimally deploy highway response teams. Additionally, the proposed clustering approach can be used to adjust features of WZs to meet specified safety levels.

The proposed clustering method has certain limitations corresponding to the number of quantiles, so in some cases it may be used for classification of WZs rather than to predict the exact collision probabilities. Including more data would allow for additional quantiles to be used (i.e., preparation of higher-resolution models) and including additional WZ features may be useful as well. It would also be interesting to test clustering algorithms other than the

The problem of allocating highway response units is modeled as a two-stage stochastic integer program, with the goal to locate

We formulate the first-stage problem as

The data used to support the findings of this study are available from the corresponding author upon request.

This research did not receive specific funding, but was performed as part of employment at the Center for Advanced Transportation Technology, University of Maryland.

The authors declare that they have no conflicts of interest.