Developing a Clustering-Based Empirical Bayes Analysis Method for Hotspot Identification

Hotspotidentification(HSID)isacriticalpartofnetwork-widesafetyevaluations.Typicalmethodsforrankingsitesareoftenrooted inusingtheEmpiricalBayes(EB)methodtoestimatesafetyfrombothobservedcrashrecordsandpredictedcrashfrequency basedonsimilarsites.TheperformanceoftheEBmethodishighlyrelatedtotheselectionofareferencegroupofsites(i


Introduction
Network screening to identify sites (i.e., roadway segments or intersections) with promise for safety treatments is an important task in road safety management [1][2][3][4][5][6][7].The identification of sites with promise, also known as crash hotspots or hazardous locations, is the first task in the overall safety management process [8].One widely applied approach to this task is the popular Empirical Bayes (EB) method.The EB method is described and recommended in the Highway Safety Manual [9] for roadway safety management.This method is relatively insensitive to random fluctuations in the frequency of accidents with two clues combined, the observed crash frequency of the site and the expected number of crashes calculated from a safety performance function (SPF) for homogeneous sites (or the reference group) [10,11].The EB method can correct for regression-to-the-mean bias and refine the predicted mean of an entity [12].Further, it is relatively simple to implement compared to the fully Bayesian approach.
Although the EB method has several advantages, there are a few issues associated with the methodology which may limit its widespread application.First, the selection of the reference population (i.e., similar sites) influences the accuracy of the EB method.When estimating the safety performance function, the crash data are usually obtained from distinct geographic sites to ensure a sufficient sample size for valid statistical estimation [10].As a result, the aggregated crash data often contain heterogeneity.When conducting an EB analysis, the reference group must be similar to the target 2 Journal of Advanced Transportation group in terms of geometric design, traffic volume, and so on.Manually identifying such a reference group is a rather time consuming task for transportation safety analysts whose time could be better spent elsewhere.Second, the EB procedure is relatively complicated and requires a transportation safety analyst with considerable training and experience to implement it for a safety evaluation.Thus, the training investment required to prepare analysts to undertake EB evaluations can be a barrier.As a result, some quick and dirty conventional evaluation methods may be applied as a compromise of convenience, which may produce questionable results.
Given that the specification of correct reference groups is critical for the accuracy of the EB methodology, the primary objective of this research will examine different clustering algorithms (e.g., centroid-based clustering, connectivitybased clustering, and distribution-based clustering) and develop a procedure to identify appropriate reference groups for the EB analysis.

Used in Comparison
2.1.Conventional Hotspot Identification Methods.One common HSID method is the accident frequency (AF) method.
Sites are ranked based on AF and hotspots are defined as sites whose accident frequency exceeds some threshold value [13].
The problem of accounting for exposure in the AF method can be accommodated by considering accident rate (AR) instead of accident frequency.As such, AR methods have been used by some analysts and normalize accident frequency by traffic count.While AF and AR methods are easy to implement, they have difficulty accounting for randomness in crash data.As such, another popular HSID method was developed, that being the Empirical Bayes method presented by Abbess et al. [14].Since its introduction decades ago, the EB method has been used numerous times in many safety studies [15][16][17][18][19][20][21][22][23].One of the key advantages of using the EB method is that it accounts for the regression-to-the-mean (RTM) bias.The EB method can also help improve precision in cases where limited amounts of historical accident data are available for analysis at a given site.At its core, the EB method forecasts the expected crash count at a particular site as a weighted combination of (1) the accident count at the site based on historical data and (2) the estimated number of accidents at similar locations as determined from a regression model [24].The regression model is generally referred to as a SPF and typically takes into account roadway and traffic characteristics (e.g., average daily traffic) at similar sites.To date, the most popular choice for the SPF has been a Negative Binomial regression model [25][26][27].In terms of HSID via the EB method, EB estimates are computed for each site and then sites are ranked according to such estimates.Sites exceeding some thresholds are then considered as hotspots.Besides the EB method, another relatively common HSID method is rooted in so-called "accident reduction potential" (ARP).
The ARP metric used for ranking sites was computed by subtracting the estimated accident count from the observed accident count at a given site, where the estimated accident count comes from a regression model developed from data at similar sites to the target.Among different HSID methods, the EB method is probably the most widely applied approach for screening sites with potential for safety treatment.

Clustering for Selection of Similar
Sites.In the following section, we present three methods that can be used to group data into different clusters.As aforementioned, crash data often exhibit heterogeneity that can affect model estimates if not properly accounted for.The idea here is to cluster crash data into different groups that hopefully align to some degree with the underlying subpopulations from which the crash data are generated.Then, separate Negative Binomial (NB) regression models (i.e., SPFs) can be developed based on each cluster and EB estimates can then be computed using an SPF that hopefully considers sites that truly are "similar" to the site in question.
For GFMNB- models, the equation for developing the weight parameter is shown in (4).By using a function of the covariates, the GFMNB- model makes it possible for each site to have different weights for each component that depends on the site-specific values of the covariates.Zou et al. [28] demonstrated how this additional flexibility can lead to better classification results where  푖 푗 is the estimated weight for component  at segment ;  푗 = ( 0푗 ,  1푗 ,  2푗 , . . .,  푚푗 ) 耠 are the estimated coefficients of component  and  is the number of unknown coefficients; and x 푖 is a vector of covariates.

K-Means
Clustering.The -means clustering algorithm is often attributed to Lloyd [29] and Anderson [30], and it is one of the most popular clustering algorithms in use today.Inputs to the algorithm are the data points; here, each data point can be viewed as one of the road segments in the crash data set and its corresponding descriptive variables (e.g., lane width and average daily traffic (ADT)).With the data in hand,  cluster centers are initialized.Cluster centers can be chosen as random points in the feature space (i.e., points that do not exist in the data set could be selected) and random data points in the feature space (i.e., only points in the dataset can be selected) or through a variety of other methods.For this project, the initialization using  random data points in the dataset was used.The algorithm then proceeds in an iterative process until it converges, where convergence is defined as the point at which the cluster assignments do not change.The first step in the iteration assigns each data point to the cluster such that the distance between that cluster center and the data point itself is smallest; the distance metric used for this work is Euclidean distance defined as shown in (5).Then, the second step recalculates the center for each cluster.Pseudocode for the algorithm is shown in the following: where (⋅) is Euclidean distance between two points;  is data point index, ranging from 1 : ;  is variable index, ranging from 1 :  for  variables; and ‖ ⋅ ‖ is the two norms of two data points.

K-Means Algorithm Cluster-Assignment
Step where () is cluster assignment for data point  푖 ;  푘 is center of cluster ; and all other variables are defined as previous.

Center-Update
Step where |()| is cardinality (number of data points) in cluster () and all other variables are defined as previous.

Hierarchical Clustering.
Hierarchical clustering methods differ from -means clustering in that the results do not depend on the number of clusters used (i.e., the results will always be the same for a given number of clusters) nor an initialization.Rather, they are rooted in the use of a dissimilarity measure defined between clusters that is defined in terms of all possible pairwise combinations of data points within two given clusters.In this research, agglomerative (i.e., bottom-up) hierarchical clustering in the form of complete linkage clustering was considered.Agglomerative clustering methods (e.g., complete linkage, single linkage, and average linkage) take the data points (i.e., road segments and their corresponding descriptors) as inputs and begin with each data point as its own cluster; a lone data point forming its own cluster is also known as a singleton.For complete linkage clustering, the algorithm proceeds in a total of  − 1 steps (i.e., one step less than the total number of data points in the dataset) and at each step, the two clusters with the smallest intergroup dissimilarity measure are joined to form a new cluster.Thus, in each successive step, the number of clusters is reduced by one.For complete linkage clustering the intergroup dissimilarity is defined as follows [31]: where ,  are two arbitrary clusters and Thus, for each step of the complete linkage clustering algorithm, the two clusters with the smallest value of the maximum between-cluster distance are joined.

Classification-Based EB Methods.
At this point it is important to clarify the main contribution of this work.It is well-known that aggregated crash likely has some degree of heterogeneity, as if they are generated from multiple distinct subpopulations.As such, if one were able to try to capture this heterogeneity and group the data into different units, ideally based upon the subpopulations from which they were generated, better estimates of safety and HSID rankings could likely be obtained.Thus, three types of clustering algorithms (GFMNB- model based, -means clustering, and hierarchical clustering with complete linkage) are proposed to cluster the data into distinct subgroups that hopefully correspond to the subpopulations from which the data were generated.The main idea/application of clustering is to define groups (i.e., clusters) of data points such that all points assigned to/belonging to a given cluster are closer/more similar to the points in that cluster than any other cluster [31].
Clustering methods present an ideal means to represent/describe heterogeneity within crash data.As such, we apply clustering-based EB methods in this study as a new means of hotspot identification.For these methods, three types of clustering as aforementioned are considered, and the classification method for HSID purposes has four main steps as follows.First, the full set of input crash data is clustered into  clusters via the GFMNB- model, -means clustering algorithm, or hierarchical clustering algorithm.In this study, the number of clusters considered is set equal to the number of components selected for the GFMNB- model, which was itself selected on the basis of the Bayesian information criterion (BIC).Ultimately, however, the choice for selection of both number of clusters and number of components in the GFMNB- model is up to the analyst.The second step of the algorithm involves splitting the data into  groups based on the results of the applied clustering algorithm.The third step of the algorithm calls for estimation of an NB regression model (i.e., SPF) for each of the  subgroups/clusters from the data and using these SPFs in further generation of EB estimates for each site.For example, if  = 2, two SPFs will be estimated and the data in each of the two groups will have EB estimates calculated through application of the corresponding SPF.Fourth and finally, the EB estimates for all sites across all  subgroups are aggregated and ranked, after which hotspots identification is based on threshold values or other methods.From this point forward, the classificationbased HSID methods aforementioned will be referred to as follows: GFMNB-based EB method, the -means-based EB method, and hierarchical-based EB method, respectively.A summary of the classification-based EB method for HSID is shown in Table 1.

Evaluation Criteria for Hotspot Identification Methods.
For the purpose of evaluating the performance of HSID methods, some kind of standardized test procedures are needed.Ultimately, analysts will be concerned with an HSID method's capability to find accident prone sites and properly rank sites according to risk.These concerns are directly related to the overarching objective of prioritizing safety treatments at hotspots in a limited-funding environment.While a multitude of tests are available and determining which test is optimal may not be clear, one might argue that "good" performance (to be described in the forthcoming test descriptions) across multiple tests could be a reasonable indicator of a method's overall performance in HSID.As such, we consider three commonly used tests attributed to Cheng and Washington [32]: the Site Consistency Test (SCT), the Method Consistency Test (MCT), and the Total Rank Difference Test (TRDT).For more information about these three tests, interested readers are referred to Cheng and Washington [32].

Data and Analysis
3.1.Data Description.In order to examine the effectiveness of the methodology presented herein, the research team chose to work with a dataset used in many previous safety studies, which being the Texas rural undivided highway crash dataset.The dataset contains crash counts collected over 1,499 rural undivided highway segments over a span of five years, 1997-2001, for the National Cooperative Highway Research Program (NCHRP) 17-29 project [33].Since the aforementioned tests for evaluation of the hotspot identification methods require data from different time periods for comparison purposes, the dataset comprised of 1,499 observations was broken down into two temporal subsets.The first subset, called "Time Period 1" henceforth, contains the data from the original dataset recorded for 1997 and 1998.The second subset, called "Time Period 2" henceforth, contains the data from the original dataset recorded for 1999, 2000, and 2001.Thus, the union of these two subsets is the original dataset with 1,499 points.Variables collected to describe the segments and be considered as independent variables in the analysis include average daily traffic during the analysis period (), lane width (LW, in feet), total shoulder width (i.e., the sum of shoulder width on both sides of the roadway in feet, SW in feet), and curve density (i.e., the number of curves per mile, CD).The dependent variable in the analysis is the number of crashes observed on each segment over the analysis period, and another variable, segment length (, in miles), was considered as an offset in the regression.Summary statistics on the dataset are presented in Table 2.

Modeling Results.
To study classification-based EB methods, GFMNB- models were developed from the crash records in Time Periods 1 and 2, respectively.Data in each time period was used to estimate the finite mixture models with  components; that is,  separate NB models were estimated for each type of mixture model that are combined together to form a weighted estimate.Then, the GMFNB- model was used as the basis for the clustering-based EB methods.That is to say, the number of components used in the model was selected as the basis for the number of clusters to use for grouping the crash data under each of the aforementioned clustering methods.
When estimating the GFMNB- models, perhaps the main problem is to determine how many components should be used in the model (i.e., to select ).In order to select the number of components for each model in each time period, the method presented in Park et al. [34] was applied in this study.Under this approach, the analyst builds finite mixture models with increasing numbers of components (from two upwards) and selects the final model (and number of components) through goodness-of-fit metrics, such as the Akaike Information Criterion (AIC) or the previously mentioned BIC, which balance the number of components and overall model fit (measured via log-likelihood).Eluru et al. [35] noted that BIC is more stringent than AIC in terms of applying a penalty based on number of components, and thus it may be more robust in terms of preventing overfitting.As such, BIC was selected as the means of choosing the number of components for the finite mixture models in each of the two time periods: where loglikelihood 푗 is the log-likelihood of model ;  is the number of components in finite mixture model ; and  is the sample size (i.e., number of sites in dataset).
In this study, GFMNB- models were developed from the crash data in Time Periods 1 and 2 with increasing numbers of components  = 2, 3, or 4. Figure 1 indicates that use of  = 2 (i.e., finite mixture models with two components) leads to the best goodness-of-fit as indicated by the lowest value of BIC.Hence, the number of components is selected as  = 2 and  the GFMNB- models can now be indicated as GFMNB-2 models.It is important to note that, in general,  = 2 may not always be the optimal number of components and the choice will depend on the data.That said, BIC is a reasonable method to use to select .By examining Figure 1, one can see that the BIC values reported for the GFMNB-2 models are not as large as those for the regular NB model ( = 1) in the corresponding time period suggesting that the mixture models have better goodness-of-fit.Further, the choice of  = 2 based on BIC seems to suggest the existence of two distinct subpopulations within the crash data corresponding to each time period instead of a lone data population.

Grouping Results.
According to the results of the GFMNB- fitting procedure, it was determined that a GFMNB model with two components fit the data best.Thus, for each of the clustering-based-EB procedures for HSID, the full set of crash data was split into two groups for each time period (i.e., four groups total) from which NB models were estimated and corresponding EB estimates were calculated.That is to say, given the crash data for Time Periods 1 and 2, the three aforementioned clustering algorithms (i.e., -means, hierarchical with complete linkage, and estimation of a GFMNB- model) were applied to group the data from each time period into two clusters, for which EB estimates were computed.
Table 3 shows grouping results for each component, under each clustering method for both time periods considered in the study.For each component, the sample size along with the mean and standard deviation (SD) for each variable in the dataset (as described previously) is presented.
From the table, it can be seen that, in general, mean values for the lane width, shoulder width, and segment length do not differ much between components.That said, in some cases, particularly for the groupings based on hierarchical clustering for Time Period 2, the mean number of crashes differs dramatically between components.Additionally, there is a substantial difference in the mean values of average daily traffic () between components for all clustering methods considered in both time periods.Such a trend suggests that the data considered here may come from underlying subpopulations where traffic volume is a defining characteristic for subpopulation membership and thus a good descriptor of the heterogeneity in the data.
With crash data clustered for each time period according to the three aforementioned clustering methods, EB estimates were then obtained after estimating an NB regression model for each of the two components corresponding to a given clustering method for a given time period.When interpreting results henceforth, one should consider the sample sizes used to estimate the NB models.For example, the sample size of "Component 1" (i.e., one grouping) for Time Period 2 as defined via hierarchical clustering with complete linkage has only 66 data points.Thus, modeling results associated with this group (namely, the results of the SPF and corresponding EB estimates) and the overall EB estimates for Time Period 2 as determined via hierarchical clustering (i.e., the aggregation of the EB estimates for components 1 and 2) should be interpreted with caution.
EB (here, all data are considered as being from one population), (4) GFMNB-based EB method, (5)-means-based EB method, and (6) hierarchicalbased-EB method, are conducted using the three main tests from Cheng and Washington [32].As all test procedures involve comparison across two different time periods, we use the time periods as defined in Table 2. Further, we consider three different scenarios in terms of the number of high-risk sites selected for consideration under each HSID method.These scenarios correspond to considering 1%, 5%, and 10% of all sites as high-risk (i.e.,  = 0.01, 0.05, and 0.10).For example, in this study, when  = 0.10, a total of approximately 150 sites (i.e., ∼10% of the 1,499 total sites) will be considered as high-risk, and their data will be used in calculation of the test statistics for the various HSID methods.Table 4 shows the results of the six HSID methods considered under the Site Consistency Test.As aforementioned, the goal of the SCT is to measure consistency of a method in identifying sites as high-risk over time.The underlying principle is that high-risk sites should show consistently high crash counts over time, and thus the higher the value for the SCT statistic, the better performing the HSID method.From the table, it can be seen that the worst performing method across all cutoff levels for high-risk site identification (i.e., all  values) is the AR method.When one percent of sites are considered as high-risk, the conventional EB method, means-based EB method, and hierarchical-based EB method all perform equally well.For the cases in which 5% and 10% of sites are considered as high-risk, the -means-based EB method is identified as the best performing HSID method according to the SCT.That said, in both of these cases, the value of the SCT test statistic for the hierarchical-based-EB method gives a value quite close to those obtained by the means-based method, indicating that it also seems to perform nearly as well in HSID.
The results of the six HSID methods being evaluated in terms of the Method Consistency Test are shown in Table 4.The MCT is designed to assess consistent identification of the same high-risk sites across different time periods.As such, the higher the value of the MCT test statistic, the better the performance of the HSID method (i.e., higher values imply that more sites were identified as high-risk in both time periods considered).From Table 4, one can see that, across all three cutoff levels for proportions of sites to consider as high-risk, the GFMNB-based EB method performs the best.That said, for the case in which 10% of sites are considered as high-risk, the -means-based method performs just as well.Additionally, for all cutoff levels, the results of all clustering-based EB methods (e.g., GFMNB-, -means-, and hierarchical-based) exhibit quite similar performance.As was the case for the SCT, the AR method consistently performs the worst across all three cutoff levels for proportions of sites to be considered as high-risk.Table 4 presents the results of the HSID-procedure evaluation under the Total Rank Difference Test.Again, this test is based on consistent identification of high-risk sites across time periods, but here, the rankings of sites identified as highrisk in one time period are compared to the rankings of the same sites in another time period.Hence, the smaller the value of the TRDT test statistic, the better the performance of the method in HSID.From the table, it can be seen that the GFMNB-based EB method yields the best HSID performance across all three cutoff levels of proportions of sites to be considered as high-risk.Under this test, the other clusteringbased EB methods (e.g., -means-based and hierarchicalbased) outperform the naïve AF and AR methods across all cutoff values and also outperform the EB method for the 5% and 10% cutoffs.As was the case for the preceding two HSID performance tests, the AR method of HSID consistently performs the worst across all three cutoff levels of proportions of sites to be considered as high-risk.
Overall, the preceding tests indicate that the GFMNBbased EB method appears to exhibit the strongest HSID performance in all three tests and across the different cutoff levels of proportion of sites to be considered as high-risk.That said, the results obtained from the other clusteringbased EB methods (e.g., -means-based and hierarchicalbased) are usually close and tend to outperform the AF, AR, and standard EB methods.From all tests, it appears that the AR method performs the worst.One possible explanation for this behavior may be that since the test sites are rural road segments, many may exhibit low ADT values and thus, as aforementioned, low-volume sites may be overrepresented as high-risk since the AR calculation normalizes by traffic count.Ultimately, HSID methods that themselves make use of the EB method when computing safety estimates prior to site ranking appear to perform better than the naïve AF and AR methods.This finding is consistent with many previous studies including [17,36,37].
3.5.Discussion.From the preceding analysis, it appears that the GFMNB-based EB procedure for HSID performs the best when evaluated with the three aforementioned test procedures [32] on the Texas rural undivided highway crash dataset.That said, it seems that all EB-based methods typically outperform the naïve methods, especially the AR HSID method.One possible reason the EB-based HSID methods may perform better is due their use of both the observed historical accident data and predicted accident count from similar sites with the SPF.Further, the EB methods are able to adjust for RTM bias.That said, the conventional EB method is not without its limitations for HSID.The main limitation, perhaps, arises when there is a substantial degree of heterogeneity in the crash data making it such that the crash data seem to arise from different subpopulations.Such heterogeneity could arise when large amounts of crash data collected from areas that differ dramatically geographically and with respect to a variety of other site-specific conditions.Oftentimes, crash data are aggregated in an effort to ensure that sufficient sample sizes are available for model estimates (i.e., in an effort to reduce the standard error value of regression coefficients).In order to remedy this issue of not accounting for heterogeneity in the data, three clusteringbased EB methods were proposed in this report.The idea behind these methods was to group the overall set of crash data (i.e., full list of study sites) into smaller subsets such that the site in each subset was more similar to sites within their groups than sites in other groups according to features, such as traffic volume, lane width, and other predictors.Further, it was hoped that such clustering could potentially help uncover the underlying groups/subpopulations from which the data could have been generated.Indeed, it appears that the clustering-based EB methods that applied -means-, hierarchical-, and GFMNB-based clustering were able to analyze heterogeneous data and outperform more conventional methods in terms of HSID.
While the clustering-based EB methods for HSID have several benefits, they are not without their limitations.Perhaps the largest limitation of clustering-based EB methods is that, in some cases, they can cluster data into groups with relatively small sample sizes.Then, regression models (i.e., SPFs) developed from these small samples are more likely to exhibit their own issues such as biases in their coefficient estimates.This issue can further be compounded when analysts interpret the biased results and have the potential to make erroneous inferences/conclusions.As such, it is important that one be cognizant of the sample sizes of the clusters and what impacts they may have on model estimates and resulting inference [38].Ultimately, as always, analysts are encouraged to interpret all results, especially those corresponding to regression models developed from small samples (e.g., 100 or less sites) with caution.

Conclusions
This study introduced three clustering-based EB methods for hotspot identification purposes.The clustering methods considered were the GFMNB-g model, -means clustering, and hierarchical clustering with complete linkage.The newly developed clustering-based EB methods for HSID were compared in terms of performance to conventional HSID methods, including the EB method, as well as the naïve AF and AR method, with three methods comparing performance in HSID across different time periods as developed by Cheng and Washington [32].When studying the HSID results based on applying the methodology to Texas undivided rural highway crash data, the results suggest that all three clustering-based EB analysis methods are preferred over the conventional statistical methods.Additionally, it seems that the accuracy of HSID can be enhanced by appropriately classifying roadway segments according to the heterogeneity of the crash data (i.e., clustering the data before developing SPFs for use in EB estimates).That said, one should always be cautious when classifying roadway segments into clusters as inappropriate classification of roadway segments can lead to erroneous results (e.g., biased coefficient estimates from SPFs developed from small sample sizes).Although the proposed clustering-based EB method is not yet ready for practical application, transportation safety analysts may use the clustering-based EB method to calculate the EB estimates and avoid manually identifying similar groups within the heterogeneous crash dataset, a task that may be difficult as the underlying subpopulations in the data are usually unknown.Future work could evaluate development of a performance measure to evaluate the overall HSID performance of the three clustering-based EB methods (i.e., to determine which clustering method is best and when it is best to use each).

Table 1 :
(2)ssification-based EB method for HSID.Use the GFMNB- model, -means algorithm, or hierarchical clustering algorithm to cluster the data into  groups(2)Separate the data into  groups based on the results of clustering (3)Estimate  NB regression models, one for each of the  subgroups, and use the corresponding SPF to get EB estimates for each site (4)Aggregate the EB estimates for all sites, rank the sites, and identify hotspots

Table 2 :
Summary statistics for road segments in Texas rural undivided highways dataset.

Table 3 :
Characteristics of each component.

Table 4 :
Results for various methods using three different tests.
Note.Bold number indicates the best result.