Scan Statistics for Detecting High-Variance Clusters

Scan statistics are mostly used to detect spatial areas or time intervals in which the mean level of a given variable is more important. Sometimes, when this variable is continuous, there is an interest in looking for clusters in which its variability is more important. In this paper, two scan statistics are introduced for identifying clusters of values exhibiting higher variance. Like many classical scan statistics, the first one relies on a generalized likelihood ratio test whereas the second one is based on ratios of empirical variances. These methods are useful to identify spatial areas or time intervals in which the variability of a given variable is more important. In an application of the new methods, I look for geographical clusters of high-variability income in France and then for residuals exhibiting higher variance in a linear regression context.


Introduction
Cluster detection has become a very fruitful research subject since the earlier work of Naus [1] looking for an unusual clustering of random points on the line.Thereafter, it was extended to identify time intervals or spatial areas where the observations of a given random variable are different than elsewhere.These methods are nowadays very popular in disease surveillance for the detection of disease clusters, and they are also used in many other fields, such as forestry, astronomy, and criminology.A thorough review is given by Glaz et al. [2].
Most of the cluster detection methods are designed for count data, that is, point processes made of the random coordinates of  events observed in , a bounded subset of R  : the goal is to identify, if they exist, the areas in which the concentration of events is abnormally high.According to the article by Cressie [3], the scan statistic denotes the maximal concentration observed on a collection of potential clusters.Originally, the size of all the potential clusters had to be the same, so that the scan statistic was just the maximum number of events in a window of size ,  being fixed a priori.This major drawback vanished when Kulldorff [4] introduced the scan statistic based on generalized likelihood ratio in either Poisson model or Bernoulli model.This scan statistic is defined to analyse point processes with binary marks, such as case/control data.Later on, Kulldorff et al. [5] introduced the Gaussian model scan statistic which allows analysing point processes with continuous marks.
This scan method is useful to detect high-mean clusters, that is, areas where the marks are significantly higher than elsewhere.However, when analysing income inequalities, for example, it might be useful to look for high-variance clusters: in these areas, inequalities are more obvious and this could generate more violence or more crimes.Thus, existing scan methods need to be modified in order to detect such clusters.
In this paper I introduce two scan statistics for detecting high-variance clusters.Section 2 describes the scan statistics and their computational aspects in the framework of marked point processes.Their performances are compared through a simulation study in Section 3. In Section 4, they are applied to two real data sets: on the first one, which describes the spatial distribution of incomes in France, the high-variance scan statistics are directly applied; on the second one, which illustrates the link between expenditure on public schools and per capita income, a linear regression model is fitted and its residuals are analysed through high-variance scan statistics.The paper is concluded with a discussion.

Two High-Variance Scan Statistics
Let {(  ,   ),  = 1, . . ., } denote the realization of a marked point process, where   ∈  is the location of the event and   ∈ R its associated mark.The area  ⊂ R  is the observation domain and the locations can be either one-dimensional ( = 1) or spatial ( = 2 or  = 3).Our goal is to detect the area  ⊂  in which the marks exhibit a significantly higher variance than elsewhere.
From now on, let us consider that the locations are spatial.The setting of one-dimensional locations will be handled in Section 4.2.Most of the spatial cluster detection methods rely on likelihood ratio between two hypotheses depending on a potential cluster: the scan statistic is nothing but the maximum of all these likelihood ratios.Thus the two questions to answer are how to choose the potential clusters and which likelihood ratio should be used?
Concerning the potential clusters, I will focus on circular clusters, such as Kulldorff [4].The set of potential clusters, denoted by D, is the set of discs (if  = 2) or balls (if  = 3) centered on a location and passing through another one: where  , is the disc (or the ball) centered on   and passing through   : the number of potential clusters is  2 .

A Generalized Likelihood Ratio Scan Statistic.
As said in the Introduction, Kulldorff et al. [5] introduced a Gaussianbased scan statistic to detect clusters exhibiting higher means than elsewhere.Their concentration index is based on the likelihood ratio between two hypotheses.In order to detect high-variance clusters instead of high-mean clusters, I decided to modify their alternative hypothesis.
Let  1 , . . .,   denote the random variables associated with the marks, which are assumed to be independent.Under the null hypothesis  0 , corresponding to the absence of any cluster, all the marks come from the same Gaussian distribution: where ( * ,  * 2 ) is the maximum-likelihood (ML) estimate associated with this distribution; that is, (, 2 ) (⋅) being the density function associated with the Gaussian distribution with mean  and variance  2 .Let the spatial area  ⊂  be a potential cluster and  =  \  its complement.Under the alternative hypothesis  1, , corresponding to a high-variance cluster in , the marks come from a mixture Gaussian distribution with equal means but different variances: where ( ) (  ) . ( Both of these ML estimates have closed forms.Let denote, respectively, the number and the mean of marks and the mean of squares of marks in .The ML estimate under  0 is the vector containing the mean and the (biased) empirical variance of the marks in the whole domain.As shown in the Appendix, the ML estimate under and  * , is the root of a cubic function which can be obtained using, for example, Cardano's method [6].The log-likelihood under the null hypothesis is and the one under the alternative hypothesis is so that the log-likelihood ratio between both hypotheses reduces to The next step is only the maximization of this log-likelihood ratio on the set of potential clusters previously defined.
The generalized likelihood ratio (GLR) high-variance scan statistic is and the potential cluster for which this maximum is obtained, is called the most likely cluster.
Once the scan statistic is computed, I need to evaluate its significance.Unfortunately, the null distribution of  is untractable due to the dependence between the loglikelihood ratios: indeed, the log-likelihood ratios LLR  and LLR   associated with two potential clusters  and   are independent only if  ∩  = 0. Another solution, chosen by Kulldorff [4], would be to simulate random data sets under the null hypothesis.However, by doing the randomization this way, the correct alpha level will not be maintained if the marks do not truly come from a normal distribution.Thus I decided to run a technique used by Kulldorff et al. [5] called random labelling: a simulated data set is obtained by randomly associating the marks with the spatial locations.Let  denote the number of simulated data sets and let  (1) , . . .,  () be the observations of the scan statistic associated with these data sets: these simulated scan statistics must be compared to , the scan statistic observed on the real data set.According to Dwass [7], the Monte Carlo based  value of the scan statistic  is /( + 1), where  is the rank of  in the ( + 1)-sample ( (1) , . . .,  () , ).Note that this  value is unbiased in the sense that under the null hypothesis, the probability of observing a  value less than or equal to  is exactly .According to the classical test theory, the most likely cluster Ĉ is said to be significant if the associated  value is less than type I error .

A Generalized Variance-Ratio Scan Statistic.
Even if the likelihood-based scan statistics are broadly used, Cucala [8] proposed, in a different context, an alternative scan statistic showing better results than the one based on likelihood ratio: the power was slightly higher, such as the ability to detect small clusters.Here I propose a method to build a highvariance scan statistic not relying on likelihood ratio.
A classical test for equality of variances between the marks in  and in , usually known as -test, relies on the ratio of (unbiased) empirical variances where Note that, since the   's are assumed to be independent, the empirical variances  * 2  and  * 2  are also independent.Moreover, under Gaussian assumption, the equality of variances of marks in  and  ensures that   follows, by definition, a Fisher distribution (  − 1,   − 1), as mentioned by Saporta [9].Let me introduce the high-variance index for potential cluster where    −1,  −1 (⋅) denotes the cumulative distribution function associated with the Fisher distribution (  −1,   −1).If the variances are equal, the distribution of   is uniform (0, 1) and does not depend anymore on   : this method is called the probability integral transform [10].Moreover, a value of   close to 1 shows that the variance in  is significantly larger than in .Therefore, the generalized variance-ratio (GVR) scan statistic is The way I evaluate the significance of λ is completely similar to the method for .

A Simulation Study
I decided to run a simulation study in order to compare the results of the tests based on the scan statistics previously introduced,  and λ.I generate artificial data sets according to three different models.Whatever the model, the geographical locations are the locations of the 94 French departments.Note that the location associated with each department is the location of its capital city.The true cluster, called , is a set of eight departments around Paris called Ile-de-France: in this area, the variance of the marks is larger than elsewhere.In the first model, denoted by homogeneous Gaussian model, the associated marks are independent and follow a Gaussian distribution with equal means and different variances: In the second model, denoted by Exponential model, the associated marks are independent and are derived from exponentially distributed random variables with different rate parameters,   =   − E(  ), where so that the means are all equal to 0. In the third model, denoted by heterogeneous Gaussian model, the associated Note that, for every model, the parameter  is the ratio between the variance inside the cluster and the variance outside the cluster: I should call it the cluster intensity.For the third model, the parameter , which denotes the difference of means inside the cluster and outside of the cluster, is called the mean lag.
For each model and each value of the cluster intensity , I generated 1000 simulated data sets.The high-variance scan statistics  and λ have been computed, using the set of circular potential clusters described in Section 2, and their  values are estimated based on  = 999 permutations.In both methods, the most likely cluster Ĉ is said to be significant if the associated  value is less than  = 5%.
When applied to such data sets exhibiting a true cluster, scan methods are useful when they manage simultaneously to identify that there is a significant cluster and to recover as precisely as possible the true cluster.Therefore, in order to compare the different scan methods, I computed three different criterions.The first one is the power of the method, that is, the proportion of data sets exhibiting a significant cluster.The second one is the mean number of true positive (TP) departments, that is, the departments included both in the most likely cluster Ĉ and in the true cluster .The third one is the mean number of false positive (FP) departments, that is, the departments included in the most likely cluster Ĉ but not in the true cluster .Note that the sum of TP and FP is the mean number of departments in the most likely cluster Ĉ.Table 1 gives the results obtained with the homogeneous Gaussian model, Table 2 the results obtained with the Exponential model, and Table 3 the results obtained with the heterogeneous Gaussian model.
The bold values are the best results obtained in each procedure.As expected, the power of the methods increases with the cluster intensity.Under the homogeneous Gaussian model, the GVR method is always more powerful than the GLR one, whatever the cluster intensity is.These results are very similar to the ones obtained by Cucala [8] stating that, for detecting clusters in count data, GLR methods are not always the most powerful ones.However, although these power results are quite satisfactory for the homogeneous Gaussian model, they are quite poor when the marks are exponentially distributed: even when the variance is 2.5 times larger in the cluster, both methods hardly detect it.The results in Table 3 also reveal that, even if those scan statistics are designed for detecting high-variance clusters, both are sensitive to a difference of means.Indeed, the cluster intensity being equal, the powers of both methods increases when there is a mean lag.Finally, looking at the FP results, I also conclude that the GLR method tends to exhibit larger significant clusters than the GVR one, which leads to a larger number of departments exhibited by mistake.

Applications
4.1.An Application to Economic Data.For many years now, the spatial analysis of income inequality is an interesting scheme for many economists; for example, Shelnutt and Yao [11] investigated the income inequalities in the different counties of Arkansas and the interaction with economic growth, whereas Atems [12] analysed the Gini index of the incomes in the 3109 American counties.Among others, Deutsch et al. [13] compared the spatial distribution of crime and economic inequalities in the US and found a correlation.These studies confirm that knowing precisely the geographic areas in which inequalities are more significant is of great interest.However, this search is usually done in a nonmathematical way.In this subsection, I show how the high-variance scan methods introduced later on can be useful for such a purpose.I apply the scan methods to an economic data set provided by a French institute, l'Institut National de la Statistique et des Etudes Economiques [14].For each of the 94 departments in France, the median income for year 2009 has been computed and Figure 1 illustrates the results.This figure clearly exhibits a cluster of wealthy departments in the North of France, next to Paris (whose location is given by the light square), whose significance has been assessed by Cucala [15] using the classical Gaussianbased scan statistic.However, we may wonder whether there is any high-variance cluster, enlightening a region where inequalities between departments are significant.To answer this question, I applied both high-variance scan statistics to this data set, once again using the set of circular potential clusters described in Section 2. Their associated  values are also estimated based on  = 9999 permuted samples.The most likely clusters are given by Figure 2. Note that, for convenience, the clusters I highlighted on this figure are not the circular areas exhibited by the scan methods, but the set of departments whose capital cities are included in those circular areas.
The most likely cluster exhibited by GLR scan statistic  is exactly the area around Paris called Ile-de-France.This is not surprising since it includes 7 wealthy departments but also the department called Seine-Saint-Denis which is one of the French departments in which the incomes are much lower.This cluster is quite significant since its  value equals 0.0001: none of the 9999 likelihood-based scan statistics computed from the permuted samples was larger than the likelihoodbased scan statistic computed from the observed sample.The cluster exhibited by GVR scan statistic λ is a bit less significant: its  value equals 0.0075.It contains 7 departments in Ile-de-France but also 6 neighboring departments in which the incomes are much lower.

An Application to Residuals Analysis.
Even if the highvariance scan statistics have been introduced in a spatial setting, they can also be adapted to data with one-dimensional locations, like high-mean scan statistics.This includes data indexed by time but also any kind of data indexed along an axis.In order to illustrate this, I apply the high-variance scan methods to a set of residuals in a regression framework.
The data set I use results from an investigation of public school spending in the United States: the sample consists of 51 observations of per capita expenditure on public schools and per capita income for each state and the District of Columbia in 1979 [16].As recommended by Greene [17], a linear regression model is fitted to explain the expenditure: the explanatory variables are the income and the squared income.A classical way to check the homoscedasticity assumption, that is, that the errors variance is constant across all the observations, is the analysis of the plot of the studentized residuals versus one of the explanatory variables, or alternatively the fitted values.In Figure 3, the studentized residuals are plotted versus the income.Note that, since the other explanatory variable of the model is the squared income, the plot of the residuals versus this explanatory variable or versus the fitted values would have been quite similar since the ordering of the residuals would have been exactly the same.
The variance of the residuals seems to be constant, except for larger values of the income.A few methods have been introduced in order to check that the residuals have equal variance: the Breusch-Pagan [18] test and the White test [19] are powerful to do so but they do not mention which of the residuals exhibit higher variance.The use of high-variance scan methods is of great interest to obtain such information.
Let  1 , . . .,   denote the studentized residuals, ordered such that   corresponds to the th observation of the income in ascending order.For any one-dimensional cluster detection method, the potential clusters are all the sets of observations {  , . . .,   }, 1 ≤  ≤  ≤ .Here, since the variance of the marks needs to be estimated inside and outside of the potential cluster, intervals containing one observation ( =  + 1) or  − 1 observations ( =  +  − 2) are excluded.I computed both high-variance scan statistics on these data.The most likely cluster given by  corresponds to the four last residuals, from solid line in Figure 3.The most likely cluster given by λ only contains the two last residuals, from dotted line in Figure 3.This matches the conclusion of the simulation study: on this specific data set, the most likely cluster exhibited by the likelihood-based scan statistic is smaller than the one exhibited by the variance-ratio scan statistic.Based on  = 9999 permuted samples, their  values are, respectively, 0.0535 and 0.0426: this clearly indicates that the regression model is not valid when the income overcomes a certain level.Note that the Breusch-Pagan test and the White test both gave the same conclusion.

Discussion
The scan statistics introduced in this paper allow one to detect high-variance clusters in marked point processes without any parameter to set up.According to the simulation study, the GVR scan statistic is slightly more powerful than the GLR one against any homogeneous clustering alternative: this result confirms that GLR scan statistics are not always the most efficient, even when the distribution of the marks is known.It also seems that, such as in the regression residuals analysis, the GLR scan statistic hardly detects clusters containing very few observations.Note that the GLR scan methods compare likelihood ratios which are, under the null hypothesis, not exactly equally distributed but only asymptotically when  tends to infinity, as stated by Wilks [20].On the other hand, the GVR scan method compares the   's which are equally distributed under the null hypothesis.Therefore, I should recommend using the GVR scan statistic instead of the GLR one.
In Section 4, I gave two examples of real data analysis through high-variance scan statistics: the search for inequalities hotspots on economic data and a homoscedasticity checking procedure on regression residuals.In the first example, the scan statistics clearly indicate a significant high-variance cluster, which is not very surprising, but also indicate precisely where social programs should be focused on in order to reduce the inequalities.Note also that a finer analysis could be done using the median income in each city instead of each department.In the second example, a new possibility is given by the high-variance scan methods to analyse regression residuals: they lead to the rejection of homoscedasticity hypothesis and focus on the individuals for which the linear regression model seems to be nonvalid.Let me mention that, among others, another application possibility would be the analysis of meteorological time series such as annual rainfall records in a specific station.Indeed, a high-variance scan test could determine whether the amplitude of rainfall increased in the last decades, because of global warming, in certain regions, leading to a higher frequency of very dry and very wet years.
The GVR scan statistic I introduce is based on the most famous test for equality of variances, the -test.However, many other tests are available in the same purpose, as mentioned by Gartside [21].For example, a scan statistic derived from the squared rank test introduced by Conover [22] could be set up, similar to the Mann-Whitney scan statistic for high-mean clusters defined by Cucala [23].It might be more suitable to marks whose distribution is not Gaussian.
The randomization procedure used to estimate the significance of both high-variance scan statistics is the most basic one, that is, random labelling.I focused on this procedure because it is the only one for which type I error remains equal to  whatever the underlying distribution of the data is.Since the GVR scan statistic itself is also distributionfree, this choice sounds natural.However, if the labels are spatially autocorrelated, this may lead to overestimating the significance of the detected clusters.Note that the same problem arises with other GLR based scan statistics when the significance is estimated through Monte Carlo simulation.As mentioned by Haining [24], restricted randomization procedures taking into account this spatial autocorrelation are usually applied to global clustering tests such as Ripley's  and derived methods.On the other hand, for local cluster detection tests such as scan statistics, this approach is much less frequent, except in a few articles including the ones by Loh and Zhu [25] and Zhang et al. [26].Since the methods they introduced are designed for high-mean cluster detection, it could be interesting to adapt them for high-variance cluster detection: this might be the subject of a future work.
In the last few years, there have been more and more papers dealing with spatiotemporal cluster detection, such as Viel et al. [27].It should be noted that the high-variance scan statistics can also be applied to spatiotemporal data using the spatiotemporal distance introduced by Demattei and Cucala [28].
In this work I only focused on the most likely cluster but the detection of secondary clusters is straightforward using the method proposed by Zhang et al. [29]: once a significant cluster is found, remove the data included in that cluster and restart the analysis.
Finally, I should emphasize that the high-variance scan tests could be adjusted for any continuous covariate adjustment as proposed by Klassen et al. [30], such as the age of an underlying population.This could be done by modeling a regression function of the marks depending on the adjusted covariates and then analysing the corresponding residuals.