Goodness-of-Fit Tests Based on Sample Space Partitions: a Unifying Overview

Recently the authors have proposed tests for the one-sample and the k- sample problem, and a test for independence. All three tests are based on sample space partitions, but they were originally developed in dieren t papers. Here we give an overview of the construction of these tests, stressing the common underlying concept of \sample space partitions."


Introduction
In a series of papers the authors (Thas and Ottoy [18], [19], [20]) presented new tests for testing one-sample goodness-of-fit, k-sample goodness-of-fit and independence between two continuous variables.All tests are based on sample space partitions (SSP) of an arbitrary size.They are referred to as the SSPc, SSPkc and SSPrc tests respectively.The aim of this paper is to give an overview of the construction of the test statistics.The stress is laid on the common underlying idea of observation-based SSP's, which categorize the data into classes on which Pearson χ 2 -statistics are calculated.
Nowadays it is generally known that the k-sample problem and the problem of independence are both special cases of the traditional goodness-offit problem.One of the oldest solutions to the one-sample goodness-of-fit problem is to group, i.e. to categorize, the continuous data into classes (or cells), and next to apply a Pearson χ 2 goodness-of-fit test to the induced discrete data set.Major questions to this approach include where to place the cell boundaries and how many cells must be constructed.Starting with Fisher [4], up to the late 1980's many papers have been devoted to these questions.A summary is given by e.g.Moore [7].Despite the intuitive justification of this approach, it is recommended these days to apply only goodness-of-fit methods that are specifically developed for continuous data, e.g. a Kolmogorov-Smirnov or a Shapiro-Wilk test for testing normality.Still there are good methods that rely heavily on the Pearson statistic for discrete data.For example, for the one-sample and the k-sample problem, the Anderson-Darling type test statistics (Anderson and Darling [1]; Pettit [8]; Scholtz and Stephens [11]) consist of averages of one-degree-offreedom Pearson statistics, calculated on all one or two way tables that are induced by observation-centred sample space space partitions.For the k-sample and the independence problem, the other extreme end solution has recently been proposed by Rayner and Best [10].Their approach consists of identifying a cell boundary with each observation in the sample, resulting in a maximal degrees-of-freedom Pearson statistic of which they consider the first few components as interesting test statistics.
Our approach is situated somewhere in the middle between the Anderson-Darling and the Rayner and Best methods.In particular we will repeatedly group the data into an arbitrary number of classes (fixed a priori) according to observation-based cell boundaries.Later we will show that this classification is related to sample space partitions, and we will show that the number of groups is closely related to the SSP-size.The number of different possible groupings depends on the SSP-size.On each differently categorized sample a Pearson statistic is calculated and the SSP test statistics are basically the averages of all such possible Pearson statistics.In the next section, the construction is explained in detail.The small sample power characteristics are discussed in Section 3. Finally, in Section 4 an example of the one-sample SSPc test is presented.

The Sample Space Partition Tests
The Three Goodness-of-Fit Problems In general all tests may be considered in some sense as goodness-of-fit tests, i.e. they test a null hypothesis that may be written as where S is the sample space of X and F and G are the true and the hypothesized distribution function, respectively.Depending on how G is determined, different specific types of goodness-of-fit problems arise.In particular, the null hypothesis in (1) is the null hypothesis of the onesample goodness-of-fit problem.A distinction must be made between the case where G is completely known, and the case where G is known up to some unknown nuisance parameters θ.These cases are typically referred to as a simple and a composite null hypothesis, respectively.In the latter situation the hypothesized distribution is often denoted by G θ , where the unknown finite dimensional parameter θ is estimated from the data, say by means of a consistent estimator θn , resulting in Ĝ = G θ .For the sake of generality, we will denote the hypothesized distribution G always as Ĝ, even when the null hypothesis is simple.
For the k-sample problem, the null hypothesis is where the F i are the true distribution functions of k populations.Under this null hypothesis the common distribution function is denoted by The common distribution G is not known and is typically estimated by the empirical distribution function, Ĝ, by pooling all k samples.Suppose that X = (Y, Z) denotes a bivariate variable.The null hypothesis of independence between Y and Z implies the restriction on the distribution function F yz that it is the product of the univariate marginal distribution functions F y and F z of Y and Z, respectively, i.e.G(y, z) = F y (y)F z (z) for all (y, z) in the sample space of X.The univariate marginal distributions are not specified a priori, and thus they must be estimated from the data.The corresponding empirical distribution functions are denoted by Fy and Fz , resulting in the estimated bivariate distribution Ĝ(y, z) = Fy (y) Fz (z) under the null hypothesis.

The Observation-Based Sample Space Partitions
Although all three types of SSP tests are based on partitions of the sample space S, we still have to make a distinction between these three types w.r.t. the way the partitions are constructed.First some notation is introduced.Let S n denote a sample of n observations.In the case of the k-sample problem and the independence problem, the sample is considered as the pooled sample and the sample of bivariate observations, respectively.For the former problem, the k independent samples are denoted by S j,nj (j = 1, . . ., k), where n j is the corresponding sample size, and n = k j=1 n j .For the independence problem we introduce S y n and S z n as the sample of the univariate Y and Z components of the sample S n .Finally, Fn and Fj,nj denote the empirical distribution functions as estimators of F and F j , respectively.
• One-sample problem.Apart from the specification of Ĝ, there is at this point no need to treat the simple and the composite null hypothesis case separately.Suppose that D c is a subset of the sample S n with c − 1 elements.Let D c denote the set of all such subsets, and let denote the i-th order statistic of D c (i = 1, . . ., c − 1), and define x (0) = inf S and x (c) = sup S. We then define a SSP of size c as ) implies on the sample S n a discretization into a table with c cells of which the counts are given by c.The construction of the partitions and the induced table of counts is illustrated in Figure 1.
• The independence problem.The size of the SSP is now given by (r, c).
Without loss of generality, suppose that r = max(r, c).We define

The Sample Space Partition Test Statistic
The core of the SSP-based test is the Pearson χ 2 goodness-of-fit statistic which is computed on all contingency tables that are induced by the partitions.In particular, in the preceding paragraphs it was explained how d subsets D are derived from the sample S n and how each subset D determines a partition [A](D) and how this partition further induces a frequency table.On each such table a Pearson statistic may be computed.The Pearson statistic, denoted by P 2 n (D), is a function of both the observed frequencies and the frequencies expected under the null hypothesis (Table 1).
The SSP test statistic has the general form The specific forms of the test statistic for the three goodness-of-fit problems are given in the following paragraphs.Also some properties are given.
• one-sample problem: the SSPc test.The test statistic is given by T c,n = 1 dc,n Dc∈Dc P 2 n (D c ). (Figure 1 further illustrates the calculation of the test statistic.)For c = 2 its asymptotic null distribution has been proven (Thas and Ottoy [18]) for both the simple and the composite null hypothesis.Note that in this simplest case the statistic is closely related to the Anderson-Darling statistic (Anderson and Darling [1]).When c > 2 the asymptotic null distribution is conjectured, but approximations are available (Thas [17]).
• k-sample problem: the SSPkc test.The test statistic is given by T k,c,n = 1 dc,n Dc∈Dc P 2 n (D c ), which is a rank statistic.When c = 2 the test statistic reduces exactly to the k-sample Anderson-Darling statistic of Pettit [8] and Scholtz and Stephens [11].The SSPkc test may thus be interpreted as an extension of the Anderson-Darling test.When c > 2 its exact permutation distribution may be enumerated or approximated by means of Monte Carlo simulation.
• independence problem: the SSPrc test.The test statistic is given by T r,c,n = 1 dr,c,n (D xy c ,D z r,c )∈Dr,c P 2 n (D c ), which is again a rank statistic.For r = c = 2 the asymptotic null distribution of T r,c,n has been proven (Thas and Ottoy [20]).In this simplest case, the SSPrc test is an Anderson-Darling type test for independence, which, however, has not yet been previously described in the literature.The statistic may also be looked at as Hoeffding's statistic (Hoeffding [5]) with an Anderson-Darling type weight function.Also here, when c > 2 the permutation distribution can be enumerated.
Thus the SSPkc and the SSPrc tests are distribution-free rank tests.It has also been proven that all three tests are omnibus consistent for any finite SSP-size.In the next section the power characteristics are briefly summarized.

Power Characteristics
From extensive simulation studies (Thas and Ottoy [18] [19]) it is concluded that the SSPkc and the SSPrc tests are very powerful for many alternatives.In particular, they did not show any power breakdown and their powers were in most of the cases larger than those of many other tests.Even under the few alternatives for which other tests had a larger power, the SSP-based test competed still very well.
Thas [17] performed simulation studies in which the power of the onesample SSP test is compared to the power of many other tests.In this section we present one of these studies.In particular we test the composite null hypothesis of normality.As an alternative to the normal distribution a family of mixtures of two normal distributions is considered.Its density, indexed by (δ, γ), is given by is the density of a standard normal distribution.Note that γ = 0 and γ = 1 correspond to normal distributions.The SSP test is compared to five other tests.The best known test is probably the Shapiro-Wilk test (SW) (Shaprio and Wilk [12]), which is implemented as the modified statistic of Weisberg and Binham [23]).Also the Kolmogorov-Smirnov (KS) test is implemented as a modified statistic (Stephens [14] [15]).Since the SSP test closely resembles the Anderson-Darling test (AD), the latter is also included, again as a modified statistic (Stephens [14] [15]).The fourth test is the D'Agostino-Pearson K-test (K) (D'Agostino and Pearson [3]), which is actually only sensitive to deviations in the third and fourth moment.Finally, the Rao-Robson test (RR) (Rao and Robson [9]) is considered.This test is basically a Pearson χ 2 test on equiprobable groups of data.The difference with the SSP based test is that here a grouping is performed only once.We adopted the rule of Mann and Wald [6] to determine the number of groups.In particular, 6 groups were used for n = 20 and 9 groups for n = 50.We included the SSPc tests with c = 2, 3 and c = 4.
The critical values at the 5% level of significance are listed in Table 2.The results of the Monte Carlo simulation study are presented in Table 3.All powers were estimated by means of 10, 000 simulation runs.These results suggest that under the particular mixture alternative the SSP tests outperform the other tests, sometimes by a considerable margin.Among the SSP tests the power seems to increase with increasing SSP-size.In power estimation under different alternatives, however, Thas [17] shows that the power may just as well decrease with increasing SSP-size.This clearly illustrates the importance of choosing a good size.In practice, though, the user most often does not know what the optimal SSP-size is in advance.As a solution the authors have proposed data-driven SSP tests (Thas and Ottoy [21]), which avoid the problem by estimating the SSP-size from the data by utilising a selection rule.
Although the results presented in Table 3 may seem very convincing, simulations under other alternatives have indicated that sometimes the power of the SSP test is considerably lower when compared to the other tests (Thas [17]).This is unfortunately a characteristic of all omnibus consistent tests.

Example of the One-Sample SSP Test
The Singer dataset, which was used by Cleveland [2] to demonstrate Trellis graphs, consists of heights of singers in the New-York Choral Society.Here, we only consider the group of 35 alto.The Shapiro-Wilk test and the Kolmogorov-Smirnov test resulted in p = 0.379 and p = 0.309, respectively.Since none of the p-values is small, in conclusion, there does not seem to be much evidence against the hypothesis of normality.
The computed test statistics (p-values) obtained with the SSPc tests with c = 2, 3 and c = 4 are t 2 = 0.452 (p = 0.374), t 3 = 306.883(p < 0.0001) and t 4 = 951.133(p < 0.0001), respectively.The results of the SSP2 test agrees quiet well with the two classical tests, but when larger partition sizes are used, one immediately notices the extreme large values of the test statistics which correspond to p-values smaller than 0.0001.Thus, both the SSP3 and SSP4 test reject the null hypothesis of normality.A reason for this tremendous difference between the SSP2 test, on the one hand, and the SSP3 and SSP4 tests on the other hand, might be that the data actually shows bimodality.Simulation results have indeed pointed out that extremely high powers of the SSP3 and SSP4 tests are especially observed under bimodal alternatives.Figure 2 shows the histogram of the data and a Gaussian kernel density estimate with a bandwidth manually set to 2.75, suggesting a bimodal distribution.However using the Unbiased Cross Validation bandwidth selection method (Silvermann [13]; Venables and Ripley [22]), the data would have been oversmoothed in the sense that the bumps in the density are all flattened, which is a typical characteristic of such a method in small samples.
and [A](D c ) are defined as before.We define a restricted sample R n = {x ∈ S n : x = max S n }.Let D c denote the set of all subsets D c of the restricted sample R n , and let d c,n denote the number of elements of D c , i.e. d c,n = #D c = n−1 c−1 .Each subset D c determines a SSP [A](D c ), which further induces a c×k contingency table {N ij (D c )} (i = 1, . . ., c; j = 1, . . ., k) of which the k columns correspond to the k samples.In particular, the counts are given by

− 1 n
c ⊂ S n .Let y (1) , . . ., y (c−1) and z (1) , . . ., z (r−1) denote the ordered Y and Z components of the elements of D yz c and D z r,c , respectively.Next we define y (0) = inf S y , y (c) = sup S y , z (0) = inf S z and z (r) = sup S z , where S y and S z denote the (univariate) sample spaces of Y and Z, respectively.An observation-based SSP of size r×c is then denoted by [A](D yz c , D z r,c ).For notational comfort we will drop the dependence on the sets D yz c and D z r,c where possible.The r ×c SSP is defined as [A] = {A 11 , . . ., A r1 , A 21 , . . ., A rc }, with elements A kl = A z k × A y l (k = 1, . . ., r; l = 1, . . ., c), where A z k = {z ∈ S z : z (k−1) < z ≤ z (k) } and A y l = {y ∈ S y : y (l−1) < y ≤ y (l) }.Further, we define the restricted samples R n = {x = (y, z) ∈ S n : y = max S y n and z = S z n } and R z n = {x = (y, z) ∈ S n : z = max S z n }.Finally, let D r,c denote the set of all possible sets D yz c and D z r,c containing observations of the restricted sample R n and R z n , respectively, and let d r,c,n = #D r,c = n c − (c − 1) r − c .Each r × c SSP [A](D yz c , D z r,c ) induces an r × c contingency table {N kl (D yz c , D z r,c )} where the counts are given by N kl = # (A kl ∩ S n ) = n A kl d Fn .

Figure 2 .
Figure 2. A histogram and a Gaussian kernel density estimate (bandwidth = 2.75) of the Singer Data .

Table 1 .
A summary of the observed and the expected frequencies for the 3 goodness-of-fit problems.

Table 3 .
The estimated powers for some alternatives of the contaminated normal family, based on 10000 simulation runs.