Power of the Neyman Smooth Tests for the Uniform Distribution

. This paper compares and investigates the generalised Neyman smooth test, its components, and the classical chi-squared test with a variety of equiprobable classes. Each test is evaluated in terms of its power to reject a wavelike alternative to the uniform distribution, chosen to quantify the complexity of the alternative. Results indicate that if broadly focused tests (rather than strongly directional or weakly omnibus) are sought, then smooth tests of order about four, or the chi-squared test with between ﬁve and ten classes, will perform well.


Introduction
Neyman (1937) constructed his smooth tests specifically to test for the continuous uniform distribution. Uniformity testing is important in a range of applications, including the assessment of random number generators. Moreover any goodness of fit test for a completely specified alternative reduces, via the probability integral transformation, to testing for uniformity. Neyman's construction has been generalised and, of interest here, evaluated several times. See, for example, Quesenberry and Miller (1977) and Miller and Quesenberry (1979), and . The articles involving Quesenberry and Miller do not consider the tests based on the components of Neyman's smooth tests. Moreover we are now in a position to take advantage of modern fast computers to make assessments that were not previously so easily carried out.
Neyman's smooth tests have been generalised to testing for arbitrary distributions. See for example, Rayner and Best (1989). Recent advice on such testing recommends using the sum of the squares of the first two, three or four components of the appropriate Neyman smooth test, augmented by using the components themselves in a data analytic fashion. See Rayner and Best (2000). The manner in which the components give information about the alternatives to uniformity can perhaps be best interpreted in terms of the parameter space spanned by the alternatives. Use of orthonormal functions means this space is decomposed into orthogonal one dimensional spaces. The r-th component assess differences between the data and the hypothesised distribution in, by definition, the r-th order, and this may be thought of as in the r-th moment. Although this correspondence isn't exact, it leads to a useful and insightful interpretation of the data. See, for example, Rayner, Best and Mathews (1995). Carolan and Rayner (2000) looked at the smooth tests for normality, and showed that even when differences from the hypothesised distribution are generated from an order r alternative (see Carolan and Rayner, 2000 for the precise meaning), earlier components may be significant. What seems to be happening is similar to a polynomial of degree six say, being reasonably well approximated, over a specified domain, by a combination of polynomials of degree say, one, two and five. Rayner, Best and Dodds (1985) looked at what Pearson's chi-squared test of equiprobability and its components best detect, and it seems timely to look at the Neyman smooth tests for the uniform distribution and its components.
The study by Kallenberg et al. (1985) into how to construct the classes for the Pearson chi-squared test characterised the alternatives in part by tail weight (heavy or light). We do not look for an answer in terms of tail weight, but in terms of how complicated the alternative may be.
The order k smooth alternative to U (0, 1), the uniform continuous distribution on (0, 1), is θ r π r (y) , for 0 < y < 1, zero otherwise, (2) in which C(θ) is a normalising constant that ensures the probability density function integrates to one. The smooth test of order k for uniformity is based on the statistic which has components To calculate Pearson's chi-squared test statistic X 2 P , we assume the n data points are categorised into m classes, with class probabilities p 1 , . . . , p m and class counts N 1 , . . . , N m . Then The Pearson chi-squared test statistic X 2 P based on m equiprobable classes (p j = 1/m for all j) will be denoted by X 2 P m . The alternatives to uniformity that we will consider are not the ones for which the test was constructed to be optimal. We take , for 0 < y < 1, zero otherwise (6) where the 'complexity' parameter ω > 0. Note that this distribution is U (0, 1) in the limit as ω approaches zero. See Figure 1 for a plot of some of these alternatives. It seems to us that probability density functions of this form cannot be characterised in terms of tail weight as envisaged by Kallenberg et al. (1985). We expect that small values of ω will reflect low order alternatives to uniformity, and will be better detected by the earlier components, while larger values of ω reflect higher order alternatives, and will be better detected by the later components. For a discussion of order and effective order, see Rayner, Best and Dodds (1985) or Rayner and Best (1989, section 4.3).

Size and Power Study
Here we present powers, estimated using 100,000 simulations, of the tests based on , X 2 P 20 and the Anderson-Darling test (D'Agostino and Stephens, 1986, p101, 104-105), based on the statistic AD. The alternatives used assume the distribution with probability density function f Y (y; ω) with 0 < ω < 3. Observations are taken from a random sample of n = 25, and a significance level of 5% is used. We also looked at a significance level of 1% and random samples of n = 50. The results presented here are typical. The critical values used, except for the Anderson-Darling test, were from the asymptotic χ 2 distribution as these are what will be used in practice and (judging by the ω = 0 results) don't seem to unduly advantage any statistic. For the Anderson-Darling statistic, critical values of 2.492 (5%) and 3.857 (1%) were used (see D'Agostino and Stephens, 1986, Table 4.2, p105). Especially for n = 50 it was expected that the difference between actual and nominal sizes would be minimal. This was largely the case. When there was a discrepancy, this reflects what would happen in practice.
To allow efficient simulation, rather than transforming U (0, 1) variates using the inverse of the alternative cumulative distribution function (which requires computationally expensive numerical root finding), random variates from the probability density function f Y (y; ω) are generated using the acceptance-rejection method. We follow the treatment given in Lange (1999; pp.272-276) with c = 2 and g(y) the probability density function of the uniform U (0, 1) distribution, since f Y (y; ω) ≤ 2g(y). Here, a U (0, 1) random variate X is taken to come from the probability density function is strictly increasing, and may be thought of as crudely linear. For 0.25 < ω < 0.75, f Y (y; ω) increases and then decreases, and may be thought of as crudely quadratic. For 0.75 < ω < 1.25, f Y (y; ω) increases, then decreases, and then increases again, and may be thought of as crudely cubic. And so on. If more complicated alternatives arise than may be crudely modeled by ω < 3, we would hope that this could have been anticipated from the context, and smooth tests based on trigonometric or some other functions used. Except for the tests based on X 2 P 10 and X 2 P 20 , the powers of the other tests considered here decrease for ω > 3. Some powers are given in Table 1. Figures 2 and 3 give a plots of the power functions based on a finer grid of ω values.
The simulations permit several conclusions.
• The tests based on V 2 1 and V 2 3 have higher powers when f Y (y; ω) may be thought of as crudely a polynomial of odd degree. Their powers are greatest when f Y (y; ω) may be thought of as crudely cubic.
• The tests based on V 2 2 and V 2 4 have higher powers when f Y (y; ω) may be thought of as crudely a polynomial of even degree. Their powers are greatest when f Y (y; ω) may be thought of as crudely quartic.
• The test based on S 4 often has power greater than the best of its component tests.  • The test based on X 2 P 2 divides the domain into two equal parts. It achieves its greatest power when ω = 1, when the discrepancy between the two parts is greatest.
• For the X 2 P m tests too few classes means the test cannot detect more complicated alternatives, while too many classes 'dilutes' the test, in that its ability to detect more complicated alternatives diminishes its ability to detect less complicated alternatives. Of these X 2 P m tests, those based on 5 and 10 classes both give good results.
• The Anderson-Darling test performs reasonably well for most values of ω but usually does not give the most powerful test. Quesenberry and Miller (1977) and Miller and Quesenberry (1979) both consider testing for uniformity, but neither considered the tests based on the components V r , r = 1, . . . , 4. Of several tests (including X 2 P m with m = 10 and m = 20) they ultimately recommend the test based on S 4 , in part because it "would have better power performance against further, perhaps more complicated, alternative classes"; see Miller and Quesenberry (1979, p.288). Several of our conclusions, such as the first two dot points immediately above, weren't addressed or could not be addressed in the Quesenberry and Miller studies. In addition, the conclusion that the low order tests considered cannot detect the more complicated alternatives, the key conclusion of the Quesenberry and Miller studies, isn't well addressed in their papers. If "more complicated" is taken to mean "ω > 2", then the current study does address their key conclusion. Unlike the alternatives used here, the alternatives used in power studies by Quesenberry and Miller (1977), Miller and Quesenberry (1979) and most other authors, are indexed discretely rather than continuously. Also the range of alternatives accessed is not as broad as we are able to consider using our family. (2=dashed), X 2 P 5 (3=dotted), X 2 P 10 (4=dotdash), and X 2 P 20 (5=longdash).

Discussion
An informed practicing statistician almost always has a contextual expectation of some of the basic characteristics of the data before it is seen. That is how good experiments are designed and tests of sensible hypotheses decided upon. More powerful Neyman smooth tests result from using fewer components: the ability to successfully use fewer components depending on the orthonormal system chosen. When you choose a particular orthonormal system (based on your contextual expectations about the data), you are choosing the alternatives you will best be able to detect, even though the data should not yet have been sighted. The results of our study demon-strate the unsurprising fact that when using a polynomial orthonormal system, more complicated alternatives (here, higher frequency or larger ω) require many components in order to have reasonable power. Ultimately, fewer components result in more powerful tests (by appropriately selecting a particular orthonormal system) and the statisticians contextual expectation's about the data are the basis on which this is achieved.
The statistic V r optimally detects a particular order r polynomial alternative to uniformity. It is thus the basis of a very directional test, and could not be expected to detect more complex alternatives well. The statistic S 4 optimally detects polynomial alternatives to uniformity of degree up to four. It is thus the basis of a broadly focused test, being able to detect interesting and relatively complex alternatives. Attempting to detect even more complex alternatives results in less power for detecting alternatives up to degree four. This cost is often achieved with little gain, as a four dimensional parameter space is usually rich enough to detect most alternatives that arise in practice.
Generally, we recommend using a polynomial orthonormal system unless there is a reason not to (for example, if the context suggests periodic alternatives may be more suitable). The advantage of the polynomial orthonormal systems is the components may be interpreted as (roughly) detecting moment departures of the data from the null distribution. The interpretation for other systems is more problematic.
There may well be a loss if we have chosen the wrong orthonormal system. So here, if we had a contextual expectation of periodic alternatives it would be appropriate to use alternatives based on something like the orthonormal series √ 2 sin(iπy) . Using such periodic orthonormal functions would probably give good protection against such alternatives, (though no doubt these would produce poorer power against the alternatives the polynomial orthonormal system components have good power detecting). Our study assumes there are no such contextual expectations of periodic alternatives, so the alternatives of interest here are only weakly periodic (0 < ω < 3).
In the size and power study here, it seems that the tests based on V 2 r with r = 1 outperforms the tests based on larger r for smaller values of ω, although this is not uniformly true. Tests based on V 2 r with larger r are more powerful for larger values of ω, but again, this is not uniformly true. The test based on S 4 is sometimes more powerful than all the V 2 r tests, and is always a good compromise.
The Pearson tests are tests for discrete alternatives. The test based on X 2 P m may be thought of as optimally detecting 'order' m − 1 alternatives; see Rayner and Best (1989, Chapter 5). Order in this sense reflects the complexity of the alternative. From the simulations it is clear that the Pearson test with m = 2 classes is unable to detect the more complex alternatives, while that with m = 20 protects against quite complex alternatives − that we don't have here − at the cost of a loss of power for the less complex alternatives. The X 2 P m tests with m = 5 and m = 10 outperform S 4 for ω ≥ 2. Presumably the X 2 P m test with m = 5 is able to detect alternatives of similar complexity to the S 4 test, and so will sometimes do better and sometimes worse. The X 2 P m test with m = 10 is able to protect against more complex alternatives than the S 4 test, but is clearly inferior when the alternatives are less complex: for ω < 2.
We can predict the outcome of assessing the  idea of looking at the residual from X 2 P 20 . The residual will always include higher order components. If the alternative is not complex (say 0 < ω < 1) the tests based on these components will have little power, as will a test based on a residual involving these components. If the alternative is complex we probably should be using a different orthonormal family. For example, the more apparently periodic alternatives that occur for ω > 3 would imply the use of something like the periodic orthonormal series given above.
If we are looking at residuals from X 2 p5 and X 2 P 10 , what may be of interest is to combine later order components. The chi-squared components will be similar to the smooth test components, and corresponding to residuals of the the chi-squared tests are tests based on sums of V 2 r such as V 2 r+1 + . . . + V 2 s . Again we can predict what will happen. If S r is powerful we would expect a residual like V 2 r+1 + . . . + V 2 s not to be, and conversely. A good question here is what smooth test residual should we use: V 2 3 + V 2 4 + V 2 5 or perhaps V 2 5 + . . . + V 2 10 ? Consideration could also be given to sums of squares of odd and sums of squares of even components.
To some extent the chi-squared components duplicate the smooth test components, and since we are testing for a continuous null, the smooth test is more appropriate. We are still advocating looking at the components.
The key point from this study are that if we seek broadly focused tests rather than strongly directional or weakly omnibus tests, then the tests based on S r with r about 4, or on X 2 P m with m in the range 5 to 10 will perform well. With the S r tests the orthonormal system should be chosen so that relatively few components are required to detect important alternatives. Given that the Pearson tests can perform well, it would be useful to look again at the class formation options. Can the equiprobable class construction used here be improved upon? We will consider this question in a subsequent paper.