Reprints Available directly from the Editor. Printed in New Zealand. Analysis of Singly Ordered Two-way Contingency Tables

Pearson's statistic is investigated for nominal-ordinal two-way contingency tables in which we wish to test for identical rows. The statistic is expressed as a sum, the rst summand of which is a statistic given by Yates (1948), and examines location eects for the nominal category. The second and subsequent summands reect the corresponding moments: for example, dispersion, skewness, kurtosis etc. The summands are shown to be weakly optimal in that they are score statistics.


Introduction
In the social sciences and in more specialized areas such as sensory evaluation, it is common to obtain categorized ratings for a number of items. For example, if eyesight is under consideration, one might have five categories of eyesight ranging from poor to excellent, with observations on both males and females, giving a two by five contingency table of responses. As one of the categorizations is ordered, it is possible to do a more thorough analysis than that given by the usual X 2 P test for a two-way contingency table. Sometimes, although the X 2 P test may not be significant, an effect may be suggested by one of a number of analyses suggested in the literature, including (i) giving the categories equi-spaced scores and using a regression analysis as in Yates (1948); (ii) using nonequi-spaced scores based on mid-rank values as in Bross (1958), Conover (1998, p.281) and Nair (1986); (iii) linear logistic models as in McCullagh (1980); (iv) log-linear models and user defined assigned scores as in Agresti (1984, p.84); (v) the cumulative chi-square method of Taguchi (1966); see also Nair (1986) and Hamada and Wu (1990) for a discussion of this method and reasons for not using it; and (vi) analysis of variance and user defined assigned scores as in Box and Jones (1986) or Nair (1990).
We demonstrate here that a method using the summands or components of X 2 P compares well with more recent methods and has the appeal of being weakly optimal and simple both conceptually and arithmetically. The method generalizes readily to other models, and to multi-way tables; see Beh and Davy (1998) and Beh and Davy (1999). Our approach involves a family of simple parametric distributions. With sufficiently many parameters our models will fit the data exactly, but in practice highly parametrised models are not needed. Our parameters are related to moments, and we usually find it is sufficient to include only location and dispersion parameters for each row; skewness and kurtosis parameters could be included, but rarely would it be necessary to include more parameters. The test statistics we recommend for testing are asymptotically independent, assessing if the rows agree with regard to their location, dispersion, skewness etc. We suggest simultaneously assessing location and dispersion, and combining the remainder into a residual unless there are a priori reasons for doing otherwise.

One-Way Tables
We now describe some results for one-way tables because our two-way results are strongly motivated by the corresponding one-way results. Best and Rayner (1987) gave formulae for obtaining components of the usual X 2 P goodness of fit statistic for the multinomial, where there are n observations categorized into c classes with known class probabilities p 1 , p 2 , . . . , p c . These components may be correlated in small samples but are asymptotically uncorrelated and have the sum of their squares equal to X 2 P . If the categories are ordered and the components are based on orthogonal polynomials, then, for example, the first two components identify linear and quadratic effects, i.e. loosely location and dispersion effects. Subsequently we suppose the numbers of observations in the c classes are N 1 , N 2 , . . . , N c , where n = N 1 + N 2 + . . . + N c . 'Asymptotic' results mean n → ∞.
Both the linear and the quadratic components are asymptotically distributed as standard normal variables, and power studies in Best and Rayner (1987) indicated these components compete well with a variety of other statistics when alternatives involve location and dispersion effects. For completeness, we give the relevant formulae. The orthogonal polynomials have, for j = 1, . . . , c, The components are given explicitly by These components depend on the orthogonal polynomials, which are most conveniently given by using the explicit formulae for the g 1 and g 2 , and then the recurrence relations of Emerson (1968). TheV 2 u are score statistics in their own right, and hence provide weakly optimal directional tests (each seeks to detect alternatives in a one dimensional parameter space), so supplementing the omnibus nature of the X 2 P test (that seeks to detect alternatives in a c − 1 dimensional parameter space). The latter is based on the statistic Lancaster (1969, p.134) demonstrated such a partition of X 2 P into components for the particular case p 1 = . . . = p c .

Two Way Tables
Consider the following product multinomial model. We have an r by c contingency table with cell probabilities p ij , i = 1, . . . , r, and j = 1, . . . , c, such that p i1 + . . . + p ic = 1 for i = 1, . . . , r. Observations N i1 , . . . , N ic are taken on the cells of each of the i rows, yielding row totals n i. , i = 1, . . . , r that were known before the data were collected. Column totals are random variables and are denoted by N .j , j = 1, . . . , c; the total count is n .. . Suppose that the columns are ordered categories, and it is of interest to compare rows for similarity of location and dispersion effects. The null hypothesis is equality of the corresponding row probabilities. If p .j = (p 1j + . . . + p rj )/r for j = 1, . . . , c, we test the null hypothesis p ij = p .j for i = 1, ..., r, and j = 1, ..., c, against the alternative hypothesis, not the null. As the conditional probability p j|i = p ij /p i. = p ij since p i. = 1 for all i, the null hypothesis could also be written as p j|i = p .j for i = 1, . . . , r and j = 1, . . . , c: the conditional row probabilities are the same as the marginal probabilities. The usual X 2 P statistic is derived in, for example, Conover (1998, section 4.2) to be wherep ij = (n i. /n .. )(N .j /n .. ). This X 2 P statistic examines all deviations from what is expected under a homogeneity model. As in the one-way table, it is appropriate to decompose X 2 P to obtain more informative directional tests. The summands of the decomposition (V ui subsequently) are not quite components as we usually use the term, as they are not independent, not even asymptotically.
To decompose X 2 P statistics for the two-way table,V 1 andV 2 defined earlier can be calculated for each row (provided c ≥ 3), yieldingV 1i andV 2i , for 1 ≤ i ≤ r. In theseV ui the p j are now taken to be (N .j /n .. ) for 1 ≤ j ≤ c, and n is taken as n i. , 1 ≤ i ≤ r, and N j becomes N ij . Using the subsequent g u (x j ), further statisticŝ V ui can be defined bŷ where the hats indicate that the maximum likelihood estimatorsp ij = N .j /n .. , j = 1, . . . , c have been used in the construction of the orthonormal polynomials. We will show that and that this is an alternative decomposition of X 2 P to that given by Lancaster (1969, Theorem 6.2). We also show that theV ui are score statistics for an appropriate model; this implies weak optimality. See Rayner and Best (1989, section 3.4) for a discussion of this optimality.
A measure of the location effect for the whole table isV 2 11 + . . . +V 2 1r , which, when x j = j for j = 1, . . . , c, is just the statistic Q of Yates (1948). Similarly the overall dispersion effect can be assessed byV 2 21 + . . . +V 2 2r , provided c ≥ 3. Provided c ≥ u + 1, a measure of the uth moment departure from the null hypothesis iŝ V 2 u1 + . . . +V 2 ur . This interpretation could be called diagnostic; see Rayner and Best (1999) for a discussion and references. It should be noted though that the departure from the null could be due to moments between the u + 1 th to the 2u th. However, if the model is correct then in large samples significance will be due to moments up to the u th, and we believe that is where most attention should focus.
IfV 2 u1 + . . . +V 2 ur , is significant then 2 by c tables could be examined in a multiple comparisons fashion. These statistics are asymptotically independent and each is well approximated by the χ 2 r−1 distribution. "Asymptotic" means n .. → ∞. Effectively we have a decomposition of X 2 P into components that assesses, under the hypothesis of independence, the agreement of the rows of the table in regard to specific moment effects, up to the c − 1 th moment. The statistics are asymptotically independent, and hence so are the assessments. By analogy with our goodness of fit work, we expect that the most significant effects will be in the first two to four moments.

Partitioning X 2 P Using Score Statistics
We test for equality of the corresponding row probabilities by first setting, for j = 1, . . . , c and i = 1, . . . , r, In (4.1) we note the following. • The θ ui are real valued parameters.
• The {g uj } is taken to be orthonormal: where δ uv is the Kronecker delta, δ uv = 1 for u = v, and = 0 for u = v. Typically the g uj depend on the p .j ; we require that they do not depend on the row, since otherwise row comparison would be virtually impossible. A number of choices for g uj are possible, but for ordered categories a ready interpretation is available if we use the g u (x j ) of section 2. Then each θ ui reflects the u th moment shift of the distribution defined by the i th row from that defined on the {p .j }.
• In the goodness of fit context we call k the order of the model. It can be at most c − 1, when the model becomes saturated, and an identity similar to Fisher's identity (given, for example, by Lancaster 1969, Theorem 2.1, Corollary 2) would result. Normally k would be chosen to be at most four, and more usually two.
Again in the goodness of fit context, we note that in Rayner and Best (1989) we say that a statistic is partitioned into components if the sum (or sum of squares) of the components gives that statistic, and the components are at least asymptotically independent. Our partition of X 2 P using theV 2 ui does not have even asymptotic independence, and could be thought of as an arithmetic rather than a statistical partition. On the other hand, the sumsV 2 u1 + . . . +V 2 ur are asymptotically independent and provide a partition in our usual sense.
In using the {g uj }, we are effectively assigning scores {j} to the ordered categories. The derivations generalise to user-assigned scores {x j } as anticipated by the discussion in section 3. It should, however, be emphasised that our approach assumes user-assigned and not estimated scores.
Colleagues have commented that although models of the form (4.1) are well-known to sometimes give excellent results, they can also produce negative probabilities. However (4.1) is asymptotically equivalent to where θ = (θ 11 , . . . , θ 1r , . . . , θ k1 , . . . , θ kr ) T . Of course, this model cannot produce estimates of probabilities that are negative. The score tests from the two different models are asymptotically equivalent, but the derivations for (4.1) given here, messy as they are, are simpler than for the exponential model above. Whatever the asymptotic optimality probabilities of the test statistics from both models, they will ultimately be judged on their practicality, convenience and small sample properties. By these standards, the statistics derived here are not inferior to any of the available possibilities! To test for equality of the corresponding row probabilities, take θ ui to be the (u − 1)r + i th element of a vector θ. We test H 0 : θ = 0 against K: θ = 0 with p .1 , . . . , p .(c−1) as nuisance parameters; p .c is omitted from the set of nuisance parameters because the constraints p i1 + . . . Subsequently . ) T and I n for the n by n identity matrix. Proofs of the three results that follow are in the Appendix. These derivations are anticipated in Best, Rayner and Stephens (1998).

Theorem 1:
For the model (4.1), the information matrix evaluated atp .j = N .j /n .. , j = 1, . . . , c, is given by the direct sum of k matrices I r − f f T /n .. . This information matrix is singular.
The score statistic involves the inverse of the information matrix. One way to overcome the information matrix being singular is to omit θ 1r , . . . , θ kr from the model. In modelling terms, the reason for doing this is that the θ's model differences between the row distributions and the average (fitted) distribution {N .j /n .. }. If there are no differences for the first r − 1 rows, then there will be no difference for the r th row.
Theorem 2: For u = 1, . . . , k and i = 1, . . . , r defineV ui = Σ j N ijĝuj / √ n i. . The score statistic for the model for i = 1, . . . , r − 1 (not r as in (4.1)) and j = 1, . . . , c − 1, with . TheV u are asymptotically independent. Note that the hats indicate that the maximum likelihood estimatorsp .j = N .j /n .. , j = 1, . . . , c, have been used in the construction of the orthonormal functions. Alsô V u gives information about the deviations of order u from the fitted distribution {N .j /n .. }; there are contributions to this information from all r rows. The asymptotic covariance matrix of (V u1 , . . . ,V u(r−1) ) is derived incidentally in the proof of Theorem 2 to be I r−1 + f * f * T /n r. . It follows that theV ui are correlated and hence are not components. Also, sinceV u is asymptotically r − 1 variate normal with mean zero and covariance matrix I r−1 + f * f * T /n r. , the contribution toŜ k from the u th order terms,V 2 u1 + . . . +V 2 ur , asymptotically has the χ 2 r−1 distribution. This uses a result for multivariate normal random variables, given for example, in Stuart and Ord (1987, 15.10). Given the asymptotic independence of theV u ,Ŝ k has asymptotic distribution χ 2 k(r−1) .
Theorem 3:Ŝ (c−1) = X 2 P . Although dependent, theV ui partition X 2 P arithmetically in that the sum of the squares of all r(c − 1)V ui add to give X 2 P .
As remarked before the statement of Theorem 2, there are only (r − 1)(c − 1) functionally independent θ's, for one row is determined from the average multinomial distribution by the other r − 1 row distributions. EachV ua corresponds to a θ ua , and assesses the deviation of the u th moment of the i th row distribution {p i1 , . . . , p ic } from the u th moment of the distribution defined by the {N .1 /n .. , . . . , N .c /n .. }. In factV 2 ua could be derived as the score statistic for the model: p aj = {1 + θ ua g uj / √ n a. }p .j , and p ij = p .j for all (r − 2)(c − 1) other p ij in the first r − 1 rows. Therefore eachV ua is the basis of a strongly directional test, with one dimensional parameter space {θ ua }. In the same vein we confirm thatV 2 u1 + . . . +V 2 ur has the χ 2 r−1 distribution by observing that it is the score statistic for the model p ij = {1 + θ ui g uj / √ n i. }p .j , i = 1, . . . , r − 1, j = 1, . . . , c − 1. It has r − 1 dimensional parameter space {θ u1 , . . . , θ u(r−1) }. It thus plays a useful intermediate role, being "more directional" than the (r − 1)(c − 1) dimensional X 2 P , and "more omnibus" than each of theV 2 u1 , . . . ,V 2 ur singly.
Although we have not done a thorough analysis, we suspect that orthogonal polynomials can also be used as part of a log-linear model approach. In that case, the log-likelihood ratio statistic would be partitioned rather than Pearson's X 2 P . We would expect such test statistics to perform very similarly to those we have just introduced. Everitt (1992, section 7.3) for example, indicates how to proceed.

Nair's Method
In simulation studies in Best, Rayner and Stephens (1998), we found that the statisticsV 2 11 + . . . +V 2 1r andV 2 21 + . . . +V 2 2r are identical in numerical value to the location and dispersion statistics described by Nair (1986). Nair's location statistic is just the Kruskal-Wallis statistic adjusted for ties which is often applied to ranked one-way analysis of variance data. Eubank et al. (1987) showed that a number of commonly used statistics like the one on which the Kruskal-Wallis test is based, are components of Pearson's X 2 P . Nair (1986) defined location scores , for 1 ≤ k ≤ c, and also dispersion scores then τ i / √ n i. and ω i / √ n i. are analogous toV 1i andV 2i . Nair's location scores are proportional to the midrank for category k. Graubard and Korn (1987) criticized the use of rank scores for contingency table analysis on the grounds that they may not give enough weight to extreme categories. The same sort of criticism may also apply to other data-dependent or estimated scores such as those given in the next section. Nair's statistics can also be derived as in section 3 if we use mid-rank scores rather than the "natural" scores 1, 2, . . . , c. So his statistics are special cases of our partition of X 2 P statistics.

Logistic Models
The partition of X 2 P given in Theorem 3 is relevant when either the rows (or columns) have ordered categories and where columns (or rows) have nominal categories. Another model suggested for use in this situation is the logistic model. McCullagh (1980) suggested the model: Iterative methods are needed for maximum likelihood estimation of the parameters. Agresti (1984, Appendix B.3) gave details. Notice that the parameters α j , 1 ≤ j ≤ c − 1, estimate the scores which are not arbitrarily assigned while the τ i and ω i are location and dispersion parameters with 1 ≤ i ≤ r. However it should be noted that there is some evidence, for example Agresti (1984, p.225), Newell (1986) and Box and Jones (1986), that in some cases there is little difference in both location and dispersion effects for assigned scores and estimated scores.
The logistic method we have just described requires iteration. However our X 2 P method, which includes Nair's midrank scores method, does not. Further, if we We now have In this sense, the τ i , τ i , and τ * i are all contrasts. For completeness we also define ω * i =V 2i √ n i. , i = 1, . . . , r.

ANOVA Analysis
Box and Jones (1986) and Nair (1990) suggested the use of analysis of variance (ANOVA) methods, and user defined assigned scores, to analyse ordered categorical data. However such an analysis relies on more assumptions to justify its use, and these additional assumptions may be difficult to justify. Sometimes the ANOVA method can give a non-orthogonal analysis, which is less convenient from many standpoints. Further, Brown (1988) has done a small simulation study which indicates that, when compared to X 2 P tests, the ANOVA tests have actual sizes further from the nominal sizes. For these reasons we do not consider ANOVA methods further, although we have often used them in the past for the analysis of ordered categorical data. It may be worth extending the simulation study given by Brown (1988).

Comparison
The X 2 P method, that includes Nair's method, and McCullagh's method both partition the total X 2 P value into location, dispersion and residual effects. The location and dispersion test statistics are each associated with r − 1 degrees of freedom, and have asymptotic χ 2 r−1 distributions. A simulation study reported in Best, Rayner and Stephens (1998) observed that, for the limited range of alternatives considered, the location and location plus dispersion tests for all methods considered have power almost exactly the same. Example 1 below demonstrates a difficulty with the use of the logistic model/McCullagh analysis. Our preference is for the X 2 P method. TheV ui are easily interpreted, give a complete and detailed scrutiny of the data, do not involve iterative calculations, and give an orthogonal partition of X 2 P .

Examples
Example 1. A taste-test experiment form Bradley et al. (1962) gave the response frequencies shown in Table 1. This is a somewhat unusual taste-test as it appears the judges did not taste each product and so cannot be eliminated as blocks. However, it could be claimed that presenting one product per judge gives a more realistic consumer assessment of the products (McBride, 1986). For this contingency table, recalling that we have √ n i. , we obtain the summaries shown in Table  2.
The analyses are very similar. All indicate that treatment 5 is most liked, as τ 5 , τ 5 and τ * 5 are the highest τ values, and that the judges agree on this, as ω 5 , ω 5 and ω * 5 are the smallest ω values. The high ω 4 , ω 4 and ω * 4 values indicate the judges responses were most spread for treatment 4. Reference to the data indicates this spread is real, and not a spurious dispersion effect due to a large location effect, as discussed in Hamada and Wu (1990). Treatment 3 was least liked, as τ 3 , τ 3 and τ * 3 are the smallest τ values, and the judges were in fair agreement about this, as ω 3 , ω 3 and ω * 3 are the second or third smallest of the ω values. Notice the agreement between the analyses and, in particular, the closeness of the X 2 P and Nair results. We can also partition the value of the usual X 2 P statistic as in the analysis of variance. For the X 2 P analysis we get the analysis shown in Table 3. However, because the logistic analysis is not orthogonal we get different analyses depending on whether location or dispersion effects are removed first. In fact we have either of the partitions shown in Table 4. In this case the conclusions from either logistic analysis are the same, but it is not clear that this would always be so.
Example 2. The overall X 2 P value of 73.84 for the taste test data of Example 1 was highly significant. However, it can be the case that the overall X 2 P is not significant but that the examination of location and/or dispersion statistics will indicate a significant effect. To illustrate this point consider the data in Table 5 which are taken from Armitage (1955) and which are concerned with two treatments for ulcers and subsequent categorization of ulcer severity. For this ulcer data X 2 P = 5.91 on three degrees of freedom implying a p-value in excess of 10% if the usual χ 2 approximation for X 2 P is assumed. However the location statisticV 2 11 +V 2 21 = 5.26, with a p-value between 1% and 5%. Treatment B is superior to treatment A, tending to have more responses in the number healed/mostly healed categories. The residual is 0.65 on two degrees of freedom, indicating there are no dispersion and no skewness effects. This example emphasizes the need to look at theV ui and not just X 2 P , as in X 2 P the insignificant dispersion and skewness effects have masked a significant location effect. Asymptotically V 0 has a multivariate normal distribution with mean zero and covariance matrix M 0 (see, for example, Cox and Hinkley, 1974, Chapter 9). Subsequently we will need to find the inverse of I pp for our model. For this purpose the following lemma will be needed; it may be easily verified.
We call these the zero mean conditions, and they follow from our choice of orthonormal functions.
Subsequently we write 1 n for the n by 1 vector with every element 1. In the derivations that follow, the p ic , i = 1, . . . , r, will be treated as dependent variables.
g uj p .j = g uc and c−1 j=1 (g uj − g uc )p .j (g ws − g wc ) = c j=1 {g uj g ws − g uc g ws − g wc g uj + g uc g wc }p .s = c j=1 g uj g ws p .j + g uc g wc = δ uw + g uc g wc , we find that I θp I −1 pp I pθ is the direct sum of k equal matrices I r − f f T /n .. . The stated information matrix now follows. It is singular because f is a latent vector with zero latent root.
Theorem 2 Proof: After omitting θ 1r , . . . , θ kr from the model, the information matrix becomes M * , the direct sum of k matrices I r−1 − f * f * T /n .. , where f * is the (r − 1) by 1 vector formed from f by omitting √ n r. . From the lemma this matrix has inverse I r−1 +f * f * T /n r. , and the inverse of the information matrix is the direct sum of k such matrices. The efficient score iŝ V * = (V 11 , . . . ,V 1(r−1) , . . . ,V k1 , . . . ,V k(r−1) ) T , where the hats indicate that the maximum likelihood estimators of the nuisance parameters are required, namelyp .j = N .j /n .. , j = 1, . . . , c. If we put V * w = (V w1 , . . . ,V w(r−1) ) T , w = 1, . . . , k, thenV * can also be expressed aŝ Substitution gives the score statistic aŝ wjp.j = 0, using N .j = n ..p.j and the zero mean conditions. In terms of theV * w , this is √ n 1.Vw1 + . . . + √ n r.Vwr = 0, w = 1, . . . , k. This is of the form of Pearson's X 2 P , and is clearly the contribution to X 2 P from the i th row. Summing over rows gives The non-zero covariance matrix establishes the dependence.