Conditional and Unconditional Tests (and Sample Size) Based on Multiple Comparisons for Stratified 2 × 2 Tables

The Mantel-Haenszel test is the most frequent asymptotic test used for analyzing stratified 2 × 2 tables. Its exact alternative is the test of Birch, which has recently been reconsidered by Jung. Both tests have a conditional origin: Pearson's chi-squared test and Fisher's exact test, respectively. But both tests have the same drawback that the result of global test (the stratified test) may not be compatible with the result of individual tests (the test for each stratum). In this paper, we propose to carry out the global test using a multiple comparisons method (MC method) which does not have this disadvantage. By refining the method (MCB method) an alternative to the Mantel-Haenszel and Birch tests may be obtained. The new MC and MCB methods have the advantage that they may be applied from an unconditional view, a methodology which until now has not been applied to this problem. We also propose some sample size calculation methods.


Introduction
In statistics it is very usual to have to verify whether association exists between two dichotomic qualities. This is especially frequent in medicine, for example, where the aim is to assess whether the presence or absence of a risk factor conditions the presence or absence of a disease or compare two treatments whose answers are success or failure, and so forth. In all the cases the problem produces data whose frequencies are presented in a 2×2 table: the two levels of one of the qualities are set out in the rows, the two levels of the other quality in the columns, and the observed frequencies are set out inside the table.
The exact and the asymptotic analyses of a 2 × 2 table have their roots in the origins of statistics, and hundred of papers have been devoted to the problem [1]. It is traditional to carry out the exact independence test using the Fisher exact test, which is a conditional test (because it assumes that the marginals of the rows and columns are previously fixed). More than thirty years has passed since the situation changed, and it is well known that the unconditional exact test tends to be less conservative and more powerful than the conditional test [2][3][4], because the loss of information as a result of conditioning may be as high as 26% [5]. The unconditional tests assume that it is only the values that were really previously fixed: the marginal of the rows, the marginal of the columns or the total data in the table. This causes two types of unconditional test: that of the double binominal model (the first two cases) and that of the multinomial model (the third case). The same can be said of the asymptotic tests, generally based on Pearson's chi-squared statistic with different corrections for continuity (cc). However, the unconditional exact tests have the great disadvantage of being very laborious to compute. An overall view of the problem can be seen in Martín Andrés [1,6].
Frequently the individuals who take part in the study are stratified in groups based on a covariate such as sex or age, which gives rise to several 2 × 2 tables. In this case the aim is to contrast the independence of both the original dichotomic qualities, bearing in mind the heterogeneity of the populations defined by the strata. To this end, the most frequent approach is to suggest a test under the null 2 Computational and Mathematical Methods in Medicine hypothesis of Mantel-Haenszel for which the odds ratio (or the risk ratio) for all the strata is equal to unity. For this purpose the most frequent asymptotic tests are those of Cochran [7] and Mantel and Haenszel [8], both of which are very similar; the exact version of the test is due to Birch [9] (and has recently been reconsidered by [10]). In all these cases the proposed tests are conditional and, when there is only one stratum, the test for the case of only one 2×2 table is obtained (Fisher's exact test or Pearson's chi-squared test). Moreover, Jung [10] and Jung et al. [11] propose a sample size calculation method, asymptotic in the first and exact in the second.
The procedures indicated have the drawback of almost all the tests for a global null hypothesis like the one in question that the result of the global (stratified) test may not be compatible with that of the individual tests (the test for each stratum). In this paper, we propose a global test (MC test) which does not have this disadvantage because it is based on a multiple comparisons method: the global test is significant if and only if at least one of the individual tests is significant. In return the MC test will have the drawback of being less powerful, given that it must control both the alpha error of the global test and the alpha errors in the individual tests. Because of this, another procedure is proposed (MCB test) which only controls the alpha error of the global test (just as in the classic stratified tests), although the alpha error in the individual tests will only exceed the nominal value on a few occasions (and generally by very little). The two procedures are applicable from both the conditional and the unconditional point of view and also when carrying out an asymptotic test or an exact test. The advantage of applying them in the form of an unconditional test is that in this way the loss of power mentioned above is reduced with regard to the classic global tests. In addition this paper shows that the asymptotic tests function well, even for small samples, if they are carried out with the appropriate continuity correction. And finally, the sample size for almost all the cases studied (exact or asymptotic tests, conditional or unconditional tests) is determined.

Hypothesis Test
2.1. Notation, Models, and Example. In the following (without loss of generality) it will be assumed that each 2 × 2 table refers to the successes or failures in two treatments which are applied to and individuals, respectively. Let be the number of strata, = + ( = 1, . . . , ) the total of individuals in the stratum , = ∑ the total sample size, { , = − } and { , = − } the number of successes and the number of failures with the treatments 1 and 2, respectively, and = + and = + the total number of successes and failures in the stratum respectively. These data may be summarized as shown in Table 1. Once the experiment has been performed, the values obtained will be written with an extra subindex "0, " that is, Let and ( = 1 − and = 1 − ) be the probabilities of success (failure) with treatments 1 and 2 in the stratum , respectively. The odds ratio for each stratum is , and the aim is to contrast the null hypothesis : 1 = ⋅ ⋅ ⋅ = = 1 against an alternative hypothesis with one tail ( : > 1 for some ) or with two tails (K: ̸ = 1 for some j). This paper addresses only the case of one-sided test; for the two-tail test the procedure is similar.
In the previous description it was assumed that the data ( , ) of each stratum j proceed from a double binomial distribution of sizes and and probabilities and in groups 1 and 2, respectively. Because in each stratum there are two previously fixed values ( and ) the model will be referred to as Model 2; the model is very frequently used in practice so that it will serve here as a basis for defining and illustrating the procedures MC and MCB. If in each stratum there is conditioning in the observed value = + , then one has Model 3; now the three values , , and are previously fixed in each stratum and the only variable arises from a hypergeometric distribution. If only the values of are fixed in each stratum , one will get Model 1: ( , , ) proceeding from a multinomial distribution. Finally, if only the global sample size is fixed (so that now even the values for are obtained at random), one will have Model 0. With conditioning in the appropriate marginal, the model leads to the model ( + 1). Therefore, whatever the initial model (i.e., whatever the sampling method for the data obtained), by conditioning in all the nonfixed marginals one always obtains Model 3 (which is the one covered by Birch and Mantel and Haenszel).
Each model produces a different sample space, which is formed by the set of all possible values of the set of variables involved in the same. For example, the sample space of stratum under Model 2 consists of ( +1)×( +1) possible values of ( , ). Each transition from a Model to Model ( + 1) constitutes a loss of information, because the number of points of the new sample space is very much smaller than that of the previous one. Probably the most dramatic transition is that of Models 2 to 3, a transition in which the loss of information may reach 26% for = 1 [5]. In addition, each transition implies using a conditional rather than an unconditional method of eliminating nuisance parameters, something which is generally never advisable [13].
The data in Table 2, which are given by Li et al. [12], are taken from preliminary analysis of an experiment of three groups to evaluate whether thymosin (treatment 1), compared to a placebo (treatment 2), has any effect on the treatment of bronchogenic carcinoma patients receiving radiotherapy. The one-sided values are Birch = 0.1563 by global conditional stratified exact test and 1 = 0.80073, 2 = 0.57143, and = 0.1563 we conclude , so that now > 1 at least once. However no individual test has significance if these are carried out to an alpha error that respects the former global error; for example, by using Bonferroni's method, the smaller of the three values 3 = 0.14706 > 0.1563/3. The same thing occurs if asymptotic tests are used. Our aim is to define procedures in which these incompatibilities will not occur. Table 3 shows this value and the remaining values in this paper. This result is based on determining the probability of all the configurations ( | , , ), = 1, 2, . . . , , such as

Conditional Tests Obtained by Using Classic Methods (Model 3). The value of exact test is
Here is a test statistic determining the order in which the points of the sample space ( 1 , 2 , 3 ) enter the region , a region whose probability under yields the value of Birch . Note that as the sample spaces in each stratum are 9 ≤ 1 ≤ 11, 8 ≤ 2 ≤ 9, and 5 ≤ 3 ≤ 8, the possible values of ( 1 , 2 , 3 ) will be 3 × 2 × 4 = 24, which is the total number of points in the global sample space; of these, four belong to (three with = 27 and one with = 28), so that 4/24 = 0.1667. Moreover note that, under the original Model 2, the number of points in the sample space of strata 1, 2, and 3 are ( +1)×( +1) = (11 + 1) × (13 + 1), (9 + 1) × (12 + 1), and (8 + 1) × (10 + 1), respectively. The total points for the global sample space will be 168 × 130 × 99: more than two million, compared to only 24 in Model 3. To determine the value Birch have developed various programs (see references in [14]); an easy way to get it is through http://www.openepi.com/Menu/OE Menu.htm (option "Two by Two Table").
The asymptotic test of Mantel-Haenszel based on ∑ is asymptotically normal with mean ∑ = ∑ / and variance ∑ = ∑ / 2 ( − 1). Therefore the contrast statistic is MH = (∑ − ∑ )/(∑ ) 0.5 , whose value MH = 0.0760 patently does not agree with Jung = 0.1563. However because the variable is discrete, it is convenient to carry out a continuity correction [15]. As S jumps one space at a time, the cc should be 0.5 and so the statistic with cc will be MHc = (∑ − ∑ − 0.5)/(∑ ) 0.5 [8]. The new value MHc = 0.1573 itself is already compatible with the exact value.

MC and MCB Tests Based on the Criterion of the Multiple
Comparisons: General Observations. Let us suppose that in each stratum the hypotheses : = 1 versus : > 1 to error are contrasted. Thereby = ∩ and = ∪ . If Table 3: values obtained by various methods for the data in the example of Li et al. [12]. Each asymptotic method is placed directly below the exact method from which it proceeds.
In particular, if = (∀ ) method MC is obtained (the "method of the multiple comparisons"), and its global alpha error will be Method MC guarantees the compatibility of the results of the global test and of the individual tests, because the global test is significant if and only if at least one of the individual tests is so. When = 1, the global test is the same as the individual test.
On the basis of the above, in general the test can be defined as follows. In each stratum an order statistic will have been defined which allows the value for each one of its points to be determined. If the points from all strata are mixed, they are ordered from the lowest value of their value to the highest and will be introduced one by one into the global critical region until a given condition (stopping rule) has been verified; then = ∪ , with the critical region formed by the points in the stratum which belong to . Let be the largest of the values of the points in . The real global alpha error MC of the test constructed thus will be given by expression (1).
When the stopping rule is "stop introducing points into when the maximum of the is as close as possible to (but less than or equal to ), " with given by then method MC is obtained, and this method simultaneously controls global error and the individual error . Now, the critical region = MC of each stratum consists of all the points whose value is smaller or equal to , = MC ≤ , = MC = ∪ MC and the real global error will be It is a simpler process to obtain the value MC of some observed data. Let be the -value of the individual test in stratum . The first individual alpha error for which is concluded will be = 0 = min , so that for expression (2) the value of the global text will be When the stopping rule is "stop introducing points into when 1−∏(1− ) is the closest possible to (but smaller than or equal to ), " method MCB is obtained (the method "based on the multiple comparisons"). Because now only the global error is controlled, its goal is similar to that of Jung's method [10]. The method MCB causes that = MCB , = MCB , = MCB = ∪ MCB and the real global error is MCB = 1 − ∏(1 − MCB ) ≤ . Note that MC ⊆ MCB , since MC ≤ , something to be expected given that method MC controls two errors and the MCB method controls only one of these.
Let us see how we can obtain the value MCB of some observed data in which 0 = 1 for example. The region MCB which yields the first significance of the global test is obtained when the observed point in stratum 1 is the last introduced into MCB , that is, when 1MCB = 0 ; in the other strata it should be MCB ≤ 0 , but as close as possible to 0 . Thus the value will be MCB = 1 − ∏(1 − MCB ). It can now be seen that MCB = MC where MC are the values of the MC test when this is carried out to the error = 0 . Therefore MCB ≤ MC and, for effects of calculating the value MCB , the values MCB = MC and the regions MC = MCB will be written just as * and * , respectively. Thus, if * is the largest value in stratum which is smaller than or equal to 0 , Methods MC and MCB may be applied with exact methods or with asymptotic methods and to any of the three models, as illustrated in the following sections.  (4). In order to apply method MCB the critical regions * ( = 1 and 2) must be determined to the objective error = 0.14706 = 0 = * 3 . For = 1, 9 ≤ 1 ≤ 11 with Pr{ 1 = 11 | 1 } = 0.2862 > 0 ; thus * 1 = and * 1 = 0. This same occurs for = 2 ( * 2 = 0). For expression (5), MCB = 0.1471 (smaller than Jung ). Generally speaking the critical region of Birch [9] and Jung [10] has the form = ∑ ≥ 0 = ∑ 0 , while that of method MCB is in the form ∪{ ≥ * }, with * ≥ 0 . It can be proved that this generally implies that the Birch method will yield a p value smaller than or equal to that of method MCB when the p values are similar or when the observed values 0 are the highest possible.

MC and MCB Tests under
Let us now apply an asymptotic test. In general, whatever the model is, the appropriate statistic is the chi-squared statistic [6]: The appropriate value for the continuity correction depends on the assumed model, and that value is what causes the results of the three models to be different. When = 0 (∀ ) Pearson's classic chi-squared statistic is obtained. In the case here of Model 3, by making = /2 the classic statistic 3 (or the Yates chi-squared statistic) is obtained. Its maximum value is reached in stratum 3 ( 33 = 1.0308), which yields the p values 0 = 0.15132 and MC = 0.3887. In order to apply method MCB, one must obtain in the other two strata the first value * 3 of 3 which is larger than or equal to 33 . As there is none, * 1 = * 2 = 0, * 3 = 0.15132 and MCB = 0.1513. Note that the asymptotic p values are similar to the exact ones, both with method MC and with method Computational and Mathematical Methods in Medicine 5 MCB. Despite the small size of the samples, the asymptotic methods function well (something which also occurs with the rest of the methods, as will be seen). (1 − ) ; and (4) determine the p value as = max ( ), where is the nuisance parameter that is eliminated by maximization (the most complicated step). Note that is the marginal probability of columns under . In the case of Model 3 there is only one order statistic possible [17], because the convexity of the region must be verified and the points ordered "from the largest to the smallest value of . " In the case of Model 2 there are many possible test statistics. One of these is the order of Boschloo [18]: order the points from the smaller to larger value of its one-tailed p value obtained using the Fisher exact test. It is already known [19] that the unconditional test based on the order is uniformly more powerful (UMP) than Fisher's own exact test. Although no unconditional order is UMP compared to the rest, the generally most powerful order is [3] the complex statistic of Barnard [20].

MC and MCB Tests under
As far as we know, the only program that carries out the above calculations for the statistic is SMP.EXE, which may be obtained free of charge at http://www.ugr.es/local/bioest/software.htm. The program also gives the solution for other simpler test statistics. Using this program, because the minimum p value is 3 = 0.05653 then MC = 0.1602. In order to obtain MCB one has to proceed as in the previous section, although now the process is now somewhat more difficult. In stratum 1, the table ( 1 , 1 ) = (11, 10) is the one that gives a larger p value * 1 = 0.05462, but smaller than or equal to * 3 = 0.05653. In stratum 2 the results are ( 2 , 2 ) = (4, 1) and * 2 = 0.05069. So, MCB = 0.1533, a value which is similar to that of Birch (the results are alike if other order statistics of the program SMP.EXE are used). It can be seen that the use of the unconditional method allows the inherent conservatism in the definitions of methods MC and MCB to be reduced.

MC and MCB Tests under Models 1 and 0.
Let us suppose now that the data contained in the example in Table 2 proceed from Model 1. The determining of the p value of an observed table ( 0 , 0 , 0 | ) is the same as in Model 2, but now the calculations are more complicated because the nuisance parameters must be eliminated (the marginal probabilities of rows and columns under ). Again there are many possible test statistics [1,21], although none of them is UMP compared to the others. The generally more powerful statistic is again Barnard's statistic [22] and, as far as we know, the only program to apply it is TMP.EXE which may be obtained free of charge at http://www.ugr.es/local/bioest/software.htm. The program also gives the solution using other simpler test statistics. Using this program, the minimum p value is 3 = 0.04472 and from this MC = 0.1282 (substantially smaller than Birch ).
In order to carry out the asymptotic test we shall use the optimal version of expression (6) for Model 1: 1 is the value of expression (6) when = 0.5 ∀ [6]. The statistic is given by Pirie and Hamdan [23]. Now the maximum value is 13 = 1.6149, with the result that 3 = 0.05317 and MC = 0.1512.

Example and Conditional Solutions Obtained by Classic
Methods. Jung [10] proposes a sample size calculation for its stratified exact test. For the example described in Section 2.1, he accepts Model 2 and sets out a case study with = /3 and = /6. The aim is to determine the value of for the alternative hypotheses Let us suppose that generally = , with known values, and that the aim is to determine the values which guarantee the desired power, which implies using Model 2. The reasoning that follows is the same as that with which Casagrande et al. [24] and Fleiss et al. [25] obtained the classic formula for sample size in the comparison of two independent proportions. The solutions without cc that If the solution is restricted to the case of = (∀ ), by making − 1− equal to the fraction of expression (7) The solutions 0 and are those of the tests MH and MHc , respectively. Frequently = 1 (∀ ); in this case expression (8) explicitly takes the following form: For the example at the beginning of this section (in which = 1), if at first we restrict the solution to 1 = 2 = 3 = , expression (9) indicates that 0 = 8.27 and = 11.3. Assuming that in this example the values of are allowed to differ at most by 1, then the solution that is sought must be 8 ≤ ≤ 9 (∀ ) without cc or 11 ≤ ≤ 12 (∀ ) with cc. In the second phase, expression (7) indicates that in 1 = 2 = 11 and 3 = 12 is the first time that MHc (=0.183) ≤ 0.2, so that this is the solution with cc that was being sought ( = 68). The solution without cc is obtained in the same way ( 1 = 2 = 8, 3 = 9, and = 50), but it is too liberal.

Solution Using the Exact Method MC.
For fixed values of the global error and the sample sizes ( , ), the method MC described in Section 2.3 allows one to obtain the critical region MC and the real type 1 error MC ≤ . Moreover, let MC be the error beta for each individual test, with 1 − MC equal to the probability of the region MC under . Because of the way method MC was defined, the real global error beta will be If MC ≤ , these values {( , )} guarantee the desired power. If MC > , it is necessary to increase some values of and/or and to repeat the previous procedure. Let us initially assume that = . The process for determining the sample sizes may be shortened if it begins with a value = (∀ ) like that of expression (8). With the method MC, one obtains that = 12 is not a solution because MC = 0.2262 > 0.2, but = 13 is a solution because MC = 0.1723 ≤ 0.2. The solution can now be refined allowing values to differ by a maximum of one. The final 7 solution is 1 = 2 = 12, 3 = 13 ( = 74), MC = 0.0881, and MC = 0.1880. Unconditioned tests are more powerful when the sample sizes are slightly different [3], since the number of ties that produces any statistic that is used is reduced. By planning = + 1 and making the values of consecutive, the solution 1 = 10, 2 = 11, and 3 = 12 ( = 69) is obtained, with MC = 0.0924 and MC = 0.1821 (the solution based on = − 1 is worse). Actually, stratum 1 is of virtually no interest since in it 1 = 1 . Despite everything, if it is introduced, the configuration = +1, 1 = 1, 2 = 11 and 3 = 12 ( = 51) is correct because MC = 0.0602 and MC = 0.1833.

Solution Using the Asymptotic Method MC Based on the
Chi-Square Test with cc. In the following the procedure is the same as in Section 2.1, assuming for the moment that and can be any values. The numerator of 2 may be written aŝ− , where is the cc of Model 2 ( = 2 or 1 depending on whether and are equal or different, resp.) and̂= − (the base statistic for the test) is asymptotically normal with mean = ( − ) and variance 2 = ( + ).
Under , = = and̂is asymptotically normal with mean 0 and variance 2 = , with = 1 − . Because under the nuisance parameter is estimated by / , it is usual to substitute it by its average value under , that is, by For the data in the example, = 1 − 0.9 1/3 = 0.03451 and by making = (∀ ) the solution, the solution based on expression (12) is = 12. This solution can be refined by allowing the values of to differ by a maximum of one, in which case the new solution, now based on expression (11), is