Estimating the Proportion of True Null Hypotheses in Multiple Testing Problems

The problem of estimating the proportion, π0, of the true null hypotheses in a multiple testing problem is important in cases where large scale parallel hypotheses tests are performed independently. While the problem is a quantity of interest in its own right in applications, the estimate of π0 can be used for assessing or controlling an overall false discovery rate. In this article, we develop an innovative nonparametric maximum likelihood approach to estimate π0.The nonparametric likelihood is proposed to be restricted to multinomial models and an EM algorithm is also developed to approximate the estimate of π0. Simulation studies show that the proposed method outperforms other existing methods. Using experimental microarray datasets, we demonstrate that the new method provides satisfactory estimate in practice.


Introduction
Estimating the proportion  0 of true null hypotheses in a multiple testing setup is very crucial in wanting to assess and/or control false discovery rate, which is quite significant in genomics, disease discovery, and cancer discovery.Langaas et al. [1] remarked "An important reason for wanting to estimate  0 is that it is a quantity of its own right.In addition, a reliable estimate of  0 is important when we want to assess or control multiple error rates, such as the false discovery rate FDR of Benjamini and Hochberg [2]."In the case of testing for differential expression in DNA microarrays, the proportion of differentially expressed genes is 1 −  0 , and it is important to know whether 5% or 35% of the genes, for example, are differentially expressed, even if we cannot identify these genes (see Langaas et al. [1]).Multiple testing refers to any instance that involves the simultaneous testing of several hypotheses.A common feature in genomes studies is the analysis of a large number of simultaneous measurements in a small number of samples.One must decide whether the findings are truly causative correlations or just the byproducts of multiple hypothesis testing (Gyorffy et al. [3]).If one does not take the multiplicity of tests into account, then the probability that some of the true null hypotheses are rejected by chance alone may be unduly large.
In a multiple hypothesis testing problem, m null hypotheses are tested simultaneously; that is, we test for  = 1, 2, . . ., , simultaneously.Assume that the m tests are constructed based on the observed p values,  1 , . . .,   , respectively.The unknown quantity  0 to be estimated is the proportion of the true null hypotheses among  01 , . . .,  0 .Introduce the i.i.d Bernoulli random variables  1 , . . .,   with (  = 0) =  0 .Then   can be interpreted in terms of the multiple testing problems as follows: for  = 1, . . ., .We assume the p values,  1 , . . .,   , are continuous and independent random variables, so that the p values are independently and identically distributed as Unif(0, 1) when the null hypotheses are all true.One chooses to reject or fail to reject each null hypothesis based on the corresponding  value.Consequences of the tests are summarized in Table 1.
In Table 1  0 is the number of null hypotheses;  is the observable random variable representing the number of hypotheses rejected.Note that all other random variables , , , and  in Table 1 are unobservable.
The problem of estimating the proportion  0 has naturally arisen in assessing or controlling an overall false rejection rate in a simultaneous hypotheses testing problems.A reliable estimate of  0 is crucial when we want to control and/or assess the false discovery rate (FDR) proposed by Benjamini and Hochberg [2], defined as where  = / and Benjamini and Hochberg [2] prove that Simes' procedure (Simes [4]) has the FDR controlled at level  if the underlying test statistics and the corresponding  values are continuous and identically and independently distributed.Specifically, they show Consequently, if  is the FDR level that one wishes to achieve and if  0 can be estimated efficiently, say by π0 , then  can be chosen to be  = /π 0 to gain additional testing powers in the multiple testing problem with FDR being under control.If π0 is overestimated substantially, however, the value of  is underestimated substantially, leading to significantly narrower simultaneous confidence intervals for multiple comparisons and significant reduction of testing power for the multiple testing problems.On the other hand, if  is chosen via some other procedure, say the Bonferroni method in the multiple comparison problem, assessing FDR accurately via estimating  0 efficiently is then of great interest.The paper is organized as follows.Section 2 is a review of existing estimating methods; Section 3 introduces the new estimating procedure; Section 4 contains simulation results and Section 5 presents application of the new estimating procedure to real life examples.

Mixture Model Framework.
A mixture model can be used to fit the  values in order to estimate the proportion,  0 , of the true hypotheses, where large scale parallel hypotheses are performed independently.The estimate for  0 can be based on the mixture model for the common density  of the  values described as follows: where ℎ() is the conditional probability density function of the  value under an alternative (see Langaas et al. [1]).Using this mixture representation, we are able to characterize the maximum likelihood estimate for .The estimation method is derived under the assumption of independent and identically distributed  values.The null  values are uniformly distributed on [0, 1].We should note that ℎ() describes the configuration of true alternative populations among the  underlying populations; it seems that a nonparametric approach to the estimation of  0 is more appealing.Without loss of generality, we define  = 1 as the alternative hypothesis, which will be used throughout this section.Three nonparametric estimators proposed recently by other authors are described and discussed in the sequel subsections.

Storey's Method.
Consider the common marginal mixture density of the  values, for any  ∈ (0, 1): On the basis that ( >  |  = 1) is typically small, a large majority of the  values in the interval [, 1] should be corresponding to the true null hypothesis and thus uniformly distributed on the interval (0, 1), of which most should be close to 1, so that (1 − )( >  |  = 1) is approximately zero.Let () = #{  :   > }.Note that () should be approximately equal to the product of  0 and the length of the interval [, 1]; that is, (()) ≈  0 (1 − ).Hence, Storey [6] proposes that the proportion of true null hypotheses,  0 , is estimated by (with an appropriately chosen ) The value of  has impacts on the behavior of π 0 ().π 0 () has a large bias and small variance when  is small and a small bias and large variance when  is big, respectively.Since both extreme values of  have a bias-variance trade-off, Storey et al. [7] propose bootstrapping, which is a resampling technique, to choose  when estimating π 0 () so as to minimize the mean square error of π 0 ().The resulting estimator is denoted by π 0 .Note that the idea leading to Storey's estimator π 0 is to treat the addition (1 − )( >  |  = 1) term as zero to get rid of the complication caused by the unknown alternative distribution.As a consequence, Storey's estimator π 0 will tend to overestimate  0 in a size of (1 −  0 )( >  |  = 1)/(1 − ), at least theoretically.However, the anticipated size of overestimate becomes less significant when  0 is closer to 1; that is, 1− 0 is closer to 0. The biasedness of π 0 as an estimator for ( > )/(1 − ) can be dominating, when  0 is close to 1, as evident in the simulation results in Section 4 and as observed by other authors as well, for example, Langaas et al. [1].

Convest Method.
Let  and ℎ be defined in (6).Langaas et al. [1] prove that if ℎ() is twice differentiable, convex, and decreasing,  can be expressed as where the kernel density with  being any probability measure on [0, 1].Thus Langaas et al. [1] are able to characterize the nonparametric maximum likelihood estimate for  for the density () as where μ is the nonparametric maximum likelihood estimate of .Let  values  (1) , . . .,  () be the ordered statistics of the  values.The nonparametric maximum likelihood estimator of (⋅) is given by Langaas et al. [1] then propose to estimate the proportion  0 by It is noted that as the estimator π 0 is constructed via a density estimate f() at or about the upper bound of support, namely, f(1), it can be conservative and overestimate  0 when the assumption ℎ(1) = 0 is questionable or ℎ(p) → 0 slowly as  → 1.The overestimating problem can be even more severe when  0 is not large; that is, 1 −  0 is not so small.However, the remarks in the end of Section 2.2 are applicable to π 0 as well.

Average Estimate Approach.
The average estimate approach was motivated by Storey's method.Jiang and Doerge [8] observe (and many other authors point out as well) that Storey's estimator has a large bias and small variance when  is small and a small bias and large variance when  is big.Since both extremes of  have a bias-variance trade-off, Jiang and Doerge [8] propose to combine Storey's estimate with different values of  that vary from a small extreme to a large extreme.
Jiang and Doerge [8] develop a bootstrapping algorithm to pick the optimal .It should be noted that, as an average of Storey's estimates, this estimate is expected to inherit conservativeness from Storey's method.

New Method
We propose a finite mixture model of the uniform distribution and a multinomial distribution M(,  1 , . . .,   ) with a mixing proportion  0 to fit the  values.Denote UM( 0 ; ,  1 , . . .,   ) for this finite mixture distribution.By this approach, the alternative distribution ℎ defined in ( 6) is restricted to the multinomial distribution family, M(,  1 , . . .,   ).Alternatively, the multinomial distribution can be viewed as a parametric approximation to the nonparametric unknown density ℎ, similarly to the idea of the empirical likelihood method (see Owen [9]).So this procedure is considered as nonparametric.
To apply the approach, we need to settle two things first: (a) convert the continuous-type observations  1 , . . .,   into discrete data with  categories and (b) select an integer .

Selection of 𝑘.
It is often the case in applications that the  values are highly skewed (see Storey and Tibshirani [10], Zhao et al. [11], and Markitsis and Lai [12]).In this case, we recommend Doane's modification of Sturges' rule for selection of  as follows, to count for skewness of the mixture distribution (6) (see Doane [13] and Sturges [14]): where γ is an estimate of the skewness coefficient.In the case of symmetry, we recommend to adapt Sturges' rule to determine the selection of : 3.2.Transformation of   's.To transform   's, partition the unit interval into k subintervals with equal width 1/.Define (21) From Storey's Bayesian interpretation for the multiple testing problem, given that the alternative is true with a probability of 1 −  0 , the  value follows the distribution ℎ.In the same way, the transformed data   's can be interpreted as follows.Given that the alternative is true with a probability of 1 −  0 ,   = ( 1 , . . .,   )  is a multinomial random vector and is distributed as M(,  1 , . . .,   ), for  = 1, . . ., .Therefore,  1 , . . .,   are independently and identically distributed as the finite mixture distribution UM( 0 ; ,  1 , . . .,   ) and the maximum likelihood estimate π0 for  0 thus results from the transformed data   's.Explicitly, π0 together with (q 1 , . . ., q ) maximizes the log-likelihood function 3.3.EM Algorithm.Note that maximizing the nonlinear log-likelihood function ( 22) can be complicating.The EM algorithm may be used easily to obtain an approximation to π0 .In order to do so, we introduce a latent Bernoulli variable  that indicates the component membership in the finite mixture distribution UM( 0 ; ,  1 , . . .,   ).That is, ( = 1) =  0 and given  = 0,  1 follows M(,  1 , . . .,   ).Note that (,  1 ) has the distribution for  = 0 or 1 and  1 = 0 or 1,  = 1, . . ., .Let (  ;  1 , . . .,   ),  = 1, . . ., , be a random sample of size  from model (23).In the present problem,   's are only available data for analysis and   's are unobservable and so considered as missing values.This defines a missing-value model.The log-likelihood with the complete data is given by where  = ∑  =1   , and  .= ∑  =1   , for  = 1, . . ., .We are ready to describe the EM algorithm.

E-
Step.Let  ()  0 and  () be the current approximations to the maximum likelihood estimates under the model UM( 0 , ,  1 , . . .,   ).For the next approximation, the E-step establishes the expected log-likelihood function as ( where From / = 0, we have To sum up, let  () 0 ,  () 1 , . . .,  ()  be the th approximations to the maximum likelihood estimates of π0 , q1 , . . ., q that maximize the log-likelihood function defined in (22).Then the ( + 1)th approximation with the EM algorithm is given by So the EM iteration process goes as shown in Algorithm 1.
It is well known that each EM iteration gets closer to the maximum of log-likelihood but only in a linear convergency rate.If the components are similar in their densities, then the convergence is extremely slow.The convergence will also be slow when the maximum likelihood solution requires some of the weight parameters to be zero, because the algorithm can never reach such a boundary point.An additional and related problem is that of deciding when to stop the algorithm.One risk to a naive user is the natural tendency to use a stopping rule for the algorithm based on the changes in the parameters or the likelihood being sufficiently small.Taking smalls steps does not indicate that we are close to the solution.
To combat this problem, Lindsay [15] exploited the regularity of the EM algorithm process to predict, via the device known as Aitken acceleration, the value of the loglikelihood at the maximum likelihood solution.The Aitken's acceleration rule is usually recommended to predict the maximum efficiently and suitably whenever one is using a linearly convergent algorithm with a slow rate of convergence.If we let  −2 ,  −1 , and   be the log-likelihood values for the three consecutive iterations, then the predicted final value is where   = (  −  −1 )/( −1 −  −2 ).Terminate the EM iteration when  ∞  −   is sufficiently small.

Simulation Studies
In order to investigate the properties of the estimators described in Section 2 and compare the performance to that of the newly developed estimating procedure described in Section 3, we conducted simulation experiments.The generation of simulated data and the calculation of the estimates were both done in the language .Simulation studies were conducted with the  values based on a one-sided -test in the finite normal mixture model  ∼  0 (0, 1) + (1 −  0 )(1, 1) with various values of  and  0 for performance comparisons.Monte Carlo data were simulated independently from  ∼  0 (0, 1) + (1 −  0 )(1, 1) and each  value was computed by   = 1 − Φ().where Φ is the cumulative distribution function of standard normal (0, 1).For the generation of simulated data, three true values of  0 were considered, namely, 0.25, 0.5, 0.75, with the sample size  = 200, 500, and 1,000.In each case, 1, 000 Monte Carlo trials were performed.In implementation of the EM algorithm to compute the proposed new estimate π0 , the determination rule of the EM algorithm is  ∞  −   ≤ 10 −4 .The EM algorithm converged for all the simulated datasets.In each case, the performance of the proposed new estimate π0 was compared with those of several existing procedures through the same set of Monte Carlo data.Specifically, in the simulations, we considered the following existing methods: (

Application to Real Life Microarray Data
To further evaluate the performance of the new method in comparison to the three existing methods, consider the real life data from the DNA microarray experiments reported in Golub et al. [5] that can be downloaded from the  package multtest.The dataset was used by many authors (see [11,16]  With each of 3,051 genes, a two-sample Welch -statistic is computed and its two-sided  value is computed under a central -student distribution with degree 36 of freedom.A histogram of these  values is displayed in Figure 1.It is clear that the  values are highly skewed to the right.The four estimates for the proportion  0 of nondifferentially expressed genes among 3,051 genes,  0 , π0 , π 0 , π 0 , and π 0 , are given in Table 3.It appears that the four estimates are significantly different, varying from as low as 0.4150 to as high as 0.4913.Noting that it has been commented by Table 3: The estimates of the proportion of true null hypotheses with the microarray data of Golub et al. [5].many authors that the estimates π 0 , π 0 , and π 0 are usually conservative, we conclude that the lower estimate of 0.42 by the proposed new estimating procedure appears as expected and might be closer to the true value of  0 . 20) for  = 1, . . ., ,  = 1, . . ., k. Keep in mind that k ∑ =1   = 1,  = 1, . . ., ,

Figure 1 :
Figure 1: Histogram of the  values for leukemia data reported in Golub et al. [5].

Table 1 :
Outcomes from m null hypotheses tests.

Table 2 :
Monte Carlo empirical estimates with standard deviations (in parentheses) by the new innovative approach along with existing estimators.The Monte Carlo size is 1,000 in each case.Denote π0 for the new estimator, π 0 for Storey's estimator, π 0 (0.5) for Storey's estimator with  = 0.5, π 0 for the convex method estimator, and π for the proportion of true null hypotheses in multiple testing problems.The dataset consists of 38 bones marrow samples, 27 acute lymphoblastic leukemia (ALL) samples and 11 acute myeloid leukemia (AML) samples, obtained from acute leukemia patients at the time of diagnosis, before chemotherapy.RNA prepared from bone marrow mononuclear cells was hybridized to highdensity oligonucleotide microarrays, produced by Affymetrix and containing probes for 7,129 human genes.But after preprocessing, there were only 3,051 genes accurately readable resulting in a microarray 3051×38 matrix.For comparison of gene expressions between two groups, let  1 and  2 be the true mean intensities for each gene, in groups 1 and 2, respectively, to determine whether the gene is differentially expressed, that is, to test  0 :  1 =  2 vs  1 :  1 ̸ =  2 .