An Empirical Bayesian Method for Detecting Differentially Expressed Genes Using EST Data

Detection of differentially expressed genes from expressed sequence tags (ESTs) data has received much attention. An empirical Bayesian method is introduced in which gene expression patterns are estimated and used to define detection statistics. Significantly differentially expressed genes can be declared given detection statistics. Simulation is done to evaluate the performance of proposed method. Two real applications are studied.


INTRODUCTION
It is important to detect differentially expressed genes, for example, exploring the key genes related to certain diseases. As the EST sequencing technology develops, a large number of EST databases from a variety of tissues are available. Enormous EST collections provide opportunities to quantify gene expression levels [1]. Efficient statistical methods are in great demand.
Several methods have been proposed to detect significantly differentially expressed (SDE) genes from EST data [2]. Fisher's exact test was used by the Cancer Genome Anatomy Project [3]. Audic and Claverie [4] developed a Bayesian method. GT statistic [5] and R statistic [6] were proposed for multilibrary comparison. In each method, gene-specific detection statistics quantify differences of gene expression levels and SDE genes are declared by their rankings.
An empirical Bayesian method is proposed to detect SDE genes. The relative gene expression abundances are estimated in each library, and a new detection statistic is derived for each gene. In Section 2, simulation experiments suggest that the proposed method outperforms those existing methods. Real applications are also studied in Section 2. Statistical methods are described in Section 3. The possibility of extending the method for multiple libraries is indicated in Section 4.

RESULTS
Let (π 11 , π 12 , . . . , π 1c ) and (π 21 , π 22 , . . . , π 2c ) be the gene expression patterns in two libraries, where π ji is the relative abundance of gene i in library j. The absolute difference between relative abundances is D i = |π 1i − π 2i |. Given a sample of ESTs from library j, an empirical Bayes estimator π ji for π ji is defined in Section 3. Given gene i seen in both samples, define D i = | π 1i − π 2i |. Given gene i seen in only one sample, for example, sample 2, define D i = | π 1i − π 2i | if π 1i < π 2i and D i = 0 otherwise, which is conservative in the sense that D i possibly underestimates D i . Gene i is declared to be SDE if D i is relatively large.

Real applications
One example concerns Chinese spring wheat drought stressed leaf cDNA library (7235) and root cDNA library (#ASP), available at TIGR gene indexes database (down-loaded at http://www.tigr.org/tdb/tgi, 01/06/2006). In each EST sample, there are totally 790 and 1306 sequenced ESTs, respectively. After removing the unannotated 103 and 194 ESTs, the annotated ESTs are clustered into 465 and 804 groups with each group associated with a unique gene. Only those well-annotated ESTs are used. The first 20 SDE genes by the proposed method are listed in Table 1, among which 7, 7, 7, and 7 genes are in the set of first 20 SDE genes by Fisher's exact test, χ 2 test, AC statistic, and R statistic, respectively.
Another example concerns pinus gene expression level comparison in root gravitropism April 2003 test library (#FH3) and root control 2 (late) library (#FH4), also from TIGR, in which 2513 and 1132 ESTs associated with 1211 and 605 genes are well annotated and clustered. Table 2 lists the first 20 SDE genes by the proposed method, among which 4, 4, 5, and 3 genes are in the set of the first 20 SDE genes by Fisher's exact test, χ 2 test, AC statistic, and R statistic, respectively.

METHODS
Suppose that there are c genes in a library. Let x i be the number of ESTs from gene i, a Poisson variable with mean λ i . Na You et al. Given a prior distribution G on the λ i , the posterior mean Let θ(Q) = h G (0)/(1 − h G (0)) be the odds that a gene is un- Let n x denote the number of genes with exactly x ESTs in the sample. The nonparametric maximum likelihood estimator Q for Q is whose calculation is discussed in [7]. It is difficult to estimate θ(Q) well [8]. There are lower bound estimators, for example, θ(Q) = n 1 (n 1 − 1)/{2n(n 2 +1)} [9], where n = x≥1 n x is the number of observed expressed genes. An empirical Bayes estimator for λ i is As the relative abundance π i satisfies π i = λ i / c k=1 λ k , let π i = λ i / s, where

DISCUSSION
A new statistical method is proposed to compare the gene expression patterns in two cDNA libraries. It can be extended to multilibrary comparison, for example, considering all pairwise comparisons among multiple libraries [3].