A Bayesian Approach for Decision Making on the Identification of Genes with Different Expression Levels: An Application to Escherichia coli Bacterium Data

A common interest in gene expression data analysis is to identify from a large pool of candidate genes the genes that present significant changes in expression levels between a treatment and a control biological condition. Usually, it is done using a statistic value and a cutoff value that are used to separate the genes differentially and nondifferentially expressed. In this paper, we propose a Bayesian approach to identify genes differentially expressed calculating sequentially credibility intervals from predictive densities which are constructed using the sampled mean treatment effect from all genes in study excluding the treatment effect of genes previously identified with statistical evidence for difference. We compare our Bayesian approach with the standard ones based on the use of the t-test and modified t-tests via a simulation study, using small sample sizes which are common in gene expression data analysis. Results obtained report evidence that the proposed approach performs better than standard ones, especially for cases with mean differences and increases in treatment variance in relation to control variance. We also apply the methodologies to a well-known publicly available data set on Escherichia coli bacterium.


Introduction
The DNA array technology is capable of providing gene expression levels measurements for thousands of genes simultaneously under different biological experimental conditions. In these experiments, total RNA is reversetranscribed to create either radioactive or fluorescent-labeled cDNA which is hybridized with a large DNA library of gene fragments attached to a glass or membrane support [1]. After this, a scanner of high resolution is used to obtain the color intensity from each spot. So, the color intensities are normalized in order to obtain the expression level of genes. For further discussion and additional references on DNA array technology see [2][3][4][5][6][7][8][9].
Obtaining the expression levels, a common objective is to identify genes that present significant changes in gene expression levels between treatment and control experimental condition. The identification of these genes is important because it may bring to light new biological discoveries, such as which genes may be involved in the origin and/or evolution of the same disease of genetic origin or which genes react to a drug stimulus. Thus, we aim to establish the use of these experiments as tools in medicine [10].
As the observed expression levels incorporate different sources of variability present in the process of obtaining fluorescent intensity measurements [11], statistical methods are important to identify the genes differentially expressed. One of the first approaches proposed to identify genes differentially expressed was the fold-change approach [2,3]. In this approach, a gene is considered differentially expressed if the average of the logarithm of the observed expression levels in treatment and control varies more than a cutoff point, R c , which is previously prefixed. This approach Table 1: True positive rate, n c = n t = 4, p = 5% (3% over and 2% under).     however is not adequate to yield good results, once a cutoff value R c may have different significance for different observed expression levels. Besides, this approach does not consider the variability of the observed expression levels from treatment and control. Another method commonly used for gene expression data analysis is the so-called two-sample t-test (TT) for the log transformed data [1,8]. The problem with the application of TT to this kind of data is the usual small size of treatment and control samples in genetic studies, which may lead to underestimated variances and small power of test. To avoid such limitations, some TT modifications were proposed, such as the Cyber-t (CT) proposed by [1] and the Bayesian t-test (BTT) proposed by [12]. Basically, the main idea is to consider modifications of the standard error estimate of the two-sample difference present in the denominator of the standard t-statistics.
In this paper, we propose a Bayesian approach to identify genes differentially expressed by calculating sequentially credibility intervals from predictive densities which are constructed using all treatment effects excluding the treatment effect of genes previously identified with statistical evidence for difference. This procedure avoids the small sample size problem, usual in gene expression data analysis, and allows us to use the normality assumption for observed data [9].
In order to verify the performance of the proposed approach and compare it with the conventional ones based on the use of the t-test and modified t-tests, we present a    simulation study. The comparison is done in terms of the true positive rate, false positive rate, and true discovery rate.
Results obtained report evidence that our proposed approach performs better than t-test and modified t-tests, especially for cases with mean differences and increases in treatment variance in relation to control variance. We also apply the methods to a real dataset, obtained from the experiment carried through with Escherichia coli bacterium, described in [5]. The paper is organized as follows. In Section 2, we develop our Bayesian approach constructing the predictive density and describing our criteria to identify the genes differentially expressed. In Section 3, the method is compared with the t-test and modified t-tests using simulated datasets and a real dataset. In Section 4, we conclude the paper with final remarks on the proposed method.

Predictive Modeling for Gene Expression
Consider a DNA array experiment with n genes and two experimental conditions which we name by control (c) and treatment (t). Suppose that control and treatment are replicated n c and n t times, respectively. Denote by x igh the ith observed expression level (or its logarithm) for gene g in experimental condition h, h ∈ {c, t}, and g = 1, . . . , n. Let x gh = {x 1gh , . . . , x nhgh } be realizations of independent random variables X gh = {X 1gh , . . . , X nhgh }, for g = 1, . . . , n and h ∈ {c, t}.
Consider that     is the sampled mean treatment effect for gene g, g = 1, . . . , n, and Y = {Y 1 , . . . , Y n } is the set of all sampled mean treatment effects. Thus, considering Y, we can determine the predictive density for a new observation Y n+1 , given Y, and build a 100(1 − α)% credibility interval for Y n+1 . In order to develop our idea and as often found in gene expression data analysis [1,8,9,[11][12][13], we assume that Y is an independent sample generated from a normal distribution with mean μ and variance σ 2 , The likelihood function is given by where y = (1/n) n g=1 y g and s 2 = (1/(n − 1)) n g=1 (y g − y) 2 are the sample mean and variance of y, respectively.
Since parameters μ and σ 2 have a direct interpretation in the context of the gene expression data analysis, so we may express expert opinions in terms of prior distributions for parameters. In order to explore the fully conjugation, consider that joint prior distribution for parameters μ and σ 2 is given by where μ 0 , λ, τ, and β are known hyperparameters, and IG(·) represent the inverse gamma distribution with mean (β/2)/(τ/2) − 1.

Predictive Approach Criterion. Let y ord = {y
In order to identify the genes differentially expressed, we fix E(μ | y ord (−g) ) = μ * = 0 and set up I threshold   , then gene g does not presents statistical evidence for differences; (iii) if |y (g) | > I threshold (g) , then gene g present statistical evidence for differences. Do y dif = y dif ∪ y (g) .

Data Analysis
In this section, we illustrate the predictive approach (PA) applied to artificial and real datasets. The real data set was extracted from the site (www.jbc.org) and refers to an experiment realized with Escherichia Coli bacterium using nylon membranes, described by [5].
Moreover, we compare the PA results with the results obtained by considering three well-known methods to identify differentially expressed genes: the two-sample t-test (TT) and Cyber-t test (CT) proposed by [1] and with the Bayesian t-test (BTT) proposed by [12].
In the TT, the hypothesis test is based on the statistics which follows a Student's t-distribution with df = [s 2 gc /n c + s 2 gt /n t ] 2 /[(s 2 gc /n c ) 2 /(n c − 1) + (s 2 gt /n t ) 2 /(n t − 1)], degrees of freedom, where x g h and s 2 gh are the sample mean and variance for gene g in experimental condition h = {c, t}. Fixing a significance level α, if |t g | is greater than a threshold t 1−α/2,df (quantile 1−α/2 of Student's t distribution with df degrees of freedom), then the test conclude for difference of expression levels.
[1] proposed a two-sample t-test replacing the denominator of (1) by a pooled variance estimated via a Bayesian approach. So, the authors implement the Cyber-t software using the statistics t g = x gt − x gc σ 2 gt /n gt + σ 2 gc /n gc (10) and the degrees of freedom df = ν 0 + n gc + n gt − 2, where σ 2 gh = ν 0 σ 2 0 + (n h − 1)s 2 gh /(ν 0 + n h − 2), for h ∈ {c, t}, where ν 0 and σ 2 0 are hyperparameters. The authors assume that k > 2 points are needed to properly estimate the standard deviation and keep n g + ν 0 = k, where n g = n gc + n gt . They suggest  to fix k = 10 and so ν 0 = 10 − n g . To fix a value for σ 2 0 , the authors say "one could use the standard deviation of the entire set of observations or, depending on the situation, of particularly categories of genes." Using only the information from observations of the gene g, we fix σ 2 0 = ((n g − 1)/n g )s 2 g , as suggested by [1], where s 2 g is the sample variance of the set Based on [1,12], develop a Bayesian approach and show that where Δx = x gt − x gc , ν n σ 2 n = ν 0 σ 2 0 + (n gc − 1)s 2 gc + (n gt − 1)s 2 gt , ν n = ν 0 + n gc + n gt − 2, and t νn represent the Student's t distribution with ν n degrees of freedom, for g = 1, . . . , n. As suggested by authors, we fix ν 0 = n g and σ 2 0 = s 2 g .

Artificial Data.
To generate the artificial data sets, we consider that observations from control group are generated from a normal distribution with mean μ c and variance σ 2 c , X igc ∼ N (μ c , σ 2 c ), for i = 1, . . . , n c and g = 1, . . . , n. We fix μ gc = −14 and σ 2 gc = 0.8. These values are the average of the observed mean and variance of the expression levels (log transformed) from control group of the Escherichia coli bacterium dataset. We fix n = 1.000, and the sample sizes n c and n t were fixed at 4 and 8.
To record the cases identified with difference by PA, we consider an indicator variable I PA g = 1 for cases, so that |y (g) | > I threshold (g) . Otherwise, I PA g = 0. Analogously, for TT, CT, and BTT, we consider I method g = 1 (method = {TT, CT, BTT}) for cases with Pvalue g < 0.05 and I method g = 0, otherwise.
In order to compare the performance of methods, we calculate the true positive rate given by where method = {PA, TT, CT, BTT}. We calculate the true positive rate for S = 100 different datasets generated according to steps (i) to (iii) described above, and we present the results using the mean of the true positive rate, that is given by P method = S s=1 P (s) method /S, where P (s) method is the true positive rate calculated for sth generated dataset for method = {PA, TT, CT, BTT}.
Tables 1, 2, and 3 present the P method value for n c = n t = 4 and Tables 4, 5, and 6 present the P method value for n c = n t = 8, for method = {PA, TT, CT, BTT}.
As we move from the left to the right side of the tables, in each line, we have the distances between control and treatment means, which are increasing. As we move from top to down in columns of the tables, we have the distance between the treatment and control variances, which are increasing.
Increasing the sample sizes from 4 to 8, all methods increase its performance.
For n c = 4 and all values of δ and γ used, the PA present better performance than TT, CT, and BTT. Moving away the treatment distribution from the control distribution (increasing δ and γ), the true positive rate obtained by PA is greater than t-tests.
For n c = 8 and γ = 1 fixed the TT, CT, and BTT present better performance than PA for the same values of δ used. For example, for p = 5 and δ = {1.5, 1.75, 2} the t tests present greater true positive rate than PA. But increasing the value of γ, γ = {2, 3}, the PA presents greater true positive rate.
Besides, note that TT, CT, and BTT present similar results with a slight advantage for BTT, that is, greater true positive rate than TT and CT. Also note that true positive rate obtained by CT is smaller than TT for all cases simulated. It happens because we use only the information from observations from gene g to fix the hyperparameter σ 2 0 . In order to obtain better results, [1,12] suggest to fix σ 0 as the standard deviation estimated by pooling together all the neighboring genes contained in a window of size w. But, the authors do not discuss how to define a good value w to lead to satisfactory results.
We also compare the performance of the methods using the mean of the false positive rate and the mean of the true discovery rate. The mean of the false positive rates is presented in Tables 7 to 9 and in Tables 10 to 12. All methods present a small false positive rate.
The mean of the true discovery rates is presented in Tables 13 to 15 and in Tables 16 to 18. The PA presents greater true discovery rate than t-tests for all values of δ and γ used. Besides, note that increasing the value of δ and γ, the true discovery rate increases in both directions for PA. But the same does not happen with the t-tests, in which the proportion of identification increases only as the value of δ increases, that is, when the mean of the treatment distribution moves away from the mean of the control distribution. Increasing the variance of treatment (increasing the value of γ), the t-tests present a reduction of its performance, in opposite to PA which presents an improvement in its performance.
Results show a better performance of the PA in relation to TT, CT, and BTT, specially, when the difference refers to variance of the variable involved. From the biological practical point of view, it shows us that PA may identify with differences genes which are not identified by TT, CT, and BTT, specially, genes with differences in means and variances.

Escherichia coli Data.
In this section consider the gene expression data set on Escherichia coli bacterium, composed by n = 4290 genes [5]. Figure 1 shows the treatment and control observed means and variances for all genes of this dataset.
Results for PA are presented in Figure 2. Results for TT, CT, and BTT are presented in Figure 3. These figures show the observed treatment and control means and variances of genes identified with evidence for difference by considering PA, TT, CT, and BTT, respectively. The PA identifies 340 genes with evidences for difference, while TT identifies 222, CT 219, and BTT 288 genes.
Note that genes with means well apart are better identified by PA than by the other methods. Moreover, genes with mean and variances well apart are identified by PA and not identified by TT, CT, and BTT, as can be noted in Figure 2. Examples are genes 2766 (b1326(f262)) and 3254 (dbpA) that are highlighted in Figures 2(a) and 2(b). Genes with means well apart and similar variances are however identified by TT, CT, and BTT. An example is the gene 10 (hdeB) that is highlighted in Figures 2(a) and 2(b). One possible reason for this is the low performance of TT, CT and BTT in situations with differences in means and variances, as observed in the artificial data sets. Besides, note that PA is capable to identify differentially expressed genes which are not identified by TT, CT, and BTT, specially, genes with differences in means and variances.

Discussion
Identifying genes with difference, in what concerns gene expression, may help biologists to study and understand some function of genes and infer possible relationships among genes and proteins. In this paper we propose a Bayesian approach to identify differentially expressed genes based on predictive density.
In order to verify the performance of the PA approach and compare it with TT, CT, and BTT, we considered artificial and real datasets. Results show a better performance of PA in relation to the t-tests in identifying difference, mainly, in presence of different variances. The main advantage of the proposed method is that it is easy to use like a usual twosample t-test but presents better performance in situations with small sample size.
The biological interest in this fact is that PA may bring to light genes that are not identified when we use only the TT or the modified t-test ones. Moreover, the PA can be easily implemented in usual softwares such as the software R (the Comprehensive R Archive Network, http://cran.rproject.org). The source code used for the data analysis was implemented in software R and can be obtained by emailing the authors.
According to [1], gene expression data can be analyzed in at least three levels of increasing complexity. In the first level, each gene is analyzed separately, where the objective is to verify whether the observed expression in treatment experimental condition is significantly different from observed expression in control experimental condition. In the second level, clusters of genes are analyzed in terms of common functionalities and interactions. In the third level, the objective is to infer and understand the relationship among genes. As it should be clear by now, in this paper, we focus on the first level of analysis. However, as future work we intent to present the adjustment of the proposed method to control the false discovery rate for multiple testing hypotheses, when thousands of hypotheses are realized simultaneously, as proposed by [14], as well as to make a systematic comparison with his methodology. Besides, we also will development a multivariate approach in order to consider dependence among genes.