A Case Study on Choosing Normalization Methods and Test Statistics for Two-Channel Microarray Data

DNA microarray analysis is a biological technology which permits the whole genome to be monitored simultaneously on a single slide. Microarray technology not only opens an exciting research area for biologists, but also provides significant new challenges to statisticians. Two very common questions in the analysis of microarray data are, first, should we normalize arrays to remove potential systematic biases, and if so, what normalization method should we use? Second, how should we then implement tests of statistical significance? Straightforward and uniform answers to these questions remain elusive. In this paper, we use a real data example to illustrate a practical approach to addressing these questions. Our data is taken from a DNA–protein binding microarray experiment aimed at furthering our understanding of transcription regulation mechanisms, one of the most important issues in biology. For the purpose of preprocessing data, we suggest looking at descriptive plots first to decide whether we need preliminary normalization and, if so, how this should be accomplished. For subsequent comparative inference, we recommend use of an empirical Bayes method (the B statistic), since it performs much better than traditional methods, such as the sample mean (M statistic) and Student's t statistic, and it is also relatively easy to compute and explain compared to the others. The false discovery rate (FDR) is used to evaluate the different methods, and our comparative results lend support to our above suggestions.


Biological background
Gene expression is a process of 'the full use of the information in a gene via transcription and translation, leading to production of a protein and hence the appearance of the phenotype determined by that gene' (Lackie and Dow, 1999). The gene expression process determines the intracellular concentration of proteins, which play an important role in many biological systems. On the other hand, the gene expression procedure is controlled by certain proteins (regulators) in an organized way. As a result, the knowledge of gene expression and transcription regulation are two key questions in biology. Answers to these questions will facilitate basic biology and medical research, leading to applications in clinical diagnosis, disease classification and finding new treatments for diseases.
DNA microarray experimentation is a biological technology which permits the whole genome to be monitored on a single slide, so that a better picture of the interaction among thousands of genes can be observed simultaneously (Brazma et al., 2000). As a result of microarray applications, the research focus of biologists has shifted from individual genes to multiple genes, and their interaction and cooperation in a complicated way to maintain life. One important microarray application is to compare the expression levels of genes in samples drawn from different tissues or conditions through a transcriptional profiling experiment (Tani et al., 2002;Spellman et al., 1998). Another ingenious application of microarray technology is to find the DNA-binding sites for proteins across the entire genome through a chromatin immunoprecipitation (CHIP) experiment (Ren et al., 2000;Iyer et al., 2001), where the interest is to detect the target DNA sequences that are bound by specific proteins.

Statistical issues
Microarray technology not only opens an exciting research area for biologists, it also provides significant challenges to statisticians. A common and difficult question before starting any data analysis is whether we need to preprocess the data, and if so how. Generally, there are some systematic biases and variations in microarray experiments, such as label or dye effects and slide or spatial effects, which may affect the measurements of gene expression levels, and thus the conclusion of an experiment Quackenbush, 2002). In order to remove these biases and make the multiple arrays comparable, various normalization methods have been proposed. For a chromatin immunoprecipitation experiment using genomic DNA as control samples, intensities of the control genomic DNA samples should be around a constant, while for a transcriptional profiling experiment, the intensities in the reference channel will have gene-specific means. This different property may require different normalization methods. This paper briefly describes several normalization methods and aims to provide a practical approach to decision making in this regard. We will illustrate our approach for both chromatin immunoprecipitation and transcriptional profiling experiments.
Another challenging question is how to properly carry out statistical data analysis. There are several overriding statistical issues here. One has to do with the 'large p, small n paradigm' (West, 2000).
Thousands of genes will be tested at the same time, but generally we only have a few replications of each. Empirical Bayes (EB) methods provide a natural approach to addressing this problem because they can effectively borrow information across genes (Efron and Morris, 1973;Carlin and Louis, 2000), and so EB methods for microarray have recently been implemented by a variety of researchers (Newton et al., 2001(Newton et al., , 2004Lonnstedt and Speed, 2002;Kendziorski et al., 2003;Lin et al., 2003). A relevant statistical issue is how to determine the statistical significance for testing a null hypothesis. In many situations, it is not straightforward to obtain the null distribution of a chosen test statistic, so it is not easy to determine the cut-off for rejection region or p value for the null hypothesis. However, because microarray experiments are largely exploratory in nature, investigators generally care little about precise p values, and are willing to accept some false positives among the identified genes. Thus, the best method is one that can lower both type I and type II errors over a range of cut-offs (Lonnstedt and Speed, 2002). This paper considers four traditional or new test statistics used for microarray data analysis, especially focusing on one EB method, and compares the methods based on plots and their false discovery rates.
This paper is organized as follows. 'Data and Methods' contains a brief description of the dataset used in the paper, as well as the various normalization methods, test statistics, evaluation methods and their implementation. The 'Results' section then illustrates in the context of our microarray experiment; in particular, firstly, we show the effects of normalization methods on the location and scale of the intensity measurements. In the following subsection plots are used to illustrate the utility of an empirical Bayes method and how it differs from other methods. After that we compare the results of using different normalization methods and statistics by looking at their overlap rates of identified genes and their false discovery rates. In the last subsection, we apply our proposed approach to a transcriptional profiling experiment. Finally, there is a discussion of our findings and open questions for future investigation.

Data
Data from a study of transcription regulation mechanisms is used in this project. The overall goal of 434 Y. Xie et al. the study is to find the DNA binding sites in vivo of a broad transcription regulator, leucine responsive regulatory protein (Lrp) (Tani et al., 2002), through analysing the genome-wide distribution of Lrp. Detailed descriptions of the method of genomewide location and function of DNA binding proteins can be found elsewhere (Ren et al., 2000;Iyer et al., 2001).
Briefly, to identify the target binding sites of Lrp, the combination of chromatin immunoprecipitation and microarray hybridization was used. DNA samples from wild-type Escherichia coli were labelled with red (Cy5) fluorophore following cross-linking, immunoprecipitation and amplification; we call these samples 'test samples'. Genomic DNA samples were also prepared and labelled with green (Cy3) fluorophore; we call these samples as 'control samples'. We identify the genomic target loci by comparative hybridization of test and control samples to a DNA microarray. The ratio of Cy5 to Cy3 fluorescence intensities measured at each spot in the microarray provided a measure of the extent of binding of Lrp to the corresponding genomic locus. If there is no binding of Lrp, this ratio should be a constant across all of the genes. So the specific purpose of this microarray analysis is to identify the spots or genes with intensity ratios between the test samples and the control samples that are different from this constant. Each DNA microarray includes 4221 ORFs of E. coli. After dealing with missing data and background correction, 4011 genes with five replicated arrays are used to do the analysis. In this paper, we define M ij as the log ratio of the background-corrected intensity levels in test and control samples for gene i on array j : We also define A ij as the log of geometric mean of the two channel intensities; that is, A ij = (log 2 R ij + log 2 G ij )/2. We denote the sample mean of A ij by A i and will use generic notation M and A for M i and A i .

Normalization methods
In microarray experiments, the purpose of normalization is to remove the systematic variation, such as the differences in labelling efficiency between two fluorescent dyes used. There are various sources of biases, including experimental variability in the processing procedures and the scanner settings at the data collection step. Some of these factors lead to biases that depend on the spot's intensity or its location on the array, often referred to as spatial or 'print-tip' effects. Locally weighted smoother (lowess) normalization and print-tip group lowess normalization methods were proposed to correct these kinds of biases . The lowess normalization method is a within-slide location normalization method. It assumes that the (dye) bias depends on the spot intensity, so it adjusts the log-ratios M ij by an intensity-dependent mean curve c(A), the lowess fit in a scatterplot of the log-ratio M vs. overall spot intensity A. The print-tip group lowess normalization method assumes that there are systematic differences between the print-tips, so it adjusts the log-ratios by both the intensity and print-tip effects c k (A), the lowess fit in an M vs. A plot for the k th grid only. Both of these methods are used after a summary measurement of gene intensity level [typically log(R/G)] is obtained for two-channel arrays. Normalization can be applied for the purpose of constructing an expression value using physical and biological properties, as well as for standardizing expression value for withinand between-sample variability. The purpose of our normalization is the latter. Irizarry et al. (2003) address some of the normalization issues in a coherent way.
The preceding discussion notwithstanding, the questions of whether and how to normalize the data do arise in practice. We address these questions by looking at some descriptive plots. First, a Choosing normalization methods and test statistics 435 scatterplot of M vs. A is checked for each slide to see whether there is any systematic pattern between the log-ratio intensity and the overall intensity; if there is, we may need to do normalization. Then within print-tip lowess curves are fitted for each print-tip group and the spatial plots are checked to see whether there are spatial effects. If the lowess curves are different for each print-tip group and the intensities are disproportionately distributed among the print-tip groups, then we may need to consider the print-tip group lowess normalization.

Test statistics
In this project, we use four different statistical methods to analyse the data. The first one is the M -statistic, M i , which corresponds to the early practice of simply using twofold change as a significance indicator. M does not take account of possibly different variations of M ij for different genes, and effectively it treats a highly variable gene in the same way as a stable one. A second possibility is the Student's t statistic, The t i statistic can be regarded as a standardized version of M i .
Because we have thousands of genes but only a small number of replicates for each gene, it is quite possible that for some genes, just by chance, their SE estimates (based on sample variances) can be very small, leading to a huge t statistic. In order to address this problem, another statistic, S , was proposed by Tusher et al. (2001). This statistic is a modified t statistic that adds a constant a 0 into the denominator, i.e. S i = M i /(SE i + a 0 ). As suggested by Tusher et al. (2001), we use the median of standard errors of all the genes as a 0 . The motivation of the S -statistic is intuitive, but it does not have a rigorous justification (although a connection exists between the S -statistic and a Bayesianregularized t statistic; Baldi and Long, 2001).
The last method we consider is the B statistic (Lonnstedt and Speed, 2002), which is an empirical Bayes estimate of the log posterior odds of µ i = 0. We assume that genes are independent and the measurement M ij is a random variable from a normal distribution with mean µ i and variance σ 2 i : Most genes have the same mean intensity level between the two samples, corresponding to µ i = 0.
Only a small proportion (say, p) of genes have different mean intensities, leading to µ i = 0. An indicator function i is defined as 0 if µ i = 0 and as 1 if µ i = 0. By definition and Bayes rule, we can calculate the log posterior odds for gene i having i = 1: We use conjugate prior distributions for mean µ i and variance σ 2 i . For n arrays, a degrees of freedom parameter v , and scale parameters a > 0 and c > 0, we set τ i = na/2σ 2 i and assume that τ i ∼ Gamma(v , 1), and: Because this is a conjugate prior, we can easily calculate the joint distribution of {M ij , j = 1, . . . , n}, µ i , and τ i , and then integrate to get the marginal distributions of Pr(M ij | i = 1) and Pr(M ij | i = 0). The final expression for B is then: where s 2 i is the sum of squared errors over n arrays for gene i . From this formula, we can find that the only gene-specific part lies in the last ratio, which is always greater or equal to 1, since 1 + nc ≥ 1. Thus, we can deduce a monotonically increasing relationship between B i and M 2 i or relative gene intensity levels, and the relationship is stronger if the variance for the gene is smaller. (A similar relationship exists between t i and M i .) B has four hyperparameters: p, v , a and c. Since there are no consistent estimates for them and appropriate hyperpriors are not clear, we use an EB approach to estimate them. First, we fix p and estimate v , a and c. The methods of moments is used to get a and v , and the least squares method to get c. There are no satisfactory estimates for p, but in most of cases that will not affect the shape of the B vs. M plot (Lonnstedt and Speed, 2002).

Evaluation methods
Three plots with numbers indicating whether the genes are identified as being bound to Lrp by different statistics are used to compare the false positive and false negative rates among the M , t and B statistics: average M vs. variance of M , t vs. average intensity A, and B vs. M . The overlap rates of top genes identified by different normalization methods and statistics are calculated to indicate the agreement between two different methods. The higher the overlap rate between two methods, the better agreement between them. We also use the false discovery rate (Efron et al., 2001;Benjamini and Hochberg, 1995) to compare the three normalization methods and the four statistics. FDR is an alternative to controlling the false positive rate (type I error), and is defined as the expected proportion of false positive genes (FP) among total positive genes (TP); the observed FP : TP ratio is often used to estimate FDR. When we use FDR to compare various statistical methods, we prefer the method that gives the lowest FDR while giving the same number of top (i.e. positive) genes as that of all other methods.

Implementation
We implemented the methods in the R software package (www.r-project.org). In particular, we used the SMA (Statistics for Microarray Analysis) package developed by Dudoit et al. (2002) (statwww.berkeley.edu/users/sandrine/software. html). We used SMA to do lowess normalization and print-tip group normalization, creating the M vs. A plots, boxplots and spatial plots, and calculating B statistics.
To calculate FDR, we used a permutation method to estimate the false positive number FP . Under the null hypothesis, H 0 :µ i = 0, we can generate a permuted dataset as follows: multiplying each M ij by either 1 or −1 randomly.
For example, suppose that the original parameters M ij for gene i are: 0.2, 0.4, −0.3, −0.5, 0.1. We generated five random numbers, say, −1, 1, −1, 1, −1. Then our permuted data will be: −0.2, 0.4, 0.3, −0.5, −0.1. We do this permutation 50 times for each genes. The false positive number from each permutation is the number of genes that counted as significant genes from the permuted data. The average of the false positive numbers over 50 permutations is calculated as FP. The number of genes that counted as significant genes from the original data is regarded as TP, and we estimate FDR = FP/TP (Pan, 2003). Note that a more elaborate estimator of FDR, namely FDR = π 0 FP /TP , with π 0 as the prior probability of null hypothesis being true, has been proposed (Storey and Tibshirani, 2003). Since π 0 is a constant, independent of whatever normalization or test statistic is used, using this estimator will not influence our final results.

Effects of normalization
M vs. A plots, spatial plots, and boxplots of the measurements in the first slide are displayed to compare the within-slide normalization methods; the corresponding plots for the other four slides are similar. Figure 1 shows that, before normalization, the intensity ratio increases as the average intensity increases, possibly indicating a systematic pattern. After the lowess normalization, this pattern disappears. Figure 2 displays 16 within-print-tip lowess lines, one for each print-tip group. This plot may indicate the existence of spatial effects, since six lowess curves seem to stand out from each other. Figure 3 shows a spatial plot. There are disproportionately large numbers of extreme log-ratios in the upper four grids if we do not use any normalization method, possibly indicating spatial effects for the experiment and the need for within-printtip group lowess location normalization. Finally, the within-print-tip boxplots in Figure 4 also indicate that both the mean and variances are different among the 16 print-tip groups. After within-printtip group lowess normalization, the mean of logratio of each print-tip group is adjusted to zero. So we may conclude from these plots that there seem to be spatial effects and we need to use the print-tip group normalization.

Effects of using various statistics
In this subsection, we illustrate the utility of B and how it differs from M and t, based on plots following the idea of Lonnstedt and Speed (2002). Corresponding illustration of the S statistic can be found elsewhere Efron et al., 2000). All statistics were calculated after applying the print-tip group normalization method to the data. The various plots relating the statistics M , t and B are shown in Figure 5. Genes that are identified as 'extreme' by at least one of these three statistics are plotted not as dots but as numbers in this figure; Table 1 provides the key to identifying to which of the 2 3 − 1 = 7 possible sets these extreme genes belong. We selected the cut-off points so that each statistic will identify its top 150 genes as being extreme. Specifically, the genes are selected as extreme for M if |M i | > 1.43, for t if |t i | > 4.62, Table 1. Number of genes falling in the corresponding sets, 1-7. In columns 2-4, a '1' indicates that the genes in this set are 'extreme' for the given statistic and for B if B i > −2.11. After setting these cut-off points, all of the seven possible sets in Table 1 are non-empty except Set 6. From the M mean vs. log variance plot in Figure 5, we see that at the left end of the plot, the  Table 1 for key) genes have very small variance and their means are not large. Almost all of these genes fall into Set 2; i.e. the genes are identified only by t but not by M and B statistics, which is consistent with our previous description of t: it can be inflated by a small variance. It is likely that these genes are false positives from using the t statistic.
It is reassuring to see that these genes are not identified by the B statistic. When we look at the right end of the M vs. log variance plot, some genes have a large mean but their variances are also large. All of these genes fall into Set 4; i.e. the genes are identified only by M but not by t and B . This phenomenon is consistent with our previous description of M : a large M does not take account of its possibly large variation. We may consider these genes as false positives from using M .
The t vs. A plot indicates that most of the genes with extreme t values fall into Set 2 and are not identified by M and B , while the B vs. M plot shows that the genes with extreme B values fall into Set 7; that is, these genes are identified by all of M , t and B . Most of the genes falling in Set 1 that are identified only by B have only moderately high B values. Hence it appears that B is more stable and reliable than t. Set 7 includes the genes with high values for all the three statistics, and this shows up clearly in the plots. Set 3 can be detected by t and B , but not by M . Set 1 and Set 5 can be detected by B , but not by t. Set 6 is the set of genes that can be detected by M and t but not by B . We note that the number of genes falling into Sets 1, 3, and 5 are between 40 and 50, but there are no genes falling into Set 6, which confirms that the genes high in both M and t are also high in B . Table 2 shows the overlap rates of the top 150 genes identified by different statistics to compare the agreement among these statistics. The results are consistent across normalization method: the overlap rates among M , t and S statistics are below 50%, but there is more than an 80% overlap rate between S and B . This appears related to the fact that both B and S can be justified from a Bayesian point of view. Table 3 compares estimated FDRs for the genes identified as extreme by the four statistics before and after applying two normalization methods. The FDRs for non-normalized data are in the range 70-80% for the various statistics, while this range for the lowess normalization is 30-56%, and just 17-43% for the print-tip group normalization method. When we compare the FDRs for each test statistic using the same normalization method, we find that M always has the highest FDR, and the FDR for t is also rather high. By contrast, the B statistic has the lowest FDR, with the S statistic close behind. Since B with print-tip normalization offers the lowest FDR (17%), we conclude that for this dataset, preprocessing to account for spatial effects and subsequent use of the B statistic is the best approach.

Application to transcription profiling data
To assess the performance and versatility of our approach, we also applied it to a transcription  profiling dataset. Transcriptional profiling was carried out to comprehensively define a family of genes whose transcription depends on the activity of leucine-responsive regulatory protein, Lrp. Specifically, researchers set out to identify genes differentially expressed in Lrp + and Lrp − strains. Two-colour hybridization of the cDNA microarray, as described in Tani et al. (2002), was used in the experiment. Here we use six arrays of 4281 genes each to illustrate our proposed procedure of selecting normalization methods and test statistics. Figure 6 displays 16 within-print-tip lowess lines for the Lrp expression data. From this plot, we can see that there is no obvious pattern between intensity ratio and average intensity, and the 16 lowess curves seem to stay close to each other, indicating that no spatial effects exist and hence any normalization (especially the within print-tip normalization) may not help much. Table 4 compares estimated FDRs for the top 150 genes identified by the four statistics before and after applying lowess and within print-tip normalization methods. As expected from Figure 6, the normalization does not have much effect on the FDR, and lowess normalization seems to be slightly better than no normalization and within-print-tip normalization.
When we compare the four statistics, we find that M always has the highest FDR and the other three statistics, t, S and B , have much lower FDR, with B having consistently the lowest FDR among these four. So, based on the plot and FDR, we conclude that for this dataset, normalization does not have much effect on results, and the B statistic seems to offer the best approach.

Discussion
In this paper, we have compared three normalization methods and four test statistics to identify target genes bound by a protein called Lrp. Different normalization methods were developed to reduce the systematic biases and variations across microarray experiments. Before doing any data analysis, we must decide whether we need to normalize the data, and if so which normalization method to use.
Here we have illustrated the preliminary use of M vs. A plots, spatial plots and boxplots to decide whether there are any patterns between M and A and whether there are spatial effects. If so, normalization is needed, with the print-tip group normalization method being most appropriate if there is evidence of a within-tip spatial effect. This should also result in a reduction of the false discovery rate in subsequent analysis. For our data, both methodologies (plot inspection and FDR calculation) point to the use of print-tip group normalization being preferable. Hence, we suggest preliminary exploration of the various descriptive plots discussed to guide selection of the appropriate normalization method. Although this suggestion is practical and easy to use, some concerns about normalization still linger. Because normalization methods assume that the mean log ratio intensity for each slide is close to zero, there should be a small proportion of genes with different intensities. But it can be difficult to check whether a dataset satisfies these assumptions, and whether the normalization methods will still work if the data violate the assumptions. If not, can more robust methods be found? Tseng et al. (2001) suggested a rank invariant method and Reilly et al. (2003) proposed probability models fit using the Gibbs sampler (see e.g. Carlin and Louis, 2000) to select non-differentially expressed genes to do normalization. ANOVA methods proposed by Kerr et al. (2000) are also widely used in normalization and testing for microarray data. One might also compare ANOVA methods with these other methods to see how they perform.
The B statistic uses information from all the genes to estimate the posterior odds of rejecting the null hypothesis. Although the normality and independence assumptions were used to derive the formula for B , we did not use a formal test based on these assumptions. Furthermore, the conclusion of 'extremeness' for a gene is often based solely on the rank of its test statistic. Because we cannot estimate the prior probability of extremeness satisfactorily and the scale of B depends on this probability, we cannot use a predetermined cutoff value (such as B = 0) for gene selection. Fortunately, the ranks of the various B statistics do not depend on p, so we can select the top genes with the most extreme B values based on their ranks. Based on the analysis results for the Lrp data, we conclude that the B statistic performs much better than the M and t statistics, since B yields much smaller false positive rates than the other two. The performance of B , in terms of the overlap rate and false discovery rate, is quite similar to that of S , but the former is more explicitly model-based. Also, because B has a closed form, it is easy to compute and convenient to use (e.g. it is directly available within the SMA package). Hence, we conclude that, although the M and t statistics are the most commonly used statistics for microarray data analysis, based on our results, we instead recommend the B statistic. It will be interesting to see whether the B statistic can be incorporated into other statistical packages and applications in the detection of differential gene expression (Pan, 2002;Kendziorski et al., 2003;Newton and Kendziorski, 2003).
In this paper, we used the FDR results from the top 150 identified genes to compare the different methods. In order to check whether the results are sensitive to the number of genes so selected, we repeated our analysis using different cut-off points; specifically, using the top 20, 50 and 200 identified genes. The results were consistent with our previous findings: the FDR was highest for nonnormalization and lowest for print-tip normalization, and the B and S statistics have lower FDRs than M and t. Thus, our results appear robust with respect to the choice of extremeness cut-off point.
Our data analysis followed the common approach of using background subtracted measurement intensity. But Kooperberg et al. (2002) pointed out that this method may lead to a much larger variance than needed when the expression levels are low. Qin and Kerr (2003) also showed that background subtraction can increase the variability of gene expression and worsens one's ability to detect the expressed genes. To investigate this issue in our setting, we redid our analyses without background subtraction. Our results were again consistent with our previous conclusions: print-tip normalization and the B statistic enjoy the best FDR performance. We also found that the FDR is consistently lower when using data without background subtraction across different normalization methods and test statistics. Thus, based on our results, avoiding