Robust Microarray Meta-Analysis Identifies Differentially Expressed Genes for Clinical Prediction

Combining multiple microarray datasets increases sample size and leads to improved reproducibility in identification of informative genes and subsequent clinical prediction. Although microarrays have increased the rate of genomic data collection, sample size is still a major issue when identifying informative genetic biomarkers. Because of this, feature selection methods often suffer from false discoveries, resulting in poorly performing predictive models. We develop a simple meta-analysis-based feature selection method that captures the knowledge in each individual dataset and combines the results using a simple rank average. In a comprehensive study that measures robustness in terms of clinical application (i.e., breast, renal, and pancreatic cancer), microarray platform heterogeneity, and classifier (i.e., logistic regression, diagonal LDA, and linear SVM), we compare the rank average meta-analysis method to five other meta-analysis methods. Results indicate that rank average meta-analysis consistently performs well compared to five other meta-analysis methods.


Introduction
We develop a simple, yet robust meta-analysis-based feature selection (FS) method for microarrays that ranks genes by differential expression within several independent datasets,then combines the ranks using a simple average to produce a final list of rank-ordered genes. Such meta-analysis methods can increase the power of microarray data analysis by increasing sample size [1]. The subsequent improvement to differentially expressed gene (DEG) detection, or to FS is essential for downstream clinical applications. Many of these applications, such as disease diagnosis and disease subtyping, are predictive in nature and are important for guiding therapy. However, DEG detection can be difficult due to technical and biological noise or due to small sample sizes relative to large feature sizes [2]. These properties are typical of many microarray datasets. Despite small sample sizes, the number of gene expression datasets available to the research community has grown [3]. Thus, it is important to develop methods that can use all available knowledge by simultaneously analyzing several microarray datasets of similar clinical focus. However, combining high-throughput gene expression datasets can be difficult due to technological variability. Differences in microarray platform [4] or normalization and preprocessing methods [5] affect the comparability of gene expression values. Laboratory batch effects can also affect reproducibility [6]. Numerous studies have proposed novel strategies to remove batch effects [7]. However, in some cases, batch effect correction can have undesirable consequences [8]. In light of these challenges, several studies have proposed novel methods for metaanalysis of multiple microarray datasets.
Existing microarray meta-analysis methods either combine separate statistics for each gene expression dataset or aggregate samples into a single large dataset to estimate global gene expression. The study by Park et al. used analysis of variance to identify unwanted effects (e.g., the effect of different laboratories) and modeled these effects to detect DEGs [9]. Choi et al. used a similar approach to compute an "effect size" quantity, representing a measure of precision for  Figure 1: Study design diagram. We compare the predictive performance of meta-analysis-based feature selection (FS) methods by designing a study that considers five components: (1) basic FS methods that are the building blocks of some of the metaanalysis methods, (2) meta-analysis-based FS methods, (3) clinical application, (4) microarray data platform, and (5) classifier (logistic regression, diagonal LDA and linear SVM). Since the "best" metaanalysis-based FS method may be dataset-or application-specific, assessing performance over a wide variety of factors enables an evaluation of the method's robustness.
each study, and used this "effect size" to directly compare and combine microarray datasets [10]. Wang et al. combined the fold change of genes between classes from three microarray datasets and weighted each dataset by its variance such that datasets with higher variance contribute less to the final statistic [11]. Yoon et al. conducted a large-scale study of gene expression by examining the variation of genes across multiple microarray datasets, regardless of the clinical focus [12]. Breitling and Herzyk ranked fold changes between all interclass pairs of samples and computed the product of all ranks for each gene [13]. More recently, Campain and Yang reviewed several meta-analysis methods and assessed their performance using both classification accuracy and synthetic data [14]. Research has shifted towards methods that consider multiple FS methods, reflecting the fact that no single FS method performs well for all datasets [15].
Although several meta-analysis methods exist, except for the study by Campain and Yang, the literature rarely compares these methods in a comprehensive manner. We develop the rank average method, a simple metaanalysis-based FS method, for identifying DEGs from multiple microarray datasets and design a study (Figure 1) to compare rank average to five other meta-analysis-based FS methods. We focus on the predictive ability of genes emerging from meta-analysis and show that rank average meta-analysis is robust with respect to three factors. These three factors are (1) clinical application (i.e., breast, renal, and pancreatic cancer diagnosis or subtyping), (2) data platform heterogeneity (i.e., combining different microarray platforms), and (3) classifier. Using a comprehensive factorial analysis, we rate each meta-analysis-based FS method relative to its peers. In terms of identifying genetic features with reproducible predictive performance and in terms of robustness to multiple factors, results indicate that rank average meta-analysis performs consistently well in comparison to five other meta-analysis-based FS methods.

Microarray Datasets.
We use six breast cancer, five renal cancer, and five pancreatic cancer gene expression datasets (Table 1) to compare meta-analysis-based FS methods. Each renal cancer dataset examines patient samples from several subtypes of tumors: clear cell (CC), oncocytoma (ONC), chromophobe (CHR), and papillary (PAP). We are interested in identifying genes differentially expressed between the CC subtype and all other subtypes, that is, CC versus ONC/CHR/PAP. These renal cancer datasets share a similar clinical focus. However, they are heterogeneous in terms of microarray platform [16][17][18][19][20][21]. Similarly, the breast cancer datasets are heterogeneous in both platform and clinical focus [22][23][24][25][26]. Although patient samples from each dataset have undergone different treatment for breast cancer and have been extracted at different stages of the disease, each sample is labeled as either estrogen receptor positive (ER+) or negative (ER−). Thus, we assess the performance of classifiers that predict the estrogen receptor status. The pancreatic cancer datasets also include a variety of platforms and clinical focuses [27][28][29][30][31]. We identify genes to discriminate pancreatic cancer versus noncancer patient samples. These datasets contain different numbers of probes (or probesets in the case of Affymetrix datasets) due to differences in microarray platform. Within each dataset group, we reduce the number of probes in each dataset to a common shared set based on probe sequence similarity.

Rank Average Meta-Analysis.
The meta-analysis-based FS method proposed in this paper ranks genes individually in each dataset and computes the average rank of each gene. Gene rank order is determined by a measure of differential expression (which can be any of a number of basic FS methods such as fold change or t-test) and we assume that this rank order is invariant to batch effects. Using the average rank of a gene across several datasets to obtain the final multidataset rank order, we can infer (1) the relative strength of that gene in differentiating the patient samples of interest and (2) the consistency of the gene's differential expression across multiple studies.
The remainder of this section uses the following mathematical notation. K is the total number of datasets, M is the total number of genes in each dataset, and N k is the number of samples in dataset k, where k = 1 · · · K and N is the total number of samples in all datasets. We denote a gene i in dataset k as a vector where x i,k j is the expression value of gene i of sample j in dataset k. In the case of sample aggregation (i.e., the naive method of meta-analysis), we denote a gene i across all datasets with (2) Using this notation, we can define a function, r i,k,ϑ = R ϑ (g i,k ), to compute the rank, r i,k,ϑ , of a gene, g i,k , using a ranking algorithm denoted by ϑ. A smaller rank indicates a greater degree of differential expression. In the case of sample aggregation, the ranking function takes the form r i,•,ϑ = R ϑ (g i,• ). The average rank, r i , of a gene i across all datasets, weighted by number of samples in each dataset, N k , is Weighting gives preference to ranks from datasets with larger sample sizes.
We consider several basic FS, or gene ranking, methods as follows: fold change (FC), t-test (T), significance analysis of microarrays (SAM) [32], rank-sum (RS), minimum redundancy maximum relevance using the difference formulation (mRMRD), and mRMR using the quotient formulation (mRMRQ) [33]. We explicitly define the rank algorithm for the kth dataset as For each dataset and each basic FS method, we use threefold cross-validation to compute an estimate of classification performance (measured using AUC) averaged over 20 feature sizes (ranging from the top single feature to the top twenty features). We then choose the basic FS method, ϑ k , with highest estimated classification performance for each dataset. Because each basic FS method makes different assumptions about DEGs and the correctness of these assumptions varies from dataset to dataset, allowing a different basic FS method for each dataset can improve performance.

Predictive Performance.
We use classification performance to assess meta-analysis-based FS methods with the assumption that improved FS leads to higher prediction performance when classifying samples from an independent    dataset. We assess prediction performance using independent training and testing datasets because of the small sample size of some of the datasets and because we want to reflect clinical scenarios in which predictive models would likely be derived from data collected from a separate batch of patients. We compare our proposed rank average metaanalysis method to other meta-analysis methods including: (1) the rank products method [13], (2) the mDEDS method [14], (3) Choi et al.'s method of interstudy variability [10], (4) Wang et al.'s method of weighting differential expression by variance [11], and (5) a naive method that aggregates samples from multiple datasets. The rank products, mDEDS, Choi, and Wang methods can be applied to multiple datasets as well as to single datasets. For each method and each dataset group, we compute single-dataset performance, combined homogeneous-dataset performance (from two to four datasets combined), and combined heterogeneous-dataset performance (Figure 2(a)). Classification performance depends on both feature selection and number of samples available for training. We are interested in performance gains due to meta-analysisbased FS alone. We isolate this performance gain by training classifiers with samples from a single dataset only, while allowing the features used for training to come from multiple datasets. Thus, any improvement (or degradation) in classification performance of a meta-analysis-based FS method in comparison to the baseline single-dataset FS is due to features selected rather than to increases in training sample size. We assess classification performance using a separate validation dataset and permute the datasets such that each individual dataset in each dataset group-renal, breast, and pancreatic cancer-is used at least once for validation. Moreover, for each permutation, we use 100 iterations of bootstrap sampling from the training datasets to estimate classification performance. Figure 2(b) is an example of the permutations possible with a five-dataset group (datasets 1-4 are the same platform while dataset 5 is a different platform), in which the prediction performance of two-dataset combination is assessed. This procedure can be expanded to handle threedataset, four-dataset, or higher combinations for FS.  (3) RC (3) BC (4) PC (3) RC (3) BC (4) PC (3) RC (3) BC (4) PC (3) RC (3) BC (4) PC (3) RC (3) BC (4) PC (3) RC (3) BC (4) Rank average mDEDS Rank products Choi Wang Naive Figure 3: Rating meta-analysis methods by prediction performance when combining all available datasets. Each meta-analysis method (rank average, rank products, Wang, mDEDS, Choi, and naive) is rated relative to its peers. We assess performance rating across three factors: (1) clinical application (breast cancer: BC, renal cancer: RC, and pancreatic cancer: PC), (2) data platform heterogeneity (homogeneous: orange, heterogeneous: blue), and (3) classifier (logistic regression: LR, diagonal LDA: DLDA and linear SVM). For each combination of factors, the rating of each meta-analysis method is represented by an additive bar. Methods with higher absolute prediction performance receive higher ratings (and longer bars). When considering absolute prediction performance, rank average, with a mean overall rating of 4.56, performs consistently well compared to its peers.
The procedure for measuring predictive performance of heterogeneous-dataset combination is slightly different. Each dataset group contains several one-channel Affymetrix datasets and one two-channel dataset (either cDNA or Agilent). Gene expression values of the two-channel datasets are computed as log ratios, resulting in different dynamic ranges compared to the one-channel datasets. We assess the robustness of each meta-analysis-based FS method to heterogeneous data platforms by first determining the performance of the method when combining only Affymetrix data (Figure 2(b), homogeneous data), then comparing to results obtained when combining a mixture of Affymetrix and two-channel arrays (Figure 2(b), Heterogeneous Data). For example, we compute heterogeneous combination performance by combining one or more Affymetrix datasets to the two-channel dataset, then training a classifier using one of the Affymetrix datasets, and testing samples from an independent dataset (again Affymetrix). Thus, not only should a good meta-analysis-based FS method perform well with respect to single dataset FS, but also the method should exhibit minimal performance degradation, if any, when combining heterogeneous data platforms.

Robustness of Rank Average Meta-Analysis.
We rate each meta-analysis method by absolute prediction performance (Figure 3). Based on this criterion, we find that rank average meta-analysis, with the highest overall mean rating of 4.56, performs consistently well compared to five other metaanalysis methods including the mDEDS, rank products, Choi, Wang, and naive methods. This analysis answers the question: which meta-analysis-based FS method consistently exhibits the largest prediction performance when combining 6 The Scientific World Journal all available datasets? We assign a rating to each metaanalysis method for every combination of three factors that include (1) clinical application or dataset group, (2) data platform heterogeneity (combining similar or different microarray platforms), and (3) classifiers (logistic regression: LR, diagonal linear discriminant: DLDA, and linear SVM). Ratings for each meta-analysis method are relative to its peers, with higher ratings indicating better prediction performance under the same combination of factors. In Figure 3, bars are proportional to performance ratings. Using pancreatic cancer (PC) as an example, the rank average meta-analysis method has a rating of five (corresponding to a predictive performance AUC of 81.5, See Supplemental Table S1 available online at doi:10.1100/2012/989637) when analyzing homogeneous datasets and when using the logistic regression classifier. This means that its absolute prediction performance is higher than that of four other metaanalysis methods compared under the same conditions (i.e., homogeneous data, logistic regression classifier). The results illustrated in Figure 3 and obtained through a comprehensive analysis of three factors suggest that, relative to its peers, rank average meta-analysis is robust when considering absolute prediction performance.

Rank Average Identifies Biologically Sensible Genes.
For each dataset group, we combine all available microarray datasets and use the rank average meta-analysis method to identify DEGs. Assessing DEG detection performance by examining the genes is difficult unless we know, via validation, whether or not these genes are truly differentially expressed. However, because of the sheer number of genes in high-throughput datasets, the validation process is often time and resource intensive. Despite this, we examine the top ranked genes from each dataset group to verify that the rank average meta-analysis method is identifying genes that are biologically sensible. Table 2 lists the top 20 genes selected from meta-analysis of each of the three dataset groups: six breast cancer, five renal cancer, and five pancreatic cancer datasets. We optimize the FS method for each individual dataset using three-fold cross validation and the diagonal LDA classifier. The optimal FS method for each dataset differs. We compare ER+ and ER− samples for each breast cancer dataset and find, not surprisingly, that the ESR1 gene (estrogen receptor) is the top ranked gene for all but one dataset. Accordingly, the weighted average rank of ESR1 places it at the top of the combined list. Among the other genes in the list, NAT1 [34], DNALI1, SCUBE2 [35], and TFF1 [36] have been implicated in breast cancer. Although the individual dataset ranks of these genes vary from low to relatively high ranks (e.g., 200 to 300), it is the consistency of selecting these genes from multiple datasets that places them at the top of the combined list. In Table 2, we include the number of individual datasets in which the gene is ranked in the top 20. We compare the renal cancer clear cell subtype to three other subtypes (i.e., chromophobe, oncocytoma, and papillary) to identify DEGs. The top gene we identify is LOX, which is an oncogene implicated in clear cell renal cancer [37]. The ADFP gene, ranked at #3 in the combined list, is especially interesting because it may be a potential urinary biomarker for detecting renal cancer [38]. ADFP is ranked favorably in all but the Higgins dataset, in which it is ranked at number 75.
The rank average meta-analysis method identifies S100P as the top pancreatic cancer gene, which has been implicated in several studies [39,40]. The S100P gene has a relatively favorable ranking in the Pilarsky and Pei datasets and moderate to un-favorable rankings in the other datasets, indicating that analysis of individual datasets may not readily identify the gene. Another example, LAMC2, is ranked favorably in the Ishikawa and Pei datasets, but relatively higher in the other datasets. Overall, LAMC2 is ranked second in the combined results and is, according the to literature, a purported pancreatic cancer gene [41]. Weighted average ranks for the pancreatic cancer results increase quickly compared to the breast and renal cancer results, indicating increased heterogeneity among the ranks of the individual datasets. One explanation for this is the slight difference in dataset subtype comparisons. For example, one of the datasets, Ishikawa, extracted RNA samples from pancreatic juice rather than from solid tumors.
The degree of differential expression (and consequently, the rank) of a gene can vary significantly from dataset to dataset. Combining DEG detection results by averaging ranks across datasets reduces variability and improves statistical confidence. Analysis of a single microarray dataset may result in errors during DEG detection-for example, false positives and false negatives (genes that should be differentially expressed, but not favorably ranked). In general, these errors can be reduced by increasing sample size. Combining microarray datasets by averaging ranks effectively increases sample size while enabling robust analysis of heterogeneous data.

Discussion
In order to understand the differences in performance among the six meta-analysis-based FS methods, we identify and list the differences and similarities in Table 3. We focus on three properties: (a) basic FS methods forming the basis of metaanalysis, (b) the manner in which these basic FS methods are chosen and applied to individual microarray datasets, and (c) the use of ranks.
Among the five meta-analysis methods (not including the naive control method) rank average and mDEDS are the only methods that consider multiple basic FS methods-for example, fold change, t-statistic, SAM, and rank sum-for detecting DEGs (Table 3, row 1). The rank products, Choi and Wang methods use modified forms of basic FS methods. Moreover, rank average is the only method that chooses one basic FS method for each dataset to maximize prediction performance (Table 3, row 2). In contrast, mDEDS uses all of the available basic FS methods for each dataset. Finally, rank average and rank products are the only meta-analysis methods that are rank-based (Table 3, row 3).
Among the basic FS methods, no method can be considered the best because of the data-dependent nature of microarray analysis. Thus, rank average and mDEDS benefit by considering multiple basic FS methods. However, some basic FS methods can produce erroneous results when inappropriately applied (e.g., using a t-statistic with gene expression data that is not normally distributed). Rank average meta-analysis further benefits from selecting a single basic FS that optimizes prediction performance. On the other hand, the performance of mDEDS meta-analysis can degrade if it includes a basic FS method that is incompatible with the data. Likewise, the performance of rank products can degrade when the fold change FS method is not appropriate for the data. The Choi and Wang methods may also suffer from this problem. However, they seem to perform fairly well when applied to the datasets in this study (see Figure 3). Finally, rank-based meta-analysis methods that consider multiple basic FS methods allow a fair comparison among the basic FS methods. In light of these results, for microarray meta-analysis, we recommend (1) to use rankbased methods, (2) to consider a wide variety of basic FS methods, and (3) to optimize the FS method for each individual dataset based on application-specific criteria (e.g., prediction performance for diagnostic applications).
Despite the benefits summarized in Table 3, rank average meta-analysis and the evaluation criteria presented in this study are not without limitations. The limitations of this study include (1) the scope of data and classifiers considered, (2) the criterion for measuring performance of a metaanalysis method, and (3) normalization and pre-processing of gene expression data. First, the results of this study may be dataset-specific. Although we have strived to provide a wide range of scenarios to allow adequate assessment of 8 The Scientific World Journal these meta-analysis methods, results may differ when applied to other dataset groups. Second, we use prediction AUC as the performance criterion. However, microarray-based clinical prediction is only one possible application. Other applications may need to identify genes based on biological relevance [15]. It is unclear which meta-analysis methods would perform well in such applications. The rank average meta-analysis method benefits from choosing a basic FS method for each individual dataset that optimizes (via crossvalidation) prediction performance. Thus, there is a potential bias in the performance of rank average meta-analysis. On the other hand, the ability to choose basic FS methods that perform well for a particular application, such as prediction, could be considered a favorable property of rank average meta-analysis. Finally, it is possible that normalization of gene expression datasets (e.g., quantile normalization) can improve the performance of meta-analysis by reducing batch effect. Specifically, removal of batch effects (1) can improve prediction performance when training and testing are applied to independent, heterogeneous datasets and (2) can improve the performance of simple meta-analysis methods that aggregate samples from multiple heterogeneous datasets. However, we do not consider any batch-effect normalization procedures in this study.

Conclusions
In order to address the sample-size problem in gene expression analysis as well as the need for accurate solutions for clinical prediction problems, we proposed the rank average meta-analysis-based FS method. Rank average metaanalysis identifies differentially expressed genes from multiple microarray datasets. We used a comprehensive study of multiple factors and found that rank average performs consistently well compared to five other meta-analysis methods in terms of prediction performance. This comprehensive study enabled us to measure the robustness of rank average to three factors that are often encountered in clinical prediction applications. These factors include clinical application (e.g., breast, renal, and pancreatic cancer), microarray data platform heterogeneity, and classifier model (logistic regression, diagonal LDA, and SVM). Rank average metaanalysis, performs well because it selects dataset-specific basic FS methods and then averages the ranks across all individual datasets to produce a final robust gene ranking. In comparison to five other meta-analysis methods the rank average method is not always the best method for some factor combinations. However, it is consistently among the best performing in terms of its ability to identify predictive genes. Although we presented results from analysis of microarray gene expression data, the proposed methods may be generalized for other bioinformatics problems that require feature selection.