Combining multiple microarray datasets increases sample size and leads to improved reproducibility in identification of informative genes and subsequent clinical prediction. Although microarrays have increased the rate of genomic data collection, sample size is still a major issue when identifying informative genetic biomarkers. Because of this, feature selection methods often suffer from false discoveries, resulting in poorly performing predictive models. We develop a simple meta-analysis-based feature selection method that captures the knowledge in each individual dataset and combines the results using a simple rank average. In a comprehensive study that measures robustness in terms of clinical application (i.e., breast, renal, and pancreatic cancer), microarray platform heterogeneity, and classifier (i.e., logistic regression, diagonal LDA, and linear SVM), we compare the rank average meta-analysis method to five other meta-analysis methods. Results indicate that rank average meta-analysis consistently performs well compared to five other meta-analysis methods.
We develop a simple, yet robust meta-analysis-based feature selection (FS) method for microarrays that ranks genes by differential expression within several independent datasets,then combines the ranks using a simple average to produce a final list of rank-ordered genes. Such meta-analysis methods can increase the power of microarray data analysis by increasing sample size [
Existing microarray meta-analysis methods either combine separate statistics for each gene expression dataset or aggregate samples into a single large dataset to estimate global gene expression. The study by Park et al. used analysis of variance to identify unwanted effects (e.g., the effect of different laboratories) and modeled these effects to detect DEGs [
We develop the rank average method, a simple meta-analysis-based FS method, for identifying DEGs from multiple microarray datasets and design a study (Figure
Study design diagram. We compare the predictive performance of meta-analysis-based feature selection (FS) methods by designing a study that considers five components: (1) basic FS methods that are the building blocks of some of the meta-analysis methods, (2) meta-analysis-based FS methods, (3) clinical application, (4) microarray data platform, and (5) classifier (logistic regression, diagonal LDA and linear SVM). Since the “best” meta-analysis-based FS method may be dataset- or application-specific, assessing performance over a wide variety of factors enables an evaluation of the method’s robustness.
We use six breast cancer, five renal cancer, and five pancreatic cancer gene expression datasets (Table
Microarray datasets.
Breast cancer estrogen receptor status
Dataset | ER+ | ER− | Platform | No. of probes |
---|---|---|---|---|
MDACC Train | 80 | 50 | Affy HG-U133A | 22283 |
MDACC Test | 60 | 40 | Affy HG-U133A | 22283 |
Miller | 213 | 34 | Affy HG-U133A | 22283 |
Sotiriou | 72 | 24 | Affy HG-U133A | 22283 |
Minn | 57 | 42 | Affy HG-U133A | 22283 |
Van't Veer | 226 | 69 | Agilent 2-Color | 24496 |
Common probes: 8953.
Renal cancer subtype
Dataset | CC | Other | Platform | No. of probes |
---|---|---|---|---|
Schuetz | 13 | 12 | Affy HG-Focus | 8793 |
Jones | 32 | 29 | Affy HG-U133A | 22283 |
Kort | 10 | 30 | Affy HG-U133+2.0 | 54675 |
Yusenko | 26 | 27 | Affy HG-U133+2.0 | 54675 |
Higgins | 26 | 9 | cDNA 2-Color | 22689 |
Common probes: 946.
Pancreatic cancer diagnosis
Dataset | Normal | Cancer | Platform | No. of probes |
---|---|---|---|---|
Badea | 39 | 39 | Affy HG-U133+2.0 | 54675 |
Ishikawa | 25 | 24 | Affy HG-U133A/B | 44928 |
Pei | 16 | 36 | Affy HG-U133+2.0 | 54675 |
Pilarsky | 18 | 27 | Affy HG-U133A/B | 44928 |
Iacobuzio-Donahue | 5 | 17 | cDNA 2-Color | 43910 |
Common probes: 4530.
The meta-analysis-based FS method proposed in this paper ranks genes individually in each dataset and computes the average rank of each gene. Gene rank order is determined by a measure of differential expression (which can be any of a number of basic FS methods such as fold change or
The remainder of this section uses the following mathematical notation.
We consider several basic FS, or gene ranking, methods as follows: fold change (FC),
We use classification performance to assess meta-analysis-based FS methods with the assumption that improved FS leads to higher prediction performance when classifying samples from an independent dataset. We assess prediction performance using independent training and testing datasets because of the small sample size of some of the datasets and because we want to reflect clinical scenarios in which predictive models would likely be derived from data collected from a separate batch of patients. We compare our proposed rank average meta-analysis method to other meta-analysis methods including: (1) the rank products method [
Procedure for comparing the predictive performance of six microarray meta-analysis-based FS methods. (a) Features are selected from microarray datasets using the rank average meta-analysis method (pink box), several other meta-analysis methods (orange boxes: mDEDS, rank products, Choi, and Wang), and a naive method (blue box) that aggregates samples into a larger dataset. Rank average meta-analysis chooses a single feature selection (FS) method from among several basic FS methods (SAM, fold change, rank sum,
Selecting features from multiple microarray datasets using six meta-analysis-based methods
Example of dataset permutations for evaluating meta-analysis predictive performance
Classification performance depends on both feature selection and number of samples available for training. We are interested in performance gains due to meta-analysis-based FS alone. We isolate this performance gain by training classifiers with samples from a single dataset only, while allowing the features used for training to come from multiple datasets. Thus, any improvement (or degradation) in classification performance of a meta-analysis-based FS method in comparison to the baseline single-dataset FS is due to features selected rather than to increases in training sample size. We assess classification performance using a separate validation dataset and permute the datasets such that each individual dataset in each dataset group—renal, breast, and pancreatic cancer—is used at least once for validation. Moreover, for each permutation, we use 100 iterations of bootstrap sampling from the training datasets to estimate classification performance. Figure
The procedure for measuring predictive performance of heterogeneous-dataset combination is slightly different. Each dataset group contains several one-channel Affymetrix datasets and one two-channel dataset (either cDNA or Agilent). Gene expression values of the two-channel datasets are computed as log ratios, resulting in different dynamic ranges compared to the one-channel datasets. We assess the robustness of each meta-analysis-based FS method to heterogeneous data platforms by first determining the performance of the method when combining only Affymetrix data (Figure
We rate each meta-analysis method by absolute prediction performance (Figure
Rating meta-analysis methods by prediction performance when combining all available datasets. Each meta-analysis method (rank average, rank products, Wang, mDEDS, Choi, and naive) is rated relative to its peers. We assess performance rating across three factors: (1) clinical application (breast cancer: BC, renal cancer: RC, and pancreatic cancer: PC), (2) data platform heterogeneity (homogeneous: orange, heterogeneous: blue), and (3) classifier (logistic regression: LR, diagonal LDA: DLDA and linear SVM). For each combination of factors, the rating of each meta-analysis method is represented by an additive bar. Methods with higher absolute prediction performance receive higher ratings (and longer bars). When considering absolute prediction performance, rank average, with a mean overall rating of 4.56, performs consistently well compared to its peers.
For each dataset group, we combine all available microarray datasets and use the rank average meta-analysis method to identify DEGs. Assessing DEG detection performance by examining the genes is difficult unless we know, via validation, whether or not these genes are truly differentially expressed. However, because of the sheer number of genes in high-throughput datasets, the validation process is often time and resource intensive. Despite this, we examine the top ranked genes from each dataset group to verify that the rank average meta-analysis method is identifying genes that are biologically sensible.
Table
Differentially expressed genes identified from rank average meta-analysis of multiple microarray datasets.
Breast cancer | Renal cancer | Pancreatic cancer | ||||||
---|---|---|---|---|---|---|---|---|
Gene |
Weighted |
Top 20 in # of datasets | Gene |
Weighted |
Top 20 in # of datasets | Gene |
Weighted average rank | Top 20 in # of datasets |
ESR1 | 0.20 | 6 | LOX | 13.65 | 4 | S100P | 31.42 | 2 |
NAT1 | 33.99 | 3 | COL5A2 | 16.86 | 3 | LAMC2 | 51.44 | 2 |
DNALI1 | 48.46 | 1 | ADFP | 19.08 | 4 | PHLDA2 | 201.93 | 1 |
SCUBE2 | 69.27 | 1 | SCNN1A | 19.25 | 2 | S100A2 | 233.07 | 0 |
TFF1 | 76.74 | 1 | LOXL2 | 21.37 | 3 | MSLN | 234.39 | 1 |
MYB | 82.17 | 0 | ELTD1 | 27.17 | 4 | WFDC2 | 236.00 | 1 |
CYP2B7P1 | 86.93 | 1 | PPARGC1A | 30.73 | 1 | ITGB6 | 238.13 | 0 |
PDZK1 | 98.81 | 0 | IFITM1 | 31.19 | 2 | HK2 | 239.87 | 2 |
PADI2 | 114.44 | 0 | RALGPS1 | 37.17 | 2 | R88990* | 244.34 | 0 |
DNAJC12 | 123.83 | 0 | VWF | 37.85 | 2 | ANO1 | 252.57 | 1 |
TSPAN1 | 126.87 | 0 | CD70 | 41.65 | 0 | MXRA5 | 261.28 | 0 |
CDH3 | 127.46 | 1 | ARHGDIB | 42.60 | 1 | PLEK2 | 264.09 | 0 |
XBP1 | 134.70 | 0 | P4HA1 | 48.91 | 2 | CDC2 | 279.79 | 2 |
KRT18 | 136.35 | 0 | BST2 | 50.56 | 2 | VCAN | 285.59 | 0 |
EEF1A2 | 138.25 | 0 | F2R | 52.22 | 1 | FERMT1 | 286.92 | 1 |
SLC16A6 | 140.73 | 1 | SPARC | 52.86 | 1 | MCOLN3 | 309.32 | 0 |
ACADSB | 142.55 | 1 | LDB2 | 56.29 | 2 | TNFRSF21 | 315.68 | 1 |
SRD5A1 | 159.99 | 1 | GJA1 | 58.54 | 0 | KYNU | 324.78 | 0 |
CHAD | 164.19 | 0 | PLAG1 | 60.29 | 1 | TACC3 | 333.27 | 0 |
P4HTM | 165.08 | 1 | DSG2 | 68.03 | 1 | TMC5 | 336.72 | 0 |
*Gene symbol not available, using accession number instead.
We compare the renal cancer clear cell subtype to three other subtypes (i.e., chromophobe, oncocytoma, and papillary) to identify DEGs. The top gene we identify is LOX, which is an oncogene implicated in clear cell renal cancer [
The rank average meta-analysis method identifies S100P as the top pancreatic cancer gene, which has been implicated in several studies [
The degree of differential expression (and consequently, the rank) of a gene can vary significantly from dataset to dataset. Combining DEG detection results by averaging ranks across datasets reduces variability and improves statistical confidence. Analysis of a single microarray dataset may result in errors during DEG detection—for example, false positives and false negatives (genes that should be differentially expressed, but not favorably ranked). In general, these errors can be reduced by increasing sample size. Combining microarray datasets by averaging ranks effectively increases sample size while enabling robust analysis of heterogeneous data.
In order to understand the differences in performance among the six meta-analysis-based FS methods, we identify and list the differences and similarities in Table
Properties of six microarray meta-analysis methods.
Rank average | mDEDS | Rank products | Choi | Wang | Naive (control) | |
---|---|---|---|---|---|---|
Basic FS methods considered | FC, |
FC, |
FC1 |
|
FC3 | FC, |
Chooses data-specific basic FS method(s) |
|
No | No | No | No |
|
Rank-Based |
|
No |
|
No | No | No |
1Fold change between all interclass pairs of samples. 2Most similar to a
Among the five meta-analysis methods (not including the naive control method) rank average and mDEDS are the only methods that consider multiple basic FS methods—for example, fold change,
Among the basic FS methods, no method can be considered the best because of the data-dependent nature of microarray analysis. Thus, rank average and mDEDS benefit by considering multiple basic FS methods. However, some basic FS methods can produce erroneous results when inappropriately applied (e.g., using a
Despite the benefits summarized in Table
In order to address the sample-size problem in gene expression analysis as well as the need for accurate solutions for clinical prediction problems, we proposed the rank average meta-analysis-based FS method. Rank average meta-analysis identifies differentially expressed genes from multiple microarray datasets. We used a comprehensive study of multiple factors and found that rank average performs consistently well compared to five other meta-analysis methods in terms of prediction performance. This comprehensive study enabled us to measure the robustness of rank average to three factors that are often encountered in clinical prediction applications. These factors include clinical application (e.g., breast, renal, and pancreatic cancer), microarray data platform heterogeneity, and classifier model (logistic regression, diagonal LDA, and SVM). Rank average meta-analysis, performs well because it selects dataset-specific basic FS methods and then averages the ranks across all individual datasets to produce a final robust gene ranking. In comparison to five other meta-analysis methods the rank average method is not always the best method for some factor combinations. However, it is consistently among the best performing in terms of its ability to identify predictive genes. Although we presented results from analysis of microarray gene expression data, the proposed methods may be generalized for other bioinformatics problems that require feature selection.
This work was supported in part by Grants from National Institutes of Health (Bioengineering Research Partnership R01CA108468, Center for Cancer Nanotechnology Excellence U54CA119338); Georgia Cancer Coalition (Distinguished Cancer Scholar Award to M. D. Wang); Hewlett Packard; and Microsoft Research. The funding sources listed here have supported this multiyear investigation of microarray meta-analysis for clinical prediction, including covering the stipends and salaries of multiple coauthors, computing hardware and software licenses, travel expenses to technical meetings to present this work, and publication expenses.