Robust Significance Analysis of Microarrays by Minimum β-Divergence Method

Identification of differentially expressed (DE) genes with two or more conditions is an important task for discovery of few biomarker genes. Significance Analysis of Microarrays (SAM) is a popular statistical approach for identification of DE genes for both small- and large-sample cases. However, it is sensitive to outlying gene expressions and produces low power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM approach using the minimum β-divergence estimators instead of the maximum likelihood estimators of the parameters. We demonstrated the performance of the proposed method in a comparison of some other popular statistical methods such as ANOVA, SAM, LIMMA, KW, EBarrays, GaGa, and BRIDGE using both simulated and real gene expression datasets. We observe that all methods show good and almost equal performance in absence of outliers for the large-sample cases, while in the small-sample cases only three methods (SAM, LIMMA, and proposed) show almost equal and better performance than others with two or more conditions. However, in the presence of outliers, on an average, only the proposed method performs better than others for both small- and large-sample cases with each condition.


Introduction
Microarray experiments are usually conducted with expressions of huge number of genes ( ) and a small number of experimental samples ( ). This unique data structure has been discovered as a completely new promising area for the researchers. At the same time, it provides a challenge to the researchers because of high dimensionality and its complexity with small sample size. Among this huge number of genes, discovery of few biomarker genes those are differentially expressed (DE) between two or more experimental conditions with multiple patterns is one of the main objectives of this experiments. These biomarker genes are important in the diagnosis of different types and subtypes of diseases for patient prognosis and treatment [1][2][3]. Nowadays, researchers are also interested in exploring the gene coexpression network or interaction of DE genes to predict the hub genes that are associated with different types and subtypes of cancer [4]. The most commonly used statistical tests for the discovery of DE genes between two or more conditions are t-test or ANOVA (F-test). However, both testing procedures sometimes produce misleading results to discover few biomarker genes, because both of them suffer from small-sample sizes and normality assumptions, and they do not share the information of all genes [5]. Therefore, a gene-specific t-statistic or F-statistic becomes large even for low differential expressions of genes between two or more conditions. Thus, the false discovery rate (FDR) may increase. Tusher et al. [6] introduced a popular statistical technique to detect the DE genes by assimilating a set of gene-specific t-tests. This approach is known as Significance Analysis of Microarrays (SAM). It controls the FDR by 2 BioMed Research International sharing information of all genes. It does not suffer from the small-sample sizes and normality assumptions. There is a close relation between SAM and FDR-based multiple testing correction approach of Benjamini and Hochberg [7]. Because of low expense and advancement of microarray technology, SAM has been extensively used in gene expression data analysis. For example, Li et al. [8] performed an integrative data analysis to identify the breast cancer subtype-specific biomarkers by combining copy number aberrations and miRNA-mRNA dual expression profiling data. Tarhini et al. [9] identified some immune-related genes that are associated with neoadjuvant ipilimumab clinical benefit. Wu et al. [10] detected some potential biomarkers for the diagnosis and prediction of preeclampsia. Ren et al. [11] identified some subtype-specific novel biomarkers using colon cancer gene expression data. However, SAM is very sensitive to outliers. It produces larger FDR and lower power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM [6] approach using the minimum -divergence method [12]. Then we investigate the performance of the proposed method in a comparison of traditional SAM approach as well as some other popular methods such as ANOVA [13,14], LIMMA [15], Kruskal-Wallis (KW) [16], empirical Bayes (EB) [17][18][19], Gama-Gama model (GaGa) [20], and Bridge [21] using both simulated and real gene expression datasets. The paper is organized as follows: Section 2 contains the formulation of traditional SAM algorithm and proposed robust SAM algorithm with detailed description. Simulation and real microarray data analysis are carried out in Section 3. Finally, we end up with a conclusion.

Methods
Let be the th gene expression for the th samples ( = 1, 2, . . . , ; = 1, 2, . . . , ; = 1, 2). Also let denote the mean of the th gene for kth condition. Then we would like to test the hypothesis H 0 : 1 = 2 versus H 1 : 1 ̸ = 2 which implies that H 0 : 1 − 2 = 0 versus H 0 is not true. A gene is said to be equally expressed (EE) if H 0 is accepted; otherwise it is DE. If̂denotes the sample mean of th gene for th condition and 2 denotes the pooled within-class sample variance then the formula of two-sample t-test to test the above null-hypothesis is as follows: where Here, = 1/ ∑ ( − 1) ⋅ (∑ (1/ )), The t-statistic given in (1) follows the t-distribution with ( 1 + 2 − 2) degrees of freedom. As early mentioned this test statistic increases the FDR for small-sample cases.
To overcome this problem Tusher et al. [6] proposed a modification of the t-statistic by adding a constant 0 to the denominator, which is known as the test statistic of Significance Analysis of Microarrays (SAM) algorithm. This statistic is defined as follows: where 0 is the percentile of the distribution of . For > 2 conditions, the modified t-statistic in (4) is defined in terms of Fisher's linear discriminant, assuming samples consist of nonoverlapping subsets, such that the response parameter ∈ {1, 2, . . . , }, = { : = }, and is the number of expressions in . Then the scores and standard deviation in (4) are replaced by the following two equations, respectively: wherê= ∑̂/ ∑ . For the details about SAM procedure visit http://statweb.stanford.edu/∼tibs/SAM/. However, the test statistic given in (4) produces misleading results in presence of outliers, sincêand̂2 in (2) and (5) are sensitive to outliers. Therefore, in this paper, an attempt is made to robustify the test statistic, SAM , in (4) by minimum -divergence method. The minimumdivergence estimatorŝ, = (̂, ,̂2 , ) of the parameters = ( , 2 ) are computed iteratively as follows: Here which is known as -weight function. This weight function was first introduced in [12] for the robustification of prewhitening procedure to improve the performance of independent component analysis (ICA) algorithms for blind source separation (BSS). It was generalized in [22] for the robust extraction of local principal components. Then it was used in [23] for the robustification of empirical Bayes approach [18] to identify DE genes between two conditions.
It was also used in [24] to improve one-way ANOVA for the robust and efficient estimation of DE genes with multiple patterns. Note that if = 0, the minimum -divergence estimatorŝ, = (̂, ,̂2 , ) reduce to classical maximum likelihood estimators (MLEs)̂= (̂,̂2 ). Since, in absence of outliers, the MLEs of Gaussian distribution are consistent and asymptotically efficient [25], therefore, in this paper, the MLEs are used in absence of outliers and in presence of outliers, the minimum -divergence estimators are used for the estimation of in SAM approach. We can apply our proposed SAM approach in two ways. One way is to select the using the cross validation (CV) which was discussed in detail in [12,22]. The CV approach produces = 0 in absence of outliers, while > 0 in presence of outliers. The minimum -divergence estimators in (6) with = 0 are equivalent to MLEs in (3) as mentioned previously. Thus the minimum -divergence estimators with an appropriate selection by CV produce both robust and efficient estimates for the parameters. However, in our current problem, it would be time-consuming, since CV approach needs to be applied for each gene of the entire genome in each condition separately to select the appropriate . To overcome this problem, in this paper we consider outlier detection approach based on -weight function with a fixed > 0. The value of -weight function lies between 0 and 1. It produces larger weights with the usual gene expressions and smaller weights with the unusual/outlying gene expressions for a wide range of > 0 [12,22]. By assigning low weights to outliers, the estimators become robust. The larger increases the robustness of estimators but decreases the efficiency, while the smaller increases the efficiency but decreases the robustness. Thus the value of controls the balance between the robustness and efficiency of the estimators. Therefore, in this paper, we fix = 0.2 for outlier detection using -weight function which was also used in [24]. A value ( ) of gene expression is classified as usual or unusual based on thisweight function as follows: If th expression for th gene is contaminated by outlier in the th condition where we fix the cutoff value = min(0.2, 0 ), sinceweights lie between 0 and 1 and smaller weights occur with unusual/outlying gene expressions. Here 0 is calculated by the following equation: which was also used in [22]. Here, is a smaller quantity; in our analysis we consider = 0.1. After convergence of (6), we obtained the robust estimateŝ, = (̂, ,̂2 , ) of the parameters = ( , 2 ). Then we combine the MLEs and minimum -divergence estimators to estimate the parameters = ( , 2 ) as follows: , If th gene expression is contaminated by outlier in the th condition , otherwise.

Results and Discussion
To demonstrate the performance of the proposed method in a comparison of other popular methods (ANOVA, SAM, LIMMA, KW, EB, GAGA, and BRIDGE), we used both simulated and real microarray gene expression datasets. We used five R packages of other methods such as samr, limma, EBarrays, gaga, and bridge. The performance measures AUC and pAUC were computed for each of the methods using ROC package. All R packages are available in the comprehensive R archive network (cran) or bioconductor.

Performance Evaluation.
In order to investigate the performance of the proposed method in a comparison of some other popular methods for binary class prediction such as DE or EE, we use different performance measures including the receiving operating characteristic (ROC) curve, area under the ROC curve (AUC), and partial AUC (pAUC) derived through the confusion matrix as shown in Table 1.

Simulation Study 1.
To investigate the performance of the proposed robust SAM in a comparison with the traditional SAM approach, we generated a dataset using (11) with = 1, 000 genes, and = 20 samples, 10 each in condition 1 and condition 2. There are 100 DE genes in this dataset (50 upregulated and 50 downregulated). Then we employed SAM and proposed method in this dataset to determine the DE genes. Figures 1(a) and 1(b) represent the Q-Q plots for this dataset using SAM and proposed method. In this figure, the number of genes above the band in the upper right (red color) and below the band in the bottom left (green color) indicates the number of upregulated and downregulated DE genes, respectively. We observed that both SAM and our proposed method identified 88 true DE genes with Δ = 0.2 considering 12 false positives on an average. To evaluate the performance of these two methods in presence of outliers, we generate the outlying dataset using (12). We consider one outlier in each of 10% genes. Then we employed these methods in the outlying dataset to identify the DE genes. Figures 1(c) and 1(d) show the Q-Q plots using this outlying dataset. We can clearly see from Figure 1(c) that there are only 19 true DE genes identified by the SAM with Δ = 0.1 considering 42 false positives on an average, whereas in Figure 1(d) the proposed method identified 86 true DE genes with Δ = 0.2 considering 14 false positives on an average. The plots of smallest -weight for each of 1000 genes are displayed in Figures S1(a) and S1(b) in the supporting file (see Supplementary Material available online at https://doi.org/10.1155/2017/5310198) in absence and presence of outliers, respectively. Cleary we observe that, in absence of outliers, -weight function produces larger weights and in presence of outliers, it produces smaller weights (almost close to zero) for unusual/outlying gene expressions. The outlier genes are indicated in red color (see Figure S1(b)). So we may conclude that our proposed method performs better in both situations, in absence and presence of outliers.

Simulation Study 2.
To investigate the performance of the proposed method in a comparison of the other seven popular methods as early mentioned for = 2 conditions, we performed 100 simulations to generate 100 datasets for both small-( 1 = 2 = 3) and large-( 1 = 2 = 25) sample cases using (11). We set the arbitrary values ( 1 , 2 ) ∈ (3, 5) and 2 = 0.3 for datasets generated from each simulation. Each dataset for each case represented the gene expression profiles of = 10, 000 genes, with = ( 1 + 2 ) samples. The proportions of DE (pde) gene were set to 0.02, 0.04, 0.06, 0.08, and 0.1 for each of the 100 datasets. For these values, the theoretical numbers of DE genes are, respectively, 200, 400, 600, 800, and 1,000. To evaluate the performance of all the methods in presence of outliers, we generated 100 outlying datasets from each of the original datasets using (12). We consider one or two outliers in each of 10%, 20%, and 50% genes for each datasets. We computed average values of different performance measures such as TPR, TNR, FPR, FNR, FDR, AUC, and power based on 200, 400, 600, 800, and 1,000 estimated DE genes by eight methods (ANOVA, SAM, LIMMA, KW, EB, GAGA, BRIDGE, and proposed) for each of 100 datasets. Figure 2 and Figure S2 show the ROC curve based on 400 estimated DE genes by each of the methods, in absence and presence of one or two outliers in each of 10%, 20%, and 50% genes for both small-and large-sample cases, respectively. We observe that all the eight methods performed almost similarly in absence of outliers for both small-and large-sample cases, except ANOVA for small-sample case (see Figure 2

Simulation Study 3.
To demonstrate the performance of the proposed method in a comparison of other popular methods for > 2 conditions with multiple patterns, we generated 100 datasets for both small-( 1 = 2 = 3 = 4 = 3) and large-( 1 = 2 = 3 = 4 = 25) sample cases using (11). Each dataset was generated for ( 1 , 2 , 3 , 4 ) ∈ (3, 5) and 2 = 0.1. Each dataset contains the expression values for = 10, 000 genes with The proportion of DE gene was fixed at 0.02 for each of the datasets. We generated the outlying gene expression datasets using (12) as before. We investigated the performance of the proposed method in a comparison of the other popular methods that are suitable for multiplecomparison tests (ANOVA, KW, SAM, and LIMMA). We first applied these methods to classify DE or EE genes and estimated different performance measures such as AUC, pAUC, MER, and FDR by these methods. Results obtained from these methods are presented in Table 3.
From this table we observe that, for small-sample case in absence of outliers, three methods (SAM, LIMMA, and proposed) exhibited better performance (AUC > 80%) than  ANOVA and KW. KW performs worse in this case. But in presence of outliers, the proposed method outperforms other methods, producing more stable and consistent results (lower FDR and higher AUC, pAUC values). On the other hand, for large-sample case KW and proposed method performed well compared to the other methods (ANOVA, SAM, and LIMMA), in presence of outliers. Our proposed method exhibited slightly better performance than KW in this case, whereas in absence of outliers, they performed similarly. For both cases the Benjamini-Hochberg (BH) method was used to adjust the values for each of the methods. Figure S3 represents the boxplots of MER values estimated by these five methods based on 200 DE genes in absence and presence of outliers for small-sample case. This figure also supports the results of Table 3. To demonstrate the pattern-detection performance of these methods, we again generated 100 datasets using (11). This time we consider the gene expression profiles for = 1, 000 genes with 300 DE genes for sample size of 3 in each condition. These 300 DE genes consist of four different patterns. These patterns are shown in Figure S4. Table 4 represents the performance of different methods for detection of up-and downregulated genes in genes. We observed that, in absence of outliers, three methods (SAM, LIMMA, and proposed) perform well for detecting of number of (UP, DR) genes in different pairs. KW performed badly in this case compared to ANOVA, whereas, in presence of outliers, the proposed method performed better than other methods (ANOVA, KW, SAM, and LIMMA). So, from this simulated study, we may conclude that the proposed method outperforms other methods in presence of outliers and in    In absence of outliers One outlier with each of 10% genes One outlier with each of 20% genes One outlier with each of 50% genes  In this table the values within the  absence of outliers it keeps equal performance with other methods.

Real Microarray Data.
To evaluate the performance of the proposed method in a comparison of the other seven methods as mentioned earlier, we used four microarray datasets. The first dataset is the Colon Cancer dataset [26] which consists of 22 control and 40 colon cancer samples. The second dataset is the Leukemia dataset [27]. The third dataset is the Platinum Spike dataset [28], which consists of 18 spike-in samples (9 controls versus 9 tests). The last dataset is the Breast Cancer dataset [29], which included 3226 genes measured on 22 breast cancer samples (7 sporadic, 7 BRCA-1, and 8 BRCA-2).

Colon Cancer Microarray Dataset.
We used the colon cancer gene expression dataset. The dataset was downloaded from http://microarray.princeton.edu/oncology and was also used in the study [26]. The number of genes in this dataset is 2000. Figure 6(a) represents the Venn diagram of top 100 genes estimated by ANOVA, SAM, LIMMA, and proposed method. From this Venn diagram, we clearly observe that our proposed method shares more genes with other methods.
We further compared the performance of proposed method with two robust methods (KW and BRIDGE) and SAM (see Figure 6(b)). This comparison also revealed that proposed method shares more genes with SAM than KW or BRIDGE methods. There were 57 genes detected as common by these four methods. The proposed method also shared 18 genes with the SAM and KW methods, which were not detected by the BRIDGE method.

Leukemia Microarray Dataset.
This dataset was used in the study [27] and contains 7129 gene expressions for 72 leukemia samples in which 47 samples are acute lymphoblastic leukemia and 25 samples are acute myeloblastic leukemia. The results obtained from ANOVA, SAM, LIMMA, and proposed method based on top 100 estimated DE genes are presented in a Venn diagram in Figure 6(c). This figure shows that larger number of genes (86) is detected by these 4 methods. SAM and proposed method shared more genes (7) in this case. When comparing the proposed method with other robust methods (KW and BRIDGE) and SAM (see Figure 6(d)), we observed that 45 genes are common with these methods. The proposed method shares more genes with SAM than KW or BRIDGE methods. The proposed method also shared 20 genes with the SAM and KW methods, which were not detected by the BRIDGE method.  Figure 6(e) shows the M-A plot for this dataset. The red asterisk, blue triangle, and black circle are used for 312, 9, and 40 genes detected by proposed method, LIMMA, and SAM, respectively, that are common with valid 1944 DE genes' set (see Figure S5(c)).

BRCA Microarray
Dataset. This data comes from the breast cancer cDNA microarray experiment [29]. This dataset consists of 3226 genes from 22 breast cancer samples, which are also divided into three classes (7 sporadic, 7 BRCA-1, and 8 BRCA-2) according to their mutational status. Figure 7(a) represents the Venn diagram of top 100 genes identified by ANOVA, KW, SAM, LIMMA, and proposed method. From this Venn diagram we clearly observe that proposed method identified three genes that were not detected by the other methods (ANOVA, KW, SAM, and LIMMA). Then we explored the biological functions of these three genes (CTNNA1, NFKB1, and TM4SF1) using [30]. From this website we obtained GO (Gene Ontology), KEGG pathway, and disease association results. Using the GO database, we found that these genes are involved in biological processes and different molecular functions like negative regulation of programmed cell death, positive regulation of signal transduction, negative regulation of cell death, protein binding, protein complex, and molecular function (see S1 File (xls)). Figure S6 shows the Gene Ontology (GO) categories of  Figure  S5(c)).   these three (3) genes using directed acyclic graph (DAG). Using the KEGG database, we found that these genes are involved in cancer pathways with adjusted value = 0.0002 (see Table 6 and S2 File (.xls)). Table 7 represents the disease association results of these genes. In both tables the hypergeometric test is used to calculate the values and adjusted by Benjamini-Hochberg method for multiple testing corrections. Figure 7(b) represents the functional interactions (gene network) of these 3 genes were analyzed by GeneMANIA web server [31].

Conclusion
Differentially expressed (DE) genes identification to discover the disease biomarkers is one of the important tasks in microarray data analysis. Significance Analysis of Microarrays (SAM) is a popular statistical approach for identification of DE genes for both small-and large-sample cases. However, it is sensitive to outlying gene expressions and produces low power in presence of outliers. Therefore, in this paper, an attempt is made to robustify the SAM approach using the minimum -divergence method. We used MLEs, in absence of outliers with = 0 and in presence of outliers, the minimum -divergence estimators are used to calculate the SAM statistic. We revealed from the simulated and real gene expression data analysis with = 2 conditions that all the eight methods behave almost similar in absence of outliers, for both small-and large-sample cases. For large-sample case, in presence of outliers, three methods (KW, BRIDGE, and proposed) performed well compared to the other five methods (ANOVA, SAM, LIMMA, EB, and GAGA). However, the proposed method outperforms other seven methods for small-sample case, in presence of outliers. From the simulated dataset with = 4 conditions with multiple patterns, we observed that five methods (ANOVA, KW, SAM, LIMMA, and proposed) perform well in absence of outliers, both small-and large-sample cases, except KW for small-sample case, whereas in presence of outliers for large-sample case two methods (KW and proposed) perform well. However, the proposed method outperforms the other four methods (ANOVA, KW, SAM, and LIMMA) for smallsample case, in presence of outliers. Similar results were found from real gene expression data analysis. Therefore, we may conclude that, on an average, the proposed method performs better than the other methods.

Conflicts of Interest
The authors have no conflicts of interest to state.