MAVTgsa: An R Package for Gene Set (Enrichment) Analysis

Gene set analysis methods aim to determine whether an a priori defined set of genes shows statistically significant difference in expression on either categorical or continuous outcomes. Although many methods for gene set analysis have been proposed, a systematic analysis tool for identification of different types of gene set significance modules has not been developed previously. This work presents an R package, called MAVTgsa, which includes three different methods for integrated gene set enrichment analysis. (1) The one-sided OLS (ordinary least squares) test detects coordinated changes of genes in gene set in one direction, either up- or downregulation. (2) The two-sided MANOVA (multivariate analysis variance) detects changes both up- and downregulation for studying two or more experimental conditions. (3) A random forests-based procedure is to identify gene sets that can accurately predict samples from different experimental conditions or are associated with the continuous phenotypes. MAVTgsa computes the P values and FDR (false discovery rate) q-value for all gene sets in the study. Furthermore, MAVTgsa provides several visualization outputs to support and interpret the enrichment results. This package is available online.


Introduction
DNA microarray technology enables simultaneous monitoring of the expression level of a large number of genes for a given experimental study. Much initial research on methods for data analysis has focused on the techniques to identify a list of differentially expressed genes. After selection of a list of differentially expressed gene, the list is then examined with biologically predefined gene sets to determine whether any sets are overrepresented in the list compared with the whole list ( [1][2][3]). Mootha et al. [4] proposed gene set enrichment analysis (GSEA), which considers the entire distribution of a predefined gene set rather than a subset from the differential expression list. GSEA provides a direct approach to the analysis of gene sets of interest and the results are relatively easy to interpret. Furthermore, microarray experiments inherit various sources of biological and technical variability, and analysis of a gene set is expected to be more reproducible than an individual gene analysis.
GSEA is a statistical approach to determine whether a functionally related set of genes expresses differently (enrichment and/or deletion) under different experimental conditions. The GSEA approach has inspired the development of various statistical tests for identifying differentially expressed gene sets [5][6][7][8][9][10][11][12][13][14][15][16]. There are two fundamental hypotheses for GSEA: competitive hypothesis and self-contained hypothesis [8]. The competitive hypothesis tests if the association of a gene set with the phenotype is equal to those of the other gene sets. The self-contained hypothesis tests if the expression of a gene set differs by the experimental condition. In either test, resampling methods are typically used to generate the null distribution of test statistics. The null distributions of statistic under the competitive hypothesis are generated by gene sampling; the null distributions under the self-contained hypothesis are generated by subject-sampling. Various GSEA statistics have been proposed for testing either the competitive hypothesis or self-contained hypothesis. There are also hybrid-type methods that utilize gene and sample sampling (e.g., [5,9]). The differences of the two hypotheses were evaluated and summarized in Nam and Kim [14] and Dinu et al. [15]. Gene sampling under the competitive hypothesis simply reassigns the genes into different gene sets. As a result, the sampled distributions are not independent. On the other hand, subject sampling is consistent with the null hypothesis, which the null distributions of statistical tests are identically and independently distributed. In addition, the self-contained hypothesis is a generalization from identification of differentially expressed genes to identification of differentially expressed gene sets. Therefore, we consider self-contained hypothesis and corresponding permutation values in this paper.
In this paper, a statistical test to determine whether some functionally predefined classes of genes express differently under different experimental conditions is referred to as gene set analysis (GSA). When there are two experimental conditions, a GSA can be a one-sided or two-sided test. A one-sided test is to detect the changes of gene expressions in the gene set in one direction, either up-or downregulation. The two-sided test is to detect the changes in both upand downregulation [6,15]. When the goal is to detect coordinated changes in one direction, the one-sided hypothesis is appropriate. However, in an exploratory context, it is impossible to prespecify how individual genes in a gene set will respond in different conditions. Hence, two-sided hypothesis is generally suggested.
The GSEA software only performs one-sided test in two conditions at a time [4]. Another software provides either one-sided or two-sided test; for example, BRB-Array tools [18] provided LS (KS) statistics, Goeman's global test, and MaxMean test. When the null hypothesis is that no genes in the gene set are differentially expressed, the two-sided is more appropriate than other methods.
When the purpose of study is to build up prediction rule based on gene expression profile, utilizing the existing biological knowledge, such as biological pathway or cellular function information, has been showed to improve the classification accuracy (e.g., [19][20][21]). The selection of predictors can be either preselected by testing or chosen based on classification algorithm, and the final identified predictors are also considered as differential expressed. Hsueh et al. [22] and Pang et al. [23] proposed a random forests-based differential analysis of gene set data in terms of predictive performance of gene sets. The analysis contained not only a classifier but also the feature importance of the input gene sets.
The gene set analysis has been considered to accommodate continuous phenotype. Linear combination test [17] has been extended from binary to continuous phenotype (LCT) by utilizing linear regression function [24]. Significance analysis of microarrays for gene sets (SAM-GS) [15] and global test [11] have also been extended in a generalized linear model (GLM) framework. Dinu et al. [24] compared the type I error rates and powers for the three methods and concluded that LCT approach is powerful and computationally attractive. The random forests-based approach could be also applied to continuous phenotype by modifying random forest classification to random forest regression. A small simulation experiment to compare the random forests-based and LCT is reported in this paper.
The main purpose of this paper is to present the MAVTgsa R package tool for GSA for study with categorical phenotypes or continuous phenotypes. The OLS one-sided test, MANOVA test, and random forest-based analysis are implemented in MAVTgsa. When categorical phenotypes involve more than two classes, the three multiple comparison procedures are implemented: Dunnett, Tukey, and sequential pairwise comparison. The program provides two visualization plots: GSA plot, a value plot of GSA for all gene sets in the study, and GST, a plot of the empirical distribution function of the ranked test statistics of a selected gene set. The researcher is able to summarize and visualize the gene expression data in gene set analysis. More importantly, the adjusted value of family-wise error rate (FWER) [25] and FDR step-up procedure [26] are computed in this package. When the outcome of interest is a continuous phenotype, the random-forests based analysis is applied in the regression context. Figure 1 describes the procedure of how to implement the MAVTgsa package for gene set enrichment analysis.
Section 2 describes the methods implemented in the MAVTgsa package. In Section 3, we present a description of the MAVTgsa package including data input, optional parameters, output, and result visualization. In Section 4, we apply to two real data sets, P53, and breast cancer, for illustration. This paper is concluded by a summary.

The OLS Test.
The OLS test was developed to detect changes in one direction, either up-or downregulation, for a study with two experimental conditions. Consider a gene set consisting of m genes with two conditions of sample size n 1 and n 2 . Let = ( 1 , . . . , ) be the -vector of intensities for simple ( = 1, . . . , ) in th condition ( = 1, 2). Denote the standardized variable * = ( − )/ , where is the overall sample mean for the th gene and is the pooled standard deviation. Let = Σ * / be the mean of the standardized variable for the th in the th condition. Denote the z = ( 1 , . . . , ) as the -vector of the standardized where 1 is a × 1 vector of 1's and V is the pooled sample covariance matrix of d . If d is a multivariate normal, then the OLS statistic ols has an approximately distribution with The one-sided OLS statistic is the most widely used global test for the analysis of multiple clinical endpoints [27]. This test is very powerful when the changes in the gene expression are in the same direction. The direction of changes of significant gene set can be checked from the OLS statistic The OLS statistics account for gene set size and correlation structure [6]. In addition, the permutation values for the OLS test are suggested to compute as the level of significance. ) be the -vector of intensities for simple ( = 1, . . . , ) in th condition ( = 1, . . . , ). The MANOVA model [16] can be expressed as = + , where is -vector of residuals with Var( ) = Σ and is the -vector of means for the th condition. MAVTgsa occupied the Wilks' Λ as the statistic for MANOVA test. The formula of Wilks' Λ is given as where 's are the eigenvalues of the matrix (= −1 ), and is within sum of squares matrix (sample covariance matrix) and is between sum of squares matrix. The number of eigenvalues is equal to the minimum of the number of genes (m) and the number of conditions minus 1 (c − 1). When the number of genes in the gene set is greater than the number of samples, the matrix is singular and illconditioned. The shrinkage covariance matrix estimator ( * ) proposed by Schafer and Strimmer [28] is applied to estimate the sample covariance matrix and given as and * = min{1, max(0, 1 −̂ * )}, where and , respectively, denote the empirical sample variance and sample correlation, and the optimal shrinkage intensitŷ * is estimated bŷ * The distribution of Wilks' Λ under the null hypothesis of no difference in responses among the conditions was estimated by the permutation method. The Wilks' Λ test is equivalent to Hotelling's 2 test when there are only two conditions. The Hotelling's 2 statistic is where and denote the sample mean vector and sample covariance matrix of the th group ( = 1, 2), respectively, and  [16] have compared the MANOVA test with several GSEA tests including the two one-sided tests: GSEA [5] and MaxMean [9], and four two-sided tests: ANCOVA [7], Global [11], PCA [13], and SAM-GS [15]. Their simulation showed that MANOVA test performs well in terms of controlling the type I error and power as compared to other tests, except that the ANCOVA [7] was more powerful when the variances were equal across all genes in the gene set.

Random-Forests Based
Analysis. The random forests (RF) [29] is a popular classification and regression algorithm based on an ensemble of many classification/regression trees combined with a bootstrap sample out of the original dataset. The error rate of a classification tree or the mean of squared error (MSE) of a regression tree is calculated by the observations outside the bootstrap sample called the out-of-bag (OOB) data. Lower error rate and MSE indicates that the corresponding gene set is with higher prediction accuracy for a categorical phenotype and higher percent variance explained for a continuous phenotype, respectively. The variable selection for each split at the nodes of trees is conducted only from a small random subset of the predictors, so that there is no need to deal with the "small large " problem. A permutation based value can be obtained as where 0 is the observed score from the random forests analysis, ( ) is the score in the th permutation, and is the total number of permutations. The score is calculated as the OOB error rate and MSE for the categorical and continuous phenotypes, respectively. The gene set is considered as significantly expressed if the correspondent value is less than or equal to the significance level . To measure the importance of each predictor variable, the random forests algorithm assesses the importance of a variable by looking at how much prediction error increases or MSE decreases when that variable of the (OOB) data is permuted, while remaining variables are left unchanged. Here, the mean decrease in the Gini index (MDG) and the mean decrease in MSE are employed as the importance measures of genes for categorical and continuous phenotypes, respectively.

Multiple Comparisons Methods.
In MANOVA test, the null hypothesis is rejected if one or more of the mean differences or some combination of mean differences among the genes in gene set differs from zero. If the null hypothesis is rejected, three multiple comparison procedures are implemented to identify which two conditions differ in expression between gene sets for studies with more than two conditions. Three multiple comparisons methods are Dunnett, Tukey, and sequential pairwise comparison. Dunnett test is specifically designed for situations where all groups are to be compared against the "reference" group. Tukey test is for all pairwise comparisons. When the conditions are in order, sequential pairwise comparison is appropriate to be occupied for comparing with previous condition.

MAVTGSA Package Description
MAVTgsa is a software tool to evaluate the expressions of a priori defined gene sets under different experimental conditions. The package implements essentially the method described in the previous section and the main functions are MAVTn() and MAVTp().

MAVTn.
For experiments with two or more than two conditions, the input parameters to perform the testing are described as follows.
MAVTn(DATA, GS, Alpha, nbPerm, MCP) (i) DATA is a gene expression data matrix with samples in columns. The first row contains the information of the experimental condition of each sample. The remaining rows contain gene expression.
(ii) GS is a binary matrix coded 0 or 1 with genes in rows. Each column represents a predefined gene set, with row equal to 1 indicating that the gene is in the gene set, and 0 otherwise.
(iii) Alpha is the significance level.
(iv) nbPerm specifies the number of permutations, and at least 5,000 is recommended.
(v) MCP specifies one of three multiple comparison methods, Dunnett = 1, Tukey = 2, and sequential pairwise comparison = 3. MCP can be ignored when the number of experimental conditions is 2.
The output of the MAVTn() function is a list of objects which contains the following. In addition, a plot of values for all gene sets is drawn to examine the adequacy of the assumptions on which the distributions of the test statistics are based.

MAVTp.
For experiments with categorical or continuous phenotypes, the input parameters to perform the random forests analysis are described as follows.

Visualization Plots.
Two visualization plots are implemented, GSA and GST. The GSA plot ( Figure 2) provides a value plot of all gene sets considered in the study. The value plot is the plot of ordered value versus its rank. The value plot can provide an overall assessment of differences in expression among conditions for all gene sets considered in the study. Under the null hypothesis of no differences, the values should be uniformly distributed on the interval (0, 1); the value plot should be a straight line. If a null hypothesis is not true, then its value will tend to be small. The GST plot displays the relative direction (in two conditions) and statistics ranking for genes in a gene set. The GST plot is derived from the SAFE plot which provides the empirical distribution function for the ranked statistics of a given gene category [30]. The GST plot displays the ranked test statistics (red line) and empirical cumulative distribution function of these test statistics for expressed genes in a gene set (solid line). Tick marks above the plot display the location of genes with gene names. The shaded regions are set to represent the statistics that the values below the given alpha value (Figures 3-5). The input parameters to perform the analysis are described as follows. Two examples are presented in next section to show the output of GST plot.

P53 Study.
The MAVTgsa was applied to a P53 dataset. The P53 dataset is from a study to identify targets of the transcription factor P53 from 10,100 gene expression profiles in the NCI-60 collection of cancer cell lines. There are 308 6 BioMed Research International  gene sets in the P53 study. The mutation status of the P53 gene has been reported for 50 cell lines included 17 wild-type and 33 mutation samples. The dataset is publicly available at the GSEA website (http://www.broad.mit.edu/gsea/). Table 1 shows  IL3RA  IGF1  BAD  BCL2L1  YWHAH  PRKACB  PRKAR2B  PIK3R1  KITLG  AKT1  IGF1R  KIT  IL3  CSF2RB  ADCY1  PRKACG  by OLS test, but they are not significant in Hotelling's 2 test. In contrary, the gene set badPathway is highly significant in Hotelling's 2 test but not in OLS test. Figure 2 is the GSA-plot for two groups. Figures 3 and 4 are the GST plots for the gene set rasPathway and badPathway, respectively. In Figure 3, two of 22 genes are underexpressed with the value less than 0.01. On the other hand, one of the 21 genes in badPathway shows underexpressed and one shows overexpressed in Figure 4.
The results indicate that the power of methods to detect differential expressed gene set depends on the global pattern of genes within gene set. Combining the information from the two tests and GSA-plot will be useful to get biological meaningful interpretation.

Breast Cancer Dataset.
We applied the MAVTgsa to a breast cancer dataset [31] and illustrate the RF, MANOVA, and a multiple comparison analysis. The dataset consisted of three conditions with 1,113 genes and 96 samples. Three conditions were tumor grades 1, 2, and 3 with the sample sizes of 11, 25, and 60, respectively. There were nine cancer related pathways for gene set analysis. Table 2 shows the values of the MANOVA and the post hoc Dunnett's analysis for each of the nine pathways. Figure 5 is the GST plots for the 31 genes in the gene set cell cycle control. Five of the 31 genes are differential expressed with value less than 0.01 in the breast cancer samples. In this analysis the total computation time used to perform the analysis was approximately four hours and 10 minutes with nbPerm = 10,000. A classification rule to classify the three tumor grades is also constructed. The error rates of 10-fold cross-validation are given in Table 2.

Simulation Study.
In order to understand how well the random forests-based analysis performs for continuous phenotypes, we conducted a number of simulations to evaluate the performance in terms of type I error and power and compare to the LCT method [24]. The simulation design was similar to that considered by Dinu et al. [24]. For each gene set of size 24, we generated a by gene expression matrix × with sample size of and a linear model was used to simulate the phenotype data associated with the gene set . The gene expressions matrix was generated from a multivariate normal distribution with mean from a uniform (0, 10) distribution, variance from a uniform (1,5) distribution, and a mixed intragene set correlation structure as follows: where the correlation was set at 0, 0.3, 0.5, and 0.9 in this study. For each gene set, a continuous phenotype vector ×1 is generated from a multivariate normal distribution MVN( , I), where is the effect size vector with length and I the identity covariance matrix. In the null model, with no association of gene expressions on the phenotype, we set to be 0 to investigate the type I error and the simulation scenarios varied according to sample sizes ( = 10, 20, or 50), gene set sizes ( = 20, 100, or 200), and the levels of correlation among genes ( = 0, 0.3, 0.5, or 0.9) within a gene set at different number of genes ( 1 = 5, 20, or 40). In the alternative model, the sample size and gene set size were, respectively, fixed at 20 and 100, as well as only 10 of genes were simulated to be associated with the phenotype to investigate power of two gene set analysis methods. First, we randomly generated 5 of the first 20 components of from normal distribution (], |]|) and another five of the next 20 components of from normal distribution (−], |]|) as up-and downregulated genes, respectively. The rest of components of were set to be 0. The effect sizes (]) for the associations were set at 0.2, 0.6, 1.0, 1.4, and 1.8. For each scenario, the simulation data were replicated 1000 times to estimate type I error rate or power. The values were based on 1000 permutations. Power was then estimated as the proportion of significance using the nominal level of 0.05. Table 3 showed the empirical type I errors using the nominal level of 0.05 for each scenario. The type I errors from the random forests method were reasonably close to or below the nominal level, while the LCT method appeared to have an inflated type I error rate in most cases. It indicates that the RF method gives a conservative conclusion. Such conservativeness may lead to power loss in detecting a difference. Figure 6 illustrated the empirical powers using the nominal level of 0.05 for = 0.0, 0.3, 0.6, and 0.9. As expected, the RF method was slightly inferior to the LCT method, while LCT was unable to adequately control the type I error rate. However, with increasing of correlations among genes, both methods appeared to be equivalent. The powers of both methods increased gradually with increasing correlations.
In addition, to explore the robustness of the RF method with regard to nonlinear association data, the continuous phenotype vector ×1 was generated from a multivariate normal distribution MVN(exp( ), I). Figure 7 showed the average power over 1000 simulations for each method using the nominal level of 0.05. As a result, the RF provided a more powerful test than the LCT method to detect the non-linear association between gene sets and continuous phenotypes.

Conclusion
The MAVTgsa package performs a systematic gene set analysis for identification of different types of gene set significance modules. The user can select the most appropriate analysis or combine them to provide insight into gene sets that respond in a similar manner to varying phenotypes and that might therefore be coregulated. For studies with more than two conditions, MAVTgsa not only provides the MANOVA test to identify gene sets consisting of differentially expressed genes but also implements three multiple comparison methods for post hoc analysis. The method implemented in MAVTgsa 8 BioMed Research International  package has the advantage of real application. First, the MAVTgsa package provides the adjusted values using both family-wise error rate (FWER) [25] and the FDR step-up procedure [26] using permutation method. Second, MAVTgsa is a permutation-based method to compute values and adjusted values. Third, MAVTgsa displays the results for individual genes test of significant gene sets. Finally, MAVTgsa draws the GST plot to display the empirical cumulative function for the ranked test statistics of a given gene set with the gene location and gene name above the plot. These allow the user to analyze the interesting gene set of the data easily. In addition, MAVTgsa provides a random forestsbased procedure to identify gene sets in terms of predictive performance or in association with the continuous phenotypes. Random forests method has been proved to perform well in comparison with the other classification methods and successfully applied to various problems. Most importantly, it can accommodate multiclass and continuous phenotypes for the GSA, even if the associations between gene sets and phenotypes are nonlinear and involve complex high-order interaction effects.

Hard Ware and Software Specifications
The implementation and examples run of this package were conducted on a laptop computer with 2.8 GHz CPU and 3.0 GB RAM under the Microsoft Windows XP Professional SP3 using the R software version 2.14.1.

Disclaimer
The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.