Selecting Genes by Test Statistics

Gene selection is an important issue in analyzing multiclass microarray data. Among many proposed selection methods, the traditional ANOVA F test statistic has been employed to identify informative genes for both class prediction (classification) and discovery problems. However, the F test statistic assumes an equal variance. This assumption may not be realistic for gene expression data. This paper explores other alternative test statistics which can handle heterogeneity of the variances. We study five such test statistics, which include Brown-Forsythe test statistic and Welch test statistic. Their performance is evaluated and compared with that of F statistic over different classification methods applied to publicly available microarray datasets.


INTRODUCTION
Microarrays provide information about the expression level of the genes represented on the array. Such gene expression profiling has been successfully applied to class prediction, where the purpose is to classify and predict the diagnostic category of a sample by its gene expression profile [1,2,3,4]. Various machine learning methods are currently used for class prediction. However, the task of prediction by microarrays is challenging, due to a large number of genes (features) and a small number of samples involved in the problem. As a consequence, one has to identify a small subset of informative genes contributing most to the classification task. Performing feature selection is essential for microarray prediction problems, since high-dimensional problems usually involve higher computational complexity and bigger prediction errors.
Many methods have been proposed to select informative genes. One category of such work depends on the tra- ditional t test statistic [5,6,7] and analysis of variance (ANOVA) F test statistic [8,9]. While t is used for twoclass prediction problems, F is used for multiclass problems. The statistics t and F are not only used in class prediction, they also apply to the class discovery [10,11]. The main goal of class discovery is to identify subtypes of diseases. The major difference between class prediction and class discovery is that the former uses labeled samples while the latter uses unlabeled samples.
Although t and F have been commonly used in the analysis of gene expression data, there exists a misunderstanding on the roles of t and F. The test statistic t is used to detect the difference between the means of two populations and it has two versions depending on whether or not the two variances of the two populations are equal. The test statistic F is often used to detect the difference among the means of three or more populations under the assumption that the variances of the involved populations are equal. Of course, the F statistic can be used to detect the difference between the means of two populations. In doing this, one can show that the F statistic is equivalent to the t statistic based on the equal variance, that is, one procedure rejects the null hypothesis that the two populations have the same mean if and only if the other procedure rejects the null hypothesis. In analyzing gene expression data, the t statistic is based on unequal variances so that its extension will never reach the ANOVA F. Therefore, for multiclass prediction problems, it is natural to explore other statistics which do not assume equal variances.
In this paper, we study the effect on multiclass prediction results of gene selection from six test statistics: ANOVA F test statistic, Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, Cochran test statistic, and Kruskal-Wallis test statistic. The five last test statistics can be viewed as extensions of the t statistic used in two-class prediction problems. Their performance will be compared with that of the F statistic.
This paper is organized as follows. In "models and methods," we describe the statistical model for gene expression levels, test statistics, and our method to select genes. In "experimental results," we investigate the effect of test statistics on the classification results by using our gene selection approach and different machine learning techniques, applied to five publicly available microarray datasets. Our conclusion is given in "conclusion."

MODELS AND METHODS
In this section, we will first introduce a general statistical model for gene expression values and describe test statistics for testing the equality of the class means. We then present our approach to select genes using power and correlation.

Statistical model
Assume there are k (≥ 2) distinct tumor tissue classes for the problem under consideration and there are p genes (inputs) and n tumor mRNA samples (observations). Suppose X gs is the measurement of the expression level of gene g from sample s for g = 1, . . . , p and s = 1, . . . , n. In terms of an expression matrix G, we may write It is seen that the columns and rows of the expression matrix G correspond to samples and genes, respectively. Note that G is a matrix consisting of data highly processed through preprocessing techniques that include image analysis and normalization and often logarithmic transformations. We assume that the data G are standardized so that the genes have mean 0 and variance 1 across samples. Given a fixed gene, let Y i j be the expression level from the jth sample of the ith class. Note that these Y i j come from the corresponding row of G. For example, for gene 1, Y i j are a rearrangement of the first row of G. We consider the following general model for Y i j : with n 1 + n 2 + · · · + n k = n. In the model, µ i is a parameter representing the mean expression level of the gene in class i, i j are the error terms such that i j are independent normal random variables, and for i = 1, 2, . . . , k; j = 1, 2, . . . , n i . Schematically, the expression levels Y i j look like the following: Note that if the variances are equal, that is, σ 2 1 = σ 2 2 = · · · = σ 2 k , then the above model is simply the commonly used one-way ANOVA model. For the microarray data, we believe that heterogeneity in the variances is more realistic, since different σ i may describe different variations of the gene expression across classes.
One of the main tasks associated with the above model is to detect whether or not there is some difference among the means µ 1 , µ 2 , . . . , µ k . For the case of homogeneity of variances, the well-known ANOVA F test is the optimal test to accomplish the task [12,13]. However, with heterogeneity of the variances, the task is challenging and is closely related to the well-known Behrens-Fisher problem [14]. When the sample sizes in all classes are equal, that is, n 1 = n 2 = · · · = n k , the presence of heterogeneous variances of the errors only slightly affects the F test. When the sample sizes are unequal, the effect is serious [15]. The actual type-I error is inflated if smaller sizes n i are associated with larger variances σ 2 i . In addition, the significance levels are smaller than anticipated if larger sizes n i are associated with larger variances σ 2 i . The above indicates that for our model, the F test may not be appropriate for testing H 0 : µ 1 = µ 2 = · · · = µ k versus H 1 : not all the µ i are equal. Therefore some alternatives to the F test are worthy of investigating.

Test statistics
After introducing the statistical model for gene expression values, we now turn to the test statistics used to test the equality of the class means for a fixed gene. We will consider the following six test statistics. The first five are parametric test statistics, while the last one is nonparametric.
(a) ANOVA F test statistic. The definition of this test is . For simplicity, we use to indicate the sum is taken over the index i. Under H 0 and assuming variance homogeneity, this well-known test statistic has a distribution of F k−1,n−k [13]. 2005:2 (2005) (b) Brown-Forsythe test statistic [16]. This is given by (c) Welch test statistic [17]. This is defined as (d) Adjusted Welch test statistic [18]. It is similar to the Welch test statistic and defined to be where In this paper, we choose φ i = (n i + 2)/(n i + 1), since this choice provides reliable results for small sample sizes n i and a large number (k) of populations [18].
(e) Cochran test statistic [19]. This test statistic is simply the quantity appearing in the numerator of the Welch test statistic W, that is, where w i and h i are given in (c). Under H 0 , C has an approximate distribution of χ 2 k−1 . (f) Kruskal-Wallis test statistic. This is the well-known nonparametric test and is given by where R i is the rank sum for the ith class. The ranks assigned to Y i j are those obtained from ranking the entire set of Y i j (use the average rank in case of tied values). Assuming each n i ≥ 5, then under H 0 , H has an approximate distribution of χ 2 k−1 [20].

Gene selection
With the test statistics introduced above, we are able to discuss the issue of gene selection. It has been well demonstrated in the literature that gene selection is an important issue in microarray data analysis. It is also known that with a large number of genes (usually in thousands) present, no practical method is available to locate the best set of genes, that is, the smallest subset of genes that offer optimal prediction accuracy. In this paper, the focus lies in comparing the performance of different test statistics in selecting genes for the classification of tumors based on gene expression profiles. Identifying a gene selection process to achieve good classification results is not the purpose of this paper. To make the comparison straightforward, we adopt the simplest gene selection approach as follows. First, we formulate the expression levels of a given gene by a one-way ANOVA model, as shown in "statistical model." We then use the test statistics in "test statistics" to determine the power of genes in discriminating between tumor types. Given a test statistic F , we define the discrimination power of a gene as the value of F evaluated at the n expression levels of the gene. This definition is based on the fact that with larger F the null hypothesis H 0 : µ 1 = µ 2 = · · · = µ k will be more likely rejected. Therefore, the higher the discrimination power is, the more powerful the gene is in discriminating between tumor types. Finally, we choose as informative genes those genes having high power of discrimination.
We note that the discrimination power of genes could be determined equally well by the p value from F . However, due to small sizes n i , it is hard to justify the approximation of the known distribution to F . Therefore the p values may not reflect the actual functionality of F . This drawback is overcome by using the value of F to determine the power of discrimination. Another obvious benefit is that using the value of F will greatly simplify the calculation.
In [18], extensive simulations have been conducted to examine the behavior of some test statistics for testing the equality of population means. The test statistics studied include B, W, W * , F, and C. The results show that with homogeneity of the variances, the ANOVA F test is the optimal test, as stated in "statistical model." However, this assumption of homogeneity is rarely met in practice. Under heterogeneity of variances, the simulation results in [18] show that the test statistics B, W, and W * provide acceptable control of type I errors. This implies that the genes identified by B, W, and W * are more likely to be powerful than those by F and C in discriminating between tumor types, and thus the prediction errors resulting from B, W, and W * are expected to be lower than those from F and C. The nonparametric test statistic H can be applied to data with less restriction, for example, ordinal data, and thus is expected to perform worse than test statistics such as B, W, W * , and C. The above discussion will be further verified by our experiments on gene expression data conducted in "experimental results."

EXPERIMENTAL RESULTS
In this section we investigate the effect on gene selection of the six test statistics introduced in "test statistics." Five gene expression datasets and five prediction methods are used for this purpose. The performances of the test statistics are evaluated in terms of class prediction errors.

Comparison of test statistics
The gene selection procedure described above depends on the test statistics. Given a gene selection process from a test statistic, different classification methods may lead to different prediction errors. In our experiments, we used the following five prediction methods: naive Bayes, nearest neighbor, linear perceptron, multilayer perceptron neural network with 5 nodes in the middle layer, and support vector machines with a second-order polynomial kernel. All the algorithms are from Matlab PRTools 3.01 by Robert P. W. Duin.
To calculate the overall prediction error, we used leave one out (LOO) cross-validation. For a dataset with n samples, this method involves n separate runs. For each of the runs, n−1 data points are used to train the model and then prediction is performed on the remaining data point. The overall prediction error is the sum of the errors on all n runs. Table 2 presents a comparison of the six test statistics when 50 informative genes were used. In the table, F, B, W, W * , C, and H represent the ANOVA F test statistic, Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, Cochran test statistic, and Kruskal-Wallis test statistic, respectively. The first number in each cell denotes the average of 5 prediction errors from 5 dif- ferent classification methods. The second number in each cell is the median of the 5 prediction errors. The results in the table suggest that B, W, W * , and, C perform better than F and H. Similar to Table 2, Tables 3 and 4 present comparison results with 100 and 200 informative genes, respectively. Results in Tables 2, 3, and 4 may be summarized in a way by figures. Consider the average errors in the tables. For a fixed dataset and fixed number of informative genes, the performances of the six test statistics can be ranked. The fifteen ranks achieved by a test statistic could be used to obtain a 95% confidence interval of the mean rank for the test statistic. The corresponding bar chart plotting six confidence intervals is given in Figure 1. The bar chart based on the median errors in Tables 2, 3, and 4 is presented in Figure 2. Clearly, both figures show that B, W, W * , and C outperform F and H. These results indicate that the proposed models in "statistical model" without assuming equal variances are preferred to those assuming equal variances.
We note that in the above experiments, the performance of C is comparable to those of B, W, and W * . This does not look consistent with the discussion in "gene selection." One reason might be that we only examined 5 datasets in this paper. Our opinion is that if more data sets are explored, the overall performance of C should be worse than that of B, W, or W * . We leave this as our future work.
Before concluding, we point out that it is useful to assess the importance of genes selected by the test statistics from the biological perspective. Since this is not the focus of our research work in this paper, below we only provide a simple example to examine some genes selected by the Brown-Forsythe test statistic for the leukemia dataset. This dataset was also studied by Getz et al [26]. They extracted the stable clusters of genes by the coupled twoway clustering analysis and concluded that those genes grouped into the same cluster share certain biological significance such as on the same pathway. Among the top 50    informative genes from the Brown-Forsythe test statistic, 12 were mapped to the clusters of genes of interest given in [26]. Table 5 shows the information about the gene names, access numbers, corresponding clusters as well as the values of the Brown-Forsythe statistic. For details on the explanation of biological significance of clusters LG1, LG5, and LG6, readers are referred to [26].

CONCLUSION
In this paper, we have compared the performance of different test statistics in selecting genes for multiclassification of tumors using gene expression data. Experiments show (a) the model for gene expression values without assuming equal variances is more appropriate than that assuming equal variances; (b) Brown-Forsythe test statistic, Welch test statistic, adjusted Welch test statistic, and Cochran test statistic perform much better than ANOVA F test statistic and Kruskal-Wallis test statistic.

DISCLAIMER
The opinions expressed herein are those of the authors and do not necessarily represent those of the Uniformed Services University of the Health Sciences and the Department of Defense.