Applications of Bayesian Gene Selection and Classification with Mixtures of Generalized Singular g-Priors

Recent advancement in microarray technologies has led to a collection of an enormous number of genetic markers in disease association studies, and yet scientists are interested in selecting a smaller set of genes to explore the relation between genes and disease. Current approaches either adopt a single marker test which ignores the possible interaction among genes or consider a multistage procedure that reduces the large size of genes before evaluation of the association. Among the latter, Bayesian analysis can further accommodate the correlation between genes through the specification of a multivariate prior distribution and estimate the probabilities of association through latent variables. The covariance matrix, however, depends on an unknown parameter. In this research, we suggested a reference hyperprior distribution for such uncertainty, outlined the implementation of its computation, and illustrated this fully Bayesian approach with a colon and leukemia cancer study. Comparison with other existing methods was also conducted. The classification accuracy of our proposed model is higher with a smaller set of selected genes. The results not only replicated findings in several earlier studies, but also provided the strength of association with posterior probabilities.


Introduction
Recent advancement in oligonucleotide microarray technologies has resulted in production of thousands of gene expression levels in a single experiment. With such vast amount of data, one major task for researchers is to develop classification rules for prediction of cancers or cancer subtypes based on gene expression levels of tissue samples. The accuracy of such classification rules may be crucial for diagnosis and treatment, since different cancer subtypes may require different target-specific therapies. However, the development of good and efficient classification rules has not been straightforward, either because of the huge number of genes collected from a relatively small number of tissue samples or because of the model complexity associated with the biological mechanism. The identification of a smaller set of relevant genes to characterize different disease classes, therefore, has been a challenging task.
Procedures which are efficient in gene selection as well as in classification do play an important role in cancer research.
Many approaches have been proposed for classes classification. For example, several analyses identified a subset of classifying genes with -statistics, regression model approach, mixture model, Wilcoxon score test, or the between-within classes sum of squares (BSS/WSS) [1][2][3][4][5][6][7]. These methods are univariate in the sense that each gene is tested individually. Others started with an initial step of dimension reduction before classification procedures, such as the principle components analysis (PCA) [8][9][10] and the partial least squares algorithm (PLS algorithm) [11][12][13][14][15]. These methods may reduce dimension (the number of genes) effectively but may not be biologically interpretable. To capture the gene-gene correlations, researchers proposed the pair-based method [16], correlation-based feature selection [17], and the Markov random field prior [18]. Although these methods can model the gene-gene interaction, they can be computationally timeconsuming.
Bayesian approach can accommodate naturally the interplay between genes via prior distributions, under the setting of regression models. Examples included the Bayesian hierarchical mixture model [19][20][21] and a logistic or probit link with latent variables and stochastic search variable selection (SSVS) procedure for binary and multicategorical phenotypes [22][23][24][25]. To consider all genes simultaneously, most Bayesian approaches adopt a multivariate analysis with a natural conjugate prior (0, (X X) −1 ), calledprior, for the regression parameters [26]. This a priori distribution utilizes the design matrix as the prior covariance matrix of and can lead to a relatively simple posterior distribution. However, if the number of genes is much larger than the number of samples available, the dimension of X becomes large and a high degree of multicollinearity may occur. In that case, the covariance matrix of Zellner's -prior becomes nearly singular. Modifications included the -prior distribution with the Moore-Penrose generalized inverse matrix [27] and use of a ridge parameter [28,29]. Alternatively, other researchers focused on the scalar in (X X) −1 which controls the expected size of the nonzero regression coefficients. For instance, it was reported that the final results are insensitive to the values of between 10 and 100, and the value = 100 has been suggested after extensive examinations [30]. Instead of fixing at a constant, George and Foster [31] proposed an empirical Bayes estimate for , while Liang and colleagues [32] suggested a hyper-prior, a special case of the incomplete inverse-gamma prior in Cui and George [33].
The main purpose of this research is the application of fully Bayesian approaches with a hyperprior on . Specifically we adopted an inverse-gamma prior IG(1/2, /2) which was commented earlier that it could lead to computational difficulty. Therefore, we outlined a MCMC algorithm and demonstrated its implementation. In this paper, we considered a probit regression model for classification with SSVS to identify the influential genes, augmented the response variables 1 , 2 , . . . , with latent variables 1 , 2 , . . . , , and converted the probit model to a Gaussian regression problem with the generalized singular -prior ( -prior). For the choice of , we assigned a hyperprior for the uncertainty in . This hyperprior is intuitive and differs from those in [32,33]. Finally, we defined an indicator variable for the th gene and perform MCMC methods to generate posterior samples for gene selection and class classification. The rest of the paper is arranged as follows. In Section 2, we briefly described the model specification including the data augmentation approach and SSVS methods. Under this hyperprior on , we also demonstrated the implementation of the Bayesian inference. Applications of three cancer studies, acute leukemia, colon cancer, and large B-cell lymphoma (DLBCL), were presented in Section 3. Conclusion and discussion were given in Section 4.

Model and Notation
Let (X, Y) indicate the observed data, where denotes the expression level of the th gene from the th sample and Y = ( 1 , 2 , . . . , ) denotes the response vector, where = 1 indicates that sample is a cancer tissue and = 0 for normal tissue. Assume that 1 , 2 , . . . , are independent random variables with = Pr( = 1).

Probit Model with Latent
Variable. The gene expression measurements can be linked to the response outcome with a probit regression model: where represents the intercept, X is the th row in the × design matrix X, = ( 1 , . . . , ) is the vector of regression coefficients, and Φ is the standard normal cumulative distribution function.
To perform statistical inference under this probit regression model, we first adopt independent latent variables 1 , 2 , . . . , , where = + X + , ∼ (0, 1) , = 1, . . . , , and the corresponds to the disease status as The use of such latent variables helps to determine which category the th sample is to be classified. Note that multiplying a constant on both sides in (3) does not change the model; thus a unit variance is considered for . If a noninformative prior is assumed for , then the posterior covariance matrix of given Z ≡ ( 1 , 2 , . . . , ) becomes (X X) −1 . However, due to the enormous size of microarray data, (X X) −1 may be nearly singular, and variable selection for dimension reduction is needed. We define for variable selection the vector ≡ ( 1 , 2 , . . . , ) whose elements are all binary, where Given , we denote as the number of 1's in and a × 1 reduced vector containing the regression coefficients if its corresponding is 1. Accordingly, for all = 1, the corresponding columns in X are collected to build X , an × reduced gene expression matrix. Given , the probit regression model in (3) can be written as = + X + , ∼ (0, 1) , = 1, . . . , , where X is the th row in X .
Computational and Mathematical Methods in Medicine 3

Choice of Prior Distributions.
To complete the model specification, we assign a normal (0, ℎ) prior for the intercept with a large ℎ indicating no a priori information.
For the regression parameters, the commonly applied -prior | , ∼ (0, (X X ) −1 ) may not work if the sample size is less than the number , leading to the results that X X is not of full rank and (X X ) −1 does not exist. Therefore, we consider the -prior distribution with (X X ) + as the pseudoinverse of X X for conditioning on ( , ), | , ∼ (0, (X X ) + ). This would solve the singularity problem. Next, we assign for and the priors and assume that are independent for = 1, . . . , . Note that here the 's are of small values, implying a small set of influential genes.
We now complete the model specification: Note that = 1 if the th sample is a cancer tissue, is the intercept, = ( 1 , . . . , ) is the vector of regression coefficients, Φ is the standard normal cumulative distribution function, and X is the design matrix: And ≡ ( 1 , 2 , . . . , ) contains the binary , where = 1 if the th gene is selected ( ̸ = 0), is a ×1 reduced vector containing the regression coefficients if its corresponding is 1, is the number of 1's in , and X is the th row in X .

Computation and Posterior
Inference. Based on the prior distributions specified in previous sections, the joint posterior distribution can be derived as and 1 , 2 , . . . , ( ≤ ) are the nonzero eigenvalues of (X X ) + . From (10), given (Z, , , , Y, X) is a multivariate normal distribution with a covariance matrix (X X ) + /( + 1). In the case where X is not of full column rank, the problem of convergence may occur in the MCMC algorithm because the covariance matrix is not positive definite and the multivariate normal distribution becomes degenerated. To avoid this problem and speed up the computations, we integrate out and in (10) following Yang and Song's [27] suggestion and derive where Σ = I + ℎ11 + X (X X ) + X . As the posterior distribution is not available in an explicit form, we use the MCMC technique to obtain posterior sample observations. The computational sampling scheme is as follows.
The conditional distribution of Z given (Y, X, , ) is a multivariate truncated normal. Since it is difficult to directly sample Z from this distribution, we draw where Z (− ) is the vector of Z without the th element [34].
The above distribution does not belong to any standard distribution, so we will use Metropolis-Hastings algorithm to sample .
The iteration therefore starts with initial values of Z (0) , (0) , and (0) , and our MCMC procedures at the th iteration are as follows.

Classification.
To assess the performance of our procedures, testing data sets are considered. For example, a testing set ( new , new ) is available, and the predictive probability of new given new is Based on the MCMC samples, we estimate the probability witĥ( new | Y, X, new ) When there are no testing sets available, we adopt the leave-one-out cross-validation (LOOCV) method to evaluate Computational and Mathematical Methods in Medicine 5 a Gene also identified in Ben-Dor et al. [38]. b Gene also identified in Furlanello et al. [39]. c Gene also identified in Chu et al. [40].
the performance with the training data. Because the predictive probability for is where Y (− ) denotes the vector of Y without the th element. We estimate this probability based on the generated MCMC samples,

Applications
In this section, we applied the fully Bayesian approach and the reference prior to three cancer studies: colon cancer, leukemia, and a large B-cell lymphoma (DLBCL) study [35][36][37]. We also compared the performance of this approach with other existing gene selection and classification methods. These data have been extensively studied with various methods but we only included a limited set of them. Others can be found in the reference lists of the work cited here.

Colon Cancer Study.
The data of the colon cancer study contained 2000 expression levels from 40 tumor and 22 normal colon tissues. These expression levels were first transformed with a base 10 logarithmic function and then standardized to zero mean and unit variance for each gene. We then performed the MCMC sampler fixing the ℎ in Σ at 100 and = Pr( = 1) = 0.005 for all = 1, . . . , . We burned in the first 12000 iterations, collected every 30th sample, and obtained 6700 posterior points in total for further analysis. The leading 20 genes with the largest posterior inclusion probabilities were presented in Table 1. This list was compared with the findings in three other studies [38][39][40] and similar findings were denoted in Table 1. The first 19 genes were identified in at least one of the three studies. For reference, Figure 1 displays the 100 largest posterior probabilities of the 100 corresponding genes. For classification, we adopted the external leave-oneout cross-validation (LOOCV) procedure to evaluate the performance of classification with the selected genes. The procedures were the following: (i) removing one sample from the training set; (ii) ranking the genes in terms of -statistics using the remaining samples and retaining the top 50 genes as the starting set to reduce computational burden; (iii) selecting the * most influential genes from the 50 genes based on our Bayesian method; and (iv) using these * genes to classify the previously removed sample. The procedures were repeated for each sample in the dataset. With different choices of * like * = 6, * = 10, and * = 14, the error rates were 0.1452, 0.1452, and 0.1129, respectively. The performance of other   methods, including SVM [41]; classification tree followed by 1-Nearest-neighbor and LogitBoost with 100 iterations [42]; MAVE-LD [43]; IRWPLS [44]; supervised group Lasso (SGLasso, [45]) and MRMS [46]; and -test for single markers in probit regression was summarized in Table 2. SVM had the smallest error rate, but it apparently included too many genes (1000 in this set). One other method MRMS+SVM+D1 performed better, with one more correct classification, than our proposed procedure when 6 or 10 genes were selected.

Leukemia Study.
Next we considered the leukemia study with gene expression levels from 72 tissues including 47 acute lymphoblastic leukemia (ALL) patients and 25 acute myeloid leukemia (AML) subjects. These data contained 38 training and 34 testing samples. The training data contained 27 ALL cases and 11 AML cases, whereas the testing data were with 20 ALL cases and 14 AML cases. As described in other studies [2], the preprocessing steps such as thresholding and filtering were applied first and then followed by a base 10 logarithmic transformation. A total of 3571 genes were left for analysis. Next, we standardized the data across samples, and we ranked these genes by the same MCMC procedures described earlier. The top 20 genes with the largest posterior inclusion probabilities were presented in Table 3, and genes identified by other studies [36,41,47,48] were also noted. For reference, Figure 2 displays the 100 largest posterior probabilities of the 100 corresponding genes.
For the classification procedure, similar to the procedures for colon cancer study, we selected * most influential genes from a starting set of 50 genes and next used them to examine the testing data. With * = 6, 10, or 14 genes, only the 61st and 66th observations were misclassified by our procedure. We also compared the results with weighted voting machine [36], MAVE-LD [43], two-step EBM [47], KIGP + PK [48], and -test for single markers with probit regression, as summarized in Table 4. Note that although MAVE-LD and two-step EBM methods performed better than our proposed a Gene also identified in Golub et al. [36]. b Gene also identified in Ben-Dor et al. [38]. c Gene also identified in in Lee et al. [22].  procedure, both methods used more genes (50 and 512) and yet achieved only one less misclassification. Among this list, our procedure apparently considered a smaller set of genes with a satisfactory performance.

Diffuse Large B-Cell Lymphoma (DLBCL) Study.
This study collected 58 samples from DLBCL patients and 19 samples from follicular lymphoma [37]. The original dataset contained 7129 genes. After the preprocessing steps such as thresholding and filtering were applied and a base 10 logarithmic transformation was conducted, a total of 6285 genes were left for analysis. Next, we standardized the data across samples and ranked these genes by the same MCMC procedures described in earlier sections. The error rates for * = 6, 10, or 14 under LOOCV were 0.0519, 0.0649, and 0.0779, and the accuracy was between 0.92 and 0.95, as listed in Table 5. To achieve a smaller error rate, we considered * = 5 and obtained a smaller rate 0.0390, the same rate achieved by the hyperbox enclosure (HBE) method [49]. Similar to the discussion in the previous two applications, our proposed model can achieve the same or smaller error rate with a smaller set of genes.

Conclusion and Discussion
In this Bayesian framework, we considered a mixture ofprior to complete a fully Bayesian analysis for gene selection and cancer classification. Different from other existing methods that treated the as a fixed value, we incorporated its uncertainty by assuming a reference inverse-gamma prior distribution. Earlier studies mentioned this prior, but considered it difficult to derive posterior inference. We therefore 8 Computational and Mathematical Methods in Medicine  In the application section, we listed only the results from * = 6, 10, and 14 selected genes. Other values for * have been tried and the performance remains good. For instance, the pink line in Figures 3 and 4 displays the accuracy of the proposed procedure when the number of selected genes * varies between 5 and 20 for the colon cancer and leukemia study, respectively. For the colon cancer study, the largest accuracy 0.8871 occurs at * = 14, while other values of * lead to the accuracy between 0.8387 and 0.8871. These  correspond to at least 52 correctly identified subjects out of 62. For the leukemia study, the largest accuracy 0.9706 occurs at * = 15. Other values of * all lead to an accuracy larger than 90% except when * = 20 (accuracy is 0.8824 = 30/34). In addition, we compared the results under the proposed generalized -prior with fixed at a constant. The colored lines in Figures 3 and 4 are for fixed at 5 (red line), 10 (blue), or 20 (black), respectively. Again, results under the prior distribution assumption lead to a higher accuracy with a less number of selected genes. Another issue is related to the choice of the number of genes in the starting set. We have considered 50 in all three applications. This value can certainly be changed. However, the computational complexity increased as the value becomes larger. This cost in computation remains a research topic for future research.
To compare the performance of a stochastic and a constant , we also conducted a small simulation study to investigate the effect of assigning a prior on versus fixing at different constant values. We used the R package penalizedSVM [50,51] to simulate three data sets; each contains 500 genes with 15 genes associated with the disease. The numbers of training and testing sample were 200 and 40, respectively. We then conducted the gene selection procedures with a prior on , = 5, = 50, and = 500 at * = 1, 2, . . . , 15 and recorded the accuracy under each setting. Figure 5 plots the average accuracy with the pink line standing for the accuracy under the mixtures of -priors on , the black line for = 5, the red line for = 50, and the blue line for = 500. It can be observed that only when is assigned with a very large number like 500, the corresponding accuracy can be slightly better than that under a prior for the uncertainty in . This again supports the use of the mixtures of -priors for a better and robust result. Here in this paper we have focused on the analysis of binary data. However, the probit regression model can be extended to a multinomial probit model to solve the multiclass problems, and the Bayesian inference can be carried out similarly. Such analysis will involve a larger computational load and further research in this direction is needed. Another point worth mentioning is the inclusion of interactions between genes. Further research can incorporate a power prior into the prior of [52] or include information on genegene network structure [18] to complete the procedure for variable selection.