Bayesian Gene Selection Based on Pathway Information and Network-Constrained Regularization

High-throughput data make it possible to study expression levels of thousands of genes simultaneously under a particular condition. However, only few of the genes are discriminatively expressed. How to identify these biomarkers precisely is significant for disease diagnosis, prognosis, and therapy. Many studies utilized pathway information to identify the biomarkers. However, most of these studies only incorporate the group information while the pathway structural information is ignored. In this paper, we proposed a Bayesian gene selection with a network-constrained regularization method, which can incorporate the pathway structural information as priors to perform gene selection. All the priors are conjugated; thus, the parameters can be estimated effectively through Gibbs sampling. We present the application of our method on 6 microarray datasets, comparing with Bayesian Lasso, Bayesian Elastic Net, and Bayesian Fused Lasso. The results show that our method performs better than other Bayesian methods and pathway structural information can improve the result.


Introduction
Identifying disease-associated genes, which can be treated as diagnostic biomarkers, can bring a significant effect on disease diagnosis, prognosis, and treatments [1,2]. With the development of high-throughput technologies in recent years, gene expression profiling has provided a useful way to find biomarkers. Researchers can identify the genes which are differentially expressed between two groups of samples. These genes are regarded as disease-associated genes. However, gene expression data usually contains a large number of genes and a relatively small sample size [3,4]. And many of the genes are also redundant or irrelevant to the prediction [5,6]. Furthermore, there are also noises in the experiment procedures which will influence the gene expression values. Therefore, identifying the biomarkers from gene expression data is challenging.
During the last decades, a number of gene selection methods have been developed to tackle this problem. Feature selection and feature extraction are two major methods (we treat gene and feature equally in this paper). On the one hand, the aim of feature selection is to select relevant features and do not change the form of the features. On the other hand, feature extraction will extract the feature from the original data and may alter the form of the features. Here, we focus on the feature selection methods since the results of such methods could be interpreted easily. Feature selection methods can be generally organized into three categories: filter, wrapper, and embedded methods. Both the wrapper and embedded methods are classifier-dependent methods; thus, they are always time consuming and easy to overfitting. However, the filter methods are usually based on statistic approaches [7] such as mRMR [5], PLSRFE [8], lasso [9], and elastic net [10], which are relatively efficient in terms of computation and can derive a score of each of the genes which represents the significance of the gene. Therefore, we focus on the filter methods in this paper.
Although these methods are successful in many applications, they usually obtain suboptimal solutions. Therefore, the prediction accuracies are not satisfied and the disease-associated genes selected from different methods have few overlaps [11]. This is partly due to the fact that the discriminatory power of many biomarkers is similar. Furthermore, some genes which have low discriminatory powers play important roles in cellular functions. Their combinations are highly discriminative while they are usually ignored [12,13].
Recently, with a large amount of biological information accumulated, there is an increased interest in gene selection with incorporating information on pathways, which can partially compensate for the lack of reliable expression data [14]. Pathways depict a series of chemical interactions in living cells; genes that interact with one another usually mean that they function together concertedly. Therefore, these genes should be highly correlated and have dependence structures. However, many studies only utilize the information that pathways cluster genes into the natural group; the pathway structural information is neglected. Li and Li have overcome this disadvantage by incorporating pathway structure information through a Laplacian matrix of a global graph [15,16] and combined with lasso penalty to perform network-constrained penalty which can select subgroups of correlated features in the network. This penalty is based on the assumption that genes belonging to the same pathway have similar functions and therefore smoothed regression coefficients. And this penalty has been successfully applied in many studies [17][18][19].
The Bayesian approach has three major advantages over Bayesian selection methods [20]. Firstly, hyperparameters can be estimated automatically through fulfilling stochastic draws; thus, 10-fold cross-validation for estimating penalized parameters is not required. Secondly, the Bayesian framework can utilize the pathway information naturally by integrating it in the model as prior knowledge. Finally, the Bayesian estimation with the posterior distributions can provide credible intervals for the regression coefficients, which is a great advantage over frequentist methods.
In this paper, we work with a Bayesian framework to perform gene selection through network-constrained regularization. Similar to the Bayesian Lasso [21], Bayesian Elastic Net [22], and Bayesian Fused Lasso [23], we use shrinkage priors to perform regularization. We show that all the conditional posteriors of the proposed model are available in closed form and proper. Thus, parameter estimation can be performed through Gibbs sampling easily. The pathway information is obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [24], which is the most popular pathway public database, especially pathways associated with several types of cancer could be obtained in the model. Furthermore, following Held and Holmes [25], we extend the regression model to binary regression which can perform binary classification through an auxiliary variable. This method is assessed by applying it to several microarray datasets.

Method
2.1. The Bayesian Network-Constrained Model for Gene Selection. Considering an N × P matrix X, where P is the number of genes and N is the number of the samples, with a response vector y = ðy 1 , ⋯, y n Þ T , we normalize the values of each feature as the tradition in variable selection; thus, the mean and standard deviation of each feature are 0 and 1. We assume the likelihood function of the continuous response is Gaussian function: which can be also expressed as Following Li and Li's work [16], we incorporate the pathway information through its normalized Laplacian matrix. Consider an undirected graph G = ðV, E, WÞ. In this graph, genes are represented by a set of nodes V, and the interactions between genes are represented by a set of edges E = fu ∼ vg, and W is the weights of the edges, where wðu, vÞ represents the weight of edge e = ðu ∼ vÞ which indicates the uncertainty of the edge between the vertices u and v. The degree of each vertex is defined as d v = ∑ u∼v wðu, vÞ. Then, the normalized Laplacian matrix L for graph G with the uth and vth elements can be defined by , if u and v are adjacent, 0, otherwise: Here, we let wðu, vÞ = 1 if there exists an interaction between gene u and v, and wðu, vÞ = 0, otherwise.
To form the network-constrained regularization, we assign the prior distribution for β as follows: where Λ is taking the form: Note that Λ only contains hyperparameter τ.

Computational and Mathematical Methods in Medicine
To eliminate the jΛj 1/2 in the prior distribution of β, we assign the prior distribution for τ as follows: where C τ is the normalizing constant. The prior distribution defined in (6) is proper, due to the following analysis: Let A = Λ − I n , and A is a symmetric and positive semidefinite matrix. Let Since A is the symmetric and positive semidefinite, there exists an orthonormal matrix Q. Hence, the eigendecomposition of matrix A can be written as where the integrand is kernels of the gamma density that indicates the integral is finite. Therefore, the prior distribution is proper. Since Λ is positive semidefinite. The joint posterior distribution can be written as Integrating out τ 2 , we have Applying the fact as follows to the above equation: we have Thus, maximizing the posterior distribution is equivalent to minimizing the following equation: which has the same regularization term as the method proposed in [19]. We assign the prior distribution for σ 2 as follows: And we assign the following prior for the hyperparameters r and λ: Then, the hierarchical Bayesian model is According to the above hierarchical model and the likelihood, the joint posterior distribution on data is Due to the fact that all the prior distributions are conjugated, the full conditional posterior distributions for the parameters have closed forms.
Let μ = ðX ′ X + rΛÞ −1 X ′ Y, Σ = σ 2 ðX ′ X + rΛÞ −1 , we have This implies that τ 2 follows a generalized inverse Gaussian distribution: The Gibbs sampling scheme iterates as follows: (1) Update β by sampling from (20) (2) Update σ 2 by sampling form (22) (3) Update τ 2 by sampling from (24) (4) Update r by sampling from (26) 4 Computational and Mathematical Methods in Medicine (5) Update λ by sampling from (28) 2.3. The Binary Response Case. Binary data such as absence or presence or different types of a disease are often used as response variables in gene selection problems. To perform binary classification, we use probit regression using auxiliary variables. Then, the model can be represented as follows: where X i is the ith sample and Pðy i = 1Þ is the probability of y i = 1. Here, latent variables Z = ðz 1 , z 2 , ⋯, z n Þ are defined as Then, the full conditional posterior distribution for each z i is truncated normal: And Z follows a multivariate truncated normal distribution: Sampling from this distribution directly is difficult. We use the method proposed in [26] to sample this latent variable.
Then, the hierarchical Bayesian model is To derive the Gibbs sampling scheme, we only need to replace Y with Z in the Gibbs sampling scheme defined in Section 2.2. And the latent variables Z are sampled from (32).

Datasets and Preprocessing.
To demonstrate the effectiveness of our methods, a regression microarray dataset and 5 real-life binary classification microarray datasets were tested in this paper, which are described as follows. The pathway information was obtained from the KEGG database.
A breast cancer dataset was used to predict the survival time of patients [27]. We used gene expression profiles of 76 patients. Each patient was measured with 24481 probes. 3592 genes were found in the KEGG database from this dataset. We used the logarithm of survival times of patients as the response variable in this dataset.
The other 5 binary classification microarray datasets are shown in Table 1. No. genes mean the genes we found both existing in the microarray dataset and KEGG pathway database.
Lastly, the gene expression values were normalized; thus, its mean and standard deviation are 0 and 1.

Parameter Settings.
In the procedure of Bayesian network-constrained regularization, we recommend small values for a, b, c, d, e, f in (16) and we set these values to 0:01 in our experiments. The Gibbs sampling iteration was conducted 6000 times, and we chose the second half of the samples to estimate the regression parameters. The posterior estimates of all parameters were obtained through the posterior averages of the chains. For the classification problem, the classifiers were built by a support vector machine (SVM). In this paper, we used the radial basic function as the kernel function in SVM. And the regularization parameter and the kernel width parameter were optimized by a grid search approach. We used Libsvm [32] to model the SVM.

Results and Analysis.
In this section, we will describe the results on 6 microarray gene expression datasets (Table 1) to evaluate the performance of the proposed method. Our method was compared with the other three Bayesian regularized regression methods, including Bayesian Lasso, Bayesian Elastic Net, and Bayesian Fused Lasso. A comprehensive review of these methods can be found in [23]. When L = I, which means we know nothing about the pathway structure, the Bayesian network-constrained regularization is equivalent to Bayesian Elastic Net. And when L = O, our method is equivalent to Bayesian Lasso. These three methods can also be extended to perform binary classification through an auxiliary variable. We also used Gibbs sampling to perform parameter estimation. Previous review [23] also shows that these three Bayesian methods' performances are similar to 5 Computational and Mathematical Methods in Medicine and in some cases better than the frequentist methods. Prediction mean square error was used to evaluate the performance on regression problem. Meanwhile, ACC and AUC were used as the evaluation criteria for binary classification problem. According to previous studies, the number of important genes is probably about 50 [28]; thus, we selected the top 50 genes based on the absolute value of their regression coefficient for the binary classification problem. Figure 1 shows the performance of all the four methods on the regression microarray dataset. And the classification performances on the five binary classification microarray datasets are summarized in Table 2. In the binary classification datasets, the first three datasets are usually treated as easy classification datasets, while the other two datasets are relatively hard to classify. From Figure 1, we can see that the PMSE of our method is lower than other Bayesian methods. Table 2 also shows that on the four easy binary classification datasets, our method achieves the highest ACC and AUC. In the other two hard classification datasets, our method achieves the highest ACC and AUC on GSE412. Although the AUC of Bayesian Elastic Net is higher than our method on GSE4922, our method achieves the highest ACC. In general, Bayesian network-constrained regularization shows better prediction and classification ability than other three Bayesian methods, which is similar to the results implied by [15]. Since our method can be transferred to   [33,34], all the Bayesian regularization regression methods could classify Leukemia, DLBCL, Prostate, and GSE412 dataset accurately. However, the performances of all the methods were poor on GSE 4922 dataset. Therefore, we demonstrate the effectiveness of our method by selecting the top 18 genes which make the prediction accuracy to achieve the highest value and most of those genes are associated with breast cancer (Table 3).

Conclusion
In this paper, we propose a Bayesian approach to perform gene selection, which can incorporate the pathway information as prior biological knowledge through networkconstrained regularization to improve the accuracy of gene selection. All the prior distributions we propose are strictly conjugated; thus, all the conditional posteriors of the model are available in closed form. An auxiliary variable is also introduced to extend the regression model to perform binary classification. An efficient Gibbs sampling method is used to estimate regression coefficients and tune parameters simultaneously, which can perform feature filter feasible for high dimensional microarray datasets. The performance of the proposed method is demonstrated by applying it to a regression microarray dataset and five binary classification microarray datasets. The results show that compared with Bayesian Lasso, Bayesian Elastic Net, and Bayesian Fused Lasso, our method performs better both in prediction and classification. And the pathway information indeed improves the accuracy of gene selection.

Data Availability
The breast cancer dataset could be obtained from the R package breast cancer NKI. Leukemia, DLBCL, and Prostate datasets are available on the website http://portals.broadinstitute .org/cgi-bin/cancer/. GSE412 and GSE4922 datasets are available in the GEO of NCBI under accession GSE412 and GSE4922.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.