Identifying biomarker and signaling pathway is a critical step in genomic studies, in which the regularization method is a widely used feature extraction approach. However, most of the regularizers are based on
Identifying molecular biomarker or signaling pathway involved in a phenotype is a particularly important problem in genomic studies. Logistic regression is a powerful discriminating method and has an explicit statistical interpretation which can obtain probabilities of classification regarding the class label information.
A key challenge in identifying diagnosis or prognosis biomarkers using the logistic regression model is that the number of observations is much smaller than the size of measured biomarkers in most of the genomic studies. Such limitation causes instability in the algorithms used to select gene marker. Regularization methods have been widely used in order to deal with this problem of high dimensionality. For example, Shevade and Keerthi proposed the sparse logistic regression based on the Lasso regularization [
So far, we observed dense molecular interaction information about the disease-related biological processes and gathered it through databases focused on many aspects of biological systems. For example, BioGRID records collected various biological interactions from more than 43,468 publications [
Inspired by the aforementioned methods and ideas, here, we define a network-constrained logistic regression model with
The rest of the paper is organized as follows. In Section
Generally, assuming that dataset
Directly computing (
In this paper, we present an enhanced
The Laplacian matrix
One-term Taylor series expansion for (
We consider the following.
If
We evaluate the performance of four methods: the network-constrained logistic regression models with
Model 2 was defined similar to Model 1, except that we considered the case when the TF can have positive and negative effects on its regulated genes at the same time:
In these two models, the 10-fold cross validation approach was conducted on the training datasets to tune the regularization parameters of the enhanced
Table
Simulation results of the enhanced
Model | Misclassification errors (%) | Sensitivity (%) | Specificity (%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Eh_ |
|
|
Elastic | Eh_ |
|
|
Elastic | Eh_ |
|
|
Elastic | |
1 |
|
9.85 | 11.81 | 13.12 |
|
0.971 | 0.968 | 0.873 | 0.969 | 0.970 | 0.962 | 0.981 |
(0.36) | (0.31) | (0.41) | (0.12) | (0.00) | (0.00) | (0.02) | (0.00) | (0.00) | (0.01) | (0.01) | (0.00) | |
|
||||||||||||
2 |
|
10.83 | 13.21 | 14.14 | 0.939 | 0.939 | 0.943 | 0.835 |
|
0.981 | 0.987 | 0.980 |
(0.33) | (0.36) | (0.24) | (0.23) | (0.00) | (0.00) | (0.01) | (0.00) | (0.02) | (0.01) | (0.01) | (0.00) |
Simulation results (averaged over 100 runs) for comparison of misclassification errors, sensitivity, and specificity used the enhanced
In this section, we merged the protein-protein interaction (PPI) network (see
Figures
The results of the enhanced
Selected genes | Connected genes | Connected edges | Cross validation error | Test error | |
---|---|---|---|---|---|
Eh_ |
171 | 54 | 41 | 6/70 | 5/37 |
|
193 | 61 | 47 | 6/70 | 6/37 |
|
500 | 150 | 121 | 7/70 | 6/37 |
|
636 | 337 | 510 | 6/70 | 6/37 |
Results of analysis of LC gene expression dataset by four procedures, including the number of genes selected, the number of linked PPI network genes, the number of linked PPI network edges, the CV error, and test errors.
The solution paths of the enhanced
The solution paths of
The solution paths of
The solution paths of the Elastic net for the lung cancer dataset in one sample run.
As shown in Table
To further evaluate the performance of the enhanced
Subnetworks identified by the enhanced
Except to identify these two significant biomarkers (EGFR and Nkx2-1), the enhanced
All these results reveal that the enhanced
In biological molecular research, especially for cancer, the analysis of combining biological pathway information with gene-expression data may play an important role to search for new targets for drug design. In this paper, we use the enhanced
We successfully identified several important clinical biomarkers and subnetwork that are driving lung cancer. The proposed method has provided new information to investigators in biological studies and can be the efficient tool for identifying cancer related biomarker and subnetwork.
The authors declare no conflict of interests.
This work was supported by the Macau Science and Technology Development Funds (Grant no. 099/2013/A3) of Macau SAR of China.