A Risk Stratification Model for Lung Cancer Based on Gene Coexpression Network and Deep Learning

Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of previous models based on individual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based on gene coexpression network. Using multiple microarray, gene coexpression network analysis was performed to identify survival-related networks. A deep learning based risk stratification model was constructed with representative genes of these networks. The model was validated in two test sets. Survival analysis was performed using the output of the model to evaluate whether it could predict patients' survival independent of clinicopathological variables. Five networks were significantly associated with patients' survival. Considering prognostic significance and representativeness, genes of the two survival-related networks were selected for input of the model. The output of the model was significantly associated with patients' survival in two test sets and training set (p < 0.00001, p < 0.0001 and p = 0.02 for training and test sets 1 and 2, resp.). In multivariate analyses, the model was associated with patients' prognosis independent of other clinicopathological features. Our study presents a new perspective on incorporating gene coexpression networks into the gene expression signature and clinical application of deep learning in genomic data science for prognosis prediction.


Public data collection and preprocessing
Microarray data sets were searched from the National Center for Biotechnology Information Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) [1] using keywords 'lung cancer', 'lung adenocarcinoma', or 'adenocarcinoma'. We searched for studies analyzed by single platform (Affymetrix HG-U133A Plus 2.0) in order to obtain high proportion of overlapping genes. In total, eleven microarray data sets were included and raw gene expression data were downloaded from the GEO data repository for preprocessing step [2][3][4][5][6][7][8][9][10]. We selected two microarray data sets with survival information (accession number GSE31210 [9] and GSE30219 [10]) as independent test sets, and the others [2][3][4][5][6][7][8] with or without survival information as the training set. For those microarray data sets containing multiple histologic types of lung cancer, only the samples from adenocarcinoma were extracted. Detailed information of data source used in this study can be found in Supplementary Table 1. All available clinico-pathological variables (age, sex, smoking status, stage, and molecular subtypes) and survival information (survival status and duration) were compiled from each microarray data sets using 'GEOquery' package [11] (Table 1).
We generated the training set by assembling nine microarray data sets through stepwise preprocessing method described below. The raw gene expression data from microarray data sets were called and normalized using robust multichip average method using the 'affy' [12] package. On a study-by-study basis, we removed invalid and duplicated probe sets by 'featureFilter' function in 'genefilter' package [13] and mapped array probe sets for the respective gene symbols. In addition, to remove poor quality probes, we filtered out probe sets with low expression level (signal intensity < log2(100) in at least 5% of samples within at least one study) and low variability (interquartile range < 0.5). As we combined microarray data from different studies, we performed additional normalization using Combat algorithm [14] in order to eliminate potential batch effects. Lastly, we detected the outliers by calculating the inter-array correlation based on Pearson's correlation coefficient for all samples, and removed them. As a result, the training set contained 4615 probe sets from 510 lung adenocarcinoma samples including 273 samples with available survival information.
The raw gene expression data of both test sets were called and normalized as the same method with the training set. One outlier sample was removed from the test set 2; consequently, the test set 1 and 2 included 226 and 84 lung adenocarcinoma samples respectively.

Functional annotation and network visualization of survival-related network modules
The enrichment of the gene ontology terms in each module were evaluated based on the hypergeometric test using 'clusterProfiler' [15] package. The gene ontology biological process terms at false discovery rate under < 0.05 in each survival-related module were regarded as significantly enriched terms. The network of two common survival-related network modules (red and turquoise) was visualized with Cytoscape Software 3.4.0 [16].

Representative genes selection for risk stratification model construction
Representative genes of the survival-related network modules were selected to construct risk stratification model. Degree of representativeness of genes in each module was calculated by gene module membership (GMM), a correlation coefficient between gene expression profile and module eigengene. Additionally, the relationship between GMM and prognostic significance (p-value) of an individual gene was tested. Prognostic significance of gene was measured by univariate Cox regression analysis for overall survival. Pearson correlation analysis was performed between GMM and prognostic significance for every gene. We selected top 10 genes according to the GMM from the modules which showed significant correlation between GMM and prognostic significance. Accordingly, expression levels of the selected genes in the same network module were highly correlated to each other, and they could be also highly associated with prognosis because of strong correlation between GMM and prognostic significance. The expression levels of selected genes were used for risk stratification model based on deep learning (DL).

Comparison of predictability between DL-based model and conventional Cox proportional hazard model
Expression level of all selected genes was fitted into multivariate Cox regression model and the predictive value of the Cox model was evaluated by C-index as in DL-based model. C-index of Cox model was measured by 5-fold cross validation in the training set, and it was calculated in two test sets. C-index of Cox model was compared with that of DL-based model in each cohort [17].

Convolutional neural network for risk stratification
DL framework was based on a nonlinear proportional hazard model, which assumed hazard function ( ), a product of a time-dependent baseline hazard function ( 0 ) and a risk function determined by covariates: ( , ) = 0 ( ) × ℎ( ) . Conventional Cox model for the risk stratification using multiple covariates ( 1 , 2 , … , ) estimates the risk function ℎ( ) by a combination of linear functions. ℎ ( ) = 1 1 + 2 2 + ⋯ + DL-based risk stratification modeling also adopts proportional hazard model, however, replaces linear risk function with the output of neural network [18]. We designed a simple convolutional neural network (CNN) to estimate risk function, ℎ ( ). Firstly, 1-dimensional convolutional filters were applied. Filter size was same as the input length, 10. Thus, the number of the output of the first layer was same as the number of convolutional filters. Genes in different modules were inputted as different channels. We set the number of filters were 24.
The outputs of convolutional layer were hierarchically connected to three fully-connected (FC) layers. Each FC layer had 24 nodes except final output layer. For FC layers, a dropout function was applied to reduce overfitting and learn more robust features. This function randomly drops the connections with predefined probability. We set the probability as 0.5.
The final output of CNN, ℎ ( ), was a single node.
The CNN model was trained by the RMSprop algorithm [19]. The model was optimized to minimize the loss function, negative log partial likelihood.