A risk stratification model for lung cancer based on gene coexpression network

Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of the previously reported models based on individual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based on gene coexpression network. Using multiple microarray datasets obtained from lung adenocarcinoma, gene coexpression network analysis was performed to identify survival-related network modules. Representative genes of these network modules were selected and then, risk stratification model was constructed exploiting deep learning algorithm. The model was validated in two independent test cohorts. Survival analysis using univariate and multivariate Cox regression was performed using the output of the model to evaluate whether the model could predict patients’ overall survival independent of clinicopathological variables. Five network modules were significantly associated with patients’ survival. Considering prognostic significance and representativeness, genes of the two survival-related modules were selected for input data of the risk stratification model. The output of the model was significantly associated with patients’ overall survival in the two independent test sets as well as training set (p < 0.00001, p < 0.0001 and p = 0.02 for training set, test set 1 and 2, respectively). In multivariate analyses, the model was associated with patients’ prognosis independent of other clinical and pathological features. Our study presents a new perspective on incorporating gene coexpression networks into the gene expression signature, and the clinical application of deep learning in genomic data science for prognosis prediction.


Introduction
Recently, risk stratification based on gene expression profiles is of major biomedical interest in lung cancer research (1-6). Previous studies developed risk stratification models that mostly focused on individual prognostic genes. However, these studies have not fully considered the nature of biological networks and their systematic properties. Since it is more evident that biological processes are derived from numerous interactions between many cellular components, gene network analysis could provide valuable information about cancer pathogenesis and therapeutic interventions (7). Among the various biological networks, gene co-expression network has some strengths: not relying on prior information about genes, avoiding biologically wrong assumptions about independence of gene expression levels, and alleviating multiple testing problems (8).
Lung cancer -mainly, non-small cell lung cancer -is one of the most common cancers and is the leading cause of cancer-related death worldwide (9,10). Currently, TNM staging system is a universal guideline for prognosis prediction and treatment decision. However, heterogeneous molecular features of lung cancer require diverse adjuvant treatment options and lead to different prognosis even in the same stage (11). Hence, there has been a constant need for developing better risk stratification models to predict accurate prognosis and to improve cancer-related survival.
The main objectives of this study were 1) to identify survival-related gene co-expression network modules, 2) and to propose a deep learning (DL)-based risk stratification model reflecting survival-related network modules. Using public microarray datasets from the Gene Expression Omnibus (GEO), we identified survivalrelated network modules of lung adenocarcinoma. Subsequently, we constructed DL-based prognostic score using representative genes of survival-related network modules and it showed great prognostic property in all cohorts.

Gene expression data and preprocessing
Microarray data sets were searched from the National Center for Biotechnology Information GEO database (https://www.ncbi.nlm.nih.gov/geo/) (12) using keywords 'lung cancer', 'lung adenocarcinoma', or 'adenocarcinoma'. We searched for studies analyzed by single platform (Affymetrix HG-U133A Plus 2.0) in order to obtain high proportion of overlapping genes. In total, eleven microarray data sets were included and raw gene expression data were downloaded from the GEO data repository for preprocessing step (13)(14)(15)(16)(17)(18)(19)(20)(21). We selected two microarray data sets with survival information (accession number GSE31210 (20) and GSE30219 (21)) as independent test sets, and the others (13)(14)(15)(16)(17)(18)(19) with or without survival information as the training set. For those microarray data sets containing multiple histologic types of lung cancer, only the samples from adenocarcinoma were extracted. Detailed information of data source used in this study can be found in Supplementary Table 1. All available clinico-pathological variables (age, sex, smoking status, stage, and molecular subtypes) and survival information (survival status and duration) were compiled from each microarray data sets using 'GEOquery' package (22) ( Table 1).
We generated the training set by assembling nine microarray data sets through stepwise preprocessing method described below. The raw gene expression data from microarray data sets were called and normalized using robust multichip average method using the 'affy'(23) package. On a study-by-study basis, we removed invalid and duplicated probe sets by 'featureFilter' function in 'genefilter' package (24) and mapped array probe sets for the respective gene symbols. In addition, to remove poor quality probes, we filtered out probe sets with low expression level (signal intensity < log2(100) in at least 5% of samples within at least one study) and low variability (interquartile range < 0.5). As we combined microarray data from different studies, we performed additional normalization using Combat algorithm (25) in order to eliminate potential batch effects. Lastly, we detected the outliers by calculating the inter-array correlation based on Pearson's correlation coefficient for all samples, and removed them. As a result, the training set contained 4615 probe sets from 510 lung adenocarcinoma samples including 273 samples with available survival information.
The raw gene expression data of both test sets were called and normalized as the same method with the training set. One outlier sample was removed from the test set 2; consequently, the test set 1 and 2 included 226 and 84 lung adenocarcinoma samples respectively.

Weighted gene co-expression network construction from the training set
We used weighted gene co-expression network analysis (WGCNA) package (26,27) to build a weighted gene co-expression network from the training set. We created a correlation matrix on the basis of Pearson's correlation coefficient for all pair-wise genes across all samples. The power -the key parameter for weighted network-was selected to optimize both the scale-free topology and sufficient node connectivity and we chose a threshold of 6 in this study (Supplementary Fig. 1). The correlation matrix was transformed into adjacency matrix (matrix of connection strength) using the power function, and pair-wise topological overlap (TO) between genes were calculated. We identified network modules using hierarchical clustering method with TO dissimilarity as the distance measure. The modules were detected using dynamic tree cut algorithm (28) in WGCNA package, defining height cutoff value of 0.99, deep split as 2, and minimum module size cutoff value of 30. Genes that were not assigned to any module were classified to color gray ( Fig. 1).

Figure 1. Gene co-expression network construction and survival-related modules identification. (A)
A schematic diagram summarizing our risk stratification modeling strategy. Gene co-expression network was constructed from the training set. Gene network modules were extracted based on topological overlap. Survival-related modules were identified from the training set and validated in the two test sets. We selected representative genes from survival-related modules, and built network-based prognostic scoring system using deep learning. (B) Gene dendrogram and modules identified by weighted gene co-expression network analysis from the training set. Modules were labeled with different colors. (C) Univariate Cox regression analysis of module eigengene in the training set was performed. Module eigengene is a representative expression value of genes of each module calculated by the principal component analysis. The dotted line represents cutoff value (p-value = 0.05) for significance, and five modules were identified as survival-related network modules. (D) Survival-related network modules were validated in the two test sets using Cox regression analysis. Three modules from test set 1, and two modules from test set 2 were significantly associated with overall survival.

Identification and validation of survival-related network modules
For each module, we summarized the module expression profile by one representative gene, module eigengene (ME), which is the first principal component of the expression matrix of the corresponding module.
We used ME as the representative of each module to evaluate association with overall survival (OS). The survival-related network modules were identified using Cox regression analysis in the training set. For validation, the same genes included in the network construction were extracted from each test set, and assigned to the modules identified in the previous step. ME was calculated based on the expression profile of each test set, and the association between ME and OS was evaluated using Cox regression analysis to see whether the modules identified from the training set are also associated with OS in each test set. The modules with uncorrected p-value under 0.05 were regarded as significant survival-related network modules.

Functional annotation and network visualization of survival-related network modules
The enrichment of the gene ontology terms in each module were evaluated based on the hypergeometric test using 'clusterProfiler' (29) package. The gene ontology biological process terms at false discovery rate under < 0.05 in each survival-related module were regarded as significantly enriched terms. The network of two common survival-related network modules (red and turquoise) was visualized with Cytoscape Software 3.4.0 (30).

Representative genes selection for risk stratification model construction
Representative genes of the survival-related network modules were selected to construct risk stratification model. Degree of representativeness of genes in each module was calculated by gene module membership (GMM), a correlation coefficient between gene expression profile and module eigengene. Additionally, the relationship between GMM and prognostic significance (p-value) of an individual gene was tested. Prognostic significance of gene was measured by univariate Cox regression analysis for overall survival. Pearson correlation analysis was performed between GMM and prognostic significance for every gene. We selected top 10 genes according to the GMM from the modules which showed significant correlation between GMM and prognostic significance. Accordingly, expression levels of the selected genes in the same network module were highly correlated to each other, and they could be also highly associated with prognosis because of strong correlation between GMM and prognostic significance. The expression levels of selected genes were used for risk stratification model based on DL.

DL-based risk stratification model
Expression profiles of representative genes were used for the input of the deep learning because they were expected to preserve co-expression patterns and to reflect the systematic properties of survival-related network modules. Convolutional neural network (CNN) was specifically used to extract gene expression patterns of modules. It finally produced gene network prognostic score (NetScore).
DL framework was based on a nonlinear proportional hazard model, which assumed hazard function ( ), a product of a time-dependent baseline hazard function ( 0 ) and a risk function determined by covariates: ( , ) = 0 ( ) × ℎ( ) . Conventional Cox model for the risk stratification using multiple covariates ( 1 , 2 , … , ) estimates the risk function ℎ( ) by a combination of linear functions. ℎ ( ) = 1 1 + 2 2 + ⋯ + DL-based risk stratification modeling also adopts proportional hazard model, however, replaces linear risk function with the output of neural network (31). We designed a simple CNN to estimate risk function, ℎ ( ).
Firstly, 1-dimensional convolutional filters were applied. Filter size was same as the input length, 10. Thus, the number of the output of the first layer was same as the number of convolutional filters. Genes in different modules were inputted as different channels. We set the number of filters were 24. The outputs of convolutional layer were hierarchically connected to three fully-connected (FC) layers. Each FC layer had 24 nodes except final output layer. For FC layers, a dropout function was applied to reduce overfitting and learn more robust features. This function randomly drops the connections with predefined probability. We set the probability as 0.5. The final output of CNN, ℎ ( ), was a single node.
The CNN model was trained by the RMSprop algorithm (32). The model was optimized to minimize the loss function, negative log partial likelihood.
=1 represents that the event has occurred in individual i at event time Ti. ∈ ( ) represents that another patient j is still at risk of the event at time Ti.
Our framework was trained by initial learning rate with 1x10 -4 and took 500 epochs for the training. The CNN was implemented using a deep learning library, Keras (ver. 1.0.4) with the Theano (ver. 0.8.2) backend (33).
Parameters related to training of the neural network including number of layers, nodes, training epoch and learning rate were determined by 5-fold cross-validation. Training set was randomly divided into 5 subsets. At each step, a single subset was left for testing and other four subsets were used for training. The performance of the model was measured by Harrell's C-index of the final output score of the model (34). The optimal parameters were selected according to the maximum average C-index across the 5-fold of the loop. The predictive value of NetScore was independently validated in two test sets. C-index for each test set was also evaluated.

Survival analysis using NetScore in all cohorts
Prognostic property of NetScore as continuous variable was evaluated by univariate Cox analysis. To define risk groups, NetScore was dichotomized using the median value in each cohort. Kaplan-Meier method was used to assess survival rates according to the risk groups and survival rate differences were assessed with the log-rank test. Additionally, independent prognostic value of NetScore was assessed by multivariate and subgroup analysis. Multivariate Cox analysis was performed using clinical and pathological variables as well as NetScore. Subgroups were divided on the basis of clinical and pathological features, and univariate Cox analysis of NetScore was performed in each subgroup.

Gene co-expression network modules from the training set
We aimed at developing a risk stratification model based on gene co-expression networks (Fig. 1A) Fig. 2).

Identification of survival-related modules from the training set and validation in test sets
Total five modules were significantly associated with OS ( Fig. 1C)  To construct risk stratification model, representative genes were selected according to the gene module membership. Gene module membership was correlated with the significance of association between individual gene expression and survival. Y-axis represents statistical significance calculated by univariate Cox analysis of individual genes. A strong correlation was found in the red and turquoise modules (r = 0.53 and p < 1x10 -19 for red module; r = 0.35 and p < 1x10 -23 for turquoise module). and turquoise (p = 0.011) and red (p = 0.049) modules in test set 2 were significantly associated with OS (Fig.   1D).
The networks of two common survival-related network modules (red and turquoise) are presented in Fig. 2A and 2B. The significantly enriched gene ontology terms of the red module included 'organic acid catabolic process', 'carboxylic acid catabolic process', 'small molecule catabolic process', and the turquoise module included 'DNA strand elongation involved in DNA replication', 'mitotic cell cycle phase transition', 'DNAdependent DNA replication' (Supplementary Table 2, Supplementary Data 2).

DL-based risk stratification model using representative genes of survival-related module
By measuring the correlation between gene significance for OS (p-value) and GMM in each survival-related module, we identified two modules demonstrating high correlation with statistical significance (r = 0.53, p < 1x10 -19 and r = 0.35, p < 1 x 10 -26 for red and turquoise module, respectively; Fig. 2C). Based on the strong correlation, we could assume that the genes with high representativeness measured by GMM has high significance for OS and are the most important elements of the module; therefore, we selected top 10 genes according to GMM from the red and turquoise modules for the DL-based risk stratification model construction (Supplementary Fig. 3).
The expression profiles of selected 20 genes were used as input data of the risk stratification model (Fig.   3A). NetScore, the final output of our model, was significantly associated with OS in the training and two test sets (Fig. 3B) (p < 0.00001, p < 0.0001 and p = 0.02 for training set, test set 1 and 2, respectively). Subjects were divided into two groups, high-and low-risk groups, according to the median value of NetScore in each cohort. The high-risk group was significantly associated with OS in the training set (p < 0.0001; Fig. 3C) and in test set 1 (p < 0.0001; Fig. 3D). A trend of the association was also shown in test set 2 (p = 0.054; Fig. 3E).

NetScore as an independent predictive factor for prognosis
Cox multivariate analysis revealed that the risk group was associated with OS independent of stage as well as other clinico-pathological features in the training set and test set 1 ( Table 2). The independent predictive factors for OS in Cox multivariate analysis were the risk group (p = 0.001) and T-stage 3 (p = 0.030) in training set, and the risk group (p = 0.01) and EGFR mutation status (p = 0.005) in test set 1. In the test set 2, there was no feature significantly associated with OS in univariate Cox analysis, though the high risk group showed a trend of unfavorable prognosis (p = 0.06). We also evaluated the prognostic value of NetScore in subgroups divided by clinical and pathological features. In the training set, the high risk group was significantly associated with poor prognosis in subgroups regardless of age and T-stage. In all subgroups, a trend of close relationship between the risk group and OS was found except never-smoking subgroup (Fig. 4A, Supplementary Fig. 4).
According to subgroup analysis in test set 1, the risk group was closely associated with OS in male, old-aged, ever/never smokers, stage IA/IB, EGFR positive and all negative mutation subgroups (Fig. 4B, Supplementary   Fig. 5). A trend of association between the risk group and OS was also revealed in each subgroup of test set 2, regardless of clinical features including sex, age and T stage (Fig. 4C, Supplementary Fig. 6). To construct risk stratification model, deep convolutional neural network was used. Input data were expression value of top 10 genes from each of red and turquoise module. The first layer consists of one-dimensional convolutional filters which extract gene expression patterns of each module. Three additional fully-connected (FC) layers were followed and connected to the output score gene network prognostic score (NetScore). (B) Univariate Cox regression analysis of NetScore as a continuous variable was performed in the training and two test sets. It shows significant association between the score and overall survival in all sets. The blue line represents hazard ratio for overall survival and the blue area represents 95% confidence interval. (C-E) Overall survival of dichotomized group according to NetScore was depicted by Kaplan-Meier survival curve. The statistical difference was tested by log-rank test. The high risk group showed worse survival in the training set (C) and test set 1 (D) with statistical significance. The high risk group of the test set 2 (E) also showed worse prognosis though the difference did not reach statistical significance.

Discussion
In this study, we developed a risk stratification  Regardless of subgroups, a trend of poor prognosis in high risk group was also found in the test set 2 selection of individual significant genes has a substantial problem of multiple statistical testing (37). Instead of these previous approaches, systemic approach integrating gene interaction as well as individual genes would be a breakthrough for robust risk stratification modeling because variation patterns of their expression levels can be associated with prognosis.
Functional annotation elucidated the role of specific gene network in lung adenocarcinoma pathophysiology.
The red module was functionally associated with catabolic process of organic acid and carboxylic acid. As cancer cells rely on aerobic glycolysis and facilitate the metabolic process of amino acid, nucleotides and lipid for rapid proliferation, genes related to fatty-acid and amino acid metabolism could reflect progression of cancer cell (38). The turquoise module was related to DNA replication and cell-cycle. A previous gene co-expression analysis also revealed that cell-cycle related genes were closely associated with lung cancer prognosis (39).
Furthermore, overexpression of cyclins has repeatedly been associated with poor prognosis in lung cancer (40).
Our result also emphasizes the importance of cell-cycle genes in lung cancer prognosis.
Recently, DL has dramatically improved data analysis in genomics and imaging fields (41,42). The main contribution of DL for our risk stratification model is firstly applying convolutional neural network to gene expression data. It was used for extracting multiple gene expression patterns by applying convolutional filters.
Another contribution is to solve regression problems of survival data by using a specialized loss function (31).
We compared predictive accuracy of DL-based model and conventional Cox proportional hazard model obtained from the expression level of selected 20 genes. Predictability of the DL-based model was significantly higher than that of the Cox model in test set 1 (C-index= 0.709±0.042 and 0.608±0.046, respectively; p = 0.004). It was also higher in the training set and test set 2 though the difference did not reach statistical significance (Supplementary Fig. 7). To our knowledge, NetScore is the first study that apply deep convolutional neural network to high-dimensional gene expression data for predicting prognosis. By applying this novel approach to various genomic data, risk stratification and survival prediction could be improved compared with conventional Cox model. NetScore was trained by various samples with different clinico-pathological characteristics. We found NetScore was associated with sex, smoking status, stage and molecular subtypes (Supplementary Fig. 8).
Briefly, a trend of high NetScore was found in male, smokers, late stage and KRAS mutation positive samples.
Nonetheless, NetScore was significantly associated with OS independent of clinico-pathological variables according to multivariate and subgroup analyses. Of note, NetScore was significant predictor in early stage subgroups (stage IA/IB). This finding could be important because the new risk stratification could identify patients who might need adjuvant chemotherapy. For example, a recent clinical trial using 15-gene signature based on individual prognostic genes showed successful selection of patients with stage IB and II NSCLC who would most likely benefit from adjuvant chemotherapy (43). In the future, as a new prognostic biomarker based on gene network, the usefulness of NetScore should be tested whether it could affect clinical decision, and compared with the previous prognostic models using individual genes.

Conclusion
We developed a risk stratification model for lung adenocarcinoma using gene co-expression network. A future extension of our work would be to apply this approach to the co-expression networks of other cancer types. In terms of technical improvement, modification of DL architecture and selection process of representative genes could improve the prediction accuracy. Finally, we expected that a prospectively designed clinical trial with well-controlled clinico-pathological variables would help find clinical application of our new risk stratification model.