Integrating Radiomics with Genomics for Non-Small Cell Lung Cancer Survival Analysis

Purpose The objectives of our study were to assess the association of radiological imaging and gene expression with patient outcomes in non-small cell lung cancer (NSCLC) and construct a nomogram by combining selected radiomic, genomic, and clinical risk factors to improve the performance of the risk model. Methods A total of 116 cases of NSCLC with CT images, gene expression, and clinical factors were studied, wherein 87 patients were used as the training cohort, and 29 patients were used as an independent testing cohort. Handcrafted radiomic features and deep-learning genomic features were extracted and selected from CT images and gene expression analysis, respectively. Two risk scores were calculated through Cox regression models for each patient based on radiomic features and genomic features to predict overall survival (OS). Finally, a fusion survival model was constructed by incorporating these two risk scores and clinical factors. Results The fusion model that combined CT images, gene expression data, and clinical factors effectively stratified patients into low- and high-risk groups. The C-indexes for OS prediction were 0.85 and 0.736 in the training and testing cohorts, respectively, which was better than that based on unimodal data. Conclusions Combining radiomics and genomics can effectively improve OS prediction for NSCLC patients.


Introduction
Non-small cell lung cancer (NSCLC) is one of the deadliest diseases in humans. NSCLC occurs in approximately 80%-85% of lung cancer patients. Surgery is still the only potential cure for patients with early-stage NSCLC. However, 30% to 55% of NSCLC patients will relapse and die of the disease [1]. erefore, effective prognostic tools are needed to help predict and improve overall survival in patients with NSCLC and to provide specific treatments to improve their quality of life.
Radiomics refers to the extraction of many quantitative image features from radiological data and data mining of these features for clinical tasks, such as disease diagnosis and prognostic analysis [2]. Image features used for extraction and analysis in radiomics include intensity patterns, shape, and a range of texture features [3]. e image analysis is completely computerized, and these features are extracted automatically and with high-throughput [4]. Radiomics in lung cancer has aroused great interest, including application in diagnosis [5] and prognosis analysis [6]. Yang et al. developed a radiomics nomogram by combining the optimized radiomics signatures of CT images and clinical predictors to assess the overall survival of patients with NSCLC [7]. Wang et al. demonstrated that a radiomics signature with multiregional features could help to stratify the survival risk of patients with clinical stage and pathologic stage IA pure-solid NSCLC [8].
With the advancement of high-throughput sequencing technology, using microarray gene expression profiling to analyze gene expression characteristics and establish prognostic gene signatures has also attracted significant research interest. Li et al. established a gene signature closely related to the tumor immune microenvironment that can effectively predict clinical outcomes [9]. e above studies have greatly promoted the discovery of biomarkers for the prognosis of NSCLC. However, most methods are limited to a single-data model, and cross-modal comprehensive methods are relatively underdeveloped. By integrating multimodal data, such as gene expression, radiological and histological imaging, and clinical data, more detailed insights into the development and occurrence of disease can be provided [10]. erefore, by fusing information from radiological imaging, gene expression, and clinical data, it is possible to significantly improve survival prediction. Recently, this has been demonstrated for recurrence prediction in NSCLC by combining radiological imaging and gene expression data [11].
Multimodal approaches can extract more meaningful features from multiple perspectives, leading to more reliable characterization of tumors, and thus have great potential to address the limitations of single-modality models. According to this assumption, in this study, we tried to provide a complete view of NSCLC characteristics for survival prediction by integrating information from radiological imaging, gene expression, and clinical data. To this end, we first used CT images and gene expression data to develop their respective risk scores and combined them with clinical characteristics to construct the final survival analysis model. We then investigated whether the fusion model could improve the overall survival of NSCLC patients compared with the models based on unimodal data [12].

Materials and Methods
An overview of our method is depicted in Figure 1. With paired CT and gene expression data, our objective is to develop multimodal representation from both modalities that would outperform unimodal representations in NSCLC survival prediction. For CT images, we followed the conventional radiomics research pipeline, which mainly includes feature extraction, feature selection, and model analysis. For gene expression data, we adopted a deep learning architecture to extract the latent features that best represent the data. For CT and gene expression data, we calculated their respective risk scores. Finally, we fused these risk scores with clinical data to obtain the final prognostic analysis model.

Problem Formulation.
Let (x i , t i , δ i ) denote each patient, where x i corresponds to the features extracted from CT or gene expression, t i is the survival time, and δ i denotes the censoring indicator. δ i � 1 means a noncensored instance and vice versa. e purpose of survival analysis is to predict the time duration until a specific event occurs. In this study, the event was the death of a patient with NSCLC. e Cox proportional hazard model is one of the most popular methods to model the hazards function [13]. It is built based on the hypothesis that the hazard ratio between two instances is time-independent and is defined as where h 0 (t) corresponds to the baseline hazard and f(x) � α T x is the risk function. α � (α 1 , α 2 , . . . α υ ) is the regression parameter that can be estimated by minimizing the following log partial likelihood function: where N denotes the number of patients and R(t i ) is the risk set at time t i , which represents the set of patients who are still at risk before time t i .

Data Source.
We used the NSCLC Radiogenomics dataset, a public dataset of 211 patients who underwent surgery for NSCLC between 2008 and 2012 [14,15]. It includes two cohorts from Stanford University School of Medicine (AMC) and Palo Alto Veterans Affairs Healthcare System (R01). CT scans were performed using different scanners and imaging protocols, with slice thicknesses ranging from 0.625 to 3.0 mm (median, 1.5 mm), an X-ray tube current of 124-699 mA (mean, 220 mA), tension of 80-140 kV (mean, 120 kV), and a field of view after manual shortening ranging from 71 to 124 cm (mean 89 cm) [16]. e corresponding gene expression data were downloaded from Gene Expression Omnibus datasets (GEO; GSE103584), which are composed of RNA sequences (RNA-seq) described by fragments per kilobase of transcript per million mapped reads (FPKM) of 130 NSCLC patients. We selected samples with paired CT scans and gene expression data, and a total of 116 samples were included in the study.

Construction of the Radiomics Model.
For each CT image, we extracted a wide range of features from the segmented cancer region according to the radiomic features described by the Imaging Biomarker Standardization Initiative (IBSI), including intensity features, shape features, texture features, and wavelet features [8]. Intensity features use first-order statistics, such as energy and entropy, to quantify the tumor intensity characteristics. Shape features describe the shape characteristics of the tumor, such as sphericity or compactness. Texture features can quantify intratumor heterogeneity by taking the spatial location of each voxel compared with the surrounding voxels into account. Commonly used texture features include gray level cooccurrence matrix (GLCM), gray level run length matrix (GLRLM), gray level size zone matrix (GLSZM), neighbouring gray tone difference matrix (NGTDM), and gray level dependence matrix (GLDM). Wavelet features calculate the intensity and textural features from wavelet decompositions to represent the tumor characteristics at different frequencies. All feature extraction algorithms were implemented in the Pyradiomics toolkit [17]. A detailed description of these features can be found in the Supplementary Appendix. Journal of Oncology To eliminate the differences in the value scales of the radiomic features, feature normalization was performed after feature extraction. For the extracted features in the training cohort, each feature for a specific patient was subtracted by the mean value and divided by the standard deviation value from this cohort. e same normalization method was applied to features in the testing cohort using the mean and standard deviation values calculated based on the training cohort.
Including too many features will increase the computational cost, and redundancy between features will reduce the accuracy of the model. Furthermore, the number of features is more than the number of samples in this work, which will increase the probability of overfitting. erefore, we needed to select a small number of informative radiomics features to predict the survival risk of patients. In this study, we used the least absolute shrinkage and selection operator (Lasso) [18] for feature selection.
e Lasso can shrink some regression coefficients to zero and select important variables by adding an L i regularization term to l(α), so the Lasso-Cox model can be formulated as where p k�1 |α k | is the L 1 regularization term and λ is a parameter to balance the two parts. We used 5-fold crossvalidation to determine the optimal value of λ.

Construction of Genomics Model.
We chose the autoencoder framework for processing gene expression data. Autoencoders aim to reconstruct the input values using combinations of nonlinear functions, and the bottleneck layers are considered as the representation of the inputs [19]. Usually, the bottleneck layer has significantly fewer neural units than the input layer, and thus can be regarded as a compressed representation of the original input. Autoencoders have already been shown to be efficient for dimension reduction of high-dimensional genomics data [20].
An autoencoder can be divided into an encoder and a decoder. Let the original input be X ∈ R N×p , with N samples and p features. We used a multilayer neural network with parameter Φ e as the encoder: H is regarded as a compressed representation of input X. e encoder maps N samples from p-dimension space to kdimension space. Another multilayer r neural network with parameter Φ d is regarded as the decoder: where is X reconstruction representation which has the same shape as X. e decoder maps N samples from kdimensional space back to p-dimensional space. e whole process of the autoencoder can be expressed as e parameters of the autoencoder can be estimated by minimizing the following reconstruction error: en, the latent representation H can be regarded as the compressed form of X that can be used in subsequent tasks.

Validation of Radiomics and Genomics Models.
For the features selected from CT and gene expression data, we constructed Cox models. erefore, a risk score of each patient can be computed by k i�1 β i × f i , where f i denotes the selected radiomic features or the compressed gene features, and β i represents the estimated coefficient of the corresponding features. According to this formula, the risk score of each patient was computed.
en, all patients in the training cohort were divided into high-and low-risk groups with the median of the risk score as the cutoff. en, the survival differences between these two groups were evaluated by a Kaplan-Meier (KM) survival curve. We used the same category for the testing cohort. Accordingly, the samples in the testing cohort were also divided into two risk groups. e difference in the survival curves of the high-risk and low-risk groups was evaluated by the log-rank test.

Construction of a Nomogram Combining Two Risk Scores and Clinical Factors.
e candidate prognostic indicators included age, sex, stage, grade, and the above two risk scores. We first used a LASSO Cox regression model to select the final features for constructing the fusion model. en, a nomogram was built using a multivariate Cox proportional Journal of Oncology 3 hazard model. We used the same validation methods as described in the validation of the radiomics and genomics models.

Statistical Analysis.
e statistical analysis was performed with R software, version 4.1.2, and Python software, version 3.8.5. Cox regression was performed using the "scikit-survival" package [21]. Nomograms were generated using the "rms" package. e differences in clinical factors between the training and testing cohorts were assessed using the "tableone" package [22].

Clinical Characteristics.
We randomly split the dataset, including 87 training and 29 testing samples. e clinical characteristics in the training and testing cohorts are summarized in Table 1. ere was no significant difference in age, gender, N stage, M stage, or grade between the training and independent testing cohorts (P > 0.05).

Construction and Validation of Radiomics Model.
ere were 8 features with nonzero coefficients in the LASSO Cox regression model. e optimal λ selection in the LASSO Cox regression model is shown in Figure 2. e radiomic signature based on the eight features and the weight for each feature are given in Table 2.
e radiomics signature achieved a C-index of 0.79 for the training cohort, and 0.643 for the testing cohort, demonstrating the predictive performance of the model. Based on the risk score of patients in the training cohort, the optimal cutoff was −0.150. en, patients in both the training and testing cohorts were stratified into low-risk and highrisk groups, as shown in Figure 3. e association of the radiomics signature with OS was shown in the training cohort (P < 0.001) and confirmed in the testing cohort (P � 0.045).

Construction and Validation of Genomics Model.
In the autoencoder structure, the number of latent features is an important parameter. To determine this parameter, we preset the number of latent features to 5-12. For each set of potential features, we further compressed it into two dimensions using multidimensional scaling (MSD) [23], as well as the original gene, and then calculated the Euclidean distance between the two projections.
e experimental results showed that 10 dimensions had the lowest Euclidean distance, so we set the number of hidden features to 10. e C-index of the genomics risk score was 0.716 for the training cohort and 0.581 for the testing cohort. Based on the risk score of patients in the training cohort, the optimal cutoff was −0.006. en, patients in both the training and testing cohorts were stratified into low-risk and high-risk groups, as shown in Figure 4. e association of the radiomics signature with OS was shown in the training cohort (P < 0.001) and confirmed in the testing cohort (P � 0.083).

Nomogram that Combines Multimodality.
We used LASSO analysis to select the optimal feature combination in the multimodal analysis. A combination of the radiomics risk score, genomics risk score, age, N grade, and stage was finally selected. e combination nomogram for the prediction of 2-year survival is displayed in Figure 5. e multimodal-based model obtained a C-index of 0.85 in the training cohort and 0.736 in the testing cohort. It is worth noting that the C-index of the fusion model was the highest compared to that of the other models in all cohorts. Furthermore, the fusion model obtained the best prognostic ability in stratifying patients into high-risk and low-risk groups with p � 0.0081, as shown in Figure 6. In Table 3, we list the results of all models.
To further verify the effectiveness of our model, we performed 5-fold cross-validation for all data. All patients were randomly divided into 5 subsamples of equal size, and in each fold, 4 subsamples were used as training data and the remaining 1 subsample was used as validation data for testing. For each fold, we used the method presented in this paper separately. e average and standard deviation of the C-indexes over the 5-fold cross-validation are listed in Table 3. As can be seen from the results, the C-indexes of the 5-fold cross-validation were consistent with the results of the testing cohort, proving the stability of our model.

Discussion
Gene expression data, imaging data and clinical factors each play an important role in the diagnosis and prognosis of diseases. Merging these multimodal data may lead to new prognostic cancer models and provide new support for patient treatment strategies. Accordingly, increased attention is given to statistical methods and algorithms to analyze and correlate multivariate imaging, clinical and gene expression data for disease diagnosis and prognosis. Our research demonstrated the ability to integrate radiomics data, genomics data, and clinical features for the stratification of       modalities can help detect more effective biomarkers and improve lung cancer clinical decision-making. In future work, we will solve the above limitations and conduct further multicenter, large-scale researches to promote the application of multimodal fusion in the management of lung cancer patients.

Conclusions
In this study, the risk score developed based on multimodal data has great potential to improve the determination of overall survival of NSCLC patients compared with the models based on unimodal data. We demonstrated that we could gain more significant insights into cancer prognosis by fusing imaging and genomic data. More studies that combine more data sources, such as multiomics data and digital pathology, are required to confirm the advantages of multimodal fusion.

Data Availability
Public data can be freely accessed and downloaded at https:// wiki.cancerimagingarchive.net/display/Public/ NSCLC+Radiogenomics.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Journal of Oncology 7