A Deep Learning Radiomics Analysis for Survival Prediction in Esophageal Cancer

The purpose of this study was to explore the deep learning radiomics (DLR) nomogram to predict the overall 3-year survival after chemoradiotherapy in patients with esophageal cancer. The 154 patients' data were used in this study, which was randomly split into training (116) and validation (38) data. Deep learning and handcrafted features were obtained via the preprocessing diagnostic computed tomography images. The selected features were used to construct radiomics signatures through the least absolute shrinkage and selection operator (LASSO) regression, maximizing relevance while minimizing redundancy. The DLR signature, handcrafted features' radiomics (HCR) signature, and clinical factors were incorporated to develop a DLR nomogram. The DLR nomogram was evaluated in terms of discrimination and calibration with comparison to the HCR signature-based radiomics model. The experimental results showed the outperforming discrimination ability of the proposed DLR over the HCR model in terms of Harrel's concordance index, 0.76 and 0.784, for training and validation sets, respectively. Also, the proposed DLR nomogram calibrates and classifies better than the HCR model in terms of AUC, 0.984 (vs. 0.797) and 0.942 (vs. 0.665) for training and validation sets, respectively. Furthermore, the nomogram-predicted Kaplan–Meier survival (KMS) curves differed significantly from the nonsurvival groups in the log-rank test (p value <0.05). The proposed DLR model based on conventional CT images showed the outperforming performance over the HCR signature model in noninvasively individualized prediction of the 3-year survival rate in esophageal cancer patients. The proposed model can potentially provide prognostic information that guides and helps the clinical decisions between observation and treatment.


Introduction
Esophageal cancer (EC) is the eighth most common malignancy and the sixth most common disease-related cause of death worldwide [1,2]. e incidence of esophageal cancer is notably high in Asia and Iceland, as well as the United Kingdom and the United States [3][4][5]. In order to systemically control the disease, radiotherapy and neoadjuvant chemotherapy are commonly combined with surgery [6][7][8]. Regardless of its benefit for severe patients with a low survival rate, an aggressive treatment plan, including multiple cycles of treatment and adjuvant chemotherapy, is not suitable for the other patients with esophageal cancer [9]. Pre-identification for such patients having a low survival rate before surgery can help provide other suitable treatment regimens for these patients [10]. erefore, identification of patients having a lower survival rate is vital to take benefit from additional treatment.
Radiomics features [11,12] have been widely used as an extremely useful tool in quantitative analysis of medical imaging and in medical diagnosis [13,14]. e traditional radiomics utilized handcrafted features, such as tumor shape and texture, obtained from medical images [11]. However, such handcrafted low-order features are not suitable to define intrinsic characteristics of intratumor imaging heterogeneity, limiting the applicability of the radiomics model [11][12][13]. Furthermore, the construction of handcrafted features is limited within the known knowledge of medical imaging.
However, unlike handcrafted radiomics studied widely for radiological diagnosis and prediction [28], the application of deep learning in predicting overall survival in esophageal cancer has not been thoroughly explored yet. Hence, this study aims to develop and validate the deep survival prediction model based on a radiomics nomogram for individualized prediction of three years' overall survival in patients with esophageal cancer.
It is worthwhile to highlight three aspects of the contributions here. (1)

Patients. Esophageal cancer patients at Shanxi Cancer
Hospital were the subject of our retrospective study. e patients were included according to the inclusion criteria: (a) patients who had pathologically confirmed esophageal cancer, (b) a standard CT scan performed before any treatment, and (c) clinical characteristics available. e patients were excluded with the following criteria: (a) too poor CT image quality, which may affect the diagnosis of the patient, (b) patients who had chemotherapy treatment at another institution, and (c) patients who are also suffering from other cancers. e survival group includes patients who survived more than three years since the treatment, whereas the nonsurvival group includes patients who died within three years. A total of 154 esophageal cancer patients diagnosed from November 2012 to February 2015 participated in our retrospective study.
ose data were grouped into two sets: training (116) and validation (38) data at a ratio of 3 : 1.
Baseline clinical data were collected via the electronic medical record system (EMRS) [29], including gender, BMI, age, M-stage, N-stage, T-stage, overall stage, and planning target volume (PTV). e picture archiving and communication system (PACS) was used to obtain CT images. e dataset was constructed and evaluated in April 2019, and all enrolled patients were followed for at least 3 years. e Institutional Review Board approved the study.

CT Image and Region of Interest (ROI) Acquisition.
General Electric Light Speed RT16 was used for scanning, with a CT thickness of 5 mm. e primary tumor volumes for radiotherapy planning were set as the ROI to quantitatively analyze the images. Two skilled radiologists manually selected the three-dimensional tumor ROI using the software package 3D Slicer [30].
Training CT images were preprocessed to avoid accuracy degradation of DL models caused by noises introduced with the interval change, which include resampling, rescaling, and voxel normalization.
ose CT images were reconstructed with a matrix of 512 × 512 and 0.5 × 0.5 mm 2 pixel size, and the resampling with cubic interpolation to 1 × 1 × 1 mm 3 pixels was conducted, minimizing CT images variabilities [31]. e tumor area was located with a rectangle bounding box that covers the primary tumor area. e ROI for each patient was obtained with three cropped consecutive slices to avoid the bias of manual segmentation that affects the location of a bounding box. Lastly, the tumor image was resized to 224 × 224 × 3 voxels.

Radiomics Feature Extraction.
Phenotypic differences between tumors can be captured by a large number of quantitative radiomics features. In this study, deep learning features and handcrafted features were extracted to quantify tumor phenotype to enhance the learning efficiency of the radiomics model. ose two feature sets have complementary advantages that can be combined to improve the model. Also, expert knowledge on the esophageal cancer lesion can be reflected with shape and texture features. On the contrary, the high-level DLR features can significantly represent complex spatial features in both global and local perspectives. e handcrafted feature extraction algorithm was standardized by referring to the Image Biomarker Standardization Initiative (IBSI) [32][33][34] and Radiomics Ontology [35]. For each CT ROI, 1,670 handcrafted features were extracted using Python implementation, including 18 first-order statistics, 16 geometric, and 1,564 texture features. e textural features include 14 gray-level dependence [36], 23 gray-level co-occurrence [37], 16 gray-level runlength [38], 16 gray-level size-zone [39], and 5 neighborhood gray-tone difference [40] matrices. Refer to the supplementary appendix of Lambing [41], for mathematical definitions of those features.
e DenseNet-169, designed for the image classification task, was adopted to extract DLR features. In the training cohort, data augmentation approaches including random rotation, random shear, and random zoom were employed before the training procedure. e deep learning model was pretrained on the ImageNet dataset, one of the largest image datasets, and then fine-tuned in a transfer learning strategy to avoid the overfitting problem [42]. e network was trained with cross-entropy loss function and Adam optimizer with a learning rate of 0.0001, a batch size of 16, and a regularization weight of 0.0001. e network was implemented on Keras (https://keras.io/) with the TensorFlow library as the backend (https://www.tensorflow.org/). As depicted in Figure 1, the tumor ROI was fed into the DenseNet-169, and the outputs of hidden layers were collected to obtain 1,664 features in total.

DLR Signature Building.
In order to obtain the most effective feature, three stages of feature selection were carried out. First, features (p < 0.05) were obtained through the Mann-Whitney U (MWU) test.
en, the features were sorted based on the mutual information (MI) between features and the survival status using the minimum redundancy maximum correlation (mRMR) scheme [43]. It should be noted that, in this study, only the top 50 features in mRMR were retained. Lastly, the dimension of features is reduced by the LASSO to obtain optimal features [44]. e survival-related features were retained while the other features were removed by LASSO regression. e 10-fold cross validation was conducted with 100 iterations in LASSO regression. e obtained features are used to construct the DLR signature, and the HCR signature was constructed in a similar way for comparison.

DLR Nomogram Construction.
A DLR nomogram was built by integrating DLR signature, HCR signature, and clinical features with a multivariable logistic regression model. Backward stepping selection was used with information criterion of Akaike as the stopping rule [45]. e variable multicollinearity in the multiple logistic regression model was checked by the variance inflation factor (VIF), where VIF > 10 indicates high multicollinearity [46]. A DLR nomogram was then built based on the multivariate logistic analysis, predicting the individual probability of survival in the training dataset.

Evaluation of the DLR Nomogram.
Harrel's C -index was employed to evaluate the discrimination ability of the DLR nomogram in both training and testing datasets. e bootstrap method was used to resampling 1,000 times, and the C index in both cohorts was calculated with 95% confidence intervals. e AUC, accuracy, specificity, and sensitivity were calculated on the plotted ROC curves. e calibration ability of the DLR nomogram was evaluated using the calibration curve that depicts the consistency between predicted and actual survival probabilities. Hosmer-Lemeshow (HL) test [47] and decision curve analysis (DCA) [48] were utilized to evaluate the fitting accuracy and robustness of the DLR nomogram, respectively. Furthermore, KMS curves were constructed to predict survival status. Accordingly, the patients were predicted as survival or nonsurvival, and then, the difference in survival curves between the two groups was evaluated using the log-rank test.

Statistical Analysis.
All the statistical analyses were conducted with R software (version 4.0.3; http://www. Rproject.org). MWU and Chi-square tests were adopted for univariate analysis, and Spearman's correlation rank was employed for correlation. e penalty parameter (λ) was tuned by LASSO logistic regression model. is study used the following packages for each analysis. "glomnet" package: LASSO logistic regression, "rms" package: nomograms and calibration plots, "ResourceSelection" package: HL test, "car" package: VIFs calculation, "survivalROC" package: AUC analysis, "survminer" package: KMS analysis, and "dca.R" function: DCA performance. is study utilized a bilateral statistical significance level p value <0.05.   e results show that scores of the survival group were higher than the nonsurvival group with a significant difference in terms of DLR signatures (1.10 ± 0.75 vs. −2.22 ± 0.75) in the training cohort and (0.15 ± 1.11 vs. −1.90 ± 1.12) in the validation cohort. MWU test was used with a p value <.001. Also, a significant correlation between DLR signature and survival status was found (C index: 0.729, p � 0.035 in the training data, and C index: 0.766, all p < 0.001 in the validation data). e LASSO algorithm selected 18 handcrafted features to build HCR signatures. HCR signatures were also significantly different between survival and nonsurvival groups. In the training data, 0.68 ± 1.12 vs. −0.50 ± 1.03, p value <.001,

DLR Nomogram.
e DLR signature, HCR signature, and BMI were combined to construct a DLR nomogram, as shown in Figure 3. e VIFs of DLR signature, HCR signature, and BMI were 1.45, 1.41, and 1.07, respectively, indicating no severe collinearity in the regression model. Figure 4    e results indicate that the DLR nomogram model provides better discrimination ability (Harrel's concordance index, 0.76 and 0.784, for the training and validation data, respectively). Figure 5 depicts the calibration curves, showing the consistency between predicted and actual survival rates. A nonsignificant statistic of the training cohort (p value � .563, HL test) showed no deviation from the ideal fit. In the validation cohort, the 3-year survival rate was also wellcalibrated (p value � .648, HL test). e DCA examined the clinical outcomes based on threshold probability at which a net benefit could be derived. Figure 6 depicts the DCA of the DLR nomogram, showing that the DLR nomogram obtained outstanding net benefits  Journal of Healthcare Engineering over the other strategies: treat-all-patients and treat-none strategies. A significant difference (p value <.05, log-rank test) between prediction survival and nonsurvival groups was found in KMS curves (Figure 7).

Discussion
Treatment planning can be further individualized via preoperative prediction of three-year survival. In previous studies, handcrafted features were analyzed to predict survival rates. However, due to the limited feature extraction ability, the prediction accuracy was not high enough. In order to overcome such a limitation, this study investigated DLR features in the survival prediction of patients with esophageal cancer. Unlike the traditional handcrafted features, clinical target-oriented DLR features can be automatically learned from data [49].
Intratumor heterogeneity has been considered a potential prognosis factor. e DRL feature extraction can robustly characterize the intratumor heterogeneity noninvasively from the medical images [26]. e experimental results showed that the use of DLR features contributed to the performance of the model, which is also supported by recent studies that high-dimensional features can preserve more detailed cancer information, making them more sensitive when assessing survival status [24]. erefore, by  is study has several limitations, described as follows. First, only 154 patients were available for a three-year followup analysis. A larger amount of data is required to improve the performance of the model. Second, all the patients were collected to form a single-center, thereby limiting the generalizability of the DLR model. A more diverse dataset is required to validate the robustness and reproducibility of the DLR model. ird, our study did not consider genetic markers. Multiple factors should be considered for more personalized treatment, including biology, pathology, genomics [24,26,[42][43][44][45][46][47][48][49][50][51][52], and imaging biomarkers [53]. In addition, this study was limited to CT images despite the essentiality of MIR images in surgical planning due to their excellent resolution for soft tissues. e focus should, therefore, be given towards developing an additional model combining CT and MRI image features. Finally, the primary tumor volumes were manually delineated for feature extraction. Even though the delineations are commonly used with confirmation by another radiation oncologist in radiotherapy planning, previous studies showed that semiautomatic tumor segmentation could reduce interobserver variability and therefore is more suitable for radiomics studies [54].

Conclusions
is study details the development of a noninvasive predictive model that combines deep learning-based radiomics features, handcrafted features, and clinical factors to predict survival rates within three years at diagnosis of esophageal cancer. e performance of the proposed DLR nomogram is superior to the traditional radiomics model in terms of Harrel's concordance index and AUC. e calibration curves show the good prediction performance of the nomogram. e nomogram-predicted Kaplan-Meier survival (KMS) curves differed significantly from the nonsurvival groups in the log-rank test (p value <0.05). e proposed model can present the basis for clinicians to make better treatment decisions and personalized diagnoses. Future works will include the model improvement based on larger data and complementary clinical factors.

Data Availability
e data used to support this research are included within the article.

Supplementary Materials
(I) HCR feature selection using the least absolute shrinkage and selection operator (LASSO) logistic regression model. Figure S1. HCR feature selection using the LASSO logistic regression algorithm. (a) e penalization coefficient λ in the LASSO model was tuned by the binomial deviance Journal of Healthcare Engineering minimization criteria. e binomial deviance metrics (the yaxis) were plotted against log(λ) (the bottom x-axis). e top x-axis indicates the number of predictors with the given log(λ). Red dots indicate average binomial deviance for each model at the given λ. Vertical bars through the red dots show the upper and lower values of the binomial deviance. e vertical black lines represent the optimal λ, where the model provides the best fit to the data. As a result, the optimal λ of 0.02373184 was selected. (b) LASSO coefficient profiles of the 50 radiomics features. For the optimal λ, eighteen features with nonzero coefficient were selected. (II) DLR feature selection using the least absolute shrinkage and selection operator (LASSO) logistic regression model. Figure S2. DLR feature selection using the LASSO logistic regression algorithm. (a) e penalization coefficient λ in the LASSO model was tuned by the binomial deviance minimization criteria. As a result, the optimal λ of 0.0445107 was selected. (b) LASSO coefficient profiles of the 50 radiomics features. For the optimal λ, thirty three features with nonzero coefficient were selected. (Supplementary Materials)