Development and Validation of the Random Forest Model via Combining CT-PET Image Features and Demographic Data for Distant Metastases among Lung Cancer Patients

The work aimed at developing and validating a random forest model of CT-PET image features combined with demographic data to diagnose distant metastases among lung cancer patients. This study involved lung cancer patients from The Cancer Genome Atlas lung adenocarcinoma (TCGA-LUAD) dataset, the lung PET-CT dataset, the lung squamous cell carcinoma (LSCC) dataset, and the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium lung adenocarcinoma (CPTAC-LUAD) dataset and collected the information on 178 CT, 178 PET, and the patients' age, history of smoking, and gender. We conducted image processing and feature extraction. Finally, 4 computed tomography (CT) image features and 2 positron emission tomography (PET) image features were extracted. Four prediction models based on CT image features, PET image features, and demographic data were developed, and the area under the receiver operating characteristic (ROC) curve was used to evaluate the performance of prediction models. A total of 178 eligible samples were randomly divided into a training set (n = 134) and a testing set (n = 44) at a ratio of 3 : 1, with 2021 as a random number. ROC analyses illustrated that the predictive performance for distant metastases of combining CT-PET image features and demographic data for training and testing were 0.923 (95% confidence interval (CI): 0.873–0.973) and 0.873 (95% CI: 0.757–0.990). In addition, the predictive performance of the combined model in the testing set was significantly better than that of the CT-demographic data model (0.716, 95% CI: 0.531–0.902), PET-demographic data model (0.802, 95% CI: 0.633–0.970), and CT-PET model (0.797, 95% CI: 0.666–0.928). The random forest model via combining CT-PET image features and demographic data could have great performance in predicting distant metastases among lung cancer patients.


Introduction
Lung cancer, one of the most common cancers, has been recognized as the leading cause of cancer-related deaths all over the world, and an estimated 2.20 million new cases and 1.76 million deaths occur each year [1]. Several studies showed that metastasis was a major cause of deaths in most patients with cancer [2,3]. Te distant metastases of advanced lung cancer are very extensive and could spread in various parts of the body, such as the lungs, liver, brain, and bone, which pose a serious threat to the life and health of patients [4]. Not only that, the treatment method for lung cancer patients is also associated with the staging of distant metastases [5]. Terefore, accurate determination of the distant stage for lung cancer plays an important role in evaluating tumor prognosis for patients.
Nowadays, the diagnosis of distant metastasis in lung cancer is mainly based on digital imaging, such as computed tomography (CT) [4], magnetic resonance imaging (MRI) [6], positron emission tomography (PET) [7], and 18 F-fuorodeoxyglucose PET/CT ( 18 FDG PET/CT) [8]. Aided diagnosis has important research signifcance and application value in accurately screening lesions, reducing the rate of missed diagnosis and misdiagnosis, reducing labor intensity, and improving the efciency of reading flms [9]. At present, several studies have developed risk prediction models to estimate the risk of lung cancer patients [10,11] and pointed out sociodemographic information related to risk of lung cancer, including age, history of smoking, and gender.
As far as we know, most studies have only used these digital imaging technologies to diagnose distant metastases in lung cancer patients until now [7,12], and there are few studies that combine demographic information to make a diagnosis. In addition, in the early diagnosis and curative efect evaluation of lung cancer, PET/CT fusion image with high sensitivity and specifcity could not only realize the function of molecular imaging and anatomical imaging fusion, which is a kind of "positive" whole-body imaging method, but also refect the pathological changes and organization structure and pathological changes of the lesion area [13,14]. Herein, this study mainly aimed to develop and validate a random forest model of PET/CT image features combined with demographic data to diagnose distant metastases among lung cancer patients.

Data Sources and Collection.
Te patients' information in this study was obtained from Te Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD) dataset (n = 12), the lung PET-CT dataset (n = 132), the lung squamous cell carcinoma (LSCC) dataset (n = 6), and the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium lung adenocarcinoma (CPTAC-LUAD) dataset (n = 2). A total of 178 CT and 178 PET were collected in this study. Simultaneously, the patients' demographic information, such as age, history of smoking, and gender, was also extracted from these datasets.

Feature Extraction.
In this study, all images were resized to 512 × 512 (Figure 1), and then, lung masks were extracted and made according to HU values of −900∼−700. Ndimage was used to fll possible gaps in the lungs, and skimage morphology was used to expand and ensure coverage of all lung areas. CT images, PET images, and their corresponding masks obtained from preprocessing were extracted by pyradiomics, including 18 frst-order statistics, 10 two-dimensional shape features, and 75 texture features including (the gray-level co-occurrence matrix (GLCM, 24), the graylevel run length matrix (GLRLM, 16), the gray-level size zone (GLSZM, 16), the neighborhood gray-tone diference matrix (NGTDM, 5), and the gray-level diference method (GLDM, 14)). Te features of CT and PET images were screened by Lasso regression, and fnally, 4 CT image features and 2 PET image features were screened out.

Statistical
Analysis. Measurement data were tested for normality by the Shapiro Test, the continuous variables of normal distribution were expressed by the mean ± SD, and the comparison between groups was tested by the T test. Te measurement data of non-normal distribution were expressed by the median and interquartile distance, and the Mann-Whitney U test was used to compare between groups. Counting data were described by the number of cases and the composition ratio N (%), and the comparison between groups adopted the χ 2 test or Fisher's exact probability method. Missing data were interpolated multiple times (Supplementary Table 1), SAS 9.4 statistical analysis software and Python analysis were used, and all statistical tests were conducted by two-sided tests. Te diference between the tests was statistically signifcant at P < 0.05.
In the present study, 178 samples were randomly divided into a training set for developing prediction models and a test set for validating the models at a ratio of 3 : 1, with 2021 as a random number. Ten, sensitivity analysis of demographic data was carried out. We constructed four prediction models based on CT image features, PET image features, and demographic data: (1) the random forest model of CT image features combined with demographic data, (2) the random forest model of PET image features combined with demographic data, (3) the random forest model of CT combined with PET, and (4) the random forest model of CT-PET image characteristics combined with demographic data. Te fowchart of the analysis method is shown in Figure 2. Te area under the receiver operating characteristic (ROC) curve was used to evaluate the performance of the prediction models.

Patient Characteristics.
A total of 178 eligible samples were enrolled, with 134 samples in the training set and 44 samples in the testing set. Te average age was 64.56 and 62.41 years in the training and test datasets, respectively. Tere was no signifcant diference in demographic data between the training set and the testing set, which showed that our data balance is comparable (Supplementary Table 2).
After screening by Lasso regression, we screened out 4 CT and 2 PET image features, of which 4 CT image features included frst-order 10 percent, frst-order robust mean absolute deviation, GLCM joint average, and NGTDM strength; 2 PET image features contained GLDM low gray-level emphasis and NGTDM strength. Subsequently, we compared demographic data, 4 CT image features, and 2 PET image features in the training set (Table 1). In the comparison between groups, we found that the number of smokers in the M0 stage was 78 (71.56%) more than that in the M1 stage (10 (40.00%)). Among the image features, the value of low graylevel emphasis extracted from PET in M1 (0.30 (0.23, 0.33)) was greater than the value in M0 (0.18 (0.07, 0.32)).

Development and Visualization of the Prediction Model.
Demographic data (age, history of smoking, and gender) were included in the three prediction models. In the CTrelated random forest model, we included CT image features (frst-order 10 percent, frst-order robust mean absolute deviation, GLCM joint average, and NGTDM strength) and demographic data; GLDM low gray-level emphasis, NGTDM strength, and demographic data were included in the PET-related random forest model. CT-PET image features were included in the CT-PET random forest model. Additionally, we also established a random forest model by combining CT-PET image features with demographic data.
Te performance of random forest models in the training set and testing set is shown in Table 2. Te area under the curve (AUC) of the CT-PET demographic data model was 0.923 (95% CI: 0.873-0.973) in the training set, which was higher Similar results were observed in the testing set. Tese results also indicated that CT-PET combining demographic data was the best model to predict the M stage among the four prediction models established by using the random forest. Te ROC curves and the confusion matrix of four models are shown in Figures 3 and 4, respectively. Furthermore, Figure 5 also depicts the variable importance of features for the random forest model of CT PET-demographic data. GLDM low graylevel emphasis, which was extracted from PET, was the most important of the nine factors, followed by age and the history of smoking in demography.

Discussion
Lung cancer is still one of the leading causes of cancer death worldwide, with distant metastasis accounting for the majority of deaths [15]. Not only that, early detection of distant metastases in lung cancer patients could provide an efective basis for the formulation of clinical treatment, which is also the focus of current clinical research [16]. In the present study, we aimed at developing a random forest model to diagnose distant metastases in lung cancer patients by combining the characteristics of CT-PET images with demographic data. Te performance of this prediction model was internally validated at the same time. Our results indicated that the random forest model created by combining CT-PET images and demographic data was efective in predicting distant metastases in lung cancer patients. Age, history of smoking, gender, frst-order 10 percent, frst-order robust mean absolute deviation, GLCM joint average, NGTDM strength (CT), GLDM low gray-level emphasis, and NGTDM strength (PET) were important factors for predicting distant metastases in lung cancer patients.
In recent years, with the rapid development of imaging technology, CT, PET, and PET/CT examinations have been gradually applied to the diagnosis of malignant tumors, which can provide more information related to anatomical   Journal of Healthcare Engineering structure and tissue metabolism, and have higher value in the qualitative diagnosis of lesions [17][18][19]. Despite the fact that CT examination could intuitively refect the morphological characteristics of lesions and carry out accurate anatomical positioning, it has great limitations in diagnosing distant metastases in patients based on morphology [20].
Several studies have pointed out that PET/CT, as a tool that can efectively integrate anatomical, morphological, and biological information of lesions, has gradually become widely used in clinical practice [21,22]. In the study by Yu et al., they confrmed that 18 F-FDG PET/CT has a good diagnostic performance for distant metastasis staging in patients with non-small-cell lung cancer (NSCLC) at initial staging [23]. More importantly, some demographic data were also considered, as this could infuence the morbidity and distant metastases of lung cancer. In this study, we extracted the characteristics of CT images (frst-order 10 percent, frst-order robust mean absolute deviation, GLCM joint average, and NGTDM strength) and PET images (GLDM low gray-level emphasis and NGTDM strength) and collected the patients' demographic data (age, history of smoking, and gender). Tus, we constructed four prediction models: combining CT image features and demographic data; combining PET image features and demographic data; combining CT and PET image features; and combining CT image features, PET image features, and demographic data. Trough ROC curve analysis, we found that the model combining CT and PET image features and demographic data may have a better predictive performance in predicting distant metastases in lung cancer patients than those models combining CT image features and demographic data, combining PET image features and demographic data, and combining CT and PET image features. Previous studies have revealed that age, history of smoking, and gender are associated with the distant metastasis risk of lung cancer patients [24,25]. In a study examining the efects of smoking on brain metastases in lung cancer patients, Wu et al. showed that the incidence of brain metastases was signifcantly higher in smokers than in those who had never smoked [26]. Te potential reason was that nicotine, as the main cigarette ingredient, might promote brain metastases by distorting the polarity of M2 microglia, which in turn accelerates the growth of metastatic tumors. Our results were consistent with those of previous research studies, in which the variable importance of features in the random forest model showed that age, gender, and history of smoking in demography were the important risk factors. To the best of our knowledge, this is the frst study to assess the value of CT-PET imaging combined with demographic data for diagnosis of distant metastases in lung cancer patients. We believed that the fnding could provide clinicians with more convenience in diagnosing distant metastases of lung cancer and making personalized treatment strategies.
Te study has several limitations. First, the study had a relatively small sample size, which might have limited statistical power. Second, owing to all patients' information being derived from the TCGA-LUAD, lung PET-CT, LSCC, and CPTAC-LUAD datasets, we did not collect data on the race and types of lung cancer, which might be associated with the distant Journal of Healthcare Engineering metastases of lung cancer [27]. More research studies are needed to explore this association. Tird, there was no external validation, and larger datasets are needed to further confrm our fndings.

Conclusion
In conclusion, our study displayed that the random forest model created by combining CT-PET image features and demographic data could have great performance in predicting distant metastases in lung cancer patients. Te developed model may provide supplementary guidance for clinicians in the choice of therapeutic strategies and personalized monitoring for lung cancer patients.

Data Availability
Te data utilized to support the fndings are available from the corresponding authors upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest regarding the publication of this article.