Dynamic Predictive Models with Visualized Machine Learning for Assessing the Risk of Lung Metastasis in Kidney Cancer Patients

Objective To establish and verify the clinical prediction model of lung metastasis in renal cancer patients. Method Kidney cancer patients from January 1, 2010, to December 31, 2017, in the SEER database were enrolled in this study. In the first section, LASSO method was adopted to select variables. Independent influencing factors were identified after multivariate logistic regression analysis. In the second section, machine learning (ML) algorithms were implemented to establish models and 10-foldcross-validation was used to train the models. Finally, receiver operating characteristic curves, probability density functions, and clinical utility curve were applied to estimate model's performance. The final model was shown by a website calculator. Result Lung metastasis was confirmed in 7.43% (3171 out of 42650) of study population. In multivariate logistic regression, bone metastasis, brain metastasis, grade, liver metastasis, N stage, T stage, and tumor size were independent risk factors of lung metastasis in renal cancer patients. Primary site and sequence number were independent protection factors of LM in renal cancer patients. The above 9 impact factors were used to develop the prediction models, which included random forest (RF), naive Bayes classifier (NBC), decision tree (DT), xgboost (XGB), gradient boosting machine (GBM), and logistic regression (LR). In 10-foldcross-validation, the average area under curve (AUC) ranked from 0.907 to 0.934. In ROC curve analysis, AUC ranged from 0.879–0.922. We found that the XGB model performed best, and a Web-based calculator was done according to XGB model. Conclusion This study provided preliminary evidence that the ML algorithm can be used to predict lung metastases in patients with kidney cancer. This low cost, noninvasive and easy to implement diagnostic method is useful for clinical work. Of course this model still needs to undergo more real-world validation.


Introduction
Kidney cancer, accounting for 5% of all cancers, originates from the renal tubular and collecting tubular epithelial system [1]. e incidence trend has been gradually increasing in recent years, resulting in a huge medical burden. e prevalence rate of men is approximately twice that of women [1]. Additionally, obesity, diabetes, hypertension, smoking, kidney injury, and drugs are major risk factors of kidney cancer. e principal manifestations of kidney cancer were hematuria, renal pain, and mass [2,3]. In the early stage of the disease, the symptoms are not noticeable. As the result, when patients intend to seek a healing care, they may have been in a metastatic state of kidney cancer and are suffering from the corresponding complications. e fiveyear survival rates of stage I and II were about 88% to 95%, and cancer-specific survival (CSS) rates were 84% to 95% [4]. Renal cell carcinoma (RCC), making up 90% of kidney cancer, is the sixth and eighth most common cancer among American men and women in 2021 [5]. RCC is mainly composed of clear-cell RCC, papillary RCC, and chromophobe RCC [6,7]. Renal clear cell carcinoma, accounts for about 70% of RCC, is invasive and has a poor prognosis. e survival time of renal clear cell carcinoma is from 3 months to 5 years. 60% of these patients die within 1 to 2 years after diagnosis [5,[8][9][10][11].
Metastasis from kidney cancer is not rare. Highly vascularization can lead to local progression and increase the chance of distant spread [6].
ere have been relevant studies on the occurrence, development, and metastasis. Hypoxia-irreducible factor (HIF) and epithelialmesenchymal transition (EMT) and so on are important molecular events [6,11]. Nishida et al. indicated that amplification of cancer-cell-intrinsic inflammation can trigger neutrophil-dependent lung metastasis during RCC progression [8]. Lung and bone are common metastatic sites of kidney cancer [12]. At the time of initial diagnosis, 18%-40% of patients have already developed systemic metastases. In addition, metastasis is widespread in the long-termfollow-up after nephrectomy [4,7,9,13,14]. e study of Jianxin Xue and colleagues reported that 2931 of 33449 RCC had distant metastasis and lung (6.19%) was the most common site of metastasis [7]. Pulmonary metastases are multiple nodules with bilateral distribution or solitary masses. e lower lobes of the lung were common sites. Immune checkpoint inhibitors (ICI), antiprogrammed death-1 (PD-1) antibody, and anticytotoxic T lymphocyte-associated antigen 4 (CTLA-4) antibody were accepted as treatments for metastatic RCC [15]. However, the survival rate of metastatic kidney cancer is just about 20% [6].
Clinical models of kidney cancer have been established, but the main focus is to predict the prognosis. e UCLA (University of California, Los Angeles) integrated staging system (UISS) and the risk model of the International Metastatic RCC Database Consortium (IMDC) are examples [16]. Machine learning is a subfield of artificial intelligence. It has many applications in kidney cancer such as identifying pathological variants, grading judgments, and differentiating benign from malignant renal tumors [17].
At present, there are few reports of machine learning model to predict lung metastasis of kidney cancer. In this study, we collected data from the SEER database to establish models. After checking performance of model, a Web calculator was conducted to assist clinicians in predicting lung metastasis from kidney cancer.

Patients' Populations.
Patients with kidney cancer from January 1, 2010, to December 31, 2017, in the SEER database were enrolled in this study. e inclusion criteria were listed as follows: (1) patients definitely diagnosed as primary kidney cancer when they were alive with ICD-O (International Classification of Diseases for Oncology) of 8120/ 3, 8130/3, 8260/3, 8310/3, 8312/3, and 8317/3; (2) histological subtypes of kidney cancer were clear cell RCC, papillary, chromophobe, and any others. e exclusion criteria were listed as follows: (1) age of patients was younger than 18; (2) patients with other primary tumors at diagnosis; and (3) the clinicopathological results were uncompleted.

Data Collections.
Marital, age, race, sequence number, survival time, status, sex, primary site, grade, laterality, pathological, T stage, N stage, tumor size, bone metastasis, brain metastasis, liver metastasis, and lung metastasis were collected retrospectively. Data were extracted from the SEER database with the help of SEER * STAT software 85 (version 8.3.5). e process of extraction was carried out by two independent data collectors. If there was any disagreement, a third collector would bring in to assist with the final decision.

Statistical Methods.
Mean was used to describe continuous variables following a normal distribution. Numerical values and proportions were used to describe categorical variables. We concluded a comparison between groups using chi-squared tests, t-tests, and logistic regression analysis. Variables with nonzero coefficients in the least absolute shrinkage and selection operator (LASSO) analysis were chosen for further analysis. Variables with p < 0.05 in univariate logistic regression analysis were put into multivariate logistic regression analysis. Independent risk factors were determined after multivariate logistic regression analysis. ML algorithms, such as RF, NBC, DT, XGB, GBM, and LR, were implemented to establish models. We ranked the importance of the variables for each model. XGB is an integration algorithm based on boost. It is typical of the integration of cart tree, which is an improvement of the gradient tree boosting.
Here, l is a differentiable convex loss function that measures the difference between the prediction^yi and the target yi. e second term Ω penalizes the complexity of the model. e probabilistic output results are evaluated using receiver operating characteristic curve (ROC). 10-foldcrossvalidation and ROC curve analysis were conducted to evaluate the performance of models. Maximum AUC was the basis for determining the best model. Heatmap showed the correlation between various variables in the models. e number in each grid of heatmap represented the correlation coefficient, and the color depth was negatively correlated with the correlation of variables. According to the results of the best model, a Web calculator was established.

Basic Characteristics.
A total of 42650 kidney cancer patients from the SEER database were enrolled in this study. A total of married was 25058 (58.75%) with a median age of 64.000 [55.000, 73.000]. Marital, race, primary site, grade, laterality, pathological, T stage, N stage, bone metastasis, and liver metastasis were variables with statistically significant differences (p < 0.05). White male was the main population.
As shown in Table 1, there were 3171 kidney cancer patients with lung metastasis and 39479 kidney cancer patients without lung metastases. rough comparing data of the two groups above, we obtained the result that the differences of all variables were statistically significant (p < 0.05). Figure 1, nine variables with nonzero coefficients in LASSO analysis were selected for logistic regression. As shown in Table 2, bone metastasis, brain metastasis, grade, liver metastasis, N stage, primary site, sequence number, Tstage, and tumor size were factors with p < 0.05 in univariate logistic regression analysis. After multivariate regression analysis, we identified that bone metastasis (yes, OR � 4.83, 95% CI � 4.27-5.46, p < 0.001), brain metastasis (yes, OR � 8.41, 95% CI � 6.72-10.51, p < 0.001; unknown, OR � 6. 13 .94, p < 0.001), and tumor size (OR � 1.01, 95% CI � 1-1.01, p < 0.001) were independent risk factors of LM in renal cancer patients. Furthermore, we found that primary site (C65.9-Renal pelvis, OR � 0.38, 95% CI � 0.3-0.49, p < 0.001) and sequence number (more, OR � 0.62, 95% CI � 0.56-0.69, p < 0.001) were independent protection factors. As shown in Figure 2, each grid in the heatmap visually showed the correlation coefficient between each variable with color depth.

Development and Validation of Predictive Models.
For developing ML models, nine independent predictors, with p < 0.05 in the multivariate regression analysis, were used for model establishment. And lung metastasis status was also included as the outcome index in the models. Figure 3 demonstrated the relative importance ranking of each input variable in the models. e ranking of variables in each model was very different. e patients with bone metastasis and the T stage were variables with relatively high importance ranking in all models. However, primary site and sequence number were variables with relatively low importance ranking in all models. For the XGB, the relative importance rank of all variables from high to low was bone metastasis, tumor size, T Stage, N stage, grade, liver metastasis, brain metastasis, primary site, and sequence number. We applied ML algorithms such as RF, NBC, DT, XGB, GBM, and LR to establish models. e results of 10foldcross-validation ( Figure 4) show that the average AUC of all models was above 0.9. And all six ML models fitted well during the course of ten iterations. e XGB's average AUC was 0.934 (std � 0.001). As a result, XGB model was selected as the final prediction model.

Web-Based Calculator.
In order to facilitate clinical application, a Web-based calculator was established on the basis of XGB model (https://share.streamlit.io/liuwencai4/ renal_lung/main/renal_lung.py). As shown in Figure 5, users can input values of each variable through clicking and selecting. Risk grouping for LM and probability of LM in renal cancer will be showed.

Discussion
Lung is the most common metastatic site of kidney cancer [7]. Early diagnosis of metastasis can improve the feasibility of surgery and increase the survive chance. e profile of kidney cancer patients is complex and involves multidisciplinary treatment issues. Artificial intelligence can be well applied in this field because of its powerful information extraction and processing ability [16]. erefore, this study aimed to develop a highly accurate model capable of predicting lung metastasis from kidney cancer.
We identified nine influence factors, included bone metastasis, brain metastasis, grade, liver metastasis, T stage, N stage, primary site, sequence number, and tumor size. In addition, 10-foldcross-validation was adopted to check the performance of models. Finally, the model with the highest accuracy is presented as a Web calculator for application.
Our study found that organ metastases were important influencing factors. Many patients will develop multiple organ metastases. In the study of Wei Xi, metastases of two or more sites accounted for 33% [18]. Jianxin Xue's study also found that there were 8.76% patients with clear-cell RCC, which had distant metastases at the time of diagnosis, and 35.01% (1026/2931) metastatic patients had multiple metastases [7]. is finding was consistent with the results of the present study. Furthermore, organ metastases as predictors have also been reported in previous studies. Shengtao Journal of Oncology 3 Dong et al. constructed a bone metastasis risk prediction model based on brain metastasis, liver metastasis, and lung metastasis as predictors [19]. Bone metastasis, liver metastasis, and brain metastasis were strong predictors in the models of our study.
As shown in Figure 3, important factors in constructing XGB, RF, and NBC models to predict lung metastasis from kidney cancer were prioritized.
Variables including T stage, N stage, and pathological grade were associated with the development of LM in renal     Journal of Oncology cell carcinoma [20]. ese risk factors were also important in other distant metastases of kidney cancer [1,7]. is highlights the significance of the stage and grade in predicting renal cell carcinoma organ metastasis. In addition, N stage and T stage were used not only to predict kidney cancer metastasis but also as an important parameter in prognostic models. For example, the University of California School of Medicine used the stage to predict five years survival in metastatic and nonmetastatic patients [21].
Tumor size was an independent predictor of overall survival [4]. e pseudocapsule (PS) in kidney cancer is the fibrous interface between the tumor and renal parenchyma [22]. ere is a richer blood supply system around PS. With PS from being infiltrated to penetrate, the incidence of venous tumor thrombus (VTT) and microvascular invasion (MVI) increases [23]. us, further distant metastasis occurs. e probability of distant metastasis may increase in the process of primary lesions expansion owing to an increase of PS surface area.
Our study revealed that primary site was a protection factor. In addition, the renal pelvic cancer was less likely to transfer to the lung than renal cancer. Because renal cancer originates from the epithelium of the proximal tubules, renal pelvic cancer originates from the urothelium. It is more likely to be diagnosed and treated in the early stage because of the high incidence of hematuria. Vascularization is an important condition for tumor growth, invasion, and metastasis [24]. e blood supply of renal pelvic cancer may be less than that of renal cancer [25]. Sequence number was another independent protection factor of LM in kidney cancer. We found that patients with >1 primary tumor were less likely to spread to lung. One of our guesses was that patients with multiple tumors may have insufficient time to form LM because of poor prognosis. Another explanation was that  Journal of Oncology more symptoms could promote early diagnosis and medical treatment. e exact mechanism needs to be further explored.
Few studies have been performed to predict LM in patients with renal cell carcinoma. Although some studies have reported some biomarkers other than the above predictors for LM prediction [26], few of these markers have been applied. Previously, Xinyu Sheng's team at the Zhejiang University School predicted LM in kidney cancer patients based on patient data from the SEER database, with a column line plot of development and a model constructed based on TNM stages with ground AUC of 0.780 and 0.618, respectively, and the study was not externally validated [20,27]. Although the AUC of the models developed in the training set is greater than 0.50, there is still room for improvement. However, the AUCs of the six models constructed in this study based on machine learning are all above 0.9, which reflects the good robustness of the models. We expect that the network calculator constructed with the XGB model in this study can be applied or tested in the future.
is study also had some limitations. First of all, the indicators including metastasis sites and some serological data in SEER database are not comprehensive [7,12]. Secondly, further verification of multicenters is indeed in the future.

Conclusion
is study provided preliminary evidence that the ML algorithm can be used to predict lung metastases in patients with kidney cancer. However, the prediction model cannot specify the genetic characteristics of these patients. However, this low-cost, noninvasive, and easy to implement diagnostic method is useful for clinical work. Of course this model still needs to undergo more real-world validation.
Data Availability e data used in this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Wenle Li, Qian Zhou, Wencai Liu, and Shengtao Dong have contributed equally to this work. CLY, JFC, and YLJ designed the study. WLL and ZQ collected and evaluated the data and wrote the first draft of the manuscript. All authors contributed to the interpretation of the results and the final draft of the manuscript.