Risk Assessment of Liver Metastasis in Pancreatic Cancer Patients Using Multiple Models Based on Machine Learning: A Large Population-Based Study

Background A more accurate prediction of liver metastasis (LM) in pancreatic cancer (PC) would help improve clinical therapeutic effects and follow-up strategies for the management of this disease. This study was to assess various prediction models to evaluate the risk of LM based on machine learning algorithms. Methods We retrospectively reviewed clinicopathological characteristics of PC patients from the Surveillance, Epidemiology, and End Results database from 2010 to 2018. The logistic regression, extreme gradient boosting, support vector, random forest (RF), and deep neural network machine algorithms were used to establish models to predict the risk of LM in PC patients. Specificity, sensitivity, and receiver operating characteristic (ROC) curves were used to determine the discriminatory capacity of the prediction models. Results A total of 47,919 PC patients were identified; 15,909 (33.2%) of which developed LM. After iterative filtering, a total of nine features were included to establish the risk model for LM based on machine learning. The RF showed the most promising results in the prediction of complications among the models (ROC 0.871 for training and 0.832 for test sets). In risk stratification analysis, the LM rate and 5-year cancer-specific survival (CSS) in the high-risk group were worse than those in the intermediate- and low-risk groups. Surgery, radiotherapy, and chemotherapy were found to significantly improve the CSS in the high- and intermediate-risk groups. Conclusion In this study, the RF model constructed could accurately predict the risk of LM in PC patients, which has the potential to provide clinicians with more personalized clinical decision-making recommendations.


Introduction
Pancreatic cancer (PC) is the fourth leading cause of cancerrelated mortality in the USA, and it causes an estimated 25,270 deaths per year worldwide, accounting for 8% of the total cancer death toll [1]. Pancreatic cancer has a 5year survival rate of <8%, and up to 80% of patients with PC already have distant organ metastasis at the time of diagnosis, which significantly reduces survival benefits from surgical resection of the primary tumor [2]. Thus, an accurate assessment of locoregional and/or distant metastases in patients with PC is essential to determine whether these patients should undergo additional surgical resection or other combination therapies.
The liver is the most common metastasis site, accounting for 37-41.9% of the initially diagnosed cases [3,4]. Moreover, more than 60% of the patients that undergo tumor resection relapse with distant liver recurrence within the first 24 months after surgery [5]. Magnetic resonance imaging, computed tomography, and ultrasonography are currently the most commonly used inspection methods. Restricted by economics, doctors' ability, and other aspects, this will affect the judgment of clinicians to a significant extent. Thus, a better prognostic model for the prediction of liver metastasis (LM) in PC is critical to improve treatment and patient outcomes.
The dismal outcomes of PC partly result from its aggressive metastatic nature, but applying appropriate treatment options according to different disease processes can improve the survival rate of patients. In this study, we plan to establish a novel prediction model for liver metastasis based on clinical parameters and simple histopathological with high reliability, which could help to improve patient risk stratification in early PC.

Data Source and Study
Population. This retrospective study was carried out based on the Surveillance, Epidemiology, and End Results (SEER) database. The publicly available data was collected from 18 cancer registries between January 1, 2010, and December 31, 2018, using SEER-Stat software (ver. 8.3.5). The patients' files from the SEER database were accessed with official permission, and patients' records were anonymized. The study was approved by the Ethics Committee of the National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences.

Main Outcomes and Selected Variables.
Patients with primary pancreatic cancer were included in this cohort study. The target outcome was hepatic metastasis of pancreatic cancer. The cancer diagnosis was based on the classification of the topography or histology based on the International Classification of Diseases for Oncology-3 (ICD-O-3)/WHO 2008 guidelines. The primary pancreatic tumor locations included C25.0-head of the pancreas, C25.1-body of the pancreas, C25.2-tail of the pancreas, C25.3-pancreatic duct, C25.4-islets of Langerhans, C25.7-other specified parts of the pancreas, C25.8-overlapping lesion of the pancreas, C25.9-pancreas, and NOS (not otherwise specified). The exclusion criteria were as follows: (1) the presence or absence of metastasis at diagnosis was unknown; (2) pancreatic cancer patients without pathohistological diagnosis; (3) patients younger than 20 years; (4) patients with benign or borderline tumors; and (5) patients with lacking information on race, histological type, and treatment strategy. The derived American Joint Committee on Cancer (AJCC) 6th and SEER combined stage (2016+) TNM staging was used in this study. Patient demographics included gender, age, year at initial diagnosis, and race. Tumor characteristics included lymph biopsy, surgery, tumor size, marital status, survival status, survival time, the presence of distant metastasis, TNM staging (tumor, lymph node metastasis, and distant metastasis), insurance status, and radiation and chemotherapy records. A flowchart of the data collection process is presented in Figure 1. 2.3. Feature Engineering and Data Transformation. These readily available clinical and demographic variables from SEER database were processed to establish the available models using feature engineering techniques. According to the clinical characteristic or median, the continuous variables (age and year at initial diagnosis, tumor size, and number of positive lymph biopsies) were converted into categorical variable. To promote the availability of the pre-diction model, we employed cross-validation (CV) and recursive feature elimination to iteratively filter variables using the random forest (RF) classifier. CV was used for internal validation as a robust method for evaluating the progress of machine learning and improve the model performance [6]. The variables were evaluated based on their relative importance for the receiver operating characteristic (ROC) of the models.
2.4. Risk Model Establishment and Risk Stratification. All of the patients included in this study were randomly divided into independent training (80%) and testing (20%) sets using R [7]. The prediction models were built based on the training sets, after which they were evaluated and validated based on the test set. The extreme gradient boosting (XGboost), RF, SVM [8], deep neural network (DNN), and logistic regression (LR) algorithms were trained by performing 10-fold CV on the training set. Univariate and multivariate logistic regression analyses were employed to evaluate the features significantly correlated with the risk of hepatic metastasis. In addition, correction analysis was performed on features included in this study to evaluate their mutual relationships. The machine learning models were established and evaluated using the caret package in R.
According to our preliminary findings, performance of these different machine learning algorithms was roughly the same for predicting LM, but there was a trend toward improved availability for RF on both training and testing sets. To further evaluate the risk of HM for PC patients, we calculated the risk scores for every patient based on the RF and then sorted the patients based on the risk scores form high to low. The pancreatic cancer patients were divided into three risk group of the same number: highrisk group, intermediate-risk group, and low-risk group, which can inform the selection of a suitable treatment strategy [9].
2.5. Statistical Analysis. The chi-squared test was employed to assess the significance of differences among categorical variables in the training set and test set, while the Mann-Whitney U test was used for continuous variables. The Kaplan-Meier method and log-rank test were used to evaluate the differences among different subgroups in univariate survival analysis. The cancer-specific survival (CSS) and the survival time were the main evaluation indices. Propensity score matching (PSM) was used to balance the patients at a ratio of 1 : 1 between PC with and without treatment. To measure the performance of several models, the sensitivity, specificity, Gini, and area under the ROC curve, as well as the 95% confidence intervals (CIs) were calculated based on the number of correctly classified TP (true positive) cases and the number of the incorrectly classified FP (false positive) cases. The DeLong test was employed to evaluate model performance in identifying liver metastasis (P < 0:05). All analyses were performed using R version 3.6.1. ) and an internal test set (n = 9, 583) with the ration of 8 : 2 ( Figure 1). All demographic and clinicopathological variables of these patients are detailed in Table 1.

Variable Feature Importance of Liver Metastasis
Prediction. To evaluate the association between these features and the risk of liver metastasis, the univariate and multivariate logistic regression was performed for linear correlation analysis (   Five risk models were established based on the selected features. We evaluated the importance of selected features by the size of the gain value for predicting liver metastasis in five models ( Figure 2). Although the importance of features varied slightly among different models, the overall results noted that surgery, radiotherapy, primary tumor site, and tumor size ranked at the top of the list. The tumor treatments (including surgery, radiotherapy, and chemotherapy) were associated closely with liver metastasis.
The specificity, sensitivity, ROC value, and Gini scores were constructed to identify the reliability of model ( Table 3 3.4. Risk Stratification for Patients. We calculated the risk score for pancreatic cancer patients for predicting liver metastasis with RF classifier. These PC patients were assigned to an average of three risk groups according to their risk scores ranked from high to low and about 15,973 (33.3%) patients in every risk group (Figure 3(a)); the patients had the highest risk scores in the high-risk group and the lowest in the low-risk group. The result on proportions of liver metastasis showed 11,905 (74.5%) patients with liver metastasis in the high-risk group, 3898 (24.4%) patients in the middle-risk group, and 106 (0.7%) patients in the low-risk group. There was significant difference of proportions of liver metastasis among three groups (P < 0:001). And then, we compare the pancreatic cancer 5-year CSS among the three groups (Figure 3(b)); the survival probabilities were significantly different among three groups; the 5-year CSS was 2.6% in the high-risk group, 4.8% in the middle-risk group, and 26.2% in the low-risk group. The univariate Cox regression analysis noted that low-risk group vs. middlerisk group was HR, 2.98; 95CI, 2.91-3.07; P < 0:001; lowrisk group vs. high-risk group was HR, 3.99; 95CI, 3.88-4.11; P < 0:001; and middle-risk group vs. middle-risk group was HR, 1.32; 95CI, 1.28-1.35; P < 0:001; the pancreatic cancer patients with higher risk scores had worse survival.
3.5. The Treatment for Three Risk Groups. To evaluate the therapeutic effect of performed surgery, chemotherapy, and radiotherapy for pancreatic cancer patients in different risk score groups, we balanced the demographic and clinicopathological characteristics of patients receiving or nonreceiving treatment with propensity score matching based on the age at PC diagnosis, race, gender, T stage, N stage, year of PC diagnosis, tumor size, and histology at the ratio of 1 : 1 between patients receiving and not receiving performed surgery, chemotherapy, or radiotherapy. And we       (Figures 4(a)-4(c)). In the middle-risk group, the patients receiving surgery (HR, 0.31; 95CI, 0.28-0.35; P < 0:001), chemotherapy (HR, 0.53; 95CI, 0.51-0.56; P < 0:001), and radiotherapy (HR, 0.72; 95CI, 0.60-0.78; P < 0:001) had better CSS than patients not receiving treatment (Figures 4(d)-4(f)). In the low-risk group, the patients receiving surgery (HR, 0.29; 95CI, 0.27-0.32; P < 0:001) had better survival than patients with nonsurgery ( Figure 4(g)). But receiving chemotherapy and radiotherapy may not promote the survival and prognosis for pancreatic cancer patients in the low-risk group (Figures 4(h) and 4(i)).

Discussion
In this study, we collected data from the SEER database, which covers 47,919 patients with PC. The trends in this dataset are therefore highly representative and universal. We described the clinical characteristics of PC patients with or without LM and factors that predict the risk of LM in these patients. The univariate and multivariate logistic regression analyses showed that the age at PC diagnosis, gender, race, primary tumor site, T and N stage, tumor histology, size, surgery, chemotherapy, and radiotherapy were significantly correlated with the risk of liver metastasis in PC. This result was consistent with similar studies. Compared with elderly patients, metastases are more often observed in younger patients, who usually have more malignant tumors with more aggressive histological features, which may lead to higher rates of liver metastasis or other forms of distant metastasis [10,11]. Gender is related to liver metastases, which are less frequent in female patients [12]. Tumor site, grade, size, and LN metastasis were all previously identified as independent predictors of liver metastasis  Figure 3: Risk levels for predicting liver metastasis in pancreatic cancer by using random forest. (a) The risk scores of developing liver metastasis based on random forest. Sorted by risk scores form high to low, pancreatic cancer patients were divided into three risk group of the same number: high-risk group, middle-risk group, and low-risk group. The liver metastasis rates were significantly higher in the high-risk group than the others ( * * * P < 0:001). (b) The survival comparison among three risk different group. The CSS were significantly worse than the others. Abbreviations: CSS: cancer-specific survival; HR: hazard ratio; CI: confidence interval.  14 Disease Markers in patients with PC [13]. Studies have shown that primary tumors located in the body and tail of the pancreas are more prone to liver metastases than primary tumors that occur in the head of the pancreas. Compared with tumors located in the head of the pancreas, PC in the body and tail is larger or more frequently diagnosed at an advanced stage, which may increase the risk of liver metastases in these patients [14].
Since patient counseling and decision are based on the estimated from the individual risk profiles, these risk factors may help customize liver monitoring and clinical decisionmaking. Distant metastasis is a sign of advanced cancer, indicating a poor prognosis for PC patients. Approximately 60% of pancreatic cancer patients are diagnosed with metastasis, especially liver metastasis [15]. Surgery is considered to be the best potential curative treatment for PC patients, but the indications for tumor resection remain controversial. Although a few scholars disagree [16,17], most studies advocate that surgical resection of the primary tumor and liver metastases should be the preferred choice for patients with resectable PC with liver metastases [18][19][20]. Surgical removal of the primary tumor and metastases can improve the quality of life and prolong survival, especially in patients with oligometastatic PC [21][22][23]. Timely diagnosis of LM is therefore crucial, since it can provide evidence and recommendations for oncologists to make appropriate clinical (f) Survival comparison between PC patients who receive radiotherapy and nonradiotherapy in the middle-risk group (after PSM). (g) Survival comparison between pancreatic cancer (PC) patients who receive surgery and nonsurgery in the low-risk group (after PSM). (h) Survival comparison between PC patients who receive chemotherapy and nonchemotherapy in the low-risk group (after PSM). (i) Survival comparison between PC patients who receive radiotherapy and nonradiotherapy in the low-risk group (after PSM). Note: the PC patients who receive treatment (including surgery, chemotherapy, and radiotherapy) and nontreatment at a PSM ratio of 1 : 1. (a, d, g) PC patients who receive surgery and PC patients who did not receive surgery were matched by PSM at a ratio of 1 : 1. (b, e, h) PC patients who receive chemotherapy and PC patients who did not receive chemotherapy were matched by PSM at a ratio of 1 : 1. (c, f, i) PC patients who receive radiotherapy and PC patients who did not receive radiotherapy were matched by PSM at a ratio of 1 : 1. The matched variables for propensity score matching (PSM) included the age at PC diagnosis, race, gender, T stage, N stage, year of PC diagnosis, tumor size, and histology at the ratio of 1 : 1. Abbreviations: CSS: cancer-specific survival; HR: hazard ratio; CI: confidence interval; PSM: propensity score matching; PC: pancreatic cancer. 15 Disease Markers treatment decisions. Unfortunately, conventional imaging tests for the diagnosis of liver metastases such as Doppler ultrasound, magnetic resonance imaging, or computed tomography have not shown high sensitivity and specificity in PC [24,25]. Moreover, multiple imaging examinations will also increase the financial burden of patients. Therefore, it is important to establish a model that can accurately predict the probability of LM in PC patients. In this study, we assessed available predictive models using the SEER dataset, which demonstrates significant discrimination and calibration and can provide a basis for formulating an optimal surgical plan. Using this approach, PC patients can be divided into different risk grades to formulate different LM review plans according to the level of risk. Effective clinical decision-making can save large amounts of time and economic costs for patients.
In spite of its promising results, this study still has several limitations. First, this is a retrospective study. Second, due to intrinsic limitations of the database, nonunified selection criteria were employed for patients and detailed information about the treatment was not recorded, such as operation details, chemotherapy plan, and radiation therapy plan, inter alia. Third, the major limitation of our study is the lack of important variables, such as time-to-treatment, type of surgery, patient status, and tumor burden at the surgical margin. Finally, further validation based on a largescale external cohort is needed.

Conclusion
The RF model constructed in this study could accurately predict the risk of LM in PC patients, which may provide clinicians with more personalized clinical decision-making recommendations. The therapeutic effect of treatment is expected to be different for pancreatic cancer patients in the three risk groups based on the RF model. Machine learning technology has the potential to provide reliable individual PC treatment recommendations.