Machine Learning for Predicting Distant Metastasis of Medullary Thyroid Carcinoma Using the SEER Database

Objectives We aimed to establish an effective machine learning (ML) model for predicting the risk of distant metastasis (DM) in medullary thyroid carcinoma (MTC). Methods Demographic data of MTC patients were extracted from the Surveillance, Epidemiology, and End Results (SEER) database of the National Institutes of Health between 2004 and 2015 to develop six ML algorithm models. Models were evaluated based on accuracy, precision, recall rate, F1-score, and area under the receiver operating characteristic curve (AUC). The association between clinicopathological characteristics and target variables was interpreted. Analyses were performed using traditional logistic regression (LR). Results In total, 2049 patients were included and 138 developed DM. Multivariable LR showed that age, sex, tumor size, extrathyroidal extension, and lymph node metastasis were predictive features for DM in MTC. Among the six ML models, the random forest (RF) had the best predictability in assessing the risk of DM in MTC, with an accuracy, precision, recall rate, F1-score, and AUC higher than those of the traditional binary LR model. Conclusion RF was superior to traditional LR in predicting the risk of DM in MTC and can provide a valuable reference for clinicians in decision-making.


Introduction
As a result of changes in living environments, heightened health awareness, and advances in detection technology, the incidence of thyroid cancer has experienced a considerable increase in most parts of the world [1].Medullary thyroid carcinoma (MTC) is a relatively rare malignancy, constituting approximately 5% of all thyroid malignancies.Patients with MTC generally exhibit a poorer prognosis than those with diferentiated thyroid cancer (DTC), with MTC accounting for approximately 13% of all thyroid cancerrelated fatalities [2,3].Roughly 75% of MTC cases are sporadic, while around 25% are autosomal dominant [4].Research has demonstrated that mutations in RET, a proto-oncogene, are present in approximately 6% of sporadic MTC patients and up to 98% of familial-inherited MTC patients [5].Studies have indicated that extrathyroidal extension and distant metastasis (DM) are signifcant predictors of poor prognosis in patients [6,7].At the time of initial diagnosis, 10%-15% of MTC patients present with DM [8].DM of MTC may involve the bones, lungs, and liver [9].Te American Tyroid Association's guidelines for the management of medullary thyroid cancer recommend various imaging examinations for MTC, potentially involving DM, including enhanced CT, MRI, abdominal ultrasound, and bone scans [10].Tese diagnostic methods have a sensitivity of approximately 50%-80% for metastatic diseases.In recent years, the clinical application of drugs targeting RET proto-oncogene mutations has been proven to be efective in treating MTC patients with RET mutations [11].Consequently, early diagnosis of MTC with DM and early intervention for high-risk patients may signifcantly improve patient survival.
Machine learning (ML) is a subfeld of artifcial intelligence technology.Compared to traditional predictive models, ML can enhance the accuracy of models by uncovering nonlinear relationships in large datasets [12,13].During medical treatment, vast amounts of data from patients are generated.Terefore, processing and analyzing these data using ML can ofer a reliable reference for clinicians to diagnose diseases and prognosticate outcomes.Tus, our study aimed to develop a model based on the Surveillance, Epidemiology, and End Results (SEER) database to predict the occurrence of DM in patients with MTC.

Data Sources and Study
Population.Data for this study were acquired from the SEER public databases, utilizing SEER * Stat 8.4.0.1 software for data extraction.Our study focused on patients diagnosed with MTC in the United States between 2004 and 2015.We excluded patients with missing data, unclear clinical and pathological conditions, uncertain histological classifcations, or other types of thyroid cancer (TC).Te histological types were restricted to medullary carcinomas.According to the International Classifcation of Diseases (ICD) for Oncology-3, patients' histological codes are 8345/3 and 8510/3, adopting AJCC 7th edition TNM stage.Variables included age, sex (male or female), race (White, Black, and others), year of diagnosis, Spanish-Hispanic origin, laterality (unilateral and bilateral), multifocality (solitary and multifocal), tumor size, extrathyroidal extension, lymph node metastasis, MTC subtypes, and DM.Distant metastasis means that the tumor invades at least one or more target organs such as brain, bone, liver, lung, and so on.As the SEER database contains public data, informed consent from relevant patients for the use of the SEER database for research purposes was not required, nor was the ethical approval.Our request for access to the SEER data was approved by the National Cancer Institute, USA (reference number 19238-Nov2021).

Screening for Risk Factors and Model Construction.
Statistical analysis was conducted using SPSS software (version 26.0;IBM Corporation).In the univariable analysis, we employed Pearson's correlation analysis to examine the association between predictor variables, with results being presented in the form of heat maps.Te predictive factors related to DM were initially screened through univariable analysis (p < 0.05), and the variables that met the criteria were incorporated into a multivariable logistic regression (LR) analysis.Te receiver operating characteristic (ROC) curve was plotted and analyzed based on the results.An area under the ROC curve (AUC) greater than 0.5 was considered meaningful.All computed p values were two-sided, and statistical signifcance was accepted at <0.05.
Te rate of DM of patients with MTC in the SEER database was low, resulting in an unbalanced original dataset.To establish a more accurate prediction model, it is essential to address this imbalance.In this study, we employed two techniques for processing the original dataset: oversampling and undersampling.We then used a correlation matrix to analyze the original and processed data.Te synthetic minority oversampling technique (SMOTE) and undersampling are standard approaches for balancing class distribution in imbalanced datasets, widely used to improve prediction models [14].Te distribution of the target variables after the sampling process is illustrated in Figure 1.After data processing, the correlation between variables became more apparent, as demonstrated in Figure 2.
We used Python software (version 3.9.12,Python Software Foundation) to incorporate the selected variables include all variables in the ML model and construct a prediction model.Te technically processed data (oversampled and undersampled data) were randomly divided into a training set (80%) and a test set (20%).Te training set employed six commonly used ML algorithms: decision tree (DT), support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM).Model evaluation was primarily based on accuracy, precision, recall, F1-score, and AUC value.Te model with the highest AUC value was selected as the optimal model.

Analysis of Patient
Information.Tis study included a total of 2049 MTC patients, of which 138 (6.7%) developed DM and the remaining 1911 (93.3%) did not.Te baseline characteristics of all patients are presented in Table 1.
In the univariable LR analysis, DM was signifcantly associated with age, sex, multifocality, tumor size, extrathyroidal extension, and lymph node metastasis (p < 0.05) (Table 2).Tese characteristic variables were incorporated into the multivariable LR analysis.
In the multivariable LR analysis, age [15] sex, extrathyroidal extension, lymph node metastasis, and tumor size were identifed as independent predictors of DM in MTC.However, multifocality was not an independent predictive factor for the occurrence of DM in MTC.Further details can be found in Table 2. Te ROC curve was plotted based on traditional multivariable LR results (AUC � 0.838, 95% confdence interval (CI): 0.808-0.868,p < 0.001).Detailed information is summarized in Figure 3.
For the analysis of the ML algorithm, six ML models were constructed and evaluated based on accuracy, precision, recall rate, F1-score, and AUC value.It was observed that ML models constructed after data oversampling outperformed those constructed after undersampling.Tables 3 and 4 provide details on the six ML models constructed from the over-and undersampled data.Te ROC curves of the six ML models, constructed by oversampling and undersampling in the training and test sets, are depicted in Figure 4.In the models established using oversampled data, the AUC of all models was greater than   5, revealed that lymph node metastasis was the most critical factor in determining whether MTC patients also have DM.
Tis study developed an online network calculator for evaluating the risk of distant metastasis in MTC patients, which can be applied to clinical patients (https://121.43.117.60:8000/).

Discussion
Patients with MTC account for only 5% of the total number of individuals newly diagnosed with TC, while the global incidence rate of MTC is rising rapidly.Deaths from MTC comprise approximately 13% of the total mortality rate of TC, and the 10-year overall survival rate of MTC ranges between 65% and 71%.However, when MTC occurs with DM, the 10-year overall survival rate can decrease to 40-44% [15,16].MTC neither concentrates radioactive iodine nor is it inhibited by thyroxine [17].Total thyroidectomy is the primary treatment method for MTC, with the decision to    6 International Journal of Endocrinology perform lymph node dissection depending on the specifc situation.Adjuvant radiation therapy can be considered for MTC patients with incomplete resection, a high risk of local recurrence, or DM [10].Radiotherapy can provide continuous control in patients with DM and prevent further progression [18].However, the impact of radiotherapy on patients' survival rates remains controversial.In patients without DM, radiotherapy may cause more harm than good [19].Some perspectives suggest that the role of radiation therapy in MTC is limited to patients who are ineligible or have contraindications for surgical treatment or targeted drugs [20].Targeted drugs are recommended for patients with DM, particularly because studies have demonstrated [11,21] that RET-specifc inhibitors (selpercatinib and pralsetinib) are efective and promising therapies for MTC patients with DM and progression.Te prognosis and treatment efectiveness of MTC are largely related to tumor staging; therefore, early diagnosis is a crucial objective in the management of MTC patients [22].Previous research on MTC has mostly focused on prognosis and analysis of survival [23,24].However, there are few studies on the DM of MTC.Utilizing independent predictors to predict DM can help physicians better evaluate patients with MTC and provide them with more efective individualized treatment options.
Univariable analysis showed that age, sex, multifocality, tumor size, extrathyroidal extension, and lymph node metastasis were independent predictors of DM.However, multivariable analysis indicated that multifocality could not serve as an independent predictor of DM in patients with MTC.Tis fnding is consistent with the conclusion of the RF feature selection, and it is generally believed that multilocality has an independent predictive efect on cervical lymph node metastasis in MTC [25].Nonetheless, multifocality had a relatively small impact on predicting the occurrence of DM in patients with MTC, which aligns with fndings of previous research [25,26].RF feature selection revealed that extrathyroidal extension was a key factor in predicting DM, while lymph node metastasis was the most important predictor of DM, consistent with a previous study [26].We also identifed tumor size was an important predictor.Compared with tumors larger than 4 cm, the odds ratio (OR) for tumors of 2-4 cm and ≤2 cm was 0.555 and 0.287, respectively.As tumor size gradually increases, the risk of DM in MTC also increases.Tumor size signifcantly impacts the recurrence and long-term rates of MTC [24].Extrathyroidal extension and tumor size are also crucial predictive factors for lymph node and DM in MTC [6,16].Meanwhile, extrathyroidal extension and tumor size are directly related to T staging in TNM staging, suggesting that tumor stage can also serve as a predictive factor for DM.Contrary to a previous study [27], sex was considered as an independent predictor of DM.We also discovered that female sex was a protective factor for DM.Tis conclusion is similar to that of a previous study [26].In our study, 55 years of age was used as the cutof age [27] and it showed that older patients were more likely to develop DM than younger patients.Terefore, older patients should be actively followed up and regularly examined.In this study, race could not independently predict DM in patients with MTC, which is consistent with results of previous research [26,27].In traditional LR, MTC subtypes and Spanish-Hispanic could not be used as independent predictors, and their infuence on the feature selection of RF was also small.
We constructed six predictive models based on the SEER database to predict DM in patients with MTC and evaluated six algorithmic models based on accuracy, precision, recall rate, F1-score, and AUC value.We employed the SMOTE technique to address unbalanced datasets and concluded that, for unbalanced datasets used to build ML models, SOMTE is superior to undersampling [14].By oversampling and undersampling, we enhanced the performance of the model and determined that the prediction model established by oversampling outperformed the one established by undersampling.Tis may be attributed to fewer patients with DM among MTC patients, resulting in limited ability of the model to identify key predictive factors for patients with combined DM.Tis study established six ML algorithms, among which RF demonstrated excellent predictive performance (AUC � 0.946), surpassing that of the traditional LR model (AUC � 0.838).Terefore, RF was the best model for predicting MTC patients with DM using the SEER database.

Limitations
However, there are some limitations to this study.First, as this study is based on demographics of North American, other populations should be used for validation in future research.Second, the predictive performance of the model warrants further optimization, and additional predictive factors potentially related to DM should be incorporated into the prediction model in future studies.Finally, due to the limitations of the database, tumor markers such as CEA and AFP were not included in MTC patients.We will continue to improve and supplement the model in future studies.

Conclusions
In conclusion, this study aimed to identify independent predictors of DM in patients with MTC and to develop a prediction model utilizing ML algorithms.Our analysis, Te application of the SMOTE technique for addressing unbalanced datasets was proven to be efective in enhancing the performance of the prediction model.Our fndings underscore the importance of early diagnosis and individualized treatment plans for MTC patients, ultimately contributing to improved patient outcomes.

Figure 1 :Figure 2 :
Figure 1: Te distribution of the target variables after the sampling process.(a) Oversampling data, (b) undersampling data, and (c) target variable distribution of original data.

Figure 2 :
Figure 2: Heatmaps of the correlation between characteristic features of the patients in diferent datasets.(a) Oversampling data, (b) undersampling data, and (c) original data.
Diagonal segments are produced by ties.

Figure 3 :
Figure 3: LR models predict the ROC curve of distant metastasis in MTC patients.

Table 1 :
Te detailed demographic information of the patients with MTC.

Table 2 :
Univariable analysis and multivariable analysis of variables related to distant metastasis.

Table 3 :
Comparison of prediction performance between diferent models constructed from oversampling data.
Figure 5: Feature importance derived from the RF model.Te plot shows the relative importance of the variables in the RF model.MTC, medullary thyroid carcinoma.8International Journal of Endocrinology based on the SEER database, demonstrated that age, sex, tumor size, extrathyroidal extension, and lymph node metastasis were signifcant independent predictors of DM in MTC patients.Te RF ML algorithm outperformed the traditional LR model in predicting DM, providing a more accurate and reliable tool for clinical use.