Application of an Interpretable Machine Learning Model to Predict Lymph Node Metastasis in Patients with Laryngeal Carcinoma

Objectives A more accurate preoperative prediction of lymph node metastasis (LNM) plays a decisive role in the selection of treatment in patients with laryngeal carcinoma (LC). This study aimed to develop a machine learning (ML) prediction model for predicting LNM in patients with LC. Methods We collected and retrospectively analysed 4887 LC patients with detailed demographical characteristics including age at diagnosis, race, sex, primary site, histology, number of tumours, T-stage, grade, and tumour size in the National Institutes of Health (NIH) Surveillance, Epidemiology, and End Results (SEER) database from 2005 to 2015. A correlation analysis of all variables was evaluated by the Pearson correlation. Independent risk factors for LC patients with LNM were identified by univariate and multivariate logistic regression analyses. Afterward, patients were randomly divided into training and test sets in a ratio of 8 to 2. On this basis, we established logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), extreme gradient boosting (XGBoost), random forest (RF), and light gradient boosting machine (LightGBM) algorithm models based on ML. The area under the receiver operating characteristic curve (AUC) value, accuracy, precision, recall rate, F1-score, specificity, and Brier score was adopted to evaluate and compare the prediction performance of the models. Finally, the Shapley additive explanation (SHAP) method was used to interpret the association between each feature variable and target variables based on the best model. Results Of the 4887 total LC patients, 3409 were without LNM (69.76%), and 1478 had LNM (30.24%). The result of the Pearson correlation showed that variables were weakly correlated with each other. The independent risk factors for LC patients with LNM were age at diagnosis, race, primary site, number of tumours, tumour size, grade, and T-stage. Among six models, XGBoost displayed a better performance for predicting LNM, with five performance metrics outperforming other models in the training set (AUC: 0.791 (95% CI: 0.776–0.806), accuracy: 0.739, recall rate: 0.638, F1-score: 0.663, and Brier score: 0.165), and similar results were observed in the test set. Moreover, the SHAP value of XGBoost was calculated, and the result showed that the three features, T-stage, primary site, and grade, had the greatest impact on predicting the outcomes. Conclusions The XGBoost model performed better and can be applied to forecast the LNM of LC, offering a valuable and significant reference for clinicians in advanced decision-making.


Introduction
Laryngeal carcinoma (LC) is one of the most common primary malignancies in the head and neck region, with an increasing incidence annually [1]. Te incidence and mortality of LC have reached 2.76/10 5 and 1.66/10 5 in the world, which pose a serious threat to human health [2]. Currently, the treatment of LC is dominated by surgery and supplemented by chemotherapy, radiotherapy, targeted therapy, and immunotherapy [3]. Despite multiple strategies and interventions, the prognosis for LC is still unsatisfactory, with a 5-year survival rate of only 50% to 60% in patients and a third of the cases relapsing [4].
For patients with LC, lymph node metastasis (LNM) is one of the most important factors in their treatment and prognosis [5]. In clinical diagnosis and treatment, the palpation of neck nodes using ultrasonography (US), computed tomography (CT), magnetic resonance imaging (MRI), or positron emission tomography-computed tomography (PET-CT) is frequently used to evaluate lymph node status [6,7]. However, several defciencies remain with respect to sensitivity and specifcity in the abovementioned methods [8]. Previous studies have reported that a large proportion of LC patients diagnosed with LNM preoperatively who underwent intraoperative cervical lymph node dissection (LND) have a negative pathological diagnosis after surgery [9]. Conversely, some patients with clinically negative cervical lymph nodes have positive pathology results after surgery, leading to delayed treatment or even secondary surgery [10]. Terefore, the development of new diagnostic tools for accurately determining preoperative cervical lymph node status is highly essential for selecting appropriate therapy and determining prognosis.
Te machine learning (ML) algorithm has been widely used in developing disease prediction models in recent years [11]. Compared with conventional statistical methods, ML can more accurately predict outcomes from multiple unrelated datasets by using high-level computing to construct algorithms for automated data-driven predictions or decisions [12,13]. Nevertheless, there is no relevant research on the ML to predict the LNM of LC. In this study, we aimed to identify risk factors associated with LNM in LC patients and developed multiple ML-based models for the preoperative prediction of LNM by using clinical and histopathological parameters in Surveillance, Epidemiology, and End Results (SEER) public data. In addition, we chose the best ML model for predicting the risk of LNM in LC patients by comparing the assessment indicators of predictive performance, which aim to identify an accurate prediction method and guide the selection of clinical diagnoses and treatment plans. Meanwhile, the correlation between LNM and clinicopathological characteristics in LC patients was interpreted through the Shapley additive explanation (SHAP) value to help clinicians understand the output of the model.

Study Population.
Te SEER database, managed by the National Cancer Institute, is a publicly available large population-based cancer registry database that covers almost 30% of the population of the United States [14]. All patient data in this study were downloaded from the SEER database after receiving approval and permission from SEER. Patient data with a confrmed diagnosis of LC were screened from "Incidence-SEER 18 Regs Research Data, Nov 2020 Sub (2000-2018)." Te study was limited to the period between 2005 and 2015. Te inclusion criteria are as follows: (1) the primary tumour site was in the larynx; (2) the pathological diagnosis was classifed as "positive"; (3) the histology of the tumour was categorized as "malignant." Te exclusion criteria are as follows: (1) unknown information about race, T-stage, grade, tumour size, regional lymph nodes, and surgery of the primary site; (2) the M stage was not "M0." Overall, 4887 patients fulflled the selection criteria and were chosen for further analysis. Te International Classifcation of Diseases for Oncology, version 3 (ICD-O-3), was used to determine tumour location, grade, and histology. Tumour staging was determined based on the 6th edition of the American Joint Committee on Cancer (AJCC) staging. Te patient screening procedure is displayed in Figure 1.

Data Classifcation.
In this study, nine demographics and clinicopathological variables from the SEER database that may afect LNM in LC patients were selected, including age at diagnosis, race, sex, primary site, histology, number of tumours, T-stage, grade, and tumour size.
Patients were divided into groups according to their age at diagnosis: <50 years, 50-65 years, and ≥65 years. Patients were divided into groups according to race: white, black, and others. Patients were classifed into groups based on sex: male and female. According to the primary site, the patients were divided as follows: supraglottis, glottis, subglottis, larynx, and others. According to histology, the patients were classifed into squamous cell carcinoma and nonsquamous cell carcinoma groups. Based on the number of tumours, the patients were classifed into 1 and >1 groups. According to tumour size, the patients were classifed into ≤3 cm and >3 cm groups. Te grades were categorized as follows: grade I, grade II, grade III, and grade IV. Te T-stage was categorized as follows: T1, T2, T3, and T4.

Statistical Analyses.
In this study, all statistical analyses were carried out using SPSS (version 22.0, IBM). Patients involved in this study were separated into two groups: nonlymph node metastasis (NLNM) and LNM. Te chisquare test was performed to compare the diferences between the two groups. A p value less than 0.05 was considered to indicate that the identical attributes were signifcantly diferent between the two groups. A correlation analysis of all variables was evaluated by the Pearson correlation, and the results are displayed as a heatmap. In addition, univariate and multivariate logistic regression analyses were used to identify independent risk factors for LNM. In the univariate analyses, variables with p values less than 0.05 were regarded as statistically signifcant and chosen for multivariate analyses. Ten, variables with p values below 0.05 in the multivariable logistic regression analysis were taken as candidate variables for the establishment of ML models.

Model Establishment.
For this study, Python software was used to establish the ML models. All patients were divided into a training set (n � 3909) and a test set (n � 978) at a ratio of 8 : 2 by random sampling, until no signifcant diferences were observed in baseline clinical characteristics (Supplementary Table S1). Te training set was used to establish the model, and then the model was validated on the test set. A total of six diferent ML algorithms were used to model the data in the training set, including logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), extreme gradient boosting (XGBoost), random forest (RF), and light gradient boosting machine (LightGBM). Furthermore, 10-fold cross validation was performed to gauge the stability of the model and determine whether the model was overftted. Accuracy, precision, recall rate, F1-score, area under the ROC curve (AUC), specifcity, and calibration plots were used to evaluate ML models. Te Brier score, which ranges from 0 to 1, was used to quantify the calibration plots [15]. Te AUCs were compared by the DeLong test, and a p value less than 0.05 was considered to indicate that there was a signifcant diference between the two models.
Te Shapley additive explanation (SHAP) framework, a game theory approach, has been used to interpret the output of any ML model [16]. Tis approach interprets each feature variable by assigning a specifc prediction weight and calculating its importance value. In this study, we used the SHAP value to improve the interpretation of the best model [17].

Demographic and Pathological Characteristics.
A total of 4887 cases were available in this study, including 3409 cases without LNM (69.76%) and 1478 cases with LNM (30.24%) ( Table 1). In the comparison of the two groups, all variables except histology were considered to be signifcantly diferent (p < 0.05), including age at diagnosis, race, sex, primary site, number of tumours, T-stage, grade, and tumour size. Details for the groups are summarized in Table 1. In addition, all variables were entered into a Pearson correlation analysis, of which the result shows that variables were weakly correlated with each other and had good independence ( Figure 2).

Analysis of Risk Factors for Lymph Node Metastases.
Based on the univariate logistic regression analysis, age at diagnosis, race, sex, primary site, number of tumours, Tstage, grade, and tumour size were signifcantly correlated with LNM in LC patients (p < 0.05) ( Table 2). Further multivariate logistic regression analysis showed that age at diagnosis, race, primary site, number of tumours, T-stage, grade, and tumour size were identifed as independent risk factors for the LNM of LC (Table 3).

Model Performance.
In this study, we constructed 6 prediction models by using the 7 aforementioned independent risk factors. As shown in Figure 3, all six models had good stability, and there was no obvious overftting or underftting in each model (the performance in 10-fold cross validation was shown in Supplementary Table S2, and the model parameters were shown in Supplementary Table S3). Te AUC, F1-score, accuracy, precision, recall rate, specifcity, and Brier score were the main evaluation metrics used to evaluate and compare the model performances. However, the higher the value of AUC, F1-score, accuracy, precision, specifcity, and recall rate, the better the model performances, but the Brier score was just the opposite [18].  Table 4). Te DeLong test showed a signifcant diference in the AUC value of the XGBoost model compared with others in the  (Table 4). Tus, overall, the abovementioned results show that the overall performance of XGBoost is better than other models, so we chose the XGBoost model as the best prediction model.

Feature Importance Analysis.
We further analysed information provided by the XGBoost model about the importance of features. Te mean value of the absolute SHAP values of 7 feature variables represents the degree of infuence on the fnal predicted probability (Figure 6(a)); the higher the SHAP value, the stronger the efect of the feature variable on the model output [19]. Te SHAP summary plot shows the positive or negative impact of feature variables on the predicted probability through diferent colors ( Figure 6(b)). We found that T-stage, primary site, and grade were the three most important feature variables in the XGBoost model for predicting LNM, and the higher the stage of T-stage and grade, the higher the risk of LNM in patients.
Unlike T-stage and grade, the colour distribution was irregular for the "primary site," which indicated that the values of the feature were not linearly correlated with the SHAP values (Figure 6(b)). To explore the reason, we further Journal of Oncology 5 plotted the SHAP dependence plot of this ( Figure 6(c)). In the XGBoost model for this study, primary sites 1, 2, 3, 4, and 5 represent "supraglottis," "glottis," "subglottis," "larynx," and "other," respectively. Since "larynx" and "other" cannot represent the exact location of the primary tumour anatomically, we concluded that "supraglottis" had a higher risk of LNM than "subglottis" and "glottis."

Discussion
LNM is one of the vitally important hallmarks of LC distant metastasis. Tere is an extensively rich lymphatic vascular network in the neck, which makes it easier for LC patients to present with LNM [20]. Currently, the presence of LNM is mainly determined by cervical palpation and preoperative imaging, which largely depend on the clinical experience of the doctor and the ability of the human eye to identify imaging [21,22]. As reported, the presence of occult metastasis was confrmed by postsurgery histological examination in 4%-40% of cN0 LC patients [23][24][25]. Although prophylactic cervical LND can decrease the risk of LNM [26], it puts LC patients at a higher risk for operations, such as major bleeds, lymphatic fstulas, or impairments in the vagus, brachial plexus, and recurrent laryngeal nerve [27]. In the traditional methods for detecting LNM, the sensitivity and specifcity of cervical palpation are low, and both falsepositive and false-negative rates reach approximately 15%-25% [28]. Although the accuracy of imaging is higher than that of cervical palpation, the nodal size, shape, and presence of central necrosis taken as the criteria to assess LNM status are not reliable [29]. Some enlarged lymph nodes could be mediated by infammatory hyperplasia, and LNM <10 mm in diameter usually do not exhibit irregular enhancement or central necrosis [30]. Terefore, identifying risk factors associated with LNM and developing a good predictive performance model of LNM for LC is extremely important for clinicians to select appropriate treatment and improve the prognosis of patients.   Journal of Oncology In this study, we found that more than 90% of LC patients were over 50 years old or had squamous cell carcinoma (Table 1), which was consistent with some previous studies [31]. Moreover, LNM was more common among males than females (Table 1), which matches the overall sex-based incidence of LC and may be associated with higher smoking and drinking rates in males [32]. Mutlu et al. found that the incidence of cervical metastasis occurring in supraglottic tumours was signifcantly higher than that in transglottic tumours (55.2% and 35.1%, respectively) [33], which was similar to our results that the occurrence proportion of LNM was highest in the supraglottic region (Table 1). Te   abovementioned results indicate that the cases included in our study are in accordance with the epidemiological characteristics of LC, with good representation.
In addition, we also found that age at diagnosis, race, primary site, number of tumours, tumour size, grade, and Tstage were closely associated with the occurrence of LNM of LC by univariate and multivariate logistic regression analyses (Tables 2 and 3). Among them, T-stage, grade, and tumour size have been previously reported as independent risk factors for LNM in LC patients, which is consistent with our study fndings, indicating that these variables play an important role in promoting LNM in LC patients [34]. Our multivariate logistic regression analysis further demonstrated that the primary site was a risk factor for LNM (Table 3). Tis suggested that the primary tumour site was strongly associated with LNM, which may be related to the supraglottic special anatomy that contains an abundant and extensive submucosal lymphatic plexus [35]. Furthermore, similar to previous studies evaluating the risk factors for LNM in head and neck tumours [8], LC patients with a younger age at diagnosis presented a higher risk of LNM in this study. Interestingly, a single tumour presents    Journal of Oncology a signifcantly higher risk of LNM than multiple tumours in our study. Multifocal tumours have been considered a feature with more aggressiveness and were more likely to develop LNM than single tumours, which is contrary to the results of our fndings [36]. However, the reason for the diference remains unclear and will require further study. In recent years, various risk factors for LNM of LC have been reported, and prediction models have been established [8,34]. However, due to the complexity and large size of the various factors in the data and the diferences among the calculation methods of the models, the prediction performance was also signifcantly diferent. Heng et al. [37] developed a nomogram to predict the occult lymph node metastases of glottic squamous cells with an AUC value of 0.716. Chen et al. [34] used a nomogram to predict cervical LNM in laryngeal squamous cell carcinoma with an AUC value of 0.809. Furthermore, Song et al. used a nomogram to predict the risk of LNM in newly diagnosed supraglottic laryngeal squamous cell carcinoma, in which the AUC value of the nomogram model was only 0.731 and 0.707 in their training set and test set, respectively [8]. All three studies used the same logic calculation method, but the prediction performance of the models varied widely. To more accurately predict LNM in LC patients, we established prediction models using six diferent ML algorithms for the frst time and selected the prediction model with the best performance by comparing the performance diferences among various prediction models. Te accuracy, precision, recall rate, F1score, AUC value, specifcity, and calibration plots were performed to evaluate and compare the ML model performances, and we found that XGBoost was better than the other models with respect to the AUC value, Brier score, F1score, recall rate, and accuracy, whether it was in the training set or in the test set (Table 4, Figures 4(a), 4(b), 5(a), and 5(b)). In addition, the AUC value of XGBoost was also higher than that of the model constructed in previous studies [8,34,37]. Terefore, we believe that XGBoost was the best predictive model to predict the LNM in LC patients. XGBoost, as an ensemble ML algorithm based on a decision tree, has the advantages of fast computation, maximizing predictive performance, minimizing model complexity, and low overftting [38]. Terefore, XGBoost has been widely used in prediction model construction and risk identifcation in the medical feld [39]. However, this algorithm also has the following shortcomings. First, there are too many hyperparameters in this algorithm, which would greatly infuence the training time and performance of the XGBoost model [40], and secondly, this algorithm is difcult to visualize and explain models, which results in the limitation of the ML model in daily applications to some extent [41]. Based on the abovementioned features, the SHAP method was introduced to explain the importance and contribution of variables in the XGBoost model to help more clinicians understand the mechanism behind ML. We found that T-stage, primary site, and grade were the three most important feature variables in the XGBoost model for predicting LNM, and race was the least important variable (Figure 6(a)). In addition, given that the values of the primary site feature were not linearly correlated with the SHAP values in the SHAP summary plot (Figure 6(b)), the additional SHAP dependence plot revealed that the supraglottis was an important feature leading to LNM, whereas the subglottis and glottis were the opposite (Figure 6(c)). Tis fnding was consistent with previous studies showing that the supraglottic type of LC is more likely to have LNM [42].
Although T-stage and primary site are important bases for preoperative judgement of LNM at present, it remains controversial to perform cervical LND only on these bases. Te National Comprehensive Cancer Network (NCCN) considered that all patients with supraglottic lesions or T3-T4 stages should have neck LNM [43]. However, a study by Sessions et al. found that patients with N0 disease may be safely observed with no loss of survival advantage [44]. Furthermore,Ömer et al. showed that a watchful waiting strategy can be applied to T1-T2 and selected T3 cases with well-diferentiated tumours [23]. Tese data further suggested that the LNM of LC is infuenced by several factors, and that it is not accurate to judge LNM solely by T-stage and primary site. Based on our XGBoost model, the individual risk of LNM can be identifed more accurately, and the LND strategies for LC patients can also be determined by doctors directly and accurately, thereby avoiding overtreatment and reducing the risk of complications related to neck dissection. Some limitations exist in this study. First, our study is a retrospective study and sufers from some possible selection bias. Second, since the patients in the study were predominantly from the North American population, there may be defciencies in the applicability of the population, so including a wider population is necessary for future research. Finally, all patients in this study were from a single database, and a multicentre study is required for external validation of our model.

Conclusion
In this study, we found that age at diagnosis, race, primary site, number of tumours, tumour size, grade, and T-stage were independent risk factors for LNM of LC. In addition, we developed six ML models to predict LNM in LC patients based on this information. All models performed well, but the XGBoost model had better predictive power. Finally, through the SHAP method, we determined that T-stage, primary site, and grade were the three most important feature variables in the XGBoost model for predicting LNM. In the future, we will validate our fndings through a prospective multicenter study using a completely independent external dataset.

Conflicts of Interest
Te authors declare that they have no conficts of interest.