Construction of a Diagnostic Model for Lymph Node Metastasis of the Papillary Thyroid Carcinoma Using Preoperative Ultrasound Features and Imaging Omics

In this paper, we mainly adopted 337 patients who had undergone the surgery on lymph node metastasis of papillary thyroid carcinoma (PTC) as the sample population. In order to provide clinical reference for the intelligent decision-making in treatment plan and improvement of prognosis, we utilized ultrasound features and imaging features to construct five early diagnosis models for patients based on the ultrasound features, imaging features, and combined features. The model integrated with broad learning system (BLS) showed the best performance, with the area under the curve (AUC) of 0.857 (95% confidence interval (CI): 0.811–0.902)) and the accuracy of 0.805 (95% CI: 0.759–0.850). For demographic and clinical features, the prediction effect was also good, with the AUC more than 0.700.


Introduction
Papillary thyroid carcinoma (PTC) is one of the most common pathologic types of thyroid cancer [1]. e current clinical problem is to find regions where lymph node metastasis is prone to occur [2]. is problem is usually solved by utilizing the ultrasound technology, which is also the first choice for thyroid cancer examination. Ultrasound technology can determine whether the patient has cervical lymph node metastasis before surgery, which is of great significance for the selection of surgical methods, radiotherapy and chemotherapy, and the judgment of prognosis [3]. e major advantage of machine learning is that the learning model can improve treatment decisions for cancers and provide clinical references to improve the prognosis [4]. Deep learning models have been used in previous studies, but it takes a lot of time in training stage [5][6][7].
As an effective and efficient incremental learning system, broad learning system (BLS) can provide value for prediction model, which largely reduced the time cost of model training [5]. If combined with imaging omics, broad learning features can then be utilized in establishing the lymph node metastasis model [6][7][8]. Imaging omics is mainly based on the extraction and analysis of images features from CT, MRI, PET, and other medical images to quantitatively evaluate diseases such as thyroid papillary carcinoma and lymph nodes [9]. It can be used to diagnose diseases, predict prognosis, and analyze biological behavior of diseases [10]. Imaging omics was proved to be objective in image extraction of lymph node features in PTC and had important implications for prediction of clinical outcome [11][12][13][14][15]. Since imaging omics has been successfully applied to the diagnosis of thyroid cancer, lung cancer, liver cancer, breast cancer, and other diseases [16][17][18][19][20][21][22], it will also be employed in the present study.
To combine imaging omics with broad learning features, random forest is employed to develop the basic analytic models, which is a combination of decision trees [23]. Each decision tree is trained by randomly generating a new data set from the original data set. e result of random forest is the decision of most decision trees [24][25][26][27][28]. But a single model classification method is often prone to overfitting problem. Many scholars often improve the prediction accuracy through the combination of multiple single models, which is called classifier combination method. Random forest is an algorithm that proposed to solve the overfitting problem of a single decision tree model [29]. Random forest uses the bootstrap resampling method to extract multiple samples from the original samples and then conducts decision tree modeling for each bootstrap sample, and then synthesizes multiple decision trees for prediction, and obtains the final prediction result through voting [30,31]. e organization of this article is as follows. We will use preoperative ultrasound features and image analyses to construct an early diagnosis model for lymph node metastasis in PTC in Section 2. ese models will be performed, evaluated, and then integrated with BLS in Section 3.

Study Design and Population.
is study was a crosssectional study which was approved by the Institutional Review Board of e Affiliated Changzhou No. 2 People's Hospital with Nanjing Medical University (approval number: [2021]KY021-01). e sample population was 337 patients who had undergone PTC surgery in Changzhou Second People's Hospital after inclusion and exclusion. e inclusion criteria were as follows: (1) patients aged ≥18 years old; (2) PTC patients who received fine needle biopsy before operation and were confirmed; (3) patients without benign lesions or single malignant lesions; (4) patients who underwent extensive neck lymph node dissection; (5) patients with complete clinical data. e exclusion criteria were as follows: (1) patients who received anticancer treatment such as radiotherapy and chemotherapy before operation; (2) patients without undergoing ultrasound examination before operation.

Missing Data Assessment.
ere were 428 nodules in 337 patients. Noting that each nodule had two or more ultrasound images from different angles, there were a total of 973 ultrasound images for 428 nodules in 337 patients. Alternatively, a total of 428 data and 973 representative ultrasound images were collected. Missing values in the data were filled by random interpolation. Sensitivity analysis before and after gap-filling is shown in Table 1.

Image Preprocessing and Classification.
In the present study, Lasso regression filtering is used for image processing [32,33]. e processes were to sample n original sample data with the sample size of N and each observation object had an equal probability of being selected, which was 1/N. e sample was regarded as the whole, and the subsamples sampled were regarded as samples from the sample. Such subsample was called the bootstrap sample. e sampling process can be formulated as follows. Let H(x) represent the random forest classification result, h i (x) represent the classification result of a single decision tree, Y represent the classification target, I(·) represent indicative function, and the random forest classification model adopt a simple voting strategy to complete the final classification.
(1) Each decision tree was generated by training sample X with sample size K and random vector θ k (2) Random vector sequence θ k , 1, . . . , K was independently and identically distributed (3) Random forest was the set of all decision trees h(X, θ k ), k � 1, 2, . . . , K Among these processes, each decision tree model h(X, θ k ) had one vote to select the classification result of input variable X: e remaining variable of image feature was gray-level size zone matrix (GLSZM) entropy. e remaining three variables were gender, age, and carcinoembryonic antigen in the demographic information and clinical data, and the remaining four features were the maximum diameter of nodule in ultrasound features, aspect ratio, calcification, and relative capsule position. BLS was established for image classification through learning the variables in the model to obtain the output variables. In the process of image classification, broad learning mapped the input data, constructed the mapping features, and then activated the mapping features to enhance the features, and output the two parts together. We screened out the new features by using the loss function of the 1-norm in Lasso regression, and the new features were merged into the random forest as follows:

Establishment of the Diagnostic Models.
For each nodule, the ROI was delineated according to the gray image    . e focus area of PTC was framed by the clinician, and then the imaging features of the focus area were extracted by the pyradiomics algorithm.     Journal of Healthcare Engineering e strategy in construction of the five diagnostic models was different. In Model 1, only demographic information and clinical data were used. In Model 2, we combined demographic information, clinical data, and ultrasound features. Model 3 combined demographic information, clinical data, and imaging features. Model 4 combined the demographic information, clinical data, ultrasound features, and imaging features. Broad learning was used to learn the variables in Model 4, and new variables were obtained, which were incorporated into the random forest model to obtain Model 5. e data set was randomly divided into 7 : 3 training set and testing set, which were then normalized, respectively.
Lasso regression was used to filter features in the training set, and then the prediction model was constructed.
e area under the curve (AUC), accuracy, sensitivity, and specificity were used to evaluate the model. en, the importance of features was expressed by using the feature importance map. Table 2, because the data were randomly divided into training sets and testing sets, the ultrasonic features were compared in balance. Because their P values were all >0.05, the difference between training sets and testing sets was not statistically significant. is confirmed that the performances between training sets and testing sets were comparable. As shown in Table 3, it can be found from the prediction results that Model 4 performed best in the testing set, which combined ultrasound features and imaging features in our data set, with the AUC of 0.857 (95% confidence internal (CI): 0.811-0.902) and the accuracy of 0.805 (95%CI: 0.759-0.850). e receiver operator characteristic (ROC) curves of the four models are shown in Figure 2.

Prediction Results after Integrated with Broad Learning.
Eight features in Model 4 were included in BLS model to get 106 features, and then 5 features were screened out by Lasso using α � 0.004. Five features put into the stochastic forest prediction model to predict whether lymph node metastasis occurred are shown in Table 4. Figure 3 shows the ROC curve of Model 4 and Model 5 in training and testing sets.

Discussion on the Importance of Model Features.
Since the prediction results of Model 4 and Model 5 were relatively close, and the difference was not statistically significant, we finally chose Model 4 because of its high interpretability. From the map of feature importance (Figure 4), it can be found that the most important variable was the maximum diameter of nodules, followed by GLSZM zone entropy in imaging features, and the third was carcinoembryonic antigen.
Overall, carcinoembryonic antigen and age were the best predictors of demographic and clinical features. Among ultrasonic features, the maximum diameter of the nodule was the best predictor. Imaging features also predict well, as seen in Figure 5.
As the most common thyroid malignancy, the papillary thyroid cancer is associated with cervical lymph node metastases in 30% to 90% of patients [34]. e lymph node dissection (LND) is the mainstay treatment for clinically evident cervical lymph node metastases [35]. So far, surgical treatment options in the literature include the traditional radical LND, the modified radical LND, the selective LND, and a "berry picking" resection in which only the grossly abnormal lymph nodes are excised [36][37][38][39][40][41][42]. e selective LND represents a compartment-based resection based on documented lymph node metastases [43,44].
is study constructed diagnostic models through an integration of the  random forest and BLS, which was demonstrated to be a successful attempt to break the related bottlenecks in the future.
Before constructing the diagnostic models for lymph node metastasis, we considered using the ultrasound features of lymph nodes as input. But cervical lymph nodes are widely distributed (mainly in 6 regions) and there are some limitations in the feature recognition of cervical lymph nodes by ultrasound, especially the lymph nodes in the central area, as well as the special anatomical structures such as posterior trachea, posterior esophagus, retropharyngeal area, and mediastinum, which cannot be displayed well by ultrasound [45]. Meanwhile, researches in modeling, diagnosis, and treatment have confirmed that some ultrasonic features of primary lesions are related to lymph node metastasis [46,47]. Our experiment also demonstrated that we can better identify lymph node metastasis in different regions through imaging features of primary lesions. is was the reason why no lymph node features were used as input in the present study.
Unlike the analysis of normal cancer [48,49], lymph node metastasis is detected through postoperative pathology (gold standard) [50]. e inclusion criteria of this study were those who underwent extensive neck lymph node dissection to ensure the accurate diagnosis of LNM. e potential risk of lymph node metastasis has led to many PTC patients receiving total thyroidectomy, lymph node dissection, and other treatments, resulting in widespread overtreatment. erefore, we hope to build a diagnostic model of preoperative LNM to help realize accurate identification of highrisk patients with LNM in this population to reduce overtreatment.
We constructed five early diagnosis models of LNM by combining the preoperative ultrasound features and ultrasound image features. ese models were chosen for the convenience to extract the imaging features of the focus area of PTC. e strategy in construction of the five diagnostic models was different, and finally, we founded that the top three parameters are more important than the others. ese results present further evidence for a systematic review and meta-analysis in previous studies, which indicated that patient gender is a factor associated with lymph node metastasis in T1 colorectal cancer [51]. e clinical significance lies in helping the clinicians in early diagnosis, which not only reduces the workload of clinicians but also cut off the suffering of patients [52][53][54]. e previous studies utilized deep learning algorithms for detection of lymph node metastases, while broad learning algorithms were rarely utilized [54]. Deep learning models spend too much time in the training stage and BLS can greatly reduce the time cost in training the model [53,54]. is was also the major innovation of our study.

Conclusion
In this paper, five early diagnostic models were developed from random forest and integrated with the BLS to obtain experimental results with population informatics, clinical data, and ultrasonic characteristics. ere was no significant difference between the combining of BLS and random forest and random forest was chosen to make predictions for the model. e most important feature map statistics show the maximum diameter of the nodule. It was the most important variable, followed by the GLSZM zone entropy and hence should be employed in subsequent studies [55][56][57][58][59][60].

Data Availability
All the data to support the experiments and findings in this study are available from the corresponding authors upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.