Prompt and correct diagnosis of benign and malignant thyroid nodules has always been a core issue in the clinical practice of thyroid nodules. Ultrasound imaging is one of the most common visualizing tools used by radiologists to identify the nature of thyroid nodules. However, visual assessment of nodules is difficult and often affected by inter- and intraobserver variabilities. This paper proposes a novel hybrid approach based on machine learning and information fusion to discriminate the nature of thyroid nodules. Statistical features are extracted from the B-mode ultrasound image while deep features are extracted from the shear-wave elastography image. Classifiers including logistic regression, Naive Bayes, and support vector machine are adopted to train classification models with statistical features and deep features, respectively, for comparison. A voting system with certain criteria is used to combine two classification results to obtain a better performance. Experimental and comparison results demonstrate that the proposed method classifies the thyroid nodules correctly and efficiently.
Thyroid nodules are exceedingly common, with a reported prevalence of more than twenty percent of men and fifty percent of women over 50 years old on high-resolution ultrasound [
Several imaging techniques, including CT, magnetic resonance imaging (MRI), and ultrasound, have been used to discriminate the nature of thyroid nodules in the clinic. Ultrasound is the most commonly used because it is expedient, efficient, inexpensive, noninvasive, and nonradioactive [
Examples of visualization about B-US image (bottom, grayscale) and SWE-US image (top, color). The regions of interest are marked with rectangles. The color bar on the right indicates the elastic modulus of nodules, which decreases from red to blue.
Tremendous advances have been made in medical imaging and artificial intelligence technologies, which make computer-assisted diagnostics (CAD) become increasingly widespread. CAD can help us solve subjective diagnostic problems based on objective criteria, which traditionally depends on the experience of radiologists. CAD on ultrasound imaging commonly uses the statistical features (SFs) which are also called radiomics data of medical images, including morphological parameters, intensity statistics, and texture features quantifying heterogeneity [
Recently, deep learning models, especially convolution neural networks (CNNs), have received great attention in image classification and target recognition [
In our study, we propose a hybrid approach combining models trained with traditional features extracted from B-US images and deep features extracted from SWE-US images for thyroid nodule classification task. Firstly, we employ a pretrained CNN model, which is transfer learned from ImageNet, as a feature extractor to draw deep features from SWE-US image dataset. To obtain better performance, we compare the classifiers trained with features extracted from each layers of CNNs to find the most discriminative classifier for the classification task. Then, traditional features are extracted from the corresponding B-US image dataset. Different classifiers are used to train with SFs and DFs for comparison. A voting system including pessimistic, optimistic, and compromise criteria is designed and conducted to combine predictive results from different classifiers together to obtain a better classification performance.
The main contributions of this work are as follows: We propose a novel hybrid framework combining multimodality features for thyroid nodule classification. The classifiers trained with features extracted from each layers of CNNs are compared to find the most discriminative classifier for the nodule classification task. The performance of different decision-making strategies on the classification results is compared and analyzed, and reasonable suggestions are put forward.
The remainder of this paper is organized as follows. Background knowledge is summarized and related literatures are reviewed in Section
Ultrasound is a combination of acoustics, medicine, optics, and electronics. It covers a wide range of applications, including ultrasound diagnosis, ultrasound therapy, and biomedical ultrasound engineering, and is of great value in the prevention, diagnosis, and treatment of diseases. Ultrasound imaging uses an ultrasound beam to scan the human body and receives and processes the reflected signal to obtain an image of the internal organs. Ultrasound commonly used in medical imaging diagnosis includes B-mode ultrasound (B-US, Figure
B-US can describe the size, shape, location, and texture of nodules so as to distinguish malignant nodules from benign nodules. In the literature [
Buda et al. have trained a multitask deep convolutional neural network and compared it with a consensus of three ACR TI-RADS committee experts and nine other radiologists, and the results show the performance of deep learning algorithm is similar to the diagnosis of experts [
SWE-US is a new technology based on the basic properties of biological tissue with elasticity or hardness, with the advantages of measurement results not being affected by the operator and excellent repeatability. Elastography can significantly improve the differential diagnosis of benign and malignant nodules of the thyroid. [
CNNs are the most commonly used deep learning model, from which the high-level characteristics can be extracted. CNNs can be used for tasks such as object detection, classification, and also feature extraction. It is a kind of feed-forward neural network, which is a multilayered perceptron inspired by biological thinking. CNN has different layers, and the working methods and functions of each layer are also different [
Transfer learning means the ability of a system to recognize and apply knowledge learned in previous tasks to a novel task [
Consider the following two facts: firstly, the scope of the ultrasound image dataset (hundreds or thousands) is much smaller than the natural image dataset (more than millions); secondly, two datasets consist of images from completely different regions. That is, the data distribution of these two datasets is inconsistent. There are two typical uses of transfer learning in the field of medical image classification: one is to remove the last fully connected layer on the top of the pretrained deep model and treat the rest of the network as a fixed feature extractor for the current dataset; the other one is that we adjust the transfer learning method by fixing most earlier layers to reserve generic information and only retraining from scratch the last fully connected layer of the pretrained deep model to capture domain-specific features.
In this paper, we propose to evaluate the hybrid approach with multimodalities of ultrasound imaging in discriminating the nature of thyroid nodules. The algorithm deals with the two modalities separately. For deep features, we compare the classifiers trained with features extracted from each layers of CNNs to find the most discriminative one for the task. For statistical features, the process generally contains 3 steps: image preprocessing, feature extraction, and feature selection. Then, different classifiers are adopted to train classification models with statistical features and deep features for comparison. In the end, two classification models hybridize together with a voting system, employing three kinds of decision criteria. The hybrid model is observed to obtain a better performance. Overall framework of this research is shown in Figure
Schematic diagram of our model architecture for thyroid nodule classification.
B-US image is gray scale image (Figure
After ultrasound images are denoised by the median filter, the regions of interest (ROIs) are manually segmented along the nodule contour on each transverse section using an open-source imaging platform named ITK-SNAP. In order to eliminate the difference, the segmentation is carried out by the radiologist with more than 5 years of experience in continuous time. Figures
Region of interest selection. (a, b) Benign and malignant nodule images. The outline of the nodule is outlined with red lines in (c) and (d).
A python radiomics package named “Pyradiomics” is used to automatically extract the statistical features from the nodule region, which is outlined by radiologists. A total of 104 dimensional statistics features including first-order statistics features, shape features, gray level co-occurrence matrix- (GLCM-) based features, gray level run length matrix- (GLRLM-) based features, gray level size zone matrix- (GLSZM-) based features, neighboring gray tone difference matrix- (NGTDM-) based features, and gray level dependence matrix- (GLDM-) based features are obtained. The dimensions of all kinds of features are shown in Table
Summary of the statistical features.
Id | Feature type | Dimension |
---|---|---|
1 | First-order statistics | 19 |
2 | Shape | 10 |
3 | Gray level co-occurrence matrix (GLCM) | 24 |
4 | Gray level run length matrix (GLRLM) | 16 |
5 | Gray level size zone matrix (GLSZM) | 16 |
6 | Neighboring gray tone difference matrix (NGTDM) | 5 |
7 | Gray level dependence matrix (GLDM) | 14 |
Total | 104 |
The purpose of feature selection is to select a subset of the smallest features based on the original value of the dataset instead of removing the irrelevant and redundant attributes, which may increase the complexity of classification and even cause the performance of the classifier to decrease [
When using PCA, the feature vector is gradually increased at an interval of 10 as an input to determine the optimal number of retained components so that colleagues who retain the data structure information to the greatest extent can reduce the dimension, as elaborated in Algorithm
Input: Output: Centralize all samples Calculate sample covariance matrix Eigenvalue decomposition of the matrix Get the eigenvectors Normalize all eigenvectors to form an eigenvector matrix Convert samples Get the output sample set End
When using the
SWE-US image is considered to be a composite color image (Figure
As mentioned above, the application of transfer learning method for medical image classification is in our research is feature extraction, which removes the last fully connected layer of the pretrained deep model because the output of this layer is for the class score of multiclass classification tasks like ImageNet. The rest of the network acts as a feature extractor for the given dataset. A variety of pretrained models such as resNet-50, Inception-V3, and VGGNet-16 have been used for transfer learning [
VGGNet-16 network structure diagram. Prior to conv5, the yellow block is the output of the convolutional layer and the orange block is the output of the pooling layer. FC is the output of the fully connected layer.
However, for the features output from the second fully connected layer, it may be the result of various functional combinations. Therefore, in order to obtain the best performance of the nodule classification task, we compare the features extracted from Pool1–Pool5 and FC6 and FC7 layers of the VGG network. Features from each layer are used as sets of input for a classifier after zero-variance removal. It is worth noting that the ROI of SWE-US has three channels to accommodate the original architecture of VGGNet-16, which is designed for colorful images. In addition, the sampling rate of ROIs is reduced by half to 224
The results are shown in Figure
Response of certain convolution layers.
The diagnosis of benign and malignant thyroid nodules in this paper is a typical two-class problem. Many classifiers including logistic regression (LR), k-nearest neighbor (KNN), random forest (RF), support vector machine (SVM), and so on have been used to discriminate the nature of nodules based on features extracted from ultrasound or CT images. We conduct LR, Naive Bayes, and SVM algorithms for comparison in the classification process because they could output the probability value, which were used as the input of voting system for decision fusion. Suppose
Logistic regression is a machine learning method used to solve binary classification problems and is used to estimate the probability. The hypothetical function of logistic regression is as follows:
The Naive Bayes method is a set of supervised learning algorithms based on Bayes’ theorem, and it is assumed that each pair of features is independent of each other. The principle is as follows:
SVM is a convex quadratic programming problem, which can be expressed by the following mathematical formula:
The voting system receives the probability of benign and malignant computed by two classifiers. The combination of the two outputs is responsible for increasing the final accuracy for each modality. The voting system proposed in our research is an adaptation of the uncertainty decision theory [
For a thyroid nodule, Define pessimistic criteria Define optimistic criteria Define compromise criteria
Herein, the experimental data are obtained from the Department of Ultrasound, First Affiliated Hospital of Nanjing Medical University. The study population is composed of 245 patients. Both B-US and SWE-US examinations are performed by experienced radiologists. Images are acquired and stored in DICOM standards. The type of thyroid nodules is the gold standard for pathological analysis including excisional biopsy, core needle biopsy, or FNA biopsy. When a patient undergoes multiple biopsies, the gold standard for final diagnosis will be determined according to the following priorities: excisional biopsy, core needle biopsy, and FNA biopsy. There are 490 images in total (B-US and SWE-US each account for half), consisting of 145 images of benign nodules and 100 images of malignant nodules. This retrospective study was approved by the institutional review board, and the informed consent was obtained from all patients.
In order to improve the generalization ability of the model on the dataset, five-fold cross validation is executed during the training and testing process of the model. The original data are evenly divided into 5 groups, and each subset of the data is used as a validation set, and the remaining 4 sets of subset data are used as the training set. Repeating 4 times will get 4 models. The average of the classification accuracy of the four models is used as the performance index of this classifier. As listed in Table
Partitioning and statistical summary of cross-validated datasets.
Subset | Number | Malignancy | Radius (mm) |
---|---|---|---|
Subset 1 | 49 | 21 | 0.51 + 0.26 |
Subset 2 | 49 | 19 | 0.56 + 0.17 |
Subset 3 | 49 | 17 | 0.53 + 0.24 |
Subset 4 | 49 | 20 | 0.58 + 0.21 |
Subset 5 | 49 | 23 | 0.52 + 0.22 |
Total | 245 | 100 | 0.54 + 0.2 |
Herein, quantitative evaluation indexes such as accuracy, sensitivity, and specificity, which are usually used in medical diagnosis, are adopted to evaluate the classification quality. Accuracy is computed by equation (
Features extracted from different layers of CNNs are compared to find the most discriminable one for the nodule classification task in our research. The results of feature extraction and processing are shown in Figure
A block diagram of how features are extracted and trained from each layer of a pretrained VGGNet-16.
Considering that the features are high-dimensional vectors and training samples are limited, we choose a linear kernel SVM as the classifier. Totally 7 SVM classification models are trained. Table
Experimental results with features extracted from different layers of CNNs.
State of CNNs | Dimension | Accuracy | Sensitivity | Specificity | Training time (s) | Testing time (s) |
---|---|---|---|---|---|---|
Pool1 | 783522 | 0.735 | 0.72 | 0.745 | 4219.3 | 63.7 |
Pool2 | 370436 | 0.755 | 0.74 | 0.766 | 2395.6 | 53.3 |
Pool3 | 177255 | 0.784 | 0.76 | 0.793 | 1500.4 | 41.4 |
Pool4 | 81547 | 0.804 | 0.78 | 0.821 | 957.6 | 35.1 |
Pool5 | 21342 | 0.823 | 0.8 | 0.848 | 510.9 | 27.3 |
FC6 | 3795 | 0.812 | 0.79 | 0.828 | 109.1 | 12.7 |
FC7 | 3740 | 0.653 | 0.59 | 0.697 | 107.3 | 11.5 |
Based on the results in Table
As a result, considering the balance between high predictive performance and relatively low dimensionality, we choose the FC6 layer as the optimal layer to extract deep features. The dimensionality of the feature vector of the pooling layer is one or two orders of magnitude higher than that of the fully connected layer, which greatly increases the computational cost.
Finally, multiple sets of comparisons with various techniques are made in our work. First, different types of features are compared, and the SFs from B-US and the DFs from SWE-US are input to the same classifiers for comparison. Second, different feature selection methods are compared. The results of direct training without feature selection are compared with the results of training after feature selection using the PCA and
Experimental results of various techniques.
Features | Methods | Accuracy | Sensitivity | Specificity | Training time (s) | Testing time (s) |
---|---|---|---|---|---|---|
SF-TTEST | LR | 0.78 | 0.76 | 0.793 | 6.5 | 1.8 |
Naive Bayes | 0.764 | 0.75 | 0.772 | 6.2 | 1.5 | |
SVM | 0.784 | 0.77 | 0.793 | 17.8 | 5.2 | |
SF-PCA | LR | 0.795 | 0.77 | 0.807 | 8.6 | 1.9 |
Naive Bayes | 0.784 | 0.76 | 0.793 | 8.5 | 1.8 | |
SVM | 0.804 | 0.79 | 0.814 | 18.7 | 5.3 | |
SF | LR | 0.755 | 0.75 | 0.758 | 9.4 | 2.1 |
Naive Bayes | 0.733 | 0.71 | 0.751 | 8.9 | 1.9 | |
SVM | 0.776 | 0.76 | 0.786 | 19.8 | 5.7 | |
FC6 | LR | 0.784 | 0.77 | 0.793 | 57.9 | 11.2 |
Naive Bayes | 0.755 | 0.74 | 0.766 | 48.2 | 10.6 | |
SVM | 0.812 | 0.79 | 0.828 | 109.1 | 12.7 |
Experimental results show that the best performance of models trained with DFs is better than that of models trained with statistical features. According to Table
It should be noted that when using PCA or
As previously mentioned, classification models of different modalities are trained, respectively, to compare the performance. According to the experimental results, we choose the predictive results from FC6-SVM and SF-PCA-SVM as the inputs of the voting system which is explained in Section
Experimental results of thyroid classification with voting system.
Methods | Accuracy | Sensitivity | Specificity |
---|---|---|---|
FC6-SVM | 0.812 | 0.79 | 0.828 |
SF-PCA-SVM | 0.804 | 0.79 | 0.814 |
Pessimistic | 0.824 | 0.779 | |
Optimistic | 0.832 | 0.69 | |
Compromise | 0.82 | 0.897 |
The best performing results are shown in bold.
ROCs and AUCs.
It is easy to know that the performances of hybrid approach based on multimodalities are better than that of single modality such as FC6-SVM and SF-PCA-SVM. The performances after voting system have been greatly improved, achieving an accuracy of 0.865 (compromise criteria), sensitivity of 0.89 (pessimistic criteria), specificity of 0.931 (optimistic criteria), and AUC of 0.921 (compromise criteria). The accuracy rates of pessimistic criteria and optimistic criteria are relatively close, slightly better than the results of FC6-SVM and SF-PCA-SVM. However, the accuracy rate of compromise criteria has increased by more than 5 percentage points and the AUC is nearly improved by 6 percentage points.
The difference in visual imaging can be quantified by algorithms such as statistical machine learning and deep learning to train a CAD system, which can automatically differentiate the nature of thyroid nodules. A lot of research studies use ultrasound images as a dataset to train the CAD system for predicting the benign and malignant nodules [
From the experiment results, we found that it is obvious that the scores of specificity are higher than sensitivity for each model. There may be two reasons for this result. One is that in our dataset, the number of benign samples is more than the number of malignant samples, which is nearly 1.5 times; on the other hand, as shown in the ROIs of Figure
The deep features generated by CNNs can better represent the inherent features of the image, regardless of the field and type of the image. Therefore, it is feasible to transfer the pretrained VGGNet-16 model to the ultrasound domain, which has been proven in experiments. In addition, models trained with features extracted from different CNN layers have different performance. According to experiments, we find that the higher the number of layers, the better the performance before fully connected layer, and the effect of the fully connected layer is weaker than the pooling layer. One possible explanation is that operations such as convolutional layer and pooling layer are to map the original data to the hidden layer feature space. Higher layers can capture different kinds of common features, while lower layers only calculate low-level features and cannot represent high-level semantic functions. The main role of the fully connected layer is to map the hidden layer features to the sample label space, but the target domain and the source domain are quite different, which results in the performance of fully connected layer being relatively weaker.
To the best of our knowledge, this is the first attempt to train models with statistical features from B-US and deep features from SWE-US, respectively, and fuse them together by a voting system, which can improve the accuracy of thyroid nodule diagnosis. Sun et al. [
It needs to be emphasized that sensitivity is more important in clinical practice. Sensitivity indicates the correct proportion of malignant while specificity indicates the correct proportion of benign. The higher the sensitivity, the lower the rate of missed diagnosis. In our research, a voting system is employed to combine different classifier results to improve the accuracy of prediction. Although the accuracy of compromise criteria is better than that of pessimistic criteria, the sensitivity of the pessimistic criteria is better, reaching 89%, which is 7% higher the compromise criteria. We strongly recommend using pessimistic criteria in clinical practice because high sensitivity can reduce misdiagnosis.
A large percentage (about 70%) of FNA results of thyroid nodules turn out to be benign [
Further research could be performed in this area to overcome limitations of this study. Deep models in medical intelligent diagnosis require large datasets, especially multicenter large datasets, to mine the hidden information of predictions to avoid overfitting the model. Additionally, a new hybrid approach including feature hybridization and classification result hybridization needs to be proposed and applied to improve the accuracy of the model.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
Hongjun Sun and Feihong Yu contributed equally to this study.
This study was financially supported by the Research and Innovation Program for Graduate Education of Jiangsu Province (KYZZ15_0110) and the National Natural Science Foundation of China (NSFC) (71971115 and 71471087).
ROI segmentation: it mainly introduces the software and the methods to segment the ROI of ultrasound image. Statistical feature extraction: it contains python code to perform the statistical feature extraction described in the article. Statistical feature extraction methodology: it mainly introduces the description and calculation method of statistical characteristics in detail.