Differentiation of the Follicular Neoplasm on the Gray-Scale US by Image Selection Subsampling along with the Marginal Outline Using Convolutional Neural Network

We conducted differentiations between thyroid follicular adenoma and carcinoma for 8-bit bitmap ultrasonography (US) images utilizing a deep-learning approach. For the data sets, we gathered small-boxed selected images adjacent to the marginal outline of nodules and applied a convolutional neural network (CNN) to have differentiation, based on a statistical aggregation, that is, a decision by majority. From the implementation of the method, introducing a newly devised, scalable, parameterized normalization treatment, we observed meaningful aspects in various experiments, collecting evidence regarding the existence of features retained on the margin of thyroid nodules, such as 89.51% of the overall differentiation accuracy for the test data, with 93.19% of accuracy for benign adenoma and 71.05% for carcinoma, from 230 benign adenoma and 77 carcinoma US images, where we used only 39 benign adenomas and 39 carcinomas to train the CNN model, and, with these extremely small training data sets and their model, we tested 191 benign adenomas and 38 carcinomas. We present numerical results including area under receiver operating characteristic (AUROC).


Introduction
Thyroid cancer has been one of the most diagnosed forms of cancers worldwide over the past few decades [1]. Follicular thyroid cancer is the second most common thyroid cancer after papillary thyroid cancer, comprising 10-20% of thyroid cancer. It is noted that follicular thyroid cancer has a higher incidence of distant metastasis and thus has prognosis worse than the more common papillary thyroid carcinoma [2][3][4]. Therefore, it is important to preoperatively notice this entity for prompt management.
Follicular neoplasm of the thyroid gland comprises follicular adenoma and carcinoma. It is challenging to preoperatively differentiate these two entities, and much clinical effort has been made up to this point. Overlapping clinical presentations, ultrasound (US) features, and molecular biology resulted in a limited value of diagnostic power through preoperative evaluation with US, fine-needle aspiration cytology, and immunohistochemistry [5][6][7][8]. Therefore, a differential diagnosis of these two entities is currently obtained by identifying capsular or vascular invasion at the periphery of the lesion among pathologic examination following diagnostic thyroidectomy [9].
In CAD (computer-aided diagnosis), many scientists and researchers have developed methods to detect thyroid nodules or automated diagnosis assistance systems, mainly to differentiate between benignancy and malignancy of thyroid nodules and break through those difficulties in definitive diagnoses of nodule lesions and assist radiologists with developing a plan of action [10][11][12].
2 BioMed Research International countries in various fields of our life, even in the area of medical sciences [13][14][15][16]. In this article, we develop and demonstrate newly conducted techniques and observe some meaningful aspects seen in various experiments, such as scaling a parameterized normalization to draw reasonable evidence of the existence of features retained on the margin of thyroid follicular neoplasms, which could be helpful in identifying capsular or vascular invasion occurring at the margin of the lesion, or inspirational to the invention of an efficient numerical method to differentiate malignant from benign follicular neoplasms on US images, in view of a CNN (convolutional neural network) [17].
In this paper, after reviewing other machine-learning type methodologies in Section 2, we introduce our model training schemes, presented in Section 3, focused on a technique that disregards features of intro area of thyroid nodule images; that is, we concentrate our image recognition model on capturing the features characterized in the boundary region of thyroid follicular neoplasms, in virtue of the fact that the previously mentioned differential diagnosis based on the pathologic examination taken after diagnostic thyroidectomy depended considerably on the properties of the boundary region of the nodules. In Section 4, we present numerical results, developing a newly devised parameterized normalization treatment, including AUROC (area under receiver operating characteristic) and those curves, as well as overall differentiation accuracy, and so on. In Section 5, finally, we discuss the existence of features on the boundary of US thyroid follicular neoplasms that could possibly be trained by our proposed CNN based inference model and its efficiency, including our future works.

Technical Issues in US Classification Experiments Using Artificial Neural Network
In view of machine learning or artificial intelligent techniques for differentiation of malignant from benign thyroid nodules, there are lots of methods or treatments with sample data sets to extract efficient features for application in a training model of a given machine learning or ANN training tools [10,11,[18][19][20]. For support vector machine (SVM), some remarkable ways of feature extracting techniques and imagery subsampling treatments are conducted to efficiently train classification models such as those found in [10,[20][21][22][23], and, for ANN type of methods, the methodologies found in [10,19,[24][25][26][27] mostly use some ways of preprocessed training with feature extraction techniques including pathological reports or information on patients such as age, sex, health condition, and the results of various medical tests or cytological data. In other words, most of ANN methods found in there actually demonstrate training with nondirect US images but with some kinds of nonimagery input data sets extracted from original US image information.
In our implementation of CNN model training for differentiating between thyroid follicular adenoma and carcinoma for US thyroid images, we engage US images in a fixed size of pixels in resolution on input nodes directly without extracting  [21] to differentiate risky hypoechoic thyroid nodules, although they try to take the features found in boundary region of thyroid nodules by setting up the data set comprising 131 medium-risk hypoechoic nodules characterized by regular boundaries and 42 high-risk hypoechoic nodules characterized by irregular boundaries, since the morphological shapes of boundary regions are so distinctive that even human eyes may easily recognize the risky nodules, one may not be sure that its model would be a good fit to work for any ambiguously shaped general cases of thyroid follicular adenoma and carcinoma (refer to Figure 1). Exhibited here are renderings of our own sample gatherings of thyroid nodule images to deal with our classification models of convolutional neural network, and, afterward, we introduce and define the type of training methodology in Section 2.
For our own collection of sample thyroid images, we have 250 cases of follicular adenoma, as well as 83 cases of follicular carcinoma, visualized in gray-scale 8-bit bitmap US thyroid nodule images, and the data sets were obtained from 2 different US clinics which identified as Hospital A (= H A ) and Hospital B (= H B ) (refer to Table 1). For the data denoted by clinic HA, in total, 230 patients with 230 thyroid nodules were included in this study. Of the 230 patients, 51 (22.174%) were men, and 179 (77.826%) were women. Mean age of the 230 patients included was 48.72 years. Mean size of the 230 thyroid nodules was 29.84 mm, and the mean of the pixel intensity of the grey-scale 8-bit bitmap US images is 63.819, where the mean value of the max intensity is 176.1475, and the mean of the minimum intensity is 7.1230. For the data of HB, totally, 103 patients with 103 thyroid nodules were included in this study, where 22 (21.359%) were men, 71 (68.933%) were women, and 10(9.708%) were the missed sex identification, and the mean age was 43.90 years. Mean size of the 103 thyroid nodules was 32.81 mm, and the mean of the pixel intensity of the grey-scale 8-bit bitmap US images is 82.07 where the mean value of the max intensity is 192.1154, and the mean of the minimum intensity is 6.6827. These data sets are given from both institutional databases which was reviewed after from January 2003, for patients diagnosed with follicular adenoma and follicular carcinoma after surgical excision. In Table 1, we present the list of the numbers of our sample cases of US thyroid images.

US Differentiation Applying CNN
We make use of CNN to differentiate US images of follicular neoplasms between the adenoma and the carcinoma. We demonstrate experiments with the data set given in Table 1 to train a CNN model to infer the differentiation.

Making Subsets.
Here, aiming to derive a data invariant numerical result related to the characteristics of the fine imagery features captured by our CNN model retained on the margin of thyroid follicular neoplasms, delivered from various examinations as far as possible, we organize 6 kinds of disjoint subsets from the data set given in Table 1, into Set , Set , Set , Set , Set , and Set (see Table 2).
After removing some US contaminated images tainted at some marginal area with an extraneous substance, such as diagnostic marking signs of the radiologist, we reduced the data sets shown in Table 2 into those refined sets listed in Table 3, in which Set * corresponds to Set , and Set to Set * , and so on.

Training Data and Test Data.
To implement the training of our model, we use Set * as training data and the other subsets for each as test data, based on the data sets given in Table 3; that is, this organization of training and test data is set to be an extremely small training set for small test set architecture to demonstrate various examinations and to deduce the existence of data invariant characteristics of fine common features captured by our nodule's boundary based CNN modeling. To set up the practical training and test data   sets based on each boundary of nodule, we select small 2D box images (here we set 50 × 50 pixels in size) aligned on the contour of each thyroid follicular neoplasms' margin (see Figure 2). To have this selection of marginal box images for the training data, following the contour of the nodule's margin, we chose somewhat distinctive images judged manually, while for test data we select box images centered at every point of pixels on the manually drawn, closed virtual contour margin line of the thyroid nodule, and afterward we have the training and test data sets given in Table 4, in which Set ∘ corresponds to Set * , and Set ∘ to Set * , and so on. Table 3 and the training and test data organization given in Table 4, we examine the differentiation, applying a decision by majority to judge the differentiation for each follicular neoplasm by those subsampled data sets taken from each own boundary region. For a simple representation of our CNN based statistical inference applying the decision by majority, let us assume that there exist 500 selected subsampled images given from the boundary of a nodule so that our trained CNN model determines each selected subsampled image to be carcinoma in 255 counts and adenoma in 245 counts, and then we determine that the nodule is carcinoma, owing to the fact that the counts to be carcinoma exceed those for adenoma (see Figure 3).

The Structure of Convolutional Neural Network as a CNN
Model. We apply an AlexNet type of CNN structure [28] to train data sets, which comprises 5 convolutional layers and 2 pooling layers, the details of which are described in Table 5 and Figure 4. (In Table 5, characters and represent the size of the convolution kernel for each input channel and the number of total kernels applied to each layer, resp.)

Overview.
In view of the setup, the data set is organized from an assumption that every margin of thyroid follicular Table 5: Training structure of the convolutional neural net (5-conv, 2-pool, 2-fully-conn structure).  neoplasms may contain certain obvious features that help differentiate between adenoma and carcinoma and that those features would well be detected and trained, even with the small number of images of thyroid nodules [9]. Our standard of outlining of the contour of each thyroid follicular is drawn from the official medical specialist from both clinic, Samsung Medical Centre, and Yonsei University Medical Centre in Seoul, South Korea, the coauthors of this article.

Numerical Results
In this section, we present numerical results related to differentiating thyroid follicular neoplasms between adenoma and carcinoma and some observable aspects in the feature recognition of CNN in view of a newly developed data normalization method by devising a parameterized scaling treatment.
For the numerical results in this section, we train the CNN model described in Table 5 and Figure 4, with 380 of epochs of training, 400 of batch size, 0.0001 for learning rate, and 0.5 for dropout rate, with a standard backpropagation algorithm [17,28,29]. We customized the popular TensorFlow (version 1.0.0) library in Python3.x for our main programs of the experiments. It took several minutes to train each experimental model where it took a few seconds to infer the results for test data sets, on two Ndvia Pacal TitanX 12 GB GPUs.

Training Aspects of the Parameterized Scaling Treatment in Data Normalization.
Here, we give training results of CNN with regard to the data normalization, applying a parameterized scaling treatment. For the normalization of training data in our experiments, we apply a mean-zero based min-max normalization of training input data, which transforms all the scores of input data into a common range [0, 1] and then minus the mean of the input data set. We let a pair of indices ( , ) represent the pixel point located in the ith position in the -axis and the -th position in the -axis in each input image and the corresponding pixel value is denoted by ; then the mean-zero based min-max normalization V for training data is given as where [ ] denotes the mean value of in the position ( , ).
While the test data is normalized applying a scaling parameter , it is performed as where [ ] denotes the mean value of , the pixel value of test data is at position ( , ), and denotes the parameterized normalization of . Here, note that if = 0 in (2), it is the min-max normalization [30].
Here we are examining the CNN model for the test data. We have the parameter in (2) range [−1.5, 1.5] for every 0.3 increase. For the results obtained by test data from Set ∘ to Set ∘ listed in Table 4, we present the accuracy of differentiation in percentage (%), and for each test set we draw the plots given from Figures 5(a)-5(g), where we draw plots of true benignancy of adenoma for Set ∘ , Set ∘ , Set ∘ , Set ∘ , and Set ∘ and the false benignancy of carcinoma for Set ∘ , and Set ∘ , respectively. In Figure 5, each curve represents the tendency of differentiation for a corresponding single follicular nodule; for example, for Set ∘ , there are 30 kinds of nodules (refer to Table 3), and then there are 30 lines of curve in Figure 5(a), and for a given each plot lying in the vertical line indicates the percentage (%) to be classified as benign, one for each nodule, respectively. Now, summarizing the plots given in Figure 5, we draw the plots in mean cumulative percentage (%) versus for true benignancy of adenoma test data and for false benignancy of carcinoma data, observing the slopes of plots in the mean cumulative percentage (%) proportional to , which represents the tendency of differentiation to be classified as benign adenoma. We provide the plots to compare those slopes in Figure 6.
Seeing the plots in Figure 6, the slopes of mean cumulative percentage (%) versus , where ≥ −0.5, have a positive 6 BioMed Research International   Table 5.
sign for all the plots, and these behaviors of slopes could promote the increase of differentiation accuracy in total for true benign data, but the behavior could also cause a decrease for carcinoma data, which gives us a sense of fine-tuning through the control of .

Fine-Tuning Effect of the Parameterized Data Normalization.
Along with the fact that the control of could give an increase in total differentiation accuracy, the result of a demonstration of differentiation for a set of test data reveals the possibility that a nice choice of gives us a highly recommendable CNN differentiation model as a model of fine-tuning. Here, a result of the demonstration conducted on test data Set ∘ is given in Table 6, for which we choose = 0.15. In Figure 7, we give the plots of differentiation in percentage (%) versus for false benignancy and true benignancy for test data Set ∘ . Seeing Figure 7(a), we know that around = 0.15 the plots lying in vertical line with values less than 50% counts about 19, and, seeing Figure 7 On the other hand, seeing that test data sets Set ∘ , Set ∘ , and Set ∘ are derived from the data set H A and Set ∘ and Set ∘ from H B , respectively, we apply a different normalizing parameter in (2) for the sets from H A and for those from HB such that = 1.5 for H A and = 0.15 for H B . The differentiation results for both H A and H B are given in Table 7.

Discussion
In our experiments of CNN inference modeling to differentiate thyroid follicular neoplasms between follicular adenoma and carcinoma of gray-scale 8-bit bitmap US thyroid images, we implemented the mean-zero based min-max normalization method defined in (1) for input data to be trained by CNN architecture and rescaled it with a parameter denoted as in (2) for test data. In our numerical simulation of training of model, referring to Table 3, the readers may see that our acquisition of the training data and test data sets is taken from two different clinic centres, the total amounts of samples for the use of training data set are very limited, the whole samples of follicular carcinoma images from clinic H A are used to be training data, and the sample images from H B are used to be test data set, so that we naturally determined the fixed partitioning scheme. As a result of the experiments of scaling the normalization parameter chosen in a real number   interval [−1.5, 1.5], we found out that the slopes of mean cumulative percentage (%) versus , where ≥ −0.5, have a positive sign for all the plots, and these behaviors of slopes increased the differentiation accuracy in total for true adenoma data but promoted a decrease for carcinoma data, providing a sense of fine-tuning through the control of . Although the training data is chosen among the subsets of H A by adjusting the normalizing parameter chosen differently from each other between the two hospital data sets, H A and H B , respectively, we could differentiate the images in H B , of which the test result of differentiation over 89% in overall accuracy supports the availability of our inference model. Furthermore, from the test results shown in Figure 6, we see that there is no pairing of data sets, of which plots have to cross over themselves where ≥ 0, of which the original hospital databases are different from each other, and these plot behaviors in the results might somewhat weakly suggest that the two different hospital databases have their own distinctive imagery characteristics for each of them so that it makes sense to apply a different normalizing parameter for each hospital data set, respectively. For this, one may suggest that the configuration of the pixel intensities which differs along both data sets, HA and HB, affects that. (Refer to the fact that, for HA, the mean of the pixel intensity of the grey-scale 8-bit bitmap US images is 63.819, the mean value of the max intensity is 176.1475, and the mean of the minimum intensity is 7.1230, whereas, for HB, the mean of the pixel intensity is 82.07, the mean value of the max intensity is 192.1154, and the mean of the minimum intensity is 6.6827, as denoted before.) On the other hand, with regard to the data set, our shortage of data sets seldom makes someone imagine a good performance to infer disease diagnostic determination, comparing to that of such a relatively plentiful of data sets of MNIST and ILSVRC [32]. Hence, to tackle our small data set problem, we mainly seek to develop inference methodologies and overcome the extremely harsh task of our inference model with small data set via seeking a kind of ensemble-like neural-network method. Moreover, for the performance of our proposed model, basically like other machine learning based technology, we may not be sure about the robust functioning of our methodology yet, since like most of other vision based deep-learning architectures severely it suffers from the types of organizations or the amount of sample data sets to be applied to do specific inference, so that the proposed methodology may or may not suffer from those kinds of problems. In our research article, we have not suggested any mathematical proof of theoretical issues related to our presented numerical results rather than given experimental conviction for the possibility of the utility. From the experiments in [5], also we see that although the amounts of samples are so rare, they conclude some reasonable researching insights into the diagnostic differentiation for follicular neoplasm lesion of thyroid. Now we hope that we open the chances of the successful application similar to our proposed method to the readers with much plentiful sets of sample data.
For the sample data acquisition, both health centres, here Hospital A (= H A ) and Hospital B (= H B ), referring to Table 1, have different protocol for the acquisition of the ultrasound images, based on the apparatus to take the ultrasound image pictures; that is, the machines to take the ultrasound images and the related mechanical conditions are different. In this case, we have the difficulty to adjust the data sets to have the same depth of intensity of ultrasound wave and resolutions for both clinics' data sets, and we thought that the differences in those parameters influence the inference model results, and it is expressed in the classification results where the classification results for data sets included either side of clinic have the similar up-and-down slopes of differentiation, that is, for data from same clinic have the tendency of near distance of plots themselves relatively compared to the other clinic's data sets, referring to Figure 6.
For the sample data organization, referring to both clinics' data sets, the critical point to determine how many data sets to be set as training data and test data is largely dependent on the number of follicular carcinoma images, since, to balance the number of sample data for training the model, we set prior data from either clinic (here H A , referring to Table 3) having much ample number of samples compared to the other clinic (here H B , referring to Table 3) to be used as training data, without loss of generality. And the total amount of follicular carcinoma sample images are be used in developing our inference model inferior to that of follicular adenoma images so that we determine having training data set from the sample images of H A which owns further sample data compared to H B , especially for follicular carcinoma images. Actually, considering the data confusion in training the inference model occurred from the mixed data given from different environment of protocol in data acquisition from the two different clinic centres and, to avoid that ill-conditioned data organization and the following training results, we mainly separated the training data set given from either clinic and the test data set from the other clinic. And lastly, we determined organizing the training data and the test data as given in Table 3. Now, here we give an overall answer to handle our choice of hyperparameters for our proposed neural network. Referring to Figures 5 and 6, we found out that the tendency of the slopes in those plots in Figures 5 and 6 gives us that as the proposed normalization parameter moves the differentiation results change, and those kinds of differentiation trends are revealed to be coherent to each model with some variances of the neural network's parameters such as batch size and learning rate. Consequently, our proposed values of the neural network's parameters are one of the good choices which enabled us to get the numerical results which are persuasive to readers to convince them of the effectiveness of our proposed methodology to infer the differentiation depending on our organization of data sets. In our experiments, we experienced some overfittings or underfittings for the validation sets for training epochs over just several hundreds of epochs, and the similar phenomenon often happened for some variances of learning rates, and so on. For dropout rate, (the recently introduced technique, called "dropout" [29], consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are "dropped out" in this way do not contribute to the forward pass and do not participate in backpropagation), we refer to the dropout rate given in [32] which deals with the AlexNet. For the structure of CNN, in our experiments, there is no prominent dominance for many heavy layers of CNN rather than popular AlexNet type of CNN architecture. For the 2D box image of size 50×50 pixels, as we see the illustration given in Figure 9, the raw contour ROI of US images taken from both clinic centres has the resolution size about 200∼600 ± pixels, and we thought that the resampling 2D box image, which is represented as the red square in Figure 9, (to be inferred for the full US image's differentiation based on our ensemble-like voting system of CNN) should be not too small or too large to have the inference model not to lose the critical morphological vision based features which may reside in the region of boundary of thyroid lesion. And of course, even our choice of the 2D-boxing size is not absolutely given someone to ensure it is the best choice, since the size may be the one of good choice to infer the model. Unfortunately, like most of other deep-learning models, especially for vision based models like CNN, there are still behaviors of each model's distinctive inference performances, and someone may say it is just black-box to analyze it in the sense of mathematical inspirations.
On the other hand, out of loss of generality, the choice of our neural network's parameters does not guarantee the absolute superiority for our applied AlexNet types of neural network; it is only dependent on one's own data sets and the experimental experiences and, here in our proposed method and the corresponding numerical results, only made to give the readers sorts of insight about the possibility or the effectiveness of our proposed inference model.
For the experimental experiences, we have ever applied various kinds of examinations with SVM, K-NN, simple 200~600 ± 200~600 ± Figure 9: An example of a raw contour ROI of US thyroid image with resolution size ranging 200∼600 ± pixels. The red square represents an example of 2D box image we have selected to set up the data sets for the use in developing our deep-learning inference model, which is described in Section 3.1.
ANN, and so on. Unfortunately, with these activities of experiments, we did not find any acknowledgeable results of inference models, yet. Finally, as we apply our proposed methodology, we observed breakthrough results, although still one may be doubtful of the real big data based performance of it. These results of our proposed method to infer the diagnoses to determine the alternative choice of classification problem, showing a possible superior task ability of ensemble-like methods to normal classical inference methodologies generally known.  Table 8 [33,34]. The readers may well compare the results to those in Table 7.

Comparison with the Benchmark Thyroid
And even from the preliminary experiments taken with the full US image based (not resampled along contour) CNN inference, we have found the total accuracy ∼75%, but there are still many follicular carcinoma images that failed to be differentiated.

Comparison with USFNA Based Differentiation for a Follicular Thyroid Neoplasm US Images.
For the comparison performance of our differentiation method for US images follicular thyroid neoplasm, we have found the USFNA (ultrasound-guided fine-needle aspiration) and the experimental results in [5] where the FNA performance ranges 51∼ 67% in accuracy, which gives inferior results compared to our proposed methodology, as given in Table 9.  On the other hand, we found our general types of benchmark computer-aided systems listed in [35] where the author collected sample images from the open database proposed by Pedraza et. al. [36]. They applied a pretrained model transferring model which is initialized from the pretrained GoogLeNet network achieving excellent classification performance attaining 98.29% classification accuracy, 99.10% sensitivity, and 93.90% specificity. Although the types of US thyroid images of various computer-aided differentiation systems found in [21][22][23]35] present excellent performances, their models are mostly treated with papillary thyroid carcinoma. And there are lots of reports that even USFNA is widely used in discriminating between benign and malignancy in various lesions of the thyroid showing excellent performances (sensitivity 65%-98% and specificity 72%-100%) for papillary thyroid carcinoma [5].

Conclusion
Although the amount of data sets relatively is not so plentiful compared to some well-known big data based machinelearning models, by the concurrent research works in the reference's authors where the follicular thyroid neoplasm US images are still not well studied for deep-learning based inference technology, we conclude that our proposed methods of CNN with data sets given by image selection subsampling along with the boundary of thyroid follicular neoplasms may detect some morphological features reflected in the region of boundary of nodules, which make sense to be supported by the background knowledge related to the known US image features indicating the criteria for diagnosing the carcinoma of thyroid follicular neoplasms in the general sense of clinical reports, especially concerning the characteristics of the marginal contour region of thyroid follicular neoplasms.

Future Works
Meanwhile, these results also reveal a suggestion that some imagery features, which could be recognized as scaling , exist on the boundary of nodules so that a CNN inference model recognizes them and learns. These conjectures of the existence of learnable imagery features adjacent of the boundary of nodules for our CNN model need to be proven by a variety of fine-tuning techniques, including Standardization ( -score normalization), tanh-Estimators, and other data normalizing techniques [37], as well as adjusting batch training modes, learning rate, convolution layers, and so on. Moreover, although we fixed the pixel resolution in this article to 50 × 50 for the subsampling image selection near the boundary of nodules, one may have other flexible choices of subsampling image size to train CNN and compare the efficiencies.