Ensemble Learning with Multiclassifiers on Pediatric Hand Radiograph Segmentation for Bone Age Assessment

In the study of pediatric automatic bone age assessment (BAA) in clinical practice, the extraction of the object area in hand radiographs is an important part, which directly affects the prediction accuracy of the BAA. But no perfect segmentation solution has been found yet. This work is to develop an automatic hand radiograph segmentation method with high precision and efficiency. We considered the hand segmentation task as a classification problem. The optimal segmentation threshold for each image was regarded as the prediction target. We utilized the normalized histogram, mean value, and variance of each image as input features to train the classification model, based on ensemble learning with multiple classifiers. 600 left-hand radiographs with the bone age ranging from 1 to 18 years old were included in the dataset. Compared with traditional segmentation methods and the state-of-the-art U-Net network, the proposed method performed better with a higher precision and less computational load, achieving an average PSNR of 52.43 dB, SSIM of 0.97, DSC of 0.97, and JSI of 0.91, which is more suitable in clinical application. Furthermore, the experimental results also verified that hand radiograph segmentation could bring an average improvement for BAA performance of at least 13%.


Introduction
Automatic bone age assessment (BAA) based on the hand radiographs is a crucial diagnostic technique to evaluate the growth disorders and endocrine abnormalities for pediatric and adolescent patients, usually performed by radiological examination of the left hand and the wrist radiographs to assess skeletal maturity in clinical [1][2][3]. The Greulich and Pyle (G&P) method [4] and the Tanner-Whitehouse (TW3) method [5] are two most widely used traditional methods for bone age estimation. But both of them are time-consuming and subjective. Therefore, the automatic evaluation of bone age based on computing power and machine learning techniques, especially the application of deep Convolutional Neural Networks (CNNs), has been studied and prompted the development of the BAA [6].
In the processing pipeline of automated BAA, image preprocessing, segmentation, and normalization were shown to be effective for improving the robustness and performance of BAA models in the previous studies [7][8][9][10], and the most important of which is hand bone segmentation [11], which could seriously affect the prediction accuracy. The hand bone segmentation could remove all extraneous objects, such as radioactive markers, impurities, and noise, and extract the whole hand. Medical image segmentation is a necessary but a challenging problem in most image analysis and classification problems. In the process of digital radiograph acquisition, an intrinsic effect will be caused when radiation intensities exposed unevenly on the examined subject [12,13]. Owing to the influence of the uneven radiation intensity and various man-made factors, most hand radiographs have motion artifacts, noise, and asymmetric illumination. The representative examples are shown in Figure 1 that most images have low contrast and blurring edge, which is complicated to extract the entire hand bone region from the background.
In the previous studies, the most widely utilized traditional segmentation techniques could be divided into the following several kinds according to the different image characteristics, thresholding, clustering, edge based, region based, deformable models, and hybrid techniques [14][15][16], but these methods frequently result in oversegmentation, especially the distal phalanx. When dealing with large datasets, the robustness limitations of traditional segmentation methods are even more pronounced. Therefore, the deep learning techniques were introduced to medical image segmentation [17], and patch-based CNN pixel classification is one of which the most popular segmentation methods [18]. LeNet-5 network was the first published application of using patch-based CNN to segment the hand and wrist [19]. In this study, 1000 radiographs were classified into sample patches to train the detection network. But because of patches' overlap, the network was really time-consuming. The U-Net network [20], which was originally proposed for medical image segmentation and could be utilized for segmentation problems with limited amounts of data [21], was applied to predict hand masks [22]. Another network VGG-16 [23] was integrated with U-Net as an encoder-decoder structure to obtain hand mask [24]. However, U-Net network required multiple trainings for binary image segmentation, and most predicted label maps had false-positive regions assigned to the hand class. Manual labor was needed to clean these masks and trained the model again for six times. Deep CNNs have been gradually devoted in medical image segmentation, but they showed weak efficiency for automatic hand radiograph segmentation in recent researches. Moreover, it was a great amount of work to creating labels of the training datasets for CNN. As a result, it is necessary to design a segmentation method with low complexity and strong processing capability.
Aiming at the problems mentioned above, we proposed utilizing a model to predict the optimal segmentation threshold for hand mask segmentation, which was trained on multiclassifiers based on ensemble learning. We also compared the proposed method with the representatively used traditional segmentation techniques and the U-Net network.

Materials and Methods
In this section, we described our approach for hand radiograph segmentation, and the whole procedure was illustrated as Figure 2. The main idea of the proposed method consisted of four stages: (1) image enhancement using the histogram equalization, (2) label making of optimal segmentation threshold, (3) a 2-level ensemble learning of classification model training based on multiple classifiers, and (4) postprocessing through region growing for clean hand masks.
2.1. Image Enhancement. Image enhancement is very essential to improve the segmentation performance and robustness of the image processing [25]. The histogram equalization method is an efficient way to enhance the contrast and smooth the histogram for hand radiographs [26,27]. Generally, the histogram with obvious double peaks is well suited to the selection of image optimal threshold, while the rest of the histograms are the opposite with several small peaks needed to be processed to restore contrast, and the optimal thresh value could be easily found to create hand mask in this way.

Label
Making. The main steps to find the optimal segmentation threshold can be summarized as below: Step 1: Chose 40 values at the interval of 2 below average to obtain binary image. The selection range of threshold value could be increased to a limited extent if no correct value was available.
Step 2: Selected the threshold value with the best segmentation result as the training label. The optimal threshold should meet the following two conditions: first, the background is completely separated from the palm, and moreover, the details of hand masks are exquisite. Try to choose the threshold with a larger value as the label, and as shown in Figure 3, threshold value of 190 was marked as the label. Especially, the selected optimal segmentation thresholds were corrected by three people with professional background without interference.
Step 3: Calculated the histogram, mean, and variance gray value of each image as the feature and utilized the features and labels as the training dataset.
Step 4: To eliminate the adverse effect caused by the outliers. The training set was standardized by min-max normalization.
where I i is the pixel value, f i is the feature, and l i is the label of each image. By this way, the deviation is finally normalized to (0, 1). 2 International Journal of Biomedical Imaging Figure 4 shows the distribution of sample labels chosen artificially from the training set. As can be seen, almost all the optimal threshold values were limited in 150 and 200 after enhanced processing. This was to suggest that the enhanced processing made the distribution of the labels more uniform, which was able to be less susceptible to the impact of imbalanced samples on model fitting.

Ensemble Learning
Framework. Individual classifier may not be able to learn more information, while ensemble learning can improve the performance of a single classifier by combining them [28]. Ensemble learning is one of the most useful strategies to improve generalization performance of prediction model, with a core of training strategy for base classifiers, such as bagging, boosting, and stacking [29]. Bagging and boosting build the base learners from a single dataset, having an impact on diversity, while stacking learning method uses the multiple classifiers by taking the prediction of the previous level as input variables for the next level [30]. Therefore, stacking learning strategy is considered to construct the ensemble learning framework for hand seg-mentation. The simplified flow diagram of stacking algorithm was shown in Figure 5.
Considering the computing cost, we decided to select the top five classifiers, RandomForest, ExtraTrees, Bagging, Ada-Boost, and SVC as the base learner of the stacked model. To increase the diversity of base classifiers, we applied 5-fold cross validation in the training process, which was illustrated in Figure 6.
Step 1: The training set was randomly divided into D 1 , D 2 , ⋯D 5 subsets with similar size. Defined D i = D/D i as the training set and D i as the testing set when base model training, where i = f1, 2, 3, 4, 5g and D = fD 1 , D 2 , D 3 , D 4 , D 5 g. The whole testing set was denoted as T.
Step 2: Trained the model 1 by D 1 and made a prediction P 11 byD 1 . Such operations were needed to be repeated five times, and thus we could get a new training set of Step 3: The whole testing set Twas predicted by the base model 1 trained on D i in every 5-fold cross validation and made a prediction T 11 , T 12 , T 13 , T 14 , T 15 , respectively. Thus, a new testing set T 1 could be obtained in computing five predictions, For the base model 2 to model 5, repeated steps 2 to steps 3 until the training set P 2 , P 3 , P 4 , P 5 and testing set T 2 , T 3 , T 4 , T 5 were achieved. Predictions provided by each base model were combined into a new training set P = ðP 1 , ⋯, P 5 Þ and a new testing set T ′ = ðT 1 , ⋯, T 5 Þ. Imported Pand T ′ to the second-level model and the labels remained the same as original dataset.
The choice of second-level model is equally important, and compared with other classifiers, Logistic regression is the most often choice. For best performance, we also considered Softmax regression and the best performing base classifier RandomForest shown in Table 1 to make a comparison, and the results were shown in Table 2. From the chart, we knew that the Softmax regression performed better with a RMSE of 6.47 than the Logistic regression and the Random-Forest. Therefore, Softmax regression was chosen as the second-level model for the ensemble learning of stacking.

Postprocessing.
There could appear to be false-positive pixels in the hand label maps predicted by a stacker model, so we extracted the hand area through region growing. The center of the image was taken as the seed, and the growth was stopped in the edge of the hand shape. As a result, a clean mask could be created for the hand radiograph. The postprocessing of the hand mask was shown in Figure 7.

Evaluation
Metrics. The objective evaluation of the proposed method mainly depends on a series of quantitative parameters. Peak signal to noise ratio (PSNR) [31], structural similarity (SSIM) [32], dice similarity coefficient (DSC), and Jaccard similarity index (JSI) [33] were commonly used to calculate the errors between the segmented images and the ground truth. PSNR and SSIM are both image quality evaluation indexes, while the DSC and JSI are segmentation accuracy assessment indexes.
PSNR can be computed using the equation as PSNR = 10 * log 10 MSE is denoted as where S stands for segmented image and G for ground truth of the segmented image.
SSIM measures the similarity of two images, is defined as whenre μ S and σ S 2 are the mean and the variance of the segmented image, respectively. Likewise, μ G and σ G 2 are the mean and the variance of the ground truth mask, respectively. And σ SG is the covariance of the predicted mask and 5 International Journal of Biomedical Imaging the ground truth mask. C 1 and C 2 are both constants to retain the stability of numerator and denominator.
DSC is defined as JSI is given by equation Except the index PSNR, the range of value for other metrics is 0 to 1, where 1 demonstrates the perfect segmentation result.

Experiments and Results
A set of experiments implemented on hand radiograph segmentation were designed to verify the effectiveness of the proposed method. To ensure the fairness of the experiments, the histogram equalization method was carried on the training and testing datasets in all comparative methods. All the experiments were performed on a CPU environment, python3.6, and Tensorflow 1.11.0.

Datasets.
In this study, a total of 600 hand radiographs with the skeletal age ranging from 1 to 18 years old were included into the whole dataset. We randomly selected 500 images as the training set and 100 images as the testing set. These whole 600 hand masks ground truth images were manually labeled by professional radiologists. The dataset was all collected and anonymized from the radiology department of Children's Hospital Affiliated to Chongqing Medical University.

Strategy Testing.
To verify the effectiveness of 5 independent ensemble classifiers with stacking, it is necessary to evaluate whether the number of multilevel models affects the accuracy of the stacked model. Therefore, we designed several experiments to measure prediction accuracy as well as inference time under different ensemble classifiers for comparison. Each time, we selected the best performing classifiers as the base learners for stacking when changing the ensemble number, such as when the number was set as 2, the top two classifiers, RandomForest and ExtraTrees were used as the combination. When the ensemble number was set as 3, Average Join Figure 6: Ensemble learning using stacking technology based on 5-fold cross validation.   International Journal of Biomedical Imaging RandomForest, ExtraTrees, and Bagging classifiers were selected as base learners and so on. The results are detailed in Figure 8; we used RMSE to measure the prediction accuracy. As shown in Figure 8, the ability of model fitting accuracy and inference time were greatly affected under different component classifiers in an ensemble. With the increasing number of ensemble classifiers, the prediction accuracy was significantly improved, but increased more slowly when the number was greater than five, even became worse when the number was approximate ten. Moreover, the inference time became longer as the number of component classifiers increased, especially when the number exceeded 5, and 35minute inference time for 5 ensemble classifiers was reasonable compared to other configurations. Therefore, an ensemble of 5 classifiers for optimal segmentation threshold prediction based on stacking proposed in this research was proved to be effective, either the model performance or computational complexity.

Qualitative Analysis.
To verify the effectiveness of the proposed approach, we made a comparison about the performance between the proposed method and three representative traditional segmentation approaches Otsu thresholding [34], K-means clustering [35], and GrabCut [36] from previous researches, as well as the U-Net network, which is the most common method for hand bone segmentation in deep learning. We used an open source tool in deep learning named Labelme to make the ground truth images of hand radiographs, and each image took approximately 3 minutes to delineate. U-Net was trained by binary_crossentropy loss function with Adam optimizer. We used 500 images for training the network with 20 epochs. The learning rate was set as 1e-4, and each step used a batch size of 2 images.
The segmentation results were shown in Figure 9. As we can see from Figure 9(a), the classical traditional segmentation algorithm, Otsu, K-means, and GrabCut had resulted in undersegmentation of the phalanges, especially the Otsu thresholding and K-means clustering. The hand masks predicted by the U-Net network, as shown in Figure 9(b), were a little worse than our method, because some clean hand masks could not be extracted by the label map predicted by the network. Figure 9(c) demonstrated the effectiveness of the proposed entire segmentation engine. The hand masks could be separated by extracting the connected region through region growing from black backgrounds by the predicted optimal threshold. We also cropped and resized the segmented image appropriately to 512 × 512, as shown in the last line. As a result, we were able to get a final segmented hand radiographs using the generated clean hand mask. Table 3, the proposed method outperformed other three representative traditional methods as well as the U-Net network on segmentation    International Journal of Biomedical Imaging accuracy, achieving a DSC of 0.97 and JSI of 0.93. And the SSIM, which was for image quality measurement, also showed the best result with an average value of 0.97. Although the PSNR of our method with an average value of 54.37 dB was slightly worse than the U-Net with the maximizing value of 55.92 dB, it is significantly better than the traditional methods, Otsu, K-means, and GrabCut with an average value of 42.54, 41.62, and 46.87, respectively. Even more important, in reference to time complexity, we could see that U-Net network had offered the longest runtime of 3400 minutes of any other tests performing on a CPU environment. Otsu algorithm showed the superiority in time complexity of 8 minutes, while other index values were least unsatisfactory. Consequently, our method with 20-minute computing time was comparatively acceptable.

Quantitative Analysis. As shown in
3.5. Impact of Hand Segmentation for BAA. To demonstrate that the proposed hand segmentation method can improve the accuracy of BAA, we chose the VGG16 as our training model to make a comparison. This network was one of the most common used models in the research of BAA. We marked the bone ages of dataset of 500 total images in years; hence, there were 19 classes overall. Due to the small dataset, the pretrained weights from Imagenet were used to initialize the weights and then the vgg16 was fine tuned with these weights. We also used data augmentation including rotation, translation, scaling, and shifting by keras 2.2.4. Softmax cross entropy was applied as the loss function to optimize the model with Adam optimizer. The training data contained 90% of the original dataset, while the validation set contained the rest. The learning rate was set as 0.01, and each step used a batch size of 2 images. The bone age assessment results under different configurations are shown in Figure 10. As can be seen from the diagram, compared with the BAA constructed by the original image, there was a performance increase of average 13% in RMSE of the BAA based on the segmented image; RMSE decreased from 2.12 years to 1.85 years, which suggested that the proposed hand segmentation method could effectively improve the accuracy of the BAA. It is believed that the accuracy improvement for BAA brought by the hand segmentation will become more apparent with more hand radiograph images.

Discussion
We have proposed an effective method which has good adaptability and generalization for hand radiograph segmentation in this paper. We find that (1) the proposed approach outperforms commonly used traditional methods and a state-of-the-art architecture U-Net on a small dataset, (2) and hand segmentation can effectively improve the forecast precision of bone age assessment. To this end, various experiments were carried out to validate the effectiveness and practicability of our method. From the strategy testing, as shown in Figure 8, greater emphasis had been placed on the number of component classifiers for better executive speed and the generalization capacity using ensemble learning. The predictive ability of single model is not as strong as that of ensembles. When the number of component classifier was set as 5, RMSE was 6.47, and when we increased the component number to 7, the RMSE decreased from 6.47 to 6.39. While when the   International Journal of Biomedical Imaging ensemble number was set as 10, the RMSE increased to 6.51. Therefore, there was a slow increasing of model performance when the ensemble number was set more than 5, but a decrease of RMSE when the number was set more than 7. Generalization error and over fitting problem might be caused by the excessive model ensemble. Moreover, the inference time was nearly 9 times when the number was set as 10 compared with 5 classifiers. Therefore, both the segmentation accuracy and execution speed should be taken into consideration in an ensemble learning. Figure 9 and Table 3 show the segmentation results between the proposed method and other methods. The proposed method performs better in hand bone segmentation with robust and highest segmentation accuracy. With regard to the traditional methods, though simple, the segmentation accuracy was unsatisfactory. Compared with the U-Net network, our method took the advantage of the segmentation accuracy and the time complexity. The U-Net took 200 times computing time than ours on a CPU. As for the impact of hand segmentation for BAA, it was obvious that there was a better performance in RMSE with the overall hand-segmented images, which suggested that the hand image segmentation step was important for generalizability of the BAA model.
In a word, the traditional segmentation methods with weak robust and low precision have not been applicable for hand mask segmentation. Although the U-Net has the unstable performance in recognition of the edge of the hand and a great deal of training time, it is still the most popular technology in dealing with many segmentation tasks. However, deep learning lies in the massive and complicated task to artificially annotate the ground truth images for model training, and repeated training process is required to get better prediction results from a small dataset. More importantly, deep neural networks require powerful operation ability of the computer, like a GPU. By contrast, our method can be trained in a small dataset and taken in a considerable computational cost in CPU. No matter what the quality or the accuracy of the segmented image, our method has obtained the satisfactory results, which is superior to the traditional segmentation methods and the U-Net network in deep learning obviously.
The study in this paper still has some limitations even if the good segmentation results have been obtained. The input features for multiple classifiers, normalized histogram, mean value, and variance of each image can be made several optimizations to improve the model fitting ability and meanwhile, boost efficiency. In addition, our experiments are only conducted on hand radiographs, and different types of images can be used to test the generalization ability of this method. Otherwise, there are some special hand radiographs with variable collimation configurations digitized from traditional film Digital Radiography (DR) could not be satisfied segmented based on our method or deep learning. Therefore, it will be the main topic of the research in future work.

Conclusions
In this work, we have proposed an automatic hand radiograph segmentation method based on ensemble learning with multiclassifiers, which can effectively improve the overall performance for BAA. We converted the process of searching for optimal threshold into a classification task. Ensemble learning with 5-fold stacking strategy was utilized to train the classification model. Demonstrated by the experimental results and analysis, the proposed method greatly contributed to improvements on the performance of optimal segmentation threshold prediction, resulting in better accuracy for hand mask segmentation using a small dataset, which was more effective in clinical application.

Data Availability
All hand radiographs used in this work are available from the corresponding author on request.

Conflicts of Interest
The authors declare no conflict of interest.

Acknowledgments
We would like to acknowledge the Department of Radiology, Children's Hospital Affiliated to Chongqing Medical University, and the radiologists. We are grateful for the doctors freely contributing the X-ray hand-wrist images and the guidance of professional knowledge, which enables this research work successfully proceeding. This research was