Bone age assessment (BAA) is an essential topic in the clinical practice of evaluating the biological maturity of children. Because the manual method is time-consuming and prone to observer variability, it is attractive to develop computer-aided and automated methods for BAA. In this paper, we present a fully automatic BAA method. To eliminate noise in a raw X-ray image, we start with using U-Net to precisely segment hand mask image from a raw X-ray image. Even though U-Net can perform the segmentation with high precision, it needs a bigger annotated dataset. To alleviate the annotation burden, we propose to use deep active learning (AL) to select unlabeled data samples with sufficient information intentionally. These samples are given to Oracle for annotation. After that, they are then used for subsequential training. In the beginning, only 300 data are manually annotated and then the improved U-Net within the AL framework can robustly segment all the 12611 images in RSNA dataset. The AL segmentation model achieved a Dice score at 0.95 in the annotated testing set. To optimize the learning process, we employ six off-the-shell deep Convolutional Neural Networks (CNNs) with pretrained weights on ImageNet. We use them to extract features of preprocessed hand images with a transfer learning technique. In the end, a variety of ensemble regression algorithms are applied to perform BAA. Besides, we choose a specific CNN to extract features and explain why we select that CNN. Experimental results show that the proposed approach achieved discrepancy between manual and predicted bone age of about 6.96 and 7.35 months for male and female cohorts, respectively, on the RSNA dataset. These accuracies are comparable to state-of-the-art performance.
Bone age assessment (BAA) may provide important clinical information for skeletal maturation estimation, especially for the diagnosis of endocrinological problems and growth disorders [
As we know, deep learning has been applied to computer vision task and achieved drastic performance improvement. In this paper, we propose a method which learns real latent features of hand X-ray images and facilitates the feature capture to perform BAA. At the beginning of the method, we train the U-Net neural networks to precisely segment hand image from radiographs and eliminate insignificant information in raw X-ray images with an active learning technique. Then, we use pretrained deep Convolutional Neural Networks (CNNs) to extract high-level features with a transfer learning technique. After that, ensemble learning is employed to perform BAA with different base regressors. Finally, we evaluate the overall performance of our approach across different models. The proposed method pipeline is shown in Figure
Overview of the proposed automated BAA deep learning pipeline.
The conventional BAA approaches could be categorized into GP and TW methods. Some traditional machine learning methods have been applied to a BAA approach, such as support vector machine (SVM) [
Recently, motivated by the success of deep Convolutional Neural Networks (DCNN) in image classification, studies in medical imaging have been exploring such methods. Rucci et al. uses an attention focuser and a bone classifier in a neural network to extract features of carpal bones and performs BAA [
Despite some methods yield very accurate results, most existing methods suffer from two main limitations:
Most of the above methods operate on coarse segmentation of hand images, which might mislead BAA toward focusing on irrelevant ROIs Most of the proposed approaches use hand-crafted features, such as HOG feature, LBP feature, and Harr feature, thus constraining regressor or classifier to use low-level X-ray image features rather than the higher and deeper latent features. This semantic gap always limits the generalization capabilities of BAA systems
To perform BAA, we first extract a precise region of interest (ROI), a hand mask, from the raw X-ray image. We then remove all irrelevant objects which may mislead model training. It is necessary to establish a nonlinear mapping from original X-ray images to hand ROIs for eliminating noise in raw X-ray images. Recently, deep learning solutions have been successfully used in a multitude of medical image semantic segmentation tasks [
In this paper, we propose a new framework for hand radiograph segmentation using the AL strategy with a limited amount of labeled training data. The flowchart of the proposed framework is shown in Figure
Overview of proposed deep AL framework for hand radiograph segmentation.
The main hypothesis in AL framework is that active learning can judge which data contain the most abundant information for Oracle annotation. The process of AL is like a pupil learns a curriculum. Along the learning process, the pupil spontaneously determines which sample is hard and then asks teachers which sample is well studied. In this setting, AL does not require human to annotate all training data, but only the most uncertain data in the training process.
In practice, query by committee (QBC) is a common strategy in the field of AL [
Now, we define an uncertainty measure for the level of disagreement of a committee. Since we train a set of members in the committee, each member can extract features of the unlabeled data. We use a different random seed to generate the initial model parameters for a different member so that in each training iteration, the features extracted by each member are different. In practice, we flatten the feature to a vector, and the feature similarity of two members are as follows:
The unlabeled datum with the lowest feature similarity indicates the datum has the most significant information which is supposed to be helpful for the model training. Therefore, the ground truth of such unlabeled datum should be annotated by Oracle and then added into the labeled dataset for the following training epochs.
As demonstrated in [
The overview of hand segmentation AL model.
In Figure
In addition, we optimize the loss function in the process of training a member, i.e., a U-Net. In the field of image segmentation, a pixel-wise loss function is usually used to penalize the distance between the ground truth and the predicted probability map. We define the pixel-wise loss function with a cross entropy formulation:
In this section, we summarize the algorithm for hand image segmentation.
Input: Output: Repeat: 1. Train 2. Calculating unlabeled data’s uncertainty between different U-Nets and select the data with the largest uncertainty 3. Annotate the selected data by Oracle and add them into Until: hand segmentation is satisfied on
Even though CNNs are more commonly used in image classification tasks, BAA is a regression task in fact. Indeed, the essential of CNN is to extract different level features with various convolutional filters. The extracted features are always fed into a softmax classifier followed by several Fully Connected (FC) layers to classify input images. Inspired by the classification task model, we aim to use deep CNNs and traditional regression algorithms to perform BAA.
The key point in BAA is to extract distinct features from preprocessed hand images. Usually, a large training dataset is necessary to fine-tune a deep CNN. However, RSNA dataset only provides 12611 images, which is a pittance amount of data compared with ImageNet dataset, which contains nearly 15 million images. Consequently, training a high-level feature extractor is difficult on RSNA BAA dataset.
In this situation, we use a transfer learning technique and a variety of models with pretrained weights to acquire features. Transfer learning has been applied to datasets which are similar to large-scale ImageNet dataset such as [
Since CNN has been proposed, researchers have designed various deep CNN models. Several state-of-the-art examples are VGG-16 [
By using transfer learning, the high-level feature maps or high-level 3-dimensional tensor of hand radiographs can be acquired from the last CNN layer of the CNN network. We use Global Average Pooling (GAP) to flatten the feature maps into a 1-dimensional vector, and the vector denotes the high-level feature of the images.
After extracting the features of hand radiographs, we decompose them into 2-dimensional features by incremental PCA [
Visualizing features from different models and different decomposing methods.
In Figure
From the first column in Figure
A further conclusion is that linear kernel function may be better for differentiating data compared to RBF kernel function.
With the analysis in Section
We obtain hand bone radiograph from the 2017 Pediatric Bone Age Challenge organized by the Radiological Society of North America (RNSA) [
Bone age distribution of (a) full dataset, (b) male, and (c) female. The unit of the horizontal axis is the month.
The X-ray data provided by RSNA vary considerably in intensity, contrast, and brightness. A part of the dataset randomly selected is shown in Figure
A close-up of part of data in RSNA dataset. Different radiographs vary in size and height-width ratio.
Taking our hardware computation ability and memory space into consideration, we set the committee size as
In practice, in addition to the initial 100 annotated hand radiographs, we annotated another 200 images within the first 20 training epochs. After every training epoch, we annotated 10 radiographs and added them to the training dataset. Then, we trained the committee with another 80 epochs. The value of loss function convergence at a satisfying stage and we visually inspect all the predicted masks and keep all of them. The segmentation results are shown in Figure
Examples at each stage of preprocessing in the segmentation pipeline.
Comparison of model performance for hand segmentation.
Strategy | Number of annotated samples | Sensitivity | Specificity | Dice |
---|---|---|---|---|
FSL | 300 | 0.869 | 0.854 | 0.869 |
AL ( |
150 | 0.864 | 0.845 | 0.863 |
AL ( |
200 | 0.902 | 0.895 | 0.905 |
AL ( |
300 | 0.903 | 0.942 | 0.939 |
AL ( |
150 | 0.896 | 0.909 | 0.888 |
AL ( |
200 | 0.904 | 0.925 | 0.916 |
AL ( |
300 | 0.935 | 0.946 | 0.931 |
AL ( |
150 | 0.879 | 0.902 | 0.899 |
AL ( |
200 | 0.932 | 0.934 | 0.926 |
AL ( |
300 |
From Table
From Figure
With the analysis in Section
We use Mean Average Error (MAE), Root Mean Square Error (RMSE), and Concordance Correlation Coefficient (CCC) to evaluate proposed methods. The MAE and RMSE intuitively represent the distance between real and prediction of bone age (lower is better). The CCC has better performance to evaluate the correlation between real bone age and prediction (higher is better) than
Performance of different regression methods on different transferred data.
Model | Sex | Inception-V3 | Xception | Inception-ResNet-V2 | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | CCC | MAE | RMSE | CCC | MAE | RMSE | CCC | ||
SVR (linear) | All | 16.4688 | 21.1794 | 0.7139 | 15.6739 | 20.3728 | 0.7029 | 14.2175 | 18.0785 | 0.7143 |
Male | 12.8732 | 17.7263 | 0.5987 | 11.9983 | 13.2222 | 0.6319 | 11.7378 | 14.8372 | 0.6417 | |
Female | 13.2739 | 17.9381 | 0.6163 | 13.6930 | 14.8271 | 0.6184 | 13.0116 | 17.3823 | 0.6371 | |
KRR (linear) | All | 15.1232 | 18.2813 | 0.7004 | 15.2830 | 17.7362 | 0.7793 | |||
Male | 13.0293 | 14.2521 | 0.6313 | 12.2321 | 14.9382 | 0.6098 | ||||
Female | 14.7421 | 19.0855 | 0.6277 | 13.3361 | 17.3211 | 0.6176 |
From Table
To enhance model performance, we employed ensemble learning to lower the regression error further. Ensemble modeling is a powerful way to improve the performance of the low generalized model by combining a diverse set of learners and adjusting data weights in training stage. From Table
Performance of different ensemble regression methods on data transferred by Inception-ResNet-V2.
In Figure
Performance of different ensemble regression methods.
Ensemble method | Dataset | MAE | CCC |
---|---|---|---|
AdaBoost | All | 9.31 (21) | 0.94 (14) |
Male | 7.62 (14) | 0.94 (17) | |
Female | 7.60 (19) | 0.95 (19) | |
Bagging | All | ||
Male | |||
Female |
Table
Comparison of approaches in BAA in RSNA dataset.
Method | MAE (m) |
---|---|
Iglovikov et al. [ |
8.08 |
Iglovikov et al. [ |
7.52 |
Wu et al. [ |
7.38 |
Han et al. [ |
8.40 |
Tajmir et al. [ |
7.93 |
Proposed |
Using our proposed BAA approach, we achieved a MAE of 8.59, 6.96, and 7.35 months on all, male, and female cohorts of the dataset.
Since AL queries unlabeled data and asks Oracle to annotate them, the number of training data is enlarged by the AL strategy and more labeled data available can benefit in training neural networks. More importantly, AL inclines to pick up the most uncertain and informative data for another training epoch so that the active learner learns the most crucial data in training. Essentially, AL boosts the training process so that the trained model can get a better solution.
A further significant investigation is that we proposed a framework of medical image segmentation to relieve human expert annotation burden via deep active learning. Feature vector differences between different members in the committee are taken into consideration. The members can work cooperatively to determine which datum is crucial in the training procedure and then ask oracle to annotate it. In the segmentation stage, benefitting from deep active learning, we only annotated 300 images—about 2.3% of the whole dataset—to make precise hand segmentation.
With the annotated hand X-ray images, our results support the finding by others demonstrating the effectiveness and applicability of transferring deep-learning weights to data from different domains [
Although the proposed BAA approach achieved a state-of-the-art performance, there are also limitations and some values need to be discussed:
Number of members in the committee. Although we found the model performance of segmentation networks are enhanced with the increment of the number of members in the committee, the number of members is hard to determine. Besides, we did not ensemble the well-trained active learners to inference the segmentation results simultaneously The computational complexity of the proposed model. As demonstrated in [
In this paper, we have investigated the application of deep transfer learning on medical images, especially for automated bone age assessment using hand radiographs. We tested several popular off-the-shell deep CNNs trained on the RSNA dataset with 12611 X-ray images. We proved that the transfer learning can cope effectively with bone age assessment task. By using an ensemble technique, our model achieved an MAE of 8.59, 6.96, and 7.35 months on all, male, and female cohorts of the dataset, respectively, comparable to the state-of-the-art performance. Furthermore, we explained which pretrained CNN is better to perform BAA.
In summary, we have created a fully automated, deep learning-based preprocessing pipeline to automatically detect and segment the hand and wrist, standardize the images, and perform BAA with pretrained deep CNNs and high-efficiency regression model. In practice, our system can be easily deployed in the clinical environment on a computer with a single GPU.
The investigation presented in this paper leaves many challenges and issues for future research. We summarize the future work as follows:
The proposed BAA framework, which contains image segmentation, feature extraction, and ensemble modules, should be validated on other medical image decision problem To proof the effectiveness of AL framework theoretically. Only if we proof it, could we find how many active learners is enough to form a committee Ensemble the well-trained active learners and generate segmentation result simultaneously by AdaBoost or other ensemble learning algorithms
The X-ray imaging data used to support the findings of this paper have been deposited in the RSNA repository at doi:
The authors declare no conflict of interest.