CNN-Based Medical Ultrasound Image Quality Assessment

The quality of ultrasound image is a key information in medical related application. It is also an important index in evaluating the performance of ultrasonic imaging equipment and image processing algorithms. Yet, there is still no recognized quantitative standard about medical image quality assessment (IQA) due to the fact that IQA is traditionally regarded as a subjective issue, especially in case of the ultrasound medical images. As such, the medical ultrasound IQA on basis of convolutional neural network (CNN) is quantitatively studied in this paper. Firstly, a dataset with 1063 ultrasound images is established through degenerating a certain number of original high-quality images. Subsequently, some operations are performed for the dataset including scoring and abnormal value screening. Then, 478 ultrasonic images are selected as the training and testing examples. The label of each example is obtained by averaging the scores of diﬀerent doctors. Afterwards, a deep CNN network and a residuals network are taken to establish the IQA models. Meanwhile, the transfer learning strategy is introduced here to accelerate the training and improve the robustness of the model considering the fact that the ultrasound image samples are not abundant. At last, some tests are taken to evaluate the IQA models. They show that the CNN-based IQA is feasible and eﬀective.


Introduction and Motivation
Image quality assessment (IQA) is to quantitatively evaluate the image, which remains a hot topic in image processing field due to the fact that it is regarded as a benchmark for image processing systems and algorithms Tang et al. [1], Kim et al. [2], and Ma et al. [3]. As many pattern recognition problems, IQA tries to simulate the human perception process which is easily influenced by image content, mathematical and psychological effect of the observer, and many other complex factors Krasula et al. [4]. So far, most of IQA methods and research studies focus on optical images rather than medical images. One of the reasons is that the medical image quality is highly related to its specific application. For example, a medical image has overall low contrast and is noisy, but it is still acceptable for the doctor if it is effective in judging the state of a certain tissue. Another is that the medical image including the MR, CT, and US usually contains artifacts, which are caused by the tissue movement in imaging or the scattering of beams Krupa and Bekiesińska-Figatowska [5], Boas and Fleischmann [6], and Prabhu et al. [7]. Most artifacts cannot be removed. In fact, the doctors can grasp the useful information from the noisy images; that is, noisy medical image does not always mean low quality. By contrast, the noisy optical images are usually seen as images with low quality. us, the medical image assessment should be conducted from a different viewpoint. e traditional IQA can be divided into two kinds: subjective assessment and quantitative assessment Hemmsen et al. [8], Kang et al. [9], Bosse et al. [10], and Kim and Lee [11]. For the former one, the image is scored by observers. According to whether relying on the other image in IQA, the subjective assessment can be divided into single and double excitation cases. In the double excitation assessment, the observers score the image after observing the considered image and its related one with high quality. It means that the considered image should have comparably less quality. e International Telecommunications Union (ITU) provided a standard for double excitation IQA and it has been widely used in the fast MR imaging; Union [12], Shiao et al. [13], Loizou et al. [14], and Hemmsen et al. [8] attempt to apply it in the ultrasound IQA. In the single excitation assessment, the observers score the image only relying on their experience. It is adaptive to more situations. For an image, the scores given by different observers may have some differences. So, a certain number of observers should take part in the process of assessment. By comparison, the quantitative assessment scores the image through automatically computing some indexes for the considered image.
Generally, the quantitative assessment algorithms can be divided into three kinds: full reference, reduced reference, and no reference IQA Kang et al. [9] and Zhu et al. [15]. e full reference IQA relies on the original high-quality image in evaluating the considered image. e reduced reference IQA relies on the part of the high-quality image. ere are not any references for the no reference IQA, which is also named the Blind Image Quality Assessment (BIQA) Kim and Lee [11], Ma et al. [16], and Ma et al. [3]. Generally speaking, the current quantitative image assessment algorithms mostly belong to the full reference and reduced reference assessment. e traditional image quality indexes include peak signal-to-noise ratio (PSNR), mean squared error (MSE), and structural similarity index (SSIM), which all need the original high-quality image Bianco et al. [17]. Besides, some indexes about the image gradient maps are also considered. Human visual system-(HVS-) based methods transfer the image into different space and they simulate the respondents of human visual cortex neuron about the low-quality and high-quality images Litjens et al. [18]. e perceptual difference model (PDM) Daly [19] is widely used in medical image quality assessment, which models the ability of humans to perceive a visual difference between a degraded "fast" MRI image with subsampling of k-space and a "gold standard" image mimicking full acquisition Huo et al. [20]. Mittal et al estimate the possible information loss between the considered image and the original high-quality one on the basis of the normalized intensity coefficient in the spatial space [21]. e full or reduced reference IQA relies on some information to judge the dissimilarity between high-quality and low-quality images. It is effective. However, in most situations, there are not any references in IQA. e former research studies in no reference IQA mainly dealt with the special image deformation such as noise, blur, and image compression. ese methods perform feature detection and statistic computation for the considered image, which is computation complex. Woodard evaluates the MR image according to the variances of the considered image and its degraded one. Mortamet et al. evaluated the MR image relying on the part of the atmosphere in the image because 40% of image is the atmosphere in the structure MR brain image Mortamet et al. [22]. Nakhaie and Shokouhi evaluated the image through wavelet transform [23]. Recently, Eck provided a new rule.
at is, the good-quality image means that it effectively helps the doctor to detect the changes of tissue Eck et al. [24]. us, the standard of IQA is whether the image can make the doctor conduct accurate judgement.
In recent years, CNN has been taken in IQA to simulate the process of human evaluation of optical images inspired by the success of CNN in image processing and pattern recognition De Angelis et al. [25] and Bosse et al. [10]. Kang firstly proposed no reference IQA on the basis of CNN, in which the CNN is trained using the samples in LIVE IQA Kang et al. [9], Zhang et al. [26], and Zhang et al. [27]. Later, Bianco proposed DeepBIO, which trains a classification CNN using the other kind of data and then transfers the network to the image quality assessment [17]. Kim et al. estimated the quality of an image by adding up the scores of each patch on the basis of the CNN without any reference. After that, they designed a deep CNN IQA model which contains two steps: error map learning and score estimation Kim et al. [2]. ese methods attempt to replace the human visual perception process using CNN. A certain number of labeled images are taken to train the CNN. To the best of our knowledge, the related reports are mainly about MR or CT IQA. So far, a few researchers have used deep learning to assess the quality of ultrasound images. In Wu et al.'s work [28], a computerized fetal US image quality assessment scheme is proposed to assist the implementation of US image quality control in the clinical obstetric examination by introducing two deep CNNs. It is adaptive to a special application, and we intend to provide a universal method to evaluate the image quality. e remainder of this paper is organized as follows. Section 2 introduces the way to collect the training samples. e CNN is designed in Section 3, and the results are shown in Section 4. Finally, we conclude our method in Section 5.

Optical Image Samples.
To train an IQA CNN, we first collect lots of image samples. In optical image quality assessment, LIVE IQA [23] is one of the popular and widely used datasets. It contains 29 original high-resolution reference images. Each original image is degraded into some low-quality images through JPG compression, Gaussian convolution, fast Rayleigh fading, or adding white noise. en, a total of 982 images including the original ones are obtained. Each of these images is scored at least by 20 observers. e average score for each image is taken as its label. In our algorithm, the images in LIVE are taken to pretrain the IQA CNN. Afterwards, the CNN is finely tuned by some ultrasound images.

e Ultrasound Samples.
e ultrasound image samples in this paper come from two ways. e first is through downloading the images from publicly accessed websites. e other way is by collecting 700 ultrasound images from the Tongji Hospital, affiliated to Huazhong University of Science and Technology. ese images are captured in the department of gynaecology, ophthalmology department, and internal medicine and surgical department and strictly screened by some experienced doctors.
en, 95 highquality ultrasound images remained, which corresponded to the tissues of liver, lymph node, bladder, breast, kidney, heart, blood vessel, pancreas, and uterus. All the images are in 8-bit bmp format. A part of the original high-quality ultrasonic images are shown in Figure 1.

Complexity
In order to create an ultrasound image database containing different quality levels, similar to many methods in generating the samples, we degrade the high-quality ultrasound images through JPEG compression, Gaussian blurring, and adding white and speckle noise. e speckle noise is common in ultrasound image; the model in Li et al.'s work [29] is used to generate the speckle noise due to the fact that this model has been validated to effectively describe the speckle noise. Meanwhile, the commonly used degeneration methods are also used to generate the low-resolution images. Specifically, the function "imwrite" in Matlab is taken to perform JPEG compression, in which the quality factor is a random integer between 0 and 100. In adding the white noise, the original image is normalized to 0 and 1. en, the Gaussian noise with δ standard error 0 < δ < 2 is added.
Afterwards, the noisy image is restored to 0 − 255. In Gaussian blurring, a 7 × 7 template is taken to convolute the original image. e standard error of the Gaussian template is randomly selected between 0 and 1. In generating the speckle noisy images, we used the exponential form noise model in Li et al.'s work [29] as where u(x) and v(x) are the original and observed images, respectively. η(x) ∼ N(0, δ 2 ) is the Gaussian distributed noise whose average value is zero and the standard error is a random data between 0 and 6. e simulation effect is better when c � 5. In generating the speckle noise images, 7 different standard errors are taken. In each of the remaining Complexity three methods, 6 levels of degeneration are conducted for the original images. For each method, we randomly choose half of the 95 images to do degeneration. At last, 1063 ultrasound images are generated.

Scoring the Ultrasound Samples. Four doctors from
Tongji Hospital scored the acquired ultrasound images with a single excitation method. It is out of two considerations: one is that the number of images to be scored is relatively large, and double evaluation is more prone to visual fatigue; the other is that the double excitation only fits limited situations. e four doctors all major in biomedical imaging. e scores are from 0 to 100. It should be mentioned that the images are unorganized before scoring to make the objective scoring. After obtaining the scores of all images, we remove some inconsistent score samples. For each original image, the absolute deviation between the scores and the mean of four observers can be easily computed. When the deviation is greater than a given threshold, it will be regarded as outlier and discarded. For accepted samples, the mean opinion score (MOS) value is taken as its label. It is clear that the different threshold can lead to a different number of samples. e higher the threshold is, the more consistent the samples' scores are, leading to less samples. In our experiment, the threshold is set to be 10, which balances the consistency among samples' scores and samples' numbers. After scoring, 478 images remained.

Outliers' Screening and Distribution
Balance. Based on the simple selection of abnormal samples in Section 2.3, a CNN is used to conduct further outliers' screening for the samples on the basis of the principle of random consistency. e screening process is shown in Figure 2. e network mentioned in the flowchart is shown in Section 3. e basic rule is to first randomly select some images in the database to train the initial evaluation model. en, we input all data into the model to predict the score value and detect the abnormal samples according to the difference between the prediction score and the actual label. is process (Figure 2, dotted box section) is carried out for several times; then each image can obtain multiple groups of predicted values. If some images show anomalies in multiple predicted values, they will be screened. e above screening process is repeated several times and 78 images are discarded.
After scoring and outliers' screening, the distribution of the average subjective score of the remaining images is not balanced, so a certain distribution balance method is needed to balance the image samples in the database. In this paper, the image expansion is performed on the interval with less samples by rotating the original image at three angles of 90°, 180°, and 270°, and their labels take the corresponding unrotated images' labels. e image dataset after rotation expansion contains 478 ultrasound images. It should be mentioned that the subjective evaluation is conducted in condition of common electric incandescent lamp, the monitor is 4K LED monitor, and the view distance is about sixty centimeters. Simply, the subjective evaluation is conducted under the common office by the experienced doctors.

CNN for IQA
3.1. Deep CNN for IQA. Due to the complexity of the medical ultrasound image content, the shallow neural network may not be able to well simulate the perception of HVS in evaluating the image. As such, this project adopts a deep convolutional neural network to assess the quality of medical ultrasound images. e research on HVS shows that human is sensitive to the deformation between images. erefore, we firstly train a CNN to learn the difference between distorted image and the related undistorted one. After that, the objective scoring for each image is carried out on the basis of the estimated distortion. It tries to simulate the human perceptual processes.
Accordingly, this paper designs a deep CNN to do IQA, which is called DCNN-IQA-14. is network adds six convolution layers to DCNN-IQA-8 in Kim et al.'s work [2] and their structures are shown in Figures 3 and 4. e objective error map is predicted in the first stage. e whole network is a full convolution network; that is, it only contains convolution layer. Zero padding strategy is used in each convolution layer to make the convolution process to retain the pixel information, and two downsampling operations are used to reduce the data dimension. Except for the last layer, each layer has a convolution kernel size of 3 × 3, which is activated by ReLU. In the last layer, the 1 × 1 convolution kernel is used to output the error map prediction. If the error map is directly used in the second stage to do quality assessment, its prediction result is not fine due to the fact that the predicted error map has only one channel and that much information about the difference among images is lost. erefore, the feature map of the penultimate output is used in the second stage of training. In the second stage, the fourteenth convolution layer of the network will return to the final subjective score through two full-connection layers.
Before the training, all samples should be normalized. e normalization operations include the following steps. e first is to reduce the image to a quarter of the original image size and then enlarge it to the original size. en, Gaussian Low Pass is taken to do filtering. Subsequently, the filtered image is subtracted from the original image. Finally, we divided them into the nonoverlapped patches with size of 112 × 112. Each patch's label is the MOS value of the related image. Let I d represent the deformation image, I r represent the corresponding high-quality reference image, I low d and I low r , respectively, represent the corresponding image after zooming in and zooming out low-pass filtering, and I d and I r are the preprocessed images; then the error map calculation formula is given by where e gt is a label for the first training stage. Stochastic gradient descent (SGD) algorithm is taken to minimize the objective function shown in the following equation: where (i, j) is the gray value of pixels in the image, e is the difference between the distorted image and its corresponding original reference image, and r is reliability map obtained by measuring the texture intensity.
In the second training stage, besides the feature map obtained in the first stage, two manually extracted features are added to full-connection layer FC1: δ low Id and μ r , respectively, the variance of the low-pass distorted image and the mean of the reliability map. By using the subjective  Complexity scores of each distorted image as labels and minimizing equation (4), the final score-predictive model can be trained. It should be mentioned that two CNNs are taken to do evaluation, which is just an example to illustrate the effectiveness of the proposed solution. Certainly, there are many other networks which can obtain the same or even better results. As this paper mainly intends to propose this issue and an effective solution, the comparison of different networks is not our main topic.

Classical Classification Network for IQA.
In the experiment, we also used a 34-layer depth residual network (ResNet). e last layer of the network is replaced by a fullconnection layer to output the score of 1 × 1. is network is called ResNet-IQA and the structure is shown in Figure 5. Except for the first layer, each layer has a convolution kernel size of 3 × 3, which is activated by ReLU. Divide the ultrasound images into the nonoverlapped patches with size of 112 × 112 as the input. Each patch's label is the MOS value of the related image and each patch is normalized according to the following formulas: In equation (5), I (i, j) represents the original gray value of point (i, j) and I(i, j) is the related one after normalization. μ(i, j) and δ(i, j) are the mean value and the variance of the window centered at the considered point, respectively. P and Q are the width and height of the window. ey are generally assigned to be 3. C is a constant. In training the CNN, the samples are divided into three kinds: the training data, the validation data, and the testing samples. eir ratios are 0.6, 0.2, and 0.2, respectively, and the loss function is shown as follows: where N denotes the number of image patches, x n is the input patch, w represents the weight, f(x n ; w) denotes the score computed by the network, and y n is the label of the input patch. e SGD is taken to train the CNN to compute the optimized w by minimizing the difference between f(x n ; w) and y n .

DCNN-IQA of Different Convolution Layers Trained from
Scratch. In this experiment, we used ultrasound images to train the DCNN-IQA with 8 convolution layers and 14 convolution layers, respectively. Ultrasound images are from the database established in Section 2, with a total of 478 images, and all the images are divided into three kinds: 60% training data, 20% validation data, and 20% testing data. It should be mentioned that there is no repetition among three kinds of data. After that, we use the linear correlation coefficient (LCC) and Spearman's rank-order correlation coefficient (SROCC) to evaluate the accuracy of IQA results.
In equations (8) and (9), X and Y denote the subjective score and the predicted score by the DCNN. n is the number of the testing numbers and d i denotes the difference between the numbers of each image after ranking the computed and original scores.
e results are shown in Figure 6. It can be known that the increase of network layers has no significant effect on the final fitting effect of the model. However, when the number of training epochs is small, the deeper network can learn and acquire image information faster. In the first 40 epochs, the fitting effect of DCNN-IQA-14 is better than that of DCNN-IQA-8, which is consistent with the results in most cases. But due to the increase of computation, the training time of each round is longer in the deeper network. In addition, because of the limited amount of data, the more network layers are, the easier the phenomenon of fitting will appear. Based on these factors, we choose the 8-layer DCNN-IQA network, DCNN-IQA-8, in follow-up studies.

DCNN-IQA Trained by Transfer
Learning. As mentioned above, lots of samples are needed to train the DCNN. However, the medical image samples are usually rare. Moreover, some results in our experiments show that the DCNN may lead to overfitting if there are fewer samples. Inspired by the transfer learning in many other applications, we introduce transfer learning to our application. Specifically, the DCNN-IQA is firstly trained using the labeled optical images in LIVE dataset. During the training, we still randomly take 60% data as the training data, 20% as validation data, and the remaining 20% as the testing data. After that, the second training stage of DCNN is finely tuned by the ultrasound images as mentioned in Section 4.1.
e results are shown in Figure 7. From them, it can be known that transfer learning significantly accelerates the training process. Transfer learning for 20 epochs has reached the LCC value of 100 epochs for direct learning. In transfer learning, the ROCC value not only improves the learning speed significantly but also achieves the result that direct learning cannot achieve. ese results all indicate that the learning of natural image quality assessment has a significant improvement on the final fitting effect of ultrasonic image quality evaluation. Transfer learning can effectively solve the network training overfitting problem caused by the limited amount of ultrasound images and improve network performance to a certain extent.

Comparison with Other Assessment Metrics.
In this experiment, we use the ResNet-IQA mentioned in Section 3.2 to do ultrasound image quality assessment. We randomly generate 10 groups' training data. In each group, there are also 478 images from the database, and all the images are divided into three kinds: 60% training data, 20% validation data, and 20% testing data. e results are shown in Table 1.

Conclusion
Ultrasound image plays a vital role in medical related applications. How to quantificationally assess the quality of ultrasound image remains an untouched issue. In this paper, CNN is used to evaluate the quality of ultrasound image. Based on the study of optical IQA, we introduce the CNN into the assessment of ultrasound image. We collected ultrasound images from hospitals and websites and established a medical ultrasound image database with subjective score tags. e ultrasound images are captured by different types of equipment. ese images are scored by four experienced doctors and the average score is used as the gold standard of image in IQA. Deep CNN is applied to ultrasound IQA task, and the network adjustment and training strategy design are carried out for ultrasound images. e transfer learning strategy is borrowed here to overcome the obstacle of the scarcity of labeled ultrasound samples. Transfer learning can also speed up training and improve IQA accuracy. Meanwhile, we modified the classic classification network for ultrasound IQA. ese methods are compared with traditional evaluation methods. e results show that the method based on deep CNN is more reliable than the traditional metrics, and the results of transfer learning and ResNet are better than that of deep CNN.
ere is still a long way to go. First, the ultrasound images should be further collected from more channels. Moreover, more experienced doctors will join in scoring the images to obtain the gold standard. Another future research should design a more applicable CNN.

Data Availability
All data will be provided upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.