Improved 3D U-Net for COVID-19 Chest CT Image Segmentation

Coronavirus disease 2019 (COVID-19) has spread rapidly worldwide.+e rapid and accurate automatic segmentation of COVID19 infected areas using chest computed tomography (CT) scans is critical for assessing disease progression. However, infected areas have irregular sizes and shapes. Furthermore, there are large differences between image features.We propose a convolutional neural network, named 3D CU-Net, to automatically identify COVID-19 infected areas from 3D chest CT images by extracting rich features and fusing multiscale global information. 3D CU-Net is based on the architecture of 3D U-Net. We propose an attention mechanism for 3D CU-Net to achieve local cross-channel information interaction in an encoder to enhance different levels of the feature representation. At the end of the encoder, we design a pyramid fusion module with expanded convolutions to fuse multiscale context information from high-level features. +e Tversky loss is used to resolve the problems of the irregular size and uneven distribution of lesions. Experimental results show that 3D CU-Net achieves excellent segmentation performance, with Dice similarity coefficients of 96.3% and 77.8% in the lung and COVID-19 infected areas, respectively. 3D CU-Net has high potential to be used for diagnosing COVID-19.


Introduction
Coronavirus disease 2019 (COVID-19) has rapidly spread worldwide since its outbreak in December 2019 [1,2]. In March 2020, the World Health Organization declared COVID-19 as a global pandemic [3]. e reverse transcription polymerase chain reaction (RT-PCR) test is the standard for COVID-19 detection. However, this test has a high false negative rate, and it cannot accurately detect the initial infection. Hence, infected patients cannot be diagnosed on time [4,5]. Compared with the RT-PCR test, chest computed tomography (CT) provides higher sensitivity in the diagnosis of COVID-19; therefore, it can be used as one of the main clinical detection methods [6,7]. e chest CT scans of patients with COVID-19 show characteristic imaging features, such as ground-glass opacity and occasional consolidation plaques in the lungs [8][9][10], which are considerably useful for diagnosing COVID-19 and evaluating the severity of a patient's condition. However, owing to a significant increase in the number of patients, it high-accuracy network (COVID-SegNet) to segment COVID-19 lesions from chest CT images. COVID-SegNet used multiscale feature fusion and enhanced features to segment lung and COVID-19 lesions accurately and automatically. Although these methods play an important role in the diagnosis and analysis of COVID-19, they are based on CT slices. ese methods frequently neglect the correlation between continuous CT slices and cannot fully utilise the spatial information of CT scans.
It is challenging to automatically segment the lesions of COVID-19 pneumonia because of the complexity of CT spatial imaging, the difficulty of marking infected areas, and the difference between medical image characteristics. First, infections may have different characteristic appearances, such as ground-glass opacity and consolidation plaques. Second, lesions have irregular shapes and fuzzy boundaries, and a few lesions have a lower contrast compared to surrounding areas. ird, it is tedious and time consuming to artificially mark pulmonary infection, and it is frequently influenced by doctors' knowledge and clinical experience of lesions [9,10,12,16].
We propose a deep learning method, named 3D CU-Net, to improve the segmentation performance of the neural network models for COVID-19. In addition, we propose a new feature encoding module (residual channel attention, Res_CA) for 3D CU-Net. In the feature extraction stage, the channel attention mechanism of local cross-channel information interaction is used to recalibrate the feature weight of global information and enhance the performance of feature representation. We propose a pyramid fusion module with multiscale global information interaction in the bottom encoder, which enhances the performance of the network by fusing the feature information of different scales and improving the performance of the network for lesion area segmentation.

U-Net
Structure. U-Net [17] was proposed by Ronneberger et al. in 2015 for medical cell segmentation. It consists of a contraction path to obtain context information and an expansion path to recover a feature map. As high-level and low-level semantic information has the same importance in image segmentation, U-Net combines the high-definition features of an encoder with the advanced semantic features of a decoder stage to help restore the details of a target and obtain an accurate output.

Variants of U-Net.
Numerous methods based on U-Net have achieved better results in different medical image segmentation tasks by integrating the new design concepts of networks. Oktay et al. [18] added an attention mechanism based on U-Net for targets with different shapes and sizes and used an attention gate to highlight the salient features of a skip connection. Xiao et al. proposed a model, named Res U-Net [19], with a weighted attentional mechanism to deal with extreme changes in the ocular vascular background. Feng et al. proposed CPFNet [20], which improved the segmentation performance by utilising two pyramid modules to fuse multiscale context information.
Wang et al. proposed a new cross-channel information interaction method, named ECA-Net [21], to recalibrate features. ey prevented the adverse effect of dimensionality reduction in SE-Net on channel attention. However, most U-shaped networks use only abstract features, neglect certain details, and cannot effectively use multiscale context information [20].

Network Overview.
We propose an automatic segmentation model, named 3D CU-Net, for COVID-19 lesions. e model is based on the 3D U-Net architecture, as shown in Figure 1. e network structure of 3D CU-Net is composed of a feature encoding module (Res_CA) with an attention mechanism, a pyramid dilated convolution module (PDS block) for extracting and fusing multiscale information at different resolutions, and a feature decoding module for segmentation. A fixed-size 3D slice extracted from a 3D CT image is used as the input of the network. e predicted segmentation result is obtained after a series of upsampling and downsampling operations for feature encoding and decoding. e model can ensure continuity between CT images and retain a certain amount of interlayer information. us, the 3D input contains more contextual information compared to a 2D image.
In the feature encoding part, an efficient channel attention mechanism [21] is used to reallocate feature weights under the guidance of global information, and residual networks are used to mitigate problems such as gradient vanishing. Global average pooling is used to obtain multiscale global information under different receptive fields to enhance the feature representation in the PDS module, thereby improving the segmentation performance of the network for the irregular shapes and sizes of lesions. Finally, segmentation results are obtained by a feature decoding module, which includes two consecutive 23 × 3 × 3 convolutions and a residual connection with a 1 × 1 × 1 convolution. Figure 2, the feature encoding module mainly consists of the following two parts.

Feature Extraction.
In each encoding module, except for the bottom encoder, two continuous 3 × 3 × 3 convolutions are used to extract deeper feature information. is expands the receptive field, extracts more feature information, improves the complexity of the network, and reduces the amount of calculation and number of parameters. After each 3 × 3 × 3 convolution, we add the ReLU activation function and batch normalisation to alleviate the problem of gradient disappearance and increase the speed of network learning.

Feature Calibration Block.
We introduce a channel attention mechanism to obtain representative features and highlight useful information. According to the correlation between adjacent channels, cross-channel interactive fusion methods are used to recalibrate the weights of the extracted features. Cross-channel information communication can effectively prevent the influence of the reduction in dimensions on channel attention and enhance the feature representation of lesion areas.

Pyramid Fusion Module for Dilated Convolutional Global
Information Interaction. Multiscale context information helps improve the performance of semantic segmentation.
us, we propose a pyramid fusion module that converts low-scale global information into high-scale features. As shown in Figure 3, a residual block is used to deepen the network and extract feature information. en, a parallel expanded convolution with expanded sizes of 1, 2, and 4 is used to obtain the multiscale information of advanced features. Next, according to the correlation between feature channel information at different scales, global average pooling is used to obtain the global channel features and their weights at different scales. us, the global information obtained in a small receptive field is used to enhance the feature expression ability of a large receptive field. Finally, the features at different scales are fused by stitching.
In the last part of this module, we connect the multiscale feature information that has been recalibrated with feature weights, normalise it using a 1 × 1 × 1 convolution, and then fuse it with the original advanced features.

Feature Decoding Block.
As shown in the decoding block in Figure 1, two 3 × 3 × 3 convolutions and a residual connection with a 1 × 1 × 1 convolution are applied to the feature map after the series connection, and a feature map is obtained with the same size as that of the original input image.

Loss Function.
In the medical image segmentation task (lesion detection), the high imbalance of the training data leads  Loss Total � Loss CE + Loss Tversky , where x is the input value, y i is the true label corresponding to category i, and f(x i ) is the model output value. TP, FN, and FP represent true positive, false negative, and false positive, respectively.

Experimental Data.
We train and evaluate 3D CU-Net using the open COVID-19 CT dataset provided by Jun et al.   Scientific Programming [22]. In addition, MosMedData [23] provided by Forrest is used as an independent test dataset to further verify the performance of the model. e COVID-19 CT dataset consists of the chest CT scans of 20 COVID-19 patients validated with annotations by a senior radiologist, in which the left lung, right lung, and lesion areas are annotated. e dataset contains 3250 CT slices with sizes of 630 × 630, 512 × 512, and 401 × 630. e lesion area accounts for only 2.12% of the CT slice area, and the slices with lesion markers account for 52.86% of the total slices. MosMedData was provided by Moscow Municipal Hospital. It consists of the chest CT scans of 50 confirmed COVID-19 patients, with lesion areas annotated by a few experts.
e Dice coefficients of the COVID-19 CT dataset and MosMedData are 0.673 and 0.588, respectively.
We normalise the training data and pixel values between [0, 1] by considering −250 and 1250 as thresholds. In addition, 3D CT images are resampled at a fixed isomorphic resolution to normalise them into the same voxel spacing. We use random elastic deformation, random rotation, random scaling, Gaussian noise, and other common medical image data enhancement methods to enhance the training data and prevent the overfitting problem caused by a small amount of training data.

Experimental Details and Evaluation Metrics. 3D CU-
Net is compared with standard 3D U-Net [25] in terms of segmentation results, and the performance of 3D CU-Net is further analysed using MosMedData.
We build an operating environment on a Linux server. e NVIDIA Tesla P100 GPU is used, and the TensorFlow 2.0 deep learning framework is adopted. e installation environment comprises cuda10.0, cudnn7.6.5, python3.6, opencv, gcc, etc. In the fitting process, we set the batch size as 2. We use Adam optimisation with an initial learning rate of 0.001 and a minimum learning rate of 0.00001. We reduce the learning rate by 0.1 times when loss does not decrease in 15 epochs. e training process ends when loss does not decrease in 50 epochs. We employ 5-fold cross-validation for model fitting. Sixteen sets of CT images are used for fitting and the remaining four sets for validation. After each fitting process, the model is evaluated using the validation data.
In addition, after fitting, we utilise three widely used metrics in medical image analysis to evaluate the segmentation performance of the model for the left and right lungs and COVID-19 lesion areas.
ese metrics are the Dice similarity coefficient (DSC), sensitivity (Sens), and specificity (Spec). Table 1, the DSC is 77.8% and Sens is 73.8% for 3D CU-Net, compared with 3D U-Net, and the DSC and Sens increased by 7.3% and 3.1%, respectively. By adjusting the parameters of the Tversky loss (α � 0.3 and β � 0.7), Sens for 3D CU-Net increases to 83.7% with few losses of DSC. In addition, the accuracy of overall lung segmentation improves. Figure 4 shows the segmentation results obtained using 3D CU-Net and 3D U-Net for five different slices. e images from left to right are the original CT image, ground truth, segmentation results of 3D U-Net, results of 3D CU-Net, and results of 3D CU-Net with the Tversky loss parameters as α � 0.3 and β � 0.7. Figure 5 shows the local details of the CT image slice segmentation results shown in the third row of Figure 4. In the first column of Figure 5, rows (a)-(e) show the original CT image, ground truth, segmentation result of 3D U-Net, segmentation result of 3D CU-Net, and segmentation result of 3D CU-Net with the Tversky loss parameters as α � 0.3 and β � 0.7, respectively. e second column shows the details of the area enclosed by the red box in the first column. 3D U-Net shows poor segmentation performance, and a large infected area is not identified, as shown in row (c). In contrast, 3D CU-Net provides better segmentation performance, and most infected areas are accurately identified, as shown in row (d). e area enclosed by the blue box in the second column of Figure 5 shows that setting α � 0.3 and β � 0.7 effectively reduces the false positive rate of 3D CU-Net and improves the sensitivity of infection region segmentation.

Experimental Results. As shown in
Furthermore, we compare the performance of the model in terms of infected area segmentation on the basis of MosMedData. As shown in Table 2, the performance of 3D CU-Net is better than that of 3D U-Net, with a 5.9% improvement in the DSC and a 15% increase in Sens. e experiments performed using the COVID-19 CT dataset and MosMedData show that 3D CU-Net provides excellent segmentation performance. For the left lung, right lung, and lesion areas, the DSC is 0.960, 0.963, and 0.771, Sens is 0.969, 0.966, and 0.837, and Spec is 0.998, 0.998, and 0.998, respectively.
It has great potential in evaluating COVID-19 infection. e above results suggest that the 3D CU-Net model has good performance in COVID-19 lesion segmentation.

Conclusion
We proposed a deep learning segmentation network (3D CU-Net) for detecting COVID-19 pulmonary infection. e proposed network was based on 3D U-Net. An attention mechanism was introduced for channel features in the encoding stage to enhance the representation ability of features. e full utilisation of the multiscale global information of high-level features extracted from the bottom encoder improved the accuracy of COVID-19 detection. e proposed network has high potential to be used for diagnosing COVID-19.
However, 3D CU-Net has certain limitations. Its accuracy must be improved for the irregular shapes and different sizes of lesions. In addition, the segmentation performance can be improved via further research and by utilising highquality medical imaging data.

Data Availability
e labeled datasets used to support the findings of this study are publicly available.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.