R2AU-Net: Attention Recurrent Residual Convolutional Neural Network for Multimodal Medical Image Segmentation

In recent years, semantic segmentation method based on deep learning provides advanced performance in medical image segmentation. As one of the typical segmentation networks, U-Net is successfully applied to multimodal medical image segmentation. A recurrent residual convolutional neural network with attention gate connection (R2AU-Net) based on U-Net is proposed in this paper. It enhances the capability of integrating contextual information by replacing basic convolutional units in U-Net by recurrent residual convolutional units. Furthermore, R2AU-Net adopts attention gates instead of the original skip connection. In this paper, the experiments are performed on three multimodal datasets: ISIC 2018, DRIVE, and public dataset used in LUNA and the Kaggle Data Science Bowl 2017. Experimental results show that R2AU-Net achieves much better performance than other improved U-Net algorithms for multimodal medical image segmentation.


Introduction
Medical image plays a key role in medical treatment. Computer-aided diagnosis (CAD) is designed to provide doctors with accurate interpretation of medical images systematically so as to treat the patients better. Manual segmentation not only relies heavily on doctors' own knowledge and clinical experience in recognition accuracy, but also has very low efficiency. erefore, the application of deep learning in medical image segmentation has aroused widespread concern. Because medical image labeling requires the experts to spend considerable time and effort, it is difficult to acquire thousands of training images in medical image segmentation tasks. Ciresan et al. [1] trained networks in sliding windows to predict class tags for each pixel by providing local areas (patch) around pixels. However, this network must run networks independently for each patch, and there are plenty of redundancies due to overlapping patches. Furthermore, more maximum pool layers are needed for large patches, which will reduce the positioning accuracy. Full convolutional neural network [2] is one of the earliest applied deep neural networks with image segmentation. Without traditional full connection layer, it uses deconvolution to restore original images at the last layer of network. Ronneberger et al. [3] extended this system and proposed U-Net, which includes coding path and decoding path. Encoder uses output feature map to characterize original image.
rough the information output from encoder, decoder restores details and size of the original image. U-Net adds multiple skip connections between the encoder and decoder, which can transfer the features of the shallow network to the deep network. us, it can help the decoding path recover the details of the image better. Since then, U-Net becomes a very popular segmentation network and is applied to medical imaging segmentation including cardiac MRI [4], cardiac CT [5], abdominal CT [6] segmentation and pulmonary nodule detection [7], and liver segmentation [8]. However, target organs vary greatly among different patients, so U-Net will rely extremely on multicascaded CNN. Cascade framework will make dense predictions of ROI, which will lead to repetitive extraction of similar low-level features and result in the waste of computational resource and the increase in model parameters. erefore, the design of an efficient structure of deep CNN is very important.
So far, many improved versions of U-Net have been proposed. Azad et al. [9] proposed BCDU-Net. e most important changes are for feature extraction method and skipping connections. e original U-Net relies on multicascaded CNN, which results in the waste of computing resources and the increase of the number of parameters. U-Net is able to splice shallow features and deep features simply by using skip connection. In this paper, an extended version of U-Net is proposed, which uses recurrent residual convolutional neural networks with attention gate connection (R2AU-Net) for medical image segmentation. e contributions of this paper can be summarized as follows: firstly, R2AU-Net uses more attention gates (AGs) to deal with deep features and shallow features. e AGs use the depth feature map in the decoding path as a gating signal to modify the feature map generated in the coding process and suppress feature responses in the irrelevant background area, so as to highlight features that are useful for a specific task [10]. Secondly, R2AU-Net substitutes recurrent residual convolutional unit for the U-Net basic convolutional unit. In the recurrent residual convolutional unit, recurrent connection and residual connection [11] are added to each convolutional layer [12], thus not increasing network parameters. e use of recurrent connection can enhance the ability of integrating context information. Residual connection can help train deeper network [13,14]. In addition, batch normalization [15] is used to accelerate the convergence speed of the network. R2AU-Net is evaluated on three datasets: retinal vascular segmentation (DRIVE dataset), skin lesion segmentation (ISIC 2018 dataset), and lung nodule segmentation (lung dataset).

Proposed Method
Multiple deep learning models are usually taken as functional modules to construct the new network. Inspired by U-Net, R2AU-Net is proposed in this paper. e network structure is shown in Figure 1, which takes advantage of four recently developed deep learning models. ere are three differences between R2AU-Net and U-Net. Recurrent convolutional block with the residual unit is used in encoding and decoding paths. Secondly, the skipping connections are replaced by AGs to correct low-resolution features through deep features. e third point is that BN [15] is used to increase the stability of the neural network and speed up the convergence speed of the network in the upsampling process. BN can standardize data, obtain smaller regularization, reduce generalization error, and improve network performance [15].

Encoding Path.
e encoding path of R2AU-Net contains four steps. Every step contains a recurrent residual convolutional unit, which consists of two 3 × 3 convolution and adds recurrent connections to each convolutional layer to enhance the capability of integrating contextual information of the model. In addition, residual connections are added to develop more efficient and deeper models. Each time a recursive residual convolutional unit is passed, the number of feature maps is doubled and the size becomes half of the original. e R2AU-Net model applies concatenation on feature mapping from encoding unit to decoding unit. e recurrent convolutional layers (RCL) in R2CL are performed according to the discrete time steps expressed by RCNN [12]. Suppose the u l input sample in the l th layer of the R2CL block and a pixel located at (i, j) in an input sample on the k th feature map in the RCL. M l ijk (t) denotes the output at time step t and can be expressed as follows: (t − 1) represent standard convolutional layers and the input sample of the l th RCL, respectively. e standard convolutional layer and the RCL of the k th feature maps are, respectively, weighted by w f k and w r k , and b k is the bias. e output of RCL is activated by standard ReLU function f as follows: e output of the R2CL unit can be calculated as follows: where u l is an input sample of R2CL layer. u l+1 is both the output of downsampling layer in encoding path and the output of upsampling layer in decoding path, respectively. e basic unit of U-Net convolution is shown in Figure 2(a), and the structure of the R2CL block is shown in Figure 2 Formulas (1) and (2) describe the dynamic characteristics of RCL. When RCL is expanded into T time steps, the feedforward subnetwork with depth of T + 1 will be obtained. In this paper, RCL is expanded to two time steps; namely, T � 2. RCL includes a single convolutional layer and two subsequence recurrent convolutional layers. RCL expansion structure is shown in Figure 3.

Decoding
Path. Each step of the decoding path performs the upsampling operation of the output from the R2CL unit of the previous layer. With each upsampling operation, the number of feature maps will be halved and the size will be doubled. At the last layer of the decoding path, the size of the feature map is restored to the original size of the input image. e LNR layer in R2CL is replaced by BN layer, so that the input of each layer keeps the same distribution. In the process of training, the distribution of activation in each layer of neural network will lead to the decrease of training speed. erefore, BN [15] is used to enhance stability of neural networks after sampling at each step. It improves the stability of the neural network by subtracting the batch mean and dividing the inputs according to the batch standard deviation. BN accelerates training speed and promotes the performance of network model. e output u up l of BN layer is sent to AGs. R2AU-Net uses AGs to readjust the output features of the encoder before splicing the features on each resolution of the encoder with the corresponding features in the decoder. is module generates a gating signal which controls the importance of features at different locations. AGs gradually suppress feature responses unrelated to background regions without clipping ROI regions between networks. are denoted as u l and g l , respectively. e gating signal g l determines the focus area per pixel. In order to obtain higher accuracy, the additive attention [16] is used to obtain the attention coefficient. e additive formula is as follows: where b 1 and b 2 denote ReLU and sigmoid activation functions, respectively, W g is the weight, and b g and b V are the bias. Wang et al. [17] used attention based on vector splicing. Linear transformations are calculated using a 1 × 1 × 1 convolution of tensor channel direction. Grid resampling of attention coefficients is achieved using trilinear interpolation. e update of AG parameter is trained according to backpropagation instead of using samplingbased update method [18].    Finally, the output of AGs is the multiplication of feature map and attention coefficient by elements, as shown in formula (5).û up l � u l × α l .
Attention coefficients tend to obtain large values in target organ regions and relatively small values in background regions, which can improve the accuracy of image segmentation. e accuracy rate is used to evaluate the accuracy of pixel classification and obtained by the following formula:

Experimental Results
Sensitivity represents the proportion of samples that are predicted to be positive in the experimental results. It reflects the situation of positive samples. Sensitivity is calculated by the following formula: e specificity is calculated by the following formula: F1-score is used to measure the accuracy of binary classification model. It considers both the precision and recall of the classification model. It can be regarded as a harmonic average of model precision and recall. F1-score is calculated by the following formula: In addition, receiver operating characteristics (ROC) curve and precision recall (PR) curve are used to compare the performance of each network more intuitively. e values of area under curve (AUC) of both ROC curve and PR curve of each network are calculated in this paper.

Skin Lesion Segmentation.
e ISIC dataset is published by the International Skin Imaging Collaboration (ISIC), which contains 2594 dermoscopy images of common skin pigmentation lesions. All the images have been annotated by the recognized skin cancer specialists. ese annotations include dermoscopic features, which are used to identify the type of skin lesions of the known global and local morphological elements in the image. During the experiment, 1815 images are used for training, 259 images are used for verification, and 520 images are used as testing sets. e size of each image in ISIC is 700 × 900. Firstly, the input image is preprocessed into 256 × 256. Training images include original images and ground truth images labeled by professional physicians. Figure 5 shows the segmentation results of ISIC. e first column is the original images, the second column is the ground truth images, and the third column shows the segmentation result of R2AU-Net. e first line of the following skin lesion segmentation figures shows that R2AU-Net can accurately segment the dark skin lesions and will not be affected by the hair around the lesions. For the less obvious lightcolored lesions in the second line, R2AU-Net can also segment the lesions well. It can be found that R2AU-Net can accurately segment the image of skin lesions, which is almost identical to the ground truth image. Table 1 shows the comparison of segmentation result among R2AU-Net and other improved versions of U-Net through F1-score, sensitivity, specificity, accuracy, and AUC value. R2AU-Net performs well in various indicators. For the dichotomy experiment, ROC curve and PR curve can intuitively compare the performance of each classifier.
e AUC values of the ROC curve and PR curve are shown in Figures 6 and 7, respectively. e ROC curve tends to the upper left and the PR curve tends to the upper right, which shows the great performance of the segmentation model.

Retinal Vascular Segmentation.
Images of DRIVE dataset are obtained from the diabetic retinopathy screening program in the Netherlands. Screening groups included 400 subjects aged between 18 and 24 who were diabetic. 40 color retina images are randomly selected. e doctor can diagnose, screen, treat, and evaluate a variety of cardiovascular and ophthalmic diseases, such as diabetes, hypertension, arteriosclerosis, and choroidal neovascularization, through the blood vessel segmentation from retinopathy images and the signs of retina blood vessel morphological properties. In the experiment, 20 samples are used for training and 20 samples are used for testing. e size of the original image is 565 × 584. Obviously, the number of samples is not enough to train a deep neural network model. erefore, this paper randomly divided the input 20 training images into 190000 patches for training. Among them, 171000 patches are used for training set, and 19000 patches are used for testing set. e data size of the input network is 64 × 64. e segmentation result of the input image is shown in Figure 8. e first image is the original color image, the second image is the ground truth mask, and the third image is the segmentation result of the R2AU-Net output; most of the blood vessels at the end can still be segmented. Table 2 shows the results of comparative experiments on the DRIVE dataset, including F1-score, sensitivity, specificity, accuracy, and AUC value. From the experimental results, the performance of R2AU-Net is better than the traditional methods and the original U-Net. Figures 9 and 10 show the AUC value of both ROC curve and PR curve of each network.

Lung Segmentation.
Public datasets used in LUNA and the Kaggle Data Science Bowl 2017 are provided by the National Cancer Research Center of the United States. is dataset consists of 2D and 3D images. e original size of lung CT images is 512 × 512, the number of which is 267. 134 images are used for training, 54 images are used for verification, and 79 images are used for testing set. Figure 11 shows the segmentation results of R2AU-Net on the lung dataset. e first column is the input image, the second column is the ground truth mask, and the third column is the lung segmentation image of R2AU-Net. e third image of the first row shows that very small CT image of lung region is able to be segmented. e segmentation results of R2AU-Net are basically the same as the ground truth image. Table 3 shows the performance comparison of R2AU-Net and other improved versions of U-Net. Figures 12 and 13 show the AUC value of both ROC curve and PR curve of each network.

Input image
Ground truth mask Output of R2AU-Net

Input image
Ground truth mask Output of R2AU-Net

Input image
Ground truth mask Output of R2AU-Net Figure 11: Results of the lung region segmented by R2AU-Net.

Conclusion
In this paper, R2AU-Net is proposed for medical image segmentation. e recurrent residual convolutional block is used to enhance the ability of capturing context information, and AGs are added in the skip connections. Attention gates use deep features of decoding path as gating signal to modify shallow features and suppress feature response of background area, so that the network can obtain more accurate segmentation results. Moreover, BN is used to accelerate the convergence speed and stability of the network in the upsampling process. e experimental results of three datasets show that R2AU-Net has good performance in medical image segmentation.

Conflicts of Interest
e authors declare that they have no conflicts of interest.