MC-UNet: Multimodule Concatenation Based on U-Shape Network for Retinal Blood Vessels Segmentation

Accurate retinal blood vessels segmentation is an important step in the clinical diagnosis of ophthalmic diseases. Many deep learning frameworks have come up for retinal blood vessels segmentation tasks. However, the complex vascular structure and uncertain pathological features make blood vessel segmentation still very challenging. This paper proposes a novel multimodule concatenation via a U-shaped network for retinal vessels segmentation, which is based on atrous convolution and multikernel pooling. The proposed network structure retains three layers of the essential structure of U-Net, in which the atrous convolution combining the multikernel pooling blocks are designed to obtain more contextual information. The spatial attention module is concatenated with the dense atrous convolution module and the multikernel pooling module to form a multimodule concatenation. And different dilation rates are selected by cascading to acquire a larger receptive field in atrous convolution. Adequate comparative experiments are conducted on these public retinal datasets: DRIVE, STARE, and CHASE_DB1. The results show that the proposed method is effective, especially for microvessels. The code will be released at https://github.com/rocklijun/MC-UNet.


Introduction
Te retina is one of the most important parts of the eyes [1]. On the basis of the data published by the WHO, a growing number of people around the world are sufering from eye diseases [2]. Te morphological characteristics of retinal blood vessels are very helpful for ophthalmologists who can use morphological features of retinal blood vessels, such as branching patterns, angles, curvatures, widths, and lengths, to diagnose and assess eye diseases [3,4]. Te ophthalmologist can efectively screen and diagnose fundus-related diseases by examining and analyzing the shape structure of retinal blood vessels. Terefore, fundus examination is an important part of the ophthalmic examination. Extracting the shape and structure of retinal blood vessels is the most pivotal procedure in the ophthalmic examination for ophthalmologists to identify diseases. In traditional medical procedures, the retinal vascular area needs to be manually segmented by experienced specialists, which is time-consuming and labor-consuming. Furthermore, the blood vessels in the retinal image are irregular and densely distributed, such as a lot of small blood vessels with low contrast, which is easily confused with the background. Although there are many retinal image segmentation methods that have been presented, those issues make blood vessel segmentation still very challenging.
Te unsupervised method and the supervised learning method comprise the retinal vessel segmentation method. Te diference between them is whether the input data have manually segmented labels. Oliveira et al. [5] used two algorithms for median ranking and weighted mean, which are diferent to combine the Frangi flter, matched flter, and Gabor wavelet flter for blood vessels segmentation. Alhussein et al. [6] extracted the enhanced images of thin and thick blood vessels, respectively, based on a hessian matrix and intensity transformation method. Azzopardi et al. [7] presented a selective response vascular flter called B-COSFIRE for vascular segmentation. Safarzadeh et al. [8] used a multiscale line operator to detect blood vessels and K-means to do blood vessels segmentation. Tese methods are efcient and fast in retinal vessel segmentation, but the segmentation performances are dependent on the selection of feature extractors. While supervised learning methods can learn features from the original images and segmentation labels that makes it more efective in segmentation tasks owing to get the input-output relationship. And the supervised learning methods can be subdivided into deep learning methods and traditional machine learning methods. Te SVM and random forest, which belong to traditional machine learning models, need to manually construct features and map them to the target space. Wang [9] combined the characteristics of Gaussian scale space and the divergence characteristics of a vector feld and used the SVM classifer to segment blood vessels. Zhu et al. [10] used Cart and AdaBoost classifers to classify pixels. Although the traditional machine learning method is easy to understand and can be explained, it requests to ft the feature types and feature selection methods that make the feature representation ability limited.
During the past few years, convolutional neural network (CNN) has made outstanding achievements in the automatic segmentation of retinal vessels. Compared with traditional machine learning, there are many layers of neural networks in deep learning, which has strong nonlinear modelling ability and feature representation ability. In particular, since the U-Net [11] was proposed, various U-shaped networks based on encoding and decoding structures make biomedical images have more accurate segmentation performance. And several excellent retinal vessel segmentation methods that are U-NET based are proposed. Li et al. [12] proposed a method using structural redundancy in the vascular network to fnd fuzzy vascular details from the segmented vascular images and expand the depth of the model through multiple iterations. Two U-NET-based models, one of which is recurrent and the other is recurrent residual, have been proposed by Alom et al. [13], using the functions of residual network and RCNN. Zhuang [14] proposes a multi-U-Net chain containing multiple encoderdecoder branches. Yuan et al. [15] fused the multilevel attention module with U-NET to obtain the fusion information of low and high levels for alleviating the problem of the network overftting and obtaining generalization ability. Wang et al. [16] designed a dual-coding U-NET, which has outstanding performance in improving the segmentation capability of vessels in the retina. A spatial attention module is added to the SA-UNet (Spatial Attention U-Net for Retinal Vessel Segmentation) to obtain more features of spatial dimensions by Guo et al. [17]. Te IterMiUnet [18] is designed to alleviate the heavy parameterization of U-Net, inspired by Internet [12] and MiUnet [19]. Zhang et al. [20] designed the Bridge-net to learn context-involved and noncontextual features to obtain superior segmentation results.
Although these U-Nets and their improved networks have been used in retinal vessels segmentation so widely, those sufer from many limitations and defciencies. Te encoder-decoder structures receive the information feature and its transmission in the same layer by jump connections, which may cause the loss of small and fragile vessels owing to the limited comprehensive features. In order to alleviate the problems, we propose a multimodule concatenation network based on a U-shape network called MC-UNet for retinal vessel segmentation, which retains local and global information about the retinal main blood vessels and capillaries. Furthermore, the contributions that this paper can make are summarized as follows: (1) We proposed a multikernel pooling based on the U-shape network that retains three layers, the essential structure of U-Net, but the atrous convolution combining the multikernel pooling blocks are designed to obtain more contextual information. (2) We design a multimodule concatenation network to contain local and global information for retaining small vascular and advanced features. (3) Te spatial attention module in the network is concatenated with the dense atrous convolution module and multikernel pooling module, which can further enhance the signifcance of the target. (4) We evaluate and analyze the proposed MC-UNet on the challenging task of retinal blood vessels segmentation. According to the results of experiments, our method reaches the state-of-the-art level on the public datasets.

Methods
In this section, we will introduce our proposed MC-UNet shown in Figure 1. Our network retains three layers, the essential structure of U-Net with a spatial attention module the same as SA-UNet [17]. Tere are three skip connections and a four layers network structure in our proposed method and is diferent from the fve layers network structure of the original U-NET. Te Dropblock and BN modules are used to take the place of the convolution block in the original U-NET, which can efectively prevent overftting of the network and improve network training speed. Consequently, it is more suitable for small sample data sets. Te main improvement for our proposed is to bind the dense atrous convolution module (DAC) and multikernel pooling module (MKP), which joint local and global information for a certain extent. Ten, the spatial attention module in the network is concatenated with DAC and MKP. For each layer, it is including a Conv3 * 3, Dropblock, BN modules, ReLu and a 2 * 2 max-pooling. We will elaborate on the MC-UNet in detail in the following subsections.

Spatial Attention Module.
Te spatial attention module [21] generates a spatial attention map using the maximum pool and average pool operations, selectively paying attention to the feature information in the image and ignoring other background information. Te output feature SA is obtained by multiplying the input feature F and attention map σ(·), which is shown in the following formula: where f 7 and σ represent 7 * 7 convolution operation and the sigmoid function, respectively. Te illustration of spatial attention module is shown in Figure 2.

Dense Atrous Convolution Module.
Atrous convolution has a widespread application in semantic segmentation, target detection and other tasks by many classical networks, such as DeepLab [22,23]. In deep learning algorithms, proft from pooling layer and convolution layer, the receptive feld of feature image is increased and the size of feature image is reduced. What's more, upsampling is used to make the image size restored. But now, due to the process of feature image shrinkage and magnifcation, the accuracy will be lost. Atrous convolution can increase the receptive feld and maintain the size of the feature map to reduce the computation of the network, which is utilized to replace downsampling and upsampling. Te dilation rate of the atrous convolution can be set with diferent values, by which diferent receptive felds can be achieved for multiscale information.
where r represents the dilation rate and k is the length of the flter w. In particular, when r � 1, formula (2) is the standard convolution. Te input feature maps x are convolved with a flter w to obtain the output y. And Figure 3 shows the schematic diagram of the atrous convolution, the dilation rates are 1, 3, and 5, respectively. And the small dilation rates can obtain the local information and the big ones can get the global information that makes the network extract local and global information for retaining small vascular and advanced features. Compared with downsampling, atrous convolution can both enlarge the receptive feld nicely and accurately locate the target and reduce the loss of spatial resolution. Te dense atrous convolution [24] module shown in Figure 4 is generated by integrating the atrous convolution using diferent dilation rates, which can capture the context information of diferent scales and achieve local or global information. By using diferent dilation rates r k to combine the atrous convolution, the output D of atrous convolution modules can be obtained.

Multikernel Pooling
Module. Te multikernel pooling [24] module is changed based on the spatial pyramid [25], which can make the redundant information of the feature map be reduced and the amount of calculation. According to the diferent sizes of the kernel, the feature information of receptive felds with diferent sizes is extracted to increase the segmentation performance of the model.  Computational Intelligence and Neuroscience 3 pooling module is introduced into the SA-UNet, which relies on multiple diferent kernels to detect diferent sizes targets.
Multikernel pooling can use more context information by combining general max-pooling operations of diferent kernel sizes, as shown in Figure 5. And encoding the global context information into four receiving domains of diferent sizes: 2 × 2, 3 × 3, 5 × 5, and 6 × 6. Ten, a 1 × 1 convolution is carried out to reduce the dimension of feature mapping, and upsampling is carried out to get features of the same size as the original feature mapping. Lastly, we concatenate the original features and the upsampled feature mapping and obtain the output feature MKP of multikernel pooling module.
where f 1 and k i denote the 1 × 1 convolution and i th kernel of diferent sizes, and the D is the output feature map representing the dense atrous convolution module. Te encoder-decoder structures receive the information feature and its transmission in the same layer by jump connections, which may cause the loss of small and fragile vessels owing to the limited comprehensive features. Te spatial attention module, multikernel pooling module and dense atrous convolution module are complementary in the ability and scope of feature acquisition. Inspired by them, we propose a multimodule concatenation network for accurate retinal vessel segmentation. Te output feature map F is obtained by concatenating the output features of the spatial attention module SA and the multikernel pooling module MKP.

Datasets.
We use the fundus datasets which are publicly available to verify our method: DRIVE [26] (digital retinal images for vessel extraction), CHASE_DB1 [27] (child heart and health study in England), and STARE [28] (structured analysis of the retina) to evaluate the segmentation performance of our approach MC-UNet. Te STARE dataset includes pathological dilation rate = 1 dilation rate = 3 dilation rate = 5   abnormal and healthy retinal images, which can be used to evaluate the impact of the model on abnormal fundus images. Te specifc information of the three datasets is shown in Table 1.

Evaluation Criteria.
Te aim of retinal vascular binary classifcation work is to divide each pixel in the input images into two categories: vascular (positive) and background (negative). By comparing the segmentation maps with the true value of the label, four indexes can be obtained: TP, TN, FP, and FN. P represent the number of white pixels in true images; N represents the number of black pixels in the true image; Tfor true; and F for false. TP represents the number of white pixels correctly predicted by optic disc, while TN represents the number of black pixels correctly predicted by the optic disc. Te values of TP, TN, FP, and FN are calculated according to the total number of pixels in the ground-truth images.
On the basis of these four basic indexes, accuracy (ACC), sensitivity (SEN), specifcity (SP), area enclosed by the coordinate axis under the ROC curve (AUC), and F1-score can be calculated [17]. In our experiment, almost all the above indicators are used. Te calculation formulas are as describe as follows:

Results
On the three datasets, we train and evaluate our method by using the manual annotation marked by the frst expert. Te segmentation result examples from the DRIVE, STARE, and CHASE_DB1 datasets are shown in Figure 6, which perceive the comparisons of the segmentation results on the three datasets with other methods are listed, including some methods based on U-Net. From Figures 6(a)-6(g), there are the original color retinal image, the ground truth, the segmentation map by U-Net [11], CE-Net [24], LadderNet [14], SA-UNet [17] and proposed method, respectively. Furthermore, all the experiments were carried out on NVIDIA Quadro M5000 and 3.00 GHz PCs. It can be observed in Figure 6 that the proposed MC-UNet achieves better performance than others, obtaining more vessels in a representative patch (green disc) in the vascular tree terminal region regions. We also compare the segmentation results on the three datasets with other methods by the fve evaluation criteria shown in Table 2. Notably, MC-UNet achieves the best performance on DRIVE and CHASE_DB1. And by comparing with the backbone, our method has better performance, which illustrates that the proposed framework is efective for vascular segmentation. Specifcally, the SE and AUC of our framework on three datasets are higher than backbone SA-UNet. Our method has the highest ACC, SP, and AUC on DRIVE, the highest ACC, SE, and AUC on CHASE_DB1. Due to many lesion images in the STARE dataset, the sensitivity index is not satisfactory by MC-UNet. However, compared with the backbone network, the  proposed MC-UNet obtains better performances which also verify our method is efective. Table 3 shows the ablation experiments of the proposed model, where the proposed MC-UNet is compared with the backbone network (SA-UNet), SA-UNet + DAC, and SA-UNet + MKP. It is observed that the DAC module is able to enhance the specifcity of the image efectively, reduce the blood vessels rate of false positive in the fundus image, and reduce the misdiagnosis cost of fundus image samples. Te MKP module improves the AUC of the segmentation algorithm, making the algorithm more robust. Integrating the DAC and MKP modules into SA-UNet improves the segmentation efect as a whole, reduces the misdiagnosis rate of the image, and improves the ability to predict blood vessels by the algorithm. Figure 7 more intuitively shows the change of ACC in the ablation experiment. Figure 8 compares the ROC curves of fve diferent methods on three datasets. It can be seen from the results that our method achieves the best efect.
And Table 4 shows the comparison on parameters for justifcation of the MKP module and DAC module, which shows that our method has much fewer parameters than the 7.76 M parameters of original U-Net.

Discussions
Tere are three skip connections and four layers in our proposed method, compared with four skip connections and fve layers in the original U-Net. Although our network has added multiple integrated modules, it has a much smaller number of parameters compared with the original U-Net with 23 convolutional layers and is a lightweight network. Te proposed network can enhance the specifcity of the image efectively and reduce the blood vessels rate of false   Computational Intelligence and Neuroscience positives in the fundus image by integrating the DAC and MKP modules into SA-UNet. However, the limited images available in the dataset restrict the performance of the algorithm. In the experiment, we set a certain number of iterations to avoid overftting. And we only consider the solution of the same data domain. Te domain adaptation method can be introduced to solve domain shift for crosstraining and verifcation for the robustness of the algorithm.

Conclusions
In order to solve the limited comprehensive features extracted by the encoder-decoder structure in the U-shaped network, which may lead to the segmentation loss of small fragile capillaries, a novel U-shape network is proposed named multimodule concatenation U-Net (MC-UNet) based on atrous convolution and multikernel pooling for retinal vessels segmentation. Te network retains local and global information about the main retinal vessels and capillaries. Te DAC and MKP modules are introduced to increase the receptive feld for improving the sensitivity of the algorithm and retain more detailed feature information for improving the accuracy of retinal vascular segmentation. Experimental results prove the efectiveness of the method, especially for microvessels. However, for more severe lesions in image data, a robust framework is still needed to be studied and discussed.