Retinal Vessel Segmentation by Deep Residual Learning with Wide Activation

Purpose Retinal blood vessel image segmentation is an important step in ophthalmological analysis. However, it is difficult to segment small vessels accurately because of low contrast and complex feature information of blood vessels. The objective of this study is to develop an improved retinal blood vessel segmentation structure (WA-Net) to overcome these challenges. Methods This paper mainly focuses on the width of deep learning. The channels of the ResNet block were broadened to propagate more low-level features, and the identity mapping pathway was slimmed to maintain parameter complexity. A residual atrous spatial pyramid module was used to capture the retinal vessels at various scales. We applied weight normalization to eliminate the impacts of the mini-batch and improve segmentation accuracy. The experiments were performed on the DRIVE and STARE datasets. To show the generalizability of WA-Net, we performed cross-training between datasets. Results The global accuracy and specificity within datasets were 95.66% and 96.45% and 98.13% and 98.71%, respectively. The accuracy and area under the curve of the interdataset diverged only by 1%∼2% compared with the performance of the corresponding intradataset. Conclusion All the results show that WA-Net extracts more detailed blood vessels and shows superior performance on retinal blood vessel segmentation tasks.


Introduction
Image segmentation in retinal blood vessels has an important medical application value [1]. Analyzing the changes of the retinal vascular structure is a key part in diagnosing retinal diseases such as diabetic retinopathy (DR) [2], hypertension, and arteriosclerosis. However, the retinal blood vessels show complex network structures and low contrast, causing difficulties in manual segmentation.
Many methods have been proposed in recent years by researchers in the field. Broadly, these methods can be divided into two groups: unsupervised and supervised approaches [3]. In unsupervised approaches, main algorithms include matched filtering [4], morphological processing [5], vascular tracking [6], and model-based [7]. However, an extreme limitation of these unsupervised methods is that they heavily rely on handcrafted features for vessel representation and segmentation. Besides, the parameters need to be elaborately designed. erefore, unsupervised methods are inferior to supervised approaches in terms of segmentation accuracy.
Supervised learning, different from unsupervised methods, requires manual annotations to build the optimally predictive model. ere are two processors which are needed: one is an extractor, and the other one is a classifier. e features of the retinal vessel can be extracted by Gabor filter [8], Gaussian filter [9], and so on. In traditional machine learning, K-nearest neighbor (K-NN), SVM, Ada-Boost, etc. are often used to train the classifier [10][11][12]. In the study conducted by Zhu et al. [13], filters, morphological operators, and linear detection operators were used to extract retinal vascular morphological features as prior information for the classifier. Based on a fully connected conditional random field (CRF) model, Orlando et al. [14] used a structured output support vector machine to learn model parameters and perform retinal vessel segmentation. For these traditional supervised methods, the final results predicted are greatly influenced by the features used for classification. However, they are often defined empirically, which prevents the generalization of the unknown geometric transformation of blood vessels.
Given the aforementioned limitations, it remains a challenging task to effectively and automatically segment retinal blood vessels that warrants further studies using novel techniques. Deep learning, with powerful data representation capabilities, has been widely used in retinal image segmentation tasks. Fu et al. [15] formulated the vessel segmentation to a holistically-nested edge detection (HED) problem and utilized the fully convolutional network (FCN) to generate the vessel probability map. Mo and Zhang [16] developed a deep supervised FCN by leveraging multilevel hierarchical features. Liskowski and Krawiec [17] regarded blood vessel segmentation as a binary classification task, using convolutional neural network (CNN) to perform pixel-bypixel classification, and combined structured prediction methods to classify multiple pixels at the same time. e results obtained are better than most traditional algorithms. However, the network has tens of millions of parameters. Recently, it is more challenging to develop a simpler but effective model to accurately segment small vessels. Jin et al. [18] proposed deformable U-Net (DUNet), which integrated deformable convolution blocks, to detect the tiny vessels. It has fewer parameters but longer calculation time. Gu et al. [19] developed the context encoder network (CE-Net) to preserve spatial information for medical image segmentation. However, its segmentation accuracy on the vessel segmentation task is unsatisfactory. Azad et al. [20] took full advantages of U-Net [21], dense convolutions [22], and bidirectional ConvLSTM (BConvLSTM) [23] to segment vessels. ey achieved a higher accuracy, but excessive use of dense block (DB) consumes memory greatly. Budak et al. [24], in their recent work, showed that the weak and thin vessels are better segmented by the cascading CNN of lesser depth as long as feature utilization is improved. For this reason, and motivated by the fact that wide activation, which focuses on the width of deep learning, propagates more low-level features with the same parameter complexity [25], in this study, we develop a retinal vessel segmentation method by deep residual learning based on wide activation. e remaining of this paper is organized as follows: Section 2 presents the proposed method, Section 3 provides implementation details, Section 4 analyzes the experimental results, and discussion and conclusion are drawn in Section 5 and Section 6, respectively.

Proposed Method
An overview of the proposed segmentation model is shown in Figure 1. e original retinal image is preprocessed and input to WA-Net, and then the segmented image is output after network mapping.
is framework has two main modules, denoted as wide activated module and LASPP module, which will be introduced in the following parts.

Wide Activated Module.
In the traditional convolutional neural network, the output of the l th layer can be calculated by the following equation: where x l denotes the output of the l th layer, x l−1 denotes the output of the former (l-1) th layer, and H denotes a convolution often activated with ReLU, where ReLU is defined as f(x l−1 ) � max(0, x l−1 ).
In order to make full use of features, He et al. [26] proposed residual mapping to build a deeper network. e definition of residual mapping is as follows: where x l is the input, Y(x l ) is the output mapping, and F(x l ) is the residual mapping. Wide activated module (WDSR-A) [25] is a kind of residual module which is based on wide activation as shown in Figure 2. It mainly consists of four layers: two convolution layers, an activation layer (ReLU) and an identity layer. In the practical experiment, a 1 × 1 convolution is used in the identity layer when channels do not match.

Original Residual Block.
For the original residual block (Figure 3(a)), suppose the width of the identity mapping pathway is c 1 and the width before activation in the residual block is c 2 ; then, c 2 � c 1 , so the parameters in each original residual block are 2 × c 2 1 × k 2 , where k is the kernel size. Figure 3(b) shows the residual block with wide activation. It is calculated by the following equation:  where r is the expansion factor before activation and c 1 and c 2 are the width of the identity mapping pathway in WDSR-A and the width before ReLU, respectively.

Wide Activated Residual Block.
When the patch size of the input is fixed, the computational complexity is a fixed proportion of the parameters. In order to make WDSR-A maintain the same complexity as the original residual block, the following equation is established: According to the above equation, the width of the identity mapping pathway in WDSR-A needs to be slimmed by factor � r √ , and the width before activation should be expanded by � r √ . is paper takes r � 4.

LASPP Module.
A major challenge in vessel segmentation is how to capture tiny blood vessel features. is problem was solved by atrous convolution [27], which allows to enlarge the receptive field without increasing parameters. e principle of the atrous convolution is to insert 0 pixels between each pixel of the traditional convolutional kernel, i. e., to increase the dilation rate d of the network. For each pixel i of the output y, the process of the atrous convolution is expressed as where w(k) denotes a filter, k is the filter size, x denotes the input, and d is the dilation rate. As shown in Figure 4, the size of the convolution kernel is 3 × 3, and Figures 4(a)-4(c) correspond to atrous convolutions of d � 1, d � 2, and d � 3, respectively. Different dilation rates can be used to change the receptive field. For the atrous convolutional layer with the dilation rate of d and kernel size of k, the receptive field size is calculated by the following equation:: For example, if a convolutional kernel size is 3 × 3 with dilation rate d � 3, the corresponding receptive field size is 7. e superposition of multiple convolutional layers allows for a greater receptive field.
Atrous spatial pyramid pooling (ASPP) [28] is a model based on the atrous convolution. It adopts several parallel convolutional layers with different expansion rates to improve the segmentation performance. Inspired by this, at the bottom of the network, we designed a module like ASPP (LASPP) to preserve multiscale features of blood vessels, as shown in Figure 5.
e dilation rates used in the four convolutional layers are d � 2 i , i � 0, 1, 2, 3, with kernel size 3 × 3 and activated by ReLU. Finally, the features extracted with different dilation rates are added to generate the fusion result and send to the decoding structure.

Weight Normalization.
In training deep neural networks, batch normalization (BN) is frequently used after each convolution to solve the internal covariate shift problem [29]. However, BN has the drawback of data dependence on the mini-batch [25]. In this way, we prefer to use weight normalization (WN) to speed up the convergence speed of the network and improve the accuracy of training and testing [30].
Weight normalization, in simple terms, is the reparameterization of weight vectors in the CNN. Assume the output y has the following form: where w is a k-dimensional weight vector, b denotes a scalar bias term, and x is a k-dimensional vector of the input features. WN reparameterizes the weight vectors with new parameters by the following equation:: where N is a k-dimensional vector, g denotes a scalar, and ||N|| is the Euclidean norm of N. With this formalization, we will have ||w|| � g, which is independent of parameter N [30].

Network
Structure: WA-Net. Figure 6 illustrates the wide activation network (WA-Net). e architecture consists of two parts: encoder and decoder. For the encoder part, the image patches are input to the network for batch normalization (only one time). en, each WDSR-A module is followed by a 2 × 2 max pooling layer. In the decoding phase, Up + conv means that the upsampling was followed by a 3 × 3 convolution without ReLU. e dashed line in Figure 6 represents a global shortcut with a convolution of 1 × 1 and leaky ReLU. Behind WDSR-A1 in the decoding section is another leaky ReLU. We add the outputs of the two leaky ReLUs and put them into a 1 × 1 × 2 convolution layer with softmax to do classification.  Figure 2: Schematic of WDSR-A. Identity represents the identity mapping layer, Add denotes the element-wise summation over channels. In the dashed box, Conv stands for convolution, ReLU indicates the activation layer, C is the abbreviation of the channel, and the two conical shapes are the convolution layers that expand and slim the channel, respectively. e principle of wide activation is expanding features before the activation layer without increasing computation. A further explanation is shown in Figure 3. e motivation of leaky ReLU is to avoid zero gradients. It is defined as follows: where x i is the input on the i th channel and a i is a coefficient controlling the slope of the negative part. It degenerates into ReLU when a i � 0. is paper sets a i � 0.3 for leaky ReLU. e specific parameter settings of each module are shown in Table 1.

Loss Function.
In retinal images, the distribution of blood vessels and non-blood vessels is unbalanced. In order to solve this problem, a combined loss function is adopted. It can be defined by the following equation: where L CE (y, y) denotes cross-entropy, which is defined as follows: And L DICE (y, y) denotes a loss function based on the dice coefficient [31]. It is defined by the following equation: where i is the number of pixels, y i is the ground truth, and y i is the predicted result, respectively. k is a smooth value set to 1.0 to correct the function.

Implementation
e network model was implemented under PyCharm simulation platform using Python 3.6 with Tensorflow1.13. All experiments were conducted in a 64 bit Windows 10 laptop with Intel Core i7-8750H CPU @ 2.20 GHz 2.21 GHz, 16 GB RAM, NVIDIA GeForce GTX 1050Ti GPU.

Data and Data Preprocessing.
e fundus images used in the experiment are from two public datasets: DRIVE (Digital Retinal Images for Vessel Extraction) [10] and STARE (Structured Analysis of the Retina) [32]. ere are 40 images in the DRIVE dataset with a resolution of 565 × 584, while the STARE dataset consists of 20 images with a pixel size of 605 × 700. Experts' manual labels in both datasets are available as the ground truth.  Figure 3: Principle of wide activation [25]. C i and C i , i � 1, 2, stand for channels, r is the expansion factor, and k × k below Conv is the kernel size. In deep learning, the width refers to the number of channels. Wide activation focuses on the width (the C parameter) to improve feature utilization. In the deep convolutional neural network, appropriate preprocessing allows better segmentation of retinal blood vessels. In this study, all images were preprocessed as follows: firstly, the RGB images were converted into grayscale and standardized. en, contrast-limited adaptive histogram equalization (CLAHE), which is a method of image enhancement, was used to improve the brightness and contrast of the image [33]. To further improve image quality, we introduced gamma correction, where c � 1.2. e preprocessed images were divided into local overlapping patches by the sliding window with 48 × 48 and the stride size of 5. e patches sampled randomly and preprocessed are shown in Figure 7. [34] is a common optimizer that speeds up network convergence. e proposed network selected Adam as the optimizer with initial learning rate 0.001, and β 1 � 0.9, β 2 � 0.999, and ε � 10 −8 . e batch size and epochs were set to 32 and 100, respectively. A kernel weight initialization method named as He normal, which is proposed by He et al. [35], was used to initialize the kernel weights of the WDSR-A modules.

Training Details. Adam
As for the division of datasets, the DRIVE dataset has been divided into the training set (20 images) and testing set (the rest 20 images) fixedly. Considering that the STARE dataset does not explicitly provide the training and testing sets, this paper selects the first 10 images as training images and the rest 10 as test images according to Wang et al. [36]. In this study, we used the patch-based strategy to reduce overfitting. During the training process, the DRIVE and STARE datasets were divided into 160,000 and 200,000 patches, respectively, of which 90% patches were selected for training, and the other 10% were used as validation.

Performance Evaluation.
In order to quantitatively evaluate the blood vessel segmentation effect, 5 evaluation metrics were used, including accuracy (Acc), sensitivity (Sens), specificity (Spec), F1 score (F1), and area under the curve (AUC). e first 4 evaluation indicators are defined as where T P is true positive, indicating the correctly segmented vascular pixels, T N is true negative, standing for the correctly segmented nonvascular pixels, F P is false positive, denoting incorrectly segmented vascular pixels, and F N is false negative, denoting incorrectly segmented nonvascular pixels. e receiver operating characteristic (ROC) curve is also an important curve to measure the effect of retinal blood vessel segmentation. e closer the area under the curve (AUC) is to 1, the more accurate a model is.

Vessel Segmentation Results.
e proposed WA-Net was compared to custom U-Net on DRIVE and STARE datasets. Figures 8 and 9 are the segmentation results. In these figures, the first row shows healthy retinal images, while the second shows unhealthy retinal images. As can be seen from Figures 8 and 9, the segmentation results were consistent with those of experts and even better on some small vascular branches. In addition, the noise and erroneous segmentation were reduced. Furthermore, WA-Net obtained desirable segmentation results in those weak vessels. On the DRIVE Computational Intelligence and Neuroscience 5   Generally speaking, the performance of unsupervised methods is not as effective as supervised approaches. As can be seen from Table 2, on the DRIVE dataset, F1, Sens, Spec, Acc, and AUC of WA-Net were 0.8222, 0.7875, 0.9813, 0.9566, and 0.9794, respectively, which were superior to most algorithms. F1 obtained the highest value, and Sens was only slightly lower than CE-Net. For Acc and AUC, DCCMED-Net and WA-Net achieved the best results, which further verify the effectiveness of increasing feature utilization. Acc of WA-Net was slightly lower than that of DCCMED-Net due to the latter network cascaded three encoder-decoder modules, while WA-Net used one. However, DCCMED-Net in Sens was dramatically lower than WA-Net 6.11%. is phenomenon also appeared on WA-Net of Table 3. A possible explanation might be the extreme imbalance of blood vessels and nonvessels. In retinal images, the pixels of the blood vessels only occupy a small proportion of the entire image. When they achieve a high global accuracy, a small number of mis-segmented pixels will have a greater impact on blood vessel pixels but less impact on background pixels. erefore, the sensitivity is relatively low.
As shown in Table 3, Spec and Acc of WA-Net achieved the highest on the STARE dataset. Dense U-Net outperformed WA-Net 1.74% in Sens, but Spec, Acc, and AUC of WA-Net were 1.49%, 1.07%, and 1.21% higher than Dense U-Net. To sum up, by comparing all the listed results, WA-Net obtained competitive performances on DRIVE and STARE datasets.

Segmentation Results between Datasets.
In our work, we proved the generalization ability of the proposed network. Considering the DRIVE and STARE datasets were obtained using two devices with obviously different physical resolutions, this paper verified a more demanding scenario where one dataset was used for training and another for testing. Verification was performed on both DRIVE and STARE datasets, with the verification curve shown in Figure 10. e tested AUC and Acc indicators of cross-training are summarized in Table 4. As listed in Table 4, when tested on the STARE dataset, we obtained the highest AUC, while Acc was lower than Yan et al. [42] slightly. However, Acc and AUC tested on the DRIVE dataset were undesirable. One possible explanation might be that the DRIVE dataset comprises many thin blood vessels, while the training set (STARE) mainly contains thick blood vessels [18]. On the other hand, we compared the performance within datasets (Tables 2 and 3) and found that the performance of Acc and AUC between datasets diverged only 1%∼2% compared with the intradatasets. All these results demonstrated generalizability of WA-Net.

e Influence of Network Structures.
is section investigated the effects of using the wide activation structure and atrous convolution. We adjusted WA-Net and presented the performance indicators on the DRIVE dataset. e wide activated module in WA-Net was replaced with the preactivated residual module [46]: BN-ReLU-Conv ⟶ BN-ReLU-Conv, and LASPP module removed WN, named as Network_1. Based on Network_1, according to the traditional CNN, channels were set to 32-64-128-256-128-64-32, named as Network_2. Based on WA-Net, reduce the number of layers in the original LASPP module to 3 as Network_3, and then increase layers to 5, where d � 16 of the 5th layer, named as Network_4.
e details of these networks are presented in Figure 11. Furthermore, ROC curves of different structures are shown in Figure 12. e closer the ROC curve to the top-left border is in the ROC coordinates, the more accurate a model is. Figure 11 shows that the thick blood vessel branches could be segmented by all adjusted networks, but WA-Net greatly reduced mis-segmentation. For complicated vessels, it is difficult for the network to separate small blood vessels when the thick and thin vessels are close to each other. At the junction of thick blood vessels, more information can be extracted by WA-Net, while Network_3 did not detect it at all. On multiconnected small blood vessels, it is difficult for the segmentation algorithms to proceed precisely due to the local lesion. In summary, WA-Net is able to distinguish different vessels and present a better performance than the other networks.

Discussion
e difficulty in segmenting retinal blood vessels accurately from RGB images mainly arises from their low pixel intensity, which makes them (especially some tiny vessels) similar to the background and results in difficult segmentation. To address this issue, we proposed WA-Net based on wide activation. With the help of the widened channels, more vessel information was transmitted to the subsequent atrous convolutional layers. en, convolutional layers superimposed with different dilation ratios are used to capture contextual weak blood vessel information of different sizes more effectively. Due to the difficulty in recovering some low-level information, skip layer connection is utilized to directly fuse low-level information and highlevel information in the network structure. Comparison of the results obtained with several methods and the proposed algorithm on the DRIVE and STARE datasets shows that the segmentation accuracy by WA-Net has led to a higher Computational Intelligence and Neuroscience 7 level as shown in Tables 2 and 3. In terms of the algorithm itself, it is expected that it will be robust and accurate. erefore, cross-training between datasets is performed in Table 4, which verified the generalization performance of the model. Utilizing the characteristic that the number of feature channels is doubled at each downsampling step, the slimmed identity mapping path reduced the parameters of WA-Net.
Despite the improvement achieved in this study, there are still several limitations. Considering that the deep learning segmentation method could produce the most accurate results when there are sufficient labeled data, while fewer samples are there in retinal images, a more effective data augmentation can be achieved using the generative adversarial network (GAN). In addition, although the parameters of WA-Net are reduced, there is still a long computational time due to the introduction of weight normalization. Further investigation on how to effectively reduce the calculation time is required. e segmentation of retinal blood vessels is the first and a critical step for automated vessel analysis. After blood vessel segmentation, more advanced analysis can be performed, for example, investigating its diagnostic and prognostic values for eye diseases such as arteriosclerosis and hypertension. Besides, in order to obtain more accurate results in medical image segmentation tasks, we plan to extend the WA-Net    structure to three dimensions (3D) because the three-dimensional images are becoming broadly used in healthcare settings. is would be a fruitful area for further work. More than this, in the future, we can also consider adding diagnostic text to images and building new models to automatically diagnose diseases.

Conclusion
e proposed WA-Net showed excellent performance in capturing detailed blood vessels and superior performance on retinal vessel segmentation tasks with proven generalization ability. It is a general, high-performance computing framework that does not require any handcrafted features. As a consequence, it has the practical clinical application value in the automatic diagnosis system and the potential to assist doctors in the diagnosis of fundus diseases.

Conflicts of Interest
e authors declare no conflicts of interest.