Semantic Segmentation Algorithm Based on Attention Mechanism and Transfer Learning

In this paper, we propose a semantic segmentation algorithm (RoadNet) for auxiliary edge detection tasks with an attention mechanism. RoadNet improves the dispersion of the low-level features of the network model and further enhances the performance and applicability of the semantic segmentation algorithm. In RoadNet, a fully convolutional neural network is used as the basic model, an auxiliary loss in the image classification, multitask learning in machine learning, and attention mechanism in natural language processing. To improve the generalization of the model, we select and analyze a proper domain difference measure. Subsequently, the context semantic distribution module and the annotation distribution loss are designed based on the context semantic encoding structure. ,e domain discriminator based on the adversarial training and the adversarial training algorithm based on transfer learning are then well integrated to provide a transfer learning-based semantic segmentation algorithm (TransRoadNet). ,e experimental results indicate that the proposed TransRoadNet and RoadNet overperform their equivalent comparison models.


Introduction
Application of deep learning methods in image classification achieves remarkable results; see, e.g., [1][2][3][4]. Deep learning is also extensively applied in image semantic segmentation. For instance, E. Shelhamer et al. [5] present a fully convolutional neural network (FCN) for image segmentation. e combination of transfer learning and deep learning [6] is also used to introduce the concepts and methods of deep learning in a variety of research fields in social and natural sciences [7][8][9]. However, in practical applications, the efficacy of deep learning based methods is challenged by the availability of enriched datasets, their inference accuracy, and generalization performance of the deep learning models. In this paper, we address these three challenges: (1) Although PASCAL VOC2012 [10], CIFAR-10/100 [11], and Cityscapes dataset [12] are able to provide powerful training data sources, multi-view observation scenes are further required to be constructed for the complex urban road images. In this paper, using Eagle Eye, the high-altitude road monitoring dataset is formed, and the virtual images are collected via the communications between the graphics library and the monitoring game to make the virtual dataset.
Hereafter, we simply refer to these enriched datasets as Surv-Citispace and Virt-Citispace. (2) Regarding the attention mechanism and to make the low-level features of the network, the auxiliary task branch for edge detection is designed based on objects' shape and edge information. Meanwhile, an auxiliary task learning module and an attentionconstant residual network are constructed to form a semantic segmentation model, namely, RoadNet. In order to improve the receptive field of the semantic segmentation task, global pooling concepts and comprehensive cascading ideas are utilized to further improve the atrous spatial pyramid pooling and design a cascaded atrous spatial pyramid pooling.
(3) We further investigate the transfer learning algorithm for RoadNet from the perspectives of domain difference measurement, semantic distribution loss, and adversarial learning and then design a semantic segmentation model, namely, TransRoadNet, based on transfer learning. TransRoadNet effectively reduces the performance loss of basic model, RoadNet, in the process of migration and deployment on different data (i.e., Cityspace to Virt-Citispace and/or Surv-Citispace to Cityspace).

Attention Mechanism.
Chen et al. [13] introduce using conditional random fields to the FCN as a post-processing algorithm. Zhao et al. [14] also design a pyramid pooling module to aggregate the context information of different regions by combining four feature maps of different scales to improve the capability obtaining global information of the neural networks. Regional-Convolutional Neural Network (R-CNN) in [15] triggers the application of target detection convolutional neural networks based on the candidate regions. He et al. [16] suggest using shared convolutions to speed up the calculation of R-CNN. e region-of-interest pooling layer is also designed by Girshick [17] based on a spatial pyramid pooling which is able to pool the considered regions with different sizes, into a fixed-size feature vector. Ren et al. [18] suggest handing over the task of finding candidate target areas to a deep convolutional neural network and propose the Region Proposal Network (RPN). Further, a network branch is added by He et al. [19] based on RPN to predict the segmentation mask of the target object. ey further expand their method from the original simple target recognition to instance segmentation. e current research results are greatly influenced by related thoughts [20,21].
In the above works, the attention mechanism is often utilized to explicitly model the interdependence between the semantic features of F(x t , W t ) and x t . is is done through combining the attention residual module (ARM) with the residual module and self-attention mechanisms. Due to adaptive enhancement of the channel graph of relevant semantics, it is therefore possible to replace the feature fusion in the original residual network and further enhance the ability to express the relevant semantics of the residual module.

Receptive Fields and Auxiliary Tasks.
From the multitasking perspective, Badrinarayanan et al. [22] introduce the encoder-decoder structure into the FCN, where the pooling layer index is retained to store more image information in the encoding stage. In this stage, the pooling layer index is used to restore image loss information.
Holmstrom [23] indicated that occasional addition of noise during the training can enhance the generalization capability of the network model. In contrast to other methods which are focused on enhancing the training effect of the auxiliary tasks, RoadNet is focused on enhancing the training effect of the main task. In this context, the edge detection is often considered as an auxiliary task. e lowlevel shared network mainly considers the edge and shape information of the object; hence, it can obtain more features regarding the differences in the object categories. e annotated images which are required for edge detection can be simply attained from the semantically segmented annotated images.
Regarding the receptive field, Fisherand Koltun [24] showed that the FCN upsampling is unable to restore the information lost.
is is because of the pooling layer downsampling without loss. To address this issue, they suggest atrous convolution, where the original convolution range is extended thus increasing the receptive field of the network. ASPP is also applied by several researchers; see, e.g., [25].
Here, we combine the cascading idea in DenseASPP with the global pooling branch in ASPPv2 and propose the cascade atrous spatial pyramid pooling (CascadeASPP). In our proposed design, the atrous convolution of multiple atrous rates is connected step by step. It provides a larger receptive field, improves the pixel sampling density of the atrous convolution, and hence forms more receptive fields to provide a higher level of size invariance. Moreover, to tackle the degradation problem of atrous convolution, here the global context information is obtained through the global pooling branch.

Transfer Learning and GAN.
To find a suitable difference domain measure, we train the following three methods on the basic FCN network and RoadNet with the abovementioned three transferred datasets: (1) Correlation Alignment (CORAL) proposed by Bao Sun et al., which is an unsupervised domain adaptive algorithm [26] (2) Maximum Mean Discrepancy (MMD) as one of the most commonly used distance measures in transfer learning [27] (3) Contrastive Domain Discrepancy (CDD) which adds category information based on MMD and hence measures the intraclass and interclass differences across domains [28] e best representation is discovered by the featurerepresentation-based transfer through feature transformation. e context semantic encoding [29] (CSE) captures the global context scene information and improves the scenerelated feature map. Nevertheless, the context semantic encoding only predicts the existence of the category as prior knowledge of the scene with obvious defects. Hence, a semantic distribution loss is proposed to replace the semantic encoding loss. Particularly, in the proposed semantic distribution loss, the ability of the model to predict the existence of the categories and the proportion of the categories in the image is essential, adding more prior knowledge of scenes and the relationship between categories to the model. e generative adversarial network is a network model proposed by Goodfellow et al. [30]. It can better grasp the global information by discriminating against the network compared to the direct use of the loss function. Moreover, TransRoadNet integrates GAN's domain adversarial ideas and replaces the image generation network in GAN with source and target domain feature extractors to extract the image features. e task of discriminating the network in GAN is to determine the extraction of the image features from the source or target domain images. e domain-invariant features are extracted by the encoder as much as possible so that the discriminator cannot distinguish between the two domains. Meanwhile, the discriminator needs to distinguish the two domains as much as possible to conduct adversarial training.

Attention Residual Module.
e original residual module is shown in Figure 1(a) as where x l and x l+1 represent the input and output of the l-th layer, respectively, F shows the residual function, h denotes the identity mapping function, and f is the rectified linear activation function. Although the identity mapping function in the residual module can ensure no loss in the information flow, the information flow of the entire network includes loss due to the activation function. erefore, f also becomes an identity mapping function to obtain an enhanced residual module, namely, the identity residual module [31], ensuring the flow of the information between the layers without a loss ( Figure 1(b)). e mathematical expression is as follows: Based on the backpropagation chain rule, the following partial derivative is obtained: Equation (3) indicates that the loss gradient can be transferred to any residual module without loss. Even the loss gradient of any residual module can be converted without loss to the remaining residual modules; hence, the probability of vanishing the gradient is reduced.
Nevertheless, if each channel of the feature graph is assumed to be the semantic feature response graph of the segmentation target, there must be a correlation between the corresponding graphs of the semantic features of various segmentation targets in the image. e semantic features of x l and y l in the residual module are not consistent and are not added directly. Hence, the self-attention mechanism is inserted into the fusion of x l and y l in the identity residual module to explicitly model the interdependence between semantic features. Using the interdependence between the channels, it is possible to improve the interdependent features as well as the representation of the specific semantic features: e input feature map of the attention residual module ( Figure 2) is X ∈ R C×H×W . A novel feature map, Y ∈ R C×H×W , is then obtained, after two rounds of batch normalization, convolution, and activation function. Hence, X and Y are reorganized into X ′ ∈ R C×N and Y ′ ∈ R C×N , respectively. Matrix multiplication is also performed on the transpose of X ′ , and Y ′ . After normalizing the exponential function, the channel attention graph A ∈ R C×H×W is finally obtained: where a i,j represents the influence factor of the i-th channel of X to the j-th channel of Y. Matrix multiplication is conducted on A and Y ′ , and E ∈ R C×H×W is readjusted as the improved feature map. e ultimate output feature map, O ∈ R C×H×W , is then attained by adding elements of E and X. Figure 3 represents the RoadNet structure. In particular, the training task signal of the auxiliary task has specific domain information to improve the main task generalization effect. Following the pyramid network structure of FPN [32,33], a semantic segmentation network model is designed to test the auxiliary tasks, including a top-down basic network, a horizontal connection, and a bottom-up edge detection auxiliary network. e accurate edge detail information is then obtained from shallow features, and then the semantic information is attained from the deep features. Consequently, the lack of image detail information in the original semantic segmentation network is eliminated. e network takes an image of any size as input and then calculates a feature map of multiple scaling ratios using the basic network. e network is also divided into five stages based on the size of the feature map. e relative scale of the feature map output by the last residual module to the input image in each stage is 4, 8, 16, and 32, respectively.

RoadNet and Auxiliary Edge Detection Tasks.
By upsampling of the image of the high-level feature pyramid, the edge detection auxiliary network restores its resolution. e basic network is also connected with the edge detection auxiliary network through horizontal connections to merge the feature maps of the same size. Furthermore, using the Canny algorithm, the annotated image of the edge detection auxiliary network is obtained from the annotated image of the semantic segmentation [34]. e loss function of the edge detection network takes multi-class empirical cross-entropy to normalize the predicted feature map exponentially. e calculation formula is en, the cross-entropy is calculated as where Y i,j is pixel i, j in the image, D i,j,n represents pixel i, j after the exponential normalization of the n-th channel of the image, X i,j,n is pixel i, j of the n-th channel of the image, M is the image length, N is the image width, and C is the category number.

3.3.
CascadeASPP. ASPP has gained a large receptive field; however, a huge deal of image information is lost within the calculation process due to the low pixel sampling rate. For example, the receptive field size is 13 for a 3 * 3 atrous convolution with an atrous rate of 6; however, only 9 pixels are sampled for calculation. en, the pixel sampling rate is 0.05. By connecting two convolutions with an atrous rate of 3 in series, the receptive field size is also 13, while 25 pixels are sampled for calculation. e pixel sampling rate is 0.15, which is more than three times the pixel sampling rate of the former. By a higher atrous rate, this effect becomes more obvious which is overcome by the proposed model effectively.
e global pooling branches and all atrous convolutions are cascaded through CascadeASPP (Figure 4). After 1 × 1 convolution and batch normalization, it is then upsampled to the preferred spatial dimension. For feature fusion, it is then merged with other atrous convolutions with different atrous rates. rough the cascading between different sizes of atrous rates, 13 sizes of receptive fields are covered. In the meantime, the coverage and atrous convolution pixels are sampled with a higher density.

Transfer Learning Mechanism.
Using the feature-representation transferring technique, the difference between the target domain and the source domain is added to the loss function of the network model. us, the difference between the target domain and the features of the source domain is minimized through model training. After comparative testing, the MMD difference measure is selected as the loss function to design a context semantic distribution (CSD) module. e structure is illustrated in Figure 5. It is observed that the input feature map of the context semantic distribution module passes through two fully connected layers. e proportion of categories in the scene becomes an output of the fully connected layer, i.e., category distribution information. Consequently, for this category of distribution information as well as for the annotated image, the semantic distribution loss is calculated. e other fully connected layer outputs the scaling factor of the input feature map and then multiplies the input feature map and the scaling factor and by the channel as the output of the module. It is aimed at strengthening the feature maps related to the current scene based on the prior knowledge of the scene and also weakening the feature maps which are not related to the current scene. e category distribution information for the model inference graph is then determined, and the semantic distribution loss is calculated within the category distribution information of the annotated images. e semantic distribution information of the annotated image is denoted as the feature vector p with the length of C. In this model, each value indicates the ratio of the pixels occupied by the category c in the annotated image to the total pixels of the image (i.e., the percentage of the image occupied by each category). e calculation formula is stated as where e semantic distribution information of the inference graph is then determined as where C is the total number of categories in the source domain dataset, Y is the annotated image, Y ′ shows the model inference graph, H is the pixel height of the annotated image, and W is pixel width of the annotated image. Using a multi-class cross-entropy loss function, the semantic distribution loss is determined for the semantic distribution information of the annotated image and the inference graph, which is 3.5. TransRoadNet. To simplify the training process, a gradient inversion layer is added between the domain discriminator and the basic network as a connection layer. However, the gradient inversion layer is corresponding to the identity mapping over the forward propagation. It includes no other operations and the input is directly outputted to the next layer. During the back propagation, the gradient inversion layer obtains the gradient from the next layer multiplied by −1, before passing to the previous layer. e structure of TransRoadNet is illustrated in Figure 6.

Datasets.
Based on the Cityscapes dataset [12], to collect the urban road traffic images, the surveillance video by Eagle Eye camera is considered as the source. An image from the video is intercepted by the dataset at given intervals and a total of 400 images are collected. Subsequently, the dataset is divided into a test set including 200 images and training set including 200 images (i.e., ratio of 5 : 5). is dataset is referred to as Surv-Cityscapes.
Grand eft Auto V (GTA5) is selected by the virtual dataset as the virtual data collection virtual environment. GTA5 is started from Render Doc with a resolution of 1920 × 1080. e character is manipulated to drive the vehicle and to select the first angle of view. In total, 4000 images are collected, and they are randomly classified into 2000 test set images and 2000 training set images with a ratio 5 : 5; this dataset is referred to as Virt-Cityscapes.
Principal component dithering and random image cropping along with other algorithms are also utilized to augment the dataset and to create transfer learning datasets on these three city image datasets.
ere are 50,000 training images and 10,000 test images in the dataset. Cifar-100 dataset includes 100 classes, each containing 600 images. Each category includes 500 training images and 100 test images. e dataset, PASCAL VOC2012, supports image recognition tasks such as classification, target detection, and semantic segmentation. In our experiment, we use the semantic segmentation sub-dataset of PASCAL VOC2012.

Attention Residual Module Testing.
For the first residual module in the attention residual network, the channel correlation between F(x 1 , W 1 ) and x 1 is obtained. e channel correlation heatmap is visualized in Figure 7. is figure illustrates the correlation between any channels of the two feature maps represented by the color of the corresponding cell. A higher (lower) correlation is shown by a lighter (darker) color. As it is seen in Figure 7, the color of the correlation heat map becomes significantly lighter after the attention mechanism. is means that the correlation in the feature map channel is significantly enhanced; therefore, the correlation between features is improved by the attention residual module. To obtain ResNet, the original residual modules are stacked, and the identity residual modules are also stacked to obtain IdentityResNet. To obtain AttentionResNet, the attention residual modules are stacked. Using Xavier algorithm [35], the weights are initialized, and the ResNet model which is trained on the ImageNet dataset is used as the pretraining model. e test results are presented in Table 1.
Compared to the original residual network, in CIFAR-100 dataset, the attention residual network in the 50-layer network is 1.45% lower. Furthermore, compared to the original residual network, the attention residual network in the 101-layer network is 1.49% lower. It is also seen that, using the attention residual module, the probability of convergence problems and degradation problems is greatly reduced.

CascadeASPP Testing.
Here, we compare the FCN basic model, FCN-ASPP, and FCN-CascadeASPP on multiple datasets. e results of these comparisons are shown in Table 2. As it is seen, the model evaluation metric for the ASPP structure is greatly enhanced in comparison with the basic model in all datasets, which are also further enhanced by using CascadeASPP. ese results suggest that the context  mechanism is an important factor and a larger receptive field is essential for capturing further contextual information and prior knowledge of the scene.

CSD Testing.
e encoder in RoadNet is the top-down basic network, while the bottom-up semantic segmentation main network is the decoder. e context semantic encoding module and the context semantic distribution module are, respectively, added to the FCN and RoadNet, and the new models are referred to as FCN-CSD, FCN-CSE, RoadNet-CSD, and RoadNet-CSE. We examine these models on three transfer learning datasets. e test results shown in Table 3 suggest the following: (1) e context semantic distribution module has only 0.2% and 0.4% performance improvement on Surv-Cityscapes transfer learning dataset and around 3% performance improvement on both Cityscapes and Virt-Cityscapes transferred dataset. e reason is the fixed recording position and angle of the Surv-Cityscapes surveillance camera. is results in a fixed image scene, and hence its prior knowledge of the scene is relatively simple. Nevertheless, the model performance is improved by the context semantic distribution module as it adds more scene prior information to the model. (2) For the three transfer learning datasets and the proposed network model, a performance improvement of about 0.2% to 3% is achieved by adding the context semantic distribution module and transferring the model. e results also indicate a further performance improvement in the context semantic encoding module. is validates the effectiveness of the context semantic distribution module.

RoadNet Testing.
To obtain a larger receptive field, CascadeASPP is added to the jump connection of RoadNet.
To compare with EncNet, the number of training epochs is consistent and set to 62500.  Table 5. e average merge ratio of TransRoadNet is 62.7%, 30.6%, and 35.8%, respectively, which is 4.1%, 24.4%, and 12.7% higher than the model without transfer learning and 1.9%, 3.7%, and 4.4% higher compared to the common transfer learning algorithm.      For Cityscapes transfer learning dataset, the performance of the transfer learning algorithm is only enhanced by about 4%.
is is owing to the limitations of the urban transferred dataset itself. e deviation of the dataset and the performance loss are small; hence, the effect of the transfer learning algorithm is not as clear as it is in the remaining datasets. Figure 8 represents the inference graph of the semantic segmentation model based on transfer learning, where graph (a) shows the original image, graph (b) represents the annotation image, and graph (c), graph (d), and graph (e) denote the inference graphs of the semantic segmentation model based on transfer learning. It is observed that TransRoadNet is noticeably better than other semantic segmentation models in terms of transfer learning in edge segmentation effect and target classification accuracy.

Conclusion
Based on Cityscapes, two datasets with various perspectives of urban roads and their transferred datasets are constructed. RoadNet designed based on ARM and Casca-deASPP possesses good portability and performance. TransRoadNet based on CSD shows a higher performance in the experiments compared to the un-transferred RoadNet and the transfer learning algorithms.

Data Availability
All data included in this study are available upon request to the corresponding author (e-mail address: hmwang@ nuaa.edu.cn).

Conflicts of Interest
e authors declare that they have no conflicts of interest.