An Encoder-Decoder Network Based FCN Architecture for Semantic Segmentation

In recent years, the convolutional neural network (CNN) has made remarkable achievements in semantic segmentation. The method of semantic segmentation has a desirable application prospect. Nowadays, the methods mostly use an encoder-decoder architecture as a way of generating pixel by pixel segmentation prediction. The encoder is for extracting feature maps and decoder for recovering feature map resolution. An improved semantic segmentation method on the basis of the encoder-decoder architecture is proposed. We can get better segmentation accuracy on several hard classes and reduce the computational complexity signi ﬁ cantly. This is possible by modifying the backbone and some re ﬁ ning techniques. Finally, after some processing, the framework has achieved good performance in many datasets. In comparison with the traditional architecture, our architecture does not need additional decoding layer and further reuses the encoder weight, thus reducing the complete quantity of parameters needed for processing. In this paper, a modi ﬁ ed focal loss function is also put forward, as a replacement for the cross-entropy function to achieve a better treatment of the imbalance problem of the training data. In addition, more context information is added to the decode module as a way of improving the segmentation results. Experiments prove that the presented method can get better segmentation results. As an integral part of a smart city, multimedia information plays an important role. Semantic segmentation is an important basic technology for building a smart city.


Introduction
Convolution neural network is the part and parcel of image recognition, detection, and segmentation. The image semantic segmentation can provide a strong foundation for the construction of a smart city and has received much attention and research in recent years. Semantic segmentation is aimed at classifying all pixels in the image according to a specific category, which is commonly referred to as dense prediction. It is different from image classification because we do not classify the entire image into one class but all pixels. Thus, we boast a set of predefined categories and we need to distribute a tag to all pixels of the image according to the context of various objects in the image [1]. Deep neural network is no secret to the innovation of computer vision, particularly image classification. Since 2012, it has surpassed its prede-cessors by a large margin. In fact, artificial intelligence is superior to human in image classification. Inevitably, we adopted the same technology for semantic segmentation. Therefore, we put forward a network structure on the basis of encoder-decoder and atrous spatial pyramid pooling [2]. At the same time, a combination of multiple loss functions is used to be the ultimate loss function.
A relatively naive approach to construct the neural network architecture is simply stacking several convolutions, using the same padding to preserve that the dimensions remain the same and then output an ultimate segmentation map. Through a series of feature mapping transformations, the corresponding mapping of segmentation results can be learned directly from the input image. But it is quite expensive in computation to keep the whole resolution in the whole network. This architecture is illustrated in Figure 1.

Related Works
In the deep convolution networks, the first layer studies the low-level notions, and the second layer studies the highlevel feature mapping. As a way of maintaining the expression ability, the quantity of feature maps (channels) is usually increased while deepening the network. Different from the image classification which only needs the target category, image segmentation needs the location information of each pixel, so it cannot use pooling or trided convolutions to reduce the computation as safely as in the classification task. Image segmentation needs a whole-resolution semantic prediction. A popular image segmentation model is based on an encoder-decoder structure. In the encoder part, down sampling is adopted to reduce the input spatial resolution, so as to generate a lower resolution feature mappings (which is computationally efficient and can effectively distinguish different categories); in the decoder part, these feature representations are up sampled and restored to the full-resolution segmentation map.
2.1. Fully Convolutional Network. Long et al. introduces the way to utilize end-to-end, pixel-to-pixel image segmentation task trained by the fully convolutional network at the end of 2014. In this paper, the author proposes to use the existing and well-researched image classification network as the encoder module of the network, adds transpose convolution layer in the decoding module, and upgrades the coarse feature mapping to the full-resolution segmentation mapping [3]. Full convolution network (FCNs) has achieved great success in the application of dense pixel prediction in semantic segmentation. The algorithm is required for predicting a variable for all pixels of the input image, a basic task in advanced computer vision understanding [1,3]. Some of the most attractive applications include automatic driving [4], human-computer interaction [2,5,6], intelligent transportation system [7], auxiliary photo processing [8], and medical imaging [9]. The great achievements of FCNs come from the powerful characteristics picked up by CNNs. It is important that the convolution computer system makes the calculation efficiency of training and reasoning very high.
2.2. Encoder-Decoder. The encoder-decoder structure is a common architecture of current semantic segmentation algorithms. The structure is composed of an encoder and decoder. Classic image semantic segmentation algorithms such as FCN, U-net, and DeepLab all adopt this structure.
The encoder is usually a network (VGG, Resnet, Xcepiton, etc); it consists of a deconvolution layer and upper sampling layer. Down sampling is aimed at capturing semantic or context information, while up sampling is aimed at recovering spatial information. Common decoders include bilinear interpolation, deconvolution, and dense up sampling convolution.

Dilated Convolution.
In FCNs, because of continuous max pooling and down sampling operations, the feature resolution is greatly reduced. Finally, the feature mapping recovered by up sampling loses the detail sensitivity of the input image. In the full convolution network, the extended convolution is used instead of the standard convolution, so that the convolution network can accurately control the resolution of the image when calculating the feature response [10]. At the same time, the receiving the field of the filter is effectively expanded without adding the quantity of parameters and the amount of computing. Many experiments show that the algorithm uses more context information to obtain more dense features, thus improving the image semantic segmentation accuracy. It can be seen from Figure 2 that this is an expansion convolution filter with three different expansion rates: each element in the filter is a (a) 1-expansion convolution and a 3 × 3 receptive field, (b) 2-expansion convolution and a 7 × 7 receptive field, and (c) 3-expansion convolution and a 15 × 15 receptive field. The quantity of parameters related to each layer is the same. The receptive field increases exponentially and the number of parameters increases linearly [11].
Under the same size of convolution kernel, the receiving field of the convolution kernel can be increased by increasing the input stripe, as shown in Figure 3.
FCNs is a kind of deep convolution neural network, which has achieved good performance in pixel-level recognition tasks, but it still faces challenges in this changing and complex world. FCN is not a fully connected layer. The original method is to use the same size convolution layer stack as a way of mapping the input image to the output image. It produced strong results, but it was very expensive, because they cannot utilize any subsampling or pooling layers, because this will screw up the location of the instance. As a way of maintaining the resolution of the image, they must add many layers in a way that learns the low-level and high-level features. That means it is inefficient. For addressing this problem, they presented an encoder-decoder architecture. The encoder is a typical pretraining convolution 2 Wireless Communications and Mobile Computing network while a decoder consists of a deconvolutional layer and an upper sampling layer. Down sampling is aimed at capturing semantic or context information, while up sampling is aimed at recovering spatial information. Because the encoder lessens the image resolution, the segmentation has too few well-defined edges, meaning that the boundaries between the images are not clearly defined.
In [8], the final image prediction is usually reduced by 32 times in several stages of stride convolution and spatial pool, resulting in the loss of fine image structure information and inaccurate prediction, especially at the object boundary. Dee-pLab [12,[14][15][16] uses atrous (also names dilation) convolution to expand the receptive field while maintaining the high-resolution feature map, or use the encoder-decoder architecture to solve this problem. It regards the backbone network as an encoder and is responsible for encoding the original input image as a low-resolution feature map.
2.4. Atrous Spatial Pyramid Pooling (ASPP). The ASPP module was first proposed in [17] and further revised in [12]. In ASPP module, as shown in Figure 4, different atrous rates are used to extract multiple scale information. In conclusion, one 1 × 1 convolution block and three 3 × 3 convolution blocks have different shrinkage rates (6, 12, and 18, respectively), and one GAP block is employed in parallel. ASPP with different sampling rates and multiple views can capture objects at multiple scales.
It can be found that the receptive field has changed from 3 to 5, approximately doubled; the convolution kernel size is still 3 × 3, and the input stripe is 2, which is now called dilate rate [12,14].

Our Approach
In this part, we introduce our presented network architecture and then explain the formation of each module in detail. We also propose a loss function as a way of further improving the performance of semantic segmentation.    Figure 5 shows the network architecture including two parts: the encoder is used to extract the feature map and the decoder is used for recovering the resolution of the channel. The amount of parameters in the ASPP part and the decoder part are also huge. Therefore, all the ordinary convolutions are replaced by the depthwise separable convolution. At the same time, the number of channels in ASPP and decoder is also decreased. The backbone network and the ASPP module together constitute the encode module of the network. Input any size of image to obtain the corresponding high-level feature map. Then, through the bilinear up sampling and the low-level feature map of one layer of the encode module, the decode module of the network is formed. Finally, the up sampling is back to the original map size, and the corresponding segmentation map is obtained through the softmax classification layer. This is to decouple spatial information and depth informa-tion. It is found that the effect of detail set 1/2 of the size of the feature map and the decoder feature are fused, and finally good results are achieved.

Backbone Network.
Over the past few years, some backbone networks of CNN have achieved great progress in visual missions, showing the most advanced level. It is stacked in the order of convolutional layer, pooling layer, activation function layer, and a fully connected layer. CNN can output the classification score corresponding to the image by inputting the image. In 2012, AlexNet [18] won the title of ILSVRC [19]. AlexNet addresses the problem of image classification and creates a new situation of computer vision. Then, top competitors put forward various CNN architectures, Goo-gLeNet [8], ResNet [20], DenseNet [21], etc [22]. These network structures can well extract the feature mapping of an image, which lays a solid foundation for semantic     [23,24]. Our network architecture uses Xception to be the feature extractor. Some common classification networks are shown in Table 1 [25]. We came to a conclusion in the experiment. With high calculation complexity, recognition accuracy is allowed to be low; with many parameters, recognition accuracy is allowed to be low. A good network structure design is very important. Different models have different parameter utilization efficiencies.

Cross-Entropy Loss and Focal
Loss. The common loss function of classification problem is cross-entropy loss. It shows the distance between two probability distributions. The closer they are to the cross-entropy, the closer they are. The cross-entropy approach is a novel general method for combinatorial optimization, multipole optimization, and rare event simulation. The standard loss of binary classification is cross-entropy. Sometimes we will meet the task of image segmentation, which is that the background accounts for a large proportion, but the object accounts for a small proportion of the seriously imbalanced dataset. At this time, we need to carefully use the loss function. The most commonly used loss functions are as follows: where y=y truth , p=y pred CE p, y ð Þ= −log p ð Þ y = 1, From the above formula, we can draw a conclusion: when y = 1, the larger y ′ is, the closer it is to y, that is, the more accurate the prediction is, the smaller the loss is. When y = 0, the smaller y ′ is, the closer it is to y, that is, the more accurate the prediction is, the smaller the loss is. The final loss is the sum of y = 0 and y = 1. This method has one obvious drawback. While the number of positive samples is far less than the negative samples, that is to say, the number of y = 0 is far greater than the number of y = 1, and its components will dominate the loss function. The model is heavily biased towards the background.
We define p t : First of all, the proportion of positive and negative samples should be balanced without using negative sample mining and other means. In this paper, we directly multiply a parameter α in front of the CE loss, so that we can easily control the proportion of negative and positive samples.
We get the balanced cross-entropy loss as In practice, α is a decimal between [0, 1]; it is a fixed value and does not participate in training.
Although the above formula can control the weight of positive and negative samples, it cannot control the weight of easy samples and hard samples.
The γ here is called a focusing parameter, γ > = 0: A modulating factor ð1 − p t Þ γ is called the modulating factor. In practice, we usually add a parameter α in front of the focal loss: In the process of semantic segmentation, there are more categories corresponding to semantic segmentation than the two classification problem in target detection. If the selected parameters λ and γ are not suitable, the crossentropy loss weight of these pixels will be reduced. Combined with the above analysis, we propose to increase the weight of difficult samples and keep the weight of simple samples almost unchanged. We find that the best results can be obtained by setting α = 0:5 and γ = 2 in our experimental network.
Focal loss was first proposed in the RetinaNet model [26] to solve the imbalance and difficulty of classification in the training process. In practical application, the combination of focal loss and dice loss usually needs to scale them to the same order of magnitude. Use -log to enlarge dice loss and use alpha to reduce focal loss.

Experiments and Results
As a way of proving the effectiveness of our presented framework, we evaluated it on the basis of the benchmark dataset (PASCAL VOC 2012) and the latest methods. In the paper, 5 Wireless Communications and Mobile Computing we report the experimental outcomes of three mainstream semantic segmentation datasets: PASCAL VOC2012, Cam-Vid [27], and Cityscapes [28].
The mean intersection on union (MIoU) is the standard measure of semantic segmentation. The intersection and union ratio of two sets is calculated. In semantic segmentation, the two sets are base truth value and prediction segmentation. This proportion can be morphed to TP (intersection set) over TP, FP, and FN (union set). Calculate the IoU of each class and take the average.
is equivalent to First, calculate the intersection and union ratio of each category, and then get the average. TP is the positive sample that has a correct sort, TN is the positive sample that has a wrong sort. FP is the negative sample of sort error. TP can be understood as the intersection of prediction results and labels, while TP + TN + FP is preunion of test results and labels. The closer the intersection is to the union, the more accurate the segmentation is.
We also used several widely used data augmentation strategies in our training, including 50% probability of horizontal flipping and random scaling of images, scale factor between 0.5 and 2.0 in steps of 0.25, fill and randomly crop the scaled image to 513 × 513. Finally, with a fine tuning learning rate of 2e-4 is implemented in the model. When we segment some small target parts, we find that the effect of detail segmentation is very poor. To improve the details, 1/2 of the size of the feature map and the decoder feature are fused, and good results are obtained. In the training, the loss function used is an improved version, focal loss. The results show that the improved focal loss can improve semantic scores. The accuracy of the segmentation and the nonequilibrium of the sample are alleviated.

PASCAL VOC 2012. PASCAL VOC 2012 includes 20
foreground object classes and one background class, including photos from private collections. There are six indoor classes, seven cars, and seven creatures. The dataset contains 1464 columns, 1449 validation, and 1456 variable size test images. We use 512 × 512 crops as a way of dividing the learning rate of pretraining weight by 8. All other superparameters are the same as those in [16] experiment. Table 2 shows the performance of our algorithm on VOC 2012, and the detailed results comparison with other methods are displayed in Table 3.
According to the evaluation samples on the test set of PASCAL VOC2012 validation set dataset, we can see that the proposed method is applicable to animals, people, and objects. The edge of equal targets can be segmented carefully, which improves the classification accuracy of the stool, ani-mal, bicycle, and so on. The evaluation of the abovementioned classification index shows that its effect is better than many segmentation methods, as shown in Figure 6. Please note that we do not use CRFs for postprocessing, which can smooth the output, but it is too slow in practice, especially for large-scale images.

Cityscapes.
The Cityscapes dataset is a very large image dataset, which focuses on the semantic understanding of street scene. It contains the road driving images of 50 cities in spring, summer, and autumn. There are 19 classes in the dataset, including good weather and moderate weather, many dynamic objects, different scene layouts, and different backgrounds. We have carried out experiments on 5000 fine-labeled images, which are divided into 2975 training images, 500 verification images, and 1525 test images. The resolving power of all images is 1024 × 2048. It contains 5000 high-quality pixel level annotations of size 1024 × 2048 (2975, 500, and 1525 for training, verification, and test sets, respectively) and 2975, 500, and 1525 (training, verification, and test sets separately).
As shown in Figure 7, finally, the method achieves 81.79% MIoU precision on Cityscapes test set on 1024 × 2048 image. Table 4 shows the performance of our algorithm on Cityscapes 2012 test set.

CamVid.
As a way of further proving the effectiveness and robustness of this method, we also assess its performance on the CamVid dataset. The Cambridge-driving Labeled Video Database (CamVid) is the first video collection with object l class semantic tags. The ground truth labels provided by the database associated each pixel with one of the 32 semantic classes. The CamVid dataset contains images of city road driving scenes. We use 11 classes, including 367 training, 101 verification, and 233 test images. The resolution of all images is 720 × 960.
We train all models from random initialization and fine tune the pretrained parameters on ImageNet. In the training process, the size of random clipping is 512 × 512, and the batch size is 16. All other superparameters are the same as PASCAL VOC 2012 experiment. After 30000 iterations on the training set, the model in this paper achieves 77.61% MIoU on the validation set and 69.39% MIoU on the test set.
We can see that the models in this paper can get very accurate semantic segmentation results. Whether it is a small target, or some targets with occlusion and overlap, the method in this paper can accurately segment them.  Figure 6: The visualization results on the PASCAL VOC2012 validation set using our methods.

Conclusion
We introduce a simpler yet robust network for improving semantic segmentation tasks. Combining ASPP and a classical encoder-decoder structure, an improved loss function more suitable for the application is proposed. The experimental outcomes show the superiority of this method. It not only effectively improves the segmentation performance but also significantly improves the imbalance of training data. As a way of improving the learning ability of this method, we will focus more on weak supervised learning and metalearning down the road. We believe that semantic segmentation can provide a good practice for future smart city construction.

Data Availability
The data used to find the study can be available upon request to the corresponding author.

Conflicts of Interest
The authors declared that they have no conflicts of interest to this work.

Image
Ours Ground truth Image Ours Ground truth Figure 7: The visualization results on the Cityscapes data using our methods.