A Real-Time Image Semantic Segmentation Method Based on Multilabel Classification

Image semantic segmentation as a kind of technology has been playing a crucial part in intelligent driving, medical image analysis, video surveillance, and AR. However, since the scene needs to infer more semantics from video and audio clips and the request for real-time performance becomes stricter, whetherthe single-label classification method that was usually used before or the regular manual labeling cannot meet this end. Given the excellent performance of deep learning algorithms in extensive applications, the image semantic segmentation algorithm based on deep learning framework has been brought under the spotlight of development. 'is paper attempts to improve the ESPNet (Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation) based on the multilabel classification method by the following steps. First, the standard convolution is replaced by applying Receptive Field in Deep Convolutional Neural Network in the convolution layer, to the extent that every pixel in the covered area would facilitate the ultimate feature response. Second, the ASPP (Atrous Spatial Pyramid Pooling) module is improved based on the atrous convolution, and the DB-ASPP (Delate Batch Normalization-ASPP) is proposed as a way to reducing gridding artifacts due to the multilayer atrous convolution, acquiring multiscale information, and integrating the feature information in relation to the image set. Finally, the proposedmodel and regular models are subject to extensive tests and comparisons on a plurality of multiple data sets. Results show that the proposedmodel demonstrates a good accuracy of segmentation, the smallest network parameter at 0.3M and the fastest speed of segmentation at 25 FPS.


Introduction
Multilabel classification evolved as the single-label classification method is gradually away from having the present needs satisfied. At first, it mainly took the form of text classification.  [1,2]. Given the characteristics, these methods can be roughly categorized to the method based on region classification and that based on pixel classification. e method based on region classification refers to an alternative of dividing image into several blocks, extracting image feature by Convolutional Neural Network (CNN) and classifying the image blocks. is alternative can be subdivided into the method based on candidate region and that based on segmentation mask. In general, the category to which a pixel belongs may be marked according to the highest score region. is alternative may regard Visual Geometry Group Network (VGGNet), GoogLeNet, ResNet (Residual Neural Network), and other networks as the backbone network of the model for classification of image blocks. Since these methods contain Fully Connected Layer (FCL) in the classification network, the size of input image is required to be fixed and the model with generally a higher cost of memory, resulting in computational inefficiency and unsatisfactory segmentation effect. At present, there are also some extensions on this basis, such as the composite segmentation method based on encoder-decoder, combined with Dense Residual Block (DRB) and FCN [3][4][5]. Given this, FCN was put forward in 2014, which has been one of the most popular pixel-based classification methods. e space size of feature image as extracted by the CNN structure can be adjusted by upsampling until it matches the original image. For image segmentation task, FCN appears to be superior over conventional CNN because the input image in the model doesn't have to be fixed size, and the network has an even higher computational efficiency. e stricter demand for real-time performance and huge computational power has put lightweight semantic segmentation under the spotlight. In 2017, Andrew Howard [8] in 2016, improved the pooling operation and output pooling mask at the time of downsampling, and improved recognition accuracy at the time of upsampling. In 2020, Tan [9] et al. from Google proposed ESPNet, which can capitalize on the two-way weighted feature pyramid structure for feature fusion and use the composite size method to uniformly scale down the resolution, depth, and width of backbone network, feature network, and predictive network.

Multilabel Classification.
Multilabel classification is considered as an issue in relation to classification, where a sample may be assigned with multiple target labels concurrently. For example, an image may contain urban buildings, vehicles, and people; a song is both lyrical and sentimental. Accordingly, a data sample (picture or music) may contain a plurality of different labels concurrently, which are used to characterize data attributes. What makes it hard to carry out multilabel learning is the explosive growth of output space. For example, if there are 10 labels available, the output space would be 210 in size. An effective mining of the label-to-label correlation is the only way to reduce the huge amount of output, which underpins the success of multilabel learning. Multilabel algorithms can be divided into three categories if we consider the intensity of correlation mining. First-order strategy: the correlation between one label and other labels is neglected. Second-order strategy: the pairwise correlation among labels is considered. High-order strategy: the correlation among a plurality of labels is considered. It should be noted that multilabel classification can be solved in three options. One alternative is issue transform options, including label transform-based options and instance transform-based options, e.g., binary relevance (BR) [10]. e second alternative is adaptive algorithm, that is, to modify some available learning algorithms, to the extent that the multilabel learning capability can be satisfied, e.g., Multilabel K-Nearest Neighbor (ML-KNN) [11]. e third alternative is integration method, an option evolved from regular issue transform or adaptive algorithm. e most famous ensemble of issue transform can be illustrated by RAKEL system [12], Ensemble of Pruned Sets (EPS) [13], and Ensemble of Classifier Chains (ECC) [14] proposed by Tsoumakas et al. Further details about these options are available in Figure 1.

ESPNet.
e ESPNet was introduced by Mehta et al. [15], where a semantic segmentation network architecture featuring fast calculation and excellent effect of segmentation is presented in details. ESPNet can process data at 112 FPS on GPU in an ideal state or up to 9 FPS on edge device at a level even faster than the well-known lightweight networks-MobileNet [6], ENet [8], and ShuffleNet [7]. Provided that the control model only losses 8% of the classification accuracy, the ESPNet has the model parameters only 1/180 of PSPNet, known as the most excellent architecture at that time, but its processing speed is 22 times faster than PSPNet. In this published paper, a convolution module which is referred to as "Effective Spatial Pyramid" was introduced as a part of ESPNet. Consequently, such network architecture is characterized by fast speed, low power consumption, and low latency, which in turn makes it more suitable to deploy in some edge devices subject to more resource limits. Figure 2 is the basic network architecture of ESPNet. In this model, point-by-point convolution is used to reduce the number of channels and sent to the hollow convolution pyramid. e greater receptive field is obtained from different scales of dilated convolution, alongside with feature fusion, so the amount of parameters is quite few. Following the reduced number of channels, the amount of parameters with respect to each dilated convolution is quite few. Figure 3 presents the number of channels, ratio, and merging strategy. e feature fusion method for merging strategy is sharply contrasted with that for the regular dilated convolution. e stepwise addition strategy is used as a way to avoiding gridding artifacts.
ESPNetv2 was introduced by Mehta et al. in 2019. With the increased network depth in the EESP module, each convolution layer is improved by using the PRelu activation function, and the activation function is removed from the final group level convolution layer. e dilated convolution is used to the extent that the receptive field is dilated, the number of network parameters is reduced, and the running speed is increased. Figure 4 In the past two years, some scholars have carried out useful research based on ESPNet [16][17][18][19]. Kim [16] proposed ESCNet based on ESPNet architecture which is one of the state-of-the-art real-time semantic segmentation network that can be easily deployed on edge devices. Nuechterlein [17] extended ESPNet, a fast and efficient network designed for vanilla 2D semantic segmentation, to challenging 3D data in the medical imaging domain.

The Proposed Algorithm
ESPNet is evolved from the Efficient Spatial Pyramid (ESP) module, where the point convolution maps high-dimensional features to low-dimensional space by 1 × 1 convolution. In this section, ESPNet is improved based on integration and tuning of a plurality of technical methods as mentioned earlier, and its core constituent modules are described here. Figure 5 is the process flow with respect to the improved model. e spatial pyramid of dilated convolution exploits K and N × N dilated convolution kernels, while resampling these low-dimensional feature images. e dilation rate of each convolution kernel is is decomposition sharply reduces the number of parameters and memory required for the ESP module and retains a large effective receiving field (n−1) 2K−1. is sort of pyramid convolution operation is also referred to as "Spatial Dilation Convolution Pyramid." Each dilated convolution kernel learns the weight of the respective receptive field, so it appears to be similar to spatial pyramid. Since ESPNet is superior to all high-efficiency CNN networks that are currently available, this model is designed and improved. Figure 6 is the ESPNet improvement based on the convolution factor decomposition as the first step.
Provided that the parameters are constantly the same, a greater receptive field can be assured by atrous convolution, but it may be unfriendly to the recognition effect of some tiny objects. Finally, the improved model generates segmented images by exploiting the deconvolution principle of the    decoding part in the similar encoding-decoding structure. e segmented images are fused with original images on the merging module, which provides an intuitive feeling of the accuracy of the model segmentation. Figure 7 is the working principle figure of the improved model, and the algorithm used in the proposed model will use image pyramid during training, as expressed in equation (1).where P out n is the feature prediction output of nth layer and P in n is the feature input of nth layer. Resize(.) is used to adjust the image size.

Depthwise Separable Convolution.
It can be inferred from the semantic segmentation analysis of CNN and decoding-encoding that the convolution layer stands as the core part. A matching convolution method should be available for adapting to different kinds of environments; otherwise, gridding, gridding artifacts, and other unfriendly phenomena would be aggravated. As a consequence, the model may not lead to a good effect of semantic segmentation. Given this, the convolution layer is improved by treating depthwise separable convolution as its core part, using a set of dilation rates and joining them by the segmentation method in ResNet. Figure 8 describes how it works.
As seen from Figure 8, the input in the layer-by-layer convolution is M channel feature images, which are, respectively, convolved with M filters until M feature images are output. In contrast with the conventional convolution Mathematical Problems in Engineering method, what makes this convolution method significantly different is that the learning process with respect to the channel-space correlation is asynchronous. To put it in other way, it will not follow synchronous learning, just as conventional convolution method does. Comparing the regular convolution equation (2) and the depthwise separable convolution equation (3), this can increase the speed of network training and widen up the network. As a result, the network can accommodate and transmit more available feature information, leading to improved working efficiency.
In the first step, the depth separable convolution is subject to channel convolution by equation (3) (in this paper, Θdenotes the multiplication of the corresponding elements) and then the pointwise convolution is performed by equation (4). Substituting equation (3) into equation (4), equation (5) with respect to the depthwise separable convolution can be obtained.     Mathematical Problems in Engineering where W is convolution kernel, y is input feature image, both i and j are the resolution of input feature image, both k and l are the resolution of output feature image, and m is the number of channels.

DB-ASPP.
In this paper, the ASPP module is introduced as a part of HDC to collect multiscale information, and the image-level feature information is integrated in available ASPP module. Considering the fusion needs, the batch normalization (BN) layer is filtered out. e course of ablation experiment can increase the number of BN layers and improve the accuracy of the activation function PRelu by approximately 1.4%, but the benefit of removal is that the parallel branch results directly disappear without postprocessing. In other words, the network parameters are reduced, and the speed is increased. Accordingly, the improved DB-ASPP module based on ASPP is proposed here. e atrous convolution has two functions: first, the receptive field is dilated; for example, when r � 1, subject to dilated convolution, it becomes r � 2. However, the deficiency is that the reduced resolution in spatial distribution, and if the compression level is high, it will add to the difficulty level of the subsequent upsampling or deconvolution to restore the original image size. Further, the continuous downsampling combination layer will cause a serious reduction in the spatial resolution of feature image. And more context information can be extracted by atrous convolution. Figure 9 is the schematic diagram of how atrous convolution works. When r−1, the receptive field is 3 × 3. Subject to atrous convolution, namely, when r−2 as shown below, the receptive field will be 5 × 5. It is apparent that as the atrous rate increases, the range of receptive field that is recognizable by original convolution kernel has been significantly increased.
Atrous convolution can increase the receptive field and control the resolution, but the current atrous convolution method is still vulnerable to an inherent issue-gridding issue. If the atrous convolution is continuously used and the atrous rate is improperly selected, certain pixels may not be always involved in the calculation process. For example, with respect to the pixel p in a certain layer of the atrous convolution, its value is limited to the adjacent zone of the upper layer, and its size is ksize × ksize with p as the center point. Assume that the atrous rate is r � 1 and ksize � 3, the pixel p is expressed by the red points as shown in Figure 10, the blue area denotes the range to be captured by the convolution, and then the lower layer image of Figure 10 can be obtained after two steps of operation (r � 1).
From the white spots as shown in Figure 10, many adjacent pixels are overlooked, and only a small part is used in the repetitive atrous convolution calculations. In addition, since the atrous convolution is constructed by zero value insertion among parameters in the convolution kernel, when the applicable atrous rate increases, the distance between non-zero values would also increase, and the relevance between local information will be destructed, leading to more serious loss of local information, aggravating the gridding effect in the generated feature image.
Accordingly, Wang et al. proposed the Hybrid Dilated Convolution (HDC), and the atrous convolutions with different atrous rates are used continuously and alternately to reduce the impact of gridding issue. At one dimension, HDC is defined in the following equation:  Figure 11 is the schematic diagram of the receptive field with respect to d � {1, 2, 5}. It can be confirmed that all pixels are involved as a part of the convolution operation, which suggests that HDC can solve gridding issue well.
Based on the above HDC, the ASPP module as a part of ESPNet is improved here by introducing HDC and removing the BN layer. Figure 12 presents the functional architecture.
e dilated convolution available with four dilation rates can capture multiscale information in parallel on the top-level feature response of the backbone network. e improved ASPP module confers a greater receptive field to neurons, and the Pyramid Pooling Module (PPM) is introduced to the proposed ESP. As a result, the contextual semantic information in different regions can be aggregated to attain a better effect of segmentation.
In addition, for control of the model size and prevention of over-sized network, 1 × 1 convolution layer is added in front of each atrous convolution layer in DB-ASPP with reference to DenseNet and DenseASPP in order to reduce Mathematical Problems in Engineering the depth of feature image to the specified size and further control the output size. Assume that each atrous convolution layer output has n feature images, DB-ASPP has C 0 feature images as input, and the lth 1 × 1 convolution layer in front of the lth convolution layer has C l input feature images. C l is computed using input C 0 , n, and l as in the following equation: In DB-ASPP, each 1 × 1 convolution layer in front of the atrous convolution layer reduces the depth of the corresponding input feature image to C 0 /4, and all atrous convolution layers output C 0 /4. e parameters in DB-ASPP can be computed as written in the following equation: where L is the number of atrous convolution layers in DB-ASPP and k is the size of the convolution kernel to validate the effectiveness of DB-ASPP.

Parameter Setting and Criteria for Evaluation
. e proposed network model is trained based on the SGD algorithm, and its parameters are given in Table 1. Following the experiment comparison as described above, the PReLU activation function and maximum pooling with proven best effect are selected. For assessment of the generalization ability in transfer learning, the loss function set with 4,200 iterations is used for testing so as to observe the numerical results in the optimization process.
e Mean Intersection over Union (MIoU), Params, and FPs are used to evaluate the performance of model. MIoU is one of the important evaluation indexes in the semantic segmentation model, which measures the advantages and disadvantages of the algorithm by calculating the intersection and union ratio (that is, calculating the ratio between TP and TP + FN + FP). e calculation method is shown in formula (10). Params is the parameter value, and the smaller value means the better lightweight feature of the model and the lower dependence on high-performance equipment. FPS is the number of frames transmitted and recognized per second in semantic segmentation. e higher the T values, the faster speed it means: where P ij is the number of pixels misjudged as class j in class i. P ii is the number of pixels predicted correctly.

Self-Built Datasets.
e experimental data are modified on the basis of Pascal VOC dataset, adding the road images taken by the author around the campus and removing some small category images in Pascal VOC dataset, such as potted plant and chair. e classification of the self-built datasets is shown in Table 2.

Experiment Results.
We conduct three kinds of comparative experiments in order to fully prove the performance of the proposed algorithm. e first is the ablation comparative experiment of DB-ASPP proposed in this paper. e second is the comparison of the experimental results between ESPNet and improved model. e third is the comparison of the improved model and other sever models such as SegNet (a Deep Convolutional Encoder-Decoder Architecture for Image Segmentation).

Comparison of ESPNet and Improved ESPNet Model.
In this section, the segmentation results of ESPNet and improved ESPNet on self-result datasets are showed, as well as the loss function of the improved model in the numerical optimization process. As shown in Figures 13-15, respectively, it can be seen from Figures 13 and 14 that the output segmentation images of the improved model are almost consistent with the segmentation standard image, and the output segmentation images of the improved model are also well fused with the original images. It shows that the improved models have good accuracy segmentation and good semantic segmentation effect. ere are 6 loss curves in Figure 15, with an aim of analyzing the loss of different functions in a full scale, so that the experiment results can be accurately optimized.
In addition to the loss function of train sets and the validation set loss function, the attention mechanism loss curve (loss_att) and the time correlation loss function (loss_ctc) with respect to train sets and test sets are configured to detect the model capability to solve and generalize real-time issues.

Comparison of the Proposed Model and Common
Models.
e proposed model is validated on the self-built datasets. Provided with the same memory and calculation condition, its performance is superior to some efficient convolutional neural networks under the standard metrics and introduced performance metrics, with the test results given in Tables 4 and 5. Table 4 provides a summary of the recognition ability of the proposed model and other seven models for different kinds of objects on the self-built dataset, where the bold figures denote the highest accuracy in the respective category. e MIoU refers to the mean value of the overlapping rates with respect to the target window generated by the   proposed model and the previously marked window. e higher value of this parameter means the higher recognition accuracy. It can be seen from Table 4 that the recognition accuracy of the proposed model is high in most categories. However, in the Sky and Pedestrian, corresponding value is 82.0 and 42.6, respectively, and the ranking is the penultimate and the penultimate, respectively, which is the obvious shortcomings of the proposed model. e preliminary analysis is due to the fuzzy absence of boundary information in the ablation experiment and data training stage.
In addition, different models are compared for the amount of parameters and real-time performance, as given in Table 5.
Referring to Table 5, the amount of parameters involved in the paper is very small and the recognition and segmentation are fast. is suggests that provided with good accuracy, the proposed model has high real-time performance without the support of strong computing power.

Conclusion
In this paper, a real-time image semantic segmentation model based on multilabel classification is proposed. e ESPNet model is improved with reference to the characteristics of multilabel classification learning by the following steps: first, the standard convolution is replaced by applying Receptive Field in Deep Convolutional Neural Network in the convolution layer, to the extent that every pixel in the covered area would facilitate the ultimate feature response; second, the ASPP module is improved based on the atrous convolution, the DB-ASPP is proposed as a way to reducing gridding artifacts due to the multilayer atrous convolution, acquiring multiscale information, and integrating the feature information in relation to the image set; finally, subject to extensive tests and comparisons, the proposed model demonstrates smaller number of parameters, faster segmentation, and higher accuracy, compared with other models. Although the proposed model has improved in real-time and accuracy, there is still a gap compared with the accuracy of non real-time image semantic segmentation model. e next work will focus on improving the accuracy, mainly considering the integration of shallow network in feature information and the optimization of boundary information collection and processing methods.
Data Availability e experimental datasets used in this work are publicly available, and the bundled data and code of this work are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.