Semantic segmentation with convolutional neural networks under a complex background using the encoder-decoder network increases the overall performance of online machine vision detection and identification. To maximize the accuracy of semantic segmentation under a complex background, it is necessary to consider the semantic response values of objects and components and their mutually exclusive relationship. In this study, we attempt to improve the low accuracy of component segmentation. The basic network of the encoder is selected for the semantic segmentation, and the UPerNet is modified based on the component analysis module. The experimental results show that the accuracy of the proposed method improves from 48.89% to 55.62% and the segmentation time decreases from 721 to 496 ms. The method also shows good performance in vision-based detection of 2019 Chinese Yuan features.
Research and Development Program of Guangdong Province2019B010154003Guangzhou Science and Technology Program Key Projects2018020300061. Introduction
As one of the primary tasks of machine vision, semantic segmentation differs from image classification and object detection. The image classification process involved the recognition of the the type of object but cannot provide position information [1], whereas object detection can be used to detect the boundary and type of the object but cannot provide the actual boundary information [2]. On the other hand, semantic segmentation can recognize the type of the object and divide the actual area at the pixel level, as well as implement certain machine vision detection functions, such as positioning and recognition [3]. As we start from image classification, move to object detection, and finally reach semantic segmentation, the accuracy of the output range and position information improves [4]. In the same manner, the recognition precision increases from the image-level to the pixel-level. Semantic segmentation achieves the best recognition accuracy; therefore, it is useful in (1) distinguishing the entity from the background, (2) obtaining the position information (centroid) clearly physically defined by indirect calculation, and (3) performing machine vision detection and identification organization, which require high spatial resolution and reliability [5, 6].
The online semantic segmentation with convolutional neural networks (CNNs) under a complex background is effective for improving the overall performance of online machine vision detection and identification [7] when maintaining the same architecture of the encoder-decoder network and convolutional and pooling layer and equivalently transforming the fully connected layer, thus yielding broad generalization. In recent years, ResNet has been used to replace the shallow CNN to optimize semantic segmentation results significantly [8]. For machine vision detection and identification under a random-texture complex background, it is necessary to eliminate the random-texture complex background to extract the object without affecting the original features of the object [9]. The difficulty lies in the randomness of the textured background, which makes it difficult to employ typical periodic texture elimination techniques, such as the frequency domain filtering and image matrix methods [10, 11]. On the contrary, the encoder-decoder semantic segmentation network ultimately retains the classification components in the network backbone, thus exhibiting larger receptive fields and better pixel recognition ability [12, 13], as depicted in Figure 1. Unreasonably selected and consequently incorrectly used component analysis modules will lead to an excessively small foreground range, resulting in the misjudgment of component pixels. If the component analysis module is too sensitive, the foreground range will be too broad; thus, it would be difficult to remove misjudged pixels [14]. Therefore, in the process of semantic segmentation under the complex background, it is necessary to consider objects, the contradiction between the component semantic response values, and their mutual exclusion relationship, while maximizing the accuracy of the semantic segmentation under the complex background using the encoder-decoder network.
Flowchart of semantic segmentation under the complex background using encoder-decoder network.
Figure 2 shows a flowchart of the semantic segmentation under the complex background using the encoder-decoder network. The process can be described as follows: the component classifier of the encoder-decoder network recognizes the pixel-level semantics and response of the pixels in the image; the object classifier recognizes the pixel-level object semantics and the response and extracts misjudged pixels of the foreground object in semantic segmentation; finally, the mutually exclusive relationship between component semantics and object semantics is considered, and non-background-independent semantics are determined to achieve effective semantic segmentation under a complex background to improve the model accuracy [15].
Flowchart of the semantic segmentation under a complex background using the encoder-decoder network.
In this study, we focus on online semantic segmentation under a complex background using the encoder-decoder network to solve the above described mutual exclusion relationship problem between component semantics and object semantics. The main contributions of this study are threefold:
We attempted to improve the low accuracy of component segmentation and selected the superior basic encoder-decoder network according to the performance.
We modified the UPerNet based on the component analysis module to maximize the accuracy of the semantic segmentation under a complex background using the encoder-decoder network while maintaining an appropriate segmentation time.
We show that the proposed method is superior to previous encoder-decoder network and has satisfactory accuracy and segmentation time. We also show the application of the proposed method in bill-note anticounterfeiting identification.
The rest of this paper is organized as follows. In Section 1, we outline related works. In Section 2, we introduce a method for semantic segmentation under a complex background using the encoder-decoder network. In Section 3, we verify the proposed method. In Section 4, we present the conclusions.
2. Related Work2.1. Evaluation of the Semantic Segmentation Performance
We can generally evaluate the CNN semantic segmentation performance from the accuracy and running speed. The accuracy indicators usually include the pixel accuracy [16], mean intersection over union [16], and mean average precision [17]. The pixel accuracy PA is defined as the number of pixels segmented correctly accounting for the total number of image pixels; the mean intersection over union IoU¯ is defined as the degree of coincidence between the segmentation results and their ground-truth; the mean average precision APIoUT is the mean of average precision scores for segmentation results, whose intersection over union no less than IoUT, for each classes.
If the object detected by machine vision has k categories, the semantic segmentation model requires the label of the k+1 categories denoted as L=l0,l1,…,lk, including the background. Denoting the number of pixels of li mis-recognized as the pixel of lj and lii≠j as pij and pii respectively, the numbers of detected objects of li mis-recognized as lj and lii≠j as Nij and Nii, respectively, the pixel accuracy can be calculated as follows:(1)PA=∑i=0kpii∑i=0k∑j=0kpij;IoU¯=∑i=0Kpii∑j=0kpij+∑j=0kpji−pii,(2)APIoUT=∑i=0kNii∑i=0k∑j=0kNij.
The running speed of CNN semantic segmentation can be measured by indicators including the segmentation time Tseg [18], which is defined as the time needed to segment the image by running the algorithm. The theoretically shortest possible time required to segment the image is also labeled as the theoretical segmentation time Tseg−t, and the time required for the algorithm to actually segment the image is known as the actual segmentation time Tseg−a. If not otherwise specified, Tseg−a is denoted as Tseg.
Although CNN semantic segmentation performs as a single-step end-to-end process, which is not further divided into multiple modules to deal with, the connection of numerous modules directly affects the CNN. The end-to-end semantic segmentation framework using the encoder-decoder enables the CNN to detect images with any resolution and output prediction map results with constant resolution. Typical networks include fully convolutional networks (FCN) [19], SegNet [20], and U-Net [21].
Figure 3 shows a schematic of the FCN model. The FCN is an end-to-end semantic segmentation framework proposed by Jonathan Long et al. (University of California, Berkeley) in 2014. The main idea is as follows: the operation of a fully connected layer is equivalent to the convolution of a feature map and a kernel function of identical size. The fully connected layer is converted into a convolution layer, which converts the CNN into a full convolution operation network consisting of a complete convolution layer (convolution operation) and pooling layer (convolution operation) to process images of any resolution. In this manner, the limitation of the fully connected layer is overcome, i.e., images with different resolutions can be processed. The original resolution is restored after eight times bilinear upsampling by taking the pooling layer as an encoder, designing a cross-layer superimposed architecture as a decoder, yielding the final output feature map of the network by upsampling, and adding to the output feature map of each pooling layer (namely, the encoder) to obtain a feature map with higher resolution. The CNN can perform end-to-end semantic segmentation through a fully convolutional and cross-layer superimposed architecture; therefore, various CNNs are capable of achieving end-to-end semantic segmentation. Using the framework described, the IoU¯ reached 62.2% in the VOC2012 semantic segmentation testing set, which is 10.6 % higher than the classic methods and 12.2% (its IoU¯ is 50.0%) higher than the SDS [22] further segmented by CNN object detection and classical method.
Schematic of FCN model.
The ResNet proposed by the Amazon Artificial Intelligence Laboratory serves as a basic network for constructing FCNs for semantic segmentation; the IoU¯ in VOC2012 reaches 8.6% [23]. The prediction results of the FCN application are obtained by eight-fold bilinear interpolation of the feature map, including the problems of detail loss, smoothing of complex boundaries, and poor detection sensitivity of small objects. The results ignore the global scale of the image, possibly exhibiting regional discontinuity for large objects that exceed the receptive field. Incorporating full connection and upsampling increase the size of the network and introduces a large number of parameters to be learned.
Figure 4 shows a schematic of the SegNet model, which is an efficient, real-time end-to-end semantic segmentation network proposed by Alex Kendall et al. (Cambridge University) in 2015. The idea is that the encoder and the decoder have a one-to-one correspondence, and the network applies the pooled index in the encoder's maximum pooling to perform nonlinear upsampling, thus forming a sparse feature map; then, it performs convolution to generate a dense feature map. SegNet defines the basic network of the encoder-decoder and deletes the fully connected layer to generate global semantic information. The decoder utilizes the encoder information without training, while the required amount of training parameters is 21.7% of that of the FCN. For the prediction of the results, SegNet and FCN occupy a GPU memory of 1052 and 1806 MB, respectively, and the GPU memory occupancy on GPU GTX 980 (video memory 4096 MB) is 25.68% and 44.09%, respectively. Therefore, the occupancy of SegNet is 18.41% lower than that of FCN. In [20], the design of SegNet on ResNet was described, and the IoU¯ in VOC2012 reached 80.4% [24]. The IoU¯ of SegNet tested in VOC2012 was reported to be 59.9%, and the efficiency was found to be 2.3% lower than that of FCN; furthermore, there was the problem of false detection at the boundary.
Schematic of SegNet model.
Figure 5 shows a schematic of the U-Net model, which was proposed by Olaf Ronneberger (University of Freiburg, Germany) in 2015. The idea was to design a basic network that can be trained by semantic segmentation images and modify the FCN cross-layer overlay architecture with the high-resolution feature map channels retained in the upsampling section and then connect it to the decoder output feature map in the third dimension. Furthermore, a tiling strategy without limited by GPU memory was proposed; with this strategy, a seamless semantic segmentation of arbitrary high-resolution images was achieved. With U-Net, a IoU¯ of 92.0% and 77.6% was achieved in the grayscale image semantic segmentation datasets PhC-U373 and DIC-HeLa, respectively. The skip connection was used in the ResNet framework to improve U-Net, and a IoU¯ of 82.7% was achieved in the VOC2012 [25]. There are two key problems with the application of U-Net: the basic network needs to be trained, and it can only be applied to specific task, i.e., it has poor universality.
Schematic of U-Net model.
Figure 6 shows a schematic of the UPerNet model, which was proposed by Tete Xiao (Peking University, China) in 2018. In the UPerNet framework, a feature pyramid network (FPN) with a pyramid pooling module (PPM) is appended on the last layer of the backbone network before feeding it into the top-down branch of the FPN. Object and part heads are attached on the feature map and are fused by all the layers put out by the FPN.
Flowchart of UPerNet model.
3. Material and Methods
The semantic segmentation under a complex background based on the encoder-decoder network will establish an optimized mathematical model with minimal segmentation time Tseg−min, segmentation time Tseg, and accuracy PA. Under the encoder-decoder network, the backbone network ηmain, the depth dmain, and the decoder ηdecoder are obtained to form an encoder. By selecting the relatively better ηmain and ηdecoder of the basic network, the component analysis module to improve the optimized architecture is proposed, and the encoder-decoder network with optimized PA for semantic segmentation under a complex background is obtained. In the encoder-decoder network, the encoder transforms color images (three 2D arrays) to 2048 2D arrays. The encoder is composed of convolutional layers and pooling layers, and it could be trained on large-scale classification datasets, such as ImageNet, to gain greater feature extraction capability.
Modeling of semantic segmentation under a complex background using the encoder-decoder network and selection of ηmain and ηdecoder.
The encoder network is determined by the backbone network ηmain, depth dmain, and decoder ηdecoder. Segmentation time Tseg and accuracy PA depend on ηdecoder, ηmain, and dmain, which can be expressed as PAηdecoder,ηmain,dmain and Tsegηdecoder,ηmain,dmain. Denoting the minimal segmentation time as Tseg−min (the recommended value is 600 ms), the mathematical model of the optimization for semantic segmentation under a complex background based on the encoder-decoder network is as follows:(3)maxPAηdecoder,ηmain,dmain,s.t.Tsegηdecoder,ηmain,dmain≤Tseg−min.
The parameters of the model to be optimized are dmain, ηmain, and ηdecoder.
First, dmain, ηmain, and ηdecoder are combined. Then, the object segmentation accuracy PAobj, component segmentation accuracy PAcomp, and Tseg are compared to select the relatively better ηmain and ηdecoder for the basic network.
The ADE20K dataset, which has diverse annotations of scenes, objects, parts of objects, and parts of parts [26], is selected. In this paper, we denote parts of objects as component. Using a GeForce GTX 1080Ti GPU and the training method described in [27], we obtained PAobj and PAcomp for improved FCN [19], PSPNet [28], UPerNet [29], and other major encoder-decoder networks for semantic segmentation used in the ADE20K [26] object/component segmentation dataset. We evaluated Tseg of different network on the ADE20K test set, which consist of 3000 different resolution images with average image size of 1.3 million pixels. Table 1 displays the pixel accuracy and segmentation time of the main network architectures on ADE20K object/component segmentation tasks, where the relatively better indices are indicated by a rectangular contour.
Pixel accuracy and segmentation time of the main network architectures on ADE20K object/component segmentation task. The rectangular contour indicates the best indices.
Network
Backbone ηmain
Backbone depth dmain
Decoder ηdecoder
Object segmentation accuracy (%) PAobj
Component segmentation accuracy (%) PAcomp
Segmentation time (ms) Tseg
1
FCN [30]
ResNet
50
FCN
71.32
40.81
333
2
PSPNet [28]
ResNet
50
PPM
80.04
47.23
483
3
UPerNet [29]
ResNet
50
PPM + FPN
80.23
48.30
496
4
UPerNet [29]
ResNet
101
PPM + FPN
81.01
48.71
604
From Table 1, the following observation can be made. ① In all networks, PAcomp is less than PAobj by about 30%; ② ηmain and dmain are equal in networks 1, 2, and 3; PAcomp and PAobj are better in ηdecoder=FPN+PPM compared to ηdecoder=FCN or ηdecoder=PPM; ③ηmain and ηdecoder are equal in networks 3 and 4. When dmain is doubled, PAcomp improves slightly and Tseg improves significantly. After a comprehensive consideration, we selected the UPerNet [23] encoder-decoder network, where ηmain=ResNet, dmain=50, and ηdecoder=PPM+FPN.
Figure 7 shows the architecture of semantic segmentation under a complex background implemented by UPerNet [29]. The encoder ResNet reduces the feature map resolution by 1/2 at each stage. The resolution of the output feature maps within five stages is respectively reduced to 1/2, 1/4, 1/8, 1/16, and 1/32. The decoder is PPM + FPN. Through pooling layers with different strides, the feature maps are analyzed in a multiscale manner within PPM. Through three transposed convolution layers, the resolution of the feature maps is increased two times to 1/16, 1/8, and 1/4. The upsampling restores the feature map resolution to 1/1. The component analysis module recognizes the feature map and outputs both the object/component segmentation results.
Flowchart of semantic segmentation under a complex background implemented by UPerNet.
Figure 8 shows the component analysis module of UPerNet. The module is composed of the object classifier, component classifier, and component analysis module. The input of each classifier is a 1 : 1 feature map. The object classifier implements the semantic recognition of NObj kinds of objects and outputs the object probability vector pObju,v and the object label CObju,v. The component classifier implements the semantic recognition of NComp kinds of components and outputs the component probability vector pCompu,v and the component label CCompu,v. According to CObju,v and the component object set ℂObj−Things, the component analysis module only segments the CCompu,v that satisfies CObju,v∈ℂObj−Things and outputs the valid component label C^Compu,v. UPerNet outputs the object segmentation result (the object label CObju,v) and the component segmentation result (the valid component label C^Compu,v).
Flowchart of component analysis module of UPerNet.
The component analysis module of UPerNet can be expressed as follows:(4)C^Compu,v=fOpCObju,v,CCompu,v,ℂObj−Things=1×CCompu,v,CObju,v∈ℂObj−Things,0×CCompu,v,CObju,v∉ℂObj−Things.
A greater PAcomp of C^Compu,v leads to a higher component segmentation efficiency.
Equation (4) outputs C^Compu,v that satisfies CObju,v∈ℂObj−Things. By identifying deviations of CObju,v due to the relationship between CCompu,v and CObju,v, the optimized component analysis module can improve the efficiency of component segmentation; it both meets the requirement of Tseg−min and improves PAcomp.
Improvements of UPerNet for semantic segmentation under a complex background based on the component analysis module.
In this subsection, we describe the derivation of the component analysis module, the optimization of the function expression of the module, and the construction of the architecture of the component analysis module.
As shown in Figure 8, the component classifier recognizes Ncomp component semantics and outputs the component labels CCompu,v of the pixel with image position u,v and the probability vector pCompu,v corresponding to the various component labels. The relationship between CCompu,v and pCompu,v [31] is as follows:(5)CCompu,v=argmaxkpComp−k,k=1,2,…,NComp.
From equation (4) and (5), we obtain(6)C^Compu,v=fOpCObju,v,CCompu,v,ℂObj−Things=1×argmaxkpComp−k,CObju,v∈ℂObj−Things,0×argmaxkpComp−k,CObju,v∉ℂObj−Things,k=1,2,…,NComp,where pObj−j is the probability of CCompu,v. Weighting pObj−j over pComp−k to get p^Comp−k instead of 1×argmaxkpComp−k, reducing the weight of low-probability object labels, and increasing PAcomp. With CObju,v∈ℂObj−Things, if CCompu,v∉ℂComp−ObjCObj, letting p^Comp−k=0 can increase the detection rate of background pixels. Therefore, the module can be expressed as follows:(7)C^Compu,v=fOpCObju,v,CCompu,v,ℂObj−Things,ℂComp−ObjCObj=argmaxkp^Comp−k,k=1,2,…,NComp,p^Comp−k=∑j=1j∈ℂObj−Things∧k∈ℂComp−ObjjpObj−j,k<NComp1−∑j=1j∉ℂObj−Things∨k∉ℂComp−ObjjpObj−jpComp−k,k=NComp,j=1,2,…,Nobj,which is the component analysis module yielded by replacing 1×argmaxkpComp−k with argmaxkp^Comp−k and considering CCompu,v∉ℂComp−ObjCObj.
The optimized architecture of the UPerNet component analysis module is proposed based on equation (7). Figures 9(a)–9(c) show the optimized architecture obtained by replacing 1×argmaxkpComp−k with argmaxkp^Comp−k by considering CCompu,v∉ℂComp−ObjCObj and by both replacing 1×argmaxkpComp−k with argmaxkp^Comp−k and considering CCompu,v∉ℂComp−ObjCObj in the component analysis module, respectively.
Optimized architecture with the component analysis module. (a) Replace 1×argmaxkpComp−k with argmaxkp^Comp−k to optimize the module. (b) Analyze CCompu,v∉ℂComp−ObjCObj to optimize the module. (c) Replace 1×argmaxkpComp−k with argmaxkp^Comp−k and analyze CCompu,v∉ℂComp−ObjCObj to optimize the module.
For the UPerNet model, the backbone network of the encoder was ResNet, dmain=50, and the decoders are PPM + FPN + component analysis modules (before/after modification). We trained each network on the object/component segmentation task dataset ADE20K [26] to demonstrate the pixel accuracy PAPart^ and segmentation time Tseg. The experiments were run on a GeForce GTX 1080Ti GPU.
Table 2 reports PAPart^ and Tseg of the UPerNet obtained with different component analysis modules in ADE20K component segmentation task. From the results, the following observations can be made:
The pixel accuracy of ResNet (dmain=50) + PPM + FPN + the proposed modified component analysis modules with different settings increased from 48.30% (without component analysis modules) to 54.03%, 55.13%, and 55.62% while the segmentation time lengthened marginally from 483 to 492, 486, and 496 ms, respectively.
Pixel accuracy and segmentation time of UPerNet with different component analysis modules (CAMs) on ADE20K component segmentation task.
Backbone ηmain
Backbone depth dmain
Decoder ηdecoder
Comp. Analysis model
Comp. Segmentation accuracy PAPart^ (%)
Segmentation time Tseg (ms)
1
ResNet
50
PPM + FPN
—
48.30
483
2
ResNet
101
PPM + FPN
—
48.71
598
3
ResNet
152
PPM + FPN
—
48.89
721
4
ResNet
50
PPM + FPN + CAM
1×argmaxkpComp−k [29]
53.62
490
5
ResNet
101
PPM + FPN + CAM
1×argmaxkpComp−k [29]
53.96
604
6
ResNet
152
PPM + FPN + CAM
1×argmaxkpComp−k [29]
54.18
726
7
ResNet
50
PPM + FPN + CAM
argmaxkp^Comp−k
54.03
492
8
ResNet
50
PPM + FPN + CAM
CCompu,v∉ℂComp−ObjCObj
55.13
486
9
ResNet
50
PPM + FPN + CAM
argmaxkp^Comp−k+CCompu,v∉ℂComp−ObjCObj
55.62
496
The UPerNet with modified component analysis modules showed significantly high segmentation performance. Both PAPart^ and Tseg outperformed the UPerNet with a deeper dmain; PAPart^ and Tseg of the architecture (dmain=50) are 55.62% and 496 ms, while those of the architectures with no modification with dmain=101 and 152 were 48.71% and 598 ms and 48.89% and 721 ms, respectively, as shown in Figure 9(c).
We trained each UPerNet (with/without component analysis module) on the instance-level semantic labeling task of the CITYSCAPES dataset [32]. To assess the instance-level performance, CITYSCAPES uses the mean average precision AP and average precision AP0.5 [32]. We also report the segmentation time of each network run on a GeForce GTX 1080Ti GPU and an Intel i7-5960X CPU. Table 3 presents the performances of different methods on a CITYSCAPES instance-level semantic labeling task. Table 4 presents the mean average precision AP on class-level of the UPerNet with/without the component analysis module in the CITYSCAPES instance-level semantic labeling task. From the table, it can be seen that the modified component analysis modules effectively improved the performance of the UPerNet. With the component analysis module, both AP and AP0.5 are improved, and the segmentation time Tseg increased slightly from 447 to 451 ms. Most of the UPerNet AP on class-level are improved. Figure 10 shows some CITYSCAPES instance-level semantic labeling results obtained with the UPerNet with/without component analysis module.
Performances of different methods on CITYSCAPES instance-level semantic labeling task.
Method
AP (%)
AP0.50 (%)
Segmentation time (ms)
SegNet
29.5
55.6
—
Mask R-CNN
32.0
58.1
—
UperNet
32.0
57.3
447
UperNet + CAM
36.5
62.2
451
CAM: Component Analysis Module.
Mean average precision AP on class-level of the UPerNet with/without CAM in CITYSCAPES instance-level semantic labeling task.
Method
Person (%)
Rider (%)
Car (%)
Truck (%)
Bus (%)
Train (%)
Motorcycle (%)
Bicycle (%)
UperNet
36.0
28.8
51.6
30.0
38.7
27.3
23.9
19.4
UperNet + CAM
36.0
28.8
53.0
34.3
57.0
37.5
22.3
23.8
CITYSCAPES instance-level semantic labeling by UPerNet.
Taking banknote detection as an example, we set up the semantic segmentation model by the component analysis modules (before/after modification) to vision-based detection of 2019 Chinese Yuan (CNY) feature in the backlight to demonstrate the segmentation performance of the proposed method.
The vision-based detection system consisted of an MV-CA013-10 GC industrial camera, an MVL-HF2528M-6MP lens, and a LED strip light. The field of view was 18.33°, and the resolution was 1280 × 1024. Under the backlight, we collected 25 CNY images of various denomination fronts and backs at random angles. Then, we marked four types of light-transmitting anticounterfeiting features, namely, security lines, pattern watermarks, denomination watermarks, and Yin-Yang denominations. All four features were detected in the CNY images to generate our dataset (200 images). We trained the model with different component analysis modules from our dataset to demonstrate PAPart^ and Tseg. Table 3 presents the pixel accuracy and segmentation time of UPerNet with different component analysis modules for CNY anticounterfeit features via vision-based detection, and Figure 11 shows the segmentation results of the anticounterfeiting features detected by UPerNet with/without the component analysis module.
Anticounterfeiting features detected by the UPerNet with/without the component analysis module.
From Table 5, it can be seen that the proposed method improved PAPart^ from 90.38% to 95.29% Tseg from 490 to 496 ms. Moreover, APIoUT=0.5 increased from 96.1% to 100%, detecting all the light transmission anti-counterfeiting features without false detection, missing detection, or repeated detection.
Pixel accuracy and segmentation time of UPerNet with different component analysis modules (CAM) for CNY anticounterfeit features via vision-based detection.
Backbone ηmain
Depth dmain
Decoder ηdecoder
Component analysis module
PAPart^ (%)
APIoUT=0.5(%)
Tseg (ms)
1
ResNet
50
PPM + FPN
—
88.50
85.3
483
2
ResNet
50
PPM + FPN + CAM
1×argmaxkpComp−k [29]
90.38
96.1
490
3
ResNet
50
PPM + FPN + CAM
argmaxkp^Comp−k+CCompu,v∉ℂComp−ObjCObj
95.29
100
496
4. Conclusions
In this study, we performed semantic segmentation under a complex background using the encoder-decoder network to solve the issue of the mutually exclusive relationship between the semantic response value and the semantics of object/component in the semantic segmentation under a complex background for online machine vision detection. The following conclusions can be drawn from this study.
Considering the mutually exclusive relationship between the semantic response value and the semantics of object/component, we selected the mathematical model of semantic segmentation under a complex background based on the encoder-decoder network for optimization. It was found that ηmain=ResNet, dmain=50 is the best encoder, and ηdecoder=PPM+FPN is the best selected decoder.
We replaced 1×argmaxkpComp−k with argmaxkp^Comp−k. The component analysis module of CCompu,v∉ℂComp−ObjCObj and UPerNet are considered to improve the performance of the encoder-decoder network.
The experimental results show that the component analysis module improves the performance of semantic segmentation under a complex background. Both PAPart^ and Tseg of the proposed model were better than those of the UPerNet with deeper dmain. Specifically, the accuracy improved from 48.89% to 55.62% and Tseg from 721 to 496 ms. By performing vision-based detection with the 2019 CNY features, we showed that the proposed method improved PAPart^ from 90.38% to 95.29% while Tseg increased only slightly from 490 to 496 ms; APIoUT=0.1 also increased from 96.1% to 100%, detecting all the light transmission anticounterfeiting features without false detection, missing detection, or repeated detection.
The model in which 1×argmaxkpPart−k was replaced with argmaxkp^Part−k and the corresponding component analysis module improved the performance of the UPerNet encoder-decoder network. However, the efficiency improvement is affected by the accuracy of object segmentation. In our next study, we will investigate the applicability of machine learning to the component analysis module to achieve a higher performance in different applications.
Data Availability
The ADE20K Dataset used to support the findings of this study is available at http://groups.csail.mit.edu/vision/datasets/. The CITYSCAPES Dataset used to support the findings of this study is available at https://www.cityscapes-dataset.com. Its pretrained models and code are released at https://github.com/CSAILVision/semantic-segmentation325 pytorch.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was funded by the Key-Area Research and Development Program of Guangdong Province (Grant no. 2019B010154003) and the Guangzhou Science and Technology Plan Project (Grant no. 201802030006).
SzegedyC.LiuW.JiaY.Going deeper with convolutionsProceedings of the Computer Vision and Pattern RecognitionJune 2015Boston, MA, USAIEEE19HeK.ZhangX.RenS.Deep residual learning for image recognitionProceedings of the Computer vision and pattern recognitionJune 2016Las Vegas, NV, USAIEEE770778HeK.GkioxariG.DollarP.GirshickR.Mask R-CNN202042238639710.1109/tpami.2018.28441752-s2.0-85048205789LiuG.HeB.LiuS.Chassis assembly detection and identification based on deep learning component instance segmentation201911810.3390/sym110810012-s2.0-85070512247ManishR.VenkateshA.Denis AshokS.Machine vision based image processing techniques for surface finish and defect inspection in a grinding process201855127921280210.1016/j.matpr.2018.02.2632-s2.0-85050076136GengL.WenY.ZhangF.Machine vision detection method for surface defects of automobile stamping parts2019531128144IslamM. M.KimJ.Vision-based autonomous crack detection of concrete structures using a fully convolutional encoder–decoder network2019191910.3390/s191942512-s2.0-85072848388ZhouS.NieD.AdeliE.YinJ.LianJ.ShenD.High-resolution encoder-decoder networks for low-contrast medical image segmentation20202946147510.1109/tip.2019.2919937LiuS.HuangJ.LiuG.Technology of multi-category legal currency identification under multi-light conditions based on AleNet2019459118122in ChineseKangH.ChenC.Fruit detection and segmentation for apple harvesting using visual sensor in orchards20191920459910.3390/s192045992-s2.0-85074146740PardoE.MorgadoJ. M. T.MalpicaN.Semantic segmentation of mFISH images using convolutional networks201893662062710.1002/cyto.a.233752-s2.0-85046082078LiuG.LiuS.WuJ.Machine vision object detection algorithm based on deep learning and application in banknote detection201945519in ChineseGaoH.YuanH.WangZ.Pixel transposed convolutional networks201942512181227VuQ. D.KwakJ. T.A dense multi-path decoder for tissue segmentation in histopathology images201917311912910.1016/j.cmpb.2019.03.0072-s2.0-85063095454HuangJ.LiuG.The development of CNN-based semantic segmentation method20194051016in ChineseNowozinS.Optimal decisions from probabilistic models: the intersection-over-union caseProceedings of the Computer Vision and Pattern RecognitionJune 2014Columbus, OH, USAIEEE548555HoiemD.ChodpathumwanY.DaiQ.Diagnosing error in object detectorsProceedings of the European Conference on Computer VisionOctober 2012Florence, ItalyIEEE34035310.1007/978-3-642-33712-3_252-s2.0-84867841321HeK.SunJ.Convolutional neural networks at constrained time costProceedings of the Computer Vision and Pattern RecognitionJune 2015Boston, MA, USAIEEE53535360LongJ.ShelhamerE.DarrellT.Fully convolutional networks for semantic segmentationProceedings of the Computer Vision and Pattern RecognitionJune 2015Boston, MA, USAIEEE34313440BadrinarayananV.KendallA.CipollaR.SegNet: a deep convolutional encoder-decoder architecture for image segmentation201739122481249510.1109/tpami.2016.26446152-s2.0-85033697420RonnebergerO.FischerP.BroxT.U-net: convolutional networks for biomedical image segmentationProceedings of the Medical Image Computing and Computer-Assisted InterventionOctober 2015Munich, GermanyIEEE23424110.1007/978-3-319-24574-4_282-s2.0-84951834022HariharanB.ArbeláezP.GirshickR.MalikJ.Simultaneous detection and segmentationProceedings of the European Conference on Computer VisionMarch 2014Zürich, SwitzerlandIEEE29731210.1007/978-3-319-10584-0_202-s2.0-84906342998LaneN. D.WardenP.The deep (learning) transformation of mobile and embedded computing2018515121610.1109/mc.2018.23811292-s2.0-85047742666QingC.RuanJ.XuX.RenJ.ZabalzaJ.Spatial-spectral classification of hyperspectral images: a deep learning framework with Markov Random fields based modelling201913223524510.1049/iet-ipr.2018.57272-s2.0-85062101261LiX.LiuZ.LuoP.Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascadeProceedings of the Computer Vision and Pattern RecognitionJuly 2017Honolulu, HI, USAIEEE64596468ZhouB.ZhaoH.PuigX.Semantic understanding of scenes through the ADE20K dataset2019127330232110.1007/s11263-018-1140-02-s2.0-85058096354ChenL.-C.Rethinking atrous convolution for semantic image segmentation2017https://arxiv.org/abs/1706.05587ZhaoH.ShiJ.QiX.Pyramid scene parsing networkProceedings of the Computer Vision and Pattern RecognitionJuly 2017Honolulu, HI, USAIEEE62306239XiaoT.LiuY.ZhouB.JiangY.SunJ.Unified perceptual parsing for scene understandingProceedings of the European Conference on Computer VisionSeptember 2018Munich, GermanyIEEE43244810.1007/978-3-030-01228-1_262-s2.0-85055088847KimD.KwonJ.KimJ.Low-complexity online model selection with lyapunov control for reward maximization in stabilized real-time deep learning platformsProceedings of the Systems, Man and CyberneticsJanuary 2018Miyazaki, Japan43634368KrizhevskyA.SutskeverI.HintonG. E.ImageNet classification with deep convolutional neural networks2017606849010.1145/30653862-s2.0-85020126914CordtsM.MohamedO.RamosS.The cityscapes dataset for semantic urban scene understandingProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)June 2016Las Vegas, NV, USAIEEE32133223