Feature-Enhanced Occlusion Perception Object Detection for Smart Cities

Object detection is used widely in smart cities including safety monitoring, traffic control, and car driving. However, in the smart city scenario, many objects will have occlusion problems. Moreover, most popular object detectors are often sensitive to various real-world occlusions. This paper proposes a feature-enhanced occlusion perception object detector by simultaneously detecting occluded objects and fully utilizing spatial information. To generate hard examples with occlusions, a mask generator localizes and masks discriminated regions with weakly supervised methods. To obtain enriched feature representation, we design a multiscale representation fusion module to combine hierarchical feature maps. Moreover, this method exploits contextual information by heaping up representations from different regions in feature maps. The model is trained end-to-end learning by minimizing the multitask loss. Our model obtains superior performance compared to previous object detectors, 77.4% mAP and 74.3% mAP on PASCAL VOC 2007 and PASCAL VOC 2012, respectively. It also achieves 24.6% mAP on MS COCO. Experiments demonstrate that the proposed method is useful to improve the effectiveness of object detection, making it highly suitable for smart cities application that need to discover key objects with occlusions.


Introduction
The development of smart cities is inseparable from the two key technologies of the Internet of Things (IoT) and artificial intelligence (AI). Although the IoT technology [1][2][3] has developed well in recent years, the effect needs to be improved. Therefore, the effective combination of IoT [4][5][6] and AI technology has become a major challenge today. Object detection based on neurocomputing, which is one of the tasks of smart cities, has been well studied in recent years, since it is a biologically inspired AI application. The goal of object detection is to localize an object of a predefined category. Recent state-of-the-art object detectors could be split into two main categories: the region-based detectors [7][8][9] and the regression-based detectors [10,11]. These models have made a big contribution to object detection development.
Nevertheless, the robustness of object detection is still worth study. In practical application for smart cities, the net-work needs to detect some perturbed images. We can classify these images into two categories: (1) some parts of object are occluded (Figure 1(a)) and (2) object is beyond the picture boundary (Figure 1(b)). The occlusion occurs commonly in multiple object image, and the foreground objects always mask some features of object behind it. For another images, some object features are lost since the object is beyond the picture boundary. These images are called hard examples in this paper.
Because the hard examples have stronger transfer ability, the network is difficult to learn discriminative features for detection. Therefore, it is necessary to enhance the network's ability of mining useful information from perturbed images. However, it is difficult to train a robust model only given normal dataset. A solution to solve this problem is to integrate hard example mining in the training stage [12,13], but it cannot solve the essential problem. An effective method is to generate hard examples from the detection dataset. Some works [14][15][16][17][18] have been devoted to addressing example generation problems. One useful solution is to generate realistic looking images by generative adversarial networks [14][15][16][17]. Another way is to generate masks on the original images directly. For instance, [18] generates hard occlusion examples for object detection during training.
In this paper, we propose a novel approach that better addresses the above issues. Our goal is to tackle the lack of hard examples and exploit more feature representation for the object with occlusions as far as possible.
There are three motivations for our study. Firstly, consider improving the network robustness in a balanced dataset, a deep mask generator is proposed in our approach to generate hard positive examples. Secondly, the weakly supervised object localization is crucial for a mask generator to obtain mask accurately. Locating in a discriminative region can let the generator know which region of object is possibly occluded in real life. Thirdly, the richness of the feature map is important for a mask generator to obtain hard examples. Therefore, the multiscale feature map can contain far more information and detail. To sum up, our main contributions are as follows: (1) We introduce an end-to-end approach which can improve the robustness of the object detector and achieve competitive performance in object detection task (2) We proposed a mask generator which uses weakly supervised location to generate pixel-wise masks and show that the mask generator is helpful to obtain more realistic hard examples during training (3) We design a multiscale feature fusion module and context-aware information module to exploit abundant spatial information and demonstrate that these can improve the richness of feature map

Related Work
In recent years, many object detectors based on region perform classification and bounding box regression on each proposal region. Compared with the regression-based detectors, the detection accuracy and location accuracy of the regionbased detectors are superior. Following the pioneer regionbased object detector R-CNN, Fast R-CNN [7] increases model's accuracy by adding a RoI-pooling layer. In Faster R-CNN [9], region proposal network (RPN) generated more precise proposals than selective search. Our work builds on Faster R-CNN, which is a remarkable end-to-end method.

Multiscale Representation Concatenation.
Recently, many significant works presented that multiscale feature concatenation is vital for object detection [19,20]. For example, ION [21] extracts some feature descriptors and combines them after RoI-pooling layer [8]. HON [22] aggregates high-   [23] incorporates hierarchical feature maps and compresses them into a fixed-size space. In order to perform detection at multiple scales, RON [24] uses reverse connection to predicts objects at different layers, and FPN [25] presented a clean and simple framework for building feature pyramids inside ConvNets; it archived good a result and trained using the COCO trainval35k dataset. Effi-cientDet [26] proposes a weighted bidirectional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion. AugFPN [27] incorporates consistent supervision, residual feature augmentation, and Soft RoI selection, which can significantly improve the baseline approach on challenging MS COCO datasets. Chu et al. [28] used an ensemble object detection system to combine the relationships between objects and context features based on global scenes. Res2Net [29] construct a hierarchical residual-like connections which represents multiscale features at a granular level. [30] proposes a multiconnection module to fuse multigrained information to enhance feature representation. VDNets [31] uses a feature fusion method based on the attention mechanism to make full use of multiscale feature information. In summary, the multiscale feature concatenation is to improve the richness of the feature map, so as to improve the detection accuracy of the detector. Different from these works, our work uses a new convolution layer with higher semantic information and trained using the trainval dataset for COCO.

Hard Example
Mining. Some works focus on how to better use data for improve the performance of the model. [32] enhances the representations for small objects using perceptual GAN. One direction is to insert hard examples in the training stage. For example, inspired by bootstrapping, OHEM [13] improves the capacity of object detection by reranking and training hard examples. This work is further extended by Wang et al. [18], which generates hard examples by adversarial learning for object detection. AOFD [33] pays attention to increase the capacity of face detection by generating occlusion-like face features and proposes a multitask training method. [34] studies online selection of hard exam-ples for minibatch SGD methods. [35] independently selects the positive and negative examples with a stochastic strategy of the training set, and [36] uses a ranking loss function to find hard negative patches from a large set. C-RPNs [37] adopt multiple stages to mine hard samples. Similar to these works, our method improves the capacity of detector by adding hard examples during training. However, these works may have collapsed and instable problems. Therefore, our model generates hard examples by masking the discriminative parts of object, rather than adversarial learning. These discriminative parts are accurately located by weakly supervised learning.

Weakly Supervised Object Localization.
CNNs are proved to have great performance on object location. Recently, there are many works exploring weakly supervised localization (WSL) using CNNs. Huang et al. [38] pays attention to improve the quality of initialized object locations. [39] proposes a self-taught model to localize the stronger responsive regions when artificially masking. [40] contributes a multifold multiple instance learning procedure to localize objects with CNN features. To solve the WSL problem and improve the detection ability, [41] proposes a deep self-taught approach to localize more positive samples by retrain itself. [42] proposes to integrate the feature pyramid network (FPN) with convolutional neural network (CNN) for weakly supervised object localization. Hide-CAM [   3 Wireless Communications and Mobile Computing strategy to locate the most discriminative and the complementary object regions of the object. Oquab et al. [44] proposes global max pooling and demonstrate that object localization can be completed using the output of CNNs. Different from [44,45] localizes the discriminative image regions base on the global average pooling. [46] trains a weakly supervised framework and to mine the entire region of object by randomly hiding patches. [46] is closely related to the adversarial erasing method [47]; this method also localizes dense regions by erasing some discriminative parts iteratively.
Similar to [45,46], our work localizes the regions of discriminative features and masks these discriminative features to generate hard examples, which does not need any bounding boxes annotations to locate regions of features.

Network Architecture
As is shown in Figure 2, we use a multiscale feature concatenation architecture in feature exaction network. A mask generation branch is added after input layer, followed by an element-wise product layer. The aim of the branch is to let the network know which region of object is possibly occluded in real life. After that, the generated feature representation input into the RoI-pooling layer. Two fully connected (Fc) layers process each descriptor and produce two outputs: a class prediction and bounding box.
In this section, we first introduce feature extraction network which can extract richer information. Then, we describe the mask generator that products mask in a weakly supervised way.

Multi-Scale Representation Concatenation.
For the feature extraction stage, multiscale representation concatenation is necessary when you want to get a more detailed feature map. Observing hierarchical feature map has different characteristics in convolutional neural networks (CNNs), we take a different use for these feature maps.
In our model, convolution layers 4, 5, and 6 are combined to obtain more feature detail (see Figure 3). After pooling and convolving convolution layer 5, layer 6 is obtained. The size of convolution layer 4 is double of the layer 5, and layer 6 is a half of convolution layer 5. To maintain the same resolution of multilevel maps, layer 4 resize to the same as layer 5 through a max pooling layer, and layer 6 resize to the same as layer 5 through a deconvolution layer for upsampling operation. For unify amplitudes from various levels, our method uses L2 normalize to different feature representation. Because the amplitudes from different levels are various, L2 normalize is important to representation concatenation. Therefore, the scale of final representation is 1/16 of the origin image scale, which is suitable for RoI-pooling layer.
Our method uses VGG-16 [48] as the pretrain network. To meet the output shape of the RoI-pooling layer, the final map's scale should be 7 × 7 pixels with 512 channels. In addition, it is the formal input for the next detection network (fc6). Therefore, the representation map will input to the RoI-pooling layer without any special operation; this process can guarantee feature map have more details.

Weakly Supervised Mask Generator. Even in large-scale datasets, it is difficult to sample all possible hard examples.
We take a flexible approach to generate hard examples, rather than relying on data augmentation. The mask generator is used to find some distinguish areas by weakly supervised object location and generate various realistic masks. More specifically, it is only mask the discriminative part of the object in the training dataset. This effective way forces the model to learn feature which look like object even if object is incomplete. Note that we only apply the mask generator during training but not during testing.

Weakly Supervised Object Location.
To localize these discriminative image regions, we use the network of Zhou et al. [45] to generate a class activation map (CAM). After learning a classification network, the CAM can represent the discriminative image regions for a particular class. In general, the classification network is initialized based on the AlexNet [49], GoogLeNet [50], and VGGNet [49]. In our work, we learn a classification network base on VGG-16 architecture. To generate a CAM for an image, global average pooling (GAP) is performed on the last convolutional feature maps. For a class, the GAP's output is the average of the last convolutional feature map in each unit. A weight corresponds to a  Wireless Communications and Mobile Computing class, CAM is the weight sum of these output values. Our final output is generated from the top 5 predicted categories for the input image. Given an image I, denote f i ðx, yÞ to be the last convolutional activation of unit i at spatial location ðx, yÞ. When perform global average pooling, the output F k is For class c, w c i is the weight of the last classification layer, which corresponds to unit i. w c i can consist of a N × M weight matrix of the classification layer, where M is the number of feature maps in the last convolution layer, and N is the number of categories. Thus, the class score is It is obvious that w c i reflects the importance of F k for class c. Then, the class activation map for class c is Hence, for an image to c class, CAMðc, IÞ indicates the importance of the activation at spatial location ðx, yÞ.
Some examples of weakly supervised object location are shown in Figure 4.

Masking Strategy.
We generate mask map X for an image with size w × h × c, where w, h are the length and width of the concatenation map, and c is the number of channels. X x,y is the pixel value for location ðx, yÞ of the mask map, and each pixel value of X is compress to 0 or 1. Then, the values of the mask map are obtained by applying a hard threshold O to the CAM. If the pixel value of activation map M c i is greater than O, this location ðx, yÞ belongs to discriminative region. Thus, X x,y = 0, the value of corresponding spatial location will be drop out in all channels. To the contrary, the feature values of general region will be retained. Our strategy is to mine some strongly responsive areas in feature map and mask these distinguish regions precisely. This strategy is more accurate than dropping pixels randomly. Occluded samples will become the hard examples for training. Some examples are shown in Figure 4. Now, we explicate our mask generator more formally. Let be a training set including N images, and P i is the mask regions for image i. Denote M c i is the activation map of image i for class c, which is generated by CAMðc, I i Þ, note that the class c ∈ Y i . The p i,x,y is the pixel value for location ðx, yÞ of the activation map M c i . Once the value of p i,x,y is greater than hard threshold O m , the region of location ðx, yÞ will be mined. Then, we mask the mined region and the new training data set D ′ is obtained. The procedure is detailed in Algorithm 1.

3.5.
Context-Aware Information. Work [21] uses a recurrent neural network (RNN) to extract contextual information. To connect features from different contextual regions, Zeng et al. [51] through the gated bidirectional network for feature expression. Influenced by [52], this method extracts contextual information from different regions of the feature map. The difference is extract from the feature representation with richer information.
After RoI-pooling layer, we stack feature from regions of object feature and contextual information. At the first, the context region is default as one and a half times of region of interest (RoIs). The context region of fusion map feed to RoI-pooling layer and generate a fixed-length feature descriptor of size 7 × 7 × 512. After that, the object region's descriptor is obtained. Our method combines these two descriptors through adding corresponding value at pixel level. This method omits additional layers to decrease dimension, so that improving model efficiency and reducing extra runtime.
3.6. Detection and Training 3.6.1. Region Generating. For region generating, the region proposal network (RPN) [9] is used to generate various boxes. In order to scan to various sized objects, we use 3 scales and 3 aspect ratios to generate various sized boxes. However, the RPN always generate many redundant region proposals. To reduce redundancies, nonmaximum suppression (NMS) is used to filter proposals. For a proposal, when the value of intersection-over-union (IoU) greater than threshold, the proposal will be deleted. Our method defines threshold as 0.7, the top rank three hundred proposals will be used in the next stage.
3.6.2. Object Detection. After generating proposed regions, the detection module needs to classify proposed regions into K + 1 categories (K = 20in PASCAL VOC database) and achieve bounding box regression. The previous module outputs an abundant pooled feature map. We make the maximum use of the pooled feature by two fully connected layers, then compute the per-class score with Softmax, and output an adjustment to the bounding box.
Algorithm 1: Weakly supervised mask generator.   Wireless Communications and Mobile Computing 3.6.3. Joint Training. For training way, this paper adopts an end-to-end way to jointly optimize the loss function. During training, the detection network and RPN are combined into one network. For per iteration of training, RPN generate a set of region proposals for detection network to predict classification scores and regress locations. This process is the precompute of forward propagation. In the RPN stage, we give positive label to a box which intersection-over-union (IoU) higher than 0.7 or highest IoU with a ground-truth box. On the contrary, box which IoU lowers than 0.3 will be given negative label. In backward propagation, loss of two networks generate gradient signal. To achieve this process, the multitask loss function is defined as

Wireless Communications and Mobile Computing
where p i is the predicted probability of positive box. The value of p * i indicates the ground truth label of anchor i. So, L cls ðp i , p * i Þ is classification loss function, and L reg ðt i , t * i Þ is regression loss function: where R is the robust loss function smooth L1 [2]:

Experiment
We conduct experiments based on three detection datasets: PASCAL VOC 2007, PASCAL VOC 2012 [53], and COCO [54]. For PASCAL VOC, the union set of PASCAL VOC 2007 trainval and 2012 trainval is used to train all networks, and PASCAL VOC 2007 and PASCAL VOC 2012 are used to verify different networks, respectively. For MS COCO, we trained networks on the trainval set and test on the test-dev evaluation server. The results are measured by mean average precision (mAP).

Experimental Setup.
Our networks are design based on the VGG-16 framework [48] and Fast R-CNN baseline. The max size of the longest side is 1024 pixels. The test image scale is the same size as train image. For solver parameter, stochastic gradient descent (SGD) as the iterative method used to optimize objective function. We set the initial learning rate to 0.001, and decreased by a factor of 10 times after every 50,000 iterations. The weight decay is set to 0.0005   Wireless Communications and Mobile Computing and momentum to 0.9, so that the learning rate is 0.001 for the first 50 k minibatches and 0.0001 for the next 20 k. VGG-16 is the pretrained model, which was firstly pretrained on the ImageNet benchmark; after that, it was fine-tuned on detection benchmark.

PASCAL VOC 2007 Test Set.
For the PASCAL VOC 2007 detection task, we compare our models with the state-of-theart detectors (see Table 1). All parameters are set as Faster RCNN except for the image size. Our full model with all three modules improves the performance to 77.4%; the final result gives 4.2% boost upon Faster R-CNN. The bounding box voting [55] is also a useful mechanism to improve detection performance.
To understand the performance of our model in detail, we use the detection analysis tool from [56]. As shown in Figure 5, the top row shows that our model can detect various object categories with high quality (large white area). The majority of its confident detections are correct. The solid red line and dashed red line reflect the change of recall with strong and weak criteria, respectively. The bottom row shows the distribution of the top-ranked false positive types. Figure 6 demonstrates that our model is robust to different object sizes and aspect ratios. Compare with other state-ofthe-art detectors, our model achieves better performance at three aspects: (1) The location error (Loc) of our model is less; this means that our model can localize objects better.

PASCAL VOC 2012 Test Set.
We also test our networks on PASCAL VOC 2012 and submit results to the public evaluation server (anonymous URL: http://host.robots.ox.ac .uk:8080/anonymous/NG67QK.html.). Our models are trained on set of VOC 2007+2012, but without VOC 2007 test set. Table 2 show our network obtains 74.3% mAP.

MS COCO Test
Set. In addition to PASCAL VOC, we present more results on the Microsoft COCO and got reports from the public evaluation server (anonymous URL: https:// competitions.codalab.org/my/competition/submission/461101 /stdout.txt). As shown in Table 3, our network achieves 24.6% mAP, which is greater than Faster R-CNN. It is noted that when IoU is 0.5 : 0.95, the mAP of our network is poorer than DSSD321, SSD300, and ION, but when the area is small, the result is better. So, our network is good at detection of small object, due to using the multiscale feature fusion module. Note that the feature in DSSD321 is extracted by Residual-101, but our network is by VGG-16.

Ablation Analysis
To study the impact of multiscale representation and context-aware, we conduct some comparative experiments with  We find that the detector would be obstructed by mask if the mask region is too large; this happens because the network saw few discriminative pixels. Oppositely, it would be useless if it too small. To find a suitable hard threshold O, we conduct a series of experiments only with a mask generator branch. Table 5 gives a brief result of our experiments. Let R is the value to 256 colors in class activation map, region with high value of R will be highlight as the discrimination. For a location in class activation map, the greater the value of R, the greater the response of the class. The pixel will be masked, provided that R is greater than O. Thus, setting a high threshold means a lower degree of mask. When O is 170, our detector achieves competitive results (76.9% mAP). But, when we set O to 160 or 180, the results are not very competitive.
The reason is that the mask generator cannot product useful hard examples with a too high value of O. Nevertheless, it would be break detector if the value of O is too low. According to results, two keys can be summarized: (1) The hard threshold O is vital to generate useful mask. (2) Occlud-ing one-third area (O = 170) of feature map can product a reasonable mask.

Do Mask Generator Help?
To prove that the mask generator is useful in the object detection network, we conduct a set of experiments which compare it with the baseline. As shown in Figure 7(a), our method achieves a better result. Furthermore, with multiscale representation concatenation and contextual information module, the performance becomes well (see Figure 7(b)). We also use three types of mask area to get different hard examples for training;, the performance for different mask area is shown in Figure 7(c). Our method performs better when the hard threshold O is 170.

Analysis for Multiscale Representation Concatenation.
To validate the effectiveness of the multiscale representation concatenation, we design a series of experiments and study why the detection performance is affected by representation concatenation. To better understand the importance of multiscale feature fusion, we have removed the mask generator branch.
Our network obtains high-level semantic information through fusion higher convolution layer 6. However, does the convolution layer 6 really work? This paper designed a set of experiments to verify this issue. Firstly, we train a model detect from single layer 5, which achieved 70.0% mAP. Secondly, we trained model detect from layers 3, 4, and 5 and 4, 5, and 6, respectively. We evaluate the detection performance with different layers at the region proposal number is 100 in Table 6. Fusing layers 3, 4, and 5 gets 73.1% mAP, and fusing layer 4, 5 and 6 gets the best detection result (mAP = 75:4%). These detection results also show the   10 Wireless Communications and Mobile Computing effectiveness of convolution layer 6. Therefore, the new convolution layer 6 is useful for the fusion feature map, since it has richer semantic information compared with layer 5.
As Table 6 has shown, we use different methods to normalize feature map; the first one is L2 normalize; the second one is local response normalization (LRN) [49]. The last model achieved 75.4% mAP with L2 normalization and 67.3% mAP with local response normalization (LRN). So, the L2 normalization is more effective.

Analysis for Context Information.
The context information is very important for feature extraction. Therefore, we design a set of experiments to verify the necessary of context information. As shown in Figure 7, our model with contextual information achieved better result than baseline. There are two keys should be concluded: (1) embedding contextual information is a good way to improve detection performance and (2) the sum operation at pixel level is vital to embed operation.

Conclusion
This paper proposed a novel architecture to solve the object occluded problem for object detection. We aim to learn an object detector that is robust to different occlusions. To achieve this goal, we propose an end-to-end framework that generate hard examples during training and achieving competitive performance in the object detection task.
To learn object models that are invariant to occlusions, we proposed a mask generator which uses weakly supervised location to generate pixel-wise masks and show that mask generator is helpful to obtain more realistic hard examples during training. To exploit more spatial information and improve the richness of feature map, we design a novel multiscale representation concatenation model for the feature extraction stage and add the context-aware module in the region proposal network. Our method obtains comparable results, 77.4% mAP and 74.3% mAP on PASCAL VOC 2007 and VOC 2012, respectively. It also achieves 24.6% mAP on MS COCO. Our studies demonstrate that hard examples and rich spatial information is vital for object detection, promoting smart cities to solve the occlusion problem of object detection.

Data Availability
The data (PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO) supporting this study are from previously reported studies and datasets, which have been cited. These prior studies (and datasets) are cited at relevant places within the text as references [46,47].

Conflicts of Interest
The authors declare that they have no conflicts of interest.