Towards Pedestrian Target Detection with Optimized Mask R-CNN

Aiming at the problem of low pedestrian target detection accuracy, we propose a detection algorithm based on optimized Mask R-CNN which uses the latest research results of deep learning to improve the accuracy and speed of detection results. Due to the inﬂuence of illumination, posture, background, and other factors on the human target in the natural scene image, the complexity of target information is high. SKNet is used to replace the part of the convolution module in the depth residual network model in order to extract features better so that the model can adaptively select the best convolution kernel during training. In addition, according to the statistical law, the length-width ratio of the anchor box is modiﬁed to make it more accord with the natural characteristics of the pedestrian target. Finally, a pedestrian target dataset is established by selecting suitable pedestrian images in the COCO dataset and expanded by adding noise and median ﬁltering. The optimized algorithm is compared with the original algorithm and several other mainstream target detection algorithms on the dataset; the experimental results show that the detection accuracy and detection speed of the optimized algorithm are improved, and its detection accuracy is better than other mainstream target detection algorithms.


Introduction
e advancement of science and technology makes machine vision have broad application prospects in video surveillance, intelligent transportation, unmanned driving, and other projects. With the popularity of high-performance camera equipment and the surge in demand for automated analysis of video content, how to extract accurately and efficiently the target in the video has become an urgent problem to be solved, especially in the study of the pedestrian target area, and even more a hot issue in the field of machine vision research. Pedestrian detection is the basis of most pedestrian dynamic analyses. e more accurate detection result is related to whether the follow-up tracking, segmentation, estimation, and other tasks can be completed accurately and efficiently.
ere are two main branches of target detection algorithms: one is the motion detection algorithm based on the difference between video sequences, and the other is the algorithm based on machine learning. e first type of method has fast calculating speed, but poor anti-interference ability. When the environment changes, the target appears dense, or the target does not move, it is easy to produce a large number of missed and wrong detections, and the robustness is poor. e common methods of this kind include frame difference method, background difference method, ViBe algorithm [1], and ViBe+ algorithm [2]. e second type of method is divided into traditional machine learning and deep learning. e deep learning method uses a multilayer neural self-learning network to repeatedly achieve excellent results in world-class target detection competitions.
Target detection based on deep learning can be divided into anchor-based and anchor-free. e most important difference between the two methods is whether the anchor box is used to extract the candidate target frame of the image during the learning process. Compared with the anchorbased method, the anchor-free method has a simpler network structure and stronger model migration ability. e anchor-free method is based on the complete feature gold tower, which has a huge amount of calculation, while the anchor-based method reduces the number of layers of the pyramid, which greatly reduces the amount of calculation, the detection speed is faster, and the detection accuracy is higher. Now, the mainstream detection algorithms such as YOLOv2 [3], YOLOv3 [4], Faster R-CNN [5], and Mask R-CNN [6] are anchor-based methods.

Related Work
e target detection algorithm based on deep learning has been fully developed in more than ten years. Now, there are many branches in the area of the target detection. e deep learning target detection algorithm based on region proposal represented by R-CNN [7,8]   Subsequently, Fast R-CNN [9] optimized the serial feature extraction method of R-CNN, and only one CNN was used for each image, which greatly improved the detection speed. After that, Faster R-CNN [5] made further optimization. Instead of using selective search algorithms to generate candidate regions, the regions to be detected are extracted through a region proposal network (RPN) so that the complete target detection process is through a neural network which is used to further improve the detection accuracy and speed, and a real end-toend target detection framework is realized. Mask R-CNN [6] is a further extension of this series of deep learning target detection algorithms. It adds a segmentation task branch based on the Faster R-CNN detection branch, and the segmentation task is performed simultaneously with the classification and regression tasks. e detection task can be better extended to other tasks, and the detection effect is also more ideal. SENet (Squeeze-and-Excitation Network) [10] was proposed by Hu's team and won the championship in the ImageNet classification competition in 2017. It is just a lightweight network model that implements an attention mechanism on channels so that the network can adaptively select appropriate feature channels. On this basis, Li et al. proposed SKNet (Selective Kernel Network) [11], which performs an attention mechanism on the convolution kernel so that the network can adaptively select the appropriate convolution kernel in 2019.
Nowadays, target detection has made a new development. Hsu et al. [12] proposed two strategies to enable the detector to detect OOD (out-of-distribution) samples without OOD data training. Wang et al. [13] introduced the intersection of the human body and object into training to improve the detection performance. Zhang et al. [14] proposed a method to automatically select positive and negative samples based on the statistical characteristics of the object and proved that simply stacking the number of anchor boxes cannot improve the detection accuracy.
Based on the Mask R-CNN target detection algorithm, we have made some optimizations to improve the accuracy of pedestrian target detection. e main work of this article consists of the following three parts:

Mask R-CNN Algorithm.
e Mask R-CNN algorithm is a melioration based on the Faster R-CNN detection algorithm which introduces a full convolutional network (FCN) to generate mask. In the real-time target detection process, the pixels of the target are classified accurately, and then the contour of the target is judged. e framework of the algorithm is shown in Figure 1.
(1) In the dataset preprocessing stage, each image is added with noise and then fuzzified; the three kinds of images are used as the pretraining set so that the amount of data in the pretraining set is tripled without relabeling. Data enhancement is realized. (2) In the RPN, the anchor box is optimized for pedestrian targets. e proportions that are more suitable for pedestrian targets are used, which makes the network training results more reasonable, higher detection accuracy, no increase in calculations, and faster detection convergence.
(3) In the ResNet, the SKNet lightweight network module is used to replace the part of the convolution module so that the model can adaptively select the best convolution kernel during the training process, increase the quality of feature representation, and improve detection accuracy.
e image is first inputted into the backbone network composed of the ResNet and the FPN. e backbone network extracts some shared feature maps that combine the coordinate information of the detected target position and the appearance texture information.
en, the RPN area suggestion network uses a sliding window to traverse these feature maps to generate several anchor frames with a combination of fixed scale and aspect ratio. ese anchor frames are candidate areas. In the proposal layer, the anchor frame that is more likely to contain the detected target is selected as the candidate area. e specific method is to exclude the anchor frame which goes beyond the image boundary, has high overlap rate, or low confidence level. en, the nonmaximum suppression (NMS) method is used to select the anchor box with the higher score [15].
In the RoIAlign layer of the Mask R-CNN algorithm, the quantization operation in the feature aggregation process is replaced by the bilinear interpolation method, which avoids the problem of mismatch and improves the accuracy of detection and segmentation. e Mask R-CNN algorithm shares the convolutional layer with the candidate region generation network for classification and regression problems, which improves the efficiency of the algorithm. e Mask R-CNN algorithm uses the softmax function and the multitask function to obtain the classification value and the Salt-and-pepper noise Median filter regression box parameter value. In the FCN, the sigmoid function is used to output the mask value to realize pixellevel instance segmentation. During the training process, the Mask R-CNN algorithm defines the multitask loss function for each sampled region of interest (RoI) as L cls is the classification error, L box is the detection error, and L mask is the segmentation error.
L cls and L box in Mask R-CNN are defined as p i represents the predicted probability of the i-th target on the anchor point. p * i is determined by the sign of the anchor point sample. When the anchor point sample is positive, p * i is 1; otherwise, it is 0. Both t i and t * i are vectors composed of four translation and scaling parameters, which, respectively, measure the degree of change of the positive sample anchor point relative to the prediction area and the label area. e weights N cls , N reg , and λ control the two losses to keep balance.
Classification loss and regression loss are defined as smooth L (x) is the robust loss, which is determined by the translation x of the corrected frame on the horizontal axis at the anchor point. It is defined as L mask in Mask R-CNN is an average binary cross-entropy function that describes the loss of semantic segmentation branches. In the mask branch, the input feature map will be output into a k × m × m format after processing, and k and m,      respectively, control the dimension and scale of the feature map. e relative entropy is obtained by the pixel-by-pixel sigmoid calculation of the output feature map, and the average entropy error is L mask .

Optimized Mask R-CNN.
We optimize the RPN by modifying the aspect ratio of the anchor frame. We also modify the network structure of the ResNet.

Optimization of RPN.
In the training process of the original RPN, the anchor frame in the sliding window is composed of three kinds of areas (128 2 , 256 2 , and 512 2 ) and three aspect ratios (1 : 1, 1 : 2, and 2 : 1). ere are totally 9 kinds of anchor frames [6]. However, if only pedestrian targets are detected, this setting will affect the convergence speed of training learning and reduce the detection accuracy, which is unreasonable. According to the statistical law, the average aspect ratio of the human body when standing and walking is about 0.41 [16]. erefore, the RPN network is optimized for pedestrian targets, removing the anchor frame with an aspect ratio of 2 : 1 and replacing it with the widthheight ratio. For the anchor frame with a ratio of 2 : 5, modify the types of anchor frame aspect ratios to 1 : 1, 2 : 5, and 1 : 2, and keep the original three areas unchanged, and the number of anchor frame types is still 9. For each image, the total number of anchor frames during training remains unchanged from the original Mask R-CNN algorithm.

Optimization of ResNet.
For Mask R-CNN, the most commonly used deep residual network models are ResNet50 and ResNet101. Compared with ResNet50, ResNet101 has higher accuracy. We use the ResNet101 network model as the basis for optimization and improvement. e network structure of ResNet101 is shown in Figure 2. SKNet is a lightweight embedded module that can adaptively change the size of the convolution kernel as the information scale changes, thereby controlling the receptive field of the network and better capturing the feature information of the target. As shown in Figure 3 [11], SKNet consists of three parts. In the split process, the feature maps are, respectively, passed through a convolution with a 3 × 3 kernel and a convolution with a 5 × 5 kernel to generate feature maps U and U ⌢ . In the fuse process, U and U ⌢ are added to get the feature map U. U goes through an operation called global average pooling first. en, U passes through two fully connected layers and goes through a process of first decreasing the dimension and then increasing the dimension. After that, weight matrix a and weight matrix b can be obtained. e final feature map V is obtained by weighted addition in the select process.
In this article, SKNet module was embedded into the ResNet101 network. Convolution module with 3 × 3 cores was replaced by the convolution module consisting of two different cores and a feature channel weight full connection layer.
e new feature extraction network was named SKNet-101. e optimized ResNet can better represent the characteristics of the target, thereby further improving the detection accuracy. e optimized network structure of SKNet-101 is shown in Figure 4.

Experimental Results and Analysis
e program running environment is Windows 10 operating system, PyCharm 2019.3.3 platform-integrated Python 3.6 is installed, and the runtime library includes Keras 2.1.6, matplotlib 3.2.2, tensorflow 1.14.0, numpy 1.19.0, and opencv 4.2.0.

Dataset Enhancement Processing.
e classic COCO 2014 dataset [17] is used as the training and testing set. e COCO dataset is a target detection dataset released by Microsoft with rich detection types. It contains 80 different types of targets and more than 200,000 labeled images. Many scholars use it for target detection training and learning. We selected 1,000 pedestrian images from the "person" category, in which scenes are under different angles, lighting, and pedestrian density as much as possible to increase the complexity of the data. is dataset is composed of 1000 pedestrian images, of which 900 are used as the training set and 100 are used as the test set. ere are 892 positive sample images in the training set and 3262 pedestrian targets and 99 positive sample images in the test set and 478 pedestrian targets.
In order to achieve the purpose of data enhancement, we added salt-and-pepper noise to 900 images in the training set and then used the median filter with a kernel of 3 for each image, as shown in Figure 5. e three kinds of images were used together as the optimized training set and compared with the original training set without data enhancement. It is proved that reasonable expansion of the dataset is conducive to fully learning the characteristics of pedestrian images and improving the detection performance.

Parameter Setting.
e Mask R-CNN optimized for pedestrian targets is used as a model to complete the detection training of pedestrian targets, and some hyperparameters were set before the training starts to speed up the convergence speed and prevent overfitting.
ere are three important parameters in the SKNet module. Since the dual-weight model is used, the number of branches M was set to 2. In order to achieve the optimal feature representation, the number of groups G was set to 32, and the fc scaling ratio R was set to 16. As shown in Figure 6, we first recorded the changes in training loss under different learning rates (LRs) in the Mask R-CNN overall training network. It can be seen that the training loss is the smallest when LR is set to 0.01.
Under the premise of setting LR to 0.01, a comparative experiment on the influence of several training iterations on test accuracy was conducted. e experimental results are shown in Table 1.
e overall test accuracy rate rises with the increase in training iterations/time, as shown in Table 1.
e test accuracy rate reaches its peak after 15,000 iterations.
ere is an overfitting situation, and the test 6 Complexity accuracy rate drops slightly after 20,000 iterations. erefore, we finally selected 15,000 iterations during training. e specific values of the training hyperparameters of the overall model are shown in Table 2.

Experimental Results and Analysis.
We compared the learning situation of the original Mask R-CNN algorithm on the training set without data expansion, the learning situation of the original Mask R-CNN algorithm on the training set after data expansion, the learning situation of the Mask R-CNN algorithm after optimizing the RPN on the training set without data expansion, the learning situation of the Mask R-CNN algorithm after optimizing the ResNet on the training set without data expansion, and the learning situation of the Mask R-CNN algorithm after optimizing the RPN and the ResNet on the training set after data expansion.
ere are two main comparative experimental indicators, namely, AP (average precision) and FPS (frames per second). e specific comparative experimental results are shown in Table 3.
It can be seen from Table 3 that the AP of the detector can be increased by 6.15% by using data expansion, and the FPS is almost unchanged, still at 4.99. In the RPN area, it is recommended to select a suitable anchor frame at each position during the network training stage, which can increase the AP by 3.91% and the FPS slightly by 0.04. Using the SKNet-101 network structure can increase AP by 8.74% but decrease FPS slightly. Using three methods to optimize the model can increase the AP of the detector by 10.46% and FPS by 0.28 when detecting pedestrian targets. It proves that the optimization method can significantly improve the detection accuracy of pedestrian targets and slightly increase the detection speed.
We also compare the optimized detector with several mainstream target detection algorithms on the test set. e experimental results are shown in Table 4.
It can be seen in Table 4 that the AP of the optimized detector is superior to other mainstream algorithms in the pedestrian target detection, and the detection accuracy has been significantly improved.

Conclusion
We optimize the RPN of Mask R-CNN and generate a new network structure named SKNet-101 by introducing the SKNet module in the feature extraction stage so that the network can select adaptively the appropriate convolution kernels. We also optimize the representation of the target by modifying the scale of the anchor frame in the regional proposal stage. e training set is expanded to improve the accuracy of the algorithm when detecting pedestrian targets. However, the optimization method has certain limitations. e optimization of the RPN can only improve the detection accuracy of pedestrian targets. When detecting other targets, the detection accuracy may be reduced. Moreover, the problem of relatively slow detection speed in R-CNN series has not been solved well. In future research, the detection speed needs to be improved.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.