Double Mask R-CNN for Pedestrian Detection in a Crowd

Aiming at the difficulty of feature extraction and the limitation of NMS (nonmaximum suppression) in crowded pedestrian detection, a new detection network named Double Mask R-CNN based on Mask R-CNNwith FPN (Feature Pyramid Network) is proposed in this article. ,e algorithm has two improvements: firstly, we add a semantic segmentation branch on the FPN to strengthen the feature extraction of crowded pedestrians; secondly, we design a rule to estimate the pedestrian visibility of detected image according to the human keypoints information, and this rule can cover binary mask on the image whose pedestrian visibility is less than a certain threshold. ,en we input the masked image into the network to locate occluded pedestrians. Experimental results on the CrowdHuman dataset show that the log-average miss rate (MR) of Double Mask R-CNN is 13, 12% lower than the best results of other mainstream networks. Similar improvements onWiderPerson dataset are also achieved by the Double Mask R-CNN.


Introduction
Pedestrian detection is a classical problem in the field of computer vision, which has a wide range of applications, such as unmanned driving, robots, intelligent monitoring, human behaviour analysis, and amblyopia assistive technology. [1,2]. Traditional pedestrian detection methods mainly used HOG (Histogram of Oriented Gradient) to extract pedestrian features and then used SVM (Support Vector Machine) for classification [3]. However, HOG can only describe pedestrian features from gradient or texture, with poor discrimination [4], and the SVM is also not suitable for the increasing scale of pedestrian detection datasets. In recent years, the accuracy of pedestrian detection has been greatly improved [5][6][7] with the development of a deep convolutional neural network. However, pedestrian detection in crowded scenes is still difficult.
Crowded pedestrian detection primarily involves two problems. Firstly, the similarity between pedestrians is high, while current detection models focus on extracting overall features. is makes it difficult to distinguish between highly overlapped pedestrians. Secondly, there are limitations in the postprocessing methods of prediction boxes. Traditional detection models collect samples from feature maps to generate dense prediction boxes. Nonmaximum suppression (NMS) is adopted for removing overlapped prediction boxes. However, it is very difficult to set an NMS threshold when this method is applied to crowded pedestrian scenarios. If the NMS threshold is too low, a large number of missed detections will be generated. If the NMS threshold is too high, a large number of false detections will be generated. e above two difficulties are mainly solved from two aspects. e first aspect is to strengthen the extraction of crowded pedestrian features. Zhang et al. and Wang et al. added additional loss items in the loss function to make the proposal boxes of the same object closer to each other and to make the proposal boxes of different objects more different from each other. us, the feature extraction of crowded pedestrians is strengthened [8,9]. Ge et al. first used P-RCNN to detect less crowded pedestrians, artificially constructed binary mask to cover these pedestrians, and then used S-RCNN to detect the remaining crowded targets (P-RCNN and S-RCNN both use Faster-RCNN as the basic structure); the model is forced to pay attention to the crowded target by constructing a mask [10], but constructing a mask for all detected images will greatly increase the detection time. Lin et al. proposed a pedestrian attention mechanism that encodes fine-grained attention masks into convolutional feature maps to enhance the extraction of pedestrian features [11]. Zhou et al. proposed a discriminative feature transformation branch to strengthen the discriminability of the network between pedestrians and nonpedestrians [12]. Chi et al. predicted the extra head mask of pedestrians during the training stage to enhance the extraction of pedestrian features [13]. e second aspect is to change the postprocessing method of the prediction box. Soft NMS [14] retains the prediction box with a close distance through linear weighting of adjacent prediction boxes, but this method will generate a large number of false detections for highly overlapping objects. e adaptive NMS [15] added a branch to the detection network to predict the density of each frame and used the predicted density to replace the NMS threshold to achieve the dynamic adjustment of the NMS threshold. However, there are still difficulties in density prediction and it is still doubtful whether the density can represent the best NMS threshold setting. In addition, the prediction box is often not fully matched with the real box, which may lead to IoU (intersection over union) and prediction density inconsistency between prediction boxes. Joint-NMS [16] is to simultaneously detect the full-body bounding box and the head bounding box that are not easily occluded and then perform NMS together. is method requires that the pedestrian head label exists in the dataset. R 2 NMS [17] also has a similar idea. Both the visible body box and the fullbody box are simultaneously used for NMS together, which requires that the dataset has the visible body label.
Current crowded pedestrian detection methods tend to focus on only one difficulty (strengthening feature extraction or changing postprocessing methods). In order to further improve the performance of pedestrian detection in crowded scenarios, it is necessary to deal with two difficulties simultaneously.
In this article, Double Mask R-CNN is proposed to process two difficulties simultaneously. Double Mask R-CNN is improved based on Mask R-CNN with FPN. For the first difficulty, in order to enhance the edge feature extraction ability of crowded pedestrians, we improved FPN and added semantic segmentation branches, which is named SFPN. In order to train SFPN, we use the head bounding box and body bounding box of pedestrians to construct more elaborate pseudolabels, which is different from [11,12] that uses all pixels of the pedestrian bounding box as pseudosemantic labels. e labels obtained by this approach are relatively coarse. For the second difficulty, in order to avoid the limitation of NMS, we adopt a method similar to PS-RCNN [10] to construct a binary mask for the detected pedestrians and then reinput the detected image into the network to obtain the occluded pedestrians. e difference is that we design an effective rule to estimate the visibility of pedestrians in the image according to human keypoints. Only the detection images whose pedestrian visibility is less than a certain threshold will be covered with a mask at the location of detected pedestrians and then reinput into the network. is method can greatly reduce the detection time and is more in line with the actual needs. e method of constructing masks for all detection images has low efficiency.
To summarize, our contributions are as follows: (1) proposed SFPN module to strengthen the edge feature extraction ability of crowded pedestrians (2) combined head bounding box and body bounding box to construct more refined pseudosemantic segmentation labels (3) designed a rule used to estimate the image pedestrian visibility according to the human keypoints, only the image with low pedestrian visibility will reinput to the network to detect occluded pedestrians, which significantly reduced detection time.

Generic Object
Detection. e object detection model based on the deep convolutional neural network can be divided into one-stage and two-stage models according to the existence of RPN (Region Proposal Network). e purpose of the one-stage detector [18,19] is to accelerate the detector's responding process so as to meet the time efficiency requirements of various practical applications. Two-stage detector [20] refines detection results by adding a two-stage classification and regression network to obtain higher accuracy. Our work is based on the twostage detector, which is the typical two-stage detector, Faster R-CNN. e Faster R-CNN [20] firstly generates a certain number of proposal boxes through RPN (Region Proposal Network) and then refines the proposal box through a two-stage classification and regression network. FPN [21] extended the Faster R-CNN by introducing a top-down feature pyramid network to deal with the scale changes of detection objects. Mask R-CNN [22] proposed RoI align to solve the problem of misalignment between proposal boxes and corresponding features in RoI Pooling [20].

Occlusion
Handling. Zhang et al. [23] employed an attention mechanism across channels to represent various occlusion patterns. Song et al. [24] operated somatic topological line localization to reduce ambiguity. Stewart et al. [25] proposed a recurrent LSTM layer for sequence generation by using a Hungarian loss function. Hu et al. [26] proposed an object relation module that handles a set of objects simultaneously through the interaction of the appearance feature and geometry. Goldman et al. [27] proposed a layer for estimating the Jaccard index as a detection quality score and a novel EM merging unit and used these quality scores to resolve detection overlap ambiguities.

Method
e structure of Double Mask R-CNN proposed in this article is shown in Figure 1. Double Mask R-CNN consists of the following phases: (1) Input the detected image into the SFPN module to obtain the feature map and semantic segmentation map. e latter is only used in the training stage to improve the network's ability to extract edge features from crowded pedestrians. e semantic branch needs to be turned off during the evaluation stage.
(2) Input the feature map into the Region Proposal Network (RPN) to obtain the proposal boxes, which plays the same role as Faster R-CNN. (3) Input the proposal boxes into MKFRCNN (Mask and Keypoint with Fast RCNN) module to obtain the prediction boxes and the corresponding instance segmentation map and human keypoints. e detailed structure will be described later. Only the prediction box and instance segmentation map are needed in the training stage to calculate loss and update the weights of the model. Human keypoints information is only needed in the evaluation stage to calculate the visibility of pedestrians of the detected image. (4) According to the number of detected human keypoints, we can calculate the visibility of pedestrians of the detected image: α. If α is less than the threshold T, it indicates that the pedestrian density of the detected image is high, then a binary mask is added to the first detected pedestrians according to the instance segmentation map. After covering masks, the detected images will be reinput into the detection network in order to obtain the position of occluded pedestrians. Note that the instance segmentation and human keypoints branches should be closed during the second detection to decrease detection speed. (5) Obtain the output results by merging two detection results.
Next, the specific structure of SFPN, MKFRCNN, and the calculation process of pedestrian body visibility are described. Finally, we describe the loss function used in the training detection network.

SFPN Module.
e specific meaning of SFPN is the feature pyramid network with semantic segmentation branch, which is an extension of FPN proposed in 2017. Because the FPN structure is similar to the encoder-decoder structure of U-NET, the semantic segmentation branch can be easily constructed. e construction process of SFPN is shown in Figure 2. e number above each bar chart is the number of channels. Firstly, we select ResNet50 [28] as the backbone, which had been pretrained on ImageNet [29] dataset. en we extract the feature map obtained by conv1 7 × 7, conv2, conv3, conv4, and conv5, and name them as {C1, C2, C3, C4, C5}, respectively. M5 is obtained by 1 × 1 convolution of C5, and then M5 is upsampled (bilinear interpolation method) to the same resolution as C4. We obtain M4 by adding convolution of C4 to M5. M3 and M2 are obtained by the same process. en, we obtain P5, P4, P3, and P2 by Conv 3 × 3 of M5, M4, M3, and M2 to generate proposal boxes in the RPN stage. e establishment of the semantic segmentation branch starts from P2, and S1 is obtained by an upsampling of P2. en, we obtain S2 through Conv 3 × 3 and the Relu function of S1, and S2 has the same number of channels as C1. Next, we obtain S3 through Conv 1 × 1 of S2 plus C1. Finally, the probability distribution function is obtained through the Sigmoid function. We do not carry out 1 × 1 convolution of C1 to expand the number of channels to 256 because this method needs more GPU memory in the process of backpropagation gradient calculation. e method in this article can reduce the consumption of GPU memory and save computing resources.
Training of SFPN requires pixel-level annotations in the training dataset. Since pixel-level annotations do not exist in the CrowdHuman [30] dataset, we directly use the pedestrian annotation box to establish pseudosemantic segmentation annotations on the basis of [30]. e conventional literature uses all pixels in the box as a pseudosemantic  Mobile Information Systems segmentation annotation. But in this article, we combine pedestrian head box with a visible box to construct pseudosemantic segmentation annotation to obtain more accurate annotation. e above two methods are shown in Figure 3. e construction process of our pseudosemantic segmentation annotations is as follows: assume that the height and width of the upper left of the head labeling box of a pedestrian are (X1, Y1) and (W1, H1) , respectively. Moreover, the height and width of the upper left of the visible body box are (X2, Y2) and (W2, H2) , respectively. e polygon consists of 8 coordinates; the pseudosemantic segmentation annotation and the horizontal and vertical coordinates are represented by P and Q , respectively. e calculation process is as follows: 3.2. MKFRCNN. MKFRCNN refers to the Fast R-CNN structure with instance segmentation and human keypoints detection branches. e branch structure is improved based on literature [22]. e upsampling method in the instance segmentation branch is changed from deconvolution to bilinear interpolation, and then feature aggregation is carried out through normal convolution. is is because the segmentation labeling mode of pseudo-instance segmentation annotation is relatively fixed, and the deconvolution may cause overfitting. And the spatial structure of the object is easier to be preserved through bilinear interpolation. e pseudo-instance segmentation annotation used in the training of MKFRCNN is converted from the pseudosemantic segmentation annotation constructed in 3.   need to give different values to the pixels in each annotation box. e structure of MKFRCNN is shown in Figure 4, which has three branches in total: box, mask, and keypoint, and these three branches predict the position of pedestrians, the instance segmentation map, and keypoints of the human body, respectively. e number in the square represents the resolution and number of channels. For example, 7 × 7 × 256 means that the resolution of the feature map is 7 × 7, and the number of channels is 256. Moreover, the number in the rectangle represents the number of nodes in the full connection layer. e number on the arrow represents the size of the convolution kernel and the number of convolution kernels. For example, 4 × 3 × 3 represents 4 kernels of 3 × 3 convolution and K represents the number of human keypoints, which is determined by the pretrained dataset annotation. Only box and mask branches should be opened during training, and all three branches should be opened during testing. However, Mask and Keypoint branches should be closed after the construction of binary mask to decrease detection time.
Since the CrowdHuman pedestrian dataset used for training and evaluating is not labeled with human keypoints, the detection branch of human keypoints uses the pretrained key point detection branch on the COCO keypoints dataset. e pretrained key point detection branch can accurately detect human keypoints so as to accurately calculate the visibility of pedestrians in detected images. COCO keypoints dataset marks up to 17 human keypoints for each pedestrian: nose, left/right eye, left/right ear, left/right shoulder, left/ right elbow, left/right wrist, left/right hip, left/right knee, and left/right ankle. Figure 5 shows the detection of human keypoints. Green numbers represent the index of keypoints in the human body, such as 0 denotes nose and 1 denotes left eye. e red number represents the activation value by the keypoints detection branch. It can be found that the score of the occluded or incorrectly positioned keypoints is usually less than or equal to 0. erefore, when calculating the visibility of pedestrians, we determine that the keypoints whose activation value is greater than 0 are successfully detected. Otherwise, detection fails.

Pedestrian Visibility.
Before adding binary masks to pedestrian images, we need to calculate the visibility of pedestrian bodies for the purpose of reducing detection time and improving detection efficiency. is is because not all the images are severely occluded, and adding binary masks to all detected images will significantly increase detection time. e calculation process of pedestrian body visibility a is as follows: (1) Calculate the ratio of the number of detected human keypoints to the total number of keypoints R i : where i denotes the index of detected pedestrian, k i denotes the number of detected keypoints of a detected pedestrian, and K denotes the number of labeled human keypoints in the training dataset. e detection results will give the score of each keypoint. If the score of a keypoint is greater than 0, the keypoint is estimated to be successfully detected.
(3) Divide M by the number of detected pedestrians N to obtain the pedestrian visibility a:

Loss Function.
As semantic segmentation branches are added to the basic model, the loss function needs to be reset. e loss function is composed of classification loss, bounding box regression loss, instance segmentation loss, and semantic segmentation loss. e classification loss, instance segmentation loss, and semantic segmentation loss all use the cross-entropy loss function [34]. Regression loss uses smooth L 1 loss function [7]. e formula is as follows: where N denotes the number of the proposal boxes, y i denotes the label of the proposal box, and p i denotes the prediction probability of pedestrians. If the proposal box is labeled positive, p * i is set to 1; otherwise p * i is set to 0. M represents the number of pixels in the semantic segmentation map, l j represents the label of the pixel, and s j represents the Sigmoid score of pixels. And t * i denotes the offset between the proposal box and ground truth box and t i denotes the offset between the prediction box and ground truth box.

Experiments
In order to verify the effectiveness of Double Mask R-CNN, we conducted experiments on CrowdHuman and Wider-Person datasets, two crowded pedestrian detection datasets.

CrowdHuman Dataset.
We use the CrowdHuman dataset for pedestrian detection; the dataset is designed for crowded pedestrian detection.
e CrowdHuman dataset has become the evaluating benchmark of the crowded pedestrian detection algorithm, and the training set contains 15000 images, and the validation set contains 4370 images. CrowdHuman provides three categories of bounding boxes annotations for each human instance: person visible-region bounding-box and person full-body bounding-box, head bounding-box. Detecting visible-region is more difficult since the aspect ratios are more diverse than the full-body annotations. We just evaluate our method with the visibleregion annotations. e comparison between the training set of CrowdHuman and other common pedestrian detection datasets is shown in Table 1.
It can be seen from Table 1 that the CrowdHuman dataset has the largest number of pedestrians in all datasets, and it has the largest number of pedestrian density (22.64) and pairwise overlaps between two pedestrians (IoU >0.5, 2.40). erefore, this dataset can better reflect the pedestrian detection performance of the detection network in crowded scenarios. [35] is another crowded pedestrian detection dataset collected from multiple scenarios, in which the training set contains 8000 images and the validation set contains 1000 images, and it contains five types of annotations: pedestrians, riders, partially visible persons, crowd, and ignored regions. In our experiment, we combined the former three categories into one category during the training and evaluating stage.

Evaluation Metric.
e evaluation criterion is adopted from the literature [36]. e standard log-average miss rate (MR) is calculated in the false positive per image (FPPI) with a range of [10 −2 , 10 0 ]. Besides, AP 50 is also evaluated following the standard COCO evaluation metric. For the CrowdHuman dataset, the validation set is divided into seven subsets: Small, Medium, and Large subsets, according to the height, to verify the robustness of our method to detect objects of different scales. According to the different occlusion ratios, Heavy, Partial, and Bare subsets verify the robustness of our method to detect objects with different occlusion ratios. Reasonable subset follows the same standard used in [37], a general subset to evaluate. All experiments are evaluated at IoU (intersection over union) threshold of 0.5. e subset of CrowdHuman is shown in Table 2. e occlusion rate is defined as follows: where A v denotes the area of visible body box and A f denotes the area of full-body box.

Implementation Details.
is article is based on PyTorch [30] deep learning framework, and the GPU is a single RTX 2070. In order to fairly compare the performance of different detection networks, the process of input images is consistent with the literature [28]. e short edge is at 800 pixels, while the long edge should be no more than 1400 pixels. e aspect ratios of the anchor are set to {0.5, 1.0, 2.0} without any data enhancement techniques. e batch size is set as 1, and the initial learning rate is 1e − 3, with a total of 150 K iterations. After 105 K and 135 K iterations, the learning rate decreases to 1e − 4 and 1e − 5, respectively. Momentum is set as 0.9, and weight decay is set as 5e − 4, and the threshold T of pedestrian visibility a is set as 0.6 in evaluation stage. In order to make the model capable of detecting human keypoints, we first trained 10 epochs on the COCO keypoints dataset to obtain the pretrained model and then trained on the CrowdHuman dataset. During training, the weights of the keypoint detection branch are not updated.

Overall Performance.
In order to show that the detection performance of Double Mask R-CNN is better than that of other detection networks, the detection results of Mask R-CNN are compared with those of other detection networks on CrowdHuman. All detection results are obtained on Reasonable subset, and the comparison is shown in Table 3.
As shown in Table 3( * stands for our reimplemented results), Double Mask R-CNN has the lowest MR, which is 39.07%, 16.87% lower than the baseline. Compared with the

Upsample Influence.
In Section 3.2, we mentioned that the upsampling method of instance segmentation branch in MKFRCNN module is changed from deconvolution to bilinear interpolation. Because the annotation mode of pedestrian pseudo-instance segmentation is relatively fixed, the deconvolution may cause overfitting and thus achieve worse performance. In Table 4, we give the experimental results obtained using different upsampling methods for Double Mask R-CNN. As shown in Table 4, in the instance segmentation branch, bilinear interpolation instead of deconvolution can achieve better detection performance; MR decreases by 0.62% and AP increases by 0.42%.

Ablation Study.
In order to verify the effectiveness of each module in Double Mask R-CNN, we gradually add the SFPN, MKFRCNN, and a (the module that adds binary mask according to pedestrian visibility) module to the model. e experimental results are shown in Table 5. e Baseline * stands for our reimplemented results for comparison.
As shown in Table 5, after the addition of the SFPN module, the MR of Reasonable decreases by 12.24%, Small, Medium, Large decreases by 8.18%, 9.05%, and 11.19%. Heavy, Partial, Bare decreases by 7.68%, 10.72%, and 10.69%. e significant improvement means that adding the semantic information can strengthen the feature extraction of crowded pedestrian, and effectively improves the performance of detection network with different scale and occlusion rate of pedestrian. e addition of instance 72%. e maximum decrease on the Heavy and Partial subsets indicates that the second detection method, after adding a binary mask, can effectively locate the occluded pedestrians in the crowded scene. In addition, another reason for the significant decrease of the MR on different subsets is that the detection model pretrained on the COCO keypoints dataset. is indicates that the human keypoints information can also effectively improve the network's feature extraction ability of crowded pedestrians.
As shown in Table 6, with the addition of SFPN, MKFRCNN, and a modules, the AP on all subsets keeps rising, compared with the Baseline * ; Reasonable AP increases by 5.54%, Small, Medium, and Large subsets improved by 13.24%, 7.83%, and 4.73%; Heavy, Partial, and Bare subsets improved by 12.75%, 8.98%, and 6.11%, respectively. e improvement in Small and Heavy subsets most obviously indicates that semantic information and human keypoints information effectively improve the feature extraction ability of the model for small objects and occluded objects and further indicate the robustness of Double Mask R-CNN in detecting pedestrians with different scales and different occluding ratios. Table 7 shows the MR% and per image detection time when a threshold set is of 0.6 and 1.0 (covering masks with all detected images); it can be found that when a is 1.0, MR% is 2.3%, which is higher than that when a is 0.6, and the inference time is four times than that when a is 0.6, showing that the method of covering masks with all detected images will significantly increase inference time and introduce more false positives. It is more efficient and performs better to filter high pedestrian density images based on pedestrian visibility a.

Qualitative Analysis.
In order to visualize the advantages and disadvantages of Double Mask R-CNN to locate crowded pedestrians, some detection results are shown in Figure 6. e picture in the left column represents the detection result of the bounding box, and the picture in the right column represents the detection result of human keypoints. e green box is the results of the first detection, Table 3: Comparison of different detection networks.

Methods
MR (%) AP 50 (%) Baseline [30] 55.94 85.60 Soft NMS [14,16] 60.05 -Repulsion loss [9,16] 54.64 -R 2 NMS [17] 52.19 85.50 Joint-NMS [16] 51.79 -Adaptive NMS [15] 49.73 84.71 Mask R-CNN * [22] 43.17 87.09 PedHunter [13] 39.5 -Double Mask R-CNN 39.07 86.80      Mobile Information Systems and the red box is the second detection results obtained by adding a binary mask to the image. As can be seen from the six detection result graphs given in Figure 6, the proposed method has good robustness to pedestrians of different scales, especially to achieve good detection results for most small pedestrians, as shown in Figure 6(a). Moreover, our method also has good robustness for pedestrians of different occlusion ratios, as shown in Figures 6(a) and 6(b), the image filtered according to the pedestrian visibility has a high pedestrian density, and it is quite common for people to occlude each other. However, most of the occluded pedestrians can be effectively detected by adding a binary mask. e calculation of pedestrian visibility depends on the detection accuracy of human keypoints. By observing the detection results of human keypoints in each group of images, it can be found that the keypoints detection branch pretrained on COCO human keypoints dataset can detect the most visible keypoints of pedestrians. e method in this article has good robustness for pedestrians with different scales and different occlusion ratios, but there are still some false detections, as shown in Figure 6(c), which can detect objects similar to pedestrians. For this problem, hard negative examples mining may be used to alleviate it.

Experiments on WiderPerson.
In order to verify the generalization performance of our method, we also conducted experiments on the WiderPerson dataset. Since the WiderPerson dataset only has a pedestrian visible body box, we can not divide the dataset according to the occlusion ratio. erefore, we only present experimental results on All (unrestricted height and occlusion rate), Small, Medium, and Large subsets. e other settings follow our experiments on CrowdHuman. e experimental results are shown in Tables 8 and 9. As shown in Tables 8 and 9, with the addition of SFPN, MKFRCNN, and a module, MR% of all subsets keeps decreasing and AP 50 % keeps increasing. Compared with Baseline, All MR% decreased by 8.81%; Small, Medium, and Large decreased by 8.13%, 15.46%, and 13.3%, respectively. AP of All increased by 5.11%; Small, Medium, and Large increased by 7.94%, 3.98%, and 2.43%, respectively.
is proves the generalization of our method.

Conclusions
In this article, we propose the Double Mask R-CNN to improve the detection performance of pedestrians in crowded scenarios. Experimental results on CrowdHuman and WiderPerson dataset show that model combined with   the semantic segmentation and instance segmentation (SFPN and MKFRCNN module) effectively strengthens the feature extraction of crowded pedestrians and improves the detection accuracy for different scales and occlusion ratios of pedestrian. In addition, images with high pedestrian density can be successfully filtered out according to the effective rule of calculating pedestrian body visibility in the image according to human keypoints information so that the occluded pedestrians can be detected by adding a binary mask and then reinput detected image into the detection network, which significantly reduces the MR in crowded scenarios.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.