Pose-Guided Part-Based Adaptive Pyramid Features for Occluded Person Reidentification

Reidentifying an occluded person across nonoverlapping cameras is still a challenging task. In this work, we propose a novel poseguided part-based adaptive pyramid neural network for occluded person reidentification. Firstly, to alleviate the impact of occlusion, we utilize pose landmarks to generate pose-guided attention maps.,e attention maps will help the model focus on the nonoccluded regions. Secondly, we use pyramid pooling to extract multiscale features in order to address the scale variation problem.,e generated pyramid features are thenmultiplied by attentionmaps to achieve pose-guided adaptive pyramid features. ,irdly, we propose a pose-guided body part partition scheme to deal with the alignment problem. Accordingly, the adaptive pyramid features are divided into partitions and fed into individual fully connected layers. In the end, all the part-based matching scores are fused with a weighted sum rule for person reidentification. ,e effectiveness of our method is clearly validated by the experimental results on two popular occluded and holistic datasets, i.e., Occluded-DukeMTMC and the Market-1501.


Introduction
Person reidentification (Re-ID) aims to retrieve a probe person/pedestrian from nonoverlapping cameras [1][2][3]. It has become increasingly popular in the community due to its application potentials in video surveillance, smart retailing, activity analysis, and so on. Although person Re-ID has achieved a great progress in the recent years, it is still a challenge to reidentify a person who is partially occluded. For example, in the surveillance scenario, a person may be occluded by walls, transformations, and other pedestrians on the road. e occlusions lead to not only the loss of target information but also the interference of occluded information [4][5][6]. e holistic person Re-ID methods that only consider the global person feature are easily misled by the occluded body part. As shown in Figure 1, the holistic methods may mistakenly match a person when a person image in another view shares similar obstacle with the probe person image. erefore, it is significant to seek efficient ways to solve this occluded reidentification problem. One well-known solution is to use part-based models. Part-based models perform part-to-part matching and are much more robust to occlusions than the holistic models [7,8]. However, they partition a person into a fixed number of parts and heavily depend on precise person detection.
In the recent years, there is a trend to leverage external cues like person mask and pose estimation for occluded person Re-ID [9][10][11][12]. Mask-guided models use person mask to help remove the background clutters, including the occluded body parts, and then perform matching using the clean body parts. A major limitation of mask-guided models is that currently even the state-of-the-art person mask segmentation models perform inferiorly in the scenarios with person occlusion, especially when the target person is occluded by other persons [13,14]. On the other hand, owing to the advancement of [5,6] pose estimation models, pose-guided models have attracted considerable attention recently [4,5]. e pose-guided models utilize the skeleton as an external cue to effectively relieve the part misalignment problem by locating each part using person landmarks.
In this paper, we propose a novel pose-guided part-based adaptive neural network for occluded person reidentification. Firstly, we use the human pose estimator [40] to generate pose landmarks, which are then used to create pixel-wise attention maps. Compared with maskguided methods [14][15][16][17], the pose-guided method can better detect pedestrian's location in a crowd. And then, we use pyramid pooling to extract multiscale features in order to address the scale variation problem. To adapt the pyramid features, we also generate multiscale attention maps. e pyramid features are finally multiplied by the attention maps with the corresponding scales to achieve pose-guided adaptive pyramid features.
In the matching stage, we propose a pose-guided dynamic body part partition scheme to deal with the alignment problem, in which the human parts are defined dynamically based on pose landmarks. In the end, all the part-based matching scores are fused with a weighted sum rule for reidentification. Experimental results on occluded and holistic Re-ID datasets, i.e., Occluded-DukeMTMC and the Market-1501, validate the effectiveness of our proposed method.
e main contributions of our work are summarized as follows: (1) we propose a pose-guided part-based adaptive pyramid neural network (PPAPN) that uses the pose-guided attention map and pyramid pooling so as to deal with the occlusion and scale problems; (2) we propose a pose-guided dynamic body part partition scheme to tackle the alignment problem; and (3) we compare the proposed method with many well-known methods on two popular occluded and holistic Re-ID datasets; the experimental results validate that our proposed method is effective. e rest of paper is organized as follows. Some related works are reviewed at first. e structure of our proposed PPAPN and implementation details are presented in the next. Furthermore, we present the detail of the experimental results. And, the conclusions of our work are described in the end.

Person Reidentification.
Person reidentification aims to match a probe pedestrian image in candidate pedestrian images across disjoint cameras [1][2][3]. In recent years, deep learning algorithms [18][19][20][21][22][23] for person reidentification indicate distinct superiority on matching accuracy. Most of these methods pay attention to extract distinguishable features from holistic pedestrian images [7][8][9][22][23][24][25]. e methods in [5-7, 9, 24-26] employ the part-based features leaning method and significantly improve the global-based person Re-ID method performance on the holistic person Re-ID datasets. However, they do not take occlusions into considerations. Besides, human pose estimation and landmark detection having achieve impressive progress, several recent works in Re-ID employ these tools to acquire aligned subregion of person images [6,7,24,25]. However, the occlusion cannot be ignored especially in the crowd scenes like school or metro station.

Occluded Person Reidentification.
Occluded person Re-ID [26][27][28] aims to retrieval occluded probe images in occluded gallery pedestrian images in disjoint cameras. Due to the images suffer from occluded noise, occluded person Re-ID is a more challenging task than holistic person Re-ID. To solve this problem, Zhuo et al. [26] defined the problem of occluded reidentification for the first time; they employed multitask losses that lead the model to distinguish simulated occluded samples and nonoccluded samples, in this way to learn a robust feature representation from occluded person images. In this following works, the model employs a cosaliency network to endeavor to pay attention to the person visible part. More recently, Miao et al. [5] utilized semantic key-points to extract the useful information from the occluded person images. And, they used a predefined threshold of key-points confidence to determine whether the part is occluded or not. He et al. [29] used a spatial transform module to transform the holistic image to align with the partial ones and then calculated the distance of the aligned pairs.

Partial Person Reidentification.
Compared with occluded person Re-ID [30], partial ones often occur due to imperfect detection and outliers of camera views. Like occluded person Re-ID, partial person Re-ID aims to retrieval partial probe images to gallery holistic images. Zheng et al. [30] proposed a local matching strategy based on dictionary learning and called it ambiguity-sensitive matching classifier (AMC), and they introduced a sliding window matching (SWM) model to solve global-partbased matching problems. He et al. [28] proposed a matching model based on sparse reconstruction learning and called it deep spatial feature reconstruction (DSR). DSR can automatically match images of different sizes, thus avoiding the time-consuming spatial alignment step. He et al. [29] further proposed a spatial feature reconstruction model, which generates multiscale features by a fully convolutional network to deal with the scale change of feature maps. Sun et al. [31] proposed a visibility-aware part model (VPM) in [7], which perceives visible areas by self-supervised learning to avoid the noise effect of the occluded regions.

Methodology
As shown in Figure  components: backbone with pyramid pooling, a pose-guided adaptive pyramid features module, and a pose-guided body part partition. Firstly, the backbone with pyramid pooling could match multiscales of person images in occluded scenarios and get multiscale pyramid features. Secondly, we use the human pose estimator OpenPose [32] to generate pose landmarks, which is used to create pose-guided attention maps of pedestrians. And then, we multiply the features from pyramid pooling by pose-guided attention maps which have the corresponding scales to achieve poseguided adaptive pyramid features. Finally, pose-guided body part partition splits the pose-guided pyramid adaptive features into three local region features for the matching stage, and all the part-based matching scores are fused with a weighted sum rule for reidentification.

Backbone and Pyramid
Pooling. Our proposal PPAPN mainly uses ResNet-50 [33] which pretrained on ImageNet as backbone to extract global feature from given images, and we make a minor modification on it. Specifically, we remove the average pooling layer and the fully connected layer of ResNet-50 to obtain the refined ResNet-50. e detail of the backbone is shown in Table 1; it contains 1 convolutional layer, namely, conv1 and 4 Resblock layers (conv2_x, conv3_x, conv4_x, and conv5_x). e output size of each layer is 384 × 128, 192 × 64, 96 × 32, 48 × 16, and 24 × 8，respectively. And, the last Resblock conv5_x outputs the spatial feature map F ∈ R h×w×c , in which h, w, and c denotes the height, width, and channel number, respectively. en, the feature map F is fed into the pyramid pooling.
Due to the different distances between pedestrians and cameras, the detected pedestrians may have various scales. In the one hand, if we resize the person images to the same size, it will fail to align their spatial features and bring errors to their distance measure. In the other hand, if we employ the original multiscales person image from the pedestrian detector, the ResNet-50 backbone will output features of variable sizes. However, the fully connected layers require fixed size features. In this work, we propose to use pyramid pooling after the backbone to address the scale variation problem. e pyramid pooling contains several max-pooling layers with different output size to generate multiscale pyramid features. e main purpose of each max-pooling layer is to produce a fixed output for any size of input, while several max-pooling layers could preserve global information in different scales as much as possible. Specifically, for any size feature map, firstly, it is divided into some blocks of several scales, such as a feature map is divided into 1 × 1, 2 × 2, and 4 × 4 blocks, and then the maximum pooling is performed on each block to obtain multiscale pyramid pooling features. In this work, we divide the feature map into 4 × 2, 12 × 8, and 24 × 8 blocks. As shown in Figure 2, finally, we obtain the basic multiscale pyramid features which are denoted as F pi ∈ F 4×2 , F 12×8 , F 24×8 , and the output size of the pyramid features is 4 × 2 × 1024, 12 × 8 × 1024, and 24 × 8 × 1024. e output features from pooling layers of small output size obtain the appearance information of a large local region, such as F 4x2 ; the feature map is divided    Mathematical Problems in Engineering into 4 × 2 blocks, and it is obvious that each block contains a larger region than F 12×8 . On the contrary, the output features from the pooling layers of large output size obtain the appearance information from relatively small regions in the image like F 24×8 . us, the scale variation problem has been well addressed.

Pose-Guided Adaptive Pyramid Features Module.
In occlusion person Re-ID, it is necessary to pay more attention on the features that are more closely related to the pedestrians. So, we use pose landmarks to generate attention maps, which are used to separate the foreground information of pedestrians from the occlusion noise and dig out useful background information, such as pedestrian attachments.

Pose-Guided Attention Maps Generator.
To alleviate the impact of occlusion, we design a pose-guided attention maps generator to guide the model extract more useful and robust features. Firstly, we employ the human pose estimator OpenPose [32] to detect N pose landmarks from an input person image, where N is 18 in this work. Each landmark contains its coordinates and confidence score. Inspired by [34], we employ a sigmoid form of decay function to better distinguish between occluded regions and nonoccluded regions. e decay function increases with the increase in the confidence of the landmark, and the sigmoid form is defined as follows: (1) We set a threshold α to suppress values less than α; in this paper, α is fixed at 0.5. As β increases, the decay function becomes steeper near 0.5. LM j denotes the sigmoid form resulted the j-th landmark, and C j is the confidence score. Similar to [5], we use the landmarks to generate the pose-guided attention map M consisting of a 2D Gaussian centred on the ground truth location, and it is defined as follows: where M j denotes the generated pose-guided attention map of the j-th landmark, j � 1, 2, . . ., N and c xj and c yj denote the coordinates of the j-th landmark. Each attention map M j is set to the same size of h × w. To adapt the multiscale pyramid features, we also employ bilinear interpolation to get multiscale pose-guided attention maps

Adaptive Pyramid Features.
e pose-guided adaptive pyramid feature module aims to integrate the feature maps information and the pose information from the target person. As shown in Figure 2, firstly the feature map F is pyramid pooled as the multiscale features F pi . en, the pose-guided attention maps M i ′ , i � 1, 2, 3, multiply the feature maps F pi element-wisely and output the pose-guided adaptive pyramid feature map F i ′ . Since each attention map M i ′ has explicitly encoded the information of different regions on the target person, i.e., which region is occluded, the pose-guided attention maps M i ′ can focus on nonoccluded parts of the target person and depress the information from occluded regions. [22,23,31] horizontal cut the probe images and gallery images and extract their horizontal features, as demonstrated in Figure 3(a). Obviously, it is difficult to match the body parts in the gallery images when the pedestrians' posture changes, and there is occlusion. So, we propose a pose-guided body part partition. As shown in Figure 3(b), we employ the landmarks detected by the human pose estimator to explicit positioning person body parts, and we define three body parts (R1, R2, and R3) based on these landmarks. In order to extract robust features, each adjacent part has an overlapping region. Specifically, R1 contains these landmarks p 0 , p 1 , p 2 , p 5 , p 14 , p 15 , p 16 , p 17 , R2 contains these landmarks p 1 , p 2 , p 5 , p 3 , p 4 , p 6 , p 7 , p 8 , p 11 , and R3 contains these landmarks p 8 , p 11 , p 9 , p 12 , p 10 , p 13 . e corresponding body part bounding box B i ∈ B 1 , B 2 , B 3 can be obtained based on the location coordinates of all body joints. B j � x min − 10, x max + 10, y min − 10, y max + 10 .

Pose-Guided Body Part Partition. Previous methods
Given landmarks from the pose estimator, firstly, we obtain the original corresponding body parts bounding box based on the coordinates of body joints. Secondly, to process the different size of adaptive pyramid features, we adjust the original corresponding body parts bounding box matching the adaptive pyramid feature size. We follow the part alignment pooling [10] method to extract these part features on query images and gallery images. We fed the output features from pose-guided body part partition into the fully connect layer to obtain the partial features F pi ∈ F 11 , F 12 , F 13 , F 21 , F 22 , F 23 , F 31 , F 32 , F 33 , For basic discrimination learning, we regard the identification task as a multiclass classification problem. e learned feature softmax loss is formulated as follows: 4 Mathematical Problems in Engineering where N denotes the number of branches; we set N is 9 in our method. e parameter λ s (0 < λ s < 1) is a coefficient to balance of each branch. L denotes the number of classes in the training dataset.

Datasets and Evaluation Metrics.
In our experiments, an occluded person Re-ID dataset and a holistic person Re-ID dataset are employed, i.e., Occluded-DukeMTMC [18] and Market-1501. For performance evaluation, we use the standard metrics as in the most person Re-ID literature, the cumulative matching curve (CMC), and the mean Average Precision (mAP). For the CMC curve, we report the Rank-1 accuracy.
(i) Occluded-DukeMTMC. is dataset is a subset of the DukeMTMC [15] used for person reidentification by images. It contains 15618 training images, 17661 gallery images, and 2210 occluded query images.
Occluded-DukeMTMC is selected from DukeMTMC-Re-ID by leaving occluded images and filtering out some overlap images.
(ii) Market-1501. is dataset includes images of 1501 persons captured from 6 different cameras. e pedestrians are cropped with bounding box predicted by the DMP detector [21]. e whole dataset is divided into training set with 12936 images of 751 persons and testing set with 3368 query images and 19732 gallery images of 750 persons.

Training Details.
We use ResNet-50 [33] as our backbone and make a minor modification on it: removing the average pooling layer and fully connected layer and setting the stride of conv4_1 to 1. We initialize our mode by the ImageNet [36] pretrained model. In our experiment setting, we did not force the image resize to a fixed size, but we augmented by random flipping and random erasing. We set the batch size to 32 and the training epoch number to 60. e base learning rate is initialized at 0.1 and decayed to 0.01 after 40 epochs. e coefficient λ is empirically set to be 0.2.

Results on Occluded-DukeMTMC.
We evaluate our method PPAPN on Occluded-Re-ID at first. As shown in Table 2, six kinds of methods are compared. As we can see, there is no significant gap between holistic Re-ID methods. For example, PCB [31] and Part-Aligned [9] both achieve approximately 43% Rank-1 score on the Occluded-DukeMTMC dataset, showing that simply using landmarks information may not significantly improve occluded Re-ID task. Our method PPAPN achieves 53.6% Rank-1 accuracy and 38.36% mAP, which outperforms all the competing methods. Compared with the state-of-the-art method PGFA [5], the PPAPN surpasses it by +2.2% Rank-1 accuracy and +1.06% mAP.

Results on Market-1501.
Although some occluded Re-ID methods obtain improvements on occluded datasets, they may not get a satisfying performance on holistic datasets; this is caused by the noise during feature learning and alignment. In this part, we also evaluate our method PPAPN on a holistic person Re-ID dataset Market-1501. As shown in Table 3, our method still outperforms the state-of-the-art methods in terms of both metrics. is result demonstrates that our pose-guided adaptive skills do not bring negative effect. Figure 4 shows some retrieval examples of the PGFA [5] method and our PPAPN method on Occluded-DukeMTMC. e top-10 retrieval results are shown. e retrieval results show that PGFA is prone to mix the information of the target person and obstacles, resulting in retrieving a wrong person with a similar obstacle. On the contrary, our PPAPN makes no mistakes in the same situation.

Conclusion
In this paper, we propose a novel pose-guided part-based adaptive pyramid neural network for occluded person reidentification. Our foremost purpose is to let the model focus on nonoccluded body regions and meanwhile alleviate the negative impact of occlusion. We utilize pose landmarks to generate multiscale attention maps. e pyramid pooling is used to extract multiscale features in order to address the scale variation problem. e generated multiscale pyramid features are then multiplied by attention maps to achieve pose-guided adaptive pyramid features. e last but not the least, we propose a part-based matching approach to deal with the alignment problem. e human parts are defined dynamically based on pose landmarks and then the adaptive pyramid features are divided accordingly, which are fed into individual fully connected layers. In the end, all the partbased matching scores are fused with a weighted sum rule for reidentification. Our method is finally evaluated on the Occluded-DukeMTMC and Market-1501 datasets and is compared with the state-of-the-art methods. Experimental results demonstrate that our method can boost the person Re-ID performance in both occluded and nonoccluded scenarios.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.