Multistage Polymerization Network for Multiperson Pose Estimation

,


Introduction
Human pose estimation (HPE) can be understood as the position estimation of human skeletal joints, such as those in the head, left hand, and right foot. It is a fundamental yet challenging task in computer vision and has applications in many fields, such as human-computer interaction, action understanding, and autonomous driving. In recent years, great progress on HPE has been made with deep learning methods.
To obtain information that is beneficial for the locating and classification of skeleton joints, existing methods mainly perform interlevel or intralevel fusion of features. In interlevel fusion, the features of different layers of the neural network are fused, as shown in Figure 1(a). Conversely, intralevel fusion refers to the fusion of feature maps of differ-ent channels in the same layer, as shown in Figure 1(b). For example, Stacked hourglass [1] extracts feature of different levels for fusion and utilizes skip connections to effectively capture various spatial relationships of keypoints. The high-resolution network (HR-Net) [2] maintained the spatial information of high-resolution features through lowresolution features and enabled high-resolution subnets to continuously obtain semantic information provided by low-resolution features through dense connections. In the residual steps network (RSN) [3], the intralevel pyramid features were integrated to extract more detailed local spatial information to obtain delicate local representations and accurately locate keypoints.
Although some feature fusion methods have achieved improved performance, they only used one of the two fusion methods. The fusion of intralevel features can extract much more delicate local representations, thereby retaining more precise spatial information, which is critical to the localization of keypoints. However, mush unrecoverable information will be lost in the down and upsampling processes in intralevel fusion. Conversely, interlevel fusion can increase the capacity of the downsampling unit and thus reduce the loss of information. Therefore, it is effective to improve the accuracy of HPE by combining these two functional fusion methods. In existing multiperson pose estimation methods, there is little work employing intralevel fusion and interlevel fusion simultaneously. To improve the accuracy of HPE, this paper explores how to combine these two feature fusion methods.
To solve this problem, this paper proposes a novel multistage polymerization network (MPN). The framework of this MPN is shown in Figure 2. In the MPN, we use as the same intralevel fusion strategy as in the residual steps network (RSN). In the RSN, after channel cutting, the feature map was downsampled to different scales for intralevel fusion. On this basis, feature maps from different layers but of the same scale are fused by element-wise sum. To enhance intralevel fusion, feature connections are added between layers, and a cross-stage feature aggregation strategy is adopted to effectively propagate multiscale features from early stages to the current stage to further enrich the information contained in the current stage's features.
We notice that the network's output features usually directly enter into the attention mechanism for weighting, and thus, the network may ignore the cross-channel communication between high-level feature maps and low-level feature maps.
In order to solve this problem, this paper proposes a new attention module, the shuffle attention mechanism (SAM). The SAM uses a shuffle channel to enhance the crosschannel information exchange between the low-level and high-level information, thereby recalibrating the interdepen-dence between the low-level and high-level feature maps. Experimental results also verify that the SAM can adaptively respond to important parts of the feature map.
The main contributions of this work can be summarized as follows.
(1) We propose a new MPN for HPE. The MPN enhances the image features by combining intralevel feature fusion with interlevel feature fusion, thereby improving the accuracy of HPE (2) We propose a new attention mechanism SAM, which can strengthen the communication between different levels of feature maps and highlight the response of feature maps in spatial channels The remainder of this paper is organized as follows. Section 2 introduces the related works, Section 3 describes the algorithms used to implement the proposed method, Section 4 presents and discusses the experimental results, and Section 5 concludes the paper.

Related Work
Previous research in human pose estimation was built based on the idea of part-based models, which use different configurations of parts to represent a person [4]. Current methods of human pose estimation can be divided into two categories: top-down approaches [1][2][3][5][6][7][8][9][10][11][12][13][14] and bottom-up approaches [15][16][17][18][19][20]. Top-down approaches first obtain the position of the human body frame by a detector such as you only look once (YOLO) [21] or single shot multiBox detector (SSD) [22] and then detect the position of keypoints in the human region. In bottom-up approaches, all of the human keypoints in an image are detected directly, and then these parts are classified as human instances. We mainly   2 Journal of Sensors focus on feature fusion and strengthening feature connections in these methods and discuss the feature fusion issue from the aspect of efficient feature representation. Attention mechanism is also widely used in these methods. However, these methods directly input the feature map into the attention mechanism for weighting, without considering the communication between different semantic layers. We design the shuffle attention mechanism (SAM) module to strengthen the connection between different semantic layers through shuffle. Therefore, we also discuss the commonly used attention mechanisms in human pose estimation (HPE).

Human Pose Estimation
Method. In recent years, heat map regression networks were applied to achieve multiperson pose estimation. The heat map of skeleton joint was first introduced in [23], which was designed to solve the problem of the coordinate prediction of the skeleton joint in traditional HPE methods. The space and context information of keypoints are lost in the coordinate prediction. But the heat map can solve this limitation well and become the most common form of skeleton representation. The key of heat map based methods is to design a network architecture to regress to regress heat map more effectively. GRAPH-PCNN [24] proposed a two-stage framework based on the graph structure and the unrelated models. This method added a positioning subnet and a graph structure pose optimization module on the original heat map regression framework, in which the heat map was regressed by the network for rough positioning of the keypoints and providing a keypoint candidate set. The positioning subsystem was used to extract visual features of each kepoint in the candidate set and predict the final keypoint coordinates. Due to the resolution reduction of the heat map, there is a quantitative error in ground-truth heatmaps, which will lead to inaccurate model training and poor inference model performance. To solve these problems, Zhang et al. [25] proposed a new distribution sensing coordinate representation (DARK) for HPE. In DARK, Taylor expansion was applied to decode efficiently coordinates to generate unbiased heat map. Huang et al. [26] used the encoding-decoding process to generate keypoint heat map and regarded discrete pixel points as a metrics. However, this method had deviations in the data enhancement process. Therefore, a continuous measurement standard of unbiased data processing (UDP) is proposed in literature [27]. The continuous measurement standard was used as an image size measurement standard, which was defined as the distance between adjacent pixels in a particular space, thereby suppressing the positioning deviation caused by the approach to discrete measure. The case of occlusion will also affect the regress of the heat map; considering this, Qiu et al. [28] proposed an image guidance GCN network (IGP-GCN) which cascaded feature adaption. IGP-GCN network-integrated human structure and image context to optimize estimation results and learned the pose displacement by progressive manner. This made the IGP-GCN not only capture the posture structure information but also capture context image information simultaneously. In IGP-GCN, the occlusion joints can be inferred from the context information of the image and the pose structure clues.

Feature Fusion.
Most previous work on multiperson pose estimation obtained rich feature information through interlayer connections or intralayer connections. The sequential architecture of convolutional pose machines (CPM) [14] used various connection strategies to implicitly capture spatial relations between key points and obtained a large receptive field through a larger estimator, thus, it can achieve a more refined spatial representation. The pyramid residual module (PRM) proposed by Cai et al. [3] enhances the invariance in scales of human components and shows great performance when using interlevel feature fusion. Newell et al. [17] proposed a U-shaped stacked hourglass network to obtain spatial connections between features of different resolutions through downsampling and skip connections. In addition, Chen et al. [5] used RefineNet combined with a cascaded pyramid network of interlayer features to maintain high-level and low-level information from multiscale feature maps. In high-resolution network (HR-Net) [2], four subnets were connected in parallel, and repeated cross-parallel convolution was used to perform multiscale fusion and enhance high-resolution representation. Meanwhile, the residual steps network (RSN) repeatedly enhanced the intralevel feature fusion to learn refined local representations. While these aforementioned methods have verified the effectiveness of interlayer feature fusion and intralayer feature fusion, exploring the combination of the two is rare in human pose estimation.

Attention Mechanism.
The performance of attention mechanisms in computer vision is remarkable. Channel attention, spatial attention, and spatial attention combined with channel attention are the most used attention mechanisms at present.

Channel Attention.
Squeeze-and-Excitation Network (SE-Net) [29] through the "Squeeze-and-Excitation" block can adaptively highlight the channel-wise feature maps by modeling the channel-wise statistics. The discriminative feature network (DFNet) [30] used global average pooling to introduce global context information and included a smooth network with global information and a channel attention model to improve intraclass consistency.

Spatial Attention.
Kligvasser et al. [31] proposed a spatial activation function with depth-wise separable convolution. Zhao et al. [32] studied the spatial attention mechanism from the perspective of information flow. However, they only considered unilateral passage or space, while ignoring the combination of spatial attention and attention channels.
Spatial attention combined with channel attention: Spatial and Channel-wise Attention in Convolutional Networks (SCA-CNN) [33] proposed spatial and channel attention. Attention was not only in the channel coding but also in the spatial perspective to indicate what part of the feature map needed to be paid attention to.
Chu et al. [34] proposed a multiscale attention model multicontext attention (MCA) that improved the performance of pose estimation. Su et al. [12] proposed the Spatial and Channel-wise Attention Residual Bottleneck (SCARB) 3 Journal of Sensors in multiperson pose estimation and studied the modeling order of space and channels. Meanwhile, Woo et al. [35] proposed a global average pool and largest pool channel attention module Convolutional Block Attention Module (CBAM). The dual attention network (DANet) [36] was proposed to adaptively integrate local features and global dependencies, the semantic dependency was modelled by a parallel channel dimension, and the space dimension has two kinds of attention module.

Method
The overall framework of the multistage polymerization network (MPN) is shown in Figure 2. It is a cascaded of several multistage polymerization block (MPB) modules. The shuffled attention mechanism (SAM) is used in the final stage. In this section, we will describe these modules in detail.

MPN: Multistage Polymerization
Network. For the input image, the convolutional layer is applied to compute the feature maps. In this layer, there is a total of 104 convolution kernels. This layer is followed by the MPB network, which is designed to achieve intralayer fusion and interlayer fusion. The input feature maps of the MPB network are regularly sliced into four parts F = f f 1 , f 2 , f 3 , f 4 g on the channel.
The MPB network is a cascade system of MPB modules. The sliced feature maps are fed into the first MPB module. Each MPB module consists of two operation blocks, which are designed according to the RSN [3] and are shown in Figure 3. In each block, the input feature maps are fed into the convolutional network. Four convolutional networks with different numbers of convolutional kernels are applied to generate features with different scales from the four input features, respectively. As shown in Figure 3, the number of the convolutional layers in these four convolutional networks is 4, 3, 2, and 1, respectively. All of these convolutional layers are built via a convolution operation.
Suppose that x i 1 , x i 2 , x i 3 , x i 4 is the output of the first block in the i-th MPB module. For intralevel fusion, these output features are concatenated to generate block features X i . These block features are upsampled by ½x and fed into second operation block of the i-th MPB module. For the second block, the same operation as in the first block is applied, and its output is defined as y i 1 , y i 2 , y i 3 , y i 4 . Finally, the interlevel fusion between these blocks is applied to output the feature of each MPB M i , as defined in the following equation.
The MPB module refers to the idea of a supervision relay and performs loss calculation for each MPB module. First, we use the Gaussian kernel to spray all labels of key points onto the heat map Y, as defined in equation (2), where σ is the standard deviation of the object size-adaptation, ðx, yÞ is the location of the heat map, and ðe x,ỹÞ is the true label coordinate. In this work, we build the heat map for each key point of the human skeleton independently. To obtain the output feature M i of each MPB module, the keypoint prediction network, including upsample and two convolution operations, is applied to map the feature to the skeleton prediction heat map. Finally, the mean squared error (MSE) function is used to compute the prediction error of each MPB module, and the overall loss of the MPB network is defined as in equation (3).
Here, N is the number of MPBs in the MPN, and K is the number of keypoints of the human skeleton. Y i j is the predicted heat map of the j-th keypoint by the i-th MPB module, andỸ j is the ground truth heat map of the j-th keypoint.
The multistage polymerization block (MPB) module draws on the method of the residual steps networks (RSN) for intralevel fusion and uses cross-stage connections for interlevel fusion. The characteristic gradient gap formed by the tight connection structure is very narrow. In addition, channel information with different characteristics between different levels can complement and strengthen each other to obtain more precise spatial and semantic information.  Figure 4, the first module of the SAM is the channel shuffling of residual connections. After shuffling, a 1 × 1 convolutional operation and a Sigmoid activation function are applied to obtain the space attention α. The last part of the SAM is the channel attention, which consists of a global pooling, two 1 × 1 convolutional operations, a ReLU activation function, and a Sigmoid activation function to obtain the channel attention vector β.

Channel Shuffle Operation.
To achieve the purpose of feature communication, we consider using a channel shuffle instead of dense pointwise convolution. As shown in Figure 4(a), the channel shuffle operation can be modelled as a process composed of "reshape-transpose-reshape" operations. Assuming that the input layer is divided into G groups, the input feature is reshaped into G × N dimensions, where N is the number of channels in each group. Then, the features are transposed into ðN, GÞ dimensions to ensure that the input of the following group convolution operation comes from different groups. Finally, it is reshaped into dimensions ðG, NÞ so that the information can flow between different groups. The shuffled feature is merged with the original by element-wise sum to form the output of the channel shuffle module. Suppose the input of the SAM is f in , this is also the output of the last MPB module. The channel shuffle can be formulated as in the following equation.
Here, CSð·Þ represents the channel shuffle operation, and f out CS is the output of the channel shuffle module.

Attention
Mechanism. Spatial attention: the feature map leads to undesirable results of keypoint locations due to the existence of areas in the spatial information that is not related to keypoints. The function of the spatial attention mechanism is to weight the feature map, reduce the interference of irrelevant areas, and adaptively highlight the areas related to the positioning task. The spatial-wise attention weight α is generated by a convolutional operation followed by a sigmoid function on the input. The spatial attention can be formulated as in the following equation.
Here, Convð·Þ denotes the convolution operation, and W is the learnable weight of the convolution operation. Sigmoidð·Þ is the Sigmoid activation function. Finally, the learned spatial attention weight α is rescaled, and the output is defined as in equation (6). f at out is the output of the spatial attention mechanism.
3.4.1. Channel Attention. Each channel of the feature map is the feature activation of the corresponding convolutional layer. Since a convolution only operates in a local space, it is difficult to obtain enough information to extract the relationship between channels. Inspired by the Squeeze-and-Excitation Network (SENet) [29], which used excitation module to learn the weight of feature map of each convolutional layer, we regard channel attention as the process of adaptively selecting the convolutional layer.
In the squeeze step, the output feature of the spatial attention mechanism f at out is used as the input of channel attention. We encode the entire spatial feature on a channel as a global feature and use global average pooling on f at out to generate channel statistics Z ∈ R C , as defined in the following equation.
Here, Z t is the t-th element in Z, and U t represents the output of the t-th convolution kernel in the channel attention network.
The squeeze operation obtains the global description characteristics, but we need another operation to capture the relationship between channels. It must be able to learn the nonlinear relationship between each channel. Moreover, the learned relationship is not mutually exclusive because multichannel features are allowed to instead of one-hot 5 Journal of Sensors form. Therefore, a Sigmoid gating mechanism is used for channel statistics Z, as defined in the following equation.
Here, W 1 ∈ R C×C and W 2 ∈ R C×C denote the learnable parameters in the two fully connected layers, and ReLUð·Þ denotes the ReLU activation function.
Finally, the channel attention weight β is learned by SAM. The output of the SAM can be generated by the following equation.
Like with the feature of the MPB, we turn the output feature of the SAM f SAM out into an estimated keypoints Y SAM j . The loss of SAM module can be defined in the following equation.
Here, Y SAM j is the heat map of the j-th keypoint predicted by the feature of the SAM. Finally, the overall loss function of the MPN is defined as in equation (11), which consists of the loss of MPB and SAM. In the training stage, the weights of the proposed method are obtained by minimizing the overall loss function.  [37]. The COCO train2017 set, which includes 57 K images and 150 K person instances, is used to train the proposed model; the COCO minival dataset is used as the testing set. The input image is resized to 256 × 192.

MPII Dataset.
The MPII human pose dataset is a stateof-the-art benchmark for the evaluation of human pose estimation. The dataset includes around 25 K images containing over 40 K people with annotated body joints. In this experiment, the data augmentation and the training strategy are set to be the same as in the COCO dataset, except that the input image size is 256 × 192.

Training Details.
We implement the proposed MPN model in Pytorch, using 2 Nvidia GTX 2080Ti GPUs; the minimum batch size of each GPU is 8. The Adam optimizer is adopted, and the linear learning rate is gradually reduced from 5e − 4 to 0. The weight decays to 1e − 5. All images are rotated and scaled. The rotation range is set from -45 degrees to +45 degrees, and the zoom range is set from 0.7 to 1.35.

Testing Details.
We estimate the heat map using a Gaussian filter. We average the predicted heat maps of the original image with the results of the corresponding input image. A quarter offset in the direction from the highest response to the second-highest response is used to obtain the final keypoints. Similar to in the cascaded pyramid network (CPN) [5], the pose score is the product of the average score of the keypoints and the bounding box score.    Figure 5: Ablation study on different numbers of MPBs (multistage polymerization block). 6 Journal of Sensors 4.1.5. COCO Evaluation Metric. The OKS-based mean average precision (mAP) is used as the evaluation indicator for the COCO dataset. According to the Euclidean distance d 2 between the detected key point and the corresponding ground truth, the OKS value is defined in the following equation.
Here, P represents the ID of a person in ground truth, pi represents the i-th keypoint of person P, v pi = 1 indicates that the i-th keypoint is visible, and S P represents the square root of the area occupied by this person, which is calculated from the bounding box of person P. σ i is the normalization factor of the i-th keypoint. And d 2 pi represents the square of the Euclidean distance of the pi between the predicted value and the ground truth.
For a predicted person P, if the OKS value of this person OKS P is higher than the threshold T, the prediction will be regarded as correct. The average precision is defined as in the following equation.
4.1.6. MPII Evaluation Metric. The percentage of correct key points (PCK) reports the percentage of keypoint detections falling within a normalized distance of the ground truth. PCK is defined in the following equation.
Here, PCK i is the PCK value of the predicted results for the i-th keypoint, d def p represents the scale factor of the P-th person, and T k is a threshold set to 0.5. Table 3: Results on the COCO test-dev dataset. " * " denotes using ensembled models. AP50 and AP75 indicate that we set the threshold to 0.5 and 0.75, respectively. APM indicates that the size of the detected target in the image ranges from 322 to 922, and APL indicates that the target range is greater than 922.   Figure 5, where the number of MPB modules is set to 8, 16, 24, and 32. When the number of MPB modules reaches 32, the proposed method achieves the best performance, and the mAP is 70.5. With the continuous growth in the number of modules, the increase in the number of parameters will lead to an increase in the computational cost, and thus, we choose 32 as the ideal number of MPBs.

Cascade Connection and Skip Connection.
To verify the effectiveness of the connection strategy in the MPB, we compare the cascade connection and skip connection. The comparison results are shown in Figure 6, and it is clear that the cascade connection produces better performance.

Ablation Study of the SAM.
To verify the effectiveness of our SAM module, we compare it to existing modules: Spatial and Channel-wise Attention Residual Bottleneck (SCARB) and Pose Rene Machine (PRM). The input size is the default: 256 × 192. The results are shown in Table 1. It can be seen that our SAM results in a mAP improvement of 0.4 relative to SCARB, and a mAP improvement of 0.6 relative to PRM. We also analyze the impact of different shuffle positions on the performance of the SAM module. SAM-A puts the shuffle operation between the space and the channel, and SAM-B puts the shuffle operation in front of the space and the channel, as shown in Table 2. SAM-B results in the best mAP of 72.3, which is an improvement of 0.1 over SAM-A.  Table 3. Without extra data for training, our single model can use MPN backbone network to reach a mAP 70.5, and by adding Spatial and Channel-wise Attention Residual Bottleneck (SCARB) to reach a mAP of 71.9, which is higher than CSM by mAP of 0.1. The results of adding SAM are higher than SCARB by a mAP of 0.6. These results show that our method is more effective.
We also validate the MPN on MPII test set. As shown in Table 4, adding the SAM module yields an improvement in mAP of 2.9, which further demonstrates the generalizability of our method.
Finally, Figure 7 shows the prediction results obtained by our MPN on the MPII and COCO datasets.

Conclusions
In this paper, we propose a top-down multistage polymerization network to handle multiperson pose estimation. The MPN learns exquisite key point representations through effective intralayer fusion and interlayer fusion. We also design a shuffled attention mechanism module. The shuffle aims to promote the cross-channel information exchange between pyramid feature maps while attention is carried out to make a trade-off between the low-level and highlevel representations of the output features. Overall, we achieve a good result on two keypoint benchmarks.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.