Cascaded Hierarchical CNN for RGB-Based 3D Hand Pose Estimation

School of Software, Jiangxi Agricultural University, Nanchang 330045, China School of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang 330045, China State Key Lab of CAD & CG, Zhejiang University, Hangzhou 310058, China School of Information Engineering, Nanchang University, Nanchang 330031, China School of Foreign Languages, Huazhong University of Science and Technology, Wuhan 430074, China


Introduction
e hand is the most active organ for humans. erefore, the gesture is one of the main expressions of human beings, which accounts for the largest proportion of all human posture. With the rapid development of computer vision technology, 3D hand pose estimation is gradually applied to the fields of Human-Machine Interaction (HMI), Virtual Reality (VR), and Augmented Reality (AR) [1][2][3], which makes vision-based 3D hand pose estimation become an active research area [4], and has achieved great progress after years of research [5][6][7][8][9][10][11][12][13]. However, this research is still very challenging due to the diversity of gestures, the significant flexibility of finger joints, the high similarity between fingers and severe self-occlusion. In recent years, research on 3D hand pose estimation based on depth images is progressing rapidly with the development of the depth cameras [14][15][16].
Firstly, the depth information from the depth image is beneficial for 3D hand pose estimation. Secondly, the emergence of cheap depth cameras significantly reduces the difficulty of obtaining depth data, which greatly reduces the production cost of depth data. As a result, 3D hand pose estimation based on depth images has achieved a great many of results [17][18][19][20][21] during this period. Compared with depth images, RGB images lack depth information, which makes it difficult to estimate 3D hand pose directly from 2D RGB images. erefore, the result of current 3D hand pose estimation based on RGB images is not ideal enough. But 3D hand pose estimation based on RGB images is more realistic because the application based on RGB images is more widespread and the number of users using RGB images is larger. In this paper, we present a four-stage cascaded hierarchical CNN (4CHNet) for RGB-based 3D hand pose estimation. We cascade four stages of the network for end-to-end training. e four stages include hand mask estimation stage, 2D hand pose estimation stage, hierarchical estimation stage, and 3D hand pose estimation stage. According to the back-propagation mechanism of the neural network, the mutual promotion and common progress can be achieved by each stage. e hierarchical estimation stage processes hand feature extracted hierarchically to extract more effective, deeper, and more representative feature information and finally fuses the feature information of all layers to estimate the 3D hand pose to improve the estimation accuracy of the 3D gesture. Our contributions can be summarized as follows: (1) We propose a 4CHNet for RGB-based 3D hand pose estimation in which hand pose estimation is divided into two subtasks by using hierarchical thinking, namely, finger pose estimation and palm pose estimation. More representative finger features and pam features are extracted, respectively, and finally fused to estimate the 3D hand pose, which can improve estimation accuracy of 3D gestures. (2) Four-stage cascaded training, which cascades four stages including hand mask estimation stage, 2D hand pose estimation stage, hierarchical estimation stage, and 3D hand pose estimation stage for end-to-end training, is proposed. According to the back-propagation mechanism, each stage is mutually beneficial and progressive together in the training process to achieve the global optimization and refine the models. (3) Based on the hierarchical network, 2D finger heatmaps and 2D palm heatmaps are estimated. ese two constraints enable the hierarchical network to conduct feature stratification and further estimate 3D finger pose and 3D palm pose. e network can perform better in feature extraction and 3D hand pose estimation by introducing four new constraints. (4) We conduct experiments on two public datasets, and the results show that our 4CHNet can achieve better 3D hand pose estimation accuracy than previous works.

Estimation Method Based on RGB Images.
Estimating 3D hand pose directly from a single RGB image is far more challenging due to the absence of depth information. Subsequently, researchers have presented different estimation methods. Zimmermann and Brox [23] firstly applied a deep neural network to 3D hand pose estimation based on single RGB images. ey used three deep networks to cover important subtasks on the way to the 3D pose. e three networks are hand localization segmentation network, 2D hand pose estimation network, and 3D hand pose estimation network. Spurr et al. [33] extended VAE framework via training several pairs of encoder and decoder to form a joint cross-modal latent space representation and estimated 3D hand pose of the input depth images and RGB images. Since full 3D meshes of hand surface can determine the shape of hands, it is of great help for 3D hand pose estimation. Using 3D meshes to estimate 3D hand pose has been extensively studied recently. Ge et al. [28] added a 3D hand mesh estimation stage in which the Graph CNN [34] uses heatmaps and hand features as input and estimates the full 3D mesh of hand surface which is further used to regress the 3D gesture. Boukhayma et al. [30] leveraged a deep convolution encoder to estimate hand shape parameters and gesture parameters and then fed these parameters to a pretrained hand mesh model to estimate the mesh of hand surface and further estimate 3D hand pose after obtaining hand shapes. Although accurate hand mesh greatly improves the estimation accuracy of 3D gesture, it is hard to generalize estimation methods from hand meshes due to the difficulty of obtaining the hand surface mesh labels. Our early work [35] proposed a three-stage cascaded CNN mask-2d-3d, which cascaded mask estimation stage, 2D hand pose estimation stage, and 3D hand pose estimation stage to estimate 3D hand pose.
Here we need to emphasize the difference between our proposed method and the earlier work of mask-2d-3d. Firstly, we add a hierarchical network to form a four-stage cascaded network, which divides 21 key points into 15 key points of finger layer and 6 key points of the palm layer to extract deeper finger features and palm features and then fuses them to estimate more accurate 3D gestures. Secondly, we add 2D palm heatmaps, 2D finger heatmaps, 3D palm poses, and 3D finger poses constraints to train the network effectively. Here, we need to emphasize the differences between us and Zimmermann and Brox [23]; their method was proposed earlier and also has some defects. ey trained their networks separately in each estimation stage, which makes estimation effect of each stage reach the local optimum rather than the global optimum. To overcome this shortcoming, we use a 4CHNet, which affects mutually and progresses together to achieve global optimization of 3D hand pose estimation. e second difference is that Zimmermann and Brox [23] only used two simple constraints: 2D hand heatmaps and 3D gestures. However, the two constraints are really difficult to extract deeper features. Differently, we address that the estimation accuracy would be dramatically improved by adding 2D finger heatmaps, 2D palm heatmaps, 3D finger poses, and 3D palm poses constraints via using a hierarchical network, while introducing hand masks and employing hand masks and 2D heatmaps to further guide feature extraction.

Estimation Method Based on Hierarchical
inking. Hierarchical network is spurred by the multitask sharing mechanism. In machine learning, multitask sharing has the advantages of reserving more intrinsic information than single-task learning [36]. e hierarchical network divides hand pose estimation task into several subtasks according to the structure of hand, which extracts more intrinsic information through multiple subtasks and finally shares information to estimate 3D hand pose. Guo et al. [37] proposed a region ensemble network, which simply divided the extracted feature maps into four grid regions of 2 × 2, and features of each region were fed into FC layers for the ensemble. e method can effectively improve performance without extra heavy computational cost. Madadi et al. [38] firstly divided the hand features into six layers, of which five layers were used to model each finger, and the remaining layer was used to model palm orientation features. en, the six layers were combined to estimate all joint positions. Zhou et al. [39] divided five fingers into three layers according to the sensitivity and function of fingers, where one layer was correlated with thumb finger, one layer modeled the index finger, and the final layer represented the remaining three fingers. Finally, three layers were combined to estimate the hand pose. Du et al. [40] divided the features of the hand into two layers, that is, finger feature and palm feature, and used a cross-connected network to refine the two-layer features and finally fused them to estimate the hand pose. Our 4CHNet is the closest to Du et al. [40]. Here, we also need to emphasize the difference. Firstly, our method is based on 3D hand pose estimation of RGB images. However, the method proposed by Du et al. [40] is based on depth images. Secondly, we use a 4CHNet, exploiting the hand mask estimation, 2D hand pose estimation, hierarchical estimation and 3D hand pose estimation to estimate 3D gesture jointly, which is essentially different from the network architecture of Du et al. [40].

Four-Stage Cascaded Hierarchical CNN
3.1. Overview. We propose a 4CHNet for estimating 3D hand pose from a single RGB image, as illustrated in Figure 1. Firstly, we use a localization segmentation network to localize and crop the hand of the RGB image for preprocessing RGB images. e cropped RGB image is used as the input of 4CHNet to estimate hand masks, 2D hand heatmaps, 2D finger heatmaps, 2D palm heatmaps, 3D finger poses, and 3D palm poses and then to estimate the full 3D hand poses through fusing 3D poses of fingers and palms.

Localization and Segmentation Network.
e localization segmentation network is used to determine the location of hand, and then the low-resolution hand is obtained and enlarged, which is the basis for subsequent gesture estimation. If there is no appropriate localization segmentation network, the accurate 3D hand pose estimation will also lack practical significance. We use a simplified version of Convolutional Pose Machines [41] as the localization segmentation network and extract the spatial features of hand by estimating two-channel hand masks. Furthermore, the loss is calculated by hand mask labels to feedback the network to achieve the goal of training a localization segmentation network. rough the estimated hand mask, we can locate the hand in RGB image and then crop and resize the hand to 256 × 256 size.

4CHNet.
We intend to use the principle of the cascade into our overall network, cascading four stages for end-toend training. e four stages include hand mask estimation stage, 2D hand pose estimation stage, hierarchical estimation stage, and pose estimation stage, respectively. Furthermore, four stages can benefit mutually and progress together, thereby achieving global optimization and the goal of improving the accuracy of 3D hand pose estimation.

Hand Mask Estimation Stage.
In the hand mask estimation stage, we use a simplified version of VGG-19 network [42]. Both 128-channel image feature F1 and 2channel spatial feature, namely, hand mask M, are extracted by convolution, and mask labels of dataset are used to train the network. Hands can be better tracked through the spatial feature, which is helpful for subsequent hand pose estimation.

2D Hand Pose Estimation
Stage. 2D hand pose estimation stage consists of five substages. In the first substage, it takes 130-channel features S as input, which consisted of 128-channel image features F1 and 2-channel spatial features M extracted from mask estimation stage and then outputs 21-channel heatmaps. In the last four substages, 21channel hand heatmaps estimated from the previous stage and 130-channel image feature S are connected to form 151channel feature which is taken as the input to estimate five substages 2D hand heatmaps. We use the final substage hand heatmaps as the final output and then use 2D labels of datasets to train the network.

Hierarchical Estimation Stage.
e hierarchical estimation stage is similar to the 2D hand pose estimation stage, both of which estimate 2D heatmaps, but the hierarchical estimation stage divides features of hands into two layers: finger features and palm features. e 21 key points of hands are shown in Figure 2(a). We divide 6 key points into palm key points and the remaining 15 key points into finger key points. e key points division demonstration of the real dataset STB is shown in Figure 2(b). And the key points division demonstration of the synthetic dataset RHD is shown in Figure 2(c). e left side of the demonstration is an example of finger key points, and the right is an example of palm key points. e hierarchical network estimates 2D finger heatmaps and 2D palm heatmaps independently and helps to further estimate 3D finger pose and 3D palm pose (see Figure 3).
ere are three substages in each layer of this stage. Taking the finger layer as an example, firstly, the first substage connects 130-channel feature S and 21-channel hand heatmaps outputted from the previous stage to form 151channel full hand feature F, which is as the input to estimate 15-channel finger heatmaps F f1 . en, the last two substages connect the 15-channel finger heatmaps obtained from the Mathematical Problems in Engineering previous stage with 151-channel full hand feature F as input. Finally, a total of three substages finger heatmaps are estimated, and the final substage estimated finger heatmaps F f3 are as the output. e principles employed for the finger layer is the same as the palm layer. Here, we use 2D finger and 2D palm labels of datasets to train the hierarchical network. F represents full hand features, F f represents finger features, F p represents palm features, C15 and C6 represent the convolutional neural network which is employed to extract features of fingers and palms, respectively:

3D Hand Pose Estimation
Stage. e 3D hand pose estimation stage takes 2D finger heatmaps and 2D palm heatmaps outputs of the hierarchical network as inputs to estimate 3D finger poses and 3D palm poses and fuses them to estimate the 3D hand pose. We employ the method proposed by Zimmermann and Brox [23] to represent the 3D pose. In order to estimate the relative normalized coordinates x rel of key points, the first bone's length of index finger ‖x k+1 − x k ‖ is selected as the standard length. x k+1 and x k represent the two endpoints of the first bone of the index finger and palm point x r as origin: In order to facilitate the estimation of hands with different poses, the relative normalized coordinates x rel are rotated by using a 3D rotation matrix R to obtain the canonical coordinates x c . e gesture directions of these canonical coordinates are consistent, which is convenient for 3D hand pose estimation. We estimate the canonical coordinates x c and 3D rotation matrix R to indirectly estimate the relative normalized 3D coordinates x rel of the 21 key points:

Estimation Loss of Mask.
e mask estimation loss loss mask uses standard softmax cross-entropy loss, where y is its label, s u is output score of the uth label in the mask estimation stage, and the mask is a binary map, u ∈ 0, 1 { }:

Estimation Loss of Mask.
A squared L2 loss is imposed on the 2D heatmaps loss of 21 key points to calculate the estimation loss of 2D hand pose loss 2d, where pre j is estimated 2D hand heatmaps and gt j is its corresponding label, and j represents the index of key point:

Estimation Loss of Hierarchical.
e estimation loss of hierarchical loss L is sum of the loss of 2D finger heatmaps loss 2d finger and the loss of 2D palm heatmaps loss 2d palm, which is calculated by using L2 loss, where pre j f and pre j p are estimated 2D finger heatmaps and 2D palm heatmaps respectively, and gt j f and gt j p are their corresponding 2D key points label of finger and palm separately, j f represents finger key points, and j p represents palm key points: loss L � loss 2d finger + loss 2d palm.

Estimation Loss of 3D Hand
Pose. e estimation loss of 3D hand pose loss 3d includes estimation loss of 3D finger pose loss 3d f, 3D palm pose loss 3d p, and full hand pose loss 3d h, which is computed by using the squared L2 loss for canonical coordinate x c and 3D rotation matrix R, respectively. e estimation loss of 3D finger pose is e estimation loss of 3D palm pose is e estimation loss of full hand pose is Mathematical Problems in Engineering e sum of 3D estimated loss is loss 3d � loss 3d f + loss 3d p + loss 3d h. (10) e total loss of 3D hand pose estimation is loss � v * loss mask + loss 2d + Loss L + Loss 3d.
Because the loss value of loss mask is large, we add a weight ratio v to this item to reduce its loss value. It is found that v � 0.05 can achieve a best result via a large number of experiments. [27] is one single-handed RGB-based dataset, hereinafter, referred to as OHK. Images in OHK are real images, including 10000 images for training, and the remaining 1703 images are used for testing, which are captured under different backgrounds and lighting conditions. Each RGB image has a corresponding mask label and 2D labels for 21 key points. In this work, we use hand mask labels of real dataset OHK to train localization segmentation network for the purpose of enhancing adaptability of the network in a real world and then employ localization segmentation network to localize the hand of RGB image and crop and enlarge hand size to get cropped RGB image for facilitating subsequent accurate 3D hand pose estimation. Because image resolution of this dataset is not uniform, we have adjusted and filled the OHK data. e size of unified OHK image is 320 × 320, and the adjustment ratio is m, where w and h are original width and height of the image. After the ratio is adjusted, we fill the lower right corner of the RGB image with gray value (128,128,128), zero-fill the lower right corner of the mask, and finally output the RGB image with a resolution of 320 × 320 and its corresponding mask:

RHD. Rendered Hand Pose Dataset (RHD) [23]
is a synthetic RGB image based hand dataset, which is composed of 41258 images for training and 2728 images for testing with a resolution of 320 × 320, and it is obtained by requiring 20 different human models randomly to perform 39 different actions and randomly generate arbitrary backgrounds. e dataset is considerably challenging due to large variations in viewpoints and hand proportion, as well as large visual diversity induced by random noise and ambiguity of the images. For each RGB image, it provides corresponding depth image, mask label, 2D label, and 3D label of 21 key points of the hand. We use the mask labels, 2D labels, and 3D labels to train the entire network. However, due to a certain gap between the synthetic data and real data, it is difficult for a network trained by synthetic data to adapt directly to the real world, so it is necessary to use real data for adaptive adjustment later. [43] is a real RGB image hand dataset containing two subsets: the stereo subset STB-BB captured from the stereo vision camera and the color-depth subset STB-SK captured from the Intel active depth camera. Since no deep data is used in our method, we only use the subset STB-BB. STB-BB has a total of 36000 images which is divided into 12 pairs. Following the same condition used in [23], we use 10 parts of 30000 images as training set and the remaining 2 parts of 6000 images as testing set. Each RGB image of this dataset has 2D and 3D labels of 21 key points of the hand and corresponding depth map, but we only use its 2D and 3D labels. On the basis of RHD training using synthetic dataset, we use real dataset STB to refine model and make the model adapt to the real world.

Evaluation Metric.
We evaluate our proposed 4CHNet on two public datasets, RHD and STB, by using two evaluation metrics: (1) Endpoint error (EPE), which includes the average endpoint error (EPE mean) and median endpoint error (EPE median) (2) e area under the curve (AUC) on the percent of correct key points (PCK). Our evaluation fully adopts the same metrics as [23] 4.3. Experimental Details. Our 4CHNet is implemented by Tensorflow [44] on a single server with single GPU of Nvidia RTX2080Ti for training and testing.

Localization Segmentation Network Training Details.
We use real dataset OHK with mask label to train the localization segmentation network. A batch size of 8 and an initial learning rate of 1 × 10 −5 are employed for training 40 K iterations. To prevent overfitting, we have set decay ratio as 0.1. Learning rate is 1 × 10 −6 for the first 20 K iterations and then decays every 10 K iterations.

Training Details of 4CHNet
(1). Pretraining on Synthetic Dataset RHD. We adopt synthetic dataset RHD to pretrain the 4CHNet and use mask labels and 2D and 3D labels of dataset to supervise the training. e training batch size is 8 and an initial learning rate is 5 × 10 −5 for training 300 K iterations, while the decay ratio of learning rate is 0.3, which decays every 50 K iterations.
(2). Refinement on the Real Dataset. Based on the RHD pretrained network, in order to adapt the model to the real world, we use a real dataset STB to refine the model by using its 2D and 3D label to train the network for training 250K iterations. e remaining training parameters are consistent with that of the pretraining stage.

Self-Comparison Experiment.
Our early work [35] has experimented on a three-stage cascaded network and compared ablation experiments with other methods, which has demonstrated the effectiveness of newly added mask estimation stage and cascaded network. On this basis, we propose a four-stage cascaded network and compare it with the three-stage cascaded network to demonstrate the effectiveness of the newly added hierarchical network. In this experiment, we also designed the other four network training methods, where 2d means that 2D and 3D networks are trained separately without a mask estimation stage, and mask-2d means mask estimation and 2D hand pose estimation are trained jointly, while 3D estimation stage is trained alone; 2d-3d represents the cascaded training of 2D and 3D estimation without mask estimation stage, mask-2d-3d represents a three-stage cascaded network, and Ours is 4CHNet we have proposed. Previous work [35] has verified the superiority of OHK for training segmentation networks, so our experiment uses localization segmentation network trained by OHK, fuses RHD and STB to train networks, and keeps the parameters consistent. Figure 4 and Table 1 show the experimental results. e experimental results show that the AUC of four-stage cascaded network denoted by Ours reaches 0.720 and 0.822 within the error threshold of 0-30 mm and 0-50 mm, which is higher than 0.706 and 0.811 of three-stage cascaded network mask-2d-3d and far higher than that of other network structures. e average endpoint error of our four-stage cascaded network is reduced to 8.878 mm, which is reduced by 5.53% compared with 9.398 mm of three-stage cascaded network and the median endpoint error of the two networks is similar. is self-comparison experiment verifies the superiority of proposed 4CHNet over the three-stage cascaded network. Because of newly added hierarchical network, 2D finger heatmaps constraint, 2D palm heatmaps constraint, 3D finger poses constraint, 3D palm poses constraint and fourstage cascaded, and estimation accuracy of 3D key points have greatly been improved.

Comparison with Other Methods.
We compare our 4CHNet on two public datasets with most of state-of-the-art methods [23,35] on RHD and state-of-the-art methods [23,25,33,35,43,45] on STB and the comparison adopts the same evaluated metrics in [23]. Particularly, we use a localization segmentation network to locate the hand in the image instead of directly processing the original image; therefore, in addition to the pose estimation error, a part of our total error also comes from hand positioning. e methods involved in the comparison also need to add localizing errors if they also have a localization segmentation network. e comparison experiment results on the synthetic dataset RHD are shown in Figure 5. e results show that the 4CHNet achieves an AUC of 0.770 within the error threshold 20-50 mm, which is significantly better than that of the state-of-the-art method. Figure 6 shows a comparison test on the STB dataset. Ours and Ours (without OHK) both represent 4CHNet, and both fuse the synthetic dataset RHD and the real dataset STB for training, of which Ours uses OHK to train localization segmentation network to achieve a more accurate hand localization in a real world, while Ours (without OHK) uses the localization segmentation network model of [23], which only uses synthetic dataset RHD for training the localization segmentation network. e mask-2d-3d and mask-2d-3d (without OHK) represent three-stage cascaded network; the latter one uses a localization segmentation network model of [23]. e experimental results show that the AUC of Ours reaches 0.988, which is a significant improvement over 0.948 in Zimmermann and Brox [23] and 0.977 in the three-stage cascaded network. At the same time, it is also better than the state-of-the-art result on STB dataset, which verifies the Mathematical Problems in Engineering     Figures 9 and 10, respectively, representing the qualitative results on STB and RHD. e first column represents original RGB images, and the second and fourth columns represent the full hand pose estimation of 2D and 3D, respectively; the third and fifth columns are their corresponding labels. As can be seen from Figure 9, our 2D and 3D estimated results of 4CHNet are basically consistent with the labels on the real dataset STB. Only in a few gestures with complex motions and severe occlusions, the estimation results are slightly biased, which indicates that 4CHNet can be well promoted in the real world. From Figure 10, we can find that, on the synthetic dataset RHD, the estimated results are close to the labels but still have a gap. is is because synthetic dataset RHD has a lot of noise and ambiguity, and the proportion of hands is small, which results in highly difficult estimation.

Conclusions
Based on the cascaded CNN and hierarchical CNN, we have proposed a novel four-stage cascaded hierarchical CNN (4CHNet) for estimating 3D hand pose of a single RGB image. Four stages include mask estimation stage, 2D hand pose estimation stage, hierarchical estimation stage, and 3D hand pose estimation stage. e four stages are cascaded for end-to-end training to achieve mutually beneficial progress. At the same time, the extracted hand features are divided into the finger layer and palm layer in hierarchical estimation stage to estimate corresponding finger pose and palm pose respectively. Finally, we concatenate them to estimate full 3D hand pose. is hierarchical network leverages finger and palm constraints to extract deeper and more representative feature information to improve accuracy of 3D hand pose estimation. In this work, we have experimented on two public datasets and compared 4CHNet with the state-of-art methods on two datasets. e experimental results verify the significant promotion and conspicuous advantages of our proposed method.

Data Availability
Previously reported data were used to support this study and are available at 10