Human Pose Recognition Based on Depth Image Multifeature Fusion

. The recognition of human pose based on machine vision usually results in a low recognition rate, low robustness, and low operating e ﬃ ciency. That is mainly caused by the complexity of the background, as well as the diversity of human pose, occlusion, and self-occlusion. To solve this problem, a feature extraction method combining directional gradient of depth feature (DGoD) and local di ﬀ erence of depth feature (LDoD) is proposed in this paper, which uses a novel strategy that incorporates eight neighborhood points around a pixel for mutual comparison to calculate the di ﬀ erence between the pixels. A new data set is then established to train the random forest classi ﬁ er, and a random forest two-way voting mechanism is adopted to classify the pixels on di ﬀ erent parts of the human body depth image. Finally, the gravity center of each part is calculated and a reasonable point is selected as the joint to extract human skeleton. The experimental results show that the robustness and accuracy are signi ﬁ cantly improved, associated with a competitive operating e ﬃ ciency by evaluating our approach with the proposed data set.


Introduction
Human perception to the external world is mainly through the sense organs such as sight, touch, hearing, and smell, of which about 80% of information is obtained by the vision.It is important for the next generation intelligent computers to mount visual functions on computers so that they can automatically recognize and analyze the activities of people in the surrounding environment [1][2][3].
At present, pose and action recognition is widely used in the many fields like advanced human-computer interaction, intelligent monitoring system, motion analysis, and medical rehabilitation [4][5][6].Pose recognition is a challenging research in motion analysis.The core target is to infer the posture parameters from the image sequence on various parts of the human body, such as the actual position in the three-dimensional space or the angle between the various joints.Human body motion can be reconstructed in threedimensional space through posture parameters mentioned above.At present, the human pose recognition algorithms based on machine vision are mainly divided into two categories: one is based on traditional RGB images and the other is based on depth images.The biggest difference between them is that pixels in the RGB image record the color information of the object, while pixels in the depth image record the distance between the object and the camera.Human pose recognition based on RGB images mainly utilizes the apparent features on the image, such as HOG (histogram of oriented gradient) features [7] and contour features [8].However, these methods are usually affected by the external environment and are particularly vulnerable to the light, resulting in low detection accuracy.In addition, due to the large differences in the size of the human body, these algorithms are only suitable for the limited environments and people.In recent years, with the development of depth sensors, especially the Kinect developed by Microsoft which has color and depth information (RGB-D), the recognition rate of human pose has been greatly improved compared with ordinary sensors [9][10][11][12][13].The main reason is that the depth images have many advantages over the RGB images.
First, depth images have robustness against changing of color and illumination.Also, the depth image, which is 3D, has more information than the RGB image.Human pose recognition methods can be divided into two categories: model-based method and feature learning.In modelbased human pose detection, the human body is divided into multiple components which are combined into a model and then the human pose is estimated by inverting the kinematics or solving optimization problems.Pishchulin et al. proposed a new articulated posture model based on image morphology [14].Sun and Savarese proposed APM (articulated part-based model) based on the joint detection [15], and Sharma et al. proposed an EPM (expanded parts' model) based on a collection of body parts [16].Siddiqui and Medioni used a Markov chain Monte Carlo (MCMC) framework with head, hand, and forearm detectors to estimate the body [17].
Feature learning tries to get advanced features from depth images through analyzing each pixel, and uses various machine learning algorithms to realize human pose recognition [12,[18][19][20][21][22][23].Shotton et al. proposed two different methods for estimating human body poses [18].One of the methods uses a random forest to classify each pixel in the depth image.Another method predicts the position of a human joint.Both methods are based on random forest classifiers that train a large number of synthetic and real human depth images.Hernández-Vela et al. proposed graph cuts' optimization based on Shotton's method [24].Kim et al. proposed another human pose estimation method based on SVM (support vector machine) and superpixel [25].In addition, deep learning algorithms are also used to solve the pose estimation of the target [26][27][28], and the convolution neural network (CNN) is used for large-scale data set processing [29][30][31][32].
In general, the advantage of model-based human pose recognition is that there is no need to build a large data set; instead, it only establishes some models.It has a higher recognition rate for the pose as the model matched.However, this method also has some disadvantages.For example, it is difficult to construct complex human body models mainly because of the diversity of human postures in actual situations.
The main merit of feature learning is that it does not need to establish a complex human body model, so it is not restricted to the model and can be applied to various situations.However, this method also has disadvantages.On one hand, it has to build a huge data set to fit in different environments.On the other hand, many feature extraction methods have high complexity and cannot meet real-time requirements.Therefore, a human pose recognition method based on depth image multifeature fusion is proposed in this paper.First, the body parts were encoded with number in depth images and a data set was constructed.Afterwards, the LDoD and DGoD features are extracted for training to get a random forest classifier.Finally, the gravity center is calculated and possible joints are screened out.The LDoD and DGoD have lower computational complexity than other algorithms, so they can satisfy the real-time requirement.Moreover, the recognition rate of human pose improves by combining LDoD and DGoD.
The rest of this paper is organized as follows: Section 2 introduces the algorithm flow about depth image multifeature fusion for recognizing human pose.Section 3 details each step of pose recognition and related algorithms.Section 4 constructs a random forest classifier.Section 5 describes the positioning of the joints in the human body image.Section 6 analyzes the experimental results.Section 7 is the conclusion.

Algorithm Overview
The flowchart of human pose recognition based on depth image multifeature fusion is shown in Figure 1.Firstly, the original depth image is segmented to extract the human target so that the different parts of segmented body is easily tagged with a specific code.And then LDoD and DGoD features are extracted for training multiple decision trees to obtain a random forest classifier.The classifier is used to classify the body parts of the test samples.Finally, the position of the joints in the human body image is calculated.

Human Pose Recognition
3.1.Depth Image Segmentation.In image processing, we often focus on special areas which are called ROI (region of interest) [33][34][35].Usually, these areas have rich feature information.Therefore, in order to identify and analyze the target, the area, where the target is located, needs to be separated from the background.On this basis, feature extraction and human body recognition can be performed.

Complexity
Due to the fact that the actual scene is fixed in this paper, depth background difference is used to segment the human body.The depth value is quantized to a grayscale space of 0-255, that is, little number is corresponding to large depth.Therefore, the 3D image can be displayed as a 2D gray image, where pixel value represents a different meaning from the conventional RGB image.
Because the camera shoots downwards from head top, the leg information of the human body cannot be considered.The depth range is controlled between D near and D far , and it can be expressed as D near , D far .First, Gaussian filtering is performed on the original depth data to filter out noise and suppress the drift of depth data.Then, the original depth image is subtracted from the background image, and the foreground target is extracted according to the threshold T, shown as follows: where B x, y is the background image, I x, y is the original image, and T x, y is the binary image.Then, the depth image of the corresponding area is extracted, shown as follows: where S x, y is the effective depth area and S x, y ⊆ D near , D far .

Tagging Body Parts.
Since there is no standard human pose depth image library, we builds a data set, including common human actions such as running, jumping, lifting, bending, knee flexion, and interaction.Random forest learning algorithm belongs to supervised learning; the data samples are a known category, and these samples need to be tagged [36][37][38][39].The tagging method is to divide the human body into 11 parts, and the rest is the background; the approximate position of each part of the human body in the depth image is observed, and then, the position is tagged with the corresponding color.As shown in Figure 2, the valid points inside the rectangle of the head area are all marked in red.The tagging result is shown in Table 1.This paper divides the waist above the human body into the head, the left shoulder, the right shoulder, the left upper arm, the right upper arm, the left lower arm, the lower right arm, the left hand, the right hand, the left body, the right body, and the background.

LDoD Feature Extraction.
According to the depth image of the human body that has been manually tagged, the features of 12 parts need to be extracted.This paper uses the local difference feature as a feature representation of the pixel, which can reflect the neighborhood information of the pixel.It uses the difference between two pixels among the eight neighborhood points to represent the characteristics of the pixel.The location of the eight neighborhood pixels is shown in Figure 3.
LDoD feature can be represented as where i, j ∈ 1, 2, 3, 4, 5, 6, 7, 8 , i ≠ j, and d s p i is the depth value of p i .Assuming θ = θ 1 , θ 2 , θ 3 , … , θ 28 , T θ k S, p is replaced by T i,j S, p and k ∈ 1, 2, 3, … , 28 .The feature vector of a point can be expressed as According to LDoD feature, features of the same type of pixels are mostly similar and features of different pixel types have large differences.Therefore, for various parts of the human body, this feature has a good division degree.Figure 4(a) shows the divided depth image, and Figure 4(b) is an enlarged image of the left lower arm.As can be seen from the figures, pixel p 6 and pixel p 7 are in the body area and pixel p 4 is out of body area and its value is 0. Therefore, the value of T 6,7 S, p is smaller and T 4,7 S, p is larger.Figure 4(c) is an enlarged image of the right lower arm.The value of T 6,7 S, p is larger, and the value of T 4,7 S, p is smaller.Therefore, these two values can distinguish the left and right lower arms of the human body.
The computational complexity of this feature is very low.Formula (3) only uses subtraction between values.In addition, it also has space translation and rotation invariance and can be applied to people's changes in postures.
3.4.DGoD Feature Extraction.Due to that, the depth information represents the distance between the object and the depth camera; the angle relationship between the plane where the pixel is located and the plane where the depth camera is located can be obtained by simply calculating the arctangent of the gradient, that is, the DGoD feature, which can be calculated as follows: where i = 1, 2, 3. Three DGoD features were selected, which are represented as G θ1 S, x, y , G θ2 S, x, y , and G θ3 S, x, y .The range of directional gradients is 0 °, 360 °.When the pixel points are on the same plane, the direction gradients are

The Design of the Random Forest Classifier
4.1.Random Forest Model Construction.Decision tree is one of the most widely used inductive inference algorithms at present.Its generated rules are simple and easy to understand.Pixels of depth images can be classified quickly and efficiently by decision tree, so it can be widely used in target detection and recognition.However, a single decision tree can easily lead to overfitting causing wrong classification.The random forest is composed of multiple decision trees [40,41], and the decision tree is trained with different data sets and parameters, which cannot only reduce the degree of overfitting but also the classification accuracy can be improved because its output is voted by multiple decision trees.
The classification effect of random forest classifiers is affected by many factors, including the size of the training set D, the dimension of the sample feature vector N, the number of the decision tree K, the maximum depth of each tree d, the eigenvector dimension n, and the termination condition for growth of each tree.
In the previous sections, the human body was divided into 12 different parts and then LDoD and DGoD features were extracted as the input of the random forest classifier.All of preliminary works are prepared for the design of the classifier model.The set of attributes can be represented as S = T θ 1 I, p , T θ 2 I, p , … , T θ 28 I, p , DGoD S x,y 6 ID3 decision tree algorithm is used to train each decision tree in a random forest.Training sample set can be expressed as where 12 is a collection of categories to which a pixel belongs, that is, 12 parts of the human body.The set of parameters can be expressed as where θ is the attribute parameter and τ 1 and τ 2 are the thresholds.
The flow chart of the construction of a single decision tree is shown in Figure 6.First, putting back is adopted in the extraction method and the training set D i , which is the same size of D, is extracted from D to get K subsets.Then, a tree node is created, and if it reaches the termination condition, the process is stopped and the current node is set as a leaf node.Otherwise, n features is extracted from the N  5 Complexity -dimensional feature set using a fixed-scale and nonreturned extraction method.The one-dimensional feature is determined according to the metric of the feature attribute, and the current node is split into the left subset D l φ and the right subset D r φ : The information gain is used to select the partitioning property of the decision tree, which can be calculated by as follows: where Ent D is information entropy.

Random Forest Two-Way
Voting.In the traditional random forest classification [42][43][44], the sample is judged by every decision tree and voted by every tree.Every tree has equal decision right.The random forest two-way voting mechanism is adopted with different decision rights in this paper.Data set is divided into in-of-bag and out-of-bag data.The data subset is called in-of-bag data when it is used to build a random forest.Otherwise, a data subset is called out-of-bag.The decision right of a tree is gained according to the result of testing out-of-bag data.When the result is true, then the tree is voted.If a decision tree has more votes, the weight will be higher.The basic algorithm steps for twoway voting are as follows.
Step 1. Create K decision trees.And in-of-bag data and outof-bag data are generated for every tree.
Step 2. Perform a performance evaluation.That is, a tree is evaluated by a certain amount of out-of-bag data.If the decision tree's classification result is true, the tree is voted.
Step 3. Assign the total number of votes to the decision tree as weight and normalize the weights of all decision trees.
Step 4. Input the test sample to the sorted random forest model, and the obtained classification result multiplies the weight to get the final classification result, shown as follows: where R x is the final classification result, T i is the weight coefficient corresponding to i-th decision tree, and T 1 + T 2 + T 3 + ⋯ + T k = 1.r i x is the evaluation result of i-th decision tree.

Human Joint Positioning
Determining the human body joints is the final step in human pose recognition [45].The above chapters have used the random forest classifier to classify 12 parts in the human body image, but the joint position has not still been determined.The joints will be determined by calculating the gravity center of the 12 body parts in this paper.
For the depth image with size M × N , the p + q moment m pq and central moment at the pixel I x, y can be calculated by formula (12) and (13), respectively.
x p y q ⋅ I x, y , 12 where x c , y c is the gravity center, which can be calculated by formula ( 14) and ( 15).

Select sample data set
Get K subsets

Create tree nodes
Extract -dimensional feature through nonreturned method

Select one dimension from 𝜂 dimension
Meet the termination conditions?
Output leaf node

Complexity
The gravity center of the upper arm and the gravity center of the lower arm are calculated to obtain the joint of the left elbow or the right elbow, as shown in formula (16).
x c = m 10 m 00 , where the size of area of the upper arm is M up × N up , the size of the area of the lower arm is M low × N low , and Δx and Δy are the offsets.

Experimental Results and Analysis
In this paper, 1000 depth images are used for the training of the classifier model and 100 images are used for the test, including the poses of 10 different people.The algorithm is programmed in C++ and compiled in Visual Studio 2013.The test computer uses an Intel Core i5-4570 processor clocked at 3.20 GHz.The ToF (time of flight) depth camera is used with the resolution of 320 × 240 in this paper.
6.1.Qualitative Analysis.The results of human body part recognition and joint positioning with 6 postures are shown in Figure 7.The first column is the segmented depth images, the second column is the outputs of the random forest classifier, the third column is the gravity center of each part, and the last column is the skeletons composed of the joints.As can be seen from Figure 7, the random forest classifier can correctly classify most of the pixels in the human body image, such as the body and the head.Incorrect classification often happens at the intersection of the two parts.Fortunately, the joints are almost positioned accurately and a reasonable human skeleton can be obtained.Finally, in the sixth picture, one of the hands blocks the body, and according to the positioning result, the posture based on the fusion of DGoD and LDoD features proposed in this paper can solve the mutual fusion and occlusion.From Figure 8, we can see that with the other parameters fixed, as the number of decision tree K increases, the training time consumed and the accuracy of the classification show an increasing trend.When d is 20, the classification accuracy of the test sample reaches 77.2% and the training time is 100 s.When d is 25, the correct rate of classification only increased by 1% but the required training time is increased to 140 s.Therefore, the optimal K in this paper is 20.
As shown in Figure 9, when the other parameters are fixed, the greater the depth of the tree d is, the higher the accuracy is.When the value of d is 30, the correct rate reaches its maximum and then the correct rate is almost constant with the increase of the depth.So the optimal number of depth is 30.
The minimum number of samples in the leaf nodes can be used as the termination condition for the growth of the decision tree.When it is too large, the structure of the tree will stop prematurely, which will affect the classification accuracy.When it is too small, the structure of the tree will become more complicated and will consume too much time.In Figure 10, when other parameters are fixed, the classification accuracy of the test sample reaches 78.4% when N node = 40.When N node = 80, the test sample classification accuracy rate drops to 77.6%.So in this paper, N node = 40.

Comparison of the recognition rate of various algorithms.
This paper compares the recognition rate of each part with a single feature LDoD and the recognition rate with the combination of two features of DGoD and LDoD, as shown in Figure 11.The recognition rate of the random forest algorithm with multifeature fusion is obviously improved, reaching about 80%.Among these 12 parts, the recognition rates of the left and right arms are lower, mainly because of the complex movements of the upper limbs.In addition, as the amount of samples collected increases, the recognition rate will increase.
The traditional voting mechanism of random forests and the two-way voting mechanism are compared in this paper, as shown in Figure 12.It can be seen from the figure that the classification accuracy of the random forest two-way voting mechanism is significantly higher than the traditional one-way voting mechanism.
Finally, we also compare our algorithm with the popular algorithms in other literatures, as shown in Table 2; our classification method is superior to that of Shotton and Kim.In addition, the computation time is about 54.9% of Shotton's algorithm.Therefore, the proposed method is more suitable for high real-time and high recognition rate occasions.

Conclusion
In this work, we propose a human pose recognition algorithm based on the fusion of LDoD and DGoD features.In human pose recognition, we first establish our own sample data set including depth images with a specific code.Then, we extract the LDoD and DGoD features from the sample.It is simple to calculate the above two features.Thus, the computation is greatly reduced.In the next step, these two features are used to train the random forest classifier.In order to improve the accuracy of classification, a random forest two-way voting mechanism is used to detect and classify different parts of the human body.Finally, according to the classification results, the gravity center of different body parts is calculated, so that accurate joints and skeleton can be obtained.
The experimental results show that the random forest classifier has higher classification accuracy and robustness.In addition, our method has low computation cost compared with the other methods and meets the real-time requirements.However, no method is perfect in terms of human

Figure 1 :
Figure 1: Flow chart of human pose recognition based on depth image multifeature fusion.

Figure 3 :
Figure 3: Neighborhood distribution of a pixel.

Figure 4 :
Figure 4: (a) Depth image of human body.(b) LDoD characteristic component of the left lower arm.(c) LDoD characteristic component of the right lower arm.

Figure 6 :
Figure 6: Flow chart of the construction of a single decision tree.

Figure 7 :
Figure 7: Results of human body part recognition and joint point positioning in different postures.

6. 2 .
Quantitative Analysis 6.2.1.Comparison of experimental results of different parameters of classifiers.When constructing a random forest model, the number of decision trees K, the maximum depth of numbers d, and the minimum number of samples in leaf nodes N node can affect the classifier performance.The experiment first determines the optimal classifier parameters by training five sample images.Figures 8-10 compare

Figure 8 :
Figure 8: Influence of the number of decision trees on the experimental results.

Figure 9 :
Figure 9: Influence of decision tree depth on experimental results.

Figure 10 :
Figure 10: Influence of the minimum number of samples in the leaf nodes on the experimental results.

Figure 11 :
Figure 11: Recognition rates of different parts under different algorithms.

Table 1 :
The tagging result.