In this paper, a fusion method based on multiple features and hidden Markov model (HMM) is proposed for recognizing dynamic hand gestures corresponding to an operator’s instructions in robot teleoperation. In the first place, a valid dynamic hand gesture from continuously obtained data according to the velocity of the moving hand needs to be separated. Secondly, a feature set is introduced for dynamic hand gesture expression, which includes four sorts of features: palm posture, bending angle, the opening angle of the fingers, and gesture trajectory. Finally, HMM classifiers based on these features are built, and a weighted calculation model fusing the probabilities of four sorts of features is presented. The proposed method is evaluated by recognizing dynamic hand gestures acquired by leap motion (LM), and it reaches recognition rates of about 90.63% for LM-Gesture3D dataset created by the paper and 93.3% for Letter-gesture dataset, respectively.
National Natural Science Foundation of China616723961. Introduction
Dynamic hand gesture recognition is a very intriguing problem in recent years that, if efficiently solved, could be the wealthiest means of communication that can be used. Because of this, many scholars from all over the world have done a lot of theoretical and practical research studies [1]. Compared with static gestures, the meaning of dynamic gestures is more abundant, and it is more common and natural to be an interactive way. But, at the same time, the information of dynamic hand gestures, such as shape and location, varies as time, which consequently increases the difficulty in recognition.
At present, there are two main types of sensors that are capable of sensing hand gestures: wearable sensor or vision-based sensor [2, 3]. The former approach could capture the movement of hands and fingers at the expense of convenience and cost and sufficiently extract information of hand, but it places an additional burden on users and could feel unnatural enough to perform hand gestures. Some advantages of a vision-based sensor are it can be less cumbersome and has more natural interaction than the wearable sensor due to no physical contact with users. However, its computational complexity is quite high for hand detecting, tracking, and extracting [4]. For instance, a hand should be separated from the background before the final recognition, which can be significantly affected by external environmental factors like ambient light. On the contrary, due to the complex 3D movements of hands or fingers, it is difficult to properly understand the performed hand pose based on the extracted information from 2D images [5]. Besides, once the palm surface is not parallel to the camera, for example, the recognition work could be harder.
The classification is a crucial step to recognize hand gestures. Five main classifying methods of hand gesture based on 3D vision can be identified: support vector machines (SVMs), artificial neural network (ANN), template matching (TM), HMM, and dynamic time warping (DTW) [4]. The SVM is a popular classifier for hand gesture recognition, in which support vectors are used to determine the hyperplane to realize the maximum separation of the hand gesture classes [6]. In vision-based hand gesture recognition systems, the ANN is used as a classifier to handle only fundamental and limited hand gestures [7]. When the high-level discriminative 3D hand features are available, the TM is an excellent choice for recognizing hand gestures, which works quite well with the contour- or boundary-based hand features [8]. As the hand gesture is a continuous pattern concerning time, the HMM is found to be the most suitable pattern recognition tool for testing on a moderately large dataset [9]. DTW is an indirect continuous hand gesture recognition approach that automatically aligns the sequences with different lengths and returns the proper distance [10].
Martin Sagayam and Jude Hemanth [11] develop a probabilistic model based on the state sequence analysis in the HMM to recognize hand gestures taken from the Cambridge hand dataset. The experimental results show that the proposed method achieves a 0.98% reduction in error rate and a 1.55% improvement in the recognition rate over that of the Viterbi prediction. Some work combines HMM with other methods for gesture recognition. Zhou et al. [12] use HMM to model the different information sequences of dynamic hand gestures and use BP neural network (BPNN) as a classifier to process the resulting hand gestures modeled by HMM, which achieves a satisfactory real-time performance and an accuracy above 84%. Martin Sagayam and Jude Hemanth [13] propose a hybrid 1D HMM model with artificial bee colony (ABC) optimization. The method is carried out with nine different classes of hand gestures that are used for virtual reality applications. The experimental results show that the average value of the recognition rate with ABC optimization increases by 2.72%, and the average value of the error rate is decreased by 0.47%.
With the emergence and development of deep learning technology, some scholars try to apply the technology for hand gesture recognition. Oyedotun and Khashman [14] apply a convolutional neural network (CNN) and stacked denoising autoencoder (SDAE) to recognize 24 American Sign Language (ASL) hand gestures obtained from a public database, which achieves the recognition rates of 91.33 and 92.83%. Bao et al. [15] propose a deep CNN that can classify hand gestures from the whole image without any segmentation or detection stage information. The method can organize seven sorts of hand gestures in a user-independent manner and achieve an accuracy of 97.1% in the dataset with simple backgrounds and 85.3% in the dataset with complex backgrounds.
In recent years, 3D sensors, such as binocular cameras, Kinect, and LM, have been applied for hand gesture recognition with excellent performance. LM can detect and track hands and fingers with an accuracy of about 0.01 mm and feedback the gesture information in real time with a sampling rate of 120 fps [16]. Because of its superior performance, many researchers consider that it is a promising 3D sensor and particularly suitable for hand gesture recognition. For instance, Chen et al. [17] extract directional codes of 3D motion trajectory as the feature and exploit a classifier based on SVM to classify letter and number gestures. Ameur et al. [18] extract the positions of fingertips and palm center as features that are then trained with an SVM classifier. Their method reaches an average recognition rate of about 81% with 11 kinds of dynamic gestures. Xu et al. [19] and Zeng et al. [20] also conducted similar studies. Besides t, some researchers are working on dynamic gesture recognition. Lu et al. [21] build two kinds of features and feed them into the hidden conditional neural field classifier to recognize dynamic gestures. Avola et al. [22] propose a long short-term memory (LSTM) and recurrent neural networks (RNNs) combined with an effective set of discriminative features based on both joint angles and fingertip positions to recognize sign language and semaphoric hand gestures, which achieves an accuracy of over 96%. Vamsikrishna et al. [9] propose a low-cost computer-vision-assisted setup based on LM to detect precise movements of palm or finger within the field of view of the sensors. Then, it presents a set of discrete HMM for classifying the gesture sequences performed during rehabilitation.
The paper is aimed at recognizing the hand gestures corresponding to an operator’s hand commands in robot teleoperation. For the problem, the paper develops four feature vectors and their extraction models based on 3D information acquired by LM to describe the hand gestures. And then, the article establishes HMMs to calculate the occurrence probabilities of four feature sequences in an unknown hand gesture, respectively. Lastly, the paper uses a weighted algorithm to fuse the occurrence probabilities of four features. The most considerable hazard is taken as can be taken as a recognition result. The rest of the paper is organized as follows. Prophase works of hand gesture recognition are introduced in Section 2. The methods of feature extraction are presented in Section 3, including valid dynamic gesture judgment, feature definition, and feature sequence clustering. HMM training model and hand gesture recognition by fusing the feature probabilities are proposed in Section 4. Section 5 comprises experiments and the result and discussion. Conclusion and possible future extensions are given in Section 6.
2. Prophase Work of Gesture Recognition2.1. Leap Motion and Data Acquisition
LM, based on time-of-flight technology, mainly consists of three infrared LEDs and two infrared cameras, which can take photos from different directions to obtain gesture information in 3D space [16]. LM has about 150 degrees view field and an effective range of approximately 0.03 to 0.06 meters above itself. LM could feedback data frames that consist of positions and velocities of key points, rotation information, and frame timestamp.
When collecting gestures, LM will establish a right-hand coordinate system, as shown in Figure 1, based on all obtained data such as position, speed, and gesture of human hands. As shown in Figure 1, the five fingertips are denoted by fii=1,…,5, and palm center is denoted by C. We mainly focus on the following data: (1) palm normal vector n→ and palm direction vector h→, which represent unit vectors perpendicular to the palm plane and point from the palm position toward the fingers, respectively; (2) finger direction vector f→i and the finger extension length points di, which represent the unit vector pointing to the point of the finger point Fi and the distance between two points, respectively; (3) instantaneous velocity vi of five fingertips and instantaneous velocity vC of the palm center; and (4) coordinate ptxt,yt,zt, which represents the coordinate of the palm position in the frame t.
Data acquisition from leap motion.
2.2. Dynamic Gesture Definition
There are relatively few publicly available hand gesture datasets created by LM-sampled images, especially for dynamic hand gestures in robot teleoperation. We analyze the movement characteristics of the operator’s hand command in the robot teleoperation, such as translation and rotation of three degrees of freedom, and create a gesture dataset named LM-Gesture3D, which contains eight different dynamic gestures, as shown in Table 1. All these gestures collected by LM represent some practical operations or command signs and can be performed easily and naturally. Besides, there are similarities among the gestures in some respects, which will be illuminated in more detail later.
Despite the fact that LM has many merits, it mainly acts as a gesture data collector similar to a wearable device and camera. Hence, conditions for judging the beginning and the end of a valid dynamic gesture need to be given first. Take LM-Gesture3D as an example; it can be seen that the fingertips and palm center will inevitably produce rapid and continuous displacement when either gesture is performed. Even for a simplest dynamic gesture, click, for example, is no exception. A simple discriminant, based on the above analysis, is established as follows:(1)v=maxvC,vi|i=1,…,5>vτ,where vC and vi are the instantaneous velocity of palm center and fingertips, respectively, and vτ is the predefined velocity threshold.
When the total number of continuous frames up to 60, vC and vi, satisfy discriminant (1), the data frames will be regarded as the original data of a valid dynamic gesture.
As LM is quite sensitive, in both cases when hand makes a slight shaking at rest and the obtained data contain noise, discriminant (1) could be satisfied in a few consecutive frames. So the total number (i.e., 60 frames) is set to eliminate these useless data. In addition, dynamic gesture with a low speed will be judged as invalid by discriminant (1), which means there is a degree of freedom for hand movement.
3.2. Feature Definition
To effectively recognize dynamic gestures, changes in hand posture and position are analyzed separately. The former can be further divided into the bending angle of fingers, opening perspective between fingers, and palm posture. The gesture trajectory can be represented later. Therefore, the paper describes the changes in gestures through the above four features.
The specific extraction process and expression of the four features are as follows.
3.2.1. Palm Attitude Feature
If the palm shape changes little in a dynamic gesture, the change in palm posture can be regarded as the problem of attitude angle calculation of a rigid body. The paper draws lessons from the 3D attitude measurement method, which is pointed out in [23].
As shown in Figure 1, the palm posture in the 3D space at any time could be uniquely determined by palm normal vector n→ and palm direction vector h→. Let m→=h→×n→, then a new coordinate system h→t,n→t,m→t can be obtained to represent the palm posture in frame t. We take the initial data frame of the dynamic gesture as the fixed coordinate system and denote it as h→t,n→t,m→t. So, the change in palm posture between the current frame and the first frame can be represented with three Euler angles:(2)ψt=arctan−n→1⋅h→tn→1⋅n→t,ϕt=arcsinn→1⋅m→t,θt=arctan−h→1⋅m→tm→1⋅m→t.
3.2.2. Bending Angle of Fingers
As we mainly focus on the bending angle of the finger, the thickness of the finger could be regarded as useless, and then, each finger can be simplified to a planar model, as shown in Figure 2. Based on the two models, Hong et al. [24] propose a method to estimate the hand’s attitude or instead bending angle of fingers and coordinates of joint points. At all conditions, their method require a merely total length of the finger li, visible length of the finger di, and several constraint constants. Combining with their research, we define the finger bending angle as(3)ωi=100×dili,i=1,…,5,where di can be obtained directly from LM and li equals to di when the finger is straight.
Simplified planar model of fingers: (a) triangular model for the thumb; (b) quadrilateral model for the other four fingers. MCP, PIP, DIP, and TIP represent metacarpophalangeal point, proximal interphalangeal point, distal interphalangeal point, and fingertip, respectively.
In equation (3), li is used for normalization in order to make the approach robust to people with hands of different sizes. For li, a simple method is proposed to calibrate before data acquisition. The user keeps his/her palm plane parallel to LM and open fingers as straight as possible. When data of total continuous frames satisfy (1) ny⟵0.94 and (2) f→i⋅n→<0.008,i=1,…,5, up to 30, the obtained visible lengths of five fingers could be recorded as total lengths, where ny is the component of normal vector n→ along the Y-axis direction in the LM coordinate system.
3.2.3. Opening Angle of Fingers
The other descriptor for the fingers is the opening angle between fingers. As mentioned above, every single finger can be modeled on a plane. Thus, the problem of computing the angle between two fingers can convert to one calculating the angle between two planes. Here, the plane consists of h→, and n→ is taken as the benchmark plane in the computation. Let h→×n→ and f→i×n→ be the normal vector of the benchmark plane and finger planes, respectively. So, the opening angle can be calculated as follows:(4)γi=arc cosf→i×n→⋅h→×n→f→i×n→×h→×n→,i=1,…,5.
3.2.4. Trajectory Feature
A specific and meaningful trajectory usually accompanies some dynamic gestures, such as circling with a finger (like G5). So, the paper considers the path of the dynamic gesture and extracts a simplified feature for gesture recognition. When LM works, it can detect the palm center’s return space coordinates with high accuracy and stability. So the moving trajectory of a hand can be expressed by a series of discrete points. The paper projects the gesture trajectory onto the LM’s principal gesture plane, i.e., the XOZ plane. The detailed feature extraction processes are as follows:
Let x1,z1,…,xT,zT be the discrete points of the 2D gesture trajectory, then the central point poxo,zo of these points can be expressed as follows:
(5)poxo,zo=1T∑t=1Txt,1T∑t=1Tzt.
Any point ptxt,zt and poxo,zo form a vector of popt→ together with the central point po as the starting point. Then, the norm of popt→ and the direction angles between popt→ and the X-axis can be represented as follows:
Norm of the vectors dt is normalized with the maximum norm dmax, thus obtaining δt. Besides, direction angles of the vectors φt are converted into codes ψt according to the angular regions, as shown in Figure 3. δt and ψt can be computed as follows:
(7)δt=20×dtpt,podmax,ψt=φtpt,po18°+1.
Before coding the direction angle φt, we change the coordinate system from the original LM one into the coordinate system, as shown in Figure 3(a), the z′ axis of which always points from the central point p0 to the first point p1. The obtained trajectory feature δt and ψt are of scale and rotation invariance based on the operation plane.
Schematic diagram of normalization process: (a) norm and direction angle of the vectors; (b) angular regions in the XOZ plane.
Select typical data once for each gesture in the LM-Gesture3D dataset and build their feature diagrams, as shown in Figure 4. Each row in Figure 4 corresponds sequentially to one of the gestures in the LM-Gesture3D. Four descriptions in each row from left to right are palm posture, finger bending angle, finger opening angle, and trajectory, respectively. It is not hard to see that each feature diagram depicts how its corresponding gesture is performed nicely. Gesture with complicated changes usually corresponds to complex feature curves, and vice versa. Different gestures may have similar features. The palm posture feature of G1–G3, for example, is similar to that of G6–G8 finger bending angle, and finger opening angle of G6–G8 is similar to each other. Therefore, it is not easy to distinguish these gestures just with a single feature. Of course, there are some gestures with significantly different features like G1 and G2. So, there is no misrecognition between G1 and G2.
Features of the gestures in the LMC-Gesture3D training dataset.
There may be some more distinguishing features that can improve the recognition rate as well as reduce the computation cost for a given gesture. However, considering eight kinds of gestures in LM-Gesture3D that have obvious similarities, we prefer to select a feature set with completeness and redundancy that meets the requirements of unified modeling and recognizes the gestures. According to the description of Figure 4, some features in the defined feature set are similar to each other for different gestures. However, there are some distinct features that are also included in the defined feature set. So, on the whole, the collected LM-Gesture3D or other more kinds of dynamic gestures can be adequately represented and distinguished by the defined four types of features.
In all four features, finger bending angle and finger opening angle are not affected by acquisition direction. To verify whether the rest two kinds of features are rotation invariance, we obtain the hand data of the gesture G6 from an experimenter, who is asked to make the gesture G6 twice during the collecting period. Then, we extract the posture feature and trajectory feature from the collected hand data and draw the feature curves, as shown in Figure 5.
Palm posture and trajectory feature diagrams of gesture G6. a1 and a2 are the feature diagrams of palm posture corresponding to two gestures; a3 and a4 are the feature diagrams of trajectory corresponding to two gestures.
3.3. Feature Sequence Clustering
As shown in Table 2, in a single data frame of a gesture, four features can be represented by mii=1,2,3,4 dimensional vectors, respectively. Accordingly, each feature in T data frames of a dynamic gesture forms T×mi dimensional vector sequences. In order to build the model of discrete HMM, K-means algorithm [25] is used to cluster the feature vector in the sequence. After clustering a feature vector into q class, the feature vector sequence can be expressed as O=o1,…,ot,…,oT, where ot=1,…,q indicates that the feature vector is closest to the cluster center numbered ot. In the paper, the cluster number q of four kinds of features is shown in Table 2.
Number of cluster centers of four features.
Feature
Feature vector
q
Palm posture
ψt,φt,θt
16
Finger bending angle
ω1,ω2,ω3,ω4,ω5
14
Finger opening angle
γ1,γ2,γ3,γ4,γ5
10
Trajectory
δt,ψt
10
In short, we take the discrete feature sequence composed of cluster tags as inputs of the discrete HMM. Therefore, both the sample data for HMM training stage and the gesture data for HMM recognizing need to go through the steps of feature extraction and clustering.
4. Gesture Modeling and Recognition4.1. Recognizing Flow
The recognizing process of gesture is shown in Figure 6, which can be divided into two parts. The first part deals with the accurate gesture segmentation and four features extraction and quantification. The second part includes HMM model training and gesture recognition, both of which are based on the premise of feature sequences extraction.
Implementation flow of the proposed method for recognizing dynamic gesture.
The formal features of HMM can be expressed with a 5-tuple ΩX,ΩO,A,B,π, where ΩX=q1,…,qN is a finite set Markov chain state, and N is the number of states; ΩO=V1,…,VM is a finite set of observation symbols, and M is the number of symbols. A=aijN×N is the matrix of state transition probability, B=bijN×M is the matrix of observation probability, and π=π1,…,πN is the initial state probability distribution.
4.2. HMM Training
Unlike common one HMM for one kind of gesture modeling pattern, we build one HMM model for each feature, which means that 4 HMM models are adopted to achieve the recognition of each performed unknown gesture. Taking LM-Gesture3D for example, the designed 8 gestures are denoted by gu,u=1,…,8; then, for the feature sequence Suv(v=1,…,4 ) of gesture gx, the following HMM modeling processes are carried out:
HMM initialization: according to Table 1, in the paper, N is set to be 6. The number of observation symbols M is set as the same value of the number of cluster centers shown in Table 2; the initialization model parameters are described as λuv=A,B,π.
HMM parameters revaluation: assume that the feature sequence Suv consists of K observation sequences Ok, where k=1,…,K, and each observation sequence could be represented as Ok=O1k,…,OTk.
For computing π¯i, a¯ij, and b¯js, respectively, the observation sequence Ok and the original model parameter λuv are substituted into the reestimation equations as follows:(8)π¯i=∑k=1Ka1kiβ1kiPOkλ,a¯ij=∑k=1K∑t=1Tk−1atkiaijbjOi+1kbt+1kj/POkλ∑k=1K∑t=1Tk−1atkiβt+1ki/POkλ,b¯js=∑k=1K∑t=1otk=vsTk−1atkjβtkj/POkλ∑k=1K∑t=1Tk−1atkjβtkjv/POkλ,where 1≤i,j≤N.
Thus, a new model λ¯uv=π¯,A¯,B¯ is obtained. The above process would be repeated until the parameters in two adjacent iterations meet as follows:(9)logPOλ¯uv−logPOλuv<ε,where POλ is calculated from the forward-backward algorithm, which indicates the occurrence probability of the observation sequence O under the parameter λ, and ε is the predefined convergence threshold.
The final model parameter λ¯uv is the optimal parameter of feature sequence Suv, that is, the single feature HMM of its corresponding gesture. By repeating the above modeling process for each feature sequence Suv of 8 dynamic gestures, we can obtain 32 single-feature HMM models in total.
4.3. Gesture Recognition with HMM Fusion
In the stage of gesture recognition, once original data of an unknown and valid dynamic gesture are obtained, it would be first extracted into 4 observation sequences O1,O2,O3,and O4. Then, the forward-backward algorithm is used to calculate the occurrence probability PO1λ¯u1u=1,…,8 of the observation sequence under 8 single feature HMM λ¯u1u=1,…,8. Similarly, the occurrence probability of the observed sequence O2,O3, and O4 under their corresponding HMM λ¯u2,λ¯u3, and λ¯u4 can be obtained. For demonstration purposes, we represent the occurrence probabilities POvλ¯uv as Puvv=1,2,3,4.
We present an algorithm of weighted probability fusion to compute the probability that an unknown gesture belongs to the gesture u in LM-Gesture3D as follows:(10)PuF=∑v=14ωuvPuv,where ωuv(0≤ωuv≤1 and ∑v=14ωuv=1) is the weight of feature v corresponding to the gesture u.
According to equation (10), there are 8 calculation results, in which the maximum is regarded as the recognition result of the unknown gesture.
The paper employs least square method (LSM) to determine ωuv in equation (10). Here is a brief introduction to the LSM weight method. Firstly, we calculate the probabilities of four features for all samples in the training dataset and can obtain Pm=P1m,P2m,…,P8mm=1,…,L, where L is the number of samples and Pum=Pu1m,Pu2m,Pu3m,Pu4m. Secondly, for the gesture u in LM-Gesture3D, if the sample m belongs to it, we set the probability of the sample m corresponding to the gesture u as follows:(11)Pm,uF=∑v=14ωuv,Puvm=ωu,PumT=ps.
Else, the probability of the sample m corresponding to the gesture u is set to be as follows:(12)Pm,uF=∑v=14ωuvPuvm=ωuPumT=1−ps7,where ps0.5≤ps≤1 is a set probability.
Calculating the probabilities of all samples corresponding to the gesture u, we can obtain the following formula:(13)ωuPu1TPu2T⋮PuVT=Pu,1FPu,2F⋮Pu,VF.
We use the least square method (LSM) to compute ωu=ωu1,ωu2,ωu3,ωu4 in equation (15). Finally, ωu is normalized to ωu′=ωu1′,ωu2′,ωu3′,ωu4′. ωu′ is the weight vector of the fusion model.
5. Experiments
To test the performance of the proposed method, several experiments are carried out on a desktop PC with an Intel Core i5-3230M processor and 4 Gb of RAM, and the software environment consists of Visual Studio 2013, Leap Motion SDK 2.3.1 + 3154, and MATLAB 2012a.
5.1. LM-Gesture3D Recognition Experiment
We select four participants with certain experiences in robot teleoperation to join the experiment. Each participant is asked to imitate each gesture in LM-Gesture3D 40 times repeatedly, and LM samples their gestures. So, there are 160 samples of each gesture.
To verify the feasibility of the proposed method, we define the recognition rate as follows:(14)RR=NRecMSam×100%,where NRec is the number of gestures correctly recognized and MSam is the total number of gestures recognized.
Firstly, we use K-fold cross-validation to evaluate the recognition performance and stability of the proposed method. In this experiment, K is set to be 10. So, each subset has 128 samples. Figure 7 is the result of K-fold cross-validation, which shows that the recognition rates of different trained models range from 89.8% to 92.9%. The fluctuation ranges of recognition rates of all 10 trained HMM models are about 3%, which shows that the proposed method has a good generalization ability. The average recognition rate of all 10 trained HMM models is about 90.8%, which indicates that the proposed method has a good recognition performance.
Recognition accuracies of LM-Gesture3D by using k-fold cross-validation.
Furthermore, we analyze the recognition performance of the proposed method for different types of gestures in LM-Gesture3D. We randomly select 60 samples of each gesture as the testing set and the remaining samples as the training set. Table 3 shows the recognizing results. From the table, we can see that our method has a good representation of the 8 dynamic gestures with the average recognition rate of about 90.6%. The recognition rates for all gestures fluctuate slightly between 88.3% and 91.7%. The recognition rates of G4 and G6–G8 are higher than those of G1–G3 and G5. The reason is that these gestures are relatively simpler and easier for different users to repeat, while the participant’s individual habits easily influence G1–G3 and G5. In addition, gestures G1–G3 are easily confused with G6–G8, respectively.
Recognition rate of the proposed method for the gestures in LMC-Gesture3D.
Gesture
G1
G2
G3
G4
G5
G6
G7
G8
RR
G1
54
3
3
90.0
G2
54
6
90.0
G3
53
3
4
88.3
G4
2
3
55
91.7
G5
3
3
55
90.0
G6
5
55
91.7
G7
6
54
90.0
G8
3
2
55
91.7
In general, the recognition results are jointly determined by four kinds of features, and our method based on multiple features and HMM can represent most kinds of complex gestures, which proves that our method is effective.
5.2. Dynamic Gesture Recognition Experiments
This experiment mainly tests our method's recognition rate for two kinds of relatively simple dynamic gestures, which are named letter-gesture dataset and the waving-gesture dataset, respectively. As shown in Figure 8(a), letter-gesture set consists of 6 gestures numbered 1 to 6, which are similar to each other. The waving-gesture dataset contains the rest 6 gestures shown in Figure 8(b). It can be seen that the main feature of two gesture sets is trajectory feature and palm posture feature, respectively.
Gestures in letter-gesture set and waving-gesture set: (a) 6 kinds of gestures in letter-gesture set; (b) 6 kinds of gestures in waving-gesture set.
The gestures in the experiment are sampled from four participants. Each participant is asked to repeat each gesture 50 times. When collecting the letter-gesture dataset, each participant keeps the shape as unchanged as possible and parallel to the horizontal plane of LM. Each gesture's obtained data are further divided into 120 sets of training samples and 80 sets of testing samples.
Chen et al. [17] propose a rapid early recognition system based on SVM to achieve multiclassification among the 36 dynamic gestures (the 3D motion trajectory of the numbers and the alphabet). Chen’s method uses LM to capture 3D motion trajectories of the gestures, which is the same as our method. In Chen’s method, the orientation angle is utilized as a unique feature of the gesture trajectory projected into the XOZ plane. It is quantized by dividing it by 45° and coded from 1 to 9, which is similar to our method. Chen’s method is also used to recognize the gestures in the letter-gesture dataset.
Figure 9 shows the recognition results of our method and Chen’s method. Our method and Chen’s method get the average recognition rates of 96.0% and 93.5%, respectively. Two approaches have very similar recognition rates. However, the fluctuation of our method's recognition rate with LSM weights is smaller than that of Chen’s approach. It shows that our method has better recognition stability than Chen’s method.
Results of using our method and Chen’s method to recognize gestures from letter-gesture dataset.
In addition, the directional code extracted by Chen’s method is determined by two neighboring points on the trajectory. In contrast, that of our method is determined by the trajectory points and the central point. At the same time, we also introduce a distance feature. Therefore, the extracted trajectory feature by our method is not affected by the amplitude of the gesture and is of rotation invariance.
Based on the above analysis, we believe that our method performs better than Chen’s method.
The waving direction of gestures 7–10 in the waving-gesture dataset is from upper right to lower left, from upper left to lower right, from top to bottom, and from bottom to top, respectively. And other gestures 11 and 12 make roughly 90° clockwise and counterclockwise rotations, respectively. It can be seen that this kind of dynamic gesture could be distinguished easily once using palm posture features. We carry an experiment to test the recognition performance of our method aiming at the 6 kinds of gestures. In the experiment, the method of data acquisition and processing is the same as that in the experiment of the letter-gesture dataset.
Pan et al. [26] present a combination method based on rule-based classification and SVM recognizes the gestures, which also use LM to capture real-time frame data of hand motion and define a 14-dimensional feature set including the absolute pose of hand in the 3D coordinate system and the pose changes in the hand between the two frames. Pan’s method is also used to recognize the gestures in the waving-gesture dataset.
Figure 10 shows the recognition results of our method and Pan’s method. The recognition rates of two methods for gestures 7–12 are all over 90, and the average recognition rates are 90.4% and 90.8%, respectively. The average recognition rate of Pan’s method is slightly higher than that of our method.
Results of using our method and Pan’s method to recognize the waving-gesture dataset.
Compared with our method, Pan’s method will lead to more computational costs because it selects high dimension features and adopts a two-step recognizing strategy. Our method has not only a high recognition rate but also has the rotational invariance for selecting the rotation angle based on the initial posture of hand as features. Our method has a good effect on recognizing the wave or rotation gestures, such as those in the waving-gesture dataset.
In addition, all three methods above use LM to sample the gestures. The data of the features defined by three methods can be obtained quickly and accurately by LM. But adopting the camera approach, we have to depend on the hand area feature to recognize the gestures, which is more complex and challenging. Hence, we can conclude that LM brings excellent benefits to our research.
5.3. Generalization Experiment
A generalization experiment is carried out to verify the adaptability of our method to nonstandard gestures. We select four inexperienced participants for the experiment. In the experiment, each participant is asked to repeat each gesture from LM-Gesture3D 40 times. A total of 1280 different gestures are sampled, which are recognized by the built HMM mode and the same weights in the LM-Gesture3D recognition experiment. The average recognition rate of 90.5% shown in Figure 11 is very similar to that of the LM-Gesture3D recognition experiment. So, the method is adaptable to different nonstandard gestures and has a good generalization ability.
Recognition rate on nontrainer’s gestures in LM-Gesture3D.
We defined positive prediction value (PPV) and accuracy (ACC) of the gesture Gi (i = 1, 2, …, 8) as follows:(15)PPV=TGiTGi+FGi,where TGi is the number of gesture Gi correctly recognized and FGi is the number of other seven gestures that are incorrectly recognized as Gi.(16)ACC=TGiTGi+∑1≤j≤8,j≠iFGj,where FGi is the number of gesture Gi that is incorrectly recognized as gesture Gj.
Table 4 shows the confusion matrix of the generalization experiment using the proposed method. According to Table 4, except for G5 with a PPV of about 0.96, the recognition precisions for the other seven gestures have a small difference ranging from 0.89 to 0.91.
Confusion matrix of the recognition results.
Gesture samples
PPV
G1
G2
G3
G4
G5
G6
G7
G8
Recognition results
G1
140
1
3
5
2
2
1
1
0.90
G2
2
146
3
2
3
2
3
2
0.90
G3
1
2
138
3
3
2
1
1
0.91
G4
6
2
3
144
2
1
1
1
0.90
G5
1
1
1
1
143
1
1
0
0.96
G6
7
1
2
2
2
147
2
2
0.90
G7
1
5
3
1
3
3
150
3
0.89
G8
2
2
7
2
2
2
1
150
0.89
ACC
0.88
0.91
0.86
0.90
0.89
0.92
0.94
0.94
0.90
5.4. Comparison Experiment with Other HMM-Based Methods
Here, we compare the recognition performance of the proposed method with other recognition methods based on HMM.
The authors in [11] define three features, including handshape, palm trajectory, and distance from the camera to extract the hand model from image features. It proposes a combinatorial method based on HMM and BPNN. The HMM-BPNN method uses the classical HMM to evaluate and decide the dynamic gesture features and, then, uses the BP neural network to classify the input state sequence.
In this experiment, the experimental samples are from the LM-Gesture3D recognition experiment in Section 5.1, from which 60 samples of each gesture and the remaining samples are randomly selected as the testing and training sets, respectively. The experiment is divided into two parts, including the feature testing and algorithm testing.
The feature testing experiment uses the features defined in the paper [12] to describe the gestures and analyze the HMM-BPNN method's recognition rate. Table 5 shows the recognition results of the experiment. From the table, we can see that the HMM-BPNN method has an average recognition rate of only about 50.83% for 8 dynamic gestures. Moreover, for different types of gestures, its recognition rate fluctuates greatly. The main reason for the low recognition rate of the HMM-BPNN method for the gestures in LM-Gesture3D is that the three types of 2D features defined by the method are only suitable for representing simple and highly differentiated gestures but cannot fully represent complex and highly similar gestures, such as G5.
Recognition rate of the HMM-BPNN method of the feature test.
Gesture
G1
G2
G3
G4
G5
G6
G7
G8
RR
G1
35
7
3
5
1
4
2
3
58.3
G2
6
34
4
4
2
3
4
3
56.7
G3
5
5
30
5
2
4
4
5
50.0
G4
4
5
6
31
1
4
4
5
51.7
G5
5
4
3
5
25
5
5
6
41.7
G6
5
4
5
4
4
30
4
4
50.0
G7
4
6
4
4
2
4
29
7
48.3
G8
5
5
3
4
2
5
6
30
50.0
The algorithm testing experiment uses the features defined by our method to describe the gestures and analyze the recognition rate of the HMM-BPNN method again. Table 6 shows the recognizing results of the experiment.
Recognition rate of the HMM-BPNN-based method of the algorithm testing.
Gesture
G1
G2
G3
G4
G5
G6
G7
G8
RR
G1
49
2
3
2
3
1
1
81.7
G2
3
48
1
1
2
3
2
80.0
G3
2
1
50
1
2
1
2
83.7
G4
2
2
3
48
1
2
2
80.0
G5
3
2
2
1
48
2
2
80.0
G6
4
2
1
1
50
2
1
83.3
G7
1
4
1
1
1
2
48
2
80.0
G8
2
2
3
2
1
3
47
78.3
From Table 6, we can see that the HMM-BPNN method has an average recognition rate of about 80.83% for 8 dynamic gestures. The recognition rate of the experiment is 30% higher than that of the feature testing experiment. Moreover, for different types of gestures, its recognition rate fluctuates less. The results show that the paper's features can more effectively represent complex gestures in LM-Gesture3D than that of the HMM-BPNN method.
For the same gesture samples and the same defined features, the recognition rate of our method, shown in Table 3, is more than 90%, which is about 10% higher than that of the HMM-BPNN method. We think there are two main reasons for the relatively low recognition rate of the HMM-BPNN method. Firstly, the input of the BPNN classifier is decided by a maximum assessment of the probabilities of the trained HMM models of four types of features, which does not consider the interference between similar features. Secondly, the BP neural network is prone to fall into local minima, which increases the risk of misrecognition when different sample features have significant similarities.
6. Conclusion
In the paper, a fusion recognition method based on multiple features and HMM for the dynamic gesture is proposed. We consider both the change in handshape and moving trajectory and build four sorts of hand features with the advantages of being straightforward, simple, and rotation invariance, which bring better operation naturalness and flexibility for the operators. What is more, it offers a further expansion of more kinds of complex dynamic gestures by using these features. For each feature, we have built its corresponding HMM. In the recognition stage, we innovatively present a weighted fusion algorithm to calculate the occurrence probabilities and get the final recognition result. In the above way, the result is not easily affected by a particular feature.
The experimental results show that the proposed method is suitable for relatively simple dynamic gestures like letter gestures and waving gestures. Still, it also has strong robustness for complex dynamic gestures like LM-Gesture3D. The average recognition rate of the proposed method for LM-Gesture3D is up to 90.6%. Besides, the average recognition rate for inexperienced participants is about 90%. These results demonstrate the usability and feasibility of the proposed method.
Like other gesture recognition methods, the proposed method inevitably has certain limitations, and a more in-depth study needs to be carried out. Firstly, as we have adopted four HMMs for each gesture recognition, the algorithm’s efficiency remains to be raised. Secondly, we have not yet done more research on the adaptive weight method and their further impact on the recognition rate, which will also be a future research direction.
Data Availability
The research library related to the dissertation will be established in GitHub (https://github.com/glchenwhut), where you can access the folders and find experimental data and lists.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Authors’ Contributions
Guoliang Chen conceived the idea, designed the experiments, and wrote the paper. Kaikai Ge helped with the algorithm and to analyze the experimental data.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant no. 61672396.
RautarayS. S.AgrawalA.Vision based hand gesture recognition for human computer interaction: A survey20154315410.1007/s10462-012-9356-92-s2.0-84921999882ZhangJ.LiW.OgunbonaP. O.WangP.TangC.RGB-d-based action recognition datasets: A survey2016608610510.1016/j.patcog.2016.05.0192-s2.0-84994894914LambertiL.CamastraF.Handy: A real-time three color glove-based gesture recognizer with learning vector quantization20123912104891049410.1016/j.eswa.2012.02.0812-s2.0-84861196100ChengH.YangL.LiuZ.Survey on 3D hand gesture recognition20162691659167110.1109/tcsvt.2015.24695512-s2.0-84987660307PisharadyP. K.SaerbeckM.Recent methods and databases in vision-based hand gesture recognition: A review201514115216510.1016/j.cviu.2015.08.0042-s2.0-84948168846OyedotunO. K.OlaniyiE. O.HelwanA.KhashmanA.Decision support models for iris nevus diagnosis considering potential malignancy2014512419426IbraheemN. A.KhanR. Z.Vision based gesture recognition using neural networks approaches: A review20123114LinW.-Y.HsiehC.-Y.Kernel-based representation for 2D/3D motion trajectory retrieval and classification201346366267010.1016/j.patcog.2012.09.0142-s2.0-84870241391VamsikrishnaK. M.DograD. P.DesarkarM. S.Computer-vision-assisted palm rehabilitation with supervised learning2016635991100110.1109/tbme.2015.24808812-s2.0-84967162515SchrammR.JungC. R.MirandaE. R.Dynamic time warping for music conducting gestures evaluation201517224325510.1109/tmm.2014.23775532-s2.0-84921456087Martin SagayamK.Jude HemanthD.A probabilistic model for state sequence analysis in hidden markov model for hand gesture recognition2019351598110.1111/coin.121882-s2.0-85060951692ZhouL.ZhangL. S.LeiS.ZhangX. B.Dynamic hand gesture recognition using HMM-BPNN model6-9Proceedings of the IEEE International Conference on Real-time Computing and Robotics (RCAR)June 2016Angkor Wat, Cambodia42242610.1109/rcar.2016.77840662-s2.0-85010075903Martin SagayamK.Jude HemanthD.ABC algorithm based optimization of 1-D hidden Markov model for hand gesture recognition applications20189931332310.1016/j.compind.2018.03.0352-s2.0-85046027945OyedotunO. K.KhashmanA.Deep learning in vision-based static hand gesture recognition201728123941395110.1007/s00521-016-2294-82-s2.0-84962910811BaoP.MaquedaA. I.del-BlancoC. R.GarcíaN.Tiny hand gesture recognition without localization via a deep convolutional network201763325125710.1109/tce.2017.0149712-s2.0-85035779675WeichertF.BachmannD.RudakB.FisselerD.Analysis of the accuracy and robustness of the leap motion controller20131356380639310.3390/s1305063802-s2.0-84878071962ChenY.DingZ.ChenY. L.WuX.Rapid recognition of dynamic hand gestures using leap motionProceedings of the IEEE International Conference on Information and AutomationAugust 2015Lijiang, China1419142410.1109/icinfa.2015.72795092-s2.0-84959894012AmeurS.KhalifaA. B.BouhlelM. S.A comprehensive leap motion database for hand gesture recognitionProceedings of the 7th International Conference on Science of Electronics, Technologies of Information and TelecommunicationsDecember 2016Hammamet, Tunisia514519XuY.WangQ.BaiX.ChenY. L.WuX.A novel feature extracting method for dynamic gesture recognition based on support vector machineProceedings of the IEEE International Conference on Information and AutomationJuly 2014Hailar, China437441ZengW.WangC.WangQ.Hand gesture recognition using leap motion via deterministic learning20184122LuW.TongZ.ChuJ.Dynamic hand gesture recognition with leap motion controller2012911881192AvolaD.BernardiM.CinqueL.ForestiG. L.MassaroniC.Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures201921123424510.1109/tmm.2018.28560942-s2.0-85050007206JiX.YuC.ChenW.DongD.GNSS 3D attitude measurement system based on dual-antenna receiver with common clockProceedings of the 2017 Forum on Cooperative Positioning and ServiceMay, 2017Harbin, China223227HongH.ChaoJ.JinY.ZhaoZ.LinW.Key point model for hand pose estimation based on leap motion2015712111216KanungoT.MountD. M.NetanyahuN. S.PiatkoC. D.SilvermanR.WuA. Y.An efficient k-means clustering algorithm: Analysis and implementation20077881892PanJ. J.XuK.Three-dimensional freehand gesture manipulation based on Leap Motion (In Chinese)201510207212