Mining Key Skeleton Poses with Latent SVM for Action Recognition

Human action recognition based on 3D skeleton has become an active research field in recent years with the recently developed commodity depth sensors. Most published methods analyze an entire 3D depth data, construct mid-level part representations, or use trajectory descriptor of spatial-temporal interest point for recognizing human activities. Unlike previous work, a novel and simple action representation is proposed in this paper which models the action as a sequence of inconsecutive and discriminative skeletonposes,namedaskeyskeletonposes.Thepairwiserelativepositionsofskeletonjointsareusedasfeatureoftheskeletonposes whichareminedwiththeaidofthelatentsupportvectormachine(latentSVM).Theadvantageofourmethodisresistingagainst intraclassvariationsuchasnoiseandlargenonlineartemporaldeformationofhumanaction.Weevaluatetheproposedapproach onthreebenchmarkactiondatasetscapturedbyKinectdevices:MSRAction3Ddataset,UTKinectActiondataset,andFlorence 3DActiondataset.Thedetailedexperimentalresultsdemonstratethattheproposedapproachachievessuperiorperformanceto thestate-of-the-artskeleton-basedactionrecognitionmethods.


Introduction
The task of automatic human action recognition has been studied over the last few decades as an important area of computer vision research.It has many applications including video surveillance, human computer interfaces, sports video analysis, and video retrieval.Despite remarkable research efforts and many encouraging advances in the past decade, accurate recognition of the human actions is still a quite challenging task [1].
In traditional RGB videos, human action recognition mainly focuses on analyzing spatiotemporal volumes and representation of spatiotemporal volumes.According to the variety of visual spatiotemporal descriptors, human action recognition work can be classified into three categories.The first category is local spatiotemporal descriptors.An action recognition method first detects interesting points (e.g., STIPs [2] or trajectories [3]) and then computes descriptors (e.g., HOG/HOF [2] and HOG3D [4]) based on the detected local motion volumes.These local features are then combined (e.g., bag-of-words) to represent actions.The second category is global spatiotemporal templates that represent the entire action.A variety of image measurements have been proposed to populate such templates, including optical flow and spatiotemporal orientations [5,6] descriptors.Except the local and holistic representational method, the third category is mid-level part representations which model moderate portions of the action.Here, parts have been proposed which capture a neighborhood of spacetime [7,8] or a spatial key frame [9].These representations attempt to balance the tradeoff between generality exhibited by small patches, for example, visual words, and the specificity by large ones, for example, holistic templates.In addition, with the advent of inexpensive RGB-depth sensors such as Microsoft Kinect [10], a lot of efforts have been made to extract features for action recognition in depth data and skeletons.Reference [11] represents each depth frame as a bag of 3D points along the human silhouette and utilizes HMM to model the temporal dynamics.Reference [12] learns semilocal features automatically from the data with an efficient random sampling approach.Reference [13] selects most informative joints based on the discriminative measures of each joint.Inspired by [14], Seidenari et al. model the movements of the human body using kinematic chains and perform action recognition by Nearest-Neighbor classifier [15].In [16], skeleton sequences are represented as trajectories in an -dimensional space; then these trajectories are then interpreted in a Riemannian manifold (shape space).Recognition is finally performed using NN classification on this manifold.Reference [17] extracts a sparse set of active joint coordinates and maps these coordinates to lower-dimensional linear manifold before training an SVM classifier.The methods above generally extract the spatial-temporal representation of the skeleton sequences with well-designed handcrafted features.Recently, with the developing of deep learning, several Recurrent Neural Networks (RNN) models have been proposed for action recognition.In order to recognize actions according to the relative motion between limbs and the trunk, [18] uses an end-to-end hierarchical RNN for skeleton-based action recognition.Reference [19] uses skeleton sequences to regularize the learning of Long Short Term Memory (LSTM), which is grounded via deep Convolutional Neural Network (DCNN) onto the video for action recognition.
Most of the above methods relied on entire video sequences (RGB or RGBD) to perform action recognition, in which spatiotemporal volumes were always selected as representative feature of action.These methods will suffer from sensitivity to intraclass variation such as temporal scale or partial occlusions.For example, Figure 1 shows that two athletes perform some different poses when diving water, which makes the spatiotemporal volumes different.Motivated by this case, the question we seek to answer in this paper is whether a few inconsecutive key skeleton poses are enough to perform action recognition.As far as we know, this is an unresolved issue, which has not yet been systematically investigated.In our early work [20], it has been proven that some human actions could be recognized with only a few inconsecutive and discriminative frames for RGB video sequences.Related to our work, very short snippets [9] and discriminative action-specific patches [21] are proposed as representation of specific action.However, in contrast to our method, these two methods focused on consecutive frame.
In this paper, a novel framework is proposed for action recognition in which key skeleton poses are selected as representation of action in RGBD video sequences.In order to make our method more robust to translation, rotation, and scaling, Procrustes analysis [22] is conducted on 3D skeleton joint data.Then, the pairwise relative positions of the 3D skeleton joints are computed as discriminative features to represent the human movement.Finally, key skeleton poses, defined as the most representative skeleton model of the action, are mined from the 3D skeleton videos with the help of latent support vector machine (latent SVM) [23].In early exploration experiments, we noticed that the number of the inconsecutive key skeleton poses is no smaller than 4.During testing, the temporal position and similarity of each of the key poses are compared with the model of the action.The proposed approach has been evaluated on three benchmark datasets: MSR Action 3D [24] dataset, UTKinect Action dataset [25], and Florence 3D Action dataset [26]; all are captured with Kinect devices.Experimental results demonstrate that the proposed approach achieves better recognition accuracy than a few existing methods.The remainder of this paper is organized as follows.The proposed approach is elaborated in Section 2 including the feature extracting, key poses mining, and action recognizing.Experimental results are shown and analyzed in Section 3. Finally, we conclude this paper in Section 4.

Proposed Approach
Due to the large performance variation of an action, the appearance, temporal structure, and motion cues exhibit large intraclass variability.So selecting the inconsecutive and discriminative key poses is a promising method to represent the action.In this section, we answer the question of what are and how to find the discriminative key poses.

Definition of the Key Poses and Model
Structure.The structure of the proposed approach is shown in Figure 2.Each action model is composed of a few key poses, and each key pose in the model will be represented by three parts: (1) a linear classifier   () which can discriminate the key

Feature extract
Finding key pose Given is a video that contains  frames  = { 1 , . . .,   }, where   is the -th frame of the video.The score will be computed as follows: in which    is the set of key poses of video ,   = { |  = ( 1 , . . .,   ), 1 ≤   ≤ }, and    ∈    .For example,   is {1, 9, 10, 28} in Figure 3(a). is the total number of key poses in the action model; in our following experiment,  is ranging from 1 to 20.   is the serial number of the key pose in the sequence of frames of video.And Δ  is defined as follows: in which  0 is the frame at which action begins.Δ is a Gaussian function and reaches peak when   −  0 =   . 0 has been manually labeled on the training set.The method of finding  0 in a testing will be discussed in Section 2.4.

Feature Extracting and Linear Classifier.
With the help of real-time skeleton estimation algorithm, the 3D joint positions are employed to characterize the motion of the human body.Following the methods [1], we also represent the human movement as the pairwise relative positions of the joints.For a human skeleton, joint positions are tracked by the skeleton estimation algorithm and each joint  has 3 coordinates at each frame.The coordinates are normalized based on Procrustes analysis [22], so that the motion is invariant to the initial body orientation and the body size.For a given frame And the feature is a 630-dimension (570 pairwise relative positions of the joint and 60 joint position coordinates) vector for MSR Action 3D and UTKinect Action dataset.AS for Florence 3D Action dataset, it is a 360-dimension vector.(The selection of alternative feature representations will be discussed in Experiment Result.)Then, we train a linear classifier for each key pose according to the following equation: The question of which frame should be used for training () will be discussed in Section 2.3.

Latent Key Poses Mining.
It is not easy to decide which frames contain the key poses, because key poses' space   is too large to enumerate all the possible poses.Enlightened by [23], since the key pose positions are not observable in the training data, we formulate the learning problem as a latent structural SVM, regarding the key pose positions as the latent variable.
Rewrite (1) as follows: in which  = ( 1 , . . .,   ) is treated as the latent variable.Given a labeled set  = {⟨ in which  is the penalty parameter.Following [23], the model is first initialized:   and   are the positive and negative subsets of , and the model is initialized with  key frames as shown in Algorithm 1.In Algorithm 1,   and   are the positive frame set and the negative frame set, respectively.They are used to train the linear classifier ().In order to initialize our model, we firstly compute (  ), the feature of the -th frame which belongs to the first video sample in   .Then the Euclidean distance between (  ) and the feature of the frames in other samples in the neighborhood of temporal position  with radius   in   is computed.The frame which has the minimum Euclidean distance from (  ) in each sample is added in  .Then   is used to train the linear classifier   () and choose   as the average of frame number in  .To select the next key pose,  chose  with the minimum score based on   () for next loop; in other words, the -th frame which is most different from previous key pose is selected in the next loop.Finally, all    and    are trained with the linear SVM when Algorithm 1 is completed.
Once the initialization is finished, the model will be iteratively trained as follows.First, to find the optimal value  subjected to  opt ∈   where  opt = arg max  ( ⋅ Φ(, )) for each positive video example and update  with the average value of all  opt , the new linear classifier () is trained with modified  for each key pose.Second, ( 6) is optimized over , where () =  ⋅ Φ(,  opt ) with stochastic gradient descent.Thus, the models are modified to better capture skeleton characteristics for each action.

Action Recognition with Key Poses.
The key technical issue in action recognition in real-world video is that we do not know where the action starts, and searching start position in all possible places takes a lot of time.Fortunately, the score of each possible start position can be computed, respectively.So a parallel tool such as OpenMP or CUDA might be helpful.
Given a test video  with  frames, first, the skeleton feature score () of each frame has been computed in advance so we could reuse them later.Then for each possible action start position  0 , we compute the score of each key pose    according to the following equation: These scores are summed together as the final score of  0 .If the final score is bigger than the threshold, then an action beginning at  0 has been detected and recognized.Figure 3 shows key poses for different actions in Florence 3D Action dataset.

Experiment Result
This section presents all experimental results.First, trying to eliminate the noise generated by translation, scale, and rotation changes of skeleton poses, we preprocess the dataset with Procrustes analysis [22].And we conduct the experiment for action recognition with or without Procrustes analysis on UTKinect dataset to demonstrate effectiveness of Procrustes analysis.Second, the appropriate feature extraction was selected from four existing feature extraction methods according to experimental result on Florence 3D Action dataset.Third, quantitative experiment is conducted to select the number of inconsecutive key poses.Last, we evaluate our model and compare it with some state-of-the-art method on three benchmark datasets: MSR Action 3D dataset, UTKinect Action dataset, and Florence 3D Action dataset.

Datasets
(1) Florence 3D Action Dataset.Florence 3D Action dataset [26] was collected at the University of Florence during 2012 and captured using a Kinect camera.It includes 9 activities; 10 subjects were asked to perform the above actions for two or three times.This resulted in a total of 215 activity samples.
And each frame contains 15 skeleton joints.
(2) MSR Action 3D Dataset.MSR Action 3D dataset [11] consists of the skeleton data obtained by depth sensor similar to the Microsoft Kinect.The data was captured at a frame rate of 15 frames per second.Each action was performed by 10 subjects in an unconstrained way for two or three times.(3) UTKinect Action Dataset.UTKinect Action dataset [24] was captured using a single stationary Kinect and contains 10 actions.Each action is performed twice by 10 subjects in indoor setting.Three synchronized channels (RGB, depth, and skeleton) are recorded with a frame rate of 30 frames per second.The 10 actions are walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands, and clap hands.It is a challenging dataset due to the huge variations in view point and high intraclass variations.So, this dataset is used to validate the effectiveness of Procrustes analysis [22].

Data Preprocessing with Procrustes Analysis.
Skeleton data in each frame of a given video usually consists of a fixed number of predefined joints.The position of joint is determined by three coordinates (, , ). Figure 4 shows the skeleton definition in MSR Action 3D dataset.It contains 20 joints which could be represented by their coordinates.
Regarding raw human skeleton in the video as the features is not a good choice in consideration of the nature of skeletonrotation, scaling, and translation.So, before the experiment, we should normalize the datasets by Procrustes analysis.In statistics, Procrustes analysis is a form of statistical shape analysis used to analyze the distribution of a set of shapes and is widely applied to the field of computer vision such as face detection.In this paper, it is used to align the skeleton joints and eliminate the noise owed to rotation, scaling, or translation.Details of Procrustes analysis will be depicted next.
Given a skeleton data with  joints (( 1 ,  1 ,  1 ), ( 2 ,  2 ,  2 ), . . ., (  ,   ,   )), the first step is to process the joints with translation transformation.We compute the mean coordinates (, , ) of all joints and put them on the origin of coordinates.The translation is completed after each joint coordinate subtracting the mean coordinate, denoted as equation (  ,   ,   ) = (  − ,   − ,   − ).The purpose of scaling is making mean square root of all joint coordinates equivalent to 1.For the skeleton joints, we compute  according to the following equation: And the scaling result is calculated as follows: (  ,   ,   ) = (  /,   /,   /).The rotation of skeleton is the last step of Procrustes analysis.Removing the rotation is more complex, as standard reference orientation is not always available.Given is a group of standard skeleton joint points  = (( 1 , V 1 ,  1 ), ( 2 , V 2 ,  2 ), . . ., (  , V  ,   )), which represent an action  facing positive direction of x-coordinate axis.The mean coordinate of  is put on the origin of coordinate and the mean square root of coordinate is 1.Then we compute the rotation matrix  for skeleton  = (( 1 ,  1 ,  1 ), ( 2 ,  2 ,  2 ), . . ., (  ,   ,   )) which has been scaled and transformed as aforementioned method by (9), in which  is 3 * 3 matrix.Σ  is the singular value decomposition with orthogonal  and  and diagonal Σ.And the rotation matrix  is equal to matrix  multiplied by the matrix transform of .At last, skeleton joint points  can be aligned with  through computing  multiplied by .
We followed the cross-subject test setting of [30] on UTKinect dataset to test the validity of Procrustes analysis.Result is shown in Table 1.It is easy to see that the recognition rate of almost all actions is improved after preprocessing skeleton joint point with Procrustes analysis.In particular, the recognition rate of action  is improved by 10%.It turned out that the translation, scaling, and rotation of human action skeleton in the video affect the recognition accuracy and Procrustes analysis is an effective method to eliminate the influence of geometry transformation.

Feature Extraction Method Selection.
With the deep research on action recognition based on skeleton, there are many efficient feature representations.We select four of them (Pairwise [1], the most informative sequences of joint angles (MIJA) [31], histograms of 3D joints (HOJ3D) [24], and sequence of the most informative joints (SMIJ) [13]) as alternative feature representations.
Given is a skeleton  = { 1 ,  2 , . . .,   }, in which   = (   ,    ,    ).The Pairwise representation is computed as follows: for each joint , we extract the pairwise relative position features by taking the difference between the position of joint  and the position of another joint :   =   −   , so the feature of  is () = {  |   =   −   , 1 ≤  <  ≤ }.Due to the informativeness of the original joints, we made an improvement on this representation by concatenating () and .Then the new feature is The most informative sequences of joint angles (MIJA) representation regards joint angle as features.The shape of trajectories of joints encodes local motion patterns for each action.It chooses to use 11 out of the 20 joints capturing information for an action and center the skeleton, using the hip center joint as the origin (0, 0, 0) of the coordinate system.From this origin, vectors to the 3D position of each joint are calculated.For each vector, it computes the angle  1 of its projection onto the x-z plane with the positive -axis and the angle  2 between the vector and -axis.The feature consists of the 2 angles of each joint.
Histograms of 3D joints (HOJ3D) representation chooses 12 discriminative joints of 20 skeletal joints.It takes the hip center as the center of the reference coordinate system and defines -direction according to left and right hip.The remaining 8 joints are used to compute the 3D spatial histogram.The Spherical Coordinates space is partitioned to 84 bins.And for each joint location, a Gaussian weight function is used for the 3D bins.Counting the votes in each bin and concatenating them, we can get an 84-dimension feature vector.
Sequence of the most informative joints (SMIJ) representation also takes the joint angle as feature but it is different from MIJA.It partitions the joint angle time series of an action sequence into a number of congruent temporal segments and computes the variance of the joint angle time series of each joint over each temporal segment.The top 6 most variable joints in each temporal segment are selected to extract features with mapping function Φ.Here Φ() : R || → R is a function that maps a time series of scalar values to a single scalar value.
In order to find the optimal feature, we conduct an experiment on Florence 3D Action dataset, in which each video is short.And we estimate other 5 joints coordinates from original 15 joints of each frame in Florence dataset to make the same joints number of each frame as MSR Action 3D or UTKinect dataset.The experiment takes cross-subject test settings; one half of the dataset is used to train the key pose model and the other is used for testing.The model has 4 key poses and Procrustes analysis has been done before the feature extracting.Results are shown in Figure 5.The overall accuracy of Pairwise feature across 10 actions is better than SMIJ and MIJA.And it is observed that, for all actions except sit down and stand up, the Pairwise representation shows promising results.So, in following experiment, we select Pairwise feature to conduct action recognition experiment.The estimated joints coordinates generate more noise, so the accuracy is lower than the results on original Florence 3D Action dataset (shown in Table 6).

Selection of Key Pose Numbers.
In this section, we implement some experiments to determine how many key poses are necessary for action recognition.The experimental results are shown in Figure 6; the horizontal axis denotes the number of key poses, and the vertical axis denotes recognition accuracy of the proposed approach.The number of key poses ranges from 1 to 20.We can see that the accuracy increases with the number of key poses when the number is less than 4. The accuracy almost achieves maximum values when the number of key poses equals 4, and the accuracy does not increase when the number of key poses is more than 4. To consider the accuracy and computation time, 4 is selected as the number of key poses for recognition action in our following experiment.
Table 2 only enumerates recognition accuracy for each action in UTKinect Action dataset when the number of key poses ranges from 4 to 8. It can be seen that the recognition accuracy varies with different key poses number for one action.However, the average recognition accuracy is nearly the same with different key poses number, so 4 is the high cost-effective choice.

Results on MSR Action 3D Dataset.
According to the standard protocol provided by Li et al. [11], the dataset was divided into three subsets, shown in Table 3. AS1 and AS2 were intended to group actions with similar movement, while AS3 was intended to group complex actions together.For example, action ℎ is likely to be confused with ℎ in AS1 and action pickup & throw in AS3 is a composition of  and high throw in AS1.We evaluate our method using a cross-subject test setting: videos of 5 subjects were used to train our model and videos of other 5 subjects were used for test procedure.Table 4 illustrates results for AS1, AS2, and AS3.We compare our performance with Li et al. [11], Xia et al. [24], and Yang and Tian [25].We can see that our algorithm achieves considerably higher recognition rate than Li et al. [11] in all the testing setups on AS1, AS2, and AS3.For AS2, the accuracy rate of the proposed method is the highest.For AS1 or AS3, our recognition rate is only slightly lower than Xia et al. [24] or Yang and Tian [25], respectively.However, the average accuracy of our method on all three subsets is higher than the other methods.MSR Action 3D Histogram of 3D joints [24] 78.97% EigenJoints [25] 82.30% Angle similarities [27] 83.53% Actionlet [1] 88.20% Spatial and temporal part-sets [28] 90.22% Covariance descriptors [29] 90.53%Our approach 90.94% Table 5 shows the results on MSR Action 3D dataset.The average accuracy of the proposed method achieves 90.94%.It is easy to see that our method performs better than the other six methods.
3.6.Results on UTKinect Action Dataset.On UTKinect dataset, we followed the cross-subject test setting of [30], in which one half of the subjects is used for training our model and the other is used to evaluate the model.And we compare our model with Xia et al. [24] and Gan and Chen [30]. Figure 7 summarizes the results of our model along with competing approaches on UTKinect dataset.We can see that our method achieves the best performance on three actions such as pull, push, and throw.And the most important thing is that the average accuracy of our method achieves 91.5% and is better than the other two methods (90.9% and 91.1% for Xia et al. [24] and Gan and Chen [30], resp.).The accuracy of actions such as clap hands and wave hands is not so good; the reason may be the fact that the skeleton joint movement ranges of these actions are not large enough and the skeleton data contain more noise.So, it hinders our method from finding the optimal key poses and degrades the accuracy.

3.7.
Result on Florence 3D Actions Dataset.We follow the leave-one-actor-out protocol which is suggested by dataset collector on original Florence 3D Action dataset.All the sequences from 9 out of 10 subjects are used for training, while the remaining one is used for testing.For each subject, we repeat the procedure and average the 10 classification accuracy values at last.For comparison with other methods, average action recognition accuracy is also computed.The experimental results are shown in Table 6.In each column, the data represent each action's recognition accuracy, while the corresponding subject is used for testing.The challenges of this dataset are the human-object interaction and the different ways of performing the same action.By analyzing the experiment result of our method, we can notice that the proposed approach obtains high accuracies for most of the actions.Our method overcomes the difficulty of intraclass variation such as bow and clap.The proposed approach gets lower accuracies for the actions such as answer the phone and read watch; this can be explained by the fact that these actions are human-object interaction with small range of motion and the Pairwise feature could not well reflect the motion.Furthermore, results compared with other methods are listed in Table 7.It is clear that our average accuracy is better than Seidenari et al. [15] and is the same as Devanne et al. [16].
Table 7: Comparing of our method with the others on Florence 3D Actions dataset.

Conclusion
In this paper, we presented an approach for action recognition based on skeleton by mining the key skeleton poses with latent SVM.Experimental results demonstrated that human actions can be recognized by only a few frames with key skeleton pose; in other words, a few inconsecutive and representative skeleton poses can describe the video action.
Starting from feature extraction using the pairwise relative positions of the joints, the positions of key poses are found with the help of latent SVM.Then the model is iteratively trained with positive and negative video examples.In test procedure, a simple method is given by computing the score of each start position to recognize the action.We validated our model on three benchmark datasets: MSR Action 3D dataset, UTKinect Action dataset, and Florence 3D Action dataset.Experimental results demonstrated that our method outperforms all other methods.Because our method relies on extracting descriptors of simple relative positions of the joints, its performance degrades when the actions are little varied and uninformative, for instance, those actions that were performed only by forearm gestures such as clap hands in UTKinect Action dataset.In the future, we will explore the other local features reflecting minor motion for better understanding human action.

Figure 1 :
Figure 1: Two athletes perform the same action (diving water) in different way.

Figure 3 :
Figure 3: Key poses for different action in Florence 3D Actions dataset.
and offset   , where the key poses  are most likely to appear in the neighborhood of   with radius   , and (3) the weight of linear classifier    and weight of the temporal information    .

Table 1 :
Results of action recognition with or without Procrustes analysis.

Table 2 :
Recognition accuracy on different number of key poses.

Table 3 :
The three subsets of actions used in the experiments.

Table 4 :
Comparison of our method with the others on AS1, AS2, and AS3.

Table 5 :
Comparison of our method with the others on MSR Action 3D.

Table 6 :
Results on Florence 3D Actions dataset.Figure 7: Results on UTKinect Action dataset.