Enhanced Human Action Recognition Using Fusion of Skeletal Joint Dynamics and Structural Features

,


Introduction
e rapid growth in hardware and software technologies has resulted in continuous generation of a huge amount of video data through video capturing devices such as smartphones and CCTV camera. Also, a large amount of video content is being uploaded to YouTube every minute. erefore, it is very important to extract useful information from these huge video databases and to recognize high-level activities for various applications such as automated surveillance systems, human-computer interaction, sports video analysis, real-time patient/children monitoring, shopping-behavior analysis, and dynamical systems [1]. Hence, human action recognition (HAR) from videos is an active area of research as it attracted the attention of several researchers in recent years.
Human action recognition focuses on detecting and tracking people, in particular, understanding human behaviors from a video sequence. e research in this area focuses mainly on the development of techniques for an automated visual surveillance system. It requires a combination of computer vision and pattern recognition algorithms. However, in the literature, activity, behavior, action, gesture, and 'primitive/complex event' are frequently used to describe essentially the same concepts. HAR is challenging because of the intraclass variation and interclass similarity. e same activity may vary from subject to subject, known as the intraclass variation. Without the contextual information, different activities may look similar, which leads to interclass variation, for example, playing and running. ere are many challenges in HAR, such as multisubject interactions, group activities, and complex visual background. e two main approaches used for HAR are based on global descriptors and local descriptors. e local descriptors are robust to noise and can be applied to a wide range of action recognition problems. However, in recent years, the skeleton-based approaches have been widely used due to the availability of depth sensors. Several datasets are available for the evaluation of action recognition algorithms. ey vary in terms of the number of classes, sensors used, duration of action, view point, complexity of action performed, and so on. In this work, we address the problem of action recognition using skeleton-based approach.
Contributions: (a) We propose a method for human action recognition based on encoded joint angle information and joint displacement vector. (b) A neural network-based method to perform score-level fusion for action classification is proposed. (c) We experimentally show that the proposed method can be applied on datasets containing the skeletal joint information acquired using Kinect sensors and also on datasets where explicit pose estimation needs to be done. us, the proposed method can be used with a visionbased sensor or Kinect sensor. e rest of the paper is organized as follows. Section 2 gives an overview of the existing techniques for human action recognition. Section 3 describes the proposed approach. e experimental results are demonstrated in Section 4. e conclusions and discussions are given in Section 5.

Review of Existing Techniques
Human activities can be broadly classified into four categories: gestures, actions, interactions (with objects and others), and group activities. Early approaches developed in 1990s mainly focused on identifying gestures and simple actions based on motion analysis. A detailed review of motion analysis-based techniques is presented by Aggarwal and Cai [2]. However, the motion analysis-based methodologies were found to be less robust as they were insufficient to describe human activities containing complex structures. erefore, an improved approach was discussed by Aggarwal and Ryoo [3], who focused on methodologies to perform high-level activity recognition designed for the analysis of human actions, interactions, and group activities.
Ben-Arie et al. [4] have proposed a technique to perform human action recognition by computing a set of pose and velocity vectors for body parts such as hands, legs, and torso. ese features are stored in a multidimensional hash table to achieve indexing-and sequencebased voting. Kellokumpu et al. [5] proposed another approach based on texture descriptor by combining motion and appearance cues. e movement dynamics are captured using temporal templates, and the observed movements are characterized using texture features. A spatiotemporal space is considered, and the human movements are described with dynamic texture features. Also, the use of motion energy features for human activity analysis is presented by Gao et al. [6]. e motion energy template is constructed for the video using a filter bank, and the actions are classified using SVM. Xu et al. [7] have proposed a hierarchical spatiotemporal model for human activity recognition. e model consists of a two-layer hidden conditional random field (HCRF), where the bottom layer is used to describe the spatial relations in each frame, and the top layer uses high-level features for characterizing the temporal relations throughout the video sequence. e bottom layer also provides high-level semantic representations. A learning algorithm is used, and human activities are identified. To improve the robustness of action recognition task, a combination of features consisting of dense trajectories and motion boundary histogram descriptors has been used by Wang et al. [8].
e descriptor captures different kinds of information such as shape, appearance, and motion to address the problem of camera motion. e deep learning models gained popularity because of their superior performance in the field of pattern recognition and computer vision research. A review by Guo et al. [9] highlights the important developments in deep neural models. Ji et al. [10] proposed a 3D CNN model for human action recognition. e features are extracted from both spatial and temporal dimensions using 3D convolutions, thus capturing discriminative features. In another work, Wang et al. [11] proposed a technique where the spatiotemporal information obtained from 3D skeleton sequences is encoded into multiple 2D images forming Joint Trajectory Maps (JTMs), and ConvNets are applied to accomplish the action recognition task. As Joint Distance Maps (JDMs) describe texture features which are less sensitive to view variations, Li et al. [12] have developed an approach for action recognition by encoding spatiotemporal information of skeleton sequences into color texture images. en, using convolutional neural networks, the discriminative features are obtained from the JDMs for achieving both single-view and cross-view action recognition. Hou et al. [13] have proposed a method for effective action recognition based on skeleton optical spectra (SOS), where discriminative features are learned using convolutional neural networks (Con-vNets). e spatiotemporal information of a skeleton sequence is effectively captured using skeleton optical spectra.
is method is more suitable in case of limited annotated training video data. Wang et al. [14] have presented a detailed survey of recent advances in RGB-D based motion recognition using deep learning techniques. In another approach, Rahmani et al. [15] have developed an improved version of deep learning model based on nonlinear knowledge transfer model learning, achieving invariance to viewpoint change. A general codebook is generated using k-means to encode the action trajectories, and then the same codebook is used for encoding action trajectories of real videos. Li et al. [16] have used multiple deep neural networks to achieve multiview learning for three-dimensional human action recognition. ese multiple networks help to effectively learn the discriminative features and also capture spatial and temporal information. e recognition scores of all views are combined using multiply fusion. Xiao et al. [17] have introduced an end-to-end trainable architecture-based model for human action recognition. e model consists of deep neural networks and attention models for learning spatiotemporal features from the skeleton data. Li et al. [18] have proposed an approach for skeleton-based human action recognition. A deep model, namely, 3DConvLSTM, is used to learn spatiotemporal features from the video sequences, and an attention-based dynamic map is built for action classification.
An approach for online action recognition has been proposed by Tang et al. [19] based on weighted covariance descriptor by considering the importance of frame sequences with respect to their temporal order and discriminativeness. e combination of nearest neighbour search and Log-Euclidean kernel-based SVM is used for classification. In another work, an optical acceleration-based descriptor has been used by Edison and Jiji [20] for human action recognition. Two descriptors have been computed for effectively capturing the motion information, namely, the histogram of optical acceleration and histogram of spatial gradient acceleration. An approach based on rank pooling method was introduced by Fernando et al. [21] for action recognition, which is capable of capturing both the appearance and the temporal dynamics of the video. A ranking function generated by the ranking machine provides important information about actions. In another work, Wang et al. [22] have presented a technique for action recognition based on order-aware convolutional pooling, focusing mainly on effectively capturing the dynamic information present in the video. After extracting features from each video frame, a convolutional filter bank is applied to each feature dimension, and then filter responses are aggregated. Hu et al. [23] introduced a new approach for early action prediction based on soft regression applied on RGB-D channels. Here, the depth information is considered to achieve more robustness and discriminative power. Finally, Multiple Soft labels Recurrent Neural Network (MSRNN) model is constructed, where feature extraction is done based on Local Accumulative Frame Feature (LAFF). Some more approaches for action recognition can be found, which have been developed based on sparse coding, Yang and Tian [24]; exemplar modeling, Hu et al. [25]; max-margin learning, Zhu et al. [26]; Fisher vector, Wang and Schmid [27]; and block-level dense connections, Hao and Zhang [28].
rough literature survey, it is found that several techniques have been proposed for human action recognition. A detailed review on action recognition research is reported by Ramanathan et al. [29], Gowsikhaa et al. [30], and Fu [31].
A lot of approaches are available in the literature for human action recognition. Most of the existing techniques use either the local features extracted temporally or the skeleton representation of the human pose in the temporal sequence. However, the combination of temporal features and spatial features provides better recognition rate. In this direction, we propose a method to recognize human action based on the combination of appearance and temporal features at the classifier decision level.

Proposed Work
In this work, we propose a method for human action recognition by considering the structural variation feature and the temporal displacement feature. e proposed method extracts features from the pose sequence in a given video. Figure 1 depicts the methodology of the proposed system. We extract the structural variation feature by detecting the angle made between the joints during an action. ere are several methods available to estimate the pose. Some of the pose estimation techniques found in the literature are based on sensor readings, and other methods are based on visionbased techniques.

Pose Estimation for Action Recognition.
e OpenPose library [32,33] is one of the well-known vision-based libraries used to extract the skeletal joints. e performance of the OpenPose library to detect the joint locations is limited when compared to sensor based methods. It uses VGG-19 deep neural network model to estimate the pose. e COCO model [34] consists of 18 skeletal joints, whereas the BODY_25 model gives 25 skeletal joint locations. In our experiments, we have used OpenPose to estimate the pose for the KTH dataset; however, for the other datasets, the pose information is taken from sensor readings. In the following section, we present the idea of structural feature extraction.

Structural Variation Feature Extraction. Let us consider the skeleton represented by a set of points
indicating the estimated joint location in the 2D image location. Our goal is to obtain the angle between the joint j k and a set of joints J � j i | i � 1, 2, . . . , n s and i ≠ k which contributes to structural variation in the skeleton. In a video having N frames, the angle θ k ij is found, where θ k ij represents the angle between the joints j i and j j in the k th frame, where where v p is given by e procedure followed to fix the threshold "T" is given in Section 3.2.1. e feature vectors v i → , for i � 1, 2, . . . , n s , are concatenated to obtain the structural feature vector v

Feature Extraction Based on Angle
Binning. e vector v → does not provide the variation in the angle at a finer level, as it is binarized with a single threshold value. Accordingly, we perform angle binning, where multiple thresholds are used in (3) to quantize the angle to a b-bit number, by modifying (2). is captures the angle between the joints at a finer level; at the same time, the quantization helps to suppress minute variations in the angle during an action. e process of feature extraction is shown in Figure 2.
where T l � l * (θ max /b), 0 ≤ θ max ≤ (π/2). e terms θ k ij and f(x, y) are defined using e temporal variation feature captures the dynamics of individual joint by tracking them through the frames. is process is explained in the following section.

Temporal Feature Extraction.
e temporal feature extraction looks at the change in the joint location for a joint j i from the frame t to t + 1. We consider the location of the joint j i in two successive frames to find the relative position of the joint. is is effectively the tracking of the joint location. A histogram of 2D displacement orientation of joint location in the X-Y plane is constructed to capture the temporal dynamics. e vector representing the joint displacement is computed for the sequence of video frames. For each displacement vector of joints, we obtain the orientation pair consisting of orientation angle and the magnitude represented by (θ i , ρ i ).
e orientation angles θ i , i � 1, 2, . . . , n s , for all the joints are used as the temporal features. e θ t i for joint j i at time instance t + 1 is computed using e feature vector f i → for every joint location j i is given by  Figure 1: Overall methodology of action classification.  Journal of Robotics A k-bin histogram is created for every joint j i from the feature vector f i → . is is concatenated to form a temporal feature vector f → representing an action. It is clear that the joint locations are sparse when compared to the traditional optical flow-based methods.
us, the feature extraction process is computationally more efficient.
3.4. Score-Level Fusion Using Neural Network. We combine the structural features and the temporal features at the score level. For every sample j, the classifier i assigns a score ranging from −inf to +inf. e score is the signed distance of the observation j to the decision boundary. A positive score indicates that the sample j belongs to class i. A negative score gives the distance of j from decision boundary. e score-level fusion is performed using a neural network. e neural network assigns significance scores to the classifiers based on structural and temporal features. e structural features are less discriminative for describing actions having similar body part movements such as walk, run, and jogging. e optimal fusion of temporal and structural features would help in better recognition.
To generalize the classifier fusion, we consider a multiclass classification problem with c classes and n classifiers. In our case, we have used scores from two SVM classifiers for fusion. e class prediction score for a sample j from i th classifier is where each x (t) ij is a prediction score corresponding to the class t. e input to the neural network for the sample j is given by where v j → ∈ R nc . e predicted label at the output layer of the neural network is given by y j ′ → ∈ R c . To get the optimal fusion score, we need to solve the objective function given in (11) for the N training samples in the action recognition dataset.
where y j �→ is the actual label at the output layer for the sample j.
For a neuron k, in the hidden layer t, the output θ k of the neuron is given by where represents the synaptic weights from the previous layer to the neuron k, and σ(.) is the sigmoid activation function. For a neuron p at the output layer, the predicted label o p is given by where represents the synaptic weights from the last hidden layer l to the output neuron p. v j →l is the input from last hidden layer.

S(.)
is the softmax function. e output of this layer, y j ′ → , for a sample j, is given by e neural network uses the backpropagation algorithm to learn the network parameters. An example of neural network architecture used in the proposed model is shown in Figure 3.

Experiments and Results
To demonstrate the performance of the proposed model, we carried out experiments on three publicly available datasets, namely, KTH [35], UTKinect [36], and MSR Action3D dataset [37]. e KTH dataset requires explicit pose estimation. However, UTKinect dataset contains the pose information captured using Kinect sensors. e source code of our implementation is available at https://github.com/ muralikrishnasn/HARJointDynamics.git.

Datasets.
e KTH dataset contains six action types performed by 25 subjects under four different conditions. e skeletal joint information is not included in the dataset unlike other datasets used in the experiment. e UTKinect dataset is acquired using a Kinect sensor.
e dataset contains skeletal joint information for 10 types of actions performed by 10 subjects repeated twice per action. e MSR Action3D dataset contains skeleton data for 20 action types, performed by 10 subjects, where each action is performed 2 to 3 times. e dataset contains 20 joint locations per frame captured using a sensor similar to Kinect device.

Experimental Setup and Results.
In our experiments, OpenPose library [33] is used to estimate the pose for the KTH dataset. A pretrained network with BODY_25 model is used in our experiments. e parameters of the experiment have been set as described in [35]. e deep neural network to detect the joints is executed on a Tesla P100 GPU. e Support Vector Machine (SVM) classifiers are used to extract the structural and temporal features. e predicted scores from these SVM classifiers are combined using a neural network. We used radial basis function kernel in the SVM classifiers. A simple feed-forward network with sigmoid function at the hidden layers and softmax output neurons is used to solve (11). In the experiments, the neural network has been trained with 50 epochs. A plot of epochs   Journal of Robotics versus cross-entropy is shown in Figure 4. e results of the experiments are shown in Figures 5(a)-5(c), summarizing the confusion matrices for structural features, temporal features, and the score-level fusion, respectively. It can be seen that the misclassifications are between highly similar actions like running and jogging. e proposed model has achieved an accuracy of ≈ 90.3% on the KTH dataset. We have conducted experiments on UTKinect dataset in a similar manner to that shown in [36,44]. e confusion matrix considering the structural features is presented in Figure 5(d). e results for temporal features and score-level fusion using neural network are shown in Figures 6(a) and 6(b). e accuracy of the proposed  Journal of Robotics 7 method on the UTKinect dataset is ≈ 91.3 with a deviation of ±1.5. e experiment on MSR Action3D dataset has been conducted using cross-subject test as described in [37] unlike the leave-one-subject-out cross-validation (LOOCV) method given in [40]. e actions are grouped into three subsets: AS1, AS2, and AS3. e AS1 and AS2 have less interclass variations, whereas AS3 contains complex actions. e obtained results are listed in Table 1. A summary of the results from all the three datasets is reported in Table 2. From Table 2, it is observed that the proposed method outperforms the existing methods for  [35] STIP, Schüldt [38] 73.6% Efficient motion features [39] 87.3% Proposed method 90.30% UTKinect dataset [36] Histogram of 3D joints [40] 90.92% Random forest [41] 87.90% Proposed method 91.30% MSR Action3D dataset Histogram of 3D joints [40] 78.97% Eigen joints [42] 82.30% Joint angle similarities [43] 83.53%     human action recognition. We used similar classifier settings for the other datasets in the experiment. e experimental results show that the proposed method outperforms some of the state-of-the-art techniques for all the three datasets considered in the experiments. For the MSR Action3D dataset, our method gives an accuracy of ≈ 90.33% with a deviation of ± 2.5, which is better than the listed methods in Table 2 by more than ≈ 5%. However, the fusion of classifiers shows better performance than the single classifier.

Influence of Quantization Parameter b and Histogram
Bins k on Accuracy. e performance of SVM classifier-1 shown in Figure 1 is analyzed by varying the quantization parameter b. e number of bits b used in quantization versus accuracy is plotted in Figure 7. It is observed that the parameter has no influence on the results beyond b � 8 for KTH and MSR Action3D datasets. However, the optimal value of b for UTKinect dataset is 16. is is due to the variations in the range of data values for the location coordinates.
A plot of number of bins k in joint displacement feature versus the accuracy is shown in  e displacement vectors provide complementary information to joint angles. Most of the pose estimation algorithms fail to detect the joints that are hidden due to occlusion or selfocclusion. Normally, the pose estimation algorithms result in a zero value for such joint locations. ese hidden joint locations act as noise and may degrade the performance of the action recognition algorithm.

Analysis of Most Significant Joints.
In KTH dataset, the hand-waving action is mainly due the movement of joints j3 to j8. e other joints do not contribute to the action. e most important joints involved in an action are depicted in Figure 13. It can be observed that actions walking, running, and jogging have similar characteristics in terms of angular movements. is is very useful in identifying any outliers while detecting abnormalities in actions. (Dominant joints with respect to angular movement for other datasets are included in Figures 14-17).        e accuracy of the proposed system has been analyzed using two types of combiners: a trainable combiner using a neural network and a fixed combiner using score averaging [45]. is is shown in Figure 18. e neural network is a better combiner as it is able to find the optimal weights for the fusion, whereas score averaging works as a fixed combiner with equal importance to both classifiers showing lower accuracy. e neural network-based fusion enhances the performance in terms of accuracy. It can be seen from Figures 19-23 that the fusion technique results in better performance. e correlation analysis is performed on the output of two SVM classifiers. e result is listed in Table 3. e analysis shows that the average correlation is less than 0.5.
is indicates that the classifiers moderately agree on the classification. Consequently, the fusion of these scores leads to improvement in the overall accuracy of the system.

Conclusions
We have developed a method for human action recognition based on skeletal joints. e proposed method extracts structural and temporal features. e structural variations are captured using joint angle, and the temporal variations are represented using joint displacement vector. e proposed approach is found to be simple as it uses single-view 2D joint locations and yet outperforms some of the state-of-the-art techniques. Also, we showed that, in the absence of Kinect sensor, pose estimation algorithm can be used as a preliminary step. e proposed method shows promising results for action recognition tasks when temporal features and structural features are fused at the score level. us, the proposed method is suitable for robust human action recognition tasks.
Data Availability e references of the datasets used in the experiment are provided in the reference list.

Conflicts of Interest
e authors have no conflicts of interest regarding the publication of this paper.     Table 3: Correlation analysis of the classifier output to find the classifier agreement for the SVM classifiers shown in Figure 1.