Human Activity Recognition as Time-Series Analysis

We propose a system that can recognize daily human activities with a Kinect-style depth camera. Our system utilizes a set of viewinvariant features and the hidden state conditional random field (HCRF) model to recognize human activities from the 3D body pose stream provided by MS Kinect API or OpenNI. Many high-level daily activities can be regarded as having a hierarchical structure where multiple subactivities are performed sequentially or iteratively. In order to model effectively these high-level daily activities, we utilized a multiclass HCRF model, which is a kind of probabilistic graphical models. In addition, in order to get view-invariant, but more informative features, we extract joint angles from the subject’s skeleton model and then perform the feature transformation to obtain three different types of features regarding motion, structure, and hand positions.Through various experiments using two different datasets, KAD-30 and CAD-60, the high performance of our system is verified.


Introduction
Vision-based activity recognition has found many applications such as human-computer interaction [1,2], surveillance [3,4], robot learning [5,6], and user interface design [7,8].Recently many researchers tend to use depth cameras like Microsoft Kinect to detect human activities.Unlike conventional RGB cameras, Kinect-style depth cameras can provide us with the depth information in addition to colors of the target object.Depth information can be used to estimate the 3D body poses of a human and to recognize his/her realtime activities.In this paper, we propose a system that can effectively recognize daily human activities with a Kinectstyle depth camera.Our system utilizes a set of view-invariant features and the hidden state conditional random fields (HCRF) [9,10] model to recognize human activities from the dynamic body pose estimates provided by MS Kinect API or OpenNI.Many high-level daily activities can be regarded as having a hierarchical structure, where multiple subactivities are performed sequentially or iteratively.Our system utilizes a multiclass HCRF model to represent effectively hierarchical nature of such activities.
Many existing systems often make use of only 3D coordinates of individual body joints as a feature set for activity recognition.However, these joint coordinates can be affected easily by change of Kinect's viewpoint [11,12].In order to meet the view variance problem and get more informative features, our system extracts joint angles from the subject's skeleton model and then performs the feature transformation to get three different types of features regarding motion, structure, and hand positions.
The remainder of this paper is structured as follows.In Section 2, we briefly introduce the related works.Section 3 presents a comparison of various probabilistic graphical models including HMM, MEMM, CRF, and HCRF.Section 4 concentrates on the design of our activity system.Section 5 presents the conducted experiments using two different datasets and results obtained with our system.Finally, Section 6 summarizes our work and outlines the future work.

Related Works
The most important factors to affect the performance of vision-based activity recognition systems are both the set of features and the recognition model to capture the unique characteristics of individual activities.Previous works adopt different features and models from each other, resulting in distinct strength and weakness in performance.
In Xia et al. 's work [13], histograms were extracted from the joint coordinates as features using modified spherical coordinate systems in order to overcome the view variance problem.However, for different activities that involve similar positions of the joints, the system could generate similar histograms, hence making it difficult to distinguish between the two activities.In this work, activities are modeled with Hidden Markov Model (HMM).The HMM is a widely used probabilistic graphical model to process a time-series data.However, this model has a limitation that current observations are only dependent on the current state, not on any previous states or observations.Moreover, it has another limitation on training efficiency since it requires supervised training to maximize the joint probability of observation and state sequences.On the other hand, in Sung et al. 's work [14], joint angles are used as features instead of the corresponding joint coordinates to overcome the view variance problem.The hierarchical Maximum Entropy Markov Models (MEMMs) are adopted to model the hierarchical nature of activities as well as enhance the training efficiency.However, MEMMs are well known to suffer from the label bias problem.
In Zhang and Tian's study [15], spatiotemporal features and Support Vector Machines (SVMs) were used to represent activities.However, the features do not consider the view variance problem and SVMs are limited in training human activity patterns over time in comparison with probabilistic graphical models.In Ong et al. 's work [16], features based on the human range of movement were extracted from joint poses and -means clustering which is an unsupervised learning method is applied to recognize daily activities.However, the features of this work are sensitive to camera view variance and the range of motion of joints may vary from person to person.It recognizes activities through means clustering without training a model.However, means clustering has several limitations that the number of clusters should be predetermined and the resulting clusters may be varied depending on the given initial clusters as well.

Probabilistic Graphical Models
Probabilistic graphical models [17] can be considered as one of the best ways to represent hierarchical structures of high-level daily activities, where multiple subactivities are performed sequentially or iteratively.Among the widely used probabilistic graphical models for activity recognition are the Hidden Markov Model (HMM), the Maximum Entropy Markov Model (MEMM), and the Conditional Random Fields (CRF) as shown in Figures 1(a)-1(c), respectively.
The HMM in Figure 1(a) is a generative graphical model in which the target system to be modeled is assumed to be a Markov process.In the figure, the variables   ,   , and   represent the observation, the hidden state, and the class label, respectively.This model assumes that the conditional probability distribution of the hidden variable   at time  depends only on the value of the hidden variable  −1 .Similarly, it assumes that the value of the observation variable   only depends on the value of the hidden variable   .This means that the HMM presumes independence of the observations.Therefore, this model cannot represent longrange dependencies among observations.Additionally, it has another limitation on training efficiency since it requires supervised training to maximize the joint probability of observation and state sequences.
The MEMM in Figure 1(b) is a discriminative graphical model that combines the features of the HMM and the Maximum Entropy (MaxEnt) model.An advantage of MEMM over HMM is that it provides increased freedom in choosing features to represent observations.Another advantage of MEMM over HMM is that training can be considerably more efficient.In MEMM, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation.However, the MEMM has a drawback that it potentially suffers from the label bias problem, in which states with low-entropy transition distributions effectively ignore their observations.
The CRF model in Figure 1(c) is a discriminative undirected graphical model.In the figure,  represents the observation sequence and   represents the random variable which, conditioned on , obeys the Markov property.The CRF model can contain any number of feature functions and the feature functions can inspect the entire observation input sequence .This means that the CRF model avoids the independence assumption between observations and allows nonlocal dependencies between state and observation [18].Moreover, this model has no label bias problem in contrast with the MEMM.However, the CRF model should assign a label   to each time  and do not directly provide a way to estimate the conditional probability of a class label  for an entire sequence .
The HCRF model shown in Figure 1(d) is a generalized CRF model with hidden states   .It incorporates hidden state variables in a discriminative multiclass random field model.By allowing a classification model with hidden states, no a priori segmentation into substructures is needed, and labels at individual observations are optimally combined to form a class conditional estimate.As an augmentation of the CRF, this model can represent long-range dependencies among observations without the label bias problem.The HCRF model was introduced by Quattoni and Gunawardana and has been successfully applied for gesture recognition and phone classification [9,10].Due to its advantageous characteristics, however, we believe that the HCRF model can be also successfully applied to vision-based daily activity recognition.

Activity Recognition System
We design a system that can recognize high-level daily activities based on the 3D body pose data acquired from Microsoft's Kinect API.A high-level daily activity can be regarded as a hierarchical activity structure consisting of multiple subactivities activities that are performed sequentially or iteratively.For example, the activity of picking up an object on the floor consists of three subsequent subactivities: stooping down, grasping the object, and standing up, as described in Figure 2.
For the purpose of our research work, we collect the training data of such high-level daily activities to construct the KAD-30 dataset.The KAD-30 dataset consists of 10 activities in total: opening a lid, drinking water, tying shoelaces, stretching, eating cereal, making a phone call, grasping an object on the floor, putting on and taking off a coat, cleaning the floor and writing on a whiteboard.The proposed activity

Stooping down
Grasping an object Standing up Picking up an object 4.1.Feature Extraction.In this step, view-invariant features are extracted based on 3D position data from 15 joints of the human body, including the head, neck, and torso, and two sets of joint directional data that correspond to the head and torso.As mentioned before, the set of 3D joint positions are directly provided by Microsoft's Kinect API, which can be estimated from the depth images acquired from the Kinect sensor.However, the 3D position (, , ) of each joint provided by Kinect API is represented based on the Cartesian coordinate system of which origin (0, 0, 0) is on the center of the Kinect sensor.Thus, the 3D position data of a joint can be easily changed if at least either the Kinect sensor or the target object changes its position.This means that the 3D joint coordinates of joints directly acquired from Kinect API are very sensitive to Kinect's view variance, and so they are not proper features used to distinguish daily human activities robustly under various environmental conditions.Figure 3 illustrates the view variance problem.As shown in the figure, if Kinect's view is changed, the corresponding position value of the same elbow joint captured by the Kinect sensor will be also changed.In order to meet the view variance problem and get more informative features, our system extracts joint angles from the subject's skeleton model and then performs the feature transformation to get three different types of features regarding motion, structure, and hand positions.While performing one of the daily activities, each joint of the performer moves according to a specific pattern over time.These temporal patterns of joint movement may be effectively captured by using motion features.In addition, daily activities are considered to be performed through multiple interactions between distinct joints.For example, grasping an object on the floor is mainly accomplished through interaction between the joints of the knee and the hand.We try to capture these spatial patterns through structure features.A lot of human daily activities include hand movement.Unlike other animals, humans use their hands very much to work in daily life.For example, consider when drinking water and opening the lid of a container.Hand position features, which represent the position of both hands relative to the head and the torso, can help distinguish human daily activities using hands.Figure 4 illustrates the process of extracting the motion and the structure features.As shown in Figure 4, the 3D Cartesian coordinates of the form (, , ) is first transformed into 2D spherical coordinates of the form (, ) for each joint, where  is the polar angle and  is the azimuthal angle of the joint.The following equation shows how to compute the polar  and the azimuthal angles  from the corresponding 3D joint coordinates (, , ).In the equation,  is the radial distance, which is the Euclidean distance from the origin to the joint.In our work, the radial distance  is omitted and only the polar  and the azimuthal angles  are used to extract features through subsequent processes: From the transformed 2D spherical coordinates  , of each joint , motion features  , and structure features  , are calculated through the following equations.Below,  and  refer to the frame and joint indexes, respectively: The motion features  , of joint  are obtained from the th input frame by computing the difference between the current  , and the previous position  −1, of the joint .Hence the motion features  , represent the positional change of each joint  from the ( − 1)th frame to the th frame.On the other hand, the structure features  , of joint  are extracted from the th input frame by computing the difference between the current position  , of the joint  and the current position  , of the other joint .Here assume that the joint  is, for example, the center of the head, the joint  can be one of the other joints, such as the neck or the torso.Hence the structure features  , represent the relative position of the joint  based on the other joint  at the th frame.It is assumed that the position  , of each joint  at frame  has already been transformed into 2D spherical coordinates ( , ,  , ) in the aforementioned way.
Figure 5 describes the process of extracting the hand position features.The position features of each hand are obtained by computing its relative positions with respect to both the head and the torso.For example, while the relative position features ℎ ,left,head of the left hand with respect to the head are computed through (3), its relative position features ℎ ,left,torso with respect to the torso are calculated through (5).Similarly, the relative position features ℎ ,right,head and ℎ ,right,torso of the right hand are computed through ( 4) and ( 6), respectively.In the equations,  ,left hand ,  ,right hand ,  ,head , and  ,torso represent the 3D position vector of the left hand, the right hand, the head, and the torso, respectively.On the other hand,  ,head and  ,torso are the 3 × 3 orientation matrix of the head and the torso, respectively: relative left hand position wrt torso (ℎ ,left,torso ) = ( ,left hand −  ,torso ) *  ,torso , relative rightt hand position wrt torso (ℎ ,right,torso ) = ( ,right hand −  ,torso ) *  ,torso .
In general, the higher the number of feature vector dimensions, the higher the computational complexity required for model learning and activity recognition.The feature vectors acquired from the feature extraction process have 252 dimensions.Vector quantization is executed by applying -mean clustering to the high dimensional feature vectors to increase the efficiency of model learning and activity recognition.Through vector quantization, each high dimensional feature vector is replaced into an integer index indicating the cluster the feature vector belongs to.As a result, one-dimensional integer type time-series data is generated while performing an activity.Here, because the length of the time-series data is determined by performing time per activity, a different length per activity is generated.The subsequent processes of the proposed activity recognition system, modeling learning, and activity recognition use these time-series feature data of each activity for the purpose of model training and testing.

Model Learning.
As mentioned before, many high-level daily activities can be regarded as having a hierarchical structure, where multiple subactivities are performed sequentially or iteratively.Our system utilizes the hidden state conditional random field (HCRF) model to represent effectively the hierarchical nature of such activities.In order to recognize a number of activities with a single trained model, our system uses a multiclass HCRF model.A state variable in this HCRF model represents a subactivity belonging to a high-level activity and it is assumed to be hidden.Therefore, there is no Motion feature m t,n J t,knee J t,right_hand Structure feature s t,n j t,n = x t,n , y t,n , z t,n ⟩ ⟩     ∑   ∈, ∈   Ψ(  ,,;) .(7) The objective function () depends on the potential function Ψ(, , ; ), parameterized by , which measures the compatibility among a label, a set of observations and a configuration of the hidden states.Using the gradient ascent method, the optimized parameters  * are found to maximize the objective function (), as in the following equation: The number of hidden states ℎ and the size of history  are determined in advance in order to train the HCRF model.In our system, the number of hidden states of the HCRF model is set to 7, considering the complexity of the target activities.The history size, which determines dependency range, is set to 1.As the optimization function to adjust the weight of feature vectors in the HCRF model, Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) is used.

Activity Recognition.
In the activity recognition step, the conditional probability of each activity, ( | , ,  * ), is calculated using the trained HCRF model  * and the test sequence data .And then the test data  is recognized as the activity  * with the highest conditional probability, as in the following equation:

Performance Evaluation
Based on the design explained before, our activity recognition system was implemented using C++ and MATLAB on Windows 7. Several experiments were conducted to evaluate the performance of our proposed activity recognition system.In the experiments, two different datasets are used: the KAD-30 dataset from Kyonggi University and the CAD-60 dataset from Cornell University.Figure 7 shows 10 common daily activities included in the KAD-30 dataset.The activities in the KAD-30 dataset are opening a lid, drinking water, tying shoelaces, stretching, eating cereal, making a phone call, picking up an object on the floor, putting  on and taking off a coat, wiping the floor, and writing on a whiteboard.To collect the KAD-30 dataset, 3 different subjects performed 10 different activities ten times in front of the Kinect sensor.3D body pose data for each activity were recorded for 30 to 40 seconds at 30 frames/second speed.
Figure 8 shows 12 daily human activities in the CAD-60 dataset provided by Cornell University.The activities included in the CAD-60 dataset are brushing teeth, cooking (stirring), writing on a whiteboard, working on computer, talking on the phone, wearing contact lens, relaxing on couch, opening pill container, drinking water, cooking (chopping), talking on couch, and rinsing the mouth.
To analyze the performance of our activity recognition system, three different experiments were conducted using the KAD-30 and CAD-60 datasets.In the first experiment, we compared the recognition performance of two different HCRF models: one-versus-all HCRF model and multiclass HCRF model.A one-versus-all HCRF model is able to distinguish only one activity from others.In order to recognize  different activities, a total of  one-versus-all HCRF models need to be learned.On the other hands, the single multiclass HCRF model can be learned to recognize  different activities.In addition, we conducted the experiment with different sizes of history  to analyze the effect of long-range dependency by setting  = 0 for one model and  = 1 for the other.Table 1 summarizes results of the experiment to compare the recognition performance between the one-versus-all HCRF model and the multiclass HCRF model.The multiclass HCRF model performs better than the one-versus-all HCRF model.The performance of HCRF models made a significant improvement when the history size was increased, which indicates that incorporating long-range dependencies was useful.
In the second experiment, we analyzed the recognition performance per activity of the multiclass HCRF model.For this experiment, we set the history size  of the multiclass HCRF model to 1. Figure 9 shows two confusion matrices

Figure 2 :
Figure 2: The hierarchical structure in an activity.

Figure 3 :
Figure 3: The view variance problem.

Figure 4 :
Figure 4: Extracting the motion and structure features.
Sequence data (x n ) Class label (y n )  =  e ,  y ,  s ⟩ ⟩  e ,  y ,  s

Figure 6 :
Figure 6: Learning parameters of the HCRF model.

Figure 6
shows the process to learn the optimized parameters  * = ⟨  ,   ,   ⟩ of the HCRF model.The parameter vector  * is made up of three different components:   ,   , and   .  refers to the parameters corresponding to state   .Similarly,   stands for the parameters corresponding to class  and state   .  refers to the parameters corresponding to class  and the pair of states   and   .In order to learn the optimized parameters  * from the initial parameters , the training data of the form (  ,   ) are used, where   is an observation sequence and   is the label of activity class.In the model learning process, the optimized parameters  * are searched to maximize the objective function () using the training dataset.The first term of the objective function () includes the conditional probability (  |   , ).The conditional probability ( | , ) of a class label  given the observation  is defined as in the following equation:  ( | , ) = ∑   (,  | , ) = ∑   Ψ(,,;)

Table 1 :
Performance comparisons between two different HCRF models.