Efficient Interaction Recognition through Positive Action Representation

This paper proposes a novel approach to decompose two-person interaction into a Positive Action and a Negative Action for more efficient behavior recognition. A Positive Action plays the decisive role in a two-person exchange. Thus, interaction recognition can be simplified to Positive Action-based recognition, focusing on an action representation of just one person. Recently, a new depth sensor has become widely available, the Microsoft Kinect camera, which provides RGB-D data with 3D spatial information for quantitative analysis. However, there are few publicly accessible test datasets using this camera, to assess two-person interaction recognition approaches. Therefore, we created a new dataset with six types of complex human interactions (i.e., named K3HI), including kicking, pointing, punching, pushing, exchanging an object, and shaking hands. Three types of features were extracted for each Positive Action: joint, plane, and velocity features. We used continuous Hidden Markov Models (HMMs) to evaluate the Positive Action-based interaction recognition method and the traditional two-person interaction recognition approach with our test dataset. Experimental results showed that the proposed recognition technique is more accurate than the traditional method, shortens the sample training time, and therefore achieves comprehensive superiority.


Introduction
Over the last few decades, human activity analysis has undergone rapid development receiving increasing attention in many fields, such as intelligent surveillance, humancomputer interaction, and elder care management [1,2].Human activity can be categorized according to complexity as partial body action [3], simple action [4], interaction activity [5,6], or group activity [7].Motivated by the activity classes drawn from [5,6], this paper focuses on two-person interaction recognition of six complex interactions: kicking, pointing, pushing, punching, exchanging an object, and shaking hands.
Much research has been done on two-person interactions [5][6][7][8][9][10] with respect to the kinds of complex action relationships and human features necessary for recognition.For example, [5] took into account whether one person's hand is above another's shoulder or whether one person's foot is near another's torso.Reference [6] used head-pose, armpose, leg-pose, and overall body-pose estimation with both people for recognition.However, these processes are complex and time consuming and the recognition results might not be as accurate as required for a particular application.This paper proposes a new definition for interactions based on one person's behavior called Positive Action.In this method, one person's action plays the key role in an interaction; thus, twoperson interaction recognition can be simplified into Positive Action recognition.This approach is simpler than traditional methods, saves computing time, and improves recognition results.
The recent proliferation of a cheap but effective depth sensor, the Microsoft Kinect [11], has created more opportunities for quantitative analysis of complex human activities.As compared to the traditional video camera, Kinect has the advantage of synchronous acquisition of color and depth images; with the use of depth maps, 3D information about a scene from a particular point of view is easily computed under diverse conditions [12].This in turn will make behavior detection easier in badly lit or dark places.For example, Figure 1(a) represents a depth image captured by Kinect in weak light, which clearly shows one person punching at another; Figure 1(b) shows a color image of this interaction synchronously captured with the depth image.With a traditional camera, only RGB images as seen in Figure 1(b) are collected, with limited value for surveillance and other applications.Unfortunately, there are few publicly accessible test datasets to assess two-person interaction recognition approaches using the depth sensor.Thus, we created a new dataset for two-person interaction.The first version of this original dataset is available to download on the Internet at http:// www.lmars.whu.edu.cn/profweb/zhuxinyan/DataSet Publish/dataset.html.The Microsoft Kinect sensor produces a new type of data, RGB-D data, which is an improvement on RGB images for human behavior recognition research.Therefore, many researchers have collected their own data and some of them are publicly accessible on the Internet [13][14][15].In [16], Sung et al. produced a dataset including a total of twelve unique activities in five realistic domestic environments: office, kitchen, bedroom, bathroom, and living room.The RGBD-HuDaAct video database [17] collected in a lab environment includes 12 categories of human daily activities: making a phone call, mopping the floor, entering a room, and so forth.The LIRIS human activity dataset contains (gray/RGB/depth) videos showing people performing various activities taken from daily life (discussing, making telephone calls, exchanging an item, etc.); it includes information on not only the action class but also the spatial and temporal positions of objects in the video.However, these datasets only address individual activities and not two-person interactions [18].
Several more-than-one-person datasets were created using Kinect.In [19], the UT Kinect-human detection dataset was created: there are 98 frames with two people appearing in the scene at different depths in a variety of poses, including several simple interactions.In addition, [5] chose eight types of two-person interactions to establish another two-person dataset, including approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands.However, this latter dataset is not publicly available on the Internet.
Depth imaging data produced by the Kinect sensor is driving new single and daily activity recognition problem research.For human activity or behavior representation, the method in [16,20] detected and recognized different activities through body-pose features, hand position features, and motion information, using the Kinect sensor.In [17], Ni et al. proposed depth-extended feature representation methods to obtain superior recognition performance based on RGBD-HuDaAct datasets.Nowozin and Shatton [21] used skeletal features: joint velocities, joint angles, and joint angle velocities to reduce the latency in recognizing an action.
For human activity or behavior recognition, most efforts use HMM-based approaches.Park and Aggarwal [6] used HMMs for human motion recognition and combined it in a hierarchical way using DBNs (Dynamic Bayesian Networks).Vogler and Metaxas [22] presented parallel HMMs to recognize American sign language based on magnet tracking data, while Wilson and Bobick [23] proposed parametric HMMs to recognize human gestures.HMM-based recognition of more complex sequences is addressed by [24][25][26].The method proposed in [24] was able to recognize motion units with optical flow data; in [25], Li proposed a landmark point trajectoriesbased approach to recognize view-invariant human actions and Chen et al. [26] presented a star skeleton model to recognize a single action and a series of actions.
Presently, there is little human interaction research based on Microsoft Kinect data and few papers report on a complex human activity dataset created to depict two-person interactions [5].This research concluded that activity recognition represented by geometric relational features based on distance between all pairs of joints outperforms other feature choices.Our proposed approach and test dataset extend this research.
The contribution of this paper is twofold; we developed an efficient approach based on Positive Action representation to recognize two-person interactions and created a new dataset based on the Kinect sensor to test and verify methods.The rest of this paper is organized as follows.Section 2 shows our interaction dataset; Section 3 details the Positive Action definition and feature extraction method; Section 4 presents the Positive Action and the traditional interaction recognition method via HMMs; Section 5 demonstrates experimental results from two different approaches using our test dataset; finally Section 6 concludes this paper and discusses future work.

K3HI: Kinect-Based 3D Human Interaction Dataset
We collected two-person interactions using a Microsoft Kinect sensor.All videos were recorded in an indoor room while 15 volunteers performed activities.[5,6]; therefore, we choose other types of relatively complex two-person interactions for recognition studies.
The most important data in our dataset is the spatial information (3D coordinates) of the two persons' skeletons.In order to ensure the integrity and continuity of target data, the original RGB images and depth information were ignored when capturing data.An articulated skeleton for each person was extracted using the OpenNI software [27] and Natural Interaction (NITE) Middleware provided by PrimeSense [28].A skeleton was represented by the 3D positions of 15 joints, including head, neck, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, torso center, left hip, right hip, left knee, right knee, left foot, and right foot.However, when two persons overlapped, especially in a hugging activity (e.g., see Figure 2), full body tracking of interactions with NITE Middleware might be inaccurate.Bad and lost tracking will seriously affect interaction results, so hugging was not considered in our dataset.At last, six types of two-person interactions were captured, including kicking, punching, pointing, pushing, exchanging an object, and shaking hands.Figure 3 visualizes the collected interaction data as represented in the form of skeletons with different colors representing different actors.[5,6,[10][11][12].Interactions can be classified into two groups: the first group indicates that one person acts first and the other person gives a responsive action, for example, kicking, pointing, punching, pushing, and so forth; the second group of interactions represents both people performing an almost identical synchronous action, for example, exchanging an object, shaking hands, and so forth.We propose that an interaction can be decomposed into a Positive Action and a Negative Action.For interactions in the first group, the person who acts first, resulting in the other person's reaction, performs a Positive Action.In the second group, since both people's behavior is similar and synchronized, we simply define the action, which moves with greater position changes in the first few frames, as the Positive Action.In all cases, a Negative Action is defined as a reciprocal action corresponding to a Positive Action in a two-person interaction.

Positive Action Definition. Most existing work about human interactions focuses on two people, considering what kind of action relationship they have and what kind of features should be chosen to best represent an interaction
After a Positive Action is identified, complex interaction recognition becomes relatively easy.Figures 4(a)-4(f) represent the original two-person interactions which were tested in [6], while Figures 4(a

Positive Action Extraction. Next, we obtained the Positive
Actions in our dataset by means of mathematical analysis, especially for interactions in the first group as defined in Section 3.1.The window size for each interaction was approximately 25 frames.We only kept the first ten framessince the action changes in the first few frames are enough to distinguish Positive Action and Negative Action.The extraction process for Positive Action is divided into the following three procedures.
(1) Aligning the Sequence.For an interaction activity, there are always time or frame length variances when capturing the data.Before discerning a Positive Action, we first select the interactions of the same class to align the sequences.Then, the Dynamic Time Warping (DTW) model is used to align the sequences of the same activity class as mentioned in [29].For each class, we selected a standard interaction sequence suitable for representation of the interaction process.We computed separately the minimal DTW distance between the remaining interaction sequences and the standard interaction sequence in the same class to find the optimal alignment.
In the DTW process, we express the feature vectors of two different sequences (in the same interaction class) as two time series (or frame series)  (1)   1 and,  (2)  2 , defined as follows: Accordingly, the costs between two series will be lower if they are similar, meaning that if two sequences are well aligned, the minimal DTW distance will be defined as
(2) Computing Key Joint Position Changes.We selected eight joints as key joints, which represent changes in the body's (3) Identifying Positive Action.For actions in the first group which is defined in Section 3.1, it is tougher to extract Positive Action than it is in the second group.According to the benchmark in [30], human reaction time is around 0.2-0.3s.Our collected data is 15 frames per second.When reaction time is converted into frames, it consists of 3-4 frames.This means that in the first group of interactions when a Positive Action starts, about 3-4 frames later, a corresponding Negative Action occurs.
In our Positive Action definition, because the joint positions in the first two adjacent frames change and conform to the benchmark, we can compare the maximum position changes of both persons' key joints between initial th and ( + 3)th frame of a sequence.The value of  for the standard interaction sequence mentioned in procedure (1) is one.For the other sequences after DTW processing,  will be different value.This is expressed as follows: Positive Action = arg max (max ( where max( ) ) and max( ) ) indicate the maximum position changes of joints for person one and person two in an interaction; max( 1 ,  2 ) indicates that if  1 >  2 ,  1 will represent the Positive Action and  2 will represent the Negative Action; otherwise,  2 will be the Positive Action.Figure 6 shows the processing results for Positive Actions, ignoring the Negative Actions.Each action has its own distinct characteristics, including easily confused interactions, such as exchanging an object and shaking hands.
Positive Action extraction is much easier in the second group as compared with the first group.According to the definition of Positive Action for group two, we also use (6); therefore, the person with the maximum  (;) (,+3) performs the Positive Action.
In order to verify the method which is used to extract Positive Action, we selected the "kicking" action from the first group of interactions and "shaking hands" from the second group and calculated the position changes using (5) for the first 10 frames.Figure 5 shows the results: from Figure 5(a), it can be seen that as person one's right foot and right knee positions change from the first frame to the third frame, person two's left and right elbows as well as left and right hands positions also change in the fourth frame.These changes suggest that when person one starts to kick, person two's upper limbs react milliseconds later so that the first person's motion belongs to the Positive Action.However, Figure 5(b) does not show any connection between the two behaviors, except that both of their right hands and elbows move in a synchronized fashion.In general, experimental results support our Positive Action extraction method.
The visualization of Positive Actions is shown in Figure 6.Table 1 represents the extraction results for Positive Action with and without DTW for the first group, illustrating that the extraction results for Positive Action have greater accuracy after DTW preprocessing.

Feature Extraction.
After Positive Actions are extracted, we utilize several body-pose features for motion-capture data representation and evaluate these features using our test dataset.One of the biggest challenges when using skeleton joints as a feature is that semantically similar motions may not necessarily be numerically similar [31].To overcome this, [32] used relational body-pose features as introduced in [31], describing geometric relations between specific joints in a single pose or a short sequence of poses.Relational pose features were used to recognize daily-life activities performed by a single actor in a random forest framework; the features included joint, plane, and velocity features.

(i) Joint Features
Joint Distance.Let  , ∈ R 3 be the 3D location of joint  in a Positive Action at time  ∈ .The joint distance feature  JoiDis is defined as the Euclidean distance between two joints at time  and is represented as where  1 and  2 are any two joints of a single person ( 1 ̸ =  2 ).
Joint Motion.Similar to the joint distance feature, the joint motion feature  JoiMot is defined as the Euclidean distance Normal Plane Feature. NorPlane is similar to a plane feature; it helps to determine if and how far the joint "hand" is raised above the "shoulder";  NorPlane is defined as follows: where  1 is the joint as in a plane feature and ⟨ ( 2 ;) ,  ( 3 ;) ,  ( 4 ;) ⟩ indicates that the plane with normal vector  ( 2 ;) −  ( 3 ;) passing through  ( 4 ;) . 1 ,  2 ,  3 , and  4 represents different joints.

(iii) Velocity Features
Velocity Feature. Vel captures the velocity of one joint along a direction generated by two other joints at time . Vel is defined as where  1 ,  2 , and  3 are different joints.
Normal Velocity Feature. NorVel is similar to a normal plane feature; it captures the velocity of one joint along the direction of the normal vector of the plane generated by three other joints at time . NorVel is defined as where ∧  ⟨⋅⟩ is the unit normal vector of the plane represented by ⟨⋅⟩ when  1 ,  2 ,  3 , and  4 are different joints.

Positive Action Recognition via HMM
Hidden Markov Models (HMMs) are widely used for modeling time series data.Formally, a HMM can be described as a 5-tuple Ω = (Φ, Σ, , , ), where Φ are the hidden variables and  are the transitions probabilities among states; these probabilities, as well as the starting probabilities , are discrete.Every observation state has a set of possible emissions Σ and discrete/continuous probabilities  for these emissions.A Gaussian Mixture Model (GMM) is used to represent the observation states for each hidden variable and to compute their probabilities [33].GMM density is defined as the weighted sum of Gaussian densities.
In the training process, HMM parameters are initialized: we manually decided the observation states' number  and hidden states' number ; then we divided equally the data sequence into  parts and clustered each part using means to establish the GMM.After the HMM parameters are known, the Baum-Welch algorithm, also known as the Forward-Backward algorithm, was used to reevaluate the HMM parameters and to compute the output probability of observation sequence    (indicating the th sample sequence of action ).Finally, the sequence probabilities are summed up and HMM parameters are confirmed until we get the maximum value ( | Ω) ∑ (   | Ω  ).After training, we have six HMMs for each type of action.
During the recognition process, given the data sequence of unknown action , the feature vectors are extracted for each frame.Using the Viterbi algorithm, the likelihood   = (   | Ω  ) of observation sequence    is generated.We repeated this procedure based on the six HMMs generated in training process and produced the probabilities   (1 ≤  ≤ 6).Thus, by comparing the values   , we obtained the maximum likelihood  max , which represents the type of interaction.

Experimental Results
We selected the features extracted from among the Positive Actions identified in Section 3.2 to recognize interactions and used the features extracted from original interaction data as in [5].Then, we compared and evaluated the recognition results from both approaches.The process for feature extraction and action recognition is illustrated in Figure 7.
In the Positive Action-based interaction approach, features as described in Section 3.3 were classified into three groups: joint features, plane features, and velocity features.In our experiments, we recognized six kinds of Positive Actions for each feature and mixed the features.There are fifteen joints (including 3D coordinates) for each action.Thus, the dimension of  JoiDis is  2 15 = 105 for each frame and the  JoiMot was  2  5 ×  2  ( is the total number of frames for each interaction).Considering the larger dimensions of both plane and velocity features, we selected key joints to characterize the features.For plane features, the relationship between the four limbs and main body is critical; therefore, the plane was spanned from seven joints ("head, " "neck, " "left shoulder, " "right shoulder, " "torso, " "left hip, " and "right hip") and eight joints for the target joint.In this way, we created a lower dimension  3  7 × 8 for each frame.However, the feature dimensions were larger than the training sample number; thus, Principal Component Analysis (PCA) was used to reduce the dimensions.
To classify interactions, evaluation is done with a 4 fold cross-validation: 3 folds are used for training and 1 for testing.Based on the fact that the 3 state HMM performs much better than the 4-and 5 state HMMs in our experiments, we trained a 3 state, continuous HMM with GMM.As expected,  the transition probabilities and the observation probabilities turned out to be different for different actions.After training, the HMM parameters are known while the Viterbi algorithm was used to find the maximum likelihood category.Table 2 shows the experimental results for each kind of feature representation.
For the traditional two-person relationship-based interaction recognition method (called the old approach in the rest of this paper), three kinds of features referring to [5] were also extracted based on the original captured data (see Figure 3).The training and recognition process was identical with the Positive Action-based (new) method.Figure 8

Original interaction
Positive action extraction generated by the old approach.The confusion matrix also compares different kinds of features for recognition: joint features include the joint motion and joint distance features; plane features include the plane and normal plane features; velocity features include the velocity and normal velocity features.The average recognition accuracy for each kind of feature from (a) to (c) is 78.67%, 66.83%, and 55.67%; the average accuracy from (d) to (f) is 70.00%, 61.67%, and 48.67%.Therefore, joint features-based recognition results are better than plane and velocity features, suggesting that geometric relational features based on the distance between joints outperform other feature choices, verifying the conclusions found in [5].Furthermore, in both the old and new approaches, there exists some confusion between "pointing" and "punching" and between "exchanging an object" and "shaking hands." Our results show that these actions are similar, leading to lower recognition accuracy.Most importantly, the average accuracy for interaction recognition based on Positive Action representation, as proposed in this paper, is 7% greater than two-person relationship-based approaches, especially since geometric relational feature-based recognition is almost 10% greater.There are several reasons for these results.First, a twoperson feature representation is more complex than a Positive Action-based representation, creating unstable factors.For example, the "pointing" interaction in normal plane features: the Positive Action-based method only judges whether one person's "hand" position is higher than its' own "shoulder";  The Positive Action-based representation method consumes less time than the old approach.In summary, Positive Action-based representation for two-person interaction recognition outperforms the old approach; not only is its recognition accuracy better, but also the time cost for training is less.So, the new method transforms a relatively complex two-person interaction into a simpler Positive Action, making the recognition procedure more cost effective while maintaining or even improving recognition quality.Therefore, the new proposed approach is efficient for interaction recognition.

Conclusion
This paper presented a novel approach to recognize relatively complex human interactions: different from many existing interaction recognition methods, we focused our research on single actions which are useful when distinguishing differences between types of interactions.Two-person interaction recognition is transformed into Positive Action-based recognition.
The key contributions of this paper are as follows: (1) we investigated the reciprocal relationships in two-person interaction and proposed a new definition for single person's behavior called Positive Action; (2) two-person interactions were recognized based on Positive Action representation via continuous HMMs; (3) a new test interaction dataset based on Microsoft Kinect camera was created and it is publicly available; our experimental results demonstrate that the proposed method outperforms old approaches based on two-person relationships.
In the future, we plan to find more volunteers to capture more data and extend our interaction dataset to include additional interaction categories.More importantly, owing to the limitations of human tracking software, such as the NITE Middleware or the Windows SDK for Kinect, there occasionally are some inaccurate tracking results.Therefore, we need to find a better way to track human actions, further improving the recognition accuracy.
)-4(f  ) show the simplified results that the complex interactions are reduced into Positive Action-based representations.It can be seen that Positive Actions are discriminated with each other; therefore, only one person's features are taken into account and traditional interaction recognition can be transformed into Positive Action recognition.

Figure 2 :
Figure 2: Bad tracking and lost tracking for a hugging activity.(a) and (b) show the key process in hugging for two different pairs; the last two images in (a) represent bad tracking of human bodies, and (b) represents lost tracking of bodies.

Figure 3 :
Figure 3: Skeleton visualization of interactions in our dataset.Three key poses were selected to represent the process of each interaction: (a) kicking, (b) punching, (c) pointing, (d) pushing, (e) exchanging an object, and (f) shaking hands.
motion; these joints include the left and right elbow, left and right hand, left and right knee, and left and right foot.The position changes of the joints were described by calculating the distances between neighboring frames, defined as follows: +1) is the Euclidean distance of a key joint  between frame  and  + 1;  (;,,)  indicates the position of joint  at frame  and (, , ) are the 3D coordinates.

Figure 4 :
Figure 4: A comparison between interactions and Positive Actions.(a)-(f) show the original interaction data in [6] and (a  )-(f  ) are the Positive Actions of one person described in this paper.
The second group of interaction: shaking hands

Figure 5 :
Figure 5: Key joints position changes in two groups of interactions during the first 10 frames.(a) shows the first group of interaction with "kicking" as an example; (b) shows the second group and takes "shaking hands" as an example.
shows the recognition results in a confusion matrix: (a)-(c) represents the Positive Action-based approach and (d)-(f) for the values

Figure 6 :
Figure 6: Skeletons visualization of Positive Actions.The red skeletons show only Positive Actions in two-person interactions.These are considered as the interaction representation and Negative Actions are ignored in the recognition process.

Figure 7 :
Figure 7: Flow of the interaction recognition system.

Figure 9 :
Figure 9: Average time cost for training samples.It is the old and new methods that are evaluated according to three kinds of features: joint, plane, and velocity features.

Table 2 :
Interactions recognition results via Positive Action-based representation.

Table 3 :
The performance on more classifiers.