Extract the Relational Information of Static Features and Motion Features for Human Activities Recognition in Videos

Both static features and motion features have shown promising performance in human activities recognition task. However, the information included in these features is insufficient for complex human activities. In this paper, we propose extracting relational information of static features and motion features for human activities recognition. The videos are represented by a classical Bag-of-Word (BoW) model which is useful in many works. To get a compact and discriminative codebook with small dimension, we employ the divisive algorithm based on KL-divergence to reconstruct the codebook. After that, to further capture strong relational information, we construct a bipartite graph to model the relationship between words of different feature set. Then we use a k-way partition to create a new codebook in which similar words are getting together. With this new codebook, videos can be represented by a new BoW vector with strong relational information. Moreover, we propose a method to compute new clusters from the divisive algorithm's projective function. We test our work on the several datasets and obtain very promising results.


Introduction
Recognizing human activities in video automatically is a promising technology in computer vision. There are a lot of application scenarios for it, such as content-based video retrieval, intelligent video surveillance, human-computer interaction, and e-health. Although lots of researchers have paid attention to this problem, it remains challenging to recognize human activities in the videos because of the great variance caused by illumination change, camera motion, and background cluster and so on.
To recognize the human activities pattern from massive videos, researchers extract discriminate features for further process. Below, we briefly review some works based on static local features and motion features.
Regarding local static features, Lowe [1] proposed a descriptor which is invariant to translations, rotations, and scaling transformation. This descriptor detects interest points from a grey-level image at which statistics of local gradient directions of image intensities were accumulated to give a summarizing description of the local image structures in a local neighborhood around each interest point. In this paper, we use a dense version of SIFT descriptor which has been proven to be useful for tasks such as object categorization, texture classification, image alignment, and biometrics [2]. On the other hand, to make use of color information in image, color-based sift descriptor has been proposed [3].
As for local motion features, to capture temporal information in videos, Chen and Hauptmann [4] proposed a Mo-SIFT descriptor that detects interest points and encodes their local appearance and explicitly models local motion. Wang et al. [5] proposed an approach to describe trajectories densely. Laptev et al. [6] proposed a STIP descriptor that computed each interest point's descriptors of the associated space-time patch.
However, local static features and local motion features only contain partial information of human activities in video. Moreover, we believe that local motion and static features are complementary for action recognition in unrestricted videos [7]. Researchers have paid attention to fusion multimodality for getting complementary information [8]. As a similar work to this paper, Liu et al. [7] first extract local static feature and 2 Computational Intelligence and Neuroscience local motion feature from videos. After that, they use static information to prune motion features and use Page-Rank to prune local static features. Moreover, they employ the divisive algorithm based on KL-divergence for code word clustering. And features are mixed up during the classification phrase.
A low dimension BoW may lose important relation information, while a high dimension may lead to the curse of dimensionality. So, reconstructing original codebook to a smaller dimension with less lost information is necessary. To solve this problem, Fulkerson et al. [9] used Informational Bottleneck to obtain meaningful feature clusters. And Pereira [10,11] used distributional cluster words/features. Each word cluster can then be treated as a single feature and thus dimensionality can be drastically reduced.
In our work, we use a bimodel to capture hybrid features before the classification phrase. We do it with two reasons: (1) capturing relational information in a direct way and (2) reducing the BoW dimensionality apparently.
As our first contribution, we employed the divisive algorithm based on KL-divergence. This algorithm uses divergence to generate an information loss criterion and is implemented iteratively like -means. This makes it possible to capture compact and discriminative codebook with smaller dimension effectively and efficiently.
As our second contribution, we initiate employment of bimodel to get hybrid feature representation. We construct a bipartite graph to model the relationship between codebooks of different feature set. Then, we use a -way partition to get a new codebook. With this new codebook, videos can be represented by a new BoW vector with strong relational information. A similar work [12] employs bimodel to get joint audiovisual codebook. The bimodel needs the clusters. We propose a method to compute new clusters from the divisive algorithm's output, while Ye et al. [12] generate the new clusters directly. Figure 1 is the flowchart of the proposed system. In this paper, we propose using several technologies for better exploiting relational information of static feature and motion feature for human activities task.

Local Static Features.
We detect local interest points by a Harris-Laplace detector densely and use a SIFT descriptor to encode these points. SIFT is invariant to rotation, scale, and light change. Moreover, the dense SIFT has been proven to be useful for tasks such as object categorization, texture classification, image alignment, and biometrics [2].

Local Motion Features.
In our work, we employ the Dense Trajectory descriptor [5] as original motion feature. Dense Trajectory firstly samples points in different spatial scale densely. After that, tracking is performed in the corresponding spatial scale. Finally, descriptors are computed along the trajectory. In this paper, we simply use the default parameters for feature extraction.

BoW Model.
BoW model has been widely used in many works and has been shown to be efficient in many tasks. This model clusters all features to several clusters and uses these clusters to discrete features from a video. Although BoW is effective and efficient, it may lose information with a low dimension or lead to curse of dimensionality with a high dimension. So, we will detail our codebook reconstruction technology in the next section for this problem.

Codebook Reconstruction.
Although BoW model is efficient in many computer visual tasks, it has two obvious drawbacks. First, it is inevitable to lose information with low dimension; second, it may lead to the curse of dimensionality with high dimension. So, in this paper, we use a two-phrase procedure to get the BoW representation. Firstly, we usemeans to get a large codebook. Then, a divisive algorithm based on KL-divergence is employed to reconstruct the initial codebook to a small and discriminative codebook. To make use of bimodel, we propose a method to compute new clusters from projective function of original words.

Compute Original Words' Projective Function. Suppose
1 ( ) and 2 ( ) are probability distributions taken from random variable . The Kullback-Leibler (KL) divergence between 1 ( ) and 2 ( ) is defined as On the other hand, the Jensen-Shannon (JS) divergence is defined as . . , ) represent activity classes, and = ( 1 , . . . , ) represent the original codebook. Then the information about captured by can be measured by mutual information ( ; ). Suppose = ( 1 , . . . , ) is the new codebook we get; then we can measure the quality of the new codebook by the loss of MI, which is defined as Computational Intelligence and Neuroscience 3 where ( ) = ∑ ∈ , = ( ). And, after some derivation, we can rewrite as follows: In this paper, we can use an iterative procedure likemeans algorithm to obtain the optimal new vocabulary using five major steps as follows: (1) Perform initialization: for every original word , assign it to (1 ≤ ≤ | |) with = max( ( | )).
After that, we get | | initial word clusters. And then, each cluster is split to several groups which result in initial clusters, say = ( 1 , . . . , ).
We now discuss the computational complexity of our algorithm.
Step (3) of each iteration requires KL-divergence to be computed for every pair, ( , ) and ( , ). This is the most computationally demanding task and costs a total of ( ) operations. Moreover, it can be proven that the objective function decreases at every iteration. So, the total time complexity is ( ), where is the number of iterations.

Compute New Clusters.
Given the projection from original words to new words, we need to compute the clusters of the new words. To be specific, let = ( 1 , . . . , ) be the new clusters; for each in , we have where is the number of training videos. And ( , ) represent the entry of the th video.

Bimodel Based Relational Information Extraction.
Given two codebooks extract from local static feature and local motion feature, we need to generate a new codebook which has as more relational information as possible. In this paper, we propose using a bimodel for this problem. Bimodel has been applied to IR [13] and Cross-View Action Recognition [14] successfully. To further capture strong relational information, we construct a bipartite graph to model the relationship between codebooks of different feature set. Finally, we use a -way partition to get a new codebook. With this new codebook, videos can be represented by a new BoW vector with strong relational information. and mot = { mot } mot =1 be the codebooks of static feature set and motion feature set individually. We can construct a graph = ( , ), where and represent the vertices and edges, respectively. To be specific, as is a bipartite graph, = sta ∪ mot , where each vertex in sta corresponds to a static word in sta and each vertex in mot corresponds to a motion word in mot . Moreover, each edge in only connects the vertices between sta and mot . The weight matrix of can be defined as = ( 0 0 ), where is a | sta | × | mot | matrix representing the similarity between any pair of words from two codebooks. In this paper, we use a measurement like TF-IDF to measure the similarity. To be specific, each element of is defined as follows: where ℎ sta ( ) denotes the entry of ℎ sta corresponding to static words sta and ℎ mot ( ) denotes the entry of ℎ mot corresponding to motion words mot .

Discover Bimodel Words.
After obtaining the bipartite graph between static feature codebook and motion feature codebook, we present the detail of bimodel words discovery.
(1) Graph Bipartitioning. Given a bipartite graph = ( , ), bipartitioning is to partition into two subsets, where vertices in the same subset have strong relation and vertices in the different subset have weak relation. Formally, graph bipartitioning aims at minimizing the following objective function: (2) Efficient k-Way Solution. Actually, finding bipartitioning of bigraph can be understood as classifying each point into two classes, for example, +1 and −1. Suppose is the projection value of vertices ; good bipartitioning minimized (1/4) ∑ ( , )∈ × ( − ) 2 . However, this may lead to a wrong solution that assigns all vertices to +1 or −1. So, in this 4 Computational Intelligence and Neuroscience paper, we are actually looking for a balanced partition whose objective function looks like the following: This problem can be solved by spectral clustering, which first constructs a Laplace matrix as follows: After that, bipartitioning of can be provided by the second smallest eigenvector of the generalized eigenvalue problem = , where ( , ) = ∑ . However, as an efficient solution proposed in [13], we can get optimal bipartitioning without computational complex. Suppose we have a matrix , where sta 1 ( , ) = ∑ and mot 2 ( , ) = ∑ , as follows: Let S = sta 1 −1/2 mot 2 −1/2 ; it can be proven that the second eigenvector of can be expressed in terms of left and right singular vectors (say 2 and V 2 ) of S as follows: In a general scene, suppose we need to capture new words containing relational information; the optimal -way partitioning solution is provided by the = ⌈log ⌉ singular vectors = ( 2 , . . . , +1 ) and = (V 2 , . . . , V +1 ).
To be specific, let = ( sta 1 −1/2 , mot 2 −1/2 ) ; we look for clusters of row space in such that the sum of squares ∑ =1 ∑ distance( , ) is minimized. Thus, our bimodel based clustering algorithm can be summarized as five basic steps as follows: Input: training videos.
(1) Construct bipartite graph, where each element of is computed as formula (9).
(5) Run -means on row vectors of matrix to get clusters.
With new clusters, each video can be represented as a new BoW vector which contains relational information.

Experiment and Analysis
3.1. Experiment on Olympic Dataset. The Olympic dataset ( Figure 2) contains videos of athletes practicing different sports [15]. As all the videos are crawled from YouTube, it means that there are little artificial constraints which make the human activities recognition hard. There are 16 sports including high jump, long jump, and basketball. In our experiment, we use the default solution to split the training videos and testing videos. Figure 2 shows some screenshots from this dataset.
We use the package provided by [3] to extract dense SIFT features. For each video, we extract SIFT from the densely sampled grid with default parameters. In our experiment,   we extract nearly 800000 SIFT features. We use the tool provided by [5] to extract Dense Trajectory features with default parameters. Finally, we sample every 100 frames and get about 60000 features for each event. After that, every video is represented by BoW vectors. And we use a grid search and 5-fold cross validation to get optimal parameters for SVM [16] classifiers.
To demonstrate the effectiveness of our method, we implement four experiments for comparison. In the first experiment, we simply used dense SIFT to recognize the testing videos. The second experiment only used Dense Trajectory for recognition. The third experiment extracted the relational information with bimodel. Finally, to demonstrate the influence of our codebook reconstruction method, we combine bimodel and the divisive algorithm we detailed before, which is called all-in-one algorithm.
As Figure 3 shows, the average accuracy of bimodel is obviously higher than the dense SIFT [3] and Dense Trajectory [5]. And, in most of the cases, the bimodel based accuracy is higher than or the same as the other two features' accuracy. This is in accord with our intuition that relational information contains message from both features which results in better result. Moreover, as the number of testing videos in "javelin throw" and "snatch" is very small, the single feature based classifiers perform badly. But the bimodel based classifiers can still deal with them. This is due to the fact that bimodel relational information contains more information that single feature does not include. Our experiments results show that the all-in-one algorithm is performing better than other three experiments in almost all cases.

Experiment on KTH Dataset. The KTH dataset [17]
consists of six human action classes. Each action class is performed by 25 people. And every person repeats one action 4 times under different scenarios. Figure 4 shows some screenshots from this dataset.  As Figure 5 shows, we compare our proposed all-in-one method with other state-of-the-art methods. Among them, Laptev et al. [6] used STIP descriptor. Wang et al. [5] used Dense Trajectory descriptor in multiple scales. And Ye et al. [12] proposed a joint audiovisual bimodal using SIFT and STIP features. Zhou et al. [18] proposed a novel structured codebook construction method to encode rich spatial and temporal contextual information for human action recognition.
It is shown that the proposed all-in-one method is better than other methods for the "boxing," "hand-waving", "jogging," and "walking" actions. Meanwhile, we observe that proposed methods perform relatively worse in "handclapping" class and "running" class. Because the "running" action looks similar to the "jogging" action except the speed and the "handclapping" action looks similar to the "hand-waving" action, we need more specific information to distinguish them.

Experiment on TRECVID MED Dataset.
TRECVID MED is a challenging task for the detection of complicated high-level events. We test our proposed method on the prespecified evaluation events in TRECVID MED 2016 development dataset [19], which includes 20 events. This dataset consists of 200,000 videos.
As Figure 6 shows, our proposed method can better take advantage of the useful information among dense SIFT and Dense Trajectory features and get higher accuracy than the other methods for all events except "parking a vehicle" and "dog show." For "parking a vehicle" event, our method is more concerned with the actions of the human, but human action is very little in the car. For "dog show" event, Ye et al. use audiovisual bimodal and the barking of the dog gave more clues.

Conclusions
In this paper, we present using bimodel for extracting the relational information of local static feature and local motion feature. To overcome the weakness of BoW model, we further introduce a divisive algorithm to keep more information among feature discrete. Our experiments have shown that original static and motion features are complementary to their relational information.