Learning a Mid-Level Representation for Multiview Action Recognition

. Recognizing human actions in videos is an active topic with broad commercial potentials. Most of the existing action recognition methods are supposed to have the same camera view during both training and testing. And thus performances of these single-view approaches may be severely influenced by the camera movement and variation of viewpoints. In this paper, we address the above problem by utilizing videos simultaneously recorded from multiple views. To this end, we propose a learning framework based on multitask random forest to exploit a discriminative mid-level representation for videos from multiple cameras. In the first step, subvolumes of continuous human-centered figures are extracted from original videos. In the next step, spatiotemporal cuboids sampled from these subvolumes are characterized by multiple low-level descriptors. Then a set of multitask random forests are built upon multiview cuboids sampled at adjacent positions and construct an integrated mid-level representation for multiview subvolumes of one action. Finally, a random forest classifier is employed to predict the action category in terms of the learned representation. Experiments conducted on the multiview IXMAS action dataset illustrate that the proposed method can effectively recognize human actions depicted in multiview videos.


Introduction
Automatic recognition of human actions in videos becomes increasingly important in many applications such as intelligent video surveillance, smart home system, video annotation, and human-computer interaction.For example, finding out suspicious human behaviors in time is an essential task in intelligent video surveillance, and identifying fall actions of older people is of great importance for a smart home system.In recent years, a variety of action recognition approaches [1][2][3][4][5] have been proposed to solve single-view tasks, and some surveys [6][7][8][9][10] review the advances of single-view action recognition in detail.However, real-world videos bring about great challenges to single-view action recognition, since visual appearance of actions can be severely affected by viewpoint changes and self-occlusion.
Different from single-view approaches which utilize one camera to capture human actions, multiview action recognition methods exploit several cameras to record actions from multiple views and try to recognize actions by fusing multiview videos.One strategy is to handle the problem of multiview action recognition at classification level by annotating videos from multiple views separately and merging the predicted labels of all views.Pehlivan and Forsyth [11] designed a fusion scheme of videos from multiple views.They firstly annotated labels over frames and cameras using a nearest neighbor query technique and then employed a weighting scheme to fuse action judgments as the sequence label.Another group of methods resort to merging data from multiple views at feature level.These methods [12][13][14][15] utilize 3D or 2D models to build a discriminative representation of an action based on videos from multiple views.In fact, how to represent an action video with expressive features plays an especially important role in both multiview and singleview action recognition.A video representation with strong discriminative and descriptive ability is able to express human action reasonably and supply sufficient information to action classifier, which will lead to an improvement in recognition performance.

Advances in Multimedia
This paper presents a multiview action recognition approach with a novel mid-level action representation.A learning framework based on multitask random forest is proposed to exploit a discriminative mid-level representation from low-level descriptors of multiview videos.The input of our method is multiview subvolumes, each of which includes continuous human-centered figures.These subvolumes simultaneously record one action from different perspectives.And then we sample spatiotemporal cuboids from subvolumes at regular positions and extract multiple lowlevel descriptors to characterize each cuboid.During training, cuboids from multiple views sampled at four adjacent positions are grouped together to construct a multitask random forest by using action category and position as two related tasks, and a set of multitask random forests are constructed in this way.In testing, each cuboid is classified by the corresponding random forest, and a fusion strategy is employed to create an integrated histogram for describing cuboids sampled at a certain position of multiview subvolumes.Histograms of different positions are concatenated to a mid-level representation for subvolumes simultaneously recorded from multiple views.Moreover, the integrated histogram of multiview cuboids is created in terms of the distributions of both action categories and cuboid positions, which endows the learned mid-level representation with the ability of exploiting spatial context of cuboids.To achieve multiview action recognition, a random forest classifier is adopted to predict the category of this action.Figure 1 depicts the overview of our multitask random forest learning framework.
The remainder of this paper is organized as follows.After a brief overview of the related work in Section 2, we detailedly describe our method in Sections 3, 4, and 5. Then a description of experimental evaluation procedure followed by the analysis of results is given in Section 6.Finally, the paper concludes with discussions and conclusions in Section 7.

Related Work
The existing multiview action recognition methods fusing data at feature level can be roughly categorized into two groups, 3D based approaches and 2D based approaches.
Some action recognition methods based on 3D models have shown good performance on several public multiview action datasets.Weinland et al. [12] built 3D action representations based on invariant Fourier analysis of motion history volumes by using multiple view reconstructions.For the purpose of considering view dependency among cameras and adding full flexibility in camera configurations, they designed an exemplar-based hidden Markov model to characterize actions with 3D occupancy grids constructed from multiple views in another work [16].Holte et al. [14] combined 3D optical flow of each view into enhanced 3D motion vector fields, which are described with the 3D Motion Context and the view-invariant Harmonic Motion Context in a viewinvariant manner.Generally, 3D reconstruction from multiple cameras requires additional processing such as camera calibration, which would lead to high computational cost and reduce the flexibility.In order to overcome the limitation of 3D reconstruction from 2D images, some methods employ depth sensors for multiview action recognition.Hsu et al. [17] addressed the problem of view changes by using RGB-D cameras such as Microsoft Kinect.They constructed a viewinvariant representation based on the Spatiotemporal Matrix and integrated the depth information into the spatiotemporal feature to improve the performance.
In recent years, different methods based on 2D models have been proposed for multiview action recognition.These methods aim to construct discriminative and view-invariant action representations from one or more descriptors.Souvenir and Babbs [13] learned low-dimensional and viewindependent representations of actions recorded from multiple views by using manifold learning.In the work of [18], scale and location invariant features are calculated from human silhouettes to obtain sequences of multiview key poses, and action recognition is achieved through Dynamic Time Warping.Kushwaha et al. [19] extracted scale invariant contour-based pose features and uniform rotation invariant local binary patterns for view-invariant action recognition.Sargano et al. [20] learned discriminative and view-invariant descriptors for real-time multiview action recognition by using region-based geometrical and Hu-moments features extracted from human silhouettes.Chun and Lee [15] extracted local flow motion from multiview image sequences and estimated the dominant angle and intensity of optical flow for head direction identification.Then they utilized histogram of the dominant angle and intensity to represent each sequence and concatenated histograms of all views as the final feature of multiview sequences.Murtaza et al. [21] developed a silhouette-based view-independent action recognition scheme.They computed Motion History Images (MHI) for each view and employed Histograms of Oriented Gradients (HOG) to extract low-dimensional description of them.Gao et al. [22] evaluated seven popular regularized multitask learning algorithms on multiview action datasets and treated different actions as different tasks.In their work, videos from each view are handled separately.Hao et al. [23] employed a sparse coding algorithm to transfer the low-level features of multiple views into a discriminative and high-level semantics space and achieved action recognition by a multitask learning approach in which each action is considered as an individual task.
Besides, some other methods employ deep learning technique to learn discriminative features for multiview action recognition, and several neural networks are developed to build these deep-learned features directly from the raw data.Lei et al. [24] utilized convolutional neural network to extract effective and robust action features for continuous action segmentation and recognition under multiview setup.
The proposed method is also relevant to our previous work [25], in which a random forest based learning framework is designed for building mid-level representations of action videos.Different from [25], the proposed method aims to solve the problem of multiview action recognition, and an integrated mid-level representation is learned for an action depicted in videos recorded from multiple views.Meanwhile, our multitask random forest learning framework is able to effectively exploit the spatial context of cuboids.
Framework of our method.Firstly, we densely sample spatiotemporal cuboids from subvolumes of  views.Suppose that 24 cuboids are extracted from one subvolume, and then six multitask random forests are constructed, each of which is built upon cuboids from multiple views sampled at four adjacent positions.Then all cuboids are classified by their corresponding random forests, and an integrated histogram is created to represent cuboids of all the  views sampled at the same position.The concatenation of histograms for all positions constitutes a mid-level representation of the input  subvolumes.At last, random forest is utilized as the final action classifier.

Overview of Our Method
Our goal is to recognize a human action by using videos recorded from multiple views.To this end, we propose a novel multitask random forest framework to learn a uniform midlevel feature for an action.In order to remove the influence of the background, we firstly employ a human body detector or tracker to obtain the human-centered figures from a video, and then a video is divided to a series of subvolumes with fixed size, each of which is a sequence of human-centered figures.We densely extract spatiotemporal cuboids (e.g., 15 × 15 × 10) from subvolumes, and each of them is represented by multiple low-level features.
Our multitask random forest framework utilizes a fusion strategy to get an integrated histogram feature for cuboids sampled at the same position of subvolumes that simultaneously record an action from different views.Concretely, a multitask random forest is built upon cuboids extracted at four adjacent positions of multiview subvolumes, and thus we can construct a set of multitask random forests corresponding to different groups of positions.For the purpose of exploiting spatial context of cuboids, position of cuboid is treated as another task besides action category in the construction of multitask random forest.Decision trees in a multitask random forest vote on the action category and position of cuboids and generate a single histogram for cuboids sampled at the same position of simultaneously recorded multiview subvolumes, according to the distribution of both action category and cuboid position.The concatenation of histograms of all positions is normalized to get the mid-level representation for multiview subvolumes.For multiview action recognition, a random forest classifier is adopted to predict the category of this action.

Low-Level Features
Our multitask random forest based framework is general for merging multiple low-level features.In our implementation, we extract three complementary low-level features to describe the motion, appearance, and temporal context of the interested human.The optical flow feature computed from the entire human figure is able to characterize global motion information, the HOG3D spatial-temporal descriptor extracted from a single cuboid captures the local motion and appearance information, and the temporal context feature expresses the relative temporal location of cuboids.Therefore, the mid-level feature built upon the above three types of low-level features is more robust to video variations such as global deformation, local partial occlusion, and diversity of movement speed.[35] is used to calculate the motion between two adjacent image frames.This motion descriptor shows favorable performance with noise, so it can tolerate the jitter of human figures caused by human detector or tracker.Given a sequence of human-centered figures, pixel-wise optical flow feature is calculated at each frame using Lucas-Kanade algorithm [36] 4.2.HOG3D.HOG3D [37] is a local spatiotemporal descriptor based on histograms of oriented 3D spatiotemporal gradients.It is an extension of HOG descriptor [38] to the video.3D gradients are calculated with arbitrary spatial and temporal scales, followed by the orientation quantization using regular polyhedrons.A local support region is divided into 1×1×2 cells.An orientation histogram is computed for each cell, and the concatenation of all histograms is normalized to generate the final descriptor.In this paper, HOG3D descriptors are computed for cuboids densely sampled from human-centered subvolumes.We set 1 = 4 and 2 = 3, respectively, and utilize icosahedron with full orientation to quantize the 3D gradients of each cell.So the dimension of HOG3D feature is 4 × 4 × 3 × 20 = 960.

Temporal Context.
Temporal context feature is characterized by the temporal relation among different cuboids and is regarded as a type of low-level feature in this paper.Given a video with  frames, a cuboid  is extracted from a subvolume which contains  0 frames and begins with the th frame.The temporal context of  is described as a two-dimensional vector [/, |/| − 0.5]  , where / represents the temporal position of  in the whole video and |/ − 0.5| denotes the temporal offset of  relative to the center of the video.

Multitask Random Forest Learning Framework
We detail the proposed multitask random forest framework in this section.Suppose that an action is recorded by  cameras simultaneously, and we obtain  human-centered subvolumes with fixed size from each video, denoted as {vol ,V } =1:,V=1: .Then we densely sample  spatiotemporal cuboids from every subvolume with particular size and stride and denote them by { ,V  } =1:,=1:,V=1: , each of which is characterized by multiple low-level features.In order to exploit the spatial context of cuboids, we treat spatial position of cuboids as another type of annotations and employ cuboids of various action instances extracted at adjacent positions to build a multitask random forest by using both action labels and position labels.The proposed multitask random forest framework constructs an integrated histogram to describe cuboids { ,V  } =1: sampled at position  of multiview subvolumes {vol ,V } =1: , and histograms of  cuboids are concatenated to create a unified mid-level representation for subvolumes {vol ,V } =1: that are simultaneously recorded from multiple views.

Construction of Multitask Random Forest. Our training cuboids {𝑥 𝑚,V
, ,   } =1:,=1:,V=1:  ,=1: are extracted from subvolumes of  action instances, and each video of the th instance generates   subvolumes.Cuboids of the th instance share the same action label   , and the position label of cuboid  ,V , is .As is shown in Figure 1, we draw cuboids at regular positions of subvolumes and utilize training cuboids sampled at four adjacent positions to construct a multitask random forest.We totally obtain  random forests, denoted as {MTRF  } =1: .
In order to build decision tree Tree  , , we randomly sample about 2/3 of cuboids from the original training set    and obtain its own training dataset   , , using bootstrap method.All of the training cuboids in   , go through the tree from the root.We split a node and the training cuboids assigned to it according to a particular feature chosen from a set of randomly sampled feature candidates.Since a cuboid is described by three types of low-level features (i.e., optical flow, HOG3D, and temporal context), two parameters  ∈ (0, 1) and  ∈ (0, 1) are predefined to control the selection of feature candidates.Specifically, we generate two random numbers  ∈ [0,1] and  ∈ [0,1] to decide which type of low-level features is utilized for node split.If  < , then a quantity of optical flow features is randomly selected as feature candidates; otherwise some randomly selected HOG3D features comprise the set of feature candidates.Meanwhile, if  < , then all temporal context features are added to the set of feature candidates.Each feature candidate divides the training cuboids at this node into two groups, and feature candidate with the largest information gain is chosen for node split.Then the node splits into two children nodes and each cuboid is sent to one of the children nodes.As the multitask random forest takes action category and cuboid position as two classification tasks, a random number  ∈ [0, 1] and a prior probability  ∈ (0, 1) codetermine which task is used to calculated the information gain of data split.
A node stops splitting when it has gotten to the limited tree depth dep max or all samples arriving at this node belong to the same action category and position, and then it is regarded as a leaf.Two vectors P = [ 1 ,  2 , . . .,   ] and Q = [ 1 ,  2 ,  3 ,  4 ] are created to store the distributions of action categories and cuboid positions, respectively.Here   denotes the posterior probability of cuboids arriving at the corresponding leaf node belonging to action , and we have  actions in total.Similarly,   represents the proportion of cuboids at this leaf node being extracted from a particular position.Both P and Q of a leaf node are calculated from training cuboids assigned to it.The construction of a decision tree is summarized in Algorithm 1. representing the distributions of action categories and cuboid positions.Then the average distributions voted by all decision trees can be calculated by
Following [25], the out-of-bag estimate [39] is employed in the construction of mid-level representations during training to solve the overfitting problem.As described in Algorithm 1, decision trees {Tree  , } =1: of view  share an original training set    = { ,V , ,   } ∈cub(),V=1:  ,=1: , and about 2/3 of cuboids in    constitute the bootstrap training set   , for tree Tree  , .The construction of local descriptor h(, V, ) for training cuboids { ,V , } =1: is the same as that for test cuboids, except that tree Tree  , does not contribute to h(, V, ) if it was trained on  ,V , .Accordingly, we rewrite (1) as where ℓ(, , ) is an indicator function defined by Similarly, the mid-level representation f(V, ) of training subvolumes {vol ,V  } =1: is created by concatenating local descriptors of all positions. (5)
For a new action instance {f(V)} V=1: , all of the decision trees in random forest vote on the action category of each sample and assign a particular action label (V) to f(V).According to majority voting, we predict the final action category of this instance as where I( = ) is an indicator function; that is, I( = ) is 1 if  =  and 0 otherwise.

Experiments
6.1.Human Action Datasets.Experiments are conducted on the multiview IXMAS action dataset [12] and the MuHAVi-MAS dataset [31] to evaluate the effectiveness of the proposed method.
The IXMAS Dataset.It consists of 11 actions performed by 10 actors, including "check watch", "cross arms", "scratch head", "sit down", "get up", "turn around", "walk", "wave", "punch", "kick", and "pick up".Five cameras simultaneously recorded these actions from different perspectives, that is, four side views and one top view.This dataset presents an increased challenge since actors can freely choose their position and orientation.Thus, there are large inter-view and intra-view viewpoint variations of human actions in this dataset, which make it widely used to evaluate the performance of multiview action recognition methods.

Experimental Results
. We compare our method with state-of-the-art methods on two datasets, and experimental results on the IXMAS dataset and the MuHAVi-MAS dataset are illustrated in Tables 1 and 2, respectively.In our experiments, the leave-one-actor-out cross-validation strategy is adopted on both datasets.We execute the random forest classifier for 10 times and report the recognition accuracy by averaging over the results of 10 classifiers.

Results on the IXMAS Dataset.
As shown in Table 1, our method significantly outperforms all the recently proposed methods for multiview action recognition, which demonstrates the effectiveness of the proposed learning framework based on multitask random forest.The confusion matrix of multiview action recognition results is depicted in Figure 2. We can observe from Figure 2 that the proposed method achieves promising performance on most actions, among which four actions (i.e., "sit down", "get up", "walk", and "punch") are correctly recognized.Meanwhile, some actions have similar motion, which may result in misclassification.For example, it is difficult to distinguish actions "cross arms", "scratch head", and "wave", since they all involve motion of the upper limb.Similarly, actors crouch in both actions of "sit down" and action "pick up", which may be a possible reason for the misclassification of "pick up".[31] 61.8% Orrite et al. [32] 75.0%Cheema et al. [33] 75.5% Murtaza et al. [34] 81.6%Our method 91.2% information gain of data split.More concretely,  denotes the probability that the action recognition task is selected at each node.We tune the value of  to investigate how it affects the performance and summarize the action recognition results in Figure 5.It should be noted that multitask random forest is reduced to random forest which takes action recognition as its only task if  is set to 1. From Figure 5 we can observe that action recognition results of multitask random forest (e.g.,  = 0.8, 0.9, 0.95) are better than that of single-task random forest (i.e.,  = 1), which demonstrates the effectiveness of our multitask random forest learning framework.

Conclusion
We presented a learning framework based on multitask random forest in order to exploit a discriminative mid-level representation for videos from multiple views.Our method starts from multiview subvolumes with fixed size, each of which is composed of continuous human-centered figures.Densely sampled spatiotemporal cuboids are extracted from subvolumes and three types of low-level descriptors are utilized to capture the motion, appearance, and temporal context of each cuboid.Then a multitask random forest is built upon cuboids from multiple views that are sampled at four adjacent positions, taking action category and position as two tasks.Each cuboid is classified by its corresponding random forest, and a fusion strategy is employed to create an integrated histogram for describing cuboids sampled at a certain position of multiview subvolumes.Concatenation of histograms for different positions is utilized as a mid-level representation for subvolumes simultaneously recorded from multiple views.Experiments on the IXMAS action dataset show that the proposed method is able to achieve promising performance.

 = 1 Figure 5 :
Figure 5: Action recognition accuracy of different values of  on the IXMAS dataset.
. The optical flow vector field  is split into two scalar fields corresponding to the horizontal and vertical components of the flow,   and   .Then   and   are half-wave rectified into two nonnegative channels  +  ,  −  and  +  ,  −  , respectively; namely,   =  +  +  −  and   =  +  +  −  .Each channel is blurred with Gaussian filter and normalized to obtain the final four sparse and nonnegative channels, F +  , F −  , F +  , and F −  , which constitute the motion descriptor of each frame.

Table 2 :
Action recognition accuracy of different methods on the MuHAVi-MAS dataset.