Deep ChaosNet for Action Recognition in Videos

,


Introduction
Human action recognition in videos is an important area in computer vision, receiving sustained attention from the researchers due to its potential applications such as video supervision, entertainment, user interface, sports, video understanding, and patient monitoring. Current action recognition methods can be classified into three categories by action feature: chaos-based feature [1], manual feature [2], and deep learned feature. Inspired by the chaos-based feature and deep learned feature, we propose deep ChaosNet for action recognition to autonomously learn the nonlinear dynamical feature in video action.

Related Works
In this section, we briefly review the literature of action recognition from the chaos-based feature, manual feature, and deep learned feature.

Chaos-Based Feature.
Ali et al. [3] introduced a human action recognition architecture by using the theory of chaotic systems to model and analyze nonlinear dynamics of human actions. Trajectories of reference joints are used as the representation of the nonlinear dynamical system that is generating the action. Xia et al. [4] proposed a human behavior recognition method based on chaotic invariant features and relevance vector machine (RVM). e trajectory generated by the motion of the human joint points is extracted to represent the nonlinear system of human action behavior, and the time delay is estimated by the C-C method. e chaotic invariants representing human behavior are extracted, and the RVM algorithm is used to identify human behavior. Venkataraman and Turaga [1] proposed to use the descriptor of the shape of the dynamical attractor as the feature representation of the nature of dynamics to solve the drawbacks of traditional approaches.

Manual Feature.
Since human behavior is composed of body movements, general human behavior characteristics are based on the underlying visual movement characteristics. e underlying visual features are easy to extract and represent, and the underlying visual motion features of the same action have a certain degree of robustness under different cameras, so they are widely used in early human behavior recognition. ere are two categories on human behavior characteristics: local feature representation and global feature representation. Existing global feature descriptions represent the formation of global spatiotemporal cues through single-frame global features and video frame sequences from aspects of human body contours, posture joint points, and saliency segmentation such as the motion history image algorithm (motion history image, MHI) proposed by Bobick and Davis [5], the adaptation of the shape context algorithm (adaptation of the shape context) proposed by Zhang et al. [6], and the kinematic feature proposed by Ali and Shah [7]. Local feature description of underlying action features is still a hotspot in human behavior recognition research in recent years. Researchers considered the changes in the motion field between frames and proposed various local spatiotemporal feature descriptions, such as STIP [8], MoSIFT [9,10], and dense trajectories [2,11].

Deep Learned
Feature. It includes two aspects of deep learning: action convolution features and action timing features. e former uses convolutional neural networks (CNNs) to learn the local depth features of human behavior from different modal data such as RGB image frames and optical flow of behavior videos [12]. On the basis of behavioral convolutional features, it uses methods such as recurrent neural network (RNN), time-series segment network, or linear coding to learn time-series features in multiple stages of behavior development [13]. Due to limited memory capacity of the GPU/CPU and different lengths of behavior duration (shown as different video frames), it is difficult to send all behavior video frames into the deep learning framework for feature learning. erefore, it is necessary to perform key frame sampling on the behavior video in the behavior recognition process. Most of the existing behavior recognition algorithms use equal sampling [13] or sequential sampling [14][15][16], ignoring the differences in the development process of human behavior, and the key frames obtained are less representative.

Deep ChaosNet Framework
Inspired by Wang et al. and Balakrishnan et al. [15,17], we propose deep ChaosNet framework for action recognition. e framework is illustrated in Figure 1. Deep ChaosNet features are extracted from video frames. And then, the features are sent to the low-level LSTM encoder and high-level LSTM encoder for obtaining low-level coding output and high-level coding results, respectively. e agent is a behavior recognizer for producing recognition results. e agent, based on hierarchical reinforcement learning, is mainly composed of manager and worker. Manager is a hidden layer, responsible for giving behavioral segmentation targets at the high level. Worker determines the spatiotemporal area of the video subsegment that best characterizes the segmentation target according to the segmentation target and outputs the segmentation recognition result.

Structure of the Network.
e network system structure is shown in Figure 2.
e manager LSTM unit obtains environmental status information [h M t ] according to the input [c M t−1 , h W t−1 ] and derives meaningful behavioral stage goals, which are used as the worker LSTM input to guide the worker to select the spatiotemporal region of the next behavioral video subsegment; the formula is as follows: (1) S M is the manager LSTM nonlinear function, and u M is responsible for mapping the environmental state information h M t to the behavioral stage target g t . e worker LSTM unit obtains context information h W t according to the input Based on h W t , we predict the next key frame position d t , sampling area l t , and behavior category p t .
For manager and worker, this project uses a visual attention mechanism to explore areas of salient behavior. e manager attention model mainly explores the significant segment information of the behavior, and the worker attention model assists in searching the behavior key frames and significant areas within the frame. e parameters C M t and C W t are calculated as follows:

Deep Learning Process.
e worker strategy learning process is a standard reinforcement learning process. At each step t of the worker, the worker will give a classification prediction result P t , and then the environment will give a reward R t , so the goal of worker strategy learning is to minimize the negative value of the reward function. e loss function is Manager does not directly interact with the environment, and its strategy learning process cannot copy the worker. Compared with manager's time t, the worker strategy π θ w (p t ; g t ) is relatively stable, and this strategy directly affects the worker's behavior classification output results p t,c at time c. At this point, although the manager is a hidden layer, its strategic goal should be to minimize the negative value of the current reward. e loss function is

Experiments and Results
We verify the proposed deep ChaosNet on two standard action datasets: UCF101 [18] and HMDB51 [19]. UCF101 is an action recognition dataset of realistic action videos with 101 action categories collected from YouTube. Videos of the 101 action categories are divided into 25 groups, and each group can contain 4∼7 action videos. Videos from the same group may share some common features, such as similar backgrounds and similar viewpoints. HMDB51 contains 51 types of actions, a total of 6849 videos which are collected from YouTube, Google Video, etc. Each action contains at least 51 videos with a resolution of 320 * 240.
In the experiments, we construct 7-layer deep ChaosNet for both action datasets. e outputs of the deep ChaosNet are 2048-dim frame features, which are then projected to 512-dim. We use Bi-LSTM with hidden size 512 as the lowlevel encoder and LSTM with hidden size 256 as the high-level encoder [20]. e worker network consisted of worker LSTM with hidden size 1024. e manager network was composed of manager LSTM with hidden size 256, an attention module, and a linear layer that projected the output of the LSTM into the latent goal space. e environment internal critic was also an RNN, which contained a GRU, a built-in word embedding, a linear layer, and a sigmoid function.

Conclusions
We extend ChaosNet to the deep neural network and apply it to action recognition. We deepen the hidden layers of ChaosNet, and then we separately input still frames and motions among frames into the deep network to extract spatial and temporal action features. e features act as the input for the attentionbased action recognition framework. We verify our method on two standard action datasets: UCF101 and HMDB51, and the experimental results indicate that the proposed algorithm is competitive compared with the state of the art.