Human activity recognition (HAR) aims to recognize activities from a series of observations on the actions of subjects and the environmental conditions. The vision-based HAR research is the basis of many applications including video surveillance, health care, and human-computer interaction (HCI). This review highlights the advances of state-of-the-art activity recognition approaches, especially for the activity representation and classification methods. For the representation methods, we sort out a chronological research trajectory from global representations to local representations, and recent depth-based representations. For the classification methods, we conform to the categorization of template-based methods, discriminative models, and generative models and review several prevalent methods. Next, representative and available datasets are introduced. Aiming to provide an overview of those methods and a convenient way of comparing them, we classify existing literatures with a detailed taxonomy including representation and classification methods, as well as the datasets they used. Finally, we investigate the directions for future research.
Human activity recognition (HAR) is a widely studied computer vision problem. Applications of HAR include video surveillance, health care, and human-computer interaction. As the imaging technique advances and the camera device upgrades, novel approaches for HAR constantly emerge. This review aims to provide a comprehensive introduction to the video-based human activity recognition, giving an overview of various approaches as well as their evolutions by covering both the representative classical literatures and the state-of-the-art approaches.
Human activities have an inherent hierarchical structure that indicates the different levels of it, which can be considered as a three-level categorization. First, for the bottom level, there is an atomic element and these action primitives constitute more complex human activities. After the action primitive level, the action/activity comes as the second level. Finally, the complex interactions form the top level, which refers to the human activities that involve more than two persons and objects. In this paper, we follow this three-level categorization namely action primitives, actions/activities, and interactions. This three-level categorization varies a little from previous surveys [
This review highlights the advances of image representation approaches and classification methods in vision-based activity recognition. Generally, for representation approaches, related literatures follow a research trajectory of global representations, local representations, and recent depth-based representations (Figure
Research trajectory of activity representation approaches.
On the other hand, classification techniques keep developing in step with machine learning methods. In fact, lots of classification methods were not originally designed for HAR. For instance, dynamic time warping (DTW) and hidden Markov model (HMM) were first used in speech recognition [
In addition to the activity classification approaches, another critical research area within the HAR scope, the human tracking approach, is also reviewed briefly in a separate section. It is widely concerned especially in video surveillance systems for suspicious behavior detection.
The writing of rest parts conforms to general HAR process flow. First, research emphases and challenges of this domain are briefly illustrated in Section
Different from speech recognition, there is no grammar and strict definition for human activities. This causes twofold confusions. On one hand, the same activity may vary from subject to subject, which leads to the intraclass variations. The performing speed and strength also increase the interclass gaps. On the other hand, different activities may express similar shapes (e.g., using a laptop and reading). This is termed as interclass similarity which is a common phenomenon in HAR. Accurate and distinctive features need to be designed and extracted from activity videos to deal with these problems.
While applications like video surveillance and fall detection system use static cameras, more scenarios adopt dynamic recording devices. Sports event broadcast is a typical case of dynamic recording. In fact, with the popularity of smart devices such as smart glasses and smartphones, people tend to record videos with embedded cameras from wearable devices anytime. Most of these real-world videos have complex dynamic backgrounds. First, those videos, as well as the broadcasts, are recorded in various and changing backgrounds. Second, realistic videos abound with occlusions, illumination variance, and viewpoint changes, which make it harder to recognize activities in such complex and various conditions.
Earlier research concentrated on low-level human activities such as jumping, running, and waving hands. One typical characteristic of these activities is having a single subject without any human-human or human-object interactions. However, in the real world, people tend to perform interactive activities with one or more persons and objects. An American football game is a good example of interaction and group activity where multiple players (i.e., human-human interaction) in a team protect the football (i.e., human-object interaction) jointly and compete with players in the other team. It is a challenging task to locate and track multiple subjects synchronously or recognize the whole human group activities as “playing football” instead of “running.”
Long-distance and low-quality videos with severe occlusions exist in many scenarios of video surveillance. Large and crowded places like the metro and passenger terminal of the airport are representative occasions where occlusions happen frequently. Besides, surveillance cameras installed in high places cannot provide high-quality videos like present datasets in which the target person is clear and obvious. Though we do not expect to track everyone in these cases, some abnormal or crime-related behaviors should be recognized by the HAR system (Figure
Long-distance videos under real-world settings. (a) HAR in long-distance broadcasts. (b) Abnormal behaviors in surveillance.
Global representations extract global descriptors directly from original videos or images and encode them as a whole feature. In this representation, the human subject is localized and isolated using background subtraction methods forming the silhouettes or shapes (i.e., region of interest (ROI)). Some global approaches encode ROI from which they derive corners, edges, or optical flow as descriptors. Other silhouette-based global representation methods stack the silhouette image along the time axis to form the 3D space-time volumes, then the volumes are utilized for representation. Besides, discrete Fourier transform (DFT) takes advantage of frequency domain information of ROI for recognition, also being a global approach. Global representation approaches were mostly proposed in earlier works and gradually outdated due to the sensitiveness to noise, occlusions, and viewpoint changes.
To recognize the human activities in videos, an intuitive idea is to isolate the human body from the background. This procedure is called background subtraction or foreground extraction. The extracted foreground in the HAR is called silhouette, which is the region of interest and represented as a whole object in the global representation approach.
Calculating the background model is an important step before extracting silhouettes. Wren et al. [
Besides the silhouette representation, the 2D shape of the silhouette can be used as a feature as well. Veeraraghavan et al. [
Bobick and Davis [
In [
Extracting silhouettes from a single view is hard to satisfy view invariant property. To alleviate the influence of viewpoint changes, multiple cameras can be used to extract silhouettes in different viewpoints. Xu and Huang [
Optical flow is an effective way to extract and describe silhouettes for a dynamic background. Lucas-Kanade-Tomasi (LKT) feature tracker [
For recognizing human activities at a distance (i.e., the football broadcast video), Efros et al. [
Tran and Sorokin [
An activity video can be seen as a series of images that contain activity sequences. Concatenating all frames along the time axis forms the 3D space-time volume (STV) which has three dimensions including two spatial dimensions
Blank et al. [
Achard et al. [
In [
The DFT of image frame is another global feature that contains the intensity information of the foreground object (i.e., the region of the subject’s body) provided that the foreground object intensity is different from the background. Kumari and Mitra [
Instead of extracting the silhouette or STV and encoding them as a whole, local representations process activity video as a collection of local descriptors. They focus on specific local patches which are determined by interest point detectors or densely sampling [
An intuitive thought of local representation is to identify those interest points that contain high information contents in images or videos. Harris and Stephens [
Saliency can also be used to detect interest points. Saliency means that certain parts of an image are preattentively distinctive and are immediately perceivable [
Although these methods achieved remarkable results in HAR, one common deficiency is the inadequate number of stable interest points. In fact, the trade-off between the stability of those points and the number of points found is difficult to control. On one hand, the “right” and “discriminative” (i.e., stable) interest points are rare and difficult to be identified. As stated in [
Besides the inherent properties of sparse interest points, many of the mentioned methods are inefficient. Therefore, these methods are restricted to the detection of a small number of points, or limited to low-resolution videos [
Dollar et al. [
Aiming at detecting interest points in an efficient way, Willems et al. [
Local descriptors are designed to describe the patches that sampled either densely or at the interest points [
Laptev [
Similar to works of extending 2D interest point detector into spatiotemporal domain, such as the Harris corner detector [
Lowe proposed the scale-invariant feature transform (SIFT) in 1999 [
The speeded-up robust features (SURF) [
An extended 3D SURF descriptor was implemented by Willems et al. [
Dalal and Triggs [
Lu and Little et al. [
Klaser et al. [
Early spatiotemporal methods adopt a perspective of regarding the video as
Wang et al. [
The STIP-based descriptors or other elaborately designed descriptors are all referred as local features. Local features are then encoded with feature encoding methods to represent activities and the encoded features are subsequently fed into pretrained classifiers (e.g., SVM) [
Taxonomy of activity recognition literatures.
References | Year | Representation (global/local/depth) | Classification | Modality | Level | Dataset | Performance result |
---|---|---|---|---|---|---|---|
Yamato et al. [ |
1992 | Symbols converted from mesh feature vector and encoded by vector quantization (G) | HMM | RGB | Action/activity | Collected dataset: |
96% accuracy |
Darrell and Pentland [ |
1993 | View model sets (G) | Dynamic time warping | RGB | Action primitive | Collected instances of 4 gestures. | 96% accuracy (“Hello” gesture) |
Brand et al. [ |
1997 | 2D blob feature (G) | Coupled HMM (CHMM) | RGB | Action primitive | Collected dataset: 52 instances. |
94.2% accuracy |
Oliver et al. [ |
2000 | 2D blob feature (G) | (i) CHMM; |
RGB | Interaction | Collected dataset: 11–75 training sequences +20 testing sequences. |
(i) 84.68 accuracy (average); |
Bobick and Davis [ |
2001 | Motion energy image & motion history image (G) | Template matching by measuring Mahalanobis distance | RGB | Action/activity | Collected dataset: |
(a) 12/18 (single view); |
Efros et al. [ |
2003 | Optical flow (G) | K-nearest neighbor | RGB | Action/activity | (a) Ballet dataset; (b) tennis dataset; (c) football dataset | (a) 87.4% accuracy; |
Park and Aggarwal [ |
2004 | Body model by combining an ellipse representation and a convex hull-based polygonal representation (G) | Dynamic Bayesian network | RGB | Interaction | Collected dataset: 56 instances. |
78% accuracy |
Schüldt et al. [ |
2004 | Space-time interest points (L) | SVM | RGB | Action/activity | KTH dataset | 71.7% accuracy |
Blank et al. [ |
2005 | Space-time shape (G) | Spectral clustering algorithm | RGB | Action/activity | Weizmann dataset | 99.63% accuracy |
Oikonomopoulos et al. [ |
2005 | Spatiotemporal salient points (L) | RVM | RGB | Action/activity | Collected dataset: 152 instances. |
77.63% recall |
Dollar et al. [ |
2005 | Space-time interest points (L) | (i) 1-nearest neighbor (1NN); |
RGB | Action/activity | KTH dataset | (i) 78.5% accuracy (1NN); |
Ke et al. [ |
2005 | Integral videos (L) | Adaboost | RGB | Action/activity | KTH dataset | 62.97% accuracy |
Veeraraghavan et al. [ |
2005 | Space-time shape (G) | Nonparametric methods by extending DTW | RGB | Action/activity | (a) USF dataset [ |
No accuracy data presented. |
Duong et al. [ |
2005 | High level activities are represented as sequences of atomic activities; atomic activities are only represented using durations (−). | Switching hidden semi-Markov model (S-HSMM) | RGB | Interaction | Collected dataset: 80 video sequences. |
97.5 accuracy (average accuracy; Coxian model) |
Weinland et al. [ |
2006 | Motion history volumes (G) | Principal component analysis (PCA) + Mahalanobis distance | RGB | Action/activity | IXMAS dataset [ |
93.33% accuracy |
Lu et al. [ |
2006 | PCA-HOG (L) | HMM | RGB | Action/activity | (a) Soccer sequences dataset [ |
The implemented system can track subjects in videos and recognize their activities robustly. No accuracy data presented. |
Ikizler and Duygulu [ |
2007 | Histogram of oriented rectangles and encoded with BoVW (G) | (i) Frame by frame voting; |
RGB | Action/activity | Weizmann dataset | 100% accuracy (DTW) |
Huang and Xu [ |
2007 | Envelop shape acquired from silhouettes (G) | HMM | RGB | Action/activity; |
Collected dataset: |
Subject dependent + view independent: 97.3% accuracy; |
Scovanner et al. [ |
2007 | 3D SIFT (L) | SVM | RGB | Action/activity | Weizmann dataset | 82.6% accuracy |
Vail et al. [ |
2007 | — | (i) HMM |
— | Interaction | Data from the hourglass and the unconstrained tag domains generated by robot simulator. | 98.1% accuracy (CRF, hourglass); |
Cherla et al. [ |
2008 | Width feature of normalized silhouette box (G) | Dynamic time warping | RGB | Action/activity | IXMAS dataset [ |
80.05% accuracy; |
Tran and Sorokin [ |
2008 | Silhouette and optical flow (G) | (i) Naïve Bayes (NB); |
RGB | Interaction; |
(a) Weizmann dataset; |
(a) 100% accuracy; |
Achard et al. [ |
2008 | Semi-global features extracted from space-time micro volumes (L) | HMM | RGB | Action/activity | Collected dataset: 1614 instances. |
87.39% accuracy (average) |
Rodriguez et al. [ |
2008 | Action MACH-maximum average correlation height (G) | Maximum average correlation height filter | RGB | Interaction; |
(a) KTH dataset; |
(a) 80.9% accuracy; |
Kiaser et al. [ |
2008 | Histograms of oriented 3D spatiotemporal gradients (L) | SVM | RGB | Interaction; |
(a) KTH dataset; |
(a) 91.4% (±0.4) accuracy; |
Willems et al. [ |
2008 | Hessian-based STIP detector & SURF3D (L) | SVM | RGB | Action/activity | KTH dataset | 84.26% accuracy |
Laptev et al. [ |
2008 | STIP with HOG, HOF are encoded with BoVW (L) | SVM | RGB | Interaction; |
(a) KTH dataset; |
(a) 91.8% accuracy; |
Natarajan and Nevatia [ |
2008 | 23 degrees body model (G) | Hierarchical variable transition HMM (HVT-HMM) | RGB | Action/activity; |
(a) Weizmann dataset; |
(a) 100% accuracy; |
Natarajan and Nevatia [ |
2008 | 2-layer graphical model: top layer corresponds to actions in particular viewpoint; lower layer corresponds to individual poses (G) | Shape, flow, duration-conditionalrandom field (SFD-CRF) | RGB | Action/activity | Collected dataset: 400 instances. |
78.9% accuracy |
Ning et al. [ |
2008 | Appearance and position context (APC) descriptor encoded by BoVW (L) | Latent pose conditional random fields (LPCRF) | RGB | Action/activity; |
HumanEva dataset | 95.0% accuracy (LPCRF |
Marszalek et al. [ |
2009 | SIFT, HOG, HOF encoded by BoVW (L) | SVM | RGB | Interaction | Hollywood2 dataset | 35.5% accuracy |
Li et al. [ |
2010 | Action graph of salient postures (D) | Non-Euclidean relational fuzzy (NERF) C-means & Hausdorf distance-based dissimilarity measure | Depth | Action/activity | MSR Action3D dataset | 91.6% accuracy (train/test = 1/2); |
Suk et al. [ |
2010 | YIQ color model for skin pixels; histogram-based color model for face region; optical flow for tracking of hand motion (L) | Dynamic Bayesian network | RGB | Action primitive | Collected dataset: 498 instances. |
(a) 99.59% accuracy; |
Baccouche et al. [ |
2010 | SIFT descriptor encoded by BoVW (L) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | RGB | Interaction | MICC-Soccer-Actions-4 dataset [ |
92% accuracy |
Kumari and Mitra [ |
2011 | Discrete Fourier transform on silhouettes (G) | K-nearest neighbor | RGB | Action/activity | (a) MuHaVi dataset; |
(a) 96% accuracy; |
Wang et al. [ |
2011 | Dense trajectory with HOG, HOF, MBH (L) | SVM | RGB | Interaction; |
(a) KTH dataset; |
(a) 94.2% accuracy; |
Wang et al. [ |
2012 | STIP with HOG, HOF are encoded with various encoding methods (L) | SVM | RGB | Interaction; |
(a) KTH dataset; |
(a) 92.13% accuracy (Fisher vector); |
Zhao et al. [ |
2012 | Combined representations: |
SVM | RGB-D | Interaction | RGBD-HuDaAct dataset | 89.1% accuracy |
Yang et al. [ |
2012 | DMM-HOG (D) | SVM | Depth | Action/activity | MSR Action3D dataset | 95.83% accuracy |
Xia et al. [ |
2012 | Histograms of 3D joint locations (D) | HMM | Depth | Action/activity | (a) collected dataset: 6220 frames, 200 samples. |
(a) 90.92% accuracy; |
Yang and Tian [ |
2012 | EigenJoints (D) | Naïve-Bayes-Nearest-Neighbor (NBNN) | Depth | Action/activity | MSR Action3D dataset | 96.8% accuracy; |
Wang et al. [ |
2012 | Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D) | SVM | Depth | Interaction; |
(a) MSR Action3D dataset; |
(a) 88.2% accuracy; |
Wang et al. [ |
2013 | Improved dense trajectory with HOG, HOF, MBH (L) | SVM | RGB | Interaction | (a) Hollywood2 dataset; |
(a) 64.3% accuracy; |
Oreifej and Liu [ |
2013 | Histogram of oriented 4D surface normals (D) | SVM | Depth | Action/activity; |
(a) MSR Action3D dataset; |
(a) 88.89% accuracy; |
Chaaraoui [ |
2013 | Combined representations: |
Dynamic time warping | RGB-D | Action/activity | MSR Action3D dataset | 91.80% accuracy |
Ren et al. [ |
2013 | Time-series curve of hand shape (G) | Dissimilarity measure based on Finger-Earth Mover’s Distance (FEMD) | RGB | Action primitive | Collected dataset: 1000 instances. |
93.9% accuracy |
Ni et al. [ |
2013 | Depth-Layered Multi-Channel STIPs (L) | SVM | RGB-D | Interaction | RGBD-HuDaAct database | 81.48% accuracy (codebook size = 512 & SPM kernel) |
Grushin et al. [ |
2013 | STIP with HOF (L) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | RGB | Action/activity | KTH dataset | 90.7% accuracy |
Peng et al. [ |
2014 | (i) STIP with HOG, HOF and encoded by various encoding methods; (L) |
SVM | RGB | Interaction | (a) HMDB51 dataset; |
Hybrid representation: |
Peng et al. [ |
2014 | Improved dense trajectory encoded with stacked Fisher kernal (L) | SVM | RGB | Interaction; |
(a) YouTube dataset; |
(a) 93.38% accuracy; |
Wang et al. [ |
2014 | Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D) | SVM | Depth | Interaction; |
(a) MSR Action3D dataset; |
(a) 88.2% accuracy; |
Simonyan and Zisserman [ |
2014 | Spatial stream ConvNets & optical flow based temporal stream ConvNets (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; |
(a) 59.4% accuracy; |
Lan et al. [ |
2015 | Improved dense trajectory with HOG, HOF, MBHx, MBHy enhanced with multiskip feature tracking (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; |
(a) 65.1% accuracy (L = 3); |
Shahroudy et al. [ |
2015 | Combined representations: |
SVM | RGB-D | Interaction | MSR DailyActivity3D | 81.9% accuracy |
Wang et al. [ |
2015 | Weighted hierarchical depth motion maps (D) | Three-channel deep convolutional neural networks (3ConvNets) | Depth | Interaction; |
(a) MSR Action3D dataset; |
(a) 100% accuracy; |
Wang et al. [ |
2015 | Pseudo-color images converted from DMMs (D) | Three-channel deep convolutional neural networks (3ConvNets) | Depth | Interaction; |
(a) MSR Action3D dataset; |
(a) 100% accuracy; |
Wang et al. [ |
2015 | Trajectory-pooled deep-convolutional descriptor and encoded by Fisher kernal (L) | SVM | RGB | Interaction | (a) HMDB51 dataset; |
(a) 65.9% accuracy; |
Veeriah et al. [ |
2015 | (i) HOG3D in KTH 2D action dataset; (L) |
Differential recurrent neural network (dRNN) | RGBD | Action/activity | (a) KTH dataset; |
(a) 93.96% accuracy (KTH-1); |
Du et al. [ |
2015 | Representations of skeleton data extracted by subnets (D) | Hierarchical bidirectional recurrent neural network (HBRNN) | RGBD | Action/activity | (a) MSR Action3D dataset; |
(a) 94.49% accuracy; |
Zhen et al. [ |
2016 | STIP with HOG3D and encoded with various encoding methods (L) | SVM | RGB | Interaction; |
(a) KTH dataset; |
(a) 94.1% (Local NBNN); |
Chen et al. [ |
2016 | Action graph of skeleton-based features (D) | Maximum likelihood estimation | Depth | Action/activity | (a) MSR Action3D dataset; |
(a) 95.56% accuracy (cross subject); |
Zhu et al. [ |
2016 | Co-occurrence features of skeleton joints (D) | Recurrent neural networks (RNN) with long short-term memory (LSTM) | Depth | Interaction; |
(a) SBU Kinect interaction dataset [ |
(a) 90.41% accuracy; |
Li et al. [ |
2016 | VLAD for deep dynamics (G) | Deep convolutional neural networks (ConvNets) | RGB | Interaction; |
(a) UCF101 dataset; |
(a) 84.65% accuracy; |
Berlin & John [ |
2016 | Harris corner-based interest points and histogram-based features (L) | Deep neural networks (DNNs) | RGB | Interaction | UT Interaction dataset [ |
95% accuracy on set1; |
Huang et al. [ |
2016 | Lie group features (L) | Lie Group Network (LieNet) | Depth | Interaction; |
(a) G3D-Gamingdataset [ |
(a) 89.10% accuracy; |
Mo et al. [ |
2016 | Automatically extracted features from skeletons data (D) | Convolutional neural networks (ConvNets) + multilayer perceptron | Depth | Interaction | CAD-60 dataset | 81.8% accuracy |
Shi et al. [ |
2016 | Three stream sequential deep trajectory descriptor (L) | Recurrent neural networks (RNN) and deep convolutional neural networks (ConvNets) | RGB | Interaction; |
(a) KTH dataset; |
(a) 96.8% accuracy; |
Yang et al. [ |
2017 | Low-level polynormal assembled from local neighboring hypersurface normals and are then aggregated by Super Normal Vector (D) | Linear classifier | Depth | Interaction; |
(a) MSR Action3D dataset; |
(a) 93.45% accuracy; |
Jalal et al. [ |
2017 | Multifeatures extracted from human body silhouettes and joints information (D) | HMM | Depth | Interaction; |
(a) Online self-annotated dataset [ |
(a) 71.6% accuracy; |
Several evaluations [
Further exploration has been conducted to match the best local feature with FV. In [
Recent stacked Fisher vectors [
Pipeline of Fisher vector and Stacked fisher vector. (a) Fisher vector. (b) Stacked fisher vector.
The core idea of both FV and SFV is trying to catch more statistical information from images; in contrast, BoVW only retains the zero order statistics. Take an
SFV further improved FV owing to a simple and intuitive reason that SFV densely calculated local features by dividing and scanning multiscale subvolumes. The main challenge is the holistic combination of those local FVs since encoding them using another FV directly is impossible because of the high dimension of them (2
Previous research of HAR mainly concentrates on the video sequences captured by traditional RGB cameras. Depth cameras, however, have been limited due to their high cost and complexity of operation [
Kinect RGBD cameras and their color images, depth maps, skeletal information. (a) Kinect v1 (2011). (b) Kinect v2 (2014). (c) Color image. (d) Depth map. (e) Skeleton captured by Kinect v1. (f) Skeleton captured by Kinect v2.
Depth maps contain additional depth coordinates comparing to conventional color images and are more informative. Approaches presented in this section regard depth maps as spatiotemporal signals and extract features directly from them. These features are either used independently or combined with RGB channel to form multimodal features.
Li et al. [
Zhao et al. [
Yang et al. [
Jalal et al. [
Skeletons and joint positions are features generated from depth maps. Kinect device is popular in this representation due to its convenience of obtaining skeleton and joints. Application in Kinect v1 SDK generates 20 joints, while the later version (Kinect v2) generates 25 joints, adding 5 joints around the hands and neck (see Figure
First, skeleton model has an inherent deficiency that it always suffers the noisy skeleton problem when dealing with occlusions (see Figure
Noisy skeleton problem caused by self-conclusion.
Second, an intuitive fact can be observed that not all skeletal joints are involved in a particular activity, and only a few active joints are meaningful and informative for a certain activity [
Finally, as an extracted feature from depth maps itself, skeleton-based representation is often combined with original depth information to form more informative and robust representation [
Xia et al. [
Yang and Tian [
Wang et al. [
Shahroudy et al. [
Chen et al. [
Besides depth-based features, skeleton data can be combined with other RGB features. To deal with the noisy skeleton problem, Chaaraoui et al. [
The next stage of HAR is the classification of activities that have been represented by proper feature sets extracted from images or videos. In this stage, classification algorithms give the activity label as final result. Generally speaking, most activity classification algorithms can be divided into three categories namely template-based approaches, generative models and discriminative models. Template-based approaches is a relatively simple and well accepted approach; however, it can be sometimes computationally expensive. Generative models learn a model of the joint probability
Template-based approaches try to portray common appearance characteristics of a certain activity using various representations. These common appearance characteristics, such as 2D/3D static images/volumes or a sequence of view models, are termed as templates. Most template-based methods extract 2D/3D static templates and compare the similarity between the extracted images/volumes of test videos and the stored templates. For the classification based on a sequence of key frames, dynamic time warping (DTW) is an effective approach.
Bobick and Davis [
Shechtman and Irani [
Common template-based methods are unable to generate single template for each activity. They often suffer the high computational cost due to maintaining and comparing various templates. Rodriguez et al. [
Dynamic time warping (DTW) is a kind of dynamic programming algorithm for matching two sequences with variances. Rabiner and Juang [
Darrell and Pentland [
Veeraraghavan et al. [
Although the DTW algorithm needs a few amounts of training samples, the computational complexity increases significantly when dealing with growing activity types or those activities with high inter/intra variance, because extensive templates are needed to store those invariance.
The recognition task is a typical evaluation problem which is one of the three hidden Markov model problems and can be solved by the forward algorithm. HMMs were initially proposed to solve the speech recognition problem [
A brief summary of the deficiencies of basic HMM and several efficient extensions are presented in [
Previous work has proposed several variants of HMM to handle the mentioned deficiencies [
Flexible duration models were suggested including the hidden semi-Markov model (HSMM) and the variable transition HMMs (VT-HMM). The hidden semi-Markov model (HSMM) is a candidate approach that has explicit duration model with specific distribution. Duong et al. [
Alternatively, Ramesh and Wilpon [
A dynamic Bayesian network (DBN) is a Bayesian network with the same structure unrolled in the time axis [
Figure
A typical dynamic Bayesian network [
Park and Aggarwal [
Cherla et al. [
Support vector machines (SVMs) are typical classifiers of discriminative models and gained extensive use in HAR. Vapnik et al. [
Schüldt et al. [
Laptev et al. [
Conditional random fields (CRFs) are undirected graphical models that compactly represent the conditional probability of a particular label sequence
Natarajan and Nevatia [
Ning et al. [
Basically, the deep learning architectures can be categorized into four groups, namely deep neural networks (DNNs), convolutional neural networks (ConvNets or CNNs), recurrent neural networks (RNNs), and some emergent architectures [
The ConvNets is the most widely used one among the mentioned deep learning architectures. Krizhevsky et al. [
One challenge for HAR using deep learning is how to apply it on small datasets since HAR datasets are generally smaller than what the ConvNets need. Common solutions include generating or dumpling more training instances, or converting HAR to a still image classification problem to leverage the large image dataset (e.g., ImageNet) to pretrain the ConvNets. Wang et al. [
The most recent research aims to further improve the performance of ConvNets by combining it with other hand-crafted features or representations. Li et al. [
Unlike ConvNets, DNNs still use hand-crafted features instead of automatically learning features by deep networks from raw data. Berlin and John [
RNNs are designed for sequential information and have been explored successfully in speech recognition and natural language processing [
Among various RNNs architectures, the long short-term memory (LSTM) is the most popular one as it is able to maintain observations in memory for extended periods of time [
In addition to videos, RNNs can also be applied to skeleton data for activity recognition. Du et al. [
A detailed taxonomy about the representation, classification methods, and the used datasets of the introduced works in this review are presented in Table
Feature encoding methods.
Method | Proposed | Description paper, the number of citations |
---|---|---|
Vector quantization (VQ)/hard assignment (HA) | Sivic et al. (2003) | [ |
Kernal codebook coding (KCB)/soft assignment (SA) | Gemert et al. (2008) | [ |
Spase coding (SPC) | Yang et al. (2009) | [ |
Local coordinate coding (LCC) | Yu et al. (2009) | [ |
Locality-constrained linear coding (LLC) | Wang et al. (2010) | [ |
Improved Fisher kernel (iFK)/Fisher vector (FV) | Perronnin et al. (2010) | [ |
Triangle assignment coding (TAC) | Coates et al. (2010) | [ |
Vector of locally aggregated descriptors (VLAD) | Jegou et al. (2010) | [ |
Super vector coding (SVC) | Zhou et al. (2010) | [ |
Local tangent-based coding (LTC) | Yu et al. (2010) | [ |
Localized soft assignment coding (LSC/SA- |
Liu et al. (2011) | [ |
Salient coding (SC) | Huang et al. (2011) | [ |
Group salient coding (GSC) | Wu et al. (2012) | [ |
Stacked Fisher vectors (SFV) | Peng et al. (2014) | [ |
Besides the activity classification approaches, another critical research area is the human tracking approach, which is widely concerned in video surveillance systems for suspicious behavior detection. Human tracking is performed to locate a person along the video sequence over a time period, and then the resultant trajectories of people are further processed by expert surveillance systems for analyzing human behaviors and identifying potential unsafe or abnormal situations [
Filtering is one of the widely used approaches for tracking, and the representative Kalman filter (KF) [
KF is a state estimate method based on linear dynamical systems that are perturbed by Gaussian noise [
Particle filter, or sequential Monte Carlo method [
Kernel-based tracking [
Traditional kernel-based tracking used symmetric constant kernel, and it tends to encounter problems of object scale and object orientation variation, as well as the object shape deformation. Research was conducted concerning these problems. Liu et al. [
Early literatures reported tracking methods using single kernel scheme. However, the single kernel-based tracking could fail when the human is concluded, that is, the object could be lost or mismatch due to the partial observation. Thus, multiple-kernel tracking is adopted in most cases of recent researches. Lee et al. [
Public datasets could be used to compare different approaches in the same standards therefore accelerate the development of HAR methods. In this section, several representative datasets are reviewed, organized as a three-level category mentioned in the beginning of this review (i.e., action primitive level, action/activity level, and interaction level). There have been a published good survey [
Overview of representative datasets.
Dataset | Modality | Level | Year | References | Web pages | Activity category |
---|---|---|---|---|---|---|
RGBD-HuDaAct | RGB-D | Interaction level | 2013 | [ |
|
12 classes: eat meal, drink water, mop floor, and so forth |
Hollywood | RGB | Interaction level | 2008 | [ |
|
8 classes: answer phone, hug person, kiss, and so forth |
Hollywood-2 | RGB | Interaction level | 2009 | [ |
|
12 classes: answer phone, driving a car, fight, and so forth |
UCF sports | RGB | Interaction level | 2008 | [ |
|
10 classes: golf swing, diving, lifting, and so forth |
KTH | RGB | Activity/action level | 2004 | [ |
|
6 classes: walking, jogging, running, and so forth |
Weizmann | RGB | Activity/action level | 2005 | [ |
|
10 classes: run, walk, bend, jumping-jack, and so forth |
NTU-MSR | RGB-D | Action primitive level | 2013 | [ |
|
10 classes: it contains 10 different gestures. |
MSRC-Gesture | RGB-D | Action primitive level | 2012 | [ |
|
12 classes: it contains 12 different gestures. |
MSR DailyAction3D | RGB-D | Interaction level | 2012 | [ |
|
16 classes: call cellphone, use laptop, walk, and so forth |
MSR Action3D | Depth | Activity/action level | 2010 | [ |
|
20 classes: high arm wave, hand clap, jogging, and so forth |
While action primitives often act as components of high level human activities (e.g., the action primitives are served as a layer in hierarchical HMM to recognize activities [
The NTU-MSR Kinect hand gesture dataset [
The MSRC-Kinect gesture dataset [
According to our definition, action/activity is middle level human activity without any human-human or human-object interactions. We first review two classic datasets, namely KTH human activity dataset and Weizmann human activity dataset. Though these two datasets have gradually faded out of state-of-the-art and are considered as easy tasks (e.g., 100% accuracy for Weizmann in [
The KTH dataset [
The Weizmann activity dataset [
The MSR Action3D dataset [
Interaction level datasets are relatively difficult tasks. Due to the human or human-object interactions, interaction level human activities are more realistic and abound in various scenarios such as sport events [
Another well-known interaction level dataset is the Hollywood human activity dataset [
The UCF sports dataset [
The MSR DailyActivity3D dataset [
Human activity recognition remains to be an important problem in computer vision. HAR is the basis for many applications such as video surveillance, health care, and human-computer interaction. Methodologies and technologies have made tremendous development in the past decades and have kept developing up to date. However, challenges still exist when facing realistic sceneries, in addition to the inherent intraclass variation and interclass similarity problem.
In this review, we divided human activities into three levels including action primitives, actions/activities, and interactions. We have summarized the classic and representative approaches to activity representation and classification, as well as some benchmark datasets in different levels. For representation approaches, we roughly sorted out the research trajectory from global representations to local representations and recent depth-based representations. The literatures were reviewed in this order. State-of-the-art approaches, especially those depth-based representations, were discussed, aiming to cover the recent development in HAR domain. As the next step, classification methods play important roles and prompt the advance of HAR. We categorized classification approaches into template-matching methods, discriminative models, and generative models. Totally, 7 types of method from the classic DTW to the newest deep learning were summarized. For human tracking approaches, two categories are considered namely filter-based and kernel-based human tracking. Finally, 7 datasets were introduced, covering different levels from primitive level to interaction level, ranging from classic datasets to recent benchmark for depth-based methods.
Though recent HAR approaches have achieved great success up to now, applying current HAR approaches in real-world systems or applications is still nontrivial. Three future directions are recommended to be considered and further explored.
First, current well-performed approaches are mostly hard to be implemented in real time or applied to wearable devices, as they are subject to constrained computing power. It is difficult for computational constrained systems to achieve comparable performances of those offline approaches. Existing work utilized additional inertial sensors to assist in recognizing, or developed microchips, for embedded devices. Besides these hardware-oriented solutions, from a computer vision perspective, more efficient descriptor extracting methods and classification approaches are expected to train recognition models fast, even in real time. Another possible way is to degrade quality of input image and strike a balance among input information, algorithm efficiency, and recognizing rate. For example, utilizing depth maps as inputs and abandoning color information are ways of degrading quality.
Second, many of the recognition tasks are solved case by case, for both the benchmark datasets and the recognition methods. The future direction of research is obviously encouraged to unite various datasets as a large, complex, and complete one. Though every dataset may act as benchmark in its specific domain, uniting all of them triggers more effective and general algorithms which are more close to real-world occasions. For example, recent deep learning is reported to perform better in a four-dataset-combined larger datasets [
Third, mainstream recognition system remains in a relatively low level comparing with those higher level behaviors. Ideally, the system should be able to tell the behavior “having a meeting” rather than lots of people sitting and talking, or even more difficult, concluding that a person hurried to catch a bus rather than just recognizing “running.” Activities are analogous to the words consisting behavior languages. Analyzing logical and semantic relations between behaviors and activities is an important aspect, which can be learned by transferring from Natural language processing (NLP) techniques. Another conceivable direction is to derive additional features from contextual information. Though this direction has been largely exploited, current approaches usually introduce all the possible contextual variables without screening. This practice not only reduces the efficiency but also affects the accuracy. Thus, dynamically and reasonably choosing contextual information is a future good topic to be discussed.
Finally, though recent deep learning approaches achieve remarkable performance, a conjoint ConvNets + LSTM architecture is expected for activity video analysis in the future. On the one hand, ConvNets are spatial extension of conventional neural networks and exhibit its advantage in the image classification tasks. This structure captures the spatial correlation characteristics, however, ignores the temporal dependencies of the interframe content for activity dynamics modeling. On the other hand, LSTM as a representative kind of RNN, is able to model the temporal or sequence information, which makes up the temporal shortage of ConvNets. LSTM is currently used in accelerometer-based recognition, skeleton-based activity recognition, or one-dimensional signal processing, but has not been widely concerned in combination with ConvNets for two-dimensional video activity recognition, which we believe is a promising direction in the future.
The authors declare that there is no conflict of interest regarding the publication of this paper.
This research is supported by the National Natural Science Foundation of China (no. 61602430, no. 61672475, and no. 61402428); major projects of Shandong Province (no. 2015ZDZX05002); Qingdao Science and Technology Development Plan (no. 16-5-1-13-jch); and The Aoshan Innovation Project in Science and Technology of Qingdao National Laboratory for Marine Science and Technology (no. 2016ASKJ07).