A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition

Human action recognition is an important field in computer vision that has attracted remarkable attention from researchers. This survey aims to provide a comprehensive overview of recent human action recognition approaches based on deep learning using RGB video data. Our work divides recent deep learning-based methods into five different categories to provide a comprehensive overview for researchers who are interested in this field of computer vision. Moreover, a pure-transformer architecture (convolution-free) has outperformed its convolutional counterparts in many fields of computer vision recently. Our work also provides recent convolution-free-based methods which replaced convolution networks with the transformer networks that achieved state-of-the-art results on many human action recognition datasets. Firstly, we discuss proposed methods based on a 2D convolutional neural network. Then, methods based on a recurrent neural network which is used to capture motion information are discussed. 3D convolutional neural network-based methods are used in many recent approaches to capture both spatial and temporal information in videos. However, with long action videos, multistream approaches with different streams to encode different features are reviewed. We also compare the performance of recently proposed methods on four popular benchmark datasets. We review 26 benchmark datasets for human action recognition. Some potential research directions are discussed to conclude this survey.


Introduction
Human action recognition is one of the most crucial tasks in video understanding. is field has a wide range of applications, such as video retrieval, entertainment, humancomputer interaction, behavior analysis, security, video surveillance, and home monitoring. In detail, we want to find handshake events in a movie or offside decisions in a football match and the results are returned automatically. e goal of human action recognition is to recognize automatically the nature of an action from unknown video sequences.
ere are some challenges in human action recognition. e view invariance is one of the reasons that make human action recognition more complex. ere are some simple datasets having a fixed viewpoint [1,2] while most of the recent datasets have many viewpoints. In addition, each person has their size and shape as well as posture. ey can appear with various clothes and accessories. An action which is performed in an indoor environment with a uniform or static background is easy to recognize than an action that is recorded in a cluttered or dynamic background. In addition, lighting conditions or viewpoints contribute to increase or decrease of the accuracy of recognition. e next problem is intraclass and interclass variations. A human action recognition method must be able to generalize an action over variations within a class and distinguish between actions of different classes. For examples, people have different speeds when they run or walk. e occlusion problem is a hard issue in action recognition because some body parts of humans are disappeared temporarily. For example, some body parts cover other parts or a subject, or a person is hidden behind another person. Temporal variations are also an important challenge because actions are happening for a long time.
Deep learning methods have achieved state-of-the-art results on various problems of computer vision, especially human action recognition. Convolutional neural networks (CNNs) [3] are the neural network that uses convolutional operator in their layers. Convolutional network is used for computing a grid of values such as images while recurrent neural networks (RNNs) [4] are a type of neural network for processing sequential data, such as text and video. In this survey, we focus on proposed methods for human action recognition using deep learning techniques.

Review of Related Survey
Articles. Since human action recognition is an attractive problem, many surveys have been done over the last few years. e most popular survey of human action recognition is the work in [5]. Firstly, the authors discussed the local representation and global representation-based methods.
en, three types of action classification approaches were discussed, including direct classification, temporal state-space models, and action detection. However, this study have been conducted over ten years ago, and this survey reviewed methods using handcrafted features.
Zhang et al. [6] provided an overview of human action recognition, interaction recognition, and human action detection methods. e whole part of the survey discussed human action feature representation methods. First, the authors discussed handcrafted action features for RGB, depth, and skeleton data. en, they reviewed some deep learning-based methods. However, they focused on twostream networks and long short-term memory methods.
A review of Singh and Vishwakarma [7] focused on human action datasets in the past two decades. ey classified these datasets into two classes, namely RGB (Red-Green-Blue) and RGB-D (depth) datasets. ey discussed 26 RGB and 22 RGB-D datasets. Two categories of existing methods (handcrafted and learned feature representations) were discussed; however, the main contribution of this work is dataset analysis.
RGB-D data plays a vital role in human action recognition because this data provide color, depth and skeleton data. e performance of human action recognition systems is improved significantly when they exploit depth and skeleton data. With a special focus on RGB-D data, Liu et al. [8] reviewed human action recognition and human interaction recognition based on hand-crafted features. en, their survey discussed human activity recognition based on deep learning in the next part.
Zhu et al. [9] reviewed over 200 papers about human action recognition. eir survey focused on three different approaches for human action recognition. Firstly, twostream networks were reviewed. e two-stream methods tried to exploit the temporal relationship between frames because motion information plays a vital role in human action recognition in video. e first stream encodes the spatial information and the second one encodes the optical flow. In this review, the authors focused on recurrent neural networks which were used as a part of a two-stream network while our work discusses RNNs-based methods for human action recognition. Next, 3D CNN-based methods were discussed. 3D CNNs exploit both spatial and temporal information by using a 3D tensor with two spatial and one temporal dimension. e two-stream networks require huge resources to compute, and the 3D CNNs are hard to train. erefore, they reviewed efficient video modeling which try to reduce computational intensity.
Beddiar et al. [11] reported a survey which discussed human activity recognition approaches in the last ten years. e authors classified human activity recognition approaches into various categories. e first category is the feature extraction process. Both hand-crafted features and feature learning were discussed. en, they discussed three stages of human activity recognition approaches, including detection, tracking, and recognition. Next, unimodel and multimodel approaches were surveyed. ey classify human activity recognition methods into three classes of learning supervision, namely supervised, unsupervised, and semisupervised methods. e review also provided different types of activities. However, the recent deep learning techniques for human activities recognition were not highlighted clearly.
In order to review many different challenges, Jegham et al. [10] reviewed methods which aimed to solve these challenges in human action recognition. Many challenges were discussed such as anthropometric variation, multiview variation, cluttered and dynamic background, interclass similarity, intraclass variability, low-quality videos, occlusion, illumination variation, shadow and scale variation, camera motion, and poor weather conditions. In the second part, the authors reviewed recent action classification methods and popular datasets. ey focused on three types of methods, including template-based methods, generative model-based methods, and discriminative model-based methods.
A different survey [12] discussed human pose estimation and the role of it in human action recognition application. Firstly, the survey discussed various types of human pose estimation such as single person, multiperson, 3D human pose estimation, and human pose estimation in videos and depth images. In the remained part, they discussed human pose estimation for action recognition.
A review of single vision and multivision modalities was provided by Majumder and Kehtarnavaz [13]. In the single vision modality section, the authors discussed the approaches which used video data for action recognition. In the next section, the methods using RGB + Depth data were reviewed in multivision modality section. In each modality, both conventional and deep learning approaches were reviewed. Table 1 provides a summary of recent related surveys. Moreover, some main contributions of this work are discussed.

Contributions of
is Survey Article. Human action recognition has a wide range of applications; therefore, many approaches have been proposed using deep learning techniques. We aim to provide a comprehensive survey of recent deep learning techniques for human action recognition. In summary, our main contributions are listed here: (i) We discuss the most recent deep learning techniques for human action recognition. (ii) We provide the first review of convolution-free approaches in the human action recognition field. (iii) We survey the most popular benchmark datasets for human action recognition (iv) We provide a comprehensive analysis of proposed methods. Figure 1, the rest of the survey is organized as follows. In Section 2, we discuss the most recent deep learning techniques for anomaly detection. en, we provide two accuracy comparisons of some popular datasets in Section 3. Section 4 reviews many popular benchmark datasets in the human action recognition field. Finally, we discuss some open research problems and give the conclusion of this survey.

Recent Deep Learning-Based Methods in Human Action Recognition
In this section, we review recent deep learning-based methods for human action recognition. With the development of large-scale datasets and deep learning, this leads a remarkable growth of models based on deep learning for human action recognition. ere are four trends and a new trend has attracted some researchers recently. e first trend is 2D Networks which uses 2D convolutional neural networks in their models, such as TSM [14], TRN [15], and GSM [16]. e second trend is action recognition based on RNN, such as in [17][18][19]. e third trend is 3D Single Stream Network which uses 3D convolutional kernels in the networks, such as CSN [20] and TSN [21,22]. e fourth trend is 3D Two-Stream Network which includes a spatial and a temporal stream to encode both structure and optical flow information, such as in [23][24][25][26]. Finally, convolution-free approaches based on attention mechanism are a new trend in human action recognition with efficient computation and performance, such as in [27][28][29] and TimeSformer [30].

Methods Based on 2D CNN.
In this part, we will discuss the proposed methods that are based on 2D CNNs. One of the advantages of 2D CNNs is that the computation is cheap [14]. However, 2D CNNs often cannot exploit well the temporal information. erefore, many approaches try to capture both spatial and temporal information [16,31,32]. Optical flow information plays a vital role in action recognition, but the computation cost is expensive. erefore, in [33][34][35], the authors tried to compute optical flow with low cost and efficiency.
Two-stream networks are often trained individually with high computational cost. Jiang et al. [31] proposed an efficient method to exploit both spatiotemporal and motion features in a 2D framework, namely STM block. e STM block includes a channel-wise spatiotemporal module (CSTM) and a channel-wise motion module (CMN). e CSTM is used to extract spatiotemporal information. e input feature map F ∈ R N×T×C×H×W is reshaped into F * ∈ R NHW×C×T . en, a channel-wise 1D convolution is applied on input feature maps. A 3D convolutional network can encode local spatial and temporal features. However, they cannot encode ordered temporal information of all clips. A channel-independent directional convolution (CIDC) [32] was introduced to solve this issue. Given input feature map with C channels, CIDC convolve each channel of input feature map with T′ filter. e output feature map including spatial and temporal information is obtained by concatenating C × T ′ feature map. Another strategy in action recognition is frame selection. Gowda et al. [36] proposed a smart frame selection method that improved over many state-of-the-art models. e method includes two branches. e first computes a score δ i for each frame, and the second computes a score c i of a pair of frames. Given n frames, top m frames are chosen using a final score which is multiplied both score δ i and c i . Finally, a classifier is used for the final prediction. e authors used the Something-Something-V2 dataset [37] for ablation study.
According to the observation, movement variations at motion boundaries are very important in human action recognition. Zhang et al. proposed persistence of appearance  [33] to obtain a map that encodes small motion variations at boundaries. e difference between the optical flow and PA is that PA captures the motion variation without encoding the direction of the movement. Given two frames, eight 7 × 7 convolutions are applied to obtain lowlevel feature maps F 1 , F 2 . e i-th PA component is computed as where F i is the i-th feature map. All PA i are aggregated to a channel PA. e PA maps the appearance to the dynamic motion because it maps from three-dimensional to two-dimensional tensor. To exploit motion information, Piergiovanni and Ryoo [34] proposed a convolutional layer to capture the flow of any channel for action recognition without computing optical flow. e proposed fully differentiable convolutional layer has learned parameters that enhance the performance of action recognition systems. Optical flow is an expensive method. Xu et al. [35] proposed a fast network to improve the extraction of optical flow. e optical flow is generated by MotionNet [41] which is an end-to-end trainable network. Moreover, OFF [42] is added to the network to get better optical flow features. e optical flow is computed directly from RGB frames without precalculation or storage. erefore, both spatial and temporal information are learned by one network.
One of the most popular modules in human action recognition is temporal shift module (TSM) [14]. TSM has the complexity of a 2D CNN but obtains the performance of 3D CNN. In addition, this module can insert into a 2D CNN without extracomputation and parameters. Given a tensor with C channels and T frames. A part of the channels is shifted by −1, and another part is shifted by +1. e rest of the tensor is unshifted. e TSM can be inserted before convolutional layer or residual block, but the spatial features may be harmed because the information is lost. To deal with this problem, the TSM is inserted into a residual branch in a residual block. To exploit the temporal relations between frames in video, Zhou et al. [15] proposed a temporal relation network (TRN) which predict human-object interactions in the Something-Something dataset accurately. eir paper show that the TRN outperformed two-stream network as well as 3D convolution networks. e pairwise temporal relation is computed as . . , f n is the video with n frames. e functions h ϕ and g θ is used to fuse the frame features. Moreover, e function captured frames relations at different scale is described as MT N (V) � T 2 (V) + T 3 (V) + · · · + T N (V), where T d is temporal relationship of d frames. A 2D convolution neural network (CNN) has smaller parameters and fast computation than a 3D CNN. However, a 2D CNN usually captures spatial information. Sudhakaran et al. proposed a gate shift module (GSM) [16] which is an 2D CNN to capture spatial and temporal features. e input is applied a spatial convolution. en, a grouped spatial gating is computed. e 2D convolution ouput is split into group-gated features and residual. e gated features are group-shifted and fused with the residual. e spatial and temporal information is exploited by a learning spatial gating.  In a different approach to abovementioned methods, Zhang et al. [43] applied video super-resolution to human action recognition by introducing two video super-resolution (SR) modules, namely spatial-oriented SR (SoSR) and temporal-oriented SR (ToSR). e low-resolution input video is enhanced by two proposed modules. e input of the recognition network includes the output of the SoSR and the optical flow computed from the output of the ToSR module.

Methods Based on RNN.
CNNs are popular models for image representation.
ey are also used to learn action representation in videos [14][15][16]. However, they often work well with short videos [33,34], since only spatial features are captured and motion information of action are not encoded. To encode longer motion in video, some approaches have used RNNs, and long-short term memory (LSTM), such as in [17][18][19]. RNN is widely used in sequence data like video, and text. LSTM is a special version of RNN with the capability of learning long-term information. In addition, LSTM is combined with an attention mechanism [44] or is used in a three-stream network [45,46] for action recognition.
With video data, RNNs and LSTM requires high memory storage and computation cost. A compact LSTM model (TR-LSTM) [17] was proposed to solve this issue. e TR-LSTM use the tensor ring decomposition to reconstruct the input-to-hidden layer of the recurrent network. In the tensor ring decomposition, the first and last tensors are connected circularly and constructed in a ring-like structure. A densely-connected bi-directional LSTM (DB-LSTM) network [18] is used to represent the spatial and temporal information of human actions. e goal of DB-LSTM is to capture the spatial, short-term, and long-term patterns. e spatial and short-term patterns are extracted by a sample representation learner module, and the long-term patterns are exploited by a sampling stack. Another work, named correlational convolutional LSTM (C 2 LSTM) [19] aims to exploit both spatial and temporal information of human action video. e basic spatial features are extracted by two parallel convolutional networks, and then, these features are used as input for the C 2 LSTM module. e C 2 LSTM extracts the spatial and temporal information as well as the time relation by using cross-correlation inside the LSTM.
A three streams network was proposed by Liu et al. [45] for human action recognition. e network includes a spatial stream, a temporal stream, and a spatial-temporal saliency stream. ese streams are used to extract appearance information of RGB frames, motion information of optical flow frames, and spatiotemporal foreground information of objects from spatiotemporal saliency maps. In addition, they proposed three attention-aware LSTMs to exploit the relationship between frames. Another three-stream network [46] processes different frame rates for human activity recognition. e first stream operates at a single frame rate and the second stream processes at low frame rates. Both streams are used to capture spatial features. e third stream processes at high frame rates to capture temporal features. e output of the previous step is fed into two LSTM layers. is makes the proposed model deeper. Instead of using the LSTM layer, the authors use an attention mechanism to capture temporal information.
To extract the salient features of human action videos, Ge et al. [44] introduces an attention mechanism and convolutional LSTM. A convolutional network is used to extract features of the input video. en, a combination of LSTM and a spatial transformer network extracts salient features. e final classification is obtained by a convolutional LSTM. e proposed combination can select salient localities effectively while getting higher accuracy than soft attention and using less calculation than hard attention.

Methods Based on 3D Single-Stream Network.
In this part, we will discuss 3D convolution-based models. ese methods obtain good results since 3D CNN extracts spatial and temporal information from action video directly. e input frames are fed into a 3D single-stream network to extract both spatial and temporal features.
Tran et al. [20] proposed a channel separated convolutional network (CSN) which employs 3D group convolution. e CSN is defined as 3D CNNs; however, only 1 × 1 × 1 conventional convolutions or k × k × k depthwise convolutions are used. In detail, the conventional convolutions are used for channel interaction and depthwise convolutions are used for local spatiotemporal interactions. In their work, a 3 × 3 × 3 convolution from the bottleneck block by a pair of a 1 × 1 × 1 convolution and a 3 × 3 × 3 depthwise convolution to get a interaction-preserved channel-separated bottleneck block. Moreover, the 1 × 1 × 1 convolution in the previous pair convolutions is removed to obtain interaction-reduced channel-separated bottleneck block. e authors also applied group convolution to ResNet blocks. Two 3 × 3 × 3 convolutional layers of simple ResNet block are replaced by two 3 × 3 × 3 grouped convolutions or a set of one 1 × 1 × 1 convolution and two depthwise convolutions. 3D convolutional neural networks have high training complexity and huge memory cost. In order to resolve this problem, Zhou et al. [47] proposed a combination of 2D and 3D convolution, namely mixed convolutional tube (MiCT). e deep MiCT is an end-to-end network which receives RGB video sequences as inputs. e whole network includes four MiCTs and a global pooling in the last layer of the network. is pooling allows the network to accept any length videos as inputs. Each MiCT block receives an 3D signal. e input process by a 3D convolution to extract spatial-temporal feature maps. e extracted features are passed through a 2D convolution to compute the final feature maps. e MiCT-Net uses fewer 3D convolution, but it obtains deeper feature maps. Instead of combining 2D and 3D convolution, a new spatiotemporal architecture fused 2D and 3D architectures to improve spatiotemporal representation. Diba et al. proposed holistic appearance and temporal network (HATNet) [48] which exploits semantic information at different levels. HATNet uses 2D convolutional blocks to encode the appearance Computational Intelligence and Neuroscience information of individual frames in a video clip. In addition, the 3D convolutions extract temporal information in a batch of frames. ResNet18 and ResNet50 was used in HATNet for 3D and 2D modules, respectively. e output feature maps of each 2D and 3D block are merged; then, a 1 × 1 × 1 convolution is applied to reduce the channel of features. With prestraining on HVU dataset [48], the HATNet obtained 97.8% and 76.5% on UCF101 [39] and HMDB51 datasets [40], respectively.
e video usually has repeating information, and the temporal squeeze network [21] can map the movement information from a long video into a set of few frames. Given a video X with K frames, a frame-wize z is obtained by applying the squeeze operation. e output of squeeze operation is fed into a excitation operation. Global average pooling is used to implement the squeeze operation while the excitation operation is implement by two fully connected layers and two activation functions. e shorter sequence frames Y′ is obtained by projecting the flattened vector of X onto the hyperplane A, where A is computed from the output of the excitation operation. To reduce the computational cost of motion feature, a FASTER-GRU network [49] aggregates the temporal information.
e FASTER framework uses an expensive model and a lightweight model to exploit the information of the action and scene, respectively.
e FAST-GRU aims to learn the features from multiple models. is network maintains the resolution of feature maps to exploit more spatial-temporal information. A fully connected layer is replaced by a 3D 1 × 1 × 1 convolution. e proposed method was evaluated on Kinetics [38], UCF101 [39], and HMDB51 datasets [40]. A combination of 3D convolution neural network and long-short term memory [50] is used to capture low-level spatialtemporal feature and high level temporal feature. e proposed network used Inception 3D CNN [38] to extract spatial features and low-level motion features from a sequence of frames. en, the output of the I3D model is fed into a LSTM network to exploit high-level spatial features. Temporal information plays a vital role in human action recognition; however, this information still has challenging problems. A temporal difference network (TDN) [51] was proposed to capture multiscale temporal information. In addition, TDN in an end-to-end model that captures both short-term and long-term motion information. Given T frames I � [I 1 , . . . , A short-term and long-term TDM is applied to exploit short-term and long-term motion. To capture the short-term motion, a stacked RGB difference of frame I i is downsampled using an average pooling, then extracted motion information with a 2D network. e feature is upsampled to match the size of RGB features. In the long-term TDM, the aligned temporal difference is computed, and then fed into a multiscale module to extract long-range motion information. Features are enhanced by a bidirectional cross-segment temporal difference. e TDN framework with ResNet backbone [52] was evaluated on Kinetics-400 [38] and Something-Something-V1-V2 [37]. Instead of computing the optical flow frame-by-frame, the proposed MotionSqueeze module [53] learned motion features by a light-weight learning technique. e module contains three parts, namely correlation computation, displacement estimation, and feature transformation. e correlation score is defined as s(x, p, t) � F (t) x · F (t+1) x+p , where F (t) and F (t+1) are two input feature maps. en, motion information is estimated in the displacement estimation module and a confidence map of correlation is obtained from the correlation. e concatenation of displacement map and the confidence map is used as the input of the feature transformation. e feature transformation converts the input into an effective motion feature. e MotionSqueeze module is inserted into ResNet and evaluated on Something-Something-V1, Something-Something-V2 [37], Kinetics [54], and HMDB51 datasets [40].
Kalfaoglu et al. [22] proposed a method which obtained highest accuracy on both HMDB51 [40] and UCF101 [39] datasets with 85.10% and 98.69%. e most important thing in this study is that the authors replace the conventional temporal global average pooling (TGAP) layer by the bidirectional encoder representations from transformers (BERT) layer. is replacement utilize the temporal information with BERT's attention mechanism. ey declared that TGAP ignores the order of the temporal features, and BERT can focus on the important temporal features. e proposed network removed temporal global average pooling at the end of the proposed 3D CNN architecture. A learned positional encoding was added to the extracted features to maintain the positional information. e two last parts of the architecture is multihead attention a classification. en, they also proposed some features reduction blocks. Attention is a useful tool in many fields of computer vision. A novel W3 (what-where-when) video attention module [55] including a channel-temporal attention M c and a spatiotemporal attention M s was proposed for the action recognition problem. An average-pooling and a max-pooling are used to aggregate global spatial information. e output is fed into a shared MLP network to exploit the interchannel relationship. To model the temporal dynamics of objects, a channel temporal attention with two layers of 1D convolutions is computed. With spatiotemporal attention, an average-pooling and max-pooling are used as in channeltemporal attention to exploit spatial feature maps. e features are concatenated and fed into a 2D convolution to obtain frame-level spatial attention. To obtain the temporal attention, two 3D convolutional layers is applied with the frame spatial attention of previous step. e W3 attention module was integrated the ResNet50-based TSM [14]. e backbone CNN network plays a vital role in many recent action recognition systems. Martinez et al. [56] changes the last layers of the backbone network to improve the representation capacity. e important information is maintained in global feature branch. e global feature branch consists a global average pooling and a linear classifier. e average pooling aggregates the spatial and temporal information of the video. In the discriminative filter bank, the filters are includes 1 × 1 or × 1 × 1 × 1 convolutions and global max pooling to compute the highest activation value. e third branch is local detail preserving feature branch. A bilinear upsampling operation are applied to double the resolution of the features. A skip connection is add from the features of stage 4. Two backbone networks (2D TSN [57] and inflated 3D [38]) were used to evaluate the proposed module with Something-Something-V1 [58] and Kinetics-400 [38]. e temporal modeling methods based on 3D CNN requires a large number of parameters and computations. Lee et al. [59] proposed VoV3D which is an 3D network with an effective temporal modeling module for temporal modeling. e module names temporal one-shot aggregation (T-OSA). e T-OSA use many 3D convolutions with different receptive fields. All the output features are concatenated and reduced dimension by a 1 × 1 × 1 convolution. In addition, the authors proposed a depthwise spatial-temporal module which decomposes a 3D depthwise convolution into a spatial depthwise convolution and a temporal depthwise convolution for making a more lightweight and efficient network. Something-Something-V1, Something-Something-V2 [58], and Kinetics-400 [38] was used to evaluate.
Zhao and Snoek [60] proposed a single two-in-one stream network to reduce the complex computation of two stream network. e network processes both RGB and optical flow in a single stream. e most important contribution in this work is motion condition layer and motion modulation layer. e motion condition layer maps flow inputs to motion condition Ψ. en, the motion condition Ψ is fed into the motion modulation to learn two affine transformation parameters (β, c). ese parameters are used to influence the appearance network as below formula M 2 (F rgb ) � β ⊙ F rgb + c, where F rgb is the RGB feature maps and ⊙ is an element-wise multiplication operation. Instead of using deeply stacking convolution layers, Huang and Bors [61] proposed region-based nonlocal (RNL) to exploit long-range dependencies. e RNL operation is used to compute the relation between two positions based on their features and the neighboring features. e feature of each position is computed from all neighboring positions. e RNL operator is embed into a residual block as z � yW z + x. In addition, the RNL block is combined with SE [62] block to exploit spatiotemporal attention and channel attention. Two backbone networks are used to implement the proposed RNL, including ResNet-50 [52] and temporal shift modules (TSM) [14]. e network was evaluated on Something-Something-V1 [37] and Kinetics-400 [38]. Furthermore, OmniSource [63] trains video recognition model using web data, such as images, short videos, and long videos. e methods train a 2D teacher network and a 3D teacher network to filter the the web data that have lo confidence scores. Hua et al. proposed a dilated silhouette convolutional network (SCN) [64] for human action recognition in video.
e silhouette boundary curves of the moving subject are extracted, and then, the silhouette curves are stacked as a 3D curve volume. e curve volume is resampled to a 3D point cloud to represent the spatial and temporal information of actions.

Methods Based on 3D Multistream Network.
Multistream networks can exploit different types of features in human action recognition. Spatiotemporal and motion information are two important features of human action recognition. A two-branch network has two branches, including the RGB branch and flow branch. e RGB branch exploits the visual structure of scenes and objects, while the flow branch exploits the motion of objects. Many recent proposed methods use a 3D CNN to exploit spatiotemporal information and a flow stream to exploit motion information [24,26,38]. e two-stream network obtains state-ofthe-art accuracy by using RGB and flow images as input. However, each stream is usually trained individually and the optical flow requires a heavy computation. erefore, some approaches try to construct a two-stream network more efficiently [23,65]. Figure 3 shows a two-stream network architecture that are used in many recent approaches.
To pay different types of attention, a two-stream attention [26] was proposed using the visual attention mechanism. e network contains two streams. e first stream is the temporal feature stream which inputs an optical flow image sequence. An LSTM and a temporal attention are used to aggregate the information of the optical flow image. e second stream is a spatial-temporal feature stream. is stream uses an LSTM architecture to encode the Computational Intelligence and Neuroscience temporal relationship. e spatial features are extracted by some convolutions. en, the spatial attention assigns an important location for the next step of feature generation and the temporal attention is used to focus the temporal frames.
e method was evaluated on UCF11 [66], UCF Sports [67], and jHMDB [68]. An approach convert 2D classification networks into 3D ConvNets. e network is named as Two-Stream Inflated 3D ConvNets (I3D) [38]. ey inflated all the filers and pooling kernels of the 2D architecture by enlarging a temporal dimension. To pretrain the 3D model on the ImageNet dataset, the authors converted an image into a video by copying it many times. e network has two streams. e first stream uses RGB inputs and the second one use flow inputs. e two networks are trained separately and the results are averaged. A twopathway convolutional neural network [24] was proposed by Huang et al., namely Fine and Coarse. In the fine branch, motion information of raw input is extracted by a motion band-pass module. e extracted motion is fed into a backbone CNN [69] to learn the fine-grained motion features. On the other hand, the coarse branch is used to learn coarse-grained information.
e raw frames are downsampled and fed into a backbone CNN to exploit coarsegrained features. In order to merge the features from two branches, a lateral connection module was established. e proposed method was evaluated on Something-Something-V1 [37], Kinetics-400 [38], UCF101 [39], and HMDB51 dataset [40]. A combination of RGB, flow, pose, and pairwise stream [70] was proposed to improve the performance of the action recognition system. e network includes two branches. e first branch uses CD3 [71] and I3D [38] as backbone networks to extract spatial and temporal information. In the second branch, a pairwise stream learns the spatial relationship between the subject who perform the action and the surrounding objects. In addition, a pose stream inputs keypoint images. Keypoint images provide the connected key body parts of a person. e predicted results are obtained by using the late fusion method. e network was evaluated on UCF101 [39] and HMDB51 datasets [40]. Optical flow requires high computing. A proposed approach [23] mimics the motion stream using a standard 3D CNN. ey introduced two learning strategies, namely Motion Emulating RGB Stream (MERS) and Motion-Augmented RGB Stream (MARS). In the first strategy, a flow network is trained to classify actions using optical flow clips. en the MERS is trained to mimic the flow stream using only RGB frames. e last layer of MERS is trained by using the imitative flow features. In the second strategy, a flow stream (teacher) uses optical flow clips to train. Next, the teacher network is frozen its weight and MARS (student) is trained with RGB frames as input. Since only RGB frames are used as input in the testing phase, the network avoids the high computation of optical flow. e optical flow requires a high computation cost. Stroud et al. [65] introduced Distilled 3D Network (D3D) which obtained high performance without optical flow computation during inference. e D3D combines motion information in the temporal stream into the spatial stream. is leads the spatial to behave like the temporal stream. D3D trains two networks, including a teacher network and a student network. e teacher network  is a learned temporal stream of a two-stream network and the student network is a spatial stream. e knowledge of the teacher network is distilled into the student network during the training phase.
One of the problems of a two-stream network is to exploit the complementary information between two streams [25]. To solve this issue, Zhang et al. proposed a cross-stream network [25]. Two similar backbone networks are used to extract structure and motion features. en, a cross-stream connection block is used to compute the correlation between the appearance and motion features. e classification scores are obtained by a classifier which inputs the extracted features of previous blocks. e crossstream network is evaluated on UCF101 [39] and HMDB51 datasets [40] and Something-Something-V2 [58]. e most popular multimodality method fused its stream at the last stage of the model. A cross-modality [72] exchanges information between modalities in a more effective way. e proposed network has two branches. Instead of averaging the scores of the two branches, several cross modality attention (CMA) blocks are added after some stage of the network. e CMA matches a query of the first modality with key-value pairs of the second modality.
A very deep network [73] uses residual learning to encode spatial-temporal information human action recognition videos. e network, residual spatial-temporal attention network (R-STAN), includes two streams. Since the computation of optical flow has high cost, RGB Difference images are used to extract motion information. e RGB Difference images are computed by applying a element-wise subtraction operation between two frames. e network is constructed of many residual spatialtemporal attention blocks, including a residual block and a temporal and a spatial attention module. A feature map is processed as M′(x) � M ⊙ A T ⊙ A S , where M and M ′ are the input and output feature maps and A T and A S are the temporal and spatial attention, respectively. Two standard datasets (UCF101 [39] and HMDB51 [40]) was used to evaluate the proposed method. A proposed neural network [74] computed the local and global representations parallel. erefore, the feature maps are processed in local path and global path. In the first path, the local features x l are updated from x l−1 and global vector g l−1 . In the second one, the global vector is updated with local feature x l . Next, they proposed a local and global combination classifier to make the final prediction by combining the local and global representations. Finally, they proposed two different local and global diffusion networks, namely LGD-2D and LGD-3D. e difference between the LGD-2D and LGD-3D is that the input of the first one is T noncontinuous frames while the input of the second is T consecutive frames. In addition, LGD-2D and LGD-3D use 2D convolution and 3D convolution, respectively. ey evaluated on two datasets, namely Kinetics-400 [38] and Kinetics-600 [75]. ey also experienced on two of the most popular video action recognition datasets UCF101 [39] and HMDB51 [40].
Instead of training different networks separately, Zhou et al. [76] constructed a probability space from which a spatial-temporal fusion strategy can be derived. e authors introduced spatial-temporal fusion strategies that obtained high performance on poplar datasets. To exploit the mutual correlations in the video, an attention mechanism [77] is used in the 3D convolutional network. e authors proposed a temporal and spatial attention submodule and then used these attentions to construct the temporal and spatial deformable 3D convolutional network. Both 3D convolutional networks can learn temporal and spatial information as well as static appearance. A proposed model [78] used pose information to predict actions. First, they used the PoseNet approach with ResNet backbone to obtain estimated pose keypoints for each human in a frame. e backbone network used is ResNet50 with a 3D version. ey added a feature gating module and did not apply temporal downsampling in any layer of the backbone network to improve the performance.
e authors tried to avoid training three models separately since the input included RGB, flow, and pose data. ey proposed a multiteacher framework in which its input can be RGB, flow, or pose. ey evaluated on three benchmark datasets, including Kinetics-600 [38], UCF101 [39], and HMDB51 [40],

Convolution-Free Approaches.
e 2D network is very successful in capturing the spatial features. However, the motion information is still missed. 3D convolution network is used to encode spatial-temporal information in videos but it requires a high computation cost. Transformer was proposed for natural language processing and then adopted for computer vision. It does not require heavily stacked convolutions to encode information, such as in [27][28][29][30].
A convolution-free model [27] that requires a smaller number of frames for inference. e model is based on a selfattention mechanism for capturing both spatial and temporal information. e authors separate the spatial attention and the temporal attention to reduce the computation and exploit temporal information better. Each input frame (H × W) of the network is split into nonoverlapping patches N � HW/P 2 , where the size of each path is P × P. en, each patch representation is converted to query, key, and value vectors. To avoid expensive computation, spatial attention is applied between patches of the same image. e output representations of the spatial attention are applied to temporal attention.
To solve the heavy memory usage of the vanilla video transformer, a video transformer [28] was introduced to reduce the memory cost. e issue is solved by applying a spatial and temporal multihead separable-attention (MSA) sequentially MSA(S) � MSA s (MSA t (S)). Moreover, the authors solved the redundant information problem of the temporal dimension. Instead of using temporal average pooling or 1D convolutions with stride 2, they proposed a topK pooling which selects topK based highest standard deviation. ey evaluated on 6 different datasets (Kinetics-400 [38], Kinetics-700 [79], Something-Something-V2 dataset [37], Charades [80], UCF101 [39], and HMDB51 [40]).
A convolution-free model is faster than 3D convolutional networks, namely, TimeSformer [30]. Each input Computational Intelligence and Neuroscience frame is split into N nonoverlapping patches same as in [27]. e spatiotemporal position of each patch is encoded by a learnable positional embedding e pos (s) ∈ R D . Each patch X p,t is mapped into an embedding vector z (0) (p,t) . e TimeSformer has L blocks and a set of query, key, and value vectors is computed from z (l−1) (p,t) for each block. In this study, the authors proposed a more efficient spatiotemporal attention. A temporal attention is applied, then, the output is fed into a spatial attention.
Akbari et al. [29] introduced a convolution-free Transformer architecture, namely Video-Audio-Text Transformer (VATT). e input video clip is split into a sequence of ⌈T/t⌉ · ⌈H/t⌉ · ⌈W/t⌉ patches. e position of each location (i, j, k) is encoded as e i,j,k � e Temporal i +e Horizontal j + V ertical k , and Multi-Head-Attention applies the self attention on the input. Multilayer perceptron includes two dense linear projections with a GeLU activation. e common space projection contains a linear projection, and a two-layer projection with ReLU activation functions in between. e proposed method was evaluated on UCF101 [39], HMDB51 [40], Kinetics-400 [38], Kinetics-600 [75], and Moments in Time [83].

A Comparison of Methods
First, we compare recent methods on two benchmark datasets, including UCF101 [39] and HMDB51 [40]. ese are the two most popular human action datasets that have been used to evaluate the performance of the proposed methods as shown in Table 2. We group the proposed methods by year. In 2019, the local and global diffusion network achieved the best result with 98.20% and 80.50% on UCF101 and HMDB51, respectively. eir network tried to learn local and global feature in parallel, and these features are diffused effectively. In 2020, Kalfaoglu et al. [22] obtained impressive results with 98.69% and 85.10% on UCF101 and HMDB51, respectively. e replacement of the conventional temporal global average pooling layer with the bidirectional encoder representations from the Transformers layer increase the performance of 3D convolutional neural networks. In 2021, a three-stream network obtained 99.00% on the UCF101 dataset. In this year, many approaches introduced a new model for human action recognition with a convolution-free architecture, such as VATT [29], VidTr [28], STAM [27], and TimeSformer [30]. Table 3 compares recent approaches on Something-Something-V1 and Something-Something-V2. TSM [84] is one of the most effective methods which obtains both high efficiency and high performance because it obtains the performance of a 3D network with the complexity of a 2D network. TSM uses a simple temporal shift module to exploit a temporal relationship with zero extra computation and zero extra parameters. It obtains 52.60% and 66.00% top-1 accuracy on Something-Something-V1 and Something-Something-V2, respectively. Another method TDN [51] obtained state of the art on the Something-Something-V1 and Something-Something-V2 with 56.80% and 68.20%. TDN focus on capturing local and global motion for action recognition.

Benchmark Datasets
Benchmark datasets play a vital role in estimating the performance of proposed methods. e scope of the problem as well as a fairly comparison are provided by the dataset. For human action recognition, there is a wide range of benchmark datasets in common use. We briefly review the most wellknown datasets and their information (size, average duration, action classes, and resolution) for human action recognition. ese datasets are grouped into three categories such as simple, clip-level, and video-level. Table 4 provides a summary of these datasets.

Simple Datasets.
e two popular datasets which are most used with traditional methods are KTH [1] and Weizmann [2]. However, these datasets obtained absolute accuracy [102,103] because the background is static and simple and one person performs an action in each video. en, some more realistic datasets were proposed such as Hollywood [90] and Hollywood2 [91].
KTH [1] is a video dataset including 2391 videos. e dataset was performed by 25 different people in four different scenarios. e whole dataset (https://www.csc.kth.se/ cvap/actions/) includes six human actions: walk, jog, run, box, hand-wave, and hand clap.
Weizmann [2] is a video dataset which was performed with nine people. Each participant performs 10 actions such as run, walk, jump, skip, jack, jump-forward, jump-in-place, side, wave-two-hand, and wave-one-hand.
Hollywood [90] is a human action dataset taken from 32 movies.
Hollywood2 [91] is a human action dataset with 3669 video clips.
is dataset (https://www.di.ens.fr/%20laptev/ actions/hollywood2/) includes 12 classes of actions and 10 classes of scenes with approximately 20.1 hours of video which is taken from 69 different movies.

Clip-Level Datasets.
e number of actions of previous datasets is small, and the actions are simple. erefore, some datasets such as UCF101 [39], HMDB51 [40], and J-HMDB [68] were introduced to provide a higher variety of actions. However, the samples are short clips, and a single action is captured. en, some large-scale datasets, such as Charades [80], Something-Something [37], Kinetics [54], Kinetics-600 [75], Kinetics-700 [79], Diving48 [82], Moments in time [83], HACS [93], HVU [48], and AViD [94], have been introduced. ese datasets allow to train a deep convolutional neural network from scratch. UCF101 [39] has 101 action classes and has split into five categories: human-object interaction, body-motion only, human-human Interaction, playing musical instruments,   [40] has 51 action categories with 6,766 video clips (https://serre-lab.clps.brown.edu/resource/hmdb-alarge-human-motion-database/) which are extracted from different sources. ere are five types of action, including general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction, and body movements for human interaction. e height of all the frames is 240 pixels. To maintain the original aspect ratio of the video, the width was scaled accordingly to the height. J-HMDB [68] is extracted from the HMD51 dataset [40]. Not only a dataset for human action recognition but also the J-HMDB is provided for pose estimation and human detection. e dataset (http://jhmdb.is.tue.mpg.de/) contains 21 classes with 31,838 annotated frames. Each action has 36-55 video clips, and each clip includes 15-40 frames.
Charades [80] is a dataset of casual everyday activities of 267 people in their homes. e dataset has 9,848 videos with an average length of 30 seconds. It includes 157 action classes and is split into 7,985 videos for training and 1,863 videos for testing (https://prior.allenai.org/projects/ charades).
Kinetics [54] (Kinetics-400 [38]) has 400 human action classes, and each class has at least 400 video clips. All clips were taken from YouTube. e actions in the dataset are the human-object interactions or human-human interactions.  16,067 clips for training and 2,337 clips for testing. All clips were taken without background objects and the scenes contain a board, a pool, and a spectator in the background. Kinetics-700 [79] is an extension of the human action dataset Kinetics-600 [75]. e extended dataset (https:// deepmind.com/research/open-source/kinetics) has 700 classes and was taken from YouTube. Each class of dataset has at least 600 video clips which have a variable resolution as well as frame rate.
Moments in time [83] is a human-annotated dataset with 339 different classes. is is a large-scale dataset with one million videos, and each video corresponds with an event occurring in three seconds. e dataset (http://moments. csail.mit.edu/) is split into 802,264, 33,900 and 67,800 videos for training, validation, and testing, respectively.
HACS [93] is a large-scale dataset for human action recognition. It contains 1.5M clips which are sampled from 504K untrimmed videos. All clips (http://hacs.csail.mit.edu/ ) in this dataset have a two-second duration with 200 action categories.
HVU [48] is a multilabel and multitask video dataset which aims to describe the whole content of a video. e dataset includes approximately 572K videos with real-world scenarios.
AViD [94] is a video dataset for human action recognition. e main difference of this dataset is that it is collected from many different countries. is dataset (https:// github.com/piergiaj/AViD) contains 410K training clips and 40K test clips. e duration of each clip is from 3 to 15 seconds.
ActivityNet [96] is a benchmark dataset for human activity understanding. e dataset (http://activity-net.org/index.html) contains human activities in their daily living. With 849 video hours, ActivityNet provides 200 activity classes. Each class has an average of 137 untrimmed videos. Most of the videos have a duration between 5 and 10 minutes and a half of the video has a resolution of 1280 × 720.
DALY [97] is a dataset for action localization in space and time. e dataset (http://thoth.inrialpes.fr/daly/) lasts about 31 hours of YouTube videos with 10 everyday human actions.
AVA [100] is a video dataset in which theactions are assigned in space and time. In addition, each person in the video is annotated with multiple labels.
is dataset (https://research.google.com/ava/) contains 437 different videos of realistic scenes and action complexities. Each video is taken from the 15th to 30th minute time and has 900 frames. It is divided into 239 videos for training, 64 videos for validation, and 134 videos for testing, roughly a 55 : 15 : 30 split.
AVA-Kinetics [101] is an extension of the AVA dataset [100] with new videos from the Kinetics-700 [79] annotated with the AVA action classes. e AVA-Kinetics (https:// research.google.com/ava/) has 238,906 videos which is split into 142,475 videos for training, 32,529 videos for validation, and 64,902 videos for testing.

Open Research Problems
In the previous sections, we discuss the recent proposed methods and benchmark datasets for human action recognition with RGB data video. In this section, we will introduce some of the potential research problems in this field.
Data for human action recognition RGB videos are widely used in most methods for action recognition because these data are very popular and acquired with a low cost. However, other types of data provide more information for action recognition, such as skeleton, depth, infrared sequence, and point cloud. Skeleton data provide the trajectories of human body joints. Depth and point cloud data capture 3D structure and distance information. Infrared data provide data in a dark environment. erefore, we cannot exploit color or texture in infrared data. Pose estimation detects the location of human body joints in images. e skeleton data provide the body structure and pose of the object; therefore, we have more information for human action recognition. e skeleton data are obtained by using pose estimation on RGB videos or depth data.
Combination of different data types, such as RGB data with depth data or skeleton data with depth data, provides rich information for learning models. e RGB video data provide spatiotemporal features while depth data provide the 3D structure and depth information. We also combine different features of different models to get better performance.

Conclusions
In this survey, we provided a review of recent deep learningbased methods for human action recognition with RGB video data. We categorized recent approaches into five different groups, including 2D CNN-based methods, RNNbased methods, 3D single-stream network-based methods, 3D multistream network-based methods and convolutionfree-based methods. More recently, a pure vision transformer with a convolution-free network has shown to be effective for human action recognition and various fields of computer vision. erefore, we discussed recent transformer-based methods. We compared the accuracy of recent methods on four popular datasets, including UCF101, HDMB51, Something-Something-V1, and Something-Something-V2. We also discussed a wide range of benchmark datasets for human action recognition that are used in recently proposed methods. Lastly, we provide some potential research directions for human action recognition.
Data Availability e datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.