A Novel Parameter Initialization Technique Using RBM-NN for Human Action Recognition

Human action recognition is a trending topic in the field of computer vision and its allied fields. The goal of human action recognition is to identify any human action that takes place in an image or a video dataset. For instance, the actions include walking, running, jumping, throwing, and much more. Existing human action recognition techniques have their own set of limitations when it concerns model accuracy and flexibility. To overcome these limitations, deep learning technologies were implemented. In the deep learning approach, a model learns by itself to improve its recognition accuracy and avoids problems such as gradient eruption, overfitting, and underfitting. In this paper, we propose a novel parameter initialization technique using the Maxout activation function. Firstly, human action is detected and tracked from the video dataset to learn the spatial-temporal features. Secondly, the extracted feature descriptors are trained using the RBM-NN. Thirdly, the local features are encoded into global features using an integrated forward and backward propagation process via RBM-NN. Finally, an SVM classifier recognizes the human actions in the video dataset. The experimental analysis performed on various benchmark datasets showed an improved recognition rate when compared to other state-of-the-art learning models.


Introduction
Human action recognition [1] is used for a variety of applications such as video surveillance [2], retrieval [3,4], and detection [5][6][7]. e action recognition is performed by computational algorithms [8][9][10] that understand and detect human actions. ese computational algorithms generate a label after detecting a human action. Action recognition involves extracting and learning human actions [11][12][13]. It can be performed by using three techniques-traditional design features, deep learning, and hybrid extraction [14]. Among these techniques, the hybrid extraction technique [15] has gained prominence in recent years. It involves using both traditional and deep learning techniques for recognition.
ough they provide a good recognition rate, there have been no recent advances. Action recognition is comprised of two components: representation [23][24][25][26][27] and classification [25]. e human actions in a video sequence are generated as a space-time feature in 3D representation [28,29]. ey are comprised of both spatial and dynamic information; the spatial information includes human pose, and dynamic information includes motion. e movement is captured through anchors or bounding boxes to detect the subject from cluttered backgrounds. To capture the spatial-temporal features in human actions, various methods use Poisson distribution to extract the shape features [30,31]. For action representation and classification, the spatial-temporal information is taken as input. e spatial-temporal saliency is computed from the moving parts and the local orientation is determined. ese local representations are converted into global features by computing the weighted average of each point inside the bounding box and analyzing the different geometrical properties [32,33].
Initially, the spatial-temporal points were extracted using Laptev's [23] and Harris corner detector [24] in the spatial-temporal domain. Gaussian kernel [34] is applied to the video sequence to obtain a response function for the spatial-temporal dimensions. Other prominent methods such as 2D Gaussian smoothing [35] were applied for obtaining the spatial features, and 1D Gabor filter is applied for obtaining the temporal features along with other information such as raw pixels, gradient, and flow features. Principal component analysis [36][37][38] is applied to the vector features for dimensionality reduction. e detection algorithms such as 3D SFIT [39], HOG3D [7,40], HOG [41], and HOF [41] are used for describing the trajectories [42][43][44].
e spatial-temporal point of interest [45] captures only short-term distance. However, to describe the change in motion, it is necessary to track the points continuously. e trajectories along with the interest points are detected and tracked using Haris3D [24] with the KLT tracker [46]. Using this method [47], the trajectories are mapped with corresponding SIFT points over consecutive frames. Using the HOG, HOF, and MBH [48] features, the intertrajectories and intratrajectories are described. After the action is represented, action classifiers [30,31,45,[49][50][51] are applied to the training samples to determine the class boundaries. e human actions are classified into two types: direct classification and sequential method. e direct classification involves the extraction of a feature vector and recognition of actions from classifiers using SVM [36] and K-NN method [52,53]. In the sequential method, the temporal features such as appearance and pose are obtained from the hidden Markov model [54][55][56], conditional random fields [57][58][59][60], and structured support vector machine [61][62][63][64]. Furthermore, representative key poses are learned for efficient representation of human actions [33,34,[65][66][67][68][69][70][71][72] to build a compact pose sequence.
Deep learning techniques [73] such as 2D ConvNets [21,74] and 3D ConvNets [26] perform feature learning via convolution operator and temporal modeling [75]. e initialization of a deep neural network [72] is crucial for training the model. To ensure that the state of the hidden layers follow a uniform distribution, a model parameter [76][77][78] is initialized. If the model parameter [79,80] is not properly initialized, it leads to gradient explosion. e most commonly used technique is the Xavier initialization method [81] modeled based on the sigmoid activation function. Many models use ReLU activation function [82], RBMs [83,84], and other methods [85] for learning.
In this paper, we propose a novel parameter initialization technique using the Maxout activation function (MAF) via restricted Boltzmann machine-neural network (RBM-NN). e spatial and temporal features required for human action recognition are obtained from the video sequence via a feature learning process. e extracted spatial and temporal features are trained using RBM-NN. e RBM-NN converts the local features into global features using an integrated forward and backward propagation process. An SVM classifier is used for recognizing the human actions in the video sequence. Section 2 describes the process of tracking human action from video sequences, extraction of shape features, and construction of an RBM-NN. Section 3 describes parameter initialization using an activation function, forward propagation, backward propagation, and action recognition using an SVM classifier. Section 4 consists of data preprocessing and model training for analyzing the effectiveness of the parameter initialization technique. Section 5 discusses the experimentation setup, result analysis performed on various benchmark datasets, influence of the learning parameter on model accuracy, and the loss function. Finally, Section 6 consists of concluding remarks followed by references.

Methodology
e spatial-temporal features [86,87] for human action recognition are performed via a feature learning process [59,62], as shown in Figure 1. e first step involves using detection and sequence tracking algorithm [88] to identify human action features. Secondly, the action tracking sequence is segregated into blocks to extract the shape features using the neural network layers implemented by RBM [83,89]. e model is implemented by dividing the network layers and feeding the output of the first layer as input to the second layer to learn the spatial-temporal features. e second hidden layer is used for dimensionality reduction of the output from the first layer and to reduce computational efficiency.

Human Action Tracking from Video Sequence.
e action changes in the human body are detected from video frames by posture and action changes. Target detection and tracking such as pedestrian detection algorithm [90,91] are used to automatically detect and track the action sequences. A bounding box tracks the subject of interest and is optimized based on pose normalization. From the video dataset, the length of the tracking sequence is set to a fixed length L. If the length of the initial tracking sequence is greater than L, the redundant frames are discarded. If the length of the initial tracking sequence is lesser than L, the tracking sequence is extended by the zero-padding method and is set to L frames. e human actions from the tracking sequence are denoted by a i , and other actions are denoted as o i .

Extracting Shape Features.
Every tracking sequence is divided into video blocks, and the initialization parameters are specified as vb w × vb h . e segregated blocks are denoted as V k , k ∈ K, where K � 1, 2, . . . , V vb w ×vb h k corresponds to the spatial position of the block. In the proposed method, a deep neural network is used for extracting the spatialtemporal features from low-level features. e first step involves segregating blocks into individual frames B k , k ∈ K, B k n , where n � 1, 2, . . . , L into grid cells C w × C h . Each grid cell is computed in C d directions in the histogram of oriented gradients (HOGs) and represents the shape characteristics. e shape dimensions of each image frame are denoted as S w × S h × S w . e feature vector is repre- Computational Intelligence and Neuroscience initial component of the shape feature of the image frame IF k m is indicated as s k nt , where t � 1, 2, . . . , n. e shape features from each block are extracted and divided into a long vector. ese individual feature vectors represent the shape features.
During action recognition, the pose of the person is estimated and the shape features are extracted from the tracking sequence. e extracted shape features, i.e., pose in individual frames are normalized. e frame from a tracking sequence is represented as where k � 1, 2, . . . , n. e normalized shape vectors for every frame in the tracking sequence are given as where 1 ≤ t ≤ s, I k nt is the normalized shape feature vector and the component s k nt is the shape factor vector that corresponds to the normalized value. e shape feature for every individual frame in the block is denoted as B k n � (l k n1 , l k n2 , . . . , l k nm ), where k ∈ K, 1 ≤ n ≤ L. e shape features from the video block are represented as 1]. e eigenvectors of the shape features are denoted as B k 1 , B k 2 , . . . , B k L and is provided as input to train the RBM-NN.

Constructing an RBM-Neural Network. Restricted
Boltzmann machine [54,63] is comprised of a network architecture that consists of two neuron layers: the input layer and the hidden layer. e nodes present in the input layer and hidden layers are connected, but they are connected with a particular layer. RBMs are capable of selflearning through discrete distribution via the hidden neutrons. e input layer consists of multiple RBMs, as shown in Figure 2, to describe the distribution of action characteristics. For each type of action category, the training samples are fed to the RBMs with spatial features. e output layers from each RBM comprise of N neurons, and the value of N has a direct influence on the distribution of every action learned. e proposed method analyses influence that the value N has on the experimental results. For every RBM present in the neuron network layer, the limits are set as k � 1, . . . , vb w × vb h . It is used for training the various shape features from the blocks along with their corresponding spatial position ′ k ′ as input. e input video block has the following shape feature I k � (i k 11 , i k 12 , . . . , i k Ln ) L , and the corresponding output is represented as . . , r k N ) L . e restrictions in the RBM-NN, its state, and energy of the neurons I k , is defined as where θ k � P k in which θ k is the RBM parameter and P k represents the symmetric correlation between the input and output neurons. Also, a k and b k indicate the deviation among the column vectors generated in the input and the output layer. e set of model parameters used in RBM is learned using the contrastive divergence (CD) algorithm  Computational Intelligence and Neuroscience [92]. e CD algorithm is effective for training undirected graphical models (RBMs) and estimates the energy gradient given a set of model parameters along with the training data. e CD provides the gradient estimates and enables the model to keep balanced and avoids the issue of gradient explosion and overfitting. e distribution between the input and output neurons for a single RBM is given as where θ k is the partition function and the conditional probability distribution is derived from equation (3): e proposed method trains the RBM in the first layer of the neural network architecture. e network parameter set of the multiple RBM neural network layers for every action is denoted as θ � θ 1 , θ 2 , . . . , θ n . e proposed work is used for training the two-layer neural network for every action category. e second layer of the neural network is also an individual RBM and solely used for dimensionality reduction of the output obtained from the first layer. e parameter of the network layer is denoted as (W, a). For every action category, the input from an action sequence will provide the feature vectors as output. e output of the trained two-layered neural network is modeled based on spatial-temporal shape feature learning from the block. e spatial-temporal individualities are represented as R � (r 1 , r 2 , . . . , r A ), where A is the set generated based on experience and is denoted as

Importance of Effective Parameter Initialization.
To build an efficient model for human action recognition, an RBM-NN architecture is defined in the proposed work and it is trained to learn the parameters. e RBM-NN architecture is trained using the following steps: parameter initialization, optimization algorithm, forward propagation, cost function computation, gradient cost computation using back propagation, and parameter updation. When testing data are provided, the network uses the trained model to predict the class. For a network to perform efficiently, it is crucial to initialize the right parameter to avoid the problem of gradient explosion and vanishing. Case 1. If the initialized parameter is large, it leads to a gradient explosion: Case 2. If the initialized parameter is small, it leads to vanishing gradients: initialized weight ≪ identity matrix.
To prevent the problem specified above, a set of rules have to be adhered to while initializing the network parameter. Initially, the mean value of the activation function must always be zero. Finally, the variance of the activation function must remain uniform throughout the network layers. If the rules are not followed, it gives rise to a locally optimal solution which renders the model untrainable and improper feature extraction. e model parameter is initialized based on two categories: parameter initialization by pretraining a model and parameter optimization by training the neural network. In the first method, a model is trained using the unsupervised model, and an AutoEncoder [93] is used to build a layer-bylayer unsupervised objective function. e layer-by-layer training is performed on equal depth neural networks to obtain the feature representations from the input. Pretraining a model involves computational overhead, and the training efficiency is affected. e second method involves initializing the parameter and its optimization using neural networks. e parameter can be initialized using a nonlinear activation function and backpropagation.

Parameter Initialization Using Maxout Activation
Function. In this paper, the parameter initialization technique is modeled using a Maxout layer. e layer consists of an activation function which takes the maximum of the  inputs. When compared to other activation functions, Maxout activation function [94] performs well due to the dropout technique. Dropout is a model averaging technique where a random subnetwork is trained for every iteration and the weights are averaged at the end. An approximation has to be used as these weights cannot be averaged explicitly. e inputs to the Maxout layer are not dropped using the corresponding activation function. e input with the maximum value for the data point is not affected as the dropout occurs in the linear part. us, it leads to efficient model averaging as the averaging approximation is for linear networks.
In the proposed work, it is assumed that the state of the neuron node follows a uniform distribution required for a Maxout activation function. It is an activation function that is capable of training itself in our model. It performs a piecewise linear approximation on ReLU, absolute function, and quadratic function to a random convex function. It considers the maximum value from a set of linear values that are determined beforehand. e Maxout implements ReLU and absolute function using two linear functions and the quadratic function using four linear functions. It can approximate any function using multiple linear functions and is known as piece-wise linear approximation.
e Maxout unit is implemented using the following function: where n is the number of linear combinations. If w 1 is set to one, all the other values take the value zero such that the proposed activation function becomes equivalent to the traditional activation functions. As mentioned earlier, any continuous piece-wise linear approximation can be expressed as a difference between two convex functions: where f 1 (x) and f 2 (x) are the convex functions and g(x) is a continuous piece-wise linear approximation function. From equation (9), it can be deduced that a Maxout layer comprising two Maxout units can be used to approximate any continuous function randomly. Also, both ReLU and leaky ReLU are considered to be special cases of a Maxout unit and enjoy all the benefits of a ReLU unit. It implements linearity of operations with no saturation and avoids the issue of dying ReLU. A Maxout can be formed with more units, but this will increase the capacity of the network and requires more training. us, Maxout units are considered as universal approximators.
e MAF is modeled based on theoretical derivation for parameter initialization of the model. Both forward propagation and backward propagation process in the network are analyzed to ensure that every neuron follows a uniform distribution.

Forward Propagation Process.
To perform forward propagation, the following assumptions are made: (1) the input vector vb and the parameter vector W must be independent; (2) the input vector vb and the parameter vector W must follow the same distribution; (3) the initial distribution of the parameter vector W must be symmetrical about the zero-point; and (4) the offset value b of each layer must always be zero. e response of the hidden convolution layer in the RBM-NN is given as where t denotes the n th hidden layer of the RBM-NN, among which x t ∈ A p , x t is the original input vector, and the mean value is set to zero after processing.
where p is the number of input nodes connected to one neuron node, u is the size of the convolution kernel, and ′ i ′ is the number of input channels to the model. e output of every neuron node is passed through the MAF provided as follows: where n is the number of linear combinations. If w 1 is set to one, all the other values take the value zero such that the proposed activation function becomes equivalent to the traditional activation functions. e problem of local linearity in the proposed activation function eliminates the issue of gradient explosion, but there is an increase in computational overhead during the training process. e variance of the initialization parameter can be obtained as follows: e weight W t and hidden layers have to adhere to Gaussian distribution with a mean value of zero as per assumptions 2 and 3. e initial state and the parameter vectors are assumed to be independent of each other as per assumption 1. us, the variance in the initialization parameter is provided: where E[x 2 t ] is the exception function. e proposed activation function can be simplified by considering two linear functions given as follows: Based on assumption 4, the offset value b t−1 is always set to zero and the mean weights W t are also set to zero. e values z t−1,1 , z t−1,2 are assumed to be symmetrical at the mean point and follow the same distribution.
e expectation function E[x 2 t ] and the variance Var[z t−1 ] are defined as follows: e expectation E[x 2 t ] value is obtained by substituting equation (15): As per assumption 2, the values z t−1,1 and z t−1,2 follow the uniform distribution and the new variance is obtained as follows: Substituting the variance value obtained from equation (17) into equation (16), we get e relationship between the variances is obtained by substituting equation (17) into equation (13) as follows: e difference in variance between the first hidden layer and the last hidden layer is obtained as follows: e initialization parameter for a neural network model must follow the necessary condition: When t is set to 1, equation (21) is satisfied without the interference on the input vector by the activation function. Based on the theoretical assumption, each node in the hidden layer behaves similarly to a neural network. Also, the model parameter initialization for every node in the hidden layer satisfies the Gaussian distribution.

Backpropagation Process.
In backpropagation, the following assumptions are made similar to forward propagation: (1) the gradient Δr t and the parameter vector W must be independent of each other; (2) the gradient Δr t and the parameter vector W must follow the same distribution; and (3) the gradient Δr t and the parameter vector W must have zero symmetry for E[Δx t ] � 0. e concentration of gradients obtained by the convolution parameter is shown as follows: where Δx t and Δr t are the gradients that represent the loss functions. e value of the activation function is obtained when a � 0: If f ′ (z t , n) � 1 and f ′ (z t , n) � 0, each has half probability of occurrence. Moreover, f ′ (z t , n) � 1 and Δx t+1 are independent of each other based on assumption 1. e initial condition n ∈ 1, 2 { } is provided: e variance function for the gradient is obtained as follows: e relationship between Var[Δx 2 ] and Var[Δx T+1 ] can be defined as follows: For the gradient to move smoothly, the following initial condition has to be satisfied: e parameter for neural network model W also follows the same distribution based on assumption 2: It is not possible to perform both forward and backward propagation at the same time. us, the parameter has to be optimized as follows: e optimized solution for the proposed initialization parameter for RBM-NN based on uniform distribution is obtained:

SVM Classifier for Action
Recognition. An SVM classifier is built for each action category. e training of the RBM-NN is categorized into two samples: positive samples and negative samples. e samples which correspond to action categories a i are classified as positive samples ′ u ′ and other actions o i as negative samples ′ v ′ . e parameter vector W and the other variables are optimized. If there is an imbalance in the positive and negative samples, the classification accuracy in the training phase is affected. To overcome the issue of accuracy, a penalty coefficient parameter ′ P ′ is introduced. If the training set has less positive samples, a higher penalty coefficient P is enforced and the negative samples are introduced to a lesser penalty coefficient P. e SVM objective function for our proposed method is defined as follows: where i � 1, 2, . . . , u + v, R i is the spatial-temporal feature of the i th action sample and (R i , y i ) is the input of the SVM classifier. Also, u + v is the total number of training samples used for training the SVM classifier. e SVM classifier is trained for each action category and represented as an action model (θ, W, a, b) comprising two-layer RBM-NN for human action recognition.

Result Analysis and Discussion
e parameter initialization proposed in the paper is verified and analyzed on the MS-COCO [95], ImageNet [96], and CIFAR-100 [97] datasets respectively. e RBM-NN comprises four convolution layers for analysis along with the loss function. e loss function considered in the model is the logistic loss layer obtained after downsampling. To prevent overfitting, the dataset is separated into batches and trained as submodels. e parameter is initialized randomly, and the submodels are trained using the dropout technique by randomly setting the output nodes to zero before updating the training set. e dropout probability for the model validation is set as 50% to determine the classification error rates.

Data Preprocessing.
e training data are preprocessed by applying global contrast normalization and zero component analysis whitening [98]. e GCN technique prevents the images from exhibiting various levels of contrast. e mean value is subtracted, and the image is rescaled such that the standard deviation across the pixels is constant. ZCA whitening process ensures that the average covariance between the whitened pixel and the original image is maximal. For instance, it makes the data less redundant by removing the neighboring correlations in adjacent pixels.

Model Training.
e models were initially trained using the Xavier initialization method [81] for parameter initialization and the model parameters. e Xavier initialization method is chosen since it keeps the variance uniform across each network layer as per the assumptions followed during the forward propagation process. e initial and model parameters must follow a uniform distribution specified below: where n k is the number of input nodes and n k+1 are the number of output nodes. e datasets MS-COCO [95], ImageNet [96], and CIFAR-100 [97] were considered as input for the proposed parameter initialization method and also compared with parameter initialized via the Xavier model. e proposed parameter initialization method showed similar results in the classification accuracy of the activation function. e improvement in classification accuracy has been attributed to the fact that nodes and states of the various hidden layers follow the same distribution pattern and avoids the problem of gradient explosion. e dataset ImageNet comprises a 1000-class image problem and required 120 epochs.
e MS-COCO comprises 80 classes and required 64 epochs for training. e CIFAR-100 dataset is comprised of 100 classes and required 200 epochs for training. e model required more layers for analysis along with the introduction of convolution kernels.
e deep neural network model was able to perform iteration for 500,000 times with a learning rate set to 0.1. However, it was found that the learning rate decreased with an increase in the number of iterations. e comparison of the test error rates between the proposed initialization method and the Xavier initialization method is provided in Table 1. e analysis shows that the error rates obtained from the proposed method showed better results for both small (MS-COCO) and large datasets (ImageNet and CIFAR-100).
e model parameters along with the slack variables are initialized and optimized by the objective function used by the SVM classifier. During the training process, it was noticed that there was an imbalance between the positive and negative samples.
For instance, there were fewer positive samples in the training set when compared to the negative samples. us, a higher penalty coefficient ′ P ′ was introduced to the positive samples to balance the training samples.

Experimentation Setup and Analysis
e human action recognition using the proposed method is performed using the datasets specified in Table 2 along with their classes, modalities, and environment type. ese benchmark datasets are comprised of actions performed in both simple and cluttered background scenes. e datasets are divided into training and testing sets. is discriminative action is used for segmentation to reduce the background correlation between the training and the testing set. e model is trained using small samples, and the data expansion method [108] is used increasing the number of video samples present in the training set.
Initially, the actions are detected from the video blocks to extract the spatial-temporal features. e features are fed to the RBMs for training along with suitable model parameters via forward and backward propagation process. e output from the RBMs is fed to the SVM classifier for human action recognition. During the experiment analysis performed on the dataset, the influence of the N parameter is analyzed along with the penalty coefficient P. e effect of the number of output neurons for each RBM is obtained by adjusting the value of the parameter N. e number N of the output neurons is influenced by the average recognition rate of the Computational Intelligence and Neuroscience action sequence. e value of N determines the number of spatial-temporal features based on RBM-NN. e SVM classifier is used for action recognition of multiple types of actions. e SVM classifier model calculates the shape features of the video blocks for each action category. After the classification values are compared, the largest classification value is set as an action label for the test video sequence. e actions from the tracking sequence are detected from the action video. e proposed algorithm operates on the image sequences with varied focus points, deep learning is used for learning all the features, and SVM classification is performed. e proposed action recognition feature is more specific than other methods. Finally, the model is compared with other state-of-the-art techniques to compare the classification accuracy rate of the model.

Weizmann Dataset.
e Weizmann dataset [99] is made available by the Weizmann Institute of Science and consists of two datasets. e event-based analysis dataset consists of long sequences of around 6000 frames comprising various people. e actions are divided into four categories: running in place, walking, running, and waving. e ground truth dataset is action annotated for every frame and can be temporally segmented. e second dataset Weizmann actions as space-time shapes dataset was created for human action recognition systems that are suitable for spatial and temporal volumes. e videos were recorded on a simple background with nine persons performing ten actions. e human actions have been divided into ten categories such as walking, running, jumping, galloping, bending, one-hand waving, two-hands waving, jumping in place, jumping jacks, and skipping, as specified in Figure 3. It is a database of 91 low-resolution video sequences. e dataset comprising 91 video sequences is divided into 60 video samples for the training set and 31 action samples for the testing set.
During experimentation, every action in the tracking sequence was divided into 180 × 144 (25 fps) video blocks. e parameter N is set to 300, where N represents the number of output neurons of each RBM present in the first neural network layer. e proposed method is compared with the reference method [109]. For determining the SVM classifier, set the penalty coefficient P � 10, and other slack variables are determined by the objective function. e neural network parameters are obtained by adaptive matching with the processed image data. e proposed work correctly identifies the rotation action of the Weizmann actions as space-time shapes dataset such as walking, running, jumping, bending, waving, and skipping. e proposed method is compared with the reference model [110] proposed by Haiam et al.
ey proposed a trajectory-based approach for human action recognition to obtain the temporal discriminative features. e trajectories are extracted by detecting the STIPs and matching them with the SIFT descriptors in the video frames. e trajectory points are represented using the bag of words (BoW) model. Finally, an SVM-based approach is used for action recognition. From the confusion matrix shown in Figure 4, it can be noticed that there are some confusions in some frames for actions such as walking, running, jumping, and skipping. Also, the action two-hand waving is similar to jumping jacks. ese confusions influence the classification accuracy of the proposed model. e proposed approach is evaluated with the classification accuracy obtained by the following descriptors: TD, HOG, HOF, MBH, and the combinations, as shown in Figure 5. Table 3 shows the average recognition rate for the dataset along with the reference method. It can be noticed that the accuracy rate for the HOG, HOF, and combined features achieved better accuracy when compared to the proposed method due to variations in the codebook sizes and model representation. e vector patches are converted to codewords to produce a codebook comprising similar

CAVIAR Dataset.
e context-aware vision using image-based active recognition (CAVIAR) is a video dataset [100]. e dataset consists of seven activities such as walking, slumping, fighting, entering, exiting, browsing, and meeting, as shown in Figure 6. e video sequences were recorded at different locations using a wide-angle camera lens in the INRIA Labs located in France and at a shopping center in Lisbon. e ground truth file is available in the CVML format. e file contains two types of labeling: activity label and scenario label. For every individual, the tracked target comprises 17 sequences and the pixel positions depend on image scaling. e second video sequence displays the frontal view and is synchronized frame by frame. e sequences are 1500 frames longer than the first sequence.
e France sequence is categorized as "d1," and the Lisbon sequence is classified as "d2."

Method type
Average recognition rate (%) Reference method using TD [110] 94.44 Reference method using HOG [110] 97.77 Reference method using HOF [110] 96.66 Reference method using MBH [110] 95.55 Reference method using combined methods [111] 96.66 Proposed method 96.3 the confusion matrix shown in Figure 7, it can be seen that some confusions are observed for the actions walking, entering, and exiting. Moreover, similarities were also observed for the actions of fighting and meeting. e other actions in the dataset are classified accurately. e proposed method was compared with the reference method [112] implemented using the MFS detector and OpenCV classifier. e results from Table 4 and Figure 8 show that the recognition rate from our proposed method for both labels "d1" and "d2" is significantly better than the reference method. Negri et al. [112] proposed an approach for pedestrian detection using movement feature space (MFS) to detect the movements and descriptor generation using a cascade of boosted classifiers. e validation of the MFS detector is performed using an SVM classifier. e reference method considered only the frontal view of the dataset resulting in only a few samples used for validation purposes. e less recognition rate achieved by the OpenCV detector 20 (20 stages) and OpenCV detector 25 (25 stages) because both classifiers require more stages for training to reduce the occurrence of false detection.

UCF Sports Action Dataset.
e UCF sports human dataset [101] is comprised of 150 videos with 10 action categories. e ten categories of actions include walking, kicking, lifting, golfing, running, diving-side, horse-driving, swing-side angle, skateboarding, and bench swinging, as shown in Figure 9. e 150 video samples are divided into 102 samples for the training set and 48 samples for the testing set. e N parameter for each cell is set to 200, and the penalty coefficient is set to P � 10 along with slack variables. e confusion matrix shown in Figure 10 shows a perfect accuracy rate with confusion observed only in the activities running and skateboarding as the model displayed false classification between these two action categories. e recognition rate for the reference methods [113][114][115][116] is specified in Table 5.
Mironicȃ et al. [113] proposed an approach to combine the frame features to model a global descriptor. e recognition accuracy of this method is affected when all the features are aggregated within a single descriptor and the BoW representation. Le et al. [114] proposed an unsupervised feature learning technique to learn the features directly from the video. ey also explore an extended version of the ISA algorithm for learning the spatial-temporal features from the unlabeled data. e classification was performed using a multiclass SVM where the labels are predicted for all clips except the flipped versions resulting in a drop in accuracy.
An action region proposal method was provided by Rezazadegan et al. [115] using optical flows. Action detection and recognition were performed using CNN based on pose appearance and motion. Souly et al. [116] proposed an unsupervised method for detection using visual saliency [117] in videos. e video frames are divided into nonoverlapping cuboids and segmented using hierarchical segmentation to obtain the supervoxels from the cuboids. e features are decomposed into sparse matrices using PCA. When compared with the reference methods, the proposed method shows a better accuracy rate, as shown in Figure 11.

KTH Action Dataset.
e KTH action dataset [102] is collated by the KTH Royal Institute of Technology. It is a video database that is comprised of human actions captured in various scenarios. It consists of six actions that include walking, boxing, running, waving, jogging, and clapping. e dataset is comprised of 600 video files that are a combination of 25 individuals, 6 actions, and 4 different types of scenarios, as shown in Figure 12.
e experimental analysis is carried out using the reference methods [118][119][120][121][122]. Only one-third of the video samples are considered for experimentation. e 200 video samples are divided into 140 samples for the training set and 60 samples for the testing set. e confusion matrix for the dataset is shown in Figure 13. It can be observed that the classification rate was affected by the action category running, as it was detected as walking. e action category jogging was classified as running.
During experimentation, the parameter is fixed as N � 300 with four scenarios labeled as "d1," "d2," "d3," and "d4." e penalty coefficient is set as N � 10, and the slack variables are obtained by adaptive data matching. e average recognition rate for the dataset is shown in Table 6.
Sreeraj et al. [118] proposed a multiposture human detection system based on HOG and BO descriptors. is approach shows a slightly better accuracy rate as the system uses a fast-additive SVM classifier.
is combined approach retains the HOG precision rate to improve the detection rate. Yang et al. [119] constructed a neighborhood by adding weights on the distance components. SONFs and MONFs are generated by concatenating multiple SONFs. e method also uses LGSR classifier for obtaining the multiscale-oriented features and achieves better classification. Ji et al. [120] proposed an improved interest point detection to extract the 3D SIFT descriptors from single and multiple frames by applying PCA. e quantification of combined features using SVM increases computational cost and causes a drop in accuracy rate. STLPC descriptor was proposed by Shao et al. [121] and learns the spatial-temporal features from the video sequence. A Laplacian pyramid is constructed by maxpooling to capture the structural and motion features efficiently. e proposed method shows a slight decrease in 0.11% and 1.4%. e classification accuracy for the KTH dataset is shown in Figure 14.

CASIA Action Dataset.
e CASIA dataset [103] is comprised of 8 human actions such as running, walking, jumping, crouching, punching, wandering, bending, and falling. e video action sequences were captured using a static camera from various angles and views. ere are 1446 video sequences performed by 24 different subjects, as Computational Intelligence and Neuroscience shown in Figure 15. For the experimental analysis, 250 video sequences are analyzed. ey are split into 190 samples for the training set and 60 samples for the testing set. e N parameter is set as 300 for every cell, while the penalty coefficient is set as P � 10 along with the respective slack variables. e reference framework [123] using the EM technique using an M-class SMV classifier and other classifiers is provided in Table 7. e confusion matrix in Figure 16 shows that the action category falling achieves a full accuracy rate. Similar action categories such as running, walking, crouching, and bending have a 99% accuracy rate. e categories of punching and wandering show the least accuracy rate of 98%. Table 7 shows the average recognition rate for the CASIA dataset. Sharif et al. [123] proposed a hybrid strategy for human action classified by the integration of four major techniques. Initially, the objects in motion are uniformly segmented, and the features are extracted using LBP, HOG, and Haralick features. e feature selection is performed by the joint entropy-PCA method, and the classification is performed using multiclass SVM. e following classifiers multiclass SVM, DT, LDA, KNN, and EBT are used for experimental analysis. If high-resolution videos are used, there is a drop in efficiency due to computation overhead. Figure 17 shows that our proposed method has a better recognition rate when compared to the classifier used in the reference method.

i3DPost Multiview Dataset.
e i3DPost dataset is a multiview/3D human action/interaction database [104] created by the University of Surrey and CERTH-ITI (Center of Research and Technology Hellas Informatics and Telematics Institute). e dataset consists of multiview videos and 3D posture model sequences. e videos were recorded using the convergent eight-camera setup for capturing highdefinition images with twelve people performing twelve different types of human motions. e actions performed by the subjects include walking, running, bending, jumping, waving, handshaking, pulling, and facial expressions, as shown in Figure 18. e 104 video sequences are divided into 60 samples for the training set and 44 samples for the testing set. is is because the action in this dataset is much more complex than the UCF sports action dataset.
e N parameter is set as 150 for every cell, while the penalty coefficient is set as P � 10 along with the respective slack variables.     e confusion matrix obtained in Figure 19 shows that action categories jumping, bending, waving, stand-up, runfall, and walk-sit have a full recognition rate. e actions running and walking have a misclassification rate in a few scenarios. Also, the actions handshaking and pulling are misclassified due to similar poses in some frames leading to a decrease in recognition rate.
In Table 8, Gkalelis et al. [124] and Iosifidis et al. [125] proposed an approach using binary masks obtained from multiview posture images for vectorization. is technique was used to extract the low-dimensional feature descriptors. DFT, FVQ, and LDA are applied for action recognition and classification. e authors tested their method with a limited testing set comprising only eight actions       when compared to 13 actions used in our proposed approach.
Holte et al. [126] proposed a score-based fusion technique for extracting the spatial-temporal features. ese feature vectors are efficient for high frame data capture with different densities and views. Based on the evaluation of the accuracy rate in Figure 20, the proposed method achieves significant performance when compared to other reference methods with 13 actions.

JHMDB Action Dataset.
e joint-annotated human motion database [105] is categorized into 12 action types. e twelve actions shown in Figure 21 include walking, climbing, golfing, kicking, jumping, pushing, running, pullup, catching, picking-up, baseball playing, and throwing. e dataset comprises of three segmentation methods for the training and the testing set. For our experimentation, we are using only one segmentation method where only 316 videos are considered. ey are further divided into 224 video segments for the training set and 92 video segments for the testing set. e N parameter is set as 350 for every cell, while the penalty coefficient is set as P � 10 along with the respective slack variables. e confusion matrix from Figure 22 shows that the action categories climbing, golfing, kicking, pushing, pullup, and pick-up have a 100 percent recognition rate. e action categories such as jumping, running, and catching showed recognition rates ranging from 91 to 98 percent. e action categories that showed the least performance were walking that was misclassified with running. e action jumping was misclassified as catching and vice versa, while the action baseball playing was misclassified as golfing.
From Table 9, Jhuang et al. [105] performed a systematic performance evaluation using the annotated dataset. e baseline model was evaluated by categorizing the poses in the sample into three categories: low-, middle-, and highlevel features. e dataset is annotated using a 2D puppet model, and the optical flow or the puppet flow is computed. e low-and mid-level poses are evaluated using the dense trajectory technique, while the high-level poses are evaluated using NTraj. Yu et al. [127] proposed a multimodal threestream network for action recognition. PoseConvNET is used for detecting the 2D poses using the 2D CMU pose estimator, and the interpolation method is introduced for joint completion. e analysis performed on the individual cues showed a less recognition rate when compared with the proposed method.   However, when all the cues are combined, the reference method proposed by You et al. shows better recognition by 1.34 percent when compared to our proposed method. e evaluation of the accuracy rates for the model is shown in Figure 23.

UCF101 Action Dataset.
e UCF101 [106] is a collection of human action dataset [128] and is an extended version of the UCF50 dataset. It is comprised of 101 human behaviors, and they are categorized into 25 groups, as shown in Figure 24. Every group is comprised of 13320 behavioral segment videos. e training and testing sets are divided into three categories. e average recognition rate from the three sets is analyzed from the dataset. e N parameter is set as 400 for every cell, while the penalty coefficient is set as P � 10, whereas other parameters are provided by pattern matching the image data to the processed image data. e effectiveness of the algorithm is measured using the following reference algorithms [9,93,111,129,130], as shown in Table 10.
Ryoo [111] proposed a dynamic and integral BoW model for action prediction. e human activities are predicted using 3D spatial-temporal local features along with the interest points.
e features values are clustered to form visual words using K-means and the Integral BoW used HOG descriptors. e method showed a drop in recognition rates during the early stages of detection. Cao et al. [129] proposed a probabilistic framework for action recognition. Sparse coding is applied to spatial-temporal features, and the likelihood is obtained using MSSC. e datasets were tested using SC and MSSC methods; the recognition rate was less satisfactory and required more training due to model complexity.
Kong et al. [130] proposed the MTSSVM model for predicting the temporal dynamics of all the observed features.
is approach showed an improvement in the recognition rate when compared to other reference methods.
e drop in recognition rate is because the model requires prior knowledge of the temporal action that can be achieved only via prolonged training. A mem-LSTM model was proposed by You et al. [9] for recording the hard samples. e model used CNN and LSTM on the partially observed videos. e model has an improved recognition rate as it does not require prior knowledge of the features, and the global memory is sufficient for prediction. From Figure 25, it can be observed that the proposed method outperforms all the other reference methods.   e HMDB51 action dataset [107] is comprised of 51 behavior categories that contain 100 videos each and 6676 action sequences, as shown in Figure 26. e data are divided into three training and testing sequences for action recognition, 60 training videos, and 30 test videos. From Table 11, the proposed method is evaluated with other techniques. e N parameter is set as 150 for every cell, while the penalty coefficient is set as P � 10.
Jiang et al. [131] proposed a fuss-free method for modeling motion relationships by adopting the global and locale reference points. e code words are derived from the local feature patches and tested. Jain et al. [48] proposed a technique for decomposing the visual motion into dominant motions to compute the features and their respective trajectories. A DCS descriptor along with the VLAD coding technique is used for action recognition.
Heng et al. [132] introduced a technique for matching the feature points between the frames using the SURF descriptor and optical flow. ese matched features are graphed with RANSAC for human action recognition. Zhang et al. [133] proposed a deep two-stream architecture for action recognition using video datasets. e knowledge is transferred from optical CNN to motion vector CNN to reduce computation overhead and to boost the performance of the model.
Karen et al. [135] proposed a two-stream ConvNet architecture to combine spatial-temporal features. e model is trained on dense multiframe optical flow to achieve enhanced performance. Figure 27 shows that the proposed method surpasses all the techniques considered for evaluation.    Figure 19: Confusion matrix i3DPost multiview dataset.   stochastic autoencoder that functions as both encoder and decoder. It is used for weight initialization in a neural network before training using stochastic gradient descent (SDG) for backpropagation. During training, multiple RBMs are stacked on top of each other to form a neural network. e RBM layer in the neural network inherits the functionality of the network. us, it can function as both an autoencoder or as a part of the neural network. As mentioned earlier, the RBM-NN comprises a two-layer neural network that is fully connected to other layers. e visible layer functions as the input layer, and the hidden layer corresponds to the features of the input neurons. During training, the RBMs adjust their weights automatically. e weight fed to one output neuron corresponds to one feature of the input. For instance, each weight originates from an input pixel, and the value determines the strength of the connection towards the activation function. e parameters generated by RBM are dynamic, and minor changes can cause huge differences in network behavior and performance. Every neuron is assigned to an activation function, and the node output is either set as 1 (on) or 0 (off).
From Figure 28, we can observe that the classification accuracy of the model is influenced by the number of neurons provided to the RBM. e classification rate reaches the highest when it satisfies the N parameter and gradually decreases after crossing the threshold layer. e influence of the parameter for the all the datasets shows similar results.
Deep learning neural networks are trained using the SDG optimization algorithm. As a part of the optimization problem, it is essential to evaluate the error rate for the current state of the model continuously. e error function used for our proposed method is a logistic regression loss function that estimates the loss of the models for weight updation. e loss function for our model is evaluated by generating a regression problem with a set of input variables, noise, and other properties. For evaluation, 100 input features are defined as input to the model. A total of 1000 samples will be randomly generated, and the pseudorandom number generator is fixed to 1 to ensure that the same number of samples is considered every time the model is evaluated. Each input and the output variable follows Gaussian distribution for data standardization. e model    Method type Average recognition rate (%) Reference baseline model [105] 56.6 Reference baseline with low/mid-level pose [105] 69.00 Reference baseline with high-level pose [105] 76.00 Reference method with RGB + flow [127] 95.04 Reference method with RGB + pose [127] 91.67 Reference method with flow + pose [127] 97.10 Reference method with all combinations [127] 98. has the learning rate set to 0.1 with learning momentum set to 0.9. e model is trained for 100 epochs, and the testing set is evaluated at the end of every epoch to compute the loss function for the model. Figure 29 shows the performance of the model for the training and testing sets. Since the input and target variable for the model follow Gaussian distribution, the average of the squared differences between the actual and predicted values are computed.
If the difference is large, a strict penalty is enforced on the model for making a misclassification. From Figure 30(a), we observe that model was capable of learning the problem by achieving near-zero error for MSE loss. e model converges reasonably for the training and the testing set with a good performance rate.
In case, if the target value consists of widespread values or the difference is large, punishing the model by enforcing a        large penalty may affect the performance of the model. To avoid performance issues, the logarithm value for every predicted value is calculated, and then, the MSE is computed to obtain MSLE. MLSE reduces the penalty enforced on the model if a large spread of values is obtained. e same configuration is followed, and the model is tested for widespread values using MSE and MLSE. From Figure 30(b), it can be observed that the MSE loss is significantly higher for the training and testing sets. is indicates that the model may be showing signs of overfitting as there is a significant drop in the beginning and the model starts to recover gradually. Moreover, convergence between the training and the testing set occurs at a later stage.
For cases with large or small values when compared to the mean value, the model might run into outliers. e mean absolute error loss is considered to be suitable for handling outliers. It is used for calculating the absolute difference between the target and the predicted values. In Figure 30(c), the training and the testing set do not converge, and numerous spikes in values are observed, making it not a good fit in the case of outliers. Figure 31 shows the overall performance evaluation of all the datasets that have been considered for human action recognition. e respective actions and the corresponding classification accuracy are provided for 41 action categories. For the training and testing, the individual actions such as walking, running, jumping bending, waving, jumping jacks, and skipping display better top-1 accuracy rates as the classification matches the target. However, combined actions such as run-fall, walk-sit, and run-jump-walk also show a better classification rate when compared to individual instances. e classification accuracy for standalone actions such as catching, entering, exiting, diving side, horse riding, skate boarding, facial expressions, and wandering was also classified accurately due to the probability of top-5 accuracy as the model considers the top five probabilities that match the target label. e restricted Boltzmann machine is composed of binary visible units and binary hidden units. e parameters for the RBM are estimated using stochastic maximum likelihood (SML). e time complexity of the RBM network is estimated to be O(n), where n is considered to be the input

24
Computational Intelligence and Neuroscience features or the number of components. e parameters estimated using SML are the number of components, the learning rate for weight updation, batch size, number of iterations, verbose level, and random state. e random state determines the random number generation for sampling the visible and hidden layers and initializing the components required for sampling the layers during fitting. It also ensures that the data remain uncorrupted, and the scoring sample must obtain accurate results across multiple functions. e attributes considered for training the RBM are the biases of the hidden and visible units; the weight matrix and the hidden activation obtained from the model distribution are computed from the batch size and components. Table 12 shows the computational complexity with respect to time for the various datasets. e table displays the dataset considered, number of videos, number of classes, pixel resolution, frames per second, the input sample considered for training the model, testing sample, training sample, testing and training accuracy, training time, and average epochs. From Table 12, it can be inferred that the training time increases when the video sample and the pixel resolution increase. e input samples are divided into mini batches and tested with various iterations. e training time after each iteration is recorded, and the time after individual iterations is averaged to obtain the training time of the dataset. e training time for JHMDB and UCF101 datasets is high as the input size and the pixel resolution are high. However, the training times of the datasets can be decreased, and better computation complexity can be achieved with better computational resources.

Conclusion
In this paper, a parameter adaptive initialization method that uses a neural network is proposed. e parameter initialization method is modeled based on Maxout activation function using RBM-NN. e spatial and temporal features are learned from various human action datasets. From the experimental analysis, the model learns the spatial-temporal features from the shape feature sequences. An RBM-based neural network model is designed with two layers, and an SVM classifier recognizes multiclass human actions. e  Run-jump-walk Actions Figure 31: Performance evaluation in terms of accuracy of human action detection. proposed method is tested on various benchmark datasets and compared with existing state-of-the-art techniques. e experimental results showed that the proposed method accurately identifies various human actions. e recognition rate was found to be significantly better than other state-ofthe-art specific and multiclass human action recognition techniques.

Data Availability
e image datasets used to support the findings of this study are included in the article. Disclosure e research neither received any funding nor was performed as part of the employment. e research was solely carried by the authors.