Channel Attention-Based Approach with Autoencoder Network for Human Action Recognition in Low-Resolution Frames

,


Introduction
Human activity recognition includes a wide range of real-life applications, such as monitoring human activities, detecting abnormal or suspicious activity, retrieving video based on various actions, semantic video recognition, and observing patients in health centers [1,2].To date, several solutions have been proposed for monitoring actions using video images, such as a visual review of events in videos [3,4].While some have performed well, various body parts, such as hands and legs, can also be used to detect movement [5].
Still, images alone cannot depict the full action.Our ability to recognize complete actions in video data is based on analyzing human body movements in-frame and their interactions with the environment [6].
Te system should function in a fraction of a second, which has unfortunately not received much attention in previous research.Although a compromise between accuracy and time is required, real-time processing is still regarded as one of the top benchmarks in information processing.Human activity recognition systems can process video frames based on frame rate per second and real-time monitoring of nonstatic environments, according to statistics [7,8].It remains one of the most difcult aspects of video processing to track multiple goals in a chain of online videos.Tis is especially true when it comes to topics such as recognizing human activity.Databases contain movements in everyday life.Tese movements are considered normal, and some are considered anomalous [9][10][11].Because of this, recognition under dense conditions is crucial in those multiple activities.In addition, accuracy is compromised when movements overlap, such as jumping and diving together.Terefore, we plan on developing an action recognition system based on a network of video sensors in diferent dynamic environments.Tis will apply to several multispectral control videos.In order to recognize data, feature extraction and classifcation algorithms are required, regardless of the type of data.Support vector machines (SVM) and neural networks (NNs) can be utilized as primary classifers in handcrafted feature extraction-oriented systems like those described in [12].Deep learning (DL), mainly convolutional neural networks (CNNs), based on the hierarchical system of the human visual cortex, has advanced considerably in image classifcation [13].By using feature extraction and classifcation models, CNNs can learn categorical information from their features.Analyzing action representations and extracting features could signifcantly improve action recognition.
Human activity recognition is a challenging research feld today.Video frames were analyzed to identify human activity.Te demand for more precise and efcient frameworks for a variety of contexts grew, as did the demand for more information, images, and video frames.In this feld, deep learning is a highly efective and powerful technique.Recently, several approaches have been presented to recognize human activity in video using CNN, also known as automatic methods.Nevertheless, such systems may not process multiple video frames accurately in real time.Consequently, the requirement for large volumes of realtime and ofine data has led to creative ideas in the feld of motion and activity recognition through video.Some general goals are as follows: (1) Our goal is to develop easy-to-use methods for our leading action recognition research.For various applications of human activity recognition, this is the most accurate estimate.
(2) Our model has been trained and can be used in action recognition applications like hybrid deep learning.Te network can thus extract information from several datasets and generalize it to other datasets, resulting in improved accuracy.Our model is, therefore, more efcient, faster, and more suitable for big data applications.Te proposed model can be implemented as a ready deep learning architecture in action recognition applications due to its rapid convergence and updating.
Video processing should incorporate deep learning techniques, which uses several feature extraction models.CNN with autoencoder [11,14,15] (CNN-AE) characterizes features well.CNN-AE extracts and classifes features based on improved attention mechanisms.As most methods for recognizing human actions rely on the quality of the frames, recognition errors may occur when the resolution or dimensions of the image change.Figure 1 illustrates how a decrease in quality can adversely afect recognition.
Despite the loss of some frame information, decreasing the size or resolution of video frames can have benefts when sending them to data centers.Tese benefts include preventing unnecessary operations like compression and decompression and online analysis of information received from the environment.Additionally, they reduce the complexity of computation.A suitable and fast structure, such as deep learning, can process low-size or low-resolution frames, reducing computation costs.DL-based structures can function in real time depending on how many layers they have.Tis reduces the decision-making component's computing complexity.As a result of the architecture proposed, it will be easier for an architecture based on generalizability, uncertainty, and evaluation criteria to be developed.A computational method for monitoring human activity is developed in this study using video frames of small size and a low number of frames.Smart city social systems can beneft from adopting and utilizing the proposed approach.To identify human actions in a video with increased accuracy, our research uses a hybrid structure combining a CNN structure and an AE network with a deep hybrid structure.
Tis study aims to improve human activity identifcation in video.Computational complexity is reduced by processing a small number of lower-resolution and lowernumber frames in a short period of time.Our research is innovative in that it combines the improved CNN network with the channel attention module and the AE structure.Tis is for action recognition in high class numbers and in low-resolution videos.A structure like this has never been proposed in a similar study before.
Tis article considers the following contributions.
1.1.Generalizability and Robustness.Te developed CNN with CAM and autoencoder (CNN-AE) model with attention mechanism (AM) is much more robust and helps the decision structure work more efciently.Te proposed method is considered robust since it has low dispersion and low accuracy against large frame quality changes.However, the diversity of datasets used and the ability of the method to make accurate decisions about unknown data demonstrate its generalizability.Due to its robustness, the proposed system recognizes human actions.On the other hand, the approach is capable of processing a random range of video frames of poor quality, indicating that it is sufciently generalizable.

Monitoring Human Action.
It is also possible to detect individual behavior.Monitoring unusual activities can serve many purposes.Recognition of human activity on video has a substantial impact on environmental deterrence and urban crime prevention, resulting in a more sustainable city.
2 International Journal of Intelligent Systems

Real-Time Decision-Making.
A CAM-CNN architecture with AE architecture retains reliable recognition even when no frames or low-size videos are present, unlike end-to-end (e2e) and traditional deep learning models.According to some experiments, the proposed method represents a realtime method.As a single-processor, it has a sufcient frame rate of frames per second (FPS), and each frame takes less than a second to process.It is estimated that the proposed action recognition method can work in real time or close to real time.In other words, a fast structure that can recognize human actions quickly is relied upon during the decisionmaking process to aid in the processing of videos with small sizes and a low number of frames.
Our research is described below.Section 2 provides a brief overview of related studies.A newly developed feature extraction and learning technique is presented in Section 3 that uses the optimized CNN structure and AE described in Section 3. Section 4 reports the experimental outcomes generated by the proposed video frame analysis method.Following the conclusion of this study is a summary of the major points discussed in Section 5.

Related Work
In recent years, computer vision has gained interest in video comprehension and action recognition.Tis is due to its potential applications, such as robots, autonomous driving, camera monitoring, and human behavior analysis.Te earliest video sequence encoding techniques used handcrafted features [16][17][18][19][20][21][22][23][24][25].With its rich trajectory features, AR with increased trajectories [19,26] achieved remarkable performance and has become one of the most popular hand-designed systems today.In this section, we discuss two signifcant topics: deep learning-based action recognition approaches and low-resolution activity recognition methods.
It is possible to divide the remaining techniques for video action recognition into two categories.To enhance their temporal modeling capabilities, the frst group of models uses a conventional two-stream structure [18,41].Spatial 2D International Journal of Intelligent Systems CNNs learn semantic features from optical fow, while spatiotemporal 2D CNNs analyze motion content from video.Te fnal predictions are determined by averaging the scores of the two streams trained simultaneously.Data combinations for spatiotemporal analysis were examined in studies [37,[42][43][44][45]. Using sparse frames from evenly divided video clips, spatiotemporal segment networks (TSNs) [46] capture long-range relationships.Dual-path methods require optical fow computations because they are timeconsuming and storage-intensive.Te proposed technique, however, can operate without an optical fow mode, which reduces network complexity.3D-CNN-based systems and 2 + 1D CNN systems comprise the second group of action recognition algorithms.3D convolutions were used for the frst time to defne spatial and temporal data simultaneously in C3D [47].As part of I3D [48], 2D convolutional kernels are supposed to be stretched into 3D to capture spatiotemporal features.Tere are, however, many parameters involved in 3D-CNNs, which makes them not suitable for all applications.A variety of strategies have been adopted to manage the costly calculations of a 3D-CNN using the 2D + 1D paradigm.By decomposing 3D convolution into a pseudo-3D convolutional block, pseudo-3D (P3D) [49] produces a pseudo-3D convolution.3D convolution is factorized by R (2 + 1) D [50] and S3D-G [51] to improve precision and reduce complexity.A relational module can be viewed as an alternative to pooling using a time relation network (TRN) [52].A spatiotemporal shift module (TSM) [53] shifts a proportion of features along the temporal dimension, giving the network the performance of a 3D-CNN while maintaining the complexity of a 2D CNN.With nonlocal neural networks [54], it was possible to capture long-range temporal dependencies between video frames and be more efcient.A dual-path network with an interactive fusion of mid-level elements was used in SlowFast [55] to model spatiotemporal data at two distinct temporal rates.Using the knowledge distillation procedure, our method also approximates the spatiotemporal representation at the feature level.Te spatiotemporal representation capacity and transferability of 2D CNN and 3D-CNN models were determined [56].Action recognition efectiveness can be enhanced by maximizing selected frames via dynamic knowledge propagation [57].Elastic semantic networks (Else-Net) [58] and memory attention networks (MAN) [59] have shown improvement in recognition precision in recent years.
Frame ordering has been discussed in several previous works [60][61][62].While these previous eforts partially addressed some aspects of order prediction, their results only provided limited supervision, i.e., a binary label for inorder or out-of-order events [60,61] or subclip-based order prediction [62].Furthermore, there is no explicit technique to encourage the model to prioritize motion data over background data.
Transformer-based techniques [63] signifcantly improve accuracy while conserving processing power.Using ViViT, a pure-transformer method for factorizing spacetime dimension inputs, we handled spatiotemporal tokens from a long series of frames efectively.By separating spatial and temporal focus within each block, TimeSformer [64] minimizes training time while maintaining test efectiveness.Spatial-temporal transformer (ST-TR) networks were constructed for skeleton-based action identifcation [65,66].In comparison with previous state-of-the-arts, Trear [67] has shown a signifcant improvement in egocentric RGB-D action recognition.Multiscale pyramid networks, MViT, were presented in [68] to extract information from low-level to high-level attention.Comparatively to other successful applications, transformers have not fully realized their potential in action recognition.
Te human action recognition method has been employed for abnormal events and abnormal behaviors in some studies [69][70][71][72].Additionally, it enhances safety and security by monitoring activities.Furthermore, it can be used to detect suspicious activity as part of a criminal investigation.Classical learning methods were used in some cases, while deep learning methods were utilized in others.

Low-Resolution-Based Action Recognition Methods.
Kawashima et al. [73] developed a deep learning-based method for identifying actions from extremely lowresolution thermal images.Tey distinguish between common and rare human actions (such as walking, sitting, and standing).Individual privacy protection is a strength of their work, which can be applied to Internet of Tings (IoT) platforms.Low-resolution thermal images are difcult to compute feature points and build a precise contour of the human body, even if privacy concerns are overlooked.Termal images, their frame diferences, and the center of gravity of people's areas are used as inputs to their deep learning method for learning the spatiotemporal representation.
Te application of deep neural networks to video action recognition follows their widespread adoption for image classifcation [47,48,74].According to C3D [47], one of the most well-known deep networks, 3D convolution is more suited to extracting spatiotemporal features from video.Analysis of deep ResNet [27] structure options for action recognition [74] has demonstrated desirable performance on common benchmarks using I3D architecture [48].Te approaches to low-resolution (LR) single-frame applications include domain adaptation, feature learning, and superresolution [48,75].
Privacy protection has infuenced earlier research on this topic [76][77][78].Te model in [77] identifes several transformations that produce LR videos based on the highresolution (HR) training set.As a result of training on the LR dataset, action classifers should gain a more precise decision boundary.Te concept of inverse super-resolution (ISR) was introduced by Ryoo et al. [77] after they found distinct pixels in downsampled frames.Using this method, additional data can be extracted from low-resolution frames after learning how to alter images properly.To improve the acquisition of information inherent in low-resolution frames, Ryoo et al. [78] developed multi-Siamese loss.Ryoo's achievements have established the standard for recovering lost visual information from constrained pixels.International Journal of Intelligent Systems According to Chen et al. [79], LR and HR networks could share some flters in a semicoupled two-stream structure.It provides high-quality training frames.Xu et al. [80] found that leveraging HR videos efectively improved LR recognition performance signifcantly.A two-stream structure incorporating HR frames as inputs was demonstrated.A fully linked two-stream network that shares all convolutional flters with an LR network outperforms previous methods marginally.CNN-based action classifers are trained simultaneously [79,80] to ensure equal representation of HR and LR frames.
Action recognition [81] is examined in super-resolution.Optical fow-guided training was developed to improve existing image-and video-driven super-resolution architectures.Tey demonstrate their performance on genuine, minute actions by downsampling HMDB-51 and UCF-101 to 80 × 60, but their performance on genuine, minute datasets difers greatly.
Novel models address the practical difculties associated with extremely low-resolution activity [82][83][84][85].Demir et al. [86] have also developed a natural LR benchmark called TinyVIRAT and an approach that employs a progressive generative method to enhance LR quality.By using these models in HR frames, visual information lost over time with a limited number of pixels can be retrieved [87].
Even though LR frames were used in most of these methods, it is unclear why more optimal architectures were not used.Conversely, similar methods have difculty recognizing states like "falling," "sitting," and "lying down" because many action classes are not considered.Furthermore, some methods cannot be implemented in the real world as a model.
Despite the previous action recognition models, the paper presents an improved CNN that incorporates the structure with attention mechanisms and AE architecture.Tis will increase accuracy while using less information than previous models.In addition, we will test the method's suitability for low-latency and real-time scenarios.Based on feature learning, we developed a dataset for short-term human action recognition using low-quality video.Similar action recognition models require scanning the entire length of a video sequence to classify large temporal sequences.Trough this method, we can create a new and enhanced machine-learning tool for testing models that recognize human motions quickly and with minimal latency.

Methodology
Figure 2 illustrates how our model recognizes various actions in video frames using the introduced method.We describe this method in the following sections.

Preprocessing.
In various environments captured on video surveillance, we use a deep learning network to recognize human actions and detect unusual activities or abnormal behavior.In addition to increasing accuracy, deep learning architectures are more capable of handling large datasets.Video input comes from a mix of existing and newly developed sources.Te process of preprocessing involves removing frames from previously captured videos.A subfolder named after each video is established and maintained along with the frames.JPG images are created from the video frames.
To conform to the enhanced integrated deep learning architecture, the data are compressed and saved in 224 × 224 dimensions.Prior to being stored in the folder, the testing video is also converted to frames and scaled to 224 × 224.Te preprocessing is performed using MATLAB functions.Te bilinear method was also used for large, medium, and lowresolution or low-size images (i.e., 100, 50, and 10% of the original frame resolution).For downsizing images, a rapid reduction of dimensions or resolutions is preferred.Its bilinear frame downsizing accuracy and its speed are signifcant reasons for choosing it.
Random sampling is used to generate a few frames in an action video.By using frame sampling to reduce video volume, unnecessary data processing can be saved.Based on dataset characteristics, diferent videos have diferent numbers of shot segments.In order to reduce the number of images available for each segment, we randomly select one frame.Video captures almost all the actions with a small amount of information.As shown in Figure 3, we present a method for capturing dynamically sampled shots.

Proposed Hybrid Model. Tis paper describes a method
for low-resolution action recognition and abnormal behavior from sample frames that consists of four sections: convolution, maximum integration, sampling, and fully connected.Te following are parts of the proposed combined method to recognize human actions in video.

Multilayer Convolution.
CNN architecture is depicted in Figure 4. Multilayer convolution has four types of operations: fully connected layer (gray color modules), upsampling layer (light yellow modules), max-pooling layer (light green modules), and convolutional layer (light blue modules).Te permeability of porous materials was predicted using a CNN (see Figure 4).Tere are two convolutional layers and one max-pooling layer in the CNN architecture.Max-pooling reduces the number of parameters in the network and expands its receptive feld by halving the size of the feature map.As a result, the CNN structure is essentially the design of the network, while the autoencoder (AE) is the core of the network [14].
For AE and CNN, we provide frames of low-resolution 128 × 128 × 1 size.Te size of the detail matrix is reduced to 64 × 64 × 2 after the frst CNN layer.It is the number of kernels that determines the number of channels in the feature map when convolution is performed.Using the CNN architecture, a low-resolution 128 × 128 × 1-sized frame is converted to 4 × 4 × 32, 8 × 8 × 16, 16 × 16 × 8, 32 × 32 × 4, and 64 × 64 × 2-sized feature map.According to the most recent attribute map, each integer represents the highest level of a feature.To fatten and connect 3-D map layers, we used 1-dimensional feature lines with 512 features.AE creates a 4 × 4 × 64 feature map, which is then transformed into a 1024-dimensional feature line.As shown in Figure 4, International Journal of Intelligent Systems the AE-CNN will be discussed below.In addition, two feature maps are examined in an interconnected network.Input layers contain nodes that facilitate the transfer of lowresolution image output from one frame to the next.
Instances of a node may display regional characteristics, such as various parts of pixel picture information at diferent activity locations.Global characteristics can also be displayed in another instance.Training determines characteristics automatically.Te fully connected network consists of nodes linked at the upper and lower layers.We use the nodes in the previous layer to calculate each node, which is expressed as follows: Te current layer is indicated by s, the number of neurons in the layer by r, and the number of layers with full connectivity by w and b.A common machine learning strategy for evaluating, choosing, and utilizing high-level data to estimate valuations is a fully connected network.For instance, as depicted in Figure 4, it decreases in size from 400 to 150 due to classes.A frame can be used to deduce the actions to be taken in the upper half of the tree.If the input image has poor resolution, the reconstructed features will be inappropriate, common in feature engineering scenarios.Low-resolution frames lack comprehensive information, resulting in confusion during training and accuracy drops.Low-level characteristics are needed to detect activities.CNN cannot forecast high-resolution properties based on low-resolution images.To support the trained network, lowresolution frames and high-resolution features can both be used.Te hybrid CNN combines low-resolution images with features, while the AE module creates high-resolution images.

Autoencoder.
To train AE procedures, we do not need to recognize every frame in the dataset.Relabeling, on the other hand, prevents low-detail frames from appearing and enables more accurate training.AE is signifcantly easier to collect training datasets due to labeled data independence.As a result, the dataset containing the greatest number of pairs of low-and high-resolution frames is selected as a starting point.Te fgure shows that the AE module contains an encoder (upper branch) and a decoder (lower branch).An encoder consists of three convolution layers and a max-pooling layer (distant branch).A decoder layer consists of one up-sampling layer and two convolution layers.Figure 4 illustrates in yellow how the aforementioned sampling approach has the opposite efect on the maximum collection operation.Te small map is transformed into a large, high-resolution image using a sampling method that doubles its width and height.Te encoder transforms low-  International Journal of Intelligent Systems max-pooling layers, increasing the map size of the ultimate feature to 32 × 32 × 8. Tis network's parameters have also been designed and trained.Trough repetition, each encoder and decoder consist of fve layers.Te initial conditions are a low-resolution frame (L), a high-resolution or image (H), and a newly generated high-resolution or original-size image (newH).Four components of AE training are examined: (1) Te encoding process begins with the convolution layer, which transforms L input data into features.Te following relationship between the F feature and the L input can be specifed after an encoding layer [15]: (2) Unlike the previous step, the decoding procedure converts the F feature into a high-resolution newH image.Input newF and output newH are related through the following equation [15]: Te encoding and decoding convolution layers are identical with the exception of the last decoding layer.By improving the activation performance of the last complexity layer, the output result is transformed to the range 0-1.
(3) Te adaptive moment estimation technique reduces cross-entropy error for N data in AE (N AE ) by using a network that changes the network's settings [15].
(4) During training, the number of encoding and decoding convolution layers increases.In both the encoder and the decoder, each layer is initialized one by one.Each encoder or decoder layer is added in three steps, up to fve encoder layers.
An encoder can achieve high-resolution recording of human actions by using the above training approach.Tis trains it to distinguish between low-and high-resolution frames from video frames.Te decoder can produce highquality images using this data.CNN's kernel was incorporated into an image processing module to extract features from low-resolution images.Both CNN and AE are provided as a fully connected layer for the ultimate prediction of actions from low-resolution images of distinct areas, with AE acting as a parallel branch line to the original CNN branch.Since the encoding features prevent defection accumulation, we use them instead of high-resolution frames.For high-resolution frames, we need encoders and decoders before CNN, resulting in a 15-layer convolution layer instead of the 5-layer layer proposed in this study, which increases parameters, overftting, and enhancement.Accuracy decreases when degradation occurs.
In equation ( 4), the LOSS metric function is diferent from the loss function representing the entire combined network and its convergence.For this study, the LOSS function was used for the AE.However, in general, for the entire combined network and to guide the network to train all the parameters, the mean square error (MSE) was used as the loss function.Te MSE can be expressed as follows: In this context, for N data, the variable y i ′ represents the recognized action of the i-th low-resolution video image, while y i represents the observed action of the corresponding high-resolution image as determined through the utilization of the lattice Boltzmann technique.

Channel Attention.
Channel attention modules (CAMs) are CNN modules focused on channel-based attention.Te channel attention map is generated by leveraging the interchannel relationship among features.Te concept of channel attention arises from the understanding that each channel within a feature map detects specifc features.Consequently, channel attention aims to determine the signifcance or relevance of the detected features in relation to the input frames.It is necessary to compress the spatial dimension of the input feature map to calculate channel attention efectively.A squeeze block and an excitation block were used in the feature channel domain.CNN extracts spatial features as a ftted decision system.By adjusting several feature maps in the channel domain, discriminating features can be selected.
Its performance can be maximized without adding new features by combining dense block and transmission layers with channel attention.Channel attention networks are small in size, and their assisting parameter is just 0.22 M, preventing overftting.To minimize the size of the feature map, a transition layer with the 1 × 1 convolution layer and International Journal of Intelligent Systems a middle integration with stride 2 can be used.Combining the channel attention module with the transfer layer results in adaptive sampling.In Figure 5, the channel-based attention mechanism processes feature channels, such as "excitation" and "squeeze," in two stages.
In the squeeze step, a one-dimensional vector of input characteristics is compressed into a length proportional to the number of input channels.In the original input feature, W × H × C, there are C channels in the spatial domain and U channels in the size domain.
Te 1 × 1 × C vector is generated by compressing each spatial domain W × H into a single value by pooling global averages.Te formal determination of the c th component, z c , of the squeeze output is given by the following equation [33,59]: Gate mechanisms consisting of two nonlinear, fully connected layers can capture channel dependence during the excitation phase.
As a result of the model's low computational complexity, the two fully connected layers are just C/16 and C, respectively.s c are used to represent the excitation output to decrease model complexity [33,59]: In the presence of W 1 and W 2 , which are the C/16 and C layer parameters, σ is the sigmoid function, and δ is the ReLU function.Furthermore, the z is the squeeze output.Finally, a weight is assigned to each feature channel.For each feature map, the weight vector s c and the initial feature maps u c are used as inputs.Te channel-wise multiplication of feature maps produces the fnal product, the u c ′ feature maps [33,59].
Te channel attention module allocates adaptive weights to features by expanding and squeezing feature channels.Te attention model for feature maps is the only parameter in this module that has a limited number of parameters.

Experimental Results
In this section, we analyze the results based on the implementation parts of the study methodology.We begin by examining the video frames.

Datasets.
Datasets utilized in the analysis include HMDB51 [88], UCF50 [75], and UCF101 [76].Dataset HMDB51 [70] is one of the most complex and difcult to analyze video image datasets related to human action recognition.Human facial interaction includes movement of body parts, physical contact with objects, and exercise.From YouTube, 6849 action samples were collected and categorized into 51 categories.Each category contains approximately 100 videos.Datasets are complicated when samples are collected from diferent participants performing the same task under diferent lighting and perspective settings.Considering the variety of camera movement, view and position of objects, object scale, perspective, cluttered background, and ambient light, the UCF50 [75] shows a wide range of human behaviors.Te action groups are divided into several groups with some characteristics in common, such as a person who plays the piano four times from diferent perspectives.
It contains 13,320 YouTube videos from 101 action classes in AVI format from UCF101 [76].Every action takes between 2 and 7 seconds, and 100 to 130 samples are evenly distributed across all categories.UCF101 analysis is difcult due to the large number of action classes involving human interaction with objects, musical instruments, and body parts.A few frames from the UCF50 dataset are depicted in Figure 6.

Implantation Details.
Te features of the computer system that allowed us to develop our approach are as follows: Intel (R), Core (TM), and Core i7 processors come with a single processor and 8 GB of RAM and a 64 bit operating system.MATLAB programming tools were used for the analysis of quantitative.Te default learning rate for this model is 0.001.Te improved model uses CNN and autoencoder between 200 and 1000 learning periods, and SGD applied CNN and autoencoder to further enhance the optimized structure.A single CPU processor was utilized to train the improved CNN model and autoencoder for about six to 10 hours for diferent learning structures.
All of our models are built based on transition learning models and fne-tuned convolutional networks.Te training and validation process involved the calculation of errors, estimation of training parameters, convergence, and fnally accuracy calculation.Error minimization during validation

Evaluations.
Based on the confusion matrix confguration, the multiclass status is estimated based on the accuracy criterion.In this study, three modes of all video frames were analyzed.In these modes, the frame was created at 70, 40, and 20% of the quality.Te proposed model was used to identify human actions.An analysis of the confusion matrix determines how well a machine learning system performs in classifcation.Te confusion matrix measures the diference between actual and expected values.Figures 7-9 show that the proposed method can recognize human activity at three diferent levels of video quality, i.e., 70, 40, and 20% of the original frames, with over 90% accuracy.It has even been observed that 100% accuracy has been achieved in some instances.Tere are separate sections for each assessment.

UCF50.
As stated before, the flms collected from this database are classifed into 50 distinct categories.Each category's videos are broken into subcategories that share characteristics such as baseball, basketball shooting, bench press, and motorbike riding.Bicycling, shooting pool, diving, drumming, and numerous other activities are incorporated into sports.Some of them are quite similar to other human acts and movements.Figure 7 shows the algorithm results for three distinct video quality levels with falling rosettes.While the frame size has not changed, the output accuracy varies slightly from the original resolution.However, despite the drop in-frame resolution, the diference between the results is relatively small.Te standard deviation is slight between them.Although the CAM-AE structure has a large number of classes, it has developed discriminative features and representation learning through changes in the set of frames.
As a result, the accuracy of more than 50 categories exceeded 96% and fve of them exceeded 97% in the various action categories.Figure 10 shows the learning, training, and convergence process of the proposed method based on the model's accuracy and loss criterion.Tis is for the set of video frames obtained from UCF50 video data for all three types of frame quality.In comparison with other deep structures, the method identifes human actions with less computational complexity.Te hybrid structure, however, will be more efective with more repetition.Moreover, there are also many layers of other CNN family structures with similar challenges, such as generalizability, uninterpretability, and computational complexity.

UCF101
. Te UCF101 dataset is complex and difcult to use since there are numerous action classes represented by humans who perform various activities with a variety of items, such as playing musical instruments, using sports equipment, or interacting with a procedure with diferent body parts.Figure 8 shows that when the size and resolution are reduced from 70% to 20% of the original frame, the classifcation error rate stays the same with low  International Journal of Intelligent Systems variances.Even when video frames are poor, processing has not been challenged and accuracy is higher than 96% in some cases.
For a set of UCF101 video frames with three diferent quality levels, Figure 11 illustrates the learning, training, and convergence processes of the proposed method.Tis is International Journal of Intelligent Systems based on accuracy and loss functions.Moreover, it is evident that in addition to completing the claim of the previous section, the proposed method is more accurate and requires less computational complexity than other similar deep structures and algorithms for recognizing human actions.

HMDB51
. Te HMDB51 video frame set is one of the most complex sets of human activities ever studied.HMDB51 video frame set includes categories related to human exercise, body movements, and body contact with objects.In total, there are 6849 YouTube actions divided into 51 categories.Tere are approximately 100 videos in each category.Participants' varying brightness and perspectives have made the dataset more complex.State-of-the-art methods have 60% precision in this dataset.Interest in this form of data collection has grown dramatically in recent years, with some studies reporting a 70 percent interest rate.Te suggested technique estimates a 78 percent increase in output despite video quality loss.
It is true that the proposed method for identifying human actions in the HMDB51 dataset is less accurate than that in the other two datasets; however, compared to other similar methods, the results are satisfactory.Te obtained results are inaccurate due to the high complexity of the videos.Tere is little variation between reported outputs despite a signifcant quality drop.Figure 9 shows the results for three diferent video quality levels.Figure 12 depicts the training and convergence procedures of the proposed technique for a collection of HMDB51 video frames of three diferent quality levels.

Discussion
Tis research aims to reduce the number and size of video frames received from human actions while maintaining accuracy.Classifcation accuracy, however, will decrease as the video quality decreases.Trough CAM and creating a deep hybrid structure with AE, the proposed method has overcome the challenge of low video quality in terms of frame number and size.

Recognition and Video
Frame Quality.In Table 1, the performance of the proposed method is examined by reducing the dimensions of video frames as well as the number of frames.Labeling frames are determined by random sampling based on the original labels.To make the analysis less computationally complex, we randomly selected one of the three frames.When we analyzed what the fnal accuracy would be if a random frame were chosen from 2, 3, 4, . .., and 10 frames, we also considered other scenarios.Table 2 shows that the highest accuracy was obtained when one of the three frames was selected.Table 2 shows one frame at a time from 2, 3, 4, . .., and 10 frames.
In addition, frames with reduced dimensions were evaluated in terms of frames per second (FPS) to estimate the computational complexity of the frames.As the number of deleted frames increases, the correlation between the extracted features and the video sequence decreases.Methods may be used to dynamically fnd the most appropriate video frames.By using diferent strategies, preparing the video and fnding the proper frames can, however, take a long time.12 International Journal of Intelligent Systems

Comparison.
Comparative methods detect actions more accurately and with less computing expense than the suggested method.Using our method instead of handcrafted methods, we extract features more accurately.Many new approaches to action recognition have appeared in recent years, including deep learning methods [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37].Despite  International Journal of Intelligent Systems Besides extracting features, discriminating features that can be generalized under diverse acquisition conditions are essential.Feature extraction is sometimes achieved by creating a skeleton from the video; however, the information gained from the skeleton is sometimes discarded, making the method less robust and accurate [37,45,89,90].Several of the methods in [25, 32-36, 41, 45, 91-99] obstruct the operation of features by adding unnecessary parts.Tus, the addition will lower the accuracy of actions.Based on the attention mechanism, the autoencoder network, and the convolutional structure, the approach suggested in this paper has created a robust method that lowers video frame numbers and dimensions.Te results are compared with those of similar approaches used in recent years as shown in Table 3. Te method can also compete with deep learningbased methods that have emerged in recent years for action recognition [101][102][103].
For UCF50, UCF101, and HMDB51, the model learning training duration was 3, 5, and 4 hours, respectively.By using the benchmark dataset, the suggested classifcation approach is validated for its ability to achieve superior or comparable classifcation precision.We fnd that our suggested technique correctly detects human actions in videos in the majority of cases.Video information overlaps with human actions.Current approaches may incorrectly classify similar actions, such as drinking, eating, chewing, and talking.

Limitations.
To date, considerable eforts have been dedicated to the recognition of human actions; however, only a limited subset of these eforts has adequately addressed the diverse range of limitations associated with this feld.Video recording protocols for people's movements are one of the fundamental challenges encountered in this domain.Tere are a variety of limitations involved, including time considerations, camera positioning, diverse weather conditions, video interference, and the inherent ambiguity surrounding movement classifcation.Human position and speed infuence video images and recognition performance.As a result of excessive illumination and fuctuating weather conditions, human action recognition precision was occasionally compromised.A variety of camera angles make it difcult to accurately evaluate performance based on captured frames.Multiple instances of the proposed model's performance have been deemed satisfactory.However, it is still necessary to train it using videos.Complexity, duration, and poor quality of video frames are signifcant challenges in this task.It may be possible to conduct simultaneous activities over video.In contrast, humans engaging in multiple activities at the same time interfere with decision-making.It is necessary to consider distinct videos that can adequately train the model to address this concern.Human actions are intrinsically complex and challenging to comprehend.Additionally, most action recognition models on standard video datasets focus on videos captured under optimal conditions, ignoring videos captured under abnormal conditions.Moreover, implementation and constraint challenges may lead to pixel occlusion.Limitations such as camera movements and perspective distortion may infuence individual actions.Recognition performance problems can be particularly aggravated when the camera moves.Variations in a system's operational classifcation afect its performance.Tere is a marginal diference between walking and running, for example.Understanding human behavior requires discernment between diferent categories.In scenarios involving changes in style, perspective, behavior patterns, and attire, recognizing human actions becomes increasingly challenging.Human-object communication and analogous activities remain active scholarly topics.In addition to monitoring and tracing multiple actions, recognizing irregularities, such as fraud detection and anomalous behavior, within a limited set of training data is challenging.International Journal of Intelligent Systems

Conclusion
Our method utilizes CNN-based channel attention mechanisms and autoencoders (AE) to recognize human actions in low volume and low number of frames dynamic video.Even low-quality videos transmitted over the Internet or from social media can be handled by our system.Additionally, CNN's model takes channel attention into account when choosing frame-level presentation.Te designed AE can reliably identify multiple actions from poor-quality video frames.Before constructing a low-dimensional feature map, AE converts high-dimensional data into a lowdimensional feature map.Our experiments demonstrate that the proposed system is capable of processing a large number of frames per second (i.e., higher than 25 FPS) and can be employed in real time even when the resolution is poor.Using UCF50, UCF101, and HMDB51 benchmark datasets, this method identifes monitoring performance under nonstationary conditions.By using video frames with appropriate dependability ratings, the action recognition model can be fne-tuned to accommodate changes in nonstationary environments.With an improved version of our current system's architecture, our long-term strategy attempts to set and track specifc goals.Te video dataset does not include multiple actions performed by one individual.Actions that overlap, such as eating, drinking, and speaking, reduce video sample precision.As a multiview surveillance video architecture, we will develop a hybrid action recognition model.In addition, we will design a training architecture to overcome challenges such as noise, similar actions, actions under diferent weather conditions, and multiple actions at once.

FrameFigure 1 :
Figure 1: Te recognition of actions in this fgure is negatively afected by a decrease in resolution and dimensions.

Figure 3 :Figure 2 :
Figure 3: Te random sampling for a baseball action.With the help of the downsizing strategy, a new and low-volume video is reconstructed by randomly choosing frames for each segment.

Figure 4 :
Figure 4: Te framework of the proposed architecture is based on AE and CNN architecture.Network architecture has two branches.On the left side of the plot, a CNN is used with a channel attention block to recognize actions.Meanwhile, the right-side branch involves implementing an AE module to assess frame sequence characteristics.
One of the most challenging datasets is the YouTube Action database.Te action video images of people in this dataset are associated with low resolution, changing camera angles, changing scales, and bright and variable backgrounds.Te dataset contains 11 sports classes with videos from 25 disciplines with four examples per action, as well as YouTube videos.

Figure 5 :
Figure 5: Te channel-based attention mechanism by means of processing feature channels.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Te confusion matrices of UCF50 action recognition datasets based on three diferent video quality levels.

Figure 10 :
Figure 10: Te training and convergence process of the proposed method is based on (a-c) accuracy and the (d-f ) loss criterion of the model for the UCF50 video data for all three types of frame quality.

Figure 12 :
Figure12: Te proposed method is evaluated through accuracy evaluation (a-c) and loss analysis (d-f ) in the HMDB51 dataset for each of the three types of frame quality.

Figure 11 :
Figure11: Te proposed method is evaluated through accuracy evaluation (a-c) and loss analysis (d-f ) in the UCF101 dataset for each of the three types of frame quality.

Table 1 :
By reducing the dimensions of the video frames as well as the number of frames, the accuracy of the proposed method is revealed in this table.

Table 2 :
From several frame sequences, this table shows the accuracy, frame per second (FPS), and dimensions of choosing a frame.

Table 3 :
Analyzing the proposed method against other comparable methods based on accuracy and computational complexity metrics.
Te best values are in bold.