GaitVision: Real-Time Extraction of Gait Parameters Using Residual Attention Network

,


Introduction
Fingerprints and faces are unique to everyone [1]; hence, they are well suited for current biometric systems. Similarly, gait refers to a person's walking pattern, which is a unique characteristic of each individual. Gait is a complex biological process and a distinctive walking style, which cannot be spoofed or imitated exactly. ese unique patterns make it ideal for user identification and authentication [2].
Generally, a biometric system compares the information registered for proof of identity to a person's current features.
is corresponds to the concept of one-to-one matching with a matching rate greater than 95%. Existing solutions achieve a decent accuracy; however, the concept of spoofing, which describes the mimicking or realistic imitation of the original pose of a person, constitutes a threat to access authentication. e latest facial recognition techniques for biometrics are threatened similarly in terms of facial spoof attacks. Many datasets of possible multiple-scenario spoof attacks have been released by CASIA and the ROSE lab, but there remains an echo of the danger of spoofing. RFID or access cards and sensors [3] can easily break the red line in any authentication system. In this paper, the overcoming of the pitfalls of natural biometrics using an advanced method called gait biometrics is discussed.
However, gait features are purely dependent on visual appearance, and this can cause a problem if there are slight variations in color, contrast, walking speed, and low-resolution imagery. ese problems are sensitive to gait computation to pool in the results. e background color contributes to the disruption and slows down the computation with regard to classification. ese are some of the major problems that contribute to a back-step in using gait biometrics.
e authors address the issues faced during implementation and resolve unnecessary features, such as isolating targets from the background, re-resolution of lowimagery by turning to high frames per second (FPS), and blurfree motion in the case of speed walking. All these techniques contribute to the accuracy of pose estimation. e design of gait features should be invariant with regard to clothing (color difference), viewing angle, and so on. erefore, this work aims to disentangle gait features by isolating the target (walking person) from a background in visual appearance and converting the frames to a high frame rate, that is, 240 FPS for a motion blur-free visual to avoid pixel corruption or distortion, which degrades the accuracy [4,5].
To avoid the overlapping issue concerning the human gait, existing solutions use a 2-stage extractor; the first one is person detection extraction with a unique ID tagged on him/ her [6,7]. e second stage is human pose extractions inside the boundary box. Unique ID tagging would erase pose overlapping problems [7,8]. It has also been demonstrated that gait recognition performance is significantly influenced by different intraclass variations related to the subject, such as shadows, walking surfaces, angle variations, environment variations, clothing, and segmentation errors [5,9].
A very common factor in human vision is that people are often able to identify a familiar person from a certain distance simply based on their walking style. It is even common to be able to mimic a person's walking style [10,11]. Although walking styles can be mimicked, microscopic parameters, such as leg pressure, angle of the walk, and distance of each step, cannot be accurately imitated. us, there has been a growing interest in natural biometrics and researchers have exploited it for identification purposes. Initially, the inability of humans to mimic microscopic gait features aroused interest in defense research for secured access authentication. Each step may appear to be of a similar style on the macrolevel of observation; however, on the microlevel of vision, parameters have fewer variations in the gait. ese microlevel variations are unique to each frame and position. is work proposes an end-to-end deep learning technique to extract temporal information from the gait in each frame and position, which is frame-level feature extractions from each silhouette independently [12]. e main contributions of this work include the following: (1) Identify a registered "subject" through his/her gait patterns. (2) Perform basic and powerful statistical methods such as mean, median, and max in attention mechanism rather than other activation layers. is is to make the feature levels as simple as possible since gait patterns would be biased and similar to each other often. (3) Train the network with attention model rather than conventional CNN or transfer learning for high-level spatiotemporal feature extraction.
e list of abbreviations used in this manuscript are shown in Table 1.

Related Work
In this section, the authors will describe the cycle of gait biometrics and deep learning-based models.

Gait Cycle.
A gait cycle describes the repetitive patterns of the human walking posture. A cycle outlines the postures between successive time instances of foot-to-floor contact.
ese contact points in relation to certain gait parameters are essential for gait analysis. e gait cycle mainly consists of 2 phase cycles: (i) stance and (ii) swing. ese two giant parameters contribute approximately ∼80% to a complete gait motion analysis. e gait stance covers 60% of the cycle, especially for the walking motion. However, when a person runs, the major proportion is in the swing phase. Feature extraction in the time domain includes variations in intrinsic properties while walking, such as velocity, motion, body length, width, and bend angle. is provides the intrinsic patterns of the walking cycle of a person, which is of extreme importance in recognition. is is how pattern information is extracted from walking cycles.

Characteristics of Gait
Cycle. It has been demonstrated and sufficiently proven that human gait is unique to each person. Pattern differentiation of each individual, especially features, such as the pelvis and thorax, is completely different for each individual. is temporal information can be used to adapt computer vision-based biometrics without any intrinsic hardware equipment. e gait cycle of a subject can be broadly divided into 2 categories: right leg strikes and left leg strikes. Temporal information can be extracted from these 2 phases, mainly depicting heel strikes. is information contains much more discriminative information for the differentiation between 2 subjects. A complete heel strike begins with a lifting of the heel in the forward direction and a swing in the backward direction and then, the cycle is repeated.

Uniqueness of Gait.
Similar to the uniqueness of fingerprints, the gait posture of each individual is distinctive. Often, the concern regarding spoof attacks arises in natural biometrics, but an exact mimicking of a "target subject's" gait posture is impossible. A subject's walking style can be mimicked easily, but only at the macrolevel; the microlevel characteristics, such as foot pressure, heel strike angles, and distance maintained between each step, are impossible to mimic accurately. is fact is sufficient evidence for the expected success for a full implementation of gait biometrics for access authentication systems. ere may be minute variations in walking posture, but this would not be a natural change; rather, this would be a forced change similar to that when the movie stars completely transform their posture [13].

Gait Representation in Biometrics.
Unlike other visionbased models, gait does not entirely rely on pixel values. Metrics can be extracted from RGB, RGB-D, or binary frames. Metrics are purely based on the structure of the pixels but not on the intensity. Hence, for better results, the authors converted the frames into binary and computed the subject based on the posture. ese frames are generally known as masks and are obtained by gait energy/entropy images, defined by GEnI [14]. ese images extract the silhouette masks of a target. e next extracted metric is the kinematic 2D body joint points. Conventional drawbacks, such as clothing and walking speed, can be resolved by this approach given high-resolution frames for extracting rich information.
is method of extraction is proven to be robust to covariates, such as clothing, walking speed, and view angle, because high-resolution frames are used for computation. e authors of this work are investigating the training of patterns rather than completely training with databased on pixel-based features. is is contrary to the CNN model, which extracts millions of parameters, rather than dealing with high-computation parameters.
is novel approach can be utilized to train a model for different scenarios, such as pattern recognition. is work involves patterns, and the backbone is patterns extracted from the data of each subject.

Disentanglement
Learning. Disentanglement learning is a new form of gait approach in feature extraction using intense computation resources. Existing models are mostly semantic latent vectors of data (features) from CNN architectures. Disentanglement learning is gaining popularity in the computer vision realm for its pure data-driven approach. One of the outstanding networks in disentanglement learning is DrNet, which uses pose vectors with a two-encoder architecture. e content information was removed by generative adversarial training. Another approach, which includes segmentation and conducts analysis, moderates foreground segment masks of body parts from 2D pose joints using a U-Net architecture. ese body part segments are transformed into the desired motion with adversarial training. Esser et al. [15] utilized U-Net and variational autoencoders (VAE) to disentangle an image into appearance and shape. Tran et al. [16,17] attained state-ofthe-art performances on pose-invariant facial recognition by explicitly disentangling variations in a pose with a multitask GAN [18,19]. Further, DR-GAN [17] implicates adversarial training with pose labels to disentangle pose features.

Single Image-Based Action Recognition.
Bhandari et al. [20] implemented simple image-based action recognition based on an HRNet [21] human pose estimation network. HRNet [21] represents multitasking features with a set of feature maps extracted from an image in the decreasing order of resolution and increasing order of channels [20]. e model returns heat maps and human joints for action recognition [14].

Attention
Image-Based Feature Stream. Bhandari et al. [20] and Fukui et al. [22] used attention mechanisms such as attention-based image feature extraction streams foreground analysis. is method uses an attention image-based feature stream [23], which is built on top of ResNet18 [24]. e shallow stream is aided by skip connections from the feature maps extracted by HRNet [21]. ese feature maps are concatenated with each output of ResNet18 [24] layer using transition blocks.

Part-Image-Based Feature Stream.
For accurate action recognition using each part, Bhandari et al. [20] proposed a part-image-based feature stream.
is implies feature extraction from HRNet [21], which is used for classifying body parts through the "Conv for pose estimation" block. ese individual parts are then fed to ResNet18 [8] for classification.

Gait Datasets.
ere are many conventional opensource datasets on gait. ese are large in quantity and high in quality. A few examples are the SOTON large database, USF, CASIA-B, OU-ISIR, and TUM GAID [25]. e authors use CASIA-B [26], a custom-collected dataset for this work. Here, architectural performance is analyzed rather than compared with different state-of-the-art techniques in different databases. Hence, a largely custom-collected dataset is used for testing and metrics extraction. CASIA-B is a multiview dataset with 3 variations of each subject in terms of view angle, clothing, and carrying. e dataset contains 11 different view angles of each subject in the walking posture. A sample of the CASIA-B dataset is shown in Figure 1.

Limitations.
Most of the state-of-the-art methods, as mentioned above, have performed experiments based on subject-sequence frames image classification using conventional convolutional neural networks and transfer learning methods. Most of these methods achieved good results but not accurate results on all subjects. Few of them have carried out attention mechanism using pretrained transfer learning alone, which lacked in extracting required features but alongside extracted even unwanted features, which diluted the feature learning. e above limitations described are maximumly focussed on a single architecture image classification with a given dataset of different subjects. Hence, even if a slight variation is observed in the subject in terms of angle and distance, the predictions conclude as false negatives.
Keeping the above factors, their main drawbacks, the authors proposed a unique architecture that considers all necessary features and discards unnecessary features during the training phase, which is explained in detail in the coming sections.

Technical Approach
In this section, the authors define a technical approach for gait formulation for a model that learns discriminative information from gait silhouettes. e proposed technical framework is shown in Figure 2.

Problem Formulation.
e proposed method mainly focuses on part-wise image feature extraction to understand the gait silhouette [27,28] at a microlevel. e training dataset from CASIA-B, which contains information on "N" people with unique identities y i , i ∈ 1, 2, 3, . . . , N { }, whereby the assumptions of the gait silhouettes are subjected to a probability distribution P i that is proportional to its identity. All silhouettes of persons in one or more temporal sequences can be regarded as a set of n silhouettes Hence, the gait recognition of a person can be started by mathematical modeling as follows: where F is the set of convolutional layers aimed at extracting features at the frame level from each unique-identity gait silhouette. e function G represents a permutation of invariant function which maps frame-level features to a setlevel feature extracted from each subject target. A set pooling operation is implemented for these layers. e function H represents the discriminative learning of the probability function P i , which indicates set-level features. Input X i is a tensor with four dimensions: set dimensions, image channel dimensions, image height dimensions, and image width dimensions.

Set
Pooling. Set pooling [27] is specifically used to extract the gait features of all "N" identity sets. Mathematically, it can be formulated as N � G(V), where N denotes set-level features and V denotes frame-level features, given by V � v j |j � 1, 2, 3, . . . , n . To formulate a much deeper function G, the permutation invariant function G is defined as where π is the permutation element [25]. Because this method is deployed in a real-time environment, function G takes each set of a person's gait silhouette with arbitrary cardinality. To work with invariant constraints present in function G in equation (2), statistical functions are applied to the set dimensions. ree powerful statistical functions are used for computation: max, mean, and median. e joint functions to combine these three functions are as follows: where "cat" represents the concatenation of channel dimensions and 1 × 1 Conv represents a 1 × 1 convolutional layer. ese three statistical functions, max, mean, and median, are applied to the set dimension. Equations (3) and (4) represent learning a proper weight to combine the information extracted by these three statistical functions.

Attention Mechanism.
Visual attention networks are applied to extract spatiotemporal features at the frame level, which were proven to improve the set pooling performance in [4,5,7].
Local information often misses some crucial points in extracting temporal features from gait patterns. Hence, the authors present an element-wise attention map that extracts global information from the gait silhouettes. As shown in Figure 3, global information is first collected by statistical functions. ese values are then fed to a 1 × 1 convolutional filter with a feature map to calculate the attention feature maps. However, the final set-level features [27] from each frame are extracted from the Max function and used to refine the frame-level features [29].
is residual structure can stabilize the convergence of the loss function of the network. In this work, the authors are considering three powerful statistical functions, that is, mean, median, and max, because they are easy to compute, and results driven through these parameters are easy to analyze the gait parameters. Since the parameters are in the form of a matrix from each parameter, these three statistical functions work well to compute and maintain a distance between each subject.

Pyramid Mapping.
e extracted features are split into feature strips or feature vectors, which are then used for person reidentification, as proposed in [8,10]. As in many networks, here too, the images are cropped and resized to a uniform size. e discriminative parts vary from one size to another based on the camera angle fixed at the top. Fu et al. [10] proposed a horizontal pyramid mapping network for local and global feature extractions. e pyramid network consists of 4 scales that help the network focus on gathering local and global features. Extending this technique in this work, a fully connected layer for each pooled feature is used to map discriminative information. Suppose the pyramid network has S scales. e scales can be defined as s ∈ 1, 2, 3, . . . , S, where the feature maps are extracted by the set pooling function, which is then further split into 2 s−1 strips based on image height dimensions.
us, the total strips can be calculated as S s�1 2 s−1 . Global pooling is applied to the 3D strips to extract crucial information formulated in the 1D feature vector. For an individual strip z s,t , where t ∈ 1, 2, 3, . . . , 2 s−1 . Here, t stands for the index of each strip. e final formulated global pooling is given by f s,t ′ � maxpool(z s,t ) + avgpool(z s,t ), which represents global max pooling and global average pooling, respectively. e    Complexity final step includes the fully connected layers for mapping features f ′ into a feature vector in the discriminative space. e strip vector represents differently depicted features from different receptive fields in different spatial positions. ese feature vectors are further given to the fully connected layers. e convolutional layers are deprecated with different receptive fields [30]. e deeper the extraction is, the larger the receptive field will be, and as the identity has to be identified with a deeper parameter, the deeper the layers are. e pixels representing the features (feature vectors) [31] in the shallow layers in the convolutional network focus mainly on local fine-grained information rather than global feature information to avoid inappropriate features being used for training. However, global features are also important for isolating a person from the frame; hence, the deeper layers focus on global and coarse-grained information. e authors implemented a similar approach to that in [10,27], a multilayer global pipeline to collect set-level information from the convolutional layers.
ese set-level features extracted from different layers are sent to the global pipeline.
e mathematical modeling of the final feature map from the global pipeline is defined as S s�1 2 s−1 . is is mapped by the global pipeline onto pyramid mapping for a final feature set (vector) [10].

Pose Plotting.
e feature vectors are used further to plot the pose plots and grab macro-gait features [32][33][34], such as pressure, knee angle, knee pressure, step size, and step angle.
is information is fed to the network comprising of pose plotting single and multiple persons from a reference taken from the authors' previous work in [35,36]. e pipeline for the pose plotting flows as follows.

Single-Pose Plotting.
As [35] proved that a pure singleperson pose estimator (SPPE) is unreliable due to localization errors, a hybrid network symmetric spatial transformer network (STN) and a single-person pose estimator (SPPE) are introduced for perfect human pose estimation. Furthermore, focusing on local features from the feature vectors given by the shallow layers, the spatial symmetry network (STN) and spatial detransformer network (SDTN) are then pinned to remap the original image to generate grids based on Gamma. e spatial affine transformation used for this pose prediction from the feature vector is given by e symmetric spatial transformer network receives the feature vector from the global pipeline, and the spatial detransformer network generates the pose proposals. e affine transformation used in the spatial symmetric network, as mentioned in equations (5) and (6), where θ 1 , θ 2 , and θ 3 are vectors and x t i , y t i and x s i , y s i are the coordinates of the transformation. As mentioned in [15], the spatial detransformer network is the inverse operation of the spatial transformer network. e operations c are given by

Fine-Tuning Pose
Plotting. For a better extraction of human-dominant regions after the final feature vector recommendation, a parallel single-person-pose estimator (SPPE) is added to the pose network of training [15], which shares its branch with the spatial transformer network. All layers of the SPPE are frozen during the training phase and the weights of this branch are fixed, which is further used as a backpropagation mechanism to modulate center-located pose errors to the spatial transformer network. If the extracted pose of the spatial transformer network is not center-located, pose plotting errors would occur in large quantities, resulting in the backpropagation of large errors in the parallel branch. Hence, the spatial transformer network focuses on the area that is of high quality and dominant, as estimated from the global pipeline. To maintain a higher effectiveness, a parallel pose estimator is turned off in the testing phase to avoid multiple overlaps.

Pose Distance Plots.
e distance function for the pose plots is modulated by dpose(P i , P j ). e box for pose P i is B i . e soft function for the pose plots is then defined as If k n j is in the range of B(k n i ); otherwise, it is 0. B(k n i ) is the center of the box around the pose plot at k n i , within each dimension of B(k n i ), which is 1/10th of the original box B i . However, there may be some instances with poses having low confidence scores, which are probably not the correct poses, but the inaccurate plots of mismatches, whereby the tanh operation eliminates poses with low confidence scores. e joint confidence score would be close to 1. e distance plots indicate the number of joints in a complete pose. e spatial distance between each part is plotted as Combining all equations mentioned above for a complete final distance plotting between human poses, they are articulated as d P i , P j | Λ � K sim P i , P j |σ 1 + λH sim P i , P j | σ 2 , (10) where λ is the weight value between two distances and Λ � σ 1 , σ 2 , λ .

Complexity
In the case of a multihuman pose, the spatial detransformer network is remapped with the estimated human pose back to its original image coordinate system. To avoid redundant pose deduction, a nonmaximum suppression network is used.

Training Details.
e CASIA-B dataset was used for training the proposed model. e dataset comprises 124 subjects with 3 different walking conditions and 11 angle views varying in the range of 0 − 180 0 . Each subject contained 6 sequences for the normal condition, 2 sequences for the walking condition, and 2 sequences for wearing a coat/jacket. Summing up all conditions, there were 110 sequences for each subject. As there is no official split into training and test datasets, the authors split the datasets in an 80-20 ratio.

Network Construction.
e proposed algorithm is built with two main blocks. e first block with deep convolutional layers [37] was used to train the CASIA-B dataset to extract the feature vectors. It is built on top of the ResNet18 module. is acts as a human detector for the frames. e images are resized to 256 × 256 in the first block for a part-image-based stream. e resized images are then fed to a 7 × 7 Conv matrix with 64 filters with a stride size of 2. en, max pooling with a matrix size of 3 × 3 and a stride of 2 is computed. is is entered in the first hidden layer, with two layers of 3 × 3 conv size and 64 filters that recur twice and are then passed on to the transition layer of 3 × 3 size with 64 filters. Downsampling is followed in the first layer and ReLu nonlinearity. e second layer is the same as the first layer, but the filter size is 128, followed by the ReLu activation function and then, following the transition layer with a 3 × 3 filter size with 128 filters. e third layer is the same as described above, but with a filter size of 256 with the ReLu function, followed by a transition layer of 3 × 3 size with 256 filters, the ReLu function, and the final layer with 512 filters followed by the transition layer with 3 × 3 size and 512 filters. is attention network is provided to the statistical layers, called set pooling layers, rather than a fully connected layer with several classes, and finally a softmax activation function.
In the attention image-based feature stream, the network configuration, as shown in Table 2, there are additional transition layers unlike those present in an ResNet18 module. ese transition layers are the feature maps F 1 , F 2 , F 3 , F 4 , which consist of the extracted features, that is, the feature vector, which is further fed into the pose estimator block for extracting statistical pose features. F 4 is crucial in action-based feature extraction. As shown in Table 2, the network contains 4 layers, each layer consists of two basic conv blocks. Pictorial representation of the technical architecture of classification model is shown in Figure 4. e further block that is the pose estimator, which is taken from as a base reference form the previous work in [35], has 6 convolutional layers with a pretrained pose model from VGG [38]. e pose blocks have each body part to identify and plot pose distance between each of them with respect to the confidence score.
e pose block plots all the joints, a 2D coordinate system in each frame and plot all pose points and joining the lines between the joints [39]. e pose coordinates will now give the space to extract macro crucial features for maintaining a unique patterns values for each identity.

Performance Metrics.
e proposed network was trained with a batch size of 16, and training was performed with the Adam optimizer [40] with an initial learning rate of 0.0001. e decaying rate factor was maintained at 0.1 for the first 200 epochs. e training was active until 36 h and 1000 epochs on a Nvidia 1080 TI GPU. e model evaluation metrics, as shown in Figure 5, are used to validate that the performances are of average precision and F1 scores. e pose estimator compares the predicted joint coordinates and pose distances inside it (human region) according to the intersection over union (IOU), where its parameters are updated at each epoch. e F1 score is used to calculate the success rate of precision and recall. Precision is the ratio of actual matches, and recall is defined as the ratio of correct predictions to that of total ground truths. However, neither is sufficient to measure the performance of the network [9,41]. e F1 score is calculated with the precision and recall as dependent parameters to compute the evaluation of the network on data. F1 score is termed as true positive (TP) for correct predictions, false negative (FN) for false nondetections and false positive (FP) for false correct predictions [9,41]. e mathematical computations for the abovementioned parameters are as follows: As a heavyweight model with a combination of 2 pretrained models, VGG and ResNet18, the inference on GPU was ∼ 1.5 ms, and on CPU, it was ∼ 5 s in real time. Table 3 describes the results of some crucial gait patterns contributing to gait identity. e qualitative and quantitative results are plotted in the real-time video feed captured as a testing phase on the trained model of the proposed algorithm. Figure 6 visualizes the gait pattern motions graphically, which are unique to each identity subject and focus mostly on knee elements [32,42], such as angle, size, and pressure. During the training, the model extracts the motion recordings from the sequences of each subject under their respective subject ID. ese are fed as values to the feature vector. Figure 7 comprises the results of a test subject,   Table 1 for image classification. representing the motion graphically throughout the feed. e variation pattern, which is a microlevel wave, is unique for each frame for all subjects. Figure 8, in contrast, is another test subject for which the camera is closer to the subject, with the pose plotted and the results of the gait motion displayed. Figure 8(b) specifically blacks out the frame and visualizes the graphical motion pattern. ese gait patterns are used to save the weights for each unique subject. ese weights are used to classify each subject according to their identity. In this way, the proposed algorithm pins the gait biometrics based on gait motion patterns. e environment set up for the real-time testing is taken with a normal Logitech camera with 1080p resolution; subjects in Figures 7 and 8 are taken from Logitech 1080p camera. e camera is placed 10m away from the subject, in normal natural sunlight. e view angle taken for real-time testing is side angles in order to capture leg movements throughout the motion. e frontal view would sometimes  dilute the concentration of the leg movements with limited numbers since the leg backward movements would be missing in the frontal view. Hence, side angle views are considered for real-time testing to capture all crucial points from the legs. Another real-time testing with camera placed nearer to the subject and different lighting conditions is shown in Figure 9. Figure 10 represents a comparison between 2 test subjects. Figure 10(a) represents the same subjects with different scenarios, and the subjects in Figure 10(b) are different test subjects. Table 4 against the proposed method GaitVision. e comparison is made with respect to the real-time evaluation. Appearance based method includes method done in GEI-SVR [44], where gait entropy image (GEI), extracted silhouettes, and energy image are defined by their silhouette masks. is method draws a major drawback from sizeable intrasubject appearance changes due to covariates such as clothing, carrying, view angles, and walking speed. e methods described in [43,47] extract gait features from RGB images via conditional random field. Zhang et al. [47] is a CNNbased approach with discriminating representation from data with multicovariates. e main drawback of these two methods is their low performance in real time. Wu et al. [46] is a low computational cost method that can handle low-resolution images. However, it is sensitive to clothes change, view angles, and walking speed, which makes it inappropriate to real-time deployment. Hu et al. [11] proposed view invariant human gait identification, but in terms of view angles, GaitNet [47] surpasses [11]. Kusakunniran's [45] method describes spatiotemporal information extraction of features.

Comparison with the State-of-the-Art Methods. A comparison table has been drawn as shown in
is method extracts the crucial spatiotemporal information while subject is in motion, but many false and unnecessary information can be captured from the data. Kusakunniran et al. [48] proposed gait subjects recognition through various angles through correlation motion. In real time, however, view angles and motion speed make this method unreliable. e proposed method GaitVision clearly surpasses all the state-of-the-art methods in real time working with different view angles, difference backgrounds, and in decent motion. Most of the current methods lack their presence in real time since they are constrained to specific environments. Even though methods have trained their model with a huge number of classes, they have imitations to fail to run in real-time. e authors here proposed a unique way to collect the subject motions for at least 5 minutes, train them, and deploy in real time. e subjects collected in real time are trained and tested with various view angles and clothes with a decent motion with camera 10 meters away from the subject. Figure 8(c) represents the isolation of background to focus on the subject for a clean gait patterns extraction. Figure 8 Figure 8). e essential advantage of the proposed GaitVision algorithm is its method of training the subjects with different backgrounds and angles. Unlike the method proposed in [46], GaitVision algorithm can detect the trained subject with different clothes and backgrounds. is is possible because of training of crucial features and parameters extracted from the subject's motion. Another essential component in the proposed algorithm is the real-time deployment, unlike most of the gait methods work on prerecorded videos. Hence, using these steps in extracting crucial gait features, GaitVision surpasses the rest of the state-of-the-art methods in real-time deployment.

Conclusion
e proposed algorithm is the backbone for classifying each subject based on its unique natural patterns. Gait motion is a unique natural pattern that can be used to train for each distinctive subject in any authentication system. With training using deep convolutional layers and the results then being fed to feature maps to extract individual gait motion patterns, the accuracy of detection of any trained subject with at least 120 s of their motion provided at diverse angles is high. e environment is not hugely influential, as the subjects are initially converted to grayscale and feature maps are extracted from the deep layers; subsequently, this vector suggests the local regions of human presence to the pose estimator, which then plots the pose and extracts the gait motion patterns. Mainly, the core novelty of this work lies in the pose estimation followed by the training of the pose patterns in frame-wise for each subject.

Limitations and Future Scope
e core aim of this work is to conduct an extensive research on contactless gait biometrics using a simple camera. Hence, edge devices would be the optimal case to deploy it in the production. However, this work uses VGG pretrained network and ResNet152 module, which are heavy weight and some of the edge devices [33] like Raspberry Pi is not suitable for the production deployment. e edge device should be computationally capable of having more than 4 GB of RAM. Another drawback is the constrained environments. In camera angle-wise, top-view angle is a major drawback and the top-view angles are not suitable because the head and body pose would dominate the bottom pose gait parameters. Since bottom pose gait parameters play a crucial role in this work, top-view angles fail to capture the correct gait parameters. Hence, feature extraction would be difficult for top-view angles and leads to false positives. Another problem in feature extractions is the camera distance from the subject. As the camera focus is kept beyond 10 meters from the subject, the motion parameters get merged and lead to false parameter training and result in major inaccuracies.
Algorithm-wise, the major drawback is the FPS. As shown in the Results section (Figures 8 and 9), the maximum FPS is ∼2 FPS on CPU, and on GPU, the maximum FPS is 20 FPS, provided with at least 4 GB RAM, and with 12 GB RAM, the maximum FPS is 55 FPS. e results provided in this work have taken CPU to show the complexity of the algorithm. e core vision would be to deploy the algorithm in an edge device and deploy it as a standalone; hence, this research can be used in further enhancements of cutting off the computational loads and preparing a light-weight model for an easy edge deployment in the production level. e model can be further processed by collecting larger-scale datasets on larger subjects with at least 5 minutes of their motion in all possible angles and movements and train the model for a large-scale deployment.
Data Availability e data were collected from cite investigation and can be provided upon request from the corresponding author.
Ethical Approval e manuscript is conducted within the ethical manner advised by the Complexity journal.

Conflicts of Interest
e authors declare no conflicts of interest to any party.