Spatial-Temporal Graph Convolutional Framework for Yoga Action Recognition and Grading

The rapid development of the Internet has changed our lives. Many people gradually like online video yoga teaching. However, yoga beginners cannot master the standard yoga poses just by learning through videos, and high yoga poses can bring great damage or even disability to the body if they are not standard. To address this problem, we propose a yoga action recognition and grading system based on spatial-temporal graph convolutional neural network. Firstly, we capture yoga movement data using a depth camera. Then we label the yoga exercise videos frame by frame using long short-term memory network; then we extract the skeletal joint point features sequentially using graph convolution; then we arrange each video frame from spatial-temporal dimension and correlate the joint points in each frame and neighboring frames with spatial-temporal information to obtain the connection between joints. Finally, the identified yoga movements are predicted and graded. Experiment proves that our method can accurately identify and classify yoga poses; it also can identify whether yoga poses are standard or not and give feedback to yogis in time to prevent body damage caused by nonstandard poses.


Introduction
Yoga has become a very trendy fitness exercise in today's life. But yoga is much more than just a fitness exercise. Yoga is a physical and mental discipline that combines art, science, and philosophy. Yoga can help people regulate their breathing, keep their bodies healthy, and also calm their moods. In today's highly developed Internet, according to incomplete statistics, yoga has become the preferred fitness exercise for 300 million people [1]. As a scientific exercise, yoga encompasses breath control exercises, body stretching exercises, and mind cleansing [2]. Yoga originally originated in ancient India, then spread to the West, where it became a mainstream Western fitness modality, and then eventually spread globally with the Internet, becoming one of the most popular exercise cultures worldwide [3]. According to a joint UK and US survey, the demographic profile of the yoga training population found in the demographics indicates that women are the main enthusiasts of the sport, accounting for 85% of the total number of yoga practitioners [4][5][6].
Numerous studies have proven that yoga exercises are beneficial to the human body. ere is also a large amount of research in rehabilitation on how to make yoga training work better for patients in their recovery process. is is one of the reasons why yoga has become a favorite exercise for many people [7]. In addition, research has proven that yoga has a complementary healing effect in the direction of eating disorders; it can modify the patient's eating habits and keep diet [8]. In the interviews of yoga practitioners, it was learned that yoga gave them a positive and subjective life experience, making them healthier and living an optimistic life. ere were significant improvements in self-care, selfactivity, life comfort, and dwelling senses [9][10][11]. In fact, most of the experience that yoga brings to people comes from the yoga instructor. e instructor, as the guide of yoga, influences the yoga student in an invisible way with his or her philosophy of teaching, teaching environment, outlook on life, values, and demonstration of yoga effectiveness [12].
Although some researchers have demonstrated that yoga can be practiced without differentiating between "traditional" and "authentic" issues [13], most people currently prefer modern yoga. Modern yoga is simpler and less demanding in terms of postural alignment and breathing exercises [14]. is is one of the reasons why modern yoga has turned into a healthy exercise for young and old alike. However, due to the overall economic development, yoga has gradually become commercialized. With the commercialization, the expression of yoga has become diversified and more and more people have become attracted to yoga. In our literature research, we found that yoga is becoming a synonym for young, beautiful, and hot women [15]. Yoga can be found in various fashion magazines and shows yoga poses that have a certain ornamental quality and at the same time these poses are difficult in the eyes of professionals. For ordinary people, they are more attracted by the ornamental poses of yoga, but these poses are risky for them. Commerce has made yoga idealized in order to facilitate promotion and thus attract consumers [16]. However, the commercialization of yoga is also a double-edged sword. Consumers are likely to cause irreversible damage to their bodies in the process of blindly imitating yoga poses due to the unknown nature of the poses, which is a potential risk in yoga training.
Traditionally, yoga is taught face-to-face, with the yoga instructor instructing in person whether the yoga poses are standard or not.
is kind of teaching can make yoga students have a more direct feeling of standard yoga movements. However, with the advent of the 5G era and the rapid development of short videos, short video platform bloggers often adopt online teaching methods to teach yoga poses in order to attract fans.
is is also the way most people learn yoga at present. Most people choose to watch videos while imitating to achieve the purpose of learning yoga. However, most people do not have professional yoga equipment and props, and they are not clear enough about the standard yoga postures. Blindly imitating the yoga postures in the videos has a great risk of physical injury. To solve this problem, in our work, we propose to use real-time posture detection technology to detect posture movements of yoga students and then use deep learning algorithms to grade and match yoga movements. A reference movement is given to the yoga students, and for the nonstandard movements, the yoga students are prompted in time to prevent the occurrence of physical injuries. In the specific experiment, we use the deep camera to capture the training postures of yoga students and decompose the postures to understand the yoga movements from the computer level. e postures are then compared with a standard database to verify whether the postures are standardized and to give feedback to the yoga students. Experiments show that the method proposed in our research can provide effective feedback to yoga trainees on the grading of yoga poses. e contributions of this paper can be summarized as follows. e rest of this paper is organized in the following manner. Section 2 discusses the work related to deep camera and action recognition. Section 3 introduces the skeleton recognition principle of graph convolution, then introduces the residual unit and multistream input structure, and finally introduces the optimization principle of the partial perception framework. Section 4 reports the experimental data collection, model training details, and analysis of experimental results. Finally, Section 5 concludes our research and reveals some further research work.

Related Work
e presentation of human motion postures in 3D space often requires the use of depth cameras. Information such as joint angles and skeletal space points can be deduced from the depth camera or the spatial position data of the human body [17]. Different poses can generate different skeletal contours, and to solve this problem, some researchers have proposed the idea of spatial segmentation, which takes an approximate mapping approach to define the location of spatial points for each segmented region. Literature [18] proposed a joint distribution method, which takes a bidirectional derivative approach to the mapping function. Literature [19] also uses the joint distribution rule, and unlike the former, the method adopts a Bayesian algorithm to obtain the image contour conditions. e final distribution of the image contour conditions will be mapped to the hybrid framework to obtain the spatial distribution features. Literature [17] additionally uses learning conditional distributions when learning features in the hybrid framework to obtain the image contour features more directly. In [20], to solve the image contour error problem caused by pose ambiguity, the researcher distributed three depth cameras into different angles to capture the human motion contour in all directions and obtained the skeletal spatial position from a standard dataset. In [21], the researcher used the SVM method to learn different pose features and perform pose prediction in the acquired 3D shape data. is proposed method links contours and 3D shapes but requires the support of large databases. For motion capture depth cameras, calibration of the depth camera is also required to ensure accuracy in 3D reconstruction work. In [22], the researcher applied the EM algorithm to calibrate the human action pose for multicamera linkage, and the mapping of 2D contours to 3D skeletal joints was achieved by training a neural network. Literature [23] adopts hybrid probabilistic PCA to predict the 3D body structure captured by the depth camera, which improves the 3D joint point coordinate accuracy.
Human motion recognition techniques originated from skeletal annotations [24,25] by video clips [26,27] to obtain the motion pose of each frame, which was then obtained by manual criteria. Previous human action recognition methods are based on RGB images, but this method is limited to the influence of nonobjective environments. e human skeleton-based action recognition method is less influenced by the nonobjective environment. is method can acquire the spatial-temporal features between joint points and learn the connection between features in a neural network to predict the human pose. Current neural network architectures that can be combined with the human skeleton approach are recurrent neural networks (RNN) [28,29], long short-term memory networks (LSTM) [30,31], convolutional neural networks (CNN) [32], etc. To make the human skeleton approach more general, [25,26] proposed to use the heat map as a complement to the skeleton information and to use the human pose image in each video frame for the encoding process. e feature communication between bone joint points is shown in Figure 1.
Literature [33] proposed a method to construct a human action dataset combining skeleton information with video in order to improve the pose estimation and action recognition accuracy of CNN networks. Literature [34] proposed a multitask parallel learning framework to improve the accuracy and stability of body joint detection. Literature [35] proposed a human intention algorithm aiming at learning behavioral action features through environmental assistance. Literature [36] took the approach of attention mechanism, which divides the human body into different parts and obtains attention from each part separately to recognize actions. Some researchers have found that the spatial-temporal graph convolution network (ST-GCN) can utilize the spatial-temporal information of skeletal articulation points effectively. It performs spatial-temporal convolution on the skeletal graph, models the graph representation of each skeleton, and uses a subsequent temporal filer to capture dynamic temporal information, as shown in Figure 2.

Graph Convolutional Network.
Benefiting from [37], the sequence of each frame t of the human skeleton in space is expressed as follows: where D represents the maximum distance of the graph, f in and f out represent the input and output values of the feature map, ⊗ represents the multiplication function, A and d mark the d-order adjacency matrix of the joint pair, and the result of the normalization operation is represented by A d . W d and M d indicate adaptive adjustment parameters. It plays an important role in the realization of boundary adjustment and convolution operations. In order to extract temporal features, we insert a L × 1 convolutional layer in the shallow layer to fuse the space information of the joint points between adjacent frames. In the process of temporal feature extraction, L represents the length of the time window, which is a predefined hyperparameter. Each time unit and space unit are followed by a BatchNorm module and a ReLU module to form a whole with this structure.

Residual Unit.
Literature [38] proposed a structure called bottleneck, which cleverly uses the advantages of conv1×1 and is placed in the front and back positions of the common convolution part to reduce the number of feature channels in the convolution operation. In this paper, we cleverly used the bottleneck structure, abandoning the original time and space modules, and found in the experiment that the improved structure is significantly faster in model training and parameter calculation. For example, the input and output channels are 256, the channel reduction rate r � 4, and the time window size L � 9. en, the total number of parameters involved in the calculation of the original structure is 256 × 256 × 9 � 589824. If the bottleneck structure is adopted, the total number of parameters involved in the calculation is 256 × 64 + 64 × 64 × 9 + 64 × 256 � 69632. Comparing the two, it can be seen that the bottleneck structure reduces the number of parameters calculated by the original structure by 8.5 times. Finally, we propose a new PartAtt block to enhance the generalization ability of the model. An example of a bottleneck structure frame is shown in Figure 3. Considering that the time module and the space module in the original structure cannot integrate the features well, we connect the time and space modules with the residual structure to construct the ResGCN unit. e specific residual connection structure is shown in Figure 4. e Module residual module adopts a jump connection mode, the Block residual module adopts the mode of connecting before and after, the Dense residual module integrates the connection mode of the Module residual module and the Block residual module, making the structure more compact and saving calculation costs.

Multistream Input Structure.
As we know from the bottleneck structure framework, each layer of input can be represented by a set of hyperparameters. In the first layer, we  Computational Intelligence and Neuroscience 3 usually use basic operations to process the original input data. e second layer starts to design the bottleneck structure to filter the output data of the previous layer, and the difference in the design of the bottleneck structure is the different number of channels between the input and output. e third and fourth layers also use the bottleneck structure, but the only difference is that each layer is followed by a PartAtt unit. By introducing the PartAtt unit, all the position information of the extracted feature vector is preserved. In the decoding process, the encoding can be performed directly by the PartAtt mechanism, which reduces the intermediate steps of traditional decoding and solves the problem of feature loss. Secondly, in the PartAtt mechanism, each step of encoding and decoding directly accesses the source feature library, which realizes the direct feature tradition of encoding and decoding and shortens the exchange in feature transfer. In addition, the time step is set to 2 in the input stage of the third and fourth layers to further reduce the complexity of parameter computation and prevent overfitting problems.
Furthermore, in high-precision models, input data generally require a multistream architecture for presentation. For example, the dual-stream input architecture mentioned in [39] incorporates both joint data and skeletal data as inputs, and decision selection is made after multiple streams of inputs. is approach is adopted by most researchers because it is effective in improving model performance. However, the multistream architecture does not control the computational cost well, and the large amount is data input, parameter exchange, and variable calculation in the multistream framework, which invariably increases the huge computational volume. erefore, our action recognition model adopts a multistream architecture in the pretraining stage, with a total of three input branches, and each input branch feature is fused with mainstream features in a pass-through tandem manner.
is structure not only preserves the skeleton features to a great extent, but also makes the model more concise in its vertical structure and easier to converge when the model is trained.
In the data preprocessing stage, we mainly used the methods proposed in [29,40] for reference. In the motion recognition method based on bone joint points, data preprocessing is very critical. In our work, preprocessing mainly revolves around joint positions, motion speeds, and bone characteristics. Suppose that a video of the action sequence is collected. According to the action sequence, the spatial coordinate set is X � x ∈ R C×T×V , where C represents the coordinates, T represents the frame, and where w ∈ {x, y, z} represents space coordinates.    Computational Intelligence and Neuroscience

Partial Perception Framework.
Long short-term memory neural network (LSTM) was proposed by Hochreiter [41] in 1997. LSTM is a derivative of Recurrent Neural Network (RNN). Since 2010, it has been proven that RNN has been successfully applied to speech recognition [42], language modeling [43], and text generation [44]. However, the disappearance of gradients and explosions makes RNN difficult to apply to long-term dynamics research. As an improved network of RNN, LSTM can handle this problem well. LSTM gives the network a lot of freedom, so that the network memory unit has an adaptive solution to learn and update information, which greatly improves the performance of some perception networks.
Assume that X � (x 1 , x 2 , ..., x n ) represents an input sentence composed of word representations of n words. In every position t, the RNN produces a hidden layer h in the middle denoted as y t , and the hidden state h t uses a nonlinear activation function to update the previously hidden layers h t−1 and the input x t , as shown below: where W y and b y are the parameter matrices and vectors learned during the training process, and σ represents the elementwise softmax function. e LSTM unit includes an input gate i t , a forget gate f t , an output gate o t , and a memory unit c t to update the hidden state h t , as shown below: where ⊙ is a kind of function which is similar to the multiplication operation, V represents a matrix related to weight, and b represents the learning vector. To increase the model's performance, morpheme training was carried out on two LSTMs. e first one is a morpheme that begins on the left and works its way to the right; the next one is a reverse duplicate of a character. Before passing to the next layer, the outputs of the forward and reverse passes are combined in series. Finally, the prediction value is observed using the activation function.
After understanding the partial perception algorithm LSTM, it was inspiring, because in the human body recognition process, the human skeleton will be divided into multiple parts. Each part is an interconnected joint. ese parts composed of joints are made by hand, for the graph convolution to be able to explore the relationship between these parts and extract the corresponding spatial features of the joint points. To obtain the information of a point in GCN, it is necessary to start from the field of that point. According to the adjacency matrix in the field, the skeleton data is automatically segmented, and then all the feedback information is input to the next joint point to complete the capture of the feature points of the entire human skeleton. rough this operation, the defects of manual design features are avoided, and the spatial features on the time series are obtained. [45].
If an ordinary convolutional neural network is used, all parts will be merged into a whole for feature extraction of convolution operations. Partial perception networks can divide joints into different departments and capture individual features for each part. Separately extracting features in this way helps to explore the connection between parts, that is, the spatial-temporal relationship between joints. e structure of our proposed spatial-temporal graph convolutional network-based yoga action recognition is shown in Figure 5.

Data Collection.
Before the establishment of the yoga posture database, we referred to yoga courses and training materials to find a reasonable grading system to assess the risk of yoga postures. As mentioned in [46,47], the researcher compared the physical extensibility and commonality of action between the different postures. It was also approached in terms of breathing rate, posture intensity, and meditation. Also, we interviewed a yoga instructor who showed us all the standard yoga poses and broke down each pose. From his experience's we learned that currently there are 6 main yoga poses such as standing, forward bending, sitting, twisting, back bending, and supine. Each movement determines a different level of body stretch. In the study of this paper, the grading mainly revolves around these movements; our experimental scoring is based on the depth camera directly in front as the main interface. e specific grading is shown in Table 1.
In preparing the yoga dataset, we invited a professional instructor for standard yoga posture data collection. en we invited participants who had one year of yoga experience and those who had no previous yoga experience to divide into two groups and complete each group of movements under the guidance of the instructor. In the process of data collection, not only the body posture but also the duration and number of movements of each yoga posture were collected. e yoga duration refers to the total time from when the breathing is adjusted until after the posture is completely relaxed. e number of movements refers to the sum of all the postures done during the training period, except for some correction of the postures by the instructor.
e Azure Kinect DK was used to collect yoga movement data. e data is then manually calibrated by us after the data collection is completed and the data is split. In order to enhance the validity of the data, we added confidence parameters in the coding process. e data collection results are shown in Table 2.

Model
Training. In the model training process, we used Pytorch to implement yoga movement recognition and Computational Intelligence and Neuroscience grading. First, we used Openpose to extract the skeleton information from the yoga video dataset, and in each frame of the video we obtained the spatial coordinate information of each of the 14 joints. en we use the heat map as the basis for pose estimation and perform secondary feature capture on the human skeleton. en each frame of data is arranged in the temporal dimension to correlate the features between the joints from the temporal dimension. Finally, the skeletal joint features are fused using the average prediction score and the weights are estimated in a progressive ranking. We set different learning rates at different epochs. At the beginning of training, the learning rate is set to 0.05 to adapt to the training speed of the data. en the learning rate is set to 0.01 at epoch � 30 to speed up the learning speed; after that, the learning rate is gradually reduced at epoch � 50 and epoch � 60 to find the optimal solution. e specific parameters in the model training are shown in Table 3. All the work is done in Ubuntu 16.04 and the whole training and prediction process is done with NVIDIA TITAN X GPU support on Intel Xeon E5-2620 CPU.

Experimental Result.
For the experimental data collection, we collected 50 experienced yogis and 50 inexperienced yogis. And the data was split according to the previous solution. e sensitivity, specificity, precision, and accuracy of skeletal features were captured in the data in the split starting from each frame.
e experimental results are shown in Table 4. e standard yoga movements were decomposed on a larger scale, making it traceable in the validation set. Based on the above statistical results, higher    Computational Intelligence and Neuroscience sensitivity values represent more experience in yoga training and also predict closer standardization of yoga poses. From the above experimental results, we can see that the recognition accuracy of all yoga poses is close to 1. And the accuracies, as a kind of random error, all keep above 0.86, which proves that the model performance is still great. e gap between experienced yogis and inexperienced yogis is mainly in sensitivity and specificity. Experienced yogis scored higher in both metrics, representing the more standardized yoga poses. e yogis are captured by the depth camera while practicing yoga. Real-time skeletal joint tracking is performed on the captured video. Finally, the yoga movements are recognized with the training model and then matched with the database to generate a grading score. e specific recognition effect is shown in Figure 6.
In addition, we also made corresponding statistics in the grading, as shown in Table 5. Table 5 demonstrates that the average grading accuracy of experienced yogis in the whole set of yoga poses is higher than that of inexperienced yogis. e yoga posture with the greatest difference was forward bending, followed by back bending. Because of the difficulty of these two poses, it was difficult for inexperienced yogis to achieve the standard poses, so the accuracy of poses grading was lower. e above experimental results favorably prove the effectiveness of the grading system in this paper, which can give yogis feedback and remind them to change their postures if the yoga movements are not standard.

Conclusion
In this paper, we found that, with the popularity of the Internet, people's lifestyles have also changed, and many people choose to learn yoga by watching videos on the Internet. For yoga beginners, learning yoga online in this way without the direct guidance of an instructor, there is a high chance that the yoga poses will be substandard. Highly difficult yoga poses are likely to be disabling for beginners. To address this potential risk, we propose a yoga posture recognition and grading system based on spatial-temporal   graph convolutional neural network. We first use LSTM network to label yoga practice videos frame by frame. en we extract the skeletal joint point features sequentially with graph convolution and then obtain the connection between joints from arranging each video frame in spatial-temporal dimension and correlating the joint points in each frame with neighboring frames for spatial-temporal information. Finally, through experiments, it is proved that our method can accurately identify yoga poses and grade them accordingly and can identify whether the yoga poses are standard or not and at the same time give feedback to yogis in a timely manner to prevent injuries to the body caused by nonstandard poses. For deep learning algorithms, the larger the number of datasets, the better the accuracy of the model obtained from training. Since there is no specific dataset for yoga poses at present, the number of homemade datasets in this paper is small, which is the shortcoming of the work in this paper. Making datasets is a tedious and time-consuming task. In our future work, we will gradually increase the number of datasets, and at the same time, we will invest more efforts in the field of data preprocessing.
Data Availability e dataset can be accessed upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.