Research Article Improved GCN Framework for Human Motion Recognition

Human recognition models based on spatial-temporal graph convolutional neural networks have been gradually developed, and we present an improved spatial-temporal graph convolutional neural network to solve the problems of the high number of parameters and low accuracy of this type of model. The method mainly draws on the inception structure. First, the tensor rotation is added to the graph convolution layer to realize the conversion between graph node dimension and channel dimension and enhance the model’s ability to capture global information for small-scale tasks. Then the inception temporal convolution layer is added to build a multiscale temporal convolution filter to perceive temporal information under different time domains hierarchically from 4-time dimensions. It overcomes the shortcomings of temporal graph convolutional networks in the field of joint relevance of hidden layers and compensates for the information omission of small-scale graph tasks. It also limits the volume of parameters, decreases the arithmetic power, and speeds up the computation. In our experiments, we verify our model on the public dataset NTU RGB + D . Our method reduces the number of the model parameters by 50% and achieves an accuracy of 90% in the CS evaluation system and 94% in the CV evaluation system. The results show that our method not only has high recognition accuracy and good robustness in human behavior recognition applications but also has a small number of model parameters, which can effectively reduce the computational cost.


Introduction
Computer vision technology is a key link in the realization of arti cial intelligence, and its emergence has given arti cial intelligence great potential in visual perception. Among them, human action recognition is the most challenging technology in computer vision. e implementation of this technology can add to intelligent applications such as pedestrian following and behavior analysis. e most widely researched human behavior recognition methods are based on the human skeleton and, of course, image-based human behavior recognition methods. Human skeleton-based and image-based approaches are very di erent [1][2][3][4]. Skeletonbased methods take human skeleton recognition data as input, focus on analyzing the depth, spatial and temporal information of human skeletal joints, and then combine all features to achieve a behavior prediction result. Compared with image-based methods, human skeletal data are denser and reduce the computational cost by replacing a large number of pixel points with dense skeletal data. e skeleton-based action recognition method also performs better in the working environment of multiple targets and complex backgrounds [5][6][7]. Traditional skeleton-based methods rely on skeletal joint trajectories, with all method models based on recurrent neural networks of skeletal joint point data [8][9][10][11]. Some researchers would prefer to adopt deep neural network models, for which there is already relevant literature demonstrating their substantial advantages as well as their shortcomings. e skeletal data distribution is rather fragmented, and each skeletal joint point data are not locally linked.
erefore, for deep neural networks, a separate neural network needs to be tailored to accommodate the structured skeletal data to coordinate with all the skeletal joint point data [12,13].
In the deep recurrent network model, only the connections in the feature point space can be analyzed, and the connections between features at the temporal level cannot be obtained. To solve this problem, researchers in the literature [14,15] used a long short-term memory network (LSTM) [16] to feature extraction from skeletal data, where the authors first divided the skeletal data into slices, each corresponding to an individual LSTM unit, and merged all. Such an architectural design improves the model's spatial perception of the skeletal data. However, this method suffers from the manual architecture predetermined to the rule limitation, which reduces the generalization ability and robustness of the network [17]. Considering the spatial and temporal features of skeletal data, the literature [18] introduced a graph convolutional network, which breaks the limitation of 2-dimensional data and can handle any graph structure. It transposes the computation of graph convolutional network to skeletal nodal data, which can dimensionally integrate the connections between spatial feature points. e literature [19] presented a spatial-temporal graph convolutional neural network based on the previous study, which can represent each skeletal data point in a graph structure and then perform feature extraction in a graph convolutional pattern as a way to obtain the spatial features between skeletal joint points [20][21][22]. In addition, the model adds a temporal convolution unit to integrate the temporal links between skeletal joint points, estimate the trajectory of skeletal joints, and finally predict the class of behavior [23].
Based on preliminary research and experiments, this paper proposes the Inception-ST-GCN (IST-GCN) method, which aims to reduce the complexity of building the neural network architectures while capturing the global information of the graph. In this paper, a tensor rotation module will be added to rotate the graph dimension to the RGB dimension and use the one-dimensional convolution Conv 1 × 1 to capture the global information afterward. A new inception layer multiscale temporal convolution filter is added to divide it into four branches with different temporal perception domains to capture richer temporal information and greatly decrease the volume of model parameters. e IST-GCN method achieves a compact and efficient network. To test the effectiveness of the method in this paper, we perform experimental validation on the public dataset NTU RGB + D. e results show that the number of parameters of the method in this paper is greatly reduced compared with the original ST-GCN model, and the accuracy and precision of action recognition are greatly improved. e remainder of this paper is laid out as follows: Section 2 introduces the construction of the basic network and the principles of mathematical equations. Section 3 details the principles and implementation procedures related to the improved human action recognition network. Section 4 presents the relevant experimental datasets and analysis of the results. Finally, Section 5 reviews our findings and reveals some additional research.

Basic Network
rough our preliminary examination, we apply the graph convolutional neural network as the base network, and its network structure is shown in Figure 1. is network is an upgrade for the graph convolutional network, aiming to optimize the perceptual domain of the graph convolution and increase the joint of the graph convolutional network for the feature relations at the temporal level. e main purpose of this network is sequence encoding the skeletal data and predict the joint behavior by the spatial features and temporal associations between skeletal joint points. For skeletal feature acquisition, we usually use the OpenPose [24] algorithm to localize the human body using 25 skeletal points and the connections among different skeletal points as human joints. e input is usually a video sample in AV format, and each frame of the sample video corresponds to this set of joint coordinates. e OpenPose algorithm can split and resolve each set of joint coordinates and map them to each skeletal unit map node of the human body, using the joints and the edges of the human body as boundaries to build a complete spatial-temporal map. In other words, the input of OpenPose can also be understood as a set of joint coordinates of skeletal points in the same way as the 2-dimensional pixel intensity vector input of the convolutional neural network. To obtain a wider range of information, the graph convolutional network is then stacked and all outputs are then fed into the classifier in parallel. e input in Figure 1 is a fixed skeleton sequence, assuming that T represents the constituent sequence of the total number of skeletons, V represents the number of skeletal joints, and G � (N, E) denotes the set of constructed skeleton spatial-temporal sequences, where N � v ti |t � 1, ..., T, i � 1, ..., V traverses the skeleton joints obtained along with all-time sequences, and v ti denotes all nodes. E denotes the set of connections between joints, and consists of E T and E S . An arbitrary human skeleton joint (i, j), E S � (v ti , v tj )|i, j � 1..., V, t � 1, ..., T denotes the composition of skeleton intra-joint connections within time t. e subset of intra-skeletal connections E S is divided into K disjoint regions in the center of gravity rule and is represented using the adjacency matrix encoding A k ∈ 0, 1 V×V . E T � (v ti , v (t+1)i )|i � 1..., V, t � 1, ..., T denotes the union of connections between all skeletal joints in a continuous time series. e fusion of the above features results in a sequence diagram that can be extended in the spatially mapped temporal dimension. e literature [25] optimized the spatial submodule of the spatial-temporal graph convolutional neural network and proposed the following graph convolution equation: where A s denotes the adjacency matrix of internal connections of skeletal nodes, I denotes the unit matrix, K s denotes the size of the convolution kernel in spatial dimensions, and W k denotes the training weights. e temporal convolution module is 1 × K t . In 2D graph convolution, and the perceptual field of the convolution kernel is not considered when operating (C in , V, T) in the (V, T) dimension, where K t denotes the number of frames.

2
Scientific Programming e graph structures in graph convolution are prede ned, and to increase their adaptability, the literature [26] uses a xed adjacency matrix and proposes an adaptive graph convolution formula as follows: where B k denotes the parameters learned in training and C k denotes the connected vertices determined with the oversimilarity function.

Improved Action Recognition Network
e spatial-temporal graph convolution model uses a prede ned structural graph as a topological constraint to achieve the ability of di erent time-step graphs to share the same topology, and such a structure leads to the inability of the graph task to fully capture the relevant features of the hidden joint layer. To solve this problem, our most common approach is to build a regional neural network using a local perceptual domain as the starting point and a small-scale graph task in the experimental region.
is can easily produce global information omission. To simulate the principle of computation of pixel points by convolutional neural networks, each graph node and adjacent graph nodes become the key nodes for graph convolution computation in the graph convolution task. Considering the problem of density heterogeneity and narrow local structure between neighboring nodes, in our improved network, we employ node features of xed size for feature learning in the temporal dimension, selectively ignoring the size of cluster features, and being able to capture more features in the temporal dimension. erefore, we present the inception spatial-temporal graph convolutional network (IST-GCN), which applies the inception structure to some network layers as a way to reduce the model parameters, broaden the network width, and enhance the robustness of the model.

Inception Module.
e inception module is a sparsity structure proposed in 2015, which has excellent feature expression capability and local topology capability. When the image is input, the pixel point population is involved in a series of convolution operations and pooling operations to obtain features at di erent scales from di erent scales of convolution kernels. All the output results are taken for parallel processing to lter out the best image features. e original structure of inception is shown in Figure 2. Its network structure mainly contains three scales of convolutional kernels and a 3×3 pooling, through which a combination of 1, 3, and 5 convolutional kernels can fully acquire large-scale sparse features and small-scale nonsparse features. Such structures not only increase the network width but also increase the adaptability of the network to di erent scales. Finally, all features are synthesized by a concatenation operation to obtain the nonlinear properties of the features.

Graph Convolutional Layer Improvement Strategy.
Our proposed IST-GCN model originates from a two-part optimization of the spatial-temporal graph convolutional network. e rst part is to optimize the graph convolutional network layers; the second part is to add the inception layer. In the graph convolutional layer, the original model aims to obtain spatial location information between the human skeletal joint points to achieve the representation of the joint points. It should start from the initial neighboring nodes to build up a local perceptual domain, in which a large number of sample nodes are generated. Although many false samples are generated at this point, adding topological angle restrictions in the subsequent process of ltering the sequence in Euclidean space can lter the false samples. When all sample nodes are in Euclidean space, all sample nodes can be considered as point from the global level view, and the sequence of points is considered as a one-dimensional vector. In this case, to capture a large number of sample node features, a large-scale graph convolution sum is required, whose size is consistent with the number of nodes. To properly solve this problem, we propose a tensor rotation strategy. We add a tensor rotation module, which we call Rotate tensor GCN (R-GCN), at the beginning and end of the graph convolution layer. e detailed network structure is shown in Figure 3.
By the action of the tensor rotation module, each sample node can share the same set of identical topological matrices, and all nodes can participate in the process of capturing global information. Taking human nodes as an example, each graph contains 25 nodes, and in the fully connected layer, we choose a lter of size 25. e rotation tensor module will rotate a tensor according to the di erent nodes separately so that the dimensionality of the nodes and the dimensionality of the channels remain the same. By tensor rotation, the prede ned topological matrix is discarded and the global features are learned adaptively according to the self-cycling unit for joint relevance. Finally, the global information is integrated through the tail-Conv 1 × 1 dimensionality reduction. Such a structural design can e ectively reduce the use of higher-order polynomial estimation layer by layer to capture higher-order features, thus achieving a reduction in the number of parameters.

Inception Layer Design Strategy.
We consider using the inception structure to broaden the spatial-temporal graph convolutional network because of the sparse structure advantage of inception. More feature information can be obtained by the layout of the sparse structure while avoiding the increase in the number of parameters. We refer to the optimization process of inception from V1 to V4 and discover the one-dimensional convolutional dimensionality reduction method [27][28][29]. We are building the inception time convolution network (I-TCN), and the expansion of parameters is exacerbated by the exponentially growing expansion coe cients in the time convolution layer to widen the network. In contrast, the inception tiling structure is incremented according to layers, and each branch is preceded by adding Conv 1 × 1 dimensionality reduction to assign di erent expansion settings to each branch, allowing the time-scale information to be graded into the inception branches and achieving information integration in di erent time dimensions. rough the above structure of time coe cient assignment, the exponential growth of coe cients is avoided and the purpose of reducing the number of parameters is achieved.
Two two-layer I-TCN layers are added at the end of each IST-GCN cell, and the TCN is divided into 4 branches according to the hierarchical principle, with each branch producing output to the corresponding group, whose structure is shown in Figure 4. e initial value of the expansion coe cient n of the network is 1. As the network deepens, the layer units increase step by step, and the maximum value of the expansion coe cient is 4. is external connection refers to the residual structure, which passes through a one-dimensional convolution with a step size of 2 in the middle, and this design can avoid the gradient dispersion problem. Improving the temporal convolutional network by inserting the inception structure allows for capturing more time-scale information while reducing the number of network parameters by a large amount and reducing the computational cost. A compact and e cient temporal feature extraction network is achieved by using di erent temporal lters to adaptively select the best feature information to optimize the classi cation problem.

Output Conv1
Input Tensor Rotate Tensor Rotate e process of human action recognition based on the IST-GCN model is shown in Figure 5. Firstly, the sample video data are input, and the video data are processed in frames during the analysis process. e human joints under di erent frames have the problem of position change, but the set of all joint points in di erent frames obeys random distribution. erefore, we rst select the batching standard module (BN) in the rst layer of the network hierarchical distribution to normalize the joint point data at the temporal level and spatial level to make the input skeletal data more standardized, reduce the error volatility, and optimize the algorithm's convergence. In the second layer of the network, we choose the attention mechanism (ATT), which connects our new R-GCN layer and the I-TCN layer in the next network. e R-GCN layer relies on the tensor rotation operation to obtain global information, after which the obtained global features are input into the I-TCN to analyze the linkage relationship among the nodal features at the temporal level, supplemented by the ATT mechanism to weaken the non-conforming features that do not conform to the bounded range of the model and lter features of di erent time-scales. e whole network consists of nine IST-GCN units sequentially connected to fully capture and fuse the graph feature information, then perform average pooling, then classify the features through the fully connected layer, and nally output the behavior prediction results according to the classi cation weights.

Datasets.
To validate the performance of our method, we chose the public dataset NTU RGB + D [30] for experimental test validation.
is dataset is one of the more comprehensive datasets covering categories in human action recognition studies. e dataset contains a total of three types of production speci cations, which are the two-person interaction dataset, the medical interaction dataset, and the daily interaction dataset. It can be subdivided into 60 categories of actions based on action types, with a total of 56880 sample sequences. All videos are stored in a uniform dataset standard, and the maximum video frames of each sample video do not exceed 300 frames. At the same time, all sample data are preprocessed by OpenPose human skeleton detection, and the corresponding skeleton data and Jason les are stored separately. In addition, a set of independent evaluation criteria, namely, Cross-Subject (CS) and Cross-View (CV), is proposed for this dataset. e CS evaluation system is evaluated based on the ID number of the person in the dataset as a sequence, and the CV evaluation system is evaluated based on the camera ID number as a sequence. e detailed volume of the training and testing datasets are shown in Table 1.

Experimental Details.
In the action recognition experiments, we mainly focus on action jogging as the control standard to verify whether the action recognition results Scienti c Programming match with the real action, each test sample is 300 frames, while the experiments are divided into single-player action recognition experiments and multiplayer action recognition experiments to test the performance of the improved method hierarchically while comparing with the spatialtemporal map convolutional neural network model.

Single Action Recognition Experiment.
e performance of single-person recognition result is shown in Figure 6, it can be seen that the action recognition result matches with the experimental preset result, the e ect is better and the action recognition result is accurate.
Compared with the spatial-temporal map convolutional neural network model, the single-person action recognition e ect is not much di erent, and the comparison experiment is shown in Figure 7. Although there are a few frames that recognize the action as a triple jump and occasionally misrecognition occurs, the nal score voting result still matches the real action and has little impact on the overall action recognition result.

Multiplayer Action Recognition Experiment.
e performance of multiperson recognition result is shown in Figure 8, which shows that the action recognition e ect is good, and a few frames appear to misrecognition situation, but it does not a ect the overall action recognition, and the recognition result is accurate.
Compared with the original spatial-temporal map convolutional neural network model, the recognition e ect of our method is superior, and the comparison of the action recognition e ect is shown in Figure 9.
As shown in Figure 9 Experiment A, two-thirds of the frames of the original ST-GCN method identify the action as triple jump, although there are also some frames identi ed as real action jogging, but the overall triple jump action score is higher, so the nal action recognition result is triple jump.
Our method uses di erent scales of time windows to capture information and has better control of global information, so it performs well in the multiperson action recognition experiment and the recognition results are accurate. From Figure 9, experiment B in the recognition e ect of the original ST-GCN algorithm, one person was obscured and although the skeletal information was recognized, the action could not be classi ed, and then the overall action was recognized as roller skating, which could not be matched with the real action. e e ect of multiperson action recognition experiments is not as good as that of single-person recognition experiments. e more the number of people, the lower the accuracy of human skeleton recognition and the e ciency of action classi cation. We try to control the multiperson action recognition experiment to less than three people in the experiment. Our method can recognize and correctly categorize the occluded part of the action, further highlighting the advantages of our proposed IST-GCN method.

Experimental Results
Analysis. Our proposed IST-GCN method involves the improvement of two main parts, namely, the rotated tensor module in the graph convolution layer (R-GCN) and the inception structure embedding in the temporal convolution layer (I-TCN). To verify the e ect of each, ablation experiments were performed. First, the GCN in ST-GCN was replaced with R-GCN, and the group of experiments was named with the letter R to construct the R-GCN e ciency testing experimental group. Secondly, the TCN in ST-GCN was replaced with I-TCN, and the group was named with the letter I. e experimental group was constructed to verify the performance of the I-TCN module. e above two groups were validated with the spatialtemporal map convolutional neural network and our proposed IST-GCN on the NTU RGB + D dataset. e results were compared in terms of accuracy (Acc), bone recognition accuracy (Bone), joint recognition accuracy (Joint), and number of parameters (Param) levels as shown in Table 2.
e R-GCN technique improves overall accuracy by 3.7 percent, and the number of parameters is lowered proportionally, as shown in Table 2.
e I-TCN approach improves overall accuracy by 7.5 percent and reduces the number of parameters by half. e results reveal that I-TCN has a greater impact on overall performance than R-GCN, although less e ective than I-TCN in terms of overall  To verify the e ectiveness of our IST-GCN, we compare four di erent kinds of skeleton-based action recognition models, dynamic skeleton [31], ST-GCN [18], P-LSTM [30]     [32]. e dynamic skeleton represents a series of action recognition models based on hand-crafted labels, P-LSTM denotes a series of recurrent neural network classes, TCN denotes a series of convolutional neural network classes, and ST-GCN denotes a series of hands-on models based on graph convolutional neural networks. e above four methods and our method are validated on the NTU RGB + D dataset. e experimental data is shown in Table 3. e experimental comparison results in Table 3 indicate that in the validation experiments of the dataset NTU RGB + D, the GCN-based action recognition method greatly outperforms other types of action recognition methods, proving that graph convolutional networks have great advantages. Our method compared with the spatial-temporalspatial-temporal graph convolutional neural network model improves the accuracy in CS metrics by 9%, reaching 90% and in CV metrics by 6% and reaching 94%.
To verify the effectiveness of our method among similar optimization methods for graph convolutional neural networks, we compared four algorithms that perform better among current variant methods for graph convolutional neural networks in terms of both number of parameters (Params) and accuracy (Acc), namely AS-GCN [33], 2S-AGCN [26], NAS-GCN [34], and Shift -GCN [35]. e validation was carried out in dataset NTU RGB + D with CS evaluation metrics, and the comparison results are shown in Table 4. Table 4 reveals the findings of the experimental comparison.
e comparison results between AS-GCN, 2S-AGCN, and NAS-GCN under the evaluation index of CS indicate that our method has better efficiency with an accuracy of 91%, both in the number of model parameters and accuracy. Given the Shift-GCN method, which introduces a more complex hyperbolic space structure, the classification accuracy is further optimized. Even though the accuracy is not as good as that of Shift-GCN, the number of model parameters in this paper adopts the inception structure to form a more compact model, and the number of model parameters in our improved method is only one-fifth of that of the Shift-GCN method, which greatly decreases the computational cost. Furthermore, there are fewer parameters in this model than in previous ones. All of this demonstrates the efficacy of our strategy.

Conclusion
In this paper, we present a deep learning method for human action recognition based on the IST-GCN framework, which optimizes the recognition accuracy of the model by reducing the model parameters. First, we add a tensor rotation module in the graph convolution layer to better capture the global features of the graph task. en we add the inception structure in the temporal convolution layer to build a multiscale temporal convolution filter to obtain temporal information in different temporal perceptual domains and reduce the arithmetic power. Finally, we perform experimental validation on the public dataset NTU RGB + D. e accuracy of CS evaluation reaches 90% and the accuracy of CV evaluation reaches 94%. e results reveal that our optimized method is robust and accurate, which not only improves the efficiency of the graph topology learning process but also greatly decreases the volume of parameters. Compared with the spatial-temporalspatial-temporal graph convolutional neural network model and similar graph convolutional optimization algorithms, the advantages of our method are outstanding.
As can be seen from the experimental results in Table 4, there is still a certain gap between the accuracy of our method and the Shift-GCN. Although we have a clear advantage in the number of parameters, accuracy is always the first assessment index as the effect of human action recognition. In the next work, we will consider using hyperbolic spatial structure to optimize the accuracy, and also ensure that the volume of parameters is small. To achieve a human action recognition model with high accuracy, few parameters, high robustness, and good stability.
Data Availability e dataset can be accessed upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.