Aerobics Action Recognition Algorithm Based on Three-Dimensional Convolutional Neural Network and Multilabel Classification

In the context of modern people increasingly paying attention to health and promoting aerobics, the amount of data and audiences of aerobics videos has grown rapidly, and its potential application value has attracted widespread attention from scientific research and industry perspectives. +is article has integrated computer vision and deep learning related knowledge to realize the intelligent recognition and representation of specific human movements in aerobics video sequences. +e study proposes an automatic recognition method for floor exercise videos based on three-dimensional convolutional networks and multilabel classification. Since two-dimensional convolutional neural networks (CNNs) lose time information when extracting features, so to overcome this, the proposed research uses three-dimensional convolutional networks to perform video recognition. +e feature is taken in time and space, and the extracted features are subjected to multiple binary classifications to achieve the goal of multilabel classification. Various comparison and simulation experiments are conducted for the proposed research, and the experimental results prove the effectiveness and superiority of the approach.


Introduction
With the rapid development of related technologies such as computers [1][2][3], networks [4][5][6], and multimedia, multimedia data have shown an exponential growth trend. Video [7] is a common form of multimedia data, and it is also an important part of multimedia data, which is closely related to our daily lives. e video contains the most abundant data information, with a complex structure and a large amount of data. Faced with such a huge amount of video data, automatic video description can better manage and utilize these rich video resources and can help users improve the indexing speed and search quality of online videos so that they can play a greater role. For visually impaired people, through the automatic description of the video combined with text-to-speech technology, the text in the computer is converted into continuous natural language for communication [8]. It can help them better understand the content in the video, thus making the life of the visually impaired more convenient. In the field of video automatic description research, video automatic analysis and understanding based on human actions has gradually become a hot research problem in the field of computer vision and pattern recognition in recent years. It has a wide range of application prospects in the fields of intelligent life assistance, advanced human-computer interaction, and content-based video retrieval [9] and is closely watched by researchers at home and abroad. Faced with the low-level video features in the current aerobics video analysis research that cannot accurately reflect human high-level semantic concepts, the action recognition algorithm in traditional RGB video has high time complexity and low recognition accuracy, and the use of a single feature cannot meet the massive amount of existing video data. Growth of complex and other issues, its automatic description research has important theoretical research significance and extensive practical application value. In terms of theoretical research, the research on automatic video description of floor exercise is a cross-cutting subject that integrates machine learning, pattern recognition [10][11][12], video analysis, computer vision, and cognitive science and provides a good basis for research in these fields. In-depth research can promote the development of related disciplines.
Regarding the problem of automatic identification of aerobics videos [13], which is a difficult point in visual research, in practical applications, the research of automatic identification based on aerobics videos has a wide range of application prospects and potential economic value. In addition to the abovementioned video retrieval and convenience to the visually impaired, potential application areas include sports assisted training, human-computer interaction, and project promotion. First of all, it can satisfy human-computer interaction [14]. In the complex floor exercise set of movements, it is particularly important to identify various human movements quickly. When watching aerobics competitions, commentators often have problems such as delay and error in interpretation of decomposed movements. In this paper, we strive to achieve a higher recognition accuracy in the automatic understanding of human movements based on video and even realize realtime movement recognition and interpretation. For nonprofessionals, if automatic recognition can be realized, it can not only improve the feeling of watching the game but also make it more convenient for them to understand and learn aerobics.
Secondly, it can assist sports training. In the aerobics video, the movement of the human body is very complex with strong skills. Compared with the daily exercise, the analysis of aerobics video is more difficult and challenging. e analysis of aerobics videos can not only bring more watching effects to sports games but also help coaches to analyze the games and assist athletes in training.
rough the research on the automatic understanding of aerobics, while improving the accuracy of aerobics movement recognition, this paper analyzes the movement data so as to excavate the regularity characteristics of gymnastics technology innovation and development and realize the function of auxiliary training. For example, taking related athletes as the main research object, the paper analyzes the differences between the difficulty, arrangement, and quality of the complete sets of movements between the winners and ordinary athletes, studies the development and innovation trend of aerobics, and adjusts the training countermeasures so as to improve the skill level of athletes [15].
Finally, project promotion can be carried out. Taking aerobics as a typical research object, knowledge transfer can be used to effectively identify and locate human movements in aerobics videos. By referring to the method of aerobics movement recognition, it can be applied to other sports so as to expand the research results. Following are the main innovations points of this paper: (i) To improve the existing algorithm model, the accuracy of automatic aerobics recognition is improved. (ii) Automatic aerobics video recognition is transformed into a multilabel classification problem. In order to extract the temporal and spatial feature representation in the video, a three-dimensional CNN [16][17][18] is used as a feature extractor. en, a two-class classifier for a single decomposition action is constructed, and each video will perform twoclass calculations for all categories to complete the multilabel classification process. (iii) To conduct comparison and ablation experiments, the experimental results prove the effectiveness and superiority of our algorithm.

Convolutional Neural Network.
A typical CNN schematic diagram is shown in Figure 1. It consists of three parts: the first part is the input layer, the second part is the several hidden layers, and the third part is the one output layer. Each layer is composed of multiple neural units. CNN [19,20] has two key ideas, which determine its performance in solving problems related to computer vision field which is particularly outstanding. e first point is that CNN makes use of the two-dimensional structure of images. Since pixels in adjacent areas are usually highly correlated, CNN does not need to establish one-to-one connection between pixel units like traditional neural networks but can directly use grouped local connections. e second point is that the CNN architecture relies on feature sharing, where each channel is generated by convolution using the same filter at all locations.
In the specific CNN network structure, the hidden layer usually includes a convolutional layer, an activation function, a pooling layer, and a fully connected layer. e function of the convolutional layer is to extract the features of the input layer. It is composed of many convolutional units, and the parameters of the convolutional unit are optimized through the backpropagation of the convolutional network. In the process of recognition, the human brain first perceives each feature locally and then comprehensively sorts the local features to obtain global information. erefore, the feature extraction of the convolutional layer plays a central role in the CNN. A CNN usually contains multiple convolutional layers. e shallow convolutional layer usually can only extract lower-level features. Commonly used CNN usually uses multiple layers in order to obtain deeper feature maps. Convolutional layer is used to iterate. e function of the activation function is to increase the nonlinear segmentation ability of the network. As an activation function, it generally satisfies the properties of nonlinearity, continuous differentiability, monotonicity, best unsaturated range, and approximate linearity at the origin. Commonly used activation functions include ReLU and Maxout. e pooling layer is also called the downsampling layer, usually after the convolutional layer. e 2 Scientific Programming feature dimensions obtained by the convolutional layer are relatively large, and compressing and sampling the feature maps obtained by the convolutional layer can not only reduce the computational complexity of the network and improve the recognition of features but also avoid the overfitting problem to a certain extent. Common pooling methods include average pooling and maximum pooling. e fully connected layer will connect all the features by weighting, and the output value obtained is used in the calculation of the classifier.

Recurrent Neural Network.
Recurrent neural network (RNN) [21][22][23][24] is a special neural network structure inspired by human beings' reliance on past experience and memory in the cognitive process. RNN is called a recurrent neural network. RNN not only gives the input of the previous moment the memory function but also gives the input of the next moment referring to the memory of the previous moment; that is, the current output of a sequence is composed of its input and the previous sequence. e output is jointly determined. e specific process performance will be applied to the previously memorized output when calculating the current output. RNN is different from CNN. In RNN, the input data have a time sequence, thus forming a sequence. is is the most critical point that distinguishes RNN from other neural networks [25][26][27], and it is also the fundamental reason why the "loop" can be established. e nodes between the hidden layers are unconnected in CNN and become connected in RNN, and the input of the hidden layer includes the output of the input layer and the output of the hidden layer at the previous moment. e hidden layer of the simplest structure of RNN is expanded in time, and its structure is shown in Figure 2. X represents the input sample, 0 represents the output, U and U, respectively, represent the weight of the sample input and output at the moment, t represents the time series, and the memory of the input sample at time t is expressed as follows: where W represents the weight entered at the previous moment.
In the actual application process, with the deepening of the network model, the problems of gradient explosion and gradient disappearance appear when the RNN model is trained often trouble researchers. Once the gradient disappears or the gradient explodes, the transfer performance of the training gradient will be greatly reduced, and the original purpose of the RNN model design cannot be achieved. at is, the training gradient cannot be transmitted in a long sequence, which eventually leads to a large deviation in the detection accuracy of the long sequence by the RNN. In order to achieve the long-term dependence problem that needs to be achieved during the training of the RNN model, a long-and short-term memory network is proposed.
is network model improves the traditional RNN model by introducing a memory unit and a gate control memory unit. e memory unit can store historical information and the network. In the long-term state, the gate control determines the flow of information through linear intervention, which can selectively increase or decrease the transmission of information.

Attention Mechanism.
e human brain pays attention to different parts of the brain differently when processing signals, known as the visual attention mechanism [32,33]. Human vision can quickly scan the global image to obtain the target area that needs to be focused on, which is generally known as the focus of attention and then invest more attention resources in this area to obtain more detailed information of the target that needs to be focused on and suppress other useless information. e reason why this paper needs to use the attention mechanism is very intuitive. e decisive video frame for automatic description of the decomposing movements of floor exercises should be the method, direction, and angle of the athlete's body turning, and the weight of these video frames should be greater. is paper uses an attention mechanism which allows the decoder to weight each time feature vector of floor exercise video. Figure 3 shows the network structure after the attention mechanism is introduced.
is paper adopts the dynamic weighted sum of time feature vectors, and the formula is as follows: is the proportion of the matching score between the hidden layer output and the entire video  Scientific Programming representation vector at the moment in the overall score, and the calculation formula is as follows: where score(x i , h i ) represents the score value of the output h i of the i-th hidden layer in the video feature vector x i . e larger the score, the greater the attention of the input at this moment in the video, and its calculated as follows: where ω, W, and U are the weight vectors and b is the bias.

Network Framework.
In this chapter, the automatic recognition problem of aerobics videos is transformed into a video multilabel classification problem, and finally the classification results are further transformed into real floor exercise recognition. In order to achieve this process, the basic framework used in this section is shown in Figure 4. e framework in the figure can be divided into two parts based on the three-dimensional convolutional network to extract the multilabel video features of aerobics, and then SVM is used to extract multilabel classification that is performed on the pictures, and finally the mapping of the results of multilabel classification to natural language is completed, and the automatic description of the aerobics video is finally completed.

Feature Extraction.
Compared with 2D convolutional networks, 3D convolutional networks can better model time information through 3D convolution and 3D pooling operations. In a two-dimensional convolutional network, the process of convolution and pooling is completed in space. In a three-dimensional convolutional network, they perform in time and space. In the introduction of 3D convolutional network above, it was proposed that images should be output when 2D convolutional network is processing images, and images should also be output when multiple images (which are regarded as different channels) are operated. erefore, the time information of input data will be lost after each convolution operation in the two-dimensional convolutional network. Only three-dimensional convolution can preserve the time information of the input signal and produce the output quantity. e same principle can be applied to 2D pooling and 3D pooling.

Multiclassification of Video Based on SVM.
After the video features are obtained, this article will establish a twoclass classifier for each decomposition action to determine whether the video contains this type of action. For the establishment of the second-class classifier, the SVM classifier is used in the work of this article. In order to obtain the optimal linear interface of the SVM classifier, the basic idea of solving the problem is to transform the input space into a high-dimensional feature space through nonlinear transformation, which can be regarded as a linear classifier in a broad sense. As shown in Figure 5, the two types of training samples in the figure are represented by " * " and " " respectively, x 1 and x 2 represent the two feature items of the sample, and H is the interface of H ′ , H 1 , and H 2 , respectively, which represent the closest to the interface of the two types of samples. e point is parallel to the plane of the interface. In order to ensure that the empirical risk is minimized in the support vector classification model, not only the optimal dividing line is required to correctly separate the two types of data but also the two types of classification interval (M in the figure) must be maximized. erefore, although H ′ is also a boundary that can be classified correctly, it is not suitable as a boundary. e principle of interface selection is to make the support vector machine show better generalization ability.
Use (a 1 , c 1 ), . . . , (a N , c N ) to represent the linearly separable sample set of the two types of problems, where a i ∈ R d and d represents the dimension. e category label N], w is a d-dimensional vector, and b is a constant. From this, the linear discriminant function can be obtained as follows: In order to obtain the maximum classification interval M, the interface needs to meet the following requirements: , for y i � 1, Normalize formula (6) so that all samples can satisfy |c(a)| ≥ 1, and the sample with the smallest distance from the interface satisfies |c(a)| � 1; thus, It can be deduced that M � 2/‖W‖. When ‖W‖ is the smallest, the classification interval is the largest. And in order to satisfy that the objective function becomes a   II II  II II   II   II   II  LSTM  LSTM  LSTM According to the Lagrange method, the corresponding Lagrange function is obtained: e previous description is the case of linear separability, and the actual problems that need to be dealt with are often linear inseparable data sets, that is, nonlinear separable problems. erefore, some misclassifications will inevitably occur in the classification process. One of the solutions is to transform the nonlinearity into a linearly separable problem. Specifically, the kernel function is introduced to map the nonlinearly separable problem in the input space to a higherdimensional feature space.
e Gauss radial basis kernel function is used in this chapter: For N two-classification problems of N decomposition actions, N two-class SVMs are constructed, and each SVM is trained in a one-to-many manner; that is, the video containing the current decomposition action is taken as a positive sample, and other data are taken as a negative sample.

Automatic Recognition Based on Multiple
Classifications. In order to form a contrast experiment, this chapter maps the multilabel classification results into automatic recognition statements of aerobics. Different from the previous operation, this step does not involve the processing of video data but only compares the classification results with the test descriptions marked in Chapter 3.
rough multiple binary classification SVM classifier, each video will get multiple 0-1 classification results, to identify the category of the assembled into a sentence because most of the video data are only less categories, and the logical relationship between each category is not obvious, so here temporarily do not consider the semantic information, e classification results are directly connected to form the description statement of aerobics movements, and the process of automatic recognition is completed.

Experimental Setup.
e experiment in this section is carried out on the operating system application Ubuntu 16.04 version. e code to realize the experiment is based on TensorflowL.6.0 framework, and the language used is Python 2.7. e network model was trained on two Nvidia Titan 1080 graphics cards with 11 GB of memory. e input video data are sampled every 5 frames. e input of the C3D feature extraction model is a 16-frame long segment with 8 frames overlapping between two continuous segments. e FC1 activation of these segments is averaged to obtain a 4096-dimensional video descriptor.

Aerobics Multicategory Data Set.
is paper collects a large number of high-standard events of professional athletes, including the Olympic Games, the World Championships, the National Games, and other male and female heavyweight events. First of all, these games are preprocessed. A complete aerobics video is completed by the participation of many athletes. During the video, there will be playback of wonderful moments, slow motion

Evaluation Index
(1) Accuracy is the most common performance metric in classification models. It is suitable for two-class models and can also be used for multiclass models. e calculation of accuracy is also relatively simple. Assuming that the classification model is g and the test set contains N data in D, the accuracy rate calculation formula is as follows: (2) Bleu is currently the closest indicator to a human score. Bleu adopts the matching principle of N-gram. N-gram represents a sentence as a sequence of n consecutive words. When N � 1, the result is Bleu 1.

Experimental
Results. e ultimate goal of this article's multiclassification is to realize the automatic recognition of videos. is article takes the average value of Blue 1 to Blue 4, Blue, as the evaluation index and compares the recognized description of aerobics with the correct description. e experimental results are shown in Table 1. It can be clearly seen that the method used in this article to automatically recognize aerobics videos using the conversion of video multilabel classification has outstanding performance. In addition, Figures 6-8 also show the visual results of the experiment.

Ablation Study.
e experimental results as shown in Table 2 compared from the table are the average values of  Bleu 1 to Bleu 4, Blue. You can see three data sets in the table. Two of them are self-built. e two data sets are different when labeling the video description. Ours A is the most direct natural language due to the decomposition of aerobics. e professional requirements of description are very high. Ours B adjusts the description sentence according to professional terminology when describing the mark. From the experimental results of these three data sets, it can be seen that the model with the attention mechanism introduced in this article has better performance regardless of the experimental results on the MSVD data set or the experimental results on the self-built data set. e use of planned sampling in the experiment can also improve the experimental results to a certain extent.    Ours: elbows bend at waist Label: elbows bend at waist

Conclusion
In this paper, we will combine computer vision and deep learning related knowledge to realize the intelligent recognition and representation of specific human movements in aerobics video sequences. erefore, this article proposes an automatic recognition method for floor exercise videos based on threedimensional convolutional networks and multilabel classification. Since two-dimensional CNN loses time information when extracting features, this paper uses three-dimensional convolutional networks to perform video recognition. e feature is taken in time and space, and the extracted features are subjected to multiple binary classifications to achieve the goal of multilabel classification. We will conduct comparison and simulation experiments, and the experimental results prove the effectiveness and superiority of our algorithm.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.