Recognition and Prediction Method of Music and Dance Movements Based on G-ResNet SPP and Attention Mechanism

Aiming at the problems of dicult action recognition and low prediction accuracy in the process of music and dance movement. is paper proposes a music and dance motion recognition and prediction method based on G-ResNet SPP (Spatial Pyramid Pooling) and attention mechanism, so as to improve the accuracy of motion feature recognition. Firstly, the action recognition model is described, and the related theoretical basis and the construction of sampling function and weight function in action recognition are described. Secondly, the recognition method of the encoder-decoder design framework is described; nally, a group of music and dance sequences are used to verify the system, and the results show that the recognition eect of the G-ResNet +APP+ATTEN model has better performance in dierent experimental data sets.


Introduction
e complex changes in dance movements, and the consistency of music and dance, are one of the challenges facing online video now. erefore, based on the ST-GCN(Spatial Temporal Graph Convolutional network) model and the progressive prediction model, the dance movement recognition and prediction algorithm show the complex and changeable dance movements more real [1]. First, the original image is compressed to express the action information; then, by extracting the details of the action along the way, the sequence image as input, and nally the features of each attribute are extracted through the spatiotemporal convolution layer. e convolutional layer algorithm transfer function to dig out the relationship between attribute features and dance movements and realize the action correction function of dancers. In addition, the optimized objective function of dance movement matching and the optimized objective function combined with ant colony theory are obtained [2]. Finally, the pheromone evaporation factor was adaptively adjusted according to the pheromone concentration in all music-dance motion segments in the music choreography, and the pheromone update was dynamically performed according to the optimized scheme of dance motion matching [3]. e main challenges existing in the eld of music and dance today are the postural recognition of the basic forms of dance and the consistency of music and dance. For such problems, this paper believes that the smoothness of action prediction and action transformation needs to be emphasized. e features experimentally extracted in [4] are 20 joint angles, using 30 joint positions to predict 50 positions from basic poses. e SPM (spatial pyramid matching) classi er of the PPM (Prediction by Partial Matching) model has the best attitude recognition e ect [4]. By performing the analysis of loss rate and accuracy, we estimated the origin of these poses, and the test was done on our own Balletto dataset of 120 dancers with an accuracy of 97.14%. A 10-point scale scored the dancers' expertise predictions of 50 poses, with an accuracy of 68.46% for grouping without postures and 89.80% with postures for [5]. Due to the rapid development of the computing video industry, the current video analysis means have been unable to meet the needs of the industry, so the derived action prediction model has become the main way of the video development industry. It needs to predict the possible state of future actions through the analysis of the current actions and predict human behavior based on incomplete actions. To establish a highly e cient and robust framework for action recognition and prediction, in reference [6], we investigate the latest techniques for complete action recognition and prediction. Existing models, algorithms, techniques, action databases, evaluation protocols, and possible future directions are systematically discussed. In this work of identification and prediction of human behavior, the linear potential low-dimensional space representing highdimensional nonlinear behavior is allowed. e algorithm of this paper uses the Balletto data set to extract some typical actions, design experiments to evaluate the performance index of the algorithm, and then in our algorithm, to achieve the high accuracy and low latency of action recognition. To identify and predict human behavior, this paper proposes a hybrid method for anterior and posterior action-aware human action recognition and prediction based on the integration of convolutional neural networks and progressive prediction models [7]. e CNN (convolutional neural network) structure is embedded in the human and object information in the video image, and the identification and prediction of the front and rear video space are completed. e action sequence is analyzed through VMM (Virtual-Machine Monitor) to consider the current, past, and present best action states, so as to maximize the recognition of the dance movements and achieve the highest accuracy. Experimental evaluation of the effectiveness of the method, experimental surface action recognition, and prediction aspects have high accuracy [8]. Unlike conventional experiments, we first need to introduce a deep graph auto encoder for the learning task on symbolic scene graphs, rather than just relying on structured Euclidean data. Our encoder can divide operations into two branches, one for identifying input types and the other for predicting future action mapping. Network output is the detected and predicted by action image type label sets. We benchmark the new model proposed in this paper for different prior approaches on Balletto datasets, and experiments show that our model can have higher accuracy and lower loss rate [9]. A control point-random forest-based action recognition and prediction method is proposed for continuous action recognition of human bone sequences. Traditional methods often identify human behavior by constructing category classifiers and, unlike building an action recognition task, this paper identifies dance action based on the development of behavioral sequences [10]. In this paper, we introduce a new method for action prediction for 3D trajectory recognition, using the most 3D pose estimation of MOCCD (multiocular contracting curve density algorithm) to track time with an encoder-decoder framework, relying on the LDT as a measure of similarity between trajectories. e true values obtained from the 10 angles of the same posture were studied while identifying the current action and predicting the next moment of action. e trajectory recognition rate is as high as 99%, and the prediction accuracy is also about 95%. Only a small number of training sequences are needed. In terms of experience, the method proposed in reference [11] has a very high reference utility. Specifically, the memory neural network is used to build an action predictor for all actions. ese predicted actions can be synchronized with the music at the next moment. According to the prediction rules and removal rules of each step length, the prediction error can be eliminated, and the last prediction number can be used as the action sequence tag [12]. Pedestrian detection systems are an important part of the driver's safe journey, and if these systems can identify and predict pedestrian behavior, and even estimate the time that each person crosses the road, the safety of road traffic will be significantly improved. e study in this paper not only focuses on pedestrian action recognition but also predicts whether current pedestrian behavior encounters danger in the future. It proposes a retinal recognition model, using a recurrent neural network to estimate pedestrian crossing time, and performing a recurrent network estimation to identify pedestrian intention, one predicting the time required to cross the road [13]. Deep learning is divided into unsupervised feature learning and supervised feature learning. In this paper, we propose a multimodal learning and show how deep learning is trained to complete the action recognition task. We demonstrate how to use cross-modal feature learning. e representation between learning modes is simulated, and the classifier is trained using audio data and then tested using video data. e experimental validation of the Balletto dataset demonstrated that AVLettes published the best visual speech consensus [14]. Skeleton-based action recognition has now become a popular 3D classification problem, integrating colony structure into deep network structures to learn simpler three-dimensional action identification of colony features. e input colony features are transformed into more desirable colony features by designing a rotating mapping layer. To reduce the dimensionality of high features, we use the logmapping layer, regularly output the data, and perform the classification. rough the evaluation of standard 3D human action recognition datasets, the final experimental results surface this algorithm due to most traditional deep learning methods [15].

ST-GCN Network
Model. ST-GCN differs from the endto-end motion recognition model in that it applies GCN to the human behavior recognition system of the skeleton, adds the factor of a spatial relationship between joints, connects the natural connections between joints and the cross-continuous time connections of the same joints, and then constructs multiple space-time map convolutional layers to integrate information [16]. e ST-GCN-based human motion recognition process is shown in Figure 1.

Construction of Space-Time Map Convolution.
e traditional two-dimensional convolution algorithm is [17] for convolution operation by using filter and image pixel matrix. For a filter of size and an input feature matrix with several channels of c, a 2D convolutional operation output at position x can be defined: where P is the sampling function, acting on the neighborhood of position x (h, ω); W as a weight function, a weight vector is provided in c-dimensional space; Later, by redefining the sampling P function and the weight function W, you can extend the above convolution formula to graph convolution formula.

Sampling and Weight
Functions. For 2D convolution operations, the sampling function p(h, ω) is defined on a pixel matrix centered at position x, with the filter as the region. Define the sampling function on v ti the neighbor set of the node represents v tj the v ti minimum length of any path from to [18]. e sampling function P in this article selects the adjacent set of D � 2B(v ti ). erefore, the sampling function can be p(v ti , v tj ) defined as follows: (2) e weight function of the graph convolution takes the weight value from each position of the filter and then maps the graph nodes l ti : You can also simplify this operation by breaking down adjacent areas of graph nodes. e simplified weight function is as follows: ese W c are the parameters of the multi-category classifier, ⊕ which represent the channel stitching operation.

Construction of Spatial Graph Convolution.
rough the above-given sampling function and weight function, the two-dimensional convolution operation can be calculated to obtain the graph convolution formula on the space: where normalized terms represent the cardinality of the corresponding subset in order to balance the contribution of different subsets to the output. According to the above-given calculation method, the final graph convolution formula on the space is as follows:

Construction of a Space-Time Map Convolution.
When constructing a skeleton spatiotemporal map, you can select a skeleton frame sequence [19] with a time range of F. By applying the spatial map convolution in this sequence range, you can add a time dimension to define the spatiotemporal map convolution formula: where the Γ time range in the adjacent graph, that is, the time kernel size. To simplify the space-time graph convolution operation, the adjacent regions of the bone points defined by the sampling function and the weight function can be mapped l ST with the result of: where l ti (v tj ) is v ti the mapping result of the bone points.

Dance Movement Prediction Model
e action prediction task aims to predict the action category y corresponding to future video frames by means of the observed local video V. Given a video V T 1 � [I 1 , I 2 , . . . , I L ], where L is the total number of frames of the video. Giving the first t-frame of the observable part of the video V t 1 � [I 1 , I 2 , . . . , I t ], this task requires predicting the action category that will occur from the t + 1 to the L frame, that is, giving the action category label price of the unobserved part y L t+1 � [y t+1 , y t+2 , . . . , y L ]. is article solves these two problems through a concise and efficient framework, and the framework is named TTPP. First, the long video of the loser is split into multiple video clips without overlap [I 1 ′ , I 2 ′ , . . . I t ′ ], each containing the same number of consecutive frames; secondly, the video g enc clips of the losers are encoded into corresponding fragment features through the feature encoding network f t � g enc (I t ′ ); subsequently, the time attention module aggregates t consecutive video clip features into historical features S t � g ttm (f 1 , f 2 , . . . , f t ); finally, the Progressive Prediction module progressively predicts future video features and action category scores [20]. e progressive prediction module shares the progressive prediction module by the initial g 0 pred prediction module and parameters g pred , in which g 0 pred the aggregated historical feature S is lost as the module, and the fragment characteristics and corresponding action category scores are predicted at subsequent moments; g pred By accumulating previous prediction results and historical features S t , subsequent features and action prediction results are generated. e model uses an encoderdecoder framework for text processing, so let us briefly introduce the codec framework. Figure 2, the encoder-decoder framework is commonly used in the field of text processing. In the case of machine translation, for example, the encoder transforms the loser sentence into an intermediate semantic vector C [21], which encodes the information of the entire loser statement: where m is the length of the loser sentence and f(·) is the encoding function. e decoder generates sentence information for the current moment by taking the semantic encoding vector C and the sentence information of the historical moment that has been generated as the loser.

Encoder-Decoder Framework. As shown in
Taking the generation of t-moment as an example, the decoding process is expressed as follows y t : where g(·) the decoder function is represented. e decoder takes the result of each decode as input to the next decade, iteratively outputting a decoding sequence of length n. e above-given joint distribution can be decomposed into an ordered conditional probability distribution: p(y) � n t�1 p y t | y 1 , y 2 , . . . , y t−1 , C , where y � y 1 , y 2 , . . . , y n is the decoding sequence of the output. Similarly, the progressive motion prediction model based on time attention designed in this chapter uses the time attention module TTM as the encoder and the progressive prediction module PPM as the decoder. Next, the design ideas for both modules will be highlighted.

Progressive Prediction Module.
Inspired by the WaveNet model, this chapter elaborately designs the progressive prediction module PPM to mine aggregated historical information to produce more accurate action prediction results. e PPM module consists of an initial prediction submodule and a parameter-sharing progressive prediction submodule, each consisting of two fully connected layers, a ReLU activation function, and a layer normalized LN [22].
Suppose that the future t + 1 to t + l total of l time points will be predicted, and the first time point in the future t + 1 will be at the moment; the initial prediction module uses the aggregated historical information as the S t ∈ R d m action characteristics and action probability scores of the loser to predict the future tf t+1 ∈ R d m + 1 moment p t+1 ∈ R C . is process is expressed as follows: At other points in the future t + i(i > 1), the feature representation predicted at the previous moment t + i − 1 is predicted Stitch together f t+i−1 the action probability score p t+i−1 and aggregation features S t on the channel and send people to the progressive prediction module. is process is expressed as follows: Due to the channel stitching operation, the loser dimension of the parameter-sharing progressive prediction module is 2d m + C. Both submodules have the same structure and consist of two fully connected layers. Among them, the first fully connected layer first reduces the feature dimension to d m /2, and the second fully connected layer raises the feature dimension to d m [23], using feature transformation to learn effective feature representation. It is worth noting that the progressive prediction submodule is parameter-sharing, so the entire PPM module satisfies lightweight network design.

Model Training with Loss Function.
e entire TTPP framework can achieve end-to-end training under the action of the supervision signal of the PPM prediction module. Specifically, two types of loss functions are used: feature reconstruction loss L r , and action classification loss L c [24]. Here, L r by measuring the loss of mean squared error between the predicted feature and the corresponding truth feature, it is defined as follows: where f t+i , it is the prediction feature of the t + f t+i i moment and the truth value feature of the t + i moment.L c is the sum of cross-entropy losses at all predicted moments, defined as follows: where y(t + i, : ) is the truth vector one-hot encoding of the t + i moment. e optimization objective function in this chapter consists of a feature reconstruction loss and a categorical loss function, expressed as follows: where λ is the hyper parameter used to balance the two loss functions.

Encoder
Coding vector c Decoder y 1 y 2 y 3 y 4 Figure 2: Encoder-decoder frame diagram.

Music and Dance Movement Loss Function
Optimization Design e following are the various loss functions used in the three network components of NTS-Net and their optimization process.

Filter for Losses.
First, in the filtering network, we mark M information regions as distinct, and R � R 1 , R 2 , . . . R M the amount of information they contain is expressed as [25]. I � I 1 , I 2 , . . . , I M At the same time, in optimizing the network, we will do this e confidence level of M regions is expressed as C � C 1 , C 2 , . . . , C M . en, the loss function of the filter network can be defined as follows: where the function f is a nonincrementing function that represents: if C s > C i , then I s > I i . In the experiment, for the function f, we use the hinge loss function, which has the formula: e expectation of the loss function for (I, C) is that I and C are in the same order. e loss function of the filtered network is differentiable and using the chain rule in backpropagation, it is possible to calculate its W I derivative about is equation can be derived directly from I i � I(R i ) the definition.

Optimization Loss.
We define the loss of optimizing the network L C as follows: where C is the confidence function that represents the degree of truth for the specified area. e first term of equation (19) represents the sum of all losses, and the second term represents the cross-entropy loss of the complete image.

Check for Losses.
When the screening network obtains the K R 1 , R 2 , . . . , R K regions with the most information, the inspection network will get fine-grained recognition results P � S(X, R 1 , R 2 , . . . , R K ). We use cross-entropy loss as a classification loss, which is expressed as follows:

Joint Losses.
In the end, we combine the various losses for joint training. e final complete loss function is defined as follows: where λ and μ are hyperparameters, in this experiment we all set them to 1. We end up using the stochastic gradient descent method to optimize L total .

Evaluation Indicators.
In the experiment, the prediction accuracy of the calculation model for action prediction is in the form of average Euler error, specifically by comparing each frame sequence to obtain the corresponding frame sequence result, and then drawing the observable result by means of a line chart, of which the Euler angle is calculated as follows: Error i where the error of the i-frame Real ij is represented, the value of the Euler angle of the real data at the jth of the iframe, and Pre di ct ij the predicted data at the jth of the iframe. e numeric value of Euler angles. After finding the sum of the squares of the errors of the two, the Euler error of this frame can be obtained by finding the root number.

Datasets.
In this paper, 250 dance movements in the Balletto database will be used for video decomposition, from which some key frame movement pictures of ballet dance will be obtained, and some of the selected data will be experimented with using the ST-GCN model and the progressive prediction model to verify the effectiveness of the algorithm as shown in Figure 3.

Experimental Results and Analysis.
In this paper, SPP (Spatial Pyramid Pooling) is added to ResNet34, and then the attention mechanism (Atten) is fused with the GRU network, and the pyramid pooling and attention mechanism are introduced e FSAG-ResNet network model and the G-ResNet network model were first experimented on the UCF101 dataset, UCF101 e comparison of training loss and accuracy of the dataset is shown in Figure 4. As can be seen from Figure 4(a), the improved network after increasing the SPP layer is smaller than the G-ResNet network model. On the basis of increasing the SPP layer and then integrating the attention mechanism, the network training loss is smaller than increasing the SPP layer and compared with the G-ResNet network, the final network training loss is smaller, indicating that the network is easier to optimize. According to Figure 4(b), after adding the SPP layer, the accuracy of the network in the UCF101 data set is improved compared with the G-ResNet network, and after integrating the attention mechanism, the accuracy is improved compared with the original network, and the accuracy is also improved compared with increasing the SPP layer.
Similarly, the improved network is experimented on the HMDB51 dataset, and the training loss and accuracy of the Mobile Information Systems HMDB51 dataset before and after the network improvement are compared as shown in Figure 5.
As can be seen from Figures 5(a) and 5(b), the training loss of the network in the HMDB51 dataset becomes smaller and the value is smaller after increasing the SPP layer and attention, and the network is in HMDB51 e accuracy of the dataset is also improved compared to the original network. Experimental results from the above-given two datasets show that the improved G-ResNet network model has less training loss and higher accuracy, which verifies the superiority of the proposed model in this paper. e comparison results of the improved G-ResNet network model with the G-ResNet database in the UCF101 database and the HMDB51 database are counted, and the identification accuracy rate of the two databases is shown in Figure 6 as shown.
Compared with the results of the G-ResNet experiment, the SPP layer was added after UCF101 the recognition rate on the HMDB51 reached 95.6%, an increase of 3.2%, and on the HMDB51 the recognition rate reached 62.6%, an increase of 4.2%. After adding attention mechanisms on the basis of SPP, the recognition rate on UCF101 was 96.3%, an increase of 0.8%, and on HMDB51 the recognition rate on 64.6% increased by 2.1%. Compared to the G-ResNet network model, the FSAG-ResNet network model presented in this chapter ends up in UCF101 e recognition rate on  HMDB increased by 3.6% at HMDB51. e recognition rate increased by 6.0%, and from the results, pyramid pooling was introduced for ResNet34 on the basis of the G-ResNet network. After the grus network fusion attention mechanism, the recognition rate of the model on the UCF101 and HMDB51 datasets has been improved, which proves the effectiveness of the improvement method. is paper conducts a user survey on the authenticity of dance and the nature of dance and music. e dance-generated videos of the five different models were scored by 100 observers, and the average value of each score was calculated, obtaining the true user evaluation of each model.
As can be seen from Figure 7, the user of model 5 has the highest evaluation, which reflects the higher authenticity of the video processed by the model.
As seen from Figure 8, our model is better than the other models in terms of musical consistency. Statistically, the Balletto dataset had higher musical consistency than other categories of dance. It shows that the Balletto choreography is more consistent with the music. Specifically, the Model-   From the score of Model Four, it can be seen that our model has received the best user reviews relative to other models in terms of dance authenticity and musical consistency.

Conclusion
In this paper, the current situation of the computer video network industry in the field of music and dance movement recognition and prediction, as well as the application prospect of deep learning research, and then mainly for the use of deep learning, the identification and prediction of music and dance movements are proposed ST-GCN model and progressive prediction model based on attention mechanism. e main work of this article is as follows: (1) Establish a dance movement recognition model.
First, the spatial graph convolution is constructed by sampling functions and weight functions; then the constructing weight function is simplified; finally, the action recognition model under the new partitioning strategy is constructed. Experiments show that the accuracy of this model is greatly improved compared with the original model. (2) Progressive action prediction based on temporal attention. is paper uses an attention mechanism to capture historical information, well supports parallel transportation, and combines the idea of iterative decoding of the neural network to make progressive action feature prediction. On the dataset UCF101, the model outperforms the ordinary codec model in performance and efficiency, which further verifies the effectiveness of the model.

Data Availability
e experimental data used to support the findings of this study are available from the corresponding author upon request.