Automatic Description Method for Sports Videos Based on Economic Management Effectiveness Visualization Techniques

The automatic description (AD) of sports videos is a fundamental task for archiving the content of broadcasters, as well as understanding video scenes, and economic management effectiveness visualization techniques are key to the classification of sports videos. In this paper, a freestyle gymnastics video is used as an example to study the automatic video description by observing the set of movements of an athlete in a freestyle gymnastics video to generate the terminology of the movements performed by that athlete. The technique used in this paper to visualize the effectiveness of economic management is the long and short-term memory (LSTM) network model, which is used to learn the mapping relationship between word sequences and video frame sequences. Attention mechanisms (AM) are also introduced to highlight the importance of keyframes that determine freestyle gymnastics movements. The study is carried out by building a dataset of free gymnastics (FG) breakdown movements from professional events and applying a planned sampling method. Experimental results show that the method can improve the accuracy of an automatic free gymnastics video (FGV) description. The proposed method has a wide range of applications in sports analysis and instruction.


Introduction
In the 21st century, along with the rapid development of Internet technology, video, a common form of multimedia data, has gradually become one of the important components of multimedia data [1]. In people's daily lives, huge amounts of video data are generated, of which automatic video description allows for the e ective management of these video resources [2]. With the in-depth research on automatic video description, automatic video analysis based on human movement has made signi cant progress in areas such as intelligent life assistance, advanced human-computer interaction, and content-based video retrieval, and is gradually receiving close attention.
ere is a di cult problem in sports video analysis research. Namely, it is di cult for low-level video features to accurately re ect the needs of the human body, and the use of single features is di cult to meet the rapid growth of available video data. As a popular sport, the research on AD of FG has made considerable achievements. Among them, the AD of FGV not only integrates theoretical knowledge of machine learning [3] but also involves several disciplines such as pattern recognition and video analysis, and we need to conduct deeper research on it.
ere are many issues that have not been adequately addressed in current research on the problem of AD in freestyle gymnastics videos. In terms of practical applications, the study of the AD of FG videos has an enormous application value. Among other things, in FG movements, we need a quick identi cation of the various types of movements of the athletes. is paper aims to achieve high recognition accuracy in automatic video-based human movement understanding, and even real-time movement recognition and commentary. For nonexperts, if ADs can be achieved, it will not only enhance the viewing experience but also facilitate their understanding and learning of the sport of FG.
Kojima et al. [4] took an alternative perspective by studying human activity through the theory of behavioral concepts and thus described human behavior. Guadarrama et al. [5] combined the semantic hierarchy theory with semantic relations between multiple fragments. Rohrbach et al. [6] described features through mathematical modelling by studying human activity under conditional random fields. Xu et al. [7] proposed a combination of a deep video model and a joint embedding model as a kind of framework that allowed for the study of relationships between videos and words. e above research methods were limited by some syntactic structures [8], making the research results deviate from everyday descriptions.
With the continuous development of deep neural networks (NN) [9] and the emergence of many large-scale datasets in image recognition [10], many approaches to semantic representation have changed dramatically. Hochreiter et al. [11] proposed LSTM, which can effectively solve the gradient disappearance problem of Recurrent Neural Network (RNN) species. Gers et al. [12] proposed an oblivion gate mechanism, in which Graves et al. [13] improved the LSTM and proposed a bidirectional LSTM (BLSTM) NN, which has been widely used.
Venugopalan et al. [11] used a convolutional neural network (CNN) to feature extract all frames in a video and fed them into an LSTM to decode and generate text. Venugopalan [14] proposed S2VT with an LSTM in both the front and back segments. Shetty et al. [15] trained a variety of models on different kinds of features, using an evaluation network to assess and generate a description of the video by generating correlations between sentences and video features. Jin et al. [16] used multiple features and fused features to represent the video.
With its greater freedom, varied movements, and the ability to perform a complete set of moves in a set time, FG is one of the most aesthetically pleasing sports in competitive gymnastics and is the quintessential representative of competitive gymnastics. e study of automatic video descriptions of FG is of great relevance. For nonexperts, ADs would not only improve the viewing experience but also make it easier for them to understand and learn the sport of FG. In this paper, automatic video description is performed by extracting the athletes' set movements from FGVs. e LSTM network visualization techniques are used to learn the mapping relationship between word sequences and video frame sequences. An AM is also introduced to highlight the importance of the keyframes that determine the FG movements. Experiments are conducted on MSVD data and self-built datasets, using planned sampling to eliminate the differences between the training decoder and the prediction decoder.

Background and Issues.
In recent years, the study of AD of sports video content has gradually become a hot topic, with the rapid growth of sports video data volume and audience groups. Apart from football and badminton, which are typical representatives of ball sports, other areas of sports video research are less involved. is paper takes FGV as the object of study because it plays a fundamental role in other sports. FG has the greatest degree of freedom and difficulty among competitive gymnastics and is highly representative. By FGV comprehension, we mean understanding a given video of FG. e terminology for the set performed by the athlete is generated by observing the set in the video, such as the method, direction, and angle of the body flip. e traditional method relies on manual commentary, and many important competitions require real-time commentary by the commentator, which demands a high level of expertise. Nonspecialists understand the competition primarily through the point of view of the commentator, and any errors in the commentary will reduce the viewing experience of these people. It is essential to use pattern recognition techniques combined with natural language processing to achieve ADs of FGVs.

Algorithmic Framework Structure.
e framework of this paper is shown in Figure 1 and uses economic management effectiveness visualization techniques to analyze freestyle gymnastics video features. at is, an LSTM network [17] is used to express the mapping between the features studied and the words, to enable the description of the language. With the development of deep neural nets, many larger datasets have emerged, such as Sport-1M. In the study of this paper, the data used contain videos of FG from the Olympic Games and the National Games, and their decomposed movements are studied as a dataset.
In the FGVs, the decomposed movements mainly include the direction of the flip, the number of rotations, and the body posture of the athletes. e video frames containing the key movements of FG are defined as keyframes, and the keyframes with high discriminative power are extracted to improve the accuracy of the video description. e discriminative power of the video frames is calculated through an AM. In this paper, the AM [18] is integrated into the existing video description network to maximize the accuracy of video description by calculating the weights between different video frames. e basic framework shown in Figure 1 begins with the construction of a free-form gymnastics decomposition movement dataset. e AD of sports videos starts with a CNN for feature extraction. For text data that have been annotated, the corresponding dictionary needs to be proposed to extract the corresponding features. Randomly selected text data are trained in the model until the model's effect stabilizes. e remaining text data are used as the test set, resulting in automated descriptions of freestyle videos.

Feature Extraction.
In the AD of the freestyle video studied in the text, the data type contains not only video data but also text data. We achieve a more accurate description of this video by using CNNs to extract video features and natural language text processing for text features. e NN model is robust and the training cost of the model is small and the classification accuracy is high. e CNN model has simple operating conditions and does not require much hardware for the device, and the speci c morphology of the features is not considered at all when using feature extraction [19]. In this section, three di erent types of 2D CNN, Visual Geometry Group (VGG) [20], ResNet [21], and DenseNet [22], are used to perform feature learning on videos, respectively, and the VGG network structures are shown in Figure 2.
e VGG network structure to extract feature representations of freestyle gymnastics videos has a signi cantly lower error rate. ResNet can use the original signal directly into the deeper layers of the neural network, speeding up the training e ciency of the network, while DenseNet builds on and improves ResNet. e feature mapping generated by DenseNet will also be used as input to all subsequent layers, ensuring that the information is passed on, and thus avoiding gradient disappearance.
In this paper, the descriptors of the FGVs are transformed into features by using the one-hot vector coding. e words in the FG annotated text are rst counted to construct a dictionary. e number of words used in the descriptions of the FG decomposition movements is not large, so no ltering of words is performed in the preprocessing.

AD System for FG.
We use LSTM to learn video features from this paper. Standard recurrent NNs are prone to gradient disappearance during backpropagation, making it di cult to continuously optimize the network parameters [23]. e LSTM is a special type of recurrent NN that can e ectively solve this problem, especially in long-distance dependent tasks, where it outperforms the RNN. ese are input gates, forgetting gates, and output gates. e gate control can be regarded as a fully connected layer in the CNN, and the LSTM stores and updates the information through these gate controls. e gate control quanti es the amount of information passing through each part of the cell by using a sigmoid function to obtain a probability value between 0 and 1. When the sigmoid function is 0, no information variables are allowed to pass at that moment, and when the sigmoid function is 1, all variables are allowed to pass at that moment. e gates for forgetting are called "forgetting gates" and gates for outputting are known as "output gates." e LSTM encodes a xed-dimensional sequence of FG decompositions into feature sequences, which are then decoded and used to generate text by the LSTM NN. First, encode the xed dimensional FG decomposition movement feature vector X (x 1 , . . . x n ) into a feature sequence, and obtain the output H (h 1 , . . . h n ) of the corresponding hidden layer. e output of the LSTM is known to be dependent on the previous input sequence, so the feature vectors are fed into the LSTM once in sequence, and the output is a coded mapping of the sequence vectors. After the feature vector of the last frame is input, the output of the LSTM is the encoding of the sequence of frames. e LSTM in the decoding phase is fed the start character, which prompts it to begin decoding the hidden state it is subjected to a sequence of words, and the output yields a sequence of words Y (y 1 , . . . y m ) with a probability of p(y 1 , . . . (1) When training in the decoding phase, the log-likelihood of the predicted sentence is found under the condition that the hidden state of the frame sequence and the previously output words are known. e model is trained so that the following equation reaches its maximum value: where θ is the angle of the vector at which the maximum loglikelihood is reached and arg max indicating the maximum value. e entire training dataset is optimized using a stochastic gradient descent algorithm, which allows the LSTM to learn more appropriate implicit states. e output of the second layer LSTM z is speci ed by nding the most probable target word y in the vocabulary Y as shown in the following equation, where W y indicates the weight of the output:

Attention
Mechanism. e di erence in the attention allocated to di erent signals by the human brain when  Mathematical Problems in Engineering processing signals is referred to as visual AM [24]. e area of the target on which human vision can gain focus by quickly capturing the image, in order to obtain more detailed information about the target to be focused on and to eliminate other useless information is referred to as the focus of attention [25]. In this paper, the AD of FG movements is based on the principle of the AM, which rst selects the decisive video movements that can be taken, i.e., the way the athlete's body ips, the angle, and the di erent directions, which should be assigned more weight in order to make the AD more accurate. e introduction of this AM allows the decoder to assign weights to all feature vectors in the FGV.
e structure of the model containing the AM is shown in Figure 3.
In this paper, a dynamic weighted sum of temporal feature vectors is used, with the following equation: where t denotes the moment t and x i denotes the vector.
i is the proportion of the overall score that the output of the hidden layer at that moment matches the entire video representation vector, calculated as follows: where score(x i , h i ) denotes the fraction of the video feature vector x i occupied by the output h i of the ith hidden layer, the larger the fraction, the greater the attention of the input at this moment in that video, which is calculated as follows: where w, W, U are weight vectors and b are o sets.

Experimental Setup.
e graphics card in this paper is an NVIDIA Titan 1080 and the memory size is 11 GB. During the network training, the input data were resized to 227 * 227, and the VGG-16 pretrained model provided in the model parameter training was performed directly on the ILSVRC-2012 image set, a subset of the ImageNet. Comparison experiments were added in order to verify the impact of the features extracted from the di erent 2D CNN on the description results of the freestyle gymnastics videos. We conducted experiments on the ResNet101, ResNet50, and DenseNet201 CNN to compare the results of the experiments after feature extraction and input to the model for coding and decoding.

Dataset Construction.
e construction of a dataset of decomposed movements for FGs is an essential task for the AD of FGs. e experimental dataset in this paper is mainly collected from videos of professional athletes competing in professional competitions, such as the Olympic Games, World Championships, National Games, and several other heavyweight events. ese collected videos are rst preprocessed, with a number of video frames per athlete being cut o to include only the athlete's FG movements. Because these videos are interspersed with highlights, replay, and slow-motion commentary, which together make up a video, these are the parts that were ignored in the AM. rough data collection, we obtained a total of 298 training video data and 45 test video data. After preprocessing all the videos, there are still some problems. As these videos are among those obtained live, there is no realtime caption display for the narrator's words, and we address the e ect of distracting factors by using speech recognition. In this paper, word frequency statistics were performed on the 298 video descriptions collected, and the results showed that a total of 48 words appeared in these descriptions, and the word frequencies of all words are shown in Table 1. e words occur less than 10 times, and nearly half of the words occur once and twice. Figure 4 also analyses the frequency of the 25 words with more than 10 occurrences. It can be seen that the number of words with more than 150 occurrences is still very small and the names of the words in Figure 4 are replaced by the rst two letters.

Scheduled Sampling.
In the decoder of the training phase, it is the target sample that is used as the input for the next predicted subsample. Whereas in the prediction phase the decoder takes the previous prediction result and uses it as an input for the next prediction value. is di erence leads to the problem that the training and prediction scenarios are di erent. In prediction, if the previous word is predicted incorrectly, all subsequent ones will follow, whereas the training phase does not.
is paper modi es the model of the decoder during training by introducing a planned sampling approach. e base model will only use the true annotated data as input, the training decoder with the addition of planned sampling is to select the model's output with a probability P as the input for the next prediction and the true markers with 1-P as the input for the next prediction. at is, the sampling rate of P varies during the training process. In the beginning, when training is not su cient, start by making P smaller and try to use the true description as input, and as training progresses, increase P and use more of its own output as input for the next prediction. As training progresses, P gets larger and larger, and the training decoder model eventually becomes the same as the prediction decoder model. Eventually, the di erence between the training and the prediction decoder is reduced by planning the sampling scheme.

Loss Function.
e iterations of the loss function in the experiments are presented using the visualization tool Tensor-Board, as shown in Figure 5, which shows the variation of the training loss value with the number of iterations for the original model and the model after the introduction of the AM. e loss values of both models gradually decrease and converge. e model with the AM has an increased rate of convergence as the time complexity increases and the starting loss value is larger.

Evaluation Metrics and Performance Comparison.
e result of the AD of FG is the description of the decomposition of FG movements, which is a kind of natural language, and the evaluation of the result of the description can be referred to the metrics used in natural language to evaluate the quality of machine translation results. Bleu is the closest metric to the human rating at present. Bleu is a matching principle using N − gram. N − gram is the representation of a sentence as a sequence of n consecutive words. is paper conducts experiments on two corpora, the MSVD and the selfbuilt dataset.
e experimental results are shown in Table 2.
e experimental results compared in Table 2 are the mean Blue from Bleu_1 to Bleu_4. e table shows three datasets, two of which are our own, each of which is di erent when tagging the video descriptions. OURS(1) is the most straightforward natural language, and since the professionalism required for the description of FG breakdown movements is high, OURS(2) is di erent when describing markers; the descriptive statements are adapted to the specialist terminology. From the results of the three datasets, we know the model with the AM introduced in this paper performs better on both the MSVD dataset and the self-built dataset. e di erent test results for the three models are given in Figure 6. From Figure 6, we know that the MSVD has the best performance among the three models and it accounts for the largest percentage. Whereas OUS(1) is the least effective and OUS(2) is the second most e ective, indicating that the e ectiveness of the modi ed model has improved. Table 3 shows the comparison of the experimental results of feature extraction using ResNet101, ResNet50, and DenseNet201 networks on the self-built dataset OURS (2). In Table 3, two evaluation metrics, ROUGE_L and METEOR, are also added, and in order to highlight the gaps in the experimental results more, Table 3 compares the results of the Bleu evaluation metrics in Table 2 with speci c  e  average Bleu from Table 2 is compared and expanded into Bleu_1, Bleu_2, Bleu_3, and Bleu_4 speci c results. e speci c experimental results show that the DenseNet201 performs the best in all evaluation metrics compared to VGG16.  Figure 7. e ultimate goal of the multiclassi cation in this paper is to achieve AD of the video, using the improved method for testing. To ensure experimental rigour, the mean Blue from Bleu_1 to Bleu_4 is still taken here as the evaluation metric, and the freestyle description statements identi ed are compared with the correct descriptions labelled from the previous section. e comparison of the experimental results is shown in Table 4, and it is clear that the method of using video multilabel classi cation transformation for AD of freestyle gymnastics videos gives better experimental results. e pie charts of the experimental results of the ve methods are shown in Figure 8. e experimental results of the method in this paper are the best, which veri es the e ectiveness of the proposed method. e experimental results of the di erent AD models of FG on the self-constructed dataset are compared with those of the classi cation method in this paper. Compared to the original model S2VT, the results of the model with the AMs are similar for the direction test as "forward" in blue, but for the body posture test as "stretched" in red, the improved model is more speci c. In the classi cation problem, the video contains two actions, and although only one correct category is identi ed, "forward stretched twist three," this category contains four correctly described words, thus improving the accuracy of the description.
Although the improved model improves the accuracy of the description, the AD method for FGV based on the LSTM networks only applies a two-dimensional CNN model for   feature extraction. is increases the risk of gradient loss due to the temporal loss of information in the video data. In the future, three-dimensional CNN could be used for feature extraction and the network could be improved by fusing multimodal video features. In addition, the introduction of the AM could be further improved by aiming to be able to introduce several attention modules at the same time to highlight the importance of the keyframes in decisionmaking. Attention should be paid to the speed and e ciency of the operation and to the improvement of the algorithm to improve the accuracy of the description.

Conclusion
In this paper, the method of the video AD and related technical theories were introduced, and the method AD of FGV based on the LSTM networks was described in detail. Firstly, the automatic video description method was taken as the entry point, and the current status of AD video methods and sports video research studies were reviewed. From the perspective of economic management e ectiveness visualization techniques, the relevant concepts and development history were introduced, and the structure of three important types of NNs was described with emphasis on the structural dissection of typical network models, respectively. e paper introduced the integration of AMs into existing video description networks, weighing the importance of video frames by the means of weight values. In the course of the experiments to improve the model's computational accuracy, application schemes were employed to reduce the discrepancy between the training decoding model and the predicted decoding model before. Finally, experiments were conducted on multiple feature extraction network structures of VGG16, ResNet101, ResNet50, and DenseNet201, through which the feasibility of the improved method was     Mathematical Problems in Engineering verified. As the results of this paper were obtained in an experimental setting, there should be more extraneous factors interfering in the practical application, and the model will be improved to make it applicable in a realistic setting.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.