Design of National Sports Action Feature Extraction System Based on Convolutional Neural Network

ive and extractive summarization,” Data Science and Engineering, vol. 4, no. 1, pp. 14–23, 2019. [26] P. Zhou and Z. Jiang, “Self-organizing map neural network (SOM) downscaling method to simulate daily precipitation in the Yangtze and Huaihe River Basin,” Climatic and Environmental Research, vol. 21, no. 5, pp. 512–524, 2016. [27] X. Xiao, “Analysis on the employment psychological problems and adjustment of retired athletes in the process of career transformation,”Modern Vocational Education, vol. 5, no. 12, pp. 216-217, 2018. [28] S. Sahoo and M. K. Jha, “Pattern recognition in lithology classification: modeling using neural networks, self-organizing maps and genetic algorithms,” Hydrogeology Journal, vol. 25, no. 2, pp. 311–330, 2016. [29] Y. Zhou and B. Yang, “Sports video athlete detection using convolutional neural network,” Journal of Natural Science of Xiangtan University, vol. 39, no. 1, pp. 95–98, 2017. [30] J. Pang, “Research on the evaluation model of sports training adaptation based on self-organizing neural network,” Journal of Nanjing Institute of Physical Education, vol. 16, no. 1, pp. 74–77, 2017. [31] G. Querzola, C. Lovati, C. Mariani, and L. Pantoni, “A semiquantitative sport-specific assessment of recurrent traumatic brain injury: the TraQ questionnaire and its application in American football,” Neurological Sciences, vol. 40, no. 9, pp. 1909–1915, 2019. [32] J. Wang, X. Luo, and H. Yan, “Correlation analysis between injuries and functional movement screening for athletes of the National Shooting Team,” Journal of Capital Institute of Physical Education, vol. 5, no. 4, pp. 352–355, 2016. [33] G. Ma, “Research on the design of juvenile football players’ sports injury prediction model,” Automation Technology and Application, vol. 277, no. 7, pp. 141–144, 2018. 10 Scientific Programming


Introduction
In human-centered computer vision research (such as human detection, tracking, human posture estimation, and human motion recognition), human motion recognition is widely used; for example, video surveillance, human-machine interface, home assistance, human-machine interaction, and intelligent driving have become an important research direction in computer vision research [1][2][3][4][5].
According to the complexity and duration of the action, action recognition can be roughly divided into four types: gesture recognition, action recognition, interaction recognition, and group activity recognition. Specifically, gesture recognition is defined as expressing people's thoughts, opinions, and emotions through the basic movements or positions of hands, arms, human body, or head: Typically, "waves" and "nods." e posture duration is relatively short and the complexity is low. Action is defined as an activity completed by a single human body mobilizing multiple parts of the body; that is to say, an action is a combination of multiple postures, such as "walking" and "boxing." e interactive action is mainly completed by two subjects: people and things, and people and people [6][7][8][9][10][11]. is means that interaction expresses the interaction of people or characters, such as "hug" and "playing the guitar." Group activities are the most complicated type of action. It may combine three types of actions: posture, action, and interaction, involving two or more people or objects, such as "two teams play basketball" and " group meeting," shown in Figure 1.
e realization process of action recognition is generally divided into three stages: first, target detection, then feature extraction, and finally feature analysis and judgment and recognition. ere are corresponding studies for each stage, and the goal is to achieve efficient action recognition. Moving target detection is not the focus of this article, so it will be not repeated here. In the feature extraction stage, the traditional method is to manually select features. However, in most cases, specific algorithm analysis is performed on the data characteristics. e generalization ability is poor, and there are often situations where the effect of different datasets is quite different. e complexity of the processing process is on different datasets. Moreover, with the advent of the era of big data, the development of datasets is gradually moving towards larger data volumes, more data categories, and wider data ranges, making it more difficult to analyze data and extract features. As it becomes more difficult to determine the features to be extracted, the step of feature extraction gradually abandons specific customization and tends to be general. People are more hopeful that the action features in the video can be extracted through nonmanual methods, which reduces the difficulty of qualitative extraction of features in the early stage and improves the efficiency of large-scale action data processing. With the deepening of research, researchers use models and algorithms to characterize features, such as image-based representation, model-based representation, time-space-based representation, and so on. In the last stage, when analyzing data features for classification, a classifier is needed for identification; that is, the extracted features are used as an abstract representation of the original input image and input into the classifier. By using the model training parameters in the classifier, a similar match with the input data is found. e main purpose of human action recognition is that, for the test data and the label data predicted by the classification model obtained through training, the higher the fit with the actual label data, the better. With the evolution of features from simple to complex, the classifier is also a process from simple to complex, from the linear binary classifier at the beginning to the logistic regression classifier, as well as a series of traditional excellent learning algorithms such as SVM and HMM. It is a typical classifier model. However, the common feature of these classifiers is that different types of features need to be matched with different classifiers, which greatly reduces the computational efficiency. Deep learning is a branch of traditional machine learning that can make up for the shortcomings of the above algorithms and is a popular algorithm among recent classification algorithms [12,13].
Deep learning is an important branch of traditional machine learning. With the development of hardware equipment and big data, deep learning has gradually become one of the hot points in the research of computer vision research. Deep learning builds a hierarchical learning and training model and establishes a progressive learning mechanism between input and output data. With each step forward, the extracted feature dimension rises by one level, so that the final trained model can extract the original. e high-dimensional features of the image are conducive to the final recognition and judgment. Because deep learning has the characteristics of autonomous learning and does not require artificial design of related algorithms, it is a more efficient and generalized feature extraction method. So far, deep learning has good experimental results in the fields of target detection, target recognition, image processing, and so on. One of the more prominent advantages of deep learning is that its processing speed for big data is much faster than traditional hand-craft features. With the rapid development of the Internet, deep learning is more in line with the characteristics of the times and can play a role from massive amounts of data. Information makes the future development of deep learning very impressive [14][15][16][17].
In recent years, the development of the field of deep learning has provided new methods for the judgment and recognition of human actions in the later stages. Deep learning algorithms can extract higher-level action features and give better classification results. e system that applies deep learning for action recognition has achieved a high recognition rate. e good classification effect of deep learning has been recognized, and it has been practically applied in driverless cars, medical image recognition, and image retrieval. Convolutional neural networks are a common way to learn high-level features in deep learning. resolution video classification. At the same time, the article also pointed out that the use of convolutional neural networks to process each frame in the video sequence alone has almost the same result as processing a series of frames at the same time; that is, the convolutional neural network for comparative research in the article learns the characteristics of time and space [18][19][20][21][22]. e above is not well integrated.
In order to better mine the spatial and temporal features in the action sequence, K. Simonyan et al. used convolutional neural networks to process spatial data streams and temporal data streams separately and finally used specific methods to perform learning results on different features [23][24][25][26][27][28]; see Figure 2.
In order to solve the above problems, this paper proposes a novel action recognition framework, which compresses the action sequence based on depth information, effectively compresses a video sequence into several pictures, and finally uses the convolutional neural network to improve the picture and learning and classification capabilities to complete the learning and classification of action sequences. is paper applies sorting pooling to the depth image sequence to obtain three levels of dynamic images, compress the depth video sequence, and then use three convolutional neural networks to judge and recognize the three types of images, respectively, and finally merge the results to obtain best effect. In the research, we introduced AlexNet, which has a good effect on human action recognition in deep learning and improved the accuracy of action recognition. Compared with other algorithms, the recognition effect and robustness of this method are far stronger than other methods, and the trained network can allow anyone to control the UAV by gestures according to regulations, which has good universality [29][30][31][32][33].

Convolutional Neural Network
One of the most common algorithms for image recognition is a model built based on a convolutional neural network (CNN), which can be regarded as a forward feedback neural network that includes convolution operations. After layers of convolution and pooling operations, the model can gradually extract the features of the target image, and the proposed features have translation invariance. e biggest feature of convolutional neural network is weight sharing and local perception, so when training, it needs to learn a huge number of feature parameters. What needs to be explained here is that each extraction of feature values requires multiple core convolution kernels. Different cores correspond to different features, as shown in Figure 3. At the same time, its corresponding two-dimensional spatial characteristics help it in computer vision tasks and demonstrate a good characterization ability on the problem. erefore, the classification model based on CNN mainly includes the following modules.

Convolutional Layer.
After inputting the current data features in the convolutional layer, by setting the appropriate number of convolutions, the size of the convolution kernel, and the convolution method, the feature expression of the input data is completed. After completing this operation, you can multiply the data and the convolution kernel to get each feature point on the corresponding feature map. Traverse all the data according to the convolution step length, and realize the convolution between the convolution kernel parameters and the input data in the receptive field. e process can be expressed as follows: where b represents the corresponding deviation, Z l+1 | ′ Z l are, respectively, used to represent the output and input values of the neural network. L l+1 represents the feature size of the output image, Z(i, j) represents the feature value, and K represents the number of convolution kernels, that is, the number of channels; f, s 0 , p represent the number of convolution operations involved in this operation, convolution kernel size.
When the convolution step size and the size of the convolution kernel are both 1, it is called unit convolution, and the convolutional layer composed of these unit convolutions is called the net in the net. In the case of unchanged input features, unit convolution can perform feature fusion on multiple channels, which helps to reduce the number of corresponding model parameters, thereby reducing a certain amount of calculation.
e convolutional layer has two characteristics of local perception and weight sharing at the same time. e neurons in the front and back layers are connected in pairs. For example, in a 1000 * 1000 image, the input layer has 106 nodes. erefore, it needs to learn a huge number of feature parameters during the training process. e feature value is extracted when the convolution operation slides in the previous layer of the convolution kernel, and the weight of the core parameter needs to be kept unchanged during each operation, and it is only at the same time. Establish a connection with a part of the neurons in the previous layer. It can be seen that the corresponding local perception not only is more in line with human's cognitive characteristics of things from part to the whole, but also greatly reduces the number of characteristic parameters. e process of applying the learning information in a certain local area to other places is called weight sharing. In simple terms, the whole image is filtered on the image, which is also conducive to greatly reducing the parameters.
Each feature extraction needs to use multiple convolution kernels, the features obtained by different convolution kernel extraction are different, and each convolution kernel will extract a certain aspect of the feature. At the same time, its two-dimensional spatial characteristics help it have good characterization capabilities for computer vision tasks.

Pooling Layer.
After the data input of the previous layer is completed, it is passed to the pooling layer. In this layer, the data filtering of the previous layer is mainly completed. e standard for filtering is the data extraction standard of the convolutional layer. is process can be understood, in order to imitate the human visual system to complete the dimensionality reduction of the data and finally obtain the image representation features of higher-level features. Pooling helps to reduce the redundancy of information, and at the same time it can improve the invariance of the model scale, thereby helping to avoid the adverse effects of the model due to overfitting. In general, the pooling layer can be further divided into mean pooling and maximum pooling. e advantage of the latter is that it can learn the edge and texture structure of the image. e advantage of the former is that it can effectively reduce the deviation of the estimated mean value and improve the anti-interference ability of the established model. e formula of the pooling operation can be expressed as In the above formula, s 0 represents the pooling step size, when p � 1 is the average pooling; when p tends to infinity, it is the maximum pooling.

Fully Connected Layer.
e fully connected layer is composed of several neurons "holding hands" with each other, and each neuron in the latter part is interconnected with the neuron in the former part. is is equivalent to the aforementioned forward feedback network, as the end layer of the convolution model. It is the integration of the features provided by the previous convolution and pooling operations. e feature map is expanded into a one-dimensional column vector and used as the input corresponding to the fully connected layer. It is easy to realize the nonlinearity of the input feature by using the activation function and finally  Scientific Programming extract feature vectors with more expressive ability and then realize feature classification. All in all, the structure corresponding to a typical convolutional neural network used to process image tasks can usually be expressed as follows.
Because convolutional neural networks have certain advantages in processing images, they are generally regarded as a commonly used method in the field of image recognition. In the traditional network structure, there are many classic structures, such as AlexNet network, GoogleNet network, VGGNet network, and ResNet network.

Gradient Descent.
e standard gradient descent can be described as e standard gradient descent refers to the replacement of the secondary parameters of the gradient of the overall calculation example. is standard gradient descent method has certain drawbacks such as relatively slow calculation speed and certain application limitations.

Stochastic Gradient Descent (SGD)
. Different from calculating the gradient after calculating the loss of all samples in the standard gradient descent case, SGD calculates the gradient once for each sample and updates the parameters. It can be described as

Minibatch Gradient Descent (MBGD).
is method is a relative compromise between batch stochastic gradient descent and gradient descent. Among them, the main idea of the minibatch gradient descent method is as follows: based on a dataset of n training samples, update the corresponding parameters in real time and select a minibatch data sample of size m (m < n) to calculate its corresponding. e gradient of the formula is as follows: AdaGrad is a method of adaptive learning rate, which implements a high learning rate for low-frequency parameters and a low learning rate for high-frequency parameters. is feature makes it more suitable for processing sparse data. Standard parameters are a very small constant, generally 10e-8, which represents the global learning rate, and usually, the final gradient accumulation variable r.

Action Feature Model Algorithm Based on Convolutional Neural Network
In order to simplify the processing of video sequences and make the computer's judgment and recognition of the types of human actions more efficient, a method of compressing the depth of human actions is proposed, shown in Figure 4. is method considers the spatial and temporal characteristics of the video at the same time, which can save the video action characteristics and reduce the information redundancy. e algorithm includes the following steps: remove the background interference in the depth video, leaving only the shape of the human body. After removing the background interference from a k-frame depth video sequence, it is expressed as where dt represents the average value of the depth features of all frames as of t time. e deep depth image is shown in Figure 5. Rank pooling processing for depth videos: At each time t, define a score value, and the score value must meet the conditions; the more the current frame number, the greater the score value.
Perform rank pooling on deep video sequences to extract features. e process of rank pooling is to find an optimal solution that satisfies the following objective function: In the depth image, each joint point is used as the center point for extension, and the depth image is cropped using a frame of size q * p to obtain the image block of the joint point. In order to make the extracted image blocks have the same scale feature, we find the maximum range of motion for the same unit in a video frame sequence and define it as mask. For each unit, use 0 to fill the unit to the mask size. In this way, the relative scale between the units can be maintained, which is very necessary for the preservation of spatial information, as shown in Figure 6.
Convolutional neural network has an excellent effect on image and speech processing. By using a single neuron to respond to the coverage and surrounding pixels, convolutional neural networks have unique advantages for large-scale image processing. e structural design of the weight sharing of the convolutional neural network is more like the principle of the human neural network.
rough the pooling layer, the weight sharing fully linked layer design greatly reduces the parameters of the entire neural network. e design uses multiple convolutional templates for feature extraction, which is far better than the existing machine learning methods in terms of image classification and recognition. Deep learning through convolutional neural networks can automatically extract useful features from images and use these features to Scientific Programming 5 classify images. e level of perfection of learning features exceeds many existing methods of manually specifying features.
In order to achieve a good classification effect, we adopted the AlexNet network structure, which has achieved remarkable results on the ImageNet dataset, as the neural network we use to classify gestures. e network structure has 5 convolutional layers and 3 full link layers.
ere is the following relationship between every two convolutional layers of the network structure, as shown in Figure 7.
Among the three fully connected layers in AlexNet, each fully connected layer contains 4096 neurons. Such a network maximizes the multiclass logistic regression objective; that is, it maximizes the average log probability of the correct label in the training sample under the prediction distribution, thereby making the classification more accurate. In order to make the convolutional neural network get good results faster, the model trained on the ImageNet of the AlexNet network is used in this article to initialize the network parameters.

Hardware Platform.
e workstation for training in the experiment is equipped with an Intel E5-2300 CPU and 16 GB DDR3 memory using GPUNvidia TitanX to accelerate the training process of the neural network. e predicted result is plotted in Figure 8.

Software Platform.
e deep learning platform used in the experiment is Caffe (Convolutional Architecture for Fast Feature Embedding), which is a general framework for deep learning algorithms. e framework uses many libraries that can perform fast calculations, models for fast data storage, and function templates that can be directly called, allowing developers to quickly implement the network structure envisioned. e architecture abstracts many common operations of convolutional neural networks, and they are implemented with CPU and GPU, respectively. e entire calculation can be seamlessly switched between CPU and GPU. Caffe allows users to implement convolutional neural networks only by specifying the network structure in the configuration file. e parameters of the network set in the experiment are as follows: e network uses 256 training images (batch size) for one iteration. According to the size of the training dataset, a total of about 90 cycles (epoch) are trained. In these 90 cycles, the initial learning rate is set to 0.001 (base learning rate), which drops to 0.0001 after 60 cycles. For the rest of the training parameters, we refer to the method proposed by A. Krizhevsky et al.
In the verification of the algorithm in this article, a total of five human motion datasets are used. ey are MSR Activity 3D dataset, G3D dataset, MSR Daily Activity dataset, SYSU 3D HOI dataset, and UTD-MHAD dataset.
ese datasets cover most types of actions, including single actions, in-game actions, daily action actions, human and object interaction actions, and fine-grained actions. erefore, these datasets can show that the algorithm proposed in this paper is universal.

MSR Action 3D
Dataset. MSR Action 3D dataset contains 20 simple actions performed by 10 people facing the camera, and each person performs each action 2 to 3 times. e experiment adopts the method of cross-validation. e human movement data labeled 1,3,5,7, and 9 are trained, and the movement data labeled 2,4,6,8, and 10 are tested.

G3D
Dataset. G3D is a game action type dataset, which is a series of action data shots for scenes in the game, including 20 game actions displayed by 10 people. e experimental verification method is to keep the first five labeled people as the training data and the last five labeled people as the test data.

MSR Daily Activity 3D Dataset.
e MSR Daily Activity 3D dataset contains 16 actions displayed by 10 people, and each person performs each action twice, once in a standing position and once in a sitting position. Most of this dataset contains interactive actions of characters. e experiment adopts a cross-validation method, and people's movement data with labels of 1, 3,5,7,9,11,13, and 15 are used for training, and those with labels of 2, 4, 6, 8, 10, 12, 14, and 16 action data are tested.  e experiment adopts the method of cross-validation, the action data with odd numbers of people is trained, and the action data with even numbers is tested.

Conclusions
is paper analyzes the framework structure of action recognition and designs a new framework from two directions: feature extraction and feature classification. In terms of feature extraction, in order to retain more information about the original video, this article starts from the time-space structure and uses the hierarchical sorting and pooling dynamic depth map DDI to obtain three types of dynamic depth maps, namely, the whole body DDI, part DDI, and joint DDI; in the feature classification stage, this   e pooling algorithm is based on the premise of a forward and backward time assumption. It extracts and sorts video features and compresses the sorting information on a picture, which is verified on RGB color images. It is a video compression method with excellent performance. Two-way pooling reverses this assumption of forward and backward time and trains them separately. is is quite effective for actions that are sensitive to time information and can obtain more comprehensive information. On the other hand, more data can make the model training more adequate.

Hierarchical Image Segmentation Combination.
is paper proposes a simple and effective way of extracting video spatial information. e human body is divided according to joint points and combined according to the whole body, parts, and joint points. In the three types of pictures obtained in this way, the whole body DDI can provide body contour information, and the partial and joint DDI can provide detailed information, which plays a complementary role and is of great significance for the next step of recognition. And this way of segmentation is easy to understand, and it is very convenient for other researchers to reverify.

e Challenge of Human Motion Recognition.
e difference between within and between classes is the same action, and the performance of different people may vary greatly. For example, because of the differences between individuals, people with different simple actions of running have different speeds and step lengths. A robust action recognition method should have good generalization; the environment here can be divided into the complex background where the action is executed and the camera environment. e environment in which the action occurs is an important differentiating factor. In a complex and chaotic background, it is difficult to accurately track and locate feature points of interest. Moreover, important parts of the human body are likely to be occluded by objects and other people. At the same time, the lighting conditions will further affect the appearance of the person's contours, which will greatly interfere with the recognition of human movements. e problem of occlusion can be alleviated by using multiple cameras to observe movement from different perspectives, but this will face the problem of synchronization and cannot achieve real time. In addition, a moving camera will increase the difficulty of positioning and tracking the human body and will also cause the ability to change the scale of the human body, which is not affected by changes between classes. At the same time, it should be able to distinguish the differences between different categories. However, as the types of actions increase, there is a certain overlap between actions, which makes the recognition more difficult.
Generally, research assumes that actions are easily segmented in the time dimension. Although this assumption reduces the segmentation burden in the recognition task, another segmentation process must be added in advance, which makes the real-time performance of the entire recognition worse. e speed of people performing actions varies greatly, and it is difficult to determine the starting point of the action, which has the greatest impact when extracting features from the video to represent the action. erefore, a robust human action recognition method should be invariant to the execution speed of the action.
Data Availability e dataset can be accessed upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.