A Sports Training Video Classification Model Based on Deep Learning

,


Introduction
With the rapid development of multimedia technology, sports get unprecedented attention and development. e mainstream research work of sports training video includes field and ground wire detection, player detection, recognition and tracking, camera calibration, event detection, and video abstract extraction. e classification of sports training video based on semantic information refers to the use of machine vision technology to automatically identify the types of sports training on the field and give the recognition results by using a certain way of expression [1]. Due to the extensive influence of sports, the introduction of machine vision technology and machine learning technology in sports training video classification has great potential commercial application value.
At present, there are few researches on sports training video classification. Zhu et al. used Gaussian mixture model to achieve player detection. e multitarget tracking method based on support vector regression particle filter was used to extract the trajectory of players and football, and the interactive space-time information between players and football trajectory was used to achieve tactical behavior expression and recognition in football game. Niu et al. achieved camera calibration by detecting and tracking the ground wire in the video image and finally achieved tactical behavior expression and recognition by using the space-time trajectory information of the interaction between players and football in real space. Matej Perse et al. proposed a twostage framework to realize the tactical behavior recognition in basketball games. In the first stage, players' trajectory is segmented according to the Gaussian mixture model under the generalized context information in basketball games. In the second stage, players' trajectory is semantically expressed according to the key information, and the tactical behavior recognition is realized by using the template matching method. Chen et al. designed an automatic recognition system, which realized camera calibration by field line detection, and realized attack and defense pattern recognition in basketball game by using player trajectory description in the field. Masui et al. used background subtraction to detect players and then represented the spatial distribution of players in different areas of the field by using symbol system, to realize football tactical behavior recognition. is idea was a nontracking tactical behavior recognition method. e existing tactical behavior recognition mostly used the target trajectory as the underlying visual feature, which faced many problems. Firstly, due to the mutual occlusion between targets, the randomness of target movement, and the complexity of the environment background, there are still many problems in the accuracy and persistence of target tracking; secondly, because the sports training video is mainly based on long-distance view, the identification of players and balls is poor under complex lighting conditions.
Deep learning forms more abstract high-level features by combining low-level features to discover distributed features of data. e multilayer network structure of deep model can make the network learn the organization form of features by itself [2], and get the final semantic features through multiple abstractions. In 2006, Hinton et al. proposed the first feasible depth model. Since then, deep learning has become a new research field of machine learning, known as a revolutionary new technology in the field of artificial intelligence. Deep learning constructs multilayer network model and combines low-level features to form high-level semantic features with abstract representation, so as to simulate the way of thinking of human brain for perception and recognition. At present, deep learning has been widely used in speech, image, and other data recognition, detection and other fields, and has achieved remarkable results. Following are the main contributions of the study: (i) To study the sports training video classification model based on deep learning (ii) Establish the sports training video classification model by using convolution neural network of deep learning method (iii) To verify the effectiveness of the proposed approach through experiments

Camera Calibration.
Camera calibration technology is used to restore the position of the target in the real threedimensional space. On this basis, the radial distortion and tangential distortion in the nonlinear model are fully considered, the Rodrigues rotation equation is used to reduce the number of optimization parameters, and the steepest descent method and LM optimization method are used to solve the accurate parameters, respectively. Because the actual lens in the video is not ideal perspective imaging, with varying degrees of distortion, this kind of distortion can be divided into radial distortion and tangential distortion [3]. In order to describe the imaging model accurately, two parameters are used to describe the lens radial distortion and tangential distortion. e relationship between ideal coordinates and distortion parameters is as follows: 4 x u y u + k 5 2x 2 u + r 2 , y d � y u + δ y � y u + k 1 r 2 y u + k 2 r 4 y u + k 3 r 6 y u + k 4 2y 2 u + r 2 + 2k 5 x u y u .
In (1), (x u , y u ) is the normalized image coordinate calculated by the pinhole camera model; (x d , y d ) is the image coordinate actually containing distortion; δ x and δ y are the nonlinear distortion values; r 2 � x 2 u + y 2 u ; k 1 , k 2 , k 3 , k 4 , and k 5 are the nonlinear distortion parameters, where k 1 , k 2 , and k 3 are the radial distortion coefficients, which will cause the radial movement of real image points on the image plane; k 4 and k 5 are the tangential distortion coefficients.
Given the initial parameters, to solve the precise camera parameters is essentially to solve the unconstrained multidimensional extremum problem. Because there is a deviation between the theoretical value of pixel coordinates and the measured value after the target feature points are projected to the image plane [4][5][6], the optimal estimation of camera parameters needs to meet the minimum deviation. According to the nonlinear optimization theory, the objective function is expressed as follows: In (2), n is the number of target images captured by the camera under different viewing angles; p is the number of target feature points; m ij is the observed value of the coordinate of the j-th feature point of the i-th target image; m ij is the theoretical value of the projection point coordinate of the target feature point under the nonlinear model; M j is the spatial coordinate of the j-th feature point on the target.
In the process of capturing the target from different angles, the internal parameters of the camera are regarded as constant, and the external parameters are different from each shooting angle. e number of optimized parameters increases significantly with the increase of the target image [7][8][9]. Rodrigues rotation equation provides a method of using vector to represent rotation. If the 3 × 3 rotation matrix with 9 elements is represented by 3 elements of a vector r � r x r y r z , the external parameters of each image are reduced to 6, which greatly reduces the amount of calculation in the optimization process. e relationship between rotation matrix and rotation vector is as follows: e steepest descent method searches along the negative gradient direction of the objective function until it reaches the lowest point of the objective function. For unimodal function, it can quickly get the extreme point. is method uses the principle that the function value along the negative gradient direction of the initial point decreases continuously to search. For the initial point X 0 of function F, there are sequences X 0 , X 1 , and X 2 , which satisfies the relationship as follows: e corresponding function values have the following relations: Because the objective function has the form of minimal sum of squares and the coordinates of feature points on the target image are nonlinear functions of parameters to be estimated, it belongs to nonlinear least squares optimization problem. LM method can avoid the case that A T k A k is illconditioned matrix in least squares. In LM algorithm, the descent direction is given by the following equation: rough the above process to restore the position of the target in the real three-dimensional space, the accuracy of sports training video classification is improved.

Video Preprocessing.
Before classifying the sports training videos, it needs to firstly preprocess the sports training videos. Shooting video on sports training site is usually divided into distance video, medium distance video, and close distance video [10]. e proportion of sports training remote shooting is relatively large; remote shooting can effectively obtain the whole field information.
where V represents the video segment corresponding to a specific sports event, v i represents the video image of frame i, and i � 1, 2, 3, . . . , N, N indicates the number of frames converted into video frame image of the input video segment.
In order to classify sports training videos more accurately, the input video segments are segmented according to equal length [11], and several subvideo segments are obtained. e expression is as follows: In the above equation, jm � pm, j ≠ p, j, p � 1, 2, . . . , M, q � 1, 2, . . . , m.v j represents the j-th subvideo segment after video segmentation, v jq represents the q-th frame image in the j-th subvideo segment, and M represents the number of subvideo segments. After the above processing, the input and segmentation of the sports training video are completed, and the time span of the segmented video field has a certain impact on the classification results.

Extraction of Motion Vector Field.
(1) Let the size of the sports training video be M × N ×T, M × N denote the resolution, and T denote the length of the video sequence. e video is divided into K × L blocks; each block size is h × v, where h � M/K and C denotes the number of blocks in each block. (2) A rectangular coordinate system is established and the motion vector is mapped to this coordinate system [12]. e mapping diagram of the motion vector field of the rectangular coordinate system is shown in Figure 1. In Figure 1, If C x is the component of the motion vector of the C-th block in the horizontal (x) direction, C y is the component of the motion vector of the C-th block in the vertical (y) direction, and ρ is the motion intensity of the block C; then, (3) e coordinate system of continuous video frames is arranged in chronological order [13], and it is divided into Q equal angle sectors along the positive x direction, p is quantized to R intervals, and then the histograms of p and θ are made, respectively, so it can obtain In (9), q t i represents the number of motion vectors in quadrant q in frame t, and r t i represents the number of p quantized to r in frame t. (4) e expectation and variance of the motion vector in the x and y directions are used to evaluate the motion in the block, namely,

Scientific
Programming In (11), C t x,i and C t y,i represent the components of the motion vector of the i-th macroblock in the x and y directions in a frame, and μ x , μ y , σ 2 x , and σ 2 y represent the expectation and variance of the motion vector of the macroblock in the x and y directions, respectively.

Extraction of Luminance Feature. Assuming that the frame resolution is
represents the brightness value of the i-th pixel in the block, and the average brightness value of each block is If y is used to represent the encoding value of the block luminance comparison, the encoding value of the luminance comparison result between the m-th block and the n-th block in the frame can be expressed by (12) rough (12), the frames can be compared according to the average brightness of blocks and encoded with "1" and "0".

Color Feature Extraction.
Assuming that the frame size is M × N, the frame is converted into HSV model and , and m ∈ [H, S, V]; then, the color characteristics of the sports training video are as follows: In the above equation, μ m,n , x i,m,n , and S m,n respectively represent the mean value, variance, and third-order moment of m component in the n-th block.

Texture Feature Extraction.
Let i have L gray levels in sports training video. G denotes a gray level cooccurrence matrix, and its element p ij is the times of pixel pairs with gray level i and gray level j in i. p ij is calculated as follows: (14) where f(x, y) is the gray level of the pixel (x, y), and Δx and Δy reflect the distance d and direction θ between the two points.
e most commonly used texture feature is used as the classification feature of sports video. e definition is as follows:  (1) Convolution Layer. In a convolution layer, the features of the upper layer are convoluted by a learnable convolution kernel, and then the output features can be obtained through an activation function [14]. Each output may be combined to convolute the values of multiple inputs: In the above equation, M j represents the set of input features connected by a convolution kernel. M j determines the connection between convolution kernel and input layer. e output feature map is obtained by convolution kernel of input feature map. Assuming that each convolution kernel extracts a pattern, each output feature map corresponds to a feature and each convolution kernel is equivalent to a feature map. is is because the convolution layer uses weight sharing technology; that is, each neuron uses the same convolution check input to do convolution and each neuron is only connected with some input neurons, which reduces the number of convolution layer parameters. Function f is the activation function of neurons, which is usually a nonlinear function. e input of convolution layer is multiple two-dimensional planes, and each convolution core is connected with all input channels [15]. Convolution is performed in a three-dimensional space to obtain the position response output. Finally, the convolution checks the convolution of the whole input space to obtain a feature map. Usually, multiple convolution kernels are set in each convolution layer, and each convolution kernel extracts different features, so that each feature map represents the feature plane extracted by the corresponding convolution kernel.
(2) Down Sampling Layer. e purpose of the down sampling layer is to improve the robustness of the network to the small deformation of the input samples, so as to enhance the generalization performance of the network. y ijk is used to represent the output of a neuron in the down sampling layer. e down sampling layer can be expressed as where w pq is the normalized weighted window, which can make down sampling of every input feature map without crossing different feature maps. e number of output feature maps in the down sampling layer is the same as the number of input feature maps, which reduces the resolution of each feature map.

(3) Normalization Layer.
e normalization layer is very important for improving the performance of neural network. In convolution neural network model, the normalization layer includes the normalization of the feature vector of the same feature map and the feature map located in different feature maps, which strengthens the feature map with higher response value, and drives different convolution kernels to learn different patterns [16,17]. e subtraction and normalization operation at a given location are actually the value of the location minus the weighted value of each pixel in the neighborhood. e weight can be determined by a Gaussian weighted window. Division normalization is a common normalization algorithm, which can intensify the difference of response value and improve the effect of high characteristic of response value.
Local response normalization is a common normalization algorithm in convolutional networks. e response value can be expressed as where a i x,y represents the value of the i-th input feature map at the coordinate (x, y); N represents the number of input feature maps; n represents the normalization on the adjacent n maps. e local response normalization layer contains three adjustable parameters, namely, the number of feature maps n and parameters α and β. All normalization layers adopt the same parameter setting, such that n � 5, α � 0.0005, β � 0.5.

(4) Fully Connected Layer.
e fully connected layer is usually at the top of the neural network, which forms a traditional multilayer perceptual network together with the decision-making layer to classify the features extracted from the convolution layer. e overfitting of convolutional neural network is mainly caused by more parameters in the fully connected layer. Dropout technology is usually added to the fully connected layer, and some neurons are randomly selected to participate in the training to prevent the network from overfitting.
A multilayer convolutional neural network is composed of the above five neuron layers, which perform different functions, respectively, and must be combined according to certain rules to achieve better results. Among the five neuron layers, only the convolution layer and the fully connected layer contain trainable parameters, and the convolution layer can retain the input spatial position information, which is required by the down sampling layer. e convolution layer is usually used alternately with the down sampling layer, so that different convolution layers can extract different scale features [18]. e fully connected layer will destroy the position information of feature planes and the difference between each feature plane. e fully connected layer is usually used as a part of the final multilayer perceptual classifier, which integrates the convolution layer and the down sampling layer to extract features and send them to the decision layer for classification.

Structure of Improved Convolutional Neural Network.
e AlexNet convolutional neural network of deep learning is used to classify sports training videos.
e AlexNet convolutional neural network consists of 23 layers, including five convolution layers and three fully connected layers.

Scientific Programming
(1) Use the New Activation Function ReLU. Generally, the activation function of artificial neuron is hyperbolic tangent function f(x) � tanh(x) or sigmoid function f(x) � (1 + e − x ) − 1 . In the experiment, it is found that when sigmoid or hyperbolic tangent function is used to calculate the error gradient by backpropagation, the derivation involves division, which leads to a large amount of calculation; once the number of layers of traditional neural network increases, the gradient fading problem occurs. e root cause is that when sigmoid or hyperbolic tangent function is used to calculate the error gradient by backpropagation, the change of function value slows down, and its derivative is close to zero, which makes other hidden layers far away from the output layer prone to gradient fading [19]; in addition, it is also a disadvantage of sigmoid function to add weight penalty factor to get sparsity and output nonzero mean value. e advantages of ReLU function f(x) � max(0, x) are as follows: first, the calculation speed and convergence speed are faster; second, ReLU will make the output 0 when x < 0, resulting in network sparsity, reducing the interdependence of parameters, and alleviating the over fitting problem; third, its derivation is piecewise linear in both forward and backward propagation, avoiding the disappearance of gradient.
(2) Local Response Normalization (LRN). In neurobiology, there is a concept called "lateral inhibition", which refers to the ability of excited neurons to inhibit their adjacent neurons. at is to highlight the maximum peak in the local sensing area and increase the ability of biological perception.
It is in the neural network that the LRN layer realizes "lateral inhibition". Let a i xy be the activation value of neurons at position (x, y) of the i-th kernel function and b i xy be the activation value after normalization, and the total number of kernel functions is N; then, the mathematical model of LRN is expressed as follows: where the sum operation is normalized at the adjacent position of n around (x, y), and the super parameters k, n, α, and β need to be determined by the verification set. It is very effective to add LRN layer after using ReLU function as the activation function. e ReLU function has unlimited activation ability when x > 0, which needs LRN normalization. It is expected that the LRN layer can detect the features with high frequency and amplify them by suppressing the peripheral neurons; the LRN layer will suppress the uniform response in any given local neighborhood; that is, if all the values are large, then the normalization will suppress all the values uniformly. e purpose of LRN layer is to make useful information more prominent by inhibiting and enhancing neuron output.

Event
Matching. Based on the output of convolutional neural network, the events of sports training test video sequence and reference video sequence are matched by event matching method. Given L 1 observation symbols of video class, a multistate traversed convolutional neural network model is trained by using features extracted from sports training video frames, to obtain the event sequence (event probability and corresponding state transition) in the corresponding reference video. e reference event sequence is used to create a dictionary for a given sports training event [20]. For the event with a specific state transition (k, l) in the reference event, the probability distribution of the event is approximated by a Gaussian density function N(μ kl , σ kl ), where μ kl and σ kl represent the mean value and variance of the density function, respectively. It is given by the following equation: Each state transition is assigned a mean value and variance to represent the probability e p t (k · l) of the event occurring in the category. For the sports training video clips that do not appear in the training stage, a reference convolution neural network model is used to obtain the events. Let e p t (k · l) denote the event probability of state transition (k, l) at time t when the test sequence in the observation symbol provides a reference model. Let L 2 denote the number of observation symbols in the test sequence. e similarity between the test video clip and the reference model is expressed by the following equation: e similarity value s between video clips and all kinds of sports training is compared, and they are classified into the category with the highest similarity value.

Results and Discussion
In order to verify the feasibility and effectiveness of the sports training video classification model, eight data sets which are often used in the classification research in the network are selected as the test objects. e data sets include eight types of sports training videos, such as basketball, volleyball, and football. e detailed contents of the videos in each data set are shown in Table 1. Table 1 shows that the experimental data set contains many types of sports training videos. Different sizes and types of sports training videos are used to test the classification performance of different models of sports training 6 Scientific Programming videos. Support vector machine model and HMM model are selected as comparison models. ree models are used to classify the sports training videos of 8 data sets, and the classification results are shown in Table 2.
e experimental results in Table 2 show that the classification of sports training videos can be realized by using the proposed model. e classification results of sports training videos by using the proposed model are similar to those of actual sports training videos, which indicates that this model has high classification performance of sports training videos.
In the result of sports training video classification of the proposed model, two images are randomly intercepted in basketball training video, as shown in Figure 2.
As can be seen from the experimental results in Figure 2, using the proposed model to classify basketball training videos can accurately classify videos according to the extracted features of sports training videos, and randomly intercepted pictures are all accurate for basketball training, which verifies that the proposed model has high classification effectiveness of sports training videos. e classification accuracy, recall rate, and precision rate are selected as the important indexes to evaluate the classification performance of the proposed model. n c is used to represent the number of correct recognition results, n m is used to represent the number of wrong recognition results, and n f is used to represent the number of failed recognition results. In order to effectively reduce the error caused by a single experiment, the average value of five experiments is selected, and 2 : 1 ratio is set to randomly divide training samples and test samples. e evaluation index equation is as follows: recall ratio � n c n c + n m , precision ratio � n c n c + n f .

(22)
Statistics of the accuracy comparison results of different data sets and different types of sports training video classification are shown in Figure 3.
As can be seen from the experimental results in Figure 3, under different data sets and different types of sports training, the classification accuracy of sports training video classified by the proposed model is higher than 99%, and the classification accuracy of sports training video classified by this model is significantly higher than that of the other two models, which effectively verifies that this model has higher classification accuracy of sports training video.
Statistics of the recall rate comparison results of different data sets and different types of sports training video classification are shown in Figure 4.
As can be seen from the experimental results in Figure 4, under different data sets and different types of sports training, the recall rate of sports training videos classified by the proposed model is higher than 98.5%, and the recall rate of sports training videos classified by this model is significantly higher than that of the other two models, which verifies that this model has higher classification accuracy of sports training videos.
Statistics of the precision rate comparison results of different data sets and different types of sports training video classification are shown in Figure 5.
As can be seen from the experimental results in Figure 5, under different data sets and different types of sports training, the precision rate of sports training video classification using the proposed model is higher than 98%, and the precision rate of sports training video classification using the proposed model is significantly higher than that of the other two models, which verifies the high accuracy of sports training video classification using the proposed model. e analysis of the above experimental results shows that the classification accuracy, recall rate, and precision rate of different data sets and different types of sports training videos are the best. Basketball and football sports training videos have strong continuity and change more frequently, so more quantitative features are needed to better obtain the change features in videos. Basketball and volleyball videos usually have close range images, while baseball and tennis videos are mostly shot from a long-distance perspective, so feature extraction is difficult. Football is also shot from a long-distance perspective; it has continuous movement in the field and can be well collected by increasing the number of states. In football and basketball videos, it most uses a single camera to track players or regions of interest; unlike other sports training, switching between multiple cameras frequently is conducive to event detection. e model can effectively improve the randomness of target movement and improve the classification accuracy by extracting video features.
e training time and test time of sports training videos classified by three models with different data sets are counted. e comparison results are shown in Table 3.  Table 3 show that the classification speed of sports training video using the proposed model is the fastest, and the accurate classification results of sports training video can be obtained by using shorter training time and test time of the proposed model, which verifies that this model has higher classification efficiency of sports training video. e above experimental results show that the proposed model can accurately classify all kinds of sports training videos, which shows that this model has good  Basketball  1481  1473  1352  1376  Badminton  2384  2415  2384  2584  Football  3436  3418  3364  3468  Running  2564  2542  2498  2348  Table tennis  1765  1759  1743  1842  Snooker  3652  3627  3584  3452  Tennis  2755  2711  2684  2679  Volleyball  1546  1638  1724  1711  Total  19583  19583  19333

Conclusion
At present and with the passage of time, the amount of sports training video data in the Internet is growing rapidly. In order to effectively manage and retrieve sports training video, accurate classification of sports training video is very important for consideration. Aiming at the shortcomings of existing approaches of sports training video classification, this paper establishes sports training video classification model based on deep learning method. Convolution neural network with deep learning is used for the classification purpose in the proposed research. After classification, event matching operation is performed, and video classification is realized according to similarity. e experimental results show that the proposed model can effectively determine all kinds of sports training videos and accurately detect the occurrence of events through convolution neural network, so as to achieve high-precision classification of sports training videos. Compared with other models, the proposed model has the advantages of simple implementation, fast processing speed, high classification accuracy, high generalization ability, and adaptability.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.