Volleyball Movement Standardization Recognition Model Based on Convolutional Neural Network

Artificial intelligence and deep learning have attracted much attention from researchers in industry and academia. The volleyball movement standardization and recognition model involve the application of artificial intelligence and deep learning. In order to solve the problem that human action in volleyball video is continuous and effective spatial and temporal features need to be extracted from the video stream, the Inception module is decoupled and heterogeneous, replacing the original 5 × 5 convolutional structures with two 3 × 3 convolutional structures, as well as replacing the 3 × 3 convolutional structures with 1 × 3 and a 3 × 1 convolutional structure with internal parameter optimization to ensure the accuracy of recognition. The model uses the input motion video RGB map as the spatial input and the optical flow map as the temporal input, and the two are weighted 1 : 1 for feature fusion. Experiments are conducted on the volleyball action video and homemade dataset in UCF101, and the experimental data show that the accuracy of the DNet volleyball action standardization recognition model proposed in this paper is 94.12%, which proves that the method improves the recognition ability of the model while speeding up the training speed. The research presented in this paper provides important theoretical guidance for artificial intelligence and deep learning.


Introduction
Artifcial intelligence and deep learning have attracted much attention from researchers in industry and academia. Te volleyball movement standardization and recognition model involve the application of artifcial intelligence and deep learning. Te research presented in this paper provides important theoretical guidance for artifcial intelligence and deep learning. Based on the computer vision platform to achieve volleyball action standard recognition, which is an important trend in the application and development of artifcial intelligence in sports with a very wide range of application prospects, the surveillance camera in real-time under the human action recognition technology support to identify the volleyball player's action standardization domain and record and make a warning can efectively improve the volleyball player's sports efciency and volleyball sports level. In addition, through the analysis of the volleyball sports video, it can be clearly derived from the video whether player action standardization, volleyball sports tactical layout, and pointto-point analysis have an important role.
With the increasing maturity of convolutional neural network technology, human action recognition based on convolutional neural networks is widely concerned by domestic and foreign research scholars. Technical action detection in sports videos is an important application in the feld of computer vision in sports, among which volleyball has distinct technical characteristics and has signifcant advantages for individual human action technology recognition and classifcation. Trough the recognition and classifcation of both players' action techniques in sports videos, it can efectively improve the technical analysis ability and technical level of volleyball players. Terefore, the convolutional neural network volleyball action standard type recognition model has important research signifcance and application value.
For artifcial intelligence and deep learning, in convolutional neural network-based volleyball action criterion recognition models, the extraction of efective action features as well as representation features from video sequences is an important part of action analysis, retrieval, and recognition, which directly afects the accuracy and robustness of its results. In the past decades, many video action recognition methods and action recognition datasets have emerged. With the successful application of convolutional neural networks in the image domain (image classifcation [1], target detection [2], and scene classifcation [3]), research on the use of convolutional neural networks for video action recognition has started to emerge, and successful applications from images to videos illustrate the superiority of convolutional neural network models. Terefore, convolutional neural network-based action recognition has become an active research area, and its algorithmic framework is shown in Figure 1.

Contribution
As an important application project of computer vision technology in the feld of sports, volleyball player action standardness recognition has been a challenging research topic in the feld of computer vision, and the volleyball standardness action recognition model not only requires the convolutional neural network to extract spatial information such as motion background and character appearance scene in the image but also needs to efectively extract motion information in the temporal dimension of the video. In summary, this paper combines the idea of a spatio-temporal dual channel and constructs a dual-stream DNet convolutional neural network volleyball action standardness recognition model based on Inception [4], which uses the idea of weighted fusion to connect the extracted spatio-temporal features so as to improve the training speed of the model and the accuracy of volleyball action standardness recognition. In addition to theoretical contributions, this paper also has some practical signifcance. As volleyball is a mainstream sport for the public, action recognition for volleyball players and volleyball fans can help improve people's ability to use volleyball and enhance the public's love for volleyball. Tis paper also has some general applicability and can be applied to other sports, in addition to volleyball, for action recognition.

Literature Review
For artifcial intelligence and deep learning-aided volleyball movement standardization recognition models, human action recognition can be divided into behavior recognition based on traditional manual features, and behavior recognition based on depth features according to the way of feature extraction division. Te specifc division of behavior recognition is shown in Figure 2. Traditional feature extraction methods are generally designed manually to characterize action features through manual observation and design. Compared with traditional methods, deep learning methods do not use manual initiative to extract features, retain more valuable information in the video, and generally outperform traditional methods in terms of efectiveness. Deep learning methods applied to human behavior recognition should not only use the spatial information of the video but also the temporal information of the video, which is also the focus of the method research.
Te essence of action recognition in volleyball belongs to the category of human action recognition classifcation. Convolutional neural networks were introduced to the video domain for action recognition. In 2014, Simonyan and Zisserman [5] proposed the frst human action recognition method based on dual-stream neural networks, in which ImageNet was used to pretrain, RGB single frames and optical fow maps were used to train the dualstream network, and fnally fused for action recognition. Te method makes full use of spatio-temporal feature information, improves the accuracy of action recognition, and lays a theoretical foundation for volleyball action recognition. In 2016, Wang et al. [6] proposed the time segment network (TSN), combining a sparse temporal sampling strategy with a video supervision method based on the classical dual-stream network. TSN is also composed of a spatial stream convolutional network and a temporal stream convolutional network. But unlike twostream, which uses a single frame or a single stack of frames, TSN uses a series of short segments sparsely sampled from the entire video, each of which will give its own initial prediction of the behavior class, and the "consensus" of these segments to obtain the video-level prediction results. 2014 LSVRC challenge winner Goo-gleNet [7] network was proposed, with its structure of adding multiple Inception network models to the traditional deep convolutional neural network. Te second defciency is compensated by cross-practice, regularization, and data expansion techniques, and the network structure is named the temporal segmentation network (TSN). Due to the recent successful application of residual networks (ResNets) in deep network training [8], a novel spatial-temporal residual network model has been proposed that combines ResNets and a dual-fow model, which learns the spatio-temporal characteristics of the behavior hierarchically by connecting the residuals of the air-domain and time-domain fows.

Convolutional Double-Stream Volleyball Action Standard Type Recognition Model Construction.
For artifcial intelligence and deep learning, the structure of a dual-stream DNet network can be divided into two feature extraction modes: spatial stream channel and temporal stream channel. Te advantage of a DNet-based dual-stream convolutional neural network model over a traditional convolutional neural network is that it captures temporal information features in video data more clearly and efciently. For singleframe RGB, spatial information generally expresses features mainly in the form of coordinate positions of things and fxed scenes, while for temporal information, more information about the continuous motion of the target can be conveyed in the form of the continuous motion of multiple optical fow frames. Both spatial and temporal stream channels are trained by applying the same network to the feature images, and subsequently, action recognition is achieved through spatio-temporal feature fusion.
Te specifc network structure of the dual-stream DNet network is shown in Figure 3, which uses a twobranch network architecture to capture the spatial and temporal information of the video, respectively. Te air domain uses RGB images as input to extract appearance features, and the time domain uses optical fow information as input to extract temporal features, and then the two behavior recognition datasets are classifed by weighted feature fusion of the two-channel features and by the multitask training method to remove overftting and thus obtain better results.

GoogleNet Network Model.
With artifcial intelligence and deep learning, GoogleNet is a deep neural network model based on the Inception module introduced by Google, which won the ImageNet competition in 2014. GoogleNet difers from previous networks such as AlexNet4 and VGGl6 by increasing the depth of the network to obtain better training results, which deepens the core structure of the GoogleNet network since its Inception, as shown in Figure 4. Inception is to take multiple convolutional or pooling operations and put them together to assemble a network module, and design the neural network to assemble the whole network structure in a module. In networks that do not use this approach, we often layer only one operation, such as convolution or pooling [9], and the convolution kernel size for the convolution operation is also fxed size. However, in practical situations, diferent sizes of  Computational Intelligence and Neuroscience convolutional kernels are needed at diferent scales so as to obtain the best performance. Moreover, for the same picture, diferent sizes of convolution kernels perform diferently because they have diferent perceptual felds. Terefore, it is necessary to let the network choose the appropriate perceptual feld by itself, and Inception can meet such a need. An Inception module provides multiple convolutional kernel operations in parallel, and the network chooses to use them by adjusting the parameters in the training process. Te whole Inception structure is a series of multiple Inception modules. Te two main contributions of the Inception structure are the use of 1 × 1 convolution for lifting and dimensioning, as well as the simultaneous convolution and reaggregation at multiple dimensions. In Inception, a single convolutional kernel is changed to a model with three parallel convolutional kernels (1 × 1, 3 × 3, 5 × 5) and a single pooling layer. Each path is followed by a 1 × 1 convolution operation immediately before the convolution operation or after the pooling operation. More convolutions can be superimposed on the same size perceptual feld to extract richer features, and the 1 × 1 convolution can also achieve the efect of dimensionality reduction, which reduces the computational complexity and improves the performance of the network. Terefore, in this paper, the GoogleNet network is selected as the base network for volleyball movement standardization and recognition.

Construction of DNet Network Model.
With artifcial intelligence and deep learning in this paper, we will use the improved GoogleNet, i.e., DNet, for volleyball action standardization and recognition research, and DNet optimizes the internal structure of the Inception module.
Te DNet network structure improves the original three groups of parallel convolutional operations, named shallow EnInception-1 module, middle EnInception-2 module, and deep EnInception-3 module, respectively. Since the convolutional neural network with high accuracy, multiple parameters, and large computation has the problem of landing, the original convolutional structure is decoupled to improve the speed and accuracy of the deep convolutional neural network. In this paper, the original 5 × 5 convolutional structures of Inception-1 are decomposed into two 3 × 3 convolutional channels to reduce the model's computational parameters. In addition, the 3 × 3 convolutional structure of all Inceptions is modifed into an asymmetric [10] convolutional pattern, i.e., two convolutional blocks of 1 × 3 and 3 × 1 are used instead of the original 3 × 3 convolutional structure to increase the stability of the model.   Computational Intelligence and Neuroscience After improving the shallow Inception-1 module, the mid-layer Inception-2 module undergoes a larger structural improvement. Since the mid-layer module is more important for feature extraction than the shallow layer, the complete convolution operation is retained to make feature extraction more signifcant while accelerating the computational speed. First, the 5 × 5 sizes convolutional kernel in the traditional middle layer of the Inception-2 module structure is improved. A new convolutional layer is formed by connecting two 3 × 3 sized convolutional kernels in sequence, replacing the original large convolutional kernel structure with a new small convolutional network. Te two "3 × 3" size convolutional kernels, which are closer to the output graph, are also replaced by "3 × 1" and "1 × 3" convolutional kernels in the middle Inception-2 module structure, as shown in the dashed part of Figure 5. Te flter banks in the module are then extended to make the network structure wider rather than deeper, which reduces the dimensionality and results in a better performance of the neural network, which is more suitable for high-dimensional features such as the amplitude of the fne human movements in volleyball. Terefore, while the computation is faster, the accuracy of the model after adjustment is higher compared with the previous one, as shown in Figure 5. For the deeper Inception-3 structure, the original structure of the model is retained in this paper because the spatial concentration of the convolutional kernels decreases in the deeper layers of the network and the relatively larger convolutional kernels extract more abstract features and are, therefore, suitable for application in the deeper layers of the network.
In addition, an auxiliary classifcation layer batch is introduced to improve the convergence of the very deep network. Te aim is to push useful gradients to lower layers to make them immediately useful and to improve convergence during training by resisting the vanishing gradient problem in very deep networks. Te auxiliary classifer also plays the role of a regularization term. Tis is because the main classifer of the network performs better if the side branches are batch normalized or have a discarded layer. Tis also gives weak supporting evidence for the speculation that batch normalization acts as a regularization term [11]. Figure 6 shows the traditional pooling method, which loses the information on the feature map, and the right panel shows the process of enlarging the feature map before pooling, which has the problem of computing too much brightness. In order to balance the gain and loss of both, one-half of the feature map is obtained by convolution frst and one-half of the feature map is obtained by pooling, and fnally stitching is performed, as shown in Figure 7. Tis Inception module is used for 35 × 35 down to 17 × 17 and 17 × 17 down to 8 × 8. Te Inception module reduces the grid size while expanding the flter set. It is fast and avoids performance bottlenecks. [12] Layer and Choose the Optimal Ratio. Te two improved shallow Inception modules, the fve middle Inception-4 modules with optimized structure, and the two high Inception-5 modules with larger convolutional kernels were sequentially entered after the maximum pooling operation, followed by a global average pooling, and the optimal dropout ratio of 0.5 was chosen by introducing the dropout layer at the fully connected layer. [13] Classifer Loses. Since this classifer has a larger class spacing and smaller intraclass distance for diferent action features, the AM-Softmax classifer with algorithm improvement based on Softmax [14] is chosen to be applied to volleyball action recognition, which makes the local action classifcation efect more obvious. Te abovementioned improved DNet network for the GoogleNet network has more optimized performance compared with the original basic network.

Spatio-Temporal Feature Fusion Design.
In this paper, the spatio-temporal fusion strategy chosen for action recognition of volleyball movement in the video is designed, and the fusion approach in the classifcation layer is selected for feature fusion. Te fusion approach for the classifcation layer of the dual-stream network is to perform feature fusion in the later classifcation layer after the fully connected layer. Te dual-stream network extracts the spatial and temporal features in the motion video separately, and the weights are shared between the two networks. After convolution, pooling, and fully connected layer operations in the DNet network, the features of the two streams are fused in the classifcation layer, as shown in Figure 8.
In this paper, the weighted fusion method is used for feature fusion at the classifcation layer. For the weighted fusion method, it can be expressed as summing spatial features and temporal features after defning diferent weight assignments. For feature fusion, the temporal fow network and the spatial fow network are fused using the weighted fusion method with a weighted fusion weight ratio of 5 : 5 for time: space, and the fusion timing is chosen for fusion at the classifcation layer.

Introduction to the Dataset.
Te experimental data in this paper are the public dataset of UCF101 [15] and the homemade volleyball sports dataset, in which the video content of the homemade dataset includes a large number of volleyball standard action demonstrations with various angles in the video, various changes in lighting information, low pixels of the video, and a short video length. Te UCF101 public is an upgraded version of the UCF50 [16]    contents include many single volleyball basic actions and volleyball sparring actions, with various angles in the videos, various changes of lighting information, low-pixel videos, and short video length. According to the analysis requirements of the technical characteristics of volleyball standard movements, the input data are divided into four categories: (1) serving; (2) matting; (3) passing; (4) dunking.

Introduction to the Experimental Environment.
Te experimental hardware environment is shown in Table 1.
Te experimental environment in this paper is the PyTorch platform, in which the input dataset UCF101 and the homemade dataset are processed in RGB and optical fow formats, and for the GNet spatial branch, RGB format images with spatial step size step � 4 are used, and for the GNet temporal fow branch, T � 8 continuous optical fow images are overlaid. Te size of each iteration in the model training process is 100, and the learning rate is 0.001. A total of 45 iterations are performed in this experiment.

Evaluation Indicators.
Commonly used evaluation metrics are accuracy, detection rate, and recall, where accuracy represents the proportion of samples with correct predictions, as shown in the following equation: where TP and TN are the numbers of positive samples correctly identifed as positive and negative samples correctly identifed as negative, respectively, and N is the number of samples tested. Te detection rate is for a specifc category, as shown in the following equation: where N is the number of all detected target frames of a specifc class, and the recall is calculated, which represents the ratio of correctly predicted target frames to all ground truths, as shown in the following equation: To address the diferent requirements for the location deviation of target detection in diferent scenarios, an IOU threshold is usually given, above which the detection is considered successful. As well as considering the problem of category balancing, the performance of each category is usually found separately and then averaged for each category.
First, all the detection results are sorted, with higher scores being higher, and then the detection is judged to be successful in turn. First, fnd the intersection ratio of detection results (dets) and real targets (gt), i.e., IOU, and fnd the real target with the maximum IOU value for each detection result; if this maximum IOU exceeds a certain threshold, the det is matched against this gt and considered TP, noting that each gt can only be matched once. If the IOU of a det to all gt does not exceed the threshold or the gt with the maximum IOU exceeding the threshold has been matched, it is regarded as FP. After traversing all the dets from high to low, if there is still a gt that has not been matched, it is regarded as FN.
After getting TP, FP, and FN, the precision and recall can be calculated. But diferent tasks have diferent requirements for precision and recall; some tasks require higher recall and wrong detection is acceptable, while other tasks require higher precision and missed detection is acceptable. To evaluate recall and precision as a whole, we select the frst n of the ranked det (n � 1, 2, 3, ...) resulting in a recall-precision graph, and obtaining the AP value by calculating the area under this curve. Te AP values for each category are averaged to obtain mAP.

Description of Experimental Results.
Te experimental results are shown in Table 2. Figure 9 shows the line graph of the DNet recognition rate, where the horizontal coordinates represent each mainstream algorithm and the vertical coordinates represent the mAP values of each algorithm on the UCF101 dataset. From the experimental results, it can be analyzed that using the weighted fusion method by setting the weight ratio to time: space 5 : 5, decoupling and heterogeneity of the GoogleNet three-layer convolutional kernel, after using the auxiliary classifcation layer for fusion, the average accuracy of volleyball action standardization recognition of the feature fusion dual-stream DNet convolutional neural network reached 94.12% at this time, which was higher than other network models on the 80.45% on the homemade dataset VolleyballSis, which is higher than other network models. For the model structure of the algorithm, most of the algorithm data modalities used in this paper are RGB and optical stream OF. In the to-stream structure, DNet uses RGB and OF as input, respectively. On this basis, the three-layer branching Inception structure is split, and the original 5 × 5 convolutional structure is split into two 3 × 3 convolutional kernels to reduce the computational parameters and improve the computational speed. In addition, the heterogeneous convolutional network structure is introduced, and the 3 × 3 convolutional model is further split into two convolutional models of 1 × 3 and 3 × 1 to improve the accuracy of the algorithm model without increasing the computational efort, and good results are obtained.

Conclusion and Suggestion
Artifcial intelligence and deep learning have attracted much attention from researchers in industry and academia. Te volleyball movement standardization and recognition model involve the application of artifcial intelligence and deep learning. Te research presented in this paper provides important theoretical guidance for artifcial intelligence and deep learning. Te heterogeneous DNet convolutional neural network with spatio-temporal feature fusion for volleyball action standardization recognition proposed in this paper greatly reduces the training parameters of the model, thus achieving the goal of improving the model speed. Te introduction of heterogeneous convolution also ensures the stability problem of the model with constant parameters. It efectively improves the recognition ability and generalization ability of the model. However, there are still some problems and difculties in volleyball action, and standardization recognition.
(1) Strong dependence of the model on the motion viewpoint problem. When the model is applied to the actual volleyball sports scenario, most of the captured videos are collected from random viewpoints. Terefore, in the actual application scenario of the model, the model is required to decouple the motion viewpoint and reduce the dependence problem of the model on the motion viewpoint. Nowadays, most of the data sets are focused on a single viewpoint or captured by a specifc viewpoint, so the future improvement algorithm can cope with the change of complex viewpoints and handle multiple random viewpoints. (2) Weakly supervised algorithm problem.
Supervised learning-based behavior recognition relies heavily on the annotation of video action tags, and the annotation of video images can consume a lot of resources. With the continuous development of deep learning, the cost of video image labeling becomes more and more expensive. How to use the algorithm itself to perform image labeling has become a hot topic of research nowadays. It is based on weakly supervised or unsupervised volleyball action standard-type recognition, which can be classifed without manual labeling or with only a small amount of labeling.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest.