Target Tracking Algorithm in Football Match Video Based on Deep Learning

, distribution, and


Introduction
In modern life, sports video is the important video that is popular among the majority of viewers. It has a large proportion in existing TV programs and the Internet [1,2]. With the continuous improvement of people's quality of life and the rapid advancement of technology, people's demands for sports video are also rising. In terms of sports competition viewing, passive and at viewing methods will gradually fail to meet the requirements of TV viewers [3][4][5]. e broadcasters need to add various visual e ects to meet the visual requirements of the audience. In terms of game analysis, the team coach needs to extract relevant data from the football game video to assist the coach in researching the tactics. In terms of commercial applications, the broadcasters also need to more fully explore the commercial value of the football game broadcast. ese need to analyze the video data of the game and process the game images for di erent requirements in order to meet the requirements of the game video [6][7][8]. Among the many sports competitions, football matches have the largest number of viewers and the highest level of attention. erefore, the detection, extraction, location, and tracking of moving targets in the video have high practical and practical signi cance. e extraction and tracking of targets in the football game video are hot spots in the current sports eld image and video processing. e technology required covers many areas of image processing analysis and computer vision. In general, football match video scene consists of background and goal, where the goal is the important part of the video, which contains important information. erefore, quickly and e ciently segmenting objects in video and tracking the target of interest are the bases for subsequent image analysis [9].
Although the research of target tracking has made great progress and breakthrough to some extent, the robust target tracking algorithm has been full of challenges due to the complexity of the environment and the in uence of target deformation. e core problem of target tracking is feature representation. e early features are manually selected, and the appropriate features are selected according to the different application scenarios, but the e ect is far from meeting the actual needs. Since the advent of deep learning technology, the eld of computer vision has developed rapidly, and deep learning techniques were first used for image classification problems [10][11][12]. In recent years, the multi-target tracking algorithm based on deep learning has also made some breakthroughs. Multi-target tracking is the very challenging research direction in the field of computer vision and has a wide range of practical applications, for example, intelligent video monitoring control, abnormal behavior analysis, and mobile robot research. Traditional multi-target tracking algorithms tend to have poor tracking performance due to poor target detection. Depth-based learning detectors can achieve better results, which in turn improves tracking accuracy. erefore, how to achieve effective combination of target tracking and deep learning has become the focus of researchers.
Traditional target tracking algorithms have many problems. For example, the block-based scale-adaptive CSK rigid body target tracking algorithm proposed in Document 4 does not consider the confidence of the candidate box, and the tracking result has low precision. e KCF (kernelized correlation filter)-based tracking algorithm in Document 5 is only applicable to single-target tracking and has limited limitations.
e TLD tracking algorithm proposed in the literature 6 causes the target to be lost when the target is severely occluded. e new target tracking algorithm that combines SIFT (scale-invariant features) and compression features is proposed in Document 7. is algorithm has poor effect on feature extraction, which results in lower center position error and coverage, and higher resource occupancy rate of the algorithm. Deep learning is a new research direction in the field of machine learning. It is introduced into machine learning to make it closer to the original goal--artificial intelligence (AI). Deep learning is the internal law and representation level of learning sample data. e information obtained in the learning process is very helpful to the interpretation of data such as text, image, and sound. Its ultimate goal is to make the machine have the ability of analysis and learning like human beings, and can recognize characters, images, sounds, and other data. Deep learning is a complex machine learning algorithm, which has achieved far more results in speech and image recognition than previous related technologies.
In response to the above problems, the target tracking algorithm in the football match video based on deep learning is proposed in this paper. GoogLeNet is used to perform convolution to obtain the feature map array through target detection algorithm-based GoogLeNet + LSTM. After processing, candidate boxes of high confidence used to perform training and matching are obtained to achieve target detection. e feature map of the detection result is pooled to obtain the depth feature required for tracking. According to this feature, the discriminant scale space tracking algorithm and the Markov Monte Carlo algorithm are used to achieve single-target or multi-target tracking.

Target Detection-Based GoogLeNet + LSTM.
Target detection is the basis of the multi-target tracking algorithm based on data association. e GoogLeNet + LSTM framework is used for target detection for problems such as small targets and occlusions in football video. First use GoogLeNet for convolution. In the last layer, the 1 × 1024 × 15 × 20 feature map array is obtained and transposed into 300 × 1024 feature map array. Each 1024-dimensional vector corresponds to the area of 139 × 139 in the original picture. e 300 × 1024 feature map array is obtained by Goo-gLeNet convolution, and then, each 1024-dimensional vector is processed in parallel by the LSTM sub-module. e hidden state of each output goes through two different fully connected layers: the position and width of direct output box, and the confidence of this box through the softmax layer. e LSTM sub-module has a total of five such units; that is, each input can predict 5 boxes and confidences. In the training, the frame is concentrated at the 64 × 64 position in the center of the sensing area, and the confidence is ranked from high to low.
After processing, five detection frames corresponding to 64 × 64 small blocks in the original image and confidence levels can be obtained. e processing of the submodule needs to filter all the detection frames of the video frame and then remove the frame with low confidence by the given thresholds. Finally, detection result can be obtained [13][14][15][16][17]. e specific process is as follows: if the candidate box intersects with the determined frame, the candidate box is removed. A determined box removes at most one candidate box. In the above matching, the cost is expressed as (m, d), m means whether the two intersect, the value is 0, 1 { }, and d is the Manhattan distance between the two boxes. e importance of m is greater than that of d; that is, the result obtained by the two matching schemes is first compared with the size of m. If the conclusion cannot be reached, the size of d is compared. e Hungarian algorithm is used to find the least costly match [18][19][20]. Assume that the filter's confidence threshold is 0.5, then those boxes with the confidence below 0.5 are removed.
In order to effectively train the target detection model, the following training method is adopted: more candidate boxes are obtained in the LSTM sub-module [21,22], but there are detection errors.
ere are three types of errors [23,24]: (1) Point out the place that is not the tip of the person.
(2) e difference between the predicted and the true value frame positions. (3) Multiple prediction frames are generated for the same target.
e lower confidence level is assigned to the candidate box to prevent the occurrence of Case 1. e error is corrected to avoid the occurrence of Case 2; the lower confidence is given to the prediction boxes generated by the same target to eliminate the problem of Case 3. e loss function of the model training is where G is the true value of the box, C is the candidate box, f is the matching algorithm, b i pos means the i th box in the truth box, b j pos means the b j pos th box in the to-be-selected box, l pos is the Manhattan distance between the two, and l c is the cross-entropy loss, which is the softmax loss in the corresponding network [25][26][27]. e first term of this loss function represents the position error of the candidate box and the matching truth box, the latter represents the confidence of the candidate box, and α adjusts the balance between the two losses. e matching algorithm is the Hungarian algorithm, and the comparison function used is where o ij is 0, 1 { }. If the center of the to-be-selected box falls in the true value box, it is 0; otherwise it is 1. r j is the sequence number generated for the candidate box. e goal is to have high confidence frame first generated when matching [28][29][30].
erefore, when matching the same target, the lower the ranking is, the lower the cost is [31,32]; d ij is the distance between the two boxes, that is, the distance error. e target detection results in the football match video are obtained by the detection algorithm-based GoogLe-Net + LSTM. Based on this, deep learning is used to extract depth features.

Extraction of Depth Features.
e corresponding box position obtained in the upper section is the position of the human head in the video, which is enlarged by certain scale to cover the whole body [33]. After obtaining the position and size of the target frame, the feature map array obtained by the last layer convolution of GoogleNet is used to extract features. e depth features of each target detected can be obtained by pooling the feature map, because each feature is highly abstract and can well characterize the appearance characteristics of the target in the football match video. e feature of the proposed algorithm is that the feature map is used for pooling to obtain the depth features required for target tracking without re-training. erefore, under the premise that the real-time performance of the target tracking algorithm is unchanged, the accuracy of target tracking in the football game video is improved.

Single-Target Detection-Based DSST.
e discriminant scale spatial tracking algorithm is simply referred to as DSST tracking algorithm. After obtaining the depth features, the DSST algorithm is used to track the single target in the video. DSST combines the two-dimensional positional filter with one-dimensional scaled filter. e candidate position is first determined by using the two-dimensional position correlation filter, and this area is used as reference area for the one-dimensional scale filter calculation. In this way, candidate blocks of different scales are obtained, and the scale with high matching degree is searched. e principle of scale selection is as follows: where P and R are the width and height of the target in the previous frame; a is the setting factor, and its value is set as 1.02; and S indicates the number of scales, with the setting of 33. e scale in the formula is not linear relationship, but only the detection process from fine to coarse and from inside to outside. In extracting image features and generating filters, MOSSE correlation filters are employed [34]. In this way, a series of image blocks are extracted from the target as training samples, which are, respectively, recorded as y 1 , y 2 , y 3 , · · · , y n . e corresponding filter response values are Gaussian functions, which are, respectively, recorded as g 1 , g 2 , g 3 , · · · , g n . e peak is at the center, and the end result is to find a filter that meets the minimum mean square error. e MOSSE optimal correlation filter formula is where G is Gaussian function, * represents complex conjugate, and H t means the minimum value of the filter. e right equal sign is derived from the Parseval theorem. e right side of the equal sign is the frequency domain equation, and the left side is the airspace equation. is calculation can be used to transform the problem from spatial domain solution to frequency domain solution. In the frequency domain, the minimum value of the filter is as follows: After the correlation filter is obtained, the determination of the target position of the next frame is determined by the functional response of the correlation score [35]. e area with high response value is the new target position, and the response formula is In this algorithm, (4) uses the extracted depth feature and Gaussian function G to obtain the correlation filter H. t indicates the response time. When a new frame is input, the feature Z extracted by the image block is used as an input to calculate with the correlation filter H using (5), and the response score x is obtained to get the candidate target [36].
DSST designs the input y of the image into the feature vector of the d dimension. e input signal y represents a certain image block of the input image. e optimal correlation filter H is established by the MOSSE idea. e formula is as follows: where l is one dimension of the feature and λ is regular coefficient, and the solution of the obtained minimum value is as follows: Discrete Dynamics in Nature and Society Since the pixel points in the image block directly solve the d × d-dimensional linear equation, the calculation amount is too large and time consuming. erefore, a robust approximate solution is obtained by updating the numerator and the denominator in the above equation. e formula is rewritten as follows: where η is learning rate. e position of the target in the new frame, that is, the maximum response value of the correlation filter, can be obtained by e DSST algorithm uses the dual correlation filter to track the single target in football match video. e algorithm is more portable and efficient, but the problem still exists. When tracking multiple targets, the occlusion of the target will inevitably reduce the accuracy of the tracking, and the interference discrimination of similar targets is not strong. erefore, the Markov Monte Carlo (HDDMCMC) algorithm is adopted when tracking multiple targets in the football match video.

MCMC Algorithm in the Segment.
In the multi-target tracking algorithm, considering the stability and continuity of motion (the same target in the front and rear frame video data), the appearance characteristics will not change drastically.
In the intra-segment MCMC algorithm, the depth feature in Section 2.2 is used to measure the similarity of the target trajectory.
Each detection target is treated as a node and is described by the intra-segment time [t, t + T]. Suppose the set of nodes of the video frame in t is N t � N t (1), N t (2), N t (3), · · · , N t (i), · · · , N t (N t − 1), N t (N t )}, and the posterior probability P(w|D) is as where N n (k) and N n+1 (k) represent the nth and n + 1th nodes in the kth pedestrian trajectory, respectively. P(N n+1 (k)|N n (k)) is the similarity of two nodes, and it can be calculated by the cosine of the angle of the two-node depth feature. l k is the length of different tracks, and |τ 0 | represents the number of false alarms to ensure that the false alarm rate is low.

Inter-Segment MCMC Algorithm.
e data used are mainly the target trajectory generated by the MCMC algorithm in the segment. e main actions taken by the algorithm include fusion, splitting, and switching operations. After passing the intra-segment MCMC algorithm, many more reliable target trajectories are generated. At this time, if there is a case where the same target trajectory is broken, it is caused by unstable detection data. erefore, the purpose of the inter-segment MCMC is to further combine the target trajectory data of the two time periods. In the current state, the posterior probability is updated as follows: where false alarm factors are no longer considered because they are mainly used to divide the target trajectory [37]. For fusion operations, the allowed time interval is set as t gap � 10 and the frame difference at the junction between the track segments of the two targets cannot exceed 6. e standard deviation of the probability is set to 3σ � size(d i t ). In this way, the unit that can be transferred between different states is a relatively complete target trajectory segment that has been generated previously [38][39][40]. e inter-segment MCMC algorithm moves the target that has gone out of the video scene out of the current data set.
e current data set is assumed to be τ ′ . After the MCMC gets the trajectory in the next segment, it is matched by the inter-segment MCMC algorithm. at is to continue to build on the previous target data, combined with the current target data, to further data integration to optimize. e entire algorithm is continuously performed in such a sliding manner.

Results
In order to verify the superiority of each aspect of the proposed algorithm, it is compared with some traditional algorithms such as CSK, KCF, Struck, CT, and TLD. e experimental object is the video of football match. e threshold set by the accuracy is 20 pixels, and the threshold of the success rate is set as t 0 � 0.5. e results are shown in Tables 1 and 2. By comparison, the accuracy of the algorithm is up to 0.88, and the success rate is up to 0.81. Although the effect of this algorithm is not optimal for some video sequences, the algorithm is robust to the overall performance. When the target is partially occluded, the algorithm can still accurately track the target. e center position error refers to the center deviation of the tracking frame from the real target frame, and the coverage ratio is the proportion of the intersection of the tracking frame and the real target frame in the merged portion. In order to evaluate the tracking performance of different algorithms on the entire video series, the experiment will use the average center position deviation and average coverage as indicators to test, and the results are shown in Tables 3 and 4.
It can be seen from Tables 3 and 4 that among the eight tracking video sequences, the average center position deviation index of the algorithm has three groups of best and two groups of two; the average coverage has two groups of best and four groups of second. e experiment used the scoring method to evaluate the two indicators separately. e rule is to sort the two indicators according to their    Discrete Dynamics in Nature and Society 5 performance from high to low, and then score them in 6, 5, 4, 3, 2, and 1. Each video sequence is scored in turn, and finally, they are summed and used as their final result, as shown in Figure 1. Analysis of Figure 1 shows that the algorithm scores 39 points on the average center position deviation and 40 points on the average coverage rate, which are better than other algorithms. is shows that the algorithm is better in the listed tracking algorithms. e video frame rate is a measure used to measure the number of displayed frames and reflects the smoothness of the tracking results. e average frame rate of the eight video sequences is compared using different algorithms, and the results are shown in Table 5.
Analysis Table 5 can be obtained that the average frame rate of the algorithm in this paper is higher than other algorithms, both above 35 Hz. is shows that the tracking results using the algorithm of this paper are more fluent.
In order to verify the efficiency of the algorithm, the iterations and time consumption of different algorithms are compared, respectively, and the results are described in Figure 2.
It can be seen from Figure 2 that the iteration number of the algorithm is similar to the number of CT algorithms, with an average of 1-2 times, which is significantly lower than the CSK algorithm. In the comparison of time consumption, the time consumed by the algorithm is similar to the CSK algorithm, with an average of 12.5 s, which is significantly lower than the CT algorithm. e comparison results show that the tracking efficiency of the algorithm is higher.
To verify the stability of the proposed algorithm, three algorithms are used to track the target in the same experimental environment. e average outage probability of different algorithms is compared, and the results are shown in Table 6.  In order to more clearly show the stability of the algorithms, the data in Table 6 are described by the implementation of the line graph, as shown in Figure 3.
Analysis of Table 6 and Figure 3 shows that the average outage probability of the algorithm is 0.2371, which is lower than the other two algorithms.
e experimental results show that the proposed algorithm has better stability when tracking the target in the football match video.
In order to test the resource occupancy rate of the proposed algorithm, it compares and analyzes the target detection, feature extraction, single-target tracking, and multi-target tracking. e results are shown in Table 7.
It can be seen from Table 7 that the CPU and memory usage of the proposed algorithm are 26%-34% and 7%-17%, respectively. e usages of the CSK algorithm are 64%-70% and 33%-41%, respectively. ose of the CT algorithm are 64%-72% and 29%-39%, respectively. e experimental results show that compared with the other two algorithms, the algorithm of this paper tracks the resource occupancy rate of the target in the football match video which is low.

Discussion
e accuracy and success rate of the algorithm are as high as 0.88 and 0.81, respectively; the average center position deviation index and the average coverage index are 39 and 40 points, respectively; the average frame rate is maintained above 35 Hz, and the tracking time is about 12.5 s. ese data show that the proposed algorithm outperforms other algorithms in performing target tracking. e reason is the use of deep learning techniques in this paper. Deep learning is artificial neural network that simulates the human brain's analysis of things. By simulating the human brain to acquire data and parse it, this structure can better learn the essential characteristics of objects. e main ideas of deep learning target tracking are as follows: first, construct a deep learning model to train standard data sets and obtain more accurate target feature information. en, use this model for target matching and positioning to achieve efficient target tracking. Depth features can more accurately reflect the appearance characteristics of moving objects than traditional features such as scale-invariant features (SIFT). erefore, the algorithm of this paper greatly improves the accuracy of target tracking. At the same time, it combines the discriminant scale space algorithm and the Markov Monte Carlo algorithm to track the targets in the football video to ensure efficient tracking of single targets while achieving accurate tracking of multiple targets.

Conclusions
With the rapid development of computer technology, deep learning has become a big weapon for video target tracking. Deep learning technology has the advantages of high precision, wide application range, and strong stability. Aiming at the traditional target tracking algorithm in the football match video, there is the defect that the target will be lost when the target is severely occluded. A target tracking algorithm based on deep learning football game video is proposed. Combined with deep learning and target tracking technology to track the target, the experimental results show that the tracking success rate, required time, and average frame rate of the proposed algorithm are 0.81, 12.5 s, and 35 Hz, respectively. e average center position deviation index and the average coverage index are 39 and 40 points, respectively, and the resource occupancy rate is low. is shows that the algorithm can well track the targets in the football match video. In the future research, the information technology model is used to further improve the accuracy of football game video target tracking and reduce the defect of losing the target when the target is seriously blocked [41][42][43][44][45].
Data Availability e data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.