Deep Neural Network-Based Sports Marketing Video Detection Research

With the rapid development of short video, the mode of sports marketing has diversified, and the difficulty of accurately detecting marketing videos has increased. Identifying certain key images in the video is the focus of detection, and then, analysis can effectively detect sports marketing videos. -e research of video key image detection based on deep neural network is proposed to solve the problem of unclear and unrecognizable boundaries of key images for multiscene recognition. First, the key image detection model of the feedback network is proposed, and ablation experiments are conducted on a simple test set of DAVSOD. -e experimental results show that the proposed model achieves better performance in both quantitative evaluation and visual effects and can accurately capture the overall shape of significant objects.-e hybrid loss function is also introduced to identify the boundaries of key images, and the experimental results show that the proposedmodel outperforms or is comparable to the current state-of-the-art video significant object detection models in terms of quantitative evaluation and visual effects.


Introduction
Vision is the main way humans receive information from the outside world, and according to research in the field of neuroscience, about 10 8 to 10 9 bytes of data enter the human eye every second [1]. is is because of the selective role of the visual attention mechanism, which allows the visual system to selectively ignore irrelevant information and pay attention to relevant information, just like separating the grains of wheat from the husk. In this Internet era where the amount of data is exploding, how to get the information of people's concern from the huge amount of information in a labor and material-saving way has gained a lot of attention. erefore, introducing attention mechanisms into data processing tasks and prioritizing the allocation of data processing resources to more critical information can help improve the efficiency of processing information [2][3][4][5][6].
In 1998, Borji and Itti [7] proposed the first computational model of visual saliency based on Koch et al.'s theory and the classical feature integration theory of cognitive psychology [8] and the pointing search model [9], whose algorithmic process contains three main steps: extraction of three primary visual features: color, luminance, and orientation.
ree types of key features are computed at multiple scales using central-peripheral contrast (key feature extraction); the feature maps are normalized and then synthesized (feature fusion), and the key targets in the images are labeled using the WTA mechanism. e algorithm has had a significant impact on subsequent research on computational models of visual criticality in the field of computer vision, especially since mainstream criticality detection algorithms used a similar framework before deep learning techniques were used on a large scale.
Early image salient object detection models [10] were mainly based on a bottom-up approach using different underlying visual features, such as color, edges. Since salient object detection is closely related to the human eye attention detection task and both model the human visual attention mechanism, the early salient object detection models also borrowed some basic theories of the human visual attention mechanism, including the classical contrast assumption, center-surround assumption. For example, both assumptions were used by Liu et al. [11] and Achanta et al. [12], and a similar assumption was used by Cheng et al. [10], who considered color contrast information on both local and global scales, and the algorithm was concise and straightforward and received wide attention from the academic community. In addition, Yan et al. [13] proposed to complete the apparently consistent image representation at different scales by over-segmenting the image at different scales and to extract and fuse the salient features at different scales for optimization to obtain the final salient object detection results. Visual center bias is also a commonly used hypothesis based on human attentional mechanisms [13]. e hypothesis is based on the phenomenon that the human visual system has a tendency to assign higher attentional weights to the center of the scene when observing the scene. After that, the popular hypothesis is the background prior hypothesis, which was proposed by Wei et al. [14] in 2012. Unlike the center-periphery hypothesis and the visual center shift hypothesis, which attempt to define "what is more likely to be the salient region," this hypothesis attempts to define "what is more likely to be the background." is assumption is based on the observation that in most scenes, the parts around the edges of the image have a higher probability of belonging to the background.
is assumption can be considered as a further development of the visual center bias assumption. Before the large-scale application of deep learning techniques, the background prior assumption was the most effective assumption in the field of saliency detection, and the majority of high-performing models [15][16][17][18][19] were based on this assumption. ese works focus on how to further improve the accuracy of the background prior assumption and how to apply more advanced one-class classifiers. By the background prior assumption, which is equivalent to obtaining a class of (background) samples, the problem can be considered as a one-class classifier giving only one class of samples.
With the great success of deep learning techniques in image classification problems, the focus of research in the field of significant object detection has gradually shifted to deep learning-based models. Slightly earlier work used deep learning features as a more effective key representation and trained using fully convolutional neural networks. Lee et al. [20] used depth features as high-level information and Gabor-filtered response and color histogram as bottom-level features to fuse different levels of significant information for significant prediction. ese models achieve better performance but have some drawbacks, such as the large number of parameters and loss of spatial information due to the use of fully connected layer-based classification networks and the high computational cost of these algorithms due to the need for significant/insignificant classification of each superpixel or target object alternative.
With the rise of fully convolutional neural networks, in recent years, significant object detection efforts based on deep learning have used or adapted full convolutional neural networks for pixel-level critical prediction. ere is some work [21] inspired by the pixel-level semantic segmentation task, proposing the fusion of features from different neural network layers for critical object detection. Since the shallower layers of deep neural networks can retain more fine-grained underlying visual features, and the deeper layers can extract higher-level, semantic-level features, the fusion of features from different neural network layers can retain the original underlying spatial information and obtain higher-level semantic information. Currently, the main research focus of the work on significant object detection based on deep learning techniques is to explore more efficient network structures that can retain more spatial details. Wang et al. [22] proposed an ASNet model for detecting visually salient objects by means of visual attention prior.
e model treats visual attention as a high-level understanding of the whole scene, which is learned through higher-level neural network layers, and the salient object detection task is considered as a more fine-grained, objectlevel saliency detection, with visual attention providing topdown guidance.
e ASNet model is based on a stacked convolutional long and short-term memory neural network, which has a unique recurrent structure that can iteratively optimize saliency detection results.
is work provides a deeper understanding of the visual attention mechanism and reveals the correlation between salient object detection and human eye attention detection. As a whole, the deep learning-based salient object detection model achieves much better performance than traditional models [23][24][25][26].
In response to the current research status, this paper investigates video salient object detection based on deep neural networks as follows, extracting richer spatial saliency information and better capturing the overall shape of salient objects. In this paper, an attention feedback network-based video salient object detection model is proposed. To further obtain clearer bounds, a new hybrid loss function is introduced in this paper based on the video salient object detection model and the attentional feedback network.

Convolutional Neural Network.
When people read or watch a video, they perceive and understand the current content based on the text or images they have already observed before and do not completely forget what they have observed before, and their brain goes blank to understand the content that follows. Traditional neural networks cannot predict salient information in later frames based on the salient object regions in the previous video frames. e emergence of recurrent neural networks makes the network memorable, and its network structure is shown in Figure 1.
t is a set of inputs with (t + 1) time steps and {H t } t�0 t is the corresponding output of the network, network N receives at time step t not only X t but also the value of the first (t − 1) value of the hidden state at a time step, that is, the network processes the current input with reference to the previous memory.
However, when the video sequence is long, the interval between the current video frame to be processed and the related video frame may be large, and at this time, the RNN may lose the memory of distant video frames due to problems such as gradient disappearance. To address the problem of long-term dependence, Hochreiter et al. [27] proposed a long-term and short-term memory network, as shown in Figure 2, where the contents of the three stages indicate the forgetting phase, updating state phase, and output phase, respectively.
All three stages contain a sigmoid layer that maps the input information to between [0, 1] and then selectively filters the useful information and forgets the useless information by a per-bit multiplication operation. e forgetting stage is used to filter the useful information and forget the useless information. e current input is x t , connecting x t with the hidden state ℎ t-1 of the previous moment, denoted as J t , and ※ denotes the connection operation, as shown as follows: (1) e sigmoid layer is then used to map J t to between [0, 1] to obtain the output gate f t , where W f and b f denote the weight and bias vector of the network layer, respectively, and σ denotes the sigmoid operation, as shown as follows: en, the corresponding element multiplication operation (∴) is performed with the cell state C t−1 , thus selectively filtering the useful information and forgetting the useless information, and the cell state at this point is noted as e update cell state phase allows the control cell state to selectively absorb relevant information from J. J t passes through the sigmoid layer and generates the input gate i t .
e information obtained by multiplying the feature obtained by J t after the tanh layer with the corresponding element of it is the information added to the cell state, and the new cell state C * t is obtained by adding this information to the C t obtained in the forgetting phase by bits.
e output phase controls what information is output at the current moment. J t is inputted into the sigmoid layer to get an output gate O t .
Let O t and the current cell state C t be multiplied bitwise by the features obtained through the tanh layer to obtain the output at the current moment H t .
2.2. Loss Function. When performing pixel-level salient object detection, it can be viewed as a binary partitioning problem, where pixels belonging to the salient object are labeled as 1 and pixels belonging to the background are labeled as 0. Assume that y i denotes the label of sample x i , the desired output, and y denotes the probability value of y i � 1 for a given sample x i .
1 − y denotes the probability value of y i � 0 given sample When x i occurs, the probability of y i occurrence can be expressed by P(y i | x i ). From the perspective of maximum likelihood, P(y i |x i ) can be expressed in the following form.
When the real mark y i � 0, 1, and take the logarithm operation. Since the smaller the value of the loss function, the more favorable it is, and the log takes a negative value, and the loss function is calculated as follows:

Feedback Network.
In order to reduce the loss of necessary visual criticality information due to repetitive stride and pooling operations and to learn richer static criticality information, AFNet is used as the main skeleton of the static criticality module. Stimuli in Figure 3 show the input image frames, and the encoding and decoding networks consist of five convolutional blocks of VGG16 (denoted as E i and D i , respectively, i ∈ {1, 2, 3, 4, 5}), where the information transfer between the corresponding convolutional blocks is controlled by the attention feedback module.

Feedback Network Detection
Model. e NHM model is proposed to capture richer spatial criticality information and thus better capture the overall shape of key images. e NHM model uses the attentional feedback network as the backbone of the static criticality module to reduce the loss of visually critical information caused by scale-space issues and to guide the correct fusion of multiscale features from coarse to fine scales. e multiscale feature maps extracted from the five decoding blocks of the attentional feedback network are then fused and fed to the pyramidal expansion convolution module to retain more spatial visual critical information. After that, the timecritical information is captured using a key object transferaware convolutional long short-term memory network in consideration of attention-aware transfer, and finally, the parameters of the model are optimized by gradually reducing the value of the loss function through continuous iterations. e algorithm is divided into three parts: extraction of multiscale spatial features, integration of spatio-temporal critical information, and loss minimization.
To mitigate the negative effects such as the loss of visual information generated by the scale-space problem, the backbone of the static criticality detection module consists of AFNet and PDC modules connected together. AFNet as a novel codec forms the design of a fully convolutional network, its encoding and decoding network consists of five convolutional blocks, and E i and D i denote the encoder and decoder blocks, respectively, where i ∈ {1, 2, 3, 4, 5}, indicating that E i and D i each contain five convolutional blocks, where each layer of the encoder block transmits its critical information through the feedback module in AFNet to the corresponding decoder block. e feedback module uses a two-step iterative learning approach, where the time steps are denoted by i ∈ {1, 2}, which helps to correct inaccurate predictions generated in the previous network by simulating a feedback mechanism that multiplies the ternary map pixel by pixel with the obtained feature map, thus helping to capture the overall shape of the key object. Facing the global spatial criticality detection problem, AFNet uses the global perception module to overcome the problem that the fully connected operation ignores local information and generates redundant data. A multiscale segmentation strategy is used to divide the feature map into 4, 16, and 36 parts, which are then stacked and reorganized for global convolution operation to make full use of the global and intraregional saliency information.
e key image in the dynamic scene is detected directly by the image key object detection model. e key object detection can only detect the spatial differences of color contrast, direction contrast, brightness contrast, and so on. However, in dynamic scenes, the temporal factor is usually used as an important clue for the criticality detection. Second, detection only on each individual frame without reference to the criticality information contained in previous frames may be highly incoherent, because the target and background may differ significantly in appearance in different frames, which will lead to incoherent detection results between frames. Finally, video content often contains significant redundancy, as consecutive video frames require enough similar content to provide a smooth viewing experience. Simply ignoring content redundancy can lead to higher computational costs. erefore, VSOD needs to consider both temporal and spatial saliency information, so a dynamic saliency detection module is used to integrate temporal and spatial saliency information. In order to better simulate the perceptual function of the human visual system, temporal saliency information is learned, and the process of attentional perceptual transfer is captured, and this paper uses SSLSTM as a dynamic saliency detection module, which combines the powerful spatio-temporal feature extraction capability of ConvLSTM with the attentional transfer mechanism.
Deep neural networks gradually optimize the network by iteratively minimizing the loss function. e loss function measures the difference between the value predicted by the model and the true value, and the weights of the network are updated by gradient descent.
e meaning of each symbol is shown in Table 1, because the video significant object detection dataset contains relatively few human eye focus annotations, so l t to indicate whether the dataset contains human eye focus annotations, when the dataset does not contain human eye focus annotations, the loss function at this time does not contain the l t A t term, the error will not be back-propagated. e meaning of each symbol is shown in Table 1. Since the video important target detection data set contains relatively few eye focus annotations, it is used to indicate whether the data set contains eye focus annotations. When the data set does not contain eye focus annotations, the loss function at this time does not contain ItAt term, and the error will not be propagated back.

Loss Function Design. A novel hybrid loss function is
proposed based on the boundary enhancement loss, and the function consists of the loss L a of the predicted attentionperception feature map, the loss L v of the final key object prediction result, and the loss L v b of the final predicted target boundary.
where ω 1 , ω 2 are used as the learning rate parameters for object-level loss and object-boundary loss of the control target, respectively, and let ω 1 : ω 2 � 1 : 10 to emphasize the learning of the target boundary. e dataset used for part of the training does not contain human eye focus annotations, so the predicted loss L a of the perceptual attention feature map can be divided into two parts: loss L a f calculated using human eye focus annotations and loss L a m calculated using salient object annotations.
L a � L a f when δ(1) � 0, L a � L a m when δ(1) � 1. e final key object prediction results are denoted by S t . at is, the loss L v can be calculated.
When δ(1) � 0, L a � L a f ; When δ(1) � 1, L a � L a m . S t is used to represent the prediction result of the final key object, and M t represents the object level annotation of the key object. e loss L v can be calculated as follows: e average pooling operation P can be used to extract smooth boundaries. Suppose it is necessary to extract the boundary B(X) of the image X and take the absolute value after making a difference between X and P(X). e final predicted target boundary loss L v b is as follows: On the basis of NHM, a mixed loss function for capturing clear boundaries is added. e loss function is based on the boundary enhancement loss and is composed of the attention perception feature map predicted by the model, the prediction results of key images, and the prediction results of key image boundaries. e model is recorded as LNSM.

Experimental Design.
e experiments were run on an Nvidia GTX1080TI GPU. e experiments in this paper were implemented using the Python language on Caffe's deep learning framework, and Matlab was used for quantitative evaluation of performance. e training set of DAVIS, DAVSOD, and FBMS and the validation set of DAVSOD were also used to train the proposed model, where the weights of the network model were initialised by the AFNet model, and video was processed per batch, and the number of time steps for the conLSTM network layer processing was set to 3. e training process was set up as follows: first, the static key model was pretrained with a base learning rate of 10 −9 ; then, the entire model was trained by setting the learning rate of the dynamic key module to 10 −8 and the learning rate of the static key module to 10 −10 ; finally, the static key module weights were fixed, and the dynamic key module was fine-tuned with the learning rate set to 10 −10 . e LNSM module was trained using 32 hours and 64 k iterations.

Compare Other Model.
In this paper, the proposed LNSM is compared with four advanced video critical object detection models, MBNM, PDBM, and SSAV, on datasets created specifically for the VSOD task (the entire dataset for ViSal and UVSD, a test set for VOS, and a simple test set for DAVSOD), and the experimental results of the quantitative evaluation are shown in Table 1. It can be seen from Table 1 that the three indexes of the model proposed in this paper are better than other models on DAVSOD and ViSal datasets. Especially on the simple test set of DAVSOD, the fvalue index and average absolute error based on pixel error and the structural index measuring the overall structural difference have improved the performance by 0.06, 0.03, and 0.064, respectively, compared with SSAV; advanced performance has also been achieved on other datasets. Moreover, ViSal is the first test benchmark especially designed for video key object detection; DAVSOD dataset takes into account the transfer of visual attention and its selectivity when labeling and can represent the real attention behavior of the human visual system in dynamic scene. ese two e experimental results show that the LNSM model has good performance for creating datasets especially for VSOD and DAVSOD datasets that mark key images according to human eye concerns.

Conclusion
is paper focuses on key image detection based on deep neural networks to complete the detection of sports marketing videos. For the detection of multiple scenes, a feedback network-based video off-image detection model and a hybrid loss function are proposed to solve the detection problem of key images. e LNSM model proposed in this paper is compared with the quantitative evaluation and visualisation results of the three state-of-the-art models on six representative datasets. e quantitative results demonstrate that LNSM outperforms other advanced models in all three evaluation metrics on the DAVSOD and ViSal datasets and achieves advanced performance comparable to other models on widely used datasets.
Data Availability e dataset can be accessed upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.