Research on Target Tracking Algorithm Based on Siamese Neural Network

Target tracking is a signiﬁcant topic in the ﬁeld of computer vision. In this paper, the target tracking algorithm based on deep Siamese network is studied. Aiming at the situation that the tracking process is not robust, such as drift or miss the target, the tracking accuracy and robustness of the algorithm are improved by improving the feature extraction part and online update part. This paper adds SE-block and temporal attention mechanism (TAM) to the framework of Siamese neural network. SE-block can reﬁne and extract features; diﬀerent channels are given diﬀerent weights according to their importance which can improve the discrimination of the network and the recognition ability of the tracker. Temporal attention mechanism can update the target state by adjusting the weights of samples at current frame and historical frame to solve the model drift caused by the existence of similar background. We use cross-entropy loss to distinguish the targets in diﬀerent sequences so that their distance in the feature domains is longer and the features are easier to identify. We train and test the network on three benchmarks and compare with several state-of-the-art tracking methods. The experimental results demonstrate that the algorithm proposed is superior to other methods in tracking eﬀect diagram and evaluation criteria. The proposed algorithm can solve the occlusion problem eﬀectively while ensuring the real-time performance in the process of tracking.


Introduction
Target tracking is a research hotspot and basic topic in digital image processing. It has important applications in many fields, such as military field, traffic monitoring, humancomputer interaction, video monitoring, precision guidance, and so on [1,2]. e task of target tracking is to predict the position and motion state of the target in the subsequent frames of video according to the motion trajectory and posture changes of the target given the target size and position in the initial frame of a video sequence [3]. Due to the change of target and environment information in the process of target tracking, the characteristics of target are changing constantly, and the problem that speed and accuracy requirements of target tracking is also discussed.
ere are several difficulties in target tracking, such as background clutter, deformation, scale variation, and occlusion. In addition to the above common challenges, there are other challenging factors such as illumination change, motion blur, rotation, out of view, and fast motion. All these challenges together determine that target tracking is a very complex task in computer vision [4]. In order to solve these practical problems, researchers have proposed many tracking methods in recent years.
Most of the methods are to solve the tracking problem by establishing the model, which can distinguish the target from the background. Because the specific information of target is available for tracking, it is difficult to learn the target model in the process of offline training, such as in target detection. On the contrary, the target model must be constructed by using the target information given during the test.
e unconventional nature of the target tracking problem brings significant challenges when pursuing an end-to-end learning solution [5]. ese problems have been solved by Siamese neural network successfully [6][7][8]. ese methods learn a feature embedding to calculate the similarity between two image regions through simple cross-correlation. en, choose the image region that is the most similar to the template to be tracked. Because the model only corresponds to the template features extracted from the target area, the tracker can make use of the annotated images for end-to-end training easily. Although Siamese neural network has been successful in target tracking in recent years, there are still limitations seriously. Firstly, lacking of the offline training datasets can lead the measurement standard of similarity to have errors sometimes, resulting in the poor generalization. Secondly, Siamese neural network only uses the appearance of the target when inferring the target model but ignores the information of background appearance that is necessary to distinguish the target from similar objects. irdly, Siamese neural network lacks of a powerful model updating strategy. All these limitations make the robustness of Siamese neural network need to be improved [9]. e contributions in this paper are as follows: First, we add the SE-block substructure [10] to the Siamese neural network, which can enhance feature representation of effective channels and improve feature discrimination by modeling the correlation between each channel of the image. ereby, we can reduce the computational cost of extracting features. Second, in order to solve the problem that the target is easy to be occluded in the tracking process, we add the temporal attention mechanism in Siamese neural network framework. Temporal attention mechanism can help the parameter to update of loss function by adjusting the weights of samples at current frame and historical frames.
Furthermore, we use the cross-entropy loss to distinguish the targets in different sequences of video, which makes the distance in the feature domain keep longer and the features have more resolution to classify the target and background.
For testing the effectiveness of the proposed algorithm, we perform comprehensive experiment on three benchmarks, respectively. e results demonstrate that the proposed approach can have a wonderful effect on three benchmarks which is superior to other contrast methods through the qualitative and quantitative evaluation. is paper can verify the feasibility of the network that we proposed and alleviate the problem of target occlusion effectively while ensuring real-time performance.

Depth Models.
e target tracking methods based on deep learning can track the target through the powerful representation ability of deep learning models. In 2012, the convolutional network AlexNet [11,12] was first proposed and many networks based on convolutional were generated for target tracking subsequently, such as VGGNet [13], Google Inception Net [14], ResNet [15], and DenseNet [16]. e development of convolutional networks has solved a series of problems about gradient diffusion in the back-propagation process, and the extracted semantic information is more robust to larger changes. ese models can have significant effects on target detection and recognition [17,18] and image classification [19]. However, the effect of tracking is subtle due to the factors such as limited datasets and real-time performance.
According to the way of deep learning model feature extraction, target tracking can be divided into tracking based on pretrained deep features and tracking based on offline training deep features.
In target tracking based on pretrained depth models, the ImageNet [17] was the earliest way to extract features in 2013. Ma et al. proposed the HCF algorithm [20] to use VGG which integrated the shallow features and deep features in the network into the correlation filters in 2015. It showed a good experimental result, but the algorithm did not process the scale of target and assumed that the target scale is invariant in the tracking process which has far less robustness when tracking targets with large-scale changes. In 2016, Qi et al. used the Hedge algorithm [21] to improve the fusion of the correlation filters trained by each layer of features.
en, Danelljan et al. proposed the C-COT algorithm [22] that combined the deep semantic information and shallow appearance information to obtain a continuous spatial resolution response map by interpolating according to the response of different spatial resolutions and then found the optimal scale and position by iteration [23]. e C-COT algorithm can integrate the feature maps of different resolutions harmoniously. However, the disadvantage is that the amount of data in training is very large, which is easy to lose frames. In 2017, Danelljan et al. proposed the ECO [24] algorithm, which was improved by grouping samples, decomposing convolution factors, and updating strategies. It improved the speed of the algorithm while ensuring the accuracy of the algorithm. In 2018, Bhat et al. proposed the UPDT algorithm [25], which made a distinction between deep features and shallow features and made use of data enhancement and the difference response function to improve the accuracy and robustness of tracking effectively and proposed a quality evaluation method concurrently to self-adaptation and fuse the response map to further optimize the tracking effect. e deep learning model based on pretraining requires less training data that can be used for target detection directly. However, the model is larger, the parameters are more, and the model structure is not flexible which leads to a large amount of calculation. e methods of target tracking based on the offline training depth model can achieve good tracking results through the end-to-end training features matching the tracking task. Nam and Han et al. proposed the MDNet algorithm [26] in 2016, which learned convolutional features to represent the target by a lightweight small-scale network and used SoftMax classifier [27] to classify the samples that sampled which had good tracking performance, but the speed of tracking ought to be better. e deep learning model based on offline training can achieve higher precision with less parameters, which can speed up the convergence while reducing the number of parameters.

Siamese Neural Network.
Siamese neural network belongs to the deep learning model of offline training. Bertinetto et al. proposed the Siamese-FC algorithm [28] that solved the more general similarity learning problem by training a depth network in the initial offline stage and trained a fully convolutional Siamese network to locate candidate regions in larger search images. is algorithm performed well in real-time, but the accuracy is not as good as the correlation filtering method combined with depth features. Tao et al. made improvements on this foundation and proposed the SINT algorithm [29], generated multiple candidate regions in images, learned the matching function of the candidate regions and the target templates in Siamese neural network, and then selected the candidate region with the smallest difference as template for online tracking which transmuted the tracking problem into a matching problem for the first time. However, the process of processing large number of candidate regions was cumbersome and timeconsuming. In 2018, Li et al. used the region proposal network (RPN) [30] based on the Siamese-FC algorithm to replace bounded box regression with multiscale detection for obtaining the bounding box with maximum response which can improve the efficiency and performance of tracking, but the feature extraction capability of the convolutional layer remained to be improved. However, most of the Siamese networks that mentioned above are based on shallow networks, while the deeper networks are prone to position errors due to filling.

SE-Block.
SE-block is a substructure that consists of squeeze and excitation, and it is remarkable that the SEblock does not belong to integrity network structure. SEblock is to learn the feature weights according to the loss of network so that the effective feature weight becomes larger and the little effect or invalid feature weight becomes smaller and enhance the image by effective channels [23] which makes the input image frame enhance the effective features extracted by using the channel correlation while considering the spatial feature information fully in order to make the training model achieve better results.

Temporal Attention Mechanism.
Attention mechanism is an important concept in neural network, which has been used widely in different fields, especially in image recognition, image processing, and NLP [31]. e attention mechanism in deep learning is to focus attention on the key point, obtain the key information, and ignore other useless information. Most attention mechanism models are based on encoder-decoder framework. e framework is shown in Figure 1.
In Figure 1, we give input x, and target y is generated by encoder-decoder framework. e encoder encodes input x and transforms input into the intermediate semantics, which is represented by c through nonlinear transformation. e decoder generates information of target according to semantic representation c of input x and generated historical information previously. So, encoder-decoder framework is regarded as a general framework; encoder and decoder can use various model combinations, such as CNN, RNN, LSTM, and GRU.
Many approaches are proposed for handing occlusion but have received only limited acclaim [12,18,32]. In this paper, temporal attention mechanism is introduced to handle occlusion. In detail, the temporal attention mechanism is used to update the target state by adjusting the weight of loss from training samples at current frame and historical frame.

Construction of Network.
is paper proposes the network that based on the Siamese neural network, which can improve the speed and accuracy and handle occlusion of target tracking. e training of the network is offline through the end-to-end way. e structure of our network is shown in Figure 2.
e structure of our network is composed of two processes, one is the feature extraction operation in Siamese neural network and another is using the positive and negative samples at current and historical frame to update the target state with the help of the temporal attention mechanism. e target is generally the bounding box given by the first frame, we adopt the exemplar images whose size is 127 × 127 pixels after preprocessing, the search images mean the candidate box search region in the frames to be tracked later, and the size of the search images is 255 × 255 pixels after preprocessing [23]. SE-block is added after conv5 of the network to form the SE-CNN structure, which can make full use of the channel and spatial information of the image to enhance the effectiveness of the channel features and improve the effect of feature extraction. SE-CNN is used to extract the features and then weighted of exemplar images and search images. e state of the search image with the maximum classification score is used as the estimated target state.
en, we collect the positive and negative training samples at current frame according to the overlap with the estimated target state. e positive training samples at historical frame are also used for updating the target state. Temporal attention mechanism actually reflects the weight of the estimated target state in the total loss when we update the parameters online. e total loss is composed of positive and negative samples at current frame and positive sample of historical frame, and the model is trained using cross-entropy loss.

Network Improvement.
In this paper, the network uses the AlexNet structure [10] that includes five convolution layers and three full connection layers to extract features. e convolution kernel of conv1 is 11 × 11 pixels, conv2 is 5 × 5 pixels, and conv3-5 is 3 × 3 pixels, respectively. Kernel e network adopts the Maxpooling, and there is a ReLu (rectified linear unit) nonlinear activation function after each convolution layer except the conv5. Adding SE-block after the conv5, the normalization layer prevents the data distribution from changing in order to reduce the risk of overfitting in the training process. e first measure of our improvements is to embed the SE-block after the conv5 to form SE-CNN module in this paper [23]. e SE-block consists of squeeze and excitation. e squeeze operation reduces the dimension of features, turns each two-dimensional feature channel into a real number, which has a global receptive field to some extent, and matches the output dimension with the number of input feature channels, representing the global distribution of responses on feature channels. e excitation operation generates weights for each channel through the correlation between the feature channels, and the weight means the importance of each feature channel after feature selection. e reweight operation uses multiplication to weight the feature channels to the original features one by one and completes the recalibration of the original features in the channel dimension. e details of the improvement are shown in Figure 3. e second improvement is making full of the temporal attention mechanism to pay attention to historical and current samples based on occlusion status. Encoder-decoder framework can give different influences (i.e., weight) to the positive and negative samples of video frames in different time and extract the key frames and their information we contained that may be useful for tracking which make the model be more accurate on judgment of target tracking without increasing the cost of calculation and storage. e historical sample is the reliable and positive sample collected at historical frame, and the sample at current frame reflects the state change of the target.

Tracking Strategy.
In essence, the tracking strategy can be divided into the following four parts roughly: Feature Extraction. SE-CNN is used as a feature extractor to extract the features of the input images and the search images Binary Classification. e feature extracted from SE-CNN is input into the binary classifier, and the output indicates the probability of candidate state belonging to the estimated target state, that is the classification score Estimated Target State. After comparing the classification score, the candidate state with the maximum score is selected as the estimated target state Handle Occlusion. We obtain the training samples from current frame and historical frame. Temporal attention mechanism is to balance the relative importance between current and historical visual cues based on occlusion status

Algorithm Process.
e process of the algorithm in this paper is shown in Figure 4.

Image Preprocessing.
e exemplar images and the search images are "modified" to a fixed size. Specifically, it includes padding, cutting, and scaling, and these processes cannot damage the information on the size of the object to make the target which is manually labeled to be at the center of the image [23].

Feature Extraction.
e exemplar and search images after preprocessing are input into the convolution layer in pair for convolution operation. Assuming that the input image is X ∈ R W′×H′×C′ and the output feature map is U ∈ R W×H×C , the formula of the convolution operation is as follows: where v c means the c − th convolution kernel, X s means the s − thinput, and u c means the receptive field of the feature map in the c − th channel. en, the feature map is to squeeze operation after GAP (global average pooling), which is written as F sq (·). In order to express the global information of the feature map, we transform the feature map from the input of H × W × C to the output of 1 × 1 × C, as shown in the following: Next, the feature is to the excitation operation, which is denoted as F ex (Z, W), as shown in the following: where ReLu is a nonlinear activation function, σ(·) means the sigmoid function, Z means the result of squeeze operation, W 1 and W 2 mean the parameters of two full connection layers, respectively, the two full connection layers are used to fuse the feature map information of each channel, and s means the weight of feature maps in different channels that is set as ω i (i � 1, 2, 3, 4, 5). ese weights are learned by the full connection layers and the nonlinear layers, so they can be trained by end to end [23]. Finally, the reweight operation is performed, and the weights that output are recalibrated in the original image, corresponding to the following: wheres c means the weight and u c is a two-dimensional matrix. We give the different weights to different channels. e network can not only strengthen the effective channels according to their importance but also improve the characterization ability of feature after the above improvements [23].

Binary Classifier.
Given the refined feature representation Φ att (X i t,j ), the classification score is obtained as follows: where p i t,j ∈ [0, 1] is the output of binary classifier that represents the probability of whether the candidate state X i t,j is the target T i and ω i cls is the parameter of the classifier for target T i .

Estimated Target State.
e initial state of target T i is estimated by choosing the candidate state with the maximum classification score as follows: It is worth noting that the initial estimated state with too small classification score will lead to deviation to the updating of the model. To avoid model degeneration, we set a threshold if the score is lower than the threshold, which represents that the target is not tracked at current frame. Otherwise, the initial state X i t will be further refined using the object detection states D t � X d t,m M m�1 . In detail, the nearest detection state for X i t is obtained as follows: where IoU(X i t , X d t,m ) calculates the bounding box IoU overlap ratio between X i t and X d t,m . en, the final state of target T i is defined as follows: and o 0 is a predefined threshold.

Handle
Occlusion. e training samples for online updating are obtained from the current frame and historical states. For the target that be tracked, positive samples are sampled at current frame t with scale variations and small displacement around the estimated target state X t . In addition, the historical states are also used as positive samples. If the target is considered untracked at current frame, we only use the historical states of the target as positive sample. All negative samples are collected at current frame t. e target-specific branches require the ability to discriminating the target that we tracked from other targets and background. erefore, the estimated status of other tracking , the function of loss for updating corresponding target-specific branch is defined as follows: where L i− t is the loss from negative samples at current frame, L i+ t is the loss from positive samples at current frame, L i+ h is the loss from positive samples in the history, and α i t is introduced by the temporal attention mechanism.
In order to alleviate the problem of target occlusion in the process of tracking, we introduce the temporal attention mechanism. e temporal attention of target T i is defined by feature weighted U(X i t ) and the overlap statuses with other targets as follows: where c i , β i , and b i are parameters that are learnable, s i t is the mean value of feature weighted U(X i t ), α i t represents the occlusion status of target T i , the larger the value is, the more seriously the target is occluded and the smaller the weight of positive samples is at current frame, o i t is the maximum overlap between T i and the other targets at current frame t, and σ(x) � 1/(1 + e − x ) is the sigmoid function. erefore, we add the temporal attention mechanism to our network that provides a good balance between the current and historical visual cues of the target.

Experimental Setup.
e experiment in this paper is based on the PyTorch framework to build and train the convolutional neural network. In terms of model training, this experiment uses the GeForce RTX 2080ti GPU and the 2.4 GHz CPU to iterate 50 times [23]. e first 20 iterations only train the feature extraction network and the last 30 iterations train the whole network which means whether the location found by object tracker is covered by object detection. In terms of parameter setting, the SGD optimizer [33][34][35] is used to optimize the loss function of the network and update to the network weights in order to avoid affecting the speed of tracking. Meanwhile, the algorithm parameters are set as follows: the training batch size is set to 16, the warmup learning rate mechanism is adopted, the initial learning rate for the first 20 iterations is 0.001, the learning rate for the last 30 iterations is 0.005, which decreased to 0.0005 (weight decay), the speed of test sequences is 0.5 fps, and the momentum is 0.9. We collect positive with ≥0.7 and negative samples with ≤0.3 IoU overlap ratios with the target state at current frame [36].

Experimental Process.
e experiment process in this paper is shown in Figure 5, which is mainly divided into the following processes.
Firstly, SE-network and temporal attention mechanism are introduced on the framework of the Siamese-FC algorithm to debug the code. And dataset is selected for training and testing, the training data are preprocessed, the code is implemented to improve the algorithm, and the tracking model is trained. en, the trained model is used to conduct experiments in the dataset, and the results were evaluated. Finally, other advanced tracking algorithms are tested andthe results are compared with that of the algorithm we proposed.

Qualitative Evaluation.
For evaluating the effectiveness of the proposed algorithm, we train and test network on the public available GOT10k benchmark [37] in unconstrained environments. It includes more than 10000 videos in 563 categories.
e test video sequences include many interference factors such as rotation, occlusion, light change, direction change, and scale transformation which are helpful to verify the practical value of the algorithm that we proposed. C-COT [22], ECO [24], UPDT [25], MDNet [26], and CFNet [38] are selected to compare with our algorithm on this benchmark. All the compared state-of-the-art algorithms including ours use the same parameters during testing for fair comparison.
is paper shows some experimental results of six algorithms on GOT10k. Boxes with different colors represent the tracking results of different algorithms, and the algorithm that we proposed is represented with red box. Qualitative evaluation of the algorithm is carried out from the following five aspects so as to show the tracking effect better than other algorithms to a certain extent, as shown in Figure 6.
(1) Target Rotation. In "000577" and "005501," the direction of the target we tracked has changed dramatically, which makes other five algorithms track failure but our algorithm can track the target accurately.
(2) Motion Blur. For video sequences "003867" and "006037," motion blurs due to fast moving of the target or camera shaking, which result in the algorithms that compared have drift, but our algorithm is not affected.  and "006504," the illumination changes dramatically in the process of tracking, which requires the algorithm to be robust to the influence of illumination. In video sequences "006504," the contrast algorithm fails one after another and only our algorithm can track the target accurately after the 119-th frame when the illumination changes dramatically. (4) Complex Background. For video sequences "000492" and "000501," complex background has great challenge to the tracking accuracy of the algorithms. In addition to our algorithm, the comparison algorithms are interfered by complex background which lead to loss the target in "000492." In "000501," the comparison algorithms have different degrees of drifts except our algorithm from the 10-th frame, and our algorithm can track the target accurately. (5) Occlusion. For video sequences "000496," "000505," "000507," and "000510," the target appears occluded in different degrees in the process of tracking. In "000510," the target is occluded by several animals seriously but only our algorithm can track the target correctly.

Quantitative Evaluation.
For demonstrating the effectiveness of the algorithm objectively and comprehensively, we compare our proposed algorithm with several advanced tracking methods on three challenging benchmarks. GOT10k [37]: It is a large-scale benchmark including over 10,000 videos. Our algorithm is compared with C-COT [22], ECO [24], MDNet [26], SiamFCv2 [38], CF2 [39], GOTURN [40], and SiamFC [28] choosing AO (average overlap) and SR (success rates) as the evaluation criteria. Results are shown in Table 1.

Conclusions
is paper uses the Siamese neural network as the research framework and adds SE-block and TAM to the network. SE-CNN can make full use of spatial feature information and channel correlation and make the extracted feature weights change according to contribution which is equivalent to a channel attention mechanism. TAM can update the target state by adjusting the weight changes of samples at current frame and historical frames. e experimental results show that the proposed algorithm has good robustness in the application of target tracking, which can satisfy the real-time requirements of tracking and alleviate the problem of occlusion effectively. However, there is still a problem of deviation because the speed is too fast of the target in some video sequences. How to solve this problem is the focus of the next research. We should do further research on this problem.