A Real-Time Structured Output Tracker with Scale Adaption for Visual Target Tracking

The structured output tracking algorithm is a visual target tracking algorithm with excellent comprehensive performance in recent years. However, the algorithm classiﬁer will produce error information and result in target loss or tracking failure when the target is occluded or the scale changes in the process of tracking. In this work, a real-time structured output tracker with scale adaption is proposed: (1) the target position prediction is added in the process of target tracking to improve the real-time tracking performance; (2) the adaptive scheme of target scale discrimination is proposed in the structured support to improve the overall tracking accuracy; and (3) the Kalman ﬁlter is used to solve the occlusion problem of continuous tracking. Extensive evaluations on the OTB-2015 benchmark dataset with 100 sequences have shown that the proposed tracking algorithm can run at a highly eﬃcient speed of 84fps and perform favorably against other tracking algorithms.


Introduction
Visual target tracking is an important task in computer vision, which has been widely used in various fields, such as intelligent transportation systems, military, medical, and so on [1][2][3]. Many scholars have done a lot of research on target tracking and have achieved great progress. However, the tracking target faces many challenges such as target scale changes, occlusion, and illumination changes [4,5]. erefore, ensuring the accuracy of the tracking process and improving the real-time performance are of great theoretical significance [6,7]. e Struck tracker is an algorithm based on discriminant classifier, and it has excellent adaptability to various complex backgrounds in visual target tracking [8,9]. e Struck tracker is to construct a structured output support vector machine classifier through an online learning method. In the tracking process, the algorithm first samples the local area of target position in the previous frame and then takes the maximum value of classifier discriminant function as the current tracking result [10]. In the process of updating classifier, the classifier abandons the sample labeling process of classifier. erefore, in the case of occlusion and scale variation, the Struck tracker cannot track target correctly [11,12]. is paper suggests a structured output SVM tracker with real-time and occlusion detection capabilities. is work includes the following. (1) e target position prediction process is introduced. e sparse sample is used to calculate the rough position of target from the discriminator, which reduces the amount of calculation to determine the target position and improves the real-time performance. (2) A multiscale sampling adaptive scale tracking strategy is proposed. e scale discriminator calculates the scale change of target, which can adjust the size of scale adaptively.
(3) e discriminant function of SVM classifier is used to implement occlusion detection. e SVM classifier will stop updating when a certain percentage of occlusion is detected, and the Kalman filter is applied to predict target position when the target has certain motion information, so the target can be successfully tracked.

Related Works
With the development of target tracking, many visual target tracking algorithms have been investigated [13]. e mainstream target tracking algorithms include traditional tracking methods, correlation filtering-based tracking methods, and tracking methods based on deep learning [14,15].

Traditional Tracking Methods.
Caulfield and Dawson-Howe [16] proposed a FragTrack tracking method by comparing template histograms with sub-block histograms to obtain the possible locations of target. Wang and Li [17] suggested a mean shift algorithm which used color as feature to obtain the probability density map of overall image. Wu et al. [18] introduced a multi-instance learning method to solve target drift and improve tracking accuracy. Babenko et al. [19] provided a compressive tracking algorithm to deal with the problem of occlusions and image noise. Chunxiao et al. [20] developed a tracking-learning-detection method combining traditional tracking algorithms with detection algorithms to solve the target occlusions. Hare et al. [21] presented a structured output SVM tracker which was robust to occlusions of tracking target.

Correlation Filtering-Based Tracking Methods.
Bolme et al. [22] used a minimum output sum of squared error filter, which could improve tracking accuracy compared with many traditional algorithms. Henriques et al. [23] developed a circulant structure kernel tracker by building a cyclic matrix to solve the ridge regression problem. Henriques et al. [24] proposed a kernel correlation filter which had excellent performance in tracking accuracy, but it could not solve illumination and scale changes. Zhang and Zheng [25] presented a spatiotemporal context tracking algorithm which had achieved excellent real-time tracking performance. Li and Zhu [26] suggested a scale adaptive multifeature tracking algorithm for solving the scale changes of tracking target.

Tracking Methods Based on Deep
Learning. Held et al [27] presented a GOTURN tracking method and applied the end-to-end deep learning model to target tracking for the first time. Nam and Han [28] developed an MDNet tracking algorithm and updated the model online to adapt to the changing targets and scenarios. Danelljan [29] proposed a C-COT tracking algorithm, and it used the deep neural network to extract target features and solved target scale changes. Danelljan [30] proposed an effective convolution operator tracking algorithm, which solved the problem of too large C-COT model and improved the real-time performance.
Although these research methods have some effect in dealing with complex background, there are two shortcomings which can be summarized as follows: (1) these algorithms cannot judge the current occlusion state and introduce error information into the classifier when the target is occluded; (2) the performance of the algorithm is significantly reduced when the target scale changes. is work uses the discriminant function value of SVM classifier to achieve occlusion detection by combining the framework of Struck tracker and the motion information of target. e classifier stops updating when partial occlusion is detected, and the Kalman filter is used to predict the target position when the target has certain motion information. Aiming at the change of target scale, a multiscale sampling strategy is proposed, and the optimal scale is calculated with scale discriminator.

Proposed Method
First, the Struck tracker and a real-time structured output tracker with scaleadaption and then the whole details of the presented algorithm are given.

e Struck Tracker.
During the tracking process, let P t−1 be the estimated bounding box at time t − 1.
e Struck algorithm estimates the target displacement y t ∈ Y, where Y is the search space which is given as Y � (u, v)|u 2 + v 2 < r 2 (r is the search radius, and are two-dimensional space coordinates). e current frame target position P t which is expressed as P t � P t−1 ∘y t can be obtained by shifting the previous frame position.
e prediction function f: X ⟶ Y is constructed to estimate target transformations between frames, and the output space is the space of all transformations Y instead of the binary labels. Now, we introduce a discriminant function to predict the target position between frames as follows: which evaluates the similarity between sample and target (w is the coefficient vector and Φ is the kernel function that maps the input space to the feature space). In the given sample set (x 1 , y 1 ), . . . , (x n , y n ) ( n ≥ 1), for finding the optimal hyperplane, the optimization objective function is where Δ(y i , y) is the loss function and C is the regularization parameter. is loss function should decrease towards 00 as y and y become more similar and is Δ(y i , y) � 0 only on the condition that y � y. We now choose the loss function based on the overlap of target bounding boxes as where s o P t (y, y) is the overlap between target bounding boxes. Using the standard Lagrangian duality techniques, the solution of equation (2) can be converted into its equivalent dual form: replacing the parameter into equation (4) as follows: According to equation (5), equation (4) can be written as where δ(y, y) � 1, (y � y), 0, (otherwise).
en, the discriminant function can be simplified as We refer the sample (x i , y) with β y i ≠ 0 as support vector. For a given support pattern x i , only the support vector (x i , y i ) is β y i > 0 and any other support vectors (x i , y) with y ≠ y i is β y i < 0. We refer to these as positive and negative support vectors, respectively. e selection of support vector is controlled by the following gradient: It can be known from Δ(y, y i ) that the gradient calculation includes the overlap between the sample and the target bounding box, and the algorithm updates β y i and gradient h i incrementally during the update process of each frame to implement the classifier learning and the update of the algorithm.

e Real-Time Structured Output Tracker with Scale Adaption.
e output of the Struck tracker is only the target position, and the accuracy of target position decreases when the scale changes, so we suggest a real-time structured output model with scale adaption. By taking the target tracking frame x of the previous frame, the candidate possible target position set Y in the current frame, and the candidate target scale set S in the current frame as input and taking the exact position y and the target scale in the current frame as output, the model can be represented using the following decision function: where G is the discriminant function, and the discriminant function G in the model is where w is the coefficient vector, Φ(x, y, s) is the structured feature function, and 〈, 〉 is the inner product operation. By combining equation (12), the adaptive scale tracking model is where is the sampling number, ε i is the relaxation amount, and Δ(y i , s j ; y, s) is the loss function. e loss function can be written as where T(y, s) is the sample blocks when the target positions and scales are (y, s) and T(y i , s j ) is the sample blocks when the target positions and scales are (y i , s j ).
We can transform the structured SVM problem of equation (12) where ij , and equation (14) can be converted as where δ(y, s; y i , s j ) � 1 when y � y i and s � s j , else δ(y, s; y i , s j ) � 0. We can get the discriminant function as follows: By selecting the proper kernel function K(x, y, s; x, y, s) � 〈Φ(x i , y, s), Φ(x, y, s)〉, the discriminant function is Equation (18) is the final form of discriminant function in the adaptive scale tracking model. e algorithm of tracking process is divided into two parts: the target position prediction and the target position and scale determination.
In the process of target position prediction, the target position is sampled equally at first, and then the previous frame of target position is evenly sampled according to the sparse sampling method: e structural feature vector of the corresponding position is obtained by extracting the feature of the samples: Mathematical Problems in Engineering function equation (18) and the decision function equation (10). e process of determining the target position and scale includes four steps: (1) We set up a set of scales S � s 1 , s 2 , . . . s j . . . s k (k ≥ 1), and the collected elements are satisfied with s j − s j−1 � gap (where gap is the step size).
(2) By taking the rough estimated target location y rough j as center, the search area is further reduced and we can sample the scale set S in the reduced search area and get k sampling results: (4) Select the feature vector Φ(x pq , y accurate pq , s p ) (the corresponding sample (y accurate pq , s p ), in which y accurate pq is the exact position and s p is the target scale, is the tracking result) that maximizes the decision function according to equations (18) and (22). e performance of the Struck tracker is decreased significantly when the target is partially occluded or completely occluded. We perform an active occlusion detection during the tracking process of each frame.
Let I P t−1 Οy t be the value of classifier discriminant function at the search position P � P t−1 Oy in frame I t (where t is the frame number of tracking target): For the target position of the current frame, the classifier discriminant function value at the satisfying target position P � P t−1 Oy t is I P t−1 Oy t t , then e change rate of I t can be defined as e queue Q � I 1 , . . . , I v (v ∈ t) is constructed, and it stores the value of v frame history I t , where I � (1/v) i�v i�1 Q i is the average value of elements in queue Q. After each frame target tracking, the current frame I t value is added to the queue Q to update the occlusion detection. e threshold of occlusion detection c is introduced to distinguish whether the target is occluded or not. When V I t > c, the algorithm continues to track the target. When V I t ≤ c, the algorithm stops updating the element of queue Q, and the classifier to ensure that the classifier does not introduce error information.

Steps of the Suggested Method.
e proposed algorithm steps are shown in Algorithm 1. e steps of the developed method are as follows.

Experiments
e tacking performance of six trackers, which include the structured output tracking with kernels (Struck), the kernel correlation filter (KCF), the background-aware correlation   [31,32]. e overlap success rate reflects the degree of overlap between the tracked target frame and the actual target frame [33,34]. e center position error indicates the offset between the output target center position and the labeled position [35,36].

Parameter Setup. Experiments are implemented in
Matlab on an Intel I7-8565U 1.8 GHz CPU with 8 GB RAM. e regularization parameter C is set to 100, the size of support vector threshold is set to 100, the original search range is three times the target size of the previous frame, the number of scales is 31, the historical frame F t is set as G � I 1 , . . . , I k , the size of k is 10, the threshold     Man  OURS  97  108  119  58  112  109  Struck  34  53  76  28  62  75  KCF  87  81  105  47  99  107  BACF  69  55  61  32  71  77  TLD  12  14  23  11  35  45  CT  37  39  46  17  60  66 value of occlusion detection c is −0.3, the target boundary box size change factor δ is 0.1, and the search radius r is 30.

Experiments on OTB-2015 Benchmark
Dataset. e tracking performance of different trackers is tested on the OTB-2015 benchmark dataset. e tracking results under the OTB-2015 dataset are shown in Figure 1, and we can get that the tracking results of OURS tracker are better than comparsion trackers. e OURS tracker and the Struck tracker are also tested under various complex background, and the selected sequence information is shown in Table 1, and their average success rate plot and average precision in various complex background are shown in Figure 2. Figure 2 shows that the performance of the OURS tracker is better than that of the Struck tracker on the OTB-2015 benchmark dataset.
e success rate of various trackers in selected sequences is demonstrated in Table 2. From Table 2, the tracking performance of OURS tracker is better than those of other comparison algorithms. e frames per second of various trackers in selected sequences are shown in Table 3. From Table 3, the running speeds of OURS tracker is higher than those of other comparison algorithms.

e Success Rate on More Challenging Video Sequences.
In addition, the tracking success rate plots under 11 challenging video sequences are shown in Figure 3. We can see from Figure 3 that the OURS tracker shows excellent tracking performance under 11 challenging video sequences than other trackers. erefore, OURS tracker processes challenging video sequence effects better than other comparison trackers. In the video sequence Faceocc1, the challenge during tracking is partial occlusion. e TLD tracker and the Struck tracker occur to drift at the 44th frame. Due to the removal of the target from partial occlusion, the comparsion trackers lose the target at the 69th frame, but the OURS tracker can deal with the problem of partial occlusion.

Experiments on Various
In the video sequence Coke, the target in video sequence is affected by complete occlusion. e target is completely occluded at the 256th frame. e Struck tracker, the KCF tracker, and TLD tracker tracking fails at the 276th frame, but the OURS tracker can deal with full occlusion problem.
In the video sequence Faceocc2, the partial occlusion affects the tracking target during the tracking process. e target is partial occluded at the 261th frame. e comparsion trackers lose the target when target removes from partial occlusion at the 294th frame, but the OURS tracker can solve the partial occlusion problem.
In the video sequence Deer, the tracking target appears fast motion. e Struck tracker and the TLD tracker lose the target at the 19th frame. e KCF tracker loses the target at the 60th frame, but OURS tracker can solve the problem of fast motion.
In the video sequence Man, the illumination variation affects the tracking target during the tracking process. e TLD tracker and the Struck tracker occur to drift in video sequence at the 35th frame. All comparison trackers lose the target when the background illumination is changed gradually, but the OURS tracker can solve the illumination variation problem.
In the video sequence Dog, the tracking target scale changes. e comparsion trackers occur to drift at the 911th frame. e comparsion trackers lose the target at the 1034th frame, but the OURS tracker can adapt to changes in target scale and track the target correctly.

Conclusion
In this work, a real-time structured output tracker with scale adaption is proposed: (1) the process of position target prediction which can improve the tracking real-time performance is added during the tracking process; (2) multiscale sampling is used to obtain samples of different scales, and the best scale is obtained by using a discriminator to improve the accuracy of tracking; and (3) the occlusion judgment mechanism is suggested to determine whether to update the classifier or not, and the Kalman filtering is applied to solve the problem of continuous tracking with occlusion. e tracking performance of OURS tracker is better than those of other trackers in different research cases due to the following advantages.
e OURS tracker uses a multiscale sampling strategy to estimate the scale of target during tracking. e OURS tracker uses Kalman filter to solve tracking problem with target occlusion. From the experimental results, the tracker proposed in this paper shows excellent performance when processing various complex backgrounds under the OTB-2015 dataset, and it also achieves excellent success rate and tracking accuracy in different challenging complex backgrounds.
In the future, our research work will focus on applying the proposed algorithm to multitarget tracking due to the successful application of the proposed algorithm on the single target tracking.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.