Robust Visual Correlation Tracking

Recent years have seen greater interests in the tracking-by-detectionmethods in the visual object tracking, because of their excellent tracking performance. But most existing methods fix the scale which makes the trackers unreliable to handle large scale variations in complex scenes. In this paper, we decompose the tracking into target translation and scale prediction.We adopt a scale estimation approach based on the tracking-by-detection framework, develop a new model update scheme, and present a robust correlation tracking algorithm with discriminative correlation filters. The approach works by learning the translation and scale correlation filters. We obtain the target translation and scale by finding the maximum output response of the learned correlation filters and then online update the targetmodels. Extensive experiments results on 12 challenging benchmark sequences show that the proposed tracking approach reduces the average center location error (CLE) by 6.8 pixels, significantly improves the performance by 17.5% in the average success rate (SR) and by 5.4% in the average distance precision (DP) compared to the second best one of the other five excellent existing tracking algorithms, and is robust to appearance variations introduced by scale variations, pose variations, illumination changes, partial occlusion, fast motion, rotation, and background clutter.


Introduction
Visual tracking, as a fundamental step to explore videos, is important in many computer vision based applications, such as face recognition, human behavior analysis, robotics, intelligent surveillance, intelligent transportation systems, and human-computer interaction.The objective of visual tracking is to estimate the locations of a target in a video sequence [1][2][3].During the tracking process, the state of the target is estimated over time by associating its representation in the current frame with those in previous frames.Though the research on visual tracking algorithms has lasted for decades, visual tracking is still a problem because of the factors such as pose variation, illumination changes, partial occlusion, fast motion, scale variation, background clutter, and so on.
In general, current tracking algorithms can be classified as either generative or discriminative approaches.Generative approaches [4][5][6][7] focus on learning an appearance model and formulate the tracking problem as finding the target observation most similar to the learned appearance or with minimal reconstruction error.The models are based on templates or subspace models.However, these generative models do not take the background information into consideration, therefore throwing away some very useful information that can help to discriminate object from background.Different from generative trackers, discriminative methods [8][9][10][11][12][13] address the tracking problem as a classification problem which differentiates the tracked targets from the backgrounds.They employ both the target and the background information.For example, Avidan [14] proposes a strong classifier based on a set of weak classifiers to do ensemble tracking.Kalal et al. [15] propose a P-N learning algorithm to learn tracking classifiers from positive and negative samples.These methods are also termed as tracking-by-detection [16][17][18], in which a binary classifier separates the target from background in the continuous frames.In recent years, tracking-by-detection methods have shown to provide excellent tracking performance.
Most current tracking algorithms are only confined to finding out the target location.This implies poor tracking performance in sequences with great scale changes.Several 2 Mathematical Problems in Engineering methods [19][20][21] that use Scale Invariant Feature Transform (SIFT) features can adapt to object scale variations at low frame-rates, but they are not able to be used in the realtime applications.Tu et al. [19] propose a vehicle tracking approach combining blob based tracking and SIFT features based tracking, which is robust to the size of vehicle.Jiang et al. [20] present a novel algorithm for object tracking based on particle filter and SIFT.Wei et al. [21] propose a SIFT based mean shift algorithm, which can be used for continuous vehicle tracking in complex situations.In this paper, we present an adaptive scale tracking approach using discriminative correlation filters, which can estimate the target scale accurately.One main contribution of this work is to decompose the tracking task into translation and scale estimation.The target translation and scale estimation both work by making use of the kernelized correlation filters.In addition, we adopt a new update scheme online based on the MOSSE [22] tracker, which takes all the previous frames into consideration when computing the current models.Experimental results on challenging video sequences demonstrate the superior performance of our proposed method in robustness and stability against state-of-the-art methods.
The rest of this paper is organized as follows.A brief summary of the most related work is first given in Section 2. The tracking algorithm with kernelized correlation filters is introduced in Section 3. Section 4 describes our proposed approach.Following this, the experimental results are presented with comparisons to state-of-the-art methods on challenging sequences in Section 5. Finally, we conclude this paper in Section 6.

Related Work
Visual object tracking has been studied extensively with lots of applications.In this section, we introduce the approaches closely related to our work.
Correlation filters have been used in many applications such as object detection and recognition [23].Since the operator is readily transferred into the Fourier domain as elementwise multiplication, correlation filters have attracted considerable attention recently to visual tracking due to its computational efficiency.In recent years, the researchers start to bring the correlation filters into the tracking-by-detection methods and gain a great development.Bolme et al. [22] propose to learn a minimum output sum of squared error (MOSSE) filter for visual tracking on gray-scale images, where the learned filter encodes target appearance with update on every frame.Henriques et al. [24] propose a circulant structure of tracking-by-detection with kernels (CSK) method, which uses correlation filters in a kernel space.They propose the first kernelized correlation filter, but the CSK method only builds on single channel features.Generalizations of linear correlation filters to multiple channels have also been proposed [25][26][27], which allow them to use more modern features such as histogram of oriented gradients (HOG).Henriques et al. [28] propose a kernelized correlation filter (KCF) tracking algorithm, which is further improved by using HOG features.Danelljan et al. [1] propose an adaptive color attributes tracking method, which exploits the color attributes of a target and learns an adaptive correlation filter by mapping multichannel features into a Gaussian kernel space.However, the above methods do not consider the target scale prediction.Recently, Wu et al. [29] perform a comprehensive evaluation of online tracking algorithms.In the evaluation, the CSK tracker is shown to provide competitive performance with the highest speed among ten top trackers.Due to its excellent performance, we base our approach on the CSK tracker.

Kernelized Correlation Filters Based Tracking
For the correlation filter based trackers, correlation can be computed in the Fourier domain through Fast Fourier Transform (FFT); and the correlation response can be transformed back into the spatial domain with the inverse FFT.The CSK tracking method explores a dense sampling strategy while showing the process of taking subwindows in a frame induces circulant structure.The CSK tracker learns a regularized least squares' (RLS) classifier of the target appearance from a single image patch, gets the kernelized correlation filter with using the circulant matrices and kernel trick, and localizes the target in a new frame by finding the maximum response of the correlation filter.In this section, we briefly describe the CSK tracker.

Circulant Matrices.
Assume (a) is an  ×  circulant matrix; then it can be obtained from a 1 ×  vector a: The first column is the transposition of the vector a, the second column is the transposition of the vector that is a cyclic shifted one element to the right, and so on.For an  × 1 vector u, the product of (a) and u represents the convolution of vectors a and u [30]; it can be expressed in the Fourier domain as follows: where F −1 and F denote the inverse Fourier transform and Fourier transform, respectively.

The
where   is the desired output for x  and  is a regularization parameter.
Mapping the inputs x to the feature space (x) with the kernel trick, the kernel is (x, x  ) =   (x)(x  ).Then we can express the solution w as a linear combination of the inputs [32]: where   is the coefficient.
Then the RLS with kernels has the simple closed form solution [31] as follows: where K is the kernel matrix with elements   = (x  , x  ), I is the unit matrix, y is the desired output with elements   , and d is the transformed classifier coefficient vector with elements   .

Fast Target Location Estimation.
It has been proved that the kernel matrix K is circulant if  is unitarily invariant [24].We can get (6) from ( 5) according to the property of circulant matrices: where K = (h) and h is the vector with elements ℎ  = (x, x  ).
We complete the target location detection using the interesting image patch z in a new frame.Then the response of the RLS classifier is ŷ = w  z = ∑    (x, z  ), and it can be computed in the Fourier domain as follows: where H = F(h), where h is the vector with elements ℎ  = (x, z  ), where x represents the target model learned from the previous frame and z  is the sample of the image patch z.
The position of the target in a new frame is obtained by finding the position that makes the ŷ maximum, which means finding the position that maximizes the response of the filter h, and D and x are updated as follows: where  is the learning rate and D and D−1 denote the updated coefficients at frame  and frame  − 1, respectively.x and x−1 denote the updated target model at frame  and frame  − 1, respectively.D  and x  denote the coefficients and target model computed from frame , respectively.For more details, we refer to [24].

The Proposed Visual Tracking Algorithm
In this section, we present the adaptive scale tracking method based on the kernelized correlation filters in detail.Recently, Danell et al. [8] propose a scale estimation method based on the MOSSE filter.Inspired from it, we propose a robust correlation tracking approach based on the CSK tracker.
Since the scale changes very little between two frames in visual tracking, we can detect the target position using the position kernelized correlation filter firstly and then estimate the target scale using the scale kernelized correlation filter that is learned by using the samples collected from the detected target.In the following subsections, we will introduce a new online update scheme and a scale prediction strategy.

Online Update Scheme.
Since the appearance of the target often changes significantly during the visual tracking, it is necessary to update the target model to adapt to these changes.In the CSK tracker, the model consists of the transformed classifier coefficients and the learned target model.But they are computed only considering the current appearance.This limits the performance because not all the previous frames are considered to compute the current model.However, the MOSSE tracker [22] employs a robust update scheme by considering all previous frames when computing the current model and performs well.Here we adopt the same idea to update the models in our approach.
Then we take all the extracted appearances {x  :  = 1, . . ., } of the target from the first frame till the current frame  into consideration in our update scheme.Therefore, the cost function in (3) can be modified as arg min Then the coefficients D  for the frame  can be computed as follows: where The target appearance x is updated using (9).Here we update the numerator D  Num and the denominator D  Den of D  in (11) separately as The process of extracting features.We get the multiscale image patches around the tracked target at frame , then resize the patches to the initial target size  ×  by bilinear interpolation, and extract HOG features from these resized patches.

The Target Scale Prediction Strategy.
To predict the target scale variation, we learn another kernelized correlation filter and train another classifier on multiscale image patches around the most reliable tracked targets.During the tracking, we construct a target pyramid around the tracked target to estimate the target scale.We resize the patches by using the bilinear interpolation to the size of the initial target before extracting features.The training samples for learning the filter are computed by extracting HOG features using the resized patches which are centred around the tracked target.Then the extracted features are multiplied by a Hamming window to reduce the frequency effect of image boundary when using the FFT, as described in [22].Assume the initial target size in the current frame is × and the size of the scale filter is ×1; then we extract the sample x   from the image patches of size  ×  which are centred around the target, where  =   ,  =   ,  ∈ {⌊−((−1)/2)⌋, . . ., ⌊(−1)/2⌋}, and  is the scale factor.The process of extracting features is shown in Figure 1.We compute the coefficients D   by ( 15) and the response ŷ for a new frame by ( 16), update D   using ( 13) and ( 14), and update the scale model x using (9).The target scale in a new frame is obtained by finding the scale that makes ŷ maximum: where where H  = F(h  ), with h  being the vector with elements ℎ   = (x  , z   ), where x is the scale model learned from frame  − 1 and z   is the sample extracted from a new frame.

Implementation.
The total procedure of our approach is summarized in Algorithm 1.In our approach, we use the Gaussian kernel function (x, x  ) = exp(−|x − x  | 2 / 2 ) in the translation and scale detection;  is the standard deviation.
In tracking-by-detection method, the closer the samples to the currently tracked target center, the larger the probability the samples are the positive samples.Since the square loss of RLS with kernels allows for continuous values, we do not need to limit ourselves to binary labels.The line between classification and regression is essentially blurred.For the continuous training output, we choose the Gaussian function, which is known to minimize ringing in the Fourier domain [33].Therefore, the desired outputs y and y  both use the Gaussian functions that are expressed in ) , where p represents a target location, p * represents the coordinate of the tracked target center, s is a target scale with elements   (1 ≤   ≤ , where   is an integer), s * is the centre scale of the target, and  and   are the standard deviations.

Experimental Results
To verify the efficiency of the method introduced above, we test the proposed tracking algorithm on 12 challenging video sequences which are from [29].They have been widely used in many recent tracking papers and are summarized

Input:
The th frame video sequence   , Initial target position p 0 and scale s 0 .Output: Detected target position p  and scale s  .

Until the End of the Video Sequence
Algorithm 1: Proposed tracking algorithm.

Performance Evaluation.
In order to evaluate the overall performance of the proposed method, three evaluation metrics are used, namely, centre location error (CLE), success rate (SR), and distance precision (DP).The CLE is defined as the average Euclidean distance between the manually labeled ground truths and the detected centre locations of the target.Then we use the average CLE over all the frames of a sequence to evaluate the overall performance for the sequence.SR is computed by (18).DP is defined as the relative number of frames in a sequence whose CLE is smaller than a fixed threshold.The threshold is set to 20 pixels in our experiment: where score is the overlap score,   is the tracked bounding box,   is the ground truth bounding box, area represents the region area, ∩ and ∪, respectively, represent the intersection and union of two regions,  is the number that we use to count the successfully tracked frames whose overlap score is larger than 0.5, and  is the total frames of one sequence.

Comparison with Original Update Scheme.
To show the effect of the changed update scheme on tracking, we compute the average CLE, average SR, and average DP over 12 sequences for the CSK, CSK with new update scheme, CSK with scale prediction, and our tracker which includes a new update scheme and a scale prediction at the same time.scheme approach reduces the average CLE by 13.2 pixels and improves the performance by 1.4% in average SR and 7.7% in average DP compared to the CSK.Our tracker reduces the average CLE by 1.5 pixels and improves the tracking performance by 3.5% in average SR and by 2.6% in average DP compared to the CSK with scale prediction approach.Our tracker achieves the best performance in terms of the average CLE, average SR, and average DP.

Comparison with CSK Tracker.
From Tables 3-5, we can see that our tracker reduces the average CLE from 40.2 pixels to 8.1 pixels and improves the performance by 29.1% in average SR and 31.1% in average DP compared to the CSK.Our approach outperforms the CSK in terms of the average CLE, average SR, and average DP.
In order to show clearly, we use the Girl sequence as an example to analyze.Figures 2 and 3, respectively, show the partial tracking results and the three evaluation metrics plots.Figure 2 shows the tracking results on Girl sequence with scale variation, pose variation, rotation, and partial occlusion.When the girl undergoes the rotation at frame #110, the CSK tracker begins to drift.When the target size becomes smaller at frame #156, our tracked box becomes smaller at the same time and our tracker can track the girl accurately.The tracking error of the CSK tracker is accumulated as the target appearance varies.CSK has a great drift at frame #436 and fails to track the girl at frame #472.However, our tracker can track the girl successfully all the time.Figure 3 also shows that our approach is better than CSK.

Comparison with State-of-the-Art Trackers.
Since it is impractical to use all the existing tracking algorithms to validate the efficacy of our tracker, we compare the proposed algorithm with 5 state-of-the-art trackers: MOSSE tracker [22], Compressive Tracker (CT) [17], Weighted Multiple Instance Learning Tracker (WMILT) [34], KCF tracker with HOG features [28], and CSK tracker [24].In order to compare fairly, we use the same parameters as the authors suggested in their papers and only change the target location and size used in the first frame.

Quantitative Analysis.
We compute the median CLE, SR, and DP to evaluate the performance of 6 tracking methods on the 12 challenging video sequences in our experiments.The results are shown in Tables 3-5.The best results are reported in bold.The three tables show the quantitative results in which our tracker achieves the best or second best performance in most sequences in terms of CLE, SR,  From the figures, we can see that our tracker maintains a smaller centre location error, a higher overlap score, and a higher distance precision in general.The above analysis implies that our approach performs more accurate and stable results than the other 5 trackers.

Qualitative Analysis
Scale, Illumination, and Pose Variation.Figures , and 7(d), respectively, illustrate the results on Car4, Singer1, Trellis, and David sequences with scale and illumination variations as well as pose changes.In Car4 sequence, the vehicle undergoes drastic illumination and scale changes especially when it passes beneath a bridge (see frame #230).
Besides, the vehicle also undergoes background clutter.Only our approach and KCF are robust to these factors and perform well on this sequence.The HOG features are robust to illumination changes, but the background information in the tracked box of KCF accumulates because of the target scale variation, and KCF has a great drift at frame #641.However, our tracker can accurately detect the target position and scale all the time since it can predict the object scale in time.CT and WMILT use the discriminative classifiers learned by Harrlike features, MOSSE uses an adaptive correlation filter, and CSK brings kernelized correlation filters into tracking, but they perform poorly in this case.For the Singer1 sequence, the other trackers except our tracker fail to deal with the large scale, large illumination, and pose variation at the same time.Despite these challenges, our approach is able to track the target accurately.For the David indoor sequence shown in Figure 7(d), the person walks towards the moving camera, resulting in significant appearance variations due to the illumination and scale change.CT, KCF, and our approach can successfully track the target in most frames of the David sequence.However, the target undergoes abrupt pose variation in the Trellis sequence, and only KCF and our tracker perform well.The CLE of our tracker is smaller and the SR of our tracker is higher.
Scale, Pose Variation, Occlusion, and Rotation.sequences with scale variation and partial occlusion.The car moves from far to near and undergoes occlusion by trees in the CarScale sequence.Both KCF and our tracker can complete the total tracking task for the sequence, but the SR of our tracker is higher.In Figure 7(f), the girl also undergoes in-plane rotation and pose variation (see frames #141, #180) which make the tracking more difficult.Only our tracker is able to track the target successfully in most frames of this sequence.
Background Clutter, Illumination, Pose Variation, and Occlusion.The targets in the Skating1 and CarDark sequences undergo background clutter, illumination, and pose changes.
For the Skating1 sequence in Figure 7(g), the target also undergoes partial occlusion (see frame #163).Only KCF and our tracker perform well during the tracking process, but our approach performs better in terms of CLE and SR.For the CarDark sequence in Figure 7(h), MOSSE, CSK, and our tracker provide promising results compared to other trackers.
Scale, Pose Variation, Occlusion, and Abrupt Motion.deformation and heavy occlusion at the same time.All the other trackers fail to successfully track the object except KCF and our tracker.But, in the Faceocc1 sequence, only MOSSE, CSK, and our approach perform well.
5.6.Discussion.From the above qualitative and quantitative analyses, our tracker outperforms other trackers in most cases.The reason is that our tracker not only can predict the target location, but also is able to estimate the target  In Figure 8(c), the goat moves unstably all the time (see frames #5, #54, and #98).Our tracker drifts away because of the accumulated online updated error from the continuous unstable motion.

Conclusion
Based on the framework of tracking with kernelized correlation filter and tracking-by-detection method, we develop a robust visual correlation tracking algorithm with improved tracking performance in this paper.Our tracker estimates the target translation and scale variations effectively and efficiently by learning the kernelized correlation filters.By accurately estimating the target scale in the tracking, Girl (#308, #436, and #472)

Figure 2 :
Figure 2: Partial tracking results compared to CSK: the plots of our tracker and CSK are, respectively, represented by red dot-dashed curve and green solid curve.

Figure 3 :
Figure 3: Three evaluation metrics plots: the plots of our tracker and CSK are, respectively, represented by red dot-dashed curve and green solid curve.

Figure 7 :
Figure 7: Partial tracking results: the plots of our tracker, MOSSE, WMILT, CT, KCF, and CSK are represented by red dash-dot box, cyan dashed box, blue dashed box, yellow dashed box, white dashed box, and green solid box.
with h = (x   , x    ), where x   is the learned scale model from frame  − 1, and Y   = F(y   ), where y   is the desired output for x   at frame :

Table 1 :
The tracking sequences used in our experiments.
Intel core i3-2130 PC with 2 GB RAM.For fair evaluation, all the parameters are fixed for all the video sequences in our experiments.For the target of size  ×  and the scale filter of size  × 1, the standard deviations are set to  = √ and   = √ .The standard deviation for the Gaussian kernel is 0.2.The learning rate  is 0.075.The regularization parameter  is 0.01.The scale of the scale filter is set to  = 31 and the scale rate is set to  = 1.1.

Table 2
shows the comparison results, and the best results are shown in bold.From the table, we can see that the new update scheme improves the performance of the tracker compared to the original update scheme.The CSK with new update

Table 2 :
Comparison with original update scheme.