Recent years have seen greater interests in the trackingbydetection methods in the visual object tracking, because of their excellent tracking performance. But most existing methods fix the scale which makes the trackers unreliable to handle large scale variations in complex scenes. In this paper, we decompose the tracking into target translation and scale prediction. We adopt a scale estimation approach based on the trackingbydetection framework, develop a new model update scheme, and present a robust correlation tracking algorithm with discriminative correlation filters. The approach works by learning the translation and scale correlation filters. We obtain the target translation and scale by finding the maximum output response of the learned correlation filters and then online update the target models. Extensive experiments results on 12 challenging benchmark sequences show that the proposed tracking approach reduces the average center location error (CLE) by 6.8 pixels, significantly improves the performance by 17.5% in the average success rate (SR) and by 5.4% in the average distance precision (DP) compared to the second best one of the other five excellent existing tracking algorithms, and is robust to appearance variations introduced by scale variations, pose variations, illumination changes, partial occlusion, fast motion, rotation, and background clutter.
Visual tracking, as a fundamental step to explore videos, is important in many computer vision based applications, such as face recognition, human behavior analysis, robotics, intelligent surveillance, intelligent transportation systems, and humancomputer interaction. The objective of visual tracking is to estimate the locations of a target in a video sequence [
In general, current tracking algorithms can be classified as either generative or discriminative approaches. Generative approaches [
Most current tracking algorithms are only confined to finding out the target location. This implies poor tracking performance in sequences with great scale changes. Several methods [
The rest of this paper is organized as follows. A brief summary of the most related work is first given in Section
Visual object tracking has been studied extensively with lots of applications. In this section, we introduce the approaches closely related to our work.
Correlation filters have been used in many applications such as object detection and recognition [
For the correlation filter based trackers, correlation can be computed in the Fourier domain through Fast Fourier Transform (FFT); and the correlation response can be transformed back into the spatial domain with the inverse FFT. The CSK tracking method explores a dense sampling strategy while showing the process of taking subwindows in a frame induces circulant structure. The CSK tracker learns a regularized least squares’ (RLS) classifier of the target appearance from a single image patch, gets the kernelized correlation filter with using the circulant matrices and kernel trick, and localizes the target in a new frame by finding the maximum response of the correlation filter. In this section, we briefly describe the CSK tracker.
Assume
The first column is the transposition of the vector
It has been shown that, in many practical problems, the RLS classifier offers equivalent classification performance compared to the support vector machine (SVM), and the former is implemented easily [
Mapping the inputs
Then the RLS with kernels has the simple closed form solution [
It has been proved that the kernel matrix
We complete the target location detection using the interesting image patch
The position of the target in a new frame is obtained by finding the position that makes the
In this section, we present the adaptive scale tracking method based on the kernelized correlation filters in detail. Recently, Danell et al. [
Since the appearance of the target often changes significantly during the visual tracking, it is necessary to update the target model to adapt to these changes. In the CSK tracker, the model consists of the transformed classifier coefficients and the learned target model. But they are computed only considering the current appearance. This limits the performance because not all the previous frames are considered to compute the current model. However, the MOSSE tracker [
Then the coefficients
Then (
The target appearance
To predict the target scale variation, we learn another kernelized correlation filter and train another classifier on multiscale image patches around the most reliable tracked targets. During the tracking, we construct a target pyramid around the tracked target to estimate the target scale. We resize the patches by using the bilinear interpolation to the size of the initial target before extracting features. The training samples for learning the filter are computed by extracting HOG features using the resized patches which are centred around the tracked target. Then the extracted features are multiplied by a Hamming window to reduce the frequency effect of image boundary when using the FFT, as described in [
The process of extracting features. We get the multiscale image patches around the tracked target at frame
The total procedure of our approach is summarized in Algorithm
Crop out the searching region in frame
(
(
(
(
(
(
(
(
To verify the efficiency of the method introduced above, we test the proposed tracking algorithm on 12 challenging video sequences which are from [
The tracking sequences used in our experiments.
Sequence  Frames  Main challenges 

Car4  659  Scale, illumination, pose variation, and background clutter 
CarScale  252  Scale variation and occlusion 
Dog1  1350  Scale and pose variation 
Girl  502  Scale, pose variation, inplane rotation, and occlusion 
Trellis  569  Scale, illumination, and pose variation 
Singer1  351  Illumination and scale variation 
David  462  Illumination, scale, and pose variation 
Woman  597  Nonrigid deformation and occlusion 
Tiger1  354  Abrupt motion, pose variation, and occlusion 
Skating1  400  Illumination, pose variation, background clutter, and occlusion 
CarDark  393  Illumination variation and background clutter 
Faceocc1  886  Occlusion 
All our experiments are performed using MATLAB 2010a on a 3.4 GHz Intel core i32130 PC with 2 GB RAM. For fair evaluation, all the parameters are fixed for all the video sequences in our experiments. For the target of size
In order to evaluate the overall performance of the proposed method, three evaluation metrics are used, namely, centre location error (CLE), success rate (SR), and distance precision (DP). The CLE is defined as the average Euclidean distance between the manually labeled ground truths and the detected centre locations of the target. Then we use the average CLE over all the frames of a sequence to evaluate the overall performance for the sequence. SR is computed by (
To show the effect of the changed update scheme on tracking, we compute the average CLE, average SR, and average DP over 12 sequences for the CSK, CSK with new update scheme, CSK with scale prediction, and our tracker which includes a new update scheme and a scale prediction at the same time. Table
Comparison with original update scheme.
Method  Average 
Average SR (%)  Average DP (%) 

CSK  40.2  53.5  61.8 
CSK + new update scheme  27.0  54.9  69.5 
CSK + scale prediction  9.6  79.1  90.3 
Ours 



From Tables
Centre location error (in pixels).
Sequence  MOSSE  WMILT  CT  KCF  CSK  Ours 

Car4  105.2  85.7  77.1  9.5  19.1 

CarScale  70.0  70.4 

16.1  90.5  19.5 
Dog1 

6.7  9.1  4.4  4.9  5.3 
Girl  52.4  51.0  40.5  32.5  36.0 

Trellis  77.5  43.0  44.8  8.2  19.0 

Singer1  112.2  16.8  19.1  12.8  13.9 

David  129.1  24.4 

9.5  17.2  9.5 
Woman  355.5  126.2  119.3  10.1  236.9 

Tiger1  32.6 

22.2  18.0  26.1  11.7 
Skating1  20.8  8.4  152.8 

10.1 

CarDark  3.2  60.8  47.0  5.8 

2.7 
Faceocc1  6.1  31.8  29.6  43.7 

6.7 
Average CLE  72.4  44.6  48.9  14.9  40.2 

Success rate (%).
Sequence  MOSSE  WMILT  CT  KCF  CSK  Ours 

Car4  27.5  24.6  27.6  36.7  27.6 

CarScale  44.8  44.8  44.8  44.4  44.8 

Dog1  65.3  62.7  59.4  65.3  65.1 

Girl  35.6  27.7  30.7 

51.5  76.2 
Trellis  30.6  33.6  26.0  84.0  55.0 

Singer1  29.6  27.6  27.6  29.6  29.6 

David  14.0  51.6 


57.0  25.8 
Woman  23.6  16.2  15.4 

23.8  86.1 
Tiger1  39.4 

56.3  69.0  50.7  57.7 
Skating1  36.2  34.0  9.0  36.2  37.5 

CarDark  96.4  0.3  12.2  72.3 


Faceocc1 

58.4  74.7  65.7  99.4  99.4 
Average SR  45.3  37.7  40.3  65.1  53.5 

Distance precision (%).
Sequence  MOSSE  WMILT  CT  KCF  CSK  Ours 

Car4  28.1  24.1  35.4  95.3  35.5 

CarScale  65.1  63.1  65.1 

65.5  76.2 
Dog1 

94.3  94.9 

99.9  99.6 
Girl  34.7  21.8  24.8  60.4  39.6 

Trellis  34.1  45.0  25.3 

75.6 

Singer1  84.9  63.5  32.2  81.5  67.5 

David  14.0  40.9 


50.5  97.8 
Woman  24.8  20.6  20.4 

25.0 

Tiger1  52.1 

67.6  73.2  63.4  87.3 
Skating1  70.0  97.8  11.7 

87.0  96.0 
CarDark 

10.4  21.4 



Faceocc1 

19.1  32.0  64.6 

98.9 
Average DP  58.9  49.5  44.2  87.5  61.8 

In order to show clearly, we use the Girl sequence as an example to analyze. Figures
Partial tracking results compared to CSK: the plots of our tracker and CSK are, respectively, represented by red dotdashed curve and green solid curve.
Girl (#15, #110, and #156)
Girl (#308, #436, and #472)
Three evaluation metrics plots: the plots of our tracker and CSK are, respectively, represented by red dotdashed curve and green solid curve.
The center location error plot
The overlap score plot
The distance precision plot
Since it is impractical to use all the existing tracking algorithms to validate the efficacy of our tracker, we compare the proposed algorithm with 5 stateoftheart trackers: MOSSE tracker [
We compute the median CLE, SR, and DP to evaluate the performance of 6 tracking methods on the 12 challenging video sequences in our experiments. The results are shown in Tables
The centre location error plots: (a) Car4, (b) CarScale, (c) Dog1, (d) Girl, (e) Trellis, (f) Singer1, (g) David, (h) Woman, (i) Tiger1, (j) Skating1, (k) CarDark, and (l) Faceocc1. The plots of our tracker, MOSSE, WMILT, CT, KCF, and CSK are represented by red solid curve, cyan dashed curve, blue dashed curve, yellow dashed curve, magenta dashed curve, and green dashed curve.
The overlap score plots: (a) Car4, (b) CarScale, (c) Dog1, (d) Girl, (e) Trellis, (f) Singer1, (g) David, (h) Woman, (i) Tiger1, (j) Skating1, (k) CarDark, and (l) Faceocc1. The plots of our tracker, MOSSE, WMILT, CT, KCF, and CSK are represented by red solid curve, cyan dashed curve, blue dashed curve, yellow dashed curve, magenta dashed curve, and green dashed curve.
The distance precision plots: (a) Car4, (b) CarScale, (c) Dog1, (d) Girl, (e) Trellis, (f) Singer1, (g) David, (h) Woman, (i) Tiger1, (j) Skating1, (k) CarDark, and (l) Faceocc1. The plots of our tracker, MOSSE, WMILT, CT, KCF, and CSK are represented by red solid curve, cyan dashed curve, blue dashed curve, yellow dashed curve, magenta dashed curve, and green dashed curve (the chosen threshold is 50 pixels).
Partial tracking results: the plots of our tracker, MOSSE, WMILT, CT, KCF, and CSK are represented by red dashdot box, cyan dashed box, blue dashed box, yellow dashed box, white dashed box, and green solid box.
Car4 (#230, #288, and #641)
Singer1 (#31, #102, and #349)
Trellis (#358, #431, and #558)
David (#20, #166, and #390)
CarScale (#139, #168, and #180)
Girl (#141, #180, and #440)
Skating1 (#93, #163, and #311)
CarDark (#92, #276, and #310)
Dog1 (#655, #1018, and #1341)
Tiger1 (#112, #248, and #339)
Woman (#128, #194, and #461)
Faceocc1 (#214, #585, and #680)
From the above qualitative and quantitative analyses, our tracker outperforms other trackers in most cases. The reason is that our tracker not only can predict the target location, but also is able to estimate the target scale accurately at the same time. As to the computational complexity, the most timeconsuming part of our tracker is to compute the latent HOG feature vectors of all the candidate samples. Our tracker is implemented in MATLAB, which runs at about 15 frames per second (FPS) on an Intel core i32130 3.4 GHz CPU with 2 GB RAM. Our tracker performs well in the above experiments, but drifts are also observed when the initial target is very little (e.g., see Freeman3 and Freeman4 sequences) and when the target moves unstably all the time (e.g., see Goat sequence as shown in Figure
Three failed tracking cases.
Freeman3 (#4, #410, and #440)
Freeman4 (#10, #202, and #258)
Goat (#5, #54, and #98)
Based on the framework of tracking with kernelized correlation filter and trackingbydetection method, we develop a robust visual correlation tracking algorithm with improved tracking performance in this paper. Our tracker estimates the target translation and scale variations effectively and efficiently by learning the kernelized correlation filters. By accurately estimating the target scale in the tracking, our tracker can obtain more useful information from the target and reduce the interference from background. The translation is estimated by modeling the temporal context correlation and the scale is estimated by searching the tracked target appearance pyramid. In addition, we further develop an update scheme that takes all the previous frames into consideration when computing the current model. Experimental results on challenging sequences clearly show that our approach outperforms stateoftheart tracking algorithms in terms of efficiency, accuracy, and robustness.
The authors declare that they have no conflict of interests regarding the publication of this paper.
The authors truly thank the reviewers for valuable advice and comments. This work is supported by the National High Technology Research and Development Program of China (Grant no. 2014AA7031010B).