Visual Tracking Based on Complementary Learners with Distractor Handling Suryo

The representation of the object is an important factor in building a robust visual object tracking algorithm. To resolve this problem, complementary learners that use color histogramand correlation filter-based representation to represent the target object can be used since they each have advantages that can be exploited to compensate the other’s drawback in visual tracking. Further, a tracking algorithm can fail because of the distractor, even when complementary learners have been implemented for the target object representation. In this study, we show that, in order to handle the distractor, first the distractor must be detected by learning the responses from the color-histogramand correlation-filter-based representation.Then, to determine the target location, we can decide whether the responses from each representation should be merged or only the response from the correlation filter should be used. This decision depends on the result obtained from the distractor detection process. Experiments were performed on the widely usedVOT2014 andVOT2015 benchmark datasets. It was verified that our proposedmethod performs favorably as compared with several state-of-the-art visual tracking algorithms.


Introduction
Given the initial state (e.g., position and other information) of a target object in the first frame, the goal of visual tracking is to predict the states of the target in subsequent frames.Visual tracking has an important role in several applications in the areas of computer vision, such as motion analysis, visual surveillance, human computer interaction, and robot navigation.Although this issue has been studied for several decades and considerable progress has been made, it still presents challenges, in particular, the development of a robust algorithm for overcoming problems such as occlusions, camera motion, illumination changes, motion changes, and size changes.
An important factor in creating a robust visual tracking algorithm is the representation of the target object.Several decades ago, to solve challenging problems in visual tracking, researchers used a color histogram [1] to represent the target object.A generative approach combined with an optimization method, such as the Lukas-Kanade algorithm, Kalman filter [2], and particle filter [3], was usually applied.The Lukas-Kanade algorithm usually utilized a differential method to handle optical flow.Unfortunately, the computation involved in this method is expensive and it has many disadvantages for addressing challenging problems in visual tracking.The Kalman filter also has some limitations for challenging problems in that it assumes that both the system and observation model equations are linear and that the distribution of the state uses Gaussian distribution.These assumptions are not realistic in many real conditions.The particle filter was proposed for overcoming the limitations of the Kalman filter.Although it has been shown that a particle filter significantly improves the results and can handle nonlinear problems, it has some issues related to the relationship between accuracy, the number of particles, and computation time [4].Furthermore, the generative approach is focused only on learning an appearance model.It does not take the information from the background model into consideration, although such information is very valuable for developing a more robust visual tracking algorithm.Moreover, although color histogram-based representation has advantages which are robust to deformations, it has a disadvantages or a drawback when illumination changes occur.It is also sensitive to motion blur.
Later, the discriminative approach was proposed for improving the performance of the generative approach.The main difference between the discriminative and the generative approach is in the utilization of a classifier method to determine the location of the target object.The generative approach, on the one hand, does not need a classifier method to determine the output; the output is determined by the nearest distance according to a one-by-one distance comparison with the target.For this reason, the computation time of the generative approach is expensive.On the other hand, the discriminative approach uses a classifier method for determining the output and takes the information from the background model into consideration.Therefore, positive and negative samples should be used for representing the target object and the background, respectively.For example, Grabner et al. proposed an online feature selection method using an AdaBoost algorithm for visual tracking.This method has online training capability [5].Although it operates quickly, online learning is problematic, in particular, when each update of the tracker may introduce an error, which finally can lead to tracking failure (drifting).Semisupervised online boosting alleviates the drifting problem in tracking applications [6].Another method for visual tracking, called multiple instance learning (MIL), was proposed by Babenko et al. to replace traditional supervised learning [7].This method treats positive and negative samples as a positive and negative bag, respectively.Then, to determine the output, a boosting classifier is used.This method operates faster and more accurately than traditional supervised learning.Kalal et al. proposed a tracking-learning-detection framework [8]; however, unfortunately this framework needs a large memory for computation.These methods can be termed tracking-bydetection methods.
Recently, a correlation filter has been used, which provides efficient computation, since the operator is transformed into the Fourier domain.Further, it also produces good results, although a limited amount of training data is used.For these reasons, researchers introduced the correlation filter into the tracking-by-detection method for visual tracking.An example is the method, called minimizing the output sum of squared error (MOSSE) tracker, that was introduced by Bolme et al. [9].For training the correlation filter, this method used only grayscale samples.To improve the method, according to the results of recent studies multidimensional features such as histogram of Gaussian (HOG) features can be used [10][11][12][13].Although the correlation filter provides efficient computation, all the circular shifts should be learned during the process.To resolve this issue, Danelljan et al. proposed the spatially regularized discriminative correlation filter (SRDCF) [14].Although it achieves excellent results, this method needs a computational time longer than the original one.Moreover, although the correlation filter has the advantages which show excellent robustness to challenging problems, such as illumination changes and motion blur, it has a disadvantages or a drawback when problems such as deformation arise.
To compensate the advantages and disadvantages of color histogram-based representation and correlation filter-based representation, respectively, a representation of the target object based on complementary learners was proposed [10,11,15].In this study, we adopted complementary learners and we propose an object-aware method based on them.These representations are computed in parallel, where each representation produces a color histogram response and correlation filter response, respectively.Since the tracking algorithm can fail because of the distractor, a method to handle the distractor is proposed to minimize tracking failures.First, distractor detection should be performed.This can be achieved by calculating the distance between the maximum value of the color histogram response and the maximum value of the correlation filter response.Then, the location of the target object can be determined from either the maximum value of the correlation filter response or the maximum value of the merged responses of the color histogram and the correlation filter; the value selected depends on the results of the distractor detection process.We demonstrate our proposed method on the widely used VOT2014 and VOT2015 benchmarks.According to the results of our experiments, the proposed method performs favorably as compared to stateof-the-art visual tracking algorithms.
The rest of this paper is organized as follows.We describe our object-aware method based on complementary learners in Section 2. The distractor detection method is explained in Section 3. The proposed method is detailed in Section 4. In Section 5, the experimental results with comparisons to the state-of-the-art methods are presented.Finally, conclusions are presented in Section 6.

Object-Aware Method Based on Complementary Learners
One important factor in building a robust visual tracking algorithm is determining the model representation of the target object.Color histogram-based object representation has been used widely.Unfortunately, this representation is not robust when the color of the distractor is similar to that of the tracked object.In addition, this representation has disadvantages or the drawback when illumination changes occur and is also sensitive to motion blur.Recently, a correlation filter has been used for representing the object.Although it is robust to challenges such as motion changes and illumination changes, it has a drawback when deformation occurs.Complementary learners, in which the results of a collaboration between the correlation filter and color histogram are used to represent the target object in visual tracking, were inspired by these ideas [10,11,15].The representations should be computed in parallel to produce each response before the distractor is analyzed based on these responses.
where |  | and |  | are the rectangle area of the object and the background, respectively.Finally, the response of the color histogram  ch  can be obtained by using the integral image from ( ∈  | id  ).
On the other hand, as in [10][11][12][13][14], HOG features are used as multidimensional features.They produce -dimensional feature map representation of an image.Based on this representation, the optimal correlation filter  is obtained by using arg min where , , ⊗, and  are the rectangle patch of the feature map that represents the target, the desired correlation output, the circular correlation, and the parameter that controls the effect of the regularization term, respectively.Further, the correlation filter operates in the Fourier domain, and, therefore, we can use the discrete Fourier transform (DFT), which produces a complex variable.Because the results of the DFT take a complex form and we need to solve (2), we follow the method presented in [16] and then we obtain where  is the complex conjugate of the DFT of ,   is the DFT of   ,   is the complex conjugate of the DFT of   ,   is the DFT of   , ⊙ represents element-wise multiplication, and   is the result in the Fourier domain.
An inexpensive computation is required to develop a visual tracking algorithm.This is because, to handle the appearance changes in the target object, online learning is effective, as was proved in [5][6][7][8].Further, based on (3),  ×  linear system of equations per pixel needs to be solved and this requires expensive computation.Thus, rather than performing expensive computation, where robust approximation is needed, an online update of the numerator   and denominator   at frame , which was adopted from [16], is used: where is the numerator at frame  − 1, and  −1 is the denominator at frame  − 1.Moreover, a response of the correlation filter  cf  can be calculated using the inverse DFT: where    is the feature map from |  | which has been multiplied by hanning window and    is the complex conjugate from    .

Distractor Detection
Visual tracking algorithms usually fail because of the distractor, in particular when the distractor has a representation similar to that of the target object.To overcome this problem, Kalal et al. [8] proposed a learning method assisted by positive and negative constraint to distinguish a target object from the background.In addition, they used optical flow for motion model.Unfortunately, this approach needs a large memory for computation.Recently, Possegger et al. [17] proposed foreground and background modeling based on the color histogram.Unfortunately, the drawback or the disadvantages of the color histogram features still influence their approach and makes less robust than shape HOG correlation filter-based tracker.In this section, we describe our proposed distractor detection method.Given the responses from color histogram  ch  and correlation filter  cf  , the maximum value of  ch  and  cf  can be determined.The maximum value of  ch  is represented by V ch and that of  cf  by V cf .Because these responses take a two-dimensional form, these maximum values have coordinate information indicating their respective positions.Distractor detection can be achieved by using the Euclidean distance between where   and  0 represent the distance at frame  and the distance threshold, respectively, and 1 indicates that a distractor appears and 0 that no distractor appears.The distractor detection procedure is illustrated in Figure 1.Moreover, compared with [8], our proposed distractor detection method does not need a large memory for computation.

Response of color histogram
Response of correlation lter #630 #637 Detecting the distractor at frame no.630.No distractors appears because  t <  0 .
Detecting the distractor at frame no.637.Distractors appears because  t ≥  0 .
Figure 1: Illustrations of detecting the distractor at frames numbers 630 and 637.

Proposed Method
In this section, our proposed method for visual tracking is described.trans, to obtain  cf  by implementation in (5). Figure 2 shows the proposed method framework.
When the parameters  ch  and  cf  have been obtained, in order to minimize the tracking failure due to the distractor, the distractor must be detected prior to the final location estimation of the target object.To detect the distractor, we use (6).The final location estimation of the target object can be obtained by maximizing the score   , where where  ch and  cf are the coefficients related to  ch  and  cf  , respectively.According to (7), when the distractor appears, the response from correlation filter  cf  is selected in order to get final location estimation.This is because color histogrambased representation is less discriminative than correlation filter-based representation.This reason is based on the disadvantages of color histogram-based representation, where this representation is often inadequate to discriminative target object from the background, sensitive to motion blur, and can not handle the variation of the illumination well.Besides that, this reason is made based on the benchmark results of the VOT2014 dataset [18,19] and VOT2015 dataset [15,20].The DSST tracker [16], SAMF tracker [21], and KCF tracker [22] occupied the top three rank in the benchmark results of the VOT2014 dataset.These trackers are developed based on shape HOG correlation filter.Furthermore, shape HOG correlation filter-based tracker is always dominant and leading in the accuracy-robustness rank compared to colorbased tracker of the VOT2015 benchmark dataset.Scale changes of the target object also can cause tracking failure.For this reason, scale estimation is required, for which a correlation filter can be used, as proved in [16].The process is almost the same as for translation estimation.Scale sample   scale, is extracted from   , considering the scale estimation from the previous frame  −1 .After   scale, is extracted, the parameters  scale,−1 and  scale,−1 are used together with   scale, to obtain  cf scale, by implementation in (5).Scale estimation   at frame  can be calculated by maximizing the score  cf scale, .The parameter  cf scale, that has the maximum score is represented by the output r of the proposed method.
Since appearance changes always occur and influence the target object, they also can cause tracking failure.Certain parameters need to be updated to handle this problem.Six parameters should be updated: the parameters ℎ , and ℎ , for color histogram-based representation and the parameters  trans, ,  trans, ,  scale, , and  scale, for correlation filter-based representation.The parameters ℎ , and ℎ , can be obtained as where ĥ, is the color histogram for the target object, ĥ, is the color histogram for the background, and  ch is a coefficient related to the color histogram-based representation.
On the other hand, the samples  trans and  scale should be extracted from frame  at r and   to update the parameters in the correlation filter-based representation, respectively.After the samples have been extracted, the updates of parameters  trans, and  trans, are determined by using (4) with  trans .Parameters  scale, and  scale, are also updated by using (4) with  scale .

Experimental Results and Discussions
In this section, a comprehensive evaluation of the proposed method is presented.The proposed method is compared on two recently published benchmarks that are widely used: VOT2014 [18,19] and VOT2015 [15,20].The method was implemented in MATLAB 2016A, and the experiment was performed on an Intel(R) Core(TM) i5 2.60 GHZ CPU with 8 GB RAM.For color histogram-based representation, the number of bins  bins that was used was 32 for each channel of a red green blue (RGB) image color format.The value of the parameter  ch for updating the color histogram was 0.01.Further, for the correlation filter-based representation, we used a HOG cell size of 8 × 8.The values of parameters  cf , , and  0 were 0.01, 0.01, and 20, respectively.When a distractor did  not appear, parameter   was constructed from the merged responses of  ch  and  cf  .Thus, coefficients  ch and  cf were required.According to the results of our experiments, these coefficients  ch and  cf are equal to 0.3 and 0.7, respectively.
Figure 4 shows the AR rank plots of LPMT and the stateof-the-art methods for the challenges of camera motion, illumination change, motion change, occlusion, and size change.For each challenge, LPMT shows a good performance: it is always ranked in the top 5 among all the 33 trackers.In particular, in the occlusion challenge, where most trackers fail because of this problem and the problem is coupled with the disruption caused by the presence of an object similar to the target object, LPMT outperforms the other state-of-theart algorithms.This proves that the proposed method meets these challenges effectively.The definition of neutral in this figure is that no challenge exists in the sequence frame.
Figure 5 shows the AR rank plots of LPMT and the stateof-the-art trackers on the VOT2014 benchmark dataset for all the challenges combined and the average expected overlap rank.Since LPMT showed a good performance according to the AR plot rank for each challenge, where it was always ranked in the top five, this method also ranked in the top five for the overall challenges.Based on the average expected overlap, the LPMT was ranked fourth, where the average expected overlap is almost 0.3.In the average expected overlap parameter of this benchmark dataset, DSST [16] achieved the top rank, which has an average expected overlap equal to 0.3.This method uses HOG and grayscale features.For detailed information about the VOT2014 benchmark dataset and its performance parameters, please refer to [18,19]. Figure 5: Accuracy-robustness rank plot of LPMT and the state-of-the-art trackers on the VOT2014 benchmark dataset for the baseline experiment of the overall challenges (a) and the average expected overlap rank (b).In the accuracy-robustness rank plot, the accuracy and robustness rank are plotted along the vertical and horizontal axis, respectively.Our LPMT is represented by the purple square.

Mathematical Problems in Engineering
In the VOT2015 benchmark dataset, there are 60 sequences that represent more challenging problems than those in the VOT2014 dataset.As for the VOT2014 benchmark dataset, the accuracy and robustness performance parameters were used, which are represented by the AR rank plot.By using this benchmark dataset and in order to justify the design choice of the proposed method LPMT, this proposed method is compared with the proposed method without distractor detection, the proposed method which uses only shape HOG features, and the proposed method which uses only color histogram features.Figure 6 shows the results of these comparisons.
Based on the VOT2015 benchmark dataset, for all of the proposed trackers, the proposed tracker without distractor detection, the proposed tracker which only uses shape HOG features, and the proposed tracker which only uses color histogram features achieve the accuracy rank of 1.00.Furthermore, the robustness rank baseline mean of the proposed tracker, the proposed tracker without distractor detection, the proposed tracker which only uses shape HOG features, and the proposed tracker which only uses color histogram features are 1.00, 1.33, 2.83, and 3.33, respectively.According to the results, these prove that color histogram features are less robust than shape HOG features.It indicates that color histogram features are less discriminative than shape HOG features.Furthermore, these results also prove that the proposed tracker which uses distractor detection can improve the robustness compared to the proposed tracker without distractor detection.
Figure 8 shows the AR rank plots of LPMT and the stateof-the-art trackers on the VOT2015 benchmark dataset of the overall challenges and the average expected overlap rank.Since LPMT shows a good performance in the AR rank plot for each challenge, where it was always ranked in the top two, for the overall challenges, this method outperforms the other state-of-the-art tracker.Based on the average expected  overlap, the LPMT achieves the first rank, where the average expected overlap is equal to 0.25.In the average expected overlap parameter of this benchmark dataset, DSST [16], which achieved the top rank on the VOT2014 benchmark dataset, was ranked the thirtieth.The second rank is achieved by the Rajssc tracker, which is based on a correlation filter.For detailed information about the VOT2015 benchmark dataset and its performance parameters, please refer to [15,20].

Conclusions
This paper presented a method that uses complementary learners, which consist of the response of the color histogram and the response of the correlation filter, for representing the target object.To overcome a distractor that has a representation similar to that of the target object, the proposed method also detects the distractor based on the response of the color histogram and correlation filter.Based on evaluations on the VOT2014 and VOT2015 benchmark datasets, the proposed method yields a favorable performance as compared to several state-of-the-art visual tracking algorithms.Figure 8: Accuracy-robustness rank plot for the baseline experiment of the overall challenges (a) and the average expected overlap rank (b) on the VOT2015 benchmark dataset.In the accuracy-robustness rank plot, the accuracy and robustness rank are plotted along the vertical and horizontal axis, respectively.Our LPMT is represented by the blue plus-sign.
Given frame  − 1, the rectangle area of an object | ,−1 |, and that of the background | ,−1 |, we calculate certain parameters that are related to each representation before we proceed to frame , since the proposed method uses a color histogram and correlation filter for representing the target object.First, considering the color histogrambased representation, the parameters ℎ ,−1 and ℎ ,−1 can be calculated based on the pixels in the observation area and the number of bins  bins that are needed.For the correlation filterbased representation, the numerator  trans,−1 and denominator  trans,−1 parameters for translation estimation and the numerator  scale,−1 and denominator  scale,−1 parameters for scale estimation should be determined.Parameters  trans,−1 and  trans,−1 can be calculated by   trans,−1 =  trans,−1 ⊙   trans,−1 and  trans,−1 = ∑  =1   trans,−1 ⊙   trans,−1 , respectively.On the one hand, parameters  trans,−1 ,   trans,−1 ,   trans,−1 , and   trans,−1 are the complex conjugate of the DFT of  trans,−1 , the DFT of   trans,−1 , the complex conjugate of the DFT of   trans,−1 , and the DFT of   trans,−1 , respectively.On the other hand, parameters  scale,−1 and  scale,−1 can be calculated by   scale,−1 =  scale,−1 ⊙   scale,−1 and  scale,−1 = ∑  =1   scale,−1 ⊙   scale,−1 , respectively.Parameters  scale,−1 ,   scale,−1 ,   scale,−1 , and   scale,−1 are the complex conjugate of the DFT of  scale,−1 , the DFT of   scale,−1 , the complex conjugate of the DFT of   scale,−1 , and the DFT of   scale,−1 , respectively.After these parameters for frame  − 1 have been calculated, the search for the target object in frame  can proceed.To search the target object in frame , the response from color histogram  ch  and the response from correlation filter  cf  are needed.Given the search area of the target object | search, | at frame , where | search, | = | , |, to obtain  ch  , we use ℎ ,−1 and ℎ ,−1 and, then, implement these parameters in (1), where id  is related to the pixel at | search, |.Further, the results of this step are computed by using an integral image in order to obtain  ch  .On the other hand, translation estimation is used to estimate the location of the target object when the correlation filter-based target object representation is used.Given frame , translation sample   trans, is extracted from | search, | within the scale estimation from the previous frame  −1 .After   trans, is extracted, the parameters  trans,−1 and  trans,−1 are used together with
Ranking plot for label neutralRanking plot for label illumination change Ranking plot for label motion change Ranking plot for label occlusion Ranking plot for label size changeRanking plot for experiment baseline (mean)

Figure 3 :
Figure 3: Accuracy-robustness rank plot based on VOT2014 benchmark dataset.The proposed tracker which only uses color histogram features is symbolized by the red circle.The proposed tracker which only uses shape HOG correlation filter is symbolized by the yellow cross.The proposed tracker without distractor detection is symbolized by the green asterisk.The proposed tracker is symbolized by the green triangle.

Figure 4 :
Figure 4: Accuracy-robustness rank plots of LPMT and the state-of-the-art tracker on the VOT2014 benchmark datasets for the experimental baseline of the following challenges: camera motion, neutral, illumination change, motion change, occlusion, and size change.The accuracy and robustness ranks are plotted along the vertical and horizontal axis, respectively.LPMT is represented by the purple square.
Ranking plot for label camera motionRanking plot for label neutral Ranking plot for label illumination change Ranking plot for label motion change Ranking plot for label occlusion Ranking plot for label size changeRanking plot for experiment baseline (mean)

Figure 6 :
Figure6: Accuracy-robustness rank plot based on VOT2015 benchmark dataset.The proposed method which only uses color histogram features is symbolized by the red circle.The proposed method which only uses shape HOG correlation filter is symbolized by the yellow cross.The proposed method without distractor detection is symbolized by the green asterisk.The proposed method is symbolized by the green triangle.
Ranking plot for label camera motionRanking plot for label neutral Ranking plot for label illumination change Ranking plot for label motion change Ranking plot for label occlusion Ranking plot for label size change 45

Figure 7 :
Figure 7: Accuracy-robustness rank plot for the baseline experiment of the challenges of camera motion, neutral, illumination change, motion change, occlusion, and size change on the VOT2015 benchmark dataset.The accuracy and robustness rank are plotted along the vertical and horizontal axis, respectively.LPMT is represented by the blue plus-sign.
Given frame , we can calculate the color histogram of the object, ℎ  , and the color histogram of the background, ℎ  , from the previous frame to obtain the response of the color histogram,  ch  .First, this response is computed from the pixel at location  in the location of the search area of the target object | search, |, which has the same bin index id  .Then, following Bayes' theorem, we calculate ( ∈  | id  )