Discriminative Fusion Correlation Learning for Visible and Infrared Tracking

Discriminative correlation filter(DCF-) based trackers are computationally efficient and achieve excellent tracking in challenging applications. However, most of them suffer low accuracy and robustness due to the lack of diversity information extracted from a single type of spectral image (visible spectrum). Fusion of visible and infrared imaging sensors, one of the typical multisensor cooperation, provides complementarily useful features and consistently helps recognize the target from the background efficiently in visual tracking.Therefore, this paper proposes a discriminative fusion correlation learningmodel to improveDCF-based tracking performance by efficiently combiningmultiple features from visible and infrared images. Fusion learning filters are extracted via late fusion with early estimation, in which the performances of the filters are weighted to improve the flexibility of fusion. Moreover, the proposed discriminative filter selection model considers the surrounding background information in order to increase the discriminability of the template filters so as to improve model learning. Extensive experiments showed that the proposed method achieves superior performances in challenging visible and infrared tracking tasks.


Introduction
Visual tracking has received widespread attention for its extensive applications in video surveillance, autonomous driving and human-machine interaction, military attack, robot vision, etc. [1,2].Depending on the appearance model, existing tracking algorithms can be categorized into two categories: generative and discriminative tracking.Generative tracking algorithms build a target model and search for the candidate image patch with maximal similarity.For example, Wang et al. [3] proposed a novel regression-based object tracking framework which successfully incorporates Lucas and Kanade algorithm into an end-to-end deep learning paradigm.Chi et al. [4] trained a dual network with random patches measuring the similarities between the network activation and target appearance to leverage the robustness of visual tracking.On the contrary, the goal of discriminative algorithms is to learn a classifier to discriminate between its appearance and that of the environment given an initial image patch containing the target.Yang et al. [5] proposed a temporal restricted reverse-low-rank learning algorithm for visual tracking to jointly represent target and background templates via candidates, which exploits the low-rank structure among consecutive target observations and enforces the temporal consistency of target in a global level.A new peak strength metric [6] is proposed to measure the discriminative capability of the learned correlation filter that can effectively strengthen the peak of the correlation response, leading to more discriminative performance than previous methods.
Besides these efforts, other researchers have worked on tracking methods that are both generative and discriminative.For instance, Zhang et al. [7] obtained an object likelihood map to adaptively regularize the correlation filter learning by suppressing the clutter background noises while making full use of the long-term stable target appearance information.Qi et al. [8] proposed a structure-aware local sparse coding algorithm, which encodes a target candidate using templates with both global and local sparsity constraints, and also obtains a more precise and discriminative sparse representation to account for appearance changes.In [9], an adaptive set of filtering templates is learned to alleviate drifting problem of tracking by carefully selecting object candidates in different situations to jointly capture the target appearance variations.Moreover, a variety of simple yet effective features are effectively integrated into the learning process of filters to further improve the discriminative power of the filters.In the salient-sparse-collaborative tracker [10], an object salient feature map is built to create a salient-sparse discriminative model and a salient-sparse generative model to both handle the appearance variation and reduce tracking drifts effectively.A multilayer convolutional networkbased visual tracking algorithm based on important region selection [11] is proposed to build high entropy selection and background discrimination models and to obtain the feature maps by weighting the template filters with cluster weights, which enables the training samples to be informative in order to provide enough stable information and also be discriminative so as to resist distractors.Generally speaking, discriminative and generative methods have complementary advantages in appearance modeling, and the success of a visual tracking method depends not only on its representation ability against appearance variations but also on the discriminability between target and background, thus leading to the requirement of a more robust training model [12].
Recently, discriminative correlation filter-(DCF-) based visual tracking methods [13][14][15][16][17][18] have shown excellent performances on real-time visual tracking for its advantage of robustness and computational efficiency.The DCF-based methods work by learning an optimal correlation filter used to locate the target in the next frame.The significant gain in speed is obtained by exploiting the fast Fourier transform (FFT) at both learning and detection stages [14].Bolme et al. [13] presented an adaptive correlation filter, named Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters by optimizing the output sum of squared error.Based on MOSSE, Danelljan et al. [14,15] proposed a novel scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation, which achieves accurate and robust scale estimation in a tracking-by-detection framework.Galoogahi et al. [16] proposed a computationally efficient Background-aware correlation filter-based on handcrafted features that can efficiently model how both the foreground and background of the object varies over time.The work in [17] reformulates DCFs as a one-layer convolutional neural network composed of integrates feature extraction, response map generation, and model update with residual learning.Johnander et al. [18] proposed a unified formulation for learning a deformable convolution filter in which the deformable filter is represented as a linear combination of subfilters, and both the subfilter coefficients and their relative locations are inferred jointly in our formulation.However, the above trackers fail when the target undergoes severe appearance changes due to limited data supplied by single features.
Multiple feature fusion contains more useful information than single feature, thus providing higher precision, certainty, and reliability for visual tracking.Wu et al. [19] proposed a data fusion approach via sparse representation with applications to robust visual tracking.Uzkent et al. [20] proposed an adaptive fusion tracking method that combines likelihood maps from multiple bands of hyperspectral imagery into one single more distinctive representation, which increases the margin between mean value of foreground and background pixels in the fused map.Chan et al. [21] proposed a robust adaptive fusion tracking method, which incorporates a novel complex cell into the group of object representation to enhance the global distinctiveness.Feature fusion also achieves superior performances on correlation filter-based tracking.For example, Rapuru et al. [22] proposed a robust tracking algorithm by efficiently fusing tracking, learning, and detection with the systematic model update strategy of kernelized correlation filter tracker.
Although much efforts have been made, single-sensor feature fusion-based tracking suffer low accuracy and robustness due to the lack of diversity information.Fusion of visible and infrared sensors, one of the typical multisensor cooperations, provides complementarily useful features, which is able to achieve a more robust and accurate tracking result [23].Li et al. [24] designed a fusion scheme containing joint sparse representation and colearning update model to fuse color visual spectrum and thermal spectrum images for object tracking.Li et al. [25] proposed an adaptive fusion scheme based on collaborative sparse representation in Bayesian filtering framework for online tracking.Mangale and Khambete [26] developed reliable camouflaged target detection and tracking system using fusion of visible and infrared imaging.Yun et al. [23] proposed a compressive time-space Kalman fusion tracking with time-space adaptability for visible and infrared images and introduced extended Kalman filter to update fusion coefficients optimally.A visible and infrared fusion tracking algorithm based on multiview multikernel fusion model is presented in [27].Zhang et al. [28] transferred visible tracking data to infrared data to obtain better tracking performances.Lan et al. [29] proposed joint feature learning and discriminative classifier framework for multimodality tracking, which jointly eliminate outlier samples caused by large variations and learn discriminability-consistent features from heterogeneous modalities.Li et al. [30] proposed a convolutional neural network architecture including a twostream ConvNet and a FusionNet, which proves that tracking with visible and infrared fusion outperforms that with single sensor in terms of accuracy and robustness.
DCF-based trackers have significant low computational load and are especially suitable for a variety of real-time challenging applications.However, most of the DCF-based trackers suffer low accuracy and robustness due to the lack of diversity information extracted from a single type of spectral image (visible spectrum).Therefore, this paper proposes a discriminative fusion correlation learning model to improve DCF-based tracking performance by combining multiple features from visible and infrared imaging sensors.The main contributions of our work are summarized as follows: (i) A discriminative fusion correlation learning model is presented to fuse visible and infrared features such that valuable information from all sensors is preserved.(ii) The proposed fusion learning filters are obtained via late fusion with early estimation, in which the The remainder of this paper is organized as follows.In Section 2, the multichannel discriminative correlation filter is introduced.In Section 3, we describe our work in detail.The experimental results are presented in Section 4. Section 5 concludes with a general discussion.

Multichannel Discriminative Correlation Filter
Multichannel DCF provides superior robustness and efficiency in dealing with challenging tracking tasks [14].In the multichannel DCF-based tracking algorithm,  channel Histogram of Oriented Gradient (HOG) features [14] from the target sample  are extracted to maintain diverse information.During training process, the goal is to learn correlation filter ℎ, which is achieved by minimizing the error of the correlation response compared to the desired correlation output  as where * denotes circular correlation and  is the weight parameter [14].  and ℎ  ( = 1, ⋅ ⋅ ⋅ , ) are the -th channel feature and the corresponding correlation filter, respectively.The correlation output  is supposed to be a Gaussian function with a parametrized standard deviation [14].
The minimization of (1) can be solved by minimizing (2) in the Fourier domain as where , , and  are the discrete Fourier transform (DFT) of ℎ, , and , respectively.The symbol bar ⋅ denotes complex conjugation.The multiplications and divisions in ( 2) are performed pointwise.The numerator    and denominator   of the filter    in ( 2) are updated as where  is a learning rate parameter.
During tracking process, the DFT of the correlation score   of the test sample   is computed in the Fourier domain as where   and    are the DFTs of   and    , respectively.   and   are the numerator and denominator of the filter updated in the previous frame, respectively.Then the correlation scores   is obtained by taking the inverse DFT   = F −1 {  }.The estimate of the current target state is obtained by finding the maximum correlation score among the test samples.

Proposed Discriminative Fusion Correlation Learning
In this section, we introduce the tracking framework of the proposed algorithm, the general scheme of which is described in Figure 1.Firstly, the multichannel features are extracted, respectively, from the visible and infrared images according to [14].Secondly, the proposed discriminative filter selection and the fusion filter learning are applied to get the fusion response map.Finally, the discriminative filters and fusion filters are updated via the tracking result obtained by the response map.We will discuss them specifically below.

Discriminative Filter Selection.
According to DCF-based trackers, we obtain the correlation output  by where    and ℎ   are the target sample and target correlation filter corresponding to the -th channel feature among  channels, respectively.In this paper,  is selected as a 2D Gaussian function where  = 2.0 [13].
Before tracking, we need to choose the optimal target correlation filters in the training step via minimizing (6) as where    ,    , and  are the DFTs of    , ℎ   , and , respectively.
Different from a single training sample of the target appearance, multiple background samples at different locations around target need to be considered to maintain a stable model.However, extracting multichannel features from each background sample increase computational complex significantly.Moreover, in practice, single channel features from multiple background samples are enough to present satisfied performances.Therefore, in this paper, we extract  background samples randomly in the range of an annulus around the target location [11] and obtain the correlation output  as where    ,  = 1, ⋅ ⋅ ⋅ ,  denotes the -th background sample.
Similarly, the optimal background correlation filters in the training step are selected via minimizing (8) as where    and    are the DFTs of    and ℎ   , respectively.While tracking, DFT  , of the estimated discriminative correlation score  , of the test sample   is defined as where where  ,, is the correlation score of the -th image computed by (9).After obtaining  , , the fusion correlation score  , is obtained by taking the inverse DFT  , = F −1 { , }.The fusion location of the current target state is obtained by finding the maximum correlation score among the test samples.
The whole tracking process of DFCL is summarized in Algorithm 1.

Experiments
The proposed DFCL algorithm was tested on several challenging real-world sequences, and some qualitative and quantitative analyses were performed on the tracking results in this section.

Input:
The -th visible and infrared images For  = 1 to number of frames do 1.Crop the samples and extract the -th ( = 1, ⋅ ⋅ ⋅, ) channel features    for visible and infrared images, respectively.

Experimental Results
. The performance of DFCL was compared with state-of-the-art trackers Struck [32], ODFS [33], STC [34], KCF [35], ROT [36], DCF-based trackers MOSSE [13], DSST [14], fDSST [15], and visible-infrared fusion trackers TSKF [23], MVMKF [27], L1-PF [19], JSR [24], and CSR [25].Figures 2-6 present the experimental results of the test trackers in challenging visible sequences named Biker [37], Campus [37], Car [38], Crossroad [39], Hotkettle [39], Inglassandmobile [39], Labman [40], Pedestrian [41], and Runner [38], as well as their corresponding infrared sequences Biker-ir, Campus-ir, Car-ir, Crossroadir, Labman-ir, Hotkettle-ir, Inglassandmobile-ir, Pedestrianir, and Runner-ir.Single-sensor trackers were separately tested on visible and the corresponding infrared sequences, while visible-infrared fusion trackers obtain the results with information from both visible and infrared sequences.For the convenience of presentation, some tracking curves are not shown entirely in the Figures.Next, the performance of the trackers in each sequence is described in detail.(a) Sequences Biker and Biker-ir: Biker presents the example of complex background clutters.The target human in the visible sequence encounters similar background disturbance (i.e., bikes), which causes the ODFS, MOSSE, fDSST, TSKF, and MVMKF trackers to drift away from the target.The corresponding infrared sequence Biker-ir provides temperature information that eliminates the background clutter in Biker.But when the target is approaching another person at around Frame #20, Struck, ODFS, STC, MOSSE, TSKF, and MVMKF do not perform well because they are not able to distinguish target from persons with similar temperature in infrared sequences.Only KCF, ROT, DSST, and our DFCL have achieved precise and robust performances in these sequences.
(b) Sequences Campus and Campus-ir: the target in Campus and Campus-ir undergoes background clutters, occlusion, and scale variation.At the beginning of Campus, ODFS, STC, KCF, and ROT lose the target due to background disturbance.Only TSKF and DFCL perform well, while Struck, fDSST, and MVMKF do not achieve accurate results.Because of the infrared information provided by Campus-ir, fewer test trackers lose tracking when background clutters happen as shown in Figure 2.But Struck, KCF, and ROT mistake another person for the target.As shown in Figure 2, most of the trackers result in tracking failures, whereas DFCL outperforms the others in most metrics (location accuracy and success rate).
(c) Sequences Car and Car-ir: Car and Car-ir demonstrate the efficiency of DFCL on coping with heavy occlusions.The target driving car is occluded by lampposts and trees many times, which cause tracking failures of most trackers.Only TSKF, MVMKF, and DFCL are able to handle the occlusion throughout the tracking process in this sequence.As shown in Figure 2, most trackers perform better in Car-ir than in Car because the infrared features can overcome the difficulties of target detection among surrounding similar background.STC, TSKF, MVMKF, and DFCL are able to handle this problem, whereas the result of DFCL is the most accurate, as shown in Figure 2.
(d) Sequences Crossroad and Crossroad-ir: the target in Crossroad and Crossroad-ir undergoes heavy background clutters when she crosses the road.While the target is passing by the road lamp, both ODFS and JSR lose the target.Then, when a car passes by the target, Struck, TSKF, and MVMKF drift away from the target.When the target goes toward     the sidewalk, most of the trackers are not able to handle the problem of heavy background clutters, but our tracker performs satisfying tracking results as shown in Figures 2-4.
(e) Sequences Hotkettle and Hotkettle-ir: in these sequences, tracking is hard because of the changes of the complex background clutters.Most trackers perform better in Hotkettle-ir than in Hotkettle for the reason that the temperature diverge makes the hot target more distinct in the cold background.Struck, KCF, DSST, fDSST, and DFCL can achieve robust and accurate tracking performances as shown in Figures 2-4.
(f) Sequences Inglassandmobile and Inglassandmobile-ir: Sequences Inglassandmobile and Inglassandmobile-ir demonstrate the performances of the 14 trackers under the circumstances of background clutters, illumination changes, and occlusion.As shown in Figure 2, when the illumination changes at around Frames #300, ODFS and fDSST lose the target, and KCF, TSKF, and L1-PF drift a little away from the target.When the target is approaching a tree, the background clutters makes most of trackers cause tracking failures that can be seen from Figure 2. Our DFCL can overcome these challenges and perform well in these sequences.
(g) Sequences Labman and Labman-ir: the experiments in Sequences Labman and Labman-ir aim to evaluate the performances on tracking under appearance variation, rotation, scale variation, and background clutter.In Labman, when the target man is walking into the laboratory, ODFS, STC, and MOSSE lose the target.When the man keeps shaking and turning around his head at around Frame #400, KCF, ROT, and DSST cause tracking failures.Also, most trackers achieve better tracking performances in Labman-ir as shown in Figure 2.
(h) Sequences Pedestrian and Pedestrian-ir: the target in Pedestrian and Pedestrian-ir undergoes heavy background clutters and occlusion.As shown in Figure 2, other trackers result in tracking failures in Pedestrian, whereas our tracker shows satisfying performances in terms of both accuracy and robustness.The efficient infrared features extracted from  Pedestrian-ir ensure the tracking successes of Struck, STC, and DFCL, as can be seen from Figures 2-4.
(i) Sequences Runner and Runner-ir: Runner and Runnerir contain examples of heavy occlusion, abrupt movement, and scale variation.The target running man is occluded by lampposts, trees, stone tablet, and bushes many times, resulting in tracking failures of most trackers.Also, the abrupt movement and scale variation cause many trackers to drift away the target in both Runner and Runner-ir as shown in Figure 2. Once again, our DFCL is able to overcome the above problems and achieve good performances.
Figures 5 and 6 are included here to demonstrate quantitatively the performances on average location error (pixel) and success rate.The success rate is defined as the number of times success is achieved in the whole tracking process by considering one frame as a success if the overlapping rate exceeds 0.5 [33].A smaller average location error and a larger success rate indicate increased accuracy and robustness.Figures 5 and 6 show that DFCL performs satisfying most of the tracking sequences.
To validate the effectiveness of the discriminative filter selection model of DFCL, we compare the tracker DCL (the proposed DFCL without the fusion learning model) with DFCL and the original DCF tracker MOSSE on visible sequences.The performances shown in Figure 7 demonstrate the efficiency of the discriminative filter selection model especially in the sequences with background clutters, i.e., Sequences Biker, Hotkettle, Inglassandmobile, and Pedestrian.

Conclusion
Discriminative correlation filter-(DCF-) based trackers have the advantage of being computationally efficient and more robust than most of the other state-of-the-art trackers in challenging tracking tasks, thereby making them especially suitable for a variety of real-time challenging applications.However, most of the DCF-based trackers suffer low accuracy due to the lack of diversity information extracted from a single type of spectral image (visible spectrum).Fusion of visible and infrared sensors, one of the typical multisensor cooperation, provides complementarily useful features and consistently helps recognize the target from the background efficiently in visual tracking.For the above reasons, this paper proposes a discriminative fusion correlation learning model to improve DCF-based tracking performance by combining multiple features from visible and infrared imaging sensors.The proposed fusion learning filters are obtained via late fusion with early estimation, in which the performances of the filters are weighted to improve the flexibility of fusion.Moreover, the proposed discriminative filter selection model considers the surrounding background information in order to increase the discriminability of the template filters so as to improve model learning.Numerous real-world video sequences were used to test DFCL and other state-of-theart algorithms, and here we only selected representative videos for presentation.Experimental results demonstrated that DFCL is highly accurate and robust.

Figure 1 :
Figure 1: Discriminative fusion correlation learning for visible and infrared tracking.Blue and green boxes denote the target and background samples extracted from the tracking result, respectively.

Figure 2 :Figure 3 :
Figure 2: Tracking performances of the test sequences.

Figure 4 :
Figure 4: Overlapping rate of the test sequences.

Figure 7 :
Figure 7: Average location error (pixel) and success rate of DCL, DFCL, and MOSSE on the test sequences.
Target result and the discriminative correlation filters    and    Algorithm 1: Discriminative fusion correlation learning for visible and infrared tracking.
4.1.Experimental Environment and Evaluation Criteria.DFCL was implemented with C++ programming language and.Net Framework 4.0 in Visual Studio 2010 on an Intel Dual-Core 1.70GHz CPU with 4 GB RAM.Two metrics, i.e., location error (pixel) and overlapping rate, are used to evaluate the tracking results quantitatively.The location error is computed as = √(  −   ) 2 + (  −   )2, where (  ,   ) and (  ,   ) are the ground truth (either downloaded from a standard database or located manually) and tracking bounding box centers, respectively.The tracking overlapping rate is defined as V = (  ∩   )/(  ∪  ), where   and   denote the ground truth and tracking bounding box, respectively, and (⋅) is the rectangular area function.A smaller location error and a larger overlapping rate indicate higher accuracy and robustness.