A Narrow Deep Learning Assisted Visual Tracking with Joint Features

A robust tracking method is proposed for complex visual sequences. Different from time-consuming offline training in current deep tracking, we design a simple two-layer online learning network which fuses local convolution features and global handcrafted features together to give the robust representation for visual tracking.The target state estimation ismodeled by an adaptive Gaussian mixture.Themotion information is used to direct the distribution of the candidate samples effectively. Andmeanwhile, an adaptive scale selection is addressed to avoid bringing extra background information. A corresponding object template model updating procedure is developed to account for possible occlusion andminor change.Our trackingmethodhas a light structure andperforms favorably against several state-of-the-art methods in tracking challenging scenarios on the recent tracking benchmark data set.


Introduction
Visual tracking is one important topic in computer vision with a wide range of applications, such as video surveillance, automobile navigation, human-computer interface, and driverless vehicle [1]. Although substantial progress has been proposed in recent years, it remains a challenging task due to many factors such as illumination changes, quick movement, and background disturbance [2].
To address these challenges for robust tracking, current visual tracking algorithms focus on exploiting robust handcrafted target representations, such as Haar-like features, color histogram, HOG descriptors, etc. Since each type of handcrafted feature is commonly able to address a few specific classical changes, they are not tailored for all generic objects and we require some sophisticated learning techniques to improve their representative capabilities. These learning methods build models to distinguish the target from the background. They typically learn classifiers based on multiple instance learning, online boosting, structured output SVMs, etc. Recently, correlation filter based tracking algorithms have achieved remarkable results due to the computational efficiency in the Fourier domain. The filter can locate the tracking target effectively, but all of them have the limitation of being excessively dependent on the maximum response value. When the response map becomes unreliable under some challenging circumstances, shift may occur even the object becomes lost. Different from handcrafted features, deep learning adopts hierarchical architecture to simulate human brain mechanism, which can generate outstanding representation for high nonstructured visual data. Convolutional neural network (CNN) for object recognition and detection has inspired tracking algorithms to employ the discriminative features learned by CNNs [3,4]. To produce stable shared weights, CNN network needs a large number of training samples, while this is often not available in visual tracking as there exists only a few number of reliable positive instances extracted from the initial frame. In addition, because the CNNs are trained to recognize object classes, the deeper the network structure is, the faster the space information will lose. Thus, naively applying CNN models into tracking is not suitable. One way to address these problems is to fine-tune a pretrained CNN model. The other way is to design a narrow learning network.
Motivated by the challenges in object tracking in complex scenarios and inspired by the fusion method [5], we propose 2 Mathematical Problems in Engineering a novel tracking scheme similar to correlation convolution, which takes advantage of convolutional and handcrafted features and meanwhile makes use of the adaptive Gaussian mixture filter (GMF) to generate samples effectively. The main contributions of this paper are summarized below: (1) We propose an efficient feature extraction scheme, which combines local convolutional features and global handcrafted features to produce robust appearance expression. A simple narrow network is designed to generate highlevel local features without pretraining. In this way, spatial information can be well preserved and it brings more accurate tracking.
(2) Our method takes advantage of strong correlation between adjacent frames to produce groups of convolution filters, which helps to discriminate the target from the surrounding background with the maximum likeness.
(3) Adaptive sampling scheme directs to reshape the candidates' distribution based on the 1-order motion information. This detection mechanism allows us to get the correct location of target object.
(4) In addition, we design an adaptive scale estimation method and an efficient model updating scheme. Spatiotemporal property decides scale changing and the updating degree.
The rest of this paper is structured as follows. We first review related work in Section 2. Next, the joint features are presented via a simple two-layer network. And the adaptive particle filter tracking model and the model updating scheme are described in Section 3. Section 4 demonstrates the objective and subjective experimental results in current benchmark datasets.

Related Work
Visual object tracking has been studied extensively and a comprehensive tracking can be found in [6][7][8][9]. In this section, we just review the works related to our method for simplicity, which include the particle filter based trackers and the deep learning based trackers.
PF algorithms have been studied in visual object tracking for many years and their variations are still widely used nowadays as it is neither limited to linear systems nor requires the noise to be Gaussian [10][11][12]. The traditional PF algorithm implements a recursive Bayesian framework by using the nonparametric Monte Carlo sampling method, which can effectively track target objects in most scenes. But challenging problems still exist. PF needs to design complex appearance models to deal with different visual sequences. And during updating the posterior distributions, it uses sequential importance sampling scheme to address the sample degeneration phenomenon when only a few particles representing the distribution have significant weights. Resampling may give limited results and is computationally expensive. Different from current PF, the appearance model can be automatically learned by convolutional network, which can directly be used in many challenging sequences. And we introduce motion information into Gaussian equation and the posterior distributions are directly decided by the state of the target objects without requiring particle resampling. Compared with current PF, it reduces computational complexity and brings adaptive particle distributions.
Deep neural networks are a powerful tool for learning image representations in computer vision applications. Inspired by the success of convolutional neural network in image classification and object recognition [13][14][15], researchers in tracking community have started to focus on the deep trackers that exploit the strength of CNN. These deep trackers come from two aspects: One trend is discriminative convolution trackers (DCT). It is the combination of excellent correlation filter tracking framework and CNN features. These tracking methods replace handcrafted features such as HOG with deep features and use correlation filter to find the maximum impulse [16][17][18]. The other trend is to design the tracking networks and pretrained them which aim to learn the target-specific features for each new sequence. And then the online tracking is followed using PF or classifiers [3,19,20]. Despite their notable performance, all these approaches are not designed towards real-time applications because of their time-consuming feature extraction and complex optimization details. In addition, they cannot end-to-end train and only tune the hyperparameters heuristically since feature extraction and tracking process are separate. Different from existing deep tracking frameworks, we formulate the extraction of high-level features as a one-layer convolution operation in the spatial domain. And the global handcrafted features are fused in a similar fully connection layer. This narrow deep learning structure allows feed-forward learning to capture robust appearance expression. Online adaptive tracking and updating mechanism brings optimal estimation for target location and scale selection.

The Tracking Framework
Our tracking algorithm includes constructing target model, online tracking, and model updating. The flowchart is expressed in Figure 1.

Joint Features Generation. Deep convolutional features
can express the abstract information just like our brains, while different handcrafted features tend to deal well with some certain challenging tracking problems. Here we combine local convolutional features with color information as the appearance model of the tracking objects in order to deal with some challenging problems such as illumination variation and occlusion.

Local Convolutional Features.
When there is heavy disturbance from complicated backgrounds, object shift will occur in current CNN tracking or correlation filter tracking. To avoid losing objects when involving in background information, our algorithm tries to design a group of local filters which not only consider the inner features of the target but also the background disturbance.
The target field is preprocessed into a fixed size and a set of overlapping patches are densely sampled inside it. To maintain good geometric and illumination invariance, we calculate the HOG features for each patch to express the local appearance and shape. After this, k-means algorithm is applied to select a foreground bank with n patches { 1 , 2 , ...., }. Each patch is processed by subtracting the mean to get local contrast as the fixed foreground filters { 1 , 2 , ...., }: Since the background context provides useful information to discriminate the target, here background filters are generated from a group of samples. We choose samples around the target in current frame, and a set of patches are selected in each sample. Just like foreground filters, we also calculate the HOG features and use k-means cluster to select patches for each sample. After subtracting the mean value for each patch, a bank of filter { ,1 , ,2 , ...., , } ( = 1, 2, ..., ) is produced. Then we summarize all filters by weighted average to produce the background filters { 1 , ...., }: The final convolutional filters { 1 , ...., } are defined as the difference between the target filters and the background filters: Then given the candidates P = {P 1 , P 2 , . . . , P m } in current input frame, the local convolutional feature map = { 1 , 2 , ..., } for each candidate is defined as In this way, the background information can be well suppressed, and shift brought by accumulative error gets alleviation. In addition, the feature maps produced by these filters are robust to noise introduced by appearance variations.

Global Color Features.
Local convolutional features based on HOG captures the texture of the image while ignore color information. Color information has shown its advantages in the environments with shade or strong color contrast. So we use the global color histograms in HSV color space as the complementation of the local convolutional features.
According to the visual discrimination of human eyes, we quantify the three channels H, S, and V as Here, = = 3 are the quantization levels for and channels. The maximum value = 7 × 9 + 2 × 3 + 2 = 71. Then the global color histogram for a given image sample is defined as a vector = { (0), (1), . . . , (71)}, and each element ( ) is calculated by where × is the number of the pixels in one sample. is Kronecker delta function. ( , ℎ) is the fused value according to formula (6) at the location ( , ℎ).

Joint Appearance
Model. The convolutional feature maps and color histogram features encode the local structural texture information and global color information for the target. Thereby fusing them can bring a better representation to handle appearance variations and background disturbance. Here, we connect these two features and form a 1-D feature vector = [ , ], ∈ 1× ( is the total number of the features) just like the fully connected layer in deep convolution network. In addition, to make the features robust to noises introduced by appearance variation, we normalize joint features as the final appearance template: Here, ∈ denotes ℎ feature value. Hence the complex patch features preserve the geometric layouts of the useful parts and suppress confusing background in advance. The fused features give the full appearance description which can improve the tracking robustness.

Position Estimation.
Our tracking algorithm is formulated within an adaptive framework similar to particle filtering tracking. Different from current random sampling method, our algorithm introduce the speed information of consecutive frames to predict the candidate samples during tracking. We assume that m target states {( , )} =1,..., in current frame are modeled by two Gaussian distributions, and the dynamic model can be formulated as Here, ( −1 , −1 ) is the target position in the previous frame.
( , ), ( , ) are separately mean and variance. They are determined by the motion information in the three previous frames.
where , are constants, which can control the clustering degree of the candidate samples. Different from the traditional particle filter, the motion information gives the prediction of the candidates in the next frame. When there is partial even full occlusion, the samples can get more reasonable distribution.
Then the optimal state ( , ) is achieved by weighting all the predicting states: The weighted values ( = 1, ..., ) play a key role in robust tracking. They are produced by the observation model ( = 1, ..., ): Here, gives the distance between appearance representation for the ℎ candidate sample and the target template −1 at frame − 1. We hope the smaller the difference is, the bigger the contribution is. So the observation model in this work is defined as Here, is a minor positive constant. With the proposed dynamic and observation model, the algorithm can keep tracking the target even if there is the overall occlusion and fast motion in the scene. In David sequence, the target walks behind one tree and occurs again. If only depending on the convolutional features, the tracking fails. The color information and motion information help to predict the target well (seen in Figure 2). In Figure 3, the target keeps rotating and moving fast; the visual tracker without color and motion cues during tracking causes the target lost.

Scale Estimation.
Rotation or scale variation tends to appear when the object moves. To avoid involving extra background information or losing partial foreground, the tracking scale needs to be adjusted adaptively according to the tendency of scale change [21]. Supposing −1 , −2 , −3 are the sizes of the three former tracking results, we define the scale variation factor as Mathematical Problems in Engineering  If > 0, the scale is regarded to be enlarged. We define an amplification pool = { 1 , ..., } for candidate scales. Then the new candidate scales S = { | = 1, 2, ..., } in the current frame are calculated as To find the most proper scale, we calculate the distance according to formula (14) for each candidate with the size . The final target scale is chosen by Else, the scale will be reduced. Similar to the scale amplification, we also set a reduction pool = { 1 , ..., } and choose the scale whose patch features are the most approximate to the model template.

Model Updating.
Model updating is an important step in visual tracking. In the process of tracking, the object appearance often changes with the factors of scale, motion, rotation, or posture. Therefore, the appearance model needs to be updated over time to accommodate these changes for robust visual tracking. But overupdating is easy to result in shift and bring extra computation. Motivated by these intuitions, we consider the appearance model needs not to be updated when there's little change or large occlusion. For these two situations, we predefine two thresholds ℎ 1 and ℎ 2 . Supposing = ‖ −1 − −1 ‖ 2 1 denotes the difference between current tracked target features −1 and the target model −1 , then if ≺ ℎ 1 or ≻ ℎ 2 the appearance template will not be updated. Else, the model gets new appearance description : Here is the weighted parameter and is set to 0.95. With the incremental update scheme, the appearance template is not only able to largely maintain the former appearance but also adapt to the target variations. And meanwhile the threshold and weighted parameter effectively control the updating degree. In this way, our method can alleviate the drift problem. Due to illumination variation and target moving, the background keeps changing. To get correct appearance features for the predicting samples, we reproduce new background samples around current target location ( , ). The new convolution filters are recalculated according to formulas (2) and (3).

Implementation Details.
The proposed algorithm is tested on the OTB 100 dataset. The size and location of the target in the first frame are given by the ground-truth. In each frame, candidate samples are resized into 32 × 32 and each patch is 6 × 6. During online tracking, the number of the local filters is set to 80 and the adaptive Gaussian mixture filter produces 300 candidate samples. The constants

Quantitative Comparisons.
For quantitative evaluation, we compare our algorithm with 9 current trackers by reporting the results of one-pass evaluate (OPE) based on precision plots and success plots. The precision plots show the percentage of frames whose estimated location distance from the ground-truth is within a predefined threshold varying from 0 to 50 pixels. The success metric computes the intersection over union (IoU) and counts the number of successful frames whose IoU is larger than a given threshold (varying from 0 to 1).       First of all, to highlight the contribution of local convolutional features, global handcrafted features, and motion information, we firstly compare our proposed algorithm with two simplified versions. They only depend on convolutional features (CNT) or on the convolutional and color fusion features (CNT-color). Five visual sequences (basketball, david3, human8, lemming, and biker) are tested with initialization from the ground-truth position in the first frame. These sequences include different challenging attributes such as fast motion, full occlusion, rotation, and illumination variation. The OPE curves in Figure 4 clearly show that our algorithm outperforms the other two versions by 32% ∼53% in success plots and 56% ∼60% in precision plots.
Then, to gain more insight about the proposed method, the proposed algorithm is compared with nine state-ofthe-art visual trackers: CXT, IVT, CT, L1APG, CNN, MIT, CNT-color, CNT, and SCM. We use the publicly available source codes provided by the authors themselves with the same initialization and parameter settings to generate the comparative results. Here, success plots and precision plots of five different attributes are illustrated in Figure 5 which includes fast motion, occlusion, illumination variation, scale variation, and deformation. In the case of fast motion, the average precision plots and success plots show that our proposed method outperforms the other trackers by more than 40% and over 8%, respectively. Meanwhile, our method can outperform the CNN tracker which only depends on the deep convolutional features by a wide margin of more than 85% in terms of precision OPE and over 78% in success OPE for occlusion sequences. Though our method only ranks the second, it still maintains the best performance when the threshold is less than 0.5 shown in the success curve. In the challenging conditions of illumination and fast motion, our method gets the best results for both success and precision evaluations. This is largely due to the fact that moving information allows predicting the object direction well. Though several other trackers perform better in sequences with scale variation, the average statistics show that our method achieves competitive results (more than 39% and more than 4% in terms of precision rate and success rate, respectively) because of the fusion of local and global features. Figure 6(a)    Mathematical Problems in Engineering sequence exemplifies the robustness of the proposed algorithm to complete and partial occlusion. Only CNT-color and our algorithm are capable of tracking the target during the entire sequence. Other trackers experience drift at different instances: CXT at frame 15, MIT, L1APG, SCM, CT at frame 90, CNN at frame 161, and MIT at frame 243 because of partial occlusion.

Qualitative Comparisons.
Lemming sequence in Figure 6(b) includes full occlusion, fast motion, and scale variation. For the other methods, tracking drifts from the moving object while the proposed algorithm keeps tracking accurately because it exploits the motion tendency well.
In Human8 sequence, the tracked object is subject to change in illumination and the background color is sometimes similar to the color of the man's backpack. CT, MIT, SCM, CXT, IVT, CNN, and CXT drift from the target continuously from frame 12 because of the similar color between the shade and the target. And meanwhile, L1APG and CNT cannot deal with the persistent change in illuminance and the disturbance of the background. They can only keep tracking until frame 55. Our method successfully tracks the object in the whole sequence.
Skiing sequence includes rotation in plane and fast motion. Tracking results at frames {9, 13, 15, 38, 41} for all 10 methods are shown. The different tracking methods are colorcoded. CXT tracker starts to drift from the target at frame 7 and finally loses tracking. From frame 9, other trackers except our algorithm and CNT-color begin to drift and totally fail tracking at frame 15. When the target jumps quickly and rotates in the sky, only our method tracks the target successfully almost from frame 1 to frame 41.
In Women sequence, there are partial occlusion, motion blur and rotation. CNN, CT, CXT, MIT, IVT, and L1APG trackers lose the target with the occlusion occurring because of the background disturbance with the similar color (seen in frame 399 and frame 479). After the target passes the cars, fast walking brings a little blue effect at frame 567. At this moment, only our method can maintain good tracking. After that, the target turns back several times, our method can succeed tracking her with the help of fusion features and motion information.
The CarScale sequence shows a car keeps changing the scale and moving fast. When it passes a tree from frame 166, CXT, IVT, CT, MIT, and CNN begin to lose the target because of the occlusion and fail tracking at frame 169. After that, the scale changes continuously with the quick motion. Our method can not only keep tracking but also adjust the tracking speed with the moving car.

Conclusion
A novel visual tracking algorithm is proposed based on a simple online learning network. The fusion of global color features and local convolutional features shows robust tracking against shade and presence of confusing colors in the background. And meanwhile, the speed information directs the propagation of the particles and improves the adaptivity of PF. When there is total occlusion or quick motion, the proposed tracker can maintain robust tracking. The proposed algorithm achieves substantial performance gain over the existing state-of-the-art trackers.

Data Availability
(1) All the source images are from the TB-100 dataset which is publicly available online at http://cvlab.hanyang.ac.kr/ tracker benchmark/datasets.html. (2) The other results including quantitative and qualitative comparisons can be available by emailing the authors at crystalhanlei@163.com.

Conflicts of Interest
The authors declare that they have no conflict of interest.