Online Learning Discriminative Dictionary with Label Information for Robust Object Tracking

A supervised approach to online-learn a structured sparse and discriminative representation for object tracking is presented. Label information from training data is incorporated into the dictionary learning process to construct a robust and discriminative dictionary. This is accomplished by adding an ideal-code regularization term and classification error term to the total objective function. By minimizing the total objective function, we learn the high quality dictionary and optimal linear multiclassifier jointly using iterative reweighed least squares algorithm. Combined with robust sparse coding, the learned classifier is employed directly to separate the object from background. As the tracking continues, the proposed algorithm alternates between robust sparse coding and dictionary updating. Experimental evaluations on the challenging sequences show that the proposed algorithm performs favorably against state-of-the-art methods in terms of effectiveness, accuracy, and robustness.


Introduction
Given the initialized position and size of a target in the first frame (or former frames) of a video, the goal of visual tracking is to estimate the states of the moving target in the subsequent frames. This active topic has been extensively studied in computer vision due to its important role in many applications such as automated surveillance, robot navigation, video indexing, traffic monitoring, and humancomputer interaction. Despite the fact that much progress has been made in recent years [1][2][3][4][5], developing a robust tracking algorithm is still a challenging problem due to numerous factors: illumination, partial or full occlusions, dynamic appearance changes, scaling, abrupt motion, background clutters, pose variation, and shape deformation.
Inspired by the success of sparse representation-based face recognition [6], Mei and Ling [7] propose a novel L1 tracker that uses a series of target templates and trivial ones to model the tracked target with the sparse constraints. In detail, the target templates are used to describe the tracked object and trivial templates are used to deal with outliers (such as occlusion). This representative scheme is robust to a wide range of image corruptions, especially moderate occlusions. Based on the milestone work, some extensions [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] are developed to improve the L1 tracker in terms of both speed and accuracy. However, sparse representationbased approaches have some drawbacks. First, previous tracking algorithms construct the dictionaries naive. They directly use the sampled samples from tracked target region and its background as the dictionary atoms; they are not selected. This operation makes the dictionary redundant and ignores the discriminative and structured information from the initial training data. Second, some methods use either static dictionaries during tracking process [10] or heuristic dictionary update. Finally, many sparse codingbased trackers [6][7][8][9][10][11][12][13][14][15] seek to minimize the reconstruction error with L2 norm to increase the representative power but ignore the discriminative ability of the learned dictionary. The data term with L2 norm does not use the robust function in the data fitting term and might be vulnerable to large outliers and makes the tracking unstable.
In this paper, we formulate object tracking in a particle filter framework as a binary classification problem. The a priori information from training data is exploited effectively to online-learn a discriminative and reconstructive dictionary. Specifically, the class label information is incorporated into the dictionary learning process as the classification error term and idea coding regularization term, respectively. Combined with the traditional reconstruction error, a total objective function for dictionary learning is constructed with L1 data fitting term. By minimizing the total object function, we can obtain a high quality dictionary and optimal linear classifier jointly using iterative reweighed least squares algorithm. With the help of robust sparse coding, the optimal classifier can separate the tracked object from background effectively.
The main contributions of this paper are the following.
(1) The a priori information from the training samples is exploited to construct a compact and discriminative dictionary. The learned dictionary deserves the structure information from training samples and encourages samples from the same class to have similar representations. It is a critical factor for the object tracker-based sparse representation. (2) Learning a high quality dictionary and optimal linear classifier are accomplished jointly. All the training samples from the object and background are involved in the dictionary learning process simultaneously. (3) Many existing sparse coding-based trackers do not use robust function in the data fitting term and might be vulnerable to large outliers. L1 norm fitting function is incorporated into the data fitting term to overcome this problem and make the tracking reliable.
The paper is organized as follows. In Section 2, we summarize the works most related to ours. Section 3 presents the L1 tracker and dictionary learning as the background to facilitate the introduction of our proposed model in the next section. The detailed description of the proposed tracking approach is presented in Section 4. Section 5 gives the detailed experiment setup and results. Finally, Section 6 concludes the paper.

Related Work
Much work has been done in object tracking. In this section, we only briefly review nominal tracking methods and those that are the most related to our tracker. We focus specifically on tracking methods that use particle filters, sparse representation, and general multitask learning methods. For a more thorough survey of tracking methods, we refer the readers to [1][2][3][4][5].
Existing tracking algorithms can be roughly categorized as either generative or discriminative.
2.1. The Generative Trackers. The generative methods represent the target as an appearance model. The tracking problem is formulated as searching for the regions which are the most similar to the tracked targets. These methods are based on either templates or subspace models. Popular generative trackers include eigentracker [21], mean shift tracker [22], fragment-based tracker [23], incremental tracker (IVT) [24], and VTD tracker [25]. Black and Jepson [21] learn a subspace model offline to represent target at predefined views and build on the optical flow framework for tracking. The mean shift tracker [22] is a popular mode-finding method, which successfully copes with camera motion, partial occlusions, clutter, and target scale variations. The fragment tracker [23] aims to solve partial occlusion with a representation based on histograms of local patches. The tracking task is carried out by accumulating votes from matching local patches using a template. But, this template is static, and it can not handle changes in object appearance. Ross et al. [24] learn an adaptive linear subspace online for modeling target appearance and implement tracking with a particle filter. However, IVT is less effective in handling heavy occlusion or nonrigid distortion. Kwon and Lee [25] extend the classic particle filter framework with multipledynamic observation models to account for appearanceand motion variation. Nevertheless, due to the adopted generative representation scheme, this tracker is not equipped to distinguish between the target and its local background.

Discriminative Trackers.
Discriminative methods cast the tracking as a classification problem that distinguishes the tracked targets from their surrounding backgrounds. The trained classifier is online updated during the tracking procedure. Discriminative tracking algorithms use the information from both the target and the background. Examples of discriminative methods are ensemble tracking [26], online boosting (OAB) [27], semionline boosting [28], online multiple instance learning tracking [29], adaptive metric differential tracking [30], TLD [31], and CT [32].
In ensemble tracking [13], a set of weak classifiers are trained and combined for distinguishing the target object and the background. The features used in [26] may contain redundant and irrelevant information which affects the classification performance. To improve the classification performance, feature selection is needed. Collins et al. [33] have demonstrated that online selecting discriminative features can greatly improve the tracking performance. Inspired by the advances in face detection [34], many boosting feature selection methods have been proposed. Grabner et al. [27] propose an online boosting algorithm to select features for tracking. However, these trackers [27,33] only use one positive sample (i.e., the current tracker location) and a few negative samples when updating the classifier. As the appearance model is updated with noisy and potentially misaligned examples, this often leads to the tracking drift problem. To better handle visual drift, Grabner et al. [28] propose an online semisupervised tracker which only labels the samples in the first frame while leaving the samples in the sequent frames unlabeled. However, this semisupervised approach discards some useful information which is very helpful in the problem domain. Babenko et al. [29] introduce multiple instance learning into online tracking where samples are considered within positive and negative bags or sets. Within the multiple instances learning (MIL) framework, several tracking algorithms have been developed [30,[35][36][37][38] in order to handle location ambiguities of positive samples for object Abstract and Applied Analysis 3 tracking or actively select discriminative feature. Besides, Kalal et al. [31] propose the PN learning algorithm to exploit the underlying structure of positive and negative samples to learn effective classifiers for object tracking. Recently, an efficient tracking algorithm [32] based on compressive sensing theory [39] is proposed, which demonstrates that the low dimensional features randomly extracted from the high dimensional multiscale image feature space can preserve the discriminative capability, thereby facilitating object tracking.

Sparse Representation for Object
Tracking. Sparse representation has been successfully applied to visual tracking [6]. Its metric is according to finding the best candidate with minimal reconstruction error using target templates and trivial ones. Most of these object tracking algorithms are in the particle filter framework. For each particle, its representation is computed independently by solving a constrained L1 minimization problem with nonnegativity constraints, so hundreds of L1 norm related minimization problems need to be solved for each frame during the tracking process. Besides, the solver for the L1 norm minimizations used in [7,8] is based on the interior point method which turns out to be too slow for tracking. A minimal error bounding strategy is introduced [8] to reduce the number of particles, equal to the number of the L1 norm minimizations for solving. A speed-up by four to five times is reported in [8]. In [9], APGbased solution is used to improve the L1 tracker. Liu et al. [10] integrate the dynamic group sparsity into the tracking problem and high dimensional image features are used to improve tracking robustness. Liu et al. [11] also develop a tracking algorithm based on local sparse model which employs histograms of sparse coefficients and the mean shift algorithm for object tracking. However, this method is based on a static local sparse dictionary and may fail when there is a similar object in the scenes. In Li et al. [14], dimensionality reduction and a customized orthogonal matching pursuit algorithm are adopted to accelerate the L1 tracker. In [15], the authors propose a robust object tracking algorithm using a collaborative model that combines a sparsity-based discriminative classifier (SDC) and a sparsity-based generative model (SGM), but it adopts the naive model updating strategy and similar metric measure; this will affect the performance of the tracker. In [16], the authors develop a simple yet robust tracking method based on the structural local sparse appearance model. Its representation exploits both partial information and spatial information of the target based on a novel alignment-pooling method. In Zhang et al. [17], lowrank sparse learning is adopted to consider the correlations among particles for robust tracking. Inspired by these works, he develops the multitask tracking (MTT) algorithm [18]. However, the dictionary still includes the trivial templates; they will degrade the efficiency and effectiveness of the tracker.

Background
In this section, we briefly introduce the L1 tracker and dictionary learning to facilitate the presentation of our model in the next section.
3.1. L1 Tracker. L1 tracker and most of its extension are in the particle filter framework. Its metric is according to finding the best candidate with minimal reconstruction error using target templates and trivial ones. In each frame, L1 tracker first generates candidate particles with the current tracking result. For each particle, its representation is computed independently by solving a constrained L1 minimization problem with nonnegativity constraints. To adapt the appearance changes of an object, the template is updated according to both the weights assigned to each template and the similarity between templates and current estimation of target candidate.
L1 tracker can be viewed as a sparse coding process with the given dictionary (object templates and trivial ones). But L1 and its extensions ignore the dictionary quality; they only adopt a simple strategy to construct the dictionary: take the entire positive (or negative) training set as dictionary. Sparse coding with a large dictionary is computationally expensive.

Dictionary
Learning. The goal of dictionary learning is to find the optimized dictionaries that provide the representation for most statistically representative input signals. Let = [ 1 , 2 , . . . , ] ∈ × be a set of input signals, where denotes the th input signal with dimensional feature description. Learning a reconstructive dictionary with size for sparse representation can be obtained by solving the following minimization problem: where = [ 1 , 2 , . . . , ] ∈ × is the learned dictionary and = [ 1 , 2 , . . . , ] ∈ × are the sparse codes of input signals. In general, the number of training samples is larger than the size of ( ≫ ), and only uses a few dictionary atoms for its representation under the sparsity constraint. Usually, the above objective function is iteratively optimized in a two-stage manner, by alternatively optimizing with respect to (bases) and (coefficients) while holding the other fixed. Each stage is convex in (while holding fixed) and in (while holding fixed) but not convex in both simultaneously. The objective function in (2) only focuses on minimizing the reconstruction error and does not consider the discriminative power of a dictionary. Hence, some supervised approaches [40][41][42][43][44][45][46][47] have been proposed to improve the discriminative power of dictionary, by integrating the category label information into the objective function of dictionary learning.

The Proposed Tracker
Many existing online dictionary learning methods do not use the robust function in the data fitting term and might be vulnerable to large outliers. In robust statistics, L1 fitting functions are found useful to make estimation reliable. During the process of object tracking, the challenging factors such as occlusion, illumination changes, abrupt motion, and background clutters are usually regarded as the outlier. If the L2 norm data fitting is adopted for sparse representationbased tracker, the drift will be cumulative and result in tracking failure. However, L1 fitting functions can overcome the above problem and make tracking reliable. Inspired by the above work [40][41][42][43][44][45][46][47], an approach to online-learn a structured sparse and discriminative representation for object tracking is presented in this section.

The Total Object Function.
We construct a robust object function for online dictionary learning and the optimal classifier. To be concrete, the total objective function for the proposed tracker is defined as where parameters 1 , 2 , 3 control the relative weight of three terms: reconstruction error term, classification error term, idea coding regularization term and norm regularization term.
Reconstruction Error Term ‖ − ‖ 1 1 . This data fitting is robust compared with L2 norm and can handle some outliers such as part occlusion and background in the train data. We compute the reconstruction errors of all the particles with the learned dictionary items at the same time. ] .

Ideal Structured Regularization Term
(3) Classification Error Term ‖ − ‖ 1 1 . The term measures theclassification error, and it supports learning an optimal classifier. For object tracking task, we define two classes: tracked object and background. A simple linear classifier ( ; ) = is adopted, where is the classifier parameters. = [ℎ 1 , ℎ 2 , . . . , ℎ ] ∈ 2× is the class labels of training data . ℎ = [1,0] is the corresponding label vector of , and the nonzero position indicates the class label of . L1 Norm Regularization Term ‖ ‖ 1 . By adding a sparseness criterion into the objective function (2), we are able to learn a sparse and structural representation with the learned high-quality dictionary . The proposed tracker is under the particle filter framework. The candidate particles are densely sampled around the current tracking target and their representations will be sparse and similar with respect to the given dictionary . In other words, a few items in are required to represent all the particles.

Optimization Procedure.
To solve optimization problem in (2), we rewrite the proposed object function as follows.
Given the initial dictionary 0 , we can obtain the robust sparse coding * by (5). Combining 0 and * , (4) can be regarded as a L1 regression problem. The IRLS algorithm can be used to solve (4) with the known * and : 1 ( , :) = arg min where = 1/ √ ( 1 ( , ) − 1 ) 2 + and is a small positive value. By taking derivatives for (6) and setting them to zeros, the global optimum can be reached by solving 1 ( , :) in the linear system = ∑ =1 , = ( , :) . Abstract As the trace continues, the dictionary updates with the coming data, the online versions of , is as follows: where and are the former data, the second terms in both (9) and (10) are the coming data.
We have learned the dictionary 1 = [ , √ 1 ] . For all the particles, we first compute their robust sparse codes * from (5) and then obtain the classification score of the particles from the optimal classifier . The tracking is completed by the following equation: where is the sparse coding of each particle with learned dictionary . The sparse codings of all the particles form the matrix .

Tracking Algorithm Details.
For initialization, we manually choose the foreground object with the bounding box and then shift it by a few pixels to generate the positive samples. Besides, we shift the bounding box far away from the object location to generate the negative samples, which are without overlap with positive samples. The K-SVD algorithm is executed on positive and negative samples separately to learn the initial dictionary. The proposed tracking algorithm is under the particle filter framework, which recursively approximates the posterior distribution using a finite set of weighted samples. It consists of two steps: prediction and update. At the frame specially, let affine parameters = ( , , , , , ) represent the target state, where and are the coordinates, and are the scale and the aspect, is the rotation angle, and is the skew. 1: −1 = { 1 , 2 , . . . , −1 } denotes the observation of the target from the first frame to the frame − 1. Particle filters tracking estimates and propagates the probability by recursively performing prediction, (12) and updating The optimal state for the frame is obtained according to the maximal approximate posterior probability: * = arg max ( | 1: ) = arg max ( ) .
(14) This inference is governed by the model ( | −1 ), which describes the temporal correlation of the tracking results in consecutive frames, and it is modeled to be Gaussian with the dimensions of assumed independent. The observation model ( | ) reflects the similarity between a target candidate and dictionary templates. In this paper, ( | ) is proportional to the classifier scores.

Experiments
In this section, we make a thorough comparison on challenging image sequences between our proposed trackers and state-of-the-art tracking methods. Our trackers are evaluated on 8 challenging tracking sequences (e.g., car11, cliffbar, and woman sequences) that are publicly available online. Table 1 lists all the evaluated image sequences; these videos are recorded in indoor and outdoor environments and include the abovementioned challenging factors in visual tracking. We evaluate the proposed tracker against ten state-of-theart visual tracking algorithms including: ONND [12], LSST [13], SCM [15], ASLA [16], MTT [18], CT [32], VTD [25], MIL [29], PN [31], IVT [24], and L1 [6]. These trackers are implemented using publicly available source codes or binaries provided by the authors. They are initialized using their default parameters.

Parameter Setting.
The proposed algorithm is implemented in MATLAB R2011b on a Pentium 2.3 GHz Dual Core laptop with 2 GB memory. For each sequence, the location of the target object is manually labeled in the first frame. Each image sample from the target and background is normalized to a 32 × 32 patch. We set the parameters 1 , 2 , 3 in (5) to be 2, 4, and 0.01, respectively. The parameter in (10) is set to 0.01, and = 0.95. The numbers of positive templates and negative templates are 200 and 600, respectively. The learned dictionary includes 200 items.

Quantitative Comparison.
For quantitative performance comparison, two popular evaluation criteria are used, namely, center location error (CLE) and tracking success rate (TSR). The CLE is computed as the distance between the predicted  center position and the ground truth center position. Clearly, we hope the CLE is small. Figure 1 presents the relative position errors (in pixels) between the ground truth center and the tracking results. Table 2 summarizes the average center location errors in pixels. The TSR is computed as the ratio of the number of frames the target is successfully tracked to the number of frames in the sequence. To define whether the target is successfully tracked at a frame, we use the score in the PASCAL VOC challenge [48], which can be computed as where is the current tracking result and is the ground truth. Table 3 and Figure 2 give the average tracking success rates and relative tracking success rates, respectively. Overall, the proposed tracker performs well against the other state-ofthe-art algorithms.

Qualitative Comparison.
There are blurred images in animal sequence, which is difficult for most trackers to solve this situation. From Figure 3, we can see that the head of the fawn becomes blurred at the frame 25 or 42; the appearance of the tracked object is indistinguishable. Most tracking algorithms fail to follow the target, such as MIL, PN, and Frag. The proposed algorithm successfully tracks the target object throughout the sequence. Its located accuracy and overlap rate are better than SCM, LSST, and ONND and less than ASLA.

Quantitative Comparison.
In the car11 sequence, a car is driven into a very dark environment. The contrast between the tracked target and its surrounding background is low, and the ambient light changes significantly. Furthermore, the low image resolution of the target object makes tracking difficult. The tracking results are illustrated in Figure 3. Due to changes in lighting, Frag and MIL algorithms start to drift around frame 60. L1 method starts to fail in frame 250. IVT, SCM, ASLA, LSST, MTT, and ONND algorithms perform well as our tracker in the whole sequence. However, the accuracy and robustness of these methods are less than our proposed algorithm. However, the other methods drift away when (c) Girl Figure 2: Overlap rate evaluation. This figure shows overlap rates for ten tested video clips. Our algorithm is compared with ten state-of-theart methods.  drastic illumination variation occurs (e.g., #0200 and #0250) or when similar objects appear in the scene (e.g., #0305), especially the car makes a turn at about frame 260. The tracking object in the girl sequence undergoes occlusion (complete occlusion of the girl's face as she swivels in the chair), large pose change, and scale variation with in-plane and out-of-plane rotations (from large to small and from small to large). The tracking results are shown in Figure 3. The experimental results demonstrate that our method achieves the best performance in this sequence.
Other trackers experience drift at different instances: Frag at frame 248, IVT at frame 436, and VTD at frame 477.
There is abrupt motion in jumping sequences, so it is difficult to predict the location of tracked target in the blurry images. Furthermore, it is rather challenging to account for drastic appearance change caused by motion blur and properly update these appearance models. Figure 3 shows that most tracking algorithms fail to follow the target right at the beginning of this sequence (e.g., #13 and 25). The proposed algorithms SCM, ASLA, LSST, and PN successfully track the target object throughout the sequence.
In the cliffbar video, the background has similar texture to the target. Moreover, the target undergoes scale variance, in-plane rotation, and abrupt motion as shown in Figure 3. The Frag, L1, IVT, CT, MIL, LSST, ONND, and SCM methods drift to the cluttered background, while our proposed tracker has the best performance on this sequence; it can adapt to the scale and rotation change of the target and overcome the influence of similar background and motion blur.
In the caviar sequence, the target is occluded by two people at times and one of them is similar in color and shape to the target. Numerous methods fail to track the target because there are similar objects around it when heavy occlusion occurs. In contrast, our tracker achieves stable performance in the entire sequence when there is a large scale change with heavy occlusion at frame 442.
The football sequence is challenging due to the cluttered background, because there are many football players with the similar helmets in appearance to the tracked object in this scene. When the tracked target approaches other football players, some trackers are not robust and begin to drift, as shown in frames 76, 113, and 150 in Figure 3.When the two football players collide at frame 290, most tracking methods especially cannot locate the target correctly. Only our tracker, CT, VTD, and ONND overcome this problem and successfully locate the correct object in the whole sequence. The accuracy of our method is the highest.
In the woman sequence, the walking woman undergoes pose variation together with long-time partial occlusion. The difficulty lies in the fact that the woman is greatly occluded by the parked cars. Most trackers fail and lock on a car with similar color to the trousers when the legs of the woman are heavily occluded from frame 110 to 130. Only ONNDL, our tracker, and ASLA can overcome this difficulty and follow the target accurately. Although PN tracker can find the tracked target again after the trace fails, it is vulnerable for occlusion and always loses the target as shown in Figure 3.

Conclusions
In this paper, we present a supervised approach to learn and update a structured, sparse, and discriminative representation for object tracking. Label information from training data is incorporated into the dictionary learning process to construct a discriminative structured dictionary. This is accomplished by adding an ideal-code regularization term and classification error term to the total objective function. By minimizing the objective function, we can obtain a high quality dictionary and optimal linear classifier simultaneously. This approach exploits the strength of label information and encourages images from the same class to have similar representations. Experimental results on challenging image sequences demonstrate that our tracking algorithm performs favorably against several state-of-the-art algorithms. Possible future work includes online and robust discriminative dictionary learning and structured low-rank representations for real-time object tracking.