Online Hierarchical Sparse Representation of Multifeature for Robust Object Tracking

Object tracking based on sparse representation has given promising tracking results in recent years. However, the trackers under the framework of sparse representation always overemphasize the sparse representation and ignore the correlation of visual information. In addition, the sparse coding methods only encode the local region independently and ignore the spatial neighborhood information of the image. In this paper, we propose a robust tracking algorithm. Firstly, multiple complementary features are used to describe the object appearance; the appearance model of the tracked target is modeled by instantaneous and stable appearance features simultaneously. A two-stage sparse-coded method which takes the spatial neighborhood information of the image patch and the computation burden into consideration is used to compute the reconstructed object appearance. Then, the reliability of each tracker is measured by the tracking likelihood function of transient and reconstructed appearance models. Finally, the most reliable tracker is obtained by a well established particle filter framework; the training set and the template library are incrementally updated based on the current tracking results. Experiment results on different challenging video sequences show that the proposed algorithm performs well with superior tracking accuracy and robustness.


Introduction
The task of visual tracking is to find the interested object and track it. It is an important research in computer vision due to its widespread applications in traffic monitoring, vehicle navigation, and visual surveillance. Robust object tracking in dynamic environment is still a challenging problem. This is mainly because the factors such as occlusion, pose variation, illumination change, and clutter background cause large appearance change [1,2]. A robust appearance model is important for dealing with occlusions or other interferences in the tracking process. A target object is represented by its visual information like color, edge, or texture features extracted from the target region. However, there are numerous trackers only that rely on single feature to build target appearance, ignore the complementary representation of different features, usually lack of robustness, and are sensitive to interferences in dynamic environment [3]. For example, Ross et al. [4] use the intensity feature to represent the appearance model of the target object and integrate incremental learning to obtain a low-dimensional subspace representation. Babenko et al. [5] propose Multiple Instance Learning (MIL), which employs the Haar-like feature to build the discriminative appearance model for robust tracking. Mei and Ling [6] introduce a 1 minimization robust visual tracking method, it uses the intensity feature to represent the target appearance, and the target appearance is represented by sparse linear combination of the appearance template and trivial template in template space. However, the single feature ignores the complementary characteristic of different visual information; it is insufficient to describe the drastic changes of target appearance in complicated environment. Therefore, the representation ability will decline when there are occlusions or other interferences in complex background [3]. As a result, numerous trackers are proposed to represent the object by fusing the multiple features to describe the target object and build the object appearance model, which can better describe the appearance changes and is beneficial to improve the robustness of trackers in dynamic environments [7][8][9][10][11]. However, how to effectively use and integrate multiple features for robust tracking should be tackled urgently.  Numerous trackers based on sparse representation have been proposed in recent years [12][13][14][15][16][17]. Mei et al. [6,17] proposed a 1 minimization robust tracking method that regards the tracking as a sparse approximation problem. Zhang et al. [14] proposed a low-rank sparse representation tracking method. Liu et al. [15] developed a robust tracking algorithm using a local sparse appearance model to balance the requirements of stability and flexibility in the process of tracking. These trackers are all solved as the sparse approximation problem by ℓ 1 regularized least squares method and show promising results against many existing trackers. However, this sparse coding based on ℓ 1 minimization provides very sparse representation but ignores the importance of collaboration representation and the correlation of visual information; it is vulnerable to interferences and the ℓ 1 minimization is very time-consuming. In [16], Zhang et al. emphasized the role of collaborative representation with ℓ 2 regularized least squares, which shows that ℓ 2 RLS is beneficial for reducing the computation burden. Shi et al. [18] also demonstrated that ℓ 2 RLS is more accurate, robust, and faster. In [19], Yu et al. demonstrated that traditional sparse coding methods ignore the spatial neighborhood structure of the image because they only encode the local patches independently; then they proposed an efficient discriminative image representation method by using a two-layer sparse coding scheme at the pixel level.
Inspired by the challenges mentioned above, this paper proposes an object tracking algorithm that combines the multiple visual features with hierarchical sparse coding to realize the tracking. As shown in Figure 1, a multiple complementary feature representation [20] is used to robustly represent the object; the target appearance is modeled by exploiting a twostage sparse-coded method, which is based on ℓ 2 regularized least squares to solve the sparse approximation problem. Then, each tracker is based on the different features to estimate the object state and build the multiple observation models. The corresponding reliability of each tracker is computed by the tracking likelihood function of instantaneous and reconstructed appearance models that take the transient and stable appearance changes into consideration. Finally, the most reliable tracker is obtained by a well established particle filter framework; the training set and template library are incrementally updated based on the current tracking result.
The main contributions of the proposed tracking algorithm are as follows: (1) we construct the target appearance by taking account of instantaneous and stable appearance features; then the transient appearance model and reconstructed object appearance model are built independently.
(2) A two-stage sparse-coded method is employed to obtain the reconstructed coefficient vector used to construct the reconstructed appearance model. The two-stage sparsecoded method takes the temporal correlation between target templates and spatial neighborhood structure of the image patches into consideration and solves the sparse approximation problem by ℓ 2 RLS. This is beneficial for reducing the computational burden and improving the tracking performance. (3) To better describe the object appearance changes, the reliability of each tracker is measured by the tracking likelihood function of instantaneous and reconstructed appearance model that take transient and stable appearance changes into consideration. Experimental results on challenging sequences show that the proposed method performs well compared to state-of-the-art methods.

The Proposed Tracking Algorithm
where ( | −1 , ) is the motion model of th tracker between the th and ( − 1)th frame, which is restricted to Gaussian distribution ( | −1 ). ( −1 | 1: −1 , ) denotes the prior distribution up to frame − 1 and { | 1: } is the probability of the th tracker. The crossover probability of th tracker for multiple features is In addition, the th tracker probability {⋅} satisfies Then, we sparsely represent the candidate sample with state from the template library ; the likelihood of the observation model is where = min ‖ − ‖ is the sparse reconstruction error of candidate sample and is the sparse coefficients. Therefore, the tracking result̂at the th frame is the most reliable tracker with the highest tracker probability:

Multiple Features Representation for Object Appearance.
The different features have complementary characteristics to cope with appearance changes, such that the HOG features are robust to pose variations [21], Haar-like features can effectively deal with occlusions [22] as the single appearance model is insufficient to represent the target in a complicated environment. Therefore, we exploit different types of the features to build the multiple appearance models to represent the object robustly. The multiple features with complementary characteristics are used to handle various appearance changes, which is beneficial for tracking the target object robustly.
In the proposed method, we use three trackers based on HOG, Haar-like feature, and intensity feature to represent the object appearance, which can effectively deal with occlusions, illumination changes, and pose variations. For the th frame, we extract the multiple features to form feature sets as ∈ R , where is the index of the feature and is the dimension of the th feature. Normalize the feature sets ∈ R × to form the target template ∈ R and denotes the dimension of th multiple features.

Object Representation by Hierarchical Sparse Coding.
In the proposed method, we use the transient and stable features to describe the abrupt and stable object appearance changes. The stable features are sparsely represented by the current template with hierarchical sparse coding. Then, the reliability of each tracker is measured by the tracking likelihood function of instantaneous and reconstructed appearance models.
The transient features up to th frame is , = [ , − , . . . , Then the transient appearance model , is achieved by averaging the recent appearance features as The stable object appearance , is represented by sparse coding the stable features , as , . (9) Because the tracking algorithm based on sparse representation is to find samples with minimal reconstruction errors from the templates library, a target can be reconstructed from several templates [23]. Therefore, there are only some features having the discriminative capability to separate the target from its background. In order to achieve the goals that discriminatively separate the target from its background and minimal reconstruction errors from its template library, we utilize the hierarchical sparse coding to minimize reconstruction errors and maximize the discriminative capability of features. In addition, we use ℓ 2 RLS to solve the sparse approximation problem, which is beneficial for reducing the computation burden.
For a new arriving frame, we can achieve tracking results {̂| = 1, . . . , }. For th tracker, , denotes the candidate image patch represented by th features and , , is the reconstructed appearance for , . Then, a two-stage sparse-coded method by ℓ 2 RLS is used to obtain the coefficient vectors , and , as follows: where 1 and 2 are nonzero components.
To effectively tackle the high-dimensional data in feature space, we use the diagonal matrix to decrease the dimension of the feature space. For a set of samples = { ∈ R 1× | = 1, . . . , }, the joint sparse solution is shown as follows: where ( , ) is the loss function and 1 , 2 are the sparse parameters. If ̸ = 0, the th feature is activated. The loss function is computed as where { ∈ R ×1 | = 1, . . . , } is the sparse vector. If ̸ = 0, the th feature is selected.
Then, the solution to the minimum loss function ( , ) is achieved by solving the sparse problem as where 0 denotes the maximum number of features that can be selected. Considering the spatial neighborhood information of the image patch, let ( , ) denote the th neighbor of th feature; then the vector set is where is the weight of the neighbors.
The diagonal matrix is formed as From the above first-stage sparse representation coding, we take account of the spatial relationship of neighborhood features, which is beneficial for selecting a set of discriminative features to separate target from its background and reducing the computational burden by ℓ 2 RLS to solve the sparse approximation problem, as the target templates always contain some features from background, which is not the same as its neighbors. By doing discriminative feature selection as above, we can efficiently eliminate the features from background in the target templates. Therefore, we can construct a more efficient and robust target template library.
In the second sparse reconstruction stage, , and , in (12) can be computed as follows: The nonzero row of matrix forms the matrix ∈ R 0 × ; let 1 = 1 , = , and = . Then, where 1 , 2 are the sparsity parameters that control the sparse representation of the target template and the tolerance of interference in complicated environment, respectively.
Therefore, the reconstructed object appearance , , for , is represented as After above sparse reconstruction, the feature dimension reduced from × to 0 × and is the number of templates in the target library.
The predicted reliable object state for th tracker at frame iŝ= Then the corresponding tracking likelihood function of the th tracker at frame is Computational Intelligence and Neuroscience 5 In the proposed method, we use instantaneous and reconstructed features to describe the transient and stable appearance changes. The reliability of each tracker is where ( |̂) is the instantaneous appearance likelihood based on the transient object appearance , , which is formed by a set of recent frames features , . ( |̂) is the reconstructed object appearance likelihood based on the reconstructed object appearance , , and , , comes from the stable object appearance, which is formed by th feature and the tracking result , from the th tracker: where is the control parameter.

Predication and Update.
To robustly track the target object, we update the tracker probability of the multiple trackers and the reliability of each tracker. The tracker probability is updated as follows: where The corresponding observation model ( | , ) for each tracker is updated based on the incremental subspace model in [4]. Then, the particle filter is used to approximate the state posterior distribution ( | 1: , ) by a set of particles [24], the particles size = 600: where (⋅) is a delta function and { , } =1 is the sample weight associated with { , } =1,..., .
The particles , are obtained from state prediction ( | −1 , ), which is simplified by first-order Markov model The weights are updated as Then, we achieve a set of reliable states by maximizing the posterior estimates: In the proposed method, the target appearance is constructed by multiple features that take account of transient and stable appearance changes to cope with occlusion and other interferences in complicated environments. For example, in a dynamic environment with drastic occlusion or illumination changes, the stable features are rarely updated, but the transient features can effectively describe the frequent appearance changes, while in a static background, if a background sample is added into the template, it usually has a good reconstruction with high likelihood because background is static at most of the time. Because the incorrect template is nonlinear, which is not the same as its neighbors, the two-stage sparse coding method taking account the spatial relationship of neighborhood features can prevent it from being selected. Therefore, we can construct a more efficient and robust target template library.
In addition, we update the template library based on the current tracking result as done in IVT method [4]; the samples with high likelihood and near the target are added to the template library. We repeat this procedure for each frame in the entire sequences. The tracking based on joint multiple feature representation and hierarchical sparse coding can provide a robust and accurate tracking result.

Tracking Based on Online Hierarchical
Sparse Representation of Multifeature As described above, the main step of the proposed tracking algorithm is shown in Algorithm 1.
Algorithm 1 (tracking based on hierarchical sparse representation of multifeature).

Experiments
To analyze the performance of the proposed tracking method, we compared our method with other five state-of-the-art trackers [25] such as IVT [4], L1 [6], MIL [5], OAB [26], and VTD [27] on several challenging video sequences. The target objects in the test videos are either nonrigid or rigid objects that suffered significant pose variation, heavy occlusion, in-plane and out of plane rotation, or motion blur. The video sequences are available in https://sites.google.com/site/ trackerbenchmark/benchmarks/v10. The proposed tracker algorithm is implemented in MATLAB, which is run on a PC with 2CPU, 2.5 GHz, and 3.1 GB RAM, at around 20 frames per second.

Parameters Setting.
For all test video sequences, we manually select the initial target location. Each image patch is normalized to 32 * 32 pixels and sparsity parameters 1 = 2 = 0.001 and = 0.1 and the dimensions of intensity features, HOG features, and Haar-like feature is 1024, 1296, and 1760, respectively. The number of particles is = 600, and the number of template samples is = 16. Table 1 lists the characteristics of the evaluated sequences used in the experiments of this paper.

Qualitative Comparison
Experiment 1 (illumination variation, occlusion, scale change, and fast motion of rigid object). The sequence of Car4 is to track a car in an open road with illumination variation and partial occlusion as shown in Figure 2(a). At frame 86, the OAB tracker appears to slightly drift due to the trees and bridge occlusion and fails to track the car at frame 233. The L1, MIL, and VTD trackers start to drift away from the target when drastic illumination changes occur at frame 195 and fail to track the target at frame 255. The IVT and the proposed method can successfully track the target because they dynamically updated the template, which is beneficial for coping with the occlusion and illumination changes. However, the result of IVT is less satisfied because the tracking box is larger than the target object from frame 195 to the end sequences.
In the CarScale sequence, the tracking target is a fast motion car in an open road. Compared with the Car4 sequence, this sequence is more challenging because the tracked car undergoes large scale changes and fast motion on the entire sequence. Due to the fast motion accompanied with the tree's occlusion, IVT, L1, MIL, and VTD trackers drift with different degree at frame 164 and finally lost the target at frame 171. The proposed method gives the best results followed by the OAB tracker.
The CarDark sequence is challenging because the target object undergoes the motion blur in a night environment with low contrast and strong reflection interference. Due to the strong reflection interference, the MIL tracker drifts a little from the target at frame 122 and lost the target at frame 202 and then regards the other car as the tracked target. The IVT, L1, and VTD trackers drift away from the target at frame 277. The OAB tracker performs well on this sequence and yields the second best results. The proposed method can accurately track the target object in the whole sequence with small center position error and high overlap rate. Experiment 2 (occlusion, scale change, and rotation of nonrigid object). The target object in FaceOcc2 sequence undergoes the drastic occlusion and in-plane rotation. As shown in Figure 3(a), when there is a small occlusion with a book at frames 128∼185 and frames 245∼279, all methods perform well. But when partial occlusion and in-plane rotation occur together at frames 392∼510, most of trackers have poor performances. When the target almost fully occludes by a book and a hat at frames 688∼740, all methods except this paper method drift away from the target at different degrees. Since the proposed method uses multiple complementary features to build transient and stable appearance models and update the template library online, it can effectively handle the occlusion and give satisfactory tracking results.
The Freeman1 sequence is challenging because the interested man's face undergoes large scale changes and view variations. Due to the large scale changes, MIL drifts away from the target at frame 32. As the view changes from the left to right, the L1 and OAB trackers totally lost the target at frames 131 and 176, respectively. The tracking methods like Computational Intelligence and Neuroscience 7  IVT, VTD, and the proposed method perform well on this challenging sequence and can track the target accurately. The Girl sequence has drastic appearance changes because of the out of plane rotation and similar target occlusion. When out of plane rotation occurs at frames 90∼ 122 and 169∼260, IVT tracker totally fails to track the Girl's face; other trackers drift at different degrees. The OAB and VTD trackers fail to track the target object and track the similar target when the Girl's face is occluded by a man's face at frames 420∼470. MIL tracker can successfully track the target except some errors like frames 303 and 433. L1 tracker and the proposed method perform well on this sequence. Experiment 3 (illumination, scale change, and occlusion of nonrigid object). The track target in shaking sequence undergoes drastic illumination and poses changes on the whole video sequence. It brings more challenges to accurately track the target because the color of object appearance is similar to the stage lighting. IVT and OAB almost fail to track the target at frame 23 and cannot recover at the rest frames. The MIL tracker drifts a little at frame 61 due to the drastic illumination changes. Although the stage lights change drastically accompanied with the serious head shaking, the L1 and VTD trackers perform well except some errors. The proposed method can effectively adapt to the severe object appearance changes when those variations occur together and achieve satisfactory results.
The Woman sequence is very challenging because the target object undergoes large scale changes, view variations, and occlusions simultaneously. As shown in Figure 4(b), all trackers merely perform well except the proposed method. The results of the proposed method show slight drift; other methods like IVT, L1, MIL, OAB, and VTD trackers totally lose the target when heavy occlusion occurs at frame 130 and never recover to track the target in the subsequent video sequence except the OAB tracker. The OAB tracker recaptures the target at frame 337 and keeps to track the target until the end of sequence with a little drift.
The Jogging sequence is more challenging to track because the tracked target is fully occluded by a stem and undergoes large scale change and fast motion simultaneously. The IVT, L1, MIL, OAB, and VTD trackers completely fail to track the target when the target is fully occluded by a stem at frames 50∼62, and the OAB tracker recaptures the target at frame 106. The proposed method can accurately track the target on the entire sequence.
From some sampled tracking results of the proposed method and other five methods on 9 image sequences, we can conclude that the algorithm in this paper can accurately and robustly track the target under the environment with illumination variation, scale change, and motion blur.

Quantitative Comparison.
Two metrics are used to evaluate the proposed tracker with reference trackers in gray-scale videos. The first is the center position error, which is applied to evaluate the distance between the ground-truth center and tracked object center (in pixels) at each frame by the Euclidean distance. The other metric is the overlap rate [28], which is defined as score = area( ∩ )/area( ∪ ), where denotes the bounding box generated by a tracking method and is the ground-truth bounding box.  Note: the optimal result is shown as bold and the suboptimal one as italic. Table 2 and Table 3 show the average center position errors and the average overlap rate for all trackers. Figures  5 and 6 show the center position error curve and overlap rate evaluation curve of different trackers on 9 video sequences at each frame. It can be seen that the proposed algorithm has the optimal or suboptimal performance in terms of average center position errors and average overlap rate in most test video sequences compared with other methods. Most competing tracking methods do not give a satisfactory result; the center position error is larger and the overlap rate is lower. The average position error of this paper at 9 videos is only 5.53 pixels, which is far less than other trackers; the 10 Computational Intelligence and Neuroscience  Computational Intelligence and Neuroscience

Conclusion
In this paper, we propose a robust tracking algorithm that leverages hierarchical sparse coding to optimize the image representation from multifeature. We compare our tracking method with other five state-of-the-art trackers on nine sequences to validate the robustness and accurateness of the proposed method. The experiment results show that our method can effectively and robustly handle the challenging scenes where the target object undergoes drastic variation in pose, scale, rotation, occlusion, and illumination. The success of our method can be attributed to constructing multiple observation models that form the multifeature by hierarchical sparse coding, which takes the spatial relationship of neighborhood features into consideration and solves the sparse approximation problem by ℓ RLS. The appearance model constructed by instantaneous and stable appearance features with two-stage sparse representation coding is more robust to cope with appearance change in complex environment and more effective to select a set of discriminative features to separate the target from its background. In the proposed method, we compute the reliability of each tracker by the tracker likelihood function that accounts for transient and reconstructed appearance model and select the most reliable one among multiple trackers. The training set and the template library are both incrementally online updated. All of these are beneficial to cope with the appearance change and can improve the tracking performance in dynamic environments. However, the limitation of the proposed method is mainly focused on the following. (1) The tracking system is not effective enough for real-time tracking because multifeatures are calculated at the same time for test video sequences, which is time-consuming. In addition, it cannot be adapted to extract the feature according to the video attribute. (2) The ability of each feature to describe the target cannot be effectively measured. (3) It cannot successfully track the target when the object leaves out the scene but reappears in subsequent frames.
In the future, we will improve the proposed method in some aspects. (1) We will improve the algorithm in real-time by proposing a method to adaptively extract the multifeatures according to the video attribute, which can reduce the time and computation load of the feature extraction. (2) We would improve the tracking performance by introducing the occlusion mechanism and drift mechanism, which can alleviate updating the template with wrong samples when the target object is occluded or drifted. Both strategies are useful to deal with appearance changes and beneficial to robustly track the target in complex environments.