Tracking-by-detection methods have been widely studied with promising results. These methods usually train a classifier or a pool of classifiers in an online manner and use previous tracking results to generate a new training set for object appearance and update the current model to predict the object location in subsequent frames. However, the updating process may easily cause drifting in terms of appearance variation and occlusion. The previous methods for updating the classifier(s) decided whether or not to update the classifier(s) by a fixed learning rate parameter in all scenarios. The learning rate parameter has a great influence on the tracker’s performance and should be dynamically adjusted according to the change of scene during tracking. In this paper, we propose a novel method to model the time-varying appearance of an object that takes appearance variation and occlusion of local patches into consideration. In contrast with the existing methods, the learning rate for updating classifier ensembles adaptively is adjusted by estimating the appearance variation with sparse optical flow and the possible occlusion of the object between consecutive frames. Experiments and evaluations on some challenging video sequences have been done and the results demonstrate that the proposed method is more robust against appearance variation and occlusion than those state-of-the-art approaches.
Visual object tracking, which is one of cardinal problems in computer vision, has a wide range of applications including video surveillance, human computer interaction, video retrieval, and autonomous navigation. Despite the fact that numerous single object tracking methods have been proposed, many of them only achieve favorable performance in simple environment with slow motion and slight occlusion. Hence, it remains a challenging work to research a robust algorithm for complex and dynamic scenes caused by the distractions such as heavy occlusion, appearance variation, and cluttered background.
The purpose of object tracking is to estimate the states of a moving target in a video. A tracking system usually has three components including appearance model, motion model, and model update. An appearance model is used to represent the object with proper features and verify predictions by object representations. A motion model is exploited to predict the most likely state of the target. A model update scheme is applied to make the tracker adapt to appearance variation and occlusion of the target object. Existing trackers can be classified as either generative or discriminative. For generative methods [
In this paper, we mainly focus on model update scheme because the part has a great impact on the performance of the trackers. Existing strategies usually concentrate on updating the weights of classifiers and ignore the update of learning rate of the classifiers. To address the problem, we introduce appearance variation estimation and occlusion estimation to control the learning rate of the classifiers motivated by [
The remainder of this paper is organized as follows. The work related to visual tracking and sparse representation is reviewed in Section
In this section, we discuss the related online tracking algorithms handling appearance variation and occlusion of target object. Although much progress has been made for visual tracking, a lot of challenges remain to design an effective and robust tracker due to some distractions including appearance variation, varying illumination, occlusions, and background clutter.
Tracking-by-detection [
Appearance variation and occlusion are two problems for visual tracking. Some current algorithms have been proposed using holistic and local representation schemes to handle appearance variation and occlusion. The IVT tracker [
As mentioned above, how to handle appearance variation and occlusion are two major problems that are also two hot research topics for object tracking. Motivated by [
In this section, we introduce the appearance variation estimation based on sparse optical flow and sparsity-based occlusion estimation methods of our tracker. We solve the problem that learning rate of classifier could not be dynamically adapted with the change of scene during tracking. Our tracker is a variant of RET [
We give a brief introduction of the RET tracking method. The tracker puts emphasis on estimating the state of the classifier rather than the state of the object. It characterizes the ensemble weight vector that combines weak classifiers as a random variable and evolves its distribution with recursive Bayesian estimation. The object bounding box is divided into local patches of size
The classification method of the tracker is described as follows. Given a weight vector
The final strong classifier is obtained for input data
In visual tracking, appearance variation is an important factor that affects the performance. Most existing methods solve the problem by involving some feature descriptors and ignore estimating the rate and degree. The methods proposed in [
Motivated by the demonstrated success of optical flow [
The workflow of appearance variation estimation method using the sparse optical flow.
In order to estimate the possible occlusion, motivated by some successful applications of sparse representation [
In our method, each image patch is sparsely represented by its neighboring patches. If the patch is occluded, its true neighboring patches are unable to be found and thus causing large reconstruction error. The reconstruction error can be used to evaluate the occlusion state of each image patch. The larger the reconstruction error of the patch is, the greater the possibility of the patch which is occluded is. If the error of the patch is bigger than a threshold, we regard the patch as an occlusion patch. So, we can estimate the occlusion between two consecutive frames. The diagram of our sparsity-based occlusion estimation method is shown in Figure
The diagram of sparsity-based occlusion estimation method.
We use overlapped sliding windows on the images to obtain some patches with the same size. Each patch is converted to a vector. For better description of the algorithm, we use Figure
The construction error can be used to estimate the occlusion. The construction error of the patch
From the above equation, the occlusion values of image
In tracking, the object appearance often significantly changes because of a number of factors such as appearance variation and occlusion. Hence it is necessary and important to update the classifier over time. In RET tracker, it updates both the Dirichlet distribution of weight vectors and the pool of weak classifiers after the classification stage in each time step. It updates the Dirichlet parameters and
Once a new set of positive and negative samples is identified, a tracker should decide whether or not to use it to update the classifiers. RET updates the pool of classifiers by the following equation:
It is apparent that the model of local patches with fast appearance variation should be updated to learn the target appearance, while the model of local patches with heavy occlusion should not be updated to avoid introducing errors. The learning rate for the pool of classifier is set according to the estimation of appearance variation and occlusion. Our classifier update method is defined as follows:
We update the model at each time step. So, our adaptive classifier update can keep the first model and take new models which learn the target appearance with occlusion and other changes into consideration, thus making our tracker more robust to appearance variation and occlusion.
To evaluate the performance of our proposed tracker, we test our tracker on 19 publicly available video sequences. The sequences come from VTD dataset, MIL dataset, and visual tracker benchmark [
Note that we fix the parameters of our tracker for all sequences to demonstrate its robustness and effectiveness. We search for the target of interest in a standard sliding-window method, and the object bounding box is divided into a regular grid of
To quantitatively evaluate the performance of each tracker, we use three widely accepted evaluation metrics including the successful tracking rate (STR), the average center location error (ACLE), and the average overlap rate (AOR). The successful tracking rate is the ratio of the number of successful tracking frames and the number of the sequence. If the overlap between the predicted and ground truth bounding box is more than 0.5, we label the frame as the successful tracking frame. The center location error is the Euclidean distance between the center of the tracking result and the ground truth for each frame. The overlap rate is based on the Pascal VOC criteria. Given the tracked bounding box
AOR (STR). The best two results are shown in bold and italic fonts. The results of some algorithm come from visual tracker benchmark [
Sequence | TLD | CT | Frag | MIL | SCM | DFT | OAB | IVT | VTD | RET | Ours | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Deer | 0.60 (0.73) | 0.04 (0.04) | 0.17 (0.21) | 0.12 (0.13) | 0.07 (0.03) | 0.26 (0.31) |
|
0.03 (0.03) | 0.60 (0.72) | 0.06 (0.04) |
|
|
David |
|
0.50 (0.43) | 0.17 (0.12) | 0.43 (0.23) | 0.72 (0.92) | 0.30 (0.23) | 0.39 (0.15) | 0.65 (0.80) | 0.54 (0.70) | 0.56 (0.68) |
|
|
Faceocc | 0.59 (0.83) | 0.64 (0.85) |
|
0.60 (0.77) |
|
0.69 (0.80) | 0.66 (0.91) |
|
|
0.68 (0.93) |
|
|
Faceocc2 | 0.62 (0.83) | 0.61 (0.74) | 0.65 (0.76) | 0.67 (0.94) | 0.73 (0.87) |
|
0.60 (0.75) | 0.73 (0.91) | 0.69 (0.80) |
|
|
|
Girl | 0.57 (0.76) | 0.31 (0.18) | 0.45 (0.54) | 0.40 (0.29) | 0.68 (0.88) | 0.56 (0.25) |
|
0.17 (0.19) |
|
0.55 (0.65) | 0.69 (0.84) | 0.69 (0.90) |
Sylvester |
|
0.67 (0.83) | 0.58 (0.69) | 0.53 (0.55) |
|
0.38 (0.41) | 0.56 (0.68) | 0.52 (0.68) | 0.40 (0.43) | 0.62 (0.80) | 0.61 (0.80) | 0.61 (0.80) |
Tiger1 | 0.38 (0.46) | 0.41 (0.25) | 0.27 (0.31) | 0.12 (0.10) | 0.16 (0.13) | 0.53 (0.68) | 0.11 (0.10) | 0.10 (0.09) | 0.31 (0.35) | 0.12 (0.12) |
|
|
Tiger2 | 0.26 (0.17) | 0.45 (0.37) | 0.12 (0.12) | 0.46 (0.45) | 0.09 (0.11) |
|
0.15 (0.14) | 0.09 (0.08) | 0.25 (0.27) | 0.30 (0.17) |
|
|
Coke |
|
0.23 (0.10) | 0.04 (0.03) | 0.20 (0.12) |
|
0.11 (0.09) | 0.33 (0.17) | 0.12 (0.13) | 0.18 (0.20) | 0.14 (0.14) | 0.37 (0.23) | 0.36 (0.22) |
Singer1 |
|
0.35 (0.25) | 0.20 (0.22) | 0.36 (0.28) |
|
0.36 (0.48) | 0.34 (0.23) | 0.60 (0.68) | 0.29 (0.38) | 0.49 (0.43) | 0.73 (0.97) |
|
Singer2 | 0.21 (0.10) | 0.08 (0.08) | 0.20 (0.20) | 0.51 (0.48) | 0.17 (0.16) |
|
0.05 (0.03) | 0.04 (0.04) | 0.04 (0.04) | 0.42 (0.45) | 0.38 (0.50) |
|
Football | 0.49 (0.41) | 0.61 (0.79) |
|
0.59 (0.74) | 0.49 (0.59) | 0.66 (0.84) | 0.34 (0.37) | 0.56 (0.72) | 0.55 (0.67) | 0.56 (0.77) | 0.62 (0.82) |
|
Lemming | 0.53 (0.59) | 0.55 (0.68) | 0.31 (0.41) | 0.65 (0.81) | 0.14 (0.17) | 0.41 (0.47) |
|
0.14 (0.17) | 0.14 (0.17) | 0.44 (0.49) |
|
|
Liquor | 0.52 (0.58) | 0.20 (0.21) | 0.33 (0.37) | 0.22 (0.20) | 0.32 (0.32) | 0.22 (0.23) | 0.45 (0.48) | 0.23 (0.21) | 0.20 (0.32) | 0.50 (0.58) |
|
|
Skating1 | 0.20 (0.23) | 0.09 (0.10) | 0.13 (0.12) | 0.13 (0.10) | 0.47 (0.42) | 0.14 (0.16) | 0.40 (0.34) | 0.07 (0.08) | 0.10 (0.13) |
|
0.48 (0.52) |
|
Shaking | 0.39 (0.40) | 0.10 (0.04) | 0.08 (0.07) | 0.43 (0.23) |
|
0.64 (0.83) | 0.01 (0.01) | 0.03 (0.01) | 0.08 (0.04) |
|
0.44 (0.53) | 0.60 (0.74) |
Waking | 0.45 (0.38) | 0.52 (0.50) | 0.54 (0.51) | 0.55 (0.54) |
|
0.56 (0.55) | 0.54 (0.48) |
|
|
0.61 (0.81) |
|
|
Soccer | 0.12 (0.11) | 0.12 (0.13) | 0.08 (0.14) | 0.25 (0.20) | 0.27 (0.18) | 0.17 (0.20) | 0.09 (0.08) | 0.13 (0.16) | 0.24 (0.20) | 0.35 (0.21) |
|
|
Basketball | 0.11 (0.07) | 0.30 (0.18) | 0.62 (0.61) | 0.24 (0.26) | 0.07 (0.06) | 0.28 (0.35) | 0.04 (0.05) | 0.17 (0.08) | 0.20 (0.10) |
|
|
0.32 (0.35) |
ACLE (STR). The best two results are shown in bold and italic fonts. The results of some algorithm come from visual tracker benchmark [
Sequence | TLD | CT | Frag | MIL | SCM | DFT | OAB | IVT | VTD | RET | Ours | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Deer | 31 (0.73) | 246 (0.04) | 105 (0.21) | 101 (0.13) | 104 (0.03) | 99 (0.31) |
|
183 (0.03) | 24 (0.72) | 135 (0.04) |
|
|
David |
|
10 (0.43) | 82 (0.12) | 17 (0.23) | 4 (0.92) | 43 (0.23) | 22 (0.15) | 5 (0.80) | 14 (0.70) | 12 (0.68) |
|
|
Faceocc | 28 (0.83) | 26 (0.85) |
|
30 (0.77) |
|
24 (0.80) | 25 (0.91) |
|
|
20 (0.93) |
|
|
Faceocc2 | 12 (0.83) | 19 (0.74) | 16 (0.76) | 14 (0.94) | 9 (0.87) |
|
20 (0.75) | 7 (0.91) | 13 (0.80) |
|
|
|
Girl | 10 (0.76) | 19 (0.18) | 21 (0.54) | 14 (0.29) | 3 (0.88) | 24 (0.25) |
|
22 (0.19) |
|
9 (0. 65) | 19 (0.84) | 17 (0.90) |
Sylvester |
|
9 (0.83) | 15 (0.69) | 15 (0.55) |
|
45 (0.41) | 15 (0.68) | 34 (0.68) | 26 (0.43) | 20 (0.80) | 12 (0.80) | 12 (0.80) |
Tiger1 | 49 (0.46) | 30 (0.25) | 74 (0.31) | 109 (0.10) | 93 (0.13) | 41 (0.68) | 95 (0.10) | 107 (0.09) | 58 (0.35) | 108 (0.12) |
|
|
Tiger2 | 38 (0.17) | 28 (0.37) | 114 (0.12) | 27 (0.45) | 141 (0.11) |
|
252 (0.14) | 105 (0.08) | 65 (0.27) | 41 (0.17) |
|
|
Coke |
|
40 (0.10) | 125 (0.03) | 47 (0.12) |
|
71 (0.09) | 36 (0.17) | 83 (0.13) | 50 (0.20) | 69 (0.14) | 13 (0.23) | 13 (0.22) |
Singer1 |
|
16 (0.25) | 89 (0.22) | 16 (0.28) |
|
19 (0.48) | 13 (0.23) | 11 (0.68) | 53 (0.38) | 4 (0.43) | 7 (0.97) |
|
Singer2 | 58 (0.10) | 127 (0.08) | 89 (0.20) | 23 (0.48) | 113 (0.16) |
|
186 (0.03) | 176 (0.04) | 181 (0.04) | 44 (0.45) | 82 (0.50) |
|
Football | 14 (0.41) | 11 (0.79) |
|
12 (0.74) | 16 (0.59) | 9 (0.84) | 72 (0.37) | 14 (0.72) | 15 (0.67) | 14 (0.77) | 7 (0.82) |
|
Lemming | 16 (0.59) | 32 (0.68) | 127 (0.41) | 12 (0.81) | 186 (0.17) | 78 (0.47) |
|
182 (0.17) | 178 (0.17) | 79 (0.49) |
|
|
Liquor | 38 (0.58) | 186 (0.21) | 100 (0.37) | 142 (0.20) | 99 (0.32) | 221 (0.23) | 69 (0.48) | 119 (0.21) | 213 (0.32) | 60 (0.58) |
|
|
Skating1 | 145 (0.23) | 150 (0.10) | 149 (0.12) | 139 (0.10) | 16 (0.42) | 174 (0.16) | 43 (0.34) | 247 (0.08) | 159 (0.13) |
|
27 (0.52) |
|
Shaking | 37 (0.40) | 80 (0.04) | 192 (0.07) | 24 (0.23) |
|
26 (0.83) | 192 (0.01) | 86 (0.01) | 110 (0.04) |
|
49 (0.53) | 15 (0.74) |
Waking | 10 (0.38) | 7 (0.50) | 9 (0.51) | 3 (0.54) |
|
6 (0.55) | 5 (0.48) |
|
|
6 (0.81) |
|
|
Soccer | 80 (0.11) | 80 (0.13) | 111 (0.14) | 37 (0.20) | 38 (0.18) | 129 (0.20) | 92 (0.08) | 201 (0.16) | 85 (0.20) | 21 (0.21) |
|
|
Basketball | 137 (0.07) | 57 (0.18) | 13 (0.61) | 82 (0.26) | 294 (0.06) | 120 (0.35) | 143 (0.05) | 180 (0.08) | 115 (0.10) |
|
|
93 (0.35) |
Tables
We also plot some tracking results of 11 trackers on 12 video sequences for qualitative comparison as shown in Figure
Comparison of our method with 11 trackers on 12 image sequences with challenging situations.
Singer2
Liquor
David
Deer
Sylvester
Football
Singer1
Faceocc1
Skating1
Faceocc2
Tiger1
Shaking
As mentioned in scale variation and shape change, our tracker can handle appearance variation better than the other methods.
Based on RET, we propose a novel adaptive classifier update method. Instead of traditional model update, our proposed method takes appearance variation and possible occlusion into consideration at each time step. We estimate the appearance variation by using sparse optical flow method and estimate the possible occlusion by using sparsity-based occlusion estimation method. Then we combine the appearance variation and occlusion estimation to adaptively update the classifier and model. Extensive experimental results and evaluations against several state-of-the-art methods on some challenging video sequences demonstrate the effectiveness and robustness of our proposed algorithm. In the future, we are interested in extracting some kinds of features and fuse them. Besides, we will explore the integration of a strong motion prediction model to solve the problem of tracking-by-detection method which is sensitive to the gating parameters.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported in part by Natural Science Foundation of China (no. 61272195, 61472055, and U1401252), Program for New Century Excellent Talents in University of China (NCET-11-1085), Chongqing Outstanding Youth Found (cstc2014 jcyjjq40001), and Chongqing Research Program of Application Foundation and Advanced Technology (cstc2012 jjA40036).