Visual Tracking Based on an Improved Online Multiple Instance Learning Algorithm

An improved online multiple instance learning (IMIL) for a visual tracking algorithm is proposed. In the IMIL algorithm, the importance of each instance contributing to a bag probability is with respect to their probabilities. A selection strategy based on an inner product is presented to choose weak classifier from a classifier pool, which avoids computing instance probabilities and bag probability M times. Furthermore, a feedback strategy is presented to update weak classifiers. In the feedback update strategy, different weights are assigned to the tracking result and template according to the maximum classifier score. Finally, the presented algorithm is compared with other state-of-the-art algorithms. The experimental results demonstrate that the proposed tracking algorithm runs in real-time and is robust to occlusion and appearance changes.


Introduction
The purpose of visual tracking is to estimate the state of a target from frame to frame. It has been widely studied in computer vision, for example, surveillance, robot navigation, and human-computer interaction [1]. Although various tracking algorithms have been proposed [2][3][4][5][6][7][8][9][10][11], visual tracking is still a very challenging task due to appearance changes caused by pose, illumination, occlusion, and motion [1]. An effective appearance model is important in the tracking process. The appearance model represents a target by using information extracted from the target region. Since color feature is readily accessible from an image, it is widely used to model a target's appearance [3]. One of the best methods for color-based visual tracking is to realize the MS algorithm [4,5]. The color histogram of a target is learned in the first frame for color based object tracking algorithm, and then the target is searched in the subsequent frames by measuring the similarity between two histograms. The MS algorithm has been successfully used in visual tracking due to its similarity, efficiency, and robustness [4,5]. However, the color histogram can only describe the color distribution in a region but ignore the spatial distribution. Therefore, its representation ability will significantly decline when there are partial occlusions or similar objects. As a result, a color based tracker may wrongly detect the "bad" candidate as tracking result. To overcome the drawback, parts-based color histograms were adopted [6]. The method is robust to occlusions as it combines the color histograms of multiple parts. Recently, compressed sensing theory has attracted much attention in computer vision, for example, face recognition [7], denoising and in-painting [8], and visual tracking [9][10][11][12]. It is demonstrated that a compressed sensing based tracking method can cope with partial occlusions, illumination changes, and pose variations [11].
The generative appearance models mentioned above are learned in the first frame. Therefore, they cannot deal with major appearance changes as the track evolves. Discriminative algorithms, which can change appearance model evolution, were proposed [13]. The main idea of these appearance models is to consider the tracking problem as a binary classification task by continuously updating a classifier [14]. The best way to implement this task is to design appearance models for the target and background [14] and after that learn a discriminative classifier according to the models. Grabner 2 Computational Intelligence and Neuroscience [15] proposed an online boosting algorithm to learn a discriminative appearance model for visual tracking. The tracker took the current tracking location as one positive example and extracted negative samples from neighborhood around the tracking location. However, the results often drifted away as the appearance model was updated with a suboptimal positive example. Zhang et al. [16] implemented a real-time compressed tracking algorithm. Multiple positive examples and negative ones were cropped around the current tracking location to update a discriminative classifier. However, the ambiguity problem may occur and confuse the classifier [17]. To deal with this problem, Grabner et al. [17] proposed a semisupervised boosting approach where only the samples from the first frame were labeled and the subsequent training examples were left unlabeled. The method achieved good performance even if the target disappears from the field of view of a camera. However, the tracker failed when there was great interframe motion.
The methods of tracking-by-detection have progressed in the past few decades. These methods train a classifier by adopting positive samples and negative ones extracted from the current frame. In the tracking process, some candidates around the old object location are extracted, and the trained classifier is employed to find the sample with the maximum classifier score. In fact, Viola et al. [18] argued that the inherent ambiguities included in object detection would cause difficulty for traditional learning methods. To handle the problem, the multiple instance learning (MIL) method was proposed for object detection [18,19] and object recognition [20][21][22]. Studies show that the technique can cope with partial occlusions and ambiguities [22,23]. The basic idea of the MIL method is that examples are called bags and some instances are included in these bags [14]. The bags instead of the individual instance are labeled. A bag is positive if at least one instance is positive. In contrast, a bag is negative when all of the instances are negative. Normally, the positive bag is constructed by the instances sampled around the labeled object. The MIL based tracking method can cope with the ambiguity problem by finding out the most "correct" instance in each positive bag [19]. However, the MIL trackers still have some shortcomings, for example, a heavy computational load, tracking drift, and failure to handle appearance variations. To address these problems, Zhang and Song [24] proposed an improved MIL tracker by weighting instance probabilities. Zhang and Song used an efficient online approach to maximize the bag likelihood function, which resulted in robust and fast tracking. However, when a bag probability is computed, the included instances are assigned different weights based on their distance to the previous tracking location. As a result, the target will be lost when drift occurs, because the real instance far away from the previous tracking location is assigned a large weight. Xu et al. [25] presented an efficient MIL tracker by using a Fisher information criterion, which yielded robust performance. Zhou et al. [26] implemented online visual tracking by using MIL with instance significance estimation. The method provided each instance a significance-coefficient which represents its contribution to the bag likelihood. The method dealt with the drift problem to some extent. In general, these MIL trackers successfully deal with slight appearance variations and achieve a robust and fast tracking. However, these trackers fail to handle strong ambiguity, for example, large appearance variations, and tracking drift.
In this study, to deal with the above problems, we propose an improved online multiple instance learning algorithm (IMIL) for object tracking. The IMIL algorithm has advantages over the other MIL trackers in terms of efficiency, accuracy, and robustness. First, to reduce the computational cost, an inner product based selection method is employed to choose the weak classifiers from a classifier pool. The selection strategy avoids computing the instance probabilities and bag probability times. Consequently, compared with the MIL, the IMIL runs in real time. Second, to generate a bag probability, the importance of each instance is with respect to their probabilities. Compared with the WMIL tracker, the strategy avoids introducing interference from the background and then deals with the drift problem. Third, to cope with the variations on illumination and pose, a feedback strategy is proposed to update the parameters of classifiers. In the strategy, the learning rate is tuned by considering the relationship between the maximum similarity score and two given thresholds after detecting the object. Therefore, using the proposed update strategy, the IMIL handle the variations on appearance and illumination.
The rest of this paper is organized as follows. Section 2 details the proposed tracking algorithm; Section 3 presents the experimental results of the proposed tracker compared with other trackers; the paper concludes with a short summary in Section 4.

The IMIL Based Tracking Algorithm
An improved online multiple instance learning algorithm is proposed for object tracking in video sequences. The basic flow of the IMIL algorithm is shown in Figure 1. Each instance is represented by a set of compressed Haar-like features [14,24]. Let ( ) ∈ 2 denote the location of each instance in the th frame. It is assumed that the tracker can detect the object at the location * . Once the object is detected, a set of instances = { : ‖ ( ) − * ‖ < } are cropped to construct a positive bag. In contrast, the instances for negative bag are cropped from an annular region , = { : < ‖ ( ) − * ‖ < }, < . The extracted positive bag and negative one are adopted to train weak classifiers. Then, a strong classifier is generated by selecting the weak classifiers with stronger classifying ability. For the +1 frame, we assume that the object appears within a radius of the tracking location in the frame. Thus, some candidate samples = { : ‖ +1 ( ) − * ‖ < }, < < , are cropped within a search radius centering at the previous location * . The learned strong classifier estimates ( = 1 | ) for all samples ∈ and updates the tracking location according to * = (arg max ∈ ( = 1 | )). Furthermore, the classifier updates the appearance model after tracking a new location.

Online MIL Boosting Algorithm.
Babenko et al. [14] proposed an online MIL Boosting method for visual tracking. The details of the MIL Boosting tracker are shown Computational Intelligence and Neuroscience for = 1 to do (6) = ( + ℎ ( )) end for (10) * = arg max (11) ℎ ( ) ← ℎ * ( ) } is a bag constructed by instances and is the bag label according to . The key of the MIL Boosting method is to design a strong classifier is a weak classifier relating to a sampled instance (represented by Haar-like feature ) in the positive bag. The weak classifier ℎ (⋅) is assumed to be Gaussian distributed with four parameters . The weak classifier returns the log odds ratio: When the strong classifier receives new data {( 1 , 1 ), . . . , ( , )}, the instance probability is modeled as where ( ) = 1/(1 + exp(− )) is the sigmoid function. Then, the th bag probability is modeled by adopting the Noisy-OR model [18]: The MIL tracker learns a pool of candidate weak classifiers. Then, the (< ) weak classifiers are selected from the candidate pool by maximizing the log likelihood of bags: where = ∑ ( log( ) + (1 − ) log(1 − )) is the bag loglikelihood function and −1 = ∑ −1 =0 ℎ . Finally, the selected weak classifiers generate a strong classifier to determine the new tracking location with the maximum score.

Computational Intelligence and Neuroscience
After detecting the object in the new frame, the parameters of the weak classifiers are updated online: where is a learning rate. It specifies the importance of the tracking result and the template. The MIL Boosting algorithm deals with the suboptimal problem included in the weak classifiers. However, the bag probability computed by using the Noisy-OR model makes the MIL tracker easily select less effective features and confuse the classifier [27]. Another disadvantage of the MIL tracker is that all of the instance probabilities and the bag probabilities must be updated times after selecting a feature, which results in a heavy computational load [24]. Moreover, the weak classifiers update their parameters with fixed learning rate, which is sensitive to the occlusion and variations on illumination and appearance.

Bag Probability.
To overcome the drawback mentioned above, the IMIL tracker is proposed. It is assumed that the bag's probability depends on the instance probability equally [28]. The importance of each instance contributing to a bag probability mainly depends on the instance probability. The bigger the instance probability is, the more it contributes to the bag probability. We use an equally weighted sum of instance probabilities to yield superior performance. The positive bag probability is defined as follows: Compared with the Noisy-OR model and the WMIL tracker, our method computes the bag probability according to each instance probability. In the WMIL tracker, the positive instances are assigned weights to generate the bag probability according to the Euclidean distances between the locations of each instance and the current tracking location ( = 1 | ) = ∑ −1 =0 0 ( = 1 | ) [24]. The nearer the distance is, the more important the instance probability is. However, if the tracking location 0 drifts away from the real target position (see Figure 2), the instance 1 far from the current tracking location will be assigned a smaller weight than that for the near instance 1 due to their distances. As a result, the instance 1 contributes more to the bag probability than the instance 1 , which is contrary to the fact. As a result, the obtained bag probability will contain information from the background and finally lead to a tracking failure. In our method, the instance's importance is determined by their probabilities. The instance whose probability is bigger will contribute more to the bag probability. Using the strategy, the tracker achieves superior performance. with stronger classifying ability from a weak classifier pool = {ℎ 1 , . . . , ℎ } by maximizing the log-likelihood function. To realize real time visual tracking, a more efficient criterion based on the inner product is employed to select weak classifiers from the classifier pool [24]. Using this criterion, the instance probabilities and bag probabilities do not need to be computed times after selecting one weak classifier. Inspired by this, we select the weak classifier ℎ as follows: where ⟨ℎ, ∇ ( )⟩ = 1/( + ) ∑ + −1 (1 − ( ( ))) ) , where ( | ) = ( ( )). The IMIL method is shown in Algorithm 2.

Weak Classifiers.
The weak classifiers of the MIL update their four parameters with fixed weights [16], which may introduce errors when there are inaccurate tracking results. Normally, the tracking process is formulated as a searching problem which aims to select the candidate with the maximum classifier score ( max ). Usually, the most "correct" samples detected by the tracker are similar in the most frames. However, when there is an occlusion or a great interframe motion, the update strategy will introduce errors from background and lead to a suboptimal result in the rest of the frames. Consequently, drift will occur or the target will be lost.
To avoid the problems mentioned above, a feedback strategy is proposed for updating the classifiers. Our goal is to detect the candidate area not only the most similar to the foreground but also the most dissimilar to the background.

Algorithm 2: Online IMIL.
The maximum classifier score max obtained by applying the strong classifier denotes the similarity between the tracking result and target. In our method, TH and TH are set as higher threshold and lower threshold, respectively. To make sure the classifier is far from the background, the update strategy changes the learning rate considering the relationship between the max and TH or TH : where is the learning rate. In this strategy, it specifies the importance of the tracking result, while 1 − specifies the importance of the template. max > TH means that the candidate area is more similar to the template (foreground), while max < TH means that the candidate area is more similar to the background. Considering the relationships between the max and the two thresholds, two weights are assigned to the tracking result and template to specify their importance when the classifier is updating. The feedback strategy can deal with appearance changes and avoid excessive updates as the track evolves. When the target is successfully tracked, the maximum classifier score max is greater than TH , which means that the similarity between the tracking result and template is very high. Then, the weight of the tracking result is increased to the largest value to cope with appearance variations. In other words, the new classifier depends mainly on the tracking result. When there is a serious occlusion or the target is lost, the most "correct" target includes numerous background information. The maximum classifier score max is less than the lower threshold TH , which means that the similarity between the tracking result and template is very low. Then, the weight of the tracking result should be a small value to avoid introducing more errors to the classifier. As a result, the new classifier depends mainly on the template. When there are partial occlusions or some appearance variations, the max is between the lower threshold and the higher one. In such a case, the classifier should update according to the tracking result and the template simultaneously.

Experiments
We compared the proposed object tracking algorithm (IMIL) with the 3 latest trackers on 6 challenging video sequences. The three trackers are online MIL Boosting tracker (MIL) [14], WMIL tracker [24], and significance-coefficients MIL [26]. For the compared trackers, the binary code released by the authors is used [26]. The six video sequences are "David indoor" [24], "Occluded face" [24], "Tiger 2" [24], "Cliff bar" [24],"Coke can" [14], and "Coupon book" [26] (Figure 3). Using these video sequences, we evaluate the IMIL tracker's ability of handling the problems of illumination changes, occlusion, pose variations, and appearance changes. There are serious illumination variations in "David indoor" and "Coke can." And the "can" moves fast in the "Coke can" video. In the sequences "Occluded face 2" and "Tiger 2," the face is often occluded by a book and the "Tiger" is often occluded by the leaves. Moreover, there are still pose changes in these video sequences. The pose variations exist in the video "Cliff bar," while the appearance of the "dollar" changes in the video "Coupon book." The proposed IMIL is implemented in the MATLAB and run on a core 2 CPU, 2.33 GHz, and 2 GB RAM computer.

Parameters Setting.
For the online MIL Boosting tracker [14], the search radius is set to 35, and about 1000 samples are extracted for detecting the object location. Set   Figure 3: Tracking object location by using MIL [14], WMI [24], significance-coefficients MIL [26], and IMIL. The video sequences are "David indoor," "Occluded face," "Coke can," "Cliff bar," "Tiger 2," and "Coupon book" from the top to the bottom. a strong classifier. The learning rate of the weak classifier is 0.85. For the WMIL tracker [24], set = 25 to search the target; = 4 to crop the positive instances; = 2 and = 1.5 to determine the negative instances. The number of candidate weak classifiers is set to be = 150. 15 weak classifiers are selected from classifiers pool to generate a strong WMIL classifier. The learning parameter is still 0.85. For the significance-coefficients MIL [26], set = 4 and = 50 to generate the positive and negative bags; the learning rate is set to be 0.85. The number of the candidate weak classifiers is = 150, and 15 weak classifiers are selected for generating a strong classifier. For the IMIL tracker, set = 25 for detecting the target; = 4 is set for sampling positive instances; = 2 and = 1.5 are set for extracting negative instances. To generate a strong classifier, 30 weak classifiers are selected from the classifier pool which includes 120 weak classifiers.

Tracking Object Location.
We perform our experiments on the above 6 video sequences. For all sequences, the images are converted into gray scale before processing. For each sequence, the classifier is learned in the first frame. Then, the locations in the subsequent frames are tracked by these trackers. For updating the classifier, the proposed IMIL algorithm tunes the parameters of the tracking result and template according to the maximum similarity in the tracking process. Therefore, compared with MIL, WMIL, and significance-coefficients MIL trackers, the IMIL tracker achieves the best performance when there are appearance changes, pose and illumination variations. Furthermore, the IMIL method computes a bag probability according to instance probability. Therefore, an instance contributes more to a bag probability if the instance probability is bigger. As a result, the IMIL tracker overcomes the tracking drift problem Computational Intelligence and Neuroscience existing in the WMIL when there is fast moving (e.g., Coke can) in the tracking process.

Quantitative Analysis.
We employ the center location error and overlap rate to evaluate the performance of our method. The center location error measures the position error between central locations of the tracking results and the centers of the ground truth. The overlap rate measures the tracking result on the area of overlap with ground truth bounding boxes [29]. The overlap rate is defined as score = area( ∩ )/area( ∪ ). The is the area of the tracking result, while the is for the ground truth bounding boxes. The tracking result with the overlap rate exceeding 50% is considered to be a correct detection.
The center location errors for all the trackers are shown in Figure 4. The average center location errors are detailed in Table 1. The smaller the average center location error is, the better the tracking algorithm performs. Bold indicates the best performance. The overlap rate for all the trackers are detailed in Table 2. Bold indicates the best performance. In the IMIL tracker, the bag probability is calculated according to instance probability. When there is tracking drift or fast moving, the bag probability mainly depends on the instance whose probability is the largest. Therefore, the drift problem can be corrected. Moreover, the learning rate is tuned according to the tracking result. As a result, the IMIL handles the problem of occlusion and variations on illumination and pose. Therefore, compared with the other MIL based trackers,

Computational Cost.
In this section, we compared our method with MIL, WMIL, and significance-coefficients MIL in terms of computational cost. The average computing time processing an image measures the computational cost of these algorithms. The average computing time is defined as = all / f rames . all is the total computing time processing the whole video sequence. f rames is the number of frames in the sequences. is the obtained average computing time. The computational cost of these methods is affected by three main factors: the strategy for selecting weak classifiers, the number of weak classifiers, and the number of the selected weak classifiers for generating a strong classifier. The IMIL tracker is in the lowest computational load due to its advantages as follows: (1) the proposed criterion for selecting weak classifiers avoids computing the instance probabilities and bag probability times before choosing one weak classifier; (2) about 120 weak classifiers are learned and 30 weak classifiers are chosen for generating a strong classifier (for MIL and WMIL, the number of the learned weak classifiers is 150 and that of the selected classifiers is 50). The average computational cost for different algorithms conducted on the 6 video clips is shown in Table 3. The results show that the IMIL is improved with lower computational time.

Conclusion
In this paper, we presented an improved online multiple instance learning algorithm for visual tracking. A feedback scheme was used to update the parameters of the classifiers, which can handle the appearance changes caused by pose, illumination, and occlusion. We equally summed the instance probabilities to generate a bag probability. The method can avoid introducing the information from background and yield superior performance. A more efficient criterion was proposed to select weak classifiers, which avoided computing the instance probabilities and the bag probabilities times after selecting one weak classifier. Finally, numerous experiments on challenging video sequences demonstrated that the proposed algorithm performs well in terms of efficiency, accuracy, and robustness.