Vision Target Tracker Based on Incremental Dictionary Learning and Global and Local Classification

and Applied Analysis 3 Affine transformation element to b set a = In t+1 Mapping from a a b X = {I1 t+1, I2 t+1, . . . , In t+1} I = {I1 (t+1)(x0,y0), . . . , I 1 (t+1)(xn,yn), . . . , I n (t+1)(xn,yn)} b = {I1 (t+1)(x,y), . . . , I1 (t+1)(x,y)} Figure 2: Affine transformation from X to I. the transfer of the probability state of the affine transformation parameters is obtained, the function of motion model is as follows: P (X t+1 | X t ) = N (X t+1 | X t , σ) , (1) whereN(X t+1 | X t ) is modeled independently by a Gaussian distribution, σ is a covariance diagonal matrix, and the elements of the diagonalmatrix are the variance of each of the affine parameters. {X1 t+1 , X 2 t+1 , . . . , X n t+1 } is a group of affine parameter sets which are randomly generated by function (1), in current frame, and {I1 t+1 , I 2 t+1 , . . . , I n t+1 } is area of the target that may occur (candidate image area) which can be constructed by affine transformation from {X1 t+1 , X 2


Introduction
Visual target tracking in uncertain environment is an important component in the field of computer vision [1].In uncertain scene, the negative impact on quality of target is mainly caused by occlusion, pose changes, significant illumination variations, and so on.Therefore, discrimination method with a strong robustness against the target and environment changes is required for accurate tracking.Visual target tracking can be treated as a binary classification problem between targets and backgrounds, target candidate set of which is established by affine transformation, and classifier is then used to discriminate the target from candidate set [2].Therefore, classifier should be not only well discriminated to targets but also capable of rejecting the discrimination of background feature and even has better robustness to occlusions, pose changes, and illumination variations.
In this paper, an incremental tracking algorithm was proposed for resolving the target appearance variations and occlusions problems.The system chart as Figure 1.Object is represented with global and local sparse representation, and tracking task is formulated as sparse representation binary classification problem with dictionary incremental learning.For object representation, targets are treated as positive samples, whereas backgrounds are treated as negative samples.Then positive and negative samples are used to establish the discriminatory dictionary, where target appearance model is treated as linear combination with discriminatory dictionary and sparse coding.In the first frame, targets are affinetransformed to affine transformation subspace, where the target is found with the minimum reconstruction error.Global classifier with sparse representation is established to determine the global region of target from center-point collection, while sparse representation local classifier is used to set up discrimination to find the target location from global region.As we know, the appearance of the target itself and the external scenes vary in real time, so dictionary needs to be updated with features of the next frame by incremental learning to ensure tracking result accurately.
The rest of the paper is organized as follows.Section 2 reviews the related works.Section 3 proposes target motion model and sparse representation globe classifier and local classifier algorithm.Furthermore, dictionary learning and incremental updating are introduced in Section 4. Section 5 reports the experimental results.Finally, the conclusions are summarized in Section 6.

Related Works
Currently, according to the target appearance model, target tracking methods can be divided into two categories: generated and discriminative model methods.The generated model method use appearance model to replace the target observed template, tracking result get from the highest similarity search with the appearance model of the area.For example, the mean-shift [3] and the incremental tracking [4].In [4], in order to make the algorithm adapt to the real-time changes of target appearance effectively, the target appearance models are incremental learned by a group of low-dimensional subspaces.Discriminative model method: cast the tracking as a binary classification problem.Tracking is formulated as finding the target location that can accurately separate the target from the background.In [5], online multiple instance learning methods improve the robustness of the target tracking system to the influence of occlusion.In [6], visual object tracking is construed as a numerical optimization problem and applies cluster analysis to the sampled parameter space to redetect the object and renew the local tracker.In [7], an ensemble of weak classifiers is trained online to distinguish between the object and the background, and the weak classifiers are combined into a strong classifier using AdaBoost; then, the strong classifier is used to label pixels in the next frame as either belonging to the object or the background.Some scholars have access to a stable target tracking system, which takes advantage of sparse representation classification model for target tracking.In [8], the set of trivial template are constructed with occlusion and corruption, target candidate is sparsely represented by target and trivial templates at new frames, then, the smallest projection error is taken to find target during tracking.In [9], samplebased adaptive sparse representation is proposed to address partial occlusion or deterioration; object representations are described as sample basis with exploiting L1-norm minimization.In [10], the paper proposed a dynamic group sparsity and two-stage sparse optimization to jointly minimize the target reconstruction error and maximize the discriminative power.In [11], tracking is achieved by minimum error bound and occlusion detection, the minimum error bound is calculated for guiding particle resampling, and occlusion detection is performed by investigating the trivial coefficients in the L1-norm minimization.
In addition, to resolve the problems that the overcompleted dictionary cannot discriminate sufficiently, Xuemei utilized target template to structure dictionary [, , −] [9,11];  and − are not related unit matrix, and  is a small piece of target templates.In [12], the use of the learning obtained in the sparse representation dictionary is more effective than a preset dictionary.In [13] proposed a dictionary learning function, the dictionary was obtained by K-SVD algorithm and linear classifier.In [14], dictionary learning problem is deemed as optimization of a smooth nonconvex over convex sets and proposed an iterative online algorithm that solves this problem by efficiently minimizing at each step a quadratic surrogate function of the empirical cost over the set of constraints.On this basis, paper [15] proposes a new discriminative DL framework by employing the Fisher discrimination and criterion to learn a structured dictionary.

Sparse Representation Global and
Local Classifier  the transfer of the probability state of the affine transformation parameters is obtained, the function of motion model is as follows: where ( +1 |   ) is modeled independently by a Gaussian distribution,  is a covariance diagonal matrix, and the elements of the diagonal matrix are the variance of each of the affine parameters.{ 1 +1 ,  2 +1 , . . .,   +1 } is a group of affine parameter sets which are randomly generated by function (1), in current frame, and { 1 +1 ,  2 +1 , . . .,   +1 } is area of the target that may occur (candidate image area) which can be constructed by affine transformation from { 1 +1 ,  2 +1 , . . .,   +1 }.Then, find the area of target from candidate image by sparse representation classifier; the classifier is trained by using previous tracking result.

Sparse Representation Classifier.
Wright et al. [16] proposed the sparse representation-based classification (SRC) method for robust face recognition (FR).We denote  = [ 1 ,  2 , . . .,   ] as the set of original training samples, where   is the subset of the training samples from class . is class numbers of subjects, and  is a testing sample.The procedures of sparse representation classifier are as follows: where  is a scalar constant, classification via identity () = arg min where   = ‖ −   α ‖, α = [α 1 ; α2 ; . . .; α ] and α is the coefficient vector associated with class .

Sparse Representation Global and Local Classifier.
We divided the set of target states   = (, , , , , ) into two Each element of the set  can be obtained from the affine transformation of   = (, , , , , ); usually, the element numbers in set  are very large, and computational cost for discriminating the set immediately is the key issue.All of the subsets  in set  can be obtained by the affine transformation of element  in set  illustrated in Figure 3.In order to reduce the computational cost, search element  from set  first, and, then, search element  from the subset that is mapping of element .However, that implies the need for training the two classifiers to role set  and a collection of , and the computational cost for classifiers is raised once again.In sparse representation classifier models, the method updating completed dictionary can achieve the purpose of training multiple classification and then reduce the computation cost for the classifiers trained.
The  set constructed by center-coordinates affine transformation,in which, most of elements containing numerous negative samples features.Figure 3 shows that the classified algorithm for set  is equivalent to sparse representation global classifier, namely, SRGC.Considering target in two frames with maximum likelihood, nonzero entries in the sparse coding are more concentrated in the same position; we add a constraint ‖α GC − ‖ 2 2 in reconstruction error function to make sure of the maximum likelihood at sparse coding in current frames and prior to the tracking results.We define the metric for classification as follows: where  is a scalar constant,  is a preset weight coefficient,  GC is coding coefficient vector for determining the location of the center coordinates,  is coefficient vector and prior to the tracking results.The sparse representation global classifier is made by (3).
In current frame, fix the target center-point, then the set of target local status is constructed by affine transformation with the last tracking result, illustrated in Figure 4. Taking into account that most target features are imprisoned in elements of set , we consider that discrimination in this part can be equivalent to sparse representation local classifier, as SRLC.The Objective function can be transformed into coding function over local dictionary by adding constraint ‖ LC − ‖ 2  2 in.We add coding discriminant fidelity term ‖ LC ‖ 1 in reconstruction error function to ensure that the target is the most sparse coding in the local dictionary.We define the function for classification as follows: ( The sparse representation local classifier is made by (3).The proposed algorithm is summarized in Algorithm 1.

Learning and Incremental Updating for Dictionary
According to the aforementioned code and discriminant function mentioned in last section, the coefficient of  GC and  LC could not have prominent sparsity if overcompleted dictionary has bad discriminant results as well; all of the samples could probably be chosen as the sparsest code, which is bad for the classification performance of reconstruction error function.Therefore, biased discriminant analysis [17] (BDA) method was introduced into the dictionary learning function in this paper, taking effect for objective and opposite for nonobjective.So the dispersity expressions of plus and minus samples of BDA ( + and  − ) are as follows: However, for objective tracing, only the region of objective is interested, so the background characteristics, noises, occlusions, and so on are regarded as noncharacterized samples  − .Let  = [ 1 ,  2 , . . .,   ] be the code coefficient vector of sample set  of dictionary D, provided that the tested sample set can be denoted as  ≈ .Furthermore, dictionary learning function is where ‖ + − ‖ 2  is the discriminant fidelity term which is only used for  + , since the interesting thing in objective tracing is only the area of objective.‖‖ 1 is l1-norm sparse constraint term, and () is the discriminant constraint with respect to coefficient vector .
According to BDA discriminant rule, let () be tr( + ) − tr( − ).Let ‖‖ 2  be added into () as a relaxed term because the function() is nonconvex and unstable, therefore where  is the control variable.Furthermore, the proposed BDDL method can be formed as Similar to [15], (, ) is nonconvex for function  which is the convex function on set  when  is already known and also the convex function on set  when  is already known.So,  is in fact a biconvex function on sets  and .=1 old  +  , respectively;  + and  + are the number of new and old samples.Furthermore, the weighted mean of these two mean values of plus sample set is

Dictionary Incremental
Similarly, the new dispersity expression of plus sample set using weighted mean value  + is The dispersity expression of the updated plus sample  + is + old is the old dispersity expression of plus sample set.However, we need just refused-discriminant to negative samples, instead of discriminant it in real time, then the dispersity of negative samples is as follows: If we take  − and  + into consideration, then the updated function () = tr( + ) − tr( − ) + ‖‖ 2  can be represented as where Ψ = ( +  + /( + +  + ))( + new −  + old ) ( + new −  + old )  .According to (14), fix  old , and then compute ;  is reconstructed by obtaining  that is updating , where  is not used for discriminant immediately and is just reconfigurable coding coefficient matrix: where () = tr( ,  old is the old dictionary;  + is the joint matrix of samples in current and previous frames, which is represented as Then function  (,) could be rewritten as In first frames we need initialization; the target is framed manually, the  + 0 is set of initial moment positive samples,  − 0 is the set of initial moment negative sample,  + 0 is the mean value of initial moment positive sample, compute  + and  − by (6).We initialize all atoms  of dictionary  as random vector with l2-norm, solve  by solving (15), and then fix  and solve  by solving (16).

Experiments
Experiments were performed with four video sequences, which included occlusion, illumination variation, appearance variation and other corruption factors.In the experiment, target location of the first frame is framed manually, and initial dictionaries were randomly generated, and track results were to be released as rectangular boxes.Figures 5, 6, 7 and 8 show a representative sample of tracking results.finally, target tracking method of this paper contrasts with incremental visual tracking (IVT) [4], multiple instance Learning (MIL) [5] and L1 tracker (L1) [8].To further evaluate the proposed method, each method applies to four video sequences and then compares the track results, that need to be evaluated qualitatively and quantitatively.The paper uses gray histogram as a way of presentation characteristics of In the Car4 sequence, when the cars pass through the bridge and the shade, intensity of illumination altered obviously.Tracking results are shown in Figure 6; and the image frames are #41, #792, #234, #309, #416, and #602.When the cars go through the bridge, MIL will be ineffective significantly, but will not lose target; IVT will also be ineffective, but it can snap back.The method in this paper and L, compared with MIL and IVT, can locate the target accurately.
In the David Indoor sequence, the degeneration is include twice illumination change, expression change, and partial occlusion.The track result was shown in Figure 7, and the image frames are #15, #158, #299, #309, #319, and #419.The method in this paper can locate the target accurately; contrastively, the result of L1 is ineffective.The reason is that the target gray histogram was changed by light intensity, that affects the feature of image gray histogram; the methods of MIL and IVT may be more sensitive to the effects.
The OneLeaveShopReenter2cor sequence shows a woman walking through a corridor, when a man walks by, which lead to large occlusions Clothes with similar color are the occluder.The track result was shown in Figure 8; the image frames are #47, #198, #266, #374, #380, and #489.The method in this paper and L1 can locate the target accurately.When occlusion happened, MIL put the occluder as target and missed the target; The target is similar with the occluded,and then the IVT is difficult to discriminate object and occluded.
In conclusion, both the method in this paper and L1 can locate the target accurately.And they have strong robustness for occlusions, pose changes, significant illumination variations, and so forth.

Quantitative Comparison.
We evaluate the tracking performance by position error.The position Error is approximated by the distance between the central position of the tracking result and the manually labeled ground truth.Table 1 shows the statistical data of position error which includes maximum, mean and standard deviation.Figure 9 shows the errors of all four trackers.
From previous comparison results, we can see that proposed method can track the target more accurately in video sequence OneLeaveShopReenter2cor, David Indoor, and Car4 than other methods.The max, mean, and standard deviation of position errors are smaller than IVT and MIT.Therefore, in complex environment, our method has a better robustness.Comparing with L1, the result of tracking to sequence OneLeaveShopReenter2cor and Car4 shows that L1 has higher stability in the scene where illumination did not change significantly.However, the standard deviation of position error of L1 tracker in those sequences is smaller than proposed method, that L1 update capability is less than proposed method, when grayscale histogram distribution changed greatly.The dictionary in L1 is constructed by target template, so robustness of learned dictionary is better than it.

Conclusion
In this paper, a tracking algorithm was proposed based on sparse representation and dictionary learning.Based on biased discriminant analysis, we proposed an effective Incremental learning algorithm to construct overcompleted dictionary.Positive and negative samples are obtained during tracking process and are used for updating discriminant dictionary by biased discriminant analysis.Then we proposed sparse representation global and local classification for set of central points and set of local states.Compared to the state-of-the-art tracking methods, the proposed algorithm improves the discriminating performance of completed dictionary and the adaptive ability of appearance model.It has a strong robustness to illumination changes, perspective changes, and targets rotation itself.

Figure 1 :
Figure 1: Diagram of targets tracking via sparse representation global and local classifier.

Figure 5 :Figure 6 :
Figure 5: Tracking results of the PETS01D1Human1 sequence (MIL is yellow, IVT is blue, L1 is green, and our tracker is red).

Figure 7 :Figure 8 :
Figure 7: Tracking results of the David Indoor sequence (MIL is yellow, IVT is blue, L1 is green, and our tracker is red).

Figure 9 :
Figure 9: Position error plots of the tested sequences.
We denote affine transformation parameters   = (, , , , , ) as target state in frame , where  and  are coordinates of center point,  is change of scale,  is bearing rate,  is rotation angle, and  is angle of inclination.The motion model of the object through 3.1.Motion Model of Targets.

)
+ and  − are the total number of plus and minus samples, Give a dictionary  = [ 1 ,  2 , . . .,   ], where   is Input:   is the tracking result of prior frame, { 1 +1 ,  2 +1 , . . .,   +1 } is set of candidate samples credible positions of the center-point coordinates in next frame. is Over-complete dictionary,  is frame numbers.jth element of th vector which is called atom of dictionary. = [ + ,  − ] is the training sample set, where  + and  − are the characterized and noncharacterized samples for objective, which are also called plus and minus samples.

Table 1 :
Analysis of location errors.The test sequences, PETS01D1Human1, show that a person went to the extreme left side from the lower right corner of the screen, which telephone poles will cause short-term shelter to it.Tracking results are shown in Figure5; the image frames are # 49, # 88, # 155, # 257, # 324, and # 405.All methods can effectively track the target, and the tests show that in the circumstances of the same light intensity, the same camera angle, and slight shelter, all methods can effectively track the target.It also indicates that the proposed method in the paper and the contrastive method are effective target tracking algorithms.