Target Tracking via Particle Filter and Convolutional Network

. We propose a more effective tracking algorithm which can work robustly in a complex scene such as illumination, appearance change, and partial occlusion. The algorithm is based on an improved particle filter which used the efficient design of observation model. Predefined convolutional filters are used to extract the high-order features. The global representation is generated by combining local features without changing their structures and space arrangements. It not only increases the feature invariance, but also maintains the specificity. The extracted feature from convolution network is introduced into particle filter algorithm. The observation model is constructed by fusing the color feature of the target and a set of features from templates which are extracted by convolutional networks without training in our paper. It is fused with the features extracted from convolutional network for tracking. In the process of tracking, the template is updated in real time, and then the robustness of the algorithm is improved. Experiments show that the algorithm can achieve an ideal tracking effect when the targets are in a complex environment.


Introduction
Object tracking has a wide application prospect in computer vision.Recently, many researchers have carried out a lot of research on it in the real world [1].Detecting and tracking the target is a very difficult task in practical application [2].Many factors can impact the performance of the tracking algorithm.These issues are made up of attitude change, appearance variation owing to illumination changes, partial occlusion, and background noise [3,4].To solve all these problems, we need more efficient machine learning [5] and feature extraction [6] to describe the target.
At present, the tracking algorithms mainly include two types: generative model and discriminative model [5].Particle filter is one of the representatives of generative tracking algorithm.Particle filters have been used widely in the tracking problem.Particle filter algorithm has the advantage of simplicity and flexibility.And it is easy to handle non-Gaussian and multimodality system model.There are many related literatures presented in [7][8][9][10][11].The information from different measurement sources can be used in the framework of particle filter.This has greatly improved the tracking performance.But in the actual process of tracking, there are still a lot of ways to improve the effectiveness of the tracking algorithm.
In addition, the classical particle filter usually adopts the dynamic model with global information.Regardless of whether the target is blocked or deformed, it treats the target as a whole.This leads to the neglect of the local information of the target.When the target is partially occluded and local appearance of it changes, particle filter algorithm cannot accurately track the target.
Discriminative methods are used to distinguish targets and backgrounds by training classifiers.At present, most of deep learning methods are also attributed to discriminative frames in the target tracking.Deep learning has made outstanding achievements in the field of image classification and target detection.It has become one of the most powerful automatic feature extraction methods.The deep network can get high-level abstract features gradually from low-level features through learning and mapping of multiple levels.These abstract features have high dimension and strong distinction.High accuracy of classification and regression tasks can be achieved by using simple classifier.At present, some tracking methods based on learning feature have been proposed, using convolutional networks trained offline [12,13].In tracking, the target localization is achieved by intercepting the characteristics of the target in different layers of the network.
The key point in all of these methods is how to learn an effective feature extraction offline with a great deal of auxiliary data, and it consumes a lot of time.The methods also have given no consideration to the similar local structure and inner geometry distribution information between the targets over consequent frames, which is handy and effective in distinguishing the target from background for visual tracking.In addition, the pure use of the deep learning method does not solve the problem of tracking drift, and it needs to be combined with other methods in order to better play the role of the depth network [14,15].
In summary, a target tracking algorithm combining particle filter and convolution network is proposed in the paper.The extracted features from convolutional networks are introduced into the particle filter framework.The target block is represented by sparse representation.The local information and spatial information of the target are fully exploited to represent the state change of the object.According to the target state, different information is dealt with.Because the global pieces of information of particle filter are combined to determine the position of the current target, the local appearance change and partial occlusion problem of the target are better solved.And in the tracking process, the template is updated according to the tracking results, which improves the robustness of the algorithm to a certain extent.Experiments show that when the target is in a complex environment, the algorithm can achieve an ideal tracking effect.

Particle Filtering Tracking Formula
The tracking problem for particle filter is to estimate the posterior probability density at the  moment, which is obtained by two steps [10].
Step 1 (prediction).First, suppose the initial value ( 0 ) of the probability density is known and the posterior probability density function ( −1 |  1:−1 ) is also known at the  − 1 moment.s  is described as a three-dimension vector; s  = [   , where (  |  −1 ) is defined by the state equation of the target.

𝑝 (𝑠
The observation likelihood function (  |   ) is determined by the observation of the target.(  |  1:−1 ) is a normalization constant.
In fact, since the integral of formula (1) is difficult to realize, the recursive Bayesian filtering (i.e., particle filter) is simulated by the nonparametric Monte Carlo method.The basic formula is where    is the weight of the corresponding particle.The weight of the particle is updated according to the observation value.That is, where (   |   −1 ,   ) is the proposed distribution (importance density) function in Bayesian importance sampling.The optimal choice is to select the proposed distribution as a priori density.That is, . Then the weight is Finally, the real state estimate of the target is obtained; that is,

Target Motion Model.
For images in a new frame, each particle carries out state transition according to the following motion model: where   is the Gauss white noise and   is the propagation radius the particle, which is proportional to the average state change of the target at the previous moment,

Target Observation Model.
Each input image is divided into a fixed size of  ×  pixels, denoted as  ∈ R × .A set of local image blocks  ∈ { 1 , . . .,   } is obtained by densely sampling through sliding a window of size  ×  ( is called the receptive field size).  ∈ R × is the th image block and  = (−+1)×(−+1).Each block   is preprocessed through subtracting the mean and ℓ 2 normalization, respectively [11].

Color Model
(1) Target Model.  ,  = 1 :  is vectorized image patches, with zero as the center.The number of Eigen values' bin is .The probability of the Eigen value of the target model is [16] where (  ) is kernel function used to adjust the size of the weights,  is delta function, ( *  ) represents the color value of pixels at  *  ,  is the color index of the histogram, and (2) Target Candidate Model.For empathy, taking  as the center, the probability of the target candidate model is where window radius is ℎ and (3) Similarity Function.In the paper, Bhattacharyya coefficients are used to calculate the similarity functions of two models [17]: It has a value of 0∼1.Then we suppose the distance between the two target templates as The corresponding color observation probability of the particle is obtained: where mean square deviation is  = 0.2.

Convolutional Networks Model.
In order to describe the target better, we applied convolutional networks to learn robust representations for visual tracking without offline training using a large amount of auxiliary data, which is inspired by recent studies [11,18].First, we use predefined convolutional filters to extract the high-order features.Second, we generate a global representation by combining local features which their structures and space arrangements need not be changed.So it increases feature invariance while maintaining specificity.
Step 2 (background layer).At the same time, the useful background information around the target is used to distinguish the target from the background. samples are selected around the background target, and -means is used to select a bank of filters F   = {  ,1 , . . .,   , } ⊂ .We use the average pooling method to summarize each filter in F   , and generate the background context filter set Then it does the convolution with the input image    ∈    ⊗ .Finally, the simple cell feature maps are defined as Step 3 (convolution layer).At first, simple cell feature map consists of the filter set  =   ∪   .Then  different feature maps are stacked to construct a three-dimensional tensor V ∈ R (−+1)×(−+1)× , that is, the combination of the characteristic graphs.This kind of specificity has the characteristic of shift and sensitivity.In addition, the warp region is  × , which makes the characteristic of the target scale robust.
To increase the robustness of appearance change, we represent the feature V by using the sparse representation.
Then the solution of the model can be solved by using the method of soft shrinkage [19]: where sign(⋅) is a sign function and  is set to median value of it and C = [ 1 ,  2 , . . .,   , . ..], with   ≥ 0.
Step 4 (model update).The update strategy is as follows.It is a low pass filtering form, in which k  is the target template at frame , k −1 is the characteristic of the upper frame, and k−1 is the sparse expression of k −1 .
where  is a learning parameter.Observation model is defined by (17) in convolution model. where is the th candidate sample representation at  frame based on the complex cell features, where ⊙ expresses the product of elements and W is an indicator function whose element is defined as

System Observation Model.
The system observation probability density function of each particle is The parameter  is used to regulate the proportion of the observed probability of each feature in the total observed probability.When the background is complex and the target is partially occluded, the global positioning advantage of color distribution should be fully exploited, and the  value should be increased at this time.When the color of the target is different from the background color, the  value should be reduced, and the localization advantage of the convolution feature can be fully exploited.Under normal circumstances, we take  < 0.5.
The flow chart of particle filter algorithm based on feature fusion is displayed in Figure 1.

Implementation Parameters Setup.
We utilized the twolayer convolutional network as a feature extractor in the experiment.We made  = (36 − 6 + 1) * (36 − 6 + 1) warped image patches obtained at the initial frame.The size of the warped image is set to 32 × 32 ( = 32).We set the receptive field size as 6 × 6 ( = 6) and set the number of filters as  = 100.The learning parameter  in ( 16) is set to 0.95 and the template is updated every frame.The standard deviations of the target state of the particle filter are set as follows:   = 4,   = 4,  = 0.4, and  = 600 particles are used.We compared the improved algorithm with five other tracking algorithms (CNT [11], MS [16], MTT [20], VTS [21], and CPF [22]).These five algorithms all use the particle filter as the search mechanism, and MTT, CNT, and VTS are sparse representation.[23] is used for the experimental validation.To better analyze the performance of the algorithm, we employ the videos with 11 attributes from complex scene [11].

Evaluation Metrics. The tracking benchmark dataset
In order to quantitatively evaluate the system, we use two graphs including the success plot and the precision plot.The success plot is calculated by the overlap rate:  = Area(BT ∩ BG)/Area(BT∪BG).BT denotes the tracked boundary frame and BG denotes the ground truth.The percentage of frames with  >  0 throughout all threshold  0 ∈ [0, 1] is used to express success rates.At the same time, the precision plot shows the percentage of frames between the given threshold distance and the ground truth over the tracked positions.In order to rank the trackers, we set the threshold to 20 pixels in the precision mark represented.One-pass evaluation (OPE) is used to express the average success and precision ratio of target state in the ground truth [23].

Quantitative Analysis
(1) Whole Performance.We give the performance of the top 6 implemented tracking algorithms in Figure 2 according to success and precision plots.It must be pointed out that all the graphics are produced applying the benchmark evaluation [23]; our proposed algorithm ranks firstly based on the success rate while it ranks secondly based on the precision rate.Note that the proposed algorithm exploits only simple sparse image representation that encodes local structural and geometric layout information of the target, and achieves competitive performance compared to other methods.Furthermore, even using only specific target information from the first frame without learning with auxiliary training data, our method performs well in contrast to the other methods.This is mainly because the generic features learned offline from numerous auxiliary data may not adapt well to object appearance variations in target tracking.
(2) Performance of Basic Attribute.To analyze the performance of the improved method, we need to use different attributes to estimate the tracker for videos.In this paper, 11 attributes are selected.Figures 3 and 4 are, respectively, the success plots and the correspondent precision plots [24].We note that our improved algorithm ranks within top 1 on 3 out of multiple attributes in success plots.And the precision plots rank within top 1 on 3.
Our method also ranks first for the video sequences with the low resolution in all estimated trackers.It is difficult to extract useful characters of the targets when the resolution of the videos is low.In contrast, our algorithm extracts dense information across the entire target region by convolution operators to divide the target from the complex scene.
Our method ranks second which follows the CNT methods on the video sequences with background clutter attributes.The proposed algorithm uses background context information that is updated online and pooled in every frame and hence provides effective features to precisely locate objects from the clutters.

Qualitative Analysis
(1) The Variation of the Illumination and Posture.Figure 5 shows the successful tracking results on shaking and skating 1 seq.When great changes have taken place in the stage lighting conditions, the posture of the target is drastically changed owing to dancing or head shaking.Our algorithm effectively solved the posture changes because the observation model is developed through online updating.In addition, the proposed algorithm is robust with regard to illumination changes because the observation model uses a hybrid template.But other algorithms fail to track the target while the light changes exist simultaneously.
(2) Occlusion.Figure 6 gives the successful tracking results of the improved algorithm while the object is occluded seriously by other objects.Our method and CNT can track robustly woman and jogging-1 seq.And the positioning of our method is more accurate.The two methods are robust in the presence of occlusion because of the efficient observation models.The model used local features and recorded the appearance of the target over time including the occlusion, appearance change, and the mixing of the two.Furthermore, it is responsible for various occlusions.But other algorithms fail to accurately track the targets.
(3) Background Clutters.carDark is tested in Figure 7. Background clutter is drastic, and its appearance is similar to the target itself.Under these circumstances, other tracking  (5) Scale Changes.The tracking results of freeman3 and singer1 seq are shown in Figure 9.In sequences, tracked targets have serious scale changes.For the freeman3 sequence, a person moves towards the camera with a large scale variation in his face appearance.Furthermore, the appearance also varies as the posture changes.The MTT, CPF, MS, and VTS algorithms deviate from tracked targets from #330, whereas the proposed and CNT algorithms succeed in tracking.The MTT, CPF, MS, and VTS algorithms do not efficiently complete tracking when the target has a large scale change in the singer1 sequence.But our proposed method and CNT achieve better performance.The proposed algorithm effectively solves scale variation because the representation of model is built on scale-invariant complex cell features.

Conclusion
We put forward an effective tracking method by using particle filter and convolutional network.Deep learning method is used to extract effective features for robust tracking.The algorithm efficiently solves the problems for appearance changing and occlusion severely.The experimental results showed that the improved method is better than traditional tracking methods in drastic tracking surroundings.Since the algorithm is extended easily by adding some more effective feature information, the tracking results could be enhanced further.

Figure 1 :
Figure 1: Flow chart of the algorithm.

Figure 4 :Figure 5 :
Figure 4: The precision plots of videos with multiple attributes.

( 4 )
Deformation.Some successful tracking results of Singer2 seq are shown in Figure8.There are illumination changes and deformation in sequences.Only our method performs well in all of the sequences.