Visual Vehicle Tracking Based on Deep Representation and Semisupervised Learning

Discriminative tracking methods use binary classification to discriminate between the foreground and background and have achieved some useful results. However, the use of labeled training samples is insufficient for them to achieve accurate tracking. Hence, discriminative classifiersmust use their own classification results to update themselves, whichmay lead to feedback-induced tracking drift. To overcome these problems, we propose a semisupervised tracking algorithm that uses deep representation and transfer learning. Firstly, a 2D multilayer deep belief network is trained with a large amount of unlabeled samples. The nonlinear mapping point at the top of this network is subtracted as the feature dictionary. Then, this feature dictionary is utilized to transfer train and update a deep tracker. The positive samples for training are the tracked vehicles, and the negative samples are the background images. Finally, a particle filter is used to estimate vehicle position. We demonstrate experimentally that our proposed vehicle tracking algorithm can effectively restrain drift while also maintaining the adaption of vehicle appearance. Compared with similar algorithms, our method achieves a better tracking success rate and fewer average central-pixel errors.


Introduction
Visual vehicle tracking is one of the key technologies used in active vehicle safety applications.It is used in advanced driver assistance systems (ADAS) and in intelligent vehicles.However, in real traffic situation, the camera and the vehicles need to be tracked are all in motion.The visual tracking of target vehicles is often affected by complex backgrounds, changes in illumination, and occlusion by other vehicles or objects.These factors make vehicle tracking a challenging task in real traffic scenarios.
Existing tracking algorithms generally fall into one of two categories: generative models and discriminative models.Discriminative models convert tracking problems into binary classification problems of target and background.They have attracted much research interest in recent years.Many representative methods have been proposed, such as the online AdaBoost algorithm [1], the multiple instance learning algorithm [2], the TLD (tracking-learning-detection) algorithm [3], and the SVT (support vector tracking) algorithm [4].
The biggest challenge for discriminative model based tracking methods is the tracker drifting problem.This is because only a small amount of labeled samples (e.g., there is only one positive sample in most cases) can be used to train the classifier.Additionally, the tracking of subsequent image sequences depends on the classification results of the former frame.So, if the tracked area of former frame is not locked onto the optimal target, this self-training approach can lead to classifier drifting, which causes accumulated errors and further tracking failures.
To solve this shortcoming of the self-training approach, a new method, named semisupervised learning-based tracking, has been proposed [5][6][7].In this method, a large amount of auxiliary images is used to maintain a feature dictionary which is able to describe images.Then, the dictionary will be used to update the tracker online.However, this kind of method prefers to use pixels from the original image or handcrafted features (such as the Haar feature, HOG feature, or DIFT feature), to generate the feature dictionary.This cannot satisfy the requirements of the image classification that is needed for robust tracking.
The newly developed deep learning framework is a bioinspired architecture which describes data such as images, voices, and text by mimicking the human brain's mechanisms of learning and analysis.Through deep learning, features are transformed from their original space in a lower layer to a new space in a higher layer [8].Compared with handcrafted features, the automatic features generated by deep learning are more capable of expressing the details of internal properties of the data.Inspired by this, a semisupervised vehicle tracking algorithm is proposed that uses deep representation.The overall structure of the proposed method is shown in Figure 1.Firstly, a large number of unlabeled images are selected as training samples, and a 2D deep belief network (2D-DBN) is used as a classifier structure.Then, a multilayer deep network will be trained with those unlabeled samples and the nonlinear mapping node in the top layer will be taken as the feature dictionary.In the tracking process, the subimage containing the vehicle to be tracked in the initial frame is set as the positive sample, and the surrounding background subimages are set as negative samples.After that, an online deep tracker will be learned from the generated samples and the feature dictionary.Finally, a partial filter will be used to estimate and provide a small area for the tracker.The deep learning-based tracking algorithm is able to fully exploit the deep-level feature information embedded in the image.This effectively suppresses drift situations while keeping the vehicle tracking system updated.
Generally, compared with other semisupervised learningbased tracking methods which use handcrafted features [5][6][7], this work introduces deep learning framework to the traditional semisupervised learning.This work applies 2D-DBN deep model to generate features automatically in unsupervised offline training and then uses small amount number of samples to adjust the tracker online.
In this article, in Section 2, the establishment of the deep model based semisupervised tracking framework is introduced in detail which contains both offline and online training steps.The vehicle position prediction method based on partial filter is given in Section 3. In Section 4, the experiment results and its analysis will be shown.Finally, a brief conclusion of this work is given in Section 5.

Establishment of Vehicle Tracking Based on Deep Modeling
The establishment of deep tracking requires two steps (Fig- are converted to gray scale and normalized to the [0, 1] interval.After that, the entire normalized matrix is input to the 2D-DBN to perform feature dictionary extraction.
The structure of the proposed 2D-DBN is shown in Figure 3.It is constructed with one visible layer,  1 , and two hidden layers,  1 and  2 .The visible layer contains  ×  units which is equal to the dimension of input samples.The hidden layers  1 and  2 contain  ×  and  ×  units, respectively.The visible layer and hidden layer are connected with a group of weights, .After unsupervised training to adjust the weights, the unit in hidden layer  2 can be considered as the feature dictionary.
In unsupervised training, the greedy-wise reconstruction algorithm is used to adjust the weights between each two layers [9].Taking visible layer  1 and hidden layer  1 as an example,  1 and  1 can be seen as a restricted Boltzmann machine (RBM).The state energy (V 1 , ℎ 1 ) of any two units of  1 and  1 can be written as where are the weight parameters of the units between visible layer  1 and hidden layer is the connecting weight of unit (, ) in visible layer  1 and unit (, ) in hidden layer  1 . 1  and  1  represent the bias between corresponding layers.
So, the RBM can be considered as a joint probability distribution: where  is a normalized parameter.

Visible layer
Hidden layer  Then, the conditional probability distributions of input state k 1 and hidden state h 1 are able to be expressed with logistic functions: where () = 1/(1 + exp(−)).
Based on the analysis above, the connecting weight and bias will be updated with a contrast divergence algorithm [9]: where ⟨⋅⟩ data is the data's expected distribution, ⟨⋅⟩ recon is the reconstruction distribution, and  = 1 is the update step size.By repeating these steps from lower layers to higher layers, the weights of ( 1 ,  1 ) and ( 1 ,  2 ) can be maintained.The weights of every unit in  2 can then be considered as the nonlinear feature dictionary.
In real practice, many subimages captured from road videos that are vehicles will be input to the visible layer one by one.Then with the unsupervised training process given before, all the weights of the 2D-DBN will be adjusted and if we graphically display the weights in the top layer, the features are able to be viewed.Some of the features generated by the proposed method of unsupervised learning are shown in Figure 4.It can be seen from the feature that the features are more sensitive to the shape of edge, corner, and so on which exist more usually in vehicles.

Tracker Online Training.
When the tracked area is identified in the first frame of a video, positive and negative samples are generated.Firstly, the tracked area will be chosen as

Hidden layer
Label layer 1  the positive sample and will then be rotated from −5 to 5 degrees with an intersection of 1 degree to generate more positive samples.Secondly, negative subimages are generated in a neighborhood ring area around the tracked area.The neighborhood area is defined as  < ‖ neg − ‖ < , in which  neg is the center of the negative samples,  is the center of the tracked area, and  and  are the inner and outer diameters of the ring area.
Based on the offline trained DBN, a label layer is introduced with a sigmoid function to generate a two-dimensional forward artificial neural network (Figure 5).In the online training process, the positive and negative samples and their labels are input to this neural network.The backpropagation algorithm is used to perform supervised training and adjust all the network weights.Finally, all the subimages in consequent frames are loaded to the newly updated network to make judgments.

Position Prediction Based on Particle Filtering
In the tracking process, the online trained classifier needs to verify a large number of subimages in consequent frames.To reduce processing time, a particle filter is used to estimate the target position.
The particle filter uses a Monte Carlo algorithm with Bayesian filtering and demonstrates good object tracking performance.Set x  and z  as the state and observation values of the target at time .Then, the vehicle position estimate can be described as follows: in the condition of knowing observation value Ζ  = {z 0 , z 1 , z 2 , . . ., z  } at time , iteratively estimate the state of the system at time .
Set the state space of system as where  and ℎ are the state transfer function and observation function, respectively, and k and n represent system noise and observation noise, respectively.The filtering process contains two main steps: forecasting and updating.In forecasting process, the posterior probability density (x −1 | z −1 ) at time  − 1 is used to obtain the prior probability density at time : In the updating process, the newest system state observation value z at time  and the prior probability density (x −1 | z −1 ) will be used to calculate the posterior probability density (x  | z  ) at time : Assume that {x  0: ,    }  =1 is a group of samples with weights from the posterior probability density and ∑    = 1, x 0: = {x  ,  = 0, 1, 2, . . ., } is the sample set from time 0 to time .Based on the Monte Carlo principle, the posterior probability density at time  is able to be approximated with a discrete weighting formula: in which (⋅) is a Dirac function.Further, the estimated value of system state x  at time  is

Experiment and Analysis
The key parameters of the deep network and the particle filter are set out as follows: the number of units of the visible layer is 24 × 24 ( = 24); the number of units of the two hidden layers is 18 × 18 and 12 × 12 ( = 18,  = 12); the initial training weights of offline training are [0, 1], which satisfies the Gaussian distribution; the number of particles in the particle filter is  = 1000.For the unsupervised offline training, around 5000 subimages that only contain vehicle are selected for tracker training.Some of the typical images for training are shown in Figure 6.
In the experiments, different road scenarios were selected, including daytime, nighttime, and raining conditions.To evaluate performance, two criteria were used: the tracking success rate and center pixel offset.In these two criteria, tracking success is defined as area(  ∩   )/area(  ∪   ) ≥ 50% and center pixel offset is defined as the Euclidean distance between   and   , in which   is the outline of the tracking box and   is the ground truth of the real target outline box.
The experiment results are shown in Tables 1 and 2. It can be seen from the table that since the proposed tracking algorithm is able to generate richer deep features, the proposed method performs better than the existing stateof-the-art algorithms under the four different test conditions.The four typical scenarios that are chosen are daytime with good illumination, nighttime with road lamp, daytime with heavy raining, and twilight time.Examples of real tracking results are shown in Figure 7.
It should be mentioned that although the performance of our method is improved, the processing speed is relatively low.The average processing speed for one image is around 75 ms which is among the second worst of all the methods that are used in the experiments (Table 3).The possible reason for the big time cost is that the deep model contains a big amount of neurons and weight connections between each of neurons and the computing of weight in each connection may need much more processing time.The experimental platform was an Intel 2.67 GHz CPU with 4 GB RAM and the Windows 7 operating system.

Conclusion
To solve the problem of traditional tracking methods being not robust enough for vehicle tracking in ADAS, a deep representation and semisupervised onboard vehicle tracking algorithm was proposed.Relying on the strong feature extraction ability of deep modeling, this method dramatically inhibits drifting.Generally, the proposed semisupervised and deep model based tracking algorithm performs better than most of the existed methods in the merits of tracking success rate and average center pixel offset.However, the real-time performance of this work still needs to be further improved.There are two ways that are supposed to be used to solve this problem.First, concise deep model may be designed to reduce calculation burden.Second, parallel computing technology may be applied to speed the calculation process.

Figure 2 :
Figure 2: Overall structure of proposed semisupervised learning and deep model based method.

Figure 3 :
Figure 3: Structure of the deep belief network.

Figure 4 :
Figure 4: Example of features generated by unsupervised training.

Figure 5 :
Figure 5: The structure of the proposed artificial neural network based on a DBN.

Figure 6 :Figure 7 :
Figure 6: Some of the typical images for unsupervised offline training.

Table 2 :
The average center pixel offset of each algorithm (pixels).

Table 3 :
The average processing time of each algorithm (pixels).