A Multistep Framework for Vision Based Vehicle Detection

Vision based vehicle detection is a critical technology that plays an important role in not only vehicle active safety but also road video surveillance application. In this work, a multistep framework for vision based vehicle detection is proposed. In the first step, for vehicle candidate generation, a novel geometrical and coarse depth information based method is proposed. In the second step, for candidate verification, a deep architecture of deep belief network (DBN) for vehicle classification is trained. In the last step, a temporal analysis method based on the complexity and spatial information is used to further reduce miss and false detection. Experiments demonstrate that this framework is with high true positive (TP) rate as well as low false positive (FP) rate. On road experimental results demonstrate that the algorithm performs better than state-of-the-art vehicle detection algorithm in testing data sets.


Introduction
Advanced driver assistant system (ADAS) is developed to improve driver safety and comfort which require comprehensive perception and understanding of on-road environment.For example, stable and dynamic obstacles on road need to be detected accurately in real time so that safe driving space of ego vehicles can be determined.Recently, with the fast development of optical devices and hardware, vision based environment sensing has drawn much attention from researchers.Vehicle detection with in vehicle front-mounted vision system is one of the most important missions for vision based ADAS.
Robust vision based vehicle detection on the road with low miss detection rate and low false detection rate is with many challenges since highways, urban and city road are dynamic environment, in which the background and illuminations are dynamic and time vary.Besides, the shape, color, size and appearance of vehicles are with high variability.Making this task even difficult, the ego vehicle and target vehicles are generally in motion so that the size and location of target vehicles captured in image are diverse.
For robust vehicle detection, a two-step strategy proposed by Sun et al. [1] including candidate generation (CG) and candidate verification (CV) is often applied by many researchers.In CG step, image blocks in which vehicles might exist will be generated with easy and low time cost algorithms.In CV step, the candidate vehicles generated in CG will be verified with another relatively complex machine learning algorithm.
In this work, by mainly following the two-step strategy, a multilevel framework for vision based vehicle detection is proposed.In the first step, for vehicle candidate generation, a novel geometrical and coarse depth information based method is proposed.In the second step, for candidate verification, a deep architecture of DBN for vehicle classification is trained.The last step, to further reduce miss and false detection, a temporal analysis method based on the complexity and spatial information will be used.The overall framework proposed by our work is shown in Figure 1.
Our main contribution is mainly in three aspects: (1) a novel geometrical and coarse depth information based vehicle candidate generation method is proposed, which dramatically reduces the number of candidates needed to be verified thus speeding up the whole vehicle detection process.(2) The DBN based deep vehicle classifier is designed.
(3) To deal with occasionally appeared miss detection and false detection, a temporal analysis method based on the complexity and spatial information is further proposed.

Vehicle Candidate Generation
In vehicle candidate generation step, traversal search such as sliding window method is often used traditionally.Since traversal search will generate a huge number of candidates, it is not able to meet real-time requirement.Another CG method based on ground plane assumption is proposed by Kim and Cho, which dramatically decreases the candidate numbers [2].However, this method performs poorly when the camera orientation changes or the road surface is a curve.In [3], Felzenszwalb and Huttenlocher use symmetry and edge based candidate generation method which cannot handle the partial vehicle occlusion situation.
In our work, a novel geometrical and coarse depth information based vehicle candidate generation method by integrating bottom and intermediate layer information of image is proposed.In bottom layer information extraction, original image is transformed to super-pixel form according to the pixel approximation.Then in intermediate layer information extraction, road image scene is mainly divided into three parts which are horizontal plane, vertical plane, and sky by geometrical information extraction.Meanwhile a coarse depth information estimation method is applied to get approximate depth information of single image.Finally, by combining the acquired geometrical and coarse depth information of single image, super pixels are clustered so that vehicle candidate can be generated.The vehicle candidate generation process is shown in Figure 2.

Bottom Layer Information Extraction.
The bottom message is the image information which can be directly maintained without further process such as image pixel value, color, and so forth.Super pixel is an important expression of image underlying information.Super pixel is essentially an over-segmentation method which aggregated neighboring pixels with similar features into a group and name as one super pixel.Because the use of super-pixel segmentation method is able to better reflect the significance of human perception, it has a unique advantage in the field of object recognition tasks.Since firstly proposed in [3], there is now a large number of machine vision approaches that uses super pixels instead of pixels in a wide range of applications such as image segmentation [4], image analysis [5], image targeting [6], and other areas.In our application, the SLIC method described in [3] is applied to segment road image into superpixel format (Figure 2(b)).

Intermediate Layer Information Extraction
2.2.1.Geometrical Information Extraction.Through statistical analysis of a large number of road pictures, it is found that flat plane in image usually belongs to road surface area, vertical plane in image tends to be the surface of on-road objects such as vehicles, trees, fence, and so forth, and the sky is often presented in the upper part of the image.If each image pixel can be identified as flat plane, vertical plane, or sky, it will provide rich information for vehicle candidate generation.Luckily, by characterizing each super pixel with color, position, and perspective effect and putting them into a pretrained regression AdaBoost classifier, Hoiem et al. successfully maintain the category of each super pixel [7].Following Hoiem's contribution, in our work, images are divided into the aforementioned road pavement, vertical object, and sky.As shown in left bottom picture of Figure 2(c); road pavement is marked with green color, sky is marked with blue color, and vertical object is marked with brown as well as black "".

Coarse Depth Information Extraction.
To maintain depth information in images, there are many mature methods based on stereo vision but they are not suitable for our work with just one front-mounted camera.Recently, single image based depth information acquirement has received widespread attention.Literature [8,9] have proposed accurate depth information obtaining algorithms based on single image, but these two methods need several seconds to achieve this which is difficult to meet the real time requirements.
Different from the application that needs accurate depth information, vehicle candidate generation tasks only need coarse depth information in the image.For this, a SVM classifier that can get coarse depth of image objects is proposed.A large number of training images is collected with stereo vision and marked with depth information.The depth information  is roughly divided into three groups: For training, all the sample images are divided into 6 * 6 small grid and the following three features of image samples are extracted to form feature vector: (1) mean value and histogram of three channels in HSV image space, (2) gabor features based texture gradient feature, and (3) category of grids (sky, ground, etc.).The classifier  is trained with libsvm toolbox [10] and the output of  is depth category   .As shown in right bottom picture of Figure 2(c), three different coarse depths are marked with different colors.
Based on the coarse depth information, the range in pixel level of vehicle candidate can be seen in Table 1.   and coarse depth information of images is made.In this section, firstly, the super pixels that satisfy the requirements of the vehicle will be picked out.Then, the vehicle candidate will be generated with the selected super pixels by clustering.The super pixel selection is mainly based on the priori knowledge as follows.

Vehicle Candidate Generation
(1) Vertical constraints: the super pixel that might belong to vehicle should be present in or adjacent to the vertical plane.
(2) Ground constraints: the super pixel that might belong to vehicle should connect to ground area.
(3) Depth constraints: the super pixel that might belong to the same depth can be clustered together.
(4) Size constraints: the super pixel that might belong to vehicle should not be out of the range of vehicle size in image.
By applying the four rules, a clustering strategy is proposed so that proper super pixels can be selected and grouped as vehicle candidate.The clustering method is as follows.
Algorithm 1 (Clustering Method for Vehicle Candidate Generation).Consider the following steps: Step 1. Start from super pixel group  that belongs to vertical plane and  connecting to ground area.
Step 2. For  ∈ ,  is super pixel group that is near  and belongs to vertical plane.If  is empty or all  are processed, STOP.Otherwise, For  ∈ , find the super pixel pair ,  with minimal Euclidean distance (, ).
Step 3. If ,  satisfies (a) ,  is with the same depth range and (b)  ∪  is not out of the range of vehicle size in image, then ,  will merge into a new super pixel  =  ∪ .
Step 5.If  satisfies the constraint of image vehicle pixel size range shown in

Vehicle Candidate Verification
In this section, a deep belief network (DBN) based vehicle candidate verification algorithm is proposed.
Machine learning is proved to be the most useful method for vehicle candidate verification task.Support vector machines (SVM) and AdaBoost are the most two common classifiers that are used for training vehicle detector [11][12][13][14][15][16][17][18][19][20].However, classifiers such as SVM and AdaBoost all belong to shallow learning model because both of them can be modeled as structure with one input layer, one hidden layer, and one output layer.On the other hand, deep learning is another class of machine learning technique, where hierarchical architectures are exploited for representation learning and pattern classification.Superior to those shallow models, deep learning is able to learn multiple levels of representation and abstraction of image data.
There are many subclasses of deep architecture in which deep belief networks (DBN) modal is a typical deep learning structure which is first proposed by Hinton et al. [21].DBN has demonstrated its success in MNIST classification.In [22], a modified DBN is developed in which a Boltzmann machine is used on the top layer, which is used in a 3D object recognition task.In our work, by using DBN, a classifier is trained for vehicle candidate verification tasks.
In Section 3.1, the overall architecture of the DBN classifier for vehicle candidate verification will be introduced.In Sections 3.2 and 3.3, the training method of the whole DBN for vehicle candidate verification will be deduced.images and nonvehicle images.Assuming that  is consisting of  samples which is shown as follows:

Deep Belief Network (DBN) for
In , X  is training samples and is in the image space R × .Meanwhile,  means the labels are corresponding to , which can be written as In ,   is the label vector of X  .If X  is belonging to vehicles,   = (1, 0).On the contrary,   = (0, 1).
The ultimate purpose of vehicle candidate verification task is to learn a mapping function from training data  to the label data  based on the given training set, so that this mapping function is able to classify unknown images between vehicle and nonvehicle.
Based on the task described above, DBN architecture is applied to address this problem.Figure 4 shows the overall architecture of DBN.A fully interconnected directed belief network including one visible input layer  1 ,  hidden layers  1 , . . .,   and one visible label layer La at the top.The visible input layer  1 maintains  ×  neural and equal to the dimension of training feature which is the original 2D image pixel values of training samples in this application.An the top, the La layer just has two units which is equal to the classes this application would like to classify.Till now, the problem is formulated to search for the optimum parameter space  of this DBN.

Hidden layer
Visible layer Feature input X k y  The main learning process of the proposed DBN is with two steps.
(1) The parameters of the two adjacent layers will be refined with the greedy-wise reconstruction method. Repeat Step 1 till all the parameters of hidden lays are fixed.Here, Step 1 is so-called pretraining process.
(2) The whole DBN will be fine-tuned with the La layer information based on back propagation.Here, Step 2 can be viewed as supervised training step.

Pretraining with Greedy Layer-Wise Reconstruction
Method.Assume that the size of the upper layer is  × .
In this subsection the parameters of the two adjacent layers will be refined with the greedy-wise reconstruction method proposed by Hinton et al. [21].To illustrate this pretraining process, we take the visible input layer  1 and the first hidden layer  1 for example.The visible input layer  1 and the first hidden layer  1 contract a restrict Boltzmann machine (RBM). ×  is the neural number in  1 and  ×  is that of  1 .The energy of the state (V 1 , ℎ 1 ) in this RBM is in which  1 = (A 1 , b 1 , c 1 ) are the parameters between the visible input layer  1 and the first hidden layer  1 ,  1 , is the symmetric weights from input neural (, ) in  1 to the hidden neural (, ) in  1 , and  1  and  1  are the (, )th and (, )th bias of  1 and  1 .So this RBM is with the joint distribution as follows: where  is the normalization parameter, and the probability that k 1 is assigned to  1 of this modal is After that, the conditional distributions over visible input state k 1 in layer  1 and hidden state ℎ 1 in  1 are able to be given by the logistic function, respectively: where () = 1/(1 + exp(−)).At last, the weights and biases are able to be updated step by step from a random Gaussian distribution value  1 , (0),  1  (0) and  1  (0) with contrastive divergence algorithm [23], and the updating formulations are in which ⟨⋅⟩ data means the expectation with respect to the data distribution, ⟨⋅⟩ recon means the reconstruction distribution after one step, and  is step size which is set to  = 1, typically.Above, the pretraining process is demonstrated by taking the visible input layer  1 and the first hidden layer  1 , for example.Indeed, the whole pretraining process will be taken from low layer groups ( 1 ,  1 ) to up layer groups ( −1 ,   ) one by one.

Global Fine-Tuning.
In the above unsurprised pretraining process, the greedy layer-wise algorithm is used to learn the DBN parameters.In this subsection, a traditional back propagation algorithm will be used to fine-tune the parameters  = [A, b, c] with the information of label layer La.
Since good parameters initiation has been maintained in the pretraining process, back propagation is just utilized to finely adjust the parameters so that local optimum parameters  * = [A * , b * , c * ] can be achieved.In this stage, the learning objection is to minimize the classification error [− ∑  y  log ŷ ], where y  and ŷ are the real label and output label of data X  in layer .

Temporal Analysis Using Complexity and Spatial Information
After the vehicle candidate verification step proposed in last section, most vehicles in frames will be detected.However, due to search window scale sparsity and inadequate detector performance, there may be some miss detections and false detections in some frames.By observation, it is found that most miss detections and false detections appear just occasionally, so that a temporal analysis using complexity and spatial information is proposed to refine vehicle detect results.
The temporal analysis is with two compositions.First, for detected target  in frame , if it is judged that the target  also appeared in  − 1 and  − 2 frames by using a similarity function, this target  is considered as true positive, otherwise it is considered as false detection and needs to be eliminated.Then, for verified detected target  in frame  − 1, also take advantage of the similarity function to determine whether there is a corresponding target in frame .If not, miss detection is considered and the area having the highest similarity and is above a threshold is regarded as the detection target.
The similarity between any two targets (V  () , V +1 () ) is defined as follows: in which   (V  () , V +1 () ) means the complexity similarity between target  in frame  and target  in frame  + 1.   (V  () , V +1 () ) means the spatialsimilarity.Complexity similarity and spatialsimilarity are defined as follows.
(a) Complexity Similarity.There are multiple ways to calculate image complexity, in which edges proportion is a good measurement.Let  be pixel number of an image and ñ be the number of pixels belonging to edge.Then image complexity is defined as The complexity ratio of two images is considered as similarity function.The complexity similarity between target  in frame  and target  in frame  + 1 is (b) Spatial Similarity.Vehicle movement in video is a continuous process and the time interval between two consecutive frames is very short (around 40 ms) so the vehicle motion does not usually change dramatically in such a very short period of time and vehicles in two consecutive frames are with very small displacement and deformation.So the detection window size and centroid coordinates are used to build spatialsimilarity measurement function in which ,  is centroid coordinates of detection window, ,  means detection window size, and   ,   ,   , and   are constant factors.

Experiments and Analysis
The proposed multistep vehicle detection framework is tested on PETS2001 dataset and many video clips captured on the highway by our project groups.In the experiment, the hidden number of the DBN is set to 2, the neural number of two hidden layers are 24 × 24 and 16 × 16, respectively, and  is set as 0.8.Besides, image samples for training are all resized to 32 × 32.The frames are all 640 × 480 resolution and the experiment is made in our Advantech industrial computer.Some of the vehicle candidate generation effects are shown in Figure 5 where (a) is the original image, (b) is the sky in the vertical plane, the ground plane geometry information extracted from the original image, (c) is the depth information by using pretrained classifier, and (d) is generated candidates.
Based on the vehicle candidate generation results, the DBN based vehicle candidate verification method is applied further.The vehicle candidate verification results are shown in Figure 6.The left column is generated vehicle candidates marked with yellow while the right column is the verified vehicles marked with blue rectangle.
Finally, by applying temporal analysis, some miss detection can also be detected.As shown in Figure 7, the blue window means normal detected vehicles by the DBN vehicle detector while the dash green window means the miss detected vehicles but is corrected by the temporal analysis.
Our method is tested in more than 10 road captured videos including different times and different weather conditions.The overall vehicle detection effects are shown in Table 2 as well as some state-of-the-art vehicle detection effects.
From the results shown in Table 2, it can be seen that the proposed vehicle detection framework maintains the lowest false positive (FP) rate while reaching the second highest true positive (TP) rate which is 0.24% lower than that of Southall's stereo vision based method.Meanwhile, by using monocular vision, the processing speed of our method is much faster than Southall's.[24] 81.48% 16.7% Monocular vision; Alonso et al. [25] 92.63% 3.68% Monocular vision; Sun et al. [12] 98.5% 2% Monocular vision; For their most effective classifier Southall et al. [26] 99.37%

Figure 2 :
Figure 2: One sample of vehicle candidate generation.

Figure 3
Figure 3 shows an example of the proposed clustering Method for vehicle candidate generation.Figure 3(a) is the original image for processing and Figures 3(b)-3(e) are the clustering process.

Figure 3 :
Figure 3: An example of the proposed clustering method.

Figure 5 :
Figure 5: Two vehicle candidate generation results.

Table 1 :
Vehicle size range in pixel level.

Table 1 ,
jump to Step 2. If  do not satisfy the constraint,  is considered as a vehicle candidate and jump to Step 2.

Table 2 :
Overall vehicle detection effects comparison.
In this work, a multistep framework for vision based vehicle detection is proposed.In the first step, for vehicle candidate generation, a novel geometrical and coarse depth information based method is proposed.In the second step, for candidate verification, a deep architecture of DBN for vehicle classification is trained.In the last step, a temporal analysis method based on the complexity and spatial information is used to further reduce missed and false detection.Experiments demonstrate that this framework is with high TP rate as well as low FP rate.