A Vehicle Detection Algorithm Based on Deep Belief Network

Vision based vehicle detection is a critical technology that plays an important role in not only vehicle active safety but also road video surveillance application. Traditional shallow model based vehicle detection algorithm still cannot meet the requirement of accurate vehicle detection in these applications. In this work, a novel deep learning based vehicle detection algorithm with 2D deep belief network (2D-DBN) is proposed. In the algorithm, the proposed 2D-DBN architecture uses second-order planes instead of first-order vector as input and uses bilinear projection for retaining discriminative information so as to determine the size of the deep architecture which enhances the success rate of vehicle detection. On-road experimental results demonstrate that the algorithm performs better than state-of-the-art vehicle detection algorithm in testing data sets.


Introduction
Robust vision based vehicle detection on the road is to some extent a challenging problem since highways and urban and city roads are dynamic environment, in which the background and illuminations are dynamic and time variant. Besides, the shape, color, size, and appearance of vehicles are of high variability. To make this task even more difficult, the ego vehicle and target vehicles are generally in motion so that the size and location of target vehicles mapped to the image are diverse.
Although deep learning for object recognition has been an area of great interest in the machine-learning community, no prior research study has been reported that uses deep learning to establish an on-road vehicle detection method. In this paper, a 2D-DBN based vehicle detection algorithm is proposed.
The main novelty and contribution of this work include the following. (1) A deep learning architecture of 2D-DBN which preserves discriminative information for vehicle detection is proposed.
(2) A deep learning based on-road vehicle detection system has been implemented and thorough quantitative performance analysis has been presented.
The rest of this paper is organized as follows. Section 2 will give a brief talking about vision based vehicle detection tasks and deep learning for object recognition. Section 3 introduces in detail the proposed 2D-DBN architecture and training methods for the vehicle detection tasks. The experiments and analysis will be given in Section 4 and Section 5 is the conclusion.

Related Research
In this section, a brief overview of two categories of work that is relevant to our research is introduced. The first set is about vision based vehicle detection and the second focuses on deep learning for object recognition.

Vision Based Vehicle Detection.
Since only monocular visual perception is used in our project, this section will mainly refer to studies using monocular vision for on-road vehicle detection.
For monocular vision based vehicle detection, using vehicle appearance characteristics is the most common and effective approach. A variety of appearance features have been used in the field to detect vehicles. Some typical image features representing intuitive vehicle appearance information, such as local symmetry, edge, and cast shadow, have been used by many earlier works.  In recent years, there has been a transition from simpler image features to general and robust feature sets for vehicle detection. These feature sets, now common in the computer vision literature, allow for direct classification and detection of objects in images. For vehicle detection purpose, histogram of oriented gradient (HOG) features and Haarlike features are extremely well represented in literature. Besides, Gabor features, scale-invariant feature transform (SIFT) features, speeded up robust features (SURF), and some combined features are also applied for vehicle image representation.
Classification methods for appearance-based vehicle detection have also followed the general trends in the computer vision and machine learning literature. Compared to generative classifiers, discriminative classifiers, which learn a decision boundary between two classes (vehicles and unvehicles), have been more widely used in vehicle detection application. Support vector machines (SVM) and Adaboost are the most two common classifiers that are used for training vehicle detector. In [1], SVM classification was used to classify Haar feature vectors. The combination of HOG features and SVM classification has also been used [2][3][4]. Adaboost [5] has also been widely used for classification for Viola and Jones' contribution. The combination of Haar-like feature extraction and Adaboost classification has been used to detect rear faces of vehicles in [6][7][8][9]. Artificial neural network classifiers were also used for vehicle detection, but the training often failed due to local optimum [10].

Deep Learning for Object
Recognition. Classifiers such as SVM and Adaboost referred to in last section are all indeed a shallow learning model because they both can be modeled as structure with one input layer, one hidden layer, and one output layer. Deep learning refers to a class of machine learning techniques, where hierarchical architectures are exploited for representation learning and pattern classification. Different from those shallow models, deep learning has the ability of learning multiple levels of representation and abstraction that helps to make sense of image data. From another point of view, deep learning can be viewed as one kind of multilayer neural networks with adding a novel unsurprised pretraining process.
There are various subclasses of deep architecture. Deep belief networks (DBN) modal is a typical deep learning structure which is first proposed by Hinton et al. [11]. The original DBN has demonstrated its success in simple image classification tasks of MNIST. In [12], a modified DBN is developed in which a Boltzmann machine is used on the top layer. This modified DBN is used in a 3D object recognition task.
Deep convolutional neural network (DCNN) with the ability to preserve the space structure and resistance to small variations in the images is used in image classification [13]. Recently, DCNN achieved the best performance compared to other state-of-the-art methods in the 2012 ImageNet LSVRC contest containing 1.2 million images with more than 1000 different classes. In this DCNN application, a very large architecture is built with more than 600,000 neurons and over 60 million weights.
DBN is a probabilistic model composed of multiple layers of stochastic, hidden variables. The learning procedure of DBN can be divided into two stages: generative learning to abstract information layer by layer with unlabelled samples firstly and then discriminative learning to fine-tune the whole deep network with labeled samples to the ultimate learning target [11]. Figure 1 shows a typical DBN with one input layer 1 and N hidden layers 1 , 2 , . . . , , while x is the input data which can be, for example, a vector, and y is the learning target, for example, class labels. In the unsupervised stage of DBN training processes, each pair of layers grouped together to reconstruct the input of the layer from the output. In Figure 1, the layer-wise reconstruction happens between 1 and 1 , 1 and 2 , . . ., −1 and , respectively, which is implemented by a family of restricted Boltzmann machines (RBMs) [14]. After the greedy unsupervised learning of each pair of layers, the features are progressively combined from loose low-level representations into more compact highlevel representations. In the supervised stage, the whole deep network is then refined using a contrastive version of the "wake-sleep" algorithm via a global gradient-based optimization strategy.

Deep Learning Based Vehicle Detection
In this section, a novel algorithm based on deep belief network (DBN) is proposed. Traditional DBN for object classification has some shortages. First, the training samples are regularized to first-order vector which will lead to the missing of spatial information contained by the image samples. This will obviously lead to a decline in the detection rate in vehicle detection tasks. Secondly, the size of layers (such as node number and layer number) in the traditional DBN is manually set which is often big and will lead to structural redundancy and increase the training and decision time of the classifier, while, for vehicle detection algorithm which is usually used in real-time application, decision time is a critical factor. The proposed 2D-DBN architecture for vehicle detection uses second-order planes instead of firstorder vector of 1D-DBN as input and uses bilinear projection for retaining discriminative information so as to determine the size of the deep architecture. The bilinear projection maps original second-order output of lower layer to a small bilinear space without reducing discriminative information. And the size of the upper layer is that of the bilinear space.
In Section 3.1, the overall architecture of our 2D-DBN for vehicle detection will be introduced. In Section 3.2, the bilinear projection method of lower layer output will be given. In Sections 3.3 and 3.4, the training method of the whole 2D-DBN for vehicle detection will be deduced.

2 Deep Belief Network (2D-DBN) for Vehicle Detection.
Let be the set of data samples including vehicle images and nonvehicle images, assuming that is consisting with K samples which is shown below: In , X is training samples and in the image space R × . Meanwhile, means the labels corresponding to , which can be written as In , is the label vector of X . If X belongs to vehicles, = (1, 0). On the contrary, = (0, 1). The ultimate purpose in vehicle detection task is to learn a mapping function from training data to the label data based on the given training set, so that this mapping function is able to classify unknown images between vehicle and nonvehicle. Based on the task described above, a novel 2D deep belief network (2D-DBN) is proposed to address this problem. Figure 2 shows the overall architecture of 2D-DBN. A fully interconnected directed belief network includes one visible input layer 1 , hidden layers 1 , . . . , , and one visible label layer La at the top. The visible input layer 1 maintains × neural and equal to the dimension of training feature which is the original 2D image pixel values of training samples in this application. Since maximum discriminative ability wishes to be preserved from layer to layer with nonredundant layer size in this application for real-time requirement, the size of hidden layers is dynamically decided with so-called bilinear projection. In the top, the La layer just has two units which is equal to the classes this application would like to classify. Till now, the problem is formulated to search for the optimum parameter space of this 2D-DBN.
The main learning process of the proposed 2D-DBN has three steps.
(1) The bilinear projection is utilized to map the lower layer output data onto subspace and optimized to optimum dimension as well as to retain discriminative information. The size of the upper layer will be determined by this optimum dimension.
(2) When the size of the upper layer is determined, the parameters of the two adjacent layers will be refined with the greedy-wise reconstruction method. Repeat step (1) and step (2) till all the parameters of hidden layers are fixed. Here, step (1) and step (2) are so called pretraining process.
(3) Finally, the whole 2D-DBN will be fine-tuned with the La layer information based on back propagation. Here, step (3) can be viewed as supervised training step.

Bilinear Projection for Upper Layer Size Determination.
In this section, followed with Zhong's contribution [15], bilinear projection is used in order to determine the size of every upper layer in adjacent layer groups. As mentioned in Section 3.1, with the labeled training data X ∈ R × as the output of the visible layer 1 , bilinear projection maps the 4 The Scientific World Journal original data X onto a subspace and is represented by its latent form LX . The bilinear projection is written as follows: Here, U ∈ × and V ∈ × are projection matrices that map the original data X by its latent form LX with the constraint that U U = I and V V = I.
How to determine the value of U and V so that the discriminative information of X can be preserved is the issue that needs to be solved. For this, a specific objective function is built as follows: in which is between-class weights, is within class weights, and ∈ [0, 1] is the balance parameter. and are calculated as follows [16]: Here, is the class label of sample data X , which is either (1, 0) or (0, 1).
is the number of samples that belong to class and is those not belonging to class . Since vehicle detection is a binary classification problem, ∈ {1, 2} in this application.
It can be seen that the purpose of the objective function is to simultaneously maximize the between-class distances and minimize the within-class distances. In other words, the objective function focuses on maximizing the discriminative information of all the sample data. However, optimizing (U, V) is a nonconvex optimization problem with two matrices U and V. To deal with this trouble, a strategy called alternative fixing (AF) is used, which is fixing U (or V) and optimizing the objective function (U, V) with just variable matrix V (or U) and then fixing V (or U) and optimizing (U, V) with just U (or V). AF will be implemented alternatively till (U, V) reaches its upper bound.
After the optimum process, new U * and V * that maximize (U, V) are got and preserve the discriminative information of original sample data . Based on this, then, the size of the upper layer can be determined by the number of positive eigenvalues of U * and V * , which is and , respectively.

Pretraining with Greedy Layer-Wise Reconstruction
Method. In last subsection, the size of the upper layer is determined to be × . In this subsection, the parameters of the two adjacent layers will be refined with the greedy-wise reconstruction method proposed by Hinton et al. [11]. To illustrate this pretraining process, we take the visible input layer 1 and the first hidden layer 1 for example.
The visible input layer 1 and the first hidden layer 1 contract a restrict Boltzmann machine (RBM). × is the neural number in 1 and × is that of 1 . The energy of the state (V 1 ,ℎ 1 ) in this RBM is in which 1 = (A 1 , b 1 , c 1 ) are the parameters between the visible input layer 1 and the first hidden layer 1 . 1 , is the symmetric weights from input neural ( , ) in 1 to the hidden neural ( , ) in 1 . 1 and 1 are the ( , ) ℎ and ( , ) ℎ bias of 1 and 1 . So this RBM is with the joint distribution as follows: Here, Z is the normalization parameter and the probability that k 1 is assigned to 1 of this modal is After that, the conditional distributions over visible input state k 1 in layer 1 and hidden state ℎ 1 in 1 are able to be given by the logistic function, respectively, Here, ( ) = 1/(1 + exp(− )). At last, the weights and biases are able to be updated step by step from random Gaussian distribution values The Scientific World Journal 1 , (0), 1 (0), and 1 (0) with Contrastive Divergence algorithm [17], and the updating formulations are in which ⟨⋅⟩ data means the expectation with respect to the data distribution and ⟨⋅⟩ recon means the reconstruction distribution after one step. Meanwhile, is step size which is set to = 1 typically. Above, the pretraining process is demonstrated by taking the visible input layer 1 and the first hidden layer 1 for example. Indeed, the whole pretraining process will be taken from low layer groups ( 1 , 1 ) to up layer groups ( −1 , ) one by one.

Global Fine-Tuning.
In the above unsurprised pretraining process, the greedy layer-wise algorithm is used to learn the 2D-DBN parameters with the information added from bilinear projection. In this subsection, a traditional back propagation algorithm will be used to fine-tune the parameters = [A, b, c] with the information of label layer La.
Since good parameters initiation has been maintained in the pretraining process, back propagation is just utilized to finely adjust the parameters so that local optimum parameters * = [A * , b * , c * ] can be got. In this stage, the learning objection is to minimize the classification error [− ∑ y logŷ ], where y andŷ are the real label and output label of data X in layer .

Experiment and Analysis
This section will take experiments on vehicle datasets to demonstrate the performance of the proposed 2D-DBN. The datasets for training are Caltech1999 database which includes images containing 126 rear view vehicles. Besides, another  Figure 3 shows some of these positive and negative training samples. The testing datasets are recorded road videos with 735 manual marked vehicles. By using the proposed method, three different architectures of 2D-DBN are applied. They all contain one visible layer and one label layer, but with one, two, and three hidden layers, respectively. In training, the critical parameters of the proposed 2D-DBN in experiments are set as = 0.5 and = 0.8 and image samples for training are all resized to 32 × 32.
The detection results of these three architectures of 2D-DBN are shown in Table 1. It can be seen that 2D-DBN with two hidden layers maintains the highest detection rate.
The learned weights of hidden layers are shown in Figure 4.
Then, we compared the performance of our 2D-DBN with many other state-of-the-art classifiers, including support vector machine (SVM), k-nearest neighbor (KNN), neural networks, 1D-DBN, and deep convoluted neural network (DCNN).
The detection results of these methods are shown in Table 2.
From the compared results, it can be concluded that classification methods with deep architecture, for example, 1D-DBN, DCNN, and 2D-DBN are significantly better than those of shallow architecture, for example, SVM, KNN, and NN. Moreover, our proposed 2D-DBN is better than 1D-DBN and DCNN due to 2D feature input and the bilinear projection.
Finally, this 2D-DBN vehicle detection method is utilized on road vehicle detection system and some of the vehicle  sensing results in real road situation are shown in Figure 5. The four rows of images are picked in daylight highway, raining day highway, daylight urban, and night highway with road lamp, respectively. The solid green box means detected vehicles, and the dotted red box means undetected vehicles or false detected vehicles. The average vehicle detection time for one frame with 640 × 480 resolution is around 53 ms in our Advantech industrial computer.
Overall, most of the on-road vehicles can be sensed successfully while misdetection and false detection sometimes occurred during adverse situations such as partial occlusion and bad weather.

Conclusion
In this work, a novel vehicle detection algorithm based on 2D-DBN is proposed. In the algorithm, the proposed 2D-DBN architecture uses second-order planes instead of first-order vector as input and uses bilinear projection for retaining discriminative information so as to determine the size of the deep architecture which enhances the success rate of vehicle detection. On-road experimental results demonstrate that the system works well in different roads, weather, and lighting conditions.
The future work of our research will focus on the situation when a vehicle is partially occluded with deep architecture framework.