This paper aims at generating highquality object proposals for object detection in autonomous driving. Most existing proposal generation methods are designed for the general object detection, which may not perform well in a particular scene. We propose several geometrical features suited for autonomous driving and integrate them into stateoftheart general proposal generation methods. In particular, we formulate the integration as a feature fusion problem by fusing the geometrical features with existing proposal generation methods in a Bayesian framework. Experiments on the challenging KITTI benchmark demonstrate that our approach improves the existing methods significantly. Combined with a convolutional neural net detector, our approach achieves stateoftheart performance on all three KITTI object classes.
Object detection has been developed in many years and there are a variety of robust approaches [
However these methods would suffer a great performance degradation, when they are applied to autonomous driving scene, such as the challenging KITTI benchmark [
In this paper, we propose an effective approach to improve the results of object proposals in autonomous driving scene. Our work is motivated by the following observations. First, there are three primary objects, in autonomous driving scene, Car, Cyclist, and Pedestrian. These three objects usually lie on the ground with different height. So the proposals should lie on the ground. Second, the realworld size of objects in one category would vary far less than their imageworld size, but the realworld size of different categories are also different. It is helpful to use the object size prior of object as an indicator to generate proposals. The details are discussed in Section
This paper has two fundamental contributions.
(1) We propose two new geometric features, AR and SD2, to represent the object size prior. We exploit D2R as an indicator to constraint the proposals lying on the ground. These features are demonstrated to be effective for generating fewer proposals with higher recall.
(2) We deeply analyze the four geometric features, AR, SD2, DMD, and D2R, and propose a method to combine these features with existing methods efficiently. The final results on the KITTI object detection benchmark achieve the stateoftheart performance in stereobased methods.
Since it is inevitable to use the depth information to compute the geometric features, we assume a stereo image pair as an input and obtain depth information via the stateoftheart approach by Yamaguchi et al. [
The main idea of object proposal method is to generate relatively fewer number of bounding boxes that contain the objects in an image that we are interested in with high recall. Existing proposal generation methods are often based on lowlevel image features, which can be divided into two categories generally: grouping methods and window scoring methods.
Grouping proposal methods aim to generate multiple segments that are likely to correspond to objects. To cover different objects with various size, most methods attempt to merge the output of a hierarchical image segmentation algorithm. The decision to merge segments is designed manually typically based on superpixel shape, appearance features, and boundary estimates.
Selective Search [
In order to detect objects with different size, MCG [
Since SS and MCG both need an initial image segmentation which impacts the object proposal results, CPMC [
Window scoring methods are to score each candidate window to indicate how likely an object of interest is contained in it. Compared to grouping approaches these methods usually directly return bounding boxes with fast speed. However, they tend to generate proposals with low localization accuracy unless the window sampling is performed very densely.
Objectness [
BING [
Edgeboxes [
However most previous methods are designed for general objects; they do not perform well in a particular scene such as the KITTI [
Most grouping and scoring methods mentioned above either purely use RGB appearance features or only use depth informed geometric features which ignore their complement of those two features. Although some methods, such as MCGD, use RGB and depth features simultaneously, it is not suitable for autonomous driving because of the complex outdoor environment. In this paper, we propose a method to exploit both the appearance features and the geometric features. Our work formulates the problem by fusing those two complementary features in a Bayesian framework for obtaining highquality object proposals in autonomous driving.
As mentioned in previous sections, geometric features are important for improving the quality of object proposals. We introduce four geometric features: aspect Ratio, diagonal multiplication distance, area multiplication of the square of the object depth, and distance to the road.
Objects in different classes usually have magnificent difference on appearance while those in the same class vary far less. Since an object is tightly bounded by a square box whose aspect ratio of the same class should vary in a specific range, based on this intuition, we use AR as a feature to assess the possibility of an image window covering a specific class. The aspect ratio of a square box is calculated as follows:
Objects’ sizes in the image can be measured by the bounding boxes covering them and they vary significantly across the dataset. Meanwhile, the realworld size of objects in the same class varies far less as mentioned in [
As shown in Figure
The imaging principle of the camera.
Depth information has been utilized for object detection in recent years; it can be computed from disparity map or directly obtained by depth sensors, such as Kinect. In this paper, we use a stereo image pair as an input, compute the disparity map via the stateoftheart approach by Yamaguchi et al. [
As mentioned above, the relationship between the image size and the depth information of the object can be utilized as a proxy for the realworld object size approximately. The camera focal length can be ignored as it is considered to be a constant. Inspired by the observation of the relationship between realworld size and image size, we use the product of area of the bounding box and the square distance to the camera as an approximate representation of the object size in realworld. The SD2 can be written as
DMD is the feature that could approximately represent the realworld object size [
The distributions of DMD and SD2 on Car, Cyclist, and Pedestrian are shown in the second and the third row in Figure
Statistic of four object features. For each object class, Car, Cyclist, and Pedestrian, from top to down the features are AR, DMD, SD2, and D2R. We could normalize them to zero mean and unit variance (mean subtraction and division by the standard deviation).
Car
Cyclist
Pedestrian
Since all the annotated objects in the KITTI benchmark are on the ground, the ground plane can be used as an important indicator to predict the possibility that a proposal contains an object. It is more likely to cover an object when the proposal is close to ground plane and is less likely when the proposal is far away from the ground plane. We use the same method in [
As the four proposal features are relatively complementary, using some of them at the same time may appear promising. AR gives only the proportion of object projection size in the image. DMD or SD2 is the replacement for the realworld object size, but either of them depends on precise depth calculated from disparity map. D2R denotes the distance to the road, which can roughly distinguish positive examples from negative examples.
To combine these features (AR, SC, DMD, SD2, and D2R), we train a Bayesian classifier to distinguish between positives and negatives. SC is the initial result of the existing method. For each training image, we sample all the proposals that have an
After training, when given a proposal we calculate its posterior probability using the following equation:
After a large number of positive and negative proposals are sampled, the distribution of their image features (AR, SC, DMD, SD2, and D2R) is demonstrated via the histogram (we sample all the proposals that have an
In this section, we evaluate our method on the challenging KITTI benchmark [
Following [
The results of analyzing features and features integration are tested on the hard validation set for all three objects, while the comparison results to the state of the art are on all three object classes and three regimes which use the same metrics depicted in previous section.
We first verify the effectiveness of all the geometric features independently. As our goal is to analyze the performance of each of the features and their combinations which is independent of the baseline method, we only evaluate our method based on Edgeboxes. The results of the baseline method are named SC. As shown in Figure
Single feature results: the first row is the recall versus IOU curve on 500 proposals while the second row is curve of the recall versus the number of proposals on different IOU threshold. For Car the IOU threshold is 0.7, and it is 0.5 for Cyclist and Pedestrian. We analyze the original results and the four proposed features independently to observe the usefulness of these features. We find that all the four proposed features work better than the original result when we just use a single feature to generate the proposals. With experiments on the three objects we find that D2R is the most useful feature while our proposed feature SD2 ranks second. DMD have similar performance with SD2, because they both catch the constancy of object size in realworld. AR is also a useful feature.
Car
Cyclist
Pedestrian
Then we combine those geometric features and SC together in a Bayesian framework using different combination to find the best way for fusion of these features. In order to use Bayesian function, the prior probabilities
We combine the five features in a Bayesian framework with all possible combinations. The combinations include 10 ways for any pairs of features, 10 for any triplets, 5 for any four, and 1 for all features together. We have evaluated all the combinations. Since plotting all the combinations is difficult to observe, we only choose 2 top results from pairs of features combinations and triplets of features combinations, 1 from four features combinations, and 1 for all five features. The results are shown in Figure
Results on the hard validation sets for all three object classes. AUC is the abbreviation for Area Under the Curve, recall is the maxima recall the method can achieve, and
Features  Cars  Cyclist  Pedestrian  

AUC  Recall (%) 

AUC  Recall (%) 

AUC  Recall (%) 
 
Single features  AR  0.13  59  Inf  0.14  52  Inf  0.2  89  Inf 
SC  0.15  70  Inf  0.24  82  2209  0.29  89  1226  
DMD  0.17  73  Inf  0.27  78  3735  0.28  84  1337  
SD2  0.19  78  4426  0.27  74  Inf  0.31  85  1463  
D2R 












Single features  D2R + DMD  0.39  92  682  0.43  89  832  0.46  90  307 
D2R + SD2  0.41  92  509  0.42  88  927  0.46  89  286  
D2R + DMD + AR  0.45  92  413  0.47  89  564  0.53  90  609  
D2R + SD2 + AR 






0.54  89  667  
D2R + DMD + SD2 + AR  0.46  92  392  0.47  89  625 




All  0.45  92  423  0.44  89  568  0.52  91  234 
Features combination results: The first row is the recall versus IOU curve on 500 proposals while the second row is curve of the recall versus the number of proposals on different IOU threshold. For Car the IOU threshold is 0.7, and it is 0.5 for Cyclist and Pedestrian.
Car
Cyclist
Pedestrian
Based on the analysis on the features in previous section, we choose D2R, SD2, and AR as our final choice. As our method can be integrated into any object proposal generation method, we verify its effectiveness on two representativeness methods: EB (Edgeboxes) and SS (Selective Search). Correspondingly, we name their improved versions OurEB145 and OurSS145, where 1 represent AR, 2 represent SC, 3 represent DMD, 4 represent SD2, and 5 represent D2R. OurEB145 means the results obtained by fusing those three geometric features, AR, SD2, and D2R, with EB in a Bayesian framework. In the paper, we just use OurEB instead of OurEB145, the same to OurSS. We also compare our results to 3DOP because it is the stateoftheart method that exploits geometric features to generate object proposals.
Figure
Recall versus IOU for 500 proposals in three regimes. From top to down: Car, Cyclist, and Pedestrian.
Easy
Moderate
Hard
Figure
Recall versus number of proposals: the overlap threshold for Car is 0.7, and it is 0.5 for Pedestrian and Cyclist. From top to down: Car, Cyclist, and Pedestrian.
Easy
Moderate
Hard
Given the depth map, our features can be computed efficiently. Combined with the existing method, our approach can obtain significant improvement with only 0.2 s additional runtime on a single core by MATLAB. Table
Running time of different proposal methods.
Method  Selective Search  Edgeboxes  3DOP  OurSS  OurEB 

Time (second)  15  1.5  1.2  15.2  1.7 
To evaluate the object detection performance based on our proposal generation method, we apply the stateoftheart fast RCNN object detector on the bounding box proposals generated by our method, as 3DOP in [
Average Precision (AP) (in %) on the validation set of the KITTI object detection benchmark with 1000 proposals, while, for EB and SS, the number of proposals is 2000.
Metric  Method  Cars  Cyclist  Pedestrian  

Easy  Moderate  Hard  Easy  Moderate  Hard  Easy  Moderate  Hard  
AP  SS [ 
75.91  60.00  50.98  56.23  39.16  38.83  54.06  47.55  40.56 
EB [ 
86.81  70.47  61.16  55.01  37.87  35.80  57.79  49.99  42.19  
3DOP  94.47  87.09 

84.65  57.38  55.63  72.47  65  57.24  
OurSS 


78.57 



74.23  66.54  57.9  
OurEB  88.92  87.40  78.43  83.38  57.72  55.69 



The visual results of our object detection framework are shown in Figure
The visual results of our object detection framework. The odd rows are the ground truth bounding box while the even rows are detection bounding box. Different colors indicate different difficulties. Green means not occluded, yellow means partly occluded, and red means fully occluded. The first four rows are the results of Cars, while the second four rows are the results of Pedestrian and Cyclist.
In this paper, we propose several geometric features which are suitable for object proposals in the autonomous driving scene and integrate them with existing object proposal generation methods in a Bayesian framework. We deeply analyze the effectiveness of each geometric feature and different combinations of features. Experiments on the challenging KITTI benchmark demonstrate that, by integrating these geometric features into existing object proposal methods, we achieve significant improvement on all three object classes. Subsequently we improve the object detection performance. Our future work will focus on integrating geometric features into a totally CNN framework for boosting their performance in the autonomous driving scene.
The authors declare that they have no competing interests.