Vehicle detection is one of the most important environment perception tasks for autonomous vehicles. The traditional vision-based vehicle detection methods are not accurate enough especially for small and occluded targets, while the light detection and ranging- (lidar-) based methods are good in detecting obstacles but they are time-consuming and have a low classification rate for different target types. Focusing on these shortcomings to make the full use of the advantages of the depth information of lidar and the obstacle classification ability of vision, this work proposes a real-time vehicle detection algorithm which fuses vision and lidar point cloud information. Firstly, the obstacles are detected by the grid projection method using the lidar point cloud information. Then, the obstacles are mapped to the image to get several separated regions of interest (ROIs). After that, the ROIs are expanded based on the dynamic threshold and merged to generate the final ROI. Finally, a deep learning method named You Only Look Once (YOLO) is applied on the ROI to detect vehicles. The experimental results on the KITTI dataset demonstrate that the proposed algorithm has high detection accuracy and good real-time performance. Compared with the detection method based only on the YOLO deep learning, the mean average precision (mAP) is increased by 17%.
The core technology of the unmanned driving includes the environmental perception, precise positioning, and path planning. The complex road environments, especially the mixed traffic environments, cause great difficulties to the environment perception of the autonomous vehicles. Moreover, vehicle detection is an important part of the environment perception and plays a vital role in the safe driving of the unmanned ground vehicle (UGV).
Currently, the mainstream obstacle detection sensors are camera and lidar. Cameras have been widely used in the intelligent driving because of their low cost and the ability to obtain textures and colors of targets, which are especially important in recognition of the traffic lights and traffic signs. In [
On the other hand, lidar can acquire distance and three-dimensional information at a long detection distance. Lidar is not affected by the illumination conditions and has strong robustness, so it has been widely used in the environmental perception. In [
The usage of the multisensor fusion scheme that combines camera and lidar in the UGV has been gradually increasing. There are three mainstream integration schemes: (1) target detection using lidar and camera separately and then merging the results [
In this paper, the map is constructed by the max-min elevation map from point cloud, which is clustered by eight connected region markers, and then, morphological expansion is applied to the clustering results. After that, the minimum rectangle of each connected domain is calculated and points from these rectangles are projected onto the image. After projection, the coordinate extremum of each rectangle on the image is obtained. The area is enlarged and merged to obtain the area of interest in the image, and finally, the YOLO is used to classify the obstacles. The process is shown in Figure
Algorithm flowchart.
The Velodyne HDL-64 lidar used in this paper returns about 1.3 million points per second. The huge amount of data helps the environment perception of the UGV, but it poses a great challenge to the real-time algorithm performance. In response to this problem, this paper uses the max-min elevation map for environment creation. At the distance of
Example of grid map.
After traversing all the points, the points within the grid map are projected onto the corresponding grid, and the height data is preserved. We consider that there is an obstacle when the difference between the maximum height and the minimum height of a single grid is greater than the set threshold. The grid map provides a computationally efficient approximation to the terrain gradient in a cell [
Considering that the distance between the scanning lines of a multiline lidar increases with the increase in the scanning distance, an incomplete scanning of obstacles can be caused, especially in the vertical direction, and the distant obstacles can be scanned only by several scanning lines. Therefore, this paper uses the morphological expansion operation for obstacle cells in the grid map. The matrix we use is given as follows:
Taking the scene shown in Figure
Example scenario.
Original point cloud.
Grid map.
Grid map after expansion.
In this paper, eight connected region markers are applied to clustering. If an obstacle cell is connected with the other obstacle cells in the upper, lower, left, right, upper-left, lower-left, upper-right, or lower-right corner, these cells are considered to belong to the same obstacle, so they are represented by the same number.
We first convert the grid coordinate extremum
The normalized homogeneous coordinates are obtained as follows:
The projection effect is shown in Figure
Obstacle point cloud projection image.
From the coordinates of the points of
Original region of interest.
For obstacle
The parameters of the rectangular region of interest that the obstacle belongs to are given as follows:
Enlarged region of interest.
The region of interest will overlap after enlargement. In this paper, the overlapped rectangles are merged into one region of interest whose parameters are given as follows:
The merged region of interest is shown in Figure
Merged region of interest.
The YOLO is a new target detection algorithm. It uses a single neural network to predict the bounding box and class probability directly from the complete image by only one evaluation. Moreover, it can directly optimize the detection performance end-to-end, so it has a high real-time performance. The basic YOLO model can process the image in real-time at the speed of 45 frames/sec while achieving twice larger mAP than the other common real-time detectors. However, the YOLO has a certain positioning error [
In this work, we used the KITTI [
Before the training, we created the training set and verification set. We used the random selection method to divide all of the 7481 images in the KITTI dataset into the training set and verification set by the ratio of 7 : 3. Size of images provided by KITTI is
The main parameters of the experimental platform were CPU—Intel Xeon E5-2687W V4 at 3.00 GHz, memory—128 G, and GPU—NVIDIA Quadro M4000. The deep learning used the Keras platform.
The training parameters of the YOLO v3 are given in Table
Training parameters of the YOLO v3.
Parameter | Value |
---|---|
Batch size | 2 |
Learning rate | 10-4 |
Ignore thresh | 0.5 |
Number of epochs | 100 |
In the test, the test set provided by the KITTI was used. The result of the original YOLO v3 algorithm is shown in Table
Result of the original YOLO v3
Benchmark | Easy | Moderate | Hard |
---|---|---|---|
Car (detection) | 58.56% | 43.38% | 38.23% |
Result of the proposed method
Benchmark | Easy | Moderate | Hard |
---|---|---|---|
Car (detection) | 70.58% | 62.71% | 55.17% |
Test results on the KITTI dataset. (a) Precision and recall chart of vehicle detection on the KITTI test set of the YOLO v3 algorithm. (b) Precision and recall chart of vehicle detection on the KITTI test set of our algorithm.
Figures
The experimental results of Scenario 1. (a) Scenario 1. (b) Results of vehicle detection using the YOLO v3 algorithm. (c) Results of vehicle detection using our algorithm.
The experimental results of Scenario 2. (a) Scenario 2. (b) Results of vehicle detection using the YOLO v3 algorithm. (c) Results of vehicle detection using our algorithm.
The experimental results of Scenario 3. (a) Scenario 3. (b) Results of vehicle detection using the YOLO v3 algorithm. (c) Results of vehicle detection using our algorithm.
The experimental results showed that the proposed algorithm significantly improved the vehicle detection accuracy at different detection difficulty levels compared to the original YOLO v3 algorithm, especially for the vehicles with severe occlusion. Under easy, moderate, and hard difficulties, the average accuracy (AP) improvement was by nearly 12%, 20%, and 17%, respectively. We calculated the proportion of the area of all ROIs to the total area of images on the training dataset in KITTI, which is 55%, and the proportion of the number of ground truths contained in ROIs to the total number of ground truths in the training dataset, which is 96%. At the same time, because the use of grid map for the purpose of reducing dimensions of point clouds and the selected point clouds is only in the range of
We selected 50 easy, moderate, and difficult images, respectively, from the KITTI test set and compared the obtained results with the results given in [
Comparison of the results.
Method | AP (%) | ||
---|---|---|---|
Easy | Moderate | Hard | |
Reference [ |
46.28 | 39.87 | 24.32 |
Reference [ |
41.72 | 33.49 | 19.01 |
This work | 69.36 | 61.33 | 46.27 |
The experimental results show that the proposed algorithm has great advantages under various difficulty conditions compared with the algorithms proposed in [
However, the proposed algorithm still has certain shortcomings. For instance, when the target vehicle is far away from the UGV, it cannot be scanned by the lidar, so the algorithm cannot detect it. At the same time, if a large vehicle is close to the autonomous vehicle, such as a container truck, it is possible that only a portion of that vehicle would be detected after enlarging the ROI area so that the algorithm could fail to identify it.
A vehicle detection method based on the multisensor fusion is proposed in this paper. Using the calibration relationship between the lidar and camera, the region of interest extracted by the lidar is projected into the image obtained by the camera, and the region of interest in the image is obtained and processed. Finally, the YOLO v3 algorithm is used to detect the vehicle in the region of interest. The effectiveness of the proposed algorithm is verified by experiments.
In our future work, we will optimize the extraction of the region of interest to achieve better target extraction. At the same time, different ROI sizes mean different input image sizes. Although they will be resized into images of the same size in the beginning of YOLO, the proportion of area of obstacles on the image will change greatly, which will make the detection accuracy of the model decrease. To solve the problem, we will improve the YOLO algorithm or use some other more targeted networks to achieve better detection accuracy.
The data used to support the findings of this study are included within the article.
The authors declare that they have no conflicts of interest.
H.W. and Y.C. did the methodology, X.L. and Y.L. worked on the software, and L.C. worked on the project administration of the study.
This research was funded by the National Key Research and Development Program of China (2018YFB0105003), the National Natural Science Foundation of China (U1764264, 51875255, U1664258, U1764257, 61601203, and 61773184), the Key Research and Development Program of Jiangsu Province (BE2016149), Natural Science Foundation of Jiangsu Province (BK20180100) the Key Project of the Development of Strategic Emerging Industries of Jiangsu Province (2016-1094, 2015-1084), the Key Research and Development Program of Zhenjiang City (GY2017006), and the Overseas Training Program for Universities of Jiangsu Province.