Learning Deformable Network for 3D Object Detection on Point Clouds

3D object detection based on point cloud data in the unmanned driving scene has always been a research hotspot in unmanned driving sensing technology. With the development and maturity of deep neural networks technology, the method of using neural network to detect three-dimensional object target begins to show great advantages. .e experimental results show that the mismatch between anchor and training samples would affect the detection accuracy, but it has not been well solved. .e contributions of this paper are as follows. For the first time, deformable convolution is introduced into the point cloud object detection network, which enhances the adaptability of the network to vehicles with different directions and shapes. Secondly, a new generation method of anchor in RPN is proposed, which can effectively prevent the mismatching between the anchor and ground truth and remove the angle classification loss in the loss function. Compared with the state-of-the-art method, the AP and AOS of the detection results are improved.


Introduction
e 3D object detection methods mainly solve the problem of locating and identifying targets from both 2D and 3D data, including image, LiDAR data, and point cloud data. 3D object detection acts as a more and more important role in real-world applications, such as autonomous driving cars [1,2], housekeeping equipment [3], and augmented reality [4]. A large number of methods have been proposed to solve the problem of 3D object detection. Image-based methods: when images contain detailed information, many methods generated the 3D bounding boxes [5,6] by firstly estimating the depth of images from monocular or stereo cameras. e accuracy of these methods was limited by the result of the low accuracy of depth estimation. Fast RCNN [7] pipeline is used to generate 3D box proposals in [8] and then apply regionbased recognition. Monocular image and shape priors of cars are used to propose 3D boxes in [9], while the length and width of cars and other objects are not invariable.
Another method designs a 3D scene representation which reasons jointly about the 3D shape of multiple objects.
Methods use the RGB-D image. With the equipment and application of 3D sensors such as RGB-D cameras and LiDAR in varying reality, the problem of the inaccuracy of depth estimation is avoided [10]. Use 3D pixel-wise features to detect cars in the RGB-D images, and take advantages of the 3D car model which can obtain additional information: height, occlusion, and 3D pose. A detailed geometry representation of objects is introduced by [11]. Sliding shapes [12] proposed sliding shape on the RGB-D image to generate 3D bounding boxes by SVM classifiers on grids encoded with handcrafted features. After sliding shapes, deep sliding shapes extract geometry features through 3D convolution networks, but the cost of better geometry features is expensive computations. e 3D depth data is transported into 2D maps and CNNs are applied to localize the objects in [13]. Reduce the 3D detection work by using an efficient 2D detector and making full use of 2D information.
Similarly, F-PointNets detect objects on 2D images and then apply PointNet to the corresponding frustum point cloud to generate 3D bounding boxes. Great amount of computation was saved so this method receives a competitive high-process speed. While F-PointNets relay too much on 2D detect result; if 2D detector misses the far object or small object such as pedestrian and cyclists, F-PointNets cannot detect it either.
Multiview: Maturana and Scherer [14] show a way to represent the point cloud by volumetric and multiview. MVCNN [15] is the first method to apply multiview computer vision to 3D object perception. 2D renders of objects from 3D data of different "perspectives" is used as the original training data. e model trained by the classic convolution network of the 2D image shows a faster speed and has a better effect on the recognition and classification of 3D objects than those directly trained by 3D data. In order to make great use of image-based feature extract methods [16] project point clouds into front view, compared to LiDAR only 3D detection, Enzweiler and Gavrila [17] make greater progress on 3D detection, especially for pedestrians or cyclists. Amoung them, MV3D first projects the point clouds into the bird's eye view and then applies a CNN to generate rough 3D bounding boxes which will be projected to three views. For each view, there will be a deep fusion network which extracts region-wise features via ROI pooling to jointly object prediction and 3D box regression. However, additional needs for camera cause time synchronization and calibration problem, which limit the use and require sensors which are always in a good state.
LiDAR based: here, we try to estimate accurate 3D bounding boxes and class label of objects from point clouds. Different from RGB image data, 3D point cloud's unique properties are very robust to the changes of view angle and illumination.
ey contain relative overall structural information and precise depth of each point, but it is precisely because of different data forms. unordered, sparse, and locality sensitive, that the traditional network structure is often unable to directly process point cloud data. Bariya and Nishino [18] simply extend 2D region proposal network to 3D point cloud when computation cost increases dramatically. Searching for other representations for point cloud is done [19], which yield satisfactory results when a plenty of detailed 3D structure information is provided.
is also leads to the fact that most of the current threedimensional target detection methods need to do some preprocessing operations on the point cloud data, which becomes two-dimensional data and then sent to the network for processing. For example, Song and Xiao [20] upgrade Faster/Mask RCNN [21] to 3D, while priors only process 2D detection results. ose methods convert the irregular point clouds to regular 3D grids and apply 3D CNN detectors to realize detection task; because of the use of three-dimensional convolution, the calculation of those methods is very large. ere are also some methods [22] which project point cloud data to the perspective of aerial view and carry out 2D target detection on the image after projection. However, this kind of method loses a lot of spatial detail data, so the detection results are very limited. And, some previous approaches [23] inferred point cloud features by neural networks on the point cloud which has been divided into many grids.
In this work, we introduce a 3D detection framework that directly processes raw point could data and does not depend on any 2D detectors. Our detection network is inspired by recent advances in 3D neural network models for point clouds and is extended by the Hough voting network, VoteNet, for object detection.
We leverage VoteNet, a hierarchical deep network for point cloud learning. To reduce the redundant computing of converting point cloud into other forms of data or features such as voxel dose, this helps a lot with avoiding the lack of dimension and structure information in the process of project point clouds to front view or birds' eye view because we directly process the original point cloud and then only calculate the perceived points to take advantage of the sparsity of the point cloud. Although PointNet++ has been used in object classification and semantic segmentation and achieved fascinating results, but how to use this architecture to detect 3D objects in point clouds is still a field rarely touched by researchers.
ere is an easy and simple idea such as what 2D object detection does, that is, whether it proposes a lot of 3D bounding boxes in point clouds which contains their features learned from network. However, it is very difficult for us to directly apply this method on the sparse original point cloud. As the surface of an object is the only thing irradiated by the depth sensor, the center of a 3D object is likely to be in an open space which is far away from any captured surface point. So, in a point cloud, there may be no perceived points in the center of the object, which is very different from that in an image. erefore, it is difficult to gather scene context near the target center by using the point-based network. If we just expand the receiving domain of the network to capture more context information, we will also absorb many other nearby objects, which will bring more clutters and problems. A point cloud depth network with a voting mechanism similar to classical Hough voting solves this problem. e new points near the target center can not only be generated by voting but also be grouped and aggregated to generate box proposals. In [24], a powerful 3D object detector is introduced, which is pure geometry and can be directly applied to point cloud. Traditional Hoff voting has multi-independent modules, so joint optimization is difficult to carry on. However, VoteNet could be optimized end-to-end. Specifically, the point cloud is sent to the backbone network to extract features, and then points with features are sampled from the point cloud to vote and generate the object center. e new object centers come from voting which appear near the real center, and it is easy to generate 3D box proposal by any learned network. We make experiment on the 3D object detection datasets: SUN RGB-D and ScanNet [25]. Our method only uses geometry of these two datasets and gain state-of-the-art 3D object detection performance than other methods that use both RGB and geometry or even multiview RGB images. Our research shows that, in our algorithm framework, context information is more effectively aggregated, and our method has greatly improved for the case of object center far away from the object surface.
In summary, the contributions of our work are as follows: (1) e Euclidean clustering method is used to realize the target detection based on the point cloud data, and the better detection effect is achieved on the Kitti dataset. According to the characteristics of Kitti dataset, the data interface is designed to preprocess the labeled target, and a suitable data enhancement method is proposed for the target detection algorithm based on neural network. (2) Based on the principle of LiDAR, a method of dividing space in the cylindrical coordinate system and transforming point cloud in voxel is proposed. For the first time, deformable convolution is introduced into the point cloud target detection network, which enhances the adaptability of the network to vehicles with different directions and shapes. A new generation method of anchor in RPN is proposed, which can effectively prevent the mismatch between the anchor and ground truth, and at the same time, the angle classification loss in loss function is removed. Compared with the second, the AP and AOS of detection results are improved. e improved method proposed in this paper is also suitable for other voxel-based 3D object detection algorithms. e paper is organized as follows. Related works with more details are presented in Section 2.
en, we give a detailed introduction in Section 3, respectively. Datasets and experiments are presented in Section 4. Conclusions and outlook are presented in Section 5.

Related Work
Various ways have been proposed for dealing with 3D object detection problem by extracting features in image and point cloud. Several existing methods were proposed to help to detect 3D bounding boxes of objects. Xiang et al. [26] generate 3D bounding boxes just to use front view images with shape priors or project the depth data into the front view as 2D maps which is fed to CNNs to localize objects. MV3D extract features from both front view's RGB image and LiDAR point cloud which is projected to the bird eye view. 3D bounding boxes are proposed by a trained RPN on the bird eye view. However, this method is a little weak to detect the object which is far away or small such as pedestrians and cyclists, and it is difficult to deal with overlap between objects too. PointRCNN [27] directly uses row point cloud as input of deep networks to generate proposal boxes and upgrade the way to extract features from point clouds but is dragged down by the empty voxel caused by the sparsity of the point cloud. VoxelNet divides space into many voxels and puts many VFE layers to get point cloud features from each voxels. Later, these features are transported into CNNs for detection and segmentation. Similarly, SECOND [28] applies sparse convolution to get efficient features from voxels. F-PointNet [29] uses a 2D object detector on the image to make 2D object proposal first and then get corresponding frustum point cloud as the base of regression and prediction, while the accuracy of F-PointNet relies too much on the 2D detector.
Representation learning from point clouds: there exist many deep network architectures which are proposed to deal with point clouds and gain great performance on the task of 3D object detection and object segmentation. Some of these point cloud-based 3D detection techniques introduce a way to extract features by representing the point cloud in form of voxel (divide the point clouds into many cuboids). Ashburner and Friston [30] inspire the work by applying 3D convolutional neural networks on voxels, although it is restricted by the sparsity of point clouds and high cost of 3D convolution. Every nonempty voxel is encoded in [31] by 6 statistical quantities which can be derived from the points in the voxel. Li [32] represent each voxel by fusing multiple local statistics. Song and Xiao [33] just encode the 3D voxel into the binary form. VoxelNet first sampled the point within voxel and then applied many VFE layers to extract features from each voxel to represent the whole point clouds.
en, PointNet represent point cloud data as a vector, and shape features are extracted for a FCN to finish classification before PointNet and PointNet++, and there is little work on directly obtaining the feature from the row unordered sparse point cloud. After that, PointNet is used to generate 3D objects in a frustum point cloud corresponding to a 2D object proposal in [34].

Voxelization.
Our method includes the spatial-feature extractor, deformable layer, RPN layer, and final regression layer. In VoxelNet, the Cartesian coordinate system is used for voxel segmentation and point cloud clustering, as shown in Figure 1(a). For comparison, Figure 1(b) shows the voxel segmentation in the cylindrical system. Due to the limitation of LiDAR working principle, it is impossible for LiDAR to obtain the information behind the object. To overcome the shortcoming, we divide voxels based on cylindrical coordinates, compared with the division based on the Cartesian coordinate system, and the method in this paper can significantly improve the point cloud in voxels, the sparse degree of voxels, and the processing efficiency of sparse convolution algorithm.
Suppose the cylindrical coordinate system consists of three axes, ρ, θ, and z. For the given point cloud data, (x, y, z) in the Cartesian coordinate system could be converted into the cylindrical coordinate system with the coordinate (ρ � ������ x 2 + y 2 θ � arctan(y/x), z c � z); then, we divide the voxel space evenly. After space division, the point cloud points need to be clustered because the division in the cylindrical coordinate system is more consistent with the working principle of LiDAR than that in the Cartesian coordinate system, and more sparse point cloud could be obtained. Considering that a large number of points will lead to the consumption of computing power and the density of the grid midpoint is not uniform, the maximum number of points in each nonempty voxel is set as T, and the redundant points are automatically discarded, and zero is added, if less than T. e voxelization algorithm is shown in Algorithm 1. We set the maximum number of voxels K and the maximum number of points T in each voxel, and a tensor with shape K × T × 4 is generated by the voxel clustering algorithm.

Network Architecture. Point cloud feature aelection:
..,t is defined as a nonempty voxel, where ρ i , θ i , z i are the 3D coordinates of point and r i is the reflection intensity. First, the average value V m � (v ρ , v θ , v z ) of all points in each voxel and the center of each voxel V c � (c ρ , c θ , c z ) are estimated. en, the average distance and the distance from each point to the center of the voxel are added into the feature; finally, a tensor with shape K × T ×10 is generated as Voxel feature extraction layer: voxel feature extraction used VFE proposed by VoxelNet as the main structure; the VFE layer takes clustered voxels as input and uses the fully connected layer, batch norm, and ReLU layer to extract the feature.
Assuming that the number of the output nodes of VFE is C � 2 × n, n � Z * + , the VFE layer takes T ×10 points corresponding to the same voxel and is fully connected to C/2 outputs, obtains feature with dimension K × C/2, then uses the maximum pooling to obtain the local aggregation feature of each voxel with dimension K × 1, and finally copies it to the point-by-point feature and obtains the final voxel feature with shape K × C. e single VFE layer is shown in Figure 2.
Spatial-feature extraction layer: we use submanifold sparse convolution and common sparse convolution to extract spatial feature. By gradually increasing the features between the voxels of receptive field, more context information is added for shape description. At the same time, the spatial-feature extraction layer can extract the information about z-axis and transform the sparse 3D data into dense 2D pseudoimages. e spatial-feature extraction layer takes the features obtained from the voxel feature extraction layer as input and converts nonempty voxels into sparse 4D tensors according to the coordinates of each voxel in the voxel grid; then, the spatial features are extracted. e spatial-feature extraction layer consists of two modules; the first module contains a submanifold convolution layer and a normal 3D sparse convolution layer; the second module consists of two submanifold convolutions and a normal 3D sparse convolution layer. e two modules only carry out downsampling on the z-axis without changing the length and width of the feature map. After two modules' processing, the dimension of z-axis is downsampled to two levels, and the sparse data is transformed into dense feature mapping. en, the 4D tensors of the two layers are readjusted to 3D tensors similar to images. e structure of spatial-feature extraction layer is shown in Figure 3.
Deformable convolution layer: because the shape of convolution kernel used in convolution layer is fixed, the receptive field is still square after many convolutions, which results in the limited ability of network for deformation modeling. Deformable convolution and deformable ROI pooling are proposed by deformable convolution networks (DCN). DCN is based on the adaptive deformation of receptive field by adjusting the position of input sampling of convolution check.
e standard sampling in ordinary convolution makes it difficult for the network to adapt to geometric deformation. In order to weaken this limitation, an offset variable is added to the position of each sampling point in the convolution kernel. e sampling position of convolution kernel changes adaptively according to the shape of the sample, instead of being limited to the standard lattice point. In this paper, DCN is introduced into point cloud object detection for the first time. As the direction of vehicles in point cloud object detection algorithm is distributed in 360 degrees, while ordinary convolution can only (1) input points P � {x i , y i , z i }, i � 1, 2, . . ., n (2) //convert to cylindrical coordinate system.
fully-connected layer max pooling batchnormlization feature fusion  Mobile Information Systems form a square receptive field, the introduction of deformable convolution can better adapt to the changes of different directions of vehicles. After the deformable convolution is applied to the spatial-feature extraction layer, the deformable convolution is expressed as DCN (c o , c i , k, s, p), c i and c o are the channels of input and output tensor, k is the size of the convolution kernel, and s and p represent step and padding, respectively. e specific structure is shown in Figure 4.
Our proposed network is shown in Figure 5. e proposed network uses a backbone extractor to extract the spatial-feature extractor, and then, a deformable layer and RPN layer are listed to add more discriminative ability; finally, 3D object regression is predicted.
RPN is often used to determine whether there is a target to be detected in the local part of the feature map. e structure of RPN is shown in Figure 6. e network consists of three modules; firstly, the convolution layer is used to downsample the input tensor three times to get feature map x 1 , y 2 , y 3 ; then, we deconvolute and upsample x 1 , y 2 , y 3 to get up 1 , up 2 , up 3 ; finally, we splice up 1 , up 2 , up 3 together to obtain the final feature map. A 1 × 1 convolution is applied to predict the category the object regression.

Experiments
Data augmentation: the training process of neural network is the adjustment process of mapping model parameters. In the process of deep learning network model training, the amount of data has a great impact on the convergence speed and generalization performance of the network. Random sampling: we use the training dataset to generate a point cloud dataset which contains all target categories and target 3D frames. In the training process, we randomly extract n ground targets from the dataset and insert them into the current training data.
is strategy greatly increases the number of real targets in each frame point cloud. Random disturbance: considering the influence of noise on network performance, the similar method used in VoxelNet is used. For each frame, the ground truth and its point cloud are transformed independently and randomly, instead of converting all point clouds with the same parameters. Global rotation and scaling: we apply the global scaling and rotation to the whole point clouds.
Implementation: we implement our network with PyTorch framework and use Adam as the optimizer. e initial learning rate is set as 0.0002 and learning rate decay strategy is used during training. e whole experiment trained 160 epochs in total and took about 22 hours to train the network on GTX2080TI. We use AP (Average Precision) and AOS (Average Orientation Similarity) as the evaluation method.
Evaluate results in KITTI dataset: we evaluate the vehicle detection performance on KITTI dataset of the proposed framework. We divide the Kitti dataset according to the difficulty of detection. Our method exceeds the SECOND by 1.24%, 1.10%, and 3.19% in the AP in the simple, medium, and hard datasets, respectively, and 0.84%, 1.21%, and 1.47% in AOS. e detailed results are shown in Tables 1 and 2. e "SECOND" is the previous state-of-the-art result, "Deformable" means we only add deformable layer for 3d final regression, "Cartesian" means we perform the voxelization on the Cartesian Coordinate system, and "ours" means our overall performance.
Comparison of different voxel partition methods: the voxel partition method in the cylindrical coordinate system is more suitable for the working principle of LiDAR. Firstly, we calculate the variance of the number of nonempty voxels and the number of voxel points obtained by different partition methods. It is obvious that the number of nonempty voxels can be reduced effectively by using the cylindrical coordinate system to divide the group point cloud, and the uniformity of the voxel points can be improved. e detailed statistical results are shown in Table 3. Evaluation of deformable convolution module: deformable convolution can adaptively adjust the sampling position according to the shape of the target. In this paper, deformable convolution is introduced into the target algorithm of 3D target detection to adapt to the target with any    angle between 0 and 360 degrees. We compared the influence of deformable convolution on 3D object detection through experiments in Table 1.
Visualization results: the detection result is shown in Figure 7. e above figure shows the visualization results of the 3D bounding box of the detection results projected on the aerial view. e following figure shows the road conditions in front of the vehicle photographed by the camera. In the aerial view, the red box is a car, the blue box is a pedestrian, and the green box is a rider. e algorithm can accurately detect the cars, pedestrians, and riders around the vehicle.

Conclusions
3D object detection algorithm based on point cloud data in driverless scene has always been a research hotspot in driverless perception technology. With the development and maturity of deep neural network technology, the method of 3D target detection using neural network began to show great advantages. Based on the point cloud data collected by vehicle 64 line LiDAR and using the Kitti dataset as the evaluation sample, this paper studies how to detect the position, size, and direction of obstacles in the environment quickly and accurately based on the point cloud data, so as to provide reliable information for vehicle tracking and path planning.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon reasonable request.