A 3D Multiobject Tracking Algorithm of Point Cloud Based on Deep Learning

,


Introduction
With the rapid development of computer vision, image processing, and other technologies as well as the emergence of deep learning, the field of object detection has achieved great development.From the high accuracy of two-step RCNN [1], fast RCNN [2], and faster RCNN [3] to the high speed of one-step YOLO [4], YOLOv2 [5], YOLOv3 [6], and SSD [7] and from anchor-based methods [8,9] to anchorfree methods [10,11], object detection has made great progress in both accuracy and speed.At the same time, the development of object detection also promoted the development of other fields, including object tracking.Multiobject tracking is a branch of object tracking, which is closely related to the development of object detection [12].Object tracking algorithm is divided into single-object tracking algorithms [13] and multiobject tracking algorithms [14].Single-object tracking algorithms are widely used in monitoring and navigation systems.Among the single-object tracking algorithms, SiamMask [15] only needs to initialize the frame; then it can generate the masks segmented with the object and the boundary frames in the video with the speed up to 35 FPS.SiamRPN++ [16] develops a Siamese tracker based on ResNet architecture.Chen et al. [17] proposed a multiscale fast correlation filtering tracking algorithm based on a feature fusion model.Zhang et al. [18] exploited spatial and semantic convolutional features extracted from convolutional neural networks in continuous object tracking.Multiobject tracking is widely used in autonomous driving systems because it can associate the results of object detection in time without switching the identities of multiple targets [19,20].e autonomous driving system can estimate the location of the object by using tracking algorithm and avoid accidents.In the MOT algorithm, simple online and realtime tracking (SORT) [21] adopts the Kalman filter and Hungarian matching algorithm to track the objects, which obtains fast and great tracking performance, but it may cause the ID switch of the occluded object after it reappears.In order to reduce the frequency of ID switch, simple online and realtime tracking with a deep association metric (DeepSORT) [22] was proposed.DeepSORT combines the advantages of SORT, and it makes up for the defects of the SORT by adding the reidentification network of pedestrians, extracting the pedestrian features, and matching the feature similarity.Ristani and Tomasi [23] put forward DeepCC algorithm, and Tang et al. [24] put forward LMP algorithm; all of these algorithms use reidentification to improve the performance of tracking algorithm through matching the similarity of trajectories.In order to enhance the robustness of complicated changes of multiple objects and complex background scene, Chen et al. [25] proposed the visual object tracking algorithm based on adaptive combination kernel.In addition, tracking has many other applications, such as tracking in basketball games [26].
Since the related algorithms become more and more proven in image processing, the development of image object detection algorithm cannot escape from the limitations of two-dimensional data, and the drawbacks of data are more obvious, which lead to many problems in the algorithm.For example, object detection and tracking algorithm are greatly affected by light, rain, snow, and haze weather.Under such conditions, the object detection accuracy is low and the recognition results are two-dimensional without including distance and volume.However, the point clouds acquired by LiDAR are little affected by the light and have the information of distance and volume, which can overcome these problems above and make up for the shortages of image processing.In recent years, with the decrease of Li-DAR cost, more and more researchers use LiDAR to replace the camera for object detection.At the meantime, different from image, point clouds are sparse and disorder space points.However, the proven algorithm used in the image processing cannot be directly used in point clouds.In order to solve this problem, many researchers adopted projection methods [27][28][29][30][31] to project 3D objects into multiple views and fuse the features of each view for detection and recognition.Using the projection method provides a transformation idea from point clouds to image processing.However, a large number of projections will cause the increase of computation, while reducing the number of projections will cause the lack of information.Wu et al. [32] and Le and Duan [33] applied the idea of voxelization to voxelate the point clouds and processed it directly, which improved the efficiency of object detection.e development of point cloud object detection also promoted the development of point cloud tracking algorithm.Weng and Kitani [34] extended the two-dimensional SORT to three-dimensional and proposed the AB3DMOT algorithm, which performed well on the KITTI dataset [35].In order to improve the performance of point clouds multiobject tracking and retrieve the ID information of occluded objects, we combine reidentification algorithm of pedestrian and 3D Kalman filter and apply them to point clouds.Our contributions are as follows: e proposed model provides a new baseline for the point cloud tracking algorithm.

Related Works
2.1.3D Object Detection.3D object detection is an indispensable part of 3D object tracking, and the 3D bounding box of detection is also very important for the effect of tracking.3D object detection can be divided into four categories: image processing methods, voxel-based methods, point-based methods, and some fusion methods.Li et al. [36] presented 3D point cloud to 2D image, and then used the 2D end-to-end full convolution neural network to predict target confidence and 3D bounding boxes through bounding boxes encoding.Simon et al. [37] transformed point clouds into BEV map, density map, and intensity map, and used the method of image processing for 3D detection.Zhou and Tuzel [38] proposed VoxelNet, which divided point clouds into different voxels.en, they used the VFE (Voxel Feature Encoding) layer to encode features uniformly.Finally, RPN (region proposal network) was used for category classification and 3D bounding boxes regression.Based on the VoxelNet, Yan et al. [39] proposed sparsely embedded convolutional detection (SECOND) by using sparse convolution, which improved the accuracy of detection further.Qi et al. [40] put forward PointNet through using point clouds directly.PointNet adopted spatial transformation matrix to align point clouds and the combined convolutional neural network (CNN) to obtain good results in object segmentation and detection.is method is a better one than two-dimensional image processing.Later, in order to solve the shortcomings of PointNet, Qi et al. [41] put forward PointNet++ by modifying PointNet.Shi et al. [42] put forward PV-RCNN by combining the advantages of voxel-based and point-based methods and then achieved the highest score on KITTI data.In addition, there are some other multisensor fusion methods: MV3D [43] fused BEV and front view of point clouds with RGB image; AVOD [44] fused RGB images and six-channel BEV map consisting of five equal height slices and density map; and F-ConvNet [45] used 2D region to estimate end-to-end of bounding boxes in 3D space.

3D MOT.
e difference between 3D MOT and 2D MOT is that the tracking objects of 3D MOT are three-dimensional and have height information and distance information.Osep et al. [46] proposed a 2D-3D Kalman filter to jointly use images and the 3D world coordinate system.Baser et al. [47] proposed an online multiobject tracking method based on 2 Mathematical Problems in Engineering CNN.Hu et al. [48] used long short-term memory network (LSTM) learning module to predict long-term motion more accurately.Frossard and Urtasun [49] described this problem as a linear programming problem and adopted CNN to detect and match end-to-end.Zhang et al. [50] put forward mmMOT to encode point clouds in the process of data association and realized the fusion of multimodal data.Shenoi et al. [51] developed JRMOT which used a two-dimensional RGB image and three-dimensional point cloud.Here, three-dimensional point cloud was used for detection, and a two-dimensional RGB image was used for reidentification based on CNN, and then multi-object tracking was achieved.e camera shooting angle results in the occlusion of the object in a RGB image, so we combine the aerial view of point cloud with the reidentification method based on CNN to match the similarity and use the three-dimensional Kalman filter to predict the three-dimensional information of the object's movements.

Materials and Methods
According to the characteristics of point clouds, 2D and 3D separation methods are used.We use the 3D Kalman filter to predict the 3D coordinate information of the point clouds and extract the features of the bird's-eye view by the reidentification network.Our system uses the three-dimensional object detection networks such as SECOND to obtain the three-dimensional coordinate information X, Y, Z, L, W, H, and θ. ese seven parameters represent the coordinates of the center point, length, width, height, and heading angle of the frame.e object detection results are transformed into 2D bounding boxes in the three-channel image which is composed of BEV, density, and intensity map, and then, they are sent to the reidentification network to extract features.X, Y, Z, L, W, H, and θ are used for state prediction and trajectory matching of the 3D Kalman filter.After that, the results of feature matching and 3D Kalman filter matching are output to obtain the ID information of the current detection results.e flow chart is shown in Figure 1.Y, W, L, θ, Z, H, S} (S represents the detection score).D t is the detection result of frame t and D t � {D t1 , D t2 , . .., D tn } (n represents the number of objects detected).In addition, considering the detection speed and effect, we choose SECOND as the threedimensional object detection detector of our tracking system.SECOND uses sparse convolution to improve significantly the speed of training and reasoning.e structure chart of SECOND is shown in Figure 2, and the detection performance is shown in Figure 3.

3D Kalman Filter.
In order to describe the moving object, we use the Kalman filter to predict the next frame state of it.It predicts the position of the current frame by the position information of historical target and then establishes the following state equation as equation (1) for each goal: where x, y, and z are the x, y, and zcoordinates of the point clouds, θ denotes the course angle, and l, w, and h denotes length, width, and height of the object, respectively.By observing the movement law of vehicles and target characteristics of the point cloud, we find that the height and z coordinate of vehicle and pedestrian hardly changed during the movement.In order to reduce the calculation amount and improve the performance, we ignore the height H and Z coordinates.In the experiment, we find that adding angle will cause the increase of the radian of the predicted target, and the target's angle will be flipped over.erefore, the final state model we use is as follows: e status of detection results can be expressed as follows: e predicted state equation can be expressed as follows:

Point Cloud Reidentification.
e point cloud is different from the image in which point cloud has no fine-grained features, and the fine features are difficult to distinguish.Although RGB images can be used for reidentification to obtain a large number of fine-grained features, they have some problems: the image may encounter obscuring; the farther the distance, the smaller the target; the farther the distance, the less distinctive the features which are even difficult to distinguish.On the contrary, the aerial view of point cloud has a large field of view and no occlusion of the object, which is conducive to reidentification and solves the problems existing in the image.
Reidentification can realize the matching of feature similarity in the trajectory, so that when it appears again after the object is blocked, it can find the original trajectory by comparing with the features in the trajectory, while the traditional matching method will cause id jump.We use the three-channel image composed of BEV map, density map, and intensity map of point clouds to replace the RGB image to realize feature extraction.Due to the difference between the point cloud coordinate system and the image coordinate system, equation ( 5) is used to convert the point cloud coordinate system to the image coordinate system, and the transformation diagram is shown in Figure 4: where x and y represent coordinates in the point cloud coordinate system, h denotes the distance from the point Mathematical Problems in Engineering   4 Mathematical Problems in Engineering cloud boundary to the y-axis, w is the distance from the point cloud boundary to the x-axis, and x t and y t represent the coordinates in the image coordinate system.After coordinate transformation, the height of point clouds is mapped to the pixel value to obtain an aerial view, and then, the intensity value of corresponding points in the BEV map is mapped into the intensity map.Finally, we calculate the density value of corresponding point clouds in the image by using equation ( 6). e resultant three-channel picture is shown in Figure 5: whereρ i represents density of the ith location point, c i represents the number of point clouds at the ith location    It can be seen from Tables 1 and 2 that our method has better performance than the FANTract method.Since our method is mainly used on the side of the road, there are a lot of occlusion and reappearance problems, which rarely occurs in the KITTI dataset.erefore, the advantages of our method cannot be reflected in the KITTI dataset, which is slightly lower than those in AB3DMOT.In order to prove that our method can match the original trajectory and reflect the advantage of the reidentification network, we compare the occlusion in frame 354 to 360 and frame 372 to 379 in the first sequence of the KITTI dataset.In Figure 7, the vehicle id 222 in AB3DMOT jumps to 252 after blocking, while the id number of our method remains at 187 after blocking.In Figure 8, the vehicle id 262 of frame 372 in the AB3DMOT method reappears to be 275 after occlusion, while our method keeps the id 204 all the time.

Results and Discussion
Figure 9 is a segment of the roadside data.In our method, the id of the two objects with id numbers 4004 and 3985 remain unchanged after occlusion, while the corresponding vehicles id switching occur in the AB3DMOT method.No matter if it is KITTI data or roadside data, our Mathematical Problems in Engineering 7 method can keep the id number after occlusion, which reflects the advantage of the reidentification method in matching by features when lacking distance information.

Conclusions
is paper introduced re-identification algorithm into point cloud tracking algorithm based on 2D MOT, and proposed 3D MOT algorithm based on deep learning.We use the object detector to obtain the 3D boundary box of the target, and then, use the 3D Kalman filter to estimate state, combining with the re-identification algorithm to match feature similarity, and finally use the Hungarian algorithm for data association.On the KITTI dataset, our approach achieves competitive results, and on the roadside dataset, our approach is more prominent.It is believed that our method can be widely used in self-driving and roadside assisted driving.
(i) e tracking algorithm based on deep learning of image processing is introduced into the tracking algorithm based on point cloud, and a tracking algorithm model based on deep learning is established.(ii) e proposed tracking algorithm model uses the three-channel image composed of bird's eye view (BEV), density, and intensity maps of the point cloud to train the point cloud reidentification network.e two-dimensional features of the threechannel image are extracted by using the point cloud recognition network, and they are made cascade matching with the location features of the IOU.(iii) e proposed tracking algorithm model performs well in the point cloud tracking algorithm.e original trajectory can be matched again after occlusion.

Figure 3 :
Figure 3: SECOND detects point clouds of roadside LiDAR.

Figure 4 :
Figure 4: Converting point cloud coordinates to the image coordinate system.

Figure 1 :Figure 2 :
Figure 1: Our proposed 3D MOT system is composed by 3D object detection and tracking (data association and filtering) components.T t−1 and T t refer to tracks at t−1 and tracks at t with the superscript indicating the space.

Figure 7 :
Figure 7: Effect comparison of frame 354 to frame 360 in Sequence 1.

Figure 8 :
Figure 8: Effect comparison of frame 372 to frame 379 in Sequence 1.