Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

. Multiobject Tracking (MOT) is one of the most important abilities of autonomous driving systems. However, most of the existing MOT methods only use a single sensor, such as a camera, which has the problem of insuﬃcient reliability. In this paper, we propose a novel Multiobject Tracking method by fusing deep appearance features and motion information of objects. In this method, the locations of objects are ﬁrst determined based on a 2D object detector and a 3D object detector. We use the Nonmaximum Suppression (NMS) algorithm to combine the detection results of the two detectors to ensure the detection accuracy in complex scenes. After that, we use Convolutional Neural Network (CNN) to learn the deep appearance features of objects and employ Kalman Filter to obtain the motion information of objects. Finally, the MOTtask is achieved by associating the motion information and deep appearance features. A successful match indicates that the object was tracked successfully. A set of experiments on the KITTI Tracking Benchmark shows that the proposed MOTmethod can eﬀectively perform the MOTtask. The Multiobject Tracking Accuracy (MOTA) is up to 76.40% and the Multiobject Tracking Precision (MOTP) is up to 83.50%.


Introduction
e objective of Multiobject Tracking (MOT) is to track multiple objects at the same time and estimate their current states, such as locations, velocities, and sizes, while maintaining their motion identifications. Hence, the MOT is one of the most important abilities of autonomous systems, but it remains challenging because the target objects may be obscured, or it may be interfered by objects of similar shape. Owing to the rapid development of object detectors, several tracking-by-detection methods [1][2][3][4][5] have been widely proposed to address the MOT problem. Typically, the existing tracking-by-detection methods involve two main computational steps: object detection and tracking. ese methods first detect the location of objects and then compute the trajectories of the objects based on the results of object detection [6][7][8]. e accuracy of object tracking is highly related to the performance of object detection. Hence, the important thing about the MOT is to track the new targets that appear at any time and find lost tracking target objects from detections and associate again. However, most of the tracking-by-detection methods are based on visionbased object detections. In the case of occlusion and overexposure, vision-based object detection may lead to false association with existing trajectories. For example, Figure 1(a) shows the failure of vehicle detection on the image with the occlusion of humans. Figure 1(b) shows the camera is disabled when overexposure. e scene of autonomous driving may contain multiple objects, and the states of the objects are usually uncertain [9,10]. In this case, the vision-based object detections are susceptible to occlusion or overexposure, which will easily lead to false checks or loss of target tracking. Besides, one major challenge of the MOT is how to reduce incorrect identity switching. Because the tracked objects often have high similarities, it is challenging to track objects correctly and perform correct Re-Identification(RE-ID).
Multimodal data fusion has the potential to improve the stability and accuracy of the MOT. However, a majority of traditional methods use the camera, LiDAR, or radar.
ese methods need to design hand-crafted features [11]. However, the hand-crafted features are often not of high precision, and it is difficult to guarantee the tracking performance. Hence, it is necessary to design a feature learning method that can automatically learn appearance features from raw visual data. Moreover, in autonomous driving systems, since the objects are moving rather than stationary, the motion information of objects should be integrated with the appearance features to achieve the MOT tasks. In addition, some MOT methods include depth information in the tracking process by using depth camera in order to improve tracking performance. For example, Mehner et al. [12] used an ordinary camera to obtain 2D information of objects and used a depth camera to obtain depth information to assist in locating the objects in world coordinates. Although it can improve the accuracy, the depth camera has a small field of view, high noise, and is easily affected by sunlight, so it is not effective as LiDAR. Moreover, they only use Kalman Filter for tracking, which does not work well in complex scenarios.
In this paper, we propose a multimodal MOT method by fusing the motion information and the deep appearance features of objects. is paper employs a 2D object detector, i.e., You Only Look Once (YOLOv3) [3] and a 3D object detector, i.e., PointRCNN [5] to process the RGB image and laser point cloud, respectively. e combination of 2D detection and 3D detection is helpful to improve the robustness of object detection. en, the MOT is achieved by associating the motion information and the deep appearance features of the target object. A set of experiments on the KITTI Tracking Benchmark is performed to demonstrate the effectiveness of the proposed MOT method. Our contributions are summarized as follows: (1) e 2D object detection based on the image and the 3D object detection-based laser point cloud are combined to detect the location of objects, which is robust against light changes and occlusion.
(2) We apply CNN that is pretrained to discriminate vehicles on a large-scale vehicle Re-Identification dataset to automatically extract the deep appearance features of the target object without manually designing features.
(3) A multimodal MOT method is proposed by fusing the motion information and deep appearance features of the object to achieve the MOT task. In addition, the proposed method obtains competitive qualitative and quantitative tracking results on the KITTI tracking benchmark. e rest of the paper is organized as follows. Section 2 introduces related works. Section 3 presents the proposed multimodal MOT method. Experiments and their results are presented in Section 4. Finally, the conclusion and future work are summarized in Section 5.

Related Works
is section provides an overview of the two related research topics: multiobject tracking and object detection.

Multiobject Tracking.
e problem of the MOT first appeared in the tracking of object trajectory. For example, tracking of multiple enemy aircraft or passing missiles. With the development of computer vision, researchers have proposed several MOT methods from different aspects in the past few decades. For example, the single-object tracking method is extended to support multiple objects. According to the data association, the existing MOT methods can be divided into two categories: offline and online MOT methods. In offline methods [13][14][15][16], the detection of all frames in the sequence is combined to obtain the object trajectory robustly. ese methods need to construct a global graph structure, which leads to high computational complexity. However, in the online MOT method [17][18][19][20], the target detector is only associated with the existing trajectories frame by frame. Hence, online methods are more suitable for real-time tracking.
Most of the existing MOT methods rely on motion information produced from Kalman Filter [21], Hungarian algorithm with Kalman Filter [17], Particle Filter [22], or probability hypothesis density filter [23]. However, in autonomous driving systems, due to the uncertainty of the scene, it is impossible to track objects stably only by using motion information.
erefore, more recent methods combine the motion features with the appearance features to improve the re-identification of target objects. Traditionally, the appearance features of objects are manually designed [24], which cannot provide reliable features, especially, in complex scenes. Owing to the rapid development of deep learning, deep convolutional networks [9,25,26] have been widely used to extract the appearance features from raw 2 Complexity visual data. For example, Wojke et al. [17] used CNN to extract the pedestrian image features and measure the distance between features for human detection.

Object Detection.
Most of the existing 2D object detection methods are based on CNNs, which can be divided into twostage detectors and one-stage detectors. In the two-stage detectors, such as RCNN [27], Fast RCNN [28], Faster RCNN [1], and FPN [29], they use Region Proposal Networks (RPN) to generate the candidate regions and then perform boundingbox classification and regression. For example, RCNN starts with the extraction of a set of object proposals by the selective search. en, each proposal is rescaled to a fixed size image and fed into a CNN model that is trained on ImageNet. In this way, the presence of an object within each region is predicted and its category is recognized. Although the two-stage detectors have made great progress, their main drawback is that the redundant feature calculation of a large number of overlapping schemes results in a very slow detection speed. e One-stage detectors have YOLO [3,30,31], Single Shot MultiBox Detector (SSD) [2], and RetinaNet [32].
ese detectors do not need the RPN. ey directly generate the categories' probability and bounding boxes of the objects. ese methods only use one-stage calculation to get the final detection results. For example, the YOLO applies a single neural network to the whole image.
is network divides the images into regions and predicts the bounding boxes and the probabilities for each region simultaneously. Compared with the two-stage detectors, the one-stage detectors have a higher detection speed.
Because the point-cloud data contains richer geometric features, 3D object detection has attracted more and more attention. Compared with 2D object detection, 3D object detection is more challenging because it needs to process the point clouds of the scene. Chen et al. [33] projected point cloud to the bird's view and used 2D CNNs to learn the features of point cloud for 3D boxes' generation. Song and Xiao [34,35] divided the point cloud into equally spaced 3D voxels and used 3D CNNs to learn the features of voxels to generate 3D boxes. Shi et al. [36] used PointNet++ [37] to process the point-cloud inputs for 3D boxes' generation. Besides, some methods [38,39] estimate 3D bounding boxes based on images.

Method
is section introduces the proposed multimodal MOT method that tracks multiple objects at the same time and records their trajectories. e proposed MOT method includes the four main computations: object detection with Nonmaximum Suppression, motion information extraction, learning deep appearance feature, and object tracking with data association. Figure 2 shows an overview of the proposed MOT method. We combine the result of 2D object detection and 3D object detection such that the location of the object can be detected robustly. Based on this, the motion information and appearance features of objects are computed respectively. Finally, the motion information and appearance features of objects are associated to track the target object.

Object Detection with NMS.
e first task of the MOT is to detect the location of objects in the scene. In this paper, we propose to combine the results of 2D object detection and 3D object detection for robust object detection. We use the 2D detector, i.e., YOLOV3 [3] that is trained on the training set of the KITTI 2D object detection benchmark and uses the 3D detector, i.e., PointRCNN [5], that is trained on the training set of the KITTI 3D object detection benchmark. e 2D detector processes the RGB image. e output of 2D object detection is a set of detections where n 1 is the number of objects at frame t. e 3D detector processes the point clouds that were collected from a LiDAR. e output of 3D object de- , where n 2 is the number of objects at frame t. For further calculation, we project the LiDAR point in the 3D space into the 2D space according to combine camera and LiDAR calibration: where y is the projected point in the RGB image. x denotes the 3D LiDAR point. P rect and R rect are the intrinsic camera parameters. e P rect is the camera matrix, and the R rect is the rectification matrix to make the image co-planar. T proj projects the point X in the LiDAR coordinates onto the camera coordinate system. Both the intrinsic and extrinsic parameters are available in the KITTI dataset [40]. Figure 3 shows an example of point projections.
After the 3D point clouds are projected onto the image, two overlapping boxes will appear on the same object. is paper further uses the Nonmaximum Suppression (NMS) algorithm to get rid of the extra boxes. e NMS sorts all detection boxes on the basis of their scores and selects box M with the highest score. All other detection boxes with the large overlapping area with M are suppressed by using a predefined threshold N t : where b i is the detection box to be screened, when IOU(M, b i ) is greater than N t , b i will be removed. In our experiment, N t is set to 0.7. Figure 4 shows a comparison result by the detection method without NMS and with NMS.

Learning Object Appearance Features.
Before implementing the MOT, we need to extract the appearance features of the object. is paper employs CNN to automatically learn the deep appearance features of objects from raw visual data. e CNN is trained on a large-scale benchmark dataset [41]. e dataset contains over 50,000 images of 776 vehicles captured by 20 cameras. Figure 5 shows several samples in this dataset.

Extraction of Object Motion Information.
Since the objects are usually moving rather than stationary, it is necessary to extract the motion information of objects for the MOT. is paper employs the Kalman filter to predict the state of the object and then extract its motion information. We use eight parameters ) to describe the tracking state at frame k, where (a, b) is the bounding box center position, c is the aspect ratio, h is the height of the bounding box, and (a . , b . , c . , h . ) represents the corresponding velocity in the image coordinate system.
Because the interval of time between each frame is very short, it can be regarded as a linear model of constant-velocity motion. We get the predicted object state at the next frame and calculate the error covariance matrix P − k between the predicted state and the true state: where x − k is the predicted object state at frame k. A is a state transition matrix, and x k−1 is the object state at frame k − 1. And Q is the covariance matrix of the predict noise. en, we can get the Kalman gain matrix K and calculate the estimated state x k : where z k is the measured value and H is the conversion matrix from x − k to z k . R is the covariance matrix of the measurement noise. Finally, update the covariance matrix P k :

Object Tracking Based on Data
Association. e next is to associate the deep appearance features and the motion information of the object for the MOT. First, this paper uses the Mahalanobis distance to compare the motion correlation between the predicted state of the Kalman Filter and the newly detected bounding boxes: where d j denotes the jth bounding box detection, y i and S i represent the mean and covariance of the ith predicted bounding box. A threshold can be adjusted to control the minimum confidence of the motion information association between objects i and j. We denote this decision with an indicator b (1) i,j , as shown in equation (7). e indicator will be equal to 1 if the Mahalanobis distance is smaller or equal to a threshold t (1) , which is set to 9.4877 for our four-dimensional measurement space: Next, the above method is only a suitable related measurement index when motion uncertainty is very low. However, in the image space, only using the Kalman filter framework is a rough prediction. erefore, this paper also adopted the second metric. It measures the smallest cosine distance of the appearance features between the ith track and jth detection as follows:

Complexity
where r j is the appearance feature vector of detection d j and r (i) k represents the feature vector of the ith tracked object at the most recent frame k. In our experiment, parameter k is set to a maximum number of 100 available vectors. In addition, in order to determine whether the appearance features are related, we introduce a binary indicator, as shown in equation (9). A threshold t (2) is set for this indicator on a VeRi dataset: (2) .
en, the Mahalanobis distance determines whether the prediction position of the Kalman filter is related to the new detection, which is especially useful for short-term prediction. And the cosine distance considers the appearance of tracking objects, which is especially useful for recovering identity after a long period of occlusion. erefore, this paper combines the two metrics using a weighted sum: where we call an association admissible if b (1) i,j � 1 and b (2) i,j � 1. e hyperparameter λ is used to control the influence of each metric on the combined association. For example, when there is substantial object motion, the prediction of the constant-velocity motion model becomes less effective. us, the appearance metric becomes more significant by reducing the valve of λ; on the contrary, when there are limited vehicles on the road without long-term partial occlusions, increasing the valve of λ can improve the importance of distance metric.
Finally, in our implementation, the maximum number of frames allowed to lose the target A max is considered. In order to avoid redundant computations, if a tracked object is not re-identified in the most recent A max frames passed since   its last instantiation, it will be assumed that it has left the scene. If the object is seen again, a new ID will be assigned to it. e judgement of a new track is that an object in the result of detection can never be associated with the existing MOT methods. If the prediction of the object position can be correctly correlated with the detection in the consecutive F min frames, we can confirm that a new track target has appeared.

Experiment
is section introduces the dataset, evaluation metric, training parameters, and experimental evaluation results in the experiments on the KITTI Tracking Benchmark.

Dataset.
e proposed method was evaluated on the KITTI tracking benchmark [43]. e KITTI dataset was collected under 4 different scenarios, including city, residential, road, and campus. Some samples of the KITTI dataset are shown in Figure 6. e dataset consists of 21 training sequences and 29 test sequences. In each sequence, LIDAR point clouds, RGB images, and calibration files were provided. In the training sequences, eight different classes were labeled, including car, pedestrian, and cyclist. e objects in images were annotated with 3D and 2D bounding boxes between different frames and had a unique ID. In this work, we used all 29 testing sequences for modal validation and only used on the car subset for model evaluation because it had the most instances of all object types.

Evaluation Metric.
e indexes used to evaluate the performance of the proposed MOT method were as follows:  [44]. Equation (11) shows the computation of the MOTA, where t is the index of the frame and G is the number of the ground truth: (7) Multiobject Tracking Precision(MOTP) : the alignment accuracy between the annotated and the predicted bounding boxes [44].

Training Parameters.
is paper trained the 2D detector, i.e., the YOLOv3, on the training set of the KITTI 2D object detection benchmark [5], and trained the 3D detector, i.e., the PointRCNN, on the training set of the KITTI 3D object detection benchmark [36]. e IOU threshold N t of the NMS module was set to 0.7. e minimum number of matched frames required to create a new trajectory F min is set to 3 and the maximum number of frames allowed to lose the target A max � 30. And because the prediction results of Kalman Filter is rough and there are many scenes with longterm partial occlusions in the KITTI dataset, we set λ � 0.1.

Qualitative Evaluation.
We evaluated the proposed tracking method qualitatively by using the KITTI test sequence. Different scenarios including occlusions, clutter, parked vehicles, and false positives from detectors were considered in the qualitative evaluation. Figure 7 shows an example of the test sequence 0 in the test set. Each vehicle was assigned a tracking ID as a reference. Despite the compact and messy parking of the vehicle, the proposed MOT method can continuously detect and track the vehicles. Moreover, from this figure, we can see that, since the image is easily affected by the environment, such as illumination changes and partial occlusion, the shape of the detected target will change. In addition, the scale of the target object may be very different. In this case, the proposed MOT method still obtained a relatively high tracking performance. e experimental results show that our method can locate each car well even in the cluttered and strong lighting scene and maintain the ID of the car unchanged. Figure 8 shows another example from the test sequence 1. Figure 8(a) shows that the object detector produces a false detection result, and Figure 8(b) shows the false positive of the detector is overcome by data association. In the case of transient errors in object detection, the proposed MOT method can still track the target stably. Hence, these experimental results demonstrated the robustness of the proposed MOT method.

Benchmark Results.
We further evaluated the proposed MOT method on the KITTI Tracking Benchmark. In this evaluation, we considered some published online MOT methods for comparison. e results are presented in Table 2. It can be seen that the proposed MOT method is very competitive. In particular, the proposed MOT method returns the fewest number of identity switches, while maintaining competitive MOTA scores, MOTP scores, and track fragmentations. e tracking accuracy is mainly affected by a large number of false positives. Given their overall impact on the MOTA score, the combination of the 2D and 3D object detection results can significantly improve the performance of the MOT. Besides, because we set the maximum allowed trackage and associate the object motion information and appearance features, the proposed MOT method has the fewest number of identity switches. erefore, the proposed MOT method can generate a relatively stable trajectory of the target object. 6 Complexity

Ablation Study.
e ablation study was to evaluate the effects of hyperparameters on the performance of the proposed MOTmethod. Table 3 shows the results of the ablation study on the KITTI benchmark. e hyperparameter N t is the threshold of IOU, and the F min denotes the minimum number of matching frames required to create a new trajectory. From the table, we can be seen that when N t � 0.6, this may miss some correct detection results. at is because the number of detected objects is reduced. When N t � 0.8, this may result in some wrong detection results, which is also the reason why it has the most IDS. F min � 1 means that track immediately when a new target is detected, which leads to more IDS and FRAG. e F min � 5 makes the minimum IDS, but MOTA is lower. erefore, we finally set N t � 0.7 and F min � 3.

Conclusion
is paper proposed a multimodal MOT method by fusing the motion information and the deep appearance feature of objects. In this method, we use a Nonmaximum Suppression algorithm to combine a 2D object detector and a 3D object detector for robust object detection. en, the deep appearance features of objects are learned by a CNN, and the motion information of objects is computed by the Kalman Filter.
e MOT task is achieved by associating the appearance features and the motion information of the target object. e effectiveness of the proposed MOT method was demonstrated in a set of experiments. e proposed MOT method can track objects stably in crowded scenes and effectively avoid false detection. In the KITTI tracking benchmark, the proposed method also shows competitive results.
Although 3D object detection is used in the proposed MOT method, it is only used as the auxiliary information for 2D object detection. 3D object detection can provide accurate position and size estimation for automatic driving. erefore, our future work will be towards the direction of 3D multitarget tracking that can adapt to a more complex environment.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.