Visual Odometry for Self-Driving with Multihypothesis and Network Prediction

Robustness in visual odometry (VO) systems is critical, as it determines reliable performance in various scenarios and challenging environments. Especially with the development of data-driven technology, such as deep learning, the com-bination of data-driven VO and traditional model-based VO has achieved accurate tracking performance. However, the existence of local optimums in the model-based cost function still limits the robustness. In this study, we introduce a novel framework with a particle ﬁlter (PF) in the optimization process, where the PF is constructed by deep neural network (DNN) prediction. We propose constructing the PF by motion prediction classiﬁcation and its uncertainty based on the char-acteristic of on-road driving motion. At the same time, an interval DNN prediction strategy is introduced to improve the real-time performance. Experimental results show that our framework obtains better tracking accuracy and robustness than the existing works, while the time consumption is maintained.


Introduction
As a fundamental problem of robot navigation, visual odometry (VO) has been studied for decades. Due to the mature studies of camera modeling and perception modeling [1], the model-based method has drawn lots of attention in the last twenty years. Some representative studies have been published [2,3], and many mile-stone results in terms of tracking accuracy have been achieved.
With the development of deep learning in the last decades, especially the development of open image dataset [4] and graphics processing unit (GPU), being different from the model-based method, the data-driven approach is gaining popularity [5,6]. Particularly, the data-driven end-to-end visual odometry methods [7,8] are more straightforward and simpler than the model-based methods, while the accuracy and generalization are still inferior to the model-based methods. To improve the accuracy and the robustness of VO systems, the combinations of model-based methods and data-driven methods have been proposed [9]. Such combinations incorporate the prior knowledge information provided by a DNN into a model-based optimization framework [10].
Regarding the application in self-driving, VO plays an essential role in the navigation tasks for visual localization and mapping [11]. With the complicated and challenging situations of city traffic, the combinations of data-driven methods and model-based methods are the way to handle these challenges [12]. However, the robustness of the VO and visual simultaneous localization and mapping (VSLAM) systems in the various environments of self-driving applications is still an open problem.
To overcome the robustness problem in the application of self-driving, we propose to cooperate the DNNbased image classification network with a particle filter (PF) in a model-based pose estimation framework, as shown in Figure 1. With the generalization from the image classification network due to its high-level representation and the multiple hypotheses introduced by the DNNenabled PF in pose optimization, the robustness and accuracy of visual pose estimation are improved. e simulation and experimental results demonstrate the superiority of our proposed method. e contribution of our study is twofold: (1) We propose a DNN-enabled PF for self-driving pose estimation, where the DNN-based image classification network is used for generalization (2) A multihypothesis pose estimation of a model-based framework system is introduced to overcome the local optimum during optimization

Related Work and Motivation
Our proposed work intersects the topics of DNN application in VO and on-road self-driving. Applying DNN to VO and VSLAM systems ( Figure 2) is popular in recent years due to the significant progress in image processing and pattern recognition by DNN. As for the end-to-end estimation of transformation in VO application, DeepVO [7] and UndeepVO [8] are the typical examples of supervised and unsupervised training methods, respectively, which leverage the DNN for pairwise image processing. Due to the multiple level perception in DNN, especially the convolution operation, the obtained information from various layers is rich, and the tracking robustness can be enhanced. However, the localization accuracy of those data-driven methods is still lower than the model-based VO and VSLAM systems, such as DSO [2], ORB-SLAM [13], and SVO [3].
To improve the accuracy of end-to-end data-driven VO pose estimation, researchers propose to combine DNN prediction and model-based pose optimization. In CNN-SVO [14], the output from DNN is utilized as the prior knowledge for optimization, which leads to accurate camera tracking performance due to the feasible DNN prediction. Similarly, CNN-SLAM [15] incorporates the depth prediction from DNN into a mature model-based VSLAM system [16], resulting in a dense structure and improved camera motion estimation. e study in [9] takes the prediction of pose transformation from a DNN as the initialization of localization directly for more robust tracking, while the advanced performance is only achieved in the environments similar to the training data. e study in [17] uses both the depth and pose DNN prediction for optimization initialization, where the depth, pose, and uncertainty are generated from the network simultaneously.
As for the VO and VSLAM system in the application of self-driving, thanks to the progress being made with deep learning, visual-based automated driving is advancing rapidly. ere are two main paradigms in this area [11]: (1) e mediated perception approach, which semantically reasons the scene and determines the driving decision based on it [18]. (2) e behavior reflex approach that learns the driving decision end-to-end [19]. e reliable and fast mapping and localization are needed in almost all driving scenarios. Due to the high resolution of cameras compared with other sensors like RADAR and LiDAR, situations that require detailed knowledge about the environment are dedicated to the applications of VSLAM and VO [20]. erefore, VO and VSLAM systems play an essential role in selfdriving. Especially with the development of deep learning, applying data-driving methods to selfdriving has achieved much progress in recent years [21,22]. However, to work in challenging situations,  Figure 1: e framework of our method. With the prior knowledge from DNN prediction, we leverage the DNN motion classification for particle filter establishment. With the particle filter, the multihypothesis pose optimization can be performed for robust visual tracking.
2 Mathematical Problems in Engineering including motion blurry, large perspective-changing, and illumination-changing, the robustness of the existing VO and VSLAM systems still cannot well satisfy the requirements.
To handle the robustness problem, especially the usage of DNN output in challenging environments and maintaining the tracking accuracy, we propose a system framework with DNN-enabled PF in pose estimation optimization.
Instead of estimating the motion from the DNN prediction directly in the existing work [7,8], we use the image classification results for more general performance. Due to the uncertainty representation in image classification, the generality of the image classification network has been verified [23,24]. Also, for the robust pose estimation, instead of the single hypothesis in the related work [17], a multihypothesis back-end optimization strategy is designed to overcome the local optimum in optimization, where the feasibility of the multiple hypotheses is guaranteed by the DNN motion prediction.

Methodology
In our method, the robustness improvement is realized by the multihypothesis pose optimization with the PF constructed by DNN image classification. ree parts are involved in the methodology: motion prediction by DNNbased image classification, the PF construction using the motion prediction, and the pose optimization with PF multihypothesis.

Prior Knowledge from DNN Image Classification.
We use the DNN introduced in [25] to provide the motion label prediction, which is regarded as the prior knowledge. Since the motion of the self-driving car is limited, the DNN classification result of the captured image pair is feasible due to its limited motion types: planar rotation and translation. Instead of obtaining the transformation prediction from DNN directly (which is performed in the end-to-end style), according to the motion types, the input image is classified into six types for motion prediction: go-forward, right-side, left-side, no-rotation, turn-left, and turn-right, as shown in Figure 3.
With these six labels, the coefficients in the particle filter of multihypothesis optimization can be determined. Also, it is worth noting that obtaining the training data with motion type classification labels is much easier than collecting the dataset with exact transformation information, because the requirement of ground truth transformation accuracy, as well as the dimension of ground truth label data, is relaxed. erefore, the cost of the training process is also lower than the method with exact DNN motion transformation prediction.
According to the discussion above, we define three coefficients of the objective function as follows, where r l , r r , r s , m l , m r , and m s are the prediction from the DNN representing the image classification probability, representing the predicted probability of turning left, turning right, no turning, left-side, right-side, and go-straight, respectively: (1)

Motion Prediction
Model-based Optimization Since the sum ofm l , m r , and m s is 1, while the normalization of translation in monocular transformation is conducted by L2-norm, the normalization operation is executed by n here, wherein m l and m r are in the same DoF. e coordinate system is shown in Figure 4. e onroad self-driving car is assumed to move in the X-Z plane and rotate about the Y-axis. With the defined coordinate system and the planar motion assumption, the motion prediction transformation matrix v t can be built according to the obtained C r t , C x t , and C y t as shown in formula (2), and v t will be utilized in the particle filter establishment.
where the two unknown variables are the rotation factor θ and the translation factor α.
In addition to the motion prediction, the corresponding uncertainty u i predicted by the network is also recorded for the PF construction, which shows the confidence of motion prediction results and indicates the uncertainty for covariance matrix constructing in PF.

Preliminaries of the Particle Filter.
e basic idea of PF is that the belief is represented by a set of samples (also called particles), and the samples are drawn according to the posterior distribution over the prediction. In other words, rather than approximating posteriors in a parametric form, as is in the case for the Kalman filter and the extended Kalman filter, PF represents the posteriors by weighted particles which approximate the desired distribution.
Also, PF algorithm applies a Bayesian iteration to update particle states involving the prediction and update stages.
e Bayesian iterative formula is as follows: As formula (3) shows, bel(x t ) is the belief obtained from prediction pre(x t ) by iteration. e kinematic model p(x t |u t , x t−1 ) is set to predict the state, and p(z t |x t ) is denoted as an observation model to update the pose.
Recursively using the prediction and update, the algorithm constantly updates the particle set. e form of the nonlinear kinematic model and the observation model in diversification makes the algorithm extremely robust and suitable for robot localization in almost any field. Moreover, the accuracy of the PF depends to a large extent on the size of the particle set. e larger the size is set, the higher the accuracy can be achieved. If and only if the size of the particle set is huge, the particle distribution is close to the actual distribution unlimitedly.

DNN-Enabled Particle
Filter. Based on the original PF mentioned above, combining the motion prediction of DNN, we propose a DNN-enabled PF algorithm, where the initialization is realized by DNN prediction C x t , C z t , and C r , and uncertainty u i , and the resampling process takes the observation from DNN into consideration.  In the initialization, the mean of particles is set in the origin, and the initial covariance is obtained by the prediction uncertainty. Given the motion prediction uncertainty ζ, the covariance matrix Σ t is defined as where C x t , C z t , and C r t are the motion prediction results from network. u i is the corresponding uncertainty of the motion prediction, which is obtained from the Softmax layer of the leveraged network, and Υ is a given value for weight balance. e mean is given by the last iteration. erefore, the initialization distribution of PF can be denoted as where x i t is the i th particle at time t and N t is the Gaussian distribution for the particle sampling.
After initialization, in the motion update process, all the particles are updated according to the motion model and predicted velocity. e velocity is predicted by the visual tracking process with the most weighted particle, which is realized by visual tracking. After each motion update and weight calculation, we calculate N eff to evaluate the necessary for particle resampling, which is defined as where x i t is the weight of particle i at time t, which is an indicator of bel(x t ). e particle weight is obtained from the residual of optimization Γ t by the corresponding pose in optimization initialization, which is indicated as follows: where β is a given value for particle weight generation.
In the resample process, the resample distribution is determined by the motion model and the observation from DNN. In detail, we adopt the resample distribution in [23] for weighted sampling, where the DNN prediction is considered. Given the motion prediction at time t, the most weighted particle pose at time t − 1 is denoted as T t−1 , which is assigned to x(t − 1) w . With the transformation estimation v t generated from C x t , C z t , and C r t , the resample distribution is written as where r is a given threshold to limit the sampling within a range around the network prediction and f( * ) is the motion model function, which is written as p(x t | u t , x(t − 1)) during the Bayes iteration. In our proposed PF algorithm, DNN prediction is only leveraged in the resampling and initialization process. Since resampling is not required after every motion update, the DNN prediction can be executed in an interval-style. With such an interval strategy, the real-time performance can be improved for online visual tracking.

Pose-Graph Optimization with the Particle Filter.
With the constructed PF with DNN prediction, the pose estimation in the optimization is conducted with multihypothesis. We give the initial estimation of optimization according to the particle pose and execute the optimization in a parallel way. e optimization function is written according to the definition of reprojection error: where p i is the feature point in the image plane and P i is the corresponding 3D projected point. (3) is the result of pose optimization, and π( * ) is the projection function according to the estimated pose. During the optimization, the initial estimation T 0 is given by the pose of particle filter mentioned above. After optimization, the residual is assigned to the corresponding particle acting as the particle weight. In the optimization, only the pose-graph is involved, instead of the complete graph including map points and pose, whose variable dimension is much smaller to save time consumption. Also, we run the pose-graph optimization within a sliding window, which only considers the neighbor variables that are much related to the current pose estimation. e size of the sliding window is set according to the covisibility graph.

Experimental Setup.
We use the pretrained DNN [26] in this experiment, which has been trained for motion type classification of one image input. e example of the training dataset is shown in Figure 5. Also, we fine-tune the pretrained network on the KITTI dataset, which is in the city self-driving environment. Both sequence 03 and sequence 04 are utilized, and they are not included in the tests. To evaluate our proposed method, we use the following evaluation metrics: pose estimation accuracy by visual localization, the number of optimization iterations, and time consumption in second. e practicality of our method will be evaluated by the runtime performance.
For visual tracking and pose optimization, we use ORB-SLAM3 as the base of our implementation, and the optimization is performed with the DNN-enabled PF algorithm. We build the vision part of this implementation upon the OpenCV library.
e Levenberg-Marquardt (LM) solver [27] in the Eigen library is selected as the optimization solver, which is effective in solving nonlinear optimization problems. We run all the experiments on a laptop computer with an Intel Core i7 (8 cores and 16 threads) CPU, 16 GB RAM memory, and RTX-2060 GPU.

Visual Tracking Precision.
By using the prediction from the DNN classification result, we estimate the trajectories on the KITTI dataset, and the accuracy is indicated by absolutely trajectory Root Mean Square Absolute Trajectory Error (RMS ATE) [28]. e results are listed in Table 1,

Mathematical Problems in Engineering
where the unit of RMS ATE is m. In terms of our proposed method, two groups of the experiment are conducted: the complete proposed method with DNN prediction and the DNN-enabled PF in pose estimation optimization; the multihypothesis optimization by PF only without DNN prediction. ese two groups are set up to show the advance of multihypothesis optimization and the DNN prediction in particle sampling. Since our system is built based on ORB-SLAM, the accuracy of the original ORB-SLAM is shown. As the competitors, some representative work, such as ORB-SLAM [13], DDVO [9], and DSO [2], are included in the comparisons. Note that the loop-closure detection and global optimization in ORB-SLAM are disabled since most of the VO systems do not perform the global optimization. e comparison results can be seen in Table 1; we generate the ORB-SLAM and DSO experimental results ourselves and extract the DDVO results from the paper [9]. ORB-SLAM and DSO are the representative systems of indirect method and direct method of the model-based style, and DDVO is the VO system leveraging DNN prediction in both environment construction and pose estimation. Our method can obtain better accuracy in most of the sequences. Some examples of the results are shown in Figures 6 and 7, including the estimated trajectories compared with the ground truth, as well as the error curves.

Convergence and Time Consumption.
Based on the results above, the real-time performance of our system process example is shown in Figure 8. e time consumption of multihypothesis pose optimization is verified here. Since the DNN prediction is executed in an interval-style, we count the time consumption of DNN prediction during the whole process. Also, we provide the result of our method with PF only, whose sampling distribution does not consider the prediction from the neural network, and the corresponding iteration process is similar to the PF algorithm introduced in [29].
Also, we provide the real-time performance analysis of the whole system in Table 2. ree groups of experiments are conducted: the traditional optimization with single initialization, called "single hypothesis"; the optimization with multiple hypotheses from PF, while the PF is constructed from the motion model and random     Mathematical Problems in Engineering initialization, called "multiple hypothesis"; the optimization with DNN-enabled PF, where the PF is established with the DNN prediction.
In addition to the time consumption, to show the realtime performance without the consideration of computer configuration, we provide the convergence performance of the DNN-enabled particle filter. Since the time of iteration is related to the efficiency of particle sampling, the advance of particle filter sampling considering the DNN observation can be shown by the convergence performance. Since the feasibility of the sampled particle is enhanced by the DNN prediction, our method is able to converge with a smaller number of iterations than others.
Based on the visual tracking result above, an example of the convergence performance in the multihypothesis optimization process is shown in Figure 6, in which the times of particle filter iteration are shown, as well as the value of cost function. With the same input, the multihypothesis optimization with DNN prediction can satisfy the termination condition with the less iteration steps. Also, the prediction from DNN can provide an appropriate initialization for pose estimation, which is beneficial to the optimization convergence and achieves the fastest convergence. e average framerate of the proposed method on KITTI dataset is 8.1 Hz, whose runtime performance is able to meet the requirement of most applications.

Discussion.
Here we present the analysis according to the results in Tables 1 and 2. Regarding the comparisons in terms of tracking accuracy, our proposed method has better performance than the popular systems and has the smaller tracking error indicated by RMS ATE; the examples of visual tracking trajectory are shown in Figures 6 and 7.
Compared with the popular VSLAM systems, such as ORB-SLAM, DSO, and DDVO, the proposed method with multihypothesis optimization initialization can converge to the global optimal solution with a higher probability and less iteration. As is shown in the tables, the multiple hypothesis optimization with PF only can obtain better tracking accuracy than the existing systems, while its error is still higher than the trajectory estimation with neural network prediction. Because the prediction from the neural network gives high efficiency to the sampled particles, the convergence performance can be improved by the sampling distribution with appropriate observation and prediction.
As for the real-time performance of optimization, the proposed method with neural network prediction is less time-consuming than the one without network prediction. Because the neural network prediction provides a feasible distribution for the particle sampling, the time consumption for particle convergence is reduced. Also, the optimization initialization by DNN-enabled PF is able to provide a lower cost for fast iteration convergence, as is shown in the convergence analysis shown in Figure 8. e optimization with multihypothesis only needs the most iteration steps because the feasibility of PF samples is not guaranteed. Also, comparing with the existing methods that consider a single hypothesis of initialization during optimization, our method still spends more time. However, regarding the application of visual tracking, the average frame rate of our proposed method can still satisfy most of the applications, and the practicality of our proposed method is verified.

Conclusion
In this paper, we propose a method to demonstrate the robustness and accuracy introduced by the multihypothesis pose estimation with the proposed DNN-enabled PF. e image classification DNN is used in our method, which provides the motion label prediction for optimization initialization. Besides, we introduce the DNN-enabled PF for improving the particle distribution, where both the motion prediction and prediction uncertainty are considered. Instead of estimating the motion from the DNN prediction directly, we use the high-level representation for more general performance. Also, for the robust pose estimation, a multihypothesis back-end optimization strategy is designed to overcome the local optimum in optimization. e scalability of our method is guaranteed by the improved generalization, which can meet the requirement of many applications.
With the robustness introduced by our method, the higher accuracy of visual tracking than existing work is achieved. e experiment result built based on ORB-SLAM shows the advance of our proposed multiple hypothesis optimization and the DNN-enabled particle filter, where the average accuracy is improved by 13.3%. In the future, the extension of our work is the application to unman-drones and other VO or VSLAM systems that require a higher degree of freedom and more generalized performance. Also, the given parameters can be tuned by the learning method in our future work.

Data Availability
Because of the confidentiality of the college, the data cannot be made public.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Mathematical Problems in Engineering 9