Mobile Robot Localization Based on Vision and Multisensor

To deal with the low accuracy of positioning for mobile robots when only using visual sensors and an IMU, a method based on tight coupling and nonlinear optimization is proposed to obtain a high-precision visual positioning scheme by combining measured value of the preintegrated inertial measurement unit (IMU) and values of the odometer and characteristic observations. First, the preprocessing part of the observation data includes tracking of the image data and the odometer data, and preintegration of IMU data. Second, the initialization part of the above three sensors includes IMU preintegration, odometer preintegration, and gyroscope bias calculation. It also includes the alignment of speed, gravity, and scale. Finally, a local BA (bundle adjustment) joint optimization and global graph optimization are established, so as to obtain more accurate positioning results.


Introduction
Positioning technology is undoubtedly the basic premise for a wide range of applications such as robot navigation, autonomous driving, and virtual reality. e visual positioning system has won the attention of researchers because of its small size, low cost, and simple hardware setup [1][2][3][4]. In recent years, the visual SLAM (simultaneous location and mapping) has experienced rapid development, among which the more classic methods include the PATM [5] algorithm proposed by Klein et al. in 2007 and its successor ORB-SLAM [6]. Other notable methods are LSD-SLAM [7] for the monocular vision direct method, SVO [8] based on the sparse method, DSO [9] and DVO-SLAM [10], etc. Since the discussion cannot be fully discussed, this section discusses the method of multisensor fusion localization, skipping the above method of localization using only visual sensors. Due to a large amount of information measured by the camera and the low cost, the application on mobile robots is very extensive. erefore, the following mainly introduces the positioning method combining the camera and other sensors.
e way the camera is integrated with other sensors can be divided into loose coupling [11,12] and tight coupling. Among them, the loose-coupling method is used to use the camera and other sensors as separate modules to calculate the respective pose. e fusion information methods generally include Kalman filter (KF), extended Kalman filter (EKF), and other methods, while other sensor information is used for state propagation. e tight-coupling methods are mainly divided into two kinds: EKF [13,14] and graph-based optimization [15,16]. A widely concerned VIO method based on EKF is the MSCKF [13] algorithm proposed by MARS Laboratory of the University of Minnesota. It can marginalize old map points and postures to ensure accuracy and computational efficiency under the condition that the number of map points and postures in the state vector is constant. However, with the development of the optimization technology in SLAM, VINS based on graph optimization has also been studied more and more. e limitation of the filtering algorithm is analyzed in reference [17], and the method based on optimization is superior to the method based on the filter. e ICE-BA algorithm proposed in reference [18] improves the BA optimization part of VI-SLAM, changes the incremental nature of SLAM measurement, resolves the direct conflict between sliding window edge deviation and global closed-loop detection constraints, and improves the computational efficiency. In [8], the IMU preintegration technology was extended to Lie algebra in 2015 and implemented in the open-source factor graph optimization library GTSAM. Based on this, [19,20] designed the VINS-MONO algorithm and obtained the experimental results that the dynamic performance of continuous-time preintegration is better than the discrete preintegration. However, there are many differences in the adaptability of sensors on different platforms. In the mobile robot, because the robot is mostly in the plane uniform motion [4], the IMU cannot be fully excited, which causes numerical stability problems during the initialization of the scale. erefore, the applicability of the IMU is not as good as the mileage. At the same time, external calibration and initialization issues between sensors pose significant challenges for highly nonlinear visual-inertial systems. ese problems will affect the positioning accuracy and robustness of mobile robots. To solve these problems, a tightly coupled and optimization-based approach is used to fuse monocular vision sensor, IMU, and wheeled odometer to achieve accurate pose estimation in this paper. e main work of this paper is shown as follows: (1) In this paper, the odometer-inertial navigation system is used to estimate the scale to solve the problem that the absolute scale cannot be obtained by location algorithm using monocular camera and scale stability cannot be guaranteed. (2) A BA tight-coupling optimization model is established, including marginalization (removal of pose and feature point constraints from the sliding window), IMU residual between adjacent frames in the sliding window, visual reprojection error, and measurement error of odometer. Finally, accurate positioning results are obtained by minimizing the error cost function.

System Composition and Coordinate System Definition.
In Figure 1, the system starts with data preprocessing (Section 2.2), including image feature extraction and tracking, preintegration of IMU measurements between two consecutive frames, and calculation of pose measurement by two consecutive interframe odometers. In the second part (Section 3), the initialization process provides the contents including attitude, velocity, gravity vector, gyroscope bias, and three-dimensional feature location for subsequent nonlinear optimization. e third part is local BA joint optimization (Section 4) and global map optimization, including a sliding window BA optimization model; the fourth part is loop detection, eliminating drift and ensuring map consistency. e main application method is DOW2 [21], so it will not be discussed in this paper. e relevant coordinate system in this paper is shown in Figure 2, which defines the coordinate system and symbols used in this paper. Among them, (oxyz) w represents the world coordinate system, (oxyz) c represents the camera coordinate system, (oxyz) w represents the IMU coordinate system, (oxyz) d represents the odometer coordinate system, and (oxyz) b represents the robot coordinate system, which is defined as the same as the IMU coordinate system. e rotation matrix R and quaternion q are used simultaneously to represent the rotation. q w b and p w b represent the rotation and translation from the robot coordinate system to the world coordinate system. b j represents the coordinate system of the robot when the j frame image is obtained. c j represents the camera coordinate system when the j frame image is obtained. ⊗ denotes multiplication between quaternions. (• ) denotes the measured or estimated value of a specific quantity.

Feature Point Tracking.
e tracking and matching of feature points in monocular vision are used to inherit or generate new feature points from the previous frame to the next frame. e specific process is shown as follows: firstly, the strongest f corner features are found on the first frame image, and the nonmaximum suppression radius is set at the same time. en, when the next frame image arrives, these feature points are tracked by the optical flow method [22], and the matching points are found in the next frame image. e matching points in the front and back frames are corrected to get the final corrected position. Based on the corrected position, the F-matrix and RANSAC algorithm are used to remove the outliers. Finally, the strongest new feature points are searched in the area beyond the matched feature points, and the number of feature points is complemented to f, which ensures that the matching number of feature points is consistent with that of the next frame. e value of f is generally in the range of 100-300, and the specific content is referenced in [23].

Inertial Navigation Preintegration.
e original accelerometer and gyroscope can measure the acceleration a and angular velocity ω of the robot. It is assumed that the additional noise in the accelerometer and gyroscope is the Gaussian white noise n a ∼ N(0, σ 2 a ) and n ω ∼ N(0, σ 2 ω ). e bias of accelerometer and gyroscope is assumed to be a random walk model, whose derivatives obey Gaussian . e measurement results are illustrated as follows: From the above formula, the position, velocity, and direction of the next moment can be calculated, and the time interval between the two moments is given as Δt � [j, j + 1]: e reference coordinate system is transformed from the world coordinate system to the robot coordinate system. Formula (2) is multiplied by the transformation matrix R b j w from the world coordinate system to the b j frame coordinate system. By using b j as the reference frame of a given bias, the preintegration term can be obtained only by the IMU measurement, and the covariance matrix Q b j b j+1 of α, β, and q propagates accordingly. en, when the IMU bias changes slightly, the first-order approximation estimation of the bias state is adjusted based on α, β, and q, and the resulting preintegral is expressed as e preintegration method does not need to repeat IMU measurements, and it saves a lot of computing time and resources for subsequent graph-based optimization algorithms.

Odometer Preintegration.
e wheel odometer can measure the linear velocity v d and the angular velocity ω d of the robot. It is assumed that the noise between the measured value and the true value is Gaussian noise, n � [n v , n g ]. e measurement model is given as follows:  z c Considering the problem of slipping off the wheeled robot, the angular velocity measured by the gyroscope can be used as the acceleration of the robot. Referring to the IMU preintegration algorithm, the preintegration is calculated for the odometer measurement and the angular velocity measured by the gyroscope; the preintegration of the velocity is simpler, and there is no offsetting effect in the acceleration. e odometer preintegration results are shown as follows: Similar to IMU preintegral, the preintegral measurement is obtained by separating the initial value and the constant term of the integral. When the gyroscope bias changes slightly, the approximation of the previous moment is used to update α d j d j+1 and q d j d j+1 , instead of iteratively calculating the preintegration measurements:

System Initialization State Estimation
e initial value of the system is obtained by a loosely coupled method. Firstly, the structure from motion (SFM) is used to estimate the camera pose and feature 3D position. en, the SFM results are aligned with IMU preintegration and odometer preintegration, respectively, to correct the gyroscope bias. At the same time, the corresponding velocity of each frame, gravity vector direction, and scale factor are solved. It should be noted that the scale is the final result obtained by weighting the scale estimated by the two alignment processes according to the dynamic weight method.

Vision-Only Initialization Based on the Sliding Window.
e monocular initialization uses SFM to estimate the camera scale pose and feature point spatial position of each frame within the sliding window. Firstly, the feature matching between the latest frame and any other frame in the window is checked. If a stable feature tracking and sufficient parallax are found, the scale and pose between the two frames are restored by the five-point method [24]. en, the feature triangulation and the reprojection error are performed to be minimized. Otherwise, the current latest frame is saved, continued to wait for the new frame. e pose (p c 0 c j , q c 0 c j ) of all frames in the sliding window is obtained by the above procedure. It is assumed that the external parameter between camera and IMU is (p b c , q b c ), the external parameter between camera and the odometer is (p d c , q d c ). e pose is converted to the robot coordinate system and odometer coordinate system, respectively. e attitude is converted to the IMU coordinate system as Similarly, the conversion to the odometer coordinate system is shown as where s 1 and s 2 both represent the scale information of the visual measurement and are used for the subsequent dynamic weighting to solve the final scale parameter s.

Vision and IMU Preintegral
where B represents all keyframes in the sliding window, where After the initial calibration value of the gyroscope bias m ω is obtained, the IMU preintegration is recalculated based on the new gyroscope bias.

Velocity, Gravity Vector, and Scale Initialization.
After the gyro bias is initialized, the gyro measurement is regarded as accurate and continues to be used to initialize other variables to be optimized: where v b j b j is the velocity of the j frame image robot coordinate system, g c 0 is the gravity vector in the c 0 coordinate system, and s 1 is based on the monocular scale obtained by the IMU preintegration. e world coordinate system w in the preintegral terms α  Journal of Robotics where By solving the linear least-squares problem, According to the above formulas, χ I can be optimized, including all keyframe velocities, gravity vectors, and monocular scales in sliding windows.

Vision and Odometer Preintegral Alignment.
Since the gyro measurement is used for the rotation measurement of the odometer preintegration, the gyro offset after calibration can be used directly. e variables to be optimized are defined as Similarly, the odometer preconcentration term is transferred to the c 0 coordinate system and converted into the form of Hx � b: By solving the linear least-squares problem, e coordinate system speed of odometer in each frame in the window and the parameters of the monocular scale can be obtained.

Dynamic Weighted Scale Parameter.
e idea of dynamic weighting method is used to design a sliding window with a size of 5 keyframes based on the original sliding window of the system and reoperate the sliding window of the system with a size of 10 keyframes. erefore, the keyframe B i∈ 1,2,3,4,5 { } is selected from the sliding window, where 1-5 frames within the window are included in B 1 , 2-6 frames within the window are included in B 2 , and so on, and B 5 is derived. e scale parameters s 1,i and s 2,i at B i are calculated according to Sections 3.2 and 3.3, respectively, and the minimum mean square errors e 1,i and e 2,i are obtained. e scale parameter s � W 1 s 1 + W 2 s 2 is calculated by dynamically adjusting the weights W 1 and W 2 according to the above parameters, wherein 2 n�1 W n � 1. e initial values of W 1 and W 2 are both 0.5 as shown in Figure 3.

State Variables and Cost Functions.
To improve computational efficiency, a sliding window is used to achieve graph optimization in this paper, so the state variables defining the sliding window optimization at time j are shown as χ � x 0 , x 1 , . . . , x n , λ 0 , λ 1 , . . . , λ k , where x i represents the position, velocity, the orientation of the robot in the global coordinate system, and the bias of the IMU. λ k represents the inverse depth of the kth feature point. n and k represent the number of keyframes and the total number of 3D feature points, respectively. In the sliding window, the residual of all measurement models is minimized to optimize the state variables: where r B (z    Journal of Robotics where e above formula is a three-dimensional coordinate form of the reprojection equation, and the optimization variable is based on the estimated value and the reprojection error in the normalized camera coordinate system, including the state quantities [p w b i , q w b i ] and [p w b j , q w b j ] and the inverse depth λ l at two moments. To minimize the reprojection error, it is necessary to optimize the above state quantities, and the corresponding Jacobian matrix is obtained as shown in equations (23)-(25).

IMU Measurement Residual.
According to the IMU measurement in the two consecutive frames b i and b i+1 and the content in formula (4), the residual of the IMU is obtained as follows: Here, are the measurements obtained after the corrected bias. δθ b i b i+1 is a three-dimensional error state representation of the quaternion, and [q] xyz represents the imaginary part of the extracted quaternion q. In the graph optimization, the state quantities x i and x i+1 need to be optimized, and the corresponding Jacobian matrix is obtained as shown in equations (27)-(29).

Odometer Measurement Residual.
It is assumed that the axle is on the central axis of the robot. e amount of displacement measured by the odometer can be used to obtain the displacement of the center of the robot. erefore, the odometer measurement residual is defined as In the graph optimization model, we need to optimize the states x i and x i+1 under two consecutive frames b i and b i+1 and obtain the corresponding Jacobian matrices such as formulas (31) and (32): where q � [w, x, y, z] � [w, s],

Experimental Result Analysis
e experimental scene in this paper is the indoor environment with a Vicon motion capture system, and the robot platform is Turtlebot3 mobile robot with a camera named Xiaomi. e robot moves in the area of 3m × 3m, and the average speed of the whole process is 0.2m/s. In the experiment, the position and posture of the robot acquired 8 Journal of Robotics by the motion capture system in the motion capture coordinate system are taken as the true value of the motion trajectory.

Scale Error Analysis.
e experiment mainly verifies the effects of the initialization state estimation algorithm and the dynamic weighting result proposed in Section 3.4. Compared with the initialization using only IMU, in this paper, the combination of visual, IMU, and odometer can achieve more accurate scale and convergence time. e scale estimation results and time are shown in Table 1, and the root mean square error (RMSE) is used to quantitatively represent the accuracy of the algorithm.
It can be seen from the results in Table 1 that the initialization state estimation algorithm of this paper reduces the RMSE of the scale value by 72.2% and the convergence time RMSE by 62.9% compared with the amplification using only the IMU for initialization. erefore, the initialization state estimation algorithm proposed has obvious improvement in the accuracy and stability of scale estimation.

Pose Error Analysis.
e algorithm proposed in this paper is an improvement based on VINS_MONO, so the experimental results are compared with the VINS_MONO. e trajectory of mono_imu in the figure represents the result of the VINS_MONO [25] algorithm running on the robot platform. e trajectory and true value comparison of the two algorithms are shown in Figures 4 and 5. e dotted line in the figure is the real trajectory obtained by the motion capture system, and the solid line is the trajectory obtained by the algorithm. As can be seen from the figure, due to the odometer constraint and more accurate initialization scale added to the algorithm in this paper, the accuracy in the positioning process of mobile robots is better than that of VINS_MONO.
To quantitatively analyze the accuracy of the algorithm, the accuracy is measured by estimating the absolute position error (APE) obtained by comparing the pose result with the true motion pose of each time step. e pose and APE and the rotational APE of the algorithm proposed in this paper and the monocular inertial navigation algorithm are shown in Figure 6, respectively. e overall pose APE obtained by the two algorithms is shown in Table 2.
By comparing the two methods, it is not difficult to see that the proposed algorithm is superior to the monocular inertial fusion algorithm in accuracy, which indicates that the odometer information has a certain degree of improvement on the positioning result. Moreover, the better planar constraint of the odometer makes the algorithm consistently stable at 0.03 m in the z w direction, while the monocular inertial fusion algorithm has a certain degree of trajectory divergence in the z w direction. Since the center of gravity of the rigid body model built in the Vicon system is 0.03 m, the true value keeps near 0.03 m in the z w direction. e results are shown in Figure 7.

Conclusions
In this paper, aiming at the problem of low accuracy and robustness of the monocular inertial navigation algorithm in the pose estimation of mobile robots, a multisensor fusion positioning system is designed, including monocular vision, IMU, and odometer, which realizes the initial state estimation of monocular vision and the fusion location of the tightly coupled visual-inertial odometer. e method of odometer preintegration and dynamic weighted scale parameters, as well as the back-end tightly coupled pose optimization method based on graph optimization, is proposed, which greatly improves the accuracy and robustness of the positioning system. en, the effectiveness and feasibility of the proposed algorithm is verified by theoretical derivation and practical experiments.
Although the research in this paper has achieved certain results, there are still some aspects that need to be improved. First, the results are only tested in the laboratory or a small number of complex scenarios, and it cannot fully explain the ability to work robustly and stably for a long time in a complex environment. Second, binocular cameras are superior to monocular cameras in terms of depth calculation and scale estimation. In subsequent solutions, multisensor fusion schemes can be considered based on binocular vision and contribute to future map-based visual navigation studies.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.