Fast Monocular Visual-Inertial Initialization with an Improved Iterative Strategy

The initialization process has a great effect on the performance of the monocular visual inertial simultaneous localization and mapping (VI-SLAM) system. The initial estimation is usually solved by least squares such as the Gauss-Newton (G-N) algorithm, but the large iteration increment might lead to the slow convergence or even divergence. In order to solve this problem, an improved iterative strategy for initial estimation is proposed. The methodology of our initialization can be divided into four steps: Firstly, the pure visual ORB-SLAM model is utilized to make all variables observable. Secondly, the IMU preintegration technology is adopted for IMU-camera frequency alignment at the same time with key frame generation. Thirdly, an improved iterative strategy which is based on the trust region is introduced for the gyroscope bias estimation as well as the gravity direction is refined. Finally, the accelerometer bias and visual scale are estimated on the basis of previous estimations. Experimental results on the public datasets show that the estimation of initial values can be converged faster, as well as the velocity and pose of sensor suite can be estimated more accurately than the original method.


Introduction
With the development of artificial intelligent (AI), the monocular visual inertial simultaneous localization and mapping (VI-SLAM) technology has become an active research topic in the robotics and computer vision communities. The camera image contains a rich information of the surrounding environment, in which it can be utilized to estimate the camera poses, as well as restructure the sparse or the dense maps. The IMUs provide the measurements of angular velocity and local linear acceleration, which can be utilized to estimate the rigid body motion in a short period. The complementary features make the visual-inertial combination suitable for many applications such as autonomous or semiautonomous driving [1,2], unmanned aerial robots [3,4], augmented reality (AR) [5,6], and 3D reconstruction [7,8]. At present, the tightly coupled nonlinear optimization method is widely applied to the visual/visual-inertial SLAM, such as ORB-SLAM and ORBSLAM 2/3 [9][10][11], OKVIS [12], VINS-MONO/VINS-FUSION [13,14], VI-DSO [15], and VI-ORBSLAM [16]. The estimation of initial values has a great effect on the aforementioned VI-SLAM system. Specifically, the IMU's initial value estimation is of great significance to the initialization process; once these parameters are obtained successfully, the measurements of the inertial measurement unit (IMU) can be used to improve the accuracy and robustness of continuous tracking, as well as to find the metric scale of the three-dimensional (3D) visual map.
(i) The joint visual-inertial initialization method is pioneered by Martinelli [17], which assumes that all features are correctly tracked in all frames. The spurious tracks lead to a poor real-time performance. In later research, it is improved by Kaiser et al. with a little bit of precision that is sacrificed [18]. The method in [19] suffers a low initialization recall; it only works in twenty percent of the trajectory points, which might be a problem for the robot applications in case the system needs to be launched immediately (ii) The disjoint method is first introduced by Murartal and Tardos [16] and latter adapted by Qin and Shen and Yang and Shen [20,21] with a good performance. In both cases, the parameters of IMU are estimated in different steps by solving a series of linear formulas with the least-squares method such as Gauss-Newton (G-N) and Levenberg-Marquardt (L-M) [22,23]. The G-N method is built on the basis of the Newton method. It performs a high local convergence speed, but several limitations still exist such as the large iteration increment which may lead to the slow global convergence speed [24,25]. In the later research, Levenberg [26] and Marquardt [27] suggest to use a damped G-N method, in which the size and direction of the iterative step are influenced by the damped parameter. It makes this method without a specific line search which guarantees the global convergence performance [28]. However, solving a set of complexity equations needs several times for each iteration, which may lead to a reduction in the speed of solution To sum up, the speed of initialization convergence is important for the VI-SLAM system. The contribution of this paper is that an improved iterative strategy is proposed based on the trust region method to speed up the initial estimation.
The flowchart of our initialization method is shown in Figure 1. It can be divided into four steps: firstly, the pure visual ORB-SLAM model is adopted to make all variables     Journal of Sensors observable in the preliminary stage. Secondly, the IMU preintegration technology is adopted for IMU-camera frequency alignment at the same time with key frame generation. Thirdly, an iterative strategy which is based on the trust region is introduced for the gyroscope bias estimation while the gravity direction is refined. Finally, the accelerometer bias and visual scale are estimated on the basis of previous estimation. In Experiments, qualitative and quantitative analyses on the public datasets [29] are given to demonstrate our improved effect. The results show that the estimation of initial values can be converged in a faster speed, as well as the velocity and poses of a sensor suite can be estimated more accurately than the original method. The remainder part of this paper is organized as follows: Section 2 describes the process of IMU initial estimation. Then, the improved iterative strategy is described in Section 3. The experiments are described in Section 4. The conclusion is drawn in Section 5.

IMU Initial Estimation
In this section, the initial parameters of IMU are estimated. The relationships of IMU body frame {B} and camera frame {C} are defined with the scale factor s taken into account; it is described as follows: where R and P represent rotation and translation vector, respectively.
The IMU preintegration technology [30,31] is adopted for IMU-camera frequency alignment. In order to make all variables observable, the pure monocular visual SLAM system [9,10] is launched to work a few seconds for generating key frames. The detailed process of the IMU parameter estimation is described as follows.
2.1. Estimating Gyroscope Bias. The gyroscope bias estimation could be obtained from the known orientation of two consecutive key frames. Firstly, we assume that the change of bias is negligible; i.e., the bias b g is a tiny constant value. Then, the difference between the preintegration of gyroscope measurements and the orientation estimated by pure visual SLAM is minimized: where N denotes the total quantity of key frames, R ð:Þ WB = R ð:Þ WC × R CB is computed from the orientation R ð:Þ WC and the calibration R CB ,ΔR i,i+1 denotes the preintegration of gyroscope measurements between two consecutive key frames, Expð:Þ denotes the exponential map with R = ExpðϕÞ = exp ðϕ Λ Þ, where R denotes a rotation matrix, R → SOð3Þ, ϕ is a corresponding vector, ϕ → soð3Þ, and J g ΔR represents the Jacobian matrix. Analytic Jacobian matrices for a similar expression can be found in [31].

Estimating Gravity Direction.
The gravity direction has a great influence on the estimation of acceleration; it should be refined before the parameter estimation of accelerometer bias and scale. In particular, a new constraint, i.e., gravity magnitude G (G ≈ 9:8), is introduced. In terms of the frame {W}, the gravity direction can be computed as follows: The rotation R WI can be calculated from the angle θ between the two direction vectors: With v = ð g I × g W Þ/k g I × g W k, θ = a tan 2ðk g I × g W k, g I ⋅ g W Þ, the gravity vector can be expressed as follows: Inputs: Δ k ,ω k ,γ k Output: Δ k+1 Step 1: If ω k > μ 2 , then put: Step 2: Else if ω k < μ 1 ,then put: Δ k+1 ← c 1 Δ k Algorithm 1: Updating region radius Δ k .

Journal of Sensors
where R WI can be calculated by just two angles around the x axis and y axis in frame {I} and the rotation around the z axis has no effect in g W .

Estimating Accelerometer Bias and Scale.
Once the accurate gyroscope bias and gravity vector are obtained, the positions, velocities, and rotation can be calculated by the integral operation. Considering the influence caused by the acceler-ometer bias, the R WI is also adjusted, where it can be expressed by a two degree of freedom disturbance δθ ; ; Equation (5) can be rewritten as follows: Considering the effect of accelerometer bias, it can be obtained that The velocities can be eliminated when the constraints between three consecutive key frames are taken into account. The linear relationship is obtained as follows: where λ ðiÞ , φðiÞ, ζðiÞ, and ψðiÞ are parameterized as follows: where ½ ð:,1:2Þ denotes the first two columns of the matrix. Stacking all relations between three consecutive key frames in (8), a linear system can be formed as the following equation A 3ðN−2Þ×6 X 6×1 = B 3ðN−2Þ×1 . It can be solved by the singular value decomposition (SVD) method. In this case, it contains 3ðN − 2Þ equations and 6 unknown variables, and at least 4 key frames are needed to solve the system.

Improved Iterative Strategy
As Equation (2) is a typical nonlinear least-square problem, the common solving method is the G-N algorithm which is adopted in [16]. The G-N method provides a high local convergence speed, but the large iteration increment might lead to the slow global convergence or even divergence. To tackle this problem, an improved iterative strategy is introduced. In particular, our method is a trust region-based method which combines the steepest descent and G-N algorithms; the model of truth region is shown in Figure 2. Similar to the L-M algorithm, the objective function is approximated with a trust model, while the solution is considered as the minimum of model function around the current point. Then, the following minimization subproblem is solved in each iteration step: where f k = f ðx k Þ is the objective function, Equation (10) is the approximate model of f k around the current point x k , g k = ∇f ðx k Þ is the gradient of f k , Δ k represents the radius of the trust region,  Journal of Sensors norm. At each iteration step, the radius of the trust region is adjusted with ω k changed: whereω k is defined as the gain ratio, the denominator: η k ð0Þ − η k ðγ k Þ, and numerator: f ðx k Þ − f ðx k + γ k Þ, is the predicted and actual reduction values, respectively. The steps of updating Δ k are described in Algorithm 1.
In this work, the sufficient reduction condition for global convergence is assumed that the reduction of obtained step performs at least square to the reduction of Cauchy step [32], which is described as follows: where J x represents the Jacobian matrix; it includes the gradient of residuals which can be found in [22].

Journal of Sensors
Assuming that γ * k ðΔÞ is the solution of Equation (10), due to our method which combines the steepest descent and G-N algorithms, the solution γ * k ðΔÞ is obtained on the basis of the selection of one of the two steps γ cd k and γ gn k . The relation of the two steps is described as follows: where γ cd k is the step of the steepest descent algorithm and γ gn k is the step of the G-N algorithm. According to Equation (13), there are two states occur for γ cd k and γ gn k : (1) The G-N point is outside of the trust region The path of our iterative strategy consists of two conditions in which the first case is a line segment starting from the current point to γ cd k and the second case is a line segment extending from γ cd k to γ gn k . Specifically, it can be formulated byγ k ðβÞ, β ∈ ½0, 2: where β can be computed from the following equation: (2) The G-N point is inside of the trust region Thus, the solution of our method is obtained on the approximate path with the minimum point of model function; the processes are shown in (Algorithm 2).

Experiments
In this section, the proposed initialization algorithm is integrated into the VI-ORBSLAM framework [16], and several tests are evaluated on the EuRoC dataset [29]. In order to display the excellent performance, we compare our method with the original VI-ORBSLAM, VINS-MONO, and the monocular version of ORB-SLAM3 frameworks.   Figure 3, which consists of machine hall scene and man-made laboratory room. According to the texture, illumination, motion blur, and fast/slow motions, the sequences are classified into easy, medium, and difficult sets [33]. The binocular cameras (denoted as CAM0 and CAM1) and the IMU are logged at 20 and 200 Hz with hardware time-synchronized, respectively. Besides, the dataset provides not only accurate ground truth of the moving trajectories but also the biases of accelerometer and gyroscope, and the velocities of IMU body are provided. In this work, we only use the monocular (left camera) and the IMU measurements. All experiments are carried out on the computer with 16GB RAM, i7-9700 CPU (8 cores @3.00GHz), in Ubuntu 18.04+ Melodic operating system. For more convenient analysis, the performance of parameter convergence, velocity estimation, and the localization accuracy is evaluated by using V2_01_easy sequence of the EuRoC dataset, in which the accelerometer bias and gyroscope bias are approximately [-0.0236, 0.1210, 0.0748] m/s 2 and [-0.0023, 0.0250, 0.0817] rad/s, respectively. The sparse mapping results on V2_01_easy sequence with the key frame trajectory are shown in Figure 4. The convergence performance of the initial parameter, velocity estimation errors, and key frame absolute trajectory error (ATE) of the whole 11 sequences is also analyzed. The detailed description is shown as follows.

Initialization Parameter Convergence.
The time-varied characteristic curves of the initial estimations: accelerometer bias b a , gyroscope bias b g , gravity g W , and visual scale s, are shown in Figure 5. We compare the convergence performance of the two optimization-based methods: the Gauss-Newton method and the proposed method. It can be seen that all parameters converge successfully in these two methods; the Gauss-Newton-(G-N-) based method con-verges within 9 seconds, while our based method converges within 6 seconds. In particular, as shown in Figures 5(a) and 5(c), the characteristic curves of accelerometer bias and gravity encounter severe oscillation within 4 seconds. This is because the MAV platform does not have enough excitation on the sensor suite in the stage of stationary and slight disturbance, which makes the accelerometer bias and the gravity vector hard to distinguish. Besides, the estimation of gyroscope bias converges to stable values in a very short time (within 2 seconds of ours) which is shown in Figure 5(b).
It well confirms that the proposed iterative strategy described in Section 3 achieves good performance. The estimation of visual scale is plotted in Figure 5(d), it should be noted that the optimal value of scale factor is obtained by manually scaling and aligning the estimated IMU body trajectories to the ground-truth trajectories in advance. The result shows that our method also converges to the optimal value with a faster speed than the G-N-based method.

Velocity Estimation.
The time characteristic curves of the estimated velocity with x, y, and z directions are shown in Figure 6. It draws the comparison results of VI-ORBSLAM (red line), ground truth (blue line), and our proposed algorithm (green line) which are tested on the V2_01 sequence of the EuRoC dataset. Due to the estimations and ground truth that are expressed in different reference frames, the estimated velocities need to be aligned with the ground truth in advance. It can be known that the velocity estimation suffers different drifts in x, y, and z directions, but the error of ours is smaller than that of VI-ORBSLAM. The velocity drift of the whole path is quantified in Table 1. Compared with the velocity drift of VI-ORBSLAM: [0.0293,0.0165,0.0501] m/s, the drift of ours is [0.0195,0.0110,0.0385] m/s in the x, y, and z axes; it can be calculated that the accuracy of ours is improved by [33.45,33.33,23.15] %. The percentage improvement in accuracy can be calculated as jError * − Error ⊕ j/j Error ⊕ j * 100%, where Error ⊕ and Error * indicate the drift of VI-ORBSLAM and the drift of our method, respectively.    Figure 9, the three comparison curves of APE results are also given; it can be seen that our method has the best performance than VI-ORBSLAM and VINS-MONO on APE evaluation. The comparisons of key frame trajectories with ground truth which tested on the V2_01 sequence are shown in Figure 10, where the ground truth is represented by a black dotted line, VI-ORBSLAM is represented by a blue line, VINS-MONO is represented by a green line, ORB-SLAM3 is represented by a violet line, and our method is represented by a red line. The quantitative comparison with the values of max, mean, median, min, RMSE, SSE, Std are shown in Table 2.
The quantitative root mean square error (RMSE) of the whole 11 sequences is also the median values with 10    Table 3, and the boxplots are also provided in Figure 11. It should be noted that the estimated pose is usually not in the same coordinate system with the ground truth; we need to give an align process. Practically, we calculate the transformation matrix S ∈ Simð3Þ from the estimated pose to the ground truth; the APE in ith frame is defined as follows: Then, the absolute translational and orientation RMSE is calculated as follows: From Table 3, it can be known that our method achieves

Conclusions
In this paper, the proposed initialization algorithm is on the basis of the VI-ORBSLAM framework. In order to improve the quality of initialization, an improved trust region-based iterative strategy is proposed. Our method has been verified on the public dataset with faster convergence speed in the initialization stage, as well as the velocity and pose can be estimated more accurately than the original method. At present, the binocular VI-SLAM technology has more practical value in the industrial application. There are several key issues that should be considered, such as the real-time

16
Journal of Sensors computing and the online recovery capabilities. In the future works, we will try to apply the proposal for the binocular model, as well as the positioning and mapping of the unmanned vehicles in outdoor environment are being considered.

Data Availability
The datasets used in this article can be found in the following link: https://projects.asl.ethz.ch/datasets/doku.php?id= kmavvisualinertialdatasets.

Conflicts of Interest
The authors declare that there is no conflict of interests regarding the publication of this paper.