Relative Pose Estimation Algorithm with Gyroscope Sensor

This paper proposes a novel vision and inertial fusion algorithm SfM (Simplified Structure from Motion) for camera relative pose estimation. Different from current existing algorithms, our algorithm estimates rotation parameter and translation parameter separately. SfM employs gyroscopes to estimate camera rotation parameter, which is later fused with the image data to estimate camera translation parameter. Our contributions are in two aspects. (1) Under the circumstance that no inertial sensor can estimate accurately enough translation parameter, we propose a translation estimation algorithm by fusing gyroscope sensor and image data. (2) Our SfM algorithm is efficient and suitable for smart devices. Experimental results validate efficiency of the proposed SfM algorithm.


Introduction
Camera relative pose estimation (CPE) is the estimation of camera extrinsic parameters, that is, camera 3D rotation parameter and 3D translation parameter.It is one of the key issues in computer vision and is widely applied in 3D scene reconstruction, augment reality, panorama, and digital video stabilization solutions.
The traditional solutions to CPE problem are based on image processing technique, that is, the visual methods.These solutions usually first extract feature correspondences between frame pairs and then model CPE problem as linear equations under the epipolar geometry constraint.In such a way CPE problem is transformed into an optimal solution problem.Hartley [1] proved the feasibility of using 8 pairs of feature correspondences to handle CPE problem and proposed 8-point (8 pt) algorithm to solve CPE problem for uncalibrated cameras.After that, in order to find simpler solutions, researchers proposed 7 pt algorithm [2], 6 pt algorithm [3,4], and 5 pt algorithm [4][5][6].These are mature traditional solutions based on image processing technique with accurate estimation results but complex calculation and slow computing speed.With the fast growing employment of MEMS sensors in smart devices, inertial-based solutions for CPE problem have been tried recently.These solutions [7,8] usually first perform CPE by visual and inertial methods individually and then adopt data filter to fuse the two results in order to obtain a more reliable estimation result.These two individual algorithms complement each other and improve the robustness of CPE.The disadvantage of this solution lies in the fact that it needs additional fusing time of the two results and thus reduces the efficiency.
It can be seen from the above analysis that visual solutions for CPE are mature and accurate but with complex computation and inertial solutions have been tried but without satisfactory results.Considering that rotation can be estimated fast and accurately by gyroscopes but there are no proper sensors to estimate translation accurately enough for CPE application; this paper proposes a visual and inertial fusion solution S 2 fM (Simplified Structure from Motion).S 2 fM divides the CPE problem into two parts: the rotation estimation part and the translation estimation part.It first employs gyroscope sensor to estimate the rotational information and then fuses the estimated rotational information with image data to estimate camera translation.Our solution relies on both gyroscope sensor data and image data but there might be time delay between them, so a calibration algorithm is necessary to align the gyroscope data and image data.
The camera focal length is also estimated in the calibration algorithm, which further simplifies the visual algorithm for translation estimation.Since the calibration needs to be done only once for each device, the main cost of our solution lies in the visual algorithm stage, which has been simplified to deal with only 3 feature pairs.Our main contributions are in two aspects.(1) Under the circumstance that no inertial sensor can estimate accurately translation parameter, we propose a translation estimation algorithm fusing gyroscope sensor and image data.(2) Our S 2 fM algorithm is efficient and suitable for smart devices.
The rest of the paper is organized as follows.Section 2 reviews the related works.Section 3 describes the proposed solution.Section 4 presents the experimental results, and Section 5 draws the conclusion.

Related Work
Generally, CPE solutions can be classified into two major groups.
The first group of solutions are the traditional solutions.These solutions model CPE problem as linear estimation problem based on image feature correspondences under twoview geometry (mostly adopted) or multiview geometry [2].A fundamental matrix will be determined by the feature correspondences, which can then be decomposed to give relative camera orientation and translation.Thus, CPE problem is transformed into the fundamental matrix estimation and decomposition problem.The fundamental matrix decomposition problem is called the minimal problem in computer vision, whose solutions are divided into two categories: one for calibrated camera and the other for uncalibrated camera.The essential issue in minimal problem is that how many correspondence points the solution needs at least.For uncalibrated camera, the solutions include 8 pt (point) algorithm, 7 pt algorithm, and 6 pt algorithm.For calibrated camera, the solution is 5 pt algorithm, since the relative pose parameter number is 5, that is, 3 for rotation and 2 for translation (up to an unknown scale factor).Hartley proved the validity of 8 pt algorithm [1] in 1997, in which the correspondence problem is supposed to have been solved.After 8 pt algorithm, in order to find simpler algorithm for uncalibrated camera, researchers tried to add constraints to the formulated equations and proposed 7 pt and 6 pt algorithms.In 2003, Hartley and Zisserman proposed 7 pt algorithm [2], which added the constraint that fundamental matrix and essential matrix are singular matrices.In 2005, Stewénius et al. proposed 6 pt algorithm [3] and in 2012 Kukelova et al. proposed a 6 pt algorithm based on polynomial eigenvalue [4].5 pt algorithms for calibrated camera include Nistér's 5 pt algorithm [5] in 2004, Li and Hartley's 5 pt algorithm [6] in 2006, and Kukelova's polynomial-eigenvalue-based 5 pt algorithm [4] in 2012.These widely used traditional visual solutions rely on image correspondence points, which may contain error and noise.RANSAC [9] method is usually introduced to reduce those error and noise.Brückner et al. [10] compared these traditional solutions above.The advantage of these traditional solutions is that they can generate accurate results but the disadvantage is that they are complex in computing: the more correspondent points the algorithm needs, the slower its computing speed is.
Another group of solutions for CPE are the inertialbased solutions, which were not proposed until the MEMS sensors were accurate enough.In 2008, Gaida et al. [7] introduced a multisensor framework that combines gyroscopes, accelerometers, and magnetometers as a unit to estimate camera pose.Then a visual method is adopted to estimate camera pose too.Finally extended Kalman filter is adopted to fuse their results to obtain the final pose.One disadvantage of using accelerometers for translation estimation is that translation measurements from accelerometers are significantly less accurate than orientation measurements [11][12][13].This is because gyroscope data need to be integrated only once to obtain the camera's orientation but accelerometer data need to be integrated twice to obtain the camera's translation, which will introduce too much noise that will affect significantly the accuracy.Miyano et al. [8] proposed an inertial and visual combination solution.It uses acceleration and a magnetic sensor to roughly estimate a camera pose and then searches the accurate pose by matching a captured image with a set of reference images.Corke et al. [14] made a survey on inertial and vision fusion solutions.These fusion solutions usually first perform CPE with separate inertial-based and visual-based solutions, generating respective results, and then fuse them by data filters.This is cooperation between inertial and visual methods.Its advantage lies in the robustness because the two methods can complement each other.Its disadvantage is the slow computing speed because of the fusion process.
This paper proposes an inertial and visual fusion solution called S 2 fM for CPE.Different from existing fusion solutions which fuse inertial data and visual data in a cooperation manner, our solution fuses them in a division manner: it divides CPE problem into a rotation part and a translation part.Our solution first estimates camera rotation by gyroscopes and then uses it as known parameter in the visual method to estimate camera translation.Since the reliability and efficiency of gyroscopes for rotation estimation have been proven [7,8,[11][12][13][14], they can significantly simplify the visual solution for camera translation estimation.As we will derive in the next section, only 3 pairs of correspondence points are needed for translation estimation.Different from Hartley and Nistér, who made great efforts to find the solution of the established equation sets, our focus is on proposing an inertial and visual fusion solution to solve CPE problem efficiently under the circumstance that no inertial sensor can estimate accurately enough translation parameter.

Proposed Solution
This section describes our proposed solution which is under the pinhole camera model and consists of three steps: camera and gyroscope calibration, estimation of camera rotation, and estimation of camera translation.

Camera and Gyroscope Calibration.
Our solution first calibrates the camera and gyroscope and the calibrating contents are as follows: (1) Gyroscope noise processing (2) Camera focal length  calibration (in pixel unit) (3) The delay   between the gyroscope and frame sample timestamps 3.1.1.Gyroscope Noise Processing.Raw MEMS gyroscope data need to be processed to remove zero-drift and random noise.We take the general statistical method to remove zero-drift.Put the device in static position for a period of time to get statistics of zero-drift and subtract it from the source data to obtain a series of stable, zero-expectation, and normally distributed random noise.Those random noises are then modeled through time sequence method and depressed by Kalman filter to give usable gyroscope data.

Calibrating Algorithm.
After noise processing, the gyroscope data can be used for the calibrating operation.The purpose of our calibrating algorithm is to calibrate the parameters   (delay between the gyroscope and frame sample timestamps) and  (camera focal length).We take a similar calibrating algorithm as in Miyano et al. [8] under camera rotation model (as Figure 1 shows) but with an optimized objective function.
As Figure 1 shows, a camera moves under the rotation model with its optical center unchanged [15].A point  in world coordinate and its projected image coordinate  have the following mapping relation: And we have the following: (1)  is the camera intrinsic: where  is camera focal length to be recovered and ( 0 , V 0 ) is the projected point of the optical center, which is (0, 0) in the proposed solution, since we set the image center as the optical center.
(2)  is an unknown scaling factor, since the translation can only be determined up to scale in CPE.
Under this rotation model, the relationship between image points in a pair of frames for two different camera orientations (as Figure 1 shows) can be derived.For a scene point  the projected points  and   in the image plane of two different frames would be  = , where  and   are 3 × 3 rotation matrices representing the rotational parameters at two frame timestamps, which will be detailed in the next section.Rearrange (3) and substitute for ; we get a mapping of all points in one frame to another as (4) shows: With these parameters above, we formulate calibration as an optimization problem, as shown in the following equation: And we have the following: (1)  is the frame amount,  is frame number, and  is the feature number in the current frame.(2)   is the feature amount of the th frame pair.

Estimation of Rotation.
After calibration, gyroscope data can be used to estimate the rotation of the device accurately.Gyroscope outputs angular velocities of every axis, so the angular rotation of each axis can be calculated by integrating the angular velocities.For any axis of the gyroscope, let   be its angular velocity and let its corresponding sampling time be Δ  ; then the angular value ΔΘ  from moment  to moment  can be obtained as shown in the following equation: Let  = (,,) be the rotation values of the three gyroscope axes, where , ,  can be calculated by (6).Let Θ = √ 2 +  2 +  2 be the total rotation value; then the rotation matrix  can be given by the Rodrigues formula, as shown in the following equation: where  is the identity matrix and [] × is the cross product matrix of : Finally, we obtain  ∈ (3), which is a unitary matrix representing the rotation of the device.

Estimation of Translation. The obtained rotation matrix
is used as known parameter in the estimation of translation based on visual method to generate the final camera pose.Similar to the traditional method, we formulate linear equations using correspondence points and solve them to get camera translation.

Formulation of Linear Equations.
As shown in the twoview geometry of the proposed S 2 fM algorithm (Figure 2), given a scene point  0 in Euclidean space, its projections  and   in camera coordination  and   have the mapping relation of where  is the rotation matrix (3 × 3) ant  is translation 3vector.Define [] × as the cross product matrix of .

Solution of Linear
Equations.The inputs of ( 17) are camera intrinsic , rotation matrix , and feature correspondences (ũ, Ṽ) and (ũ  , Ṽ ).This is typical ternary homogeneous linear equation set and we need only 3 pairs of feature correspondences to solve it (T can also be estimated with 2 pairs of feature correspondences by enforcing certain constraint to reduce the number of freedoms to 2).We extract SIFT correspondence features [16] between frames and introduce RANSAC algorithm [9] to remove mismatches and noise.Then singular value decomposition (SVD) method is employed to solve (17).Since the essential matrix is defined up to scale, the solved translation vector  will also be defined up to scale; that is, ‖‖ = 1.
Two solutions are still possible due to the arbitrary choices of signs for translation .The correct one can be easily determined by ensuring that the reconstructed points lie in front of the camera [5], as shown in Figure 3.One pair of feature correspondence is enough to determine the sign of .First, select a random pair of feature correspondence and random sign of ; then apply the 3D reconstruction solution by Hartley and Sturm [17] to reconstruct its 3D space coordinate  0 .If  0 lies in front of the cameras, as Figure 3(a) shows, the selected sign of  is already correct; otherwise, inverse the sign of , as shown in Figure 3(b).

Experimental Results
Our experimental device is Lenovo Vibe Z, with built-in three-axis gyroscopes ST L3GD20.The gyroscopes run at a frequency of 100 Hz.
We show the experimental results of S 2 fM algorithm for CPE.To show the results explicitly, we draw the camera 3D motion computed by the proposed solution to check its validity.Then we compare our solution with traditional solutions (including 8 pt, 6 pt, and 5 pt algorithms.7 pt algorithm performs similar to 8 pt algorithm so we omit it) and the inertial-based solution proposed by Gaida et al. [7].But, first of all, we show the calibrating result for the device and gyroscopes used in our experiments.

Calibrating Result.
We record a video of about 10 seconds with a rotation motion around -axis of the camera and   record the gyroscope data of all the three axes and then run the calibrating algorithm.Under a pure rotation model, the frame translation motion can be estimated both by SIFT features and by gyroscope data.After calibration, the frame motions estimated by the gyroscopes would align with that estimated by SIFT features.As the results show in Figures 4  and 5, the best parameters align the two motion results well.
In order to show the results more explicitly, the calibration error for every frame is computed as shown in Figure 6.The average calibration error is 0.0024 pixels and the absolute average calibration error is 0.75 pixels per frame.There are some error peaks in the calibration because the motion is not pure rotation (translation exists), but this will not affect the calibrating results.

CPE Results.
We write letters L, E, N, O, V, and O and the word Lenovo in the air with our experimental device and run S 2 fM algorithm to estimate the camera pose.Figure 7 shows  its results, where every point represents one estimated camera pose and the red one is the start position.

Evaluation.
We compare our solution with 8 pt algorithm [1], 6 pt algorithm [3], 5 pt algorithm, [6] and the fusion method by Gaida et al. [7].In order to make the experiments as complete as possible, we design six basic camera motion scenes: (1) Left-right translation All possible camera motions can be a combination of the six basic motions above.To suit for different application precisions, different error thresholds (2.0, 1.0, and 0.5 pixels) are set in the linear square solution.

Accuracy.
The symmetric squared geometric error   is introduced to measure the reprojection error [10], as shown in the following equation: where  and   are feature correspondences and  is the fundamental matrix.The reprojection errors of all the six motion scenes with different error thresholds are shown in Figure 8.The result is given by the author in paper [7].
We compute the average reprojection errors of the 6 basic scenes under different threshold in RANSAC algorithms, as shown in Figure 9 and Table 1.The results show that, in our experiments, Nistér's 5 pt algorithm generates minimal reprojection error and our solution performs similar to 6 pt algorithm.

Efficiency.
We analyze the efficiency of S 2 fM both in theory and in practice.
(1) Theoretical Analysis.Since the camera rotation can be read out from gyroscopes in real time, S 2 fM costs most in translation estimation by the visual method.The time complexity of traditional algorithms can be analyzed by the number of feature pairs needed in solving the linear equations.In our experiments, we adopt SVD algorithm to solve the linear equations, whose time complex is ( 3 ).As a result, the time complex ratio of solving one linear equation set for CPE would be as shown in Table 2 (with S 2 fM set to be 1).
(2) Practical Analysis.In practice, all of the algorithms (S 2 fM, 8 pt, 6 pt, and 5 pt) need to adopt RANSAC algorithm to estimate the optimal solution of the linear equations and the number of iterations will not be the same.However, for each algorithm, we set its upper limit of RANSAC iteration times to be 128 in our experiments and then compute the average frame-computing time of all the 11715 experiments above with the same SIFT features and the same running environment.With a good number of experiments, the average value should be enough to compare the performances of all the algorithms, as shown in Table 3.
As seen in Table 3, the practical efficiency result generally tallies with the theoretical analysis.Another limitation is that our solution is ideally supposed to perform similar to 5 pt algorithm because they are the most similar in estimating translation, but the experimental results show that it performs worse than 5 pt algorithm in accuracy.The reason that S 2 fM does not achieve ideal results lies in the focal length calibrating error and the drift of gyroscopes.More accurate calibrating algorithm will be of great help.
What is more, in our experiments, we choose to use 3 pairs of feature correspondences to estimate camera translation but  can also be estimated with 2 pairs of feature correspondences by enforcing certain constraint to reduce the number of freedoms to 2. However, the purpose of this paper is to provide a fusion solution for CPE, not the mathematical technique for solving equation set.

Conclusion
Traditionally, CPE has been formulated as the problem of estimating the optimal camera pose given a set of point correspondences.This is the vision-based method.The first 8-point solver was proposed in the 1990s and later 7-point, 6-point, and 5-point solvers were proposed in the 2000s, all of which are based on the feature correspondence problem as reviewed in the literature.These solvers all focus on the mathematical technique of solving the formulated optimal estimation problem.In recent years, when MEMS sensors become accurate enough and become popular in many handheld devices, MEMS-based methods turn up in CPE-related problems.Generally, sensor data will be coupled with image processing data with data filter, for example, Kalman filter, for robust camera pose estimation.The drawback of this method is also the computing complexity due to the additional data filtering process.This paper proposes a camera pose estimation algorithm with gyroscope sensor, which estimates camera rotation with the built-in gyroscopes of the device and then fuses it with the image data to estimate camera translation.This proposed fusion solution is quite different from existing fusion solutions in the fusion manner of inertial and visual data.We compare our solution with both the traditional solution and existing fusion solution and the experimental results validate the efficiency of our solution.As for the accuracy, our solution performs similar to 6 pt algorithm, which can be further improved with better focal length calibration and drift compensation techniques.
Under the circumstance that no proper MEMS sensor can estimate accurately enough translation, S 2 fM provides a way to fuse the inertial data with visual data to solve the problem.

Figure 3 :
Figure 3: Two possible signs of .

Table 2 :
Time complex ratio of solving one linear equation set for CPE in theory.

Table 3 :
Average frame-computing time for CPE in practice.One main limitation of the proposed solution lies in the drift of gyroscopes.After running a long period of time, the gyroscope data may drift and result in the decrease of estimation accuracy, which can be compensated by timely calibrations, for example, assisted by other inertial sensors to remain drift-free.