Monocular SLAM has attracted more attention recently due to its flexibility and being economic. In this paper, a novel metric online direct monocular SLAM approach is proposed, which can obtain the metric reconstruction of the scene. In the proposed approach, a chessboard is utilized to provide initial depth map and scale correction information during the SLAM process. The involved chessboard provides the absolute scale of scene, and it is seen as a bridge between the camera visual coordinate and the world coordinate. The scene is reconstructed as a series of key frames with their poses and correlative semidense depth maps, using a highly accurate pose estimation achieved by direct grid point-based alignment. The estimated pose is coupled with depth map estimation calculated by filtering over a large number of pixelwise small-baseline stereo comparisons. In addition, this paper formulates the scale-drift model among key frames and the calibration chessboard is used to correct the accumulated pose error. At the end of this paper, several indoor experiments are conducted. The results suggest that the proposed approach is able to achieve higher reconstruction accuracy when compared with the traditional LSD-SLAM approach. And the approach can also run in real time on a commonly used computer.

In the robotics community, simultaneous localization and mapping (SLAM) refers to creating the surrounding map and determining self-position, which is necessary for a robot to autonomously navigate in an unknown environment [

Traditional visual navigation usually uses a stereo visual system, which can directly provide the 3-dimensional information of circumstance, and the position of cameras can be easily estimated by utilizing the visual difference coming from two or more cameras. Whereas the accuracy of stereo visual navigation is limited by the length of base line, this problem is crucial especially in applications where the base line is seriously limited, such as remote sensing and micro UAVs. Therefore, the monocular visual navigation tends to be more general and commonly used.

Generally speaking, there exist two classes of monocular SLAM: feature-based methods and direct methods. In feature-based approaches, including filtering-based [

This uncoupling predigests the overall problem at the cost of information lose, such as information presenting curved edges. This class of image information often makes up a large part of the image especially in man-made environment and is important for a robot to fulfill tasks like obstacle avoidance.

Direct visual odometry (VO) methods overcome this limitation by calculating camera pose directly on the image intensities, in which all information contained in the image is used. Moreover, more geometry information of the environment can be used in the direct methods, and that is helpful for obtaining higher accuracy and robustness, especially in simplex environments where few key points are available. The geometry information about the scene is valuable for robotics in many applications such as augmented reality. It is well known that direct image alignment is well established for stereo sensors or RGB-D [

In both feature-based methods and direct methods, monocular SLAM can only get the reconstruction of scene up to a scale. There still exist more challenges of scale drift, which is seen as the major reason for accumulated error [

In this section, some relevant mathematical definitions and notations used in this paper will be introduced. In particular, we introduce the most widely used camera model, pinhole camera model (Section

The most widely used camera model is the pinhole camera model which is shown in Figure

Camera perspective model.

As described in Figure

As described in aforementioned introduction, a chessboard is used to initialize the depth map. During the initialization process, the relationship between the camera coordinate and the world coordinate is estimated. Herein, the world coordinate is defined as follows: the original point is the left-top point of the chessboard, the

World coordinate definition.

In this section, the representation of 3D pose transformation is described in the same way as in previous researches, such as [

And 3D similarity transform

But a nonredundant expression for the camera pose is needed during optimization, which cannot be given by definition above, so the corresponding element

The Gauss-Newton algorithm is effective with nonlinear least-squares problems, with the advantage of a small computation cost [

Beginning with an initial guess

And an iteratively reweighted least-squares problem is proposed to be robust to outliers arising, for example, from occlusions or reflections, which can be expressed as [

This main contribution of this paper is that it provides a metric online monocular SLAM approach, by using a commonly used chessboard reference. The process can be divided into 5 parts. The chessboard provides the initial depth estimation of the scene, and the initial guess can be seen as the scaled source of the scene reconstruction. The initial depth estimation results are able to correct the scale drift through the key frames transfer, and thus we can obtain a global metric reconstruction. The whole process of this approach is shown in Figure

Overview over the whole SLAM system.

In this section, we introduce our work from 4 parts: the initial depth estimation in Section

In the initial process, unlike in the traditional LSD-SLAM approach, we use a standard, key point-based method to obtain the initial depth map with the aid of a calibration object, which is a commonly used chessboard in this paper. We need to be reminded here that any other reference object is also good to fulfill this initial depth estimation process.

During the calibration process, the chessboard corners detection should be executed at the beginning. With the known 2D coordinates of chessboard corners and corresponding 3D coordinates, the relative pose of camera to the world coordinate can be got, which is known as the PNP problem [

To run online, only the depth of pixels with sufficiently large intensity gradient, which means that the pixel is a corner or on the edges, is estimated. We search the corresponding points of those pixels on the epipolar lines in the second image using a window-based matching approach with a window size of 3 pixels. Also, parallax constraint and sequence constraint are used to reduce the mismatching.

With the known intrinsic camera parameters, extrinsic parameters, and corresponding image point pairs, the initial depth map can be estimated:

As a recursive process in the visual localization, the pose tracking is the main task. Now the commonly used pose tracking method is the image alignment, in which image sequences sampled in different time steps are consequently utilized, to provide the location verification of the moving camera. As the same in traditional monocular visual odometry researches, such as [

The pose estimation of a new frame is treated as a problem to minimize the variance-normalized photometric error:

To solve the scale-drift problem, direct

Here, a depth residual

The problem can also be solved with iteratively reweighted Gauss-Newton optimization, which is the most commonly adopted approach in nonlinear optimization problem, as described in Section

In this section, we will show the outer assistance of a chessboard work in the pose error correction and in the depth maps accumulation errors correction. This is the main contribution of our work. Our SLAM approach is designed for the indoor robots, which means that it is possible for the robots to see the calibration object more than once while moving. So it is practical to correct the pose estimation error accumulated with the calibration object.

It is unnecessary to detect the chessboard in every frame, which is of great computational cost. We give a principle to judge whether to detect the chessboard with the aid of pose of current frame.

When the four corners of the chessboard can be observed, the whole chessboard is inside the horizon, which can be used to decide when to detect the chessboard. In the world coordinate, the four corners of the chessboard are

And homogeneous image pixel coordinate can be expressed as

A chessboard detection is performed only under the condition that

Usually, successful chessboard detection is hard to achieve when the camera is too far away from the chessboard, even when the whole chessboard can be observed, so we add a limitation to avoid the case:

When chessboard detection is performed successfully on current frame while failing on previous frame, a new keyframe will be created with a corrected pose which is estimated with the correspondence between the image coordinate and the world coordinate of those chessboard corners. Then, relative pose between new created keyframe and previous keyframe can be calculated as

The depth map estimation problem is the most commonly referred problem in monocular SLAM, and in this paper we still use the most common method to execute the depth map estimation using the method proposed in [

After the pose estimation process, the pose graph optimization is necessary to continuously optimize the map which consists of a set of keyframes and their camera poses. The error function is defined in the following equation, the same in [

In the experiments process, the SLAM approach is executed in an indoor environment, where the visual scene is occupied by artificial equipment. In the experiment, the camera is selected as a commonly used industrial camera, the frame-frequency of which is higher than 30 frames/sec.

Before the experiment, the camera is placed in front of a chessboard for the initial alignment, from which the initial positions and gestures between camera and word coordinate are calculated. During the experiment process, the hand-held camera is moved around the house arbitrarily and returned to the start position at the end of the experiment. Based on the same datasets sampled in the moving process, our SLAM approach is run twice to provide the comparisons of these two different methods. The first experiment utilizes the most common monocular SLAM approach, as described in [

Reconstruction of scene.

As depicted in the experimental results, we can easily find that there exits accumulated error in pose estimation when the traditional LSD-SLAM method is used, which is shown in Figure

A comparison between the reconstructed points and the real ones. The red points are real chessboard corners and the blue ones are those reconstructed.

Reconstructed chessboard corners in different keyframes.

The aforementioned results can only provide a visualized comparison. In order to assess the proposed method quantitatively, we also provide a numerical analysis of the reconstruction results. Because the around scene during the experimental process is randomly selected, without any other information about the accurate scale and size, thus we choose the reconstructed result of the known chessboard to verify the accuracy of our method. As depicted in Figure

As depicted in Figure

In addition, all the coordinates of a randomly selected chessboard corner in different keyframes are also counted, and it is used to show the variation of accuracy about the reconstructed results.

From the results shown in Figure

In this article, a metric direct monocular SLAM system is introduced, which can run in real time on a CPU and can obtain metric reconstruction of the scene. Based on the assistance of a chessboard, the initial depth map is estimated; meanwhile, the similarity transform between known world coordinate and the map coordinate is calculated, which can be used to convert the map to the known world coordinate. The system is tested in a complex indoor environment, and its accuracy is verified with a comparison between the estimated chessboard corner coordinates and the real ones. The indoor experiments prove the effectiveness of the proposed metric monocular SLAM approach.

The authors declare that they have no competing interests.

This work was supported by the National Natural Science Fund of China 61403398.