3 D Visual SLAM Based on Multiple Iterative Closest Point

With the development of novel RGB-D visual sensors, data association has been a basic problem in 3D Visual Simultaneous Localization and Mapping (VSLAM). To solve the problem, a VSLAM algorithm based on Multiple Iterative Closest Point (MICP) is presented. By using both RGB and depth information obtained from RGB-D camera, 3D models of indoor environment can be reconstructed, which provide extensive knowledge for mobile robots to accomplish tasks such as VSLAM and Human-Robot Interaction. Due to the limited views of RGB-D camera, additional information about the camera pose is needed. In this paper, the motion of the RGB-D camera is estimated by a motion capture system after a calibration process. Based on the estimated pose, the MICP algorithm is used to improve the alignment. A Kinect mobile robot which is running Robot Operating System and the motion capture system has been used for experiments. Experiment results show that not only the proposed VSLAM algorithm achieved good accuracy and reliability, but also the 3D map can be generated in real time.


Introduction
Motivation.During the past ten years, most robot systems rely on two dimensional (2D) sensors for map creation, selflocalization, and motion planning.Robot maps of indoor environments have often been 2D occupancy maps in the past, allowing robots to be localized in terms of position and orientation.However, in real world scenarios, robots have to face a three dimensional (3D) environment, not just 2D environment.The robot can have better performance in higher level missions if 3D environmental information can be obtained.Hence, many researchers now focus on 3D mapping and the RGB depth sensors (RGB-D) cameras are new alternatives to provide the useful information to create the map.Meanwhile, 2D Simultaneous Localization and Mapping based on a laser scanner has been implemented in so many cases that it may now be considered solved [1].SLAM applications are numerous in hazardous or poisonous environments which require a robot with ability of SLAMbased navigation.Laser scanner is often performed accurately in SLAM, but it is expensive and has limited ability in complex environment.
In recent years, with development of different type of sensors, such as 3D laser scanners and stereo cameras, which have been used in Visual Simultaneous Localization and Mapping (VSLAM) [2], VSLAM has become increasingly important.In particular, RGB-D as one kind of 3D laser scanners are widely used to obtain 3D environment data in many practical applications [3][4][5].A technique called distance imaging technology is used in RGB-D sensor which means that the generated 2D image shows the distance between one point and another point set in a scene.The obtained output is called a depth image, which contains a value corresponding to a distance value [2].Besides, RGB-D cameras offer an alternative to create metric maps of an environment with known scale at a unit price much cheaper than other sensors such as laser range finders.This paper investigates how to use RGB-D sensors to create a 3D map.The problem can be stated as generating a digitized representation of an indoor environment, which can be used for automatic localization in that environment and other applications such as 3D VSLAM and 3D rendering.Most of existing research works on 3D mapping used features extracted from the color images [6,7].However, visual features are not robust enough in some situations such as when the environment lacks features or when the camera has strong movements.Our objective is to develop a system which can realize 3D VSLAM accurately, quickly, and reliably.
Related Work.Traditional 2D or 3D VSLAM methods usually use a monocular camera, a binocular camera, a trinocular stereo camera, a fish camera, or a panorama camera for visual SLAM [8].Tong et al. [9] used a fish camera to get images and mapped wide-angle images to spherical images and then used Extended Kalman Filter (EKF) for VSLAM.Farrokhsiar and Najjaran [10] proposed a theoretical framework of Rao-Blackwellized Particle Filter in which higher order state variables and a modified undelayed initialization scheme are incorporated to solve the 3D monocular VSLAM problem.Currently, RGB-D sensors are widely used for various forms of researches.Novel RGB-D sensors like Microsoft Kinect and ASUS Xtion Pro live are two kinds of popular 3D sensors, as shown in Figure 1, which provide RGB images and depth images.In general, VSLAM approaches operating on RGB-D images are different from stereo systems as the input is RGB-depth images instead of color images.Izadi et al. [11] presented a Kinect fusion method to realize real-time 3D reconstruction and interaction by using a moving depth camera.Fioraio and Konolige [12] presented a system which used bundle adjustment to align dense point clouds of Kinect without using RGB images.The rest of the paper is organized as follows: Section 2 describes the MICP-based 3D visual SLAM method.Section 3 introduces hardware and software of the mobile robot platform.The experimental results and comparison with other methods are shown in Section 4. Finally, Section 5 and last part end with a summary and acknowledgement.

3D VSLAM Using a Kinect Sensor
Most traditional VSLAM methods are based on EKF algorithm [2,6].In this paper, a novel VSLAM algorithm based on MICP is proposed.Color (RGB) and depth (D) data are both used in the algorithm and it is generic due to the fact that it contains no motion model.Schematic overview of the proposed algorithm is shown in Figure 2. The proposed VSLAM mainly includes five steps.
Step 1. RGB-D sensor is used to collect depth and RGB images with synchronized timestamps.And then feature points from RGB images are extracted by SURF (Speeded Up Robust Feature) algorithm [15].In order to match two kinds of feature points, extracted feature points from RGB images are projected to depth image.In this step, some uncertainty will be introduced into a chain of operations, not only due to synchronization mismatch between depth and RGB images, but also because of interpolation between points with large differences in depth.The fact is that a minor misprojection of a feature is lying on an object border onto the depth image which can result in a big depth error and makes features picked at object borders unreliable.Step 2. RANSAC algorithm is used to find a 6D transform for the camera position in noise [16].Feature points are matched with earlier extracted feature points from a set of 30 images in database.The set consists of a subset including some of the most recent captured images and another subset including images randomly selected from the set of all formerly captured images.Three matched feature pairs are randomly selected and are used to calculate a 6D transform.
Step 3. In order to create a globally consistent trajectory, General Graph Optimization (g 2 o) algorithm is used to realize global optimization [17], which is an easily extensible graph optimizer that can be applied to a wide range of problems including several variants of SLAM and bundle adjustment.And it performs a minimization of a nonlinear error function that can be represented as a graph.
Step 4. All feature pairs are then evaluated by their Euclidean distances to each other.Pairs whose Euclidian distance is below a certain threshold are counted as inliers.From these inliers a refined 6D transform is calculated using MICP.
Step 5. Registered point clouds are used by an improved FastSLAM 2.0 to complete 3D Occupancy Grid map [18].

RGB-D Sensor Calibration and Coordinate System Conversion.
As position and orientation of RGB-D sensors in the world coordinate system cannot be obtained directly, we can infer the true pose of a sensor through calibrating markers which are on the top of the sensor.The purpose of calibration is to obtain the relationship between the sensor pose and the rigid body which is represented by markers.Three markers are mounted at the top of the sensor, which represent a rigid body which has an ability of rotation and translation.The calibration process involves some relationships among five coordinate systems, which are world coordinate system  (represented by VICON motion capture system) [19], marker coordinate system , sensor coordinate system , image coordinate system , and target coordinate system , as shown in Figure 3.After we obtain relationship between rigid body and RGB-D sensor, the position of rigid body is able to get through VICON motion capture system.And then position and orientation offset of the sensor is represented by the product of the rigid body position and a homogeneous matrix.

Calibration for RGB-D Sensor.
The whole set of the RGB-D sensor has two key parts which are the color camera and the depth camera.The calibration for depth camera is already accomplished by the manufacturer.Then, using the Coordinate system of VICON (world coordinate system) depth camera intrinsics, each pixel (  ,   ) in the depth camera can be projected to metric 3D space (, , ) using these following formulas: RGB-D sensor calibration is used to establish the relationship between the sensor and the world coordinate system.Before calibrating, we firstly need to discover the relationship between image coordinate system and world coordinate system, which involves the following two steps: the first step is to map the image space to a normalized 3D space using intrinsic parameters  and the second step is to transform the normalized 3D space to world coordinates using the rotation  and translation .The formula is shown as follows: where  is the focal length,   ,   are the scale factors,  is the skew coefficient, and   ,   are the principle points.MATLAB camera calibration toolbox is used to obtain both extrinsic and intrinsic parameters of the camera.A chessboard is used to conduct the calibration.The calibration is based on Herrera's method [20].Then extracted corner points and the extrinsic results are shown in Figure 4.

Homogeneous Transformation.
A 3D coordinate transformation matrix is represented by a 4 × 4 homogeneous  which contains two-part information including the rotation  and translation .The relationship among the five coordinates is as follows: where   = (     0 1 ),   = (    0 1 ), and   = (     0 1 ).  ,   are the extrinsic result of camera calibration.  ,   are the data obtained from the VICON capture system.  can be obtained from the frame represented by the plane of the chessboard.So the offset   is represented by the product of these matrices.Then the camera pose is obtained by multiplying this matrix with the coordinate system of the three markers.

Feature Extraction and Matching.
In order to estimate spatial relationship of sensor pose from images, it is necessary to use image feature.And relationship between any two images is achieved by matching image feature points.In the paper, Speeded Up Robust Feature (SURF) algorithm [15] is used for feature extraction and matching.SURF algorithm includes detectors and descriptors, which are used to extract feature and discover consistency between two feature points, respectively.Specific implementation of SURF algorithm is divided into three steps: feature point detection, construction of feature point descriptor, and fast matching.
Step 1 (feature point detection).This step is divided into integral images, approximately Hessian matrix discriminate extreme point, scale space description, and features location (, , ).Integral image calculates pixels sum in a rectangular area.Integral image  ∑ () at a point  = (, ) of image  is defined as follows: In Hessian matrix in a given image  at a point  = (, ), scale  is defined as where   (, ) is a divalent partial derivative of Gaussian  2 / 2 , which convolute image  at a point .Box filters approximate Gaussian function in order to accelerate integral images; the approximate Hessian matrix is as follows: where weight  is a parameter, which is used to balance Hessian determinant expression.
Scale-space description is usually used to describe pyramid image.After calculating extreme value at the point of (, , ) in scale image according to Hessian matrix, a quadratic fitting function is used for interpolation: Derivate (7) and obtain the extreme value at extreme points Step 2 (construction of feature point descriptor).Descriptor is used to describe pixel values distribution of feature points in scale space field.Haar wavelet response of partial derivative at  and  directions is used to create distribution of 64D, which is accelerated by integral image at the same time: Step 3 (fast matching).Fast matching is divided into preliminary matching and exact matching.Euclidean distance of feature vector is used in preliminary matching as a similarity judgment measure of key points in two images.Kd-tree search [21] is used to discover Euclidean distance of original feature points in prepared register image: where   means the th element of the th feature descriptor in prepared register image;   is the th element of the th feature descriptor in reference image;  represents dimension number of feature vector.

Global Pose Graph
Optimization.RANSAC is an iterative algorithm which is used to adapt parameters of a mathematical model to experimental data and is a suitable method when a data set contains a high percentage of outliers, that is, measurements that suffer from measurement errors so large that validity of the measurements is low.RANSAC is the robust way that handles outliers in the data set, and it is suggested to be a suitable method for feature matching in VSLAM.More details of the algorithm can be found in [16].g 2 o implements a general and efficient graph optimizer.A graph optimizer achieves a minimization of a nonlinear error function represented as a graph.Nodes represent vectors of parameters and edges how well two parameter vectors match to external constraint, relating to the two parameter vectors.In the case where g 2 o is used by VSLAM for global pose graph optimization, each node in the graph represents a state variable of the robot and each edge represents a pairwise observation between nodes that are connected by edge.The algorithm is fast due to sparse connectivity property of graph and advanced solvers of sparse linear systems are used.More thorough presentation is presented in [17].

3D Point Clouds
Register.ICP algorithm is one of the most widely used intuitive registration methods [22].The ICP algorithm can be summarized in two key steps.
Step 1. Find matched feature points between two point clouds.
Step 2. Compute a transform , which includes a rotation and translation such that it minimizes the Euclidian distance between matched points.
In the paper, we propose a MICP algorithm, which can be thought of as a plane-to-plane matching algorithm.The most likelihood estimate of the transform can be achieved by assigning probabilistic properties to extracted features.Furthermore, probabilistic framework can make use of all general research in probabilistic techniques to increase robustness, such as outlier rejection.The probabilistic model assumes that points in the measured point clouds D = {α  } and Ŝ = {ŝ  } are independently Gaussian distributed [12].This will generate probabilistic point clouds  and  with points   ∼ (α  ,    ) and   ∼ (ŝ  ,    ).Here    and    are threedimensional covariance associated with the measured points.Define  * as the correct transform when perfect compliance between points occurs as ŝ =  * d .
Defining the function and considering this function evaluated with  * , as in (13), give the probabilistic function This is an objective function for optimization and the most likelihood estimate can be found solving The solution of ( 14) gives a transform which can be used to optimize the point cloud match in a probabilistic manner.By setting    =  and    = 0, the standard point-to-point MICP algorithm is obtained.In this case (14) becomes which coincides a point-to-point MICP algorithm for matched features.
In the first layer, firstly ICP algorithm is used in the MICP algorithm to align the current frame with the previous frame.If estimation error is extremely small, then rotation matrix and translation matrix will be updated.In the second layer, if matching fails due to situations such as shaking and sudden change of the sensor direction, then ICP algorithm will be started again until the nearest frame in map  is found out to match the current frame.If both of the two levels of registration failed, the current frame will be discarded; if a frame is at a distance of  with all frames in map , then it will be considered as an update and added into map .
The detail of the proposed MICP algorithm is shown in Algorithm 1.

3D Occupancy Grid Mapping. FastSLAM algorithm decomposes VSLAM problem into mobile robot localization problem and landmark position estimation problem.
And particle filters are used to realize mobile robot pose estimation of the entire path; meanwhile, EKF is used for estimating landmark position.The FastSLAM algorithm has two versions 1.0 and 2.0 [23].In the paper, the improved FastSLAM 2.0 version algorithm was used [18].
We analyze and solve VSLAM problem based on the perspective of probability.Firstly, we need to estimate the robot pose   at time  and joint posterior probability density (  ,  |  0: ,  0: ), of landmark position .For convenience,   (  , ) is used to replace (  ,  |  0: ,  0: ), which is like a Markov chain, that is, the current state   (  , ) only related with the former state  −1 ( −1 , ), which can be expressed as a probability model as follows: where  is a normalization constant, (  |   , ) is a probability of observation model, and (  |  −1 ,   ) is a probability of motion model.
From entire trajectory of the mobile robot, we can obtain the following formula: Equation ( 17) decomposes the joint probability density of robot pose and landmark position into estimate  + 1: one for the robot trajectory   ( 0: ), the remaining  is the position estimate of each landmark on the basis of known path.
Particle filters are used in FastSLAM algorithm use to estimate trajectory of the mobile robot   ( 0: ) [24], each particle in the particle filter represents a path estimate and associates with  independent EKF, and each EKF corresponds to a landmark.Particle th at time  can be expressed as follows: (1) for current frame  do (2) ICP(current frame , previous frame  − 1) (5) if ICP succeeds then (6) Update else (8) find nearest frame  ( ICP(current frame , frame ) (10) if ICP succeeds then (11) Updated Updated failed, discard the frame (14) end if (15) end if (16) if Update succeeds then (17) if arg max ∈ ‖(  est ,   est − (  est ,   est ))‖ > threshold then (18) add frame  to M (19) end if (20) end if where   , and covariance Σ  , . . .are Gaussian parameters of posterior probability density of each landmark, so each EKF estimated a landmark; therefore, it is a low-dimensional EKF.
The improved FastSLAM algorithm is used in the paper, and the process is as follows [24].
Step 1. ( Step 2. Update estimation of landmark observation.The observation function ℎ(⋅) is linear, and its adopted update mode is similar to the EKF-SLAM.
Step 3. Resampling: each particle is given a weight    = Target Distribution/Proposal Distrution. particles are resampled based on probability and then start the next round of iteration.RGB camera and depth sensor [28].This device is connected to a computer via USB port.Its effective sensing range is from 0.4 meters to 4 meters, vertical viewing angle range is ±43 ∘ , horizontal range is of ±57 ∘ , and frame rate is of 30 fps.Beyond major components mentioned above, an external battery is used to supply power for FitPC2 computer and Kinect sensor.And a USB-powered minifan is used to cool FitPC2.

Software Setup. Robot Operating System (ROS) is
an open source metaoperating system, including low-level device control, wireless communications, and software package management [29].ROS has two basic parts, one is the core part of ROS which functions as an "Operation System" and the other part is the packages contributed by the whole ROS community.In ROS, a program can be divided into different nodes which can be distributed to different computing devices in the same network.According to the form of software packages, ROS can effectively create and manage programs.Executable files are executed by "Nodes." A program can be divided into different "Nodes" in ROS.These "Nodes" can be running on different computers in the same network.A node is an independent process, which can receive and publish "Topics" from other "Nodes." A part of the hardware can be regarded as a node, and a data processing algorithm also can be used as a node, as long as all these "Nodes" are working in the same "ROS Master." The whole algorithm can be successfully achieved in such a distributed mode.
There are two kinds of "Nodes" in our ROS.One kind is hardware drivers including Pioneer robot driver node and Kinect driver node.Another is functional procedures including feature extraction node, feature point match node, data association node, and FastSLAM GMapping node, as shown in Figure 6.Pioneer robot 3DX driver program is called "ROSARIA package, " developed by Aria library; Kinect sensor driver program is called "openni kinect" which comes from ROS community.Programs of feature extraction node, feature point match node, and data association node are written by us except the FastSLAM GMapping package which is used for robot localization in this part; FastSLAM GMapping package and navigation package are supplied by ROS community which are used for VSLAM and autonomous navigation.

Experiment Results and Comparison
A VICON motion capture system [19] is mounted in a laboratory which was used as a test environment, as shown in Figure 5.A Kinect sensor was used as a RGB-D sensor, and three markers were fixed on the top of Kinect as a rigid body, which is used for camera calibration.ROS [29] was used as a software platform to implement the proposed algorithm.All experiments were carried out on a powerful computer who has a hexa-core CPU with 8 GB of memory.All processes (also called "Nodes") are developed in accordance with the framework of ROS, including Kinect drivers, image compression and processing procedures, movement data receiver program, point cloud data processing, and display program.

Evaluation of the Proposed MICP-Based 3D VSLAM.
In this section, we firstly present results on the accuracy from the proposed MICP-based visual SLAM approach and then evaluate the accuracy, reliability, and real-time performance of proposed method.Finally, we investigate the influence of various parameters on the running time of our VSLAM method.In order to evaluate our proposed method, an evaluation tool that computes the root mean square error (RSME) [30,31] is used which provides an evaluation for our visual SLAM method.Furthermore, a VICON motion capture system is used to provide precise RGB-D sensor positioning data.In order to assess validity and reliability of the proposed method, VSLAM tests were conducted in an artificial environment and an office environment, respectively.Meanwhile, a VICON motion capture system was used to verify accuracy of the proposed algorithm.According to the proposed MICP algorithm, a handle was used to control the mobile robot to surround the room at a steady speed, and then a 3D map of the entire environment was created by Kinect sensor mounted on the robot.After 50 frames were captured, a 3D map could be generated quickly.The VSLAM results are respectively shown in Figures 7 and 8.In the first round of experiments, we evaluated the accuracy of the proposed MICP on all objects in the artificial room and in the office.On the objects "Floor" and "Desk, " we, respectively, obtained the best values of 1.9 cm and 4.1 cm RMSE error.Meanwhile, we achieved the worst result of 21.8 cm RMSE error on the object "Bookshelf, " as indicated by numbers in bold in Table 1.From Table 1, we can find that the range of camera degrees is varied from 15.48 deg/s to 66.23 deg/s, and the range of velocities is varied from 0.09 m/s to 0.47 m/s.And the proposed method achieves on average an accuracy of 11.9 cm and 4.16 ∘ .Furthermore, high angular and translational velocity pose have no obvious difficulty even though some frames had a certain level of overlap.In general, these evaluation results show that our approach performs well in most of these objects.And with an average frame processing time of 0.618 s our approach is suitable for online operation.
Results on these objects from Table 1 show that the capabilities of the proposed approach are in the best case.However, good results can usually be achieved when the robot is carefully moved in an indoor environment.Generally, a map accomplished by someone who hold a RGB-D sensor is better than other map which is built by RGB-D sensor fixed on a mobile robot, as human being can keep the RGB-D sensor moving in a uniform velocity.However, when someone uses a joystick or a computer to control a robot, it is usually hard to keep the robot moving at a constant speed.On the other side, when the robot tries to cover areas of a larger space, data from RGB-D sensor are more challenging as the unrestricted camera motions.

Comparison of Our MICP-Based 3D VSLAM with Other
Methods.Furthermore, we compared the proposed method with method based on Kinect fusion [11], method based on RGB-D SLAM [7], and method only based on ICP, respectively.Comparison results are shown in Table 2.The Kinect fusion-based method can achieve the highest accuracy of 5 mm, but its reliability only can last for 21 hours.Another method based on RGB-D SLAM had the fastest speed of 0.5 fps.However, all calculations are based on hardware.Additionally, its process requires a lot of memory to store data, so the created map size is limited.
Besides, when we only used the ICP method in the proposed algorithm, the running speed of the algorithm was nearly 6 fps.But when MICP was used in the algorithm,

Figure 3 :
Figure 3: Schematic view of relationship among five different coordinate systems in calibration.

Figure 4 :
Figure 4: Corner points extraction (a) and results for extrinsic parameters (b).

Figure 6 :
Figure 6: Structure chart of nodes in our ROS.

Figure 7 (
a) is the 2D image of an artificial environment; Figures 7(b) and 7(c) are images from the front view and top view of the 3D artificial environment, respectively.Figures 8(a) and 8(b) are images from left corner and right corner of an office, respectively; Figures 8(c) and 8(d) are images from the front view and top view of the 3D artificial environment, respectively.