Method for SLAM Based on Omnidirectional Vision: A Delayed-EKF Approach

This work presents a method for implementing a visual-based simultaneous localization and mapping (SLAM) system using omnidirectional vision data, with application to autonomous mobile robots. In SLAM, a mobile robot operates in an unknown environment using only on-board sensors to simultaneously build a map of its surroundings, which it uses to track its position.The SLAM is perhaps one of the most fundamental problems to solve in robotics to build mobile robots truly autonomous. The visual sensor used in this work is an omnidirectional vision sensor; this sensor provides a wide field of view which is advantageous in a mobile robot in an autonomous navigation task. Since the visual sensor used in this work is monocular, a method to recover the depth of the features is required. To estimate the unknown depth we propose a novel stochastic triangulation technique.The system proposed in this work can be applied to indoor or cluttered environments for performing visual-based navigation when GPS signal is not available. Experiments with synthetic and real data are presented in order to validate the proposal.


Introduction
In recent years, many researchers have addressed the issue of making mobile robots more and more autonomous.In this context, the estimation of the state of the vehicle (i.e., its attitude and position) is a fundamental necessity for any application involving autonomy.
For open outdoor domains this problem is seemingly solved with on-board Global Positioning System (GPS) and Inertial Measurements Units (IMU), as well as their integrated version: the Inertial Navigation Systems (INS).On the contrary, unknown, cluttered, and GPS-denied environments still pose a considerable challenge.While attitude estimation is well handled by available systems, GPS-based position estimation can have some drawbacks.Specially GPS is not a reliable service as its availability can be limited by urban canyons and it is completely unavailable in indoor environments.
There are many approaches that work on this problem, one of the simplest is the use of odometry to estimate the pose of the robot; however, due to errors of the sensor and problems like wheel slip, this approach is useful only for short trajectories.For this reason, different type of sensors might be used.One option is the use of a laser sensor; this is an interesting approach; however these type of sensors are more expensive.Another option is the use of a camera; one of the advantages is the cost of this sensor which is very low compared with other sensors; in addition, its weight and power consumption make it an excellent option for robots with weight constraints, like an aerial robot; furthermore it provides more information than other sensors.
Simultaneous localization and mapping (SLAM) is an important problem to solve in robotics in order to build truly autonomous mobile robots.SLAM deals with the way in which a mobile robot can operate in an a priori unknown environment using only on-board sensors to simultaneously build a map of its surroundings, which it uses to track its position.Using visual-based SLAM methods, a robot can localize itself by detecting and tracking natural visual features of the environment.Also, because these visual features typically have an spatial meaning, the information obtained from the visual measurements can be used for replacing range sensors: the lasers are often expensive and heavy, and the operation range of ultrasonics is limited.
On the other hand, while range sensors (e.g., laser) provide range and angular information, a camera is a projective sensor, which measures the bearing of image features.Therefore, depth information (range) cannot be obtained in a single frame [1,2].This fact has motivated the appearance of special techniques for feature initialization to allow the use of bearing sensors, such as cameras, in SLAM systems.In this sense, the treatment of the features in the stochastic map, such as, initialization and measurement., is perhaps still the most important problem in monocular SLAM in order to improve robustness.
Currently, there are two main approaches for implementing vision-based SLAM systems: (i) filtering-based methods (as [3][4][5]) and (ii) the optimization-based methods (as [6,7]).While the former approach is shown to give accurate results when the availability of computational power is enough, filtering-based SLAM methods might be still beneficial if limited processing power is available [8].Moreover, filteringbased methods are better suited for incorporating, in a simple manner, additional sensors to the system.In this sense, most robotics applications make use of multiple sensor inputs.

Related Work.
There are several options for a visual sensor; the most common is a perspective camera; however this sensor has a short field of view, which can be a problem in an autonomous navigation system.In this work, we propose to use an omnidirectional vision system, which is a combination of a mirror and a camera [9,10]; the advantages of this sensor are its wide field of view; this capability can improve the autonomous navigation system.
In previous works, omnidirectional vision systems have been used for localization.In [11], the authors present an approach for self-localization using an omnidirectional vision system and a perspective camera; they combine the information of both cameras and with the use of epipolar geometry they estimate the location of the robot.However the localization is estimated with respect to the omnidirectional images acquired previously; in contrast in our approach the localization of the robot is estimated with respect to the features acquired on real-time.In this case, the image features extracted from the omnidirectional vision sensor will be tracked and used as input for a method to estimate simultaneously both the pose of the robot and a map of its surroundings.
In [12] a Particle Filter is used for the localization of a mobile robot equipped with omnidirectional vision; an approach is proposed that reduces the number of features generated by SIFT, as well as their extraction and matching time.Omnidirectional vision can also be applied for the estimation of attitude and heading; in [13] a visual compass proposed makes use of catadioptric images in order to calculate the relative rotation and heading of an unmanned aerial vehicle.In [14] an algorithm that builds topological maps is proposed; in this case, local features are extracted from images obtained in sequence and are used for both: to cluster the images into nodes and to detect links between nodes.
Compared with the number of SLAM approaches using projective cameras, the use of omnidirectional vision in SLAM has been less explored.An early approach is proposed in [15], in this work two omnidirectional vision sensors are used as a stereo system for performing EKF-based SLAM.In [16] a monocular SLAM system that does not make use of any motion model is proposed.The key aspect of the system is a pose estimation algorithm that uses information from the estimated map, as well as the epipolar constraint.
In [17] a SLAM with an omnidirectional sensor has been proposed; in contrast with our approach they do not estimate the depth.In [18] the previous approach is extended by considering a patch formulation for data association which is invariant to rotation and scale.In [19], the authors present an approach for SLAM with omnidirectional vision; they extract features from the environment and update the position of the robot and a map library by using the EKF algorithm.In [20] the SLAM estimation method is based on the FastSLAM approach and the Hungarian algorithm for hierarchical data association.In [21] a technique for the extraction and matching of vertical lines in omnidirectional images is proposed; also, the initialization of features is carried out with an unscented transformation.

Objectives and Contributions.
This work presents a novel method for implementing a visual-based SLAM system.The proposed approach combines odometry data and visual information obtained from an omnidirectional camera for estimating the state of the mobile robot as well as the map of its environment.The system can be applied to indoor or cluttered environments for performing visual-based navigation when no GPS signal is available.The system is mainly intended to be applied to small autonomous land robots, and its design takes into account the limited resources commonly available in this kind of applications.As it will be shown later, the proposal exhibits an improvement in the computational cost when it is compared with related methods.
The approach proposed in this paper shares some similarities with [18]; both are EKF-based methods and both use salient points as visual features.However while the method proposed in [18] makes use of the undelayed inverse depth technique [5] for initializing new features into the map, the proposed work is based on the delayed initialization technique, which is proposed in [22], for the same purpose.
Additionally this work presents the following novelties.
(i) The stochastic technique of triangulation that is used [22] for estimating the depth of visual features is extended in order to accommodate for the particularities of an omnidirectional vision sensor.In this case, a spherical projection model is used instead of the typical projection model in perspective cameras.
(ii) Also, different from [22] and other works, in order to improve the modularity and scalability of the proposed method, the process for detecting and tracking visual features is fully decoupled from the main estimation process.In this case, a simple but effective scheme is proposed.
(iii) Different from [18,22], which make use of the inverse depth parametrization.In this work the features of the map are parametrized directly in Euclidean coordinates.Euclidean features imply less computational cost than the inverse depth features [23].
The paper is organized as follows: in Section 2 the description of the problem is presented, Section 3 presents an introduction to the projection model of the omnidirectional vision sensor, the proposed approach is presented in Section 4, in Section 5 the experiments results are presented, and finally in Section 6 the conclusions are given.

Problem Description
The main goal of the proposed method is to estimate the following system state : where  V represents the state of the mobile robot and   represents the location of a feature point  in the environment.At the same time,  V is composed of where The proposed approach is intended for local autonomous vehicle navigation.In this case, the local tangent frame is used as the navigation reference frame and thus initial position of the vehicle defines its origin.The navigation coordinate frame follows the NED (North, East, Down) convention.In this work the magnitudes expressed in the (i) navigation frame, (ii) vehicle (robot) frame, and (iii) camera frame are denoted, respectively, by the superscripts , , and  (see Figure 1).All the coordinates systems are right-handed defined.
In this work, it is assumed that origin of the robot's coordinate frame  is aligned with the origin of the camera's coordinate  frame but rotated by a magnitude defined by a known camera-to-robot rotation matrix   .A superscript  denotes a reference frame  expressed with respect to reference .

Omnidirectional Vision
As we mention previously, perspective cameras have a small field of view; one effective way to increase the field of view is the combination of a mirror with a camera; this approach is called omnidirectional vision.The image formation process of an omnidirectional vision sensor is different from the perspective camera.In this work, the model for omnidirectional cameras described in [24] is used.With this model we can compute the projection of a 3D point into the image plane, i.e., a projection R 3 ⇒ R 2 .
Let   = (, , )  represent a 3D point; this point can be expressed in the camera frame by   ; the relationship between these points can be defined as where   = (    )  is the rotation matrix transforming from navigation frame  to the camera frame .
Then, the point   is projected onto a unit sphere (See Figure 2): The point   is changed to a new reference frame centered in   = (0, 0, ) and    = (  ,   ,   + ). is a calibration parameter of the model and depends only on the system modelled and the geometry of the mirror.For a common perspective camera  = 0.For catadioptric systems with parabolic mirror and orthographic camera  = 1.For systems with hyperbolic mirror and perspective camera 0 <  < 1.
The point is then projected onto the normalized plane: Radial and tangential distortion induced by the lens are added through a distortion model  and distortion parameters  = [ 1  2  3  4  5 ]: The final projection involves a generalized camera projection matrix, where  =  is the generalized focal length for the camera-mirror system,  is a mirror parameter,  is the focal length of the camera, ( 0 , V 0 ) is the principal point,  is the skew, and  is the aspect ratio.
where  and V are the coordinates of the image point  expressed in pixels units.Inversely, a directional vector can be computed from the image point coordinates  and V.The vector ℎ  points from the camera optical center position to the 3D point location.
In ( 9), (, , ) =   , where   can be obtained from   from the inverted distortion model.And Finally, ℎ  can be expressed in the navigation frame by ℎ  =   ℎ  where   =     is the camera to navigation rotation matrix.

Detection and Tracking of Visual Measurements.
As we mention before, in this work, an omnidirectional camera is considered (see Figure 1) as the main sensory input to the system.A standard small base-line tracker is proposed for detecting and tracking visual features.For this purpose, the Kanade-Lucas-Tomasi Tracker (KLT) [25] is used, but any small base-line tracker can be used too.Detection of new features and their tracking are completely conducted by the KLT, (see Figure 3).In this case, the availability is assumed, at each  frame, of a list of  () = [ V (1) ,  V(2) , . . .,  V() ] visual measurements of  static landmarks of the environment.Visual measurements  V are modelled by where  V() = [, V] represents the true position in pixel coordinates (in the image plane) of the projection of a landmark .
The term V V represents the uncertainty associated with visual measurements and it is modelled by a Gaussian white noise with power spectral density (PSD)  2 V .Note that all the process for detecting and tracking visual features is fully decoupled from the main estimation process.Later, it will be seen that the whole uncertainty associated with the tracking process will enter into the filter by means of the update equations through parameter  2 V .Most of the time, the KLT produces good quality results; however, sometimes, false positives are also obtained (e.g., continuous tracking of a feature that should be occluded by objects in front of it).In this work, a simple validation technique is used, which has been shown to give good results in combination with the KLT.This technique may be resumed as follows.
( If the score obtained is higher than a threshold, then the current (   , V   ) position is assumed to be a valid measurement.
The proposed technique rejects some overoptimistic measurements provided by the tracker, while it still provides some good degree of rotational invariance to rotations of the camera.The use of visual descriptors with at least some degree of invariance to rotations is very important in applications involving omnidirectional cameras.
Although some degree of validation is provided by the above technique, more robustness for the data association process can be obtained by the use of batch validation techniques as [26][27][28].These techniques are applied during the update stage of the filter.

Odometry Measurements.
In this work a differential wheeled robot is considered; however, it should be straightforward to adapt the proposed method to a platform with a different configuration.
Measurements   = [  ,   ]  of the rotational speed of the robot's wheels are obtained through the wheel encoders and can be modelled by where   and   represent the true angular rate of the right and left wheel and V  is a Gaussian white noise with PSD  2  .
4.3.System Prediction.At every step , when measurements   of the encoders are available, the estimated system state x is taken a step forward by the following (discrete) nonlinear model: where   is computed from the current quaternion    , Δ is the sample time of the system, and where  is the wheel radius,  is the distance between wheels of the robot, and   ,   are the angular rotation measured by the encoders of right and left wheels.In the model defined in (12), a closed form solution of q = 1/2() is used for integrating the current velocity rotation   over the quaternion   .In this case  = [  +1 Δ/2] ⊤ and In the system prediction model presented above, note that a typical differential drive kinematics model, (13), which is defined for flat surfaces, has been coupled with a more general 6DOF kinematic model defined in (12).This modular design should allow using another kind of motion models with minor modifications.An Extended Kalman Filter (EKF) propagates the system state x over time.It is assumed that the map features ŷ remain static (rigid scene assumption) so x+1 = [x V(+1) , ŷ1() , ŷ2() , . . ., ŷ() ] ⊤ .The state covariance matrix  is taken a step forward by where  and the Jacobians ∇  , ∇  are defined as follows: V /x V are the derivatives of the equations of the nonlinear prediction model (12) with respect to the robot state xV . V / are the derivatives of the nonlinear prediction model with respect to the system inputs   and   .Uncertainties of the encoders, as well another kind of unstructured uncertainties, like wheel slippages, are incorporated into the system by means of the process noise covariance matrix  =  2   2×2 , through parameter  2  .
4.4.Visual Aid.Depth information cannot be obtained in a single measurement when bearing sensors (e.g., a omnidirectional camera) are used.To infer the depth of a feature, the sensor must observe this feature repeatedly as it freely moves through its environment, estimating the angle from the feature to the sensor center.The difference between those angle measurements is the parallax angle.Actually, parallax is the key to estimate the depth of the features.In case of indoor sequences, a displacement of centimeters could be enough to produce parallax; on the other hand, when the distance to a feature increases, then the sensor has to travel more to produce parallax.
In monocular-based systems, the treatment of the features in the stochastic map (initialization, measurement, etc.) is an important problem to address with direct implications in the robustness of the system.In this work, a novel method is proposed, for incorporating new features into the system.In this approach, a single hypothesis is computed for the initial depth of features, by using of a stochastic technique of triangulation.The method is based on previous author's work [22].In this previous work, a common projective camera is used as input to the SLAM system.

Initialization of Visual Features.
At the  frame, when a visual feature is detected for the first time, the following entry   is stored in a table (see Figure 4): where   = [   , , ] models a 3D semiline defined on one side by the vertex    , corresponding to the current optical center coordinates of the camera expressed in the navigation frame and pointing to infinity on the other side, with azimuth and elevation,  and , respectively, and where ℎ  = [ℎ   , ℎ   , ℎ   ]  is computed as is indicated in Section 3.    is a 5 × 5 covariance matrix which models the uncertainty of   .   =   , where  is the system covariance matrix and  is the Jacobian matrix formed by the partial derivatives of the function   = ℎ(x,  V ) with respect to [x,  V ]  .At every subsequent frame  + 1,  + 2, . . .,  + , a hypothesis of the feature depth   is computed by (see Figure 4) where   =  − ( + ) is the parallax.  =    0 −    indicates the displacement of the camera since it was first observed to its current position, and where  is the angle defined by ℎ 1 and   .ℎ 1 is the normalized directional vector : computed taking   ,   from   and where  is the angle defined by ℎ 2 and −  .ℎ 2 = ℎ  is the directional vector pointing from the current camera optical center to the feature location and is computed as is indicated in Section 3 from the current measurement  V .At each step, the hypothesis of depth   is passed through a low-pass filter.The above occurs since the depth estimated by triangulation varies considerably, especially for low parallax.In previous author's work [22] it is showed that only a few degrees of parallax are enough in order to reduce the uncertainty in depth estimations.
When the parallax   is greater than a threshold (  >  min ) a new feature ŷnew = [  ,   ,   ] ⊤ is added to the system state vector x: where ŷnew =    0 +  (  ,   ) .
In this work, good experimental results were obtained with  min = 4 deg.
The system state covariance matrix P is updated by where   new is the 3 × 3 covariance matrix which models the uncertainty of the new feature ŷnew , and In (25),    is taken from   and  2  is a parameter modelling the uncertainty of process of depth estimation. is the Jacobian matrix formed by the partial derivatives of the function ŷnew = ℎ(  , ) with respect to [   0 ,  0 ,  0 , ]  .

Map Management and Real-Time Issues.
A SLAM framework that works reliably locally can be applied to largescale problems using methods, such as submapping, graphbased global optimization [8], or global mapping.
Therefore, in this work, large-scale SLAM and loopclosing are not considered.Although, these problems have been intensively studied in the past.
Moreover, this work is motivated by the application of monocular SLAM in the context of visual odometry.When the number of features in the system state increases, then computational cost grows rapidly, and consequently, it becomes difficult to maintain the frame rate operation.To alleviate this drawback, old features can be removed from the state for maintaining a stable number of features and, therefore, to stabilize the computational cost per frame.Obviously, if old features are removed, then previous mapped areas cannot be recognized in the future.However, in the context of visual odometry or local mapping, this fact is not considered as a problem.
In particular, in this work, oldest features are removed from the system state when the number of map features surpasses a certain threshold.Good experimental results were obtained with a threshold equal to 50 features.The removal process is carried out using the approach described in [22].
It is important to note that, since early works as [29], the real-time feasibility of EKF-based SLAM methods was shown for maps composed up to 100 features using standard hardware.Even though there are real-time implementations for high computation demanding techniques as the optimization-based methods, the threshold used in this work is under a bound that should permit a real-time performance, by means of an optimized implementation.
As it was mentioned before, in the proposed method, the process for detecting and tracking visual features is fully decoupled from the main estimation process.In this sense, this architecture should permit to parallelize at least both processes in software or hardware implementations.However, this subject is beyond the scope of this work.

Experimental Results
In this section we present the results obtained using synthetic data obtained by simulations as well as the results obtained with real data.The experiments were performed in order to validate the performance of the proposed method.A MATLAB5 implementation was used for this purpose.

Experiments with Simulations.
The main purpose of the experiments, carried out through simulations, was to validate the correctness of the measurement model used for the omnidirectional camera.In this case, the problem of data association was avoided and perfect matching of visual features was assumed for experiments.Also in simulations, the use of the odometry data obtained from the robot's encoders was avoided.In this case, the model described in (12) was slightly modified in order to use Gaussian noise as inputs instead of odometry measurements.With this modification, the method was tested in a purely monocular 6DOF SLAM context, for obtaining more insight about the performance of the visual aiding.
Figure 5 shows the simulated environments used in experiments.In this case, the camera was moved in order to follow a spiral-like trajectory.The area, where the vehicle is moving, is surrounded by a wall of landmarks.During the trajectory, the camera yaw was changed under a constant pattern of movement.In order to recover the metric scale of the world, the system was initialized with perfect knowledge of four landmarks.At the end of the experiment it can be appreciated that both trajectory and map have been satisfactorily estimated.

Comparative Study with Real
Data.The proposed method was executed off-line, using the Rawseeds dataset [30] as input signals, in order to test its performance with real data.This dataset is freely available online, and it contains several sources of sensors signals.In the experiments the following signals were used: (i) visual information obtained from an omnidirectional camera at 15 frames per second (fps) and with a resolution of 640 × 640 pixels.(ii) Odometry data was obtained from the wheels encoders of the robot, available at 50 Hz.The sensors are mounted over a differential drive platform, (see [30] for a complete description).In experiments, the 1-point RANSAC method [28] has been used for validating the visual matches of map features.The method was tested under both indoor and outdoor conditions.Figure 6 shows examples of the visual input, as well as the output of the proposed method for both cases.
For comparison purposes, another two related methods were also implemented: (i) the scheme proposed in [18] (UID), which is based in the well-known undelayed inverse depth methodology and (ii) A variant of the former approach, the undelayed inverse depth to Euclidean method (UID2E), which is based on [23].For performing the experiments, the same input signals were used with the three methods.In this case, because the process for detecting and tracking visual features is fully decoupled from the main estimation process (Section 4.1), it is important to note that the same set of visual features were used as input to the three methods.Therefore, the experimental results were obtained under similar conditions.

Indoor Experiments.
The dataset used for testing the proposed method under indoor conditions was captured by the robot traveling through corridors and hallways of a building intended to education.The sequence used in these experiments is composed by 3700 frames.Figure 7 shows  a top view of the map and trajectory that was obtained with each method.It is important to note that for indoor conditions there is no availability of a ground truth signal.In this case, in order to have a better interpretation of the results, the estimated map and trajectory were overlaid to a CAD drawing of the building where the raw data was collected.Table 1 summarizes the experimental results, under indoor environments, obtained with the three methods.The following statistics have been computed for each method: (i) the execution time of the SLAM process (Ts), (ii) the average execution time per frame of the SLAM process (aTpFs), (iii) the execution time of the 1-point-RANSAC validation process (Tr), (iv) the average execution time per frame of the 1point-RANSAC validation process (aTpFr), (v) the number of features initialized into the system (NIF), (vi) the average number of features per frame used for updating the filter (aFpF), and (vii) the average number of outliers rejected per frame, by the 1-point-RANSAC validation process (aNFRpF).
Figure 8 illustrates the evolution over time of the number of features contained within the system state.Figure 9 illustrates the evolution over time of the number of outliers detected by 1-point-RANSAC method.The visual matches detected as outliers are not considered for updating the filter.

Outdoor Experiments.
The dataset used for testing the proposed method under outdoor conditions was captured by the robot traveling along sidewalks of a campus.The sequence used in these experiments is composed of 1600 frames.Figure 10 shows the map and trajectory obtained with the proposed method (DE) and the UID and UID2E methods.Note that for outdoor experiments a ground truth reference is available.Table 2 summarizes the experimental results under outdoor environments obtained with the three methods.Additionally to the statistics computed for indoor experiments, in this case for outdoor experiments, the average Euclidean  error (AEE) in position estimation was also computed due to the availability of the ground truth signal.
Figure 11 shows the evolution over time of the Euclidean error in position estimation for each method.Note that this figure also displays the Euclidean error obtained when only odometry data, obtained from the wheels encoders of the robot, is used for computing the position.Figures 12 and 13, respectively, show the evolution over time of the number of features included within the map and the number of outliers rejected per frame by the 1-point-RANSAC method.

Discussion.
According to the results of the comparative study some implications can be inferred.
For indoor experiments, analysing Figure 7, it can be seen that similar trajectories were estimated with the proposed method (DE) and the UID method.On the other hand, the UID2E method presents a considerable drift in this experiment.For outdoor experiments, by analysing Figures 10 and 11, it can be seen that the DE method showed a slightly better performance.In this case, very similar results were obtained with UID and UID2E methods.
After analyzing the results presented in Tables 1 and  2, it can be observed that the proposed method offers a better performance in terms of computational cost.Two main reasons can explain the reduction of the execution time of the DE method with respect to UID and UID2E methods.
The first reason has to do with the use of the Euclidean parametrization of features, which results in a reduction of the size of the vector state (compared with the inverse depth parametrization).As it is well known, the computational cost of the Kalman Filter scales badly with the size of the state.The UID2E method transforms the inverse depth features, that are well conditioned, to Euclidean features in order to improve the computational cost of the UID method.In fact, by comparing with the UID method, some reduction of computational cost is obtained with the UID2E method.However, as it was the case with indoor experiments, the UID2E method can show sometimes a larger drift in estimations.
The second reason has to do with the inherent property of the proposed method to reject weak features during the initialization process.Different from the undelayed methods whose features are initialized at the very first time that they are observed, the delayed methods (as the proposed one) collect depth information of a feature prior to its inclusion into the system state.During this initial period, weak candidate points that are detected but are lost after only a few frames that have been tracked are not initialized into the system state.Also, candidate points that do not exhibit parallax are not included into the system.As a consequence of the above, a less number of strong visual features are initialized into the system with the DE method (observe the NIF column in Tables 1 and 2).
On the other hand, as it was mentioned before, undelayed methods (UID and UID2E) initialize into the filter all the visual features detected.In this case, weak visual features are continuously been removed from the system state, while the new features are been detected and initialized.The high number of features that are initialized and removed continually requires that some portion of the system state performs somehow as a buffer for these features.This has as a consequence that the UID and the UID2E methods require maintaining a larger system state (see Figures 8 and 12).
The 1-point-RANSAC validates the unequivocal association of visual features with its spatial meaning.This validation This suggests that the DE method produces maps less dense of features, but also more consistent.
Moreover, the number of outliers also has a huge impact to the computational cost.In this case, the time consumed by the 1-point-RANSAC validation process increases with the number of outliers (see Tables 1 and 2).

Conclusion
In this paper a method for mobile robot navigation using visual information is presented.The proposed scheme is a filter-based simultaneous localization and mapping (SLAM) system.In this case, visual information is incorporated into the system in order to minimize the odometry error.A monocular omnidirectional vision sensor was used in this work; this sensor allows tracking the features more time during the navigation and this makes the approach more robust to estimate the pose of the mobile robot as well as the map of the environment.In this case, the appearance and spatial information provided by the built map could be useful for multiple tasks.To estimate the depth of the features a novel method is proposed which is based in a stochastic technique of triangulation.In addition, the detection and tracking process are fully decoupled from the main estimation process.For this purpose, a simple but effective scheme was proposed.Finally, the effectiveness of the proposed approach is verified through simulations and experimental results with real data.
A comparison study was presented in order to provide additional insights about the performance of the proposed method.When it is compared with the related methods that were used in the study, the proposal shows a considerable improvement in terms of computational cost.Also, the results suggest that more consistent estimates are obtained with the proposed method.The proposed method should be useful to be applied to mobile robots moving across indoor or cluttered environments where GPS signal is not available.

Figure 1 :
Figure1: System parametrization.The local tangent coordinate frame is used as the navigation reference frame.It is assumed that origin of the robot's coordinate frame is aligned with the origin of the camera's coordinate frame but rotated by a known magnitude.

1 Figure 2 :
Figure 2: Full projection model used in this work.

Figure 3 :
Figure 3: Detection and tracking of visual features are conducted by a Lucas Kanade tracker.Note that only the region of the image defined by { 1 <  <  2 } is considered, where  1 is chosen in order to avoid visual features generated by the robot itself.

AFigure 4 :
Figure 4: A hypothesis of depth is computed for each visual feature prior to its inclusion into the system state by means of a stochastic technique of triangulation.

Figure 5 :
Figure 5: In simulations, the omnidirectional camera was moved through an environment composed by 3d points which emulate visual landmarks.(a) shows one of the frames as it should have been seen by the camera.(b) shows the actual (red) and estimated (blue) trajectory and map after 500 frames.

Figure 6 :
Figure 6: Examples of the experiments with real data in indoor (a, b) and outdoor (c, d) environments.The video frames (a, c) are displaying illustrating visual features that have been tracked, as well as the corresponding trajectory and map estimated by the method (b, d).Comparing visual features corresponding to elements of the environment with the estimated map, it can be appreciated that the physical structure of the environment is partially recovered.

Figure 7 :
Figure 7: Top view of the estimated map and trajectory, using real data from sensors mounted in a differential drive robot, moving through corridors and hallways of a building.The results are presented for the proposed method (DE) (a), the UID method (b), and the UID2E method (c).

Figure 8 :Figure 9 :
Figure 8: Indoor experiment: evolution of the number of the map features for each method.

Figure 11 :Figure 12 :
Figure 11: Outdoor experiment: evolution of the Euclidean error in position estimation.

Figure 13 :
Figure 13: Outdoor experiment: evolution of the number of outliers rejected per frame by the 1-point-RANSAC method.
where ℎ  /x V are the partial derivatives of the equations of the measurement prediction model ℎ  with respect to the robot state xV .And ℎ  /ŷ  are the partial derivatives of ℎ  with respect to feature ŷ .Note that ℎ  /ŷ  have only a nonzero value at the location (indexes) of the observed feature ŷ .=( 2×2 )2V is the measurement noise covariance matrix.

Table 1 :
Experimental results obtained under indoor conditions.

Table 2 :
Experimental results obtained under outdoor conditions.