Three-Dimensional Location and Mapping Analysis in Mobile Robotics Based on Visual SLAM Methods

. One of the essential tasks required from a mobile robot is the autonomous and safe navigation of its working environment. However, in many cases, a model of the environment or map is not available to execute this task. Indeed, navigation requires a permanent estimation of the location for a map, which is not available for unknown environments. In such a scenario, the robot must have extended capabilities to solve, concurrently, the problems of localization and mapping. Te simultaneous solution of these two problems is known as SLAM (simultaneous localization and mapping) and is a complex problem, not yet fully solved by the scientifc community. Tis is due to the fact that localization requires a map that is not yet available since it is still under construction. In turn, the elaboration of a map requires the estimation of the robot’s location. Tis is the reason why SLAM has been categorized as similar to the chicken and egg problem. In the case of a robot facing an unknown environment, it would be something like what to solve frst, localization or mapping? Te answer to this question is that the robot will have to solve both problems at the same time. Tis article presents a study of some of the most representative open source visual SLAM (vSLAM) methods, beginning from an analysis of their characteristics and presenting criteria selection for an experimental design that allows contrasting their advantages and disadvantages. Two of the most representative algorithms for solving vSLAM were considered (RTAB-Map and ORB-SLAM2). Te experiments were validated with a robotic system designed for this purpose, which is fully compatible with ROS (robot operating system)


Introduction
One of the biggest challenges faced by robotics research is the incorporation of advanced levels of autonomy in robotic systems, understanding autonomy as the capacity of a system to carry out its processes and operations with no human intervention. Navigation is perhaps the problem most closely linked to the concept of autonomy in mobile robots. Precisely, some of the tasks that defne a mobile robot as an autonomous entity are a safe navigation of its environment and path planning [1].
In robotics, the terms location, pose, and localization refer to the coordinates and orientation of a robot with regards to a global spatial representation system [2]. Localization and mapping, which is the ability of some robots to model their environment, are two fundamental problems in robotics and are closely related [3]. When a representation of the environment is not available, it must be constructed autonomously by the robot. Tis is a nontrivial problem; since in order to build models of the environment, it is necessary to establish a precise location in relation to a model that is not yet available. Similarly, in order to obtain a good estimate of the location, a map is required. Consequently, when a robot faces an unknown environment, it will have to simultaneously solve two problems that cannot be solved independently: localization and mapping [4]. Tese two problems are so strongly correlated that it is not possible to establish which of them is the cause and which is the efect. Te need to solve both problems concurrently results in the simultaneous localization and mapping (SLAM) problem.
To successfully carry out these two tasks, the robot must have extended capabilities to reliably estimate its localization (x, y, and θ) T and be able to elaborate relatively accurate representations of its environment (mapping). Traditional methods employed in solving SLAM are based on using probabilistic techniques. In such techniques, parameters and measures are treated as random variables, and the uncertainty in localization estimation is modelled by using probability density functions.
Tis article is addressed to the problem of localization and mapping simultaneous of indoor environments based on vSLAM methods, under controlled conditions that allow highlighting the diferential aspects of two of the most representative approaches, namely, RTAB-Map (real-time appearance-based mapping) and ORB-SLAM2 (oriented FAST and rotated BRIEF simultaneous localization and mapping 2). Both RTAB-Map and ORB-SLAM2 are representative methods of the current state of the art in vSLAM (visual SLAM). However, while the frst produces dense maps, the latter generates a sparse one. Equally, the robot trajectory and the three-dimensional structure of the environment are estimated. In the authors' criteria, the contributions of this article are the following: (1) a holistic view of vSLAM methods that incorporates conceptual and procedural elements that support the selection of the most convenient method for the development of robotic applications. (2) A detailed discussion of simple experimental settings under controlled conditions that tend toward repeatability is provided, facilitating comparison of methods.
(3) A complete interpretation of the models obtained from each approximation is provided in real environments to which a wide variety of objects that provide complex characteristics are incorporated.
Tis paper is distributed as follows: in the second section, materials and methods established for this research are discussed. Te third section presents the qualitative selection method proposed in this study. Te fourth section describes the experiments performed and discusses the results obtained. Finally, in the last two sections, conclusions are drawn and references are presented.

Materials and Methods
In recent years, robotics research and the development of robotic systems have experienced signifcant growth, widely due to the emergence of new modelling, estimation, optimization, and control methods, and advances in computing, sensing, mechatronics, and communications. Tis scenario is conducive to the arising of new applications of robotics in various areas of daily life [5,6] and the need for versatile and powerful tools for the design, simulation, and development of functional robotic systems.
Robots are now considered complex distributed systems that include a wide variety of hardware and software components, which is why this research incorporates the highest possible degree of modularity in the design and implementation of the experimental robotic system, both at the hardware and software levels. Although modular design poses additional integration, communication, and interoperability problems, these can be solved by using an intermediate software layer or middleware for distributed robotic systems [7]. Generally, the latest generation of open source middleware facilitates integration with libraries and packages widely used in computer vision, point cloud processing, 3D geometry, and simulators. Tey greatly ease the development, testing, and debugging of algorithms with rigorous security and robustness requirements, mainly in applications where a close interaction with people is needed, such as assistance and educational and logistic robots for goods delivery [8][9][10].
As for hardware, several companies around the world (Clear-path Robotics, Fetch Robotics, PAL robotics, and Hokuyo Automatic) develop and supply a wide range of robotic platforms and sensing, perception, and actuation systems for research, learning, and development of industrial and commercial applications compatible with different middlewares.
Te following four tools were used in the experiments conducted in this work (robot operating system-ROS, gazebo, turtlebot2, and workstation) [11].

Robot Operating System (ROS).
Middleware, which has been well accepted in academic and industrial environments, has positioned itself, de facto, as a standard for robotic application development [12][13][14]. ROS is an open-source middleware for robotic application development, which provides a distributed programming environment for both real and simulated platforms. It also provides hardware abstraction, low-level control, message transfer between processes, and software package management [15]. ROS was conceived considering the following criteria (Table 1): 2.2. Gazebo. Simulators are compatible with ROS middleware. Preliminary tests of the implemented teleoperation algorithms were carried out with this tool, which ofers a 3D simulation environment that requires a computer with good graphics processing capabilities, as the ones supported by GPUs (graphics processing unit) [17,18].

Turtlebot2.
It is an open hardware platform for research, testing, and development of methods, algorithms, and applications in robotics [12,14,15]. It holds the necessary perception, control, and actuation systems to start with the development of ROS-supported applications. Some low-cost modifcations were carried out to improve the processing and autonomy of the platform [19].

Processing and Communications.
Computers were available, one of the high performances that we call as workstation, which was used for system supervision through GUI (Graphical User Interfaces) and for the execution of high computational cost processes. Te other equipment, of lower performance and that we call Robot-PC, was incorporated into the mobile robot to execute processes of lower computational cost, mainly those associated with proprioception, exteroception, and efector control of the turtlebot2 robot. Te connectivity of both the devices was executed through a router.
Features of the hardware used are as follows ( Table 2): Te experimental mobile robot prototype was named RobSLAM. It was developed from the TurtleBot2 Personal Robot, an open-source platform supported by ROS (Figure 1). Te robot incorporates an RGB-D exteroceptive sensor, trays, and brackets that allow incorporating new equipment, and as the main component, the iClebo Kobuki mobile base. It integrates proprioceptive sensing, low-level processing, and locomotion systems.
Te intermediate tray was removed from the original platform to reduce the height of the robot and keep its center of gravity low. Four 13 mm diameter by 51 mm high aluminium spacers were available, the length of which was extended by 20 mm high acrylic cylinders. Tese spacers were used to anchor the NUC minicomputer (Robot-PC) to the bottom of the upper tray, for which a 130 × 150 mm rectangular support base with its fxing screws was provided. Te hardware architecture of the RobSLAM system is shown in Figure 2.
Since the 19.1 V DC @ 2 A voltage available on one of the Kobuki base connectors was insufcient to power the NUC minicomputer, a regulated power supply system was designed that includes a 12 V battery @ 5.5Ah, a DC/DC converter to boost the voltage to 19.0 V DC, and a fxing structure to prevent the battery from shifting. Te system was anchored to the robot's bottom tray so as not to displace its center of gravity. Te design allowed a convenient power supply for the Robot PC and energy autonomy of up to three and a half hours under full battery charge conditions, with sufcient time to advance the proposed experiments and collect data for analysis. Figure 3 shows the fnal appearance of the mobile robot.

Criteria for the Selection of vSLAM Methods.
Tis section presents the qualitative criteria that support this study. Te types of sensors and detectors/descriptors used by each of the methods and the environment model kind were considered. Finally, the associated specifc aspects were presented, such as 2D and 3D modelling support, model dimensions, map reuse, path detection, and global optimization. Table 3 presents some of the most popular vSLAM algorithms, which are classifed according to the type of visual sensors they support and the type of detector/descriptor they use. MonoSLAM was the frst real-time vSLAM algorithm to use a camera as the only sensor. In this method, featurebased sparse probabilistic maps are produced by estimating camera states, features, and associated uncertainties. PTAM (parallel tracking and mapping) is a monocular algorithm also used in augmented reality. It often does not deliver full maps but instead produces small maps that include a single physical object in the environment. Both MonoSLAM and PTAM were developed in the last decade and are now considered somewhat long-outdated, which is why they were discarded. Figure 4 summarizes a classifcation based on the method of extracting information from the image sequence and the type of map produced. Sparse maps consider small subsets of pixels, precisely those associated with the detected features and their neighborhood. In contrast, dense maps use most or all of the pixels in the image. Two of the most representative vSLAM algorithms for dense mapping are DTAM (dense tracking and mapping) and RTAB-Map (real-time appearance-based mapping), while RTAB-Map is classifed as an indirect method, DTAM is considered a direct one. In indirect methods, features of interest are frst extracted from the image, and from them, the camera is located and the map is constructed. On the other hand, in direct methods, the intensity of the pixels is considered without taking into account a preliminary stage of feature extraction.
An important matter to consider is that DTAM is designed for small and localized scenes or spaces [20], while RTAB-Map supports 2D and 3D mapping of large dimensions and long term. Another direct method worth highlighting is LSD-SLAM (large-scale direct monocular SLAM), which considers areas with higher intensity gradient than those considered by DTAM. In this sense, LSD-SLAM ignores regions with low or no texture for which it is difcult to estimate depth. For this reason, it is considered a semidense method [21]. A recently proposed algorithm is DSO. In this approach, it is sought to minimize the photometric error of the set of pixels between images, instead of minimizing the geometric back-projection error. DSO performs a joint probability optimization of all model parameters [22,23]. ORB-SLAM2 is an extension of its predecessor ORB-SLAM [24] that supports not only monocular but also stereo and RGB-D cameras (Table 3). ORB-SLAM2 includes map reuse, closed path detection, and relocation capabilities. It operates in real time on standard processing systems and in a wide variety of environments, from small indoor enclosures with short sequences of images captured with a handheld RGB camera, through industrial environments with sequences captured from drones to large sequences obtained from automobiles driving in urban environments [25]. Based on previous data, the methods selected to conduct the experiments are ORB-SLAM2 in its monocular component [26] and RTAB-Map in its RGB-D component, which are highlighted with green circles in Figure 4. For this purpose, it was considered that [11]

Results and Discussions
Visual SLAM experiments are conducted using the selected methods (RTAB-Map and ORB-SLAM), and their advantages and disadvantages are weighed. Te experiments were conducted in an indoor ofce environment, whose plan is   (A to N), and there is only one access door located at one end of the enclosure. In the path of robot movement, cubicles H and J were considered in the section of the corridor that connects them. Figure 6 shows the test environment for the ORB-SLAM2 algorithm in its monocular component together with its dimensions. It corresponds to the desk located in cubicle J of the ofce space in Figure 5. Some of the possible positions adopted by the camera during the experiment are indicated from the starting point to the endpoint.

ORB-SLAM2 Hand-Held Experiment.
Te experiment is conducted by holding the camera with one hand and following a trajectory close to the one indicated by the orange dashed line in Figure 6. Tis type of SLAM experiments is known in the literature as hand-held reconstruction [27,28]. During the path, the camera was aimed at the desk at all times, at a height of approximately 130 cm with respect to the foor. Diferent objects were placed on the desk to provide texture to the algorithm and the possibility of detecting and tracking diferent features. A round-trip path was designed, starting at the point labelled start, on the far left of Figure 6, advancing to the point labelled end, and fnally returning to the starting point. Figure 7 shows the working environment for this experiment, which includes a desktop and diferent objects placed on it, providing texture and a variety of features to the algorithm. Te experimental setup for the SLAM experiment of the considered environment is shown in Figure 8. In this case, the moving platform is not required since the camera moves manually over the scene.    Journal of Robotics Figure 9 shows the map displayed by the ORB-SLAM2 algorithm by using its monocular component, according to the criteria of [25]. Te library can be compiled with or without ROS; in this experiment, it is used in combination with this middleware. To run ORB-SLAM2 with ROS, we proceed to open from the workstation PC three terminals with remote access from the NUC PC via SSH; thus, it is possible to monitor the process from the workstation PC ( Figure 8). Subsequently, the commands in Table 4 are entered in the remote terminals: Te frst two commands are straightforward. Te frst one executes the ROS Master, and the second one launches the usb_cam_node, a ROS driver for USB cameras, in this case the Logitech C270 device shown in Figure 8. Tis last command launches the Mono node of the ORB-SLAM2 package, which starts mapping the environment from the sequence of images provided by the camera. In this way, a real-time monocular SLAM process is implemented. Te command requires two parameters. Te frst parameter indicates the location of the bag-of-words vocabulary contained in the OR-Bvoc.txt fle, and the second one indicates the location of the ASUS.yaml fle containing the camera parameters.
Te mapping starts by extracting ORB features from the sequence of images to build an initial map (map initialization) on which to estimate the camera position. In the frst      Journal of Robotics two images of Figure 9(a) (upper left corner), we can see how the system detects a set of relevant features of the environment, which is indicated by the green line segments on the observed objects. At this stage, the GUI (graphical user interface) of the system displays the message "trying to initialize" in the image fow window. Once the algorithm manages to build an initial map, it proceeds to estimate the location of the camera on it. Te system is now in the SLAM mode, which is indicated by displaying the message "SLAM mode" in the image fow window. In addition, the feature signaling changes to small diamonds enclosed by squares, as seen in the third image of Figure 9(a). Each point p j of the ORB feature cloud stores the three-dimensional position x wj in the world coordinate system [24]. Te camera direction n j corresponds to the mean unit vector extracted from the rays joining the point with the optical center of the observed keyframes. A representative D i descriptor, corresponding to the ORB descriptor whose hamming distance is the minimum between all the keyframes where the point is observed. Te maximum and minimum distances (dmax and dmin) from which the point can be observed, according to the scale invariance limits of the ORB features.
Each keyframe stores T iw position, which corresponds to a rigid body transformation of the world points to the camera coordinate system [24]. Te intrinsic parameters of the camera, which include the focal length and the coordinates of the main point. All ORB features extracted from the image, whether or not associated with a map point and whose coordinates are undistorted or, if distorted, a distortion model is available.   ORB-SLAM2 is a GraphSLAM method in which closedloop detection is performed based on the optimization of the position graph ( Figure 4). Te optimization problem of the 3D structure, the camera position, and its intrinsic parameters are solved by employing iterative nonlinear optimization, which aims to minimize the distance between the back projection of the 3D model and the associated points in the image [29]. In the position graph, the keyframes correspond to the nodes and the camera trajectories correspond to the arcs ( Figure 10). Figure 11 establishes the correspondence of some objects in the scene with the sparse point cloud or map.

RTAB-Map
Hand-Held Experiment. Te same "L" shaped desktop environment ( Figure 5) was considered to perform hand-held mapping experiments with RTAB-Map. In this particular case, a Kinect sensor was used, which delivers both depth and RGB images. Figure 12(a) shows one of the 3D maps obtained, highlighting the dense point cloud that gives a realistic appearance to the model. Te pose graph is shown in Figure 12(b) together with the initial and fnal coordinates of the frame linked to the sensor.
Te graph nodes store camera odometry, along with visual information from depth images, RGB images, and SURF features quantized into an incremental visual dictionary (visual words), used by the closed-loop detection algorithm [30,31]. Te algorithm allows establishing matches between the current camera location and previous locations of already visited sites employing the bag-of-words method [32]. Figure 13 shows the visual odometry and closed loop detection implemented by the RTAB-Map method.

Dispersed Mapping Experiment with the RobSLAM
System. Figure 14 shows the working environment for the visual SLAM mapping experiments conducted with the RobSLAM system. Cubicles H and J of the enclosure in Figure 5 were considered. Te designed path, which consists of three sections, is indicated in green. Te frst section of 0.86 m with an orientation of 150°with respect to a line parallel to the outer edge of the cubicles, the second section, on the same parallel, at a distance of 1.02 m with respect to the edge and 2.55 m long, and fnally, a third section that is symmetrical with the frst and of the same length.
Monocular SLAM systems estimate depth by triangulating corresponding key points in consecutive images, thus requiring lateral camera displacements that generate baselines of sufcient length for depth calculation (Figure 15). For such a reason, for robot displacements only forward or backward, they tend to fail [33]. In order to allow lateral displacements, the camera was located on the right side of the mobile robot, as shown in Figure 14.
Te robot was teleoperated along the trajectory shown in Figure 14 using the hardware designed for this purpose, which is shown in Figure 2 of the Materials and Methods  section. Figure 16 shows some stages of the map construction process with ORB-SLAM2. Figure 16(a) shows the beginning of the map construction (right), being necessary to incorporate a chess pattern impression to the frst division observed by the camera (left). In fact, given the lack of texture of the divisions found in the environment, it was essential to locate this type of pattern in several of them; otherwise, the algorithm would not identify a sufcient number of features to perform the tracking. As the robot teleoperation progresses, the trajectory of the camera and the 3D map are simultaneously estimated (Figures 16(b) and 16(c)). As some features are left behind, new ones are incorporated for tracking. Sites already visited and whose features are not part of the current scene appear in black in the point cloud and become part of the global map. Te red points in the cloud are part of the features of the local map that is in the making. Figures 16(d) and 16(e) show the mapping and localization process almost at the end of the tour. It can be seen how the estimation of the camera trajectory fts the designed tour and how the 3D reconstruction refects the geometrical structure of the environment, in general terms. Te fnal map is shown in Figure 17, which also shows a checkerboard pattern attached to a desk, adding texture to the scene. Figure 18 shows that although the estimated trajectory of the camera is close to the robot's odometry (pink line), this estimate does not correspond to the robot's tour, which is indicative of the uncertainty in both estimates.
In this section, the algorithms ORB-SLAM2 and RTAB-Map were employed in hand-held mapping experiments, in which the sensor was manually moved through a working environment consisting of an "L" desk, obtaining threedimensional maps with sparse and dense point clouds, respectively. Te environment provided a sufcient and varied number of features to maintain visual odometry and to be able to elaborate its three-dimensional model [30].
An additional experiment using the RobSLAM system and the ORB-SLAM2 algorithm was conducted in an ofce building. A teleoperation of the robot was performed along two contiguous modules with objects such as chairs, tables, and partitions with wood veneer. In this case, a lack of texture was evident, mainly in the partitions, which caused a frequent loss of visual odometry and a failure in the experiment, even when the robot was slightly moved back to try to recover the odometry.
One solution to the problem of the lack of texture was to adhere to a checkerboard pattern to the divisions, such as those used in camera calibration. Although feature tracking and visual odometry were improved, a second problem was detected. It consisted of the fact that the boards observed in the middle of the tour were considered as previously visited sites. Indeed, the boards observed at the beginning of the survey and those observed later were considered as closed loops by the parallel loop detection process of ORB-SLAM2. Tis problem led to false camera location estimates and maps that did not refect the geometrical properties of the environment. It was solved by rotating some of the boards and adding new objects to the scene. With this, it was fnally possible to obtain consistent three-dimensional maps.
Both ORB-SLAM2 and RTAB-Map are GraphS-LAM algorithms, from which the following are obtained: (1) a point cloud and (2) a position graph. However, while ORB-SLAM2 produces sparse maps, RTAB-Map delivers a dense point cloud, which implies the need for hardware with higher processing capabilities.
Te main advantages of ORB-SLAM2 are as follows: (1) the map can be initialized with only the frst pair of images and (2) it is globally consistent, yet closing loops can result in large jumps in position estimation. Disadvantages include the following: (1) a trained bag-of-words vocabulary is required, (2) the localization mode requires an integrated map, and it is highly dependent on the relocation, and (3) the