Distributed Monocular SLAM for Indoor Map Building

Utilization and generation of indoormaps are critical elements in accurate indoor tracking. Simultaneous Localization andMapping (SLAM) is one of the main techniques for such map generation. In SLAM an agent generates a map of an unknown environment while estimating its location in it. Ubiquitous cameras lead tomonocular visual SLAM,where a camera is the only sensing device for the SLAMprocess. Inmodern applications, multiplemobile agentsmay be involved in the generation of suchmaps, thus requiring a distributed computational framework. Each agent can generate its own local map, which can then be combined into amap covering a larger area. By doing so, they can cover a given environment faster than a single agent. Furthermore, they can interact with each other in the same environment, making this framework more practical, especially for collaborative applications such as augmented reality. One of the main challenges of distributed SLAM is identifying overlapping maps, especially when relative starting positions of agents are unknown. In this paper, we are proposing a system having multiple monocular agents, with unknown relative starting positions, which generates a semidense global map of the environment.


Introduction
Utilization of indoor maps is a critical component of accurate indoor tracking when existing infrastructures such as GPS do not work reliably. Therefore, generating such maps with high accuracy for unknown environments becomes critical in the infrastructure of indoor tracking. Furthermore, such maps may be generated partially by different agents moving in and out of an environment, and unifying these independently and partially generated maps into a highly accurate global map is crucial. The map generation and its utilization for localization can be done in many different modalities. Simultaneous Localization and Mapping (SLAM) is one of the main techniques for such map generation. In modern applications, multiple mobile agents may be involved in the generation of such maps, thus requiring a distributed computational framework.
SLAM is a problem that addresses generating a map of an environment and tracking an agent in the environment. These two tasks are interrelated, since an accurate map is necessary to localize the agent precisely, and only a correctly localized agent can construct a good map. The SLAM problem is also known as the Tracking and Mapping (TAM) problem.
Cameras are becoming a popular choice for SLAM, since they are ubiquitous in smart devices. Furthermore, the smaller form factor and the lower cost of cameras also contribute to this choice. When we use a camera as the input device, the process is called visual SLAM. For visual SLAM, three main types of cameras are used: monocular, stereo, and RGBD. Unlike monocular cameras, stereo and RGBD cameras provide depth data in addition to image data to simplify the initialization and the pose estimation process.
Visual SLAM uses either the direct or the feature-based methods. Direct methods work on the intensity information of images without computing features and generally produce denser maps. Dense maps can be more attractive in certain applications, such as augmented reality, in which a user is interacting with the environment and virtual objects in the environment. It is desirable that this interaction be realistic and seamless. A dense map of the environment makes this interaction possible.

Journal of Sensors
In distributed SLAM, multiple agents perform SLAM in an environment collaboratively. These agents (which are cameras for the purposes of this paper) can enter and exit the environment at any time. If there is a map of the environment, the agents can utilize it to localize themselves in it. If an agent moves in a part of the environment that is not mapped, it can start building the map and localize itself in it as part of the SLAM process. Each agent can do this independently, however, when they are operating in a common environment, it makes sense to use their locally built maps to complement each other. At the same time, they can complete and improve the global map, while helping each other in their respective tasks.
Additionally, using multiple agents to perform SLAM increases the robustness of SLAM process, which makes it more fault tolerant and less vulnerable to catastrophic failures. One of the main challenges in distributed SLAM is to compute map overlaps, especially when agents have no prior knowledge of their relative starting positions. Usually, agents also have limited bandwidth to communicate with each other.
In this paper, we introduce a distributed framework for monocular visual SLAM agents with no prior knowledge of their relative positions.

Related Work
In a seminal paper Smith et al. [1] introduced a solution to the SLAM problem using extended Kalman filter (EKF-SLAM). In their work, the extended Kalman filter is used to estimate the posterior distribution over agent pose and landmark positions incrementally. However, processing a covariance matrix is a significant challenge as it grows with the number of landmarks. The entire covariance matrix has to be updated even when the system observes one landmark. This severely limits the number of landmarks in EKF-SLAM, typically a few hundred. Furthermore, EKF-SLAM has Gaussian noise assumptions. FastSLAM by Montemerlo et al. [2,3] addressed above limitations using a Monte Carlo Sampling (particle filter) based approach. Most importantly FastSLAM supported nonlinear process models and non-Gaussian pose distributions. In more recent work, FastSLAM by Cain and Leonessa [4] uses a compressed occupancy grid to reduce the data usage of each particle by 40%. Pei et al. [5] used distributed unscented particle filter to avoid reconfiguring the entire system during vehicle state estimation. Martinez-Cantin and Castellanos in [6] proposed an Unscented Kalman Filter based approach (UKF-SLAM) to support large scale environments.
Davison et al. [7] introduced MonoSLAM, a SLAM method of capturing the path of a freely moving camera (6 Degrees of Freedom) while generating a sparse map. This monocular visual SLAM method worked in a room-sized environment. The map consisted of image patches representing features. Their solution was a combination of EKF-SLAM for estimation and Particle Filtering (PF) for feature initialization. The entire system is initialized by positioning the camera in front of a marker.
Klein and Murray in [8] presented Parallel Tracking and Mapping (PTAM), one of the most significant solutions for visual SLAM. This robust SLAM solution mainly focused on accurate and fast mapping in a similar environment to MonoSLAM. Its implementation decoupled mapping and localization into two threads. The front-end thread only performs pose estimation and feature tracking while the backend thread performed mapping and everything else, such as feature initialization and removing unnecessary key frames. A set of sparse point features represented the map. The system is initialized by moving the camera roughly 10 centimeters perpendicular to the optical path. RANSAC [9] and 5-point algorithm [10] initialized the system. A global bundle adjustment (BA) [11] with Levenberg-Marquardt optimization [10] adjusted the pose of all key frames. Furthermore, a local BA changed the pose of a subset of key frames to allow a reasonable rate of exploration.
Although MonoSLAM and PTAM address the same problem, PTAM used BA in contrast to MonoSLAM's incremental approach. BA is heavily used and proven to work well for offline Structure from Motion (SfM). Even though BA is relatively computationally expensive, PTAM and other researchers recently adopted BA for many real-time monocular visual SLAM solutions. Strasdat et al. 's analysis in [12] showed that increasing the number of image features acquired per frame is more beneficial than incorporating information from increased number of closely placed camera frames. They argue that the former increases the accuracy of the motion estimation and a better map estimation for a given computational budget. Their analysis hence favors bundle adjustment techniques over incremental methods for accurate monocular visual SLAM. Moreover, BA helps to increase the number of features on the map, leading to denser maps.
Scale drift is one of the biggest challenges in monocular visual SLAM. Strasdat et al. [13] introduced a pose graph optimization technique that corrects the scale drift at loop closures. Their method handled large looped trajectories well.
The work by DTAM by Newcombe et al. [14] and LSD-SLAM by Engel et al. [15,16] utilize image pixel intensities directly instead of computed features for SLAM. Their systems generate dense or semidense maps of the environment. Furthermore, these direct methods are more robust to motion blur of images.
During the SLAM process, an agent might revisit the same location in multiple instances. Error accumulation can lead this to go unnoticed. The solution for this problem is referred to as loop closing. This could be done using appearance based image-to-image, map-to-map, or map-toimage matching approaches. The survey from Williams et al. [17] concludes with positive remarks on map-to-image approaches.

Distributed SLAM.
One of the challenges in generating a globally consistent map is identifying map overlaps of agents. It is relatively easier to determine map overlaps if all of the agent relative poses are known at all times. For example, Nettleton et al. [18] used global positioning sensors (GPS) to detect agent locations. When the agent position is known, it is only a matter of doing a proximity check between agent trajectories to detect their map overlaps. However, location Journal of Sensors 3 sensors like GPS are not always available, and they do not do well in indoors, nor in underwater vehicles.
The relative transformations between coordinate systems of agent maps can be computed if the starting position of each agent is known. Paull et al. [19] initialized all agents from known locations. Next, agents performed SLAM and estimated their new locations, while at the same time communicating their locations to each other. Given that the agents already knew the transformation between their maps, they were able to easily determine map overlaps, similar to the case in having location sensors.
When these agent relative locations are unknown, the distributed SLAM problem becomes more challenging. In some contributions, agents continued to build local maps until they saw each other. Howard et al. [20] proposes a method in which each agent would be able to detect other agents. Agents use these coincidental encounters to find their relative locations. Dieter et al. in [21] presents a method where each agent is actively seeking other agents in the environment to find relative locations between them. These methods either require special sensors to be seen by each other or to actively seek each other.
Some methods heavily depend on a central node. Zou and Tan in [22] allowed cameras to move independently in a dynamic environment. However, all of their cameras were initialized from the same scene and connected to the same computer. Although their cameras were distributed, all frames were processed at the same time in their SLAM process. This tightly couples agents, since agents do not possess any knowledge of the environment individually.
The multiagent system by Forster et al. in [23] used a centralized ground station for mapping, loop closure detection, and map merging. However, relying on a central agent is highly prone to failures, especially when the central node fails.
In [24], agents used a master-slave approach where the slave is always in the master agent's map to maintain a map overlap. Their method restricts the free movement of the slave agent.
Williams et al. in [25,26] introduced a method to construct a global map from a multiagent system using a Constrained Local Submap Filter (CLSF). In their method, overlaps between the local and global maps are determined using a Maximal Common Subgraph (MCS) method. Kin and Newman in [27] use a visual similarity matrix to determine relative agent locations. A subsequence of visually similar images is detected from the images captured in each agent. Cunningham et al. in [28,29] formulate the distributed SLAM problem using a graphical model. In their fully decentralized system, each agent maintained a consistent local map, augmented with information shared in a neighborhood of agents. In [30], authors proposed a fully distributed solution for the data association problem.
In the system proposed in [30], local feature matches are propagated through the low-bandwidth communication network. This method helps agents to find global correspondences with other agents with no direct connections. Work done in [18] discusses most informative features to transmit, to reduce bandwidth requirements. Our proposed framework performs distributed SLAM with no knowledge of the initial agent locations. Furthermore, the agents do not get their location directly from sensors like GPS. Instead, they estimate the location using a visual SLAM process. Moreover, the framework does not rely on a central agent, but rather a network of monitoring agents that look for map overlaps of SLAM agents.
We used the experimental framework for distributed SLAM we introduced in [31], during the development of this framework.

Nodes of the Distributed Framework.
Our distributed SLAM framework consists of two types of nodes, the exploring node and the monitoring node. Each node is deployed in its own physical machine. At any given time, the framework has at least one monitoring node and an arbitrary number exploring nodes. Each node is identified using a global unique identifier.
A connection between nodes is made when nodes are required to communicate with each other over the network. As shown in Figure 1, the exploring nodes are connected to a monitoring node. Furthermore, two exploring nodes become connected when their maps overlap with each other.

Exploring Node.
Each exploring node performs a semidense visual SLAM by using a camera as the only sensor, based on the work by [16]. Our choice is based on the fact that denser maps describe the environment in more detail, compared to the sparse, feature-based maps. Denser maps enable better interaction with the environment, especially in AR applications. This also means exploring nodes that communicate more data, compared to feature-based methods. Each exploring node maintains a set of key frames and a pose graph to represent the map. It periodically sends out its map information to the monitoring node, as well as to the other exploring nodes to which it is connected. Furthermore, it processes incoming commands from the monitoring node.

Monitoring
Node. The monitoring node's responsibilities include map overlap detection between the exploring nodes and loop closure detection in each exploring node. It maintains a feature store in which all salient features are stored. Features from all incoming key frames are matched against the feature store to identify map overlaps and loop 4 Journal of Sensors (1) procedure getUniqueID (key frame id, node id) return id ⊳ A globally unique identifier (5) end procedure Algorithm 1: Unique identifier for a key frame. closures. The monitoring node uses a graph, which will be referred to as the fusion graph in this paper, to prioritize and issue commands to merge overlapping maps.
In advanced configurations of our proposed distributed framework, there could be multiple monitoring nodes. Ideally, each monitoring node connects to overlapping exploring nodes. In practice, monitoring nodes may move exploring nodes among themselves dynamically to minimize the number of overlapping exploring node clusters. Map overlap detection between two exploring nodes belonging to different monitoring nodes is accomplished by sharing features between monitoring nodes. In this paper, our experiments are limited to a single monitoring node configuration. Figure 2, the exploring node maintains a map of the environment using multiple key frames and a pose graph.

Key Frames.
The th key frame, K , consists of an absolute pose ∈ R 7 , an image , a map containing coordinate reciprocals corresponding to nonnegligible intensity gradient pixels (an inverse depth map), an inverse depth variance map , and a list of features . The absolute pose is encoded with a translation, along with orientation and scale parameters using a quaternion. The elements of are the three components of the translation and the four components of the quaternion representing the rotation. The scale is represented using the magnitude of the quaternion. When K is first introduced into the pose graph, the features of K are computed. In K , corresponds to a 32-bit globally unique identifier computed using Algorithm 1. Figure 3 contains a visual representation of two different key frames.
We used SURF [32] features and SIFT [33] descriptors in our framework. Because we computed features only for the key frames, the added computational cost that resulted did not adversely affect the real-time performance.

Pose Graph. Pose graph edges
contain similarity transformations and Σ constraints. Here ∈ R 7 and Σ are relative pose transformations and corresponding covariance matrix between th and th the key frames, respectively.
Both absolute pose and similarity transformation are encoded with a translation (three components) and orientation with scale using a quaternion (four components).

Generating and Updating the Map.
The SLAM process simultaneously tracks the camera against the current key frame K and improves its and based on its new observations. Once the camera deviates significantly from the K , either a new key frame is created or, if available, an existing key frame is selected from the map. Next, if a new key frame is created, the previous key frame used for tracking is inserted into the pose graph. The pose graph is continuously optimized in the background. More information on the LSD-SLAM process is found in [15].

Identifying Salient Features.
To determine the saliency of a feature, first the feature is filtered for its ( ), where is the location feature found. The th feature in K should satisfy where is a threshold computed empirically. Only salient features are kept as . We experimented with different values for to minimize the number of features exchanged, while still achieving sufficient map overlap detection. We found 0.001 to be a good value with satisfactory results.

Sending Salient Features to Monitoring
Node. For every salient feature in , the corresponding 3D location and the descriptor are computed. Next, the key frame identifier , the salient features ( , ) , and the pose are sent to the monitoring node.
The communication process between the nodes is explained in more detail in Section 3.6.2.

Map Overlap Detection in Monitoring Node.
Exploring nodes of our distributed framework do not know their relative pose at the beginning. The monitoring node is responsible for detecting map overlaps and computing corresponding relative pose between nodes.

Feature Store.
Each entry in the feature store contains a feature descriptor , key frame identifier , 3D feature location , and key frame pose . Every incoming feature descriptor is matched against the entries in the feature store, to identify common features between two key frames. We used the FLANN [34] method for feature matching. We used a search radius of 150, which was calculated empirically. We divide each key frame into a grid of 16 equal sized cells. If features are matched in at least 10 cells with another key frame K , it is concluded that there is an overlap between key frames K and K . Our approach expects the indoor environment to contain sufficient textured regions of overlap to function properly. It fails if the overlapping region contains only scenes like texture-less walls.
If overlapping key frames belong to the same exploring node, it is considered that a loop closure is found. Otherwise, matching information contributes to the fusion graph.

Fusion
Graph. All available exploring nodes are represented as vertices in the fusion graph as shown in Figure 4. Assume there is an overlap between key frames K and K and K ∈ K and K ∈ K , where K represent key frames in th exploring node. Then, the fusion graph contains an edge between and . The number of features matched between and is represented using as shown in Figure 4. Note that the edge between and could represent matching features between many different key frame pairs. If the direction of  In this example, is higher than other edges (indicated by the thicker edge), so and are merged first. Furthermore, the map merging process is initialized by sending 's map to following the direction of the edge. the edge is → , map merging process is initialized from . As shown in Figure 5, first sends the map to . Then, merges the map and sends its original map to . merges the received map and notifies the monitoring node about the completion of map merging process. Assume that the fusion graph edge having the largest satisfies where is an empirical threshold. Then the monitoring node concludes that a map overlap exists between exploring nodes and . Empirically, 120 shared features are found to be a good value for . Next, the rigid transformation between Exploring node e j Exploring node e k Figure 5: The map merging process of the fusion graph edge → . and , , is computed using a Singular Value Decomposition (SVD) based on the least squares method [35]. Similar to the absolute pose representation is encoded using 3 components of translation and 4 components of quaternion. The scale is initially assumed to be 1 and a proper value is estimated later, during the map merging followed by pose graph optimization in each exploring node. of all relevant features between and are used for the computation. The RANSAC algorithm [9] is used to make the computation robust to outliers. Figure 3 shows a set of matched features between two key frames, K and K .

Issuing Commands to Exploring Nodes. A map merge
command is issued to exploring nodes and . The command contains the relative pose between two nodes. The command also contains the key frame correspondences used to compute the relative pose between and . Similarly, a loop closure command is issued to an exploring node , when both overlapping key frames K and K belong to . The command contains the relative pose between key frames, which is also computed using the same least squares method [35].
The communication process is explained in more detail in Section 3.6.2.

Merging Maps of Two Exploring Nodes.
First, as shown in Figure 1, a connection is created between two exploring nodes. Once the connection is made, each exploring node sends its map to its counterpart. Once the map is received, the key frame correspondences found in the map merge are directly transformed into new constraints between pose graphs of and . The similarity transformation of the constraint is computed using key frame pose and relative pose between exploring nodes. Figure 6 shows how and were generating their own maps before merging. Figure 7 shows the resulting merged map for . Once map merging is complete, each exploring node listens to its counterpart for new key frames and the pose graph, to incrementally update its map.

Loop
Closure. In most instances, completing smaller loop closures increases the robustness of tracking. Completing large loop closures, however, has more impact in generating an accurate map. Direct semidense SLAM operations alone do not support large loop closures.
Upon receiving a loop closure command with , the exploring node checks whether K and K are consecutive  key frames in the pose graph. If that is the case, we discard the loop closure command since K was constructed using K and already has a better estimate for the edge . Otherwise,  3.6. System Implementation. We use the Robot Operating System (ROS) infrastructure to implement the distributed SLAM framework described in this paper [36]. A ROS node is responsible for performing computations. ROS also provides a message passing communication framework between nodes. Nodes in our framework are implemented as ROS nodes.
In its communication framework, ROS provides named communication buses called topics. Multiple nodes can publish messages to a topic while multiple subscribed nodes could receive them. ROS nodes can communicate with each other peer-to-peer via topics. In our framework, communication channels are implemented using ROS topics. Figure 8 shows components of the distributed framework and the communications between nodes. The exploring node consists of five main modules: input stream, tracking, mapping, constraint search, and optimization modules. Each of these modules runs in its own thread and it contains the map.

Components of the Distributed Framework.
The input stream module accepts all incoming messages including image frames, key frames, pose graph, map, and commands. All image frames are transferred to the tracking module. Key frames, pose graph, and map are transferred to the optimization module so that they can be merged into the map before an optimization iteration. Commands are processed in the input stream module itself.
The tracking module accepts the new frame from input stream module and tracks it against the current key frame. If the current key frame can no longer be used to track the current frame, a new key frame is created. The old key frame will be added to the map by the mapping module. The constraint search module is used to recover from tracking failures. The optimization module continuously optimizes the pose graph in the background.
The monitoring node maintains the feature store and the fusion graph as explained in Section 3.3.

Communication between Nodes.
The distributed framework provides multiple communication channels for nodes. These communication channels are shown as arrows in Figure 8.
Upon creating new key frames, exploring nodes send salient features and the absolute pose of the key frame through the features channel. The monitoring node receives them and processes them to issue commands through the commands channel. The command could be either a merge command or a loop closure command.
When an exploring node receive a merge command, it creates multiple channels with the other exploring node. The map channel is used to exchange a map between each other. This channel ceases to exist once the map is transferred. And key frames and pose graph channels are created between these nodes.
For every new key frame, its information is written into the key frames channel. After every pose graph optimization, the pose graph information is written into pose graph channel. All exploring nodes that are using these channels incrementally update their maps after merging.

Public Datasets.
To evaluate our framework, we need a monocular visual SLAM dataset, with multiple trajectories covering a single scene. We considered publicly available datasets, but they did not satisfy our requirements. For example, the dataset EuRoC [37] contains pure rotations, which did not work well with the monocular SLAM approach we used. The dataset Kitti [38] is mainly a stereo dataset. Even when we considered a single camera, the direct monocular SLAM process failed since the camera motion is along the optical axis. The TUM-Mono [39] dataset does not provide the ground truth for all frames and is primarily suitable for 8 Journal of Sensors  evaluating single agent SLAM. That said, we created the DIST-Mono dataset to evaluate our framework. We also made it publicly available (http://slam.cs.iupui.edu).

Experimental Setup.
Our experimental setup is designed to precisely define the ground truth of a camera motion. As shown in Figure 9 we mounted a Point Grey Firefly MV global shutter camera onto a Computer Numeric Controller (CNC) machine. We also prepared a 1 m × 1.5 m scene containing wooden objects. We then moved the camera along a path for about four minutes each time, while capturing its location ground truth periodically. Our in-house built, 3-axis CNC machine is controlled using an open-source controller called TinyG. The controller converts the provided trajectory from the gcode file format into linear synchronized movements along , , and axes. The maximum travel volume of the machine is 1 m × 1 m × 0.3 m ( × × ). We also prepared a 1 m × 1.5 m scene containing wooden objects. The scene is uniformly lit by two 4-feet long LED tube lights.
To collect datasets, we moved the camera at a speed of 25 mm/s, along a path for about four minutes each time, while capturing its location ground truth periodically. We captured 640 × 480 resolution camera frames at 60 Hz and ground truth at 40 Hz. The CNC machine has 0.2 mm accuracy in all three axes. We developed an open-source ROS node (http://github.com/japzi/rostinyg) to capture the ground truth from the TinyG CNC controller.
During experimentation we played back the datasets at twice the speed it was recorded (2x), effectively making the camera movement to be 50 mm/s.

The DIST-Mono Dataset.
The dataset consists of five subdatasets. We defined three camera motion paths: Path A, Path B, and Path C. All of these paths are on a plane S01-A-0, S01-C-0 0.0121 slanted above the scene as shown in Figure 10(a). These paths have roughly 10% overlap with each other and three different starting points per path. We generated two datasets using Path A by rotating the camera around its -axis. In S01-A-0, the camera optical axis and scene axis are on a vertical plane. In S01-A-P20, we rotated the camera around its -axis by 20 ∘ . This is illustrated in Figure 10(b). Similarly, we created datasets S01-B-0, S01-B-N20, and S01-C-0 as shown in Table 1.

System Evaluation.
As an experiment, we deployed two exploring nodes and one monitoring node in three different machines. One exploring node processed the dataset SCENE-A-0 and the other the dataset SCENE-B-N20. After map merging, each exploring node exported its key frame poses in TUM dataset [40] pose format. Most importantly, these poses contain key frames from both exploring nodes. We then computed the absolute translation RMSE [40] against the ground truth. To support the nondeterministic nature of the distributed framework, we ran the experiment five times and the median result is recorded. Similarly we performed two more experiments with other combinations of datasets as shown in Table 2. Given the fact that monocular visual SLAM systems do not capture the scale, we manually calculated that in all experiments to minimize the RMSE error. Figure 11 shows how estimated key frame positions are compared against ground truth in experiment 3. The red circles in the figure display the estimated key frame position, whereas the black circles display the ground truth of said key frame position. Red lines show the difference between the estimated and the ground truth positions of the key frame.

AR Application: Adding and Viewing Virtual Objects.
We developed an AR application to test our framework. We added an AR window to each exploring node. The AR window allows users to add virtual objects (simple cube in our example) into its map. This allows us to demonstrate the collaborative AR potential of the distributed SLAM framework, in which (i) each agent is able to view the augmented scene from its viewpoint and (ii) if it is in an unexplored part of the scene, generate its own local map and contribute it to the global map. We also added a relevant channel to share the virtual object information between exploring nodes. Given that the relative transformation between nodes is known for connected exploring nodes, these cubes are placed correctly in the map. Figure 12 shows AR windows of two exploring nodes and two interactively added cubes. 4.6. Discussion. Our framework relies on image features for map overlap detection. As explained in Section 3.2.5, features ( , ) and the pose of each K are processed by monitoring nodes for this purpose. In addition to the current semidense monocular SLAM method, many key frame based SLAM methods using stereo or RGBD sensors could easily be adapted into exploring nodes, given the fact that those methods can produce required data with minimal effort. However, other sensors, such as LiDAR, do not provide the data appearance based features required in monitoring node. Therefore, they require a major change in the monitoring node to function properly.

Conclusions
We introduced a distributed monocular SLAM framework that identifies map overlaps based on an appearance based method. Most importantly the framework computes relative transformations of local maps of SLAM nodes, even when their relative starting positions are unknown. We demonstrated empirically that the distributed framework we developed successfully generated the map using multiple agents and localized them in the environment with little amount of error. We achieved a pose location RMSE between 0.0097 m and 0.0136 m for experiments that were conducted in a 1 m × 1.5 m scene. Each node traveled about 4 m on average. We developed an empirical set up that generated the data set with associated ground truth to be used for extensive evaluation and validation of the distributed SLAM method. This data set has been made publicly available for other researchers to use. We also developed an augmented reality application to showcase how two nodes can use our framework to interact with the shared global map.
Our next steps would be to improve exploring node's SLAM process by incorporating features in pose graph optimization. That would help greatly in supporting public datasets as well. Furthermore, we will evaluate ORB descriptors instead of SIFT descriptors to improve performance and reduce the network bandwidth usage.