Vision-Based Semantic Unscented FastSLAM for Indoor Service Robot

This paper proposes a vision-based Semantic Unscented FastSLAM (UFastSLAM) algorithm for mobile service robot combining the semantic relationship and the Unscented FastSLAM.The landmark positions and the semantic relationships among landmarks are detected by a binocular vision.Then the semantic observation model can be created by transforming the semantic relationships into the semantic metric map. Semantic Unscented FastSLAM can be used to update the locations of the landmarks and robot pose even when the encoder inherits large cumulative errors that may not be corrected by the loop closure detection of the vision system. Experiments have been carried out to demonstrate that the Semantic Unscented FastSLAM algorithm can achieve much better performance in indoor autonomous surveillance than Unscented FastSLAM.


Introduction
Visual simultaneous localization and mapping (SLAM) uses the cameras as the only exteroceptive sensors to recover a representation of the environment and achieve localization of the robot complemented with information from the proprioceptive sensors with the aim of increasing accuracy and robustness.To the mobile robotic, vision has proved to be an effective and inexpensive sensing device for localization and mapping.Sim et al. solved the SLAM problem with a stereo pair of cameras [1,2].Schleicher et al. used a top-down Bayesian method to perform a vision-based mapping process where identification and localization of the natural landmarks from the images were provided by a wide-angle stereo camera [3].In this paper, a new semantic vision SLAM framework is proposed to improve the performance without increasing the complexity of the algorithms dramatically.
Literature on visual SLAM have focused on feature-based SLAM where a feature could be described by the points with its 2D position (SIFT [4], SURF [5]) or 3D position [6,7], and also edge segments [8,9].But feature extractions from the natural visual scenes were heavily dependent on the environment where the sparse features might be found.These features could be occasionally too few to fully constrain the pose of the robot.Hence the appearance-based SLAM was proposed to represent the recorded images of the environment with prominent features as a whole [10].Morita et al. reported another novel appearance-based localization approach for outdoor navigation with feature or object learning, recognition, and classification using SVM [11].However, the usage of rich sensorial information in these appearancebased SLAM solutions has resulted in very time-consuming computation especially for larger-scale environments.To allow real-time operation in more moderately sized environments, one method was proposed to observe the interframe motion of every other corner feature in a visual odometry style [12,13].Also, some researchers proposed the method of discovering and incorporating higher level map structure in the forms of lines [14] and planes [15,16].
Different kinds of maps have been applied in SLAM.Metric maps capture the geometric properties of the environment whereas topological maps describe the connectivity between different locations [17].Topological maps can represent the environment as a list of the significant places which has simplified the problem of large-scale mapping [18].However, one limitation of the topological representation was the lack 2 Mathematical Problems in Engineering (5) of metric information.So the strategy of mixing the metric and topological information in a single consistent model was proposed [19].Fernández et al. also developed a hybrid metric-topological algorithm to build a metric map while maintaining a topological graph and to detect loop closures [20].Thrun and Buecken combined the grid based and topological based methods to map indoor robot environments [21].Such hybrid algorithms took advantage of the local metric grids for enhanced local planning while avoiding the computation of a complete global grid map.However, these maps are very limited in describing the environment other than distinguishing between occupied and empty areas.
In order to explore richer information of the environment, semantic mapping has become a research topic recently.Wolf and Sukhatme proposed a semantic classification method based on HMMs and SVMs to tackle the problem of terrain mapping and activity-based mapping [22].Ranganathan and Dellaert described a technique to model and recognize the places using objects as the basic semantic concept [23].Yi et al. proposed a semantic representation and Bayesian model for robot localization using spatial contexts among objects [24,25].This paper will take advantage of semantic relationship of features in the visual SLAM framework.Early work on SLAM was done by Smith et al., where the Extended Kalman Filter (EKF) was applied [26].Later Doucet et al. introduced the Rao-Blackwellized particle filter (RBPF) as an efficient solution to the SLAM problem which is also called FastSLAM [27].The Unscented FastSLAM algorithm was then proposed to overcome the drawbacks of FastSLAM where the scaled unscented transformation (SUT) was applied to replace the linearization in the FastSLAM framework [28].The SLAM solution in this paper will be based on Unscented FastSLAM.
Hence the main contribution of this paper includes a novel Semantic Unscented FastSLAM algorithm to improve accuracy of localization and mapping while maintaining the sparse map for real-time implementation.The semantic relationship and topological metric map are combined to form a new kind of map for SLAM.Few experiments have been carried out for validation of the proposed technique.
The rest of the paper is organized as follows: Section 2 describes the semantic topological metric map and observation model.Framework of the Semantic Unscented Fast-SLAM is presented in Section 3. The experimental results and discussion are presented in Section 4. The concluding remarks are presented in Section 5.

Semantic Topological Metric Map and Observation Model
2.1.Semantic Topological Metric Map.Semantic topological metric map is defined as the combination of the topological metric map and the semantic relationships between the landmarks where the assumption is that such semantic relationships can be represented by some mathematical equations.
The spatial semantic relationship between the available landmarks is always invariant with respect to the robot location.Denote the semantic topological metric map as  and the semantic metric relationship as . Figure 1 shows the process of creating the semantic topological map.The procedures are summarized as follows.(i) When a robot starts to move and the first landmark is observed, the semantic topological map  only includes the position vector of the  1 , (1).(ii) As the robot moves forward, more landmarks are observed.If there are no semantic relationships between any pair of landmarks, the semantic topological metric map  will be the same as the regular topological metric map.If the number of the observed landmarks is , the semantic topological map  includes the position vectors of  1 , . . .  , (2).(iii) When the robot observes landmark  + 1, the semantic relationship between landmark  + 1 and landmark ,   +1 ,  , is also found.If all the semantic relationships with the observed landmark +1 are defined as the set  semantic, +1 , the semantic topological metric map  is then updated with the addition of the semantic metric relationship as in (3)-( 4): (iv) When the robot observes landmark  + 2, if the semantic relationship between landmarks +2 and +1 is found, then the total semantic topological metric map would be Since more-than-one semantic relationships between different landmarks, ( 4) and ( 6), have been observed, the extended new semantic topological relationship will be created, Figure 2, where landmarks  + 2,  + 1 and  are associated together.The semantic topological metric map at the time being will become where   +2 ,  is the extended semantic relationship between landmark  + 2 and landmark .When   +2 , +1 has the same semantic relationship as   +1 ,  coincidently, we can associate them together as

Semantic Observation Model.
A semantic observation model is the observation model of the vision sensor with implicit of the semantic relationships.Hence the semantic observation model consists of not only the metric distance  and the bearing  of each landmark, but also the semantic metric relationships between different landmarks.The dimension of the semantic observation model could be  + 1 where  is the total number of the landmarks observed so far.In this case, the semantic observation model can be represented as where ( * ) is the mathematical expression of the semantic metric relationship, (, ) is the coordinates of the current robot pose, and ( , ,  , ) is the coordinates of the landmark  observed at the current time period.   , 1 ,    , 2 , . . .,    , −1 are the series of the possible semantic metric relationships associated with   .The position vector of the robot is defined as  = [, , ]  . is defined as the position set of all the landmarks observed at the current time as follows:

Semantic Unscented FastSLAM
Semantic unscented FastSLAM partitions the SLAM posterior into a localization problem and independent landmark position estimation problem conditioned on the robot pose estimate and the semantic metric relationships between the landmarks as follows: where   is the robot pose at time  and  denotes the full semantic metric map at the current time period as follows: Suppose the control vector of the robot is  = [V, ]  where V and  represent the linear and angular velocities of the robot.According to the kinematics of the wheeled mobile robot [29], the motion model is represented as follows: 3.1.Robot Pose Estimation.Since particle filter is incorporated into the FastSLAM frame, the following derivation will be associated with only one particle as an example.Then the robot pose at time  for a th particle can be estimated as where ) is represented by a Gaussian with the mean  []  −1 and covariance  []  −1 .The ( []    |  []  −1 ,   ) can be predicted in the following according to the motion model of the robot.In order to integrate the robot pose and the map update, the state vector is augmented with a control input and the observation vector as where  [] −1 is the augmented state vector,   is the motion noise covariance and   is the observation noise covariance, and   []  −1 is the augmented covariance matrix.In order to apply the unscented transformation, a symmetric set of 2 + 1 sigma points ( is the dimension of the augmented state vector) need to be extracted first as follows [30]: where the subscript  means the th column of a matrix.The  is computed by  =  2 ( + ) −  and  is a small number to avoid the sampling nonlocal effects for high nonlinearities.
is a scaling parameter determining how far the sigma points are separated from the mean value.Each sigma point −1 contains the robot pose, control noise, and semantic observation noise components as So the prediction of the robot pose can be derived by passing the above sigma points through the motion model,  in (13).
The transformed sigma points of the robot pose,  [][]  , are calculated as where the current control vector is the sum of the  []   and the control noise component  [][]  of each sigma point.Then the prediction of the robot pose can be calculated as The weights are calculated by the following equations: where the weight  []   is used to compute the mean of the predicted robot pose, and the weight  []   is used to recover the covariance of the Gaussian.The parameter  is used to incorporate the knowledge of the higher order moments of the posterior distribution.
Suppose the th landmark and its semantic relationships are observed; the transformed sigma points of the semantic observation vector can be derived as where the semantic metric relationships,  [] semantic,  , are included in the semantic observation model (⋅) in (9) for robot pose update.So this new update will result in the improvement of robot localization.Then the prediction of the semantic observation vector can be calculated as The Kalman gain can then be obtained by the following equations as usual: where  []   is the innovation covariance and  , []    is the crosscovariance.
Therefore, the mean and covariance of the robot pose are estimated at the time period  by

Landmark Position Estimate with Semantic Constraints.
For the observed landmark , the probability of the landmark position estimate can be represented as where the probability ( []  ,−1 ,  semantpic,  |  []  −1 ,  []  −1,semantic ) is represented by a Gaussian with the mean  []  ,−1 and covariance Σ []  ,−1 .( []  ,semantic |  []   ,  []  , ,  semantpic,  ) will be derived in the following.Likewise, the sigma points of the observed landmark position,   , are initialized as The transformed sigma points of the landmark position estimation with semantic relationships can be derived as where (⋅) is the observation model in (9),  []   is the current estimation of the robot pose in (24).Hence the predicted semantic observation vector, ẑ[] ,semantic , is Then the Kalman gain ,semantic is calculated as follows: ,semantic ( Note that the weights  []   and  []   are the same as (20).Finally, the mean  []  , and the covariance Σ []  , of the th landmark position are updated by  []  , =  []  ,−1 + ,semantic ( Note that  ,semantic includes the true observation of the relative position of the landmark and the robot and the associated semantic relationships with this landmark.These observation values are obtained from the image process of the vision sensor data.If more landmarks are observed at one time, the derivation would be similar except that more semantic relationships would be included in the observation model.
As mentioned at the beginning, all the above derivation is with respect to the particle .Then the traditional resampling procedure will be taken, and the robot pose and the landmark positions will be estimated finally.

Experiments and Discussions
4.1.Experiment Procedures.The platform used in the experiments was a Pioneer 3-DX robot equipped with a binocular camera system.The camera was the only exteroceptive sensor to recover the representation of the environment.The sampling period was 0.5 seconds.The proposed technique has been evaluated by three different types of the experiments.In Experiment 1, the robot moved along a simple rectangular trajectory (8 m × 14 m) in a neat lab environment.The environment in Experiment 2 was a regular office area that was more general to most indoor service robots to verify the superior performance of the Semantic Unscented FastSLAM.Experiment 3 was conducted in a messy environment where the robot had to move along a zig-zag path to go through aisles.
In the experiments, three kinds of the semantic metric relationships were found.One semantic relationship was that the new observed landmark and another landmark existing in the previous map were both along the -axis (-line).The second semantic relationship was that two landmarks were both along -axis (-line).Such two kinds of semantic relationships are denoted by {-line, -line}.The third one was that three landmarks were collinear such as the walls of neighboring cubes in an office.Suppose   ,   , and   are three landmarks with the above semantic relationships; the semantic observation model can be represented, respectively, as (32)

Experiment Results and Discussions
Experiment 1. Figure 3 shows one image taken by the vision sensor on the robot with three landmarks ,  + 1, and  + 2 where the landmarks  + 2 and  + 1 were located along the -axis.This semantic relationship will be applied for localization and mapping.Figure 4 shows the comparison of the system performance using the Unscented FastSLAM (Figure 4(a)) and the proposed method (Figure 4(b)).As shown in Figure 4(a), the error of robot pose became larger especially after the robot was turning.This error could not be corrected by the loop closure detection because all the landmarks observed after turning have not been observed before.In Figure 4(b), the localization error has been eliminated greatly after the semantic topological metric map was applied.Figure 5 is the partially enlarged view of Figure 4 where A1 and B1 are the estimations by odometer only, A2 and B2 are the estimation from Unscented FastSLAM, and A3 and B3 are the estimation from the proposed Semantic Unscented FastSLAM.When Landmark #6 was observed by the robot at the first time, it was also found that Landmark #6 has the semantic topological relationship "-line" with the landmarks #4, #2, and #3 in the existing map.Hence this semantic relationship has resulted in much better robot pose estimation, B3 in Figure 5(b), which has pulled the dead reckoning estimate B1 back from the deviation comparing with B2 without taking advantage of semantic relationships.
Experiment 2. Figure 6 shows the experimental environment in Experiment 2 where the reference trajectory started from the circle and ended at the same point after a complex  surveillance along the arrow directions.The start point was defined as the origin of the inertial frame (0, 0).It is worth noting that this office was composed of a few cubes that were higher than the robot.Hence when the robot moved along the reference trajectory, most landmarks could not be observed more than once before the robot was close to the end point.The experimental results using the Unscented FastSLAM and the proposed Semantic Unscented FastSLAM are shown in Figure 7.In this experiment, when the robot moved close to the end point, Landmark #1 should be observed after a long  period for the loop closure detection.The estimation of the end point, C1, by odometer only was far away from the real end point.As shown in Figure 7(a), the end point estimated by the Unscented FastSLAM before the loop closure detection of Landmark #1, C2, was better but still had a huge error.This error was too large to be corrected by the loop closure detection (see D2 for the estimation after loop closure detection).Figure 7(b) shows that the end point estimated by the proposed Semantic Unscented FastSLAM before the loop closure detection, C3, was much smaller because of the semantic updates in the algorithm.Therefore, after the loop closure detection, the error was reduced close to the reference point (see D3 for the estimation by Semantic Unscented FastSLAM in Figure 7(b)).
Experiment 3. The experiment environment is shown in Figure 8 where small triangles represent a few locations along the reference path, and the solid line represents the wall of cubes.Notice that the reference path is not straightforward during each aisle because the aisle had irregular width and the robot also needs to avoid chairs and boxes on both sides of the aisle.Figure 9 shows an example of two pictures captured by the camera where three green landmarks were detected as collinear relationship.Figure 10 illustrates the performance of the surveillance robot in Experiment 3. As shown in Figure 10(b), the locations of robot and landmarks are much closer to the reference path using the proposed Semantic Unscented FastSLAM than without considering semantic relationships (Figure 10(a)).

Conclusions
This paper has proposed a vision-based Semantic Unscented FastSLAM for mobile robot.The semantic relationship is combined with the traditional topological metric map to improve the accuracy of localization and mapping.Experiments were conducted to verify that the Semantic Unscented FastSLAM is more robust and applicable to more general indoor autonomous surveillance.

Figure 1 :
Figure 1: The process of creating the semantic map.

Figure 3 :
Figure 3: The observation result of vision system.

Figure 5 :Figure 6 :
Figure 5: Partially enlarged view of SLAM results in Experiment 1.
Experiment result using Semantic Unscented FastSLAM