Keyframes Global Map Establishing Method for Robot Localization through Content-Based Image Matching

Self-localization andmapping are important for indoormobile robot.We report a robust algorithm formapbuilding and subsequent localization especially suited for indoor floor-cleaning robots. Common methods, for example, SLAM, can easily be kidnapped by colliding or disturbed by similar objects. Therefore, keyframes global map establishing method for robot localization in multiple rooms and corridors is needed. Content-based image matching is the core of this method. It is designed for the situation, by establishing keyframes containing both floor and distorted wall images. Image distortion, caused by robot view angle and movement, is analyzed and deduced. And an image matching solution is presented, consisting of extraction of overlap regions of keyframes extraction and overlap region rebuild through subblocks matching. For improving accuracy, ceiling points detecting andmismatching subblocks checking methods are incorporated.This matching method can process environment video effectively. In experiments, less than 5% frames are extracted as keyframes to build global map, which have large space distance and overlap each other. Through this method, robot can localize itself by matching its real-time vision frames with our keyframes map. Even with many similar objects/background in the environment or kidnapping robot, robot localization is achieved with position RMSE <0.5m.


Introduction
The ideal indoor mobile robot should store a global map of the entire indoor space for self-localization, especially in a building including multiple rooms and corridors [1].It would be much better if this global map can be built by the robot itself during indoor environment studying and without the help of human [2].
For this research field, SLAM (Simultaneous Localization and Mapping) is the commonly employed [3], especially the V-SLAM (Visual Simultaneous Localization and Mapping) [4,5].Many latest variants take advantage of new features or 3D information to build navigation map [6,7], including ORB SLAM [8], dense SLAM [9,10], semidense SLAM [11], LSD SLAM [12], and CV-SLAM [13,14].The CV-SLAM method is specially designed for indoor robot localization [15] and could make good use of ceiling features as navigation map through an upward-looking camera [16].This technology has been used widely in the Roomba, a product of iRobot Inc.Similar methods are used in other models made by Dyson, SAMSUNG, and LG.
However, the SLAM method is easy to be disturbed, and the most troublesome problems include kidnap problem and similar objects interference [17].The kidnap problem occurs frequently when a robot is suddenly involved in a collision, kicked, or intentionally repositioned during operation.The robot cannot localize itself according to video information of previous moments [18].The similar object interference refers to the fact that a robot is easy to be confused by similar features on similar objects in different places and error fixed itself to the wrong position.An indoor robot that looks mostly at ceilings of indoor space do not detect many features of significance.To make matters worse, there are usually many similar objects in the indoor environment (e.g., air condition outlets, ceiling lamps, and ceiling).Hence established methods are not sufficiently reliable for global indoor positioning.
The rest of this paper is organized as follows.Section 2 describes the design of robot vision system and the distortion features of common indoor objects.In Section 3, we discuss content-based image matching method and the keyframes global map establishing method.In Sections 4 and 5, the experiment result and conclusion are presented and discussed.

Image Distortion Model and Feature Analysis for Common Indoor Objects
Content-based image matching with both ceilings and walls is one of the best ways for detecting keyframes and building global map, which can extract keyframes from environment video as efficiently as possible to describe the indoor environment and keep the connection between these frames.However, the main problem for content-based image matching is the image distortion.If two frames are taken in different view angle and position, the objects distortions in them are different, and image content similarity between the two frames is difficult to be calculated.
Here we discuss the method for analyzing features of image content distortion through establishing a distortion model.An upward-looking camera (SY012HD wide angle camera, Weixinshijie Technology Co., Ltd, Shenzhen, China) is installed on the wheeled robot (Roomba, R1-L081A, Midea, Suzhou, China) as the robot vision sensor [21], as shown in Figure 1(a), the same as that carrying a CV-SLAM [22,23].

The Features of Robot View Angle and Movement for
Ceiling and Wall Region.The camera model, which can change the three-dimensional object in the real world to twodimensional picture, is as follows [24,25]: where There are two kinds of main indoor objects: ceiling and wall regions.The ceiling is parallel to the floor, and wall regions (e.g., wall, window, door, and furniture) are perpendicular to the floor.For the upward-looking camera on the robot, the view angles for them are different.
When a robot is moving on the floor, its route parallels the ceiling, the heading (given ) is the only robot view angle can be changed, and both roll and pitch are equal to 0.
For the wall region, because they are fixed on the floor, the freedom of robot view angles is also very limited.We analyze each wall separately, by giving each wall an independent coordinate (front/back wall or side wall), as shown in Figure 1(b), with the -axis perpendicular to the wall, -axis parallel to the wall and floor, and -axis perpendicular to the floor.The heading  between a wall and the robot is the only view angle which can be changed.As wall is perpendicular to floor and ceiling, it can be seen as a part of ceiling with roll or pitch of 90 deg.For the front wall (or back) of robot, the pitch is 90 deg and roll is 0, as shown in Figure 1(b).For the left wall (or right wall) on the robot, the roll is 90 deg and pitch is 0, as shown in Figure 1(b).
Because of the different view angles, ceiling and wall deform differently in the robot vision.In order to match precisely, this deformation should be revised before image content matching.According to (1), it can also be deduced that robot movement (  ,   ) and object position (, , ) can also contribute to image distortion.Therefore, bringing the robot view angle and movement and object position into (1), the distortion models of ceiling and wall region could be established, and their distortion features can be extracted.

Ceiling Distortion Model and Feature Extraction.
Given the coordinate origin on the ceiling,   and   show robot movement, and   is ceiling height.The heading between ceiling and camera is , and roll and pitch are 0. Therefore (1) can be transformed as follows: It can be simplified as follows: The distortion of ceiling only includes rotation and translation, and the shape of ceiling is unchanged.As rotation and translation are affine transform, they can be revised by affine transform too.Through SURF (Speed Up Robust Features), the matched feature points in frames  and  are extracted [26,27].Bringing these points coordinate into (5), the heading difference Δ and movement differences Δ  and Δ  between these two frames could be resolved [28]: Two frames,  and , can be adjusted to the same robot view angles and camera shooting position after rotation and translation according to Δ, Δ  , and Δ  .Their overlap regions (containing the same objects in two frames) could be extracted.The image similarity would be obtained by comparing the similarity of their overlap regions.

Wall Distortion Model and Feature Extraction.
For the front wall ( = 90,  = 0) and side wall ( = 0,  = 90), the heading difference is 90 deg (or −90 deg), but their image distortions are the same and can be deduced as follows.
Given that the robot heading is  when it is taking frame , for a point (, , ) on the front wall, its corresponding point (  ,   ) on frame  of robot vision is calculated as follows: And it can be simplified as follows: For the point (  ,   ,   ) on the side wall, its corresponding point (  ,   ) on the frame  of robot vision is calculated as follows: And it can be simplified as Comparing with (7), it means that the -axis and -axis terms are exchanged in ( 9) and (10).Accordingly, ,  are exchanged, and   ,   are exchanged, too.The distortion feature of front wall in the robot vision is the same as that of side wall, and the distortion revise method for the front wall would be the same as that of side wall.
Take the distortion of front wall for example.Given the robot heading and movement for frame  are ,   ,   for frame  which would be marched with frame , its parameters are  + Δ,   + Δ  , and   + Δ  .According to (8), the object point (, , ) on frame  is (  ,   ) and is calculated as follows: For the ceiling and wall region in frames  and , the deformations of their ceiling images, including rotation and translation, can be adjusted similarly through (5).However, the wall region in frame , after rotation and translation, would be transformed differently from that of frame : where (  ,   ) is point in frame  processed by (5) and can be simplified as The transform result is Comparing with (10) and (11), the transformed frame  is much closer to frame  than the original frame .The similarity between frames  and  can be calculated more exactly through matching with transformed frame  compared to original frame .
But there is still some difference between transformed frame  and frame .On one hand, their denominators are different, which are  sin  −  cos  +   and  sin( + Δ) −  cos( + Δ) +   .On the other hand, some terms in (15) are unknown (e.g., point coordinate (, , ) in the room space), as the terms (  ,   ) cannot be calculated through the two equations.In order to neutralize these disadvantages, overlap region extraction method and subblock matching method would be presented, and (  ,   ) can be resolved through subblock matching.The term subblock is defined as follows.An image is divided into many small blocks which are equal sized, and these blocks are named as subblock in this paper.Equation (5) would also be accomplished through overlap region extraction method designed in this paper.
If frames  and  contain the same image content, their transform results would be very similar after being processed by ( 5) and ( 15) and very useful for image matching and similarity analysis.For the same ceiling regions in frames  and , the correlation coefficient between them would be very large.For the same wall regions in two frames, as containing the same objects, the number of correct matching subblocks would be also very large.This paper would combine them together to analyze the similarity between frames  and .

Match Method Design
Content-based image matching method would be designed firstly and then the keyframes global map can be established through this matching method.

Content-Based Image Matching through Overlap Region
Extraction and Subblocks Rebuild.The content-based image matching method is designed according to the features of image distortion and consists of three parts: image overlap region extraction, overlap region rebuild through subblock matching, and, lastly, image content similarity calculating, as shown in Figure 2.For frames  and , this method would adjust the distortion of frame  similar to that of frame  and then calculate their degree of similarity.In order to adjust more exactly, ceiling points detecting method is designed to improve the accuracy of overlap region extraction, and mismatched subblocks checking method is deigned to rebuilt the overlap region more exactly.

Image Overlap Region Extraction.
For frames  and , which are taken by robot at different positions, only parts of them might overlap.Content-based image matching would focus on analyzing the image similarity in this overlap region.
In order to extract this overlap region, frames  and  would be adjusted to the same view angle and camera shooting position according to their matched feature points extracted through the SURF method.If all the feature points are on the ceiling, the translation Δ  , Δ  , and rotation Δ between the two frames can be calculated directly according to (5).Through rotation and translation, the same objects in frames  and  would be adjusted to the same position, and this image processing progress is shown in Figure 3.The overlap regions can be extracted effectively from these two adjusted frames.
The image procession result is shown in Figure 4.The rotation and translation of frames  and  are shown in Figures 4(c) and 4(d), and the same objects are overlapped, as shown in Figure 4(e).The overlap region mask is the pixels covered by the two image, as shown in Figure 4(f).The overlap regions in frames  and  can be extracted through this mask.
If the feature points were extracted from wall regions, they would interfere with the calculation result of Δ  , Δ  , and Δ.In order to delete the points not on the ceiling (including wall, furniture, window, and door), this paper designs ceiling points detecting method according the feature of ceiling, which are of equal height.
For the ceiling points  and , given that the origin of coordinate is on the ceiling and ceiling points  = 0, according to (4), the length of their connect lines in frames  and  (for points "" and "" in Figures 5(a     +   sin ( + Δ) +   17) that  , =  , and the length of connect lines of the same ceiling points in the two images are unchanged, as shown in Figure 5.However, if one of the feature points   is on the wall (given height is   ) and not on the ceiling, its height would differ from that of ceiling (  ).In this case, the length of the connect line is 1/2 where ( , ,  , ) and ( , ,  , ) are coordinates of wall point   in frames  and  and  , and  , are the length of connect lines   in these two frames, as Δ is in the denominator of (19), and ( 18) and ( 19) are not equal, and  , ̸ =  , .Therefore, the points on the wall can be deleted effectively by comparing the length of connect lines of feature points in the two frames, and the feature points on the ceiling can be reserved.The image process is shown in the Figure 6.
Therefore the points on the ceiling can be extracted by comparing the length of their connect lines in frames  and .And these points can make the calculating result of (5) more exact and extracting overlap regions more effective.In the overlap regions, as rotation and translation have been completed, the difference of ceiling image between frames  and  caused by robot rotation and translation can be deleted.However, for the wall image distortion difference between these two frames, the distortion adjustment would be completed in the next section.

Overlap Region Rebuild through Subblocks Matching.
For the points on the ceiling region, it is difficult to calculate the translation result (  ,   ) of each point in the wall directly through (15), as robot cannot measure each point coordinates (, , ) in the room through its monocular camera.This paper presents subblocks matching method to calculate the translation value.
As the distorted object in the frame is still an entity in the image, the translation values (  ,   ) of distorted object points in a small image region are nearly equal.Therefore this region can be translated as smaller units according to the average translation value (  ,   ) of this region.The overlap region in frame  can be divided into many small subblocks firstly and then match the overlap region in frame  through SAD method (Sum of Absolute Difference) to get the average translation value (  ,   ) of each block.Therefore the distortion of overlap region in frame  can be adjusted similar to that of frame , as shown in Figure 7.
The SAD matching method is as follows: where   is one of the subblocks in the overlap region in frame , and size is  × , and   is the overlap region in frame , and size is  × .Traversal   , the terms (, ), which can make (20) smallest, is the most suitable rebuild position of   in   .This (, ) is the (  ,   ) of subblock   .Through this method, every subblock in frame  can find its rebuild position and get its translation (  ,   ), and the rebuild overlap region of frame  is similar to that of frame .
Considering that there might be some mismatched subblocks, which are caused by similar objects and mismatched to wrong position, this paper presents mismatched subblocks checking method to check the wrong (  ,   ).
As (  ,   ) is the average of (  ,   ) of all the points in a subblock; the performance of (  ,   ) is very similar to that of (  ,   ).For (  ,   ), (15) can changed into matrix from: Therefore the wrong (  ,   ) of mismatched subblocks can be picked out according to the rotation difference and translation difference between frames  and  after overlap region extraction, as (/( sin − cos +  )) ( 1/ 0 0 1/ ) is the projection ratio between real world and camera image, and (, , ) can be substituted by its corresponding subblock coordinate (, ).The thresholds (Δ  , Δ  ) for (  ,   ) of subblock in (, ) can be calculated by affine model: where Δ  , Δ  , and Δ  are the thresholds of rotation difference and translation difference between frames  and  after overlap region extraction.In order to delete the wrong mismatched subblocks effectively, Δ  , Δ  , and Δ  are smaller than the maximums of rotation difference and translation difference.If (  ,   ) of a subblock is larger than its threshold (Δ  , Δ  ), it can be deduced that this block is mismatched and should be deleted.Through mismatched block detection, the correct content-based image matching result can be gotten.The matching result for two similar frames is shown in Figure 8, and the matching result for the dissimilar frames is shown in Figure 9.This subblock matching method can match similar frames and evaluate their similarity effectively.

Image Content Similarity Calculating.
If the content of frames  and  is similar, the number of correct matched subblocks is very great, and the correlation coefficients between rebuild overlap region in frame  and overlap region in frame  are also very large [29,30], and the similarity   between two frames is the product of subblock number and correlation coefficients: where   is the number of matched subblock,   (, ) is the pixel value in rebuild overlap region in frame ,   (, ) is the pixel value in overlap region in frame , and   and   are the average image pixel value.

Keyframes Global Map Establishing through Content-
Based Image Matching.Before establishing map, robot would move in the building automatically to take a video of indoor environment through its camera firstly.And then, through this content-based image matching method, robot can extract keyframes sequence from vision video to build global map of indoor environment by itself.The first keyframe is the first frame of vision video.And the other keyframes extraction progress is as follows:   Step 1.For the th keyframe, its similarity with subsequent 50 frames video (about 17 seconds) is calculated by the robot.
Step 2. The max similarity in the subsequent 50 frames is found out firstly, and it is correspondent to the frame whose space position is nearest to the th keyframe.And, then, the frame whose similarity is 50% of the max similarity can be extracted as the  + 1th keyframe.If all the subsequent 50 frames are larger than 50% of the max similarity, the 50th frame is the  + 1th keyframe.As there is 50% similarity between the  and  + 1 keyframes, they can both interval long space distance and overlap each other.
Repeating Steps 1 and 2 and processing all the frames of indoor environment video, keyframes sequence can be extracted, as shown in Figure 10.
This global map would be established by these keyframes, and consists of two parts: keyframes sequence and the global position of each keyframe.As the keyframes can partially overlap each other, extracting the feature points in the overlap region between neighboring keyframes and bringing them into (5), the relative position relationship of these keyframes, including heading difference  +1, and position difference  +1, ,  +1, , would be resolved.Then, bringing these relation positions into (24), the global position of each keyframe can be gotten:

Keyframes global map
Content-based image matching method The most similar keyframe for real-time robot vision

This keyframe global position
The relative position between this keyframe and robot vision frame Equation ( 5)

SURF method
Ceiling points detection method Equation (24) Robot global position where (  ,   ) and ( +1 ,  +1 ) are the global position of the th and  + 1th keyframe.
Through this content-based image matching method, the space distance between each keyframe could be very large, and the number of keyframes is very less.Robot would resolve its position quickly by matching with these keyframes.In the experiment, total 1710 frames are taken by the robot during indoor environment studying, and 72 keyframes are extracted by this method.
Through this global map, the robot can localize itself in real-time, and the image processing of robot localization is shown in Figure 11.When robot moving in the indoor environment to render service to human, this content-based image matching method could be used to match real-time robot vision frames with the keyframes sequence in the map and find out the most similar keyframe for each robot vision frame.The same as the global map establishing progress, feature points between robot vision and this keyframe can be extracted, and their relative position can be resolved through (5).And the global position of each robot vision frame, which is also the robot global position, can be resolved through (24), the same as resolving progress of the  + 1th keyframe global position.13.The environment of the experiment site, including ceiling and wall region (wall, window, door, and furniture), can also be described by this keyframe sequence and their global position, and the position RMSE is less than 0.3 m, as shown in Figure 14.
The robot can make good use of this map to localize itself in the whole building and draw its route in the different room and corridor effectively, as shown in Figure 15.
In order to evaluate the localization precision of this map building method, the corners of tile floor are taken as ground signs, and the air conditioning port and ceiling lamp on the ceiling are taken as ceiling signs.And the localization RMSE between robot localization result and these signs is less than 0.5 m, as shown in Table 1.
Table 1 also shows the comparison between our method and ORB SLAM.The algorithm architecture of our method is similar to that of ORB SLAM, but this paper uses image content matching taking place of feature points matching.Through image content matching, the robot can pick up the keyframe from global map more precisely, which is most similar to its real-time robot vision, and is seldom disturbed by similar objects in the indoor environment.As the experiment site (Figure 14) includes four parts, two corridors and two rooms, the comparison between our method and ORB SLAM is also divided into four parts, as shown in Table 1.
It can be seen from Figure 14 that there are many similar objects in the experiment site, and this would be a serious test for our method and ORB SLAM.
When robot is moving the in the two rooms, the experiment result of our method is better than that of ORB SLAM.This is due to the fact that there are many similar objects in the two rooms, such as the air condition outlet and ceiling   lamps.The feature points on these similar objects in different rooms are easy to be error matched as the same points.If ORB SLAM needs to match all keyframes in the map, the robot can easily localize itself to wrong room (the distance between the two rooms is 12 m).But our method is able to take advantage of the image of dissimilar objects in different rooms, the interference caused by similar objects can be suppressed, and robot can fix its position precisely in these rooms and less error localized.
While the robot is moving in the two corridors, as when there are fewer similar objects than that of rooms and ORB, the method can extract feature points more effectively, the experiment result of ORB SLAM is better than that of our method.Table 1 also shows the comparison between our method and CV-SLAM.The difference is not significant under normal condition.However, compared with CV-SLAM method on commercial equipment, our method is immune to kidnapping events, because our method can build global map of indoor environment and make good use of this map to fix robot position at any time.

The Test for Robot Localization under Kidnap Condition.
The kidnap problem for robot self-localization can be solved effectively by our method.In order to test robot in more complex kidnap condition, two adjoining rooms (20-metersquare and 10-meter-square) and the corridor (5 m) are chosen as experiment site.Through our method, 16 frames were extracted as keyframes to build the global map of these rooms and the corridor, with the position relationship of keyframes sequence shown in Figure 16.The keyframes mosaic result is shown in Figure 17.The rooms and small corridor could restrict the field of robot camera, such that the robot cannot fix its position by watching the distant landmarks.It was kidnapped and suddenly put into a faraway place.In the test, the robot was frequently taken from one place to another place (2 or 3 m apart) by experimenters to achieve kidnap.Through matching with global map, the robot was still able to fix its position effectively, especially when the robot is suddenly moved from room to the corridor, as shown in Figure 17, and position RMSE is less than 0.4 m.

Conclusion
For the issue of robot localization and mapping in the indoor environment, this paper presents keyframes global map establishing method for robot localization through contentbased image matching with ability to analyze distortion and overlapping of keyframes.Results show that common problems, such as kidnapping or disturbances by similar objects, can be resolved through the content-based image matching method presented in this paper, which is specially designed for indoor environment.In the test, the keyframes global map can be established by this method and describe the indoor environment effectively.Although there are many similar objects in the experiment site, the robot cannot be kidnapped and can localize itself accurately (Figure 18), with the position RMSE being less than 0.5 m.

Figure 1 :
Figure 1: The relationship between wall and upward-looking camera.(a) Robot and camera.(b) Relationship between front wall and side wall and camera.

2 (Figure 2 :
Figure 2: The image processing of content-based image matching.

Figure 3 :
Figure 3: The image processing of rotation and translation.

Figure 4 :
Figure 4: The overlap region extraction progress.(a) Frame , (b) Frame , (c) the translation result, (d) the rotation result, (e) the overlap between two frames, and (f) the overlap region mask.

Figure 5 :
Figure 5: The connect lines on the ceiling.(a) The connect lines on the original ceiling.(b) The connect lines on the ceiling after rotation and translation.

Figure 6 :
Figure 6: The image processing of ceiling feature point detection (red points are the feature points).(a) Original feature points in frame , (b) ceiling points detecting result in frame , (c) original feature points in frame , and (d) ceiling points detecting result in frame .

Figure 7 :
Figure 7: The overlap region rebuilding process.

Figure 8 :
Figure 8: The rebuild overlap region by two similar frames.(a) The overlap region in frame , (b) the overlap region in frame , and (c) the rebuild overlap region in frame .

Figure 9 :
Figure 9: The rebuild overlap region by two dissimilar frames.(a) Frame , (b) frame , (c) the overlap region in frame , and (d) the rebuild overlap region in frame .

Figure 10 :
Figure 10: The image processing of keyframes global establishing.

Figure 11 :
Figure 11: The image processing of robot localization.

Figure 12 :
Figure 12: The experiment site.(a) The robot moving in the experiment site.(b) The ceiling in the robot vision.

Figure 13 :
Figure 13: The global position relationship of keyframes sequence.

Figure 16 :
Figure 16: The global position relationship of keyframes sequence; " * " is the position of each keyframe.

Figure 17 :Figure 18 :
Figure 17: The global map for kidnap test described by keyframes sequence.