Affine-Invariant Geometric Constraints-Based High Accuracy Simultaneous Localization and Mapping

In this study we describe a new appearance-based loop-closure detection method for online incremental simultaneous localization and mapping (SLAM) using affine-invariant-based geometric constraints. Unlike other pure bag-of-words-based approaches, our proposed method uses geometric constraints as a supplement to improve accuracy. By establishing an affine-invariant hypothesis, the proposed method excludes incorrect visual words and calculates the dispersion of correctly matched visual words to improve the accuracy of the likelihood calculation. In addition, camera’s intrinsic parameters and distortion coefficients are adequate for this method. 3Dmeasuring is not necessary.We use themechanism of Long-TermMemory andWorkingMemory (WM) tomanage the memory. Only a limited size of theWM is used for loop-closure detection; therefore the proposedmethod is suitable for large-scale real-time SLAM.We tested ourmethod using the CityCenter and Lip6Indoor datasets. Our proposedmethod results can effectively correct the typical false-positive localization of previous methods, thus gaining better recall ratios and better precision.


Introduction
Simultaneous localization and mapping (SLAM) is widely used to generate maps for localization or autonomous robotic navigation.
The appearance-based SLAM type is characterized by low-cost solutions.Moreover, SLAM based on visual features provides abundant information for use in matching and recognition.
Almost all appearance-based SLAMs are pure bag-ofwords approaches that extract SIFT [1] or SURF [2] descriptors from images and then match descriptors by brute force method or NNDR [3] and so forth to calculate likelihood between two locations.
The biggest challenge for improving the precision and recall ratio of loop-closure detection is that the false-positive localization loop-closure hypothesis selection score is higher than the false-positive localization.This results in acceptance of false-positive localizations and rejection of false-negative localizations.
Likelihood calculations between two places are the most decisive factor for establishing a loop-closure hypothesis.But in many conditions a pure bag-of-words approach cannot effectively calculate the likelihood between places.
Our proposed method attempts to improve the likelihood calculation results by appending the geometric constraints of visual words to the classic pure bag-of-words likelihood calculation.The geometric constraints include order and acreage constraints, which are designed as affine-invariant geometric constraints.Therefore, the proposed method can work well even though the viewpoint is significantly changed.This method uses a memory management approach similar to those in [4,5] for real-time processing and uses SURF for its visual descriptors; descriptors are matched by NNDR [3].
In this paper, we describe the more accurate likelihood calculation to improve the loop-closure detection performance.By establishing an affine-invariant hypothesis, the proposed method excludes incorrectly matched visual words and calculates the dispersion of correctly matched visual words to improve the accuracy of the likelihood calculation.Section 2 reviews some previous pure bag-of-wordsbased approaches and their typical problems.In Section 3, we describe the proposed method.Section 4 presents our (2) Previous methods' false-negative localized place: excepting ground truth, there are incorrect matches between too many different locations.[4,5,7,8] on the basis of CityCenter [14] dataset.(1) illustrates that, because too many words are matched between two different locations incorrectly, previous pure bag-of-words systems treat two different locations as the same place.By using the proposed method, 0 match is accepted.So the proposed method can solve this problem.(2) shows Raw-Likelihood (without normalizing) compression between [5] and the proposed method.Although the ground truth obtains the highest likelihood, because there are incorrect matches between two many different locations, those noises result in that the ground truth is rejected by previous pure bag-ofwords systems incorrectly.The proposed method prevents noises and retains the peak of ground truth and thus the proposed method can accept the ground truth correctly.experimental results.In Section 5, we discuss our proposed method's advantages, disadvantages, and outlook.
Cummins and Newman [6] proposed a rapid method based on the probabilistic bailout condition for appearanceonly SLAM.But this approach's precision and recall ratio are not satisfactory.
Kawewong et al. [9] proposed a method that tracks robust features in a sequence of images, called position-invariant robust features (PIRF).They also proposed two onlineincremental-appearance-only methods for SLAM PIRF-nav [7] and PIRF-nav2 [8] based on PIRF.Regarding PIRF's robustness, the methods in PIRF-nav and PIRF-nav2 perform satisfactorily in dynamic environments.Compared with the method in [6], the precision and recall ratio also improved significantly.
However, PIRF-nav and PIRF-nav2's processing time for loop-closure detection cannot be controlled very well.The processing time increases as the map's scale increases.In addition, because PIRF-extracted robust features persist in a sequence of images, many useful features are ignored.This can cause significant loss of visual features, particularly, in indoor low resolution datasets such as Lip6Indoor [10].Thus, it is difficult to improve the performance of PIRF-nav and PIRF-nav2.
Labbé and Michaud [4,5] proposed a method based on a short term memory (STM) and Long-Term Memory mechanism called RTAB-map.It can optimize the processing time of SLAM by controlling the processing speed effectively without increasing the processing time when the map's scale increases.
However, because of the problems shown in Figure 1, it is difficult to improve RTAB-map's recall ratio.
RTAB-map is the best vision-only SLAM method currently available and probably represents the limit of performance possible for pure bag-of-words approaches.
FAB-MAP3D [11] is a SLAM method that combines a pure bag-of-words approach with 3D geometric constraints.It works better than the FAB-MAP [6], but it requires 3D measurement information about each visual word.
The proposed method attempts to design geometric constraints for appearance-only SLAM without any 3D measuring.

Affine-invariant features of two locations
Calculate two places' proportion of acreage Build likelihood of two locations Unlike RANSAC [12] and PROSAC [13], the proposed method estimates an affine-invariant hypothesis and calculates the likelihood between two places without any random elements.Thus the proposed method is more stable and better suited for situations in which only few words match.

Proposed Method
This section presents our new likelihood calculation method having geometric constraints.We also include a brief explanation of the loop-closure hypothesis selection.Figure 2 shows the likelihood calculation of the proposed method.
3.1.Image Undistortion.Sometimes a camera lens will cause significant distortions; undistorted images are necessary to establish an affine-invariant hypothesis.
To produce undistorted images, we must establish the camera's requisite intrinsic parameters (  ,   ,   ,   ), radial distortion coefficients ( 1 ,  2 ), and its tangential distortion coefficients ( 1 ,  2 ) by calibration.It is easy to calibrate and undistort a camera using OpenCV [15].Intrinsic parameters and distortion coefficients are stable for certain cameras.More details are available in OpenCV documents.
Since the real world is not flat, real world images do not strictly abide by the affine-invariant constraint.However, for the most part, landmarks in images can be considered to be in a flat environment.

Order Constraint.
We designed a distance order constraint to exclude incorrectly matched visual words.
As illustrated in Figure 3, ( →   ) is an example of incorrect matching.We first calculate 's relative distance vector    , which is sorted from nearest to farthest.  = (, , , ) and    = (  ,   ,   ,   ).Except for  and   ,   =    .
We designed an offset-based linear formula to calculate diff(  ,    ).The offset's definition is shown in Figure 5.We also define diff(  ,    ) in (1).In Figure 4, diff(  ,    ) = 0.25, and diff(  ,    ) = 11/16.Therefore, the higher diff indicates that the probability of an incorrect matching is higher.diff(  ,    ) can be used to distinguish correctly and incorrectly matched visual words.
Please note that diff(  ,    ) is not an affine-invariant quantity and is sensitive to noise percentage.So we cannot set a certain threshold eliminating incorrectly matched visual words for large-scale SLAM.Using a normalized diff normal is one candidate.We normalized diffs using its mean  diff and standard deviation  diff .
Our proposed method uses kd tree-based [16] FLANN [17] to establish relative distance vectors s when descriptors are extracted.All extracted words are used for establishing s and these vectors are retained for further queries.When required, these vectors eliminate all mismatched words by calculating expression (1) to establish new vectors for the processing of order constraint.The original vectors do not change.
In Figure 4,  and   are excluded and we can obtain a corrected set of words   = (, , , ) and    = (  ,   ,   ,   ).However, the only order constraint is not strict enough for a highly accurate likelihood calculation.We designed an acreage constraint to establish an affine-invariant hypothesis based on   and    .

Acreage
Constraint.An example of an affine invariant is illustrated in Figure 6.Although, from  to , the coordinates of  and   changed significantly, the proportional relationship of the acreage illustrated in the figure did not change; that is, not only   /  =        /         , but also   /  =        /       ,   /  =        /       , and so forth.
Therefore, when the affine-invariant proportional relationship of acreage has been found, an affine-invariant hypothesis can be established.
We propose a method to establish an affine-invariant hypothesis based on results of the order constraint.First, we calculate a total area: where  is the center of gravity of  : We then define the deviation of two pairs of visual words according to an affine-invariant hypothesis: Dev ,  = 0, if Dev ,  <   and this establishes an affineinvariant hypothesis.Because   is a robust affine-invariant threshold, a certain   is suitable for large-scale SLAM.
In fact,  is important in the establishment of the affineinvariant hypothesis.The Dev ,  is meaningful only in the sense that  is built by visual words that obey the affineinvariant constraint.After processing the order constraint, the incorrectly matched words have been eliminated, but the noise remains.

Likelihood Calculation.
After the above processing ( →   →   ), only correctly matched words remain.Now it is possible to calculate a geometric constraints-based likelihood   (  ,   ) between the testing and current place.
where   is the testing place and   is the current place. Affine is the proportion between   's size and the sum of the matched word pairs.
where  pair is the number of matched word pairs between the two places.0 ≤  Affine ≤ 1.  Dispersion is dispersion of the affine-invariant wordsbased parameter for estimating the likelihood between two places. where Apparently 0 ≤  Dispersion ≤ 1.
In [4,5], the likelihood calculation formula is where    and    are the total number of words of the signature   and the compared signature   , respectively.However, since this method attempts to obtain a low likelihood, it may cause a false-negative localization.But for pure bag-of-wordsbased approaches, because precision is hard to control there is no alternative but to choose low likelihoods.We propose a new likelihood calculation method combined with   (  ,   ): are corresponding pairs processed by acreage constraint.,   are center of gravity of   ,    .  ,    and ,   are used to establish a likelihood between two locations.Left part of figure is the first step of the algorithm.This step is obtaining credible  and   .It will be converged when size of  tmp and   tmp did not change and obtain credible  and   .The convergence is achieved rapidly (after 2∼3 loops and then it gets converged) because, during establishing credible  and   , some correct matching may be rejected (overfitted).Right part of Figure 6 presents the processing of attempting to retrieve incorrectly rejected matching on the basis of credible  and   .

Localized place
Query place This likelihood calculation method is fairer than [4,5].In addition, since geometric constraints are added to the calculation, the proposed method achieves better accuracy.

Brief Summary of Loop-Closure Hypothesis Selection.
The proposed method uses a loop-closure hypothesis selection method similar to that in [5].We update the Bayesian filter by the following recursion formula: where   =  is the probability that   closes a loop with a past location   and   = −1 is the probability that the current place in the STM is a new place.(  |   ) is important for this formula, being a normalized likelihood by the mean  and standard deviation , which the proposed method significantly affects.∑   =−1 (  |  −1 = )( −1 |  −1 ) briefly describes the likelihood, which, due to space limitations, we cannot describe in detail.Please refer to [4,5].
When (  = −1 |  −1 ) is lower than the loop-closure threshold  loop , the loop-closure hypothesis will be accepted.

𝑃 (𝐿
Please note that when  is too high the loop-closure hypothesis will be rejected, although the probability of a high loop-closure hypothesis is very high.This may cause falsepositive localizations.

Experiments
We performed our calculations using a MacBook Pro, i7 with 16 GB RAM.The application is written in C++.We tested our method by two well-known datasets: Lip6Indoor and CityCenter.
4.1.Lip6Indoor. Figure 7 shows that a typical false-positive localized place occurs in pure bag-of-words approach such as [5,6,8].After processing by the proposed method, geometric constraints for two places  Dispersion = 0.12, the false-positive Compared with [5], the proposed method improved the recall proportion by only 1.55%.But as the recall proportion increases in [5], precision decreases rapidly.For a 100% recall proportion of [5] precision is 63%, but the precision of our proposed method is 87.5%.Table 1 shows the results of Lip6Indoor dataset.References [6,8] are faster than [5], but their recall proportion is low.After comparison between the proposed method and [5], an average of 53.5 ms additional processing time is required for each frame.The maximum processing time for one frame using our proposed method is 825.3 ms.Because this dataset is captured at 1 HZ, the proposed method can be processed in real time.

CityCenter.
In the CityCenter dataset, since our method has effective control, we obtained a higher recall proportion.The resolution of 2474 images in this dataset is 640 × 480.Every two images were captured simultaneously at the same location.
The recall proportion cannot be further increased because in some scenes (like jungles) there are too many similar words.The proposed method failed in these types of scenes.With too many incorrectly matched pairs, a bad affineinvariant hypothesis was established.Table 2 shows the results of CityCenter dataset.
The maximum processing time for one frame of the proposed method is 1780.7 ms.The dataset is captured at approximately 0.5 Hz, so the proposed method can also be processed in real time in this dataset.

Conclusion and Future Studies
These experiments showed that our proposed method can work better than pure bag-of-words-based SLAM approaches.We proved that 2D geometric constraints are an effective way to break the bottleneck and improve the accuracy of appearance-based SLAM.
Although the proposed method works well for the most part, it cannot handle some problems.In particular, one typical problem is too many similar words in the same image.Methods to solve this problem are being considered.One possible solution is to increase NNDR [3] threshold to avoid repeated features more effectively.This step should reduce false-positive ratio of descriptors matching but causes more false-negative avoiding.Then, use the proposed method to construct affine-invariant hypothesis based on features matched by higher threshold.Lastly, test avoided repeated features by affine-invariant hypothesis to retrieve potential correct matches.
Today, high-performance handheld smart phones are very popular.Because the proposed method does not require any 3D measuring to achieve high robustness while using handheld devices, it may be applied to many types of platforms, for navigation by pedestrians.

1 )
Previous methods' false-positive localized location: many incorrect matches between different locations.

Figure 1 :
Figure 1: Examples of two problems occur in[4,5,7,8] on the basis of CityCenter[14] dataset.(1)illustrates that, because too many words are matched between two different locations incorrectly, previous pure bag-of-words systems treat two different locations as the same place.By using the proposed method, 0 match is accepted.So the proposed method can solve this problem.(2)shows Raw-Likelihood (without normalizing) compression between[5] and the proposed method.Although the ground truth obtains the highest likelihood, because there are incorrect matches between two many different locations, those noises result in that the ground truth is rejected by previous pure bag-ofwords systems incorrectly.The proposed method prevents noises and retains the peak of ground truth and thus the proposed method can accept the ground truth correctly.

Figure 2 :
Figure 2: Framework of likelihood calculation on the basis of geometric constraints.

Figure 5 :FFigure 6 :
Figure5: An affine-invariant's example.Affine-transfers are including rotation, shearing, translation, and scaling.This example is including rotation and scaling. = (, , , ) and   = (  ,   ,   ,   ) are matched visual words between two images. and   are center of gravity of each set of words.

Figure 7 :
Figure 7: Typical false-positive localized place in Lip6Indoor data set.Although the query place (right) matches only a part of the localized place (left), pure bag-of-words approaches do not calculate likelihood based on matched acreage.
Figure 4: Offset's calculation.At first find a word in   's corresponding matched word in    , and then calculate the offset between corresponding IDs.
loop-closure hypothesis is rejected by the proposed method.The resolution of 388 images in this dataset is 240 × 192.