Visual Localization by Place Recognition Based on Multifeature ( D-λ LBP + + HOG )

Visual localization is widely used in the autonomous navigation system and Advanced Driver Assistance Systems (ADAS). This paper presents a visual localization method based on multifeature fusion and disparity information using stereo images. We integrate disparity information into complete center-symmetric local binary patterns (CSLBP) to obtain a robust global image description (D-CSLBP). In order to represent the scene in depth, multifeature fusion of D-CSLBP and HOG features provides valuable information and permits decreasing the effect of some typical problems in place recognition such as perceptual aliasing. It improves visual recognition performance by taking advantage of depth, texture, and shape information. In addition, for realtime visual localization, local sensitive hashing method (LSH) was used to compress the high-dimensional multifeature into binary vectors. It can thus speed up the process of image matching. To show its effectiveness, the proposed method is tested and evaluated using real datasets acquired in outdoor environments. Given the obtained results, our approach allows more effective visual localization compared with the state-of-the-art method FAB-MAP.


Introduction
One of the prerequisites of navigation issue is to make the vehicle or robot able to reliably determine its position within its environment.With the wide use of cameras, varieties of approaches were proposed to address the challenges of place recognition based visual localization [1,2].
FAB-MAP (Fast Appearance Based Mapping) method [3] can be considered as the milestone in the field of visual localization.FAB-MAP approach consists of matching the appearance of current scene to the same (similar) past visited place by converting the images into bag-of-words representations built on local features such as SIFT or SURF.
Recently, binary image descriptors that encode patch appearance, using compact binary string with low memory requirements, are widely used in image description and visual recognition [4].In local feature based place recognition approaches, image representation is defined as collection of local features which contribute to their robustness when faced with local image variations as well as from discriminative power of their descriptors.Nevertheless, most of these works exhibit a high computation cost or complex feature extraction for image matching [5,6].Also, few works pay attention to the depth information for visual place recognition.
Their advantages are that they are invariant to monotonic changes in gray-scale and fast to calculate.One typical binary descriptor is LBP (local binary pattern) [7].Since it was firstly proposed in 1996, several new variants of binary descriptors have been proposed [8,9].They show great invariance to monotonic illumination changes, do not require many parameters to be set, and have a high discriminative power.However, most of them are unfortunately not efficient for background modeling or place describing because of their sensitivity to noise or illumination.In this paper, the most relevant binary descriptors for visual place recognition that will be tested and compared in our approach are LBP, CLBP (complete local binary pattern) [10], CSLBP (center-symmetric local binary patterns) [11], CSLDP (center-symmetric local derivative pattern) [12], and XCSLBP (extended CSLBP) [13].These different local binary descriptors are noted as LBP.Despite local binary features efficiency, histograms of oriented gradients (HOG) features have also been successfully used in various vision tasks such as object classification, image search, and scene classification [14].Wang et al. [15] combine histograms of oriented gradients (HOG) and local binary pattern (LBP) and propose a novel human detection approach capable of handling partial occlusion.For such applications, HOG is one of the best features to capture edge or local shape information which provides a rough description (shape information) of the scene.
Considering the robust and strong image representation ability of binary descriptors and HOG feature, we expect that their combination would provide more useful information and then should improve place recognition performance.In this paper, stereo images are used for visual place recognition.A novel localization approach is then proposed which uses multifeature fusion by combining HOG and binary features (LBP), as shown in Figure 1.HOG features are obtained from gray-scale image while LBP features are built from both gray-scale image and disparity map.We note that the features are first extracted from the blocks composing the gray-scale image and the disparity map and then concatenated.We extend the application of LBP descriptor to disparity map in order to incorporate disparity information in image representation by simply concatenating the two descriptors (LBP from gray-scale image and LBP from disparity map).This produces a new descriptor named D-LBP.The integration of disparity information in image representation provides depth information which should be helpful for place recognition, especially in complex environment situation.Indeed, image description using features LBP and HOG and the depth information will permit reducing perceptual aliasing problems related to visual place recognition.As will be shown in our experiments, features combination permits achieving better recognition performance than single feature.Also the performance of place recognition is compared with the state-of-the-art FAB-MAP algorithm: the achieved  1 scores on four tested datasets using our approach are better than those resulting from FAB-MAP method.Furthermore, considering that highdimensional multifeatures comparison is time-consuming, locality sensitive hashing is applied on multifeatures to speed up the process of features comparison and image matching.
The most important contributions introduced in this paper are the following: (i) An innovative method for place recognition based visual localization using multifeature descriptor (D-LBP++HOG) extracted from gray-scale image and disparity map.The proposed multifeature descriptor takes advantage of texture, depth, and shape information and hence performs better than single feature (see Section 5.2).(ii) The impact image block size for the binary descriptors is studied.Binary descriptor extracted from small block has better discriminative ability in local details of different locations, while considering large block size for image representation may cause loss of some discriminative information (see Section 5.1).(iii) A speeding-up of the place recognition method is achieved by approximating the Euclidean distance between features with hamming distance over bit vectors obtained by locality sensitive hashing (see Section 5.3).
The rest of this paper is organized as follows.Firstly, the LBP descriptor and several of its variants as well as HOG feature are introduced in Section 2.Then, in Section 3, the proposed approach is described in detail.Section 4 deals with the presentation of the tested database and the used performance evaluation parameters.The obtained results are presented and discussed in Section 5. Finally, conclusions and future works close this paper (Section 6).

Overview of Used Image Descriptors
In this part, some of the state-of-the-art image descriptors used and compared in the proposed approach are described.

LBP (Local Binary Pattern
). LBP is a texture descriptor that codifies local primitives (such as curved edges, spots, and flat areas) into a feature histogram.The original LBP operator labels the pixels of an image with decimal numbers, called local binary patterns or LBP codes, which encode the local structure around each pixel [8].
As illustrated in Figure 2, each pixel gray level value is compared with its eight neighbors in a 3 × 3 region by subtracting the center pixel value.The resulting strictly negative values are encoded with 0 and the others with 1.A binary number is obtained by concatenating all these binary codes, and its corresponding decimal value is used for labeling the central pixel.In Figure 3, examples of neighborhood used for LBP operator are illustrated.The generalized LBP definition uses  sample points evenly distributed on a radius  around a center pixel located at (  ,   ).The position (  ,   ), of the neighboring points, where  ∈ {0, . . .,  − 1} is given by The local binary code for the position (  ,   ) can be computed by comparing the gray-scale value   of this center pixel located at (  ,   ) and the gray-scale values   of its neighbor pixels located at (  ,   ) where  ∈ {0, . . .,  − 1}.The value of the LBP code of the center pixel at position (  ,   ) is given by where  is the Heaviside function: The operator LBP , produces 2  different output values, corresponding to 2  different binary patterns formed by the  pixels in the neighborhood.Although this method can capture the relations of nearby and adjacent pixels, it leads to a large data dimension.Ojala et al. [7] further propose an "uniform patterns" to reduce the dimension of LBP feature while keeping its discrimination power.For this, a uniformity measure of a pattern is used:  ("pattern") is the number of bitwise transitions from 0 to 1 or vice versa when the bit pattern is considered circular.The  value of an LBP pattern can be computed by where  is the maximal LBP pattern value.The length of the histogram is a ( − 1) + 3.

CLBP (Complete Local Binary Pattern
). LBP feature considers only signs of local differences (i.e., difference of each pixel with its neighbors) whereas CLBP feature [10] considers both magnitude (M) and sign (S) of local differences as well as original center gray level value (C).Consequently, three operators, namely, CLBP M, CLBP S, and CLBP C, are used to code the magnitude, sign, and center gray level.Given the gray-scale value   of the center pixel (  ,   ) and its  circularly and evenly spaced neighbors with grayscale value   ,  ∈ {0, . . .,  − 1}, the difference between   and   can be simply calculated using   =   −   .The local difference vector [ 0 ,  1 , . . .,  −1 ] characterizes the image local structure at (  ,   ).Because the central gray level   is removed in local difference vector, [ 0 ,  1 , . . .,  −1 ] is robust to illumination changes and is more efficient in pattern matching.  can be further decomposed into two components: where   is the sign component of   and   is the magnitude component of   .CLBP M is used to code the magnitude information of local differences: where  is a threshold which is set to the mean value of the   values from the whole image.CLBP S is the same as the original LBP and is used to code the sign information of local differences: CLBP C is used to code the information of original center gray level value: where the threshold   is set to the average gray level of the input image.The dimension of the histograms corresponding to CLBP S and CLBP M is 2  , while the dimension of CLBP C is 2. The CLBP C only uses the center gray level value which can be easily affected by the changing of viewpoints or illumination.Therefore, in our work, only the histograms of CLBP S and CLBP M codes are computed and then concatenated together to construct CLBP feature.Thus, the final dimension of CLBP feature is 2 +1 .

CSLBP (Center-Symmetric Local Binary Patterns).
CSLBP [11] is another modified version of LBP.CSLBP produces shorter feature set than LBP, but it is also a first-order local pattern in center-symmetric direction and it ignores the central pixel information.CSLBP is closely related to the gradient operator, because it compares the gray levels of pairs of pixels in centered symmetric directions instead of comparing the central pixel to its neighbors.In this way, CSLBP feature takes advantage of the properties of both LBP and gradient based features.
For an even number  of neighboring pixels distributed on radius , CSLBP operator produces 2 /2 possible distinct patterns.The operator is given by where   and  +(/2) are the gray values of center-symmetric pairs of pixels. is used to threshold the gray level difference so as to increase the robustness of CSLBP feature on flat image regions.Since the gray levels are normalized in [0, 1], the authors of paper [11] recommend to use small value for .
It should be noticed that CSLBP is closely related to gradient operator, because like some gradient operators it considers gray level difference between opposite pixels in a neighborhood.
Given an image of size  × , after the computation of CSLBP patterns, a histogram is built to represent the texture image: By construction, the length of the histogram resulting from CSLBP feature is 2 /2 .

CSLDP (Center-Symmetric Local Derivative Pattern).
CSLDP operator [12] is a second-order derivative pattern in center-symmetric direction.CSLDP captures more information by encoding the relationship between central pixel and center-symmetric neighbors.Moreover, CSLDP has shorter length than LBP.For an even number  of neighboring pixels distributed on radius , CSLDP operator produces 2 /2 possible distinct patterns and is defined as where   ,  +(/2) are gray-scale values of neighborhood pixels in center-symmetric direction.  corresponds to the gray value of central pixel located at (  ,   ).The threshold function (⋅, ⋅) is used to determine the type of local pattern transition and is defined as A CSLDP pattern encodes the second-order center-symmetric derivatives at pixel (  ,   ) along 0 ∘ , 45 ∘ , 90 ∘ , and 135 ∘ directions.They can be represented as The CSLDP histogram construction method is the same as for CSLBP, and its histogram length is also 2 /2 .

XCSLBP (Extended CSLBP).
The work in [13] proposes a new LBP variant called XCSLBP (eXtended CSLBP), which compares the gray values of pairs of center-symmetric pixels considering the central pixel, without increasing histogram length.This combination makes the resulting descriptor robust to illumination changes and noise.For an even number  of neighboring pixels distributed on radius , XCSLBP is expressed as where the threshold function , which is used to determine the types of local pattern transition, is defined as where   and  +(/2) are the gray values of center-symmetric pixels.XCSLBP operator produces histograms with a length of 2 /2 .

HOG (Histograms of Oriented Gradients).
Besides LBP and its variants, another histogram feature named HOG has also been widely accepted as one of the best features to capture the edge or local shape information.HOG feature is proposed by Dalal and Triggs [16] and widely used to detect objects in computer vision.The essential idea of HOG feature is that the shape or appearance of local object can be described by the distribution of intensity gradients and edge directions [17].
HOG descriptor is a one-dimensional histogram of gradient orientations of intensity in local regions that can represent object shape.

Overview of Proposed Approach
In this section, a robust visual localization based on multifeature combination is developed.The general principle is to find the image that best matches the current acquired one, among a set of previously acquired and GPS-tagged training images.As shown in Figure 4, HOG divides the image into small connected blocks, and, for each block, a histogram of gradient directions for the pixels within the block is computed.The combination of these cell histograms represents the feature vector.At each pixel, the gradient is a 2D vector with a real-valued magnitude and a discretized direction (9 possible directions uniformly distributed in [0, ]).During the construction of the integral image of HOG, the feature value at each pixel is treated as a 9D vector, and the value at each dimension is the interpolated magnitude value at the corresponding direction.Since HOG takes adjacent pixel gradients information as basis to extract features, it is robust to changes in geometry and is not easily affected by local lighting conditions.
The whole system includes an offline phase and an online phase.In the offline phase, a set of GPS-tagged training image  and then compared with each multifeature of  train based on Euclidean distance.The computed distances are then used to select the best candidate (see Section 3.4); the smaller the distance is, the higher the similarity between the images will be.A distance ratio SS between the two best candidates (i.e., corresponding to the two minimum computed distances) is considered for matching validation (see Section 3.5).If the ratio SS is lower than or equal to a threshold Th, the first best image candidate (with the lower matching distance) is confirmed as positive; otherwise, it is regarded as negative (in this case, no matching result is conserved).When a matching is confirmed as positive, the current position can be obtained from the matched GPS-tagged training image (see Section 3.6).
As illustrated in Figure 5, the overall approach comprises six stages: ( Downsampling permits reducing the original image size, which makes feature extraction faster.In fact, it has been already proved in [18] that high resolution images are not more helpful than lower resolution ones.Therefore, downsampling is the first step before feature extraction.As it is well known, illumination has significant influence on outdoor image appearance.Therefore, another applied image preprocessing is contrast-limited adaptive histogram equalization (CLAHE), which permits enhancing the contrast of the gray-scale image by transforming the values using contrastlimited adaptive histogram equalization [19].Through this adjustment, the intensities can be better distributed on the histogram.This allows for areas of lower local contrast to gain higher contrast.This contrast, especially in homogeneous areas, can be limited to avoid amplifying any noise that might be present in the image.On the same time, it also decreases the shadow influence.An image example after applying contrast-limited adaptive histogram equalization can be seen in Figure 6.It is obvious that CLAHE prepossessing improves the image contrast and makes the image more brightened (especially in some dark parts).With respect to local properties and enhanced image representation ability, image features are extracted from small image blocks (subimage areas) without any segmentation and then these independent feature descriptors are concatenated to obtain final image feature.To illustrate the block based feature extraction process, it is applied on an example in Figure 7. Block based approach (that relies on image blocks) can address spatial properties of images.It can be used for any histogram descriptors.

Block Based Feature Extraction.
After image preprocessing, features are extracted, as illustrated in Figure 8. LBP feature is extracted from gray-scale image and disparity map independently, while HOG feature V HOG is extracted from gray-scale image.For both LBP or HOG, the features are extracted based on image blocks.In order to facilitate the process of block based feature extraction, image blocks in the full image have the same size.The influence of different block sizes will be studied in Section 5.1.Image parts that cannot satisfy a whole block will be ignored.
(i) LBP Feature Extraction.LBP feature from gray-scale image and disparity map are obtained using the following equations: where V gray is a vector which stores the LBP feature obtained from gray-scale image.V dis stores the LBP obtained from disparity map. and  are the image block numbers of gray-scale image and disparity map, respectively.ℎ gray ( ∈ [1, 2, . . ., ]) is the LBP histogram of the th block of the gray-scale image and ℎ dis  ( ∈ [1, 2, . . ., ]) is the LBP histogram of the th block of the disparity map.In our work, the disparity map is calculated using the SGBM (Semiglobal Block Matching) algorithm [20].Using this SGBM method, there are some useless parts ("black areas"), for which no depth information is computed, especially on the left and right sides of the disparity map.In these "black areas," LBP operator is not applied; therefore, these useless parts are simply removed.Thus, due to the removing of the "black areas" in the disparity map,  and  are not identical.
By using the block based approach, the features V gray and V dis are extracted from gray-scale image and disparity map, respectively.Then, D-LBP feature can be computed by concatenating V gray and V dis : (ii) HOG Feature Extraction.HOG feature is also computed for each image block of the gray-scale image.The obtained HOG features from all the image blocks are then concatenated: Here, ℎ hog  ( ∈ [1, 2, . . ., ]) is the HOG feature extracted from the th image block.It should be noted that HOG feature adopts the same image block size as the LBP feature extraction from gray-scale image; therefore, the number of image blocks is the same.

Multifeature Concatenation.
In order to take advantage of the different features, D-LBP and HOG are combined together to represent the image.Since the D-LBP and HOG are two independent features, we simply consider that they have the same weight in the role of place recognition.The final multifeature can be obtained easily through concatenation using the following equation: where ‖ ⋅ ‖ denotes the Euclidean norm.
In fact, small distance means high similarity.Based on Euclidean distance, image matching candidates are searched.After distance computation, for the testing image, the two minimum distances ( ,1 and  ,2 ) and their corresponding training images (the two best candidates) are conserved.

Final Matching Validation. For a given current image pair 𝐼 test
, the validation of matching candidate from the training database is based on the ratio SS  , calculated as follows: where  ,1 and  ,2 are, respectively, the first and second minimum distances between the current image multifeature V test  As said before, the lower the distance is, the more similar the images are.The potential matching candidate is the image   (the one giving the lower distance with the testing image).However, if the second best matching candidate provides a distance very close to the first one, this means that the matching algorithm provides two confused solutions.In this case, we propose to ignore the matching result and consider that the testing image has no matching image.For that, a threshold Th is applied to the ratio SS  , which takes its values in the range [0 1].The last decision is as follows: if SS  is lower than or equal to the threshold Th, then the pair (, 1) is considered as positive, and the pair is matched.Otherwise, the pair is considered as negative and the pair is ignored.

Visual Localization.
After image matching result is successfully validated, the vehicle can localize itself through the matched training image position.Since the training images are tagged with the GPS or pose information, the vehicle can get its position information by assimilating its position to the GPS position of the training image matched with the current testing image.This is a topological level localization; that is, the system simply identifies the most likely location.Therefore, this is not a very accurate metric localization, because the training and testing trajectories are not exactly the same.
It should be noted that some places can not be localized at the situation of validation failure (negative matching case).
The taken route for UTBM-1 dataset is shown in Figure 9  composed of 540 images while the testing database is composed of 520 images.The GPS information of each image is also collected.
The popular KITTI benchmark dataset is also used to test our proposal.The KITTI Odometry dataset has 22 sequences containing a total of 44182 stereo images (39.2 km).These sequences include environments with different characteristics and challenging situations such as perceptual aliasing and changes on scene.Among them, the datasets KITTI 05 and KITTI 06 that contain loop closures were selected to evaluate our method.There are 2761 and 1101 images in KITTI 05 and KITTI 06 datasets, respectively.
For UTBM-1 and UTBM-2 datasets, ground-truth was constructed by manually finding pairs of frame correspondences according to the GPS data, while the KITTI dataset ground-truth was built according to the pose information [21].

Image Preprocessing and Feature Extraction. In our work,
for faster feature extraction, the original color images were downsampled into half scale size gray-scale image.That means images in datasets UTBM-1 and UTBM-2 were resized to 640 × 480 and the images in dataset KITTI 05 and KITTI 06 were resized to 613 × 235.
In order to reduce the illumination influence on the outdoor image appearance, contrast-limited adaptive histogram equalization (CLAHE) method was used (see Section 3.1).
Moreover, as a pair of images is acquired at each instant, a disparity map can be computed easily using the SGBM (Semiglobal Block Matching) algorithm [20].
After image preprocessing, binary descriptors (LBP, CLBP, CSLBP, CSLDP, and XCSLBP) are extracted with the following parameters: 8 sampling points and 3 pixels radius.HOG descriptor is extracted from the gray-scale images.To capture large-scale spatial information, the cell size of HOG is 32 × 32.The number of cells in each block is specified as a 2-element vector.
An example of extracted image features can be seen in Figure 10.It can be seen that the local binary features pay more attention to texture information.It can also be noted that CSLBP and XCSLBP perform better than LBP.HOG feature depicts object shape information in the image.Therefore, combining LBP and HOG features could bring more information and make place (scene) better described.

Performance Evaluation. Precision-recall characteristics
and  1 score are widely used to show the effectiveness of image retrieval method.Therefore, our evaluation methodology is based on precision-recall curves and  1 score.In our experiments, the training image number is larger than or equal to the testing image number; thus each testing image has a ground-truth matching.Therefore, among the positives, there are true positives (correct results among successfully validated images matching candidates) and false positives (wrong results among successfully validated images matching candidates).The sum of the true positives and false positives is the total retrieved images number.
More specifically, precision is the ratio of true positives over the retrieved images number (number of all the successfully validated image matching candidates), and recall is the ratio of true positives over the total testing images: The final curve is computed by varying the threshold Th (applied to the ratio SS) in a linear distribution between 0 and 1, with the calculation of the corresponding values of precision and recall.100 values of threshold Th are considered to obtain well-defined curves.When the threshold is set to 1, the candidates whose ratio is below or equal to 1 are positives.In this case, the number of retrieved images is identical to the number of testing images, while when the threshold is 0, it means that the candidates whose ratio is below or equal to 0 are regarded as positives.In this case, there is no retrieved image.
Precision relates the number of correct matches to the number of false matches, whereas recall relates the number of correct matches to the number of missed matches.A perfect system would return a result where both precision and recall have a value of one.The  1 score value is a single value that indicates the overall effectiveness of image retrieval method.Based on the precision and recall,  1 score is defined as

Experiments and Results
Different aspects of our proposal are evaluated in the following sections.In Section 5.1, the performance of binary features (LBP and its variants) with and without disparity information is studied.In addition, the image block size influence for the binary feature D-CSLBP is investigated in Section 5.1.In Section 5.2, the effect of the multifeature fusion proposed in our approach is analyzed.It is to note that the experiment results obtained in Sections 5.  based on Euclidean distance.In Section 5.3, the efficiency of our LSH based visual recognition is checked: the execution time and recognition performance of our complete system are evaluated.Finally, visual localization at 100% recognition level is discussed in Section 5.4.

Performance of Different Binary Features.
In this section, we compare binary features performance in two situations: with or without disparity map.Here the features are compared based on the Euclidean distance.Table 1 gives the  1 scores of the binary descriptors in two cases (without and with disparity information).It can be seen that LBP, CLBP, CSLBP, and CSLDP with disparity information improve the image retrieval ability as  1 scores are higher with disparity information than without disparity information.Among them, D-CSLBP is the best one; it achieves the highest  1 score.
Figure 11 depicts the precision-recall curves obtained by the different binary features in two typical datasets UTBM-1 and KITTI 06.It can be seen that the performance of D-CSLBP is better than the performance with the features D-LBP, D-CLBP, D-CSLDP, and D-XCSLBP.Also, it can be seen that the maximum recall at 100% precision for D-CSLBP is higher than the one of the other features.

Comparison of Different Image Block Sizes.
In this section, the influence of block size of block based D-CSLBP feature is studied.
Small block size permits discriminating local details, while large block size makes the representation more robust.Each image block is a square block in our experiment (block size 32 × 32 is shorted as 32).The performance of D-CSLBP feature with different block sizes (32, 64, 128, and 32 + 64 + 128 (multiblock sizes, there different block sizes used together)) in place recognition is evaluated.
According to Figure 12, it can be noted that by increasing the block size from 32 to 64 and 128, the place recognition ability decreases.The computation of D-CSLBP feature with combination of the block sizes 32, 64, and 128 only permits achieving a slightly better performance than the D-CSLBP feature with block size 32.
It is obvious that the binary descriptor D-CSLBP extracted from small block size may benefit from discriminative local details, while feature extraction using larger block size makes it easy for image representation to drop some discriminative information.
However, when the block size is too small, the abundant information can not bring more improvement to the image matching process.At the same time, smaller image block size may lead to computation time increase during feature extraction.So, on our following experiments, the image block size for D-CSLBP is set to 32.

Performance of Multifeature Combination.
In this section, we compare the performance of multifeature descriptor (D-CSLBP++HOG) with single independent feature descriptor.
Figure 13 shows the precision-recall curves obtained with the different tested features: D-CSLBP, HOG, and D-CSLBP++HOG.It can be found that the binary feature D-CSLBP combined with HOG permits improving image retrieval performance.Combining D-CSLBP and HOG can achieve better result than each single feature, which means that the combination is useful for place recognition.
Table 2 compares the  1 scores of different features with the state-of-the-art FAB-MAP method.It confirms that the multifeature D-CSLBP++HOG achieves better results than single feature.The  1 score of D-CSLBP++HOG provides the highest value for all the four datasets.Furthermore, the proposed method outperforms the FAB-MAP method.
For a better comprehension of the proposed multifeature, an example of distance matrices for UTBM-1 dataset is presented in Figure 14.Here, for clearly demonstrating the feature performance, the distance matrix  is normalized into 0-1 range.The distances of the same or similar images are close to 0 (red color), while, for the larger distances, the corresponding color is close to yellow.As plotted in Figure 14(b), the ground-truth line is red.When perceptual aliasing occurs, some red points (noise) will appear which is outside the ground-truth line.In the distance matrix provided by our method using the D-CSLBP++HOG feature (see Figure 14(c)), it can be seen that the noise which appears around the diagonal (ground-truth line) due to perceptual aliasing is clearly reduced with respect to other feature approaches (CSLBP, D-CSLBP, and HOG).All the previous affirmations are supported by the precision-recall curves depicted in Figure 13(a) and results in Table 2.
We can thus conclude that integrating HOG and disparity information permits improving the image matching results.The reason why the D-CSLBP++HOG achieves better performance than the other features is mainly because the feature combination takes the advantage of texture, shape, and depth  image matching significantly, locality sensitive hashing (LSH) method that preserves the Euclidean similarity [22] is used for visual recognition.LSH is arguably the most popular unsupervised hashing method and has been applied to many problems, including information retrieval and computer vision [23].The paper [23] demonstrates that Euclidean distance between two high-dimensional vectors can be closely approximated by the hamming distance between the respective hashed bit vectors.The more the hash bits that hash method contains, the better the approximation.The LSH method simply uses a random projection matrix to project the high-dimensional data into a low-dimensional binary (Hamming) space; that is, each data point is mapped to a -bit vector, called the hash key.Thus approximate nearest neighbors in sublinear time can be found.A key ingredient of locality sensitive hashing is mapping similar features to the same bucket with high probability.
More precisely, for multifeatures  test obtained from testing image and  train obtained from training image, the hashing functions (⋅) from LSH family satisfy the following elegant locality preserving property: where the similarity measure sim is directly linked to the Euclidean distance function.Hash keys are constructed by applying  binary-valued hash functions to each image feature.The  binary-valued LSH functions consists of random projections and thresholds as where  is a  dimensional data-independent random hyperplane, which is usually constructed from a standard Gaussian distribution [24]. is a random intercept.For a normalized dataset with zero mean, the approximately balanced partition is obtained with  = 0.By applying  binary-valued hash functions to each image feature, high dimension multifeatures  test and  train are converted into a low  dimension bits  test and  train .Since  test and  train are binary bits, they can be more efficiently compared in low dimension space than original feature.
In our experiment, we compare the place recognition performance achieved with hashed multifeature of different binary lengths (2 8 ⋅ ⋅ ⋅ 2 12 bits) on four datasets in Figure 15.Since the image size is different, multifeature dimension in datasets UTBM-1 and UTBM-2 is 18696 while the multifeature dimension in KITTI 05 and KITTI 06 is 6432.It can be seen that, using 4096 and 2048 bits retains above 86% total place recognition performance.
Table 3 shows the  1 score obtained from different hash bit lengths applied on the multifeature (D-CSLBP++HOG) of  Figure 15: Visual localization results obtained by our system on four datasets.The trajectory of the vehicle is depicted with black lines; the loop closure zone is plotted by blue lines.Red points are correctly recognized locations at 100% precision by using our proposed approach.There are no false positives in any case.It is noted that the loop closure zone of datasets UTBM-1 and UTBM-2 is the whole trajectory, while the loop closure zone of KITTI 05 and KITTI 06 is only parts of the trajectory in blue.our place recognition method.The average matching time is also presented.Here average matching time does not include the feature extraction time.The experiments were conducted on a laptop machine with intel i7-4700MQ CPU and 32 G RAM.
As Table 3 shows, the average matching time using 4096 bits is almost half of the one using the Euclidean distance over the original full features.Compared with the full multifeature matching, hashing the original multifeature into 4096 bits makes the distance computation and comparison easier and faster.There is no doubt that, for large-scale datasets, the speed-up advantages can be more significant.

Visual Localization
Results.In the previous section, 4096 bits obtained by hashing the original feature shows its good performance in place recognition.Therefore, in this section, we describe visual localization results achieved by 4096 hash bits.
Figure 15 shows the final place recognition results for the different datasets at a precision level of 100%.For the datasets UTBM-1 and UTBM-2, we obtained 23.81% and 11.35% recall at the 100% precision, respectively, while in the KITTI 05 and KITTI 06 datasets, a recall rate of 17.38% and 32.39% is achieved, respectively, at the total correctly level.It should be noted that, at 100% precision level, the obtained place recognition result is totally correct.A correct place recognition means a successful visual localization; therefore, the higher th recognition rate (recall) at 100% precision is, the more robust the visual localization system is.
When adjusting the threshold value Th, the recognition precision is also changing.At 100% precision level, each recognized place is true positive and its localization error is small (depending on the ground-truth criteria, in our case it is 5 m).For achieving the 100% recognition precision level, threshold value is set to 0.88 and 0.58 for UTBM-1 and UTBM-2 datasets, respectively.
When the threshold is set to 1, which means every image matching result is positive, in this case, the precision level is the lowest and there are many false matching for place recognition, which lead to huge localization error.In general, if small threshold is used, there are few false recognition cases.
In addition, for visual recognition precision level below 100%, meaning that recognized places are not totally correct, some false recognition places appear.For these false recognized places, the localization error can be very large, because the testing image can be wrongly matched to anyone in the training image database.That is also the reason why some locations have huge localization error.
Table 4 gives the average localization error and recall ratio at different precision levels.For all these datasets, at 100% precision, the minimum localization error is 0 while the maximum error is not larger than 5 m.It should be noted that, at 100% precision level, some places can not be recognized and no localization results are obtained at these places.This problem can be easily solved by visual odometry technique or extra sensors (as LiDAR or Inertial Measurement Unit (IMU)).

Conclusion and Future Works
In this paper, we presented a visual vehicle localization approach that uses multifeature built from gray-scale image and disparity map.The multifeature concatenates the D-CSLBP and HOG features together to take the advantage of texture, depth, and shape information.Also, block based feature extraction was used to consider the spatial information.Image matching using the proposed multifeature D-CSLBP++HOG based on local sensitive hashing makes the visual recognition more efficient.The results of our experiment demonstrated that this approach provides an available place recognition based visual localization in outdoor environment compared with the state-of-the-art FAB-MAP method.
However, in the long-term visual localization, place recognition is prone to be influenced by appearance or seasonal changing.The future objective of our research is to achieve a robust long-life localization at different times and seasons.Sequence matching will be considered for place recognition in the following research.

Figure 1 :
Figure 1: Multifeature built from gray-scale image and disparity map.Features are firstly extracted from each image block and then concatenated together.The symbol "++" means concatenation.

Figure 2 :
Figure 2: Illustration of the basic LBP operator.

Image pair I train 1 Image pair I train 2 Image pair I train 3
Image pair I train

Figure 5 :
Figure 5: The process of the proposed place recognition based visual localization.

Figure 7 :
Figure 7: An example of block based local binary descriptor extraction.Features are extracted from each image block firstly and then concatenated together.Here, image blocks are nonoverlapped and do not need any image segmentation.
Using this method, a multifeature set  train = {V train  }  train =1 of all training image pairs { train  }  train =1 is obtained.For a current testing image pair  test  , a multifeature V test  is also obtained.Then the image matching is conducted based on the Euclidean distance comparison between the multifeature V test  of the current testing image and all training image multifeatures V train  ( = [1, 2, . . .,  train ]) from the training images dataset.

3. 4 .
Feature Comparison and Image Matching.Feature comparison is performed based on the Euclidean distance between features.Each testing image multifeature V test  is compared with all the training images multifeatures V train  ( = [1, 2, . . .,  train ]) of the training database.The distance  , between the multifeature V test  ( = [1, 2, . . .,  test ]) of the testing image and multifeature vector V train  ( = [1, 2, . . .,  train ]) of a training image is computed as follows:

Figure 8 :
Figure 8: Block based feature extraction procedure (applied to each images pair of the training database for the offline phase and to current images pair for the testing phase).
(a): the experimental vehicle traversed about 4 km in a typical outdoor environment.Three typical areas were traversed: urban city road (area a), lots of factories building (area b), and a nature scene surrounding a lake (area c).The training and testing data were collected at different times, respectively, in 2014/9/11 and 2014/9/5.The training database is composed of 849 images while the testing database is composed of 819 images.The average distance between two successive frames was around 3.5 m.To tag the training images, GPS position of each image is obtained by a RTK-GPS receiver.The UTBM-2 dataset (Figure 9(b)) consists of a 2.3 km route in Belfort city downtown acquired in 2014/9/5.The first traversal to acquire training dataset was performed in the morning and the second one was conducted in the afternoon to acquire testing dataset.Each travel time across this dataset was approximately 20 minutes.The training database is Trajectory of UTBM-2 dataset

Figure 11 :
Figure 11: Image retrieval performance (precision-recall curve) comparison considering different block based binary features on UTBM-1 and KITTI 06 datasets.Here the image block size is 32.

Figure 12 :
Figure 12: Image retrieval performance (precision-recall curve) comparison considering D-CSLBP feature extracted with different image block sizes, on four datasets.

Figure 14 :
Figure 14: Example of distance matrices for UTBM-1 dataset.Here, the distance matrix  is normalized into 0-1 range.The distances of same or similar images are close to 0 (red color), while, for the larger distances, the corresponding color is close to yellow.In (a), two images from the same place are taken at different times (difference of two weeks).From Figure (b) to Figure (f), the distance matrix  is plotted.The distance matrices show that multifeature combination (c) reduces the noise appeared around the diagonal (ground-truth line).Besides, compared with (d), after adding disparity information in (e), perceptual aliasing decreases, as confirmed by the precision-recall curves in Figure 13(a).

Table 1 :
1 score comparison for different tested binary features on four datasets.Here the block size is set to 32 × 32.

Table 2 :
Comparison of  1 scores for different features and the stateof-the-art FAB-MAP method, on four datasets.

Table 3 :
1 score and matching times comparison of different hash bit lengths for our approach and the state-of-the-art FAB-MAP method.

Table 4 :
Recall results and average localization error at three precision levels (4096 hash bits).