With the increasing demand for location-based services such as railway stations, airports, and shopping malls, indoor positioning technology has become one of the most attractive research areas. Due to the effects of multipath propagation, wireless-based indoor localization methods such as WiFi, bluetooth, and pseudolite have difficulty achieving high precision position. In this work, we present an image-based localization approach which can get the position just by taking a picture of the surrounding environment. This paper proposes a novel approach which classifies different scenes based on deep belief networks and solves the camera position with several spatial reference points extracted from depth images by the perspective-
According to statistics, more than 80 percent of people’s living time is in an indoor environment such as shopping malls, airports, libraries, campuses, and hospitals. The purpose of the indoor localization system is to provide accurate positions in large buildings. It is vital to applications such as evacuation of trapped people at fire scenes, tracking of valuable assets, and indoor service robot. For these applications to be widely accepted, indoor localization requires an accurate and reliable position estimation scheme [
In order to provide a stable indoor location service, a large number of technologies are researched including pseudolite, bluetooth, ultrasonic, WiFi, ultra wideband, and LED [
The vision-based positioning method is a kind of passive positioning technology which can achieve high positioning accuracy and does not need extra infrastructure. Moreover, it can not only output the position but also the view angle at the same time. Therefore, it has gradually become a hotspot of indoor positioning technology [
Before presenting the proposed approach, we review previous work on image-based localization methods and divide these methods into three categories roughly.
Manual mark-based localization methods completely rely on the natural features of the image which lacks robustness, especially under conditions of varying illumination. In order to improve the robustness and accuracy of the reference point, special coding marks are used to meet the higher positioning requirements of the system. There are three benefits: simplify the automatic detection of corresponding points, introduce system dimensions, and distinguish and identify targets by using a unique code for each mark. Common types of marks include concentric rings, QR codes, or patterns composed of colored dots. The advantage is raising the recognition rate and effectively reducing the complexity of positioning methods. The disadvantage is that the installation and maintenance costs are high, some targets are easily obstructed, and the scope of application is limited [
Natural mark-based localization methods usually detect objects on the image and match them with an existing building database. The database contains the location information of the natural marks in the building. The advantage of this method is that it does not require additional local infrastructure. In other words, the reference object is actually a series of digital reference points (control points in photogrammetry) in the database. Therefore, this type of system is suitable for large-scale coverage without increasing too much cost. The disadvantage is that the recognition algorithm is complex and easy to be affected by the environment, the characteristics are easy to change, and the dataset needs to be updated [
Learning-based localization methods have emerged in the past few years. It is an end-to-end method that directly obtains 6dof pose, which has been proposed to solve loop-closure detection and pose estimation [
In this section, first, we introduce the overview of the framework. Then, the key modules are explained in more detail in the subsequent sections.
The whole pipeline of the visual localization system is shown in Figure
The framework of the visual localization system.
In the offline stage, the RGB-D cameras are held to collect enough RGB images and depth images around the indoor environment. At the same time, the pose of the camera and the 3D point cloud are constructed. The RGB image is used as a learning dataset to train the network model, and then, the network model parameters are saved until the loss function value does not decrease. In the online stage, after the previous step is completed, anyone enters the room, downloads the trained network model parameters to the mobile phone, and takes a picture with the mobile phone, and the most similar image is identified according to the deep learning network. The unmatched points are eliminated, and the pixel coordinates of the matched points and the depth of the corresponding points are extracted. According to the pinhole imaging model, the
Due to the processing error and installation error of camera lens, the image has radial distortion and tangential distortion. Therefore, we must calibrate the camera and correct the images in the preprocessing stage. The checkerboard contains some calibration reference points, and the coordinates of each point are disturbed by the same noise. Establishing the function
In this section, we use the deep belief network (DBN) to categorize the different indoor scenes. The framework includes image preprocessing, LBP feature extracting, DBN training, and scene classification.
The improved LBP feature is insensitive to rotation and illumination changes. The LBP operator can be specifically described as the following: the gray values in the window center pixel are defined as the threshold, and the gray values of the surrounding 8 pixels are, respectively, compared with the threshold in a clockwise direction, and if the gray value is bigger than the threshold, then mark the pixel as 1; otherwise, mark 0, and then get an 8-bit binary number through the comparison. After the decimal conversion, get the LBP value of the center pixel in this window. The value reflects the texture information of the point at this position. The calculation process is shown in Figure
Local binary pattern calculation process.
The formula of local binary pattern:
The earliest proposed LBP operator can only cover a small range of images, so the optimization and improvement methods for the LBP operator are constantly proposed by researchers. We adopt the method which improves the insufficiency of the window size of the original LBP operator by replacing the traditional square neighborhood with a circular neighborhood and expanding the window size as shown in Figure
Three types of LBP.
In order to make the LBP operator have rotation invariance, the circular neighborhood is rotated clockwise to obtain a series of binary strings, and the minimum binary value is obtained, and then, the value is converted into decimal, which is the LBP value of the point. The process of obtaining the rotation-invariant LBP operator is shown in Figure
Rotation-invariant LBP schematic.
The deep belief network consists of a multirestricted Boltzmann machine (RBM) and a backpropagation (BP) neural network. The Boltzmann machine is a neural network based on learning rules. It consists of a visible layer and a hidden layer. The neurons in the same layer and the neurons in different layers are connected to each other. There are two types of neuron output states: active and inactive, represented by numbers 1 and 0. The advantage of the Boltzmann machine is its powerful unsupervised learning ability, which can learn complex rules from a large amount of data; the disadvantages are the huge amount of calculation and the long training time. The restricted Boltzmann machine canceled the connection between neurons in the same layer; each hidden unit and visible layer unit are independent of each other. Roux and Bengio theoretically prove that as long as the number of neurons in the hidden layer and the training samples are sufficient, the arbitrary discrete distribution can be fitted. The structure of BM and RBM is shown in Figure
Boltzmann machine and restricted Boltzmann machine.
The joint configuration energy of its visible and hidden layers is defined as
The output of the hidden layer unit is
When the parameters are known, based on the above energy function, the joint probability distribution of
Since the activation state of each hidden unit and visible unit is conditionally independent, therefore, when the state of the visible and hidden units is given, the activation probability of the first implicit unit and visible elements is
In this paper, we propose a multifeature point fusion algorithm. The combination of the edge detection algorithm and the ORB detection algorithm enables the detection algorithm to extract the edge information, thereby increasing the number of matching points with fewer textures. The feature points of the edge are obtained by the Canny algorithm to ensure that the object with less texture has feature points. ORB have scale and rotation invariance, and the speed is faster than SIFT. The BRIEF description algorithm is used to construct the feature point descriptor [
The Brute force algorithm is adopted as the feature matching strategy. It calculates the Hamming distance between each point of the template image and each feature point of the sample image. Then compare the minimum Hamming distance value with the threshold value; if the distance is less than the threshold value, regard these two points as the matching points; otherwise, they are not matching points. The framework of feature extraction and matching is shown in Figure
The process of multifeature fusion extraction and matching.
The core idea is to select four noncoplanar virtual control points; then, all the spatial reference points are represented by the four virtual control points, and then, the coordinates of the virtual control points are solved by the correspondence between the spatial reference points and the projection points, thereby obtaining the coordinates of all the spatial reference points. Finally, the rotation matrix and the translation vector are solved. The specific algorithm is described as follows.
Given
First, select four noncoplanar virtual control points in the world coordinate system. The relationship between the virtual control points and their projection points is shown in Figure
Virtual control point and its projection point correspondence.
In Figure
Assume
Then, obtain the equation:
Assume
The solution
The image coordinates of the four virtual control points obtained by the solution and the camera focal length obtained during the calibration process are taken into the absolute positioning algorithm to obtain the rotation matrix and the translation vector.
We conducted two experiments to evaluate the proposed system. In the first experiment, we compare the proposed algorithm with other state-of-the-art algorithms on public datasets and then perform numerical analysis to show the accuracy of our system. The second experiment evaluated the performance of accuracy in the real world.
The experimental devices include an Android mobile phone (Lenovo Phab 2 Pro) and a depth camera (Intel RealSense D435) as shown in Figure
Intel RealSense D435 and Lenovo mobile phone.
The user interface of the proposed visual positioning system on a smart mobile phone running in an indoor environment.
In this experiment, we adopted the ICL-NUIM dataset which consists of RGB-D images from camera trajectories from two indoor scenes. The ICL-NUIM dataset is aimed at benchmarking RGB-D, Visual Odometry, and SLAM algorithms [
Table
Comparison of mean error in ICL-NUIM dataset.
Method | Living room | Office room |
---|---|---|
PoseNet | 0.60 m, 3.64° | 0.46 m, 2.97° |
4D PoseNet | 0.58 m, 3.40° | 0.44 m, 2.81° |
CNN+LSTM | 0.54 m, 3.21° | 0.41 m, 2.66° |
Ours | 0.48 m, 3.07° | 0.33 m, 2.40° |
The images are acquired by a handheld depth camera at a series of locations. The image size is
Images captured from different scenes.
Using the RTAB-Map algorithm, we get the 3D point cloud of the laboratory. It is shown in Figure
3D point cloud of laboratory.
The 2D map of our laboratory is shown in Figure
Environmental map and walking route.
In the offline stage, we get a total of 144 images. Due to some images captured at different scenes being similar, we divide them into 18 categories. In the online stage, we captured 45 images at different locations on route 1 and 27 images on route 2. The classification accuracy formula is
Most mismatched scenes concentrate in the corner, mainly due to the lack of significant features or mismatches. Several mismatched scenes are shown in Figure
Mismatched scene.
After removing the wrong matched results, the error cumulative distribution function graph is shown in Figure
Error cumulative distribution function graph.
The trajectory of the camera is compared with the predefined route. After calculating the Euclidean distance between the results through our method and the true position, we get the error cumulative distribution function graph (Figure
Since the original depth images in our experiment are based on RTAB-Map, its accuracy is not accurate. For example, in an indoor environment, intense illumination and strong shadows may lead to inconspicuous local features. It is also difficult to construct a good point cloud model. In the future, we plan to use laser equipment to construct a point cloud.
In this article, we have presented an indoor positioning system based only on cameras. The main work is to use deep learning to identify the category of the scene and use 2D-3D matching feature points to calculate the location. We implemented the proposed approach on a mobile phone and achieved a positioning accuracy of decimeter level. The preliminary indoor positioning experiment result is given in this paper. But the experimental site is a small-scale place. The following work needs to be done in the future: with the rapid development of deep learning, it can generate high-level semantics and effectively solve the limitations caused by artificial design features, use a more robust lightweight image retrieval algorithm, and carry out tests under different lighting and dynamic environments, system tests under large-scale scenarios, and long-term performance tests.
The data used to support the findings of this study are included within the article.
The authors declare that they have no conflicts of interest.
This study was partially supported by the Key Research Development Program of Hebei (Project No. 19210906D).