A Novel Feature-Level Data Fusion Method for Indoor Autonomous Localization

We present a novel feature-level data fusionmethod for autonomous localization in an inactive multiple reference unknown indoor environment. Since monocular sensors cannot provide the depth information directly, the proposed method incorporates the edge information of images froma camerawith homologous depth information received froman infrared sensor. Real-time experimental results demonstrate that the accuracies of position and orientation are greatly improved by using the proposed fusion method in an unknown complex indoor environment. Compared to monocular localization, the proposed method is found to have up to 70 percent improvement in accuracy.


Introduction
For achieving seamless positioning and navigation in degraded or denied unknown areas such as indoors [1,2], various new technologies have replaced the Global Positioning System (GPS) [3][4][5][6] and Inertial Navigation System (INS) [7] as the mainstream technologies in indoor autonomous localization field.The most popular one is visual Simultaneous Localization and Mapping (SLAM) [3,8,9].One of the key technologies in SLAM is the autonomous localization which is localizing oneself in an unknown environment.The other one is recreating a map of this environment.These two tasks are coupled to each other: an accurate localization is benefited to mapping, and a good map is crucial to localize oneself.For this reason, both of the two tasks must be performed at the same time [10].
The key point of good localization is to ensure the correct processing of geometrical information gathered by cameras.Three types of methods have been considered for autonomous localization estimation.(a) Monocular Autonomous Localization: in an early stage, researchers prefer to use a monocular method for its simple device structure.A good example of this method was introduced by Civera et al. in [11,12].1-point Random Sample Consensus (RANSAC) based on Extend Kalman Filter (EKF) was used as the data association method.They also showed real-time feasibility of monocular SLAM with affordable hardware which cannot provide depth information, using the EKF algorithm to get the depth information from features information [13], but a long convergence time for obtaining depth information of the monocular method is required.(b) Multisensor of same "type": stereovision SLAM is a typical example of multisensor of same "type" which is considering two linked cameras as a single 3D sensor [14,15]; SLAM has the defects of limited 3D estimation range and mechanical fragility.A stereovision sensor can only provide a reasonable 3D estimation up to infinity.(c) Multi-sensor of different "types": with the using of data fusion method and filtering in the field of multisensor, researchers are paying attention to the multi-sensor methods recently [16][17][18].Multi-sensor of different "types" can provide more types of feature information for enhancing estimation accuracy and reducing strong anti-interference.In order to further improve the performance of the multi-sensor of different "types" in SLAM, we introduce a novel featurelevel data fusion for indoor autonomous localization.The basic idea is of fusing camera and infrared sensor information with each other.In contrast to the laser sensor, the advantage of low cost is also outstanding with this type of sensor.This paper is organized as follows.Our system framework is introduced in Section 2. Section 3 describes proposed feature-level data fusion based indoor localization method.Section 4 presents the experimental results.Section 5 gives the conclusion of the paper.

Framework of Proposed System
Among the various SLAM methods [10], the most popular one is the monocular SLAM [11].In the literature, feature detection [19] and data association and prediction are important parts of SLAM.Feature detection is to extract features in images, and one can localize the camera by analyzing the relative position between these points and the camera.Data association maps the feature points between two adjacent frames.Because of the lack of depth information, the initial depth information is usually stated before the localization, and the depth information is then estimated after several iterations of the sequential estimators such as EKF.
In this section, we present a framework that supports the feature-level data fusion based autonomous localization as shown in Figure 1.In the proposed localization method, features in the unknown area are first detected.The featurelevel data fusion based autonomous localization in Section 3 will then be performed.It is designed to deal with two problems: (a) the unavailable initialized depth information of the monocular method and (b) self-calibration of the multiple dissimilar sensors.
Figure 1 is the block diagram of the proposed system.The state vector  ⊤ can be represented by the set of camera  ⊤ and the set of feature points information  ⊤  : where  ⊤ contains the position and orientation as a 6-DOF [20] constant velocity model,  ⊤  contains the world coordinate of feature points, WC represents the world coordinate,  is the direction of the sensor view, and V and  are the velocity and angular velocity.Consider the following: In (2),  + 1 refers to the step number, (   + Ω  ) transforms the local orientation vector into a quaternion,    and Ω  are the impulse of velocity and angular velocity, and Δ is the delay between  and  + 1.From the velocity model, the EKF prediction of the state vector can be described: where xV stands for the prediction of position and orientation of camera,  V is the Jacobian of  V ,   is the covariance of the zero-mean Gaussian noise assumed for the dynamic model, and   is the Jacobian of this noise.Consider the following: where  is the updated features information,   is the observation of ( + 1 | ),  stands for Kalman gain, and  is the Jacobian of the measurement equation.With the state vector and covariance matrix, the prediction of the camera motion is available.
The data association block in the framework can reduce the size of the data set and 1-point RANSAC [11] is a paradigm for the model.The number of RANSAC iterations can be greatly reduced and hence the computational cost.After the data association, the features of the adjacent two frames will be linked, deleted, or created.

Proposed Feature-Level Data Fusion Based Autonomous Localization
In this section, we present an autonomous localization method based on a novel method of feature-level data fusion.Because of the difference between camera and infrared sensor, depth information cannot be projected to an image with high accuracy.The proposed method provides the fused information of depth and image through the edge information as shown in Figure 2.With the known pose of the camera, the world coordinates of feature points can be inferred.Similarly for the features, the pose can also be calculated.

Camera and Infrared Sensor Geometric Modeling. A 3D
point  in the scene and its corresponding projection   are related by the following equations.We use , ,  to represent the world coordinates and , V for the image coordinates: where  is the camera calibration matrix,   is the infrared sensor calibration matrix,  is the rotation matrix, and  is the translation vector between world and camera coordinate systems.
Consider the following: where  is defined by   and   referred to the focal length of the camera in the -axis and -axis and   and   are the elements of the principal point vector.Infrared sensor has the same  and  with the camera.The corresponding projection in the infrared sensor is   , and the relation between   and   can be expressed as

Feature-Level Data Fusion Method of Image and Depth.
While in Section 2 most feature points extraction methods are concentrated on the edges of objects, the depth information of these places is not accurate.In order to solve this problem, we use a feature-level data fusion method based on edge information by three steps.Before performing data fusion, edge information is chosen as features by analyzing the image and depth information.Comparing various methods, we select the canny operator to extract.The block diagram of the proposed method is shown in Figure 3. First,  = (, ) is one edge point of the image ,   = (, ) is the projection of (, ) on the depth information , and  is the cross-shaped matrix with center point at   .It is constituted by the next 4 points around   as shown in Figure 4.
Consider the following: Second, the nearest nonzero point to   in  named  which is the point used for edge alignment should be obtained.We assume that  = (, ) which means that  +  is the minimum among all non-zero points in .
Third, moving  = (, ) forward   in horizontal or vertical direction, and then extracting the depth information of the point next to  (the one further than  from   ), we can fill in all the empty points between  and   .Since the relations between , , and , there are 4 solutions that should be considered.  is the fused depth information described in the following: After introducing the projection of features from image plane into the 3D world coordinate, the relation between camera pose and 3D feature points in world coordinate will be described in this paragraph.Before the depiction of the processes, a problem should to be noticed.Assume a feature point  1 is registered in world coordinate and the depth information is available.In that cases the camera move along the tangent of the circle whose center is feature point  1 .Even though there is no change in the image projection and depth of  1 , the pose of the camera should not be considered as being changeless.So, the number of features used for camera pose estimation should be more than two.In order to finish this calculation, the following parameters must be known as shown in Figure 5: (1) the world coordinate of feature points 2) the projections of feature points on image  1 = [  1 ,   1 ] ⊤ and  2 = [  2 ,   2 ] ⊤ , (3) the depth information of feature points  1 and  2 , (4) the vertical and horizontal visual angles of camera view  and , (5) the size of images  and .Assume that the camera world coordinate is  = [  ,   ,   ] ⊤ , angles between camera view and -axis and axis are  and .One can calculate the camera pose using following equations.Obviously,   ,   ,   , , and  can be derived by solving the previous equations.So, the pose of camera can be obtained through the previously mentioned parameters.
Consider the following: At last, consolidating the previous equations, if the camera pose in time  is known, the image coordinates of  feature points and their depth are known all the time, and feature points information is stable, then camera pose in time  + 1 can be calculated, and the world coordinates of feature points in time  + 1 can also be calculated through the previously described data.Therefore, considering the mathematical induction, as long as there are more than two feature points that can be found in the image later, the camera pose can be obtained in every moment.
The algorithm ultimately achieves autonomous localization estimation.The hardware operational principle which consists of camera, infrared transmitter and infrared receiver is show in Figure 6.

Experiments and Results Analysis
The feature on image plane projects in world coordinates through the information from these sensors.The frame rate is 30 frames per second.The software platforms for Kinect are OpenNI [23] and OpenCV [24] which are under .Kinect

Experimental Results and Analyzation of Data Fusion.
The image and the depth information of the carton shown in Figure 7 are used for experiments.
(1) After calibration, the internal reference of infrared sensors and cameras is  and   .Since the rotation matrixes and translation matrixes of them are the same, mapping relation of the pixel between depth information and images can be calculated through the equation as ( 7).
(2) The edge detection results of the depth information and image are shown in Figure 8, the red lines represent the image edges and the blue lines represent the depth information edges.
(3) After the alignment of the edges, the results of the feature-level fused data fusion are shown as follow.
The left one is the original data and the fused data is on right.
From the edge parts in Figure 9, we can find that there are a lot of glitches in the raw data.Comparing to the left one, the right image has a qualitative leap on the carton edges.After removing the glitch noise, the fused information is ready for further data analysis.

Indoor Autonomous Localization Estimation Error and
Analysis.In the autonomous localization experiments, the chosen scene is a complex indoor environment shown in Figure 10.All of the depth information is continuously distributed in a range of 0.85 m to 9 m.In order to evaluate the positioning method comprehensively, two experiments are conducted in this part.The sensors are in uniform rectilinear motion in the first one and nonuniform nonlinear motion in the second one.In the first experiment, the camera moves 2.2 m from original point along the negative -axis.The sensor Kinect is placed on a cart moving at a speed of 2 cm/s in uniform motion.In order to simulate human behavior, we conduct the second experiment.In the second one, the camera moves 1.6 m uniformly along the positive -axis from the original point at first, then moves along a quarter circle with a radius of 1 m and the tangent of the cycle is -axis.The speeds of the uniform motion and the quarter circle are 4 cm/s and 2 cm/s.
The infrared sensor and the camera obtain information of depth and images at the frequency of 2.5 Hz, the images are of 640 × 480 color pixels and the depth information is of 16-bit grey level 320 × 240 pixels.The raw data from sensors are recorded in a hard drive for postexperiment analysis.There are three kinds of data as follows: (1) the images Figure 11 explains the positioning cumulative errors from the respects of the error in the Euclidean coordinate, along the -axis and the -axis.The first site represents that fused data can improve accuracy by 70 percent along -axis with reducing the error from 0.5 m to 0.15 m.The second site explains that the fused data can improve accuracy by 67 percent along -axis with error reducing from 0.45 m to 0.15 m.The third one explains that the fused data can improve accuracy by 67 percent in the Euclid distance.So, these results show that the localization method based on feature-level data fusion holds more advantages in the direction of motion and this method increases about 70 percent accuracy of the positioning.
In Figure 12, the positioning RMS errors are shown from 4 respects: the error in the Euclidean coordinate (position and attitude), the error along the -axis and the -axis; they are the evaluations for stability of the localization method.By comparing the third and the fourth sites, we find that the localization method based on fused information has excellent performance in the direction of motion.At the same time, by comparing the first and the second sites, the significant improvement on the stability of the pose estimation is also considerable.The same arrangement with Figures 11 and 13 can also be divided into three parts: the error in the Euclidean coordinate, along the -axis and the -axis.The first site shows the improvement of the error along -axis with reducing the error from 0.35 m to 0.02 m.The accuracy of localization is improved about 60 percent along -axis and in the Euclidean coordinate with the error reducing from 1 m to 0.3 m.
Figure 14 shows the results of the positioning RMS errors of the second experiment; the method also can be evaluated from four respects as in Figure 12.From these sites, we can find that the localization method based on feature-level fused information holds a low RMS error level by contrastive analysis and practical calculation.

Conclusion
We showed a novel feature-level data fusion based autonomous localization method for unknown indoor environment in this paper.Accuracy estimation was discussed for this method with real-time experiments.The experiment was conducted in an inactive multiple reference unknown indoor environment, and the platform consisted of a Kinect and a laptop.In the first experiment, the sensors move in uniform linear motion.The sensors move in a nonlinear and nonuniform motion in the second experiment.The positioning error of the three types of information from different data fusion methods was compared in the experiment.The first experimental results proved that the accuracy improvement of the autonomous localization based on feature-level data fusion guarantees up to 70 percent by comparing to monocular SLAM (without depth information) and 50 percent to pixel-level data fused information basing positioning method.The second experimental results show that the positioning method also holds practicality in the real environment.So, the advantage of feature-level data fused information is applied into two problems: data fusion method of multisensor of different "types" and the reliability of the visual SLAM.

Figure 1 :
Figure 1: Framework of proposed system.

Figure 2 :
Figure 2: Replacement of fused information to monocular information.

Figure 3 :
Figure 3: The architecture of autonomous localization estimation using multisensor.

Figure 4 :
Figure 4: The processes of line alignment in the feature-level data fusion.

Figure 5 :
Figure 5: The relationships between camera pose and features.

Figure 6 :
Figure 6: Operational principle of the combined hardware of camera and infrared sensors.

Figure 7 :
Figure 7: Image from CCD (a) and infrared sensor data (b).

4. 1 .
Experimental Platform.In this experiment, the Kinect[21,22] sensor is used.Kinect is equipped with two sets of lenses.The middle sensor is the RGB camera (color CMOS) with a resolution of VGA (640 * 480).The infrared transmitter in the left and the receiver in the right comprise a set of 3D depth sensors with a resolution of QVGA (320 × 240).

Figure 8 :
Figure 8: Edges of image (red line) and depth information (blue line).

Figure 9 :
Figure 9: Comparison of original data (a) and fused data (b).

Figure 10 :
Figure 10: Experimental environment and its depth information.

Figure 11 :
Figure 11: The results of the first experiment.The cumulative errors of the positioning along -axis (a), the positioning along -axis (b), and the positioning error (c).

Figure 12 :
Figure 12: The results of the first experiment.The RMS errors of the positioning (a), the orientation estimation (b), positioning along -axis (c), and positioning along -axis (d).

Figure 13 :
Figure 13: The results of the second experiment.The cumulative errors of the positioning along -axis (a), the positioning along -axis (b), and the positioning error (c).

Figure 14 :
Figure 14: The results of the second experiment.The RMS errors of the positioning (a), the orientation estimation (b), positioning along -axis (c), and positioning along -axis (d).

)
3.3.Pose Estimation.In order to register a feature point   = [  ,   ] ⊤ into the world coordinate as a 3D point  = [, , ] ⊤ , these parameters should be known, the pose (world coordinate  = [  ,   ,   ] ⊤ , angles between camera view and initial camera view are  and ,  is horizontal and  is vertical) of camera, depth  of the feature point, and vertical and horizontal visual angles of camera view which are  and  and  ×  is the sizes of image.The equation for registering   into the world coordinate  will be ,  cos  cos  +   ) .