Monocular Vision-Based Robot Localization and Target Tracking

This paper presents a vision-based technology for localizing targets in 3D environment. It is achieved by the combination of different types of sensors including optical wheel encoders, an electrical compass, and visual observations with a single camera. Based on the robot motion model and image sequences, extended Kalman filter is applied to estimate target locations and the robot pose simultaneously. The proposed localization system is applicable in practice because it is not necessary to have the initializing setting regarding starting the system from artificial landmarks of known size. The technique is especially suitable for navigation and target tracing for an indoor robot and has a high potential extension to surveillance and monitoring for Unmanned Aerial Vehicles with aerial odometry sensors. The experimental results present “cm” level accuracy of the localization of the targets in indoor environment under a high-speed robot movement.


Introduction
Knowledge about the environment is a critical issue for autonomous vehicle operations.The capability of localizing targets with a robot for environment is highly demanded.In [1], it investigates the vision-based object recognition technique for detecting targets of the environment which have to be reached by a robot.After the robot finishes the target detection, it is an essential task for robots to know the positions of targets.The tasks include navigation and object tracking.As a result, the approach to localize targets for specific tasks is an essential issue in some applications.For UAVs, there are some similar demands [2,3] that UAVs have to track the positions of ground objects for reconnaissance or rescue assistance with monocular vision.Besides, UAVs need to observe the changing surroundings to understand the movements of the aircraft better, or the localizing system has to direct the aircraft to a region of interest after taking ground observations.
In recent years, odometry sensors have been widely used for estimating the motion of vehicles moving in a 3D space environment such as UAVs and Unmanned Ground Vehicles (UGVs).For instance, Inertial Navigation Systems (INSs) are applied to measure linear acceleration and rotational velocity, and capable of tracking the position, velocity and attitude of a vehicle by integrating these signals [4] or mobile vehicles use the Global Position System (GPS) and Inertial Measurement Units (IMUs) for land vehicle applications [5].However, most of inertial navigation system sensors are expensive for some applications in the indoor environment.Optical wheel encoders and an electrical compass provide linear and angular velocities respectively.Both of two sensors are basic odometry sensors and widely used owing to their low cost, simplicity, and easy maintenance.Encoders provide a way of measuring the velocity to estimate the position of the robot and compasses are often used to detect the orientations of the robot.Based on the sensor information, motion control is done and then the localization of the robot is estimated [6,7].Despite some limitations of an encoder, most researchers agree that an encoder is a vital part of the robot's navigation system and the tasks will be simplified if the encoder accuracy is improved.Besides, the additional sensor, a camera, allows a robot to perform a variety of tasks autonomously.The use of computer vision for localization has been investigated for several decades.The camera has not been at the center of robot localization while most of researchers have more attention to other sensors such as laser range-finders and sonar.However, it is surprising that vision is still an attractive choice for sensors because cameras are compact, cheaper, well understood, and ubiquitous.In this paper, the algorithm is achieved by combining different types

Explanation of the Algorithm
The algorithm is mainly divided into five parts including motion modeling, new target adding, measurement modeling, image matching, and EKF updating.By sequential measurement from sensors, the EKF is capable of improving The robot moves from r w i to r w k .Based on the two views, the parallax is produced.The initial depth can be refined to approach the real distance and the position of the ith target is estimated.
the initial estimate for the unknown information while simultaneously updating the localization of targets and the robot pose.Finally, Chelesky decomposition and forwardand-back substitution are presented to calculate the inverse covariance matrix efficiently.

The Origins of the Proposed Algorithm.
A single camera is mounted on our system as one of the sensors.The monocular vision infers that the depth of the target is not able to be measured by only one image but estimated by the sequential images.Therefore, the single camera has to estimate the depth by observing the target repeatedly to get parallax between different captured rays from the target to the robot.The orientations of target are estimated in the world coordinate system only by one image.Y w i is a sixdimension state vector and used to describe the position of the ith target in 3D space.Its equation is addressed as r w i is the location of the robot when the robot detects the ith target in the first time.G c i is defined as the position of the target with respect to the camera coordinate system and denoted by θ w i and ϕ w i are the orientations of the ith target and calculated by the pixel location of the target in only one image.The relationship between the position of the target G c i and the current robot localization r w k is presented in Figure 1.G w i is the position of the target with respect to the world frame system.1/ρ w i is defined as the distance from r w i to the target and m(θ w i , ϕ w i ) is its unit vector.Consequently, 1/ρ w i m(θ w i , ϕ w i ) is seen as the vector from r w i to the target and then G w i is calculated as ((r i is estimated if the depth 1/ρ w i is known in advance.However, the depth of the target is not able to be measured by a single image.The EKF is applied to estimate 1/ρ w i , θ w i , and ϕ w i and they converge toward more correct values under the recursive iteration by using sequential image measurement information.To sum up, (2) is the basic concept and origin of our proposed algorithm.

Motion Modeling.
We first derive the continuous-time system model that describes the time evolution in the state estimate.This model allows us to employ the sampled measurements of the wheel encoder and the compass for state propagation.The process state is described by the vector where r w k is the position of the robot with respect to the world frame.q w k is the quaternion that represents the orientation of the world frame for the robot.The linear velocity in the world frame and the angular velocity with respect to the camera frame are denoted by v w k and ω c k , respectively.In this system, the control inputs are measured by wheel encoders and the compass which provide the linear velocity and the angular velocity, respectively.Besides, the linear velocity is used to simulate the acceleration data of the robot for describing how components of the system state vector change completely.The definition of acceleration is defined as a w k = (v w k − v w k−1 )/Δt.In order to have similar on-line simulation, we define a w k as (v w k − v w k−1 )/Δt rather than (v w k+1 − v w k )/Δt.Although the former definition is not better than the later one which is closer to the real acceleration, an iterated EKF is still capable of compensating for errors when adopting (v w k − v w k−1 )/Δt as the robot's acceleration.The system model describing the time evolution of the wheel encoders and of the compass is given by The compass is used to measure Δθ k which is defined as where Δt.The covariance is predicted and modified if considering the control covariance Cov(v w k , ω c k , a w k ).We assume that the process noise of the control vector is not correlated with the process state vector Xm k so that Cov(Xm k , v w k , ω c k , a w k ) is set to be a zero matrix.By using (A.4) in Appendix A, the predicted covariance of the system model is expressed as where The first item F Xm Cov(Xm k )F Xm T represents the noise from the process state vector and Q describes the noise from the measurement of the wheel encoders and the compass.The Jacobian matrix of F Xm is shown as The Jacobian matrixes of F v and F ω are as the same as ∂f/∂v w k and ∂f/∂ω c k , respectively.

Adding New Target.
It is remarkable about our algorithm that new targets are initialized by using only one image.The initialization includes the state initial values and the covariance assignment.

Target Initialization.
Equation ( 1) is used to define the location of the new target and estimate the six parameters of Y w i with the EKF prediction-update loop.If the robot senses the 1st target at the state k for the first time, the new target information is added and then the process state vector is modified.The expanded process state vector is represented by the following equation: where The first time observation of the 1st target is done at the current camera location r w 1 .m(θ w 1 , ϕ w 1 ) is defined as an unit Figure 2: The orientations of the target with respect to the world frame can be estimated as a function of its undistorted pixel location h u .
vector from the location r w 1 to the 1st target with respect to the world frame.Figure 2 illustrates the relationship about m(θ w 1 , ϕ w 1 ), θ w 1 , and T is defined as the pixel location of the 1st target in an undistorted image.g c is the location of the 1st target with respect to the camera frame.The location of the 1st target with respect to the world frame g w is addressed as where q 2 1 +q 2 2 − q 2 3 − q 2 4 2 q 2 q 3 − q 1 q 4 2 q 2 q 4 + q 1 q 3 2 q 2 q 3 + q 1 q 4 q 2 1 −q 2 2 + q 2 3 − q 2 4 2 q 3 q 4 − q 1 q 2 2 q 2 q 4 − q 1 q 3 2 q 3 q 4 + q Then we get Only g w x /g c z and g w y /g c z rather than g w x and g w y are computed by using h u .It is impossible to know g c z by only one image.However, it is possible to make use of g w x /g c z and g w y /g c z to calculate θ w 1 and ϕ w 1 which are shown as The unit vector m(θ w 1 , ϕ w 1 ) is derived as a function of the orientations of the target and described as The final parameter 1/ρ w 1 is not able to be measured by only one image but estimated by the sequential images with EKF prediction-update loop.As a result, it is assumed to be an initial value 1/ρ 0 .

New Covariance
where B 1 is not a zero matrix because the new target's location is correlated with the estimates of the robot.According to (B.4) and (B.5) in Appendix B, B 1 and A 1 are derived as where 2.4.Measurement Modeling.The robot moves continuously and records the sequential images.This is a process of detecting and identifying the targets.The parameters of targets have been set into the process state vector and the covariance has been estimated with the recursive loop.

Sensor Model.
The predicted pixel locations of targets are estimated as a function of the prior process state vector which is described in (4).According to (2), the location of the ith target in the camera frame can be defined in another way: Our measurement sensor of the camera is monocular vision.
There is a camera sensor model to describe how the sensor maps the variables in the process state vector into the sensor variables.By using the pinhole model [16], T c i is derived as a function of the undistorted pixel location of the ith target (u i , v i ) and denoted by Equation ( 22) and ( 23) are combined to derive the predicted pixel location without distortion for the ith target Its equation is addressed as By using image correct, the predicted pixel location of the ith target within distortion Zd where k 1 and k 2 are coefficients for the radial distortion of the image.
The actual image target the robot takes is the distorted pixel location rather than the undistorted one.Therefore, we have to calculate Zd k(i) to get the measurement innovation for EKF updating.

Measurement Covariance
Assignment.The measurement covariance is expected to describe what the likely variation of measurement is by the sensor under the current condition.The variation is infected by the variables in the process state vector and the noise which corrupts the sensor.The measurement Jacobian matrix H i is where When there are N targets observed concurrently, they are stacked in one measurement vector Zd k to form a single batch-form update equation which is shown as Similarly, the batch measurement Jacobian matrix is derived as where Based on the estimation of measurement covariance, the small area where the target lies with high probability is able to be predicted.The measurement covariance of the ith target is defined as The template match technique is used to search the actual image pixel location of the ith target in the small search area whose width and length are kσ uu(i) and kσ vv(i) , respectively.The coefficient k is a constant and defines how large the search area is.The larger k is, the more is the search time.There are two advantages regarding the search area which is estimated by the measurement covariance.One advantage is that a lot of time is saved because the image pixel location of the ith target is detected in a small search area rather than in a 320 × 240 image.The other one is that the successful search rate increases dramatically because the search area allows the image pixel location to be correctly matched with high probability.As a result, it is not necessary to use a complicated object recognition algorithm such as Scale Invariant Feature Transform (SIFT) [17] for image matching in our system.Cross-correlation search is applied to be our template match algorithm and the computing loading is lower due to its simplification.This approach uses crosscorrelation of image to search a suitable image patch.I D is defined as the template image patch for the ith target and stored in the database.I t is defined as a candidate image patch in the search area whose width is kσ uu(i) and length is kσ vv(i) .The cross-correlation value of I t with I D is given by M 2 is the number of pixels in an image patch and σ ID and σ It are the stand deviations of I D and I D , respectively.I D and I t are the average values of I D and I t , respectively.The maximum value of cross-correlation of I t with I D is the best template matching and it is seen as the matching target pixel patch for the ith target.

Iterated EKF Updating.
The EKF is one of the most widely used nonlinear estimators due to its similarity to the optimal linear filter, its simplicity of implementation, and its ability to provide accurate estimates in practice.We employ the iterated EKF to update the state.At each iteration step k, the prior process state vector is computed by using (4) and then Zd k is calculated as a function of Xm − k by using (28).Next, the measurement innovation and the measurement Jocobian matrix are computed as Zd k − Zd k and H k(2N×(13+6N)) , respectively.The measurement covariance and Kalman gain can be performed as (33) Finally, the process state vector and its covariance are updated at each iteration state and presented as where Zd k is the actual measurement and detected by using image matching search.Zd k is the predicted measurement and computed by the sensor model.The error in the estimation is reduced by the iteration and the unknown depths of the targets converge toward the real values gradually.

Fast Inverse Transformation for Covariance Matrices.
The inverse matrix of the measurement covariance has to be computed at each iteration step.It needs plenty of running time by using the transpose of the matrix of cofactors to invert the measurement covariance.In order to deal with the large size of the inverse matrix within the variations of N targets efficiently, Cholesky decomposition is applied to invert the measurement covariance and reduce the running time at each iteration step.The measurement covariance is factored in the form Cov(Zd k ) = LL T for a lower-left corner of the matrix L because the measurement covariance is positive semidefinite.The matrix L is not unique, and so multiple factorizations of a given matrix Cov(Zd k ) are possible.A standard method for factoring positive semidefinite matrices is the Cholesky factorization.The element of L is computed by using Cholesky decomposition and addressed as follows: where A i j is the element of A = C(Zd k ).A matrix equation in the form LX = b or UX = b can be solved by forward substitution for lower triangular matrix L and back substitution for upper triangular matrix U, respectively.In our proposed method, Cholesky decomposition and forward-and-back substitution are combined to invert the measurement covariance for reducing the computational loading.

The Analysis of the Algorithm
In this section, there are two key points about the proposed algorithm presented as follows.The first key point is that the details of algorithm will be analyzed to prove that it is not necessary to get the depth information in our algorithm even though it is designed to track the 3D localization of the targets.The other key point is introduced to explain why the simple template match technique, cross-correlation search, is applied.

Depth Information Analysis in the Proposed Algorithm.
The depth information should be known while the robot localizes the targets in 3D space by using monocular vision.However, the proposed algorithm does not need the depth information to localize targets.As presented in (14), it is not necessary to know the depth information to calculate θ w 1 and ϕ w 1 .Similarly, it seems that the depth information should be known to compute k q in (19).k q is derived as where The depth information g w z has to be known for calculating ∂θ w i /∂g w according to the following equation: However, ∂θ w i /∂q w k still can be computed without knowing g w z because of ∂g w /∂q w k .The details are proved as follows: where (40) g w x /g c z , g w y /g c z , and g w z /g c z are computed by (14).Equation (39) can be calculated without knowing the depth information g c z due to the following equation: In the same way, the depth information should be known to compute k hd in equation ( 19).k hd is derived as where ∂g w /∂g c = R wc and ∂h u /∂h d can be computed without g c z but ∂θ w i /∂g w cannot be calculated without g w z according to (39).However, ∂θ w i /∂h d still can be estimated without knowing g c z and g w z if ∂θ w i /∂g w and ∂g c /∂h u are combined.The details are proved by the following equations: where Therefore, ∂θ w i /∂h d is computed without g c z and g w z by using (44) and applying (14) to calculate g w x /g c z , g w y /g c z and g w z /g c z .Based on the results of calculating ∂θ w i /∂q w k , ∂θ w i /∂h d and ( 14), it infers that we have solved a problem that the depth information, g c z and g w z , is not able to be measured only by one image.This is a very important characteristic of the proposed algorithm.

Analysis Regarding Using Cross-Correlation Search.
Generally speaking, cross-correlation search is though as a simple template match algorithm and it is not as robust as SIFT.However, I D is still detected correctly in the search area by using cross-correlation search because the small area is estimated by an iterated EKF and it includes the actual pixel location of the target with high probability.As shown in Figure 3, the predicted pixel locations of the targets are estimated by using the sensor model and denoted by the green crosses.The pixel locations of the red crosses are the corresponding target pixel locations and detected by applying cross-correlation search.Based on the testing result that the red crosses are closer to the actual target pixel locations, it has proved that it is feasible for the proposed algorithm to apply cross-correlation search as its template match algorithm.

Experimental Results
In order to validate the proposed algorithm for localizing targets when the ground truth is available we have performed a number of experiments.The algorithm is implemented by C++ and performed by PC with 2.0 GHz microprocessor.The monocular sensor is a single camera with wide angle lens because we hope that more targets can be observed in one image and tracking rate can be higher.The camera's field of view is 170 • with a focal length of 1.7 mm.The image measurements received at a rate of 15 Hz are distorted with noise σ = 20 pixel.The addressed experimental results are tested under the high speed robot motion because the average velocity of each case is higher than 20 cm/sec and the maximum velocity of all cases is 69.11 cm/sec.For the duration of experiments, the initial distance between the camera and the targets ranges from 1.68 m to 5.76 m.The unknown depth of the target is estimated by sequential images with EKF and six cases (3.0 m, 3.5 m, 4.0 m, 4.5 m, 5.0 m, and 5.5 m) are assumed to be default depths for each experiment case.All of the sensors mounted on our system are shown in Figure 4.
There are some measurement errors caused by the camera distortion when using the camera with wide angle lens.Before validating the proposed algorithm, we performed an experiment to estimate the distorted noise by making use of the artificial landmarks.We chose the corners of the artificial landmarks as targets.The undistorted pixel locations of the corners are estimated by the pinhole model and then the image correction is applied to compute their predicted pixel locations with distortion.Owing to using a camera with wide angle lens, there is an error between the predicted pixel location with distortion and its actual pixel location which is detected by the cross-correlation search.In the proposed algorithm, the predicted pixel location without distortion is estimated in terms of the prior process state vector.The distorted error is produced if transforming the undistorted pixel locations of targets to the distorted pixel locations.Therefore, the distorted error should be taken into consideration very carefully.Based on the experimental result, the distorted noise is assumed to be 20 pixels.Figure 6 shows the real environmental features as targets whose positions require to be estimated for a task that the robot passing through the door.Since the developed system does not require an initial setting at the first iteration, it provides a practical usage under a dynamical motion if target is detected instead of training processes of the environment features for robot localization.The two different moving  paths with varied linear velocities and orientations are examined as shown in Figure 7.The initial true depth for paths A and B between robot and the doors is about 3.5 meter, path A is a straight forward path, and path B is a left way of robot to move forward to the doors.The experimental result is summarized in Table 1, and the averaged error of proposed approach is about 10 cm.According to the experimental result shown in Figures 8 and 9, it infers that the camera is not able to provide the precise measurement information to make MonoSLAM [18] correcting the prior process state variables under a highspeed movement when the control input is not provided.It also shows that the image-distorted noise σ = 20 pixel is too large for MonoSLAM [18] model.Owing to combining hybrid sensor information, the proposed MVTL algorithm is capable of localizing targets with image distorted noise σ = 20 pixel and its average error is lower than 0.11 m.It has proved that MVTL algorithm is robust to track targets under a higher-speed movement with larger measurement noise.

Acceleration Approximation-Based Error Comparison.
We also performed an experiment about the comparison between encoders and IMU.We choose an encoder as our sensor instead of IMU in indoor environment.The acceleration a w (k) is approximated as (v w (k) − v w (k−1) )/Δt by encoders in order to have similar on-line simulation data.It will increase additional acceleration noise if we use prevelocity v w (k) rather than v w (k+1) to simulate the acceleration of the robot at state k.However, an iterated EKF is robust enough to compensate for errors from this low-cost sensor and the definition of acceleration.The error of acceleration from a simulated IMU is lower than 0.001 m/s 2 .Based on the result shown in Table 2, the errors of localization by using encoders are still accepted even though we use prevelocity v w (k) to simulate acceleration at iterated state k.It has proved that errors are able to be reduced gradually with EKF recursive loop.

Performance of Depth Initial Conditions.
In terms of the experimental result shown in Figure 10, we could conclude that the ability of localizing targets depends on the parallax from different views rather than the distance between the robot and the target.It is a very crucial viewpoint to analyze the stability of the system in terms of means and deviations.Not only do we analyze one the target localization but also we understand the details about multiple target localization.Refering to the above experiments, MVTL algorithm localizes targets with a higher-speed motion and six kinds of default depth values converge to the similar value with little errors.

Experiments of Fast Inverse Matrix Transformation.
Finally, we performed an experiment to verify Cholesky decomposition and forward-and-back substitution.In this experiment, the totally testing time is 4.2 sec with 15 Hz image rate.At first, the transpose of the matrix of cofactors is used to invert the measurement covariance matrix whose size ranges from 2 × 2 to 8 × 8.The experimental results shown in Table 3 present that the running time is long if the system inverts a 6 × 6 or a more dimension matrix.However, the running time is reduced dramatically if the system inverts an 8 × 8 matrix by using Cholesky decomposition and forwardand-back substitution.The test result shows that the system is capable of being real-time even though localizing four targets concurrently.able to localize targets with "cm" level accuracy under a higher-speed movement.Besides, we have validated that it is practical to use odometry sensors to track targets as well.Not only does the system start the recursive procedure without the initial setting but also the robot can move rapidly to localize targets.The EKF-based algorithm is really practical and robust to reduce errors from sensors even though the low-cost sensors are used in our system.The efficiency of the proposed algorithm is impressive by using Cholesky decomposition and some computational tricks.

Conclusion and Future Work
In terms of the experimental result shown in Figure 10, we could conclude that the ability of localizing targets depends on the parallax from different views rather than the distance between the robot and the target.Consequently, the proposed algorithm has a high potential extension to surveillance and monitoring for UAVs with aerial odometry sensors.In the same way, it is able to be used widely for robot tasks as well.

Future Work.
The targets are assumed to be stationary landmarks in the proposed algorithm.It will be an interesting and challenging research if the algorithm is modified to track a moving target.This type of the technique is necessary and going to be widely used in many applications.For instance, UGV has to know the intended motions of other moving objects in order to plan its navigating path for the next state.Therefore, it will be a good future work to add a new approach into the proposed algorithm to track moving targets.Besides, there is another important future work for our system.This future work is to find some ways or algorithms to improve accuracy of the measurement data by the low-cost sensor because it will improve the localization errors in terms of more accurate measurement data.

Appendices
The proof of the system covariance is presented in appendix.It derives the modified covariance in motion model at each iteration step and the new covariance after adding new targets.

A. The Modified System Covariance in Motion Model
The process state vector is predicted by the system process model.It is determined by the old state and the control inputs applied in the process.Y k is a control vector and

Figure 1 :
Figure1: The robot moves from r w i to r w k .Based on the two views, the parallax is produced.The initial depth can be refined to approach the real distance and the position of the ith target is estimated.
Assignment.The system covariance is modified after adding a new target.By using (B.1) in Appendix B, the new covariance Cov(X − k ) and the function of Y w 1 are denoted by

Frame = 50 Figure 3 :
Figure 3: Detect the actual target pixel location by using crosscorrelation search.

Figure 4 :
Figure 4: The encoder shown in (a) is used to measure the linear velocity and simulate the acceleration for the robot.The electrical compass shown in (b) is applied to measure the angular velocity of the robot.As shown in (c), the camera with wide angle lens is the additional measurement sensor.The three sensors are combined to estimate the localization of targets.

Figure 5 :
Figure 5: The experiments are designed to localize the natural landmarks as targets with varied linear and angular velocity patterns in different moving paths including the straight and curved paths.

Figure 6 :
Figure 6: The predicted pixel locations of the targets are denoted by the green crosses and estimated by using the sensor model.The pixel locations of the red crosses are the corresponding target pixel locations and detected by applying cross-correlation search.

4. 1 .
Results of Target Localization.For demonstrating the performance and practical applications to the proposed MVLT, the experiments for a mobile robot go through the doors by tracking the feature targets located upon the door and localize robot location simultaneously.4.1.1.Target Localization Experiments.Once a robot has to pass through doors, one should recognize the accurate doors position.The experiment shown in Figure5is designed to verify accuracy of MVLT for the door as tracked targets.

Figure 8 :
Figure 8: The comparison of the target 1 depth errors (meter) between MVTL and MonoSLAM [18] in Path B.

Figure 10 :
Figure 10: The unknown depth of the target is estimated by sequential images with EKF and six cases (3.0 m, 3.5 m, 4.0 m, 4.5 m, 5.0 m, and 5.5 m) are assumed to be default depths for each experiment case.

5. 1 .
Conclusion.Based on the different experiments we performed, it has proved that the proposed algorithm is pixel locations of the targets in a 320 × 240 image.Image matching is a fundamental and vital aspect of many problems in computer vision including motion tracking and object recognition.The image features are invariant to rotation, image scaling, change in illumination, and 3D camera viewpoint.It is still a popular research topic regarding how to allow a single candidate image to be correctly matched with high probability.
) 2.5.Image Matching by Using Cross-Correlation.The image patch information of the targets whose variables are set in the process state vector has been stored in the data-base when the robot senses them in the first time.The predicted pixel locations of the targets within distortion are estimated by using (25).Next important issue is how to search for the actual

Table 1 :
Error comparison of target tracking.

Table 2 :
Comparison of localization errors between the encoder and IMU.

Table 3 :
Comparison of execution time.