Vision System of Mobile Robot Combining Binocular and Depth Cameras

In order to optimize the three-dimensional (3D) reconstruction and obtain more precise actual distances of the object, a 3D reconstruction system combining binocular and depth cameras is proposed in this paper.Thewhole system consists of two identical color cameras, a TOFdepth camera, an image processing host, amobile robot control host, and amobile robot. Because of structural constraints, the resolution of TOF depth camera is very low, which difficultly meets the requirement of trajectory planning. The resolution of binocular stereo cameras can be very high, but the effect of stereo matching is not ideal for low-texture scenes. Hence binocular stereo cameras also difficultlymeet the requirements of high accuracy. In this paper, the proposed system integrates depth camera and stereo matching to improve the precision of the 3D reconstruction. Moreover, a double threads processing method is applied to improve the efficiency of the system.The experimental results show that the system can effectively improve the accuracy of 3D reconstruction, identify the distance from the camera accurately, and achieve the strategy of trajectory planning.


Introduction
With the development of the society, the application of the robots is rising rapidly in the society [1].With the increase of the application field, the accuracy of the robot movement and the stability of the movement are very necessary for the promotion of the robot.That is how to improve the accuracy of the robot has been widely concerned, and more and more schools and companies have been involved in the research on how to improve the stability of the robot.3D imaging technologies directly affect the accuracy of the robot movement.There exists a variety of 3D imaging technologies to acquire depth information about our world.In general, they can be categorized into two major classes: stereo matching methods and direct depth measuring methods.
Stereo vision [2][3][4] mainly researches on how to use imaging technology to obtain the distance information of the objects in the scene from the images with different views.At present, the basic method of stereo vision is to obtain the same scene image from two or more different perspectives and to obtain the corresponding disparity between pixels in different images and then it is simple to derive the target space position in the scene through the principle of triangulation.
Binocular stereo vision simulates human visual system, using two 2D images of the same scene obtained by two cameras from different point of views.The goal of binocular stereo matching is to find out the corresponding matching points in the left and right image after calibration.The difference in  coordinate of the corresponding points is named as disparity, which can form a disparity map.The key objective of stereo matching methods is to reduce the matching ambiguities introduced by low-texture regions, and they can be generally classified into two categories: local matching methods and global matching methods.Researchers have done a wealth of work on stereo matching in the past few years.But, unfortunately, the fundamental problems in stereo such as occlusion and texture-less regions remain to be unsolved.
Direct depth measuring camera [5,6] mainly uses TOF ranging principle.The system illumination light source emits light beam to the scene, and detection sensor detects reflected light signal and figures out the propagation time of the optical signal.Finally, spatial coordinates of the object are calculated by the relative position of the optical signal propagation velocity and light source and sensor.These sensors measure time delay between transmission of a light pulse and detection of the reflected signal on an entire frame once by using In general, binocular stereo vision system has the advantages of simple structure and high vertical resolution, but the stereo matching process is complex and has lots of matching errors in the weak texture.The vertical resolution of depth camera is independent of the distance of the scene in the imaging range, and it is relatively simple to achieve real-time three-dimensional image by using the data of depth camera, but the resolution and stability of TOF depth camera are lower by the limitation of hardware.
In this paper, a novel vision system of mobile robot is proposed, which can achieve more accurate 3D data by fusing stereo matching and direct depth measurement.In the proposed 3D reconstruction system, an image processing machine is used to simulate the human brain, and two parallel cameras and a depth camera are used to simulate human eyes as shown in Figure 1.Specifically, the image processing host utilizes binocular stereo cameras and achieves 3D reconstruction data based on stereo matching.Then image processing host will fuse the 3D reconstruction date of stereo matching with the data of depth camera and achieve 3D reconstructed data with high precision.Finally, it can figure out the physical distance of the scene objects from the camera by using this 3D data and achieve the strategy of trajectory planning [7,8] to improve the stability of the robot.Moreover, double threads processing method is applied to improve the efficiency of the system.When the robot works, image processing machine obtains 3D reconstruction data in short time; the latest motion route is designed to guide the robot running.The experiments demonstrate that the 3D data fusion between the depth camera and stereo camera can meet the requirement of actual applications, but also it can improve the accuracy of matching.And, in practice, it is proved that the effect is very good and can improve the accuracy of 3D reconstruction.
The rest of this paper is organized as follows: Section 2 introduces the design of the system.In Section 3, we describe the framework of the 3D reconstruction system.In Section 4, experimental results are detailed and justify the superiority of our approach.Finally, conclusions are made in Section 5.

Design of the System
As shown in Figure 2, the whole system is composed of five parts: two color cameras of the same type, one depth camera, one image processing host, one mobile robot, and one mobile control host.Two color cameras are placed in the top of the mobile robot on the same horizontal line, and depth camera is placed below the left camera in the same vertical line.Three cameras communicate with image processing host through three USB3.0line; the mobile robot is driven by two brushless DC motors and they are controlled by the mobile robot control host.The mobile robot control host and the image processing host communicate with each other through the serial port.In normal operation, the system is initialized.The three cameras start to take pictures of the same scene synchronously and transmit the image data to the image processing host through USB3.0 in real-time.Image processing host, first of all, utilizes the correlation algorithm [9,10] to deal with two color images and calculate the disparity data of two images.Second, the image processing host matches the 3D image data of the low-resolution depth camera with the left image data of the color cameras.And, then, the image processing machine projects the spatial point detected by the TOF camera to the matching view.Finally, the 3D reconstruction data of the binocular camera is fused with the 3D data of the depth camera to obtain the 3D reconstruction data in high precision, by using which the trajectory planning is carried out.When the image processing host is in the process, image capturing thread updates and refreshes the scene data ceaselessly to ensure planning in short time.When image processing thread is completed, the trajectory planning data will be sent to the mobile control host through the serial port by the thread; mobile control host will control mobile robot to avoid obstacle.
As shown in Figure 2, the horizontal placement of two color cameras is particularly important for the accuracy of binocular matching.We can obtain relative rotation parameters and relative translation parameters by using a MATLAB toolbox for calibrating [11,12].The parameters will be used to correct the images and will make the images in the line alignment, and the overlapping area of the two images will be cut [13,14].The purpose of these preprocessing operations is to improve the accuracy of stereo matching and improve the matching time.
From Figure 2, we can also see the relative position of the depth camera and the left color camera.In order to make the scene reconstruction and fusion more accurate, it is necessary to register the data obtained by depth camera and color camera; namely, it is essential to match between the depth camera and the left color camera [15][16][17].An effective camera registration method is developed between the depth camera and the left color camera in this paper.Firstly, we calibrate the two images (one is from left color camera; the other is from depth camera) to get the rotation parameters and relative translation parameters.Then, the image processing host uses the parameters to project the point of the depth camera to the image of left color camera.At last, a novel fusion method is proposed to get the high precision depth data.

3D Fusion Reconstruction System
3.1.Binocular Ranging and Calibration.Let (, V) be the pixel coordinate of color image; (  ,   ,   ) is the corresponding coordinate in the color camera coordinate system.Based on the principle of small hole imaging, Here,  is the focal length of color camera; (  ,   ) are the coordinates of the principal point.  and   are physical sizes of pixel in the horizontal and vertical directions, respectively.Define   = /  ,   = /  ; we can obtain Here,   ,   ,   ,   are the internal parameters of color camera.In this paper, the internal parameters of left and right cameras are obtained using the MATLAB Zhengyou Zhang calibration toolbox: As shown in Figure 3, the coordinates of the target point in the left view are (  ,   ), the coordinates of the target point in the left view are (  ,   ), binocular ranging mainly utilizes the differences of the target point horizontal coordinates on the direct images of two views (i.e., the disparity [18]:  =   −   ), and the disparity has an inverse relationship with distance  from the target point to the image plane: Hence, the key of binocular ranging is to get the disparity map between left and right images.Various binocular stereo matching methods have been proposed to calculate the disparity maps.Before binocular stereo matching, binocular correction should be applied to align the lines of left and right images strictly, making the lines exactly at the same level.Therefore, the points of the left image and the corresponding points of right image have the same number of rows.Because of this, one-dimensional search of rows can find the corresponding point for stereo matching, which will save the cost of computation.Specifically, left and right images can be rectified to achieve line alignment using the relative rotation matrix  and translation matrix , which are obtained using the MATLAB Zhengyou Zhang calibration toolbox in this paper: As shown in Figure 4, it can be seen that the images after line alignment calibration can well achieve the line alignment.In this paper, then, the BM stereo matching algorithm is applied to obtain the disparity map because of the strict requirement about algorithm time.

Depth Registration.
In the system, the model of the TOF depth camera is SR4000 camera, produced by the Swiss company Mesa Imaging.The TOF depth camera uses a modulator to emit infrared light from multiple angles.When the light returns to the sensor, it is possible to obtain the distance information for each pixel by the round trip time in real-time.The TOF depth camera can obtain depth information of the scene in real-time while the resolution of the camera is just 144 × 176.Hence, in this paper, a fusion system is developed to integrate depth camera and binocular stereo matching and improve the quality of 3D reconstruction.
Before fusion, the data on the depth map need to be aligned pixel by pixel to the color pixel.The registration process depends on the result of camera calibration [19,20].The parameters for registration are the conversion matrixes between the two coordinate systems [21,22].In this paper, an effective registration method is applied to align the depth image to the color image.Specifically, with the intrinsic parameters of the color camera we can mark the spatial position of each experimental checkerboard in the color camera coordinate system.Then, the depth camera can capture the position of the plate, which is coplanar with the checkerboard in physical space as shown in Figure 5. And, finally, we can obtain the projection matrix between the depth camera and the color camera by gradient descent optimization method.
Assume that the space point  is (  ,   ,   ) under the color camera coordinate system and (  ,   ,   ) under the depth camera coordinates.

Color camera Depth camera
Pl an e 1 Plane 2 The comparison of the depth camera coordinate system and the color camera coordinate system.
Using formula (2), we can obtain Then we can obtain formulas ( 7) and ( 8): Conversion formula between color camera and depth camera coordinate system: Here,   and   are the relative rotation and translation matrixes between left color camera and the depth camera.Expanding formula (9) The extrinsic parameters   and   can be calculated from formulas (11) using gradient descent optimization method.The accuracy of   and   plays an important role in the accuracy of the system, since the depth information of the TOF camera is determined by the phase difference between the emitting infrared light and reflected infrared light.If the reflected light intensity is higher, the credibility of the depth value is higher.By contraries, if the light is largely absorbed, the obtained depth value may be far from the true value.So the depth value of black block is less precise while the depth value of white block is more realistic.Therefore, we can only extract the depth data of the white checkerboard for subsequent processing.And we can extract depth value of the white block pixel by setting the gray level threshold.
Moreover, due to the noise of TOF camera points on the same plane will have a slight deviation.In order to obtain the one closest to the real value of the depth, we fit the threedimensional coordinate data of the same template by using equation  +  +  +  = 0.The optimal equation coefficients a, b, c, d are obtained by least squares method [23,24].In order to get more accurate extrinsic parameters   and   , we take many groups of pictures to do the same operation above.Figure 6(a) reports some images of the chess board map of left color camera; Figure 6(b) shows gray images of the chess board map of depth camera; Figure 6(c) shows depth images of the chess board map of depth camera.After the calculation process above, the parameters   and   are obtained as follows: Using   and   , the depth image can be registered to the color image effectively as shown in Figure 7.

Depth Fusion Algorithm.
As shown in Figure 7(c), lots of depth regions need to be filled after the registration.Hence, an effective depth fusion algorithm is proposed in this paper as shown in Figure 8. Firstly, the bilinear interpolation is applied to the registered depth map.After interpolation, there still exist black holes to be filled as shown in the Figure 9(d).Then, a fusion method is proposed based on the result of binocular stereo matching to improve the accuracy.Specifically, we traverse the pixel of depth image after interpolation, when the pixel value is zero, and then look for the corresponding value of the binocular matching result as shown in Figure 9(c).If the value of binocular matching is valid, the result of binocular matching will be filled in the depth image (before filling, the disparity result of binocular stereo matching should be transformed into the actual depth value).Then, an initial fusion depth image is obtained as shown in Figure 9(e).
Moreover, a filling method based on mean-shift segmentation is applied further based on the following prior assumption: (1) The pixels with similar colors around a region are likely to have similar depth.
(2) World surfaces are piecewise smooth.Specifically, the mean-shift segmentation algorithm is used in the left color image.Then the pixels in the remaining black holes are assigned using the maximum probability value of the nonzero pixels in the same color segmentation region.Finally, the final depth image can be obtained after median filtering as shown in Figure 9(f).It can be seen that the quality of the final depth image is improved obviously.

Experimental Results
The system utilizes a mobile robot, two of the same model Mercury (MERCURY) Series-High Speed 30 surveillance cameras made by Daheng, China (http://www.daheng-image.com/),and a Swiss Mesa Imaging SR4000 depth camera as based hardware.Image processing system runs in VS2012 based on OpenCV2.4.9 and the industrial control panel is used as the image processing host.In the basis above, we tested the system proposed in this paper.After the system is initialized, the program starts running.First, the processing host reads the parameters into the memory, and then the system starts to capture the images and display images (only display in test mode).The host will match the images by using the parameters in memory and obtain a new threedimensional reconstruction data.Finally, the system can complete the trajectory planning via these data to avoid obstacles.
Firstly, experiments are carried out to test the system.As shown in Figures 10-12, the system can obtain satisfying depth images after fusion.Then, the accuracy of the system is compared in Table 1."The coordinate of our system" is the coordinate value of an object obtained using our system."Actual coordinate" is obtained from actual measurements (the depth camera coordinate is defined as the real-world coordinate).As shown in Table 1, the MSE of the measurement experiments is 0.019, 0.0076, and 0.0133 for , ,  directions, respectively.The accuracy of our system can well meet the requirements of practical applications.
Then, we analyze the time performance of the system.As shown in Table 2, the average processing time is 0.5513133 seconds, which can well meet the requirements of actual applications, such as mobile robot avoidance.
In the system, software (i.e., image processing system) takes VS2012 as the development platform, which is based on OpenCV2.4.9.In order to observe the change of scene and view the effect of fusion, we code the program based on MFC.The program uses multithreaded control mode.The camera part is controlled by a separate thread to facilitate real-time acquisition, which easily refreshes the buffer of the image in real-time.The fusion part is controlled by another separate processing thread.Acquisition thread and processing thread will not affect each other, which can guarantee the high efficiency of processing and avoid delay.The MFC interface and running results are shown in Figure 13 (use button "Start Matching" for the choice of whether to start the integration; you can use the button "Stop Matching" to end the integration).The current system uses a multithreaded way to satisfy the strict requirement of time cost; later, we will try to use GPU acceleration method to make time cost lower.

Summary
In this paper, the visual processing section is used to acquire the three-dimensional scene data of the front of the mobile robot.The three cameras will transmit the image data and the 3D data of the depth camera to the image processing host via USB3.0.Image processing host uses calibration data to calibrate the images and maps the depth data to the left color image after dimensional registration.At the same time, the host will match the two color images by using correlation algorithm.And then the system will fuse the calibrated 3D data of the depth camera with the matching result of the two color images to achieve the three-dimensional scene data with higher accuracy.Finally, the high precision 3D data is used to guide the mobile robot to avoid obstacles.
The system is proposed based on the following two observations: (1) Real-time and accuracy of binocular matching are a contradiction, and it is difficult to ensure that both real-time and high accuracy meet the requirements.It can be used in the scenes where the accuracy requirements are not very strict.But, it is hard to meet the requirements of high precision mobile robots.(2) The three-dimensional camera data of the TOF depth camera is accurate, and the real-time of the camera is very high too, but due to the structural constraints, the resolution is very low, only 144 * 176, which makes it hard to meet the requirements of mobile robots either.However, the difficulty of this system is to accurately register the depth data to the left color picture.In order to reduce the impact of various objective factors such as noise, this paper calculates the   and   parameters matrix by taking multiple sets of images.The test results show that the system can meet the requirements of mobile robot trajectory planning.The system uses double threads processing method to improve efficiency of the system.And, later, we will try to use GPU acceleration method to ensure real-time.

Figure 2 :Figure 3 :
Figure 2: The components of the system.

Figure 4 :
Figure 4: Line alignment correction: (a) camera images before the correction; (b) camera images after the correction.

Figure 6 :Figure 7 :
Figure 6: The images used to obtain the parameters   and   between left color camera and the depth camera.(a) Gray images of the chess board map of left color camera; (b) images of the chess board map of depth camera; (c) depth images of the chess board map of depth camera.

Figure 8 :
Figure 8: The framework of our fusion system.

Figure 9 :
Figure 9: The effect of the proposed depth fusion algorithm: (a) the left color image; (b) the registered depth image; (c) disparity image of binocular stereo matching (BM algorithm); (d) depth image of interpolation; (e) the initial fusion depth image; (f) the final depth image.

Figure 10 :
Figure 10: Result of the proposed system: (a) left color image; (b) right color image; (c) disparity map; (d) low-resolution depth image from depth camera; (e) depth image registered with the left color image; (f) the final fusion depth result.

Figure 11 :
Figure 11: Result of the proposed system: (a) left color image; (b) right color image; (c) disparity map; (d) low-resolution depth image from depth camera; (e) depth image registered with the left color image; (f) the final fusion depth result.

Figure 12 :
Figure 12: Result of the proposed system: (a) left color image; (b) right color image; (c) disparity map; (d) low-resolution depth image from depth camera; (e) depth image registered with the left color image; (f) the final fusion depth result.

Figure 13 :
Figure 13: The running result of the system.

Table 1 :
Accuracy experiments for the system.