3D Distance Measurement from a Camera to a Mobile Vehicle, Using Monocular Vision

Estimation of distance from objects in real-world scenes is an important topic in several applications such as navigation of autonomous robots, simultaneous localization and mapping (SLAM), and augmented reality (AR). Even though there is a technology for this purpose, in some cases, this technology has some disadvantages. For example, GPS systems are susceptible to interference, especially in places surrounded by buildings, under bridges or indoors; alternatively, RGBD sensors can be used, but they are expensive, and their operational range is limited. Monocular vision is a low-cost suitable alternative that can be used indoor and outdoor. However, monocular odometry is challenging because the object location can be known up a scale factor. Moreover, when objects are moving, it is necessary to estimate the location from consecutive images accumulating error. This paper introduces a new method to compute the distance from a single image of the desired object, with known dimensions, captured with a monocular calibrated vision system. This method is less restrictive than other proposals in the state-of-the-art literature. For the detection of interest points, a Region-based Convolutional Neural Network combined with a corner detector were used. The proposed method was tested on a standard dataset and images acquired by a low-cost and low-resolution webcam, under noncontrolled conditions. The system was tested and compared with a calibrated stereo vision system. Results showed the similar performance of both systems, but the monocular system accomplished the task in less time.


Introduction
In recent years, due to the increase in the processing capacity of computers, it has been possible to process digital images in real time. With this technological improvement, it also grew the interest of the scientific community in developing systems for pose and location estimation based on artificial vision. One reason is because there are many applications in which this technology can be used. These applications include the navigation of autonomous robots [1,2], simultaneous localization and mapping (SLAM) in unknown places [3,4], augmented reality (AR) [5], and inspection of industrial systems [6,7]. Many of these applications require information of object location in three-dimensional (3D) real world coordinates.
GPS systems are normally used for 3D real world locations, but this technology is susceptible to interference, especially in places surrounded by buildings, under bridges or indoors. Furthermore, they have large error margins, up to several decimetres. On the other hand, RGBD cameras can be used too; however, in addition to their high cost, they use infrared sensors to determine the distance from the camera to object (depth). This hinders and even prevents its application in some places illuminated with natural light.
In some cases, vision systems offer some advantages over the dominant technologies currently used at a low cost. A vision system can use a single camera, called a monocular system, or two (or more) cameras, called a stereo system. In stereo systems, usually, it is necessary to perform a stereoscopic calibration to know the rotation and translation of one camera with respect to the other. Generally, the first camera position serves as a coordinate reference system. The stereo calibration parameters must be kept fixed during all estimation process to reach good results. To calculate the location of a point in three-dimensional space, each camera must capture an image containing that point; then, it is necessary to match the corresponding points and identify its 2D coordinates within the two images; finally, triangulation is performed to obtain their 3D coordinates. However, there are some practical disadvantages in this type of systems based on two (or more) cameras. The main of these problems are the difference in the response of each camera to the colour and luminance of the input signal makes difficult the matching of corresponding points; these systems require more physical space and consume more energy, and the computational cost is higher because it needs to process two images on each occasion; it is possible that the cameras lose calibration due to movements or vibrations, mainly when cameras are attached on a mobile vehicle. In addition, when distant points on two cameras are observed, the system degenerates and tends to behave like a monocular system. All above puts monocular systems as a good alternative. Moreover, monocular vision systems can be designed with low-cost hardware. Besides, it can be used for indoor and outdoor applications with low error rates.
To estimate the pose in monocular systems, since just one camera is used, it is necessary to move the same camera and capture images in different positions. Because this movement is unknown, each new relative pose must be estimated. In the literature, two approaches have proven successful for monocular pose estimation: filtering methods [8][9][10][11] and keyframebased methods [12][13][14][15]. Pioneering work of Davison et al. [8] recovered trajectories from a monocular camera by detecting natural landmarks using the Shi and Tomasi [16] operator; they made a probabilistic estimation of the state of the moving camera with an Extended Kalman Filter (EKF). In that approach, every frame is processed by the filter to jointly estimate the map feature locations and the camera pose. However, Strasdat et al. [17] showed that keyframe-based techniques are more accurate than filtering for the same computational cost. This method needs to keep the information of captured views from each image. Besides, optimization methods are required to reduce the estimation errors. Usually, the optimization is computationally expensive because it involves computing the reprojection error from all accumulated views. Despite this, with time, the error increases considerably.
Other important problem in monocular systems is that the location of a point in 3D space can only be known up to a certain scale factor. This scale factor must be obtained by using any bootstrap method. In the literature, several approaches have been proposed to estimate this scale factor: in reference [18], the authors use a camera as the main sensor and an inertial measurement unit (IMU) to determine the scale. In [19], the depth is estimated by using a convolutional neural network, this estimation is refined, and the error is reduced by training the network with consecutive images. In [20], the authors used a pattern of three concentric circles of known diameter in a plane; the camera must be perpendicular to the circles plane, to calculate the initial depth. In [21], it is assumed that the field of vision of one camera mounted on an airship is always perpendicular to the earth. Subsequently, the camera is placed at a known distance from the face of a person (centered in the image). Then, pixels of the detected face are counted; finally, with these data, a relationship is established to calculate the depth when detecting the same face in future images from different distances from the camera.
In this paper, we introduce a new technique to calculate the 3D location of a moving vehicle. Depth calculation is based on a group of three points from the vehicle, using a calibrated monocular system. Only the distance between each pair of reference points must be known for depth computation. The points can be in any position and the camera is not required to be perpendicular to the plane formed by the three points. Using this technique, it is possible to calculate the pose (rotation and translation) of the camera from each image, avoiding error accumulation. To detect the reference object, we use transfer learning with a pretrained Convolutional Neural Network. Computer simulations show a good performance of the proposed method in terms of precision and speed of processing.

Materials and Methods
Most vision techniques rely on the camera model and calibration. Camera model is a geometrical approximation of how light travels through the camera lens and forms images. Camera calibration is required to correct the main deviations due to the model used. Besides, camera calibration can be used to relate camera measurements in pixels, to the real three-dimensional world. In this section, we review some basic principles of pinhole camera model, calibration, and triangulation methods.

Pinhole Model.
Due to its simplicity, the pinhole model is widely used to represent the formation of images in a camera. In this model, it is supposed a single ray of light entering to the camera and projected onto an imaging plane (projective plane). Figure 1 illustrates the principle on which this model is based on. As can be seen, each point P with coordinates (x, y, z) in 3D space is projected through the pinhole (which is taken as the origin of the coordinate system) to point P ′ with coordinates (x ′ , y ′ , f ) of the plane Π into the camera. Here, f is the focal length of the camera. From Figure 1, by the similarity of triangles, it can be established that [22] x ′ where λ is a scale factor. If the focal length and λ are known, it is possible to calculate the coordinates of the 3D point P from the 2D coordinates of the projected point onto the image plane and the focal length. Usually, the focal length together with the intrinsic and extrinsic parameters can be obtained from the camera calibration process.

Camera
Calibration. The basic pinhole model does not include any kind of distortions, which usually occur in real world cameras. Camera calibration gives a model of the camera's geometry and distortions caused by lenses. This information can be used to define the intrinsic and extrinsic parameters of the camera. Let P ′ be the projected point of a 3D point P in the camera plane (as shown in Figure 1). By using homogeneous coordinates, we define P = ½X Y Z 1 T and P′ = ½x y 1 T . Then, we can express the mapping from P to P′ in terms of matrix multiplication as [23] where P is a point of the object in 3D, in homogeneous coordinates; P′ is the same point of the object in homogeneous 2D coordinates; ½R t is a matrix of extrinsic parameters (rotation and translation); λ is an arbitrary scale factor, and A is a matrix of intrinsic parameters. The information of the intrinsic parameter matrix is defined by Here, f x , f y provide information (depending on the size of the pixel) of the focal distance in the direction of x and y , respectively; c x and c y are the coordinates of the main point of the image; s is known as skew and represents the angle of inclination of the pixel.
The pose of an object, relative to the camera coordinate system, could be described in terms of the rotation matrix R and the translation vector t. Rotation around x, y, and z axes can be represented by rotation matrices R x , R y , and R z , respectively: Here, α, β, and θ are rotation angles around x, y, and z axes, respectively. Finally, the rotation matrix R can be composed by the multiplication of the three rotation matrices.

Triangulation.
The process of determining the 3D coordinates of a point in space, given its projection onto two or more images, is called triangulation. First, for triangulation, it is required for the detection and extraction of interest points, such as corners or salient features in images. Then, it is necessary to match the corresponding points in at least two images (find the same points in them). After that, selecting the strongest match with respect to a stablished threshold is needed (keeping index of the strongest matching points in the input set). Next, the triangulation of the matched features is achieved as follows: suppose we have a set of corresponding points x i ↔ x i ′ in two images, and there exist some camera matrices P, P ′ , and a set of 3D points X i that give rise to these image correspondences in the sense that PX i = x i and P ′ X i = x i ′ . These equations can be combined into the form AX = b, which is an equation linear in X, with 2.5. Convolutional Neural Network. Unlike traditional machine learning methods for classification, in which features must be chosen manually and extracted with specialized algorithms, deep learning networks automatically discover relevant features from data. CNNs are composed of an input layer, an output layer, and many hidden layers in between [24]: Convolutional Layer. Units in a convolutional layer are organized in feature maps. Within them, each unit is connected to local patches in the feature maps of the previous 3 Journal of Sensors layer through a set of weights, called a filter bank. Each filter activates certain features from the images.
Pooling Layer. This layer is used to merge semantically similar features into one, to simplify the output, by performing nonlinear downsampling, which reduces the number of parameters that the network needs to learn. The pooling layer takes a pool size as a hyperparameter, usually 2 by 2. It then processes its input image in the following way: divide the image in a grid of 2 by 2 areas and take from each four-pixel a representative value (normally the maximum value is used).
Rectified Linear Unit (ReLU). The result of the convolution is then passed through a nonlinearity called ReLU, which allows faster and more effective training by mapping negative values to zero and maintaining positive values.
These three operations are repeated over tens or hundreds of layers, with each layer learning to detect different features. After feature detection, the architecture of a CNN shifts to classification. The next-to-last layer is a fully connected layer that outputs a vector of K dimensions, where K is the number of classes that the network will be able to predict. This vector contains the probabilities for each class of any image being classified. The final layer of the CNN architecture uses a softmax function to provide the classification output.

Region-Based CNN.
Contrasting with classification, detection requires the accurate localization of objects (probably many) in images. For that purpose, Region-based CNN (R-CNN) were proposed. In this case, the network finds regions of interest (ROIs) where the object probably can be found. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve it. Complexity arises because detection requires the accurate localization and refinement of many objects. Several algorithms have been proposed for training R-CNNs, some methods use multistage pipelines, but they are slow. Other methods use a sliding window technique to generate region proposals.
For this task, region-based methods have shown better performance than the other methods. In this case, the first step is proposing several candidate object localizations; then, the proposals must be refined to achieve precise localization. Each region must be evaluated, and its membership to any class of object vs. background is scored. From the latter, the most popular algorithms are regions with CNN (R-CNN) [25], Fast Region-based Convolutional Network (Fast R-CNN) [26], and Faster Region-based Convolutional Network (Faster R-CNN) [27].

Transfer
Learning. The success of CNNs depends on the number of images used to train them. Usually, thousands or millions of images are required to achieve good results. Training a CNN from scratch may require weeks of computation in a high-performance computer. Besides, preparing manual images for training consumes so much time. A better option is to use a pretrained network and fine-tune it with a few images to make it work in our desired task. This method is called transfer learning.
Transfer learning refers to the situation where what has been learned in one setting is exploited to improve generali-zation in another setting [28]. For example, one CNN is trained to recognize one set of visual categories, such as cars, then is fine-tuned to learn about a different set of visual categories, such as trucks. Here, a key point about image data is that the extracted features from a data set are highly reusable across other data sources. Usually, only the deeper layers are fine-tuned, and the weights of the early layers are fixed. The reason for training only the deeper layers, while keeping the early layers fixed, is that the earlier layers capture only primitive features like edges, whereas the deeper layers capture more complex features. The primitive features do not change too much with the application at hand, whereas the deeper features might be sensitive to the application at hand. Some popular pretrained CNNs available for transfer learning are AlexNet [29], ResNet [30], and GoogLeNet [31].

Corners Detection.
Corners are features invariant to translation, rotation, and illumination. As well, there exist robust algorithms to detect them. The main idea for corner detection is searching for strong derivatives in two orthogonal directions of image I. This can be done by applying a matrix of second-order derivatives (Hessian matrix Hs) to the image intensities [32]: In practice, the most used approach is by applying the autocorrelation matrix M of the second derivative image over a small window W around each point:   Journal of Sensors 2.9. Depth Computation. In this section, we describe the proposed depth calculation method. From Equation (1), the following relationships can be obtained: Now, suppose we have at least three points P 1 , P 2 , and P 3 in 3D space with coordinates (x 1 , y 1 , z 1 ), (x 2 , y 2 , z 2 ), and (x 3 , y 3 , z 3 ), respectively. Suppose also that the distances d 1 , d 2 , and d 3 between each pair of points are known. Using the relationships in (8), the absolute distance between each pair of points can be calculated by: In this system of nonlinear Equation (9), z i represents the depth at which each point is located with respect to the camera. Note that each absolute value term jA − Bj has two possible solutions: Then, by combining all solutions of the nine absolute value terms, we can obtain 512 linear equation systems. For example, one of these systems is To find the best solution, all 512 sets of equations must be computed. In (11), by changing sign to each term, there are eight combinations of values inside each parenthesis pair. Besides, the content inside each parenthesis pair is repeated in two different equations. Then, to reduce computations,   Subsequently, each solution can be tested in the original equations system (9) and keep the solutions that fulfill the restriction that all z i are positive (the camera cannot capture points behind it). Finally, the solution with the least error is selected. Substituting these results in Equation (8) can also be obtained the real-world coordinates (x i , y i , z i ) of the three points. Relative poses (translation and rotation) of the camera respect to the vehicle can be determined too with such coordinates.

Results and Discussion
3.1. Experimental Configuration. In this section, by means of computer simulations, we illustrate performance of the proposed distance computation method. Due to stereo vision has proven good results computing distance from objects, we validated the results of the proposed method with the results obtained from a calibrated stereo vision system. Both, monocular and stereo system were compared with respect to manually measured distances. RGB images were obtained with two low-cost webcams at a resolution of 640 × 480 × 3 pixels. Both cameras were fixed and calibrated for stereo operation. We use a set of 234 images containing the mobile vehicle shown in Figure 2. Vehicle dimensions are 21 × 16 × 15:5 centimeters. The blue rectangle at the back of the vehicle is used as the reference to compute distance. The rectangle dimensions are 4:7 × 5 centimeters.
We use noncontrolled natural illumination. Calibration of the cameras was carried out with the technique proposed by Zhang [33]. With this calibration technique, it is only required that the camera observe a flat pattern taken from different orientations (as shown in Figure 3). The pattern or camera can be moved freely, and it is not necessary to know the movement made. This calibration allows to obtain simultaneously the intrinsic and extrinsic parameters of the camera. Each square is 27 × 27 millimeters in length.

Object Detector.
For object detection, we tested three region-based algorithms: R-CNN [25], Fast R-CNN [26], and Faster R-CNN [27]. The best results were achieved with Fast R-CNN algorithm. The network takes as input an entire image and a set of object proposals. Then, the whole image with several convolutional and max pooling layers is processed to produce a map feature. Then, for each object proposed, a pooling layer extracts a fixed-length feature vector from the map. Each feature vector is fed into a sequence of fully connected layers that end in two output layers: one that produces softmax probability estimates over all object classes and a background.
For our implementation, we selected a pretrained net: AlexNet [29]. AlexNet was the winner of the 2012 ILSVRC competition and has been trained on over a million images and can classify images into 1000 object categories. The network has learned rich feature representations for a wide range of images. It takes an image as input and outputs the probabilities for each of the object categories. The first layer (input layer) requires input images of size 227-by-227-by-3, where 3 is the number of color channels; then, each input image must be resized to such size. The last three layers of the pretrained network net are configured for 1000 classes. These three layers must be fine-tuned for the new classification problem. To retrain the selected net, we replaced the last three layers of the network. The new added layers were a fully connected layer, a softmax layer, and a classification layer. The final fully connected layer was set to have the same size as the number of classes in the new data set (one class: vehicle). To learn faster in the new layers than in the transferred layers, we increased the learning rate factors of the fully connected layer.
To train the network, we used 105 images containing the desired object. Each target was manually enclosed into a box, and the coordinates of each box were used as the ground truth. 60% of the images were used for training and 40% for testing. The network was trained using a low-performance GPU (GeForce GTX 960 M) with an Intel Pentium Core i7 processor with 8 Mb of RAM. The main parameters selected for  1  10  19  28  37  46  55  64  73  82  91  100  109  118  127  136  145  154  163  172  181  190  199  208  217 226 Distance (centimeters) Image number Monocular Stereo Reference Figure 6: Comparison of distance measurement among monocular (proposed) system, stereo system, and manual reference. 6 Journal of Sensors training were gradient method L2-norm, epoch 50, minibatch size 8, momentum 0.9000, and initial learning rate 1.0000e-03. Figure 4 shows the confusion matrix of the results, where class one is for the handguns and class zero is for the background. As can be seen, the system misclassifies only 2.3% of the images (one false positive). Figure 5 shows an example of performance of the trained network. As can be seen, the net recognizes the vehicle with 0.9997 of confidence, even with the low-quality images. Next, inside the ROI, the Harris detector is applied to identify corners of the blue rectangle (back of the vehicle).

Distance Calculation.
For distance calculation, the vehicle was moved remotely, and images were captured in pairs (using the stereo system), from several poses. Monocular system computes distance as described previously. For the stereo system, the detected corners on the left and right cameras are matched, and its centroid is calculated. With such centroids, by means of a triangulation also described previously, the distance is computed. Figure 6 shows a comparative graph. The percentage of mean quadratic error was 0.27% for stereo system and 0.28% for monocular system respect to manual reference. This means that the proposed method performs similarly to a stereo system.
Since computational cost is an important issue, we compared the processing time of both systems. To guarantee statistically correct results, 30 tests were performed and averaged on each system. The stereo system got an average time of 0.01945 seconds; on the other hand, the monocular system got an average time of 0.01485 seconds. This means that the algorithm for monocular system is faster than the algorithm for stereo system.

Conclusions
In this paper, a new method to calculate the distance from a camera to a moving object in three-dimensional space, using a monocular vision system, was introduced. This information can serve for applications of industrial autonomous robot navigation, SLAM, augmented reality, or control systems, by using a single camera. For the calculation, it is only necessary to know three points and the pairwise distance between them. To detect objects and interest points, a R-CNN combined with a corner detector were used. Unlike other methods such as Mono-SLAM, this method avoids error accumulation since it computes the distance from each image and does not require optimization methods, such as bundle adjustment.
The experimental results shown a good performance of the distance computation algorithm in images taken with a low-cost camera, under uncontrolled conditions of illumination. The maximum error obtained was less than 0.9%, compared to a calibrated stereo vision system. Moreover, the system runs faster than the stereo system.
A drawback of this method is that it highly depends on a good detection and matching of interest points. Although modern Convolutional Neural Networks perform well in detecting objects, the computation cost is high yet.

Data Availability
Data can be requested by e-mail to smdiaz06@hotmail.com.