An RGB-D-Based CrossField of View Pose Estimation System for a Free Flight Target in a Wind Tunnel

Estimating the real-time pose of a free flight aircraft in a complex wind tunnel environment is extremely difficult. Due to the high dynamic testing environment, complicated illumination condition, and the unpredictable motion of target, most general pose estimating methods will fail. In this paper, we introduce a cross-field of view (FOV) real-time pose estimation system, which provides high precision pose estimation of the free flight aircraft in the wind tunnel environment. Multiview live RGB-D streams are used in the system as input to ensure the measurement area can be fully covered. First, a multimodal initialization method is developed to measure the spatial relationship between the RGB-D camera and the aircraft. Based on all the input multimodal information, a so-called cross-FOV model is proposed to recognize the dominating sensor and accurately extract the foreground region in an automatic manner. Second, we develop an RGB-D-based pose estimation method for a single target, by which the 3D sparse points and the pose of the target can be simultaneously obtained in real time. Many experiments have been conducted, and an RGB-D image simulation based on 3D modeling is implemented to verify the effectiveness of our algorithm. Both the real scene’s and simulation scene’s experimental results demonstrate the effectiveness of our method.


Introduction
Aircraft attitude estimation plays a crucial role in aircraft control systems of the wind tunnel.During the flight of the aircraft, it is essential to adjust the flight parameter according to the real-time attitude of the aircraft [1].And while verifying the flight performance of aircraft, it is also necessary to check the performance of the aircraft in different attitudes.Attitude estimation is an important part of this [2].In computer vision task, aircraft attitude estimation can be regarded as an object pose estimation task.Vision system is the most widely used technique for measurements in the wind tunnel, which can provide crucial data that can be compared with computational fluid dynamics (CFD) predictions to assist validating design geometries.
However, it is hard to get satisfying precision measurement results in a low-speed wind tunnel when the target is flying freely.The main reason is that the wind tunnel is a high dynamic testing environment and there often exist a complex illumination condition and unpredictable motion of the target.These will heavily decrease the measurement accuracy of the estimation system.
The size of a low-speed wind tunnel is eight meters long, six meters wide, and six meters high.Thus, multiview live RGB-D streams are needed in the system as input to ensure the measurement area can be fully covered.Furthermore, a multimodal initialization method is developed to measure the spatial relationship between the RGB-D camera and the aircraft.Based on all the input multimodal information, our cross-FOV model is proposed to recognize the dominating sensor and accurately extract the foreground region in an automatic manner.
The object pose estimation task has been extensively studied [3].Traditional methods of object attitude measurement are mainly divided into template matching and feature matching [4].Template matching methods [5] are usually applied to weakly textured scenes.Such methods need to reconstruct 3D objects and then match real scenes with 3D models to find the best pose.The classic ICP algorithm [3] and RANSAC algorithm solve the current pose by minimizing the  distance between corresponding points of the actual scene and the model [6].Many people believe that, in computer vision applications [7][8][9][10][11], the contour of an object is the most reliable information, because feature-based recognition methods [12][13][14][15] are likely to fail when recognizing poses of weakly textured objects.
In this paper, to overcome those problems, we propose a cross-field pose estimation framework based on local features to estimate the pose of an aircraft in real time and deal with cross-field problems.By acquiring the relative positional relationship between the camera and the aircraft, we transform the relative pose of the camera into the relative pose of the aircraft.According to the experiment results, the pose estimation system with the cross-FOV model can get accurate measurement results in a wind tunnel.And we run our pose estimation system to measure the pose of an aircraft model.We will apply our system on the low-speed wind tunnel of China Aerodynamics Research and Development Centre (CARDC).

Overview of Our Method
We proposed a cross-FOV RGB-D pose estimation system that processes each new frame in real time.Also maintaining high precision pose estimation our system reconstructs the sparse point cloud for the object in the scene and can track the target motion continuously when the target moves across the different field of view.Figure 1 illustrates the frame-to-frame operation of our system and Figure 2 illustrates the structure diagram of our system in a real scene.

Pose Initialization.
In this section, a relative attitude measurement module is utilized to obtain the relative attitude between the aircraft and initial camera.We set two tags at the X-axis of the aircraft shown in Figure 3.The center of the tags is the center of the aircraft.Once the tag's location is detected, the center of the aircraft is localized.Then the system can transform it into relative attitude between the aircraft and initial camera.

Tag Recognition.
This module is used to detect the position of tags in Figure 3.We use a tag detector to detect tags following the proposed method AprilTag [5].The first step is adopting an adaptive thresholding approach to threshold the input grayscale image into a black-and-white image.The next step is segmenting the edges based on the characteristic of the black-and-white components from which they arise to find edges which might form the boundary of a tag.Finally, the method computes a proximate partition by searching for a small number of corner points and then iterates through all possible combinations of corner points to find all fitting quad.After this whole operation, a tag is localized in the image coordinate system and the center of two tags represents the center of the aircraft.

Aircraft Center Localization.
We transform the coordinate (u, v) of aircraft center in image coordinate with its corresponding depth value d into a 3D coordinate so that relative attitude between the aircraft and initial camera is obtained: where (  ,   ) is the focal length and ( 0 , V 0 ) is the principal point and all can be known from calibration.(  ,   ,   ) is the 3D coordinate of aircraft center.

Cross-FOV Model.
In the wind tunnel, as shown in Figure 4, the cross-field of view measurement is needed.Thus, we designed a cross-FOV model.For camera  in camera set  at the time , we have the color frame    and depth frame    .To choose the best input camera   , we need to backtrack  frames to get the maximum frame score camera set N M .The frame score can be expressed by where (, ) is the depth score at    (, ) that fit the depth constraint H. X and Y are, respectively, the frame width and height.
The best input camera   is chosen from the camera set  to find the maximum score where () is a proportion calculate function, which will calculate the amount of camera number  in camera set N  .

Feature Extraction.
The system utilizes a fast binary descriptor called ORB for the feature extracting task.This descriptor is rotation invariant and resistant to noise and illumination changes.Simultaneously, it is fast to extract and match which makes ORB suitable for real-time pose estimation work in a complex environment.
Our system handles RGB-D input.We extract ORB features on the RGB image for tracking and, for each feature with coordinates (, V) and its corresponding depth value d, we transform them into a world coordinate system according to (1).

Bundle Adjustment.
After the initialization operation, our system performs bundle adjustment, to minimize the reprojection error between the 3D point X  and its corresponding 2D point x   to estimate the camera's instantaneous pose {R, t} relative to previous frame where  is the robust Huber cost function and Σ the covariance matrix associated with the scale of the keypoint.The projection function  is defined as follows: where (  ,   ) is the focal length and ( 0 , V 0 ) is the principal point, both known from the calibration.

Experiments
In the evaluation stage, we carry out a quantitative evaluation on both synthetic and real sequences with ground truth data.Our synthetic experimental sequences which imitate a real experimental environment are specifically designed for this work.In the synthetic scene, we set up ambient light and multiple light sources to simulate the real complex lighting conditions.As for camera settings of synthetic data, we set it up with settings of an Asus Xtion camera with resolution of 640◊480 pixels and field of view in 58 ∘ H, 45 ∘ V, 70 ∘ D.

Synthetic Experiments.
Appropriate synthetic sequences were specifically created for this work.In Figure 5, the left image is a synthetic color image, middle image is the corresponding depth image, and right image is the output of our system.Point cloud is sparsely reconstructed from the model, and the coordinate axis at the center of the model is the current pose of the target.We set three kinds of translate sequences.
For each synthetic scene, we compare the estimated and ground truth trajectories of the aircraft by computing the root-mean-square errors (RMSE).Results on synthetic sequences are shown in Table 1 and Figure 6.We also make a comparison with Co-fusion [3] in translate sequences.Co-fusion [3] only supports translation output; as Table 1 shows, rotation result of Co-fusion [3] is not given.Cofusion [3] failed in sequences rotate1, rotate2, rotate3, and translate&rotate.
As shown in Figures 6-8, the estimated trajectories of the proposed method fit well with ground truth in all scenes.In rotate sequences, the proposed method can also track the object stably.As we can see from Table 1, the proposed method performs better than Co-fusion [3].In some scenes, Co-fusion [3] may fail to track the object which leads to a huge estimating error.But our method can achieve longterm and effective tracking so long as the first frame is provided to initialize.We carried out a quantitative evaluation on both synthetic and real sequences with ground truth data.

Experimental Verification with Hexapod and Real Scene.
For real sequences, we set a series of experiments on a high precision Hexapod as Figure 9 shows; the accuracy of Hexapod can reach micron level (Table 2).For each axis, a corresponding experiment is set up.After aligning cameras with cloud terrace, we separately set the platform to move uniformly along each axis to test the accuracy of translation and rotation.Results on real sequences are shown in Figure 10.
Experiments on the high precision cloud terrace show that our proposed method also performs well in real sequences.In rotation experiments, the detection of yaw angle is the most accurate which illustrates that the proposed method can reach the highest accuracy without change of depth.
We have performed a series of qualitative experiments to demonstrate the capabilities of our method.The comparison to Co-fusion [3] indicates that our method achieves extremely high accuracy.As mentioned earlier in our paper, our method can adapt to a complex lighting environment and achieve high precision tracking and pose estimation.And particularly, it can satisfy the need of cross FOV, which means it can achieve reliable pose estimation in a wide range of environments as demonstrated in the experiments.The real scene experiment is shown in Figure 12, and the estimated trajectory of the sequence is shown in Figure 11; the trajectory is accurate and concise.

Conclusions
We introduced a cross-field of view (FOV) real-time pose estimation system which provides high precision pose estimation of the free flight aircraft in a wind tunnel environment.Multiview live RGB-D streams are used in the system as input to ensure the measurement area can be fully covered.

Figure 1 :
Figure 1: Overview of our method showing the data-flow starting from RGB-D streams.RGB streams are used to calibrate the initial pose of the aircraft.Our cross-FOV model can process multiple RGB-D streams and output the target stream.By tracking the target, our method can get the pose and point cloud of the target.

Figure 2 :Figure 3 :Figure 4 :
Figure 2: The structure diagram of our system.

Figure 5 :
Figure 5: The synthetic RGB-D input and the estimated point cloud result of our system.

Figure 6 :
Figure 6: Comparison between the ground truth, our estimated trajectory, and Co-fusion's estimated trajectory on the translate sequence.

Figure 7 :
Figure 7: Comparison between the ground truth and our estimated pose on rotate sequences.

Figure 8 :
Figure 8: Comparison between the ground truth and our result on translate&rotate sequences.

First, a multimodal
initialization method is developed to measure the spatial relationship between the RGB-D camera and the aircraft.Based on all the input multimodal information, a so-called cross-FOV model is proposed to recognize the dominating sensor and accurately extract the foreground region in an automatic manner.Second, we develop an RGB-D-based pose estimating method for a single target by which the 3D sparse points and the pose of the target can be simultaneously reconstructed in real time.Many experiments have been conducted and an RGB-D image simulation based on 3D modeling is implemented to verify the effectiveness of our algorithm.Both the real scene's and simulation scene's

Table 1 :
RMSE of estimate translation and rotation for synthetic sequences.
X ,  Y ,  Z 5 u r a d