RGBD Scene Flow Estimation with Global Nonrigid and Local Rigid Assumption

RGBD scene flow has attracted increasing attention in the computer vision with the popularity of depth sensor. To estimate the 3D motion of object accurately, a RGBD scene flow estimation method with global nonrigid and local rigid motion assumption is proposed in this paper. Firstly, the preprocessing is implemented, which includes the colour-depth registration and depth image inpainting, to processing holes and noises in the depth image; secondly, the depth image is segmented to obtain different motion regions with different depth values; thirdly, scene flow is estimated based on the global nonrigid and local rigid assumption and spatial-temporal correlation of RGBD information. In the global nonrigid and local rigid assumption, each segmented region is divided into several blocks, and each block has a rigid motion. With this assumption, the interaction of motion from different parts in the same segmented region is avoided, especially the nonrigid object, e.g., a human body. Experiments are implemented on RGBD tracking dataset and deformable 3D reconstruction dataset. The visual comparison shows that the proposed method can distinguish the motion parts from the static parts in the same region better, and the quantitative comparisons proved more accurate scene flow can be obtained.


Introduction
Vedula et al. [1] proposed the scene flow first, which describes a 3D motion field formed by the motion in 3D space scene. Scene flow is the fundamental input to high-level tasks such as scene understanding and analysis. With the development and applications of computer vision and artificial intelligence, the related technologies have been used in the object detection and segmentation [2,3], depth interpolation, and 3D reconstruction in many dynamic scenes, such as autonomous driving [4,5], high-speed video generation [6], and 3D reconstruction [7].
Some research efforts have been dedicated to the estimation of the scene flow, which involve different environments, monocular vision [8], stereo vision [1,2,9,10], and RGBD [11][12][13]. Affordable RGBD cameras can directly capture both colour and depth information simultaneously, so we focus on the RGBD scene flow estimation. Among the existing methods, methods based on segmentation are attractive, which can deal with large displacement and occlusion better. For this method, the correlation of motion in the local area is considered, such as the assumption of local rigid area, which can improve the accuracy of the scene flow estimation. In the local rigid area, it is assumed that all pixels in a segmented region share a rigid motion.
However, if the segmented region is a nonrigid object, pixels with different motion degrees would affect each other and then affect the overall scene flow estimation effect. In this paper, the local rigid and global nonrigid assumption in segmented regions is introduced into the RGBD scene flow estimation. In this assumption, the local motion in a segmented object area is correlated, and the motion of the whole segmented object is nonrigid. With this assumption, the interaction of motion from different parts in the same segmented region is avoided.

Related Work
According to the difference of solving process, these approaches are divided into two categories roughly: the variational approaches [1,8,12,13], which construct the objective function on scene flow directly, and the methods based on segmentation with the assumption of local rigid motion [14][15][16]. e variational approaches estimate the dense scene flow with constraints of the spatial-temporal vision information commonly. An objective function is constructed to estimate the dense scene flow [1,8,17]. Xiao et al. [8] construct an objective function on scene flow in a monocular camera environment, which includes a brightness constancy assumption, a gradient constancy assumption, a short time object velocity constancy assumption, etc. Jaimez et al. [17] considered the depth information from the RGBD camera and presented a dense real-time scene flow algorithm with brightness constancy and geometric consistency.
e methods based on segmentation estimate the rigid motion of each segmented region, and then the local rigid motion and nonrigid motion are mixed to get dense scene flow [16,[18][19][20]. In [20], an efficient RGBD PatchMatch was used to solve large displacement motion patterns and stage, and further occlusion model and spatial smoothness regularization were used to compute the RGBD scene flow field. Golyanik et al. [18] presented a multiframe scene flow approach that assumes scene transformations to be locally rigid in RGBD image sequences. Xiang et al. [19] used a 3D local rigidity assumption to estimate the dense scene flow in a variational framework. Schuster et al. [21] interpolated the sparse matches between stereoscopic image pairs to estimate scene flow, in which the initial sparse match is the local rigid assumption actually.
Sun et al [16] proposed a layered RGBD scene flow method, in which the depth ordering from RGBD is used to segment the scene, and solved the occlusions. e layered RGBD scene flow method is a promising method as spatial smoothness is separated from the model of discontinuities and occlusions, which can model occlusion boundaries by obtaining the relative depth order. Depth image is layered based on the depth information. In order to estimate the motion of the scene, it assumed that pixels belonging to the same layer have the same rigid motion. e result of estimating scene flow directly is high dimensional, so the solution space is large and the calculation complexity is high. And methods with the assumption of local rigid motion reduce the solution space. However, for most of the methods, the assumption of local rigid motion, a local region is semantic, such as a superpixel or a specific object. So the assumption cannot be well applied to nonrigid objects because the internal motion of nonrigid object is not consistent. In this paper, we propose an assumption of global nonrigid and local rigid motion based on the study of Sun et al. [16], which can accurately estimate the motion of each segmented region by dividing each segmented region into different blocks. Besides, affordable RGBD cameras provide both colour and depth information simultaneously. erefore, we would focus on approaches with colour and depth information.

Methodology
In this section, a framework for estimating RGBD scene flow is shown in Figure 1. In this framework, two steps are presented to get scene flow: the preprocessing and the scene flow estimation. e preprocessing mainly performs basic processing on the input RGBD image sequence (the red box in Figure 1), thus providing materials for estimating scene flow efficiently, involving the colour-depth registration and depth image inpainting. Details will be introduced in Section 3.1. e scene flow estimation would present the calculating processing of scene flow. It includes two parts: depth image segmentation and scene flow estimation with the preprocessing result and spatiotemporal constraints from the RGBD image sequence (Section 3.2).

Preprocessing.
In the preprocessing, the color-depth registration and the depth image inpainting are implemented. e color-depth registration is used to associate the depth image with the RGB image. And the depth image inpainting is used to repair holes and noises, which are from occlusions, lack of point correspondences, sensor imperfection, etc. [22]. To register the RGB image and depth image, a projective matrix M is calculated as shown in equation (1). In equation (1), (x, y) is a pixel coordinate in the depth image, and (X, Y) is the corresponding coordinate in the RGB image:

Colour-Depth Registration
Furthermore, equation (1) can be rewritten as follows: In M, eight unknown parameters need to be solved, and four pairs of corresponding points in the depth image and RGB image are needed at least. In our paper, corners are used.

Inpainting.
To process the holes and noises in the depth image, the inpainting algorithm with the guidance of RGB image information is used [23]. In this algorithm, holes and small noises are all regarded as noises, but holes have larger connected areas and the depth value is 0, while small noises have smaller connected areas. In this paper, holes are inpainted based on depth domain similarity and colour consistency from the aligned depth image and RGB image. And small noises are removed with the local bilateral filter.

Scene Flow Estimation.
In order to estimate scene flow accurately, the depth image is segmented into different regions roughly since there is stronger local motion correlation in the same region. Based on the inpainted depth image, K-means clustering algorithm is used to segment and label the depth image, by which scene can be quickly and simply segmented based on the depth information. e value of K depends on the number of moving regions in the scene.
To calculate the RGBD scene flow, an assumption of global nonrigid and local rigid motion is proposed to describe the behaviours of the scene in this paper. In a segmented region, the pixels' motion of its inner local area is highly consistent, so it is assumed that the local motion of a segmented region is rigid.

Global Nonrigid and Local Rigid Assumption.
Each segmented region is divided into a number of sufficiently small blocks and the size of the block is 3 × 3 ( Figure 2). In the global nonrigid and local rigid assumption, pixels in each block share the common 3D rigid motion R, which includes the rotation and the translation relative to the camera coordinate system (local rigid assumption), and different blocks have different motions (global nonrigid assumption).
Let a 2D point p 1 � (x 1 , y 1 ) at frame t, and its corresponding 2D point p 2 � (x 2 , y 2 ) in frame t + 1. e depth values of p 1 and p 2 are z 1 and z 2 , which are from the depth images. According to the camera imaging principle and the 2D-3D transformation model in [16], the corresponding 3D point P 1 � (X 1 , Y 1 , Z 1 ) and P 2 � (X 2 , Y 2 , Z 2 ) of p 1 and p 2 are as follows: where (f x , f y ) T and (c x , c y ) T represent the camera focal length and distortion coefficient, respectively. e rigid motion R from P 1 to P 2 can be expressed as follows: In equation 5, the image coordinate p 2 corresponding to the spatial point P 2 is given by e corresponding local rigid RGBD scene flow from p 1 to p 2 is as follows: where u, v, and w are the horizontal motion, vertical motion, and depth change of p 1 . Furthermore, a term on spatial constraints for scene flow is presented as follows: where where u tk , v tk , and w tk are the scene flow of in directions x, y, and z for the segmented region k at frame t, and N p is 4 nearest spatial neighbours of the pixel p. E spa u , E spa v , and E spa w reflect motion correlation in different directions within the same segmented region.

Spatiotemporal Correlation.
Referring to the objective function in [16,24,25], the spatiotemporal correlation of the RGBD image sequence is also considered besides the global nonrigid and local rigid assumption. e spatial-temporal correlation of the RGBD image sequence contains two Discrete Dynamics in Nature and Society 3 terms: the consistency of RGBD data and the coherence of the segmented regions.
(1) e Consistency of RGBD Data. If p is visible in frame t and p + (u tk (p), v tk (p)) is also visible in frame t + 1 in the depth image and the aligned RGB image, the point has a constant appearance with the motion (u tk (p), v tk (p), w tk (p)). e term consistency of RGBD data can be represented as follows: (2) e Coherence of the Segmented Region. If p in frame t belongs to the segmented region k, p + (u tk (p), v tk (p)) in frame t + 1 belongs to the segmented region k. e term coherence of the segmented region can be represented as follows: where g tk is a support function, which represents the probability size that a pixel belongs to the segmented region k in frame t. According to equations (8), (10), and (11), a total objective function is constructed as follows: where λ data , λ spa , and λ sup represent the corresponding weight of E data , E spa , and E sup , respectively. e coordinate descent method is used to minimize the RGBD scene flow energy function in equation (12). Firstly, estimate the initial scene flow according to the interframe optical flow and segmentation of the depth image. Secondly, obtain the optimized scene flow by image warping while keeping the layering result fixed. irdly, calculate the optimized layered support function with coordinate descent method while keeping the scene flow fixed. Finally, get the final scene flow by looping the second and third operations.

Experiments
In this section, the performance of the proposed method is evaluated by analysing the results without the assumption of global nonrigid and local rigid. en, the method is implemented on Princeton Tracking Benchmark and Deformable 3D reconstruction dataset, and some qualitative or quantitative comparisons are presented.

Performance Evaluation on the Term on Spatial Constraints of Scene Flow.
e term on spatial constraints of scene flow reflects the relationship between the scene flow of a pixel and its neighbourhood. To evaluate its performance, some experiments are implemented without this term.
In Figures 3 and 4, the scene flow is estimated without the spatial constraints of scene flow. For clarity, the figures for scene flow are not shrunk too much. It is obvious that scene flow loses smoothness in the same segmented region.
at means the scene flow of pixels in the same region is discontinuous since the correlation of scene flow of pixels in the same region is not considered.

Princeton Tracking Benchmark.
is dataset contains multiple independent moving targets and large areas of occlusion [28]. In this section, "Bear_back" sequence is used to test the method in this paper, and the results are shown in Figure 3. In the "Bear_back" sequence, the motion of the scene is produced by the opposite movement  Discrete Dynamics in Nature and Society of two hands mainly, in addition to some slight motion of the body. In Figure 5, the first two columns are two consecutive images from Bear_back sequence, including RGB and depth images. e third column is the segmentation results, and K is set to 5 in the K-means clustering algorithm. e first and second rows of the fourth and fourth columns are the results of Sun's method [16] and our method, respectively. By comparing the scene flow in the red box between Sun's method and ours, it can be found that, under the same segmentation condition, the proposed method is closer to the moving region of the real image. [26]. Deformable 3D reconstruction dataset is a nonrigid dataset. In this paper, "Hat" and "Alex" sequence are used to test the proposed method, and different poses from different times are selected in these two sequences, respectively, to validate the proposed method is invariant to pose variation.

Deformable 3D Reconstruction Dataset
In "Hat" sequence, the motion is caused by the off-cap behaviour, and two poses are used, which is called Pose 1 and Pose 2. Pose 1 has small amplitude, involving the slight motion of hat, arm, and twist ( Figure 6). Pose 2 includes the motion of hat mainly, and the direction of scene flow is the same basically (Figure 7). In Figures 6 and  7, the first two columns are the consecutive RGB and depth images, the third column is the segmented results with K � 2 in the K-means clustering algorithm, and the fifth column is the scene flow of Sun's method and ours. Occlusion calculation is an important part of Sun's method; therefore, the occlusions are also presented in this section.
In the fifth column of Figures 6 and 7, the estimation result of scene flow with Sun's method covers the whole human body which contains some stationary part. e reason for this problem may be that pixels in the same segmented region share a common rigid motion, which results in pixels without motion are also estimated scene flow. However, our method can estimate the scene flow of motion part, such as arm, head, and hat because each segmented region is divided into different blocks and the scene flow is estimated based on 3 × 3 block in each segmented region.
In "Alex" sequence, Pose 3 and Pose 4 are used. Pose 3 is produced by waving arms and some movement of clothes (Figure 8), and Pose 4 is obtained by the motion of arms ( Figure 9). In the segmentation of "Alex" sequence, K is also set to 2. In Pose 3 and Pose 4, the motion amplitude of arms is greater than the rest of the human body. In Sun's method, the motion amplitude of the whole body is considered to be the same; however, the motion of arms is significantly greater than the rest of the body.
By comparing the scene flow estimation results visually (Figures 6∼9), it can be found that our method can accurately estimate the scene flow of the nonrigid objects which involves different motion parts.

Evaluation Results.
Quantitative results, RMS and AAE, are used to compare the proposed method and Sun's method.
RMS and AAE traverse all the pixels in the image, map the 3D scene flow acquired by the algorithm into a 2D optical flow, and compare it with the real optical flow value. e smaller the difference is, the more accurate the calculation is. Let that the estimated optical flow is (u, v) T , and the true optical flow is (u GT , v GT ) T , then the calculation formula of the RMS and AAE is as follows: where N is the number of pixels in the image. Errors of the method in this paper and Sun's are shown, respectively, in Figures 10(a) and 10(b), where the blue bar represents the errors of ours and the orange bars are the errors of Sun's method. From Figure 9, it is obvious that the blue bars are shorter than the orange bars, that is, RMS and AAE of the proposed method are lower than those of Sun's method in the test datasets.

Conclusions
In this paper, a RGBD scene flow estimation method with global nonrigid and local rigid motion assumption is presented. In this method, the preprocessing and the scene flow estimation are carried out. e preprocessing is used to get the registered RGB image and depth image, which would provide material for estimating scene flow. In the scene flow estimation, the K-means clustering algorithm is used to segment the depth image and process the occlusions, and then scene flow is estimated with the spatial-temporal correlation of the RGBD image sequence and global nonrigid and local rigid assumption in each segmentation region. To represent the global nonrigid and local rigid assumption, each segmented region is divided into a number of sufficiently small blocks since the pixels' motion in the same block is consistent and the pixels' motion in the different block is inconsistent. Experiments on different datasets and different poses show that the scene flow can be estimated more accurately with the proposed method. However, the running time of the code is longer than [16] because each segmented region is divided into different blocks. In the future work, we will refer to the optimization of the model. For trained deep neural network methods can predict scene flow rapidly, we will refer to the existing methods to study learning-based methods.

Data Availability
e data used to support the findings of this study are available at http://tracking.cs.princeton.edu/dataset.html and http://campar.in.tum.de/personal/slavcheva/deformabledataset/index.html.