Inaccuracy-Tolerant Sparse-to-Dense Depth Propagation for Semiautomatic 2D-to-3D Conversion

. Current semiautomatic 2D-to-3D methods assume that user input is perfectly accurate. However, it is diﬃcult to get 100% accurate user scribbles and even small errors in the input will degrade the conversion quality. This paper addresses the issue with scribble conﬁdence that considers color diﬀerences between labeled pixels and their neighbors. First, it counts the number of neighbors which have similar and diﬀerent color values for each labeled pixels, respectively. The ratio between these two numbers at each labeled pixel is regarded as its scribble conﬁdence. Second, the sparse-to-dense depth conversion is formulated as a conﬁdent optimization problem by introducing a conﬁdent weighting data cost term and the local and k - nearest depth consistent regularization terms. Finally, the dense depth-map is obtained by solving sparse linear equations. The proposed approach is compared with existing methods on several representative images. The experimental results demonstrate that the proposed method can tolerate some errors from use input and can reduce depth-map artifacts caused by inaccurate user input.


Introduction
3D videos have attracted more and more attention, providing an immersive realism visual experience by exploiting depth information [1].With rapid advances in 3D display technologies, 3D content shortage has become one of the bottlenecks which restrict the development of entire 3D industry [2].To remedy this issue, many 2D-to-3D conversion methods have been developed to convert existing 2D images/videos into 3D format by creating depth-maps [3].Semiautomatic 2D-to-3D conversion can produce high-quality depth-maps from sparse user scribbles by using sparse-to-dense depth propagation.However, current methods assume that user scribbles are perfectly accurate [4][5][6], and depth quality degrades significantly when inaccurate scribbles are present.Figure 1 shows an experimental result where scribbles are partly inaccurate.As shown in Figures 1(d) and 1(e), existing methods generate visual artifacts around inaccurate labeled regions.To handle the inaccurate input, a confident sparse-todense propagation algorithm is introduced in this paper that obtains accurate depth-maps even from erroneous user scribbles, as in Figure 1(f).
e proposed method is based on the observation that inaccurate input often occurs at or near object boundaries, and the number of correct scribbles is much larger than the number of incorrect ones.e rest of this paper is organized as follows.In Section 2, the related works about sparse-todense depth propagation for 2D-to-3D conversion are reviewed.e proposed method is described in Section 3. Experimental results are provided in Section 4. Finally, conclusion and future work are given in Section 5.

Related Work
2D-to-3D conversion algorithms can be categorized into manual methods, automatic methods, and semiautomatic methods.Manual methods can offer the highest quality conversion results but need per-pixel depth assignment precisely which is most time consuming and costly.Automatic methods infer depth information in image/video by exploiting different depth perception cues such as motion, occlusion, vanishing points, defocus, and so on.Recently, with the popularity of deep learning, many neural networks have been proposed for automatic depth estimation [7][8][9].However, existing automatic methods can generally provide a limited 3D effect due to ambiguities between depth and perception cues [2].Semiautomatic methods are the most widely used schemes for 3D content creation, since they can balance conversion quality and production cost.e core step of semiautomatic methods is sparse-to-dense depth propagation on key frames, in which dense depth-maps are estimated from user-assigned sparse depth values.e conversion quality largely depends on the accuracy of depthmaps at key frames.us, this paper mainly focuses on sparse-to-dense depth propagation for semiautomatic 2Dto-3D conversion.
Phan and Androutsos [10] combine random walks (RW) with graph cuts (GC) for sparse-to-dense depth estimation, but incorrectly segmented object boundaries provided by GC may degrade depth quality.Rzeszutek and Androutsos [11] use the domain transform filter to propagate sparse labels throughout an image, but it may smooth out depth edges.Iizuka et al. [12] utilize superpixel-based geodesic distance weighting interpolation and optimization-based edge-preserving smooth to compute dense depth from user scribbles.Similarly, Wu et al. [13] apply superpixel-based optimization method to obtain dense depth-maps from sparse input.However, these superpixel-based methods are affected by the performance of superpixel segmentation.Yuan et al. [14] propose a nonlocal RW algorithm to produce dense depth-map from user scribbles on single 2D image.Liang and Shen [15] further extend this scheme with the ability to process videos.However, RW-based methods cannot modify user-assigned labels, and erroneous input will degrade depth accuracy seriously.Lopez et al. [16] incorporate perspective and equality/inequality constraints into an optimization framework for dense depth estimation, but may add additional burden to user operations.Vosters and Haan [17]   Journal of Electrical and Computer Engineering propose a line scanning-based sparse-to-dense propagation method with low computation cost, but accuracy may be lost.Revaud et al. [18] use an edge-aware geodesic distance for sparse-to-dense optical flow interpolation, but the result is vulnerable to inaccurate input.All of the above methods, however, do not account for the possibility of inaccurate scribbles.us, these methods can only give reliable results for accurate input.To address this issue, confidence of scribbles is calculated based on local color variation.ere has been some recent works addressed on error-tolerant interactive image segmentation [19][20][21].However, these methods are not well suited to 2D-to-3D conversion, since they mainly focus on foreground and background separation.

Method
As shown in Figure 2, the proposed method works as follows.Firstly, user draws sparse scribbles on 2D image/key frames, where brighter red marked regions are closer from the camera.Secondly, depth values at labeled pixels are extracted according to intensities of the scribbles.irdly, confidence of scribbles is calculated based on the color variation at labeled regions.Fourthly, an energy function is built where scribble confidence is incorporated into the data cost.Finally, the energy function is minimized by solving a sparse linear system to obtain the dense depth-map.

Scribble Confidence.
It can be found that pixels in accurate labeled regions often have similar color values, while erroneous input mainly appears at object boundaries with strong variations in color.Based on this observation, the scribble confidence is calculated using the following formula: where I i denotes the Lab color values at pixel i. e reason for using Lab color space is that it takes into account human perception [22].N(i) is the set of 8 neighbors of pixel i. δ(•) is the Dirac delta function.‖‖ 0 and ‖‖ 2 are L0 norm and L2 norm, respectively.ε is a small positive constant to prevent division by zero and set to 10 −5 .Ω is the set of labeled pixels.It can be seen from formula (1) that the confidence of a labeled pixel is lower when its color difference between neighboring pixels is larger.Since inaccurate input is mainly located at or near object boundaries around which the color changes significantly, the proposed method can penalize inaccurate scribbles.One may question whether or not the confidence of correct scribbles is high.e confidence of labeled pixels at texture regions is indeed low, and correct labels at these regions will be mistaken as incorrect ones.However, the number of correct scribbles is much larger than the number of incorrect ones.erefore, the impact on the accurate scribbles can be tolerated.Figure 3 gives an example on how scribble confidence works.
e confidence of inaccurate scribbles which move across the object boundaries is low, as shown in Figure 3(d).Current optimization method [6] generates visual artifacts around inaccurate labeled regions, as can be seen from regions within the red circles in Figure 3(e).
ese artifacts can be removed if scribble confidence is incorporated into the optimization method, as shown in Figure 3(f ).
As shown as the blue square in Figure 3(a), when user scribbles are inside objects, the color variation at labeled pixel between its neighbors is small, in which case the scribble confidence of the pixel at the center of the blue square is 0.6.When user scribbles approach object boundaries, the color variation at labeled pixel between its neighbors becomes larger, as shown as the yellow and pink squares in Figure 3(a).e scribble confidence of the pixel at the center of the yellow square is 0.3, while confidence of the center pixel of the pink square is 0.0.Since erroneous scribbles mainly appears at object boundaries, the proposed method can remove erroneous input by using color differences between labeled pixels and their neighbors.

Energy Function.
Let n be the total number of pixels and w and h the image width and height in pixels, that is, where the first term is data cost, the second term is local smoothness, and the last term is k-nearest smoothness.s i is the scribble confidence at pixel i obtained from formula (1).
T is used to find k-nearest neighbors.Here, α is a parameter and set as 30 in all experiments.c and λ in formula (2) are parameters to weigh the local smoothness term and k-nearest smoothness term, respectively.w ij is the Gaussian weight to measure color similarity between pixels i and j and is defined as follows: where σ is the bandwidth parameter and fixed as 0.03 in all experiments.
In formula (2), the data cost is used to measure the consistency between the estimation and user-assigned depth values.Since scribble confidence is incorporated, the proposed method is robust to inaccurate use input.e local smoothness term makes the neighboring pixels with similar colors have similar depth values.To reduce the impact on correct scribbles at texture regions, the k-nearest smoothness term is introduced to make distant pixels with similar features also have similar depth values.e energy function in formula ( 2) is minimized to obtain the dense depth-map x from the sparse depth-map b.To facilitate computer implementation, formula ( 2) is rewritten in matrix form as follows: where S is an n × n diagonal matrix whose i-th entry on the diagonal is s i .L is the n × n Laplacian matrix for local neighbors and defined as L � D − W, where W � [w ij ] n×n (j ∈ N(i)) is the n × n affinity matrix for local neighbors and D is an n × n diagonal matrix whose i-th entry on the diagonal is is the n × n affinity matrix for knearest neighbors and D k is an n × n diagonal matrix whose i-th entry on the diagonal is e energy function in formula (4) to be minimized is convex and thus takes its derivative with respect to x and sets it equal to zero leading to the following system of linear equations: e equation in formula ( 5) is sparse and positive definite which means the solution x can be obtained using the conjugate gradient method.

Experimental Results
4.1.Experimental Setup.Four representative test images, RGBZ_01, RGBZ_03, RGBZ_05, and RGBZ_07, from the RGBZ dataset [23] are used to evaluate the performance.e proposed method is compared with several state-of-the-art  methods, including RW [4], optimization (OPT) [6], hybrid GC and RW (HGR) [10], superpixel-based optimization (SOPT) [13], and nonlocal RW (NRW) [14].In the proposed method, the regularization weight parameters c and λ are fixed to 1 and 10 −5 , respectively.e local neighbors of formula ( 1) and ( 2)are empirically set to 3 × 3 size square windows centered at each pixel.e parameter k of k-nearest neighbors in ( 2) is set to 9. Structural similarity (SSIM) and PSNR are used as the quantitative measure for comparison, in which parameters of SSIM are set to default values as suggested by Wang et al. [24].

Experiments in the Absence of Inaccurate Scribbles.
In this section, user input is assumed to be perfectly accurate and show the performance of the proposed method in this case.
e SSIM comparison is listed in Table 1.e PSNR comparison is shown in Table 2.As shown in Tables 1 and 2, the proposed method is comparable to current optimization method [6] when user input is accurate.Figures 4-7 show qualitative comparisons for different methods on the four test images.It can be seen that the proposed method is superior in reducing depth bleeding artifacts compared with the previous optimization method [6].e reason is that the k-nearest smoothness term in formula ( 2) is effective in preserving sharp depth boundaries [14].In summary, the proposed method can be safely used in the case of accurate input.

Experiments in the Presence of Inaccurate Scribbles.
In this section, some inaccurate scribbles are added on the abovementioned experiments by roughly drawing labels across some randomly selected object boundaries (see regions within white squares of Figures 8(b e SSIM comparison in this case is listed in    8-11.anks to the scribble confidence, the proposed method successfully reduces these artifacts caused by inaccurate input.

4.4.
Experiments for Sparse Labeling.In this section, user input is assumed to be very sparse.In the first row of Figure 12, results are shown for a very sparse input which only contains seven strokes.In the second row, two strokes are added to the input image.It can be seen that the performance of all methods is improved as increase in the number of accurate scribbles.NRW and the proposed method are superior to others in preserving depth discontinuities since they both use nonlocal regularization.As analyzed in Section 3.1, although the proposed method may mistake correct labels for incorrect ones when they are located in texture regions, the proposed method can obtain acceptable results even with very sparse scribbles thanks to the k-nearest smoothness in formula (2), as shown in Figures 12(g) and 12(n).

Conclusion
Semiautomatic 2D-to-3D conversion has proven to be an effective solution for alleviating 3D content shortage.e key is sparse-to-dense depth conversion from user scribbles.Existing methods assume user input is entirely accurate, and even small errors may degrade the depth quality dramatically.To alleviate this problem, color difference between labeled pixels and neighbors is used to compute scribble confidence, and a confident optimization method is proposed for sparse-to-dense depth conversion.Furthermore, k-nearest smoothness is introduced to make the proposed method perform well even with very sparse input.e experiments demonstrate that the proposed method is superior to existing methods when inaccurate input is present, while at the same time competitive results are obtained when all scribbles are accurate.
Currently, the proposed method mainly focuses on 2Dto-3D conversion for images.In future, the proposed method will extend to videos.

Figure 1 :Figure 2 :
Figure 1: A sparse-to-dense depth propagation result with inaccurate input.(a) User scribbles (inaccurate scribbles are marked by the white circles).(b) Sparse depth-map extracted from user scribbles.(c) Groundtruth depth-map.(d) Depth-map generated by random walks [4].(e) Depth-map generated by optimization method [6].(f ) Depth-map generated by the proposed method.

Figure 3 :
Figure 3: e demonstration of scribble confidence provided by the proposed method.(a) User scribbles.(b) Sparse depth.(c) Label mask.(d) Scribble confidence (the brighter the intensity, the higher the confidence of labeled pixels).(e) Depth-map of optimization method [6].(f ) Depth-map of confident optimization method.

Table 1 :
SSIM comparison in the absence of inaccurate scribbles.

Table 2 :
PSNR comparison in the absence of inaccurate scribbles.
e best PSNR (dB) at each row is shown in bold.

Table 3 ,
and PSNR comparison is shown inTable 4. It can be seen that the proposed method is superior to other approaches when inaccurate input is present, and obtains the highest SSIM and PSNR values in average. is shows that scribble confidence can help resist inaccurate scribbles.e Journal of Electrical and Computer Engineering performance of current methods degrades significantly in the case of inaccurate input, since they assume all scribbles are perfectly accurate.e qualitative comparisons for different methods are shown in Figures 8-11.Current methods generate undesirable visual artifacts around inaccurate labeled regions, as shown in Figures