Robust Semiautomatic 2D-to-3D Conversion with Welsch M-Estimator for Data Fidelity

Semiautomatic 2D-to-3D conversion plays an important role in generating 3D contents for display. However,most existingmethods assume that user scribbles are perfectly correct, and only give acceptable results when user provides accurate labels. To address this problem,WelschM-estimator data fidelity is used to resist erroneous scribbles.TheWelschM-estimator data fidelity which is able to alleviate the influence of inaccurate scribbles has theoretical guarantee by means of its redescending property. First, the Welsch Mestimator is introduced to measure the fidelity between estimated depth and user provided depth; then local smoothness is built by using color weightedWelschM-estimator tomake neighboring pixels with similar colors have similar depth values. Finally, we solve the problem using generalized iteratively reweighted least squares algorithm. Experiments demonstrate that our method obtains competitive performance in the absence of inaccurate scribbles and outperforms the state of the art both visually and quantitatively in the presence of inaccurate scribbles.


Introduction
3D videos have gained much attention as 3D viewing became popular and Virtual Reality (VR) market emerged.The biggest issue of 3D industry is lack of program material.2D-to-3D conversion is a practical solution to alleviate such content shortage by estimating depth information from monoscopic images [1].High quality depth extraction plays a key role in 2D-to-3D conversion [2].
Depending on whether human intervention is utilized, 2D-to-3D conversion can be divided into three categories: manual, automatic, and semiautomatic method [3].Manual method can provide high quality results with perpixel depth assignment by labeling; thus, this makes the process of conversion both cumbersome and expensive [4].Automatic method attempts to estimate depth from monoscopic images utilizing various cues such as defocus, texture gradients, and scattering [5].Recently, deep-learninginspired approaches have been proposed for automatically converting 2D video/image to 3D format [5][6][7][8][9][10].Although these methods can produce depth maps automatically, they are hard to provide robust and stable conversion results in any general content.Semiautomatic method can balance 3D quality with conversion cost which consists of the following steps: first let user label on chosen key frames to provide sparse depth, then obtain dense depth maps of key frames via sparse-to-dense propagation, and finally generate depth maps of nonkey frames by depth propagation from the key frames [3].The conversion quality largely depends on the accuracy of depth maps for key frames.Therefore, we focus on the most relevant work of semiautomatic 2D-to-3D conversion about sparse-to-dense depth propagation for key frames.
Various methods have been proposed for dense depth estimation from user scribbles.Rzeszutek et al. [11] exploit random walks (RW) to generate dense depth based on the user input, but RW has problems in preserving strong edges [12], thus resulting in blurring artifacts at object boundaries.Phan and Androutsos [12] attempt to enhance depth discontinuities of RW by introducing the hard segmentation constraints provided by graph-cuts (GC).However, GC is hard to locate object boundaries at the transition from foreground to background with low contrast [13] and may introduce fake boundaries.Our previous work [14] demonstrates that depth discontinuities of RW can be enhanced with nonlocal pairwise constraints.Wu et al. [15] enhance depth boundaries with superpixel constraints which can prevent depth propagation across low contrast edge regions.Lopez et al. [16] formulate depth estimation from user scribbles as a graph based optimization problem with equality, inequality, and perspective constraints.Becker et al. [17] let user annotate depth discontinuities in key frames and learn depth edges of nonkey frames with random forests, which can produce dense depth maps with sharp edges at discontinuities but with more cumbersome labeling work.Kawai and Sasaki [18] propose to generate dense depth from user provided anchor points on the outline of objects at key frames, but this increases user labeling difficulties since it is hard to locate the outline of objects.Donatsch et al. [19] employ user provided geometric features to generate stereo pairs directly, but mainly suitable for images with buildings.Zhang et al. [20] utilize interactive segmentation to refine foreground depth, but inaccurate segmentation may introduce depth artifacts.Iizuka et al. [21] show that geodesic distance based interpolation can obtain dense depth efficiently from user input with few scribbles.Liao et al. [22] let user assign diffusion strength during sparse-to-dense propagation to influence the depth estimation.
Existing approaches mainly focus on enhancing depth quality and assume that user scribbles are entirely accurate.Therefore, they generate correct depth only with accurate user scribbles, and even small errors in the input may degrade the depth quality significantly as shown in Figure 1.The erroneous scribbles inside objects or background can be easily removed by users during conversion process.However, it is hard for users to make adjustments when erroneous input appears at object boundaries.The user friendly semiautomatic 2D-to-3D conversion method should have the ability to remove erroneous input automatically.Handling of inaccurate user labels has been addressed in semiautomatic image segmentation [23,25,26].While Subr et al. [25] and Bai and Wu [26] can discriminate accurate and inaccurate input, they focus on binary labels which cannot be applied to 2D-to-3D conversion directly.Oh et al. [23] utilize occurrence and cooccurrence probability (OCP) of color values for labeled pixels to estimate the reliability of each label, but may mistake correct labels for incorrect labels.Surprisingly, there are few 2D-to-3D conversion methods to handle inaccurate user labels.To address this issue, we propose a robust method based on Welsch M-estimator for data fidelity motivated by the fact that Welsch loss based redescending M-estimator can be efficiently resistant to extreme outliers [27].We note that Welsch M-estimator has been used to construct the regularizer for depth superresolution in recent years [24,[28][29][30][31].
Although employing Welsch M-estimator for regularization handles structural difference between texture and depth images like these methods, we leverage it for data fidelity to resist the influence of inaccurate input on the estimated depth.
Thanks to Welsch M-estimator for data fidelity, our approach outperforms existing methods in the presence of inaccurate input and provides at least comparable performance in the absence of erroneous input.The remainder of this paper is divided into three sections.In Section 2, our method for robust semiautomatic 2D-to-3D conversion is presented.Experimental results are provided in Section 3. Finally, we give conclusion in Section 4.

Proposed Approach
The semiautomatic 2D-to-3D conversion framework based on the proposed method is shown in Figure 2. First, we provide an interaction tool (https://github.com/tcyhx/brush2depth) for user to brush sparse scribbles on input 2D images or key frames, indicating initial depth.Second, sparse depth map is obtained from the intensities of user scribbles, where lighter and darker denote closer and farther from the viewer, respectively.Third, we construct data fidelity and regularization terms using Welsch M-estimator and formulate the sparse-to-dense depth propagation as a robust optimization, which will be illustrated in Section 2.1.Then, we solve the optimization problem via generalized iteratively reweighted least squares (IRLS) [32], which will be discussed in Section 2.2.Finally, we produce 3D content using a depth image based rendering (DIBR) technique proposed in our previous work [33].

Model.
Let Ω denote a set containing user labeled pixel locations.Given the -pixel input image , user provided sparse depth map , and the estimated dense depth map , we denote by   ,   , and   the corresponding values of pixel .Without loss of generality, we assume that   ,   , and   are normalized in the range 0 to 1.We minimize the following objective function to estimate  from : where ) denotes Welsch function and  is a bandwidth parameter which has influence on the strength of penalty to outliers, N  is the local neighboring index set of the pixel ,   represents Gaussian weighting function measuring appearance similarities between pixel  and , which is given by , and  is the parameter to balance data fidelity with regularizer.
It can be seen from formula (1) that we introduce the data consistency by Welsch's function to suppress user erroneous input while adapting the Welsch loss for regularization.Since Welsch M-estimator can deal with outliers with large magnitudes [27], we can ignore inaccurate scribbles with the data fidelity term while minimizing depth blurring caused by structural differences between texture and depth images via the regularizer.Recently, Ham et al. [24], Kim et al. [28,29], and Liu et al. [30,31] have introduced the regularity of depth maps by Welsch M-estimator.Our model differs from [24,[28][29][30][31] in its data fidelity.These methods all use a quadratic data fidelity, which cannot handle inaccuracies in user scribbles.As shown in Figure 3, Welsch M-estimator data fidelity can help to reduce visual artifacts caused by inaccurate input but quadratic data fidelity cannot suppress erroneous input.The characteristics of our model will be further illustrated in Section 2.3. 1) is nonconvex, and can be solved by the GIRLS algorithm [32].The idea of GIRLS is to determine an upper bound quadratic function and then iteratively minimize the quadratic approximations to obtain a local minimum.

Solver. The optimization problem to minimize (
The quadratic upper bound of Welsch function can be obtained by [24] with equality only if  = .Thus the quadratic upper bound of formula ( 1) is given by where  is a constant term which has nothing to do with  and will be ignored in the solving process,   denotes estimated depth map at th iteration, and    represents its value of the pixel .
Then, GIRLS for minimizing (1) is to iteratively solve the following problem: Let u = [  ] ×1 and d +1 = [ +1  ] ×1 ; the problem in (4) can be solved in a matrix form as follows: where M  is an  ×  diagonal matrix with th diagonal entry    defined in (6), L  = Λ  −A  represents the  ×  Laplacian matrix at th iteration, where A  denotes an  ×  affinity matrix with entry    of the th row and th column defined in (7), and Λ  is an  ×  diagonal matrix with th diagonal entry Λ   defined in (8).
In summary, the whole procedure used to minimize (1) is illustrated as follows.
(3) Update the entries of A  by formula (7).
end while.
Final estimated dense depth d = d  .
Here,  max denotes the maximal number of iterations.

Analysis.
Looking at formula (3), we can observe that the data fidelity will be weighted by a Gaussian function of differences between the latest estimated and user input depth values.
At erroneous input regions, inaccurate scribbles will let their depth values be different from neighboring pixels; thus smoothness imposed by regularizer will make estimated depth deviate from user provided depth, and the weight of data fidelity will be decreased to zero during the iteratively solving process.Therefore, the proposed model can suppress inaccurate user scribbles.
At accurate input regions, the depth values of labeled pixels will be consistent with their neighbors; thus the result mainly relies on data fidelity term which makes estimated depth approach user assigned depth.Therefore, the weight of data fidelity will approach 1 during the iterative solution process, and the accurate user scribbles will not be affected by the proposed model.
Figure 4 illustrates the change curve for the data fidelity weight of an input image.We can see that the fidelity weight at erroneous input regions rapidly drops to 0 and it is close to 1 at accurate labeled regions.
As the quantitative evaluation metric, we used structural similarity (SSIM) [35] since it can predict human perception of image quality.Similar to Konno et al. [36], the standard deviation of Gaussian function in SSIM was set to 4 so that it can evaluate the similarity of semiglobal structure.The higher SSIM value shows a better performance.

Choice of the Parameters.
The parameters , ,  1 ,  2 , and  max should be set to begin with our sparse-to-dense depth propagation algorithm.The parameter  is used to balance data fidelity with regularizer and has impact on depth smoothness.The bandwidth parameters  and  2 are utilized to adjust the performance of depth discontinuities preservation.Liu et al. [37] have proposed an adaptive method to calculate the bandwidth according to the local depth smoothness.The parameter  1 has influence on the strength of resistance to outliers. max is used to terminate iterations.Our algorithm typically converges in less than 10 iterations.Thus  max is fixed to 10 in our method.We find that the choice for  = 10,  = 2000,  1 = 1000, and  2 = 0.1 is proper for most cases.

Comparison with Existing Methods in the Presence of Erroneous Scribbles.
In this subsection, we roughly draw labels across some randomly selected object boundaries, and these erroneous labeled regions are marked by white circles in Table 1 shows quantitative comparisons in the presence of erroneous scribbles.It can be seen from Table 1 that the proposed method achieves the best performance for all scenes in terms of the SSIM.

Comparison with Existing Methods in the Absence of Erroneous Scribbles.
In this subsection, we perform experiments on depth estimation with human interactions for 2D-to-3D  conversion when erroneous input is absent.Table 2 presents SSIM comparisons for estimated depth.Visual comparisons are shown in Figures 14-22.
In Table 2, it shows that our method has the same performance as SDF [24] in terms of SSIM.The reason is that the proposed method generates weights approaching 1 for data fidelity in the absence of erroneous input, which has been illustrated in Figure 4. From the data in Table 2, we can find that our method has the second highest SSIM in average.Therefore, the proposed method has comparable performance to the state of the art approaches in the absence of erroneous scribbles.
From the above experiments, we can see that the proposed method outperforms the state of the art methods both qualitatively and quantitatively in the presence of inaccurate scribbles.In addition, our method shows comparable performance when accurate scribbles are provided.Therefore, the proposed method can be used in depth estimation problems for 2D-to-3D conversion under various cases.

Conclusion and Future Work
We propose a robust sparse-to-dense depth propagation method for images or key frames in semiautomatic 2D-to-3D conversion.Depth estimation is formulated as a nonconvex problem.We leverage the Welsch M-estimator to construct data fidelity term, and exploit the outlier resistance property of redescending M-estimator to suppress erroneous scribbles.The experiments demonstrate that our method is more robust than the state of the art methods when inaccurate input is present, and obtains comparable performance in the absence of erroneous scribbles.The parameters of our method are set empirically.In the future, an optimal parameters setting scheme according to depth properties should be proposed.In addition, we will apply our method to perform depth propagation from key frames to nonkey frames.

Figure 1 :
Figure 1: Depth estimation with erroneous user input, where (a) is user labeled image (erroneous scribbles at object boundaries are marked by white circles, and errors inside objects or background are marked by yellow circles), (b) is ground truth depth, (c) is result of Rzeszutek et al.[11],(d) is result of Phan and Androutsos[12], (e) is result of Yuan et al.[14], (f) is result of Wu et al.[15], (g) is result of Ham et al.[24], and (h) is result of the proposed method.

Figure 3 :
Figure 3: Depths obtained by minimizing quadratic and Welsch M-estimator data fidelity based objective functions, where (a) is input image, (b) is user labeled image (erroneous scribbles at object boundaries are marked by white circles, and errors inside objects or background are marked by yellow circles), (c) is sparse depth map obtained from (b), (d) is ground truth depth, (e) is dense depth map generated by minimizing the quadratic data fidelity based objective function, and (f) is dense depth map generated by minimizing the Welsch M-estimator data fidelity based objective function.

Figure 4 :
Figure 4: Change curve for weight of data fidelity during iterative solution process where green and blue curves are for labels marked by green and blue circles, respectively.

Figure 5 :Figure 6 :
Figure 5: Results of different methods on RGBZ 01 in the presence of erroneous scribbles.(a) is input image.(b) is user labeled image (scribbles inside white and yellow circles are inaccurate).(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 7 :Figure 8 :
Figure 7: Results of different methods on RGBZ 03 in the presence of erroneous scribbles.(a) is input image.(b) is user labeled image (scribbles inside white and yellow circles are inaccurate).(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 9 :
Figure 9: Results of different methods on RGBZ 05 in the presence of erroneous scribbles.(a) is input image.(b) is user labeled image (scribbles inside white and yellow circles are inaccurate).(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 10 :
Figure 10: Results of different methods on RGBZ 06 in the presence of erroneous scribbles.(a) is input image.(b) is user labeled image (scribbles inside white and yellow circles are inaccurate).(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 11 :
Figure 11: Results of different methods on RGBZ 07 in the presence of erroneous scribbles.(a) is input image.(b) is user labeled image (scribbles inside white and yellow circles are inaccurate).(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 12 :Figure 13 :
Figure 12: Results of different methods on RGBZ 08 in the presence of erroneous scribbles.(a) is input image.(b) is user labeled image (scribbles inside white and yellow circles are inaccurate).(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 14 :
Figure 14: Results of different methods on RGBZ 01 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 15 :
Figure 15: Results of different methods on RGBZ 02 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 16 :Figure 17 :
Figure 16: Results of different methods on RGBZ 03 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 18 :
Figure 18: Results of different methods on RGBZ 05 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 19 :
Figure 19: Results of different methods on RGBZ 06 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 20 :
Figure 20: Results of different methods on RGBZ 07 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 21 :
Figure 21: Results of different methods on RGBZ 08 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Figure 22 :
Figure 22: Results of different methods on RGBZ 09 in the absence of erroneous scribbles.(a) is input image.(b) is user labeled image.(c) is ground truth depth.(d) is result of RW.(e) is result of HGCRW.(f) is result of NRW.(g) is result of SCO.(h) is result of OCP.(i) is result of SDF.(j) is result of the proposed method.

Table 1 :
SSIM comparison in the presence of erroneous input.The first and second best SSIM at each row are shown in bold and italic, respectively.

Table 2 :
SSIM comparison in the absence of erroneous input.The first and second best SSIM at each row are shown in bold and italic, respectively.