Hand Depth Image Denoising and Superresolution via Noise-Aware Dictionaries

This paper proposes a two-stage method for hand depth image denoising and superresolution, using bilateral filters and learned dictionaries via noise-aware orthogonal matching pursuit (NAOMP) based K-SVD. The bilateral filtering phase recovers singular points and removes artifacts on silhouettes by averaging depth data using neighborhood pixels on which both depth difference and RGB similarity restrictions are imposed.The dictionary learning phase uses NAOMP for training dictionaries which separates faithful depth from noisy data. Compared with traditional OMP, NAOMP adds a residual reduction step which effectively weakens the noise term within the residual during the residual decomposition in terms of atoms. Experimental results demonstrate that the bilateral phase and theNAOMP-based learning dictionaries phase corporately denoise both virtual and real depth images effectively.


Introduction
With the development of 3D range imaging devices such as laser scanner, Kinect sensor, and Time-of-Flight (ToF) cameras, depth images are widely used in various research fields including computer vision, computer graphics, virtual reality, and human computer interaction.While laser scanners provide 3D measurements with precise accuracy, Kinect sensor and ToF cameras provide a convenient way to accomplish 3D range imaging in faithful time, which facilitate many applications with high requirement of efficiency and convenience.
However, depth images provided by Kinect sensor or ToF cameras, either in structured light principle or in ToF principle, suffer from lower quality and resolutions because of the deficiency of received light speckles and the noise incurred from ranging environment.Typically, depth images produce holes, missing regions, or unstable boundaries and nonzero-mean Gaussian noise (see Figure 1).
The research work devoted to enhancement of depth images, including superresolution of depth images and denoising depth images, can be roughly divided into three categories: filtering methods, probabilistic methods, and sparse representation methods.In general, filtering methods [1][2][3][4][5][6] perform depth enhancement by using filters, based on the assumption that faithful depth data and noise be separable in frequency domains; probabilistic methods [7][8][9][10][11][12] formulate the depth enhancement as the uncertainty problem of depth measurement and use probabilistic graphical models to resolve; sparse representation methods [13][14][15][16][17][18] model the depth enhancement problem as a sparse optimization by assuming that faithful depth data have an underlying sparse or low-rank structure.
Hand depth image denoising and superresolution are important for hand-based human-machine interaction.Although many research works have been proposed for RGB/depth image denoising and superresolution, conventional approaches do not work well for hand depth images.This is because the resolutions of depth images captured from Kinect sensor are 512 × 424, where the hand takes only a very small subregion (typically 170 × 150 in our experiments).Thus traditional approaches usually confuse depth with noise and are unsuitable for such small scale depth data.
This paper proposes hand depth image denoising and superresolution using bilateral filters and NAOMP-based dictionaries.The bilateral filtering phase recovers singular points and removes artifacts on silhouettes by averaging depth data with restrictions of both depth difference and RGB similarity.The dictionary learning phase uses the K-SVD method with the NAOMP for training dictionaries to separate faithful depth from noisy data.While traditional orthogonal matching pursuit (OMP) works well for training dictionaries to denoise RGB images, the performance of denoising depth images deteriorates badly as depth images involve nonzero-mean Gaussian noise.As a result, traditional dictionary learning algorithm (e.g., OMP-based K-SVD) cannot prevent noisy data from penetrating dictionaries which results in an unsatisfied denoising effect.To improve traditional OMP-based K-SVD for fitting nonzero-mean Gaussian noise which frequently appears in depth data, we propose NAOMP for replacing traditional OMP in K-SVD, where the noise term within residuals is weakened in each atom updating step.Such a modification gives dictionaries capable of representing faithful depth data in a more precise fashion.Experimental results show that the proposed bilateral filters and NAOMP-based dictionaries corporately give promising results for denoising both virtual and real depth data, compared with traditional bilateral filters and traditional OMP-based dictionaries.
This paper is organized as follows.Section 2 reviews previous work on enhancing depth images.Section 3 proposes the bilateral filtering phase using RGBD data, which first recovers singular points and then removes incorrect points over silhouettes and finally removes nonsingular points over nonsilhouette regions.Sections 4 and 5 propose depth image denoising and superresolution, respectively, both using learning dictionaries via NAOMP-based K-SVD.Section 6 shows experimental results with both virtual and real hand depth images.

Previous Work
Previous work on enhancing depth images is reviewed in this section, including superresolution of depth images and denoising depth images in the following three categories.
2.1.Filtering Methods.Yang et al. [1] construct a 3D volume of depth probability, impose a bilateral filter over the volume iteratively, and obtain high-resolution depth images by taking the winner-takes-all approach on the weighted volume and a subpixel refinement afterward.Huhle et al. [2] present a two-stage depth enhancement method, which first removes outliers from depth data and then performs smoothing via a nonlocal means filter which uses the similarity of both color and intrapatch of depth.Wasza et al. [3] propose a GPUbased depth image preprocessing, including a normalized convolution for restoring depth images, a bilateral temporal averaging for dynamic scenes, and a guided filter for edgepreserving denoising.Min et al. [4] propose a joint histogram of depth maps for measuring color similarity between reference and neighboring pixels and find a global mode solution via ℓ 1 -norm minimization for depth video enhancement.Fu et al. [5] propose a spatial-temporal denoising algorithm which exploits both the intraframe spatial correlation and the interframe temporal correlation to fill the depth hole and suppress the depth noise.Camplani and Salgado [6] propose a joint-bilateral filtering framework for denoising depth images, by evaluating missing depth values with a filter to neighboring pixels involving both spatial and temporal information, with the filter weights selected as a function of a photometric similarity measure of the neighbor pixels.
2.2.Probabilistic Methods.Mac Aodha et al. [7] explore the height field of patches of low-resolution depth images and select high-resolution candidate depth patches by solving a Markov Random Field (MRF) labeling problem.Shen and Cheung [8] use depth layers to account for the differences between foreground objects and background scene, the missing depth value phenomenon, and the correlation between color and depth channels and consider the depth layer labeling as a maximum a posteriori estimation problem.Wang et al. [9] evaluate the confidence of depth map for adaptive weighting of MRF energy terms and introduce a guided depth recovering method in the framework of MRF optimization for handling large holes across multiple image regions.Li et al. [10] segment the input low-resolution depth image into several regions with different labels which correspond to a high-resolution counterpart on training images and formulate the depth superresolution as an MRF-based patchwork assembly problem.Hui and Ngan [11] propose a variational-based depth map enhancement by fusing the depth maps from the active sensor of a moving RGBD system and the depth cues from an induced optical flow.Yang et al. [12] propose a regression method for enhancing depth images using RGB-D data, which first fits the regression model for depth images and then designs pixel-wise regression predictors using the similarity of depth images and the accompanied color images.

Sparse Representation
Methods.Schuon et al. [13] combine several low-resolution noisy depth images of a static scene from slightly displaced viewpoints and merge them into a high-resolution depth image with ToF calibration data.The depth superresolution problem is formulated as an optimization of a data reconstruction term plus a sparsity term of spatial gradient for separating noise from features.Li et al. [14] propose a novel joint example-based depth map superresolution method which reconstructs high-resolution depth images by learning a mapping function from a set of training samples of an image database.Kiechle et al. [15] propose a joint intensity and depth cosparse model for depth map superresolution, by assuming that the cosupports of corresponding intensity and depth image structures be aligned.Zheng et al. [16] propose constructing multiple dictionaries with different structures and different number of atoms for sparse representing each low-resolution patch of depth images.Xie et al. [17] learn a coupled dictionary with local coordinate constraints and incorporate an adaptively regularized shock filter to sharpen the edges and implement both depth superresolution and depth denoising.Lu et al. [18] assemble similar RGBD patches into a low-rank matrix in order to prevent the noise or weak correlation between color and depth.

RGBD-Based Bilateral Filters
This section proposes the bilateral filtering phase, which preprocesses depth images by using bilateral filters with both depth and RGB restrictions.The phase includes the following three steps.The depth of each singular point of depth images is corrected by averaging depths over the neighborhood according to a rule of both RGB comparison and depth histogram difference; the depth of each point over silhouettes is corrected by averaging depths over the neighborhood according to a rule of both RGB comparison and depth difference; the depth of other points is corrected using traditional bilateral filters, that is, by averaging depths over spatial neighborhood.
Let   be the intensity values/depth value at the pixel  of an intensity/depth image.Traditional filters at the pixel  with respect to its neighborhood Ω  are given by where representing the row index, the column index of  within images, respectively,   denotes the normalization term, (, ) is the 2D Gaussian smoothing kernel (known as the domain term) which measures the closeness of the pixels, and (‖  −   ‖ 2 ) is the 1D Gaussian smoothing kernel (known as the range term) which measures the similarity of RGB/depth values of the pixels in RGB/depth images.

Recovering Singular Points.
The singular points of a depth image are referred to as the points whose depth is undetected.
For each singular point  of depth images, we first set an initial depth value  initial  at  as the average of depths which are no smaller than (1/2) Ω max in the neighborhood Ω  of  (choosing such an initial value because experimental results indicate that the depths over the neighborhood of a singular point belong to two regions generally: an interval determined by the depths of the target hand and an interval determined by the depths of the background.Experimental results show that (1/2) Ω max separates two intervals well and hence gives an initial approximation of depth at ), where  Ω max is the maximum depth over Ω  , and then choose a suitable subset of Ω  whose histogram of depth is much greater than the histogram of depth at  (choosing such a subregion of the neighborhood of  overcomes directly choosing a spatial  2)-( 4), and comparison details.One can check from details that the proposed filters ( 2)-( 4) make artifacts on silhouettes either removed or remained with a small number while traditional filters turn artifacts into a shadow effect.Although those shadow artifacts give smaller depth values, the continuous region where the artifacts locate make them difficult to be removed in the dictionary learning phase.neighborhood of , because the RGB comparison improves the confidence of the pixels and because the histogram of the depth avoids producing shadows within the silhouette.We illustrate this improvement in Figure 2).Finally we select the filtered value of the depth at the pixel  by averaging the depth data with respect to both the domain term and the range term provided that the confidence pixels are enough; otherwise the singular points remain as the same value (such untreated singular points are much less than before and are dispersed within the depth image and can hence easily be treated in the second phase).That is, where   ,   denote the RGB, depth values at the pixel  of an image, respectively,   denotes the normalization term, and denotes the confidence subset of a square neighborhood Ω  of  with both a depth difference restriction and an intensity similarity restriction, with hist Ω () fl |{(, ) ∈ Ω :  , = }| denoting the number of pixels within Ω whose depth value is equal to .

Removing Incorrect Points on
Silhouettes.For all nonsingular points  located the silhouettes of depth images, we modify its depth value at the pixel  by first choosing a suitable subset of Ω  whose depth is much greater than the depth of  and whose RGB values are similar to 's and then averaging the depth data within the background domain (corresponding to the restriction of depth difference) with similar RGB values at , provided that such suitable pixels are enough.That is, where   ,   ,   , (⋅, ⋅), and (⋅) have the same meanings as in ( 2) and Ω = { ∈ Ω  :   ≥   +  5 , ‖  −   ‖ 2 ≤  6 } denotes the confidence subset of a square neighborhood Ω  of  with both a depth difference restriction and an intensity similarity restriction.3) where both of the neighborhoods are determined by a depth difference restriction and an intensity similarity restriction, the filtering (4) simply selects a square neighborhood without additional restrictions.This is because the filtering (2) and (3) treats singular points and incorrect points on silhouettes which require careful treatment, whereas the filtering (4) performs a smoothing effect over the whole square neighborhood of the point which needs smoothing.).Figures 2 and 3 show some comparison results of depth images using traditional filters [6] and the proposed filters (2)-(4).In a word, the proposed filters recover singular points using neighboring points with a quantity restriction (2) and correct artifacts using neighboring points with an intensity restriction (3), while traditional bilateral filters accomplish such tasks by directly averaging points within spatial neighborhood.Therefore, traditional filters always produce new artifacts while the proposed filters either correct the artifacts over silhouettes or suppress them in a small number without creating new ones, so that they can be easily treated in the next dictionary learning phase.

Hand Depth Denoising Using Noise-Aware Dictionaries
This section proposes the dictionary-based denoising phase, followed by the bilateral filtering phase given in Section 3.
Recovering the original image from a degraded image with additive white Gaussian noise can be modelled by solving the ill-posed system Y = X + k, where Y denotes the degraded image, X denotes the original image, and k denotes the noise term.Sparse representation is an important tool for solving such an ill-posed system.According to sparse representation, natural images can be represented by a linear combination of a series of overcomplete basis (known as a dictionary) with very few nonzero combinational coefficients.Therefore, by denoting D ∈ R × to be such a dictionary whose columns are basis vectors (known as atoms) with  ≫ , the original image can be obtained by imposing an ℓ 0 -norm constraint of the coefficients  over the system Y = D + k.In particular, we select image patches of size √×√ pixels randomly from Y as training data and order each patch lexicographically as column vectors Y  ∈ R  .Then we obtain a trained dictionary D and sparse coefficients   simultaneously via the following ℓ 0 -minimization of all coefficients of training data: Because ( 5) and ( 6) are both nonconvex systems involving the ℓ 0 -norm minimization, developing efficient algorithms for solving them is an important task.Elad and Aharon [19] propose the K-SVD algorithm to solve (5), by constructing a dictionary iteratively from training signals with a sparse coding phase and an atom update phase.Within the sparse coding phase of K-SVD and the optimization problem (6), OMP is a greedy algorithm frequently used for solving the ℓ 0 -norm minimization.While traditional K-SVD denoises traditional images well for zero-mean Gaussian additive noise, it is not well suited for denoising depth images with nonzero-mean noise because traditional OMP can hardly separate noise data from noiseless image when the amplitude of noise has an irregular distribution.To remove such noise from depth images in a more effective fashion, traditional OMP is improved by modifying the amplitude of entries of residual whenever such entries have large amplitude and a small quantity.Figure 4 helps readers understand how this idea works.The -axis represents the value of each component of residuals while the -axis represents the index of each component of residuals.Let  −1 be the residual obtained in an iteration of OMP.The left subfigure of Figure 4 shows how traditional OMP represents residuals when the noise is zero-mean Gaussian.While  −1 contains four components heavily contaminated by noise (denoted by blue cubes, i.e., the original training data Y has noise terms in the positions of those four components), the least square fitting, obtained when computing the sparse coefficients, approximates noiseless components well and hence makes most of noiseless components vanish at the next iteration before the decomposition of the noise terms with respect to current atoms, because the stopping criteria of the number of sparse coefficients are satisfied.The middle subfigure of Figure 4 shows how traditional OMP fails to remove noise from depth images when the noise is nonzero-mean.In this case, the least square fitting deviates from most noiseless terms and hence the noisy terms begin decomposition in the next iteration, because neither the criteria of the number of sparse coefficients nor the criteria of the residual amplitude are satisfied.From this step, the new atoms obtained shall be contaminated by noise.The right subfigure of Figure 4 shows how NAOMP improves this issue.When the current residual  −1 contains greater-magnitude entries and those entries have a relatively small number, it is reasonably believed that those entries correspond to the noisy components of training data.In this case, each of those components of  −1 is modified by reevaluating it as the amplitude of corresponding component within the fitting line.Then the updated residual is redecomposed over atoms.By doing so, the least square fitting approximates noiseless terms of residuals and weakens the effect of noisy terms.We illustrate the idea using numerical examples in Appendix.
The NAOMP algorithm is illustrated in Algorithm 3.After performing traditional OMP (lines (4)-( 9)), the number of components which is relatively greater than others is checked.Once the number is smaller, the current residual is reevaluated (line ( 12)) so that the values of greater components approach those of other components.Then the atom and the residual are reupdated as in OMP (lines ( 13)-( 17)).Detailed parameter setting shall be given in Section 6.

Hand Depth Image Superresolution Using Noise-Aware Dictionaries
We apply NAOMP to hand depth image superresolution, where we use NAOMP for joint dictionary training.The main idea is similar to the work of [20]; hence we only give the different part of our work (which is the dictionary training algorithm) without giving details.We randomly select patches of virtual hand depth images as a training set, each of which is stacked as a vector, denoted by   ,  ∈ Ω train .Then we obtain its corresponding downsampled version   and form pairwise training sets {(  ,   ) :  ∈ Ω train }.We denote D ℎ and D  to be the joint dictionary, which is the sparse representations for high-resolution and low-resolution image patches, respectively.The joint dictionary is given by the following optimization: min By rewriting the first and the second terms in a single term of 2-norm, we obtain the optimization model similar to the last section and solve it using the NAOMP-based K-SVD algorithm.To recover the high-resolution patch of an input image Y, we find a sparse representation of each patch of Y with respect to D  and then obtain the corresponding highresolution patch by combining the high-resolution atoms D ℎ with the same sparse coefficients with respect to D  .

Experimental Results
The experimental results are given in this section.The experiments are run on Core6 2 Quad Q6600 2.4 GHz machine with 2 GB RAM using Visual Studio 2010.The proposed filters are given by ( 2)-( 4), and the NAOMP is given by Algorithm 3.All the parameters in traditional filters [6], the proposed filters ( 2)-( 4), OMP-based K-SVD [19] (Algorithms 1 and 2), and NAOMP-based K-SVD (Algorithms 1 and 3) are given in Table 1, respectively.One can see from Table 1 that traditional OMP-based K-SVD has to select different values for the residual threshold  according to the intensity of noise, while NAOMP-based K-SVD selects a single value.In fact, according to [19], the denoising effect heavily depends on the choice of the residual threshold.Such a threshold is difficult to determine for OMP-based K-SVD when no a priori information of noise is given, while this threshold is fixed and easily determined in NAOMP-based K-SVD.

Denoising Virtual Hand Depth
Images.Artificial Gaussian noise is added on three virtual hand models and the images are denoised using OMP-based K-SVD and NAOMPbased K-SVD, both without filter preprocessing.Qualitative results are given in Figure 5 (with 0.5%, 2%, and 5% of noise) and quantitative results are given in Table 2.We see that traditional OMP fails to recover the wrist part of models while NAOMP treats them well.Moreover, within all cases, the proposed method provides higher PSNR than OMP-based K-SVD except for the first example with 0.5% noise.

Denoising Hand Depth Images Obtained from Kinect v2.
We show qualitative results of denoising and superresolution of six hand depth images obtained from Kinect sensor v2 in Figure 6 and give the average running time of different approaches in Table 3.The comparison results include five approaches: traditional bilateral filters [6], OMP-based K-SVD [19], NAOMP-based K-SVD, the bilateral filters ( 2)-( 4) plus OMP-based K-SVD, and the bilateral filters ( 2)-( 4) plus NAOMP-based K-SVD.In general, both bilateral filters preprocess depth images well in that singular points are removed (the different effect of two filters is given in Section 3 using Figures 2 and 3).Moreover, one can see from the fifth column and the sixth column that NAOMP-based K-SVD produces clearer silhouettes than OMP-based K-SVD, as the trained dictionaries from NAOMP remove noisy terms well.

Discussions.
The two-stage depth image denoising and superresolution enhance hand depth images well mainly because of the following two reasons.For one thing, the proposed bilateral filter functions choose suitable neighborhood pixels with exquisite depth and RGB restrictions than traditional filters.For another, the NAOMP modifies the noisy terms of the residual in each atom updating step so that they are reevaluated with closed values to noiseless terms.This enables the residual to decompose in a more noiseless fashion and results in dictionaries which are less contaminated by noise.
It should be noted that the proposed method may fail in denoising and superresolution of depth images of objects other than human hands.This is because while the skin of human hands shares RGB data in a small range, other objects do not show such an advantage, which make the proposed bilateral filters fail to select suitable neighborhood for filter functions.
In future work, the proposed image denoising and superresolution framework shall be developed for enhancing depth images of more complex objects or scenes.More accurate RGB/depth restrictions can be designed for filter functions to preprocess depth images, so that the singular points and artifacts remain in a more dispersing fashion.Furthermore, the modification of residuals within NAOMP can be improved so that the updated atom can represent the noiseless data more precisely.

B. OMP versus NAOMP: The Second Example
be an input signal, a faithful signal, respectively, where k = ( 2 , 0, 0, 0) ⊤ denotes the noise term and where d 1 , d 3 are atoms of the following dictionary: According to the expression of X, the best choice of atoms is {d 1 , d 3 }.Suppose that 2 <  < 10,  > 0. Let us recover the signal using OMP and NAOMP, respectively, both with two iteration steps.For OMP, in the first step, the atom which takes the maximum inner product with  0 = Y is d 1 , and the residual is accordingly given by where  1 = arg min ∈R ‖ 0 − d 1 ‖ 2 .In the second step, using simple calculation and the inequalities  > 2,  > 0, the atom which takes the maximum inner product with  1 is d 4 .Finally we recover the signal by Figure 5: Denoising additive Gaussian noise using traditional OMP and NAOMP (both without the filtering phase).Left: hand depth images with artificial noise.Middle: denoised images using traditional OMP.Right: denoised images using NAOMP.Rows 1-3: 0.5%, 2%, and 5% of noise for hand 1; rows 4-6: 0.5%, 2%, and 5% of noise for hand 2; rows 7-9: 0.5%, 2%, and 5% of noise for hand 3.

Figure 1 :
Figure 1: Color/depth images of human hands captured from Kinect sensor v2.

Figure 2 :
Figure 2: Left to right: noisy depth images, images preprocessed by traditional filters, images preprocessed by the proposed filters (2)-(4), and comparison details.One can check from details that the proposed filters (2)-(4) make artifacts on silhouettes either removed or remained with a small number while traditional filters turn artifacts into a shadow effect.Although those shadow artifacts give smaller depth values, the continuous region where the artifacts locate make them difficult to be removed in the dictionary learning phase.
0s.t.D  ≈ Y  ∀ (, ) ∈ Ω train ,(5)whereΩtrain denotes the training set of indices of patches of Y we randomly select.The original image X is finally reconstructed by assembling all image patches X  = D  with   given by min   ∈R             0 s.t.D  ≈ Y  ∀ (, ) .

Figure 4 :
Figure 4: The difference between how OMP and NAOMP work on decompositions of residuals with nonzero-mean noise.Left: traditional OMP for zero-mean Gaussian noise.Middle: traditional OMP for nonzero-mean noise.Right: NAOMP for nonzero-mean noise.The -axis: the index of each component of the residual; the -axis: the value of each component of the residual.The residual in this figure is a vector of length sixteen.From top to bottom: a procedure for updating a residual and an atom.

Table 3 :
Average running time comparison of six models in Figure6.than the error of OMP.Moreover, NAOMP selects the correct atoms {d 1 , d 3 } while OMP selects the atoms {d 1 , d 4 }.