Dataset Denoising Based on Manifold Assumption

Learning the knowledge hidden in the manifold-geometric distribution of the dataset is essential for many machine learning algorithms. However, geometric distribution is usually corrupted by noise, especially in the high-dimensional dataset. In this paper, we propose a denoising method to capture the “true” geometric structure of a high-dimensional nonrigid point cloud dataset by a variational approach. Firstly, we improve the Tikhonov model by adding a local structure term to make variational diffusion on the tangent space of the manifold. Then, we define the discrete Laplacian operator by graph theory and get an optimal solution by the Euler–Lagrange equation. Experiments show that our method could remove noise effectively on both synthetic scatter point cloud dataset and real image dataset. Furthermore, as a preprocessing step, our method could improve the robustness of manifold learning and increase the accuracy rate in the classification problem.


Introduction
Since objects vary gradually in the real world, the manifold assumption indicates that the data points depict the state of an object should distribute on a smooth low-dimensional manifold embedded in high-dimensional observation space [1]. Dimensionalities of the manifold are key factors that control variation of the object state. For example, in Figure 1, the images of the rotational duck toy distribute on a one-dimensional manifold (a curve) embedded in high-dimensional pixel space. Each image depicts a particular state of the duck. Although the pixel values change dramatically at these images, humans could discover easily that they are controlled by one key factor: rotation of the duck.
Learning the knowledge hidden in the manifold-geometric distribution of a high-dimensional dataset is essential in many machine learning algorithms. For example, manifold learning algorithms aim to discover the nonlinear geometric structure dataset by preserving different local geometric properties [3][4][5][6][7][8]. e embedding results can be further used in data visualization, motion analysis, and classification [9,10]. Moreover, much research takes manifold assumption as a constraint condition in its objective function [11,12]. It is worth noting that manifold assumption is applied to explain why deep learning works well recently [13][14][15]. is research indicates deep learning could capture the manifold structure of one kind of knowledge by powerful nonlinear mapping.
However, noise is inevitable in data acquisition. For example, in Figure 1, the noiseless images of the rotational duck toy (red points) should lie on a curve embedded in the pixel space. However, due to the long exposure time and camera shake, the duck becomes "brighten" and "small" in the image. e corresponding noise data point, which is marked by "N" and green color in Figure 1, does not lie on the curve because pixel values change dramatically in the noise image.
Noise makes machine learning models fragile and hard to train. For example, the outlier points are difficult to handle in the classification and clustering task. Machine learning model needs to become more complex to get proper results [13]. In manifold learning algorithms, noise points make recovered embeddings difficult to capture the true manifold-geometric distribution of the dataset. e reason is that the "short circuit" phenomenon arises easily in the noise dataset which destroys the local linear structure of the manifold [16].
In this paper, we propose a novel denoising method based on manifold assumption. Our aim is to obtain the data points that lie on the noiseless manifold through noise data points. Compared with the existing denoising methods, our method has two contributions worth being highlighted: (1) Our method makes use of manifold-geometric distribution information of the dataset. erefore, this method works for a dataset rather than a single data point.
(2) Our method improves the Tikhonov model to make the variational diffusion on the tangent space of the manifold for a high-dimensional nonrigid point cloud dataset.
Our method could capture the "true" geometric structure of the noise dataset. After denoising, the key factors that control the geometric distribution of the dataset are maintained and the characteristics of individual points are removed as noise. As a preprocessing step, our method could improve the robustness of manifold learning and increase the accuracy rate in the classification problem. e rest of the paper is organized as follows: a brief review of the research on the manifold assumption is outlined in Section 1. Section 2 describes the motivation and details of the proposed method. In Section 3, experiments are conducted on both synthetic and real data to evaluate our method. Section 4 concludes remarks and a discussion of future work.

Related Work
Existing denoising methods always work for the noise in a single data point, such as "Gaussian noise" or "pepper noise" [17,18] in an image. However, these methods could not deal with the noise that distorts the geometric distribution of the dataset, such as the noise duck toy image (green point) caused by longer exposure time and camera shake in Figure 1.
Only a few studies exist to deal with this problem. Gong et al. [19] proposed a local linear denoising method. is method removed noise by projecting noise data points to the tangent space of manifold which is estimated by the principal component analysis method firstly. en, local denoised patches are aligned to get the global denoising dataset. However, the principal components may be distorted because they are calculated by the neighborhood of noise data points, which could lead to a wrong denoising result. Hao et al. [16] also utilized principal component analysis and projection method to find the noiseless data points. erefore, it has the same problem. Moreover, many machine learning methods proposed the noise-resistant model for outliers but did not discuss denoising as an independent problem [7,20]. For example, Zhang et al. [7] proposed an adaptive neighborhood selection method by the shrink and expand strategy to resist noise on the neighborhood of manifold.
In this paper, we propose a denoising method for the dataset.
is method improves the Tikhonov method by adding a local structure term. e optimal solution is obtained by minimizing the objective function through a variational diffusion approach.

Proposed Approach
Let F � f(1), f(2), . . . , f(m) { } be the noise dataset. f(x) ∈ R D is the x-th data point in F. D is the dimension number of f(x). Let U � u(1), u(2), . . . , u(m) { } be the noiseless dataset we want to obtain. u(x) ∈ R D is the x-th data point in U. f(x) � u(x) + ξ(x), ξ(x) ∈ R D is the noise of f(x). e goal is to recover U from F. We illustrate our method in three steps: firstly, introduce to inspiration and motivation; then, construct the objective function by improving the Tikhonov model; and finally, optimize the objective function and get the solution by taking discrete operators.

Inspiration and Motivation.
Manifold assumption claims that the noiseless data point u(x) that depicts the object state (the blue points in Figure 2) should lie on a smooth manifold U (blue surface in Figure 2) embedded in observation space. However, noise points f(x) (red points) distribute on the noise manifold F. e denoising problem is how to obtain ux on Ufrom f(x) on F.

Objective Function.
e objective function is formulated in this part. Firstly, we illustrate the Tikhonov model briefly in image denoising which is similar to our problem. en, the challenge of our problem is shown. Finally, we improve the Tikhonov model and construct the objective function for our problem.  are pixel values at row xand column y in noise and noiseless image, respectively. ξ(x, y) is the noise. In Figure 2, if we regard the x, y, and z coordinate of f(x) as row number, column number, and pixel value, then the red manifold F depicts the pattern of noise image. erefore, the image denoising problem is to find a noiseless image U from F. e Tikhonov model is one of the most classical variational models to deal with this problem [21]: where Ω is the image domain and dx is the area element (pixel) in Ω. ∇u is the gradient of u(x). e first term Ω (u − f) 2 is "data term" that measures the Euclidean distance between F and U. e second term Ω |∇u| 2 dx is "smooth term" that measures the noise strength of U. Since these two terms have opposite effect, the parameter α balances these two terms. If α is small, U is close to F but the noise strength is large. On the other hand, the noise becomes small but the image pattern of U is "unlike" F.

e Challenge of Our Problem.
In the image denoising problem, the gradient operator is defined as [21] ∇u When minimizing the "smooth term" Ω |∇u| 2 dx in (1), the pixel values in the image became the same, whereas the image area does not change since x and y are fixed.
However, in our problem, the dataset is nonrigid and high-dimensional cloud points. Let u(x) � [u(x) 1 , u(x) 2 , . . . , u(x) D ] ∈ R D be a data point. D is the dimension number of u(x). Suppose N u(x) is the neighborhood of u(x) which is determined by the KNN method: Naturally, the gradient operator is defined as ∇u � u(x) − u y 1 , u(x) − u y 2 , . . . , u(x) − u y k T .
(4) erefore, the "smooth term" in (1) is When minimizing an objective function, the "smooth term" makes u(x) and u(y i ) become the same point. erefore, the "cluster" phenomenon arises in the dataset-some points are brought close together and the other points are pushed away. erefore, the geometric structure of the manifold U (blue surface in Figure 2) will shrink to a few point clusters rather than becoming smooth.
erefore, the Tikhonov model could not be applied directly to solve our problem.

Our Objective Function.
To deal with this problem, we maintain the geometric distribution of U by keeping the tangent linear structure when minimizing the objective function. Since the neighborhood of the manifold could be regarded as tangent space (the blue plane in Figure 3), we make the neighborhood structure of U the same as F. e weight of local linear representation is utilized to depict the geometric structure of the neighborhood.
e weight W f of data point f(x) is defined as where f(y i ) ∈ N f(x) and W fi is the i-th component of W f between f(x) and f(y i ). Similarly, the linear representation weight of u(x) is defined as W u . e local linear structure can be maintained if we set W u the same as W f . en, f(x) could only move along the normal space of manifold when minimizing the "smooth term" in the objective function because the tangent geometric structure is fixed by W u . erefore, we add a "local structure term" in the Tikhonov model: where N u(x) W fi f(y i )dy is the linear reconstruction of u(x). us, our objective function is where α and β are balance parameters.

Optimal Solution.
In this part, we get optimal u by minimizing objective function (8). e solution in the continuous form is calculated firstly. en, the discrete operator is defined and plugged to get a discrete solution.
Noise data points Noise manifold Noiseless data point Noiseless manifold Figure 2: Illustration of the idea of our method: obtain the noiseless blue points that lie on smooth manifold (blue surface) from the noise red points that distribute on an irregular surface (noise manifold).

Solution in Continuous Form.
To get optimal u, we calculate the derivative of (8) with respect to u by variational approach and set it to zero: erefore, the Euler-Lagrange equation of u is en, And the boundary condition is

Solution in Discrete Form.
To get the discrete solution, we define the discrete Laplacian operator in (11) by spectral graph theory [22]. Firstly, the gradient of u(x) is defined as is gradient is a k-dimensional vector because there are k data points in N u(x) . e subscript "wG" is abbreviated to "weighted graph." W d (x, y) is a weight vector. e component W d (x, y i ) should be important if u(x) and u(y i ) are near. On the contrary, the component should be unimportant if u(x) and u(y i ) are far away. erefore, we define W d (x, y) as where d(x, y) is the vector of Euclidean distance between u(x) and u( . (15) Consequently, the gradient of a vector v(x, y) is (the derivation procedure is listed at "Notice" at the end of this capture): Let v(x, y) � ∇ wG u(x, y) � (u(y) − u(x)) ������ w(x, y), therefore, the discrete Laplace operator of u(x) can be defined by  Mathematical Problems in Engineering We plug the discrete Laplace operator into (11). e solution of our object energy function (8) is where the superscripts k and k+1 are the iteration step. e initial value of u is set to f. e optimal u is obtained by iteration, which ends up when E(u) < ε, where E(u) is the objective function value and ε is a small error we set. e boundary condition (12) could be ignored because the dataset is scattered and nonrigid cloud points.
Notice: e gradient of a vector v could be derived as follows: ������ w(x, y) erefore,

Experiments
In this section, we evaluate our algorithm on both the synthetic scatter point cloud dataset and real image dataset. en, this method is utilized as a preprocess step for manifold learning and classification task. e major parameters of our algorithm include (1) the neighborhood size k; (2) the smooth term weight α; and (3) the local structure term weight β.

Experiments on Synthetic 3D Scatter Cloud Data.
In this part, we test our algorithm on the classical "swiss roll" dataset.
e data points are sampled from 2D manifold randomly embedded in the 3D space like a swiss roll cake. Figures 4(a) and 4(b) at first row are noiseless and noise dataset at [− 8, 10] and [0, 0] viewpoint, respectively. It is obvious that noise data points distribute around the "swiss roll" manifold but do not lie on it exactly. Our goal is to recover the noiseless dataset in Figure 4(a) by the noise dataset in Figure 4(b). In this experiment, we set the number of data points n � 1300, KNN parameter k � 12, and the noise parameter NI � 1. e MATLAB code of the swiss roll dataset is listed in Table 1. e second, third, and fourth rows in Figure 4 are denoising results by our method with α and β equal to (1, 1), (3,1), and (0.3, 1), respectively. For ease of viewing, we set the denoising datasets at [− 8, 10] and [0, 0] viewpoints in the left and right columns. In the right column, it is easy to see that the denoising data points are closed to the tangent space of manifold compared with (b), which show that our method is effective. Among them, (f ) seems to be the best result because the denoising points are the nearest to manifold compared with (d) and (h). However, the "cluster" phenomenon arises in the denoising dataset; some points are close together and the other points are pushed away, which is easy to see in (e). e reason is that the large smooth parameter (α � 3) makes geometric distribution distort when minimizing the objective function. Conversely, the "cluster" phenomenon in (g) is not serious when we set a small parameter α � 0.3, but the noise is large.
To conduct a quantitative comparison between noise and denoising datasets, we assess the quality of the denoising datasets by mean square error (MSE) and tangent distance error (TE). MSE is a widely used index which measures the average squared Euclidean distance difference between two datasets: where N is the point number of the dataset. u i and u * i are a noise data point and corresponding noiseless data point.
Mathematical Problems in Engineering e tangent distance error (TE) measures the distance of u i to the tangent space of the manifold. A small TE indicates that u i lies on the manifold and noise is weak. On the contrary, the noise strength is large if TE is big. For the convenience of calculations, we approximate TE as the Euclidean distance between u i and its nearest data point in the noiseless dataset. e tangent distance error (TE) is defined as  Input: the number of datasets: n; noise parameter: NI Output: swiss roll dataset, noiseless and noise t � (3 * pi/2) * (1 + 2 * rand(n, 1)); Height � 30 * rand (n, 1); Noiseless data � [t · * cos(t) height t · * sin(t)]; Noise data � [t · * cos(t) height t · * sin(t)] + NI * rand n(n, 3); 6 Mathematical Problems in Engineering where N is the number of data points, u i and u * i represent the denoising data point and noiseless data point, respectively. U * is the noiseless dataset.
To evaluate our algorithm, we test seven sets of α and β ranging from 0 to 10. MSE and TE are listed in Tables 2 and  3. When α and β equal 0, the "data term" is the only term remaining in the objective function (8).
erefore, the denoising dataset is the same as the noise dataset and the value at (α � 0, β � 0) is the errors of the noise dataset. While α is small and β is large, the "data term" and "local structure term" maintain the geometric structure of the noise dataset. erefore, the errors at the upper right of the table are close to the errors of the noise dataset. While α is large and β is small, the "smooth term" plays a major role. It could lead to a "cluster" phenomenon which distorts the geometric structure of the dataset and make errors large at the bottom left of the table. It is able to see that the errors near the diagonal of tables are much smaller than the others.

Experiments on the Image Dataset.
In this part, we test our method on two real image datasets: MNIST handwritten number dataset [23] and "LLE face" dataset. Image is regarded as a point in pixel space. For example, the image in the MNIST dataset could be regarded as a point in 784dimensional space because it has 784 pixels. erefore, the only difference between this part to experiment 3.1 is that the dimensionality of image-point is much higher than the synthetic scatter point in 3D space.
We analyze denoising images both from the subjective and objective aspects. Firstly, our method is applied to raw image datasets. Ideally, key factors that control the geometric distribution of the dataset could be maintained and the characteristics in individual images are removed as noise. Since there is no ground truth of the raw image dataset, we could only evaluate results by eyes subjectively. Secondly, we add several types of noise in an image and utilize MSE to measure the denoising images by our method and classical image denoising methods objectively.

Experiments on the Raw Image Dataset.
We select "number 3" and "number 4" datasets in MNIST which contain 1010 and 982 images, respectively. e size of each image is 28 * 28 pixels. e "LLE face" dataset contains 1965 face images with different expressions and shooting angles. e size of each image is 28 * 20 pixels. Figure 5 shows 110 images in the "handwritten number 3" dataset. e left side is original images and the right side is the corresponding denoising images by our method. In this experiment, k � 15, α � 0.8, and β � 1. Four typical images are marked with a box and listed in Figure 5. It can be seen that the blurring strokes become clear and the posture of number in the image is maintained. Figure 6 shows the 110 images in the "handwritten number 4" dataset. e left and right sides are original images and the corresponding denoising images by our method, respectively. In this experiment, k � 15, α � 8, and β � 1. It can be seen that the denoising images maintain the main factors, such as the angularity of number "4." And the individual characteristics are removed after denoising; for example, the difference of stroke width becomes small after denoising. Four typical images are marked with a box and listed in Figure 6. It is obvious that the margin of "head" of number "4" becomes large in the first two images after denoising. In the third image, the stroke width becomes broad. In the fourth image, the "bend" at the upside of the stroke is removed. Figure 7 shows the denoising result for the LLE face dataset. is dataset contains 1965 face images and the size of each image is 28 * 20 pixels. In this experiment, k � 15, α � 3, and β � 0.8. [4] shows that this dataset distributes on the manifold that spans by two key factors: head pose and expression, where the expression reflects by lip shape in images.
It can be seen that these two factors are maintained after denoising and the characters in the individual image are removed as noise. Four typical images are marked with a box and listed in Figure 7. In the first two images, the head twists to the left and right slightly in the original dataset whereas the head pose is fixed after denoising. In the third image, the original head seems to be smaller than the other images which may be caused by camera shake. e corresponding denoising image enlarges the face, and the cheek and chin became "fat." In the fourth image, the eyes are "open" after denoising.

Experiments on the Noise Image.
In this part, we add several different types of noise to an LLE face image. en, our method and three classical image denoising methods are  applied to these noise images. Finally, MSE is utilized to evaluate denoising images. Figure 8 shows the denoising images by four denoising methods for five types of noise. e first column is a raw LLE face image. Brightness noise, Gaussian noise, salt and pepper noise, rotation noise, and scaling noise are added to the raw image which are shown in the second column, top to bottom row. e MATLAB code of noise model is listed in Table 4. ree classical denoising methods, mean filtering, median filtering, and Tikhonov method are utilized to deal with these noise images. e corresponding denoising images are listed in the third, fourth, and fifth columns in Figure 8. e images in the last column are denoising results by our method. MSE is listed below each image. In this experiment, the size of the raw LLE face image is 28 * 20 pixels. In mean filtering, the size of the filter is 2 * 2 pixels. In median filtering, the size of the filter is 3 * 3 pixels. In the Tikhonov  method, the smooth parameter is 0.3. e parameters in our method are set to k � 15, α � 3, and β � 3.
It can be seen that three classical denoising methods have no effect on brightness noise, rotation noise, and scaling noise. ese noises still exist in denoising images. e MSE even becomes larger after denoising in contrast to the noise image whereas our method has a good effect. For example, the rotation face is fixed at the fourth row and sixth column and MSE becomes smaller. e reason is that classical image denoising methods make use of the pattern information in a single image. ey could not "see" the geometric distribution information of the whole image dataset whereas our method removes noise by drawing noise data points back to the noiseless manifoldgeometric distribution of the image dataset.

Denoising Dataset for Manifold Learning.
In this part, we utilize our method as a preprocessing step and compare the recovered low-dimensional embeddings of noise and denoising datasets on several manifold learning algorithms. In this experiment, α, β, and k are 1, 0.8, and 13.   denoising dataset by LTSA. Figures 9(g) and 9(h) are embeddings of the noise and denoising dataset by HLLE. It is obvious that embeddings of the noise dataset could not reflect the geometric distribution of manifold since the neighborhoods easy to result in the "short circuit" phenomenon. By taking the denoising dataset, all the three manifold methods could get the proper embeddings. e results of Isomap result in the "hole" phenomenon because the calculated geodesic distance is always larger than it really is.    To conduct a quantitative comparison, we assess the quality of the embeddings by three indexes: embedding error, trustworthiness error, and continuity error [8]. e embedding error E measures the squared distance from the recovered low-dimensional embeddings to the ground truth coordinates which could be defined as where N is the number of data points and y n and y * n represent the embedding coordinates and ground true coordinates, respectively. is index tends to measure global structure distortion of the manifold. e trustworthiness error T and continuity error C measure the local geometric structure distortion. e trustworthiness error measures the proportion of points that are too close together in the low-dimensional embedding and continuity error measures the proportion of points that are pushed away: where k is the point number in the neighborhood, r(n, m) is the rank of the point u m in the ordering according to the pairwise distance from point u n in the high-dimensional space, and r(n, m) is the rank of the point y m in the ordering according to the pairwise distance from point y n in low-dimensional embedding. e variables U (k) n and V (k) n denote the neighborhood points of u m in low-dimensional embedding and high-dimensional space, respectively.
We test our method on several dimension reduction methods. e noise swiss roll dataset contains 1300 points. Here, we set α, β, and k to 1, 0.8, and 13. e best embedding results among several trials are selected in this experiment. e embedding error, trustworthiness error, and continuity error are listed in Tables 5-7, respectively. To show the effectiveness of our method, the errors of noise dataset, denoising dataset, and noiseless dataset are listed in three rows. It could be seen that the errors become small by taking the denoising dataset in Isomap, LLE, HLLE, LTSA, and AML. However, LE and LPP have a poor performance by taking denoising dataset.

Classification Experiment.
In this part, we utilize our method as a preprocessing step and compare the accuracy rate of the original dataset and denoising dataset in the classification task. MNIST handwritten number dataset is selected which contains 60000 images with ten classes from numbers 0 to 9. Each class has about 6000 images and the size of each image is 28 * 28 pixels. To get the denoising dataset, we utilize our denoising method for these ten classes, respectively.
In this experiment, we specify different numbers of images in each class as training data and utilize the remaining images as test data both in the original dataset and denoising dataset. A simple one-hidden-layer neural network is adopted as a classifier. e input layer has 784 units corresponding to the pixels in an image. e output layer has 10 units corresponding to ten categories from number zero to nine. We set 25 units in the hidden layer including a bias unit. e parameters of the network are trained by the BP method.
For each classification task, we repeat 10 times and list the mean accuracy rate in Figure 10. e labels "original dataset" and "denoising dataset" are raw MNIST dataset and denoising dataset with our method. e x-coordinate is the number of training images in each class and the y-coordinate is the accuracy rate. e blue and red lines are the accuracy rate of the original dataset and denoising dataset, respectively. It is obvious that the accuracy rate goes down as the number of training images decreases in each class. e performance of the denoising dataset is much better than the original dataset, especially when the training number is less than 50 in each class. e accuracy is above 96% even when there are only 10 training images in each class for the denoising dataset. e reason is that the individual characters are removed in the denoising dataset, which is shown in Figures 5-7 in Section 3.2.1. e denoising datasets that distribute on a "clean" manifold expanded by key factors of the dataset could make machine learning algorithm easy to learn the geometric distribution knowledge of the dataset. It also illustrates that there is some kind of essential features to the classifier that is captured by our method.

Conclusion and Future Work
We propose a denoising method for the dataset rather than a single data point. is method is inspired by the manifold assumption. A local structure term is added in the Tikhonov model to make the noise points diffuse on the tangent space of the manifold. Our method could prominent the major factors hidden in the dataset and remove characteristics of the individual data point. Experiments show that our method could eliminate noise effectively on both synthetic scatter point cloud dataset and real image dataset. And as a preprocessing step, our method could improve the robustness of manifold learning and increase the accuracy rate of the classification problem. However, the parameters are sensitive in this model because the optimal solution is calculated by iteration. e geometric distribution of the dataset is distorted when the smooth term parameter is large. On the contrary, the noise intensity is still large after denoising. Our future work will focus on this problem.

Data Availability
Some or all data, models, or codes generated or used during the study are available from the first author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.