^{1}

^{2}

^{1}

^{2}

^{1}

^{1}

^{2}

This paper proposes a new manifold-based dimension reduction algorithm framework. It can deal with the dimension reduction problem of data with noise and give the dimension reduction results with the deviation values caused by noise interference. Commonly used manifold learning methods are sensitive to noise in the data. Mean computation, a denoising method, is an important step in data preprocessing but leads to a loss of local structural information. In addition, it is difficult to measure the accuracy of the dimension reduction of noisy data. Thus, manifold learning methods often transform the data into an approximately smooth manifold structure; however, practical data from the physical world may not meet the requirements. The proposed framework follows the idea of the localization of manifolds and uses graph sampling to determine some local anchor points from the given data. Subsequently, the specific range of localities is determined using graph spectral analysis, and the density within each local range is estimated to obtain the distribution parameters. Then, manifold-based dimension reduction with distribution parameters is established, and the deviation values in each local range are measured and further extended to all data. Thus, our proposed framework gives a measurement method for deviation caused by noise.

Manifold learning is used for nonlinear dimension reduction of natural and general data. These data are often assumed to be nonlinear manifolds embedded in low-dimensional space [

Manifold-based dimension reduction of noisy data is challenging. Traditionally, Hein and Maier [

Although the above manifold learning methods can suppress noise to a certain extent, the original data are possibly overprocessed or artificially intervened, and the effectiveness of dimension reduction becomes difficult to evaluate. Yan et al. [

The low-dimensional space of noisy data will show certain randomness under the interference condition; that is to say, noise changes the existing distribution characteristics because of the low-dimensional coordinates changes, and these characteristics reflect the existence of certain deviation values in the dimension reduction results. Some works, such as that by Zhang et al. [

In the determination of localization, which is an important process in most manifold methods for proper localization, the key is to change the low-dimensional structure to an approximately nonlinear, high-dimensional one. This paper uses graph sampling [

For measuring the deviation values of the low-dimensional samples, this study uses a maximum likelihood estimation in each local range to construct a normal distribution function adapted to the current local. This function can be used to obtain the maximum deviation of the noisy data in each locality, and the dimension reduction result should be affected by the fluctuation of noise interference in one or more localities. The distance weights are computed to calculate the deviation results caused by noise of the dimension reduction results in our algorithm framework.

As can be seen from the above, the contribution of this paper is as follows:

This paper proposes a manifold dimension reduction framework for noisy data.

A kind of local ranges determination is provided for complex noisy data by graph sampling and graph spectral theory.

This paper proposes a weighted sum method to compute the low-dimensional data deviation values from different local ranges caused by noise.

Furthermore, our framework fits for many kinds of manifold dimension reduction methods and thus can be possibly applied to different research fields.

The remainder of this paper is organized as follows. In next section, the preliminary works are introduced. Afterwards, the section details the proposed algorithm framework, and the following section provides the experimental results of the simulation data and practical data. Finally, a summary is given in the final section.

The noisy data are denoted as

In order to analyze the spectral properties of the graph

We introduce the spectral graph wavelet theory [

At this time,

The algorithm framework in this paper is divided into three parts: the determination of local anchor vectors, local range and distribution estimation, and manifold dimension reduction with distribution parameters.

The localization analysis is the first step for reducing the dimension of the manifold. The determination of the local anchor vectors is important for identifying the local position. Graph sampling is similar to image downsampling. Graph sampling can select some data samples to form a reduced data graph based on the polarity of the eigenvectors of the Laplacian matrix, retaining a part of the data samples of interest, so as to determine some key samples for computing the local ranges. These key samples are collectively referred to as “local anchor vectors” in this paper.

Let the local anchor vector set be

This paper adopts a graph sampling method based on the polarity of the components of the largest eigenvector, that is,

After each local anchor vector is determined, it is necessary to further determine its local range and then calculate information such as the deviation values caused by noise interference.

The spectral graph theory is used to further analyze the local ranges. A spectral graph is a special type of spectral analysis. The spectral analysis itself is based on the frequency domain where a signal is characterized by its spectral coefficient or spectral energy. According to [

This study uses the spectral graph wavelet method to obtain the local range. This method is more flexible and has fewer parameters than the common k-nearest neighbor algorithm. If a signal is applied to an anchor vector, it will propagate and extend to the area around. Owing to distance and noise, the signal strength will be attenuated. When determining the local range using the spectral graph wavelet, the range in which the signal is attenuated to a certain degree and the bandpass characteristic of the wavelet function will also affect the speed of attenuation.

Let a signal function

The filter function will control the span of signal spreading, and it can take out the area where the local range will possibly be. Given a certain anchor vector

Therefore, we traverse the anchor vector set and determine the local range of each anchor vector through the spectral energy calculation given above.

The noise will lead to a certain distribution of the local range of each anchor vector. On the Gaussian noise premise, the distribution can also be approximately Gaussian. To perform the manifold dimension reduction with distribution parameters, it is necessary to estimate the distribution parameters.

Our proposed framework uses the maximum likelihood estimation. We consider the anchor vector

The estimated distribution of the anchor vector

The manifold-based dimension reduction with distribution parameters in this section will solve the problem in measuring the deviation values of low-dimensional manifolds under noise interference. A maximum likelihood estimation is used in each local range to construct a normal distribution function adapted to the current local. In this study, manifold mapping functions are established for the local mean and deviation results, to obtain the manifold dimension reduction results with distribution parameters and then to calculate the deviations between the deviation dimension reduction result and the mean dimension reduction result. Similarly, using the original data to establish a manifold mapping function and considering that each data belongs to the range of one or more local anchor vectors, the dimension reduction result should also be affected by the fluctuation of noise interference in one or more localities. In this study, the distance weights are used to calculate the weighted sum of the deviation results of the dimension reduction results of each data sample.

For each anchor vector and its local range, the mean and covariance were estimated. In reality, the mean of all anchor vectors constitutes a type of denoising result. This mean result will form an approximately smooth manifold. Therefore, the manifold dimension reduction of the mean values is used as the benchmark of the deviation measure. The framework proposed in this study can use any manifold dimension reduction method to establish a mapping function. Assume that the selected manifold dimension reduction method is expressed as

We iterate through all anchor vectors to obtain all local deviation value sets

(1)

The above formula represents the deviation calculation of the

(2)

The Euclidean distance calculation in the above formula represents the deviation calculation of the data distance within the local range of the anchor vector

In order to further achieve the dimension reduction of the overall noisy data and measure the interference effect of noise, based on obtaining the manifold dimension reduction with distribution parameters of the anchor vectors and their local deviation values, the framework in this study provides a distance-weighted method for dimension reduction values of the original noisy data.

The reason of using distance-weighted method is that any one data sample

The deviation values of each dimension of the low-dimensional local of anchor vectors in

For the entire data

We iterate overall low-dimension

The above deviation value calculation for each dimension in the low-dimensional space and for the overall data in the low-dimensional space is shown in Figure

(a) Diagram for calculating the deviation value of each dimension in the dimension reduction result of a manifold with distribution parameters using the anchor vector

In Figure

Through the weights, the different local ranges can simultaneously affect the final deviation values.

Summarize the above three parts of algorithm to have the schematic illustration of Figure

A schematic illustration of the whole proposed framework of manifold dimension reduction of noisy data.

The computing process of the proposed framework is as follows.

Input (1) data samples with noise

Step 1. Construct a graph

Step 2. Calculate the Laplacian matrix

Step 3. Determine the anchor vector set

Step 4. For each

Apply a signal function

Calculate the spectral graph wavelet coefficient of the signal function according to (

Determine the local range of the anchor vector

Obtain the distribution parameters of the anchor vector locality according to (

Step 5. Learn to obtain the manifold dimension reduction result

Step 6. Use all covariances of the anchor vector set

Step 7. Learn to obtain the manifold dimension reduction result

Step 8. Obtain the deviation value of each dimension of the low-dimensional space according to (

Step 9. For each

According to (

Calculate the deviation values

For the proposed algorithm framework, two simulated datasets and two image datasets are selected for algorithm implementation and simulation.

This three-dimensional dataset has a total of 1000 data samples. Gaussian noise with 0 mean and a standard deviation of 0.5 is added to each sample point, as shown in Figure

Visualization of noisy Swiss roll data.

We construct the graph

Set of anchor vectors in green color.

Next, the graph spectral wavelet method determines the local range of each anchor vector. The filtering function selected in this study is

Graph spectral wavelet method used to determine the local range of anchor vector.

For each local range, we estimate the distribution parameters, that is, the mean and covariance parameters. According to (

The mean vectors and deviations because of distribution parameters at each anchor vector locality color.

The mean vectors are used to learn to determine the dimension reduction

For now, we can obtain the dimension reduction results of the mean vectors and deviation values, as shown in Figure

Dimension reduction of the mean vectors and deviation values in low-dimensional space of each anchor vector.

As can be seen from Figure

We use manifold learning to perform dimension reduction on the entire noisy data and the proposed distance weighting method to calculate the deviation values of the noise interference in the dimension reduction result. In order to visualize the deviation results, the positive and negative

Dimension reduction and deviation value results for entire noisy Swiss roll data.

As can be seen from Figure

This dataset has a total of 1000 data samples. Gaussian noise with 0 mean and a standard deviation of 0.5 is added to each sample point, as shown in Figure

Visualization of noisy S-shaped dataset.

According to the anchor vector set and the local range determination method given in the framework, the anchor point set is shown in Figure

The set of anchor vector.

The graph spectral wavelet method used to determine the local range of anchor vector.

After determining the anchor vector and local range, the distribution parameters are estimated and the deviation can be obtained based on the positive and negative directions of the standard deviation, as shown in Figure

The mean vectors and deviations because of distribution parameters at each anchor vector locality.

Similar to the Swiss roll dataset experiment, the mean vector and the deviation values in Figure

Dimension reduction of the mean vectors and deviation values in low-dimensional space of each anchor vector.

Then, manifold dimension reduction processing is performed on the original noisy data, and deviation values are calculated for noise interference in the dimension reduction results. Similarly, select the positive and negative directions of the

Figure

Dimension reduction and deviation value results for entire noisy S-shaped dataset.

The MNIST dataset is a set of handwritten digital grayscale images of size

Gaussian noise is added to the original 2000 vectors, with 0 mean and a standard deviation of 0.5. The proposed algorithm framework uses LE, ISOMAP, LLE, and LTSA, the commonly used manifold dimension reduction methods, and the parameter in the filter function

Example of original “0” and “1” images and images with noise.

Manifold dimension reduction results of “0” and “1” and results of deviation values.

As seen in Figure

Gaussian noise is added to the original 1500 vectors, with 0 mean and a standard deviation of 0.5. Other conditions are similar to the two-class experiment. Figure

Example of original “0,” “1,” and “2” images and images with noise.

Manifold dimension reduction results for “0,” “1,” and “2” and results of deviation values.

The results in Figure

The fashion-MNIST dataset belongs to an extended version of MNIST dataset and includes

We arbitrarily select “T-shirts” and “boots” as two-class samples. The processing method is the same as in the MNIST dataset experiment. Gaussian noise with mean of 0 and standard deviation of 0.5 is introduced. Each class is randomly composed of 1,000 data samples. The proposed algorithm framework is used to process samples, and the parameter of the filter function

Examples of original “T-shirt” and “boots” and images with added noise.

Manifold dimension reduction results for two-class samples and results of deviation values.

As can be seen in Figure

We arbitrarily select the three-class samples of “T-shirt,” “pants,” and “boots.” The setting of the experiment is the same as that of the two-class process. Gaussian noise with mean of 0 and standard deviation of 0.5 is introduced. We randomly take 500 data samples from each class for dimension reduction processing. Figure

Examples of original “T-shirt,” “pants,” “boots,” and examples with added noise.

Manifold dimension reduction results of three-class samples and deviation values.

As can be seen in Figure

This paper presents a manifold-based dimension reduction algorithm framework capable of processing noisy data. Considering the manifold localization and the distribution characteristics brought about by noise interference, we propose a method for determining local anchor vectors using graph sampling and a method for determining the local ranges of anchor vectors based on spectral graph wavelet. In each local range of an anchor vector, maximum likelihood estimation is used to estimate the distribution parameters, and a distance-weighted deviation value calculation method for dimension reduction results with distribution parameters is proposed. Among these, dimension reduction can be adopted as the currently used manifold learning methods.

As seen from the simulation of noisy data, the proposed framework can achieve the dimension reduction of noisy data and the deviation value measurement caused by noise interference and provides the dimension reduction results and deviation values with obvious classification discrimination for data containing class labels. Moreover, the proposed framework can fit for other kinds of dimension reduction methods and in this way to extend the functionalities of those methods to take noisy data as input to be applied in more complex situations.

This paper will further research on various manifold dimension reduction calculations with other types of distribution parameters, optimization of filter functions, and studying of quantitative evaluation methods.

The paper used the public MNIST and fashion-MNIST dataset.

The authors declare that they have no conflicts of interest.

This work was supported by the National Natural Science Foundation of China (no. 61903029), National Key R&D Program of China (no. 2017YFB0702100), and National Material Environmental Corrosion Platform.