Robust Structure Preserving Nonnegative Matrix Factorization for Dimensionality Reduction

As a linear dimensionality reduction method, nonnegative matrix factorization (NMF) has been widely used in many fields, such as machine learning and data mining. However, there are still two major drawbacks for NMF: (a) NMF can only perform semantic factorization in Euclidean space, and it fails to discover the intrinsic geometrical structure of high-dimensional data distribution. (b)NMF suffers fromnoisy data, which are commonly encountered in real-world applications. To address these issues, in this paper, we present a new robust structure preserving nonnegative matrix factorization (RSPNMF) framework. In RSPNMF, a local affinity graph and a distant repulsion graph are constructed to encode the geometrical information, and noisy data influence is alleviated by characterizing the data reconstruction term of NMF with l 2,1 -norm instead of l 2 -norm. With incorporation of the local and distant structure preservation regularization term into the robust NMF framework, our algorithm can discover a low-dimensional embedding subspace with the nature of structure preservation. RSPNMF is formulated as an optimization problem and solved by an effective iterativemultiplicative update algorithm. Experimental results on some facial image datasets clustering show significant performance improvement of RSPNMF in comparison with the state-of-the-art algorithms.


Introduction
With the development of data acquisition technology, highdimensional data is ubiquitous in many areas.Directly learning in original data space is prone to disturbance by noise; meanwhile, it is hard to discover the latent semantic structure [1], so low-dimensional representation learning becomes a fundamental problem in numerous data analysis algorithm [2][3][4][5][6][7][8][9][10].As a common used tool for dimensionality reduction, matrix factorization has received more and more attention in recent years [2,3,11].Unlike typical matrix factorization methods, such as QR and SVD, which seek for exact factorization of original matrix, nonnegative matrix factorization (NMF) approximately factorizes the original matrix into two low-dimensional nonnegative submatrixes.It is the nonnegative constraint which assures NMF can learn a sparse and part-based representation of the sample matrix.Psychological and physiological evidences have proved that NMF is consistent with how the brain interprets an object in a part-based, nonsubtractive way [12][13][14].Recently, NMF has exhibited great success in a variety of applications, like text analysis [15], image processing [16,17], and video representation [18].
In order to detect the latent geometrical structure hidden in high-dimensional data space, many manifold learning algorithms have been proposed, for instance, isometric feature mapping (ISOMAP) [19], locally linear embedding (LLE) [20], Laplacian eigenmaps (LE) [21], and local tangent space alignment (LTSA) [22].Motivated by the benefit of NMF and manifold learning, a graph regularized NMF (GNMF) method is proposed in [5].Although GNMF displays some improvement on clustering performance, it suffer greatly from the following two problems in real-world applications: (a) GNMF only consider the ideal case where the input data is noise-free.When the input data samples is corrupted by noise, GNMF even leads to an inaccurate matrix factorization.(b) GNMF overemphasizes the affinity relationship 2 Mathematical Problems in Engineering between neighboring data samples; the repulsion relationship between data samples which is favorable for clustering is even neglected thoroughly.
To address the abovementioned issues of GNMF, in this paper, we propose a new robust structure preserving nonnegative matrix factorization (RSPNMF) method.In RSPNMF, the data reconstruction term is formulated in the form of robust ℓ 2,1 -norm instead of ℓ 2 -norm, which will improve the model robustness to noisy data; meanwhile, the rotation invariance property of the model is also preserved.The success of GNMF relies on the assumption that two neighboring data samples share the same semantic label; it characterizes the compactness relationship among neighboring data samples.Inspired by this observation, we propose a new assumption; that is, data samples which are remote are generally semantic difference; and then we constrain that the low-dimensional representation of distant data samples in the original data space should also be far apart in embedding subspace.We consider that the preservation of neighboring data sample's compactness together with distant data sample's repulsion is helpful to describe the intrinsic geometrical structure of data distribution.
The contributions of this paper are presented as follows: (1) A novel nonnegative matrix factorization algorithm, named RSPNMF, is proposed, in which the intrinsic manifold structure is approximated with the local affinity and distant repulsion graphs.By incorporating the corresponding regularization term into robust NMF framework, our proposed algorithm can find the latent subspace of the original matrix with nature of manifold structure learning.
(2) We formulate our approach as an optimization problem and provide an iterative multiplicative update algorithm to solve it.The convergence proof of the proposed algorithm is also provided.
(3) Our experiments on several datasets for clustering validate our proposed algorithm.Compared with several state-of-the-art methods, the results of our RSPNMF show the better performance.
The rest of the paper is organized as follows.In Section 2, we briefly review the background of NMF and related works.In Section 3, our RSPNMF approach and the convergence proof of optimization scheme are presented.Some experimental results and comparison with other approach are presented in Section 4. We end this paper with some conclusions and discussion in Section 5.

Related Works
Given a nonnegative matrix  = [ 1 , . . .,   ],   ∈ R ×1 + is a data sample.NMF attempts to find two low-dimensional nonnegative submatrixes whose product provides a best approximation to the original matrix ; that is, With the purpose of finding a compact factorization for the original data matrix, the number of basis component  usually is set less than  and .
One common data reconstruction function used to quantify the approximation of NMF is formulated as follows: where ‖ ⋅ ‖  denotes the Frobenius norm.Due to no-convexity of the objective function in (2) on both  and , it is unrealistic to expect an algorithm to find its global minimum.An iterative multiplicative update algorithm which can find the local minima of ( 2) is proposed in [23], which are As the update algorithm allows only additive combination among different basis components, it is believed that NMF can learn a sparse and part-based representation [4].
Inspired by the work exhibited in [23], many of NMF variants have been proposed over the past years.In order to explicitly impose sparseness constraint, sparseness constrained NMF is proposed in [9,[24][25][26][27][28]; these works expect more sparse representation than the original NMF.To characterize different residual error model, two new distance metrics are introduced in [29,30].Similar to SVD presented in [31], a nonnegative matrix can also be factorized into three nonnegative components.In [32], a fast nonnegative matrix trifactorization is proposed, and then orthogonality constrained nonnegative matrix trifactorization is studied in [27].Different from unsupervised NMF variants, a semisupervised nonnegative matrix decomposition algorithm is proposed in [9], where the label information of data samples was introduced to NMF as an additional constraint.To improve discriminating power of MMF, in [33,34], Fisher's criterion as an additional regularization term is incorporated into NMF framework.In spite of the success of the abovementioned NMF variants in various scenarios, all of them perform matrix factorization on Euclidean space.
In [39,40], a dual regularized seminonnegative matrix trifactorization is proposed, in which data samples and feature samples manifold structure are also considered.Although the graph-based NMF algorithm incorporates the prior knowledge of manifold structure into NMF framework, overemphasizing local structure preservation sometimes will degrade its performance.A structure preserving nonnegative matrix factorization (SPNMF) is proposed in [10], in which the intrinsic manifold is approximated with local and distant graphs.Experiment results demonstrate SPNMF outperforms GNMF and its variants.
All the abovementioned NMF variants are based on the premise that the input data are noise-free; it is impractical to real-world data samples which are often corrupted by noise.For boosting NMF robustness to outliers, ℓ 2,1 -norm based NMF reconstruction function is applied in [41,42].In [43], ℓ 2,1 -norm and ℓ 1 -norm are used to formulate the graph embedding and data reconstruction function, which endows the algorithm with robustness to unreliable graphs and noisy labels.
Among the existing methods, SPNMF is closely relevant to our method and performs better on some specific datasets.There are still some differences between our proposed algorithm and SPNMF.Firstly, the Laplacian matrix of distant repulsion graph from SPNMF needs to be updated in each iteration, which is time consuming.Additionally, to learn a better parts-based representation, the basis vector of SPNMF is required to be as orthogonal as possible.Finally, SPNMF is sensitive to noisy data points.

The Proposed Method
Our goal is to perform matrix factorization on the hidden semantic subspace.To this end, we impose the local affinity and distant repulsion structure preservation on the new robust NMF framework.In this section, we will first explain the structure preservation and robust data reconstruction terms of RSPNMF, followed by its optimization algorithm.The convergence of the proposed algorithm is proved finally.

Problem Formulation.
Despite the diverse motivations of various dimensionality reduction algorithms, most of them can be explained within a graph embedding framework.Let  = {,} be an undirected weighted graph, where vertex set  corresponds to a dataset and  ∈ R × + is an affinity matrix whose elements measure the similarity of each pair of vertices.In graph embedding algorithm, graph  characterizes the prior knowledge of geometric structure from data distribution.
If we use the  1 -nearest neighboring graph  to characterize the local structure of data distribution, the weight matrix  = [  ] × can be defined as follows: where  is the bandwidth parameter and   1 (  ) denotes the set of  1 nearest neighbors of   .
Local invariance assumption which encourages neighboring data pairs in the original space to be still close in the low-dimensional embedding subspace can be formulated as where V  is the corresponding low-dimensional representation of   .By defining the diagonal matrix  and Laplacian matrix  of graph  are as we can rewrite (5) as where Tr() denotes the trace of a matrix.Equation ( 7) essentially holds the smoothness of dimensionality reduction process.It is based on two vital assumptions; firstly, the neighboring data samples in highdimensional space are semantic similarity, and secondly the preservation of affinity structure plays great role for low-dimensional representation.
Local invariance assumption essentially exploits the favorite relationship among similar data samples under unsupervised condition; however, it ignores unfavorite relationship between divergent data pairs.In this paper, we conjecture that the distant data pairs are always semantic difference.A new distant neighboring graph   = {,   } which is used to describe the repulsion relationship between dissimilar data pairs is also constructed.The corresponding weight matrix   = [   ] × is defined as follows: where   2 (  ) denotes the remotest  2 data samples of   in givensample dataset.From ( 8), we can we can find that the more the distance between   and   , the smaller the value of    .If we want the corresponding low-dimensional representation V  and V  are also faraway, the value of ‖V  − V  ‖ 2 must be as large as possible.Inspired by this observation, we propose a new distant repulsion assumption, which emphasizes that the lowdimensional counterpart of distant data pairs is also far apart in embedding subspace.This distant repulsion constraint is characterized by the objective function defined in The distant repulsion assumption in ( 9) can maintain the repellent relationship between each remote data pair during dimensionality reduction, which can be rewritten as where   and   are the diagonal matrix and Laplacian matrix of graph   , respectively, and   =   −   ,    = ∑  ̸ =    for any given .With the assumption that semantic label of distant data pairs is certainly different, (10) actually emphasizes the unfavorite relationship among dissimilar data samples.(7) and (10) together can better preserve the manifold structure of data distribution, their optimized object is opposite, one for minimum and the other for maximum.To address this issue, we divide the basis matrix  into two parts; that is,

Algorithm. Although
correspondingly, the coefficient matrix  is also partitioned into two parts: Then, the original matrix  can be reconstructed with ( 1 ,  1 ) and ( 2 ,  2 ) in an additive manner: where , and  ≥ . 1 and  2 are the coefficient matrix of the original matrix  on  1 and  2 , respectively.
If we restrict the fact that the submatrix  1 retains the property of graph  and avoids the property of graph   , then submatrix  1 is the low-dimensional representation of matrix  with local invariance assumption.With the added constraint that the property of basic matrices  1 and  is consistent, (7) can be translated as min On the contrary, we assume that submatrix  2 preserves the property of graph   and avoids the property of graph ; meanwhile, we constrain the fact that the property of matrix  2 is complementary to matrix .This will lead to the fact that the calculation of finding maximum value in (11) is transformed into the calculation of finding minimum value showed in min In ( 2), the error of each data sample entered the objective function of NMF with squared residue, a few outliers with large errors will easily dominate the objective function.In this paper, we employ ℓ 2,1 -norm to define the data reconstruction term of NMF framework.Compared with ℓ 2 -norm, ℓ 2,1norm can keep the feature rotation invariance together with minimizing outlier impact.The ℓ 2,1 -norm of matrix  can be defined as correspondingly, the ℓ 2,1 robust loss function of NMF can be formulated as With the purpose of finding a robust low-dimensional representation with the nature of structure preservation, we explicitly incorporate ( 14) and ( 15) into the robust NMF framework.The objective function of the proposed RSPNMF is described as follows: From ( 18), we can find that RSPNMF divides the ultimate matrix factorization framework into two parts, one for structure preservation and the other for data reconstruction.
The constants  and  are the tradeoff parameters to balance the contribution of different regularization terms.When  = 0, RSPNMF degenerates to GNMF [6].When  = 0 and  = 0, RSPNMF even degrade to the same form of NMF [23].
Because the objective function showed in (18) is not convex on both  and  together, in the following, we will manage to achieve its local minima with multiplicative iterative update algorithm.
Using the KKT conditions     = 0,   V 1  = 0, and   V 2  = 0, we obtain the following multiplicative update rules for RSPNMF: To avoid the scale transfer problem which was pointed out in [44], we normalize  and  as follows: The total procedure for solving the minimization problem of ( 18) is presented in Algorithm 1, from which we can obtain a low-dimensional representation of input data samples.

Convergence Analysis.
Due to the objective function shown in (18) is nonconvex, only local optimal solutions can be obtained.In this section, we use the property of auxiliary function approach to prove the objective function   is nonincreasing under the update formulas in ( 21) and (22).
In order to prove the objective function   is nonincreasing under the update formula in (21), we only need to construct a proper auxiliary function   (,   ) which satisfies   (,   ) ≥   () and   (, ) =   (), where () denotes the part of   which is only relevant to  and   () is the (, )th element of ().
From Lemma 2, we can find that   () is nonincreasing under the update rule in (21).
Let   (,   )/  = 0; there must be  * , which makes   (,   ) attain the minima, so we can obtain that Convergence Analysis for  1 .Similar to the convergence analysis for , the first and the second derivate of   with respect to  1 can be described as The Taylor series expansion of   ( 1 ) is as follows: Now, we construct a new auxiliary function for   ( 1 ): and we prove which satisfies   ( 1 ,  1  ) ≥   ( 1 ) and   ( 1 ,  1 ) ≥   ( 1 ).
Proof.Obviously,   ( 1 ,  1 ) =   ( 1 ) is true; we only need to prove   ( 1 ,  1  ) ≥   ( 1 ).Since we can deduce that )/V 1  = 0; we can find that Finally, we have The convergence analysis for  2 is similar to the convergence analysis for  1 , and it is omitted.

Computational Complexity Analysis.
With the update rules in ( 21) and ( 22), it is not hard to calculate the extra computational cost of our proposed algorithm in each iteration, and we summarize the result in Table 1.It is important to note that matrix  is sparse, because the local graph  is constructed with  1 neighbors.The average nonzero element on each row of  is  1 .Thus, we only need  1  and flam (a floating point addition and multiplication) to compute  1 .
Besides the multiplicative updates, RSPNMF also needs ( 2 ) to construct the local affinity graph  and distant repulsion graph   .If the multiplicative updates stop after  iterations, the overall cost for RSPNMF is (+2 2 ).The overall cost for GNMF is ( +  2 ) and the overall cost for SPNMF is (( + ) +  2 ).

Experimental Results
In this section, we carry out several experiments to examine the clustering performance of our approach in comparison with some state-of-the-art methods.

Datasets Description.
Three image datasets which are widely employed in dimensionality reduction literatures are used in our experiments.
(1) ORL Dataset.The ORL dataset contains 400 face images of 40 subjects under different lighting, facial expressions (open/closed eyes, smiling/not smiling), and facial details (with glasses or no glasses).All these images were taken against a dark homogeneous background with the subjects in an upright, frontal position.In our experiments, original images are firstly normalized in scale and orientation, such that two eyes are aligned at the same position.Then, the facial areas are cropped into the final images for clustering.Finally, each image is normalized to array of 32 × 32 pixels and reshaped to a long vector with 256 gray levels per pixel.
(2) YALE Dataset.The YALE dataset includes 165 face images collected from 15 volunteers.Each image has different facial expression and facial details.The same preprocessing as for the ORL dataset is implemented on YALE dataset.
(3) FERET Dataset.For the FERET dataset, we use 70 subjects with 6 facial images for each subject.Each image is also represented by a 1024-dimensional vector in image space.
The statistic property of these datasets is listed in Table 2, which includes the number of samples, the dimension of features, and the number of object classes.
As showed in Figure 1, to evaluate the robustness of mentioned algorithms to noisy data, all the images were occluded with local and global noise.The local noise consists of occlusion patch of 12 × 12 pixels with random black and white dots.The global noise is replaced by Gaussian noise with mean 0 and variance 36.

Compared Methods.
To investigate the clustering performance of our proposed approach, we compared it with the following methods: (i) Canonical -means (means in short) clustering method as a baseline.
(ii) -means clustering in the principle component subspace (PCA in short).
(iii) Nonnegative matrix factorization based clustering (NMF in short).We use the -norm formulation.
Except the means, for each image, a new lowdimensional representation can be derived from the matrix factorization, and then we perform clustering by using means in the new embedding subspace.

Evaluation Metrics.
To evaluate the clustering performance of all involved methods, we adopted two popular metrics, the accuracy (AC) and the normalized mutual information (NMI), as used in [6].
The accuracy indicates one-to-one relationship between obtained cluster label and ground truth label.The accuracy is defined as follows: where (map(  ),   ) is the delta function which judges whether the ground truth   is equal to the obtained label   and map(  ) is a permutation mapping function that maps each obtained cluster label   to the equivalent label from the dataset.
The second measure metrics is the normalized mutual information (NMI), which is used for determining the quality of clustering.
Let  and   denote the sets of cluster from the ground truth and the clustering method, respectively.The mutual information metric MI(,   ) is defined as follows: where (  ) and (   ) denote the extent of a sample belonging to the clusters   and    , respectively.(  ,    ) is a joint probability that an arbitrarily selected sample belongs to   as well as    at the same time.The NMI can be defined as follows: where () and (  ) are the entropies of  and   , respectively.It can be seen that NMI(,   ) ranges from 0 to 1.The larger the NMI is, the better the clustering result will be.

Clustering Results and Analysis.
In order to fair examine the clustering performance of all mentioned methods, 20 test runs are conducted on the optimal parameter setting, and then the means and standard deviations are recorded.In each test, means is applied 20 times with different initialization and the best result is reported.All clustering performances on the three noise contaminated datasets are reported in Tables 3 and 4.
From these results, we can derive the following observations: (a) SPNMF outperforms means, PCA, and NMF in most cases.This demonstrates the importance of structure preservation in discovering the hidden factors.In addition, the performance of SPNMF is superior to GNMF.This reveals that, by leveraging the ratio of local affinity and distant repulsion regularization, SPNMF can learn a better low-dimensional representation.(b) The performance of RSPNMF exceeds SPNMF; this demonstrates the robustness of ℓ 2,1 -norm in handling noise data samples.4.5.Parameter Sensitiveness.Our RSPNMF has five essential parameters: the number of nearest neighbors  1 , the number of distant neighbors  2 , the regularization parameter , the regularization parameter , and the number of basis vectors .For all mentioned dimensionality reduction algorithms, the number of basis vectors  is equal to the number of clusters.The complement space and the original space reduced dimension  is set to /2.For all graph-based methods, the number of nearest neighbors is set to {3, 5, . . ., 19, 21}.For SPNMF and RSPNMF, the number of distant points is also set to {3, 5, . . ., 19, 21}.For GNMF, the regularization parameter is set by the grid {10 −3 , 10 −2 , . . ., 10 3 }.For SPNMF and RSPNMF, we tune  and  by searching the grid {10 −3 , 10 −2 , . . ., 10 3 }.For SPNMF,  is set to 1.
The result of parameter sensitiveness on different datasets with local and global noise is showed in Figure 2. In order to compare with GNMF, we fix the parameter  and tune the parameter .From Figure 2, we can find that the performance of RSPNMF and SPNMF is superior to GNMF; meanwhile, they are all stable with respect to the change of parameter .This demonstrates the importance of the distant repulsion preservation on characterizing the intrinsic distribution of data samples.Besides, RSPNMF still outperforms SPNMF when  range within {0.001, 10}.This is because the robust data reconstruction term of RSPNMF alleviates the noisy data impact on clustering performance.
The relevant samples selection plays an important role in all graph regularized NMF method.For GNMF, which only use  1 -nearest neighboring graph to capture the local geometric structure of data distribution; meanwhile, assume that the neighboring samples are with the same semantic label.For noise corrupted data samples, this assumption may not hold.In particular, as  1 increases, clustering performance of GNMF is more likely to fail as shown in Figure 3.For RSPNMF and SPNMF, the local and distant data samples relationship is also considered.This will definitely reinforce the algorithm robustness to  1 -nearest neighboring samples increase.In addition, from Figure 3 we can find that the introduction of ℓ 2,1 -norm improves RSPNMF clustering results.

Convergence and Computational
Cost.The effectiveness of the proposed RSPNMF has been demonstrated by the above clustering experiments.Here, we evaluate the efficiency of RSPNMF, that is, how fast the update rules in ( 21) and ( 22) are converged.
Figure 4 shows the convergence curves of RSPNMF on all the three datasets.For each figure, the -axis is the value of objective function and the -axis denotes the iteration     number.From the convergence curves, it can be observed that the convergence rate of RSPNMF is fast, and the runtime cost of RSPNMF is acceptable.

Conclusion
In this paper, we proposed a novel nonnegative matrix factorization approach named RSPNMF.RSPNMF performs the subspace learning with the nature of manifold structure preservation and robust data reconstruction.As a result, RSPNMF is more discriminating than GNMF which only considers the local structure preservation of the clean data.Experimental results on face image clustering show that RSPNMF provides a better representation in the sense of semantic subspace.The superiority of RSPNMF arises in the following two aspects: (a) the objective function of RSPNMF uses the ℓ 2,1norm as the discrepancy measure and therefore alleviates the outlier issue that is commonly existing in image clustering.(b) The regularization term of RSPNMF incorporates the local and distant geometric information, which leads the nonnegative matrix factorization to be performed on the specific manifold.The added geometric information generally improves the clustering results.

Figure 2 :Figure 3 :
Figure 2: The performance of involved methods with respect to the parameter  on different dataset.The result of local noise is shown in the first column.The result of global noise is shown in the second column.

Figure 4 :
Figure 4: The convergence curve of RSPNMF with 500 iterations.The result of local noise is shown in the first column.The result of global noise is shown in the second column.

Table 1 :
Computational operation counts for updating and  in each iteration.

Table 2 :
Statistic of real-world datasets.

Table 3 :
Clustering performance on local noise contaminated datasets.The best performances are highlighted in bold font.

Table 4 :
Clustering performance on global noise contaminated datasets.The best performances are highlighted in bold font.