Neighborhood Preserving Convex Nonnegative Matrix Factorization

The convex nonnegative matrix factorization (CNMF) is a variation of nonnegative matrix factorization (NMF) in which each cluster is expressed by a linear combination of the data points and each data point is represented by a linear combination of the cluster centers. When there exists nonlinearity in the manifold structure, both NMF and CNMF are incapable of characterizing the geometric structure of the data. This paper introduces a neighborhood preserving convex nonnegative matrix factorization (NPCNMF), which imposes an additional constraint on CNMF that each data point can be represented as a linear combination of its neighbors. Thus our method is able to reap the benefits of both nonnegative data factorization and the purpose of manifold structure. An efficientmultiplicative updating procedure is produced, and its convergence is guaranteed theoretically.The feasibility and effectiveness of NPCNMF are verified on several standard data sets with promising results.


Introduction
This nonnegative matrix factorization (NMF) [1,2] has been widely used in information retrieval, computer vision, pattern recognition, and DNA gene expressions [3,4].NMF decomposes the data matrix as the product of two matrices that possess only nonnegative elements.It has been stated by many researchers that there are a lot of favorable properties for such a decomposition over other similar decompositions, such as PCA.One of the most useful properties of NMF is that it usually leads to parts-based representation because it allows only additive, not subtractive, combinations.Such a representation encodes much of the data making them easy to interpret.NMF can be traced back to 1970s and has been studied extensively by Paatero and Tapper [5].The work of Lee and Seung [1] brought much attention to NMF in machine learning and data mining fields.Since then, various extensions and variations of NMF have been proposed.Li et al. [4] proposed local nonnegative matrix factorization (LNMF) algorithm which imposes extra constraints to the cost function to get more localized and parts-based image features.Hoyer [6,7] employed sparsity constraints to improve local data representation, while nonnegative tensor factorization was studied in [8,9] by Hazan et al. to handle the data encoded as high-order tensors.All the methods mentioned above are unsupervised, Wang et al; [10] and Zafeiriou et al. [11] proposed independently the Fisher-NMF, which was further studied by Kotsia et al. [12], by adding an additional constraint seeking to maximize the between-class scatter and minimize the within-class scatter in the subspace spanned by the bases.
One of the most important drawbacks of NMF and its variants is the fact that these methods have to be performed in the original feature space of the data points, so that it can not be kernelized and the powerful idea of the kernel method cannot be applied to NMF.Ding et al. [13] proposed convex nonnegative factorization (CNMF) that strives to address the problems while inheriting all the strengths of the above NMF method, which models each cluster as a linear combination of the data points and each data point as a linear combination of the cluster centers.The major advantage of CNMF over NMF is that it can be performed on any data representations, either in the original space or RKHS.
Recently, there has been a lot of interest in geometrically motivated approaches to data analysis in high dimensional spaces.When the data lives on or close to a nonlinear low dimensional manifold which is embedded in the high dimensional ambient space [14,15], Euclidean distance is incapable of charactering the geometric structure of the data and hence traditional methods like NMF and CNMF no longer work well.Both CNMF and NMF do not exploit the geometric structure of the data, which assume that the data points are sampled from a Euclidean space.To address this problem, Cai et al. proposed a graph regularized NMF (GNMF) [16] and locally consistent concept factorization (LCCF) [17], which assume that the nearby data points are likely to be in the same cluster, that is, cluster assumption [18,19].The Euclidean and manifold geometry are unified through a regularization framework, which has a better interpretation from manifold perspective.
In this paper, we introduce a novel matrix factorization algorithm, called neighborhood preserving convex nonnegative matrix factorization (NPCNMF) which is based on the assumption that if a data point can be reconstructed from its neighbors in the input space, then it can be reconstructed from its neighbors by the same reconstruction coefficients in the low dimensional subspace, that is, local linear embedding assumption [20].NPCNMF not only inherits the advantages of CNMF, for example, nonnegativity, but also overcomes the shortcomings of CNMF, that is, Euclidean assumption.We also propose a multiplicative algorithm to efficiently solve the corresponding optimization problem and its convergence is theoretically guaranteed.
The rest of this paper is organized as follows.In Section 2, we briefly review NMF and CNMF.In Section 3, we introduce our algorithm and provide the proof of the convergence of the proposed algorithm.Experiments on three benchmark face recognition data sets are demonstrated in Section 4. Finally, we draw a conclusion and provide suggestions for future work.

A Review of NMF and CNMF
Nonnegative matrix factorization (NMF) factorizes the data matrix into one nonnegative basis matrix and one nonnegative coefficient matrix.Given a nonnegative data which minimize the following objective function: where ‖ ⋅ ‖  is Frobenius norm.The objective function is joint optimization problem of basis matrix U and coefficient matrix V.Although it is not jointly convex to U and V, it is convex with respect to each of them when the other one is fixed.Therefore, it is unrealistic to expect an algorithm to find the global minimum of J NMF .
To optimize the objective, Lee and Seung [2] presented an iterative multiplicative updating algorithm as follows: It is proved that the above updated steps will find a local minimum of the objective function in (1).
In reality, we have  ≪  and  ≪ .Thus, NMF essentially tries to find a compressed approximation of the original matrix, X ≈ UV  .We can view this approximation column as follows: where u  is the th column vector of U. Thus, each data vector x  is approximated by a linear combination of the columns of U, weighted by the components of V.One limitation of NMF is that the nonnegative requirement is not applicable to applications where the data involves negative number.The second is that it is not clear how to effectively perform NMF in the transformed data space so that the powerful kernel method can be applied.To overcome the problem, Ding et al. [13] proposed a convex nonnegative matrix factorization (CNMF) algorithm where nonnegative and mixed-sign data matrices are applied.CNMF claims that each base can be characterized by a linear combination of the entire data points while each data point can be approximated by a linear combination of all the bases.Translating the statements into mathematics, we have where   is a nonnegative weight in which data point x  is related to th base and V  is a nonnegative projection value of x  .Replacing u  in ( 5) with (4), we have We form the  ×  data matrix X = [x 1 , x 2 , . . ., x  ] using the feature vector of data point x  as the th column, the  ×  matrix W = [  ] using bases   , and × projection matrix V = [V  ] using the projection values V  .From (6), we have Equation ( 7) can be interpreted as the approximation of the original data set.Minimizing the squared error J and its approximation [13] where The matrices W and V are updated iteratively until convergence using the following rules: where Y = X  X and the matrix Y + and Y − are given by respectively.

Neighborhood Preserving Convex Nonnegative Matrix Factorization
In this section, we introduce our neighborhood preserving convex nonnegative matrix factorization method, which takes the local linear embedding constraint as an additional requirement.The method presented in this paper is fundamentally motivated from the neighborhood preserving embedding.

The Objective Function.
Many real world data are actually sampled from a nonlinear low dimensional manifold which is embedded in the high dimensional ambient space.Both NMF and CF perform the factorization in the Euclidean space.They fail to discover the local geometrical structure of the data space, which is essential to the clustering problem.NPE aims at preserving the local manifold structure.Specifically, for each data point, it is represented as a linear combination of the neighboring data points and the combination coefficients are specified in the weight matrix.We can find an optimal embedding such that the combination coefficients can be preserved in the low dimensional subspace.
For each data point, we find its  nearest neighbors.And we can characterize the local geometrical structure by linear coefficients that reconstruct each data point from its neighbors.The reconstruction coefficients are computed by the following objective function: and Then k  , 1 ≤  ≤  in the dimensionality reduced space can be preserved by minimizing min where tr(⋅) denotes the trace of a matrix, I ∈ R × is an identity matrix, and L = (I − W)(I − W).By minimizing (12), we essentially try to formalize our intuition that if a data point can be represented from its neighbors in the original space, then it can be represented from its neighbors by the same combination coefficients in the dimensionality reduced space.
With the neighborhood preserving constraint, CNMF incorporates (8) and minimizes the objective function as follows: where  is a positive regularization parameter controlling the contribution of the additional constraint.We call (13) neighborhood preserving convex nonnegative matrix factorization (NPCNMF).Let  = 0; (13) degenerates to the original CNMF.

The Algorithm.
We introduce an iterative algorithm to find a local minimum for the optimization problem.By defining K = X  X and using the matrix properties ‖A‖ 2 = tr(A  A), tr(AB) = tr(BA), and tr(A) = tr(A  ), we can rewrite the objective function J as follows: This is a typical constrained optimization problem and can be solved using the Lagrange multiplier method.Let   and   be the Lagrange multiplier for constraint   ≥ 0 and ]  ≥ 0, respectively, and let Ψ = [  ] and Φ = [  ].The Lagrangian function is The partial derivatives of L with respect to W and V are Using the Karush-Kuhn-Tucker conditions     = 0 and   ]  = 0, we get the following equations for   and V  : The corresponding equivalent formulas are as follows: Introduce where The equations lead to the following updating formulas: Note that the solution to minimizing the criterion function J is not unique.If W and V are the solution, then, WD, WD −1 will also form a solution for any positive diagonal matrix D. To make the solution unique, we will further require that w  Kw = 1, where w is the column vector of W. The matrix V will be adjusted accordingly so that WV  does not change.This can be achieved by 3.3.Convergence Analysis.In this section, we will investigate the convergence of the updating formula in (14).We use the auxiliary function approach [16] to prove the convergence.
Here we first introduce the definition of auxiliary function [16].
are satisfied.

Lemma 2.
If  is an auxiliary function for , then  is nonincreasing under the update Proof.Consider The correctness and convergence of the algorithm are addressed in the following.
For given K, fixing V, considering any element   in W, we use J(W) to denote the part of J, which is only relevant to   .We get Theorem 4. One rewrites J(W) as follows: where From its minima and setting H (+1) ← H and H () ← H  , one recovers (20), letting We find upper bounds for each of the two positive terms and lower bounds for each of the two negative terms.For the third term in J(H), by applying Lemma 3, we obtain an upper bound The second term of J(H) is bounded by To obtain lower bounds for the two remaining terms, we use the inequality  ≥ 1 + log , which holds for any  > 0, and the first term in J(H) is bounded by The last term in J(H) is bounded by Collecting all bounds, we obtain (H, H  ) as in (29).Obviously, J(H) ≤ (H, H  ) and J(H) = (H, H).
To find the minimum of (H, H  ), we take To find the minimum of (H, H  ), we take the Hessian matrix of (H, To be a diagonal matrix with positive entries Thus, (H, H  ) is a convex function of H. Therefore, we obtain the global minimum by setting (H, H  )/H  = 0 in (36) and solving for H. Rearranging, we obtain (30).Theorem 5. Updating W using (20) will monotonically decrease the value of the objective in (13); hence it converges.
For given K, fixing W, considering any element V  in V, we use J(V) to denote the part of J, which is only relevant to V  .We get J (V) = −2 tr (VW  K) + tr (VW  KWV  ) +  tr (V  LV) .
Then the following function is an auxiliary function of J(H): that is, it satisfies the requirements J(H) ≤ (H, H) and J(H) = (H, H  ).Furthermore, it is a convex function of H and its global minimum is From its minima and setting H (+1) ← H and H () ← H  , one recovers (21), letting B + = (K) + U, B − = (K) − U, A = W  KW, and H = V.

Proof. The function J(H) is
We find upper bounds for each of the three positive terms and lower bounds for each of the three negative terms.For the third term in J(H), by applying Lemma 3 and setting A ← I, B ← A + , we obtain an upper bound tr (HA The second term of J(H) is bounded by using the inequality  ≤ ( 2 +  2 )/2, which holds for any ,  > 0.
For the fifth term in J(H), setting A ← L + , B ← I, and S ← H, we obtain an upper bound To obtain lower bounds for the three remaining terms, we use the inequality  ≥ 1 + log , which holds for any  > 0, and the first term in () is bounded by The fourth term in J(H) is bounded by The last term in J(H) is bounded by Collecting all bounds, we obtain (H, H  ) as in (41).Obviously, J(H) ≤ (H, H  ) and J(H) = (H, H).
To find the minimum of (H, H  ), we take We have Therefore The Hessian matrix containing the second derivatives is a diagonal matrix with positive entries Thus, (H, H  ) is a convex function of H. Therefore, we obtain the global minimum by setting (H, H  )/H  = 0 in (41) and solving for H. Rearranging, we obtain (21).
Theorem 7. Updating V using (21) will monotonically decrease the value of the objective in (13); hence it converges.

Experimental Results
In this section, we show the performance of the proposed method on face recognition and compare our proposed method with the popular subspace learning algorithms: four unsupervised ones which are principal component analysis [21] (PCA), neighborhood preserving embedding (NPE) [20], local nonnegative matrix factorization (LNMF) [4], and convex nonnegative factorization (CNMF) [13] the one supervised algorithm and which is linear discriminant analysis (LDA) [21].We use the nearest neighbor (NN) classifier as baseline in original space.We apply different algorithms to obtain new representations for each chosen data set, and then the NN method is applied in the new representation spaces.

Data Preparation.
The experiments are used on three data sets.One is Cambridge ORL face database, the other is the Yale database, and the third one is the CMU PIE face database.The important statistics of these data sets are described below.
The Yale database contains 165 gray scale images of 15 individuals.All images demonstrate variations in lighting condition (left-light, center-light, right-light), facial expression (normal, happy, sad, sleepy, surprised, and wink), and with/without glasses.
The ORL database contains ten different images of each of 40 distinct subjects, thus 400 images in total.For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses).All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).The CMU PIE face database contains more than 40 000 facial images of 68 people.The images were acquired over different poses, under variable illumination conditions, and with different facial expressions.In our experiment, we choose the images from the frontal pose (C27) and each subject has around 49 images from varying illuminations and facial expressions.
In all the experiments, images are preprocessed so that faces are located.Original images are first normalized in scale and orientation such that the two eyes are aligned at the same position.Then the facial areas were cropped into the final images for clustering.Each image is of 32 × 32 pixels with 256 gray levels per pixel.

Parameter Settings.
For each data set, we randomly divide it into training and testing sets, and evaluate the recognition accuracy on the testing set.In detail, for each individual in the ORL and Yale data sets,we randomly select 2, 3, and 4 images per individual, respectively, for training samples, and the remaining for test samples, while for each individual in the PIE data set, we randomly select 5, 10, and 20 images per individual for training samples.For each partition, we repeated each experiment 20 times and calculated the average recognition accuracy.In general, the recognition rate varies with the dimension of the face subspace.The best result obtained in the optimal subspace and the corresponding dimensionality for each method are shown.
For the face recognition experiments, several parameters need to be decided beforehand.For LDA, we use PCA as a first step dimensionality reduction algorithm to avoid the singularity problem.The dimension of the PCA step is fixed as  −  and then performs LDA.There are two parameters in our NPCNMF and NPE approach: the number of nearest neighbors  and the regularization parameter .Throughout our experiments, we empirically set the number of nearest neighbors  to 5, the value of the regularization parameter  to 100.
Each testing sample  is projected into the linear subspace spanned by the column vectors of the basis matrix U, namely, h  ≈ W † , where W † indicates the pseudoinverse of matrix W.

Classification Results
. Tables 1, 2, and 3 show the evaluation results of all the methods on the three data sets, respectively, where the value in each entry represents the average recognition accuracy of 20 independent trials, and the number in brackets is the corresponding projection dimensionality.These experiments reveal a number of interesting points.
(1) It is clear that the use of dimensionality reduction is beneficial in face recognition.There is a significant increase in performance from using LDA, NPE, NMF, LNMF, and CNMF.However, PCA fails to gain improvement over the baseline.This is because that PCA does not encode the discriminative information.
(2) The performances of nonnegative algorithms NMF, LNMF, and CNMF are much worse than supervised algorithms LDA, which shows that without considering the labeled data, nonnegative algorithms could not guarantee good discriminating power.
(3) Our NPCNMF algorithm outperforms all other five methods.The reason lies in the fact that NPCNMF considers the geometrical structure of the data and achieves better performance than the other algorithms.This shows that by leveraging the power of both the parts-based representation and the intrinsic geometrical structure of the data, NPCNMF can learn a better compact representation in the sense of semantic structure.

Conclusion and Future Work
In this paper, we have presented a novel matrix factorization method called NPCNMF for dimensionality reduction,

) Lemma 3 .≥
For any nonnegative matrices A ∈ R × , B ∈ R × , S ∈ R × , and S  ∈ R × , and A, B are symmetric, then the following inequality holds: tr (S  ASB) .

Table 1 :
Face recognition accuracy on the ORL data set.The number in brackets is the corresponding projection dimensionality.

Table 2 :
Face recognition accuracy on the Yale data set.The number in brackets is the corresponding projection dimensionality.

Table 3 :
Face recognition accuracy on the PIE data set.The number in brackets is the corresponding projection dimensionality.