MPE Mathematical Problems in Engineering 1563-5147 1024-123X Hindawi Publishing Corporation 154942 10.1155/2014/154942 154942 Research Article Neighborhood Preserving Convex Nonnegative Matrix Factorization Wei Jiang Min Li Yongqing Zhang Chen Chung-Hao School of Mathematics Liaoning Normal University Dalian 116029 China lnnu.edu.cn 2014 25 2 2014 2014 27 09 2013 06 01 2014 25 2 2014 2014 Copyright © 2014 Jiang Wei et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The convex nonnegative matrix factorization (CNMF) is a variation of nonnegative matrix factorization (NMF) in which each cluster is expressed by a linear combination of the data points and each data point is represented by a linear combination of the cluster centers. When there exists nonlinearity in the manifold structure, both NMF and CNMF are incapable of characterizing the geometric structure of the data. This paper introduces a neighborhood preserving convex nonnegative matrix factorization (NPCNMF), which imposes an additional constraint on CNMF that each data point can be represented as a linear combination of its neighbors. Thus our method is able to reap the benefits of both nonnegative data factorization and the purpose of manifold structure. An efficient multiplicative updating procedure is produced, and its convergence is guaranteed theoretically. The feasibility and effectiveness of NPCNMF are verified on several standard data sets with promising results.

1. Introduction

This nonnegative matrix factorization (NMF) [1, 2] has been widely used in information retrieval, computer vision, pattern recognition, and DNA gene expressions [3, 4]. NMF decomposes the data matrix as the product of two matrices that possess only nonnegative elements. It has been stated by many researchers that there are a lot of favorable properties for such a decomposition over other similar decompositions, such as PCA. One of the most useful properties of NMF is that it usually leads to parts-based representation because it allows only additive, not subtractive, combinations. Such a representation encodes much of the data making them easy to interpret. NMF can be traced back to 1970s and has been studied extensively by Paatero and Tapper . The work of Lee and Seung  brought much attention to NMF in machine learning and data mining fields. Since then, various extensions and variations of NMF have been proposed. Li et al.  proposed local nonnegative matrix factorization (LNMF) algorithm which imposes extra constraints to the cost function to get more localized and parts-based image features. Hoyer [6, 7] employed sparsity constraints to improve local data representation, while nonnegative tensor factorization was studied in [8, 9] by Hazan et al. to handle the data encoded as high-order tensors. All the methods mentioned above are unsupervised, Wang et al;  and Zafeiriou et al.  proposed independently the Fisher-NMF, which was further studied by Kotsia et al. , by adding an additional constraint seeking to maximize the between-class scatter and minimize the within-class scatter in the subspace spanned by the bases.

One of the most important drawbacks of NMF and its variants is the fact that these methods have to be performed in the original feature space of the data points, so that it can not be kernelized and the powerful idea of the kernel method cannot be applied to NMF. Ding et al.  proposed convex nonnegative factorization (CNMF) that strives to address the problems while inheriting all the strengths of the above NMF method, which models each cluster as a linear combination of the data points and each data point as a linear combination of the cluster centers. The major advantage of CNMF over NMF is that it can be performed on any data representations, either in the original space or RKHS.

Recently, there has been a lot of interest in geometrically motivated approaches to data analysis in high dimensional spaces. When the data lives on or close to a nonlinear low dimensional manifold which is embedded in the high dimensional ambient space [14, 15], Euclidean distance is incapable of charactering the geometric structure of the data and hence traditional methods like NMF and CNMF no longer work well. Both CNMF and NMF do not exploit the geometric structure of the data, which assume that the data points are sampled from a Euclidean space. To address this problem, Cai et al. proposed a graph regularized NMF (GNMF)  and locally consistent concept factorization (LCCF) , which assume that the nearby data points are likely to be in the same cluster, that is, cluster assumption [18, 19]. The Euclidean and manifold geometry are unified through a regularization framework, which has a better interpretation from manifold perspective.

In this paper, we introduce a novel matrix factorization algorithm, called neighborhood preserving convex nonnegative matrix factorization (NPCNMF) which is based on the assumption that if a data point can be reconstructed from its neighbors in the input space, then it can be reconstructed from its neighbors by the same reconstruction coefficients in the low dimensional subspace, that is, local linear embedding assumption . NPCNMF not only inherits the advantages of CNMF, for example, nonnegativity, but also overcomes the shortcomings of CNMF, that is, Euclidean assumption. We also propose a multiplicative algorithm to efficiently solve the corresponding optimization problem and its convergence is theoretically guaranteed.

The rest of this paper is organized as follows. In Section 2, we briefly review NMF and CNMF. In Section 3, we introduce our algorithm and provide the proof of the convergence of the proposed algorithm. Experiments on three benchmark face recognition data sets are demonstrated in Section 4. Finally, we draw a conclusion and provide suggestions for future work.

2. A Review of NMF and CNMF

Nonnegative matrix factorization (NMF) factorizes the data matrix into one nonnegative basis matrix and one nonnegative coefficient matrix. Given a nonnegative data X = [ x 1 , x 2 , , x N ] + m × N , each column of X is a sample point. NMF aims to find two nonnegative matrices U + m × r and V + N × r which minimize the following objective function: (1) 𝒥 NMF = X - U V T 2 s . t . U 0 , V 0 , where · F is Frobenius norm.

The objective function is joint optimization problem of basis matrix U and coefficient matrix V . Although it is not jointly convex to U and V , it is convex with respect to each of them when the other one is fixed. Therefore, it is unrealistic to expect an algorithm to find the global minimum of 𝒥 NMF . To optimize the objective, Lee and Seung  presented an iterative multiplicative updating algorithm as follows: (2) u i k t + 1 = u i k t ( X V ) i k ( U V T V ) i k , v j k t + 1 = v j k t ( X T U ) j k ( V U T U ) j k .

It is proved that the above updated steps will find a local minimum of the objective function in (1).

In reality, we have r m and r N . Thus, NMF essentially tries to find a compressed approximation of the original matrix, X U V T . We can view this approximation column as follows: (3) x j = k = 1 r u k v j k , where u k is the k th column vector of U . Thus, each data vector x j is approximated by a linear combination of the columns of U , weighted by the components of V . One limitation of NMF is that the nonnegative requirement is not applicable to applications where the data involves negative number. The second is that it is not clear how to effectively perform NMF in the transformed data space so that the powerful kernel method can be applied. To overcome the problem, Ding et al.  proposed a convex nonnegative matrix factorization (CNMF) algorithm where nonnegative and mixed-sign data matrices are applied. CNMF claims that each base can be characterized by a linear combination of the entire data points while each data point can be approximated by a linear combination of all the bases. Translating the statements into mathematics, we have (4) u k = j = 1 N x j w j k , (5) x j = k = 1 r u k v j k , where w j k is a nonnegative weight in which data point x j is related to k th base and v j k is a nonnegative projection value of x j . Replacing u k in (5) with (4), we have (6) x j = k = 1 r j = 1 N x j w j k u j k . We form the m × N data matrix X = [ x 1 , x 2 , , x N ] using the feature vector of data point x i as the i th column, the N × r matrix W = [ w i j ] using bases w i j , and N × r projection matrix V = [ v i j ] using the projection values v i j . From (6), we have (7) X = X W V T .

Equation (7) can be interpreted as the approximation of the original data set. Minimizing the squared error 𝒥 and its approximation  (8) 𝒥 CNMF = X - X W V T 2 , where X m × N , W N × r , V N × r . The matrices W and V are updated iteratively until convergence using the following rules: (9) w i k = w i k ( Y + V ) i k + ( Y - W V T V ) i k ( Y - V ) i k + ( Y + W V T V ) i k , v i k = v i k ( Y + W ) i k + ( V W T Y - W ) i k ( Y - W ) i k + ( V W T Y + W ) i k , where Y = X T X and the matrix Y + and Y - are given by (10) Y i k + = 1 2 ( | Y i k | + Y i k ) , Y i k - = 1 2 ( | Y i k | - Y i k ) , respectively.

3. Neighborhood Preserving Convex Nonnegative Matrix Factorization

In this section, we introduce our neighborhood preserving convex nonnegative matrix factorization method, which takes the local linear embedding constraint as an additional requirement. The method presented in this paper is fundamentally motivated from the neighborhood preserving embedding.

3.1. The Objective Function

Many real world data are actually sampled from a nonlinear low dimensional manifold which is embedded in the high dimensional ambient space. Both NMF and CF perform the factorization in the Euclidean space. They fail to discover the local geometrical structure of the data space, which is essential to the clustering problem. NPE aims at preserving the local manifold structure. Specifically, for each data point, it is represented as a linear combination of the neighboring data points and the combination coefficients are specified in the weight matrix. We can find an optimal embedding such that the combination coefficients can be preserved in the low dimensional subspace.

For each data point, we find its k nearest neighbors. And we can characterize the local geometrical structure by linear coefficients that reconstruct each data point from its neighbors. The reconstruction coefficients are computed by the following objective function: (11) min x i - x j 𝒩 k ( x i ) w i j x j 2 s . t . x j 𝒩 k ( x i ) w i j = 1 , and w i j = 0 if x j 𝒩 k ( x i ) , where 𝒩 k ( x i ) denotes the k nearest neighborhood of x i .

Then v i , 1 i n in the dimensionality reduced space can be preserved by minimizing (12) min v i - x j 𝒩 k ( x i ) w i j v j 2 = tr ( V T ( I - W ) ( I - W ) V ) = tr ( V T L V ) , where tr ( · ) denotes the trace of a matrix, I n × n is an identity matrix, and L = ( I - W ) ( I - W ) . By minimizing (12), we essentially try to formalize our intuition that if a data point can be represented from its neighbors in the original space, then it can be represented from its neighbors by the same combination coefficients in the dimensionality reduced space.

With the neighborhood preserving constraint, CNMF incorporates (8) and minimizes the objective function as follows: (13) 𝒥 = X - X W V T 2 + λ tr ( V T L V ) , where λ is a positive regularization parameter controlling the contribution of the additional constraint. We call (13) neighborhood preserving convex nonnegative matrix factorization (NPCNMF). Let λ = 0 ; (13) degenerates to the original CNMF.

3.2. The Algorithm

We introduce an iterative algorithm to find a local minimum for the optimization problem. By defining K = X T X and using the matrix properties A 2 = tr ( A T A ) , tr ( A B ) = tr ( B A ) , and tr ( A ) = tr ( A T ) , we can rewrite the objective function 𝒥 as follows: (14) 𝒥 = tr ( ( X - X W V T ) T ( X - X W V T ) ) + λ tr ( V T L V ) = tr ( ( I - W V T ) T K ( I - W V T ) ) + λ tr ( V T L V ) = tr ( K ) - 2 tr ( V W T K ) + tr ( V W T K W V T ) + λ tr ( V T L V ) .

This is a typical constrained optimization problem and can be solved using the Lagrange multiplier method. Let ψ i j and ϕ i j be the Lagrange multiplier for constraint w i j 0 and ν i j 0 , respectively, and let Ψ = [ ψ i j ] and Φ = [ ϕ i j ] . The Lagrangian function is (15) = tr ( K ) - 2 tr ( V W T K ) + tr ( V W T K W V T ) + λ tr ( V T L V ) + tr ( Ψ W T ) + tr ( Φ V T ) .

The partial derivatives of with respect to W and V are (16) W = - 2 K V + 2 K W V T V + Ψ , V = - 2 K W + 2 V W T K W + 2 λ L V + Φ .

Using the Karush-Kuhn-Tucker conditions ψ i k w i k = 0 and ϕ j k ν j k = 0 , we get the following equations for w i k and v j k : (17) ( - K V + K W V T V ) i k w i k = 0 , ( - K W + V W T K W + λ L V ) j k v j k = 0 .

The corresponding equivalent formulas are as follows: (18) ( - K V + K W V T V ) i k w i k 2 = 0 , ( - K W + V W T K W + λ L V ) j k v j k 2 = 0 .

Introduce (19) A = A + - A - , where A i k + = ( | A i k | + A i k ) / 2 and A i k - = ( | A i k | - A i k ) / 2 .

The equations lead to the following updating formulas: (20) w i k = w i k ( K + V ) i k + ( K - W V T V ) i k ( K - V ) i k + ( K + W V T V ) i k , (21) v j k = v i k ( K + W ) j k + ( V W T K - W ) j k + λ ( L - V ) j k ( K - W ) j k + ( V W T K + W ) j k + λ ( L + V ) j k .

Note that the solution to minimizing the criterion function 𝒥 is not unique. If W and V are the solution, then, W D , W D - 1 will also form a solution for any positive diagonal matrix D . To make the solution unique, we will further require that w T K w = 1 , where w is the column vector of W . The matrix V will be adjusted accordingly so that W V T does not change. This can be achieved by (22) V V [ diag ( W T K W ) ] 1 / 2 , W W [ diag ( W T K W ) ] 1 / 2 .

3.3. Convergence Analysis

In this section, we will investigate the convergence of the updating formula in (14). We use the auxiliary function approach  to prove the convergence. Here we first introduce the definition of auxiliary function .

Definition 1.

Z ( h , h ) is an auxiliary function of F ( h ) if the conditions (23) Z ( h , h ) F ( h ) , Z ( h , h ) = F ( h ) are satisfied.

Lemma 2.

If Z is an auxiliary function for F , then F is nonincreasing under the update (24) h ( t + 1 ) = arg min h Z ( h , h ( t ) ) .

Proof.

Consider (25) F ( h ( t + 1 ) ) Z ( h ( t + 1 ) , h ( t ) ) Z ( h ( t ) , h ( t ) ) = F ( h ( t ) ) .

Lemma 3.

For any nonnegative matrices A n × n , B k × k , S n × k , and   S n × k , and A , B are symmetric, then the following inequality holds: (26) i = 1 n p = 1 k ( A S B ) i p 2 S i p tr ( S T A S B ) .

The correctness and convergence of the algorithm are addressed in the following.

For given K , fixing V , considering any element w i j in W , we use 𝒥 ( W ) to denote the part of 𝒥 , which is only relevant to w i j . We get (27) 𝒥 ( W ) = - 2 tr ( V W T K ) + tr ( V W T K W V T ) .

Theorem 4.

One rewrites 𝒥 ( W ) as follows: (28) 𝒥 ( H ) = tr ( - 2 H T B + + 2 H T B - + H T A + H C a m r o - H T A - H C ) , where B = K V , A = K , C = V T V , and H = W .

Then the following function (29) Z ( H , H ) = - i k 2 B i k + H i k ( 1 + log H i k H i k ) + i k B i k - H i k 2 + ( H ) i k 2 H i k + i k ( A + H C ) i k H i k 2 H i k - i j k l A i j - H j k C k l H i l ( 1 + log H j k H i l H j k H i l ) is an auxiliary function of 𝒥 ( H ) : that is, it satisfies the requirements 𝒥 ( H ) Z ( H , H ) and 𝒥 ( H ) = Z ( H , H ) . Furthermore, it is a convex function of H and its global minimum is (30) H i k = argmin   H H i k B i k + + ( H A - C ) i k B i k - + ( A + H C ) i k .

From its minima and setting H ( t + 1 ) H and H ( t ) H , one recovers (20), letting B + = ( K ) + V , B - = ( K ) - V , A = K , C = V T V , and H = W .

Proof.

The function 𝒥 ( H ) is (31) 𝒥 ( H ) = tr ( - 2 H T B + + 2 H T B - + H T A + H C - H T A - H C ) .

We find upper bounds for each of the two positive terms and lower bounds for each of the two negative terms. For the third term in 𝒥 ( H ) , by applying Lemma 3, we obtain an upper bound (32) tr ( H T A + H C ) i k ( A + H C ) i k H i k 2 H i k .

The second term of 𝒥 ( H ) is bounded by (33) tr ( H T B - ) = i k H i k B i k - i k B i k - H i k 2 + ( H ) i k 2 2 H i k .

To obtain lower bounds for the two remaining terms, we use the inequality z 1 + log z , which holds for any z > 0 , and the first term in 𝒥 ( H ) is bounded by (34) tr ( H T B + ) = i k B i k + H i k i k B i k + H i k ( 1 + log H i k H i k ) .

The last term in 𝒥 ( H ) is bounded by (35) tr ( H T A - H C ) i j k l A i j - H j k C k l H i l ( 1 + log H j k H i l H j k H i l ) .

Collecting all bounds, we obtain Z ( H , H ) as in (29). Obviously, 𝒥 ( H ) Z ( H , H ) and 𝒥 ( H ) = Z ( H , H ) .

To find the minimum of Z ( H , H ) , we take (36) Z ( H , H ) H i k = - 2 B i k + H i k H i k + 2 B i k - H i k H i k + 2 ( H A + C ) i k H i k H i k - 2 ( H A - C ) i k H i k H i k .

To find the minimum of Z ( H , H ) , we take the Hessian matrix of Z ( H , H ) (37) 2 Z ( H , H ) H i k H j l = δ i j δ k l Y i k To be a diagonal matrix with positive entries (38) Y i k = 4 [ ( B + ) i k + ( H A - C ) i k ] H i k H i k 2 + 2 B i k - + ( H A + C ) i k H i k .

Thus, Z ( H , H ) is a convex function of H . Therefore, we obtain the global minimum by setting Z ( H , H ) / H i k = 0 in (36) and solving for H . Rearranging, we obtain (30).

Theorem 5.

Updating W using (20) will monotonically decrease the value of the objective in (13); hence it converges.

Proof.

By Lemma 2 and Theorem 4, we can get that 𝒥 ( W t ) = Z ( W ( t ) , W ( t ) ) Z ( W ( t + 1 ) , W ( t ) ) 𝒥 ( W t + 1 ) , so 𝒥 ( W ) is monotonically decreasing. Since 𝒥 ( W ) is obviously bounded below, we prove this theorem.

For given K , fixing W , considering any element v i j in V , we use 𝒥 ( V ) to denote the part of 𝒥 , which is only relevant to v i j . We get (39) 𝒥 ( V ) = - 2 tr ( V W T K ) + tr ( V W T K W V T ) + λ tr ( V T L V ) .

Theorem 6.

One rewrites 𝒥 ( V ) as follows: (40) 𝒥 ( H ) = tr ( - 2 H T B + + 2 H T B - + H A + H T - H A - H T + λ H T L + H - λ H T L - H ) , where B = K W , A = W T X T X W , and H = V .

Then the following function (41) Z ( H , H ) = - i k 2 B i k + H i k ( 1 + log H i k H i k ) + i k B i k - H i k 2 + ( H ) i k 2 H i k + i k ( A + H ) i k H i k 2 H i k - i k l A k l - H i k H l i ( 1 + log H i k H l i H i k H l i ) + λ i k ( L + H ) i k H i k 2 H i k - λ i k l L k l - H i k H l i ( 1 + log H i k H l i H i k H l i ) is an auxiliary function of 𝒥 ( H ) : that is, it satisfies the requirements 𝒥 ( H ) Z ( H , H ) and 𝒥 ( H ) = Z ( H , H ) . Furthermore, it is a convex function of H and its global minimum is (42) H i k = H i k B i k + + ( H A - ) i k + λ ( L - H ) i k B i k - + ( H A + ) i k + λ ( L + H ) i k .

From its minima and setting H ( t + 1 ) H and H ( t ) H , one recovers (21), letting B + = ( K ) + U , B - = ( K ) - U , A = W T K W , and H = V .

Proof.

The function 𝒥 ( H ) is (43) 𝒥 ( H ) = tr ( - 2 H T B + + 2 H T B - + H A + H T - H A - H T + λ H T L + H - λ H T L - H ) .

We find upper bounds for each of the three positive terms and lower bounds for each of the three negative terms. For the third term in 𝒥 ( H ) , by applying Lemma 3 and setting A I , B A + , we obtain an upper bound (44) tr ( H A + H T ) i k ( H A + ) i k H i k 2 H i k .

The second term of 𝒥 ( H ) is bounded by (45) tr ( H T B - ) = i k H i k B i k - i k B i k - H i k 2 + ( H ) i k 2 2 H i k , using the inequality a ( a 2 + b 2 ) / 2 b , which holds for any a , b > 0 .

For the fifth term in 𝒥 ( H ) , setting A L + , B I , and S H , we obtain an upper bound (46) λ tr ( H T L + H ) λ i k ( L + H ) i k H i k 2 H i k .

To obtain lower bounds for the three remaining terms, we use the inequality z 1 + log z , which holds for any z > 0 , and the first term in J ( H ) is bounded by (47) tr ( H T B + ) = i k B i k + H i k i k B i k + H i k ( 1 + log H i k H i k ) .

The fourth term in 𝒥 ( H ) is bounded by (48) tr ( H A - H T ) i k l A k l - H i k H l i ( 1 + log H i k H l i H i k H l i ) .

The last term in 𝒥 ( H ) is bounded by (49) λ tr ( H T L - H ) λ i k l L k l - H i k H l i ( 1 + log H i k H l i H i k H l i ) .

Collecting all bounds, we obtain Z ( H , H ) as in (41). Obviously, 𝒥 ( H ) Z ( H , H ) and 𝒥 ( H ) = Z ( H , H ) .

To find the minimum of Z ( H , H ) , we take (50) Z ( H , H ) H i k = - 2 B i k + H i k H i k + 2 B i k - H i k H i k + 2 ( H A + ) i k H i k H i k - 2 ( H A - ) i k H i k H i k + 2 λ ( L + H ) i k H i k H i k - 2 λ ( L - H ) i k H i k H i k = 0 .

We have (51) - 2 B i k + ( H ) i k 2 + 2 B i k - H i k 2 + 2 ( H A + ) i k H i k 2 - 2 ( H A - ) i k ( H ) i k 2 + 2 λ ( L + H ) i k H i k 2 - 2 λ ( L - H ) i k ( H ) i k 2 = 0 .

Therefore (52) H i k = H i k B i k + + ( H A - ) i k + λ ( L - H ) i k B i k - + ( H A + ) i k + λ ( L + H ) i k .

The Hessian matrix containing the second derivatives (53) 2 Z ( H , H ) H i k H j l = δ i j δ k l Y i k is a diagonal matrix with positive entries (54) Y i k = 4 [ ( B + ) i k + ( H A - ) i k + λ ( L - H ) i k ] H i k H i k 2 + 2 B i k - + ( H A + ) i k + λ ( L + H ) i k H i k .

Thus, Z ( H , H ) is a convex function of H . Therefore, we obtain the global minimum by setting Z ( H , H ) / H i k = 0 in (41) and solving for H . Rearranging, we obtain  (21).

Theorem 7.

Updating V using (21) will monotonically decrease the value of the objective in (13); hence it converges.

Proof.

By Lemma 2 and Theorem 4, we can get that 𝒥 ( V t ) = Z ( V ( t ) , V ( t ) ) Z ( V ( t + 1 ) , V ( t ) ) 𝒥 ( V t + 1 ) , so 𝒥 ( V ) is monotonically decreasing. Since 𝒥 ( V ) is obviously bounded below, we prove this theorem.

4. Experimental Results

In this section, we show the performance of the proposed method on face recognition and compare our proposed method with the popular subspace learning algorithms: four unsupervised ones which are principal component analysis  (PCA), neighborhood preserving embedding (NPE) , local nonnegative matrix factorization (LNMF) , and convex nonnegative factorization (CNMF)  the one supervised algorithm and which is linear discriminant analysis (LDA) . We use the nearest neighbor (NN) classifier as baseline in original space. We apply different algorithms to obtain new representations for each chosen data set, and then the NN method is applied in the new representation spaces.

4.1. Data Preparation

The experiments are used on three data sets. One is Cambridge ORL face database, the other is the Yale database, and the third one is the CMU PIE face database. The important statistics of these data sets are described below.

The Yale database contains 165 gray scale images of 15 individuals. All images demonstrate variations in lighting condition (left-light, center-light, right-light), facial expression (normal, happy, sad, sleepy, surprised, and wink), and with/without glasses.

The ORL database contains ten different images of each of 40 distinct subjects, thus 400 images in total. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).

The CMU PIE face database contains more than 40 000 facial images of 68 people. The images were acquired over different poses, under variable illumination conditions, and with different facial expressions. In our experiment, we choose the images from the frontal pose (C27) and each subject has around 49 images from varying illuminations and facial expressions.

In all the experiments, images are preprocessed so that faces are located. Original images are first normalized in scale and orientation such that the two eyes are aligned at the same position. Then the facial areas were cropped into the final images for clustering. Each image is of 32 × 32 pixels with 256 gray levels per pixel.

4.2. Parameter Settings

For each data set, we randomly divide it into training and testing sets, and evaluate the recognition accuracy on the testing set. In detail, for each individual in the ORL and Yale data sets,we randomly select 2, 3, and 4 images per individual, respectively, for training samples, and the remaining for test samples, while for each individual in the PIE data set, we randomly select 5, 10, and 20 images per individual for training samples. For each partition, we repeated each experiment 20 times and calculated the average recognition accuracy. In general, the recognition rate varies with the dimension of the face subspace. The best result obtained in the optimal subspace and the corresponding dimensionality for each method are shown.

For the face recognition experiments, several parameters need to be decided beforehand. For LDA, we use PCA as a first step dimensionality reduction algorithm to avoid the singularity problem. The dimension of the PCA step is fixed as N - c and then performs LDA. There are two parameters in our NPCNMF and NPE approach: the number of nearest neighbors k and the regularization parameter λ . Throughout our experiments, we empirically set the number of nearest neighbors k to 5, the value of the regularization parameter λ to 100.

Each testing sample y is projected into the linear subspace spanned by the column vectors of the basis matrix U , namely, h y W y , where W indicates the pseudoinverse of matrix W .

4.3. Classification Results

Tables 1, 2, and 3 show the evaluation results of all the methods on the three data sets, respectively, where the value in each entry represents the average recognition accuracy of 20 independent trials, and the number in brackets is the corresponding projection dimensionality. These experiments reveal a number of interesting points.

It is clear that the use of dimensionality reduction is beneficial in face recognition. There is a significant increase in performance from using LDA, NPE, NMF, LNMF, and CNMF. However, PCA fails to gain improvement over the baseline. This is because that PCA does not encode the discriminative information.

The performances of nonnegative algorithms NMF, LNMF, and CNMF are much worse than supervised algorithms LDA, which shows that without considering the labeled data, nonnegative algorithms could not guarantee good discriminating power.

Our NPCNMF algorithm outperforms all other five methods. The reason lies in the fact that NPCNMF considers the geometrical structure of the data and achieves better performance than the other algorithms. This shows that by leveraging the power of both the parts-based representation and the intrinsic geometrical structure of the data, NPCNMF can learn a better compact representation in the sense of semantic structure.

Face recognition accuracy on the ORL data set. The number in brackets is the corresponding projection dimensionality.

Method 2 Train 3 Train 4 Train
Baseline 69.32% 77.56% 83.48%
PCA 69.32% (79) 77.56% (118) 83.48% (152)
LDA 72.80% (25) 83.79% (39) 90.13% (39)
NPE 73.19% (36) 84.29% (54) 91.06% (73)
NMF 70.87% (97) 78.98% (81) 84.48% (95)
LNMF 71.73% (178) 81.09% (168) 86.31% (195)
CNMF 72.23% (138) 83.58% (143) 89.56% (111)
NCPNMF 77.31% (143) 86.73% (153) 93.35% (145)

Face recognition accuracy on the Yale data set. The number in brackets is the corresponding projection dimensionality.

Method 2 Train 3 Train 4 Train
Baseline 46.04% 49.96% 55.62%
PCA 46.04% (29) 49.96% (44) 55.62% (58)
LDA 42.81% (11) 60.33% (14) 68.10% (13)
NPE 48.19% (13) 62.00% (19) 69.00% (73)
NMF 44.11% (112) 49.00% (195) 52.19% (164)
LNMF 44.00% (157) 48.84% (198) 53.57% (197)
CNMF 49.72% (125) 59.50% (168) 65.77% (129)
NPCNMF 63.45% (124) 71.83% (148) 81.38% (153)

Face recognition accuracy on the PIE data set. The number in brackets is the corresponding projection dimensionality.

Method 5 Train 10 Train 20 Train
Baseline 43.02% 62.90% 83.19%
PCA 42.87% (199) 62.51% (195) 82.84% (200)
LDA 84.39% (67) 90.47% (67) 93.98% (67)
NPE 84.71% (166) 91.48% (200) 94.33% (200)
NMF 78.66% (200) 88.98% (200) 92.52% (200)
LNMF 76.47% (200) 87.91% (200) 92.61% (196)
CNMF 83.72% (176) 90.89% (187) 93.78% (159)
NPCNMF 88.43% (147) 94.86% (158) 98.58% (133)
5. Conclusion and Future Work

In this paper, we have presented a novel matrix factorization method called NPCNMF for dimensionality reduction, which respects the local geometric structure. As a result, NPCNMF can discriminate power more than the ordinary NMF and CNMF approaches which only consider the Euclidean structure of the data. Experimental results on face datasets show that NPCNMF provides better representation in the sense of semantic structure.

Several challenges remain to be investigated in our future work.

A suitable value of λ is important to our algorithm. It remains unknown how to do model selection theoretically and efficiently.

NPCNMF is currently limited to the linear projections, and those nonlinear techniques (e.g., kernel tricks) may further boost the algorithmic performance. We will investigate it in our future work.

Another further research direction is how to extend the current framework for tensor-based nonnegative data decomposition.

NPCNMF algorithm is iterative and sensitive to the initialization of W and H . It is unclear how to choose optimal initialization parameters in a principled manner.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Lee D. D. Seung H. S. Learning the parts of objects by non-negative matrix factorization Nature 1999 401 6755 788 791 2-s2.0-0033592606 10.1038/44565 Lee D. D. Seung H. S. Algorithms for nonnegative matrix factorization Proceedings of the Conference on Neural Information Processing Systems (NIPS '00) 2000 Cooper M. Foote J. Summarizing video using nonnegative similarity matrix factorization Proceedings of the IEEE Workshop on Multimedia Signal Processing 2002 Li S. Z. Hou X. Zhang H. Cheng Q. Learning spatially localized, parts-based representation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CPRV '01) December 2001 I207 I212 2-s2.0-0035683536 Paatero P. Tapper U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values Environmetrics 1994 5 2 111 126 2-s2.0-0028561099 Hoyer P. O. Non-negative sparse coding Processings of IEEE Workshop on Neural Networks for Signal Processing 2002 Hoyer P. O. Non-negative matrix factorization with sparseness constraints Journal of Machine Learning Research 2003 5 1457 1469 MR2248024 Hazan T. Polak S. Shashua A. Sparse image coding using a 3D non-negative tensor factorization Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05) October 2005 50 57 2-s2.0-33745944718 10.1109/ICCV.2005.228 Shashua A. Hazan T. Non-negative tensor factorization with applications to statistics and computer vision Proceedings of the 22nd International Conference on Machine Learning (ICML '05) August 2005 793 800 2-s2.0-31844432834 Wang Y. Jia Y. Hu C. Turk M. Fisher non-negative matrix factorization for learning local features Proceedings of the 6nd Asian Conference on Computer Vision (ACCV '04) 2004 Zafeiriou S. Tefas A. Buciu I. Pitas I. Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification IEEE Transactions on Neural Networks 2006 17 3 683 695 2-s2.0-33646528853 10.1109/TNN.2006.873291 Kotsia I. Zafeiriou S. Pitas I. Novel discriminant non-negative matrix factorization algorithm with applications to facial image characterization problems IEEE Transactions on Information Forensics and Security 2007 2 3 588 595 2-s2.0-34548062342 10.1109/TIFS.2007.902017 Ding C. H. Li T. Jordan M. I. Convex and semi-nonnegative matrix factorizations IEEE Transactions on Pattern Analysis and Machine Intelligence 2010 32 1 45 55 2-s2.0-70350738173 10.1109/TPAMI.2008.277 Belkin M. Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation Neural Computation 2003 15 6 1373 1396 2-s2.0-0042378381 10.1162/089976603321780317 Roweis S. T. Saul L. K. Nonlinear dimensionality reduction by locally linear embedding Science 2000 290 5500 2323 2326 2-s2.0-0034704222 10.1126/science.290.5500.2323 Cai D. He X. Han J. Huang T. S. Graph regularized nonnegative matrix factorization for data representation IEEE Transactions on Pattern Analysis and Machine Intelligence 2011 33 8 1548 1560 2-s2.0-79959532395 10.1109/TPAMI.2010.231 Cai D. He X. Han J. Locally consistent concept factorization for document clustering IEEE Transactions on Knowledge and Data Engineering 2011 23 6 902 913 2-s2.0-79955525117 10.1109/TKDE.2010.165 Zhang L. Qiao L. Chen S. Graph-optimized locality preserving projections Pattern Recognition 2010 43 6 1993 2002 2-s2.0-77049116966 10.1016/j.patcog.2009.12.022 He X. Niyogi P. Locality preserving projections Proceedings of the NIPS 2003 He X. Cai D. Yan S. Zhang H.-J. Neighborhood preserving embedding Proceedings 10th IEEE International Conference on Computer Vision (ICCV '05) October 2005 1208 1213 2-s2.0-33745881038 Belhumeur P. N. Hespanha J. P. Kriegman D. J. Eigenfaces vs. fisherfaces: recognition using class specific linear projection IEEE Transactions on Pattern Analysis and Machine Intelligence 1997 19 7 711 720 2-s2.0-0031185845 10.1109/34.598228