Relationship Matrix Nonnegative Decomposition for Clustering

Nonnegative matrix factorization NMF is a popular tool for analyzing the latent structure of nonnegative data. For a positive pairwise similarity matrix, symmetric NMF SNMF and weighted NMF WNMF can be used to cluster the data. However, both of them are not very efficient for the ill-structured pairwise similarity matrix. In this paper, a novel model, called relationship matrix nonnegative decomposition RMND , is proposed to discover the latent clustering structure from the pairwise similarity matrix. The RMND model is derived from the nonlinear NMF algorithm. RMND decomposes a pairwise similarity matrix into a product of three low rank nonnegative matrices. The pairwise similarity matrix is represented as a transformation of a positive semidefinite matrix which pops out the latent clustering structure. We develop a learning procedure based on multiplicative update rules and steepest descent method to calculate the nonnegative solution of RMND. Experimental results in four different databases show that the proposed RMND approach achieves higher clustering accuracy.


Introduction
Nonnegative matrix factorization NMF 1 has been introduced as an effective technique for analyzing the latent structure of nonnegative data such as images and documents.A variety of real-world applications of NMF has been found in many areas such as machine learning, signal processing 2-4 , data clustering 5, 6 , and computer vision 7 .
Most applications focus on the clustering aspect of NMF 8, 9 .Each sample can be represented as a linear combination of clustering centroids.Recently, a theoretic analysis has shown the equivalence between NMF and K-means/spectral clustering 10 .Symmetric NMF SNMF 10 is an extension of NMF.It aims at learning clustering structure from the kernel matrix or pairwise similarity matrix which is positive semidefinite.When the similarity matrix is not positive semidefinite, SNMF is not able to capture the clustering structure contained in the subspace associated with negative eigenvalues.In order to overcome the limitation, weighted NMF SNMF 10 is developed.In the WNMF model, the indefiniteness of the pairwise similarity matrix is passed onto a specific low-rank matrix.WNMF improves the clustering performance of SNMF.When a portion of data is labeled, it is desirable to incorporate the class labels information into WNMF in order to improve the clustering performance.To this end, a semisupervised NMF SSNMF 11 is studied by incorporating the domain knowledge into WNMF to extract more clustering structure information.
In SNMF, WNMF, and SSNMF, the low rank approximation to the pairwise similarity matrix is used.The goal is to learn the latent clustering structure by minimizing the reconstruction error.However, since there is no prior knowledge about the data, the kernel matrix is often obtained based on pairwise Euclidean distance in the high-dimensional space.It is more sensitive to the unexpected noise.Consequently, it may produce undesirable performances in clustering tasks by minimizing the objective function from the viewpoint of reconstruction in the form as SNMF, WNMF, and SSNMF.In this paper, we present a novel model, called relationship matrix nonnegative decomposition RMND , for data clustering tasks.The RMND model is derived from the nonlinear NMF algorithms which take advantages of kernel functions in the high-dimensional feature space.RMND decomposes a pairwise similarity matrix into a product of a positive semidefinite matrix, a distribution matrix of similarity on latent features, and an encoding matrix.The positive semidefinite matrix pops out the clustering structure and is treated as a more convincing pairwise similarity matrix by an appropriate transformation.RMND learns the correct relationship matrix adaptively.Furthermore, according to the positive semidefiniteness, the SNMF formulation is incorporated in RMND, and then a more tractable representation of pairwise similarity matrix is obtained.We develop a learning procedure for RMND to discover the latent clustering structure.Experimental results show that the proposed RMND leads to significant improvements on clustering performance.
The rest of the paper is organized as follows: in Section 2, we briefly review the SNMF and WNMF.In Section 3, we present the proposed RMND model and its learning procedure.Some experimental results on several datasets are shown in Section 4. Finally, conclusions and final remarks are given in Section 5.

Symmetric NMF (SNMF) and Weighted NMF (WNMF)
A pairwise similarity matrix X is a nonnegative matrix, since the pairwise similarities between different objects cannot be negative.For the kernel case, X V T V is the standard inner-product linear kernel matrix, where V is a nonnegative data matrix with size p × n.And it can be extended to any other kernels.NMF technique is powerful to discover the latent structure in X.Since X is a symmetric matrix, Ding et al. 10 introduced the SNMF model as follows: where A is a nonnegative matrix of size n × q whose rows denote the degrees of the samples related to the q centroids of clusters.In 2.1 , AA T is a positive semidefinite matrix.When the similarity matrix X is indefinite, X has negative eigenvalues.AA T will not provide a good approximation, since AA T cannot absorb the subspace associated with negative eigenvalues.For kernel matrices, decomposition in 2.1 is feasible.However, a large number of similarity matrices are nonnegative but not positive semidefinite matrix.Ding et al. 10 introduced another improved factorization model as where S is a nonnegative matrix of size q × q which inherits the indefiniteness of X.The detailed update rules for A and S can be found in 10 .Recently, NMF has been extended to nonlinear nonnegative component analysis algorithms referred as KNMF by Zafeiriou and Petrou 12 .KNMF is proposed to model efficiently the nonlinearities that are present in most real-life applications.The idea of KNMF is to perform NMF in the high-dimensional feature space.Specifically, KNMF is to find a set of nonnegative weights and nonnegative basis vectors such that the nonlinearly mapped training vectors can be written as linear combinations of nonlinear mapped nonnegative basis vectors.Let φ : R p → Þ be a mapping that projects data v i to a Hilbert space Þ of arbitrary dimensionality.KNMF attempts to find a set of q vectors w j ∈ R p and a set of nonnegative weights h ji such that

Relationship Matrix Nonnegative Decomposition (RMND)
where V Φ φ v 1 , . . ., φ v n and W Φ φ w 1 , . . ., φ w q .The nonlinear mapping is related to a kernel function with the operation as k v i , v j φ v i T φ v j .The detailed algorithms for directly learning W and H can be found in 12 .
In this paper, we focus on the convex nonlinear nonnegative component analysis algorithm referred as CKNMF in 12 .Instead of finding both W and H simultaneously, Zafeiriou and Petrou followed the similar lines as convex-NMF 9 and assumed that the centroid φ w j is in the space spanned by the columns of V Φ .Formally, φ w j can be written as where m lj ≥ 0 and n l 1 m lj 1.This means that the centroid φ w j can be interpreted as a convex weighted combination of certain data point φ v l .Using 3.2 , approximation 3.1 is reformulated in the matrix form as where X is the kernel matrix with the entry provides a new decomposition of kernel matrix.Each matrix has the explicit interpretation.X is the relationship matrix between different objects based on a certain kernel function, each column of M denotes the relationship distribution on certain latent feature according to the property of convex combinations, and H is the encoding coefficient matrix.In particular, we rewrite 3.3 in an entry form as XM is h sj .

3.4
It can be noted from 3.4 that XM is represents the weighted average relationship measure correlated to object i on the sth latent feature, and then the relationship measure between object i and j is a linear combination of the weighted average relationship measures on the latent features.However, 3.3 or 3.4 is not convincible for clustering tasks, since the kernel matrix X cannot represent the relationship between different objects faithfully.It is more desirable to discover the latent relationship adaptively.Consequently, we replace the X in the right hand side in 3.3 by a latent relationship matrix where R denotes the correct relationship matrix.From 3.5 , the correct relationship matrix R would be adaptively learned from the kernel matrix X. X is a linear transformation of R. A relationship matrix R, which pops out the latent clustering structure, is approximately a block diagonal matrix under suitable rearrangement on samples.It would be a positive semidefinite matrix.SNMF model is reasonable to learn a low rank representation of matrix R. Thus, we derive our new model, referred as relationship matrix nonnegative decomposition RMND , as follows: where A is a nonnegative matrix whose rows denote the degrees of the samples related to the centroids of clusters.The corresponding optimization problem of RMND is given by min

3.7
The objective function D of RMND in 3.7 is not convex for A, M, and H simultaneously.Therefore, it is unrealistic to expect an algorithm to find the global minimum of D. As it is known that AA T MH AA T MUU −1 H, where U is a diagonal matrix with u jj i m ij , the normalization on M can be easily handled after M is updated.Therefore, we only consider the nonnegativity constraints on the factors.When A is fixed, let γ jk and ψ jk be the Lagrange multiplier for constraints m jk ≥ 0 and h jk ≥ 0, respectively.We define matrix Γ γ jk and Ψ ψ jk , then the Lagrange multiplier L is

3.9
Using the KKT conditions γ jk m jk 0 and ψ jk h jk 0, we get the following equations:

3.10
The above equations lead to the following multiplicative update rules:

3.11
For factor matrix A, the corresponding partial derivatives of D is

3.12
Input: Positive matrix X ∈ R n×n , and a positive integer q.Output: Nonnegative factor matrices A ∈ R n×q , M ∈ R n×q and H ∈ R q×n .Learning Procedure S1 Initialize A, M and H to random positive matrices, and normalize each column of A to one.S2 Repeat the iterations until convergence: T  AA T AA T MHH T .3 M ← MU −1 and H ← UH where U is a diagonal matrix with u jj i m ij .4 Repeatedly select small positive constant μ A until the objective function is decreased.
ii Project each column of A to be nonnegative vector with unit L 2 norm.5 A : A. Above, ⊗ and denote elementwise multiplication and division, respectively.Algorithm 1: Relationship matrix nonnegative decomposition RMND .
Our algorithm essentially takes a step in the direction of the negative gradient and, subsequently, projects onto the constraint space, making sure that the taken step is small enough that the objective function D is reduced at every step.The learning procedure for RMND can be summarized as Algorithm 1.

Computational Complexity Analysis
In this subsection, we discuss the extra computational cost of our proposed algorithm in comparison with SNMF and WNMF.We count the arithmetic operations for each algorithm.Based on the updating rules in 10 , it is not hard to count the arithmetic operations of each iteration in SNMF and WNMF.For GNMF, the steepest descent method is used to update factor matrix A. We use the bipartition to determine the small positive constant μ A .Let N 1 be the maximum iteration number in the steepest descent method.We summary the computational operation counts for each iteration in Table 1.Suppose that the algorithms stop after N 2 iterations and the overall cost for both SNMF and WNMF is

3.13
The overall cost for RMND is

3.14
Then, the overall cost of SNMF, WNMF, and RMND is related to qn 2 , where n is the number of samples.Much time is needed for large-scale data clustering tasks.For RMND, its overall cost is also effected by the maximum iteration number N 1 in the steepest descent method.Nevertheless, RMND will be shown that it is capable of improving the clustering performance in Section 4. We will develop algorithms for fast convergence and low computational complexity in the future work.

Numerical Experiments
We evaluate the performance of five different methods, RMND, K-means clustering, spectral clustering SpeClus 13 , SNMF, and WNMF, in a task of data clustering.In RMND, once A is learned, we denote X A A T , where A is the modification of A with normalized rows.We apply K-means clustering on A, H and the factor matrices learned by SNMF and WNMF on X.Finally, the best clustering results of RMND is obtained.

Datasets
We use five datasets for evaluating the clustering performance of algorithms.The detailed description for the datasets is listed below.

Evaluation Metrics for Clustering and Kernel Function
In all these methods, we set q, the dimensionality of feature subspace, to be equal to the number of classes of datasets.Two performance measures clustering accuracy and normalized mutual information are used to evaluate the clustering performance of algorithms.If we denote the true label for the ith data to be c i , and the estimated label c i , the clustering accuracy can be computed by n i 1 δ c i , c i /n , where δ x, y 1 for x y and δ x, y 0 for x / y.The clustering accuracy achieves maximum value 1 when clustering results are perfect.Let T be the set of clusters obtained from the ground truth and T obtained from our algorithm.The normalized mutual information measure is defined by where MI T, T is the mutual information between T and T , H T and H T denote the entropies of T and T , respectively.The value of NMI varies between 0 and 1.The greater the normalized mutual information, the better the clustering quality.
In our experiments, we use Gaussian kernel function to calculate the kernel matrix.And then, we evaluate the clustering performance of different algorithms.For comparison, the same kernel matrix has been used in our experiments.The Gaussian kernel function used here is as follows: where t is a parameter.X is given by where t is a heat kernel parameter 16 and N l v j denotes a set of l nearest neighbors of v j .For simplicity, we present a tunable way to set t as a square average distance between different samples 17 where m is a scale factor.

Experimental Results on the JAFFE Dataset
To demonstrate how our method improves the performance of data clustering, we firstly set the number of nearest neighbors l n − 1, the scale factor m 1, where n is the number of samples.Then, the pairwise similarity matrix X is the weighted adjacency matrix of the fully connected graph similar to those in spectral clustering.Figure 1 displays the pairwise similarity matrix obtained from the JAFFE dataset.It can be noted that X is ill structured.In order to discover the latent clustering structure, we apply RMND, SNMF, and WNMF algorithms to obtain the decomposition form of X, respectively.The factor matrices are randomly initialized by the values in the range 0, 1 .Figure 2 shows that objective function value decreases with increasing iteration number.It can be noted that the reconstruction error of RMND is smaller than those of SNMF and WNMF after 500 iterations.Figures 3, 4, and 5 display the estimated pairwise similarity matrix of SNMF, WNMF, and RMND algorithms, respectively.The estimated pairwise similarity matrix learned by RMND is more highly structured.RMND produces better representations of kernel matrix than SNMF and WNMF.SNMF and WNMF have similar representations of X when the algorithms are convergent.

Clustering Performance Evaluation on Various Pairwise Similarity Matrix
In graph embedding methods, the pairwise similarity matrix also referred as affinity matrix has been widely used.In this subsection, we test our algorithm under different adjacency graph constructions to show how the different graph structures will affect the clustering  respectively.This implies that RMND is more suitable to discover the clustering structure contained in smoothed pairwise similarity matrix.
Note that the choice of parameters in 4.4 is still an open problem.To this end, we explore the range of possible values of the scale factor m to determine the heat kernel parameter.Specifically, m is taken from {0.2, 0.4, . .., 2.0}.Figures 8 and 9   accuracies and normalized mutual information on the JAFFE dataset under different scale factors.n − 1 nearest neighbors are used in this experiment.As m increases, the performance decreases.The reason might be that the difference between different pairwise similarity is small for larger value of m.The pairwise similarity matrix becomes more and more ill structured.Nevertheless, RMND leads to better clustering performance compared with SNMF and WNMF.

Conclusions and Future Work
We have presented a novel relationship matrix nonnegative decomposition RMND model for data clustering task.The RMND model is formulated by decomposing a pairwise similarity into a product of three low-rank nonnegative matrices which have explicit interpretation.The correct relationship matrix is adaptively learned from the pairwise similarity matrix by RMND.We develop a learning procedure based on multiplicative update rules and steepest descent method to calculate the nonnegative solution of RMND.Extensive numerical experiments confirm that 1 RMND provides a favorable low-rank representation of pairwise similarity matrix. 2 By using an appropriate kernel function, the ability of RMND, SNMF, and WNMF to deal with mixed-signed data makes them useful for many applications in contrast to original NMF. 3 RMND improves the clustering performance of SNMF and WNMF.
Further future work includes the following topics.The first is to develop algorithms for fast convergence and better solution in terms of minimizing the objective function.The second is to investigate the ability of RMND on different kinds of kernel functions.

Figure 1 :Figure 2 :
Figure 1: The pairwise similarity matrix X obtained from the JAFFE dataset.

Figure 8 :
Figure 8: Clustering accuracies derived by applying RMND, SNMF, and WNMF on the JAFFE database versus different scale factor m.

Figure 9 :
Figure 9: Normalized mutual information derived by applying RMND, SNMF, and WNMF on the JAFFE database versus different scale factor m.

Table 3 :
Clustering accuracy and normalized mutual information of K-means, spectral clustering SpeClus , SNMF, WNMF, and RMND on the Reuters dataset.

Table 4 :
Clustering accuracy and normalized mutual information of K-means, spectral clustering SpeClus , SNMF, WNMF, and RMND on the USPS dataset.Clustering accuracies derived by applying RMND, SNMF, and WNMF on the JAFFE database versus different number of nearest neighbors.performance.The number of nearest neighbors used in this paper defines the locality of graph.In Figures6 and 7, we show the relationship between the average clustering accuracies and normalized mutual information versus different numbers of nearest neighbors under 20 independent runs and the scale factor m 1, respectively.As can be seen, RMND performs better when the number of nearest neighbors is larger than 60 and the maximum achieved clustering accuracy is 86.62% when 190 nearest neighbors are used in 4.3 .The normalized mutual information is better after l > 150, and the maximum normalized mutual information is 0.8429 at l 210.For SNMF and WNMF, the best clustering accuracies are 84.55% and 82.82%, respectively, and the best normalized mutual information are 0.8373 and 0.8214,

Figure 7 :
Normalized mutual information derived by applying RMND, SNMF, and WNMF on the JAFFE database versus different number of nearest neighbors.