Parallel Nonnegative Matrix Factorization with Manifold Regularization

Nonnegative matrix factorization (NMF) decomposes a high-dimensional nonnegative matrix into the product of two reduced dimensional nonnegative matrices. However, conventional NMF neither qualifies large-scale datasets as it maintains all data in memory nor preserves the geometrical structure of data which is needed in some practical tasks. In this paper, we propose a parallel NMF with manifold regularization method (PNMF-M) to overcome the aforementioned deficiencies by parallelizing the manifold regularized NMF on distributed computing system. In particular, PNMF-M distributes both data samples and factor matrices to multiple computing nodes instead of loading the whole dataset in a single node and updates both factor matrices locally on each node. In this way, PNMF-M succeeds to resolve the pressure of memory consumption for large-scale datasets and to speed up the computation by parallelization. For constructing the adjacency matrix in manifold regularization, we propose a two-step distributed graph constructionmethod, which is proved to be equivalent to the batch constructionmethod. Experimental results on popular text corpora and image datasets demonstrate that PNMF-M significantly improves both scalability and time efficiency of conventional NMF thanks to the parallelization on distributed computing system; meanwhile it significantly enhances the representation ability of conventional NMF thanks to the incorporated manifold regularization.


Introduction
Data representation is a fundamental problem in data analysis.A good representation typically uncovers the latent structure of a dataset by reducing the dimensionality of data.Several methods including principal component analysis (PCA), linear discriminant analysis (LDA), and vector quantization (VQ) have addressed this issue.Recently, nonnegative matrix factorization (NMF) [1] incorporates nonnegativity constraint to obtain parts-based representation of data, and thus it has been widely applied in many applications, such as document clustering [2,3], image recognition [4,5], audio processing [6], and video processing [7].
However, conventional NMF suffers from a few deficiencies: (1) conventional NMF usually works in batch mode and requires all data to reside in memory, and this leads to tremendous storage overhead as the increase of the data samples and (2) conventional NMF ignores the geometrical structure embedded in data and causes unsatisfactory representation ability.To overcome the first deficiency, either parallel or distributed algorithms have been proposed for NMF to fit for large-scale datasets.Kanjani [8] utilized multithreading to develop a parallel NMF (PNMF) based on multicore machine.Robila and Maciak [9] introduced two thread-level parallel versions for traditional multiplicative solution and adaptive projected gradient method.However, their methods are prohibited for large-scale datasets due to the memory limitation of a single computer.Liu et al. [10] proposed a distributed NMF (DNMF) to analyze large-scale web dyadic data and verified its effectiveness on distributed computing systems.Dong et al. [11] also attempted to design a PNMF based on the distributed memory platform with the message passing interface library.Although all the above parallel NMF algorithms achieve a considerable speedup in terms of scalability, they cannot consider the geometric structure in dataset.To overcome the second deficiency, Cai et al. [12] proposed graph-regularized NMF (GNMF) which extended conventional NMF by constructing an affinity graph [13] to encode the geometrical information of data and enhanced representation ability.Gu et al. [14] further extended GNMF to avoid trivial solution and scale transfer problems by imposing a normalized cut-like constraint on the cluster assignment matrix.Lu et al. [15] incorporated manifold regularization into NMF for hyperspectral unmixing and obtained desirable unmixing performance.These improved algorithms get better representation ability but work inefficiently for large-scale datasets.Liu et al. [16] introduced the geometric structure of data into incremental NMF [17,18] and utilized two efficient sparse approximations, buffering and random projected tree, to process large-scale datasets.Yu et al. [19] also presented an incremental GNMF algorithm to improve scalability.But these algorithms only performed well for incremental or streaming datasets and could not deal with large-scale batch datasets.In addition, Guan et al. [20] and Liu et al. [21], respectively, introduce Manhattan distance and large-cone penalty for NMF to improve representation and generalization ability.
In conclusion, none of the above works can simultaneously overcome both deficiencies due to the great computation for calculating the decomposition and the storage requirement of the adjacency matrix.In this paper, we take the best of advantages of parallel NMF and manifold regularized NMF and design a parallel NMF with manifold regularization method (PNMF-M) by parallelizing manifold regularized NMF on distributed computing systems.In particular, PNMF-M distributes both data samples and factor matrices to multiple computing nodes in a balanced way and parallelizes the update for both factor matrices locally on each node.Since the graph construction is the bottleneck of the computation of PNMF-M, we adopt a two-step distributed graph construction method to compute the adjacency matrix and obtain an adjacent graph equivalent to that constructed in batch mode.Experimental results on popular text corpora and image datasets show that PNMF-M not only outperforms conventional NMF in terms of both scalability and time efficiency by parallelization, but also significantly enhances the representation ability of conventional NMF by incorporating manifold regularization.

Related Work
This section gives a brief review of PNMF, GNMF, and their corresponding popular algorithms.Details are as follows.
2.1.PNMF.Due to working in batch mode, conventional NMF requires to process masses of large matrix operations, which is inevitably confronted with high time overhead.Parallel processing is often an efficient method to speed up computing intensive algorithms.In general, the parallelism of the algorithms is embodied in two aspects.The first aspect is computation flow, but unfortunately conventional NMF is highly correlated among the computing steps and could not be decomposed into uncorrelated tasks.So the second aspect, data structure, is taken into account.A feasible plan is to divide the data into small blocks and assign them to different processing units.

GNMF.
In common, conventional NMF performs the factorization in the Euclidean space, which cannot get satisfactory solution.Based on manifold assumption [22], GNMF embeds the intrinsic geometric structure of the data space in conventional NMF to overcome the above defect.
One of the representative algorithms for GNMF is proposed by Cai et al. [12].They define the graph weight matrix  ∈ R × , which is obtained by computing the weight relationship between any two data samples in input data matrix .Furthermore, another two matrices are derived from .One is the diagonal matrix  whose entries are row or column sum of , that is,   = ∑  =1   = ∑  =1   , and the other is  =  − , which is called graph Laplacian.They utilize Lagrangian multiplier method to solve the above problem and derive the following update rules: where  ≥ 0 is the regularization parameter,  denotes the iteration number,  = 1, . . ., ,  = 1, . . ., , and  = 1, . . ., .
By incorporating the geometric structure, GNMF can significantly improve representation ability in comparison to conventional NMF.

Parallel Manifold Regularized Nonnegative Matrix Factorization
In conventional GNMF, the intrinsic low dimensional manifold structure embedded in the dataset is expressed by  by  graph weight matrix .From the update rule (2), we can easily see that the manifold terms in conventional GNMF give rise to a lot of additional computations compared with NMF.Therefore, we describe the proposed PNMF-M in this section, which improves this issue by distributing both data samples and factor matrices to multiple computing nodes while parallelizing the update for both factor matrices and manifold construction.
As the scale of  is much larger than those of , , and , the weight matrix  should be distributed across  processors.The same is true for .To this end, PNMF-M divides  and  into  blocks in the row direction, namely, In this way, each of the first  − 1 blocks is a ⌈/⌉ × -dimensional matrix and the last one is a (−⌈/⌉×(−1))×-dimensional matrix, as described in Figures 1(c) and 1(d).
According to [12], to take the advantage of manifold regularization, PNMF-M optimizes the following objective: where  denotes the regularization parameter and  = −.
In each iteration round, PNMF-M updates each block of  and  locally by ( 5) and ( 6), respectively, and inserts the reduction for the whole factors among processors between (5) and (6).Since our algorithm is developed on a distributed computing system with 40 Gb/s Infiniband interconnection network whose communication overloads among processors are relatively low, this updating strategy significantly speeds up the calculation of manifold regularized NMF by the distributed computing.

Manifold Construction.
Besides data partition and distributed computing, manifold construction is another key factor in PNMF-M.In the update rule (5), updating V in the Input: data matrix  ∈ R × , reduced dimensionality  ∈ N, regularization parameter  ∈ R, tolerance  ∈ R, parallelism  ∈ N Output: basis matrix  ∈ R × , coefficient matrix  ∈ R × (1) Assign X , X , Ǔ  , and V to   as described in Figures 1 and 2 (2) Compute idx() from X (3) Send X to the other  − 1 processes   (4) Receive X from the other  − 1 processes   to obtain idx() (5) Merge all  index sets to construct Ŵ (6) repeat (7) Apply (5)   th processing unit requires the products of V and Ŵ in all  processing units.However, directly constructing the global adjacency graph from the input data matrix  costs Θ( 2 ) time in manifold learning and heavily prohibits manifold regularized NMF for large-scale datasets.In this subsection, we design an efficient construction method by means of the communications between different processing units.
In Figure 1(c), we can see that the th processor simply holds Ŵ which is a part of the global manifold  and represents the weight relationship between data samples in X and all the data samples in .Originally, it needs the entire data matrix  to construct Ŵ , and the tremendous storage requirement becomes the bottleneck which restricts the scalability of the algorithm.It is therefore critical to construct the manifold in an efficient way.To overcome this drawback, in this paper, we design a two-step manifold construction method by means of high-throughput interconnection network.
Figure 3 depicts the details of the construction method.Without loss of generality, we take the th processor   as an example to demonstrate both steps.In the first step, PNMF-M first constructs locally an adjacency subgraph from X in   relative to the adjacency graph constructed from  and thus obtains an index set idx() which contains local neighbor indices of each data sample in X .Then,   utilizes highthroughput interconnection network to take turns to receive data samples from the other −1 remote processors.When X comes, where  = 1, 2, . . .,  except , PNMF-M reconstructs an adjacency subgraph from X and X and obtains an index set idx() which contains remote neighbor indices of each data sample in X .In the second step, PNMF-M computes the weight relationship between X and data sample matrix Xidx() indicated by each index set idx() and picks out the final neighbor indices of each data sample in X to construct a manifold Ŵ in   .
In our method, we select -nearest neighbors (-NN) mode in [22] to construct the graph, and thus the proposed construction method can be proved to get a manifold equivalent to the batch construction method.Since each processor can discard data sample matrix received from remote processors after constructing adjacency subgraph, PNMF-M greatly reduces the storage overhead and improves the scalability.

PNMF-M Algorithm.
We are now ready to alternatively apply the update rules ( 5) and ( 6) to solve PNMF-M.Algorithm 1 shows the pseudocode on distributed computing systems.Given that  processes exist in PNMF-M, Algorithm 1 first partitions data samples and factor matrices according to Figures 1 and 2 and assigns the associated submatrices to each process   (Statement (1)).Then,   begins to locally construct its private weight matrix in parallel.It first computes idx() from X in the local process (Statement (2)) and then receives X from the other  − 1 processes   to obtain idx() (Statement (4)).Next, it merges all  index sets to construct Ŵ (Statement (5)).Then,   alternatively employs the update rules ( 5) and ( 6) to obtain its private coefficient submatrix V (Statement (7)) and basis submatrix Ǔ  (Statement (10)) until the termination condition for the iteration is satisfied.Notice that updating V requires the basis matrix  which consists of  submatrices Ǔ  distributed across each process   , where  = 1, 2, . . ., , as shown in (5).Hence, Algorithm 1 has to insert the interprocess communication to implement the reduction for  before updating V (Statement (11)-( 12)).Similarly, Algorithm 1 also requires increasing the interprocess communication to achieve the reduction for  before updating Û (Statements (8) and ( 9)).

Experimental Evaluation
In this section, we verify the proposed PNMF-M by comparing it with PNMF in [11] and GNMF in [12], which incorporate parallel computing and manifold regularization into NMF, respectively.

Datasets and Evaluation Metrics.
We choose Reuters and TDT2 text corpora widely used for document clustering and MNIST and COIL100 image databases used for pattern recognition and machine learning as our benchmark datasets.Table 1 lists their detailed statistics.
In this experiment, we utilize canonical -means clustering method to transform the coefficient matrix obtained by our algorithm into the labels of data samples, then compare them with the labels provided by the benchmark datasets, and finally use the accuracy (AC) and the normalized mutual information (NMI) to evaluate the clustering performance of each tested algorithm.For the way to compute AC and NMI, Xu et al. [2] and Cai et al. [12] have given detailed description, so we need not explain them here.
To evaluate the scalability of our proposed algorithm, we record the execution time of each tested algorithm and utilize the speedup (SP) and time cost (TC) as metrics, where SP denotes the speedup obtained by PNMF-M relative to GNMF and TC denotes the time cost introduced by PNMF-M relative to PNMF; that is,

Clustering Performance Comparisons.
We collect data samples in multiple clusters randomly selected from the benchmark datasets and provide them and the corresponding cluster number to each tested algorithm.All the tested algorithms are run 50 times on different data subsets each of which consists of some part of clusters randomly selected from the whole clusters.We compute the mean of them for each given cluster number as the final clustering performance.At the same time, we also compute the mean of SP and TC to verify the effectiveness of PNMF-M.Table 2 lists the values of main experimental parameters for clustering performance evaluation.In our experiment, we set regularization parameter to 100 and the tolerance to 10 −4 as the termination condition of each iteration.In addition, we employ -NN definition to construct the graph weight matrix and set  to 5. For parallel algorithms, we adaptively select the parallelism  according to the number of samples  and set  = ceil(), where ceil() means rounding up  to an integer.
Figures 4-7 report AC and NMI versus the cluster number on four benchmark datasets, respectively.Figure 8 reports SP and TC of our algorithm versus the cluster number on four benchmark datasets.
Through comparative analysis, we can obtain the following observations: (1) Since the manifold constructed in PNMF-M is equivalent to that constructed in GNMF, both obtain the same clustering performance on all four benchmark datasets.
(2) PNMF-M gets better clustering performance than PNMF in terms of both AC and NMI, because PNMF-M incorporates the intrinsic geometric structure of data and obtains the same clustering performance as GNMF, while PNMF ignores the manifold structure of the dataset.
(3) GNMF directly constructs the manifold from the entire data matrix in a single processing node, which incurs tremendous computation and storage overhead.In contrast, PNMF-M designs a two-step distributed method to accelerate the manifold construction.In addition, PNMF-M parallelizes the update procedure to eliminate large size matrix operations in the update rules and further reduces the computation and storage overhead.According to Figure 8, PNMF-M acquires about 4-10 times speedup on four benchmark datasets relative to GNMF.
(4) Compared with PNMF, PNMF-M requires to construct the manifold in parallel and introduce the manifold term into the update rule for the coefficient matrix, which inevitably increases the computation overhead of the algorithm to some extent.PNMF-M pays about 2-4 times cost on four benchmark datasets relative to PNMF, as shown in Figure 8.

Scalability Comparisons.
We collect increasing scale data subsets in all the clusters randomly selected from the benchmark datasets as the input datasets of each tested algorithm, run them in the same hardware configuration and software environment as listed in Table 3, and verify the scalability by comparing their execution time and evaluating SP achieved by PNMF-M.In our experiments, we set parallelism to 2 (PNMF-M2), 4 (PNMF-M4), 8 (PNMF-M8), and 16 (PNMF-M16) for PNMF-M, respectively.
Figure 9 reports SP achieved by PNMF-M with different parallelism on four benchmark datasets relative to GNMF.We can draw the conclusions from it as follows: (1) When the number of input data samples is small (below about 5 * 10 3 ), SP obtained by PNMF-M is less than the corresponding parallelism.This is because all processing  (2) When the number of input data samples is large (above about 5 * 10 3 ), SP achieved by PNMF-M is more than the corresponding parallelism.The reason is that all the input data samples reside in external storage before the execution of each tested algorithm.But GNMF requires all data to be loaded into memory during its execution, which goes beyond the memory capacity in a single processing node and has to load large amounts of data samples by virtual memory.To address the problem, PNMF-M divides the input data into multiple subsets and stores them in different processing node, which reduces memory access overhead caused by loading data.
(3) In the case of small input data samples, PNMF-M achieves better SP on MNIST and COIL100 than on Reuters   and TDT2 when parallelism for PNMF-M is equal to 2, 4, and 8.But the result is the opposite with the increase of the number of input data samples.The main reason for this result is twofold: one is based on the two analyses mentioned in (1) and (2) and the other is that memory and communication overhead for a single data sample in Reuters and TDT2 are larger because the dimensionality of data in them is higher than that in MNIST and COIL100.But when parallelism for PNMF-M increases to 16, the overhead caused by high dimensionality is spread over each processing node.Hence, SP obtained by PNMF-M16 has little difference among four benchmark datasets.
In addition, we also compare the execution time of all the tested algorithms on four benchmark datasets, as shown in Figure 10.It is obvious that the increasingly widening gap of the execution time exists between serial algorithm (GNMF) and parallel algorithms (PNMF-M2, PNMF-M4, PNMF-M8, and PNMF-M16).It is because GNMF involves lots of large size matrix operations during the manifold construction and updating procedure, and the time consumed by them grows nonlinearly as the number of data samples and attributes increase, while PNMF-M transforms them into relatively small size matrix operations and utilizes multiple processing nodes to perform them in parallel.Hence, PNMF-M achieves better scalability than GNMF.

Conclusion
In this paper, we proposed a parallel nonnegative matrix factorization with regularization method (PNMF-M) which introduced manifold regularization into conventional NMF and parallelized it on distributed computing systems.In PNMF-M, besides the balanced data partition strategy, we conducted the update procedure in parallel.In particular, we presented a two-step distributed graph construction method and obtained a manifold equivalent to that constructed in batch mode.Experimental results showed that PNMF-M not only outperforms conventional NMF in terms of both scalability and time efficiency thanks to the parallelization, but also significantly enhances the representation ability of conventional NMF thanks to the manifold regularization.

Figure 8 :
Figure 8: SP and TC on four benchmark datasets.

Figure 10 :
Figure 10: Execution time on four benchmark datasets.

Table 1 :
Statistics of benchmark datasets.

Table 2 :
Experimental parameters for clustering performance evaluation.

Table 3 :
Experimental environment for scalability evaluation.
nodes need to communicate with each other to construct the manifold during the execution of PNMF-M, which introduces communication latency into parallel computation.