Two Efficient Algorithms for Orthogonal Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a popular method for the multivariate analysis of nonnegative data. It involves decomposing a data matrix into a product of two factor matrices with all entries restricted to being nonnegative. Orthogonal nonnegative matrix factorization (ONMF) has been introduced recently. This method has demonstrated remarkable performance in clustering tasks, such as gene expression classiﬁcation. In this study, we introduce two convergence methods for solving ONMF. First, we design a convergent orthogonal algorithm based on the Lagrange multiplier method. Second, we propose an approach that is based on the alternating direction method. Finally, we demonstrate that the two proposed approaches tend to deliver higher-quality solutions and perform better in clustering tasks compared with a state-of-the-art ONMF.


Introduction
Nonnegative matrix factorization (NMF) has been investigated by many researchers, such as Paatero and Tapper [1]. However, it was only through the works of Lee and Seung published in Nature and NIPS [2,3] that this method has gained popularity. Based on the argument that nonnegativity is crucial for human perception, they proposed a simple algorithm to find nonnegative representations of nonnegative data and images. e basic NMF is a technique that decomposes a nonnegative data matrix into a pair of other nonnegative matrices with lower ranks: where Y ∈ R I×T + denotes a nonnegative data matrix. A ∈ R I×J + and X ∈ R J×T + denote the basis matrix and coefficient matrix, respectively. J denotes the number of factors that is usually chosen so that J ≪ min(I, T). Recently, NMF has proven to be useful for many applications in pattern recognition, multimedia, text mining, and clustering, as well as environment and DNA gene expression classifications [4][5][6][7][8][9][10][11][12].
In solving the NMF problem, the update rules given by Lee and Seung [3] take a multiplicative form and satisfy the nonnegative constraint implicitly and elegantly. e extension of the multiplicative updates provided by them [3] can be found, for instance, in the works of Sha et al. [10], whereby a multiplicative update rule is proposed to solve a nonnegative quadratic programming problem. However, the slow convergence of the multiplicative updating rule has been pointed out, and more efficient algorithms equipped with more powerful theoretical convergence properties have been introduced.
ese efficient algorithms are based on either the alternating nonnegative least squares (ANLS) framework or the hierarchical alternating least squares (HALS) method.
been proven that ONMF is equivalent to K-means clustering [16,18]. Moreover, as a result of the nonnegativity and orthogonality constraints, the orthogonal factor matrix is naturally sparse. In such cases, the factorization is essentially unique, and we can ignore the permutation problem.
Ding et al. and Pompili et al. [16,18] proposed two different kinds of methods to solve the ONMF problem. In a study conducted by Ding et al. [16], ONMF algorithms strictly enforce nonnegativity for each iterate while trying to achieve orthogonality at the limit. is can be done using a proper penalization term [16], a projection matrix formulation [19], or choosing a suitable search direction [15]. In a study conducted by Pompili et al. [18], the authors proposed a method working the opposite way: at each iteration, a projected gradient scheme is used to ensure that the orthogonal factor iterates are orthogonal but not necessarily nonnegative. However, these algorithms still are not convergent algorithms for ONMF. In this study, we first briefly introduce NMF and ONMF algorithms. Next, based on the Lagrange multiplier, we design a convergence algorithm, which is a method similar to the algorithm presented by Lin [20]. Because NMF algorithms based on the alternating direction method are more efficient than multiplicative update algorithms, we propose another convergence algorithm for solving the ONMF problem by combining the ADM approach with our convergence algorithm. Experiments on two grayscale images and the ALLAML genesample data demonstrate that Algorithms 2 and 3 tend to deliver higher -quality solutions and perform better in a clustering task compared with Algorithm 1. e remainder of this article is organized as follows: Section 2 gives a brief review of NMF and ONMF algorithms. In Section 3, an improved version of the original ONMF algorithm and the convergence analysis is provided and another new ONMF algorithm based on the alternating least squares (ALS) approach is provided. Section 4 is devoted to numerical experiments. Some concluding remarks are provided in Section 5.

Related Algorithms
In this section, we provide a brief review of NMF and ONMF algorithms.

Nonnegative Matrix
Factorization. NMF aims to find a nonnegative matrix Y � [y 1 , y 2 , . . . , y r ] ∈ R m×r + and another nonnegative matrix Because the NMF problem is not convex in both A and B, various algorithms have been proposed. e first method for solving the equation (2) is the ALS algorithm utilized by Paatero and Tapper in 1994. It minimizes the least squares cost function with respect to either A or B, one at a time, by fixing the other and disregarding nonnegativity; after each least squares step, the algorithm sets all negative entries to zero. e scheme can be updated as follows: where A † denotes the Moore-Penrose pseudoinverse. Another popular method for NMF is the multiplicative updating method proposed by Lee and Seung. It can be expressed as follows: where ⊙ and ⊘ denote elementwise multiplication and division, respectively. When they are started from nonnegative initial estimates, iterates will remain nonnegative throughout the iterations. By all indications, this algorithm appears to be the most widely used solution method in NMF by far. e modified update rule given in equations (7) and (8) was first proposed by Gillis and Glineur [21]; it can be expressed as follows: where ε > 0 denotes a small number. ey proved that if a sequence of solutions generated by (7) and (8) has a limit point, then this point is necessarily a stationary point of the following optimization problem:

Original ONMF Algorithms.
In an ONMF, an additional orthogonal constraint, XX T � I, is imposed on the NMF. We briefly review the first ONMF algorithm proposed by Ding et al. [16] and reveal the problem behind ONMF. e goal of ONMF is to find a nonnegative orthogonal matrix X and a nonnegative matrix A by minimizing the following objective function using a Lagrangian multiplier λ. Because it is not easy to solve this problem for every value of λ, Ding et al. [16] ignored the nonnegativity and relied on XX T � I to obtain a unique value of Λ X : e KKT conditions for the objective in equation (10) are as follows: 2 Mathematical Problems in Engineering with Using the multiplicative updating method, the update rules of A and X are derived as follows: e problem with this algorithm is that it is difficult to determine Λ c . By summing over the index i, the authors found an exact formulation for the diagonal entries: Input: Y ∈ R I×T + and J ∈ N. Output: Factors A ∈ R I×J + , X ∈ R J×T + , which solve problem (18). (1) Select initial matrix A (0) > 0 and X (0) > 0.
(2) For k � 1 : maxIt (2.1) Calculate the factor matrix A using the following formula: Calculate the factor matrix X using the following formula: Calculate the factor matrix A using the following formula: Calculate the factor matrix X using the following formula: Calculate the factor matrix Λ using the following formula:

Mathematical Problems in Engineering
e off-diagonal entries are obtained by ignoring the nonnegativity constraint on X and setting ∇ X (X) to a zero matrix: Equation (15) is derived from equation (10) using the fact that ‖X‖ 2 F � tr(X T X) and from equation (16) using the orthogonality constraint XX T � I. By combining equations (14) and (16), Λ X can be defined as follows: Note that the formulation with the specific value of Λ X does not strictly satisfy orthogonality. erefore, ONMF algorithms prioritize the approximation while relaxing the degree of orthogonality.
To improve the approximation, Mirzal [22] proposed the following cost function: Using the same strategy as in the aforementioned algorithm, the update forms are derived as follows: Next, the author proposed an equivalent algorithm with a robust convergence guaranteed by (1) transforming the algorithm into a corresponding AUR algorithm and (2) replacing every zero entry that does not satisfy the KKT conditions with a small positive number to prevent zero locking. is strategy was employed to derive a convergent algorithm for the ONMF.
e AUR version of equation (18) can be expressed as follows: where [A] ij � a ij denotes the ijth entry of matrix A. Additionally, the author proposed an equally robust AUR algorithm (see Algorithm 1) for equation (18), where max It denotes a limit maxtime of iteration, ∇ A D F represents the gradient of the objective function (18), and e author proved the convergence of this algorithm. However, this algorithm requires a high computational cost. e algorithm that is based on the alternated application of these rules in equations (5) and (6) is not guaranteed to converge to a first-order stationary point. However, a slight modification proposed by Lin [20] achieves this property (generally speaking, MU is recast as a variable metric steepest descent method, and the step length is modified accordingly). In another study [23], the authors demonstrated through numerical experiments that the update rules in equations (7) and (8) work better than the original update rules in equations (5) and (6) in some cases. is indicates that equations (7) and (8) are important not only from a theoretical point of view but also from a practical viewpoint. erefore, in the next section, we propose convergence updates based on the alternated approach.

Alternating Direction Algorithm for ONMF
3.1. Classic ADM Approach. ADM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the multipliers method. e algorithm solves problems in the form.
with variables x ∈ R n and z ∈ R m , where A ∈ R p×n , B ∈ R p×m , and c ∈ R p . We assume that f and g are convex. e optimal value of equation (22) will be denoted by As in the method of ADM, we form the augmented Lagrangian as follows: ADM consists of the following iterations: where ρ > 0 and c ∈ (0, 1.618) represent a step length. e algorithm is similar to dual ascent and the method of multipliers. It consists of an x-minimization step (26), a zminimization step (27), and a dual variable update (27). As in the method of multipliers, the dual variable update uses a step size that is equal to the augmented Lagrangian parameter ρ. Conversely, the classic augmented Lagrangian multiplier method requires a joint minimization with respect to both x and z, i.e., it replaces steps (26) and (27) with which involves both f(x) and g(z) and could become much more expensive. In Section 3.2, we propose improved algorithms for ONMF and present the convergence analysis.
We discuss the orthogonality constraint on the rows of X. Similar results can be derived for A.

Classic ADM Approach Extension to ONMF.
First, we define the objective of the ONMF as follows: such that A ≥ 0 and X ≥ 0. e AUR version of Problem (18) can be modified as follows: Inspired by the approach proposed by Gillis and Glineur [23] and Yoo and Choi [24], we employ these strategies to devise a convergent algorithm for NMF with an orthogonality constraint, which is Problem (4.1). e modified AUR version of Problem (4.1) can be modified as follows: where ε > 0 denotes a small number and ρ > 0 and c ∈ (0, 1.618) represent a step length.
stand for the modifications for avoiding zero locking with a small positive number σ and Mathematical Problems in Engineering e detailed procedure for the ALS-based version of Problem (4.1) can be seen in Algorithm 2.
For Algorithm 2, we obtain the following basic result: Proof. is statement for A is similar to the proof for X. erefore, we only prove the statement for X. It is obvious for k � 0; hence, we only need to prove the results for k ≥ 1.
Similarly, the statement of A can be proven.

Convergence Analysis.
To analyze the convergence of Algorithm 2, we first prove its nonincreasing property under the update rules in equations (25) and (26), i. e., Because the algorithm solves Problem (1.1) in an alternating method, L(A (k) ) and L(X (k) ) can be analyzed separately. Because the update rule of A is similar to that of X, it is sufficient to prove the nonincreasing property of L(X (k) ).
By making use of the auxiliary function approach [3], the nonincreasing property of L(X (k) ) can be proved by showing that where the auxiliary function G is defined in a similar function by [22] G X, X (k) � L X (k) + tr X − X (k) T ∇ X L(X) (k) where x i (i � 1, 2, . . . , T) denotes the ith column of X. Meanwhile, we define where ∇ X L(X (k) ) i denotes the ith column of ∇ X L(X (k) ), and where D t (t � 1, . . . , T) denotes a diagonal matrix with its diagonal entries Here, l n � j|x k jt > 0, ∇ X L(X (k) ) jt ≠ 0, or x k jt � 0, ∇ X L(X (k) ) jt < 0 denotes the set of non-KKT indices in the nth column of X (k) .
Using the Taylor series, the expansion formulation for L(X) can be expressed as follows: where L � ∇ 2 X L(X (k) ) and ξ (k) X represent the higher components of the Taylor series, and where ∇ 2 X L(X (k) ) n denotes the nth column of ∇ 2 X L(X (k) ). Next, for the function G(X, X (k) ), which is an auxiliary of L(X), we first have to prove the following Lemma 2. As shown later, A (k) and X (k) must be bounded for L(A (k) , X (k) ). e boundedness of A (k) and X (k) will be proven in eorem 3.
where D t and δ (k) X D t are diagonal matrices that sum up to D t With the boundedness of A (k) and X (k) and using sufficiently large δ (k) X , the inequality G(X, X) ≤ G(X, X (k) ) can be guaranteed.
Additionally, it is clear that G(X, X) � G(X, X (k) ) only if X (k) satisfies the KKT conditions demonstrated in equation (48) regardless of ξ (k) X . Because δ (k) X is a variable, G(X, X) � G(X, X (k) ) holds only if X � X (k) in equation (51) as a result of the update rule of X and the boundedness of A (k) and X (k) .

Mathematical Problems in Engineering
and the positive semidefinite of matrix D, G(X, X (k) ) is a strictly convex function with respect to X (k) ; it has a unique minimum that satisfies e aforementioned consideration results in the following: If X (k) satisfies the KKT conditions, then X � X (k) , and therefore, G(X (k) , X (k) ) � G(X, X (k) ).
Conversely, assuming that the equality holds, but X (k) does not satisfy the KKT conditions, there is at least an index such that Furthermore, if x (k) jt � 0, then x jt � x (k) jt , which contradicts the equality assumption. erefore, x (k) jt ≠ 0. As a result, which contradicts the assumption.  (2) can be obtained directly from Lemma 1.

Proof.
is theorem is the corollary of eorem 1.
Proof. Let us verify Assumptions A1-A3 (assumptions A1-A3 can be found in the works of [23]). Assumption A1 holds because of the coercivity of ‖ · ‖ F . Assumptions A2 and A3 hold because all the coefficient matrices are identity matrices. Additionally, because ‖ · ‖ F is Lipschitz differentiable, A (k) , X (k) , Λ (k) generated by Algorithm 3 is bounded, and L(A (k) , X (k) ) is lower bounded according to eorem 2 presented in the works of [23]. e convergence property of the modified update rules in equations (27) and (29) is stated as follows: □ Theorem 3. Given a sufficiently large number δ (k) X , every limit point of this algorithm is a stationary point of the following optimization problem: Proof. Because A (k) , X (k) (k � 1, . . . , ∞) are in a compact set, the sequence A (k) , X (k) generated by Algorithm 2 has at least one limit point. Gillis and Glineur [23] introduced a similar analysis that every limit point of this algorithm is a stationary point of the optimization problem (41).

Auxiliary Variable-Based ADM Extension to ONMF.
To facilitate the efficient use of ADM, we first introduce two auxiliary variables, U and V, and consider the following equivalent model: where A, U ∈ R m×q and Y, V ∈ R q×n . e augmented Lagrangian of equation (59) is as follows: where Λ ∈ R m×q , Π ∈ R q×n are Lagrangian multipliers, α, ρ 1 , ρ 2 > 0 are penalty parameters, and A · B ≔ i,j a ij b ij for matrices A and B of the same size. e alternating direction method for equation (59) is derived by successively minimizing L A with respect to A, X, U, V, one at a time, while fixing others at their more recent values, i. e., and then updating the multipliers Λ and Π. e introduction of the two auxiliary variables, U and V, makes it easy to carry out each of the alternating minimization steps. Specifically, these steps can be expressed in closed form as follows: e detailed procedure for the algorithm of Problem 59 can be seen in Algorithm 3.

Experiments and Results
In this section, we analyze and compare Algorithms 2 and 3 with Algorithms 1 and 4 (ONP-MF [18]). All experiments are performed using the Windows 7 operating system and MATLAB v8.1 (R2013a) running on a Lenovo laptop with an Intel Core(TM)i5-3470 CPU at 3.2 GHz and 4 GB of memory.

Test on Images.
To visualize the results, we use examples of ONMF arising in image processing to analyze the quality of the solutions. Two grayscale images are tested using Algorithms 1-3 and 4. e two grayscale images, a panda in Figure 1(a) and a cat in Figure 1(b), have resolutions of 1200 × 1920 and 128 × 128, respectively.
We evaluate the degree of approximation and orthogonality using two indices: Relative residual error(RRE): Orthogonality residual error(ORE): We set tol � 10 − 5 , δ � 10 − 8 , and α � 0.1, and maxiter � 500 for the image of the cat and maxiter � 100 for the image of the panda, after which we repeat the process 10 times and calculate the relative errors and orthogonality. e results are given in Tables 1-5, and the recovered images are shown in Figures 2 and 3.
Tables 1 and 2 show that Algorithm 2 performs slightly better than Algorithms 1 and 3 in terms of relative errors in most cases. Moreover, we observe that Algorithm 4 converges in least iterations. From Tables 3 and 4, we can see that Algorithm 2 demonstrates the best performance compared with Algorithms 1, 3 and 4 in terms of orthogonality. From Table 5, we observe that Algorithm 4 requires the lowest computational cost compared with the rest of algorithms. From Figures 2 and 3, we observe that Algorithm 2 achieves a better recovery quality of images compared with Algorithms 1 and 3, and the recovery quality of Algorithm 4 is the best.

Clustering Capability.
We test the clustering capability of the three ONMF algorithms on the ALLAML gene-sample data provided by [4] and compare the performance of the algorithms. To measure the clustering performance, we use purity and entropy as our performance metrics. We expect that these metrics will provide us with good insights on how our algorithm works.
Entropy measures how classes of genes are distributed within each cluster [25]. It is defined as follows: where N denotes the number of samples, c rs denotes the number of samples in the rth cluster that belong to the sth class, and c r denotes the size of the rth cluster. Purity is the most commonly used metric. It measures the extent to which each cluster contains data points from primarily one class [25]. erefore, the higher the value of purity, the better the clustering performance. Purity is defined as follows: Mathematical Problems in Engineering        Tables 6 and 7. From Tables 6 and 7, we observe that our Algorithms 2 and 3 clustering achieves better purity and entropy results on the leukemia data sets compared with the performance of Algorithms 1 and 4. ere exist slight differences in the relative performance of purity and entropy in our comparison, that is, higher purity values do not necessarily correspond to lower entropy values. is is because entropy takes into account the entire distribution of the genes in a particular cluster and not only the largest class as in the computation of purity. In summary, the comparison shows that Algorithms 2 and 3 are viable and competitive in clustering tasks.

Conclusion
In this study, we develop two efficient modified algorithms based on the ADM algorithms for the ONMF problem. We then prove that any sequence of solutions generated by the modified updates of Algorithms 2 and 3 has at least one convergent subsequence, and the limit of any convergent subsequence is a stationary point of the corresponding optimization problem. From the numerical results, a notable observation is that our Algorithms 2 and 3 are feasible and efficient, especially in the properties of the solution, that is, the orthogonality. e framework for the convergence analysis of the updates for Algorithm 3 presented in this article may be applied to the ADM algorithms for NMF [26].

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.