Multiview Clustering via Robust Neighboring Constraint Nonnegative Matrix Factorization

Many real-world datasets are described by multiple views, which can provide complementary information to each other. Synthesizing multiview features for data representation can lead to more comprehensive data description for clustering task. However, it is often difficult to preserve the locally real structure in each view and reconcile the noises and outliers among views. In this paper, instead of seeking for the common representation among views, a novel robust neighboring constraint nonnegative matrix factorization (rNNMF) is proposed to learn the neighbor structure representation in each view, and L2,1-norm-based loss function is designed to improve its robustness against noises and outliers. +en, a final comprehensive representation of data was integrated with those representations of multiviews. Finally, a neighboring similarity graph was learned and the graph cut method was used to partition data into its underlying clusters. Experimental results on several real-world datasets have shown that our model achieves more accurate performance in multiview clustering compared to existing state-of-the-art methods.


Introduction
Clustering is a fundamental topic in machine learning and data mining tasks. Datasets often are comprised of different views, and the views often provide compatible and complementary information in real world. us, multiview clustering (MVC) aims to integrate those different views and uncover the consistent latent information to achieve perfect clustering performance [1]. Over the past decades, it has attracted great attention [2,3] and has been widely used in various real applications [4].
Essentially, given the multiview inputs, the critical work in MVC is to fuse information of different views and learn the common agreement for clustering. For efficiently integrating views, many subspace clusteringbased methods [5,6] and nonnegative matrix factorization-(NMF-) [7,8] based methods have been developed. In particular, NMF is shown to be equivalent to relaxed k-means and symmetric NMF is closely related to spectral clustering [9]. However, NMF cannot preserve the geometrical structure of the data space, which is essential for the algorithms to find the true cluster structures. Many manifold learning methods, which are motivated by the so-called locally invariant idea that the nearby points are likely to have similar embedding, have been proposed, such as locally linear embedding (LLE) [10] and locality preserving projection (LPP) [11]. In particular, Cai et al. [12] proposed graph-regularized nonnegative matrix factorization (GNMF) to find a compact representation which can uncover the hidden semantics and simultaneously respect the intrinsic geometric structure. It is well accepted that the clustering performance can be significantly enhanced when the local invariance is considered. ese are single-view clustering methods.
On the other hand, many NMF-based MVC methods [13] have attracted attention, in which various constraints have been applied to the coefficient matrix to cluster the data points. Multi-NMF [14] formulated a joint multiview NMF learning process with the constraint that encourages representation of each view toward a common consensus. Many extensions of multi-NMF methods were proposed for image clustering and other tasks [15]. In [16], two weight matrices are introduced to alleviate the issue of dataset imbalance in real applications. Ou et al. [17] explored the local geometric structure for each view under the patch alignment framework and adopted correntropy-induced metric to measure the reconstruction error of each view to improve the robustness. A deep matrix factorization model [18] aimed to seek a common representation by introducing graph regularization to guide shared representation learning in the final layer of each view. However, existing approaches are all used to exploit common information shared by multiple views but neglect the diversity among views. e diversity means that each view of the data contains some distinct information that other views do not have.
In this paper, we propose a novel MVC method, with a novel algorithm, called robust neighboring constraint NMF (rNNMF), which uses the locally neighboring structure of each view to capture the diversity features. In rNNMF, a neighboring graph is constructed and updated for each view in factorization process to obtain the underlying diversity features. Finally, these diversity features will be combined to create an integrated feature for datasets, and then, a global graph is further generated from this integrated feature and Ncut is used to partition data into its underlying groups.
In summary, the novelty and contribution of our research are as follows: (1) A neighboring constraint NMF method is proposed to learn the diversity representation of data in each view. e proposed model only keeps the nearest relationship between a point and its nearest neighbor to maintain geometrical structure in feature learning in each view.
(2) L 2,1 -norm loss function is used in rNNMF to improve the robustness of feature in each view and reduce the effect of noisy features. e rest of this paper will be organized as follows. Section 2 will introduce some related work about NMF-based MVC algorithms. Our proposed robust NMF-based MVC model will be introduced in Sections 3, the experimental results will be introduced in Section 4, and conclusions and discussions will be introduced in Section 5.

Related Work
Both subspace clustering and NMF-based methods are important in MVC. For example, a robust graph can be learned with correlation consensus agreement in [5] to improve the clustering performance. A multigraph regularized low-rank representation-(LRR-) based method is proposed to achieve the data correlation consensus among all views [6]. A structured LRR was proposed by factorizing into the latent low-dimensional data-cluster representations, which characterize the data clustering structure for each view [1]. Meanwhile, NMF-based methods [19] were also proved to be useful, which enforce the constraint that the elements of the factor matrices must be nonnegative. It shows that when the Frobenius norm is used as a divergence, NMF is equivalent to a relaxed form of K-means clustering method. However, NMF fails to discover the intrinsic geometry of the data, which is essential to the real applications. To preserve the locally geometrical structure of the data space, Cai et al. imposed graph regularization on NMF (GNMF). In [20], Shang et al. proposed graph dual regularization NMF (DNMF) which simultaneously considered the geometric structures of data manifold and feature manifold. Two subspace clustering algorithms were proposed in [21]. It established connection with spectral normalized cut [22] and ratio cut clustering. It also extended the nonlinear orthogonal NMF and introduced a graph regularization to obtain a factorization that respects a local geometric structure after the nonlinear mapping.
In MVC, NMF-based methods also have received increasing attention. Let input X � X (1) , . . . , X (v) , . . . , X (V) and X (v) be the v-th view, then it is a d v × n matrix, where d v denotes the feature dimensionality in row and n is the number of data in column. H (v) is the representation of the v-th view, it is a p × n matrix, where v denotes the feature dimensionality and Z (v) is the corresponding basis matrix. For NMF-based methods, the overall framework is as follows: where ‖ · ‖ 2 F is the Frobenius norm, and both Z v and H v should be nonnegative. By default, the input data for each view should also be nonnegative for the NMF-based methods, i.e., X (v) ≥ 0. f(H v , H s ) is the regularization term to learn the agreement among different views. For example, MulNMF designed a constraint that encourages representation of each view toward a common consensus H * . DiNMF [23] introduced a constraint term tr(H (v) H (s)T ) to guarantee the diversity among points in different views. In order to deal with mixed-sign data, based on the semi-NMF model, a deep semi-NMF method couples the output representation in the final layer of factorization and enforces views that share the same representation after layer by layer factorization.
Although good performance can be achieved in those methods by finding a common agreement among views, the consensus information cannot be explored effectively and do not make full use of information of multiple views [18]. For full use of diversity information of views, combination of views' representations is a natural method. Our work focuses on this kind of combination styles. However, original information contained among views can usually lead to poor performance in clustering. erefore, it is necessary to design a new method which can not only get maximum preservation diversity feature of each view but also obtain the aggregated representation with good clustering performance.

Robust Neighboring Constraint Regularization. Given
where N (v) (j) is the set of the nearest neighbors of point j in the v-th view. If point i is one of the nearest neighbors of j, R (v) (i, j) will be set as 1. We hope the difference between the points j and i is as small as possible, which can be represented by to describe this diversity of the point with its neighbors. e smaller the value of H (v) R (v) , the more similar they are. We introduce the L 2,1 norm to penalize H (v) R (v) for seeking a representation in each view:

Objective Function and Optimization Algorithm.
In MVC, to learn the neighbor information in each view, based on ORNMF [19] which is a robust representation approach, the proposed rNNMF can be expressed as where the L 2,1 -norm is applied to the loss function and defined as ‖X − WH‖ 2,1 � n i�1 ‖X i − WH i ‖. With the error for each point not being squared, the impact of large errors is reduced significantly. e first term in equation (4) indicates the v-th data fidelity term, and the second term is the nearest neighboring constraint term. H (v) is the v-th representation from the v-th view. α is the positive parameter to specify the relative importance of the factorization term and regularization term in the model.
Like the most NMF-based methods, the objective function in (4) is not convex, so we present an iterative algorithm to achieve the local minima of (4).
Computing H (v) , to update H (v) with W (v) fixed, we need to solve the object function as follows: as where D (v) 1 and D (v) 2 are diagonal matrices, which are n × n matrices, and the elements defined as e partial derivative of Lagrangian function L(H (v) ) with respect to H (v) is computed as follows: Because R (v) is mixed-sign data, we should decompose it into two nonnegative parts M + and M − , representing the positive part and the negative part, respectively, is formulated as follows: Computing W (v) , to update W (v) with H (v) fixed, the following object function should be solved: is is similar as that in [19,24]. So we have the updating rule as where (·) indicates the Hadamard product. For each view, the updating rule of H (v) and W (v) satisfies the theorem in [24], which guarantees the correctness of the rule. e correctness analysis and convergence proof of (10) and (12) are shown based on the method [19] in the following section.

Correctness and Convergence
Theorem 1. If the updating rule of H (v) converges, then the final solution satisfies the KKT optimality condition.
where t denotes the t-th iteration, and the following formula holds: □ We now prove the convergence of the updating rule (10) using the auxiliary function approach in [19]. e definition of the auxiliary function is as follows. e auxiliary function is useful because of the following Lemma 1.

Lemma 1. J is nonincreasing under the updating rule
Proof. Following the definition of G(H, H ′ ), we have □ e key point is to find an appropriate auxiliary function for (6). Because the learning process is independent in each view, in generally. Let J(H) represents J(H (v) ) to indicate each view's process, and let B � D 1 X T W, A � W T W, and C � RD 2 R T . We rewrite (6) as follows: Since the update rules are elementwise, we should prove each H is nonincreasing under the update (10) by defining the auxiliary function regarding H ik as follows.

Lemma 2.
e function For the inequality z ≥ 1 + log z, which holds when z > 0, we have the following inequalities: □ With the lemma and proposition in [19], we have the following inequalities: Collecting all bounds, G(H, H ′ ) ≥ J(H) holds, and Lemma 2 is proven. (6) is nonincreasing under the iterative updating rule (10).

Proof. G(H, H ′ ) is a convex function. To find its minima, following the KKT condition, we let
□ is can derive the updating rule (10) under the objective function J(H) in (6). is updating rule of W (v) also can be derived by this method.

Clustering with Similarity Graph.
Because the neighboring structure is kept during the factorization, the graphbased clustering method will be chosen to cluster data in our study. e similarity graph G is built from the final representation H by the k-NN algorithm. en, normalized cut (Ncut) [22] is used to obtain the final clustering results. It can achieve better performance considering the graph structure of the data. Details of our method are described in Algorithm 1.

Experiment
Washington (http://www.cs.umd.edu/projects/linqs/ projects/lbc/) belongs to WebKB, which collected webpages from four universities. e webpages are distributed over five classes: student, project, course, staff, and faculty, and they are described by two views: the content view and the citation view. Each webpage is described by 1703 words in the content view and the number of citation links between other pages in the citation view. We summarize them in Table 1. e algorithms that we employ to compare are as follows: (1) BestSV performs the best performance in each view [25]; NMF-based methods: (2) MulNMF and (3) D-SNMF; subspace clustering based methods: (4) c-LRSSC [26], (5) p-LRSSC [26], (6) RMSC [27], (7) ECMSC [28], and (8) MVGL [29]. e codes of all the baseline methods are provided by their authors. We adjust the parameters of all comparison methods according to the corresponding literature to obtain their best performance. For RMSC, its parameter is searched from 0.005 to 100 as the authors' suggestion. For all the NMF-based methods, we set the dimensionality of the new space to be the same as or bigger than the number of clusters and the initial step follows the authors' suggestion. K-means will be applied to the new representation for clustering. is process is repeated 10 times, and the average clustering performance is recorded as the final result.
For rNNMF, the dimensionality of H (v) is 60 for UCI and ORL, and it is 20 for 3Sources and Washington. It is initialized by NNDSVD and repeated five times for average results. e hot kernel, in which the parameter σ is 2, is used to determine the distance between points to select the neighboring points as same as in [30] (https://github.com/ louloupiano/PCPSNMF).
For evaluation, we use three evaluation metrics: accuracy (ACC), normalized mutual information (NMI), and adjusted Rand index (AR) [31]. For all the metrics, a high value denotes good performance. Table 2 summarizes the clustering performance. e best values are in bold. As the table shows, rNNMF achieves the highest performance. In UCI, with settings σ � 0.1 and k � 5, it outperforms the second best method by roughly 6.78%, 9.07%, and 12.49% in the three metrics. In 3Sources, it also obtains the best with settings σ � 0.05 and k � 5. In ORL, when σ � 0.05 and k � 5, it also outperforms the best in ACC and AR, while becomes worse slightly in NMI than ECMSC. In Washington, with σ � 0.001 and k � 7, it also outperforms the second best method by Input: input X (v) , parameter α, parameter k Initialize:

Clustering Performance.
for each view do (1) ; · · · ; H (v) ; · · · ; H (V) ] Similarity graph: G is built with H and k clustering: Ncut (G) Output: clustering results ALGORITHM 1: rNNMF-based MVC model. Mathematical Problems in Engineering 5 roughly 2.43%, 0.77%, and 12.87%. In all, rNNMF achieves more accurate performance than others obviously. We present Figure 1 to show more details on clustering results with the similarity matrix yielded from two highperformance MVC methods in UCI and ORL. For UCI, we can see that diagonal blocks of rNNMF are whiter than MulNMF, and the surrounding nondiagonal black blocks are blacker than MulNMF. For ORL, the similar conclusions also hold, which is clearer in the similarity matrix of rNNMF.
Hence, with the representation combining process, rNNMF can fuse multiple views efficiently. And neighboring constraint plays an important role in discovering the underlying structure of points. It is observed that a comprehensive graph structure is important to discover the cluster structure. It shows that ACC performance and NMI performance improve with increasing σ in UCI, and they will reach their best performance when σ � 0.1. en, the performance drops obviously.
is tendency can also be observed in 3Sources. e only difference is that the best ACC and NMI can be obtained when σ � 0.05. However, in ORL dataset, performances improve slightly with increasing σ, and they drop obviously after reaching the best value. ACC and NMI in Washington will drop with increasing σ. is shows that too large σ can destroy the similarity structure in each view, which can lead to worse performance in final clustering process. And the suitable σ can strengthen the cluster structure in learning process. Furthermore, when σ � 0, ACC and NMI performances in our model are better than those in some methods, such as D-SNMF and RMSC,     Mathematical Problems in Engineering demonstrating that combining multiviews and building graph is useful in MVC. Parameter k is important to create the final similarity graph. Figure 3 shows the ACC and NMI results with different k values. As we can see, the final results are sensitive with k in integrated representation. In UCI, 3Sources, and ORL, obviously, the best performance can be obtained when k � 5, and then, the tendency of performance will drop as k increases. In Washington, ACC and NMI reach their best when k � 7. is shows that the influence of k mainly focuses on similarity graph building from integrated representation, and it is important to build the graph from representation. Figure 4 shows the convergence property of rNNMF by computing the objective error in each iteration. It is clear that the objective value decreases steadily in all datasets. All of NMI finally keeps the rough stability around the convergence point. So the maximum number of iterations is set to 100 for all the experiments.

Conclusion
In this paper, we proposed a novel NMF-based MVC model, named "rNNMF." In this model, the neighbor structure representation can be learned in each view, and L 2,1 -normbased loss function is designed to improve its robustness against noises and outliers. en, a final representation of data was integrated with those representations of all views, and a graph was learned from this representation. Finally, the graph cut method was used to partition data into its underlying clusters. Unlike existing methods, rNNMF can well encode the local structure from each view feature space and achieve the structure agreement via combining fusion. Experiments show that the rNNMF-based model yields higher performance. One of the important further works is to find a better graph structure to obtain more clear representation. In addition, the weight for each view is also worth studying to deal with varying levels of quality. e future work can also be done to extend rNNMF model and optimization strategy to handle dynamic data and achieve online multiview clustering.

Data Availability
All the data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.