Low Rank Correlation Representation and Clustering

Correlation learning is a technique utilized to find a common representation in cross-domain and multiview datasets. However, most existing methods are not robust enough to handle noisy data. As such, the common representation matrix learned could be influenced easily by noisy samples inherent in different instances of the data. In this paper, we propose a novel correlation learning method based on a low-rank representation, which learns a common representation between two instances of data in a latent subspace. Specifically, we begin by learning a low-rank representation matrix and an orthogonal rotation matrix to handle the noisy samples in one instance of the data so that a second instance of the data can linearly reconstruct the low-rank representation. Our method then finds a similarity matrix that approximates the common low-rank representationmatrix much better such that a rank constraint on the Laplacian matrix would reveal the clustering structure explicitly without any spectral postprocessing. Extensive experimental results on ORL, Yale, Coil-20, Caltech 101-20, and UCI digits datasets demonstrate that our method has superior performance than other state-of-the-art compared methods in six evaluation metrics.

e reason is that different instances of cross-domain and multiview datasets often share a common property that is useful in the overall clustering task [9,10]. As such, several correlation learning techniques, including canonical correlation analysis (CCA) [11][12][13][14][15], cotraining [15][16][17], nonnegative matrix factorization [18,19], and subspace clustering [20][21][22][23][24][25][26] were utilized over the years to find a common representation in cross-domain and multiview datasets. For example, CCA [11] learns the linear relations between two sets of correlated variables with the nonlinear type proposed in [12]. As a result, the study in [13] is built on CCA, in which CCA is used to obtain a low-dimensional subspace spanned by a set of correlated but different datasets. In addition, Qin et al. [14] utilized cluster CCA to align domains in a projected correlated subspace. Besides, cotraining [16] is another classical methodthat maximizes the agreement between two distinct views of data to aid the learning process by adapting the knowledge gained from one view to the second view alternatively. erefore, considering the successes of cotraining, Kumar and Daume [17] incorporated the idea of cotraining into multiview spectral clustering by learning a common clustering matrix that agrees across different views. Furthermore, Sun and Jin [15] proposed a method named robust cotraining, which combines the principles of cotraining and CCA, where CCA is used to analyze the labels predicted by cotraining on unlabeled samples. While the clustering methods based on CCA [13,14] and cotraining [17] are validated to be efficient, their clustering performance may degrade with noisy data, which is ubiquitous in practice [27].
Recently, subspace clustering methods [20][21][22][23][24] have provided a more robust solution to tackle the abovementioned shortcomings. ese methods utilize either low-rank representation (LRR) [28,29] or a combination of LRR and sparse subspace clustering (SSC) [30] to minimize the discrepancy between cross-domain and multiview datasets in a latent subspace. Specifically, these methods utilize selfexpressiveness property to select similar samples in the original dataset to reconstruct each other to find a lowdimensional representation. e above is possible because high dimensional data are assumed to have corresponding multiple low-dimensional subspaces and the concept becomes finding those low-dimensional subspaces together with their cluster members [31]. at's why, Xia et al. [20] combined LRR and SSC techniques to learn a low-rank transition matrix shared across multiple views, which is used subsequently for clustering. Similarly, Ding and Fu and Zhang et al. [22,24] utilized the LRR technique to pursue a domain-invariant representation that agrees across domains. Brbic and Kopriva [23] used the combination of LRR and SSC to find a low-dimensional representation matrix for each view.
en, using these low-dimensional subspace structures, Brbic and Kopriva [23] can obtain a joint similarity matrix which balances the agreement between the different views. However, these methods mentioned above find the common representation matrix shared across views by assuming that similar samples in each view reside near each other. While this assumption is possible, it can be problematic in a practicable case where the data samples in each view are distributed randomly with noise. Hence, the chances will increase for two unrelated data samples to be selected to reconstruct one another, and this limitation can cause the learned common representation matrix to be faulty in a manner capable of degrading clustering performance. erefore, the recent work in [26], which extends [25], attempts to resolve the above issue by allowing a common similarity matrix to approximate different view low embedding matrices.. However, the state-of-the-art method fails to consider the proximity issue mentioned above in constructing the individual view structure. us, it can cause the propagation of noise in the different low embedding matrices.
To this end, in this paper, we propose a novel correlation learning method, which finds a common low-rank matrix between two different instances of data in a latent subspace. e core idea here is that we learn this common low-rank matrix using one instance of data in a way that a second instance can linearly reconstruct it. First, we consider a realworld scenario where two instances of data will have different dimensions. Hence, we utilize matrix factorization to align a low-rank matrix learned in one instance to another through an orthogonal constraint. is approach allows the low-rank matrix to become a common one shared between the two domains. Besides, this approach avoids the propagation of noise in different low-rank matrices. erefore, our method finds an ideal similarity matrix that approximates the common low-rank matrix much better, with a rank constraint on the Laplacian matrix so that it reveals the clustering structure explicitly without spectral clustering postprocessing. Extensive experimental results on ORL, Yale, Coil-20, Caltech 101-20, and UCI digits datasets demonstrate that our method has superior performance than other state-of-the-art compared methods in six evaluation metrics. at is, accuracy (ACC), normalized mutual information (NMI), adjusted rand index (AR), F-score, precision, and recall.
Our major contributions are summarized as follows: (1) We propose a novel method based on LRR for correlation learning between two instances of data. Specifically, our method learns a common low-rank matrix in one instance of data using matrix factorization to align a second domain. is way, our method can avoid the propagation of noise in different low-rank matrices that can cause a faulty common matrix.
(2) Furthermore, our method can obtain a clustering structure without performing spectral postprocessing of the low-dimensional embedding matrix. To achieve this, we find an ideal similarity matrix that best approximates the common low-rank matrix such that a rank constraint on the Laplacian matrix will reveal the clustering structure explicitly. (3) Extensive experimental results on five benchmark datasets demonstrate that our method has superior performance than other state-of-the-art compared methods in six evaluation metrics.

Related Work
e purpose of correlation learning in cross-domain and multiview clustering tasks is to learn a common representation that maximizes the agreement between different instances or distribution of a given data to improve clustering performance [17]. Many studies [27,[31][32][33] provide an extensive review of the existing methods utilized previously to learn the correlation between cross-domain and multiple view datasets. Among them are two classical methods, namely, CCA [11] and cotraining [16], which have inspired several other multiple instance-based methods such as [13,17]. While [13] is built on CCA principles , [17] is based on cotraining in which a common structure is learned in a low-dimensional subspace to reduce the disagreement between distinct data views. Besides, the more recent subspace clustering methods [22,24,34], built on LRR [28,29], provide some more robust approaches. ese methods require constructing a common similarity matrix to align the different views or cross-domain datasets using the low-rank representation of each view where similar data samples are selected to reconstruct themselves linearly [35]. For simplicity, assuming we have a single view data matrix X with some noisy samples, LRR can obtain a low-rank representation matrix using nuclear norm regularization [36] as follows: where X denotes the self-dictionary, ‖·‖ * denotes the nuclear norm, and E � X − XU denotes the error matrix. Upon obtaining U in a multiview or cross-domain setting, the subspace clustering methods mentioned above can find a common similarity matrix to balance the agreement between the different 2 Scientific Programming U. Besides, some methods, such as [20,23], combine both LRR and SSC [30] to learn a common similarity matrix. Regardless, these subspace clustering methods mentioned above utilize a two-phase learning approach to obtain a clustering structure by applying spectral clustering [37] on the learned similarity matrix. In this case, the clustering performance can degrade when the similarity matrix produced in the first phase is faulty. Considering this drawback, Yang et al. [38] utilized block diagonal constraints to encourage a proper common matrix. Liang et al. [39] explored the diversity between domains to improve performance. Nonetheless, [38,39] are equally two-phase methods. Moreover, the previous method [26] and the more recent one [40] can obtain a clustering structure directly without applying spectral clustering over the consensus similarity matrix. ey both achieved that by imposing a rank constraint on the Laplacian matrix to guarantee c exact number of connected components.
Our proposed method is related to the subspace clustering method based on LRR because we can obtain a similarity matrix through a low-rank representation. However, unlike these methods, our method learns a common low-rank representation matrix on only one instance of the data. en, following a similar approach in [26,40], we directly pursue a block diagonal clustering structure without performing any spectral postprocessing.

Proposed Method
In this section, we present our proposed low-rank correlation representation and clustering method. First, we formulate the common low-rank matrix and then incorporate clustering directly into our model.

Model Formulation.
Given two multi-instance datasets, Xand Y of sizes n × d1 and n × d2, respectively, where d1 and d2are the dimensions of the feature space and n is the number of samples, a naive approach would include learning different low-rank matrices for Xand Y using equation (1) so that both matrices can be merged to obtain a common matrix. is approach, however, has a limitation. It will cause a fallible common matrix if any of the low-rank matrices is faulty. erefore, we suppose that a common matrix can be obtained differently through one instance of the data since multiview data originate from one underlying latent subspace [41]. On the other hand, the above is tricky for two reasons. First, the dimensions of Xand Y are different. Second, the redundant specific structure and common structure in Xand Y are combined. To tackle this, we utilize matrix factorization [42,43] to find a low-rank matrix U adaptively that captures the common structure in X and Y. At first, we obtain the common structure from Y as follows: where U ∈ R n x n denotes a low-rank matrix and P ∈ R d2 x n is a factorized variable to make Y ≈ UP T possible with orthogonal constraint P T P � I. en, U can be perceived to hold the common structure of X as well with the following model: where Q ∈ R d1 x n is similar to P, and it is used here to align the common structure in the two domains with orthogonal constraint Q T Q � I so that U can well capture an accurate manifold structure of X and Y through U ≈ XQ.
Upon learning U, our method finds an ideal similarity matrix S � s T i , . . . , s T n ∈ R n×n by making S an approximate of U as much as possible. As such, we formulate our model as follows: where S ≥ 0ensures that all the entries of S are none negative. L s � 1 − S denotes the normalized Laplacian matrix because the constraint S1 � 1 normalizes S such that j S ij � 1. en, rank (L s ) � n − c allows S to become the clustering structure with an exact c number of connected components by avoiding spectral postprocessing of the low-dimensional embedding matrix through the following theorem [44,45]. Theorem 1. If the similarity matrix S is nonnegative, the multiplicity C of the eigenvalue zero of the Laplacian matrix L s is equal to the number of connected components in the graph that is associated with S.
A detailed proof of eorem 1 is given in preposition 2 of [45].

Discussion.
Here, we discuss the importance of matrix P and Q, constrained to be column orthogonal using P T P � I and Q T Q � I, respectively, in our model. It can be seen that in equation (2), the low-rank matrix U is learned from Y such that it captures the common structure between X and Y. is is possible because we let Y differ from U through P T and not by itself. en, Q is used in equation (3) to match the common structure inside X( in original d2 dimension ) to U adaptively. For this, we suppose that the relative error of the term ‖U − XQ‖ 2 F should be very small for U to capture the common structure in X and Y correctly. erefore, the proposed method can guarantee more discriminative ability in practice, since U is learned adaptively using the nuclear norm, because with this norm, we can remove the effect of noise through soft thresholding [46], which sets small eigenvalues zero to suppress the noisy data.
is approach is different from the existing ones where the data samples themselves are used as dictionaries to learn low-dimensional representation for each of X and Y before a corrected space can be found.

Optimization.
We propose an optimization method built on augmented Lagrange multipliers [47] to solve equation (4) by iteratively updating all variables. First, we Scientific Programming denote the i-th smallest eigenvalue of L s as θ i (L s ) to make equation (4) easy to solve. en, referring to Proposition 1 [45], θ i (L s ) ≥ 0 because L s is a positive semidefinite matrix. erefore, given a large value of λ 3 , equation (4) is the same as the following: Furthermore, when λ 3 is large enough, c i�1 θ i (L s ) � 0, so that the constraint rank(L s ) � n − c will be satisfied. According to Ky Fan's theorem [48], we understand that c i�1 θ i (L s ) is equivalent to minimizing Tr(F T L s F) subject to F T F � I. is is because an optimal F would contain the eigenvectors corresponding to c smallest eigenvalue of L s . Hence, for a better understanding of Ky Fan's theorem, Zhan et al. [26] provided a simplified version as follows.
Theorem 2. Eigenvalues of L s are ordered by 0 ≤ θ 1 (L s ) ≤ θ 2 (L s ) ≤ · · · ≤ θ n (L s ) and the corresponding eigenvectors us, we rewrite equation (5) as and also, we introduce an intermediate term J � U to make equation (6) easier to solve. en, we get e Lagrangian of equation (7) can be obtained as follows: where M 1 is the Lagrange multiplier. Hence, we divide equation (8) into several subproblems and then update all subproblems by fixing the others in the following order.

J Subproblem.
We solve equation (9) by denoting the singular value decomposition (SVD) [46] of S μ [M] as TH μ [ V T ] to obtain J: where e orthogonal Procrustes problem in equation (11) is difficult to solve because the feasible set P satisfying P T P � I is not convex. However, luckily, SVD [46] provides a method to solve this problem to find a unique solution. erefore, we solve P by sv d(U T Y) as follows:

Q Subproblem.
e Q subproblem can be solved the same way as P subproblem.
Since F in equation (16) contains the eigenvectors corresponding to c smallest eigenvalue of L s , we rewrite equation (16) as en, denoting λ 3 ‖f i − f j ‖ 2 2 − λ 2 u ij as p ij , equation (17) can be rewritten as min s j Similar to [26], the optimal solution s j ist

Complexity Analysis.
e computational complexity of our proposed method illustrated in Algorithm 1 is determined mainly by nuclear norm calculation in equation (10), matrix inverse and multiplication in equation (13), and Euclidean projection onto the simplex space in equation (19). e cost of equation (8) is O (n 3 ), while the inverse of an n × n matrix in equation (15) consumes O (n 3 ). e time complexity for matrix multiplication is O (n 3 ). However, since there are several multiplications in equation (15), the overall multiplication cost will be (k + 1) O (n 3 ), which is not ignorable when the number of data samples is large. Moreover, the Euclidean projection onto the simplex space in equation (19) takes O(n).

Datasets.
We perform experiments on ORL and Yale, Coil-20 and Caltech 101-20, and UCI digits datasets to demonstrate the superiority of our method on face image clustering, object image clustering, and handwritten digit, respectively. For each dataset, we select two types of features to represent X and Y. Specifically, we extract LBP together with Gabor features for ORL, Yale, Coil-20, and Caltech 101-20, and we select Fourier coefficients of the character shape (FOU) and profile correlation (FAC) features for UCI digits. erefore, the following is a detailed description of each dataset, while Table 1 and Figure 1 provide a summary and pictures of the datasets, respectively.

Experimental Setting.
We fine-tuned the parameters for each method with strict compliance with the experimental settings in the respective literature. For our proposed method, two parameters λ 1 and λ 2 need tuning. erefore, we utilize a grid search to find the best λ 1 and λ 2 from [0.001, 0.01, 0.1, 1, 10, 100, 1000]. en, we perform all experiments ten times and report the mean performance to guarantee fairness.

Evaluation Metrics.
To evaluate our method, we utilize six standard evaluation metrics, i.e., accuracy (ACC), normalized mutual information (NMI), adjusted rand index (AR), F-score, precision, and recall. ese metrics capture a different aspect of the performance to demonstrate the superiority of our method over state-of-the-art compared methods. For example, ACC measures the percentage of correctly clustered data samples in the learned clustering structure compared with the ground truth labels, whereas NMI is a theoretic validation measure, which relies on the amount of statistical information shared by random variables.

Image Clustering.
In this section, we present the image clustering performance results on all five benchmark datasets using the six evaluation metrics described earlier.
Scientific Programming Table 2 displays the performance results on ORL and Yale datasets. Specifically, on the ORL dataset, our method has better performance results than the seven state-of-theart compared methods with 81.39%, 91.32%, 71.13%, and 83.10% in ACC, NMI, precision, and recall, respectively. Surprisingly, DiMSC outperforms the more recent methods, such as MCGC, MVGL, and MLRSSC, with 78.87% in ACC and 91.19% in NMI. e above may result from the approach utilized by DiMSC to learn the common representation matrix where it introduced a diversity regularizer to obtain a representation matrix for each view that enhanced the diversity of the different views. Besides, one can observe that GMC and SM 2 SC, which are more recent methods, have better performances than other compared methods in most evaluation metrics. In particular, SM 2 SC has the best performance of 78.36% in AR, which is just 1% higher than that obtained by our proposed method. Nonetheless, our proposed method has the best performance on the Yale dataset in all evaluation metrics, with 74.48% in ACC, 71.96% in NMI, 50.11% in AR, 51.84% in F-score, 50.42% in precision, and 59.18% in the recall metrics. Table 3 illustrates the clustering performance on Coil-20 and Caltech 101-20 object datasets. Furthermore, our method has the best performance in all evaluation metrics,  Input: training dataset X, Y clusters size c Initialize: M 1 � 0, U and S matrices are based on c nearest neighbor graph; F is formed by the c eigenvectors of L s corresponding to the c smallest eigenvalues; ρ � 1.01 While not converged do Update J while fixing others by equation (10) Update P while fixing others by equation (12) Update Q while fixing others by equation (13) Update U while fixing others by equation (15) Update S while fixing others by equation (19) Update the multipliers.
Update μ by μ � min(ρμ, max(μ)) End while Output: S, U ALGORITHM 1: Algorithm for our proposed method. 6 Scientific Programming except for the Coil-20 dataset, where SM 2 SC has the best performance in ACC by a small margin of 0.29%. Table 4 shows that our proposed method outperforms other state-of-the-art methods in five evaluation metrics such as NMI and AR, where it has 86.76% and 78.42%, respectively.
Overall, our method outperforms GMC, SM 2 SC, and MCGC only slightly, especially on object datasets, and it is not surprising because, for instance, SM 2 SC employed a block diagonal regularization to enforce a proper common structure. Similarly, GMC and MCGC utilized a rank constraint to find a unified clustering structure to balance the agreement between the different views by avoiding k-means spectral postprocessing of the low-dimensional embedding matrix as well. Notwithstanding, our proposed method outperforms other state-ofthe-art methods by a wide margin in all evaluation metrics. Furthermore, we provide more intuitiveness of our approach by visualizing the clustering structure learned on three datasets in Figure 2. Accordingly, we can see clearly that our proposed method has a block diagonal clustering structure on all three datasets with an exact c number of connected components. erefore, we conclude that our common low-rank learning approach is very efficient.

Image Recognition with 0%, 10%, and 20% Levels of
Corruptions. In this section, we present experimental results obtained for image recognition with respect to accuracy evaluation metric. In this experiment, we study the robustness against noise corruption of each algorithm by gradually injecting 10% and 20% noise into the five benchmark datasets. To perform this experiment, we keep the same parameter settings described in Section 4.2, while the K nearest neighbor (KNN) classifier is applied to evaluate the classification accuracy of each algorithm. us, it is easily noticeable through Tables 5-9 that all algorithms had a degrading performance with an increment in the noise level. Specifically, our proposed method had degraded performance of only about 3-5% on all datasets with a 10% level of noise corruption. However, all algorithms had a significant  reduction in performance with a 20% level of noise corruption. Yet, we can observe that our proposed method is generally more robust against noise than the other compared methods. Figure 3 demonstrates the clustering performance of our method with respect to NMI and ACC by varying λ 1 and λ 2 on Yale, COIL-20, and UCI digits datasets. First of all, we should note that λ 3        does not need turning like the other parameters because we find its best value heuristically like [26] to accelerate the process. erefore, we initialize λ 3 � 1 and automatically set it as λ 3 � λ 3 * 2 or λ 3 � (λ 3 /2) in each iteration when the number of connected components is smaller or larger than c, respectively. As a result, we can see from Figure 3 that our method is more sensitive to λ 1 on both NMI and ACC because it controls the learning of the common low-rank matrix U. Hence, our proposed method is certain to have a good performance by finding a suitable value for λ 1 . On the other hand, our method is only slightly sensitive to λ 2 , which means that λ 2 is not as important as λ 1 to our model. Still, these clustering results show that our proposed method is relatively stable regardless of the dataset.

Convergence Analysis.
Although the convergence of the inexact augmented Lagrange multiplier (ALM) method with more than three subproblems is still not easy to theoretically prove [29], we compute the relative error of the term ‖λ 2 S − U‖ 2 F to demonstrate the convergence behavior of our proposed method. To illustrate this better, Figure 4 shows the errors at different iterations. It is easy to see that on all three benchmark datasets, our method converges within 150 iterations.

Conclusion
We proposed a novel method based on LRR, which learns a common low-rank representation matrix shared by two multi-instance datasets to improve the clustering performance accuracy. Specifically, our proposed method obtained this common low-rank representation matrix using only one instance of the data so that a second instance of the data can linearly reconstruct it. us, utilizing this common low-rank representation, our method obtained an ideal similarity matrix, which explicitly revealed the clustering structure without any spectral postprocessing. is approach is different from existing state-of-the-art methods, where a projection matrix is learned for each instance of the data to obtain a common structure. Extensive experiments on five benchmark datasets demonstrate our method's superiority over seven state-of-the-art compared methods in all six evaluation metrics. In future work, we would extend our ideas to deep multiview learning.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.