Exact k -Component Graph Learning for Image Clustering

The performance of graph-based clustering methods highly depends on the quality of the data aﬃnity graph as a good aﬃnity graph can approximate well the pairwise similarity between data samples. To a large extent, existing graph-based clustering methods construct the aﬃnity graph based on a ﬁxed distance metric, which is often not an accurate representation of the underlying data structure. Also, they require postprocessing on the aﬃnity graph to obtain clustering results. Thus, the results are sensitive to the particular graph construction methods. To address these two drawbacks, we propose a k -component graph clustering ( k -GC) approach to learn an intrinsic aﬃnity graph and to obtain clustering results simultaneously. Speciﬁcally, k -GC learns the data aﬃnity graph by assigning the adaptive and optimal neighbors for each data point based on the local distances. Eﬃcient iterative updating algorithms are derived for k -GC, along with proofs of convergence. Experiments on several benchmark datasets have demonstrated the eﬀectiveness of k -GC.


Introduction
Clustering is one of the most fundamental topics in computer vision and pattern recognition. e objective of clustering is to discover the data structure and partition a group of data points into several clusters, where the similarity of data points within the same cluster is greater than the similarity from different clusters [1][2][3][4][5][6][7][8].
Structure of data is usually characterized by the affinity matrix of the graph whose edges denote the similarities between data points. If vertices belonging to each cluster are connected to be a component, i.e., there is no edge connecting between different clusters, cut is assigned to a value of zero in graph theory. Our purpose is to learn a graph with exact k number of components so that vertices in each connected component of the graph are partitioned into one cluster.
Given a cluster indicator matrix Z, it can be constructed by data labels; z ic � 1 if a data point x i is assigned to the c-th cluster, and z ic � 0 otherwise. Since ZZ ⊤ is a strictly block diagonal matrix in the ideal case, many clustering methods are designed to obtain a block diagonal affinity matrix or they use it as an important prior [9][10][11][12][13]. Actually, we find that a good quality of graph structure results in a good cluster indicator matrix, even if the affinity matrix is not a strictly block diagonal matrix. For example, in Figure 1, we can obtain an ideal cluster indicator matrix Z with the graph itself because one connected component has exactly one cluster, but the affinity matrix W (see Figure 1) constructed from the graph itself is not a strictly block diagonal matrix, i.e., some on-block diagonal elements are zeros. If the graph has k-connected components, we can directly obtain a good cluster indicator matrix with graph itself, even when the connected edges are relatively sparse in each connected component. It means that the graph structure rather than the strictly block diagonal affinity matrix is the intrinsic quality for obtaining a good clustering result.
We propose a novel graph-based clustering approach to exploit k-connected components for clustering, called k-component graph clustering (k-GC). Figure 1 shows schematic of the k-component graph learning. k-GC aims to learn a graph whose connected edges are tuned adaptively until the graph has exactly k connected components [14]. To evaluate the effectiveness of k-GC, we have conducted experiments on six benchmark datasets in comparison to stateof-the-art approaches. Experimental results have well demonstrated that k-GC performs better than other approaches consistently. k-GC makes following contributions: (1) k-GC learns to obtain a graph with exact k-connected components. Since the vertices in each connected component belong to one cluster, labels are obtained directly from the learned graph itself. (2) e clustering indicators are obtained by using the learned graph itself without performing postprocessing k-means clustering algorithm. (3) k-GC can be used as an alternative to spectral clustering (SC). Similar to SC, k-GC only needs to input an initial affinity graph without involving raw data.

Related Work
SC, which exploits the eigenstructure of a data affinity graph to partition data into different groups, has become one of the most fundamental clustering approaches. Standard SC uses the radial basis function to construct the affinity matrix [15], and its performance relies heavily on the eigenstructure of the affinity matrix [15][16][17][18][19][20].
In SC, the similarity between pairwise data points x i and x j is firstly computed by the radial basis function to construct the affinity matrix W, then the k number of eigenvectors of the normalized Laplacian matrix L corresponding to the top k smallest eigenvalues, H � [h 1 , h 2 , . . . , h k ] ∈ R n×k , are regarded as the low-dimensional embedding of raw data X, and k-means clustering is performed on H to obtain labels finally.
However, due to the ambiguity and uncertainty inherent in data structure, the intrinsic affinity matrix cannot be determined by a unified function. Since most existing SC methods use a predefined affinity matrix of the graph, the cut value is usually minimized but not zeroed, e.g., ratio cut [21], normalized cut [16], and min-max cut [22], which results in a postprecessing k-means clustering algorithm to obtain the clustering labels [15,17,18].
Many approaches have been proposed to improve the performance of SC. Generally, they could be categorized into three paradigms: (1) How to improve data clustering using a predefined affinity matrix [12,15,23] (2) How to construct a better affinity matrix to obtain a better result than standard SC [17,24,25] (3) How to learn the affinity matrix and clustering structure simultaneously [26][27][28][29] is paper is related to the third paradigm. In the third paradigm, objective functions of [26,28,29] usually involve the raw data but k-GC does not involve it. All of them employ a rank constraint on Laplacian matrix [26][27][28][29], and clustering with adaptive neighbors (CAN) [26] and constrained Laplacian rank (CLR) [27] are two methods more related to k-GC. CAN learns the graph from raw data. However, CLR learns the graph by minimizing the difference between the input initial graph and the learned graph.
In k-GC, the raw data are not involved in the objective function, and k-GC is directly modified from standard SC to tune the structure of graph so that an intrinsic graph with exact k-connected components is obtained.

k-Component Graph Clustering
Suppose that the data matrix is denoted by X � [X 1 , X 2 , . . . , where X c denotes the data matrix belonging to the c-th cluster, k denotes the cluster number, x i denotes the i-th data point, n is the number of data points, and d denotes the dimension.

Preprocessing.
We propose a new simple neighborhoodpreserving method to refine the raw data in this section. Assume that a data point x i and its nearest data point y i are both generated from a same point since they may be  Mathematical Problems in Engineering disturbed by noise. en, we define an equation to replace x i by the weighted linear combination of x i and y i : where r i is defined by Since a data point is generally closer to the nearest one belonging to a same cluster than one to a different cluster, it is straightforward to check from equation (2) that r i tends to 1 if x i and y i belong to a same cluster and r i tends to 0 if x i and y i belong to different clusters. It implies that two data points in a same cluster become closer during iteration while two data points in different clusters have very small or even no influence on each other after using the preprocessing. With the iteration of equation (1), data points in same cluster become closer and closer so that the clustering is easier to carry out.

k-Component
Graph. An undirected finite graph is denoted by G, and its vertex set is denoted by We consider the graph is weighted and assign a non-negative real weight w ij to each pair vertices of where D is the diagonal matrix whose (j, j)-element is d j .
For any vector f ∈ R n , we have , the smallest eigenvalue of L is equal to zero, and the corresponding eigenvector is 1.

Theorem 1.
e number of connected components k is equal to the multiplicity of 0 as an eigenvalue of L.
Proof. Suppose there are k-connected components in graph G, the c-th subgraph is denoted by G c , and the corresponding vertex set of G c is denoted by In the c-th connected component G c , a connected edge is weighted by a positive value w ij > 0, and the corresponding term of (h i − h j ) 2 needs to be zeroed, i.e., h i and h j have to be a constant μ c . ∀v i ∈ V c and ∀v j ∈ V c , and the terms of (h i − h j ) 2 need to be zeroed while these edges are connected with positive weights w ij , which means that in the c-th component, ∀v i ∈ V c , and each h i is of the same constant, i.e., h i � μ c . ese eigenvectors of different components are linearly independent, so the multiplicity of 0 as eigenvalue is the number of components of the graph G and vice versa.
Similar proof of eorem 1 can be seen in previous works [13,18,31]. We explore eorem 1 in the context of a specific example. In Figure 1, we show an intrinsic graph G with two connected components, and an intrinsic affinity matrix W of the graph G being consistent in eorem 1 is given by where we constrain the degree d j � n i�1 w ij � 1, i.e., the sum of each column of W is equal to one.
e Laplacian matrix L of equation (4) has an eigenvalue zero with the multiplicity k � 2, and we have the two corresponding eigenvectors h 1 and h 2 . ey are concatenated by H � [h 1 , h 2 ], and its transposition is If the graph G in Figure 1 corresponds to an affinity matrix W in equation (4), i.e., the Laplacian matrix L of equation (4) satisfies to eorem 1, then the clustering labels are easy to obtain without further performing graph-cut or k-means clustering algorithms. Since vertices in each connected component of the graph are partitioned into one cluster, the clustering labels can be obtained easily by using the graph G with strongly connectedcomponent algorithm [32].
Since it is a positive semidefinite matrix, L has n nonnegative eigenvalues 0 � λ 1 ≤ λ 2 ≤ · · · ≤ λ n . eorem 1 indicates that if k c�1 λ c � 0, then the graph G has k-connected components and the vertices are already partitioned into k clusters [18,30,31,33]. en, according to Fan's theorem [14,34] where Tr(·) denotes the trace operator, I ∈ R k×k is an identity matrix, L � D − ((W ⊤ + W)/2) is the Laplacian matrix, and D is a diagonal matrix whose diagonal elements are column sums of ((W ⊤ + W)/2). e proof of Fan's theorem can be seen in [13,35,36]. It is straightforward to check that the objective function value of equation (6) does not generally tend to zero because the structure of graph G is varied by varying the graph construction methods. In the following section, we propose a method to tune the graph structure of G adaptively so that the objective function value of equation (6) tends to zero. □ Mathematical Problems in Engineering 3.3. k-GC Algorithm. In this section, we will explore how to learn a graph W with exact k-connected components [14]. e right term of equation (6) can be solved with respect to H by the eigenvectors of L, but it has trivial solution with respect to W, i.e., only one w ij is assigned to a value and others are zeroed in each column of W. We add ℓ 2 -norm regularization to smooth the weights in W. en, we optimize W and H simultaneously: where β is the trade-off parameter.
If the affinity matrix is consistent in eorem 1, the procedure of obtaining clustering labels is named as k-GC. In other words, we guarantee the objective function of standard SC to be zeroed in k-GC by means of tuning the graph structure. We divide the problem equation (7) into two subproblems and alternately solve them. e first subproblem is to fix W, updating H. en, equation (7) becomes Equation (8) where h (i) denotes the i-th column of H ⊤ . Each column of W is independent, so solving equation (9) is equal to optimizing the following problem: is denoted by g i and h (j) is a fixed vector when we solve the j-th column w j . Solving equation (10) is equal to optimizing the following problem: where g � [g 1 , g 2 , . . . , g n ] ⊤ is a constant vector. Equation (11) is a Euclidean projection problem on the simplex space, and there are several algorithms for solving it [37][38][39]. According to the Karush-Kuhn-Tucker condition [37], it can be verified that the optimal solution w * j is Equation (10) will be analysed specifically in Section 3.4, and we show how to obtain exact k-connected components.
We alternately optimize equations (9) and (8) until the sum of the top k smallest eigenvalues of L becomes zero. e algorithm for solving equation (7) is summarized by Algorithm 1.

k-Component
Graph Learning. We find a trade-off between two graph structures: the first case is that one vertex is connected with only one other vertex in vertex set V and the second case is that all vertices are connected with each other by the same weight 1/n. e trade-off renders the objective function of equation (6) close to zero.
We optimize each column of W independently, and one column of W is denoted by w j . Values in the vector w j are zero or positive: w ij � 0 means v i has no edge with v j and w ij > 0 means v i and v j are connected with an edge. e first is to optimize the following: It returns the minimum value g i � min(g), i.e., v j is only connected to the other v i . e second is the following: It returns w * ij � 1/n, ∀j. us, an adaptive graph learning objective function is where β j is the trade-off parameter. If β j � 0, then optimizing equation (15) is equivalent to solving equation (13); if β j tends to an infinity positive value, then equation (15) becomes equation (14). If we want to learn a sparse graph, we can tune β j to be a small value; if we want to learn a graph with more edges, we can tune β j to be a relatively large value.
Solving equation (15) is equal to optimizing the problem: According [40], the optimal affinities w * ij are given by Actually, it is straightforward to check from equation (17) that the number of neighbors m is also determined by the setting of β j . In practice, the structure of the graph G can be tuned coarsely by m and can be tuned finely by the tradeoff parameter β j . m has an explicit meaning while β j has an implicit relation with the structure of the graph. We tune both of them to obtain an intrinsic affinity matrix.
As standard SC, we can input any affinity matrix for k-GC. ere is one parameter β in equation (7). In Algorithm 1, we use Eq. (9) to obtain the initial graph by replacing H with X ⊤ . For each w j , j ∈ 1, 2, . . . , n { }, we have different β j , so before the iteration, we set β � n j�1 β j for preserving the initial structure. e initial graph structure is mainly determined by m.

Convergence Analysis.
Since the second-order derivative of equation (16) with respect w j is equal to 1 ≥ 0, equation (16) is a convex problem. Because the Laplacian matrix is a positive semidefinite matrix, equation (8) is a convex optimization problem. Optimizing W and H alternately, both of them decrease monotonically. As a result, the overall objective function value of equation (7) decreases monotonically in each iteration until Algorithm 1 converges.

Computational Complexity Analysis.
e first step of the objective function equation (7) is to solve equation (10). We need O(n) time to compute h (j) , and O(t 1 n) to solve equation (15) where t 1 is the iteration number. n times are needed to calculate each w j , ∀j, so the complexity of the first step of equation (7) is O((t 1 + 1)n 2 ). e second step is an eigendecomposition procedure, and the complexity of the generalized eigenvector problem is O((n + k)n 2 ). For solving equation (8), we need to calculate the k eigenvectors of L, so its cost is O(kn 2 ). us, the total complexity of equation (7) is where t o is the number of iterations of the two steps.

Experimental Results
In this section, we conduct experiments on six datasets to demonstrate the effectiveness of k-GC in terms of clustering accuracy (ACC) and normalized mutual information (NMI).

Dataset Description. Six datasets include the following:
(1) Two-moon dataset is a randomly generated synthesis dataset and has two clusters of data distributed in the moon shape. Each cluster has 100 samples and the noise percentage is set to 0.20. (2) Path-based dataset [41] is made up of 300 samples which belong to three clusters. e first two datasets are two-dimensional synthesis datasets. For datasets of Yale, ORL, COIL-20, and Notting-Hill, we employ the intensity feature of these images [43,44]. A small fraction of samples in these image datasets is shown in Figure 2. Dataset description is summarized in Table 1. (1) Input: A data matrix X ∈ R d×n , the number of neighbors m, and the number of clusters k.
(1) k-means [45] clusters samples based on the similarity between samples and cluster centroids. ere is no parameter to tune in k-means. We run k-means clustering 10 times to evaluate the performance. In order to reduce the effect of random initialization, we run 30 times and report the result with the minimum value of the objective function of k-means among these results in each time.
(2) R-cut [21] finds the first k eigenvectors of the Laplacian matrix of the graph so as to minimize the similarity between two parts in the graph. Specifically, R-cut advocates the second eigenvalue of the Laplacian matrix which gives the optimal solution. (3) ST-SC [17] constructs a graph in a local scale and each data point chooses different neighbors. It has a parameter to determine the number of nearest neighbors for graph construction, and we tune the neighbor number to report the best results in terms of the objective value of k-means.
(4) RG-SC [24] generates robust affinity graphs via identifying and exploiting discriminative features for spectral clustering based on the unsupervised clustering random forest. e default parameters are used in our experiments. (5) CAN [26] learns the data similarity matrix and clustering structure simultaneously and uses a rank constraint to create the clustering structure in the similarity matrix as several disconnected components. e number of nearest neighbors is searched from 2 to 50 with incremental 2. (6) CLR [27] imposes a Laplacian rank constraint on the learned graph which best approximates the input initial affinity graph. In our experiments, the number of nearest neighbors is searched from 2 to 50 with interval 2 to obtain the best parameter.
For k-GC, we tune the parameter m in the range of [5,40] with interval two, and β is fixed by β � n j�1 β j . For preprocessing, we iterate equation (1) ten times with σ � 12. We select optimal m in terms of objective value of k-means clustering preformed on H. k-GC obtains the clustering results from the learned graph itself because data points in each connected component belong to one cluster, and the clustering indicators are directly obtained according to Tarjan's strongly connectedcomponent algorithm [32]. In practice, we can determine β in a heuristic way to accelerate the procedure [26]. As the iteration stopping condition of the sum of the top k smallest eigenvalues of L being zeroed, i.e., the first term of equation (7) is zeroed, is stronger than conventional iteration stopping condition [26], we adopt the former. Because of the iteration stopping condition, we set β to a constant (e.g., β � n j�1 β j ), then increase β if the connected components of W are larger than k, and decrease β if they are smaller than k during iteration.
For all these methods, we run each method 10 times and report the mean of performance as well as the standard deviation in Table 2.

Results and Analysis.
We choose six datasets to demonstrate the effectiveness of k-GC, the first two of which are synthetic while the rest are real-world datasets. In addition,     Table 2, from which we can find that we have improved the performance to a large extent in comparison to the state-ofthe-art methods. e clustering results of the comparing methods using the two synthetic datasets are visualized in Figure 3 so that we can observe in an intuitive way that k-GC approximates the ground truth mostly. What is notable is that on two synthetic datasets of two-moon and path-based, we have a surprisingly good result approximating to 100%, meaning almost entirely accurate clustering. is implies the great importance of the intrinsic graph structure with a comprehensive analysis of which we obtain a significantly success over other methods. Figure 4 shows that the objective value of k-GC, equation (7), is nonincreasing during iterations. ey all converge to a fixed value very fast in 10 iteration times. k-GC only needs several times to iterate alternately to obtain a cut-zeroed graph while the conventional SC-based methods need to perform postprocessing by using k-means clustering algorithm. erefore, the k-GC algorithm is effective overall.

Conclusion
In this paper, we have proposed a novel k-component graph clustering method, which learns to obtain a graph with exact k-connected components. Since the vertices in each connected component of the intrinsic graph belong to one cluster, labels are obtained directly from the learned graph itself without performing further graph-cut or k-means clustering  algorithms. k-GC learns the affinity matrix and clustering structure simultaneously. Moreover, k-GC can serve as an alternative to SC due to its simplicity and the effectiveness over the standard SC. is paper focuses on the spectral analysis of the graph Laplacian, to which the sum of top k smallest eigenvalues is zeroed. It renders the graph structure update until it has exactly k number of components. e efficient algorithm as well as the optimization algorithm is presented after plentiful analysis. Experiments on six benchmarks have demonstrated the superiority of k-GC.

Conflicts of Interest
e authors declare that they have no conflicts of interest.