A Fast Clustering Algorithm for Data with a Few Labeled Instances

The diameter of a cluster is the maximum intracluster distance between pairs of instances within the same cluster, and the split of a cluster is the minimum distance between instances within the cluster and instances outside the cluster. Given a few labeled instances, this paper includes two aspects. First, we present a simple and fast clustering algorithm with the following property: if the ratio of the minimum split to the maximum diameter (RSD) of the optimal solution is greater than one, the algorithm returns optimal solutions for three clustering criteria. Second, we study the metric learning problem: learn a distance metric to make the RSD as large as possible. Compared with existing metric learning algorithms, one of our metric learning algorithms is computationally efficient: it is a linear programming model rather than a semidefinite programming model used by most of existing algorithms. We demonstrate empirically that the supervision and the learned metric can improve the clustering quality.


Introduction
Clustering is the unsupervised classification of instances into clusters in a way that attempts to minimize the intracluster distance and to maximize the intercluster distance. Two criteria commonly used to measure the quality of a clustering are diameter and split. The diameter of a cluster is the maximum distance between pairs of instances within the same cluster, and the split of a cluster is the minimum distance between instances within the cluster and instances outside the cluster. Clearly, the diameter of a cluster is a natural indication of homogeneity of the cluster and the split of a cluster is a natural indication of separation between the cluster and other clusters. 2 Computational Intelligence and Neuroscience criteria can be found using Gonzalez's algorithm [1] in linear time: maximizing RSD, maximizing the minimum split, and minimizing the maximum diameter.
However, the condition of RSD opt ( ) > 1 is too strong and unrealistic for real world data. So, a natural problem arises if is poorly clusterable (RSD opt ( ) ≪ 1), whether can be made more clusterable by a metric learning approach and thus Gonzalez's algorithm together with the learned metric can perform better than together with the original metric.
In the clustering literature, there are commonly two methods to add supervision information into clustering. First, adding a small portion of the training data into unlabeled data, this method is also called semisupervised learning [12,13]. Second, instead of specifying the class labels, pairwise constraints are specified [14,15]: a pairwise must-link constraint corresponds to the requirement that the involved two instances must be within the same cluster, whereas the two instances involved in a cannot-link constraint must be in different clusters.
Overall, empirical studies showed that supervised metric learning algorithms can usually outperform unsupervised ones by exploiting either the label information or the side information presented in pairwise constraints. However, despite extensive studies, most of the existing algorithms for metric learning have one of the following drawbacks: it needs to solve a nontrivial optimization problem, for example, a semidefinite programming problem, there are parameters to tune, and the solution is local optimal.
In this paper, we present two simple metric learning models to make data more clusterable. The two models are computationally efficient, parameter-free, and local-optimalityfree. The rest of this paper is organized as follows. Section 2 gives some notations and the definitions of clustering criteria used in the paper. Section 3 gives Gonzalez's farthest-point clustering algorithm for unsupervised learning, presents a nearest neighbor-based clustering algorithm for the semisupervised learning, and discusses the properties of the two algorithms. In Section 4, we formularize the problem of making data more clusterable as a convex optimization problem. Section 5 presents the experimental results. We conclude the paper in Section 6.

Notations and Preliminary
We use the following notations in the rest of the paper. | ⋅ |: the cardinality of a set.
⊂ R : the set of instances (in -dimension space) to be clustered.
In the rest of the paper, we use ℘ ssc to denote the subset of ℘ that respects the semisupervised constraints, and we require that any partition in the context of semisupervised learning should respect the semisupervised constraints.
Definition 5. The unsupervised and semisupervised minmax diameter problems are defined as, respectively, Definition 6. The unsupervised and semisupervised max-RSD problems are defined as, respectively, For the unsupervised max-RSD problem, Wang and Chen [9] presented an exact algorithm for = 2 and a 2approximation algorithm for ≥ 3; however the worstcase time complexity of both algorithms is ( 3 ) and thus impractical for large-scale data.

Well-Clusterable Data: Find the Optimal Solution Efficiently
In this section, we show that if RSD opt ( ) > 1, the max-RSD problem, the max-min split problem, and the min-max diameter problem can be simultaneously solved by Gonzalez's algorithm for unsupervised learning in Section 3.1 and by a nearest neighbor-based algorithm for semisupervised learning in Section 3.2, respectively. At the same time, we also discuss the properties of the two algorithms for the case of RSD opt ( ) ≤ 1.

Unsupervised
Learning. The farthest-point clustering (FPC) algorithm proposed by Gonzalez [1] is shown in Algorithm 1, where the meaning of nearest neighbor is its literal one as (9); that is, 's nearest neighbor in is , Theorem 7. For unsupervised learning, if RSD opt ( ) > 1, then the partition returned by FPC is simultaneously the optimal solution of the max-RSD problem, the max-min split problem, and the min-max diameter problem.
Algorithm: FPC Input: The input data , and the number of clusters. Output: The partition of . ← Φ; Randomly select an instance from ; Let be the partition by assigning each instance of to its nearest neighbor in (if ∈ , the nearest neighbor of in is itself); return ; Algorithm 1: The FPC algorithm for unsupervised learning [1].
Proof. (a) The proof of the max-RSD problem: let = { 1 , 2 ,. . . , } be the optimal partition of the max-RSD problem; then RSD( ) > 1, and we have ∀ , ∈ , ∀ ∉ : We prove the following proposition: any pair of instances in (see Algorithm 1) must be in different clusters of ; that is, contains exactly one instance of each cluster , = 1, 2, . . . , . If this holds, then by (10), for any instance ∈ , = 1, 2, . . . , , its nearest neighbor in must be the instance such that also belongs to , and hence = .
We prove the proposition by contradiction. Assume that there exists a pair of instances and in so that they belong to the same cluster for some . Without loss of generality, let be selected into before . Then Min( , ) ≤ ( , ) when selecting into . Note that | | < before selecting ; there exists at least one cluster ( ̸ = ) such that no instance in belongs to . By (10), for any ∈ , we have Min( , ) > ( , ) ≥ Min( , ); has no chance to be selected into since we should select the instance with the maximum Min( , ), and thus the proposition holds.
(b) Since separating any pair , of instances within the same cluster of into different clusters will strictly decrease the split of the resulted partition, the conclusion for the maxmin split problem holds.
(c) Since grouping any pair , of instances in different clusters of into the same cluster will strictly increase the diameter of the resulting partition, the conclusion for the min-max diameter problem holds.
Clearly, the time complexity of is ( ) by maintaining a nearest neighbor table that records the nearest neighbor in of each instance ∈ − and the corresponding distance between and its nearest neighbor in . The space complexity is ( ). So, the time complexity and the space complexity are both linear with for a fixed . Using a more complicated approach, the algorithm can be implemented in ( log ), but the implementation was exponentially dependent on the dimension [3]. Now, a natural problem arises: if RSD opt ( ) ≤ 1, how does the FPC algorithm perform? Although, in this paper, we cannot give performance guarantee of the FPC algorithm for the max-RSD problem and the max-min split problem if RSD opt ( ) ≤ 1, Gonzalez [1] proved the following theorem (see also [2,3]).
Theorem 8 (see [1]). The FPC is a 2-approximation algorithm for the unsupervised min-max diameter problem with the triangle inequality satisfied for any . Furthermore, for ≥ 3, the (2 − )-approximation of the unsupervised min-max diameter problem with the triangle inequality satisfied is NPcomplete for any > 0.
So as far as the approximation ratio is concerned, the FPC algorithm is the best for the unsupervised min-max diameter problem unless P = NP.

Semi-Supervised Learning.
For semisupervised learning, we present a nearest neighbor-based clustering (NNC) algorithm as shown in Algorithm 2. The algorithm is selfexplanatory, and we do not give a further explanation.

Theorem 9.
For semiunsupervised learning, if RSD opt ( ) > 1, then the partition returned by NNC is simultaneously the optimal solution of the semisupervised max-RSD problem, the semisupervised max-min split problem, and the semisupervised min-max diameter problem.
The proofs for the semisupervised max-min split problem and the semisupervised min-max problem are similar to (b) and (c) in the proof of Theorem 7 respectively, and here we omit it.
The time complexity of NNC using a simple implementation is The space complexity of NNC is ( ). Since we assume that are small sets for = 1, 2, . . . , , the time and space complexities are also linear with when | | are regarded as constants for = 1, 2, . . . , .
Similar to Theorem 8, we have the following theorem for the semisupervised min-max diameter problem. is an unlabelled instance} and let be any unlabelled instance such that Min( , ) = . Since the optimal partition of a semisupervised min-max diameter problem must respect the supervision, we have opt ( ) ≥ , where opt ( ) denotes the diameter of the optimal solution of the semisupervised min-max diameter problem; at the same time, and for some ∈ {1, 2, . . . , } must be within the same cluster of the optimal solution, so opt ( ) ≥ ; therefore opt ( ) ≥ max{ , }. Now consider the partition = { 1 , 2 , . . . , } returned by NNC. Since each unlabeled instance is assigned into its nearest neighbor in , so, for any cluster of for = 1, 2, . . . , (assume that the super-instance in is ), we have ( , ) ≤ , and ( ) ≤ 2 by the triangle equality. So, ( ) ≤ max{2 , } ≤ 2 opt ( ), and the theorem holds.

The Metric Learning Models
If the given data are poorly clusterable, that is, the RSD opt ( ) is far less than one, the algorithms FPC and NNC may perform poorly. Given the supervision, we use metric learning to make the supervised data more clusterable, and then the two algorithms can be used with the new metric. Supervised metric learning attempts to learn distance metrics that keep instances with the same class labels (or with a must-link constraint) close and separate instances with different class labels (or with a cannot-link constraint) far away. As discussed in the first section, there are many possible Computational Intelligence and Neuroscience 5 ways to realize this intuition; for example, Xing et al. [18] presented the following model: In the above model, denotes the set of must-link constraints, denotes the set of cannot-link constraints, is a × Mahalanobis distances matrix, and ‖ − ‖ denotes the distance ( , ) between two instances and ∈ ⊆ R with respect to ; that is, where denotes the transpose of a matrix or a vector. The constraint (14) requires that should be a positive semidefinite matrix; that is, ∀ ∈ R , ≥ 0. The choice of the constant 1 on the right hand side of (13) is arbitrary but not important, and changing it to any other positive constant results only in being replaced by 2 .
Note that the matrix can be either a full matrix or a diagonal matrix. In natural language, Xing et al. 's model minimizes the sum of the square of distance with respect to between pairs of instances with must-link constraints subject to the following constraints: (a) the sum of distances with respect to between pairs of instances with cannot-link constraints is greater than or equal to one, and (b) is a positive semidefinite matrix.
Xing et al. 's model, as well as most of the existing metric learning, is a semidefinite programming problem and thus computationally expensive and even intractable in high dimensional space for the case of full matrix.
Inspired by the RSD clustering criterion, we propose two metric learning models: one learns a full matrix and the other learns a diagonal matrix. In this section, the supervision can be given either in the form of labeled sets 1 , 2 , . . . , or in the form of pairwise constraints.

The Labeled Sets.
Given the supervision 1 , 2 , . . . , , we want to learn a Mahalanobis distances matrix such that the minimum split with respect to among , = 1, 2, . . . , , is maximized subject to the following constraints: (a) the distance between each pair of instances with the same class label is less than or equal to one and (b) is a positive semidefinite matrix. Formally, we have the following optimization problem (the case of full matrix).

The Case of Full
The constraint (17) requires that the scalar variable (the minimum split) is the minimum among distances between pairs of instances with different class labels. The constraint (18) requires that the distance between each pair of instances with the same class label is less than or equal to one. The optimization objective is to maximize . Similar to (13), the choice of the constant 1 on the right hand side of (18) is arbitrary but not important and can be set to any positive constant.
The full matrix model is a SDP optimization problem, and, theoretically, the global optimal solution can be solved efficiently [36]. However, when is a full matrix, the number of variables (| |) is quadratic in , and thus it is prohibitive for problems with a large number of dimensions. To avoid this problem, we can require that is a diagonal matrix. Since is a diagonal matrix, is a positive semidefinite matrix if and only if ≥ 0 for = 1, 2, . . . , , where is the th diagonal entry. So, learning a diagonal matrix is equivalent to learning a vector ∈ R using the following model (the case of diagonal matrix).

The Case of Diagonal
where The constraint (24) requires that each component of should be greater than or equal to zero. Now since the optimization objective and all constraints are linear, the above optimization problem is a linear programming problem with + 1 variables, and × | | × (| | − 1)/2+ ×( −1)×| | 2 /2+( +1) inequality constraints (assume 6 Computational Intelligence and Neuroscience that has equal size). When | | is small for = 1, 2, . . . , , the global optimal solution can be efficiently found using some optimization tool package, for example, the MATLAB linprog function, or the CVX-MATLAB software for disciplined convex programming (http://cvxr.com/cvx/download/).

Pairwise Constraints.
If the supervision is given in the form of pairwise constraints, that is, the must-link and cannot-link constraints, the models also work after a minor modification. Let ML be the set of must-link constraints, and let CL be the set of cannot-link constraints; then the full matrix model and the diagonal matrix model should be modified as follows: substituting (17 ) for (17), (18 ) for (18), (22 ) for (22), and (23 ) for (23), respectively, However, if the supervision is given in the form of pairwise constraints, it is nontrivial to decide whether there is a partition of such that satisfies all of those pairwise constraints (and we call it the feasibility problem). For CL constraints, Davidson and Ravi showed that the feasibility problem is equivalent to the -colorability problem [37] and thus NP-complete [38], whereas the feasibility problem is trivial if the supervision is given in the form of labeled sets. Of course, if we do not require that all of those pairwise constraints should be satisfied, the FPC algorithm can be naturally used together with the metric learned from the pairwise constraints.
Clearly, the metric learning models proposed in this paper are practicable only when the cardinality of sets of labeled instances or the number of pairwise constraints is small. Otherwise, the problem is usually overconstrained and there is no feasible solution.

The Compared Algorithms and Benchmark Datasets.
To validate whether semisupervised learning performs better than unsupervised one, whether metric learning can improve clustering quality, and whether our metric learning model performs better than Xing et al. 's one for the and algorithms, we implemented the following algorithms: (i) the FPC algorithm as shown in Algorithm 1; (ii) the NNC algorithm as shown in Algorithm 2; (iii) the FPC with our metric learning model (the case of diagonal matrix) (FPC Diag); that is, we first use our metric learning model to learn a vector and then use the FPC clustering algorithm with the learned vector; that is, the distance is computed using (26); (iv) the NNC with our metric learning model (the case of diagonal matrix) (NNC Diag); We also implemented the following algorithms as baseline approaches. The reason that we select -means to compare is that -means is very simple and also a linear time algorithm when regarding and the repetition times as constants: (i) the constrained -means [39]  We conduct experiments on twenty UCI real world datasets obtained from the Machine Learning Repository of the University of California, Irvine [42]. The information about those datasets is summarized in Table 1.

The Experiments Setup.
We first make the following preprocessing: for a nominal attribute with different values, we replace these values by integers 1, 2, . . . , , and then all attributes are normalized to the interval [1,2].
Except Ecoli, | | is set to five for = 1, 2, . . . , . Because the smallest number of instances is two among eight classes in the dataset Ecoli, | | is set to two for = 1, 2, . . . , .  The stop condition is either the repetition times are more than 100 or the objective difference between two consecutive repetitions is less than 10 −6 .
We use the Rand Index [43] to measure the clustering quality in our experiments. The Rand Index reflects the agreement of the clustering result with the ground truth. Here, the ground truth is given by the data's class labels. Let be the number of instance pairs that are assigned to the same cluster and have the same class label, and let be the number of instance pairs that are assigned to different clusters and have different class labels. Then, the Rand Index is defined as All algorithms are implemented in MATLAB R2009b, and experiments are carried out on a 2.6 GHz double-core Pentium PC with 2 G bytes of RAM. Table 2 summarizes the mean Rand Index and the standard deviation over 20 random runs on twenty datasets, and the value with bold in each row is the highest. Table 2 shows that although no algorithm performs better than the other algorithms on all datasets, in general we can draw the following conclusion.

The Mean Rand Index.
(   Figure 1 shows that both NNC and FPC are much faster than CopK and PCK, which is consistent with their time complexities: the complexity of FPC and NNC is ( ), whereas the complexity of CopK and PCK is ( ), where is repetition times of -means. Figure 1 also shows that Xing et al. 's model is slower than our model when the number of dimensions is relatively large, for example, Ionosphere, Promoters, Sick, and Splice. On the other hand, since the number of inequality constraints is quadratic with the number of class labels, our Diag model is slower than Xing et al. 's model on datasets with relatively large number of class labels, for example, Ecoli, Mfeat-fac, Mfeat-pix, Yeast, and Zoo.
The experimental results in Table 2 and Figure 1 show that the FPC algorithm is very fast, but the clustering results are unsatisfactory. The NNC algorithm proposed in this paper has the same time complexity as FPC, but the clustering quality is much more satisfactory than FPC if a few labeled instances are available.

Conclusion
In this paper, we studied the problem related to clusterability. We showed that if the input data are well clusterable, the optimal solutions with respect to the min-max diameter criterion, the max-min split criterion, and the max-RSD criterion can be simultaneously found in linear time for both unsupervised and semisupervised learning. For the max-RSD criterion, we also proposed two convex optimization models to make data more clusterable.
The experimental results on twenty UCI datasets demonstrate that both the supervision and the learned metric can significantly improve the clustering quality. We believe that the proposed NNC algorithm and metric learning models are useful when only a few labeled instances are available.
Usually, the term semisupervised learning is used to describe scenarios where both the labeled data and the unlabeled date affect the performance of a learning algorithm, which is not the case here: the supervised data is used either to induce a nearest neighbor classifier on the unlabeled data or to find a metric vector. Hence, the supervision information can be more elaborately utilized in the future.