^{1}

^{1}

^{1}

^{2}

^{1}

^{1}

^{1}

^{2}

The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.

Semisupervised clustering algorithm has been studied recently as a method for improving the performance of clustering algorithm, and it allows the human expert to incorporate domain knowledge into the process of clustering and thus guides it to get better results. The use of domain knowledge in clustering task is motivated by the fact that the priori knowledge for some data objects can be obtained in many applications, the priori knowledge can be the labels of the data objects or the relationships between data objects. The “must-link” and “cannot-link” constraints capture relationships among data objects. Labeled objects could be used in clustering algorithms to help determine the groups and get more meaningful results. Most of the existing semisupervised clustering algorithms can be divided into three categories: method based on labeled data [

Semisupervised clustering algorithms based on labeled data utilized the label information to improve the performance of clustering. Semisupervised k-means clustering algorithm is a popular semisupervised clustering method [

The concepts of two basic pairwise constraints were defined by Wagstaff et al. [

Fuzzy clustering model adopts membership to show the results of clustering, and membership grades are used as probabilities that each data object belongs to every class. In order to improve the performance of fuzzy clustering, the priori knowledge has been applied into it, and most of them used the priori knowledge to modify the objective function. Labeled data [

Most of the semisupervised clustering algorithms assume that the labeled dataset or pairwise constraints are given. In practice, getting the priori knowledge is very expensive and time consuming. In addition, if the size of labeled dataset is too small in the process of constructing semisupervised clustering based on labeled data, some clusters may have no labeled data in imbalanced dataset, and then the data in those clusters will be assigned to other clusters forcibly. For example, the dataset shown in Figure

An imbalanced and multidensity dataset which contains 4 clusters.

The active learning method, which aims to achieve high accuracy using labeled data as few as possible, selects informative data actively and labels them by oracle. The active learning method can minimize the cost of obtaining labeled data points greatly without compromising the performance of clustering algorithm, and this is very attractive and valuable in real-world applications.

Perhaps the simplest and most commonly used active learning technique is uncertainty sampling [

Although most of the active learning strategies are applied into classification tasks, in the recent years, active learning is also introduced into clustering [

Most of the existing active learning algorithms are pool-based or stream-based, and they are mainly applied in supervised learning. Although active learning is introduced into semisupervised clustering, the performances of these clustering algorithms are unsatisfiying when dealing with the imbalanced and multidensity datasets. The most uncertain data lies on the boundaries of clusters, and it is not “representative” of other data in the same cluster. So knowing its label is unlikely to improve the performance of the clustering algorithm as a whole. This paper selects the data with max density from each cluster which is the result of MST clustering.

Since the dataset is imbalanced, the distribution of labeled data in a given dataset is not the same as the whole data space, and a data point and its

The proposed semisupervised clustering method utilizes MST clustering to select data points actively so as to avoid labeling data in the same cluster repeatedly. If we need

The proposed clustering algorithm achieves label propagation by using labeled data to expand their

The rest of this paper is organized as follows. Section

The

Each classification algorithm requires enough labeled data to achieve high classification accuracy. However, labeling data is quite expensive and time consuming in many real-world applications, and we can get a very small size of labeled dataset. For instance, there are 3 classes in Figure

Low accuracy of KNN on an imbalanced dataset.

There are two problems for most of classifications and semisupervised clustering algorithms like

The first one is that the whole dataset is imbalanced and the size of labeled dataset is too small, and using random method to select labeled data cannot guarantee that each class has more than one data object to be selected.

The second is that the class label of some unlabeled data and that of its

Some definitions are given as follows in order to describe the proposed active semisupervised clustering algorithm.

The proposed active semisupervised clustering process can be divided into two algorithms: active data selection algorithm (Algorithm

(1) Let

to be selected,

(2) Use Prime method to construct MST of

(3)

(4) Compute edge’s inconsistent value

(5)

(6) Sort all

(7) Insert the sorted edges into a list:

(8)

(9) Delete

(10) Check the number of partitions in MST,

(11)

(12)

(13)

(14)

(15)

(16)

(17) Compute density of each point in

(18) Select one data with max density and add it to

(19)

(20) Query oracle about labels of data in

(21) Return

(1) Input the value of

(2)

(3) Suppose that the number of different labels in

(4) Merge

(5)

(6)

(7) Compute the

(8)

(9)

(10)

(11)

(12)

(13)

(14) Label the rest unlabeled data according to KNN rule.

(15) Output the clustering results.

If the dataset is imbalanced and we select small number of data points from this kind of datasets randomly, then there exist some clusters which have no data to be selected. Using these selected data as the labeled data to guide the process of clustering, the data objects in clusters which have no data being selected are assigned to other clusters forcibly. Thus, decreases the accuracy of semisupervised clustering algorithm, and the clustering results are unsatisfying. In order to make the selected data cover as many clusters as possible, an active mechanism of selecting data points is presented. It partitions a given dataset into

Algorithm

How to use small number of labeled data to achieve a higher accuracy of clustering algorithm is a challenging work, especially when the dataset is imbalanced and multidensity. The semisupervised clustering algorithms should use the character of labeled dataset to guide their clustering process. In this paper, firstly, the clustering results of MST are merged according to the label of its labeled data (each cluster has and only has one labeled data). Since the density of each cluster is not unique and the densities of clusters may be different, we should not use the same expanding threshold when utilizing the method of label propagation to expand the labeled dataset. Secondly, the expanding threshold of each cluster should be obtained based on its density automatically, and it is used to expand the labeled dataset in one cluster. Finally, the rest of unlabeled data are assigned with the most frequent label among its

The

(1) Get all the labeled data which belong to

(2) Let

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16) Delete

(17)

(18) Return

Algorithm

We use three standard datasets from UCI Machine Learning Repository [

This subsection demonstrates the performance of the proposed semisupervised clustering algorithm in three UCI datasets: IRIS, Wine, and Ecoli. In order to test that the proposed algorithm has a higher accuracy compared with SSDBSCAN and Constrained-Kmeans in imbalanced and multidensity datasets, three datasets are constructed by deleting part of data from some clusters of IRIS, Wine, and Ecoli.

The IRIS dataset, which contains 150 data objects, is perhaps the most well-known dataset in pattern recognition and data mining literature. IRIS contains 3 clusters of 50 data objects each. We turn IRIS into imbalanced and multidensity dataset by deleting 20 data objects from the second cluster randomly, and let modified IRIS denote this dataset. Since IRIS contains only 150 data objects, we select 2, 3, 4, 5, 6, 7, 8, 9, and 10 percents of the dataset from IRIS and the modified IRIS to be labeled datasets, respectively, and view the rest of the data as the unlabeled datasets. The experimental results are shown in Figures

Clustering accuracy (%) obtained with the proposed algorithm and other algorithms on IRIS.

Clustering accuracy (%) obtained with the proposed algorithm and other algorithms on modified IRIS.

Figure

Figure

Wine dataset contains 178 data objects, and these data can be assigned to 3 clusters whose sizes are 59, 71, and 48, respectively. We adapt the same method to turn Wine dataset into an imbalanced and multidensity dataset by removing 25 data objects from the first cluster randomly, and let modified Wine denote this dataset. We select 2, 3, 4, 5, 6, 7, 8, 9, and 10 percents of the dataset from Wine and the modified Wine to be labeled datasets, respectively, and view the rest of the data as unlabeled datasets. Figures

Clustering accuracy (%) obtained with the proposed algorithm and other algorithms on Wine.

Clustering accuracy (%) obtained with the proposed algorithm and other algorithms on modified Wine.

Figure

Figure

The Ecoli dataset, which contains 336 data objects, has 8 clusters. The sizes of the 8 clusters are 143, 77, 52, 35, 20, 5, 2, and 2, respectively.

Since the data objects of the last three clusters are less than 6 and they can be viewed as noises, in the experiment, we delete these data. We select 2, 3, 4, 5, 6, 7, 8, 9, and 10 percents of the dataset from Ecoli dataset, respectively. The experimental results are shown in Figure

Clustering accuracy (%) obtained with the proposed algorithm and other algorithms on Ecoli.

In this subsection, we generate 2500 data objects which have two attributes and are viewed as imbalanced and multidensity datasets, and these data can be partitioned into 4 clusters whose sizes are 1000, 100, 800, and 600, respectively. These data are shown in Figure

Clustering accuracy (%) obtained with the proposed algorithm and other algorithms on Synthetic.

The accuracy of Constrained-Kmeans and SSDBSCAN depends on the labeled data seriously. Although we select 10% of all data, the second cluster has no one data that can be selected as the labeled data, and, in the clustering results, the data objects in the second cluster have to be assigned to other clusters, and this phenomenon manifests in the clustering results of SSDBSCAN. In addition, even if Constrained-Kmeans selects labeled data from all of the clusters, it assigns some data objects from the rest of the three clusters to the second cluster, and this is the reason why the accuracy of Constrained-Kmeans is not improved as the percent of labeled data increases. Figure

A new active semisupervised clustering algorithm is proposed which actively selects informative data by dealing with the clustering results of MST. Labeling these data and using them to label their

This paper is supported by the Fundamental Research Funds for the Central Universities (lzujbky-2012-212) and is partially supported by the IBM 2010 X10 Innovation Awards Project.