This paper presents an assembling unsupervised learning framework that adopts the information coming from the supervised learning process and gives the corresponding implementation algorithm. The algorithm consists of two phases: extracting and clustering data representatives (DRs) firstly to obtain labeled training data and then classifying nonDRs based on labeled DRs. The implementation algorithm is called SDSN since it employs the tuningscaled Support vector domain description to collect DRs, uses spectrumbased method to cluster DRs, and adopts the nearest neighbor classifier to label nonDRs. The validation of the clustering procedure of the firstphase is analyzed theoretically. A new metric is defined data dependently in the second phase to allow the nearest neighbor classifier to work with the informed information. A fast training approach for DRs’ extraction is provided to bring more efficiency. Experimental results on synthetic and real datasets verify that the proposed idea is of correctness and performance and SDSN exhibits higher popularity in practice over the traditional pure clustering procedure.
Among diverse issues of data mining, clustering is a fundamental one owing to its ability to provide the data group information before fulfilling other further mining tasks. Clustering pursues data partition that holds maximum cohesion and minimum coupling [
Our main idea is implemented by a concrete algorithm, SDSN, which involves support vector technique [
The contributions of SDSN are as follows.
Experiments are conducted to compare SDSN with some stateoftheart clustering methods, say the purely unsupervised clustering methods. Empirical results show that SDSN is of the better or competitive behaviors than its peers. That verifies the validation and the performance of our twophase clustering idea. Besides, seen in another opinion, SDSN can be thought of as a type of support vector clustering algorithm. So, experiments are conducted to compare SDSN with the popular clustering methods based on support vector technique. Empirical evidence indicates the improvement of SDSN over its counterparts in efficiency and performance.
In the following description, Section
The essential of twostep clustering idea is to learn discriminative information from data twice by integrating supervised information into the unsupervised learning process. In details, it discovers DRs, learns the label information from DRs, and trains the classifier based on labelled DRs. Then the remaining unlabelled data are classified by the classifier.
Compared with purely unsupervised clustering process that detects clusters from all unlabelled data only once, the proposed idea learns discriminative information of clusters twice. The first learning is during the process of DRs collection. Cluster distribution information is investigated through finding the DRs that are located on the boundaries of dense regions. Secondly, when DRs are clustered, grouping information is probed once again, and this information is carried by the classifier trained on the labelled DRs. That, in turn, benefits the resulted classifier to hold the strong discriminative ability to label other data.
In the first phase, DRs that hold the ability to describe the dataset sketch are collected. Among existing data reduction approaches, there are two main types. The first type specifies DRs through random sampling; the second type finds DRs based on probability distribution information. Clearly the former incurs randomness and consequently leads to the unexpected results. The latter often produces unpleasant results since the exact probability distribution of real dataset is usually unknown.
Focused on previous difficulties, we exploit SVDD to produce support vectors, which serve as DRs. The reason of choosing SVDD to search DRs is that it is of boundarydetecting mechanism, which fosters SVDD the ability to handle arbitraryshaped clusters. Furthermore the employment of Kernel function of SVDD enhances its generalization ability. However, the support vectors produced by SVDD only describe the cluster boundaries, without providing the information of inner regions of clusters. Therefore, we propose the tuningscaled SVDD to yield the support vectors that can describe the dataset sketch. Besides, we give a fast training method for tuningscaled SVDD to facilitate the efficiency. Now classical SVDD is introduced in brief.
Given the
Points with
The original SVs are only located on cluster boundaries. However, in this paper, for SVDD, we tune the scale parameter of Gaussian Kernel data adaptively, which enhances the difference among data points, and consequently more SVs are produced to describe both cluster contours and contours of small innercluster regions with high density. So, we name these SVs produced by tuningscaled SVDD as the semisupport vectors (semiSVs). Obviously, these semiSVs develop a sketch of dataset, and they are qualified to work as DRs.
In details, for
Two figures in Figure
SVDD with
In the first phase, after extracting DRs, we cluster DRs with a spectrumanalysisbased method. The clustering method has the following steps.
In feature space, after the tuningscaled SVDD is conducted, data are mapped into the hyper sphere. Suppose this sphere being
Denote
For any
Although we name those points with nonzero support values as semiSVs, they are also of the identical properties as SVs. Therefore, based on the previous lemma, we have the following geometric property of semiSVs in feature space [
In feature space of Gaussian Kernel, semiSVs are collected in terms of clusters on the intersection hyperline of
In this study, we give two geometric properties about the hyper spheres produced by SVDD. Those two properties are expected to extend the proposed algorithm to the multiclassification issue in the future work.
If SVDD procedure is conducted on each class with the same Kernel scale, it ensures that multiple hyper spheres exist in the same feature space.
For kernelbased algorithms, the kernel function defines a nonlinear map
The maps of the same data created by different kernelbased methods are the same, provided that the Gaussian kernels have the same scale parameter.
According to Proposition
Based on the previous two propositions, the next job is to probe the further geometric properties of SVs and try to investigate the relationship between geometric positions of SVs and the hyper spheres, with intention to develop the wise discrimination approach. That formulates the future work direction to address multiclassification issue by conducting SVDD procedures on each class and building the new labeling approach.
The validation of the clustering method is interpreted as follows.
Firstly, SVs are grouped in terms of clusters on the intersection circle, which means that their distribution in feature space has regular directions. Then angle information is a good criterion to detect their clusters. Angle distance is used in the way of Cosine value. Because SVs are on the surface of the unit ball, Cosine value is used in the way of inner product:
Secondly, according to spectrum analysis idea, if we eigendecompose the affinity matrix of one dataset, the inherent distribution information of the dataset can be revealed. This type of distribution information is described by eigenvectors because these vectors describe the distribution directions. For our algorithm, the eigendecomposition on the kernel matrix yields the eigenvectors that describe the distribution directions based on angle information. The subsequent clustering operations are well expected to produce correct clusters.
With labeled DRs in hand, we label data using NN classifier. Like any a distancebased classifier, NN is faced with two scenarios of clusters: the first is convexshaped clusters, where spatial information can reflect cluster structures; the other is nonconvex clusters. In the later scenario, the Euclidean metric often fails and the cluster structures make importance in clusters detection. That can be illustrated in Figure
Dataset1 (a), the shortest path of Euclidean (b), and the shortest paths of
In Figure
The key of new metric is to investigate multireachable paths between two points including the path of straight line segment that directly connects two points, and the indirect paths that consist of multiline segments. The final distance is determined by the shortest path with minima path length. In computing length of indirect paths, neighboring information is exploited, and the neighboring information spreads along paths so that the message of global cluster structures is accumulated gradually and integrated into metric formulation.
Given dataset
Here, Euclidean distance is rescaled by the exponential mechanism with intention to reduce the distance between similar points while enlarging distance between distinct points. That actually strengthens cluster boundaries.
Then the path length is defined. For a path
Therein,
We now give the computation steps of the new metric. Different from classical Dijstra algorithm [
For the
for
Therein, procedure
(1)
(2) while
(3)
(4) for
(5) if
(6)
(7) while
(8)
(9) for
(10) if
(11)
(12) Fill
(13) Fill
In
To check the performance of
Dataset2 (a), clusters of Euclidean (b), and clusters of
Dataset3 (a), clusters of Euclidean (b), clusters produced by
The shortest path produced by Euclidean (a) and the shortest path produced by
The shortest path produced by Euclidean (a), the shortest path produced by
Three metrics are introduced into
The DRs’ extraction process of SDSN is essentially a quadratic optimization problem, which consumes huge cost. So, we give a fast training approach based on incremental learning to facilitate the efficiency.
Like the common incremental method, our approach sets a working set and divides remainder data into batches. Those batches will be enriched into the working set incrementally. The novel in our approach is that the batches are sorted in an order that benefits the incremental development of the problem solution. For SVDD optimization problem, the solution is the support values
Schrödinger equation (SE) [
In quantum mechanics, SE describes the law of energy conservation of a particle, where
To apply SE in machine learning,
From geometry meaning,
It has been known that
In our fast training approach,
Now check the quality of fast training approach. In experiments, SVDD optimization process is conducted in three ways:
Performance comparison of fast training approaches.
Dataset (size)  (1) no. of SVs  (2) no. of SVs  (2) no. of shared  (3) no. of SVs  (3) no of shared 

Wine (178)  22  21  18  21  20 
Liver (345)  43  40  39  41  40 
Monk3 (432)  102  96  92  95  95 
Letter (520)  115  115  102  112  108 
Datasets are taken from UCI benchmark machine learning repository. In Table
From Table
Besides, we record the time consumption of three processes in Table
Time comparison of fast training approaches (seconds).
Methods  Wine  Liver  Monk3  Letter 

(1)  2.32  68.1  165.2  112.7 
(2)  0.52  1.58  6.72  3.93 
(3)  0.10  0.74  3.70  1.29 
Since SDSN is designed based on support vector clustering method, experiments will be done to compare the performance of SDSN with support vector clustering (SVC) as well as SVC variants firstly. To be selfincluded, here SVC and some variants are introduced in brief. SVC consists of optimization piece and labeling piece. The original labeling procedure is named as complete graph (CG). CG identifies clusters by constructing a complete graph and taking connected components of graph as clusters [
Therein,
Variants of SVC focus on revising its labeling process. support vector graph (SVG) [
Besides the previous datasets, news groups dataset is tested. This dataset contains about 20,000 articles divided into 20 news groups. They are named as
Clustering accuracy comparison (
CG1  CG2  CG3  SVG  PG  GD  CCL  SDSN  

(1)  84.9  85.2  85.8  81.4  —  86.5  86.6  86.6 
(2)  82.8  83.2  83.9  80.3  —  80.0  84.5  84.3 
(3)  78.6  79.1  79.6  78  —  78.7  79.6  79.5 
(4)  66.3  67.5  67.8  65  —  68  68.3  68.5 
(5)  66.3  67.2  67.8  65.2  —  65  69  68.6 
Wine  94.8  94.8  95.5  93  94.3  95  96.6  95 
Liver  70.0  70.3  70.5  69.1  68.1  70.0  71.3  71.3 
Monk3  96.6  96.6  97.1  95.8  95.3  96.1  97.1  97.3 
Letter  87.2  87.2  87.8  87.0  86.9  85.3  88.5  88.2 
From Table
Onerun time consumptions of methods (second).
Now, compare SDSN with some clustering methods:
Comparison of clustering accuracies of methods (

Girolami  NI  NJW  SDSN  

(1)  79.4  83.9  82.8  86.7  86.6 
(2)  81.8  83.7  80.2  85  84.3 
(3)  75  78  73  81.1  79.5 
(4)  66.7  68  67.5  71.6  68.5 
(5)  65.3  66.2  66.8  68.9  68.6 
Wine  94  95.7  96  97.5  95.8 
Liver  71.1  72.6  70.3  73.1  71.3 
Monk3  96.2  97  96.4  97.3  97.3 
Letter  86.9  87.2  87.5  90.5  88.2 
Among five methods, it can be found that NJW is the best clustering tool. It probes spectra of all data, to obtain data inherent distribution directions, which assists much in detecting correct clusters. But NJW consumes huge cost, with
This paper presents a general clustering framework that extracts and clusters DRs firstly and then classifies nonDRs. The idea is implemented by an algorithm SDSN. SDSN consists of a tuningscaled SVDD optimization that extracts DRs, a spectrumanalysisbased method to cluster DRs, the nearest neighbor classifier that works with an informed metric to classify nonDRs, and the fast training approach that facilitates the computation ease. To illuminate the validation of the spectrumanalysisbased method, the geometric property of Gaussian Kernel feature space is explored and proofed. Experiments demonstrate the performance of the tuningscaled SVDD, of the new metric, and of the fast training approach. The improved popularity of SDSN over the pure clustering approach is verified too.
Since the supervised information can be utilized in the unsupervised learning process, it is well expected to obtain help from clustering information for the classification mining task. The future work direction is focused on solving multiclassification issue based on the geometric properties of multihyperspheres of the feature space, as mentioned in Section
This research is partially supported by the National Natural Science Foundation of China under Grants nos. 61105129, 61304174, and 11226146 and the Natural Science Foundation of Jiangsu Province of China under Grant no. BK2011581.