^{1}

^{1}

^{1}

As one of the typical clustering algorithms, heuristic clustering is characterized by its flexibility in feature integration. This paper proposes a type of heuristic algorithm based on cognitive feature integration. The proposed algorithm employs nonparameter density estimation and maximum likelihood estimation to integrate whole and local cognitive features and finally outputs satisfying clustering results. The new approach possesses great expansibility, which enables priors supplement and misclassification adjusting during clustering process. The advantages of the new approach are as follows: (1) it is effective in recognizing stable clustering results without priors given in advance; (2) it can be applied in complex data sets and is not restricted by density and shape of the clusters; and (3) it is effective in noise and outlier recognition, which does not need elimination of noises and outliers in advance. The experiments on synthetic and real data sets exhibit better performance of the new algorithm.

Clustering is an automatic process of partitioning data set with proper similarity measurement. In the process, data in same group have maximum similarity, while data in different group have minimum similarity. As unsupervised learning, clustering is greatly influenced by similarity measurement and is closely related with priors in application fields. Clustering is extensively applied in fields such as biology [

Clustering is developed into many types of algorithms in its successful application in those fields. According to different clustering processes, it is generally classified into two types: agglomerative clustering and partitional clustering. Agglomerative clustering initially sees all data as one group and then conducts the fusion of clusters based on certain principles, thus forming proper data groups. Partitional clustering redistributes the existing data groups and forms proper clusters, such as k-means [

According to different similarity measurements, clustering can also be subdivided into many types: local density-based clustering, clustering based on density estimation, clustering based on matrix calculation, clustering based on graph calculation, grid-based clustering, and so on.

Clustering based on local density feature is a typical clustering algorithm. Simple calculation and high efficiency are the advantages of this type of algorithm, such as clustering with density peaks (CDP) [

Density-based clustering algorithm constitutes a significant proportion of the clustering research, such as mean shift [

One typical algorithm of clustering is spectral clustering [

There are two problems for clustering: one is definition of similarity, including Euclidean and non-Euclidean dissimilarity and the other is similarity treatment. Different algorithms employ different methods for similarity treatment. The first problem is subjective because it is greatly related with priors in application fields. The second problem lies in centroid estimation or boundary description. For most clustering algorithms, some clustering mistakes in the middle of the process will go down to the final clustering result, which means that the ability of error correction needs to be improved.

This paper proposes a heuristic clustering algorithm based on cognitive feature capturing. The proposed algorithm is a centroid learning process and captures data structure features with a new type of similarity measurement to obtain better clustering result. In feature description, three cognitive features are described: neighbourhood, density difference, and connectivity. Similarity measurement (a new kernel function) is established to capture the three mentioned features. In clustering, the paper proposes a heuristic algorithm based on centroid learning. The proposed algorithm possesses great expansibility, which enables priors supplement and misclassification adjusting during clustering process. The proposed algorithm, to some extent, weakens the subjective dependence on similarity measurement and allows misclassification adjusting and prior interference in the process of clustering.

In Section

In clustering, cognitive features are of great significance for identification of clusters, such as neighbourhood, density difference, and connectivity. Samples of closer distance tend to be recognized as in one cluster. Samples of greater density difference tend to be recognized as in different clusters, while samples of little density difference can extend continuously. The capturing of those features helps in the recognition of clusters, which always depends on similarity measurement.

For data set

In Equation (

Similarity measurement (

As one of the most important types in clustering algorithms, centroid estimation continuously searches for the centroids of local area or of the whole data set. In data set

Equation (

Combination of local centroid estimation (

The new heuristic clustering algorithm is a process of searching for local density maximum (centroid) and is expressed as

With gradient (

Heuristic clustering algorithms (

Heuristic clustering algorithm (

In clustering, number of clusters reduces constantly until only one or needed number of clusters is reserved. In the process, since the variation of clusters is regular, some of the clusters remain stable for a rather long period, which is called survival period of the cluster and is expressed as

Heuristic clustering algorithm based on centroid learning is a process of centroid estimation, in which the centroids of local area are searched constantly until the satisfying results are obtained. Centroid learning process makes good use of clustering structure information of the data set and performs better in dealing with data sets of manifold structure and various density structures. In this section, the proposed algorithm is tested with synthetic data sets and real data sets and is also applied to volume rendering to show its good performance.

For heuristic clustering algorithm based on centroid learning, nearest neighbourhood parameter

The clustering process of a data set is shown in Figure

Procedure of the clustering. Clustering result (5 clusters) with the longest lifetime is recognized as the final result. The result with 6 clusters is also stable, in which an outlier is recognized as one cluster.

Four synthetic data sets of various density and shape structures are employed to test the efficiency of the new algorithm. The four data sets in Figures

Comparison of clustering results with NMI.

Data sets | New | CSSF | Chameleon | Spectral-Ng | CDP |
---|---|---|---|---|---|

Airplane | 1.00 | 0.34 | 1.00 | 1.00 | 1.00 |

Anchor | 0.98 | 0.79 | 0.85 | 0.79 | 0.98 |

Ring | 1.00 | - | 0.28 | 1.00 | 0.82 |

Swissroll | 1.00 | - | 1.00 | 0.02 | 0.34 |

Clustering results of the four synthetic data sets by the new clustering. (a) Result with “Airplane” data set. (b) Result with “Anchor” data set. (c) Result with “Ring” data set. (d) Result with “Swissroll” data set. The four data sets have typical cognitive features, such as density difference, connectedness, and manifolds.

Four real data sets are employed to test the new clustering algorithm, including pen-digit, iris, noise iris, and noise USPS-01. Pen-digit and iris with whole training data set are from UCI machine learning repository, in which pen-digit data set contains 7494 images with 16 attributes and iris data set contains 150 samples with 4 attributes. Noise iris is composed of iris data set and 5% uniform random samples. “USPS-01” data set contains 2200 images which are ‘0’ and ‘1’ classes drawn from USPS handwritten digits, and “noise USPS-01” is “USPS-01” contaminating with 2% residual USPS samples.

The clustering results are compared with those of NRSC [

Comparison of clustering results measured by NMI.

Clustering is applied to volume rendering to improve separability of structures with similar attributes. Volume rendering is a technique used to display a 2D projection of a 3D discretely sampled data set, typically a 3D scalar field. A typical 3D data set is a group of 2D slice images acquired by a CT, MRI, or Micro-CT scanner. Clustering can be used in scalar field to separate similar structures, and the clustering result in scalar field is transformed into visualized image.

The new clustering is employed to separate adjacent structures in two volume rendering examples. One is an engine in industry and the other is keen joints in medical treatment. The results of volume rendering with the new clustering are shown in Figure

Two clustering examples on volume rendering to separate structures with similar attributes. (a) A part of 3D scalar field, which is labelled with red rectangle in (b). (b) The visualization of clustering results in scalar field. The visualization inside the red rectangle is obtained with clustering data points in (a). (c) The visualization of keen joints with CT scanning. (d) The visualization of keen joints by clustering data points in 3D scalar field.

The paper proposes a heuristic clustering algorithm based on cognitive feature capturing. The new approach is effective in integrating cognitive features of data set and conducts clustering based on the features. The proposed approach is not restricted by density and shape features of data set, especially manifold structure data after dimensionality reduction. Heuristic clustering shows advantage in integration of data structure and is worth further research concerning feature information feedback and effective usage of data features in clustering process.

The pen-digits and iris data sets are from UCI machine learning repository, and USPS hand written digits data set is from

The authors declare that they have no conflicts of interest.

The study presented in this article is supported by the National Science Foundation of China, Research Grants no. 61305070 and no. 61703001.