This study proposes a novel method to calculate the density of the data points based on K-nearest neighbors and Shannon entropy. A variant of tissue-like P systems with active membranes is introduced to realize the clustering process. The new variant of tissue-like P systems can improve the efficiency of the algorithm and reduce the computation complexity. Finally, experimental results on synthetic and real-world datasets show that the new method is more effective than the other state-of-the-art clustering methods.
Clustering is an unsupervised learning method, which aims to divide a given population into several groups or classes, called clusters, in such a way that similar objects are put into the same group and dissimilar objects are put into different groups. Clustering methods generally include five categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods [
Usually, an objective function measuring the clustering quality is optimized by an iterative process in some clustering algorithms. However, this approach may cause low efficiency. Thus, the density peaks clustering (DPC) algorithm was proposed by Rodriguez and Laio [
Membrane computing proposed by Pǎun [
Based on previous works, the main motivation of this work is using membrane systems to develop a framework for a density peak clustering algorithm. A new method of calculating the density of the data points is proposed based on the K-nearest neighbors and Shannon entropy. A variant of the tissue-like P system with active membranes is used to realize the clustering process. The new model of the P system can improve efficiency and reduce computation complexity. Experimental results show that this method is more effective and accurate than the state-of-the-art methods.
The rest of this paper is organized as follows. Section
Rodriguez and Laio [
Let
The local density
After
The computation of the local densities of the data points is a key factor for the effectiveness and efficiency of the DPC. There are many other ways to calculate the local densities. For example, the local density of
The way in (
Each component on the right side of (
Three different function curves.
A tissue-like P system has a graphical structure. The nodes of the graph correspond to the cells and the environment in the tissue-like P system, whereas the edges of the graph represent the channels for communication between the cells. The tissue-like P system is slightly more complicated than the cell-like P system. Each cell has a different state. Only the state that meets the requirement specified by the rules can be changed. The basic framework of the tissue-like P system used in this study is shown in Figure
Membrane structure of a tissue-like P system.
A P system with active membranes is a construct:
The biggest difference between a cell-like P system and a tissue-like P system is that each cell can communicate with the environment in the tissue-like P system, but only the skin membrane can communicate with the environment in the cell-like P system. This does not mean that any two cells in the tissue-like P system can communicate with each other. If there is no direct communication channel between the two cells, they can communicate through the environment indirectly.
DPC still has some defects. The current DPC algorithm has the obvious shortcoming that it needs to set the value of the cutoff distance
K-nearest neighbors (KNN) is usually used to measure a local neighborhood of an instance in the fields of classification, clustering, local outlier detection, etc. The aim of this approach is to find the K-nearest neighbors of a sample among N samples. In general, the distances between points are achieved by calculating the Euclidean distance. Let KNN(
Shannon entropy measures the degree of molecular activity. The more unstable the system is, the larger the value of the Shannon entropy is, and vice versa. The Shannon entropy, represented by
However, the decision graph is calculated by the product of
The specific calculation method is as follows. First, the local density of data point
To guarantee the consistence of the metrics of
In the following, a tissue-like P system with active membranes for density peak clustering, called KST-DPC, is proposed. As mentioned before, assume the dataset with
The initial configuration of the tissue-like P system.
When the system is initialized, the objects
At the beginning, there are
The tissue-like membrane system in the calculation process.
The main steps of KST-DPC is summarized as in Algorithm
Inputs: Output: Step Step Step Step Step Step Step Step Step
As usual, computations in the cells in the tissue-like P system can be implemented in parallel. Because of the parallel implementation, the generation of the dissimilarity matrix uses
Experiments on six synthetic datasets and four real-world datasets are carried out to test the performance of KST-DPC. The synthetic datasets are from
Synthetic datasets.
Dataset | Instances | Dimensions | Clusters |
---|---|---|---|
Spiral | 312 | 2 | 3 |
Compound | 7266 | 2 | 6 |
Jain | 373 | 2 | 2 |
Aggregation | 788 | 2 | 7 |
R15 | 600 | 2 | 15 |
D31 | 3100 | 2 | 31 |
Real-world datasets.
Dataset | Instances | Dimensions | Clusters |
---|---|---|---|
Vertebral | 310 | 7 | 2 |
Seeds | 210 | 7 | 3 |
Breast cancer | 699 | 10 | 2 |
Banknotes | 1372 | 5 | 2 |
The performance of KST-DPC was compared with those of the well-known clustering algorithms SC [
The performances of the above clustering algorithms are measured in clustering quality or Accuracy (Acc) and Normalized Mutual Information (NMI). They are very popular measures for testing the performance of clustering algorithms. The larger the values are, the better the results are. The upper bound of these measures is 1.
In this subsection, the performances of KST-DPC, DPC-KNN, DBSCAN, and SC are reported on the six synthetic datasets. The clustering results by the four clustering algorithms for the six synthetic datasets are color coded and displayed in two-dimensional spaces as shown in Figures
Clustering results of the Spiral dataset by the four clustering algorithms.
KST-DPC
DPC-KNN
DBSCAN
SC
Clustering results of the Compound dataset by the four clustering algorithms.
KST-DPC
DPC-KNN
DBSCAN
SC
Clustering results of the Jain dataset by the four clustering algorithms.
KST-DPC
DPC-KNN
DBSCAN
SC
Clustering results of the Aggregation dataset by the four clustering algorithms.
KST-DPC
DPC-KNN
DBSCAN
SC
Clustering results of the R15 dataset by the four clustering algorithms.
KST-DPC
DPC-KNN
DBSCAN
SC
Clustering results of the D31 dataset by the four clustering algorithms.
KST-DPC
DPC-KNN
DBSCAN
SC
The performance measures of the four clustering algorithms on the six synthetic datasets are reported in Table
Results on the synthetic datasets.
Algorithm | Par | C1 | Acc | NMI | Algorithm | Par | C1 | Acc | NMI |
---|---|---|---|---|---|---|---|---|---|
| | ||||||||
KST-DPC | 16 | 3 | 1.00 | 1.00 | KST-DPC | 217 | 6 | 0.98 | 0.95 |
DPC-KNN | 20 | 3 | 1.00 | 1.00 | DPC-KNN | 360 | 6 | 0.6466 | 0.7663 |
DBSCAN | 1.2/3 | 3 | 1.00 | 1.00 | DBSCAN | 1.5/3 | 5 | 0.8596 | 0.9429 |
SC | 3 | 3 | 1.00 | 1.00 | SC | 6 | 6 | 0.6015 | 0.7622 |
| | ||||||||
KST-DPC | 4 | 2 | 1.00 | 1.00 | KST-DPC | 40 | 7 | 1.00 | 1.00 |
DPC-KNN | 8 | 2 | 0.9035 | 0.5972 | DPC-KNN | 40 | 7 | 0.9987 | 0.9957 |
DBSCAN | 2.62/4 | 2 | 1.00 | 1.00 | DBSCAN | 1.59/3 | 5 | 0.8274 | 0.8894 |
SC | 2 | 2 | 1.00 | 1.00 | SC | 7 | 7 | 0.9937 | 0.9824 |
| | ||||||||
KST-DPC | 20 | 15 | 1.00 | 0.99 | KST-DPC | 25 | 31 | 1.0000 | 1.0000 |
DPC-KNN | 20 | 15 | 1.00 | 0.99 | DPC-KNN | 25 | 31 | 0.9700 | 0.9500 |
DBSCAN | 0.4/5 | 13 | 0.78 | 0.9155 | DBSCAN | 0.46/3 | 27 | 0.6516 | 0.8444 |
SC | 15 | 15 | 0.9967 | 0.9942 | | 31 | 31 | 0.9765 | 0.9670 |
Results on the real-world datasets.
Algorithm | Par | C1 | Acc | NMI | Algorithm | Par | C1 | Acc | NMI |
---|---|---|---|---|---|---|---|---|---|
| | ||||||||
KST-DPC | 9 | 2 | | 0.0313 | KST-DPC | 4 | 3 | | |
DPC-KNN | 9 | 2 | | | DPC-KNN | 6 | 3 | 0.8143 | 0.6252 |
DBSCAN | 7/48 | 2 | 0.6742 | - | DBSCAN | 0.92/7 | 3 | 0.5857 | 0.4835 |
SC | 2 | 2 | - | - | SC | 3 | 3 | 0.6071 | 0.5987 |
| | ||||||||
KST-DPC | 70 | 2 | | | KST-DPC | 68 | 2 | | |
DPC-KNN | 76 | 2 | 0.7954 | 0.3154 | DPC-KNN | 82 | 2 | 0.7340 | 0.3311 |
DBSCAN | 6/20 | 2 | 0.6552 | 0.0872 | DBSCAN | 6.5/5 | 2 | 0.5554 | 6.7210e-16 |
SC | 2 | 2 | - | - | SC | 2 | 2 | 0.6152 | 0.0598 |
The Spiral dataset has 3 clusters with 312 data points embracing each other. Table
The Compound dataset has 6 clusters with 399 data points. From Table
The Jain dataset has two clusters with 373 data points in a 2 dimensional space. The clustering results show that KST-DPC, DBSCAN, and SC can get correct results and both of the benchmark values are 1.00. The experimental results of the 4 algorithms are shown in Table
The Aggregation dataset has 7 clusters with different sizes and shapes and two pairs of clusters connected to each other. Figure
The R15 dataset has 15 clusters containing 600 data points. The clusters are slightly overlapping and are distributed randomly in a 2-dimensional space. One cluster lays in the center of the 2-dimensional space and is closely surrounded by seven other clusters. The experimental results of the 4 algorithms are shown in Table
The D31 dataset has 31 clusters and contains 3100 data points. These clusters are slightly overlapping and distribute randomly in a 2-dimensional space. The experimental results of the 4 algorithms are shown in Table
This subsection reports the performances of the clustering algorithms on the four real-world datasets. The varying sizes and dimensions of these datasets are useful in testing the performance of the algorithms under different conditions.
The number of clusters, Acc and NMI are also used to measure the performances of the clustering algorithms on these real-world datasets. The experimental results are reported in Table
The Vertebral dataset consists of 2 clusters and 310 data points. As Table
The Seeds dataset consists of 210 data points and 3 clusters. Results in Table
The Breast Cancer dataset consists of 699 data points and 2 clusters. The results on this dataset in Table
The Banknotes dataset consists of 1372 data points and 2 clusters. From Table
All these experimental results show that KST-DPC outperform the other clustering algorithms. It obtained larger values of Acc and NMI than the other clustering algorithms.
This study proposed a density peak clustering algorithm based on the K-nearest neighbors, Shannon entropy and tissue-like P systems. It uses the K-nearest neighbors and Shannon entropy to calculate the density metric. This algorithm overcomes the shortcoming that DPC has that is to set the value of the cutoff distance
However, the parameter
The synthetic datasets are available at
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was partially supported by the National Natural Science Foundation of China (nos. 61876101, 61802234, and 61806114), the Social Science Fund Project of Shandong (16BGLJ06, 11CGLJ22), China Postdoctoral Science Foundation Funded Project (2017M612339, 2018M642695), Natural Science Foundation of the Shandong Provincial (ZR2019QF007), China Postdoctoral Special Funding Project (2019T120607), and Youth Fund for Humanities and Social Sciences, Ministry of Education (19YJCZH244).