Adaptive Mixed-Attribute Data Clustering Method Based on Density Peaks

,


Introduction
Clustering analysis has been widely used in statistics, machine learning, pattern recognition, image processing, such as image inpainting [1,2], image super-resolution reconstruction [3], and so on. Mixed-attribute data clustering is one of the research hotspots in data mining. ere are many solutions to mixed-attribute data clustering, including attribute conversion methods, clustering ensemble methods, prototype-based methods, hierarchical clustering methods, and density clustering methods [4]. e K-prototypes algorithm proposed by Huang [5] and the iterative clustering learning based on object-cluster similarity metric (OCIL) algorithm proposed by Cheung and Jia [6] are both typical prototype-based methods. e similarity-based agglomerative clustering (SBAC) algorithm proposed by Li and Biswas [7] is a famous aggregation hierarchical clustering method. Density clustering algorithms include the relative densitybased clustering algorithm for mixture datasets (RDBC_M) algorithm based on relative density proposed by Huang and Li [8] and the density-based clustering algorithm for mixed data with mixed distance measure methods (MDCDen) algorithm based on density and mixed-distance measurement proposed by Chen and He [9]. But the state-of-the-art methods require user intervention and parameter tuning, so they cannot realize adaptive clustering. e density peaks clustering (DPC) algorithm proposed by Rodriguez and Laio [10] has attracted a great deal of attention from researchers in recent years [11][12][13][14]. In this algorithm, a decision graph is constructed by calculating a local density ρ i and a relative distance δ i , and the number of clusters is determined by manually selecting the center points of the clusters in the decision graph. e remaining data points will be assigned to the cluster of the nearest higher-density neighbor. eoretically, it can cluster data of both arbitrary shape and type and automatically identify outliers. e algorithm is efficient and has only one parameter, d c (cutoff distance), which determines the local density calculation. e input of the DPC algorithm is the distance matrix between data points, and as long as the distance measurement problem of mixed-attribute data is solved, the algorithm can be applied directly to cluster the mixed-attribute data. erefore, Liu et al. [15] defined a distance measurement method for mixed attributes and improved the DPC algorithm to a modified DPC algorithm for mixed-attribute data (DPC_M) algorithm, which was successfully applied to mixed-attribute data clustering. Du et al. [16] defined a distance-measurement method between mixed-attribute data points by referring to the similarity in OCIL algorithm and used the DPC algorithm to perform clustering analysis on the numerical attribute, categorical attribute, and mixed-attribute data. ese two algorithms prove the feasibility of the density peaks algorithm in the clustering of mixed-attribute data. But they are not adaptive and need manual intervention in the clustering process.
ere are also many studies on adaptive improvement of DPC algorithm, but they mainly focus on clustering of numerical attribute datasets, which will be detailed in the next section. To realize the adaptive clustering of mixed-attribute data, we proposed an adaptive mixedattribute data clustering method based on the DPC algorithm, called AMDPC. Experimental results showed that the proposed AMDPC algorithm had a better clustering effect, automatically determined the cluster number, and realized adaptive clustering of mixed-attribute datasets with no parameter.
For this paper, the main contributions are as follows: (1) e distance-measurement method of mixed-attribute data is studied, and a unified distance-measurement method is used to construct the distance matrix between data points of mixed-attribute data, to solve the problem of using the DPC algorithm to cluster mixed-attribute data. (2) e adaptive improvement of the DPC algorithm is studied and a new method to determine the center of the cluster is proposed. Because the cluster center is usually a data point with large local density and distance, after calculating c i � ρ i × δ i the cluster center is determined by calculating the inflection points of the sorted c i , ρ i , and δ i sets. (3) An adaptive local density calculation method based on K-nearest neighbor (KNN) is used to improve the robustness of the algorithm, without manually determining the cutoff distance d c and other parameters.

Density Peaks Clustering.
e DPC algorithm is based on the following two assumptions: the cluster center point has a higher local density, which is surrounded by neighbor points with lower local density, and the cluster center point is relatively far from other denser data points. erefore, the DPC algorithm constructs a decision graph by calculating a local density ρ i and a relative distance δ i to find the cluster center of a dataset. e remaining data points in the dataset will be assigned to the cluster with the nearest local density that is higher than its own.
Suppose that X � {X 1 , X 2 , . . ., X n } is the dataset to be clustered consisting of n data points; we define the distance between the data points X i and X j as d ij � dist(X i , X j ). A cutoff distance d c is defined in the DPC algorithm, and the local density ρ i and the distance δ i of each data point are defined, as shown in equations (1) and (2), where χ(d ij − d c ) � 1 when d ij − d c < 0 and 0 otherwise.
When the local density of X i is not the maximum, the relative distance is the minimum value of the distance from this point to all points with higher density; otherwise, the relative distance is the maximum distance from this point to all other points.
When the dataset has few data points, the local density is generally calculated using a Gaussian kernel, as shown in Based on the local density ρ i and relative distance δ i for each data point, users can explicitly choose the cluster centroids on the decision graph. Once the center point is determined, each remaining data point can be classified into the same cluster as its nearest neighbor with a higher density.

Adaptive Improvement of the DPC Algorithm.
e original DPC algorithm has some drawbacks: the cutoff distance d c and cluster centroids selected manually have great influence on the clustering results, and the original method of local density calculation is not effective for data with different density clusters or different shapes. At the same time, the original sample allocation strategy will create a domino effect. Once a sample is misallocated, it will lead to a series of sample allocation errors, resulting in incorrect clustering results and a reduction in the reliability of the clustering results [20,21]. erefore, many adaptive improvement methods of the original DPC clustering algorithm have been proposed. e research on the adaptive improvement of the density peaks algorithm mainly focuses on the automatic determination of the cutoff distance d c , the calculation of adaptive local density, the design of adaptive distance measurement, and the automatic determination of cluster number (the selection of cluster centroids). Most of these studies are focused on the clustering of numerical attribute data.

Adaptive Improvement for Selection of Cutoff Distance
and Local Density Calculation. Wang et al. [22] proposed a method to automatically extract the optimal value of the threshold in different kernel functions and different datasets from the original dataset by using the potential entropy of the data field. According to the characteristics of the dataset, Jiang et al. [23] used the change of the nearest-neighbor distance curve to automatically determine the density threshold d c and used this method to guide the clusters merging after the first clustering using DPC. is was done to solve the problem that the DPC algorithm will divide a cluster into multiple clusters when there are two or more density peaks in a cluster. Sun et al. [24] proposed an ADPC method with a Fisher linear discriminant.
e Pearson correlation coefficient is first introduced as the weight, and then the kernel-density-estimation function based on the weighted Euclidean distance is used to calculate the local density between the samples. Lotfi et al. [25] proposed a novel dynamic density peaks clustering method based on density backbone and fuzzy neighborhood called DPC-DBFN in which a fuzzy kernel is proposed to compute the local densities of the data points. Parmar et al. adopted the residual error computation to measure the local density within a neighbourhood region and proposed residual errorbased density peak clustering algorithm named REDPC [26,27] and FREDPC [28].
Du et al. [29] proposed a DPC-KNN algorithm that introduced K-nearest-neighbor data to participate in the local density calculation. In addition, they also proposed an improved principal component analysis-(PCA-) based algorithm named the DPC-KNN-PCA algorithm for highdimensional data clustering. Juanying et al. proposed KNN-DPC [20] and FKNN-DPC [21] algorithms, in which a uniform local density metric based on KNNs, fuzzy KNNs, and two new strategies for assigning the remaining points to their most likely clusters are proposed for both. Yaohui et al. [30] proposed an adaptive DPC algorithm (named ADPC-KNN), which introduced the idea of KNNs to calculate the global parameter d c and the local density ρ i of each point, applied a new approach to automatically select the initial cluster centers, and finally aggregated the clusters if they were reachable in density. Shi et al. [31] presented an algorithm called the adaptive clustering algorithm based on KNN and density (ACND) that first determines the KNN of every data point and then redefines the similarity between pairs of points with shared nearest neighbors. It does not force the user to define parameter values, recognizes the core point and constructs the cluster around it, and then attempts to detect the clustering boundary. It makes full use of the effect of KNN, and it has low computational complexity and can deal with different shapes as well as different data sizes with noise and outliers. Xu et al. [32] proposed extended adaptive density peaks clustering (EDAP) for overlapping community detection in which the local density is calculated based on KNN. Jiang et al. [33] proposed a method called G-KNN-DPC to calculate the cutoff distance based on the Gini coefficient and KNN. Sun and Liu [34] proposed a new density formula combined with the idea of gravitation and KNN that can make the local densities of sample points in dense and sparse areas have more obvious separability. Fan et al. [35] proposed a new DPC algorithm by incorporating an improved mutual K-nearest-neighbor graph (Mk-NNG) into DPC.
In general, KNN is used for local density calculations in most of these improved algorithms. Letting d(X i , X j ) be the Euclidean distance between the ith and jth data points in the dataset X � {X 1 , X 2 , . . ., X n }, the local density calculation formula defined by DPC-KNN is expressed by and the local density calculation formula defined in the KNN-DPC and FKNN-DPC algorithms is expressed as follows: where KNN(X i ) represents the KNN set of data point X i . In general, k takes a fixed value of 5 or 6, or is calculated according to the percentage of data points in the dataset. In most cases, k � p * N, where the percentage p � 2, N is the total number of data points in the dataset, and * is a ceiling function. Most of these algorithms are based on equations (4) and (5) or variants.

Automatic Determination of Cluster Number.
To solve the problem that the density peaks algorithm must manually select the cluster center, Ma et al. [36] introduced the weight of the cluster center. First, the products c (c i � ρ i × δ i ) of the normalized adjacent distance δ i and the local density ρ i were calculated. en, the inflection point of c was used to determine the cluster center of the dataset to avoid the subjective difference of users' selection of the cluster center. Zhao [37] proposed an improved LDPC algorithm combined with the linear fitting method. In this algorithm, the sparse and dense points are separated by the linear fitting method, and then the residual sequence C is obtained by making a difference between the original c s and the fitting value c r . e average residual value of the first 20 points is selected as the threshold value, and the data points with residual values greater than the threshold are the central points. Du et al. [38] proposed a parameter-adaptive clustering algorithm named DDPA-DP. e data-driven thought goes through the design of DDPA-DP: at first, a series of fitted curves are established to automatically detect points' roles by points' density attributes instead of any artificial thresholds; meanwhile, a new point's role "pending point" is defined, and then by the change of pending points' number, the local field's radius can be adaptively optimized. García-García and García-Ródenas [39] proposed an optimization-based methodology for automatic parameter/ center selection that uses the internal/external cluster validity index as the objective function.
Wang et al. [40] proposed an efficient hierarchical clustering algorithm based on density peaks, used the step characteristics of the parameter c to distinguish different Complexity levels of clustering, and then constructed a hierarchical clustering tree based on the intermediate result of DPC (NNeigh, a DPC array) to complete efficient hierarchical clustering and determined the cluster number automatically. Zhang and Li [41] extended the traditional DPC algorithm by using the CHAMELEON hierarchical clustering algorithm. e DPC algorithm was used for the initial clustering in the extended algorithm, and then the hierarchical clustering algorithm was used to merge the subclasses for the clustering results, and the effect was improved. Bie et al. [42] proposed a fuzzy DPC algorithm called Fuzzy-CFSFDP that uses fuzzy rules to find all density peaks and treats each peak as a local cluster, and then merges the close local clusters into a global cluster to achieve the final cluster. Ding et al. [43] proposed an improved density peaks clustering based on a natural neighbor expanded group (DPC-NNEG). ey first define a natural neighbor expanded (NNE) and a natural neighbor expanded group (NNEG) and then divide all NNEGs into a target number of sets as the final clustering result according to the degree of closeness of the NNEGs. To describe the clustering center more comprehensively, Diao et al. [44] redefined the local density and relative distance and distance attributes of the two neighbor relationships (KNN and SNN) as fused. is method can detect the low-density clustering center. Mehrmohammadi et al. [45] proposed a better method for selecting centers based on the mutual kNN graph and the shortest path. Fang et al. [46] proposed adaptive core fusion-based density peak clustering (CFDPC) to detect clusters in any shape and density adaptively. An initial clustering based on automatic finding of density peaks is proposed first. An adaptive search approach is then proposed to find the core points and a core fusion strategy based on similarity within the cluster is proposed to obtain the final clustering results.
In summary, the main improvement ideas of automatic determination of cluster number can be categorized as following two directions. One is to determine the cluster center by taking a larger value of c, ρ, and δ, such as finding the inflection point of c or using curve fitting and residual analysis.
e second is to adopt the idea of hierarchical clustering, initially selecting more clustering centers and then merging the close local clusters.
Notably, c, ρ, and δ are discrete sequences, and for the calculation of the inflection point of the discrete sequence, Ma et al. [36] used the slope of the line segment at two points to represent it. e calculation formula is expressed as follows: S m i represents the average change rate of the discrete sequence in the interval [i, i + m], namely, the slope change of y in the interval Based on the slope calculation, the inflection point is defined as follows: where S 1 i is the slope from the ith point to the ith +1 point, S i−1 1 is the slope from the first point to the ith point, and represents the average change rate of the discrete sequence y in the interval [1, i]. In this case, the inflection point is the critical point with the fastest slope change.

Adaptive Mixed-Attribute-Data Density
Peaks Clustering . ., X n } is a mixed dataset with d dimensions and n instances, which contains d r dimensional numerical attributes and d c (d c � d − d r ) dimensional categorical attributes, for two instances X i and X j in the dataset; their distance is defined as D(X i , X j ) as follows: Equations (9) and (10) illustrate the distance computation of the numerical attribute D r (X i , X j ) and that of the categorical attribute D c (X i , X j ), respectively: where dist(X d r i , X d r j ) denotes the normalized Euclidean distances of the numerical attribute of the data points X i , X j . Because the Euclidean distance is non-negative, it is ensured that the distance value of the numerical attribute is in the interval [0,1]. Regarding the distance of the categorical attribute, the matching method with the entropy weight is used. e matching distance of the data point X i , X j in the tth categorical attribute is calculated by e importance of a categorical attribute is quantified by its average entropy on each attribute value. e weight of each attribute ω t is then computed by Assume that the total number of categorical values on the tth categorical attribute is m t , where the probability of occurrence of the sth (s � 1,2, . . ., m t ) values is p(a ts ). e entropy weight H A t can be calculated using equation (13); it represents the average entropy of m t values of tth classification attribute: p a ts log p a ts .
Assuming a mixed-attribute dataset about the weather record is shown in Table 1, the dataset DS � {X 1 ,X 2 ,X 3 ,X 4 ,X 5 } has five records X 1 -X 5 and four attributes A 1 -A 4 . e four attributes represent weather, windy, temperature, and humidity: the first two attributes weather and windy are categorical attributes and the last two are numerical ones. Here d r � 2, d c � 2. Let us look at the calculation process of the unified distance metric.
Firstly, it is necessary to normalize the numerical attributes  (12) and (13) are used to calculate the weights of the first and second dimensions as ω 1 � 0.5708 and ω 2 � 0.4292, respectively. Finally, the distance between the first record and other records can be calculated according to formula (8) as follows:

Local Density Calculation Based on KNN.
In a small dataset, the Gaussian kernel function is usually used to calculate the local density, which requires manually setting the density threshold parameter d c . As mentioned above, to adaptively calculate the local density, many studies have adopted KNN information to improve the calculation of the local density. We adopt the idea of the DPC-KNN algorithm and use the improved Gaussian kernel function of KNN information to calculate the local density of each data point. Using the KNN set of data points, we can calculate the average of the sum of squares of distance between each data point and KNNs. us, equation (4) can be used to calculate the local density of data points. In this calculation method, it is not necessary to set the cutoff distance parameter d c , but rather to determine the nearest-neighbor number K.
rough subsequent experiments, the nearest-neighbor number K can be automatically determined based on the data points in the dataset.

Automatically Determining the Cluster Number.
To realize the automatic determination of the cluster number, the method of calculation of the inflection point of c proposed by Ma et al. [36] is simple but not sufficiently accurate. In theory, the center points should be points with large local density ρ i and large relative distance δ i , and the product of the two c i does not fully guarantee that the local density and the relative distance are both large.
From the sample dataset of the DPC algorithm and its decision graph [10] in Figure 1, points 1 and 10 in the upper right corner of the decision graph are cluster centers for which the local density and relative distance are both large. Points 26, 27, and 28, however, are treated as outliers for which the relative distance is large, but the local density is small. erefore, we presented a three-inflection-point improvement method, which is based on equations (2) and (4) to calculate the local density and distance of the data points. We then sorted the c, ρ, and δ values of each data point by descending order and used equation (7) to calculate the inflection point of c, ρ, and δ, and to obtain three candidate sets S g , S p , and S d according to three inflection points of c, ρ, and δ, respectively. e candidate set S g contains the points with c values that are larger than those of the inflection point of c. Similarly, the candidate set S p contains the points for which the value of ρ is larger than that of the inflection point of ρ, and the candidate set S d contains the points with δ values larger than those of the inflection point of δ. en we calculated the intersections S c � S g ∩ S p ∩ S d , and S c is the cluster center set. Points for which the relative distances are larger, but not the cluster center, can be judged as the outliers, which can be obtained by calculating S o � S d − S c . erefore, the improved method proposed in this paper could automatically identify the cluster center and outlier point.
For example, the local density ρ i , relative distance δ i , and c i of partial data points of a sample dataset are shown in Table 2. According to equation (7)  Complexity 5 e three-inflection-point algorithm to determine the center of the cluster is described as Algorithm 1.

AMDPC Implementation.
First, we used the unified distance measurement of the mixed-attribute data to calculate the distance matrix of the mixed-attribute dataset according to equation (8). en, we calculated the local density ρ i of each data point using the KNN equation (4) and calculated the distance δ i using the method of equation (2); thus, c i � ρ i × δ i is calculated and the cluster centers are found using Algorithm 1. Finally, the remaining points could be clustered by finding the nearest local point with densities higher than it and setting the clustering label to be consistent with its nearest-neighbor point with high density. e overall flow diagram of the AMDPC algorithm is shown in Figure 2. e input of the algorithm is the mixed-attribute dataset (DS) and the output is the cluster label vector (CL). e detailed process of the AMDPC algorithm is as Algorithm 2.

Complexity Analysis.
For datasets with n data points, the space complexity of the algorithm is mainly from the storage of distance matrix. According to the input demand of DPC algorithm, 3 * n * (n − 1)/2 storage space is needed. Columns 1 and 2 are the data point numbers and column 3 is the distance between the two data points. In addition, the algorithm requires three arrays of length n to store the local density ρ, distance δ, and its product c, so the space complexity is O(n 2 ). e time complexity of the AMDPC algorithm is mainly derived from distance calculation in Step 2 and the local density computation in Step 3. e time complexity of distance computation and its product calculation is O(n 2 ). e sort time complexity in Step 4 (Algorithm 1) depends on the sorting algorithm, the minimum O(n log(n)), and the largest O(n 2 ), so the total complexity is no more than O(n 2 ). e time complexity of the data point allocation in Step 5 is O(n). erefore, the overall complexity of the algorithm is O(n 2 ), and it is the same as the DPC algorithm.

Experimental Analysis
To verify the effectiveness of the AMDPC algorithm in this paper, we used several mixed datasets from the University of California-Irvine (UCI) for experimental study. We compared the clustering results of the AMDPC algorithm with those of the K-prototype and DPC_M algorithms.
We implemented the three algorithms in MATLAB 2015a (MathWorks, USA) running on Windows 10 on a laptop with Intel Core i5-5200u model CPU and 4 GB of DDR3 memory.

Experimental Datasets.
In this study, we investigated four datasets of mixed datasets from the UCI machinelearning repository, namely, Statlog Heart, Cleveland Heart Disease, Statlog Credit Approval, and Acute Inflammations. Brief information describing these datasets is shown in Table 3.
e Acute Inflammations dataset contains pathological and physiological indicators for 120 patients with acute inflammation.
ere is one numerical attribute (body temperature) and five categorical attributes (different symptoms) to determine whether each patient has cystitis and nephritis.
ere are two class labels to represent the two diseases. We used the first to predict cystitis in our experiments. e  deletion of missing data in the dataset did not affect the result of clustering analysis. erefore, we eliminated 6 instances with missing values in the Cleveland dataset and 37 instances with missing values in the Credit dataset before the experiment. e Adult dataset was extracted from the census bureau database, which contain 30162 training instances. We selected 3000 of them by random sampling. In addition, we normalized the numerical properties using the maximum-minimum normalization method.

Effectiveness Analysis.
We used the K-prototype algorithm, DPC_M, and the proposed AMDPC algorithm to separately cluster the dataset described in Section 4.1.
According to the research in [5], the parameter c of the K-prototype algorithm was 1/2σ (σ represents the average standard deviation of the numerical attributes). e K-prototype algorithm ran 100 times and the clustering results were averaged. In the DPC_M algorithm, the percent parameter p � 2, as described in [15]. When the AMDPC algorithm calculated the local density, the parameter K was assigned as ceil(0.1 * N); that is, 10% of the data points were taken as the nearest neighbors.
Because the UCI datasets have real class labels, the clustering accuracy rate (ACC) can be used as the validity index. We also used the normalized mutual information (NMI), Rand index (RI), adjusted Rand index (ARI), and F-score as validity indexes. For all indexes, the higher the index values, the better the clustering effect. e optimal results are indicated in bold in Tables 4-8.
Accordingly, we observed that the performance of the AMDPC algorithm was much better than that of the traditional K-prototype algorithm. e AMDPC algorithm improved the clustering accuracy of all datasets by more than 22.58%, by 24.25%, by 28.03%, by 22.5%, and by 10.12% for the Heart, Cleveland, Credit, Acute, and Adult datasets, respectively. It also outperformed the DPC_M algorithm in the first four datasets as shown in Tables 4-7. In the Adult dataset, the clustering accuracy of the AMDPC algorithm was 0.43% worse than that of the DPC_M algorithm, but it is better than the DPC_M algorithm in the NMI index and the ARI index. e F-score takes into account both precision and recall; the value of F-score shows that different algorithms perform differently in different experimental datasets. e proposed AMDPC algorithm got the best performance in Credit dataset.
As shown in Table 7, for the first four indexes, the DPC_M algorithm had two different results because of the different selection of center points. e DPC_M2 was worse than the AMDPC algorithm in the clustering effect, whereas the DPC_M1 was better. is showed that the selection of the center point in the DPC algorithm had a significant

Begin
Step4: Find the cluster center points using Algorithm 1.

End
Step1: Load the dataset DS and separate it into numerical subset Dr and categorical subset Dc Step2: Construct the distance matrix of DS according to Equation (8) Step3: Calculate the local density and the relative distance of each data point according to Equation (4) and (2) Step5: Assign the class label and return the class label vector CL. Input: rho, delta (represent local density vector ρ and relative distance vector δ) Output: Sc (set of cluster centers S c ) (1) //Step 1. Calculate c i � ρ i * δ i .

Parameter-Adjustment Experiment.
e AMDPC algorithm uses the improved Gaussian kernel function of the KNN information to calculate the local density of each data point. To understand how the parameter K affects the effectiveness of the algorithm, we conducted a series of experiments and found that the best effect was obtained when K was approximately 10% of the data instances in the dataset.
Taking the Heart dataset as an example, we had 270 data points in total. We took K as 1-20% of the data points to calculate the clustering accuracy of the AMDPC algorithm. e results are presented in Table 9 ; optimal results are indicated in bold).
As shown in Table 9, when K was 10% of the data points (K � 27), the clustering accuracy reached the best value. As shown in Figure 4, in the Heart and Cleveland datasets, K took 10% of the data points to achieve the best effect. In the Credit and Acute datasets, some values of K would have led to the incorrect clustering number, so the value of clustering accuracy (ACC � 0) was marked "not available" in the graph. For Acute dataset, K � 4 or 5 was the best, and 10% was also  good. erefore, we determined that the value of K in the AMDPC algorithm is 10% of the data points in the dataset.

Computational Complexity Experiment.
To verify the time complexity of the proposed algorithm, we calculated the running time of the above three algorithms. e K-prototype algorithm was run 100 times and the running times were averaged. e running times are shown in Table 10.
As shown above, the K-prototype is the most efficient algorithm.
e proposed AMDPC needs more time to calculate distance and compute local density. With an increase in data volume, the time consumption of the AMDPC algorithm and the DPC_M algorithm presents a linear relationship, and the time complexity of the two algorithms is of the same order of magnitude, which is consistent with the previous theoretical analysis. As shown in Table 11, when there are 120 points in the Acute dataset, the clustering time used by AMDPC is 3.6 times that of the DPC_M algorithm. When the data amount increases to 653 points (in Credit), the clustering time used by AMDPC is about 6 times that of DPC_M algorithm. When the data amount increases to 3000 points (in Adult), the clustering time used by AMDPC is less than 2 times that of DPC_M algorithm.

Conclusion
e DPC algorithm is a simple and efficient algorithm. As long as the distance-measurement problem of data points in the mixed-attribute dataset is solved, the DPC algorithm also can be used for efficient clustering of mixed-attribute data. In this paper, we study the clustering methods of mixedattribute data, focusing on the DPC algorithm and its adaptive improvement. Accordingly, we proposed an adaptive mixed-attribute data clustering algorithm based on DPC called AMDPC that adopted a unified mixed-attribute distance-measurement method and KNN adaptive local density calculation method. We used three inflection points to calculate the cluster center set and automatically determined the clustering number, which realized adaptive clustering of mixed-attribute datasets. From the analysis of experimental results, the proposed algorithm was significantly superior to the traditional K-prototype and DPC_M algorithms. In all five datasets, the clustering accuracy of the AMDPC algorithm is significantly improved compared with that of the K-prototype algorithm, by 10.12% to 28.03%, and also slightly improved compared to the DPC_M algorithm except in the Adult dataset. In addition, AMDPC implements adaptive clustering without manual adjustment of any parameters.
e AMDPC algorithm could realize adaptive clustering of mixed-attribute data well. When we used KNN to calculate the local density of the data points, the determination of K was different from the value in the previous research paper [10,14], and the value of K also had a significant influence on the effect of cluster. According to the experimental analysis, the effect was optimal when K was 10% of the data points, but there was still room to adjust the value of K on different datasets, which requires further research.
ere are still many problems in adaptive clustering of mixed-attribute data to be further studied, such as mixedattribute data clustering on the datasets containing a huge number of objects or a huge number of attributes, or on the datasets with arbitrary shapes, different sizes, variable density, and overlapping clusters, etc.

Data Availability
Data used to support the findings of this study are available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/index.php, https://github. com/milaan9/Clustering-Datasets.