A Clustering Algorithm via Density Perception and Hierarchical Aggregation Based on Urban Multimodal Big Data for Identifying and Analyzing Categories of Poverty-Stricken Households in China

Kaifaqu Campus of Dalian University of Technology, No. 321, Tuqiang Street, Dalian Economic and Technological Development Zone, Dalian, Liaoning 116600, China Faculty of Business and Management, Universiti Teknologi MARA, Cawangan Sarawak, Jalan Meranek, 94300 Kota Samarahan, Sarawak, Malaysia International School of Shenyang Jianzhu University, No. 25, Hunnan East Road, Hunnan New District, Shanyang, Liaoning 110168, China Huawei Nanjing Research & Development Center, No. 101 Software Avenue, Yuhuatai District, Nanjing, Jiangsu 210012, China


Introduction
With the development of Information and Communication Technology, the era of multimodal big data has arrived comprehensively. Cities are the important places which are of prime importance for big data distribution, such as population, economy, transportation, and landscape [1][2][3]. e urban multimodal big data obtained by traditional data collection methods such as field survey and questionnaire interview cannot objectively and accurately reflect the status quo of urban development and the law of residents' activities in a wide range of time and space. Also, the obtained urban operation information has a large lag. Multimodal big data can make up for the above defects and deeply depict the urban physical space and social environment. is not only provides the possibility to objectively understand the urban system and summarize its development rules but also provides important support for urban planning and related research studies such as poverty-relief work and urban education. It must be admitted that urban planning based on urban multimodal big data is a very challenging task for povertyrelief work. It can improve urban environments, quality of life, and smart city systems [4,5]. Due to the short time, heavy task of targeted poverty alleviation in the early stage, the basic information of each impoverished object and the causes of poverty are not comprehensive and accurate enough, which needs to be further enriched and improved. Poor object management mechanism is not perfect. Due to the large number of poor people in the poor villages and the complicated family situation, the number of people coming out of the basin and returning to poverty is in constant change [6]. In addition, the management mechanism of poor objects at the village level is not sound enough, so there is a lack of changes in the poor population in the poor villages.
In this paper, we focus on the tasks of identifying and analyzing categories of poverty-stricken households in China. Eradication of poverty is the historical task facing the international community. With the development of artificial intelligence (AI) technologies such as machine learning and deep learning, a growing number of researchers are making great efforts to develop and unleash the huge potential of these AI technologies in alleviating poverty [7]. China, as the largest developing country worldwide, has made a significant contribution to global poverty alleviation. In the year of 2013, the Chinese government raised the concept of targeted poverty alleviation, which aims to take targeted measures to assist each truly poverty-stricken household and eliminate various factors leading to poverty fundamentally, thus achieving the goal of sustainable poverty alleviation [8]. On the basis of the policy, this paper adopts the clustering algorithm [9] to divide the data of poverty-stricken households in China reasonably and thus identify different categories of poverty-stricken households for supporting the formulation and implementation of antipoverty measures.
Poverty-oriented scientific research depends on the analysis of poverty data. e Chinese poverty data generally come from population censuses carried out by the country, society, and universities [10]. Due to the wide coverage of population and the individual differences in educational level and psychology, respondents may not answer questionnaires according to actual conditions, which results in the subjectivity of questionnaire data. Additionally, faults in processes such as data entry and storage can easily lead to outliers and missing values in datasets. Since the quality of poverty datasets obtained by population censuses is hard to guarantee, it brings certain difficulties for the design and application of clustering algorithms. e design of clustering algorithms for poverty datasets should make reasonable consideration of noises caused by missing values and outliers. Nowadays, common clustering methods mainly include partitional clustering, hierarchical clustering, and density-based clustering [11]. e K-means clustering algorithm achieves clustering through the partition, which assigns each sample to the closest cluster according to distances between samples and prototypes and updates prototypes by the average of samples within clusters, then repeats the above steps until the iteration ends [12]. Although the method is easy and practicable, the number of clusters and the initial prototypes need to be predefined. e agglomerative hierarchical clustering (AHC) regards each sample as a separate cluster and then merges the two closest clusters into a new cluster constantly [13]. e AHC algorithm requires no predefined prototypes and can get the hierarchical structure of clusters, but it is sensitive to noises within data. e density-based spatial clustering of applications with noise (DBSCAN) algorithm is a representative of density-based clustering methods, which defines the cluster as the maximal set of density-connected samples and takes the sample regions with high densities as clusters, thus discovering clusters of arbitrary shapes [14] whereas the hyperparameters eps and minpts in the DBSCAN algorithm, i.e., the neighborhood radius and the minimum number of samples required to form a dense region, have a great influence on the result of clustering, and the method is not applicable to datasets with different density distribution. Many researchers improve DBSCAN in view of the existing problems in the algorithm and propose improved algorithms such as K-nearest neighbor DBSCAN (KNNDBSCAN), DVBSCAN, and varied density-based spatial clustering of applications with noise (VDBSCAN) [15][16][17][18]. For instance, Gaonkar and Sawant [19] drew a k-dist graph based on the distance between each sample and its k-th nearest neighbor so as to identify multiple values of the neighborhood radius, then finds the clusters with different densities under each value of the neighborhood radius. Fahim et al. proposed an enhanced DBSCAN (EDBSCAN) algorithm, which defined the density variation for core points and specified that a core point allowed for expansion only when its density variation was less than or equal to a threshold value and its neighborhood satisfies the homogeneity index [20]. In terms of the clustering methods, some other researchers proposed many advanced approaches such as robust FCM clustering [21], improved quantum clustering algorithm [22], and swarm clustering algorithm [23]. Chen et al. [24] proposed a fast clustering for large-scale data. Chel et al. [25] presented the HDBSCAN clustering algorithm to find a clustering pattern present in calcium spiking obtained by confocal imaging of single cells. Znidi et al. [26] introduced a new methodology for discovering the degree of coherency among buses using the correlation index of the voltage angle between each pair of buses and used the hierarchical density-based spatial clustering of applications with noise to partition the network into islands. Parmar et al. [27] proposed a residual errorbased density peak clustering algorithm named REDPC to better handle datasets comprising various data distribution patterns. Specifically, REDPC adopted the residual error computation to measure the local density within a neighborhood region. Parmar et al. [28,29] proposed the feasible residual error-based density peak clustering algorithm with the fragment merging strategy, where the local density within the neighborhood region was measured through the residual error computation and the resulting residual errors were then used to generate residual fragments for cluster formation.
Overall, the above methods have the limits of low clustering efficiency and time-consuming with high-dimensional data.
Considering that clusters in real-world datasets may have different sizes, shapes, and densities, accompanied by certain noises and outliers, this paper takes the idea of initial division and hierarchical aggregation to design a clustering algorithm named hierarchical DBSCAN (HDBSCAN). e proposed method comprises two stages of division and aggregation. Our contributions are as follows: (1) First, it makes an initial division of the dataset based on sample densities; that is, the proposed method takes the neighbor information of samples to calculate local density values and then searches the set of density-connected samples for each unlabeled core point sequentially according to the density values in descending order, thus forming the initial clusters.
(2) en, the method adopts the idea of hierarchical clustering to perform the aggregation of neighbor clusters. Based on the inner and border distances between clusters, the most similar clusters are regarded as neighbor clusters and merged to form a new cluster, and the process is repeated until the iteration ends.
(3) Based on the way of division and aggregation, the method can identify clusters with different forms in the dataset. Moreover, noise data cannot be integrated into high-density clusters as its density is relatively sparse, by which the proposed method can handle noise data reasonably. e rest of this paper is organized as follows. Section 2 introduces two typical clustering algorithms, i.e., the DBSCAN clustering and the hierarchical clustering. Section 3 describes the proposed hierarchical DBSCAN algorithm in detail. Section 4 discusses the clustering performance of the proposed method, then applies it to the Chinese poverty dataset, and further analyzes the result of clustering. Finally, conclusions are presented in Section 5.

Theoretical Foundation
e DBSCAN algorithm regards regions with high densities as clusters and those with sparse densities as noises. It requires two hyperparameters, i.e., the neighborhood radius eps and the minimum number of samples required to form a dense region minpts.
Let D � x 1 , . . . , x n represent the dataset composed of n samples and d attributes, where where dist(x i , x j ) denotes the distance between samples x i and x j , calculated by If x i satisfies equation (3), it is called the core point: ere are several definitions in the DBSCAN algorithm, listed as follows: (1) A sample x j is directly reachable from x i with respect to eps and minpts if x i is a core sample and and minpts if there exists a chain of samples reachable from x m l with respect to eps and minpts (3) A sample x j is reachable from x i with respect to eps and minpts if there exists a chain of samples

respect to eps and minpts
In the process of clustering, the algorithm randomly selects a core point as the initial point and takes all the core points in its eps-neighborhood for continuous expansion. e expansion ends until the maximal set of density-connected samples is found and labeled as one cluster. After that, the algorithm randomly chooses other unlabeled core points for generating new clusters. e process of clustering completes when all the core points are labeled.

Hierarchical Clustering.
e hierarchical clustering can be divided into the agglomerative hierarchical clustering and the divisive hierarchical clustering. e agglomerative hierarchical clustering first takes each sample as a separate cluster, then finds the two closest clusters by measuring the distance between the clusters, and then merges them into a new cluster. Subsequently, the algorithm recalculates the distance between clusters and continues the aggregation process. e realization of the divisive hierarchical clustering is the exact opposite of the above, which regards the whole dataset as one cluster and then performs the division iteratively.
In the hierarchical clustering, the distance between C p and C q can be calculated by (4), i.e., the average of sample distances between two clusters. Besides, the minimum distance of samples between clusters shown in (5), or the maximum distance of samples between clusters, can also be taken to measure the distance of two clusters:

2.3.
e Hierarchical DBSCAN Algorithm. As the global hyperparameters for the DBSCAN algorithm, the numerical values of minpts and eps have a direct impact on the expansion of all the clusters. Figure 1 illustrates the expansion of clusters under different numerical values ofeps, where the red points denote the initial core points in each iteration of expansion. According to Figure 1(a), the clusters C 1 and C 2 can be identified while the other samples are regarded as noises and cannot be partitioned properly if the DBSCAN algorithm takes eps 1 as the neighborhood radius. It can be seen from Figure 1(b) that all the samples are divided into one cluster C 1 through four iterations of expansion if the algorithm takes eps 2 as the neighborhood radius.
In view of the above problem, this paper takes the way of division and aggregation to design the HDBSCAN clustering algorithm. First, the proposed method makes an initial division of the dataset according to sample densities. During the expansion of each cluster, the method adaptively adjusts the neighborhood radius based on the neighbor information of samples within the cluster. en, the idea of hierarchical clustering is adopted to perform the recursive aggregation; that is, the method takes the cluster pair with the minimum distance as the neighbor clusters and then merges them into a new cluster. Based on division and aggregation, the method can perceive the clusters with different forms in the data space.

Initial Division.
During the process of initial division, the parameter k is used to calculate the local density. Let SN k (x i ) represent the set composed of k samples closest to x i , and the average distance between x i and all samples in the set is e distance dist(x i ) can capture the density distribution around the sample x i . e smaller the value, the greater the density. erefore, the local density of x i can be defined as e neighborhood radius of x i , namely, eps(x i ), is the distance between x i and the maxpts-th nearest sample. e process of the initial division includes the following steps.
Step 1. Calculate the local density for each sample and then sort the samples based on the local density values so as to form the sequence: e cluster label is initialed as q � 1.
Step 2. Select an unlabeled sample x i from the sequence O in order and set the iteration numbert � 1.
Step 3. Let C (t) q and Q (t) q represent the set of samples and the sequence of core points for the q-th cluster in the t-th iteration and C (1) q Step 4. Calculate the adaptive neighborhood radius for the expansion of the current cluster by all samples in the cluster: Step 5. Select a core point x j from the sequence Q (t) q in order and continue the expansion based on eps(C (t) q ).
Step 6. Calculate the set of neighbor samples to be expanded according to Step Step 8. e expansion of the q-th cluster C q is completed if Q (t+1) q � ∅, then it returns to Step 9. Otherwise, it sets t � t + 1 and returns to Step 4.
e initial division ends if all the samples are labeled. Otherwise, it sets the cluster label as q � q + 1 and returns to Step 2.

Aggregation of Neighbor Clusters.
In this paper, the similarity between clusters is measured by border distance and inner distance. Figure 2 takes the clusters C p ′ and C q ′ during the aggregation as an example to describe two kinds of distances. In Figure 2, the red points denote the core points and the grey ones denote the border points distributed around the clusters.
Suppose that the dataset can be represented by D � C 1 , . . . , C K after the initial division, where K denotes the number of clusters and C i (i � 1, . . . , K). While the neighbor clusters are merged to form new clusters continuously during the aggregation, C p ′ is described by where eps (t) denotes the neighborhood radius at the completion of division for x i . e value changes dynamically due 4 Scientific Programming to the adaptive adjustment of neighbor radius. According to Figure 2(a), the border distance between clusters C p ′ and C q ′ is the minimum distance between the border points of two clusters, namely, As can be seen from Figure 2(b), the cluster C p ′ consists of four initial clusters, and thus the inner distance of the cluster is defined as During the aggregation, the two clusters with the minimum border distance are considered as the neighbor clusters for further merging if their difference of inner distances and that of densities below certain limitations. Algorithm 1 is a simple implementation of aggregation for neighbor clusters. In the actual implementation of the algorithm, values such as border distances and inner distances will be restored to avoid repeated calculation. According to the 14th line of Algorithm 1, two clusters will be involved in calculating neighbor clusters only when their density difference, border distances, and inner distances satisfy certain conditions. e proposed HDBSCAN clustering algorithm can capture clusters with different forms in the data space. e aggregation of neighbor clusters weakens the sensitivity of the algorithm to hyperparameters in the initial division. Besides, the result of the division in the DBSCAN algorithm depends on the selection sequence of initial core points. e proposed method can weaken the fluctuation caused by the selection sequence to some extent. e Algorithm 2 summarizes the whole process.

Datasets.
ree public artificial datasets and four realworld datasets are chosen to verify the effectiveness of the proposed clustering algorithm. e description of artificial datasets is listed in Table 1. e visualization of artificial datasets is shown in Figure 3. Scientific Programming e description of real-world datasets is listed in Table 2, where Banknote, Parkinson, Codon usage, HCV, and Planning relax are taken from UCI machine learning repository, and CFPS2016 is the dataset of poverty-stricken households in China. e CFPS2016 dataset comes from the China Family Panel Studies (CFPSs) released by the Institute of Social Science Survey of Peking University, China, in 2016. In the experiment, the CFPS2016 dataset consists of 14019 samples and 320 attributes, which covers the family economy as well as the states of adults and children in health, (1) Input: clusters after initial division D � C 1 , . . . , C K ; the threshold ε; den(x i )|i � 1, . . . , n (2) Output: final clusters after aggregation D � C 1 ′ , . . . , C K ′ ′ (3) min O ⟵ + ∞, combine⟵ (4) While True (5) Calculateaverage den diff by the averaged of density differences between clusters (6) For each cluster C p ′ in D (7) For each cluster Calculate den(C p ′ ), den(C q ′ ) by the averaged densities for samples in the clusters Calculate the adaptive neighborhood radius (9) Select a core point x j from the sequence Q (t) q (10) Calculate the set of neighbor samples (11) For Q (t+1) q � ∅ (12) e expansion of the q-th cluster C q is completed (13) End For (17) End For (18) Calculateaverage_den_diff by the averaged of density differences between clusters (19) If den_diff < average_den_diff anddist_diff < ε and O < min O (20) min O ⟵O, combine ⟵ C p ′ , C q ′ (21) End For (22) End For Else (26) Break (27) End While ALGORITHM 2: e proposed cluster method. 6 Scientific Programming education, and psychology. Hence, the CFPS2016 dataset can reflect the status of each Chinese household objectively. During the data preprocessing, we fill in missing values with the K-nearest neighbor imputation method [30], and then 1778 poverty-stricken households are measured from 14019 Chinese households based on the Alkire-Foster method, the main measurement method of multidimensional poverty [31]. e parameters in this experiment are set the same as DBSCAN under the same experimental platform.

Evaluation Metrics.
We take the silhouette coefficient (SC) [32], Davies-Bouldin index (DBI) [33], adjusted Rand index (ARI), and normalized mutual information (NMI) [34] to measure the performance of clustering. e silhouette coefficient is defined by where n denotes the total number of samples; a(i) denotes the average distance between the sample x i and all other samples in its cluster, which reflects the cohesiveness of clustering; and b(i) denotes the minimum value of average distances between the sample x i and all samples in any other cluster, which reflects the dispersity of clustering. e larger SC represents the higher performance of clustering. Besides, the definition of the Davies-Bouldin index is   where K ′ denotes the number of clusters; S i and S j denote the average distance between all the samples within the cluster and the centroid of the cluster; ‖w i − w j ‖ 2 denotes the distance between cluster centroids. e smaller DBI denotes the higher performance of clustering. With respect to performance, adjusted Rand index (ARI) and normalized mutual information (NMI) are also used for evaluation. ARI represents the similarity measure between two clusterings that is adjusted for chance and is related to accuracy, while NMI quantifies the amount of information obtained about one clustering, through the other clustering (i.e., the mutual dependence between the two). In the case of observations being identified as noise, each noise observation is treated as a distinct singleton cluster for both ARI and NMI.

Compared Methods.
is paper compares the proposed method with three existing clustering algorithms which are described as follows: (1) AHC: as described in Section 2.2, the method regards every sample as a separate cluster and then merges the two closest clusters continuously until the iteration ends.
(2) DBSCAN: as described in Section 2.1, the method performs the continuous expansion for each cluster based on core points and thus takes regions with high densities as clusters and those with low-densities as noises.
(3) EDBSCAN: the method calculates the density variation for each core points and specified that a core point is allowed to expand only when its density variation is below a specified threshold and its neighborhood satisfies the homogeneity index [35].
(4) NS-DBSCAN: the NS-DBSCAN algorithm used a strategy similar to the DBSCAN algorithm. Furthermore, it provided a new technique for visualizing the density distribution and indicating the intrinsic clustering structure [36].
(5) ADBSCAN: unlike many other algorithms that estimate the density of each samples using different kinds of density estimators and then choose core samples based on a threshold, ADBSCAN utilized the inherent properties of the nearest neighbor graph [37].

Artificial Datasets and Real-World Datasets from UCI.
First, we conduct the effect experiments of ε on the local sensitivity as shown in Figure 4. en, the selected ε is used for the following experiments to provide the equitable comparison. From Figure 4, we can know that when ε is 0.5, the local sensitivity is small. e effect of proposed method is better. erefore, we select ε � 0.5 in this paper. e clustering results of three artificial datasets based on the proposed method are shown in Figure 5, where regions with different colors can be regarded as one cluster. According to Figures 5(a), 5(c), and 5(e), the datasets are cut into several regions with different densities after the initial division. As can be seen from Figures 5(b), 5(d), and 5(f ), the adjacent regions with similar densities aggregate continuously during the aggregation of neighbor clusters, which contributes to the ideal results of clustering. In Figure 5(f ), some discrete points are distributed around four large clusters.
e proposed method identifies these points as noises since there exist certain differences between the densities of discrete points and those of clusters around them.
e metric values for three UCI datasets obtained by four comparison methods are shown in Table 3, in which the optimal results have been bolded and the suboptimal results have been italicized.
According to Table 2, all the SC values obtained by the proposed method HDBSCAN are better than those obtained by other methods, and the method also has ideal DBI values. For instance, in respect of the Parkinson dataset, the SC value of HDBSCAN is 8.91% higher than that of the suboptimal method AHC. Although the DBI value of HDBSCAN is suboptimal, it is only 2.63% worse than that of EDBSCAN.
e above results indicate that the proposed method HDBSCAN has the ideal performance of clustering. Table 2 shows the ARI performance with the different methods on the artificial datasets. From these results, HDBSCAN is shown to rank first in these datasets. More importantly, in each case HDBSCAN is able to identify the underlying classes of each dataset, whereas each of the other approaches fails at this task in at least one case.

e Dataset of Poverty-Stricken Households in China.
We perform clustering on 1778 poverty-stricken households of CFPS2016 so as to identify different categories of povertystricken households. Table 4 shows the metric values for CFPS2016 obtained by four compared methods, where the optimal results have been bolded and the suboptimal results have been italicized. Table 4 also shows NMI performance results on the same set of artificial datasets and clustering approaches. Here, HDBSCAN ranking performance is identical to those discussed with respect to ARI.
We also make accuracy comparison with the other three methods. e results are the average values shown in Table 5.
It can be seen from Table 5 that the values of SC and DBI obtained by HDBSCAN are better than those obtained by other compared methods. erefore, the proposed method has the ideal performance of clustering on the CFPS2016 dataset. e clustering result based on HDBSCAN is listed in Table 6.
According to Table 6, the proposed method divides CFPS2016 into 10 clusters and identifies 70 noises. Additionally, the numbers of households within different clusters are distributed unevenly. For instance, the number of households in Cluster 1 is 382 while those in Cluster 9 and Cluster 10 are 61 and 34, respectively. To evaluate the rationality of the clustering result, we adopt the random forest algorithm to calculate the importances of attributes in ten 8 Scientific Programming clusters and thus analyze the characteristics of each cluster. Specifically, based on the labels generated by HDBSCAN clustering, we take each cluster as the positive class and the other clusters as the negative class to construct multiple binary classification models, thereby mining the important attributes within each cluster.  Cluster 9 are as follows: (1) the average age of adults in the household is 76. (2) Almost every household member has no pension insurance. Besides, the characteristics of Cluster 10 are as follows: (1) the annual per capita income of the household is 35, 914 yuan, 1.43 times higher than the average level.
(2) More than half of the members use computers. e living standard of households in Cluster 10 is relatively high compared with other clusters, and Cluster 10 accounts for a small proportion of poverty-stricken households. According to the above analysis, the causes of poverty and characteristics for most households are similar so that the numbers of households in some clusters are large whereas the characteristics of a few poverty-stricken households are clearly different from others, which leads to small numbers of households in clusters such as Cluster 9 and Cluster 10. Figure 6 shows the distribution of attribute importances in each cluster, where the abscissa values indicate the numbers of 320 attributes and the ordinate values indicate the attribute importances; the ten curves represent the distribution of attribute importances in ten clusters.
As can be seen from Figure 6, the distributions of attribute importances represented by ten curves nearly differ from each other. For instance, the attribute with the highest importance in Cluster 7 is the 165th-dimensional attribute which denotes the stage of schooling for household members at the last survey. And that in Cluster 8 is the 218th-dimensional attribute which denotes the total post-tax annual income from work. e phenomenon shows that poverty-stricken households within different categories differ in the characteristics and the causes of    poverty. erefore, the proposed method can identify the commonalities and differences in poverty effectively. Finally, for all the datasets, we conduct computational complexity experiments with the different methods. e results are shown in Table 7. Because the proposed method is the hierarchical DBSCAN algorithm based on the initial division and aggregation of neighbor clusters, the time is higher than traditional DBSCAN. However, the time is lower than other new methods.

Conclusions
is paper designs the hierarchical DBSCAN algorithm based on the initial division and aggregation of neighbor clusters. First, the proposed method HDBSCAN adopts the adaptive neighborhood radius to perceive regions with different densities and thus makes the initial division of the dataset.
en, iterative aggregation is performed on neighbor clusters according to the border and inner distances. Experiments on artificial datasets and UCI realworld datasets indicate that HDBSCAN has the ideal performance of clustering. Additionally, HDBSCAN divides the dataset of Chinese poverty-stricken household, namely, CFPS2016, into 10 clusters, and experimental results verify the rationality of the clustering result. e main reasons for the ideal performance of HDBSCAN lie in the following two aspects. First, the adaptive neighborhood radius helps to identify regions of different densities in the data space with imbalanced density distribution. Second, the aggregation further merges neighbor clusters with similar densities, which weakens the impact of the accuracy of initial partition on the clustering performance effectively. However, if the dimension of the datasets is very higher, the cluster effect is not better. In the future, more research studies will be conducted on the clustering result of the CFPS2016 dataset. To be specific, we will study the characteristics of poverty-stricken households in each category so as to support the formulation and implementation of antipoverty measures. e advanced clustering technology will be applied in targeted poverty alleviation of the poverty counties in China.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.