Two-Stage User Identification Based on User Topology Dynamic Community Clustering

. In order to solve the problem of node information loss during user matching in the existing user identiﬁcation method of ﬁxed community across the social network based on user topological relationship, Two-Stage User Identiﬁcation Based on User Topology Dynamic Community Clustering (UIUTDC) algorithm is proposed. Firstly, we perform community clustering on diﬀerent social networks, calculate the similarity between diﬀerent network communities, and screen out community pairs with greater similarity. Secondly, two-way marriage matching is carried out for users between pairs of communities with high similarity. Then, the dynamic community clustering was performed by resetting the diﬀerent community clustering numbers. Finally, the iteration is repeated until no new matching user pairs are generated, or the set number of iterations is reached. Experiments conducted on real-world social networks Twitter-Foursquare datasets demonstrate that compared with the global user matching method and hidden label node method, the average accuracy of the proposed UIUTDC algorithm is improved by 33% and 26.8%, respectively. In the case of only user topology information, the proposed UIUTDC algorithm eﬀectively improves the accuracy of identity recognition in practical applications.


Introduction
With the rapid development and application of artificial intelligence technology, the application scope of artificial intelligence technology is also expanding. Artificial intelligence technology represented by deep learning and breadth learning is becoming more mature than before. Among them, data integration is the premise and foundation of breadth learning. In order to achieve multisource data integration, user identification across social network has become a very valuable research hotspot.
Social networks connect users on the Internet, allowing users to communicate and interact, forming a virtual social behavior similar to reality. According to statistics, 42% of users have more than one social network account, and 93% of Instagram users use Facebook at the same time [1]. Different social network platforms have different functions, and these platforms are independent of each other. User information is scattered in different networks, and the same real user information cannot be shared between different networks. Each network forms an "island," which makes it impossible to integrate data between networks. In order to break the phenomenon of information "islands" and achieve multisource data integration, cross-social network identification is a necessary premise and basis. Cross-social network user identification has strong research value and practical application significance in many fields such as user portraits, commercial advertising, friend recommendation, and maintenance of online public opinion security. At present, cross-social network user identification methods mainly include methods based on user attribute information, user behavior information, and user topology information, and the integration of three different characteristic information methods. In social networks, user topology, namely, friend relationships, is authentic and difficult to forge [2]. erefore, this article decides to use user topology information for user identification.

Related Work
Among the methods based on user attributes, Zafarani et al. [3] first proposed the method of user name mapping to identify users. Peritio et al. [4] proposed a method to calculate the similarity between user names based on the uniqueness of user names. Liu Dong et al. [5] extracted the hidden features of usernames from multiple angles and integrated the statistical results of the probability distribution of various features to infer the identity of the corresponding usernames. Vosecky et al. [6] first proposed using vectors to represent user profile information and then calculated the similarity between the vectors. Although the accuracy of the user-based attribute method is very high, the attribute information belongs to the user's privacy and is so difficult to obtain. In addition, due to the user's high awareness of network security, the user may provide wrong content when filling in the attribute information. erefore, the universality of user-based attribute methods is not high [7].
Among the methods based on user behavior information, Kong and Zhang et al. [8] match users by calculating the similarity of users in different networks with respect to time, space, and text information. Liu et al. [9] proposed a method to identify users by integrating information such as the content of the user's published content, writing habits, and behavior trajectory. Roedler et al. [10] established the user's own unique social behavior pattern by using the time information carried by the social network and the geographical information recorded by the device, which was used as an identification mark. e method based on user behavior information faces the problem that the user's geographic and spatial information has sparse characteristics in social networks, and it is difficult to apply to large-scale social networks.
Among the methods based on user topology information, Narayanan et al. [11] proved for the first time that only user relationships are used, relying on a small number of initial matching seed nodes and iteratively updating to continuously find new nodes, but the recognition accuracy of this method is not high. Nitish et al. [12] proposed an identification algorithm for multiple social networks based on the node degree and the number of common neighbors. Zhou et al. [13] took the number of seed nodes shared by the nodes to be matched as the cross-network similarity of the nodes and selected those with greater similarity for matching. But only simply using user topology information, when there are many nodes, the efficiency and the accuracy are not high.
Among the methods based on multidimensional information fusion, Peled et al. [14] extracted two aspects of user topology and user attributes, established a 27-dimensional feature vector, and finally judged whether the user identity matches through the similarity of the feature vector. Liu et al. [9] used three-dimensional information to train the model in a semisupervised learning manner to complete the matching. Zhang et al. [15] also used the information of all the above three dimensions. First, the network structure information was used as users to be matched to select the set of potential matching nodes, and then the user name and spatiotemporal trajectory were used to train the classifier, which is a kind of unsupervised learning algorithm. Xing et al. [16] firstly used entropy to assign weights to user name features, then analyzed user interests, combined with the user name and user published content to identify users across social networks. ese algorithms fully consider multidimensional information, so the overall performance of the algorithms is better. Although the methods based on multidimensional information fusion are effective, it is difficult to obtain comprehensive data in specific social networks. Moreover, the multidimensional information model is complex, difficult to model, inefficient, and prone to "overfitting" when the amount of data is not enough.
Although the method based on user topology information is not efficient and accurate, the user topology information is authentic and difficult to forge. Wu Zheng et al. [17] used potential relationship information to improve the recognition of nodes to be matched by clustering social networks. However, the clustering of communities in this method is fixed, resulting in the loss of information of nodes outside the community, and the efficiency and accuracy are not high.
Based on the above literature analysis, this paper proposes a dynamic community clustering two-stage user identification algorithm based on user topological relationships to solve the problem of node information loss during user matching between fixed communities. Firstly, perform community clustering on different social networks, calculate the similarity between different network communities, and filter larger similar community pairs. Secondly, match users between larger similar community pairs on different networks. Finally, add the matched node pair to the seed node user pair. Reset the number of different community clusters (decrease by a certain level), redo dynamic community clustering, and then match users in larger similar communities. Repeat iterations until no new matching user pairs are generated or reach the set number of iterations.

Related definitions.
Identity recognition based on user relationships uses user topological structure relationships to identify the accounts of the same natural person on different community platforms. A formal description of this problem is as follows: Definition 1 (cross-social network user matching). ere are two different social network platforms, G_A and G_B, where G_A � (U_A, E_A), G_B�(U_B, E_B). U_A and U_B represent the set of all users in social networks A and B, respectively, and E_A and E_B represent the set of user topological relationships in social networks A and B, respectively. e cross-social network user matching rela- M is a pair of users belonging to the same natural person in the A and B networks.
Definition 2 (known user matching node pair). e known user matching node refers to the network user matching node that is found in advance through specific methods such as URL address information. is article uses Seed_User to indicate known user matching node pairs.

UIUTDC Algorithm Principle.
Since there are many users in social networks and the relationship of friends is relatively complicated, if the similarity calculation is performed on the user nodes of the network one by one, the cost of similarity calculation is very high. Except for a small number of friends (compared to the number of users in the entire social network) of a user in a social network, most other users rarely contact this user. According to the principle of clustering of things and groups of people, a user and his friends are likely to be in a cluster (community) in a social network, while in reality, a user belongs to a cluster in a different social network (community), these clusters have a large degree of similarity. Taking into account the problem of fixed community clustering and node information loss, first, the UIUTDC algorithm uses multiple rounds of dynamic community clustering method to cluster different networks. Set a different number of clustered communities in each round (decreasing according to a certain level), cluster from different angles, cover the entire network, and more fully match users through multiple iterations. After each round of community clustering, calculate the similarity of the communities in different networks, as shown in Figure 1(a), and filter the larger similar community pairs, then calculate the similarity of the user node in the community pair with higher similarity, and the node pair with higher matching similarity is the matching user pair, as shown in Figure 1

TSUIBUTDC Algorithm Framework.
e specific framework of the UIUTDC algorithm is shown in Figure 2.
Firstly, we initialize the number of community clusters in social networks A and B, perform community clustering on social networks A and B, respectively, and calculate and filter out larger similar community pairs in social networks A and B. Secondly, we select any community of A network in the larger similar community pair, make each user match with any user of B network in the larger similar community pair, and calculate and screen out the user pairs with a high similarity between A and B network communities. e user pairs with large similarities are bidirectionally matched, and the matched nodes are added to the seed node pair set. Loop iteratively until all communities in the community pair with greater similarity in A and B networks are matched. Judging whether the iteration is over, that is, whether it has reached the maximum number of iterations or whether it converges ( e maximum number of iterations of the judgment condition here is obtained through experiments, and whether it converges is when there is no new seed node generated). If it reaches the maximum number of iterations or has converged, the newly generated seed node pairs are output, and the program ends. Otherwise, we reset the number of clustering communities of A and B social networks (decrease according to a certain level), repeat the above process of community clustering, screening large similar community pairs, and matching user nodes until the maximum number of iterations is reached or no new seed node pairs are generated.

Calculate and Filter Out the Community Pairs with Greater Similarity between A and B Networks.
e calculation of community similarity is based on the common a priori seed node relations in the community. e calculation is shown in formula (1), where c i A and c j B represent the i-th community in A social network and the jth community in B social network, respectively. c i A SU represents the a priori seed node of the ith community in A social network, c j B SU is the prior seed node of the jth community in B social network.
In order to store the community pairs with greater similarity, we design the Com_pair set. Its element data structure includes the community pair sequence number attribute Com and the community pair similarity attribute Sim, where the Com attribute contains Com_A and Com_B. Com_A stores the community number of the A network and Com_B stores the community ordinal number of the B network. e Sim attribute stores the similarity between the A network community and the B network community. e structure is shown in Figure 3.
Com_pair [m]. [(Com_A, Com_B)] represents the mth similar community pair in Com_pair and Com_pair [m]. Sim represents the similarity between the mth similar community pair. e community pairs with greater similarity between A and B networks are calculated as follows:

Calculate and Filter out User Pairs with Greater
Similarity between Social Networks A and B. After obtaining the community pairs with high similarity, the user similarity between communities with high similarity in different networks is calculated; that is, the ratio of the number of the same prior seed nodes in the neighbor nodes of two users to the total number of the neighbor nodes of two users is calculated. e specific calculation is shown in formula (2), where u i A N represents the set of neighbor nodes of the ith node of the community in the A network, and u j B N represents the set of neighbor nodes of the jth node of the community in the B network. NCSU represents the number of common seed node pairs in neighbor nodes.
In order to store large similarity user pairs and their similarity, the User_sim set is designed. Its element data structure includes User attribute and Sim attribute, among which User attribute contains user [0] and user [1], user [0] stores A network user node, user [1] stores B network user node. Sim stores the similarity between A network user node and B network user node. e structure is shown in Figure 4: User_sim [k]. user [0] represents the A network user in the kth user pair of User_sim set, User_sim [k]. user [1] represents the B network user in the kth user pair of User_sim set, User_sim [k]. sim represents the similarity of the kth user pair in the User_sim set.
Calculate and filter out user pairs with greater similarity between social networks A and B:

User Pair Two-Way Matching.
Considering the accuracy of user-pair matching, this paper uses the userpair screening mechanism (user two-way matching). e user in the A network and the user in the B network are selected with the greatest similarity, and the user similarity in the B network and the A network is also the largest. e user pair with the highest similarity is selected in both directions as a result, and the rest wait for matching.
As shown in Figure 5, it is the two-way matching process. After two-way matching, two user matching pairs are generated, and the remaining two users wait for the next match. User matching is mainly judged by the similarity of the user pairs. We sort the similar user pairs User_sim obtained in Section 3.3.2 according to the similarity sim from large to small and match the sorted similar users. e two-way matching process is as follows:

Experimental Dataset.
In this paper, the Twitter-Foursquare [18] dataset is selected for the experiment. First, the user's homepage in Twitter is found according to the URL link in the user's homepage in Foursquare to determine the seed nodes. en, the two social networks are processed, respectively, according to the user's node degree, and the nodes whose user node degree was less than 1 were deleted. e dataset is shown in Table 1, which shows the relevant information of two real-world social network datasets, among which the number of anchor links between the two social networks is 1862. Here, the nodes connected by anchor links in the two networks are regarded as seed nodes. e percentage of seed nodes in Twitter and Foursquare is 69.6% and 61.7%, respectively.   nodes. erefore, it can only be judged how many matching node pairs are found out of the 1862 node pairs except the prior seed nodes. It is impossible to determine whether the node pair that is not judged as a matching node pair is correct, so this article only uses the accuracy rate (that is, how many seed node pairs are correctly found out of 1862 seed node pairs) as the evaluation criterion. e specific calculation is shown in formula (3). Where Acc represents the accuracy rate, F_seed represents the number of matching node pairs found in the final iteration, and SU represents the known user matching node pairs, that is, the number of 1862 anchor link (matching) node pairs in this experiment, R_seed is the number of prior seed node pairs randomly selected from 1862 matching node pairs.

Evaluation
is paper randomly selects 100 and 200 prior seed node pairs and uses the global user matching method (GUMM), the hidden label node method (HLNM), and the UIUTDC algorithm to Conduct Experiments. e comparative analysis of the results is as follows.

Comparison and Analysis of 100 Prior Seed Node Pairs
Iterative Matching Node Pairs. As shown in Figure 6, when 100 a priori seed nodes are randomly selected, the experimental results can be seen that the result of the UIUTDC   Complexity 7 method proposed in this paper from the beginning of the iteration to the end of the iteration is far greater than the hidden label node method and the global node matching method. e reason is that the UIUTDC method uses dynamic community division, matching in two stages to make the matching process more comprehensive and the matching result better. e number of pairs of nodes matched by the hidden label node method in the early stage is less than that of the global matching method. e reason is that the method of hiding the label node is based on the degree of the node (that is, the number of friends the user has in the social network) from large to small. In the early stage, nodes with a larger degree are selected to participate in the matching, and there are fewer nodes with a larger degree in the network, resulting in fewer nodes participating in the matching in the early stage, and the result is lower than the result of the global matching method. It is not until the fifth iteration that more nodes from the hidden label node method participate in the matching process that the number of nodes matched is more than that of the global matching method.

Comparison and Analysis of 100 Prior Seed Node Pairs
Iterative Matching Node Pairs. As shown in Figure 7, when 200 seed nodes are randomly selected, the experimental result graph shows that as the number of iterations increases, the number of node pairs generated by the UIUTDC algorithm is always greater than the number of node pairs generated by the hidden label node method and the global matching method. e hidden label node method generates fewer seed nodes before the 7th iteration than the global matching method. e reason is that the method of hiding the label nodes selects the nodes participating in the matching according to the degree of the nodes from large to small, resulting in fewer nodes participating in the matching in the early stage, and the result is lower than the result of the global matching method. Until the seventh iteration, all the nodes in the hidden-tag node method participate, and the result is better than that of the global matching method.

Comparison and Analysis of Accuracy of Different
Methods. It can be seen from the accuracy graphs of different seed nodes in Figure 8, on the one hand, the accuracy of the UIUTDC method is much higher than the accuracy of global user matching and the accuracy of hidden label nodes.
On the other hand, it can be seen that when there are few seed nodes, the accuracy of the global matching method and the hidden label node method is not very high. However, with the increase of seed nodes, the accuracy of the UIUTDC method, global user matching method, and hidden label node method are all improving, which shows that prior seed nodes have a certain influence on the experimental results. e more points, the higher the accuracy rate, and the most obvious are hidden label nodes. e experimental results show that the average accuracy of the UIUTDC method is 42.33% higher than the average accuracy of global user matching by 33% and 26.8% higher than the average accuracy of the hidden label node method.

Comparison and Analysis of Time Consumption of Different Methods.
is paper uses the same computer to verify the UIUTDC method, the global user matching method, and the hidden label node method on the real network data set and obtains the running time comparison chart of different seed nodes, as shown in Figure 9. It can be seen from the time comparison graph that the overall time consumption of the UIUTDC method is much lower than that of the global user matching algorithm and the hidden label node method. at is, its time complexity is better than the global matching method and the hidden label node method. e global user matching method requires users in two social networks to perform a one-to-one matching calculation (assuming that both network user nodes are n), which is extremely complicated. e time complexity is O(n 2 ). In the UIUTDC method, the number of nodes in the cluster is much less than that of the entire network (assuming the number of clusters isk, the average node in each cluster is l, l � n/k), which greatly reduces the overall cost of similarity calculation for all nodes in  8 Complexity the network. e cost includes two parts: one is the cluster similarity calculation, K clusters match each other, the calculation cost is O(k 2 ), the other is the user node matching within the cluster, the calculation cost isO(l 2 ) . e total cost isO(n) + O(k 2 ) + O(kl 2 ), which is O(k 2 + kl 2 + n). Since l � n/k, the total cost is O(k 2 + (n 2 /k) + n),1 < k < n. It can be calculated that when n is very large, the appropriate value of k can make O(k 2 + (n 2 /k) + n) < O(n 2 ), that is, the computational complexity of the UIUTDC method is less than the global user matching. e hidden label node method has fewer participating nodes in the early stage, so the early time consumption is less, but as the number of nodes increases, the number of calculations increases, resulting in longer time consumption in the later stage. e UIUTDC method is based on the division of communities and the selection of communities with high similarity, which greatly reduces the calculation of user matching to a certain extent. e UIUTDC method is superior to the global user matching method and hidden label node method in terms of time.

Conclusions
is paper proposes a dynamic community clustering twostage user identification algorithm based on user topological relationships. e algorithm uses the user topological structure information in the social network to match by dividing the social network into communities and selecting communities with greater similarity in different networks to match the nodes in the community. Based on this method, we can reduce the time complexity of the matching algorithm while improving the accuracy of node matching. In order to prevent the loss of node information when the community is divided according to the fixed number of communities for node matching, dynamic community division is adopted. e number of community divisions in each iteration is different, and the nodes in the network community are fully matched from different angles, which can improve the accuracy of node matching. Applying the algorithm in this paper to a real social network data set, the results show that the effect of the algorithm is 33% and 26.8% higher than the global user matching algorithm and the hidden label node algorithm, respectively. In terms of time, the algorithm in this paper reduces on average 637911 seconds and 1,94657 seconds than the global user matching algorithm and the hidden label node algorithm.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.