Exploiting Two-Level Information Entropy across Social Networks for User Identification

As the popularity of online social networks has grown, more and more users now hold multiple virtual accounts at the same time. Under these circumstances, identifying multiple social accounts belonging to the same user across different social networks is of great importance for many applications, such as user recommendation, personalized services, and information fusion. In this paper, we mainly aggregate user profile information and user behavior information, then measures and analyzes the attributes contained in these two types of information to implement across social networks user identification. Moreover, as different user attributes have different effects on user identification, this paper therefore proposes a two-level information entropy-based weight assignment method (TIW) to weigh each attribute. Finally, we combine the scoring formula with the bidirectional stable marriage matching algorithm to achieve optimal user account matching and thereby obtain the final matching pairs. Experimental results demonstrate that the proposed two-level information entropy method yields excellent performance in terms of precision rate, recall rate, F-measure (F1), and area under curve (AUC).


Introduction
A product of the Web 2.0 era, today's social networks provide people with a rich variety of social services. According to research, 42% of users have more than one social network accounts [1]. Since different social networks have their own unique social methods and provide different social services to users, they generate a wealth of social user information; however, each social network account is isolated and there is no direct contact between them. Therefore, the social information generated by user accounts is distributed across multiple social networks. Across social networks, user identification refers to the process of identifying virtual accounts that belong to the same entity user across different social networks. Achieving this type of identification provides assistance for other fields, such as network recommendation and user behavior analysis, in a variety of different ways, and allows the "big data" contained in multisource social networks to be fully exploited.
User identification, also known as user identity resolution and anchor linking, is aimed at identifying the same users across different social networks. The core idea underpinning existing research works involves utilizing user profile information, network topology information, and user behavior information to calculate and analyze whether the user account matching pair represents two instances of accounts created by the same user. The research work based on user profile information mainly focuses on specific attribute information provided by users (e.g., username, birthday, and interests) [2]. Similarity calculation methods are used to calculate the similarities between these attributes used in order to determine whether or not the user account pair is identical. However, existing social networks have begun to focus on user privacy issues, meaning that a portion of attribute information is not easily available. Related works based on network topology information have mainly focused on using a user's friend relationships to identify their accounts across different social networks [3]; however, these connections between friend relationships are often sparse, meaning that this method encounters certain limitations when applied. Research works based on user behavior information are now primarily aimed at using the content generated by the users to identify said users [4,5], as usergenerated content is easier to access than the other two types of information. Moreover, the content published by users is personalized and can thus be used effectively to map user behavior habits.
User profile information refers to the information that users are required to enter or select when they register for their social accounts. Relevant research has shown that more than 45% of users choose the same display name across different social networks [6]. Therefore, there are a number of largely similar phenomena involved in the process of filling in profile information across different social networks, which provides a reliable basis for user identification. User behavior information mainly refers to the content posted by users on major social networks. Moreover, users often make some of the content they publish public, meaning that this content is easy to access. Intuitively, user behavior information-based user identification could constitute a means of breaking through the limitations of the above two methods.
Accordingly, through the analysis of characteristic information in the process of user identification and a simple comparison of existing work, the contributions of our work are summarized as below: (1) We calculate the similarities of the user account found in the user profile information and user behavior information, respectively. Then, we briefly analyze the importance of different attributes in the process of user identification (2) To the best of our knowledge, we first propose a twolevel information entropy-based weight assignment method to weigh each attribute (3) User similarity scoring formula is designed through the similarity calculation and weight assignment of these user attributes (4) We use a stable marriage matching algorithm to ensure that the different accounts achieve bidirectional optimality in the matching process in order to further improve user identification performance In this work, we conduct user identification across two different social networks; for multiple social networks, we can adopt the transitivity law that anchor linkage follows [7] to identify users.
The remainder of this paper is organized as follows. We describe the current related works in Section 2. In Section 3, we formulate the problem of user identification. Section 4 introduces the method of user identification and the details of its implementation, while the experimental results of our proposed method are provided in Section 5. Finally, we summarize this paper and lay out some avenues for future research in Section 6.

Related Works
Current studies of user identification across multiple social networks can be divided into three categories, according to the different types of information utilized: user profile information-based matching strategy, network topology information-based matching strategy, and user behavior information-based matching strategy. At present, most studies center around on one of the abovementioned types of information when engaging in user identification, while only a small portion of these studies have fused this information.
2.1. User Profile Information-Based User Identification. Studies engaging in user profile information-based research focus on user profile attributes, which include display name, city, gender, and profile image.
Display names, which are required to be entered by almost all social networks, have been widely studied in an attempt to identify the virtual accounts of users across different social networks. Perito et al. [8] calculated the similarity of display names between accounts and used binary classifiers to identify users. Similarly, Liu et al. [9] focused on determining whether user accounts on different social networks that use the same username belong to the same natural person. Zafarani and Liu [10,11] proposed a usermapping method by modeling and analyzing user behavior habits as regards display names. Moreover, the profile image is another attribute that has attracted considerable attention from researchers. Acquisti et al. [12] utilized a face recognition algorithm to enable user identification. Although both display names and profile images can be used to identify users, they are not universal for some large social networks, as many unrelated users may have the same display name; for example, many users have the display name "John Smith" on Facebook.
User identification can be better achieved through the combination of user profile attributes. Iofciu et al. [13] proposed a method that involves measuring the distance between user profile information. Motoyama and Varghese [14] transformed user attributes (gender, education, etc.) into word sets and matched account pairs by calculating the similarities between users. Cortis et al. [15] proposed a user profile identification technique based on weighted ontology to identify users by calculating the similarity between users' grammatical semantics. Abel et al. [16] aggregated user profiles and implemented identification across systems to match users; a similar study involving multiple social networks can also be found in [17]. Li et al. [18] developed an identification model that combines username and display name, both of which contain rich information redundancy. He and Li [19] designed an identification scheme based on user preferences. They fully exploit the redundant information of the user's nickname and analyze the topic of the user's published content. Finally, the above two attributes are combined to identify the user. Mishra [20] developed a pairwise comparison string model that analyzes user profile information among different social networks and adopts different similarity 2 Wireless Communications and Mobile Computing calculation methods to calculate the similarity of different attributes. Finally, the obtained value is compared with the threshold to generate a final matching pair. Ahmad and Ali [21] proposed a two-step method to examine the problem of matching social accounts across social media services. A good effect is achieved by fusing the username and the network structure of the seed node to identify users. In light of the above, the proposed method reduces the usage of user attributes, while the attributes chosen for use have high accessibility.

Network Topology Information-Based User Identification.
Studies addressing network topology information-based research focus primarily on identifying the same user by utilizing that user's friend relationships. A user's friend relationships are easy to obtain, and malicious imitation and falsification problems are almost impossible. Narayanan and Shmatikov [22] demonstrated that user identification can be accomplished by relying on network topology information; however, the user identification performance achieved in this study required further improvement. Bartunov et al. [23] constructed an objective function by combining profile information and network topology information, then optimized the objective function to obtain optimal account matching pairs. Cui et al. [24] aggregated the similarities between user profile information and graphics to enable mapping from email networks to Facebook networks. Korula et al. [25] converted the user identification problem into a mathematical problem in an attempt to determine whether accounts in different social networks are identical by calculating the similarity of the user graph structure. Tan et al. [26] built models from the users' social relationships, mapped users to a low-dimensional space, and thereby reduced the amount of computation required by the user identification process. Zhou et al. [27] utilized the number of shared seed nodes to measure the similarity between nodes on different social networks and selected the nodes with the highest matching degree for pairing; a similar study was also conducted in [28]. Zhou et al. [29] designed an unsupervised user identification scheme to identify users without the use of seed nodes. This method extracts the friend relationships of each account in the social network as a feature vector, constructs a learning model, and then employs the related similarity calculation method to obtain the similarity between the friend feature vectors. Zhang et al. [30] designed an identification scheme based on graph neural network, which maps network nodes to low-dimensional space and preserves the local and global connection modes, so that the reconstruction social network based on the learning node features is close to the original social network. Recently, Li et al. [31] performed user identification by mining multihop nodes in the user friendship networks. They analyzed the contribution of multihop nodes for user identification and combined the features contained in the display name to achieve better performance and universality of the identification method. However, it should be noted that this information is often sparse, as only a small percentage of users are willing to disclose their friend relationships.

User Behavior Information-Based User Identification.
The primary focus of user behavior information-based user identification studies is the content published by users [32]. Users' behavior information exhibits a certain degree of personalization that can be used to map out the user's behavior habits, making this type of information a good choice for user identification purposes. Narayanan et al. [33] analyzed the writing style of user-published content and calculated the similarity probability of publishing content between different social accounts of the same user. Kong et al. [34] designed a Multi-Network Anchoring (MNA) method for user mapping, which comprehensively calculated the similarity between time-based, space-based, and text content across platforms to achieve user identification. Goga et al. [35] fused the geolocation information, status timestamp, and writing style of published content to identify users. Liu et al. [36] calculated the probability of two accounts belonging to the same natural person with reference to the PoIs, writing style, and locations extracted from user behavior information. Chen et al. [37] proposed a new type of user identification approach by extracting users' spatiotemporal features from user behavior information. Li et al. [38] designed a recognition model based on user behavior information and measured the similarity of user behavior information across the dimensions of space, time, and content; a supervised machine learning algorithm was then used to match the account pairs. Recently, Chen and Tan [39] mapped user access behavior to different time windows, extracting time, text, and sequence features of access behavior to obtain rich semantic content. Finally, a multilayered perception network-based identification model is constructed to identify the user's identity. To the best of our knowledge, many existing works mainly focus on the extraction of user information, while ignoring the analysis of the importance of user information. The state-of-the-art works performed a simple analysis of user information [4,5]. However, the degree of contribution to distinguishing user information varies greatly, which is not conducive to the balance of weight assignment in the user identification process. Furthermore, with the development of the Internet of Things [40][41][42][43][44][45], location information began to enter the researcher's field of vision, which is also a powerful feature for user identification purposes. Hao et al. [46] converted the user trajectory into multiple grid sequences, transformed these sequences into vectors using the TD-IDF model, and then used cosine similarity to calculate the similarity between the vectors obtained. Roedler et al. [47] combined user-generated timestamp information with location information for use in constructing a personalized social behavior pattern designed to address user identification issues. Qi et al. [48] designed an identification scheme based on user trajectory, which analyzes the TOP-N region with the most frequent distribution of user trajectories and uses different similarity calculation methods to measure the similarity between two trajectories. This achieves better identification performance. However, geolocation information is often sparse in social networks, as only a small proportion of users are willing to make their geographic locations public.

Problem Definition
This section defines key terms used in the present paper, integrates user profile information and user behavior information to identify the entity users behind multiple social accounts, and elaborates on the process of user identification.
The term "social network" is used to refer to virtual communities and platforms on which people publish, share, and exchange ideas and information. On social networks, people input information within a bounded system that allows them to create a profile and can then communicate with other users [49]. From the above description, it is evident that a social network comprises three important elements: users (with their profile information), behavioral information among users (or content), and friend relationships (or network) [50]. More detailed definitions of these terms are provided below.
Definition 1 (social network). Given a social network S = fV, E, Cg, where V denotes the users, E denotes the user's profile information, and C denotes the user-generated behavior information.
Delving into the main components of a social network, one can easily determine that both E and C are generated by V. To a certain extent, E and C can therefore be considered as attributes of V, while V is the core element of the social network. Therefore, user identification is of great importance in multiple studies of social networking.
In this study, S A and S B are used to represent social networks A and B, respectively.
Definition 2 (entity user). An entity user is a user representing a single unique individual, in combination with his or her profile information, behavior information, and friend relationships.
Definition 3 (user matched pair). Given social networks S A and S B , if social accounts V Ai and V Bj belong to the same person in real-life, they then constitute a user matched pair ðV Ai , V Bj Þ.
In this section, we formulate across social networks user identification based on profile information and behavioral information. In Figure 1, we consider two social networks S A and S B . Each of these social networks is home to some users whose profile information and behavioral information are presented in Table 1.
Generally speaking, multiple virtual accounts belonging to the same user but created on different social networks tend to be isolated from each other, without any connection between them [34]. The main task of user identification is accordingly to discover the potential relationship between these accounts on different social networks [51]. From Figure 1, in S A , the user information sets of two people are ðE Ai , C Ai Þ and ðE Ak , C Ak Þ, respectively. Among them, E Ai represents the user's 17 user profile information, corresponding to T mp 1 in Figure 2. C Ai represents the user's 3 behavioral information, corresponding to T mb 1 in Figure 2. ðE Ak , C Ak Þ is the same. Moreover, there is a user whose user information set is ðE Bj , C Bj Þ in S B . The following two problems need to be solved for user identification to be achieved: (1) There are two accounts V Ak and V Bj ; can we determine whether these two accounts belong to the same person?
(2) There are two accounts V Ai and V Bj ; can we determine whether these two accounts belong to two different persons?
If two user information sets from different social networks are sufficiently similar, they are likely to belong to the same person. The higher their attribute similarity, moreover, the higher the probability that they are the same person. User identification issues can thus be defined as follows: where Score ij denotes the similarity score of V Ai and V Bj . It should be noted here that, for various reasons, some people have multiple accounts in the same social network; however, we usually consider these accounts to be independent and belonging to different persons, which is to say that we only identify one of the accounts.

Proposed Method
4.1. User Identification Framework. In this paper, we mainly aim to combine users' profile information and behavior information to identify the entity users behind multiple social accounts. We fuse these two types of information for three main reasons.
(1) User profile information and user behavior information are easier to obtain than user network topology. In addition, these two types of user information can more intuitively reflect the user's living habits and style on different social networks (2) User profile information and user behavior information have similarity across social networks (3) User topology will have the contribution of singlehop nodes and multihop nodes. If the user topological structure is weighted directly, it will cause the problem of weight imbalance among the user nodes Based on the above analysis, we focus on integrating user profile information and user behavior information. Through in-depth mining of the two types of user information, a twolevel information entropy-based weight assignment algorithm is proposed to solve the problem of weight imbalance caused by multiattribute information.

Wireless Communications and Mobile Computing
As shown in Figure 2, given three user accounts v n 1 , v m 1 , and v k 2 form two different social networks S A and S B , respectively. Where each user account contains 17 user profile information (such as T mp 1 ) and 3 user behavior information (such as T mb 1 ). Let us select accounts v n 1 and v k 2 as an example. This pair of accounts is calculated to obtain the similarity S nk ; similarly, S mk denotes the similarity calculation of 20 user attribute values between users m and k. Then, a weight assignment method based on two-level information entropy needs to be used to weight each user information, and a similarity vector X nk based on weight is obtained. Finally, the bidirectional stable marriage matching algorithm is used for account matching. If y = 1, the account matches; otherwise, the two accounts belong to different persons.

User Profile Information Analysis.
In Section 3, we divide the user profile information into different dimensions according to different feature attributes. The similarity calculation methods are used to calculate the similarity of each dimension of the two accounts across different social networks, after which the similarity values are compared with their corresponding thresholds in order to obtain the results of the user profile information comparison across the different dimensions.
Before the similarity calculation is performed, the profile information must be generalized to ensure that the information is unified across the different social networks. For example, the user's gender may exist in different social networks as "boy, girl," "man, woman," etc., and this paper uniformly generalizes to "boy, girl." In addition, since the types of user information we analyze are different, different similarity calculation methods are needed. In this paper, there are three key methods adopted to perform the similarity calculation of profile information with different attributes: Dice coefficient (it is one of the common methods to evaluate the classification effect, and the same can be used as a loss function to measure the loss between the classification result and the label), cosine similarity (this method mainly takes the user's preference as a point in the n-dimensional coordinate system and forms a straight line (vector) by connecting this point with the origin of the coordinate system. The similarity value between two users is the cosine value of the angle between the two vectors), and exact matching (specific user information, such as gender). These methods are explained in more detail below: (1) Dice Coefficient. Let a and b represent two sets of strings, respectively. The numerator denotes the length of the intersection of the two sets, while the denominator denotes the sum of the lengths of the Diff eren t pers on?
Sa m e pe rso n? Figure 1: Cross-network research to merge various SNs. For example, for the strings a = fplay, music, basketballg and b = fmusic, reading, footballg, the intersection information is fmusicg, while |a | = 3 and |b | = 3, meaning that the Dice coefficient of these two strings is 2½1/ð3 + 3Þ ≈ 0:33 (2) Cosine Similarity. This method is often used to calculate the similarity between two vectors. We quantize two strings into word vectors; here, let x i and y i denote the word vectors, while i denotes the ith dimension of the word vector, and m denotes the word vector dimension. Therefore, the cosine similarity of the two strings is as follows: (3) Exact Matching. Under this method, the attribute information of the two accounts must be identical. For example, in the matching process, the gender of two accounts must match exactly (e.g., they are either male or female).

User Behavior Information
S w (X mk , y mk ) Figure 2: Across social networks user identification implementation framework. 6 Wireless Communications and Mobile Computing profile information is important to user identity matching, there will be some information loss problems caused by privacy protection, meaning that the performance of any identification based solely on user profile information will not be perfect. Accordingly, in this paper, we analyze user behavior information on the basis of user profile information. In short, user identification can be better achieved by fusing the above two types of information.
4.3.1. User Blog Data Similarity Calculation. This paper draws on some of the ideas underpinning the concept of frequent pattern mining to propose a method for calculating the similarity of user blog data. Some mining algorithms can be used to determine potential connections between users' behaviors. The specific details of the frequent pattern mining process are explained in Figure 3 below.
Step 1. Word segmentation is performed on each user's blog. As can be seen from Figure 3, the user's blog post constitutes the TID, which is the transaction ID. Each blog post contains a different number of frequent items, which is the set fI1 − I5g. Among them, the set formed by the blog post is the word segmentation performed by the frequent pattern mining algorithm. After word segmentation is complete, the blog post is treated as consisting of many distinct words, each of which forms a transaction T i ; as such, all of a user's blog posts form the transaction set T = ½T 1 , T 2 , ⋯, T i .
Step 2. The first iteration of the algorithm; traversing all the items in the transaction set T and calculating its support constitutes one-item set C 1 . In this paper, the minimum support for the one-item set is set to two. With reference to the support requirement, item sets that do not satisfy the condition are filtered out until only the frequent one-item set L 1 remains.
Step 3. L 1 is connected with itself to obtain the two-item set C 2 . For example, fI1g and fI2g in L 1 are connectable, and the item set fI1, I2g can be obtained after the connection. By analogy, we connect the connectable item set to get the candidate set C 2 and obtain the support count corresponding to the related item set. Moreover, we set the support of C 2 is one. The transaction set T is scanned to filter frequent items that do not satisfy the support, with the result that the frequent two-item set L 2 is obtained.
Step 4. Similarly, the frequent three-item set L 3 , frequent four-item set L 4 , ..., frequent n-item set L n are generated in sequence until all generated item sets C are unable to satisfy the minimum support; at this point, the algorithm ends. The example presented in Figure 3 ends with the frequent fouritem set L 4 . After the above process is complete, the frequent items and their support counts have been obtained. Accordingly, the similarity calculation formula of the blog data of users i and j is as follows: where F n denotes the frequent items shared by users i and j, C Ai ðF n Þ and C Bj ðF n Þ denote the support counts of frequent items F n of users i and j, respectively, while C F n denotes the number of item set of F n . The "1" in the formula is added to ensure that high-frequency items are avoided.

Punctuation Marks Similarity Calculation.
Punctuation marks, when used in the writing of blog posts, can clearly reflect the user's personal habits of behavior where writing is concerned. Therefore, this information can also be measured and analyzed as an attribute for use in determining user's identity. In this subsection, a punctuation mark vector is generated to store the frequency at which punctuation marks appear in the users' blog posts. The measure of punctuation similarity is aimed at determining whether users utilize similar punctuation marks when writing a blog post. In Table 2, we present a case in which two users, Ben and Emily, use different punctuation marks in the blog posts they publish.
In order to calculate the similarity between the punctuation marks used by each user account, each user's punctuation marks are quantized into a vector; each element of the vector is P i = c i /n. Here, c i is the count of each punctuation mark, and n is the total number of blog posts. In this way, we can obtain the user account's punctuation mark vector. In this paper, we calculate the similarity between different account punctuation vectors via cosine similarity. The specific formula is shown in (3).

State Timestamp Similarity Calculation.
The same user has a high probability to generate consistent dynamic time across different social networks; this attribute can also fully reflect the user's behavior habits. State timestamp similarity is a measure of a user's behavior characteristics according to the dynamic numbers (the number of posts published by users at each time interval) generated by the user in different time periods, thereby enabling the similarity between the two user accounts to be more accurately calculated.
We first need to tag the time at which the user posted each status. The dynamic number of each time period is counted and divided by the total number of blog posts, from which the average dynamic number of each time period is obtained; this average dynamic number of each time period can constitute a 24-dimensional user timestamp vector. Finally, formula (5) is used to calculate the similarity of the status timestamps between different accounts.
Here, v it and v jt denote the average dynamic number of users i and j over period t, while T denotes the dimension of the timestamp. Wireless Communications and Mobile Computing identification performance, we need to assign reasonable weights for each of the attribute items. Following earlier works [4,5], Figure 4 illustrates the contribution of a single attribute used for user identification.
It can be clearly seen from Figure 4(a) that the username has different distributions of similarity between the same user and different user when performing account matching; therefore, it can be concluded that this attribute item has strong discriminability, and thus that the weight assigned to it should be large. Moreover, it can be clearly seen from Figure 4(b) that when account matching is performed using the interest attribute, the user distribution is relatively uniform, meaning that it is not easy to determine whether or not users are identical; hence, the weight assigned to this attribute should be relatively small. In summary, there are differences in the contribution made by each attribute to user identification; it is therefore vital to assign appropriate weights to each attribute.

Weight Assignment Algorithm.
After calculating the similarity between user attributes, we need to assign corresponding weights to each attribute to further improve user identification performance. Both traditional expert subjective weighting methods and objective weighting methods have their limitations, which include poor robustness, dependence on large amounts of sample data, and poor generality.
In order to solve the above problems associated with the weight assignment scheme, the present paper proposes a two-level information entropy-based weight assignment method. The basic idea behind the entropy weight method is that the greater the difference in the index, the greater the difference in weights; therefore, we use the concept of information entropy to solve the problem of weight assignment in user identification. According to the definition of information entropy, for any random variable x, the formula is as follows: where pðxÞ denotes the value probability of this attribute. Based on the information entropy concept, the posterior probability of each attribute is further calculated, allowing the impact of each attribute on user identification performance to be more accurately determined. The one-level weight assignment of the user attribute can be obtained by combining the posterior probability and the information entropy, as follows: Transaction set T Scan T to count each candidate.
Compare the candidate support count and the minimum support count.
Generated in sequence.  where pðy s | sÞ is the posterior probability of the attribute, i.e., the probability that the same user attribute y s is consistent, x is an attribute, and X is all attributes. The posterior probability and pðxÞ can be obtained via statistical means.
Softmax is an extremely common and important function in machine learning and deep learning applications, particularly in multicategory scenarios. It represents some input mappings as real numbers between zero and one; moreover, normalization guarantees that the sum will be one, meaning that the sum of the probabilities of multiple classifications is also exactly one. The output of Softmax characterizes the relative probability between different categories. Accordingly, we use the Softmax concept to assign a two-level weight to the user's attributes. After obtaining the one-level weight of the user attribute, the weight values of all attributes are combined into an array Z = ðz 1 , z 2 , ⋯, z n Þ, which is used as the Softmax input.
As shown in Figure 5, the one-level weight value of each attribute is calculated using the Softmax function, after which the probability corresponding to each attribute is output. Using Softmax should address the problem of numerical overflow. Due to the exponential operation, if the Z value is large, the value after the exponential operation is often exposed to the possibility of overflow, meaning that some numerical processing for Z is required. This paper therefore multiplies the elements in Z by one-tenth. As mentioned above, the probability formula for user attributes is calculated as follows: After obtaining the probability of each attribute, the attribute entropy value can be redefined as: Moreover, since the entropy value is inversely proportional to the weight, the variant entropy value R i can be expressed as: Finally, the two-level weight of each attribute is As shown in Figure 6, we compare the variances (y-axis) of three methods of weight assignment: empirical probabilitybased weights, posterior probability-based weights, and two-level information entropy-based weights. The specific values obtained by these three weight assignment methods are 0.0031, 65.9918, and 285.8982, respectively. The experimental data also fully demonstrates the effectiveness of the proposed method. Note that the blue histogram corresponds to the left axis; the red histogram corresponds to the right axis.

User Account Matching
4.5.1. Scoring Formula. The user similarity score formula can be constructed by calculating the similarity between the user attributes and assigning weights accordingly. In order to improve the efficiency of similarity calculation between user accounts, this paper uses a stable marriage matching algorithm [18] to match account pairs across social networks; moreover, it also utilizes a bidirectional matching strategy  9 Wireless Communications and Mobile Computing to further improve user identification performance. The user similarity score formula is as follows: Here, Score ij denotes the final matching score, W ′ x represents the weight of the xth attribute of the account, while S ij x denotes the similarity of the xth attribute of the two accounts i and j, and the value of u is twenty. The size of the Score ij is used to determine whether the entity users behind the two social accounts are identical.

Bidirectional Stable Marriage Matching Algorithm.
This step involves constructing the scoring formula between user accounts, then using the bidirectional stable marriage matching algorithm to achieve user account matching. Generally speaking, the best user identification is to satisfy oneto-one matching. The core idea of the TIW-UI algorithm we proposed is still stable marriage matching, so the corresponding algorithm complexity is oðn 2 Þ, which is the same as state-of-the-art matching algorithm complexity [5,29,52]. It is worth noting that we mainly focus on the weight assignment of user information. The following is the specific process of the matching algorithm used in this paper.
Step 1. Matching each user account in social network S A with all user accounts in social network S B via the scoring formula. The specific process is as outlined in Algorithm 1.
Step 2. Each account in S A is matched with the top-ranked account in S B in descending order of final score. If the account in S B has not been matched with any other account in S A , then the account is matched with the current account in S A . If, however, the account has also been matched with other accounts in S A , then all accounts matching the account need to be scored, until the account with the highest score is finally selected as the matching pair.

Wireless Communications and Mobile Computing
Step 3. If there are still accounts that do not match, return to Step 2. The overall detailed bidirectional matching of the algorithm is outlined in Algorithm 2.

Analysis of Experimental Results
Several experiments were conducted to more intuitively illustrate the effectiveness of the proposed method. All experiments were performed using a computer with 8 G RAM and 2.4 GHz CPU. User profile information and user behavior information from two major social networks, namely, Facebook and Twitter, were selected for user identification across social networks. [53] provides an open dataset that includes five foreign mainstream social networks. The ratio between the training set and the test set is set as 3 : 1.
In order to facilitate the comparison between the proposed algorithm and other algorithms, this paper uses the most widely used user identification metrics, namely, precision rate, recall rate, F1, and AUC. The relevant formulae are as follows: AUC refers to the area under the ROC curve. The false positive rate (FPR) is defined as the x-axis, while the true positive rate (TPR) is defined as the y-axis. Since the results of this paper are divided into two categories-i.e., same entity user and different entity users-AUC can also be used to measure identification results.
where TP denotes the case in which the two accounts are the same user and the matching result indicates the same user, TN denotes the two accounts not being the same user while the matching result indicates different users, FP denotes the two accounts belonging to different users while the matching result indicates the same user, and FN denotes an instance in which the two accounts belong to the same user, but the matching result indicates that the users are different.

The Impact of Weight on User Identification
Performance. The two-level information entropy-based weight assignment algorithm is used to assign weights to user attributes with the aim of improving identification performance. In order to verify the algorithm's effectiveness, we use the control variable method to analyze the experimental results of three different weight assignment methods.
(1) The Empirical Probability-Based Weight Assignment Method (EW). There are two ways to obtain weights, one of which is calculated by using user history information. Second, when historical information is not available or the data is incomplete, it can be obtained based on expert experience [4] (2) The Posteriori Probability-Based Weight Assignment Method (PW). The user information has been obtained, and the probability that the user information is the same among the same attributes on different social networks is judged [5] From the description of the above two methods, it can be seen that user information cannot be analyzed in a finegrained manner. Only relying on the initially obtained data probability for weight assignment will cause category tilt and make the weight assignment unbalanced. Therefore, we propose a two-level information entropy-based weight assignment method (TIW). As shown in Figure 7 and   Performance. This paper proposes a two-level information entropy-based across social network user identification algorithm (TIW-UI). The effectiveness of the proposed algorithm is illustrated through comparison with the random forest confirmation algorithm, based on stable marriage matching (RFCA-SMM) [5], and the ranking-based cross-matching algorithm (RCM) [52]. RFCA-SMM analyzes user-generated data and combines posterior probability and information entropy to assign weight to user data. However, the weight assignment method is not able to fully reflect the differences between user attributes. The identification performance of the RCM is largely influenced by the number of seed users, i.e., the number of known matching pairs. In cases where it is not possible to know in advance which accounts belong to the same entity user across two social networks, the RCM requires improvement in terms of its identification performance. Since the algorithm for user identification in this paper is untagged, the experimental results of the three algorithms are analyzed using an untagged dataset. From Figure 8 and Table 4, it can be seen that the proposed TIW-UI is superior to RFCA-SMM and RCM in terms of precision rate, recall rate, F1, and AUC. TIW-UI analyzes a total of 17 user profile information attributes and 3 user behavior information attributes and thereby comprehensively calculates the similarity in users' information across social networks. The two-level weight assignment is adopted to the attributes used. Compared with RFCA-SMM and RCM, different attributes make substantially different contributions to the identification results, which improve the overall user identification performance. Moreover, TIW-UI and RFCA-SMM achieve some improvement compared with RCM in terms of the evaluation metrics, mainly because these two algorithms do not take the social network structure into account [52]. In order to more clearly illustrate the effectiveness of the proposed algorithm. As shown in Table 4, we can clearly see the performance comparison of the three algorithms among different evaluation metrics. In terms of precision rate, TIW-UI improved by 3.7% and 1.1% compared to RCM and RFCA-SMM. In terms of recall rate, TIW-UI is 1.1% higher than RFCA-SMM, but slightly lower than RCM. This is because RCM uses cross matching to filter out the wrong matching pair to some extent. TIW-UI increased by 1.1% and 1% in terms of F1 compared to RCM and RFCA-SMM, and in terms of AUC, it increased by 5.5% and 0.7% compared to RCM and RFCA-SMM. In general, the proposed algorithm achieves better user identification performance than the other two algorithms.

Conclusions
This study addressed the problem of across social networks user identification and provided an innovative solution. As the most relevant types of user information, user profile information, and user behavior information are key to identifying the unique users behind different social network accounts. In this paper, the similarity between the above two types of user information is calculated and analyzed, and a two-level information entropy-based weight assignment algorithm is employed to weight each user attribute appropriately. Finally, we combine the scoring formula with the bidirectional stable marriage matching algorithm to obtain the optimal account matching results.
In future research works, user privacy protection will also become an urgent problem to be solved. There should be a complementary relationship between user identification and user privacy protection. The problem of user privacy leakage caused by user identification should be minimized in the process of research. We will aim to utilize the idea of game theory to achieve a balance between user identification and user privacy protection. If this idea can be translated into a workable form, it will solve the data constraint problem affecting most existing identification algorithms. Moreover, the protection of user privacy data is also effectively realized.

Data Availability
Data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.