User Identification Based on Integrating Multiple User Information across Online Social Networks

,


Introduction
With the development of social networks and their diversity, the number of active users on social networks has increased year by year. According to the report of Statista, the number of Facebook active users reached more than 2.7 billion in July 2020, and the number of Twitter active users reached 353 million in July 2020 [1]. People may have accounts on several social networks simultaneously. People can use Twitter to follow the latest developments in their areas of interest, use Facebook to post life trends and keep in touch with friends in life, use LinkedIn to post career information and keep contact with colleagues, and use Foursquare to post locations [2,3]. If we can match accounts of an individual in different social networks, we can integrate his more comprehensive personal information and draw out their complete friend relationships [4].
User identification across social networks is also called matching user accounts, user recognition, matching user accounts, user matching, or anchor linking [10]. In recent years, there have been many existing works on user identification across social networks. Most existing works use attributes in the profile to user identification [4,[11][12][13][14][15], such as display name, profile photo, and location. Due to user privacy settings, users may fill in fake information or choose not to fill in. ese limitations make these methods quite fragile [16]. Some existing works are relationship-based user identification [17][18][19]. Relationships have higher discriminability, which is difficult to fake [10]. However, taking into user privacy settings and social network restrictions on data crawl, we may only get part of relationships. is will result in sparse and incomplete relationships. And, a number of existing works also use UGCs to user identification [4,20]. ese methods are usually based on posting time, location, writing style, or similarity of content [4]. However, they may ignore other information contained in the content, such as organizations and URLs. Because of user privacy settings and social network restrictions on data crawl, UGCs may not complete.
For most users, profiles, UGCs, and relationships may be missing or incomplete in real social networks. e features extracted in previous research may be sparse. If more effective features can be extracted from public and available user data, the impact of the above problems can be reduced. erefore, our paper uses public multiple user information to perform user identification. e main advantages of MUIUI and contributions of our work are (1) A complete user identification framework: we propose a complete user identification framework MUIUI, which is from data collection to user identification detection. Firstly, we crawl user data from two popular social networks and extract multiple user information from user data, which include profiles, UGCs, and relationships. en, we extract features from multiple user information. Finally, we employ a fusion classifier to address the user identification problem. (2) Conducted on popular social networks: this paper focuses on two popular social networks, Twitter and Facebook. We expand the raw dataset, which are those proposed in [21][22][23][24][25], crawled during November 2012 [9]. We screen the users in the raw dataset who are still alive and take them as positive samples. We construct negative samples which display names similar to the display names of half of positive samples. All negative samples and positive samples constitute the dataset used in this paper. We develop multiprocess crawlers to obtain the user data, include profiles displayed in December 2019, and UGCs and relationships published before January 2020, until we reach the limits of the social networks. We can disclose the dataset used in this paper. e MUIUI framework is conducted on this dataset. (3) Extracted a set of effective features: we use named entity recognition to extract locations and organizations from profile and UGCs and regard them as all locations and all organizations. We use the entity link method to associate the alias of the locations and organizations. We propose methods to calculate the similarity of all locations, the similarity of all organizations, and the similarity of URLs in profile and UGCs. We apply the following relationship jointly with the location in profile to conduct user identification. e experiments prove that the features extracted in this paper are effective for user identification. e experiments also indicate that using multiple user information, we can improve the performance of user identification.
In the rest of this paper, Section 2 presents some related works. Section 3 introduces the basic background and formalizes the problem statement. In Section 4, we describe the user identification framework MUIUI. We do three experiments and compare with three existing works in Section 5. Finally, Section 6 concludes the paper and makes prospects for future work.

Related Works
In recent years, there have been much research studies on user identification across social networks. e existing research can be roughly divided into four categories: profilebased user identification, UGC-based user identification, relationship-based user identification, and user identification based on profile and user relationship.
Profile-based user identification only uses profile to identify users. In online social networks, attributes in a profile include the display name, user ID, introduction, location in profile, work education experience, and profile photos. Most research studies use one or more of these attributes. It can prove that these attributes are helpful for user identification. Some existing works only use one attribute for user identification, such as only use display name [11,13,[26][27][28], only use profile photos [29], and only use locations [30][31][32][33]. ese studies prove the feasibility of one attribute to perform user identification. As we know, social networks do not only contain a single attribute. And, applying several attributes jointly can improve the performance of user identification [10]. Li et al. [34] used display names and user IDs to link user identities. Motoyama and Varghese used various attributes, such as display name, location in the profile, age, and email, to link user identities [35]. Due to user privacy settings, users may fill in fake profile information or choose not to fill in. e accuracy of profile-based user identification will decrease.
UGC-based user identification only uses UGCs to identify users. Attributes in a UGC include locations, organizations, time, content, and writing style. Li et al. [4] calculated the similarity of UGCs on spatial, temporal, and content dimensions. en, they proposed a cascaded threelevel machine learning method to solve user identification. Goga et al. [36] used three features extracted from UGCs, such as location attached to UGCs, timestamp, and writing style, to identify users. Because of user privacy settings and social network restrictions on data crawl, UGCs may not complete. e robustness of above identification methods may be poor.
Relationship-based user identification only uses relationships to identify users. Xuan et al. [17] found that users usually maintain a similar circle of friends on different social networks. ey use relationships and propose FRUI. Zhang et al. [18] proposed the energy model COSNET by considering the local and global similarities between multiple networks. Zhou et al. [19] sampled the network and learned the vector representation of network nodes. ey aligned anchor nodes through neural networks and link users with dual learning and policy gradient. Some researchers also apply the graph embedding to the user identification. Man et al. [37] used the network embedding method to explore the network structure and identify users through crossnetwork mapping. Zhou et al. [38] proposed a nonpriority knowledge method FRUI-P based on social relationships. Liu et al. [25] embedded both the following relationship and follower relationship into the network structure to identify users. ere are some existing works based on profile and user relationship. Zhang and Yu [39] combined user attributes and network structure to link potential multiple shared entities. Li et al. [10] combined user display name and social network information redundancy to identify users. Zhang et al. [40] extract features from display name, location in the profile, and relationships to identify users. Due to social network restrictions on data crawl, difficulty in obtaining multilevel relationships and the highly dynamic topology of social network [41], relationships will be sparse, incomplete, and unstable.
Relationships can be divided into the following relationships and follower relationships [25]. Due to the openness of social networks, any user can follow other users. A user may not know the person who is following him. erefore, we only focus on the following relationships. Nowadays, due to user privacy settings and social network restrictions on data crawl, profiles, UGCs, and relationships may be missing or incomplete or fake in real social networks.
is paper digged out a set of effective features that is extracted from public and available user data and can reduce the impact of user data which are missing or incomplete.

Problem Formulation
Suppose there are two social networks, Twitter and Facebook, represented by G t and G f . Use G t � V t , E t to define social network G t , where V t represents the set of all user accounts and E t represents the set of relationships. User data of user v t i include profile, user-generated contents, and relationships. His profile includes display name name t i , location loc t i , user ID id t i , and work education experience we t i . His user-generated contents UGC t i includes locations UGCL t i , organizations UGCO t i , and URLs UGCU t i . His relationships include the following relationships and follower relationships. e definition of social network G f is the same as G t . As shown in Figure 1, we can define user identification across social networks as follows.
User identification: determine whether the user v t i in the social network G t and the user v f k in the social network G f are the same natural person in reality. If they belong to the same natural person, then the user v t i and the user v f k are called anchor users.
As shown in Figure 2, this paper mainly solves the user identification between two popular social networks, that is, to determine whether two user accounts from two social networks belong to the same natural person. Of course, this method can also be applied to user identification between multiple social networks. e dataset in this paper contains a part of ground truth, that is, anchor link users. We use A �

Model and Solution Framework
e framework proposed in this paper is mainly used for user identification when profiles, UGCs, and relationships are missing or incomplete. Firstly, we introduce the framework as a whole. en, we specifically introduce the feature extraction methods. Finally, we introduce the fusion classifier machine learning-based user identification method.

MUIUI Framework.
e MUIUI framework includes data crawl and storage module, feature extraction module, and detection module. e MUIUI is shown in Figure 3. e data crawl and storage module mainly collects user data from Twitter and Facebook and stores it in the MySQL database. is paper uses multiprocess crawlers to crawl user data from Twitter and Facebook. e user data include profile, UGCs, and relationships. e feature extraction module mainly extracts effective features from multiple user information, which extracts from user data. We obtain fourteen features from a display name and use them as the similarity of display name. e named entity recognition method is used to obtain all locations and organizations from UGCs and profile. We use entity link method to disambiguate and integrate them using the entity link method. We extract all URLs from UGCs and profile. And, extract organizations from the work education experience and combine them with the following relationships to calculate the similarity of the following organizations. We propose several algorithms to measure the similarity of the display name, all locations, all organizations, location in profile, all URLs, following organizations, and user ID, respectively. Combining the above features, a 20dimensional feature vector is finally obtained. e 20-dimensional feature vector is input to the detection module to perform user identification. In fact, the detection module uses a fusion classifier. We use the stacking method to fuse three base classifiers which have better performance. e output result of detection module is anchor users or nonanchor users.

Feature Extraction.
Generally, user data contain multiple user information. We can extract several effective features from it. In the following, we exploit multiple user information from network G t and G f .

Similarity of Display Name.
e display name is closely related to the user. It may not be unique in social networks. At present, some existing works only use the display name as the only attribute for user identification [11,13,14]. Compared with other attributes, the display name is easier to obtain. Nevertheless, the user can change the display name at will. e robustness of user identification Security and Communication Networks based on display name is poor. Li et al. [11] extracted 14 features from the display name. is paper uses their method to obtain features vector X name ik from two display name name t i and name

Similarity of All Locations and Similarity of All
Organizations. In social networks, users may disclose their location in profile, work education experience, and UGCs. e work education experience is filled in by the user and is closely related to the user. e work education experience includes organizations, such as the company where the user works and the school where the user studies. Some social networks include work education experiences directly in the profile (such as LinkedIn and Facebook), and some social networks work education experiences are hidden in the profile (such as Twitter). is paper mainly analyzes the two social networks, Twitter and Facebook. So, for Twitter, we use their introductions as the work education experiences. e content of the UGCs also contains much-hidden Figure 1: Illustration of user identification across G t and G f .  information. For example, locations related to the user, URLs shared by the user, and organizations that the user is concerned about. Named entity recognition can identify named entities from text data. is paper uses named entity recognition to obtain a set of locations and organizations from the content of the UGCs and work education experience. All locations include the location in the profile and the locations involved in the UGCs. Meanwhile, all organizations include the organizations included in the work education experience and the organizations involved in the UGCs.
Since all locations and organizations are closely related to the users themselves, all locations and organizations involved in user public information in different social networks will overlap. Moreover, the more a user mentions the location and organization, the more important it is. e same entity may have many aliases and named entity recognition may also be wrong. e entity link method can solve the above problems. All recognized locations and organizations are mapped to the Wikipedia entry IDs, where names pointing to the same entity are mapped to the same ID. Furthermore, delete entities that do not exist in Wikipedia entries to improve accuracy.
is paper uses the named entity recognition method provided by the spacy (https://spacy.io/) library and entity link method provided by the entity link open-source framework Dexter (https:// dexter.isti.cnr.it/). e Dexter uses English Wikipedia to implement entity link.

Security and Communication Networks
For user v t i and user v f k , the similarity of all locations and similarity of all organizations can be calculated as follows: Step 1: for user v t i , a set of locations UGCL t i and a set of organizations UGCO t i are obtained from the content of the UGCs through named entity recognition. Similarly, we obtain a set of locations UGCL Step 5: calculating sim loc and sim org by equations (1) and (2), where λ t im is the frequency of lid im in LID t i and λ f kn is the frequency of lid kn in LID

Similarity of Location in the Profile.
e location in the profile may be his/her current city or his/her hometown. It is more accurate than the location information extracted from the content of UGCs. erefore, the similarity of location in the profile is taken as one feature. e profile's location filled in by the same user in different social networks should be closely related [40]. However, there are many aliases for the same location. is paper uses the API provided by pickpoint (https://app.pickpoint.io/) to convert location names into their latitude and longitude. e similarity of location in the profile is calculated based on the latitude and longitude of locations and is expressed by equation (5): where d(loc f k , loc t i ) in equation (4) can be measured by equation (3)

Similarity of all URLs. UGCs often include some URLs.
ese URLs may be the links of UGCs on other social networks, or the links that the user is interested in, or the links related to work education experiences of the user. is paper finds that users may share the same URLs on different social networks. Users may fill in the URL in their profiles, which are often closely related to users. It may be the company web page URL, or the personal web page URL, or homepage URLs of other social networks. Based on these extracted URLs, the similarity of all URLs can be calculated.
We use a method similar to Agarwal's URL extraction methods [12] to extract URLs' set UGCU t i and UGCU f k from the profile and UGCs, respectively. e calculation method of sim URL is shown in equation (6): where c t and c f represent the number of occurrences of the URL in URLs' set UGCU t i and URLs' set UGCU f k , respectively. URL belongs to the intersection of UGCU t i and UGCU f k .

4.2.5.
Similarity of the following Organizations. Some social networks divide relationships into following relationships and follower relationships, such as Twitter. Following relationships refer to other users that the target user is following. Meanwhile, follower relationships refer to other users following the target user [25]. Due to the openness of social networks, anyone can become a user's follower. erefore, we use following relationships and work education experiences to calculate the similarity of the following organizations. e work education experience was introduced in Section 4.2.2. Work education experience includes the organizations where the user works or studies, and these organizations often have their official social accounts in social networks. is paper found that users often follow the official social accounts of organizations that work or study.
is paper mainly analyzes two social networks, Twitter and Facebook. We suppose Twitter is a social network G t and Facebook is a social network G f . Because different social networks contain different user information, this paper extracts the organization from the work education experience of Facebook users and obtains the following relationships from Twitter users. Firstly, we extract the homepage URLs from the following users on Twitter and use the entity recognition method to extract the organizations from work education experiences on Facebook. Secondly, we use Google's advanced search method to obtain the official accounts' homepage URLs of the organizations on Twitter (for example, we need to obtain the official account of Apple on Twitter. Google search method is Apple + site: twitter.com). Finally, calculate the similarity of following organizations. For user v t i and user v f k , the similarity of the following organizations' detailed algorithm is shown in Algorithm 1.

Similarity of User ID.
e user ID can uniquely identify a user in the social network. In Twitter and Facebook, the initial value of the user ID is usually automatically generated by the social network, and the initial user ID has a strong correlation with the user's display name. e user can also modify it to a familiar string, but it must be unique. Some research [12] found that user ID can be used for user identification. erefore, this paper takes the similarity of user ID as one classification feature. e user ID is usually a short string composed of numbers, letters, and underscores so that the string similarity calculation method can be used.
is paper uses the Jaro-Winkler algorithm, which is often used to calculate English names' similarity. is algorithm increases the initial characters' weight and makes the string similarity more dependent on the initial part of the string. For user v t i and user v f k , the calculation method of sim userid is where id t i and id is the length of common prefix at the start of the string up to a maximum of four characters and p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes (the value of p is 0.1 in Jaro-Winkler).

Fusion Classifier.
For the same dataset, the effects of different classifiers will also vary. Zhang et al. [40] use logistic regression (LR) and multilayer perceptron (MLP) classifiers to user identification. Liu et al. [42] use support vector machine (SVM) as the model classifier. Zafarani and Liu [43] use logistic regression (LR) as the model classifier. Li et al. [10] use gradient boosting (GB) classifier and tune the parameters of GB to user identification. Li et al. [11] use seven supervised machine learning models and tested them on the training set. Finally, the best model logistic regression with built-in cross-validation (LRCV) is selected as the classifier. ese prove that base classifiers can already solve the classification problem well. Li et al. [4] performed ten cross-validations on the classification effect of 10 base classifiers and selected three better base classifiers to construct the fusion classifier. It also proves that the fusion classifier is generally better than the base classifier. is paper mainly uses a supervised machine learning model to identify anchor users based on the above features.

Experimental Dataset.
is paper focuses on two popular social networks Twitter and Facebook. We expanded the raw dataset which are those proposed in [21][22][23][24][25] and crawled during November 2012 [9]. We screened the users in the raw dataset who are still alive and took them as positive samples. We re-crawl 2397 pairs of Twitter and Facebook users in the raw dataset. As a result, 1292 pairs of Twitter and Facebook user accounts were found as still alive. To improve the classifier's performance, 1292 pairs of negative samples are added to the dataset, and half of the negative samples have similar display names to the positive samples. ese 2584 pairs of samples are used as the experimental dataset.
We developed multiprocess crawlers to obtain the profiles of the dataset in December 2019 and to obtain UGCs and relationships of the dataset before January 2020, until the limits of the social network. e UGCs can be divided into original and repost. In this paper, we consider the reposted contents to be part of the UGCs, Security and Communication Networks and the same content reposted multiple times will only be regarded as once. Both Twitter and Facebook users in the dataset are native speakers of English.

Evaluation Metrics.
In the experiments, accuracy, recall, precision, and F1 score are used to evaluate the framework. In this paper, positive samples indicate anchor users, and negative samples indicate nonanchor users.
A confusion matrix is shown in Table 1. TP is the number of samples whose predicted and actual values are both positive. TN is the number of samples whose predicted and actual values are both negative. FN is the number of samples whose predicted is negative but is actually positive. FP is the number of samples whose predicted is positive but is actually negative.
Accuracy (ACC) is the ratio of correct predictions in all samples and is expressed by equation (9): Recall (REC) is the ratio of the both predicted and actual are positive samples in all actual samples and is expressed with equation (10): Precision (PRE) is the ratio of the both predicted and actual are positive samples in all predicted samples and is expressed by equation (11): F1 score is the harmonic mean of precision and recall and is expressed with equation (12): Area under curve (AUC) is area under the ROC curve. AUC can evaluate two-class classifiers. If a classifier has larger AUC, the accuracy of the classifier will be higher.

Experiments and Analysis.
To prove that the MUIUI is an effective user identification framework even when user data are incomplete or missing, this paper makes statistics on the missing and incomplete user data in the dataset, as shown in Table 2. e numerical value in Table 2 is the number of users whose user data are missing or incomplete. Missing information means that the user has not filled in the information or has not disclosed it. Incomplete information means that the user has disclosed and filled in the information, but only part of them can be obtained due to social network restrictions. e false locations are judged by whether location names can be converted into latitude and longitude. If location can be transformed, it is true. Besides, if a user fills in the location is "Earth" or other meaningless nouns, they will also be regarded as false information.
According to the statistics in Table 2, the user data in the dataset used in this paper are missing or incomplete, except for display names. is dataset is crawled from real social networks by multiprocess crawlers. It also proves that user data have varying degrees of missing, falsity, and incompleteness in real social networks. To evaluate the effectiveness of the MUIUI framework, we compare MUIUI with three existing methods: the method proposed by Li [11], the OPL method proposed by Zhang [15], and the ALLEN-LR method proposed by Zhang [40]. e experiments use the dataset introduced in Section 5.1, which has 1292 pairs of anchor users (positive samples) and 1292 pairs of nonanchor users (negative samples). e dataset includes 1881 Twitter users and 1305 Facebook users. ese classifiers can be implemented through scikit-learn [44], and all the parameters use their default values. In the experiments, the ratio of positive sample to negative sample is 1 : 1, and the ratio of the training set to the test set is 2 : 1. ese 13 base classifiers are tested with the retraining process, and the average results are shown in Figure 4.
According to the results of Figure 4, RF, Grab, and AdaB have the best performance. Grab and AdaB are strong classifiers. A strong classifier is a classifier with higher accuracy, and it works better than weak classifiers. Grab and AdaB belong to strong classifiers and other base classifiers belong to weak classifiers. is is why Grab and AdaB are significantly higher than other classifiers. For RF, if the number of trees (that is, the dimensions of features) is larger, the RF classification performance will be better. e features of this paper reach 20 dimensions, that is, the number of trees is large. So, the RF works better. erefore, we choose RF, Grab, and AdaB as base classifiers and use the stacking method to construct a fusion classifier as the final classifier.

5.3.2.
e Ratio of Positive Sample to Negative Sample. e ratio of positive sample to negative sample in the training dataset may affect user identification framework. In order to choose the ratio of positive sample to negative sample in the MUIUI, the following experiments are based on the ratio of 8 : 1, 6 : 1, 4 : 1, 2 : 1, 1 : 1, 1 : 2, 1 : 4, 1 : 6, and 1 : 8 to train the MUIUI and compare it with the method proposed by Li [11], the OPL method proposed by Zhang [15], and the ALLEN-LR method proposed by Zhang [40].
e results are shown in Figures 5(a)-5(d).
According to the results in Figure 5(a), the accuracy first drops and then rises. Because the number of samples is the smallest at 1 :1, the accuracy reaches a minimum at 1 :1. From ALGORITHM 1: Similarity of following organizations.     1 :1 to both ends, the number of samples increases, and the accuracy is getting higher and higher. e accuracy includes correctly predicted positive and negative samples. e more actual positive samples, the more positive samples are accurately predicted, and it is same for negative samples. e more the samples, the higher the accuracy. erefore, the accuracy will first decrease and then increase. As shown in Figures 5(b)-5(d), when the proportion of positive samples decreased, the recall, precision, and F1 score also decreased. If training dataset has more positive samples, the classifier will learn more features of the positive samples and predict the positive samples more accurately. Leading to some negative samples are predicted to positive samples. It can be seen from Figure 5 that the ALLEN-LR method has a higher recall than the method in this paper when positive samples are more than negative samples. However, when negative samples are more than positive samples, the performance of ALLEN-LR drops sharply. When the ratio of positive sample to negative sample is 1 : 4, 1 : 6, and 1 : 8, the recall, precision, and F1 score are almost zero. It shows that ALLEN-LR may judge some negative samples as positive samples. Based on this situation, the F1 score can evaluate the model better. According to Figure 5(d), MUIUI is stable and superior to other methods at different ratios. Because the cost of obtaining positive samples is too high, this paper chooses the ratio of 1 :1 to construct the dataset.

e Ratio of the Training Set to the Test Set.
To more fully illustrate the effectiveness of MUIUI, the following experiments are based on the ratio of the training set to the test set. Different ratio experiments are carried out 100 sampling verifications, and the average of 100 verification results are taken as the final results. According to the results, the accuracy, recall, precision, and F1 score of different frameworks are drawn.
Figures 6(a)-6(d) show that the MUIUI has higher indicators than the other three methods under different ratios. At the same time, it can be concluded that the larger the proportion of the training set is, the better the four methods perform.
Li's [11] method only extracts 14 features based on the display name, and there are no missing display names in the dataset. is is the only method without missing user data. e ALLEN-LR method [40] extracts features from the display name, locations in the profile of a user and his/ her friends, and the multilayer relationships. It uses the LR classifier to perform user identification. Because the ALLEN-LR method relies heavily on relationships and needs locations in the profile of a user and his/her friends are relatively complete. However, the relationships in our dataset are incomplete, and the location in the profile is partially missing. When the data are partially missing or incomplete, the performance of ALLEN-LR is not ideal. Even if the proportion of the training set increases, it will not help the method. e OPL method [15] proposes methods to complete similarity of the display name, the similarity of profile photo, the similarity of location in profile, the similarity of text in profile, the similarity of URL in the profile, the popularity of the user, and the language user used. ese seven features are used for user identification. Because profiles and relationships of some users are missing or incomplete in our dataset, the performance of OPL is also nonideal. It proves that MUIUI can reduce the impact of user data which are missing or incomplete.

Conclusion and Future Works
User identification has attracted extensive attention in academic circles, which can be used for friend recommendation, user privacy protection, and advertising recommendation. Due to user privacy settings and social network restrictions on data crawl, user data may be missing and incomplete in real social networks. e features extracted in previous research may be sparse. In order to solve these problems, we extracted effective features from public and available user data, which can reduce the impact of these problems. Firstly, we developed multiprocess crawlers to obtain the latest user data of the dataset. en, we used named entity recognition and entity linking to obtain and integrate locations and organizations from profiles and UGCs and extracted URLs from UGCs. We developed several algorithms to measure the similarity of the display name, all locations, all organizations, location in profile, all URLs, following organizations, and user ID, respectively. Finally, we proposed a fusion classifier machine learning-based user identification method. We verified the MUIUI framework on the dataset we crawled and the results indicate that the performance is better than that of existing representative works.
Popular social networks LinkedIn and Instagram also contain user data. Our work will be extended to these social networks in the future. We will introduce more effective features into the user identification method, such as user hotspot topics detection, trajectory analysis, and face perception of profile photos. ese methods may improve the performance of user identification.
Data Availability e data supporting this paper are from previously reported studies and datasets, which have been cited. e processed data are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.