Temporal Activity Path Based Character Correction in Social Networks

Vast amount of multimedia data contains massive and multifarious social information which is used to construct large-scale social networks. In a complex social network, a character should be ideally denoted by one and only one vertex. However, it is pervasive that a character is denoted by two or more vertices with different names, thus it is usually considered as multiple, different characters. This problem causes incorrectness of results in network analysis and mining. The factual challenge is that character uniqueness is hard to correctly confirm due to lots of complicated factors, e.g. name changing and anonymization, leading to character duplication. Early, limited research has shown that previous methods depended overly upon supplementary attribute information from databases. In this paper, we propose a novel method to merge the character vertices which refer to as the same entity but are denoted with different names. With this method, we firstly build the relationship network among characters based on records of social activities participated, which are extracted from multimedia sources. Then define temporal activity paths (TAPs) for each character over time. After that, we measure similarity of the TAPs for any two characters. If the similarity is high enough, the two vertices should be considered to the same character. Based on TAPs, we can determine whether to merge the two character vertices. Our experiments shown that this solution can accurately confirm character uniqueness in large-scale social network.


Introduction
In the past decade, the mobile Internet and social multimedia applications have become an indispensable part of social life, and huge multimedia data are being produced and consumed [1].For instance, Facebook reports 350 million photos uploaded daily as of November 2013; 100 hours of video are uploaded to YouTube every minute, resulting in more than 2 billion videos totally by the end of 2013 [2].Social Media Networks allow people to communicate, share, comment, and observe different types of multimedia content [3].As social activities are becoming more frequent, social networks have been larger and much more complex.Generally, we extract information and construct social transaction databases from vast amount of multimedia data, such as text, images [4,5], videos, and audios [6], to construct largescale social networks which are modelled by graphs [7] with node-edge representation [8].Multimedia data, generally, can be described in multiviews [9,10] such as color view and textual view [11,12].In social networks, the relations between character vertices are tangled by the time difference of transaction, incompletion of personal information record, anonymous phenomena, and difference of information pattern and structure.It is distinctly difficult to maintain one-to-one mapping between characters in relation networks and people in real life.Besides, characters are marked up as difference vertices by former and present name.These vertices have the same personal information, structure, and attributes of relation.For social networks, these vertices and relationships are redundant, which will severally perturb the results of social network analysis.Therefore, character vertices ambiguity has become a key problem in social network analysis.
In relational databases, we can use multidimensional personal information to confirm uniqueness of characters, such as name, gender, and date of birth.In big data environment, however, multimedia data is mainly from unstructured data storage.Its scale is vast and types are multifarious [13,14], such as text, images, videos, and audios.Besides, the data is

Related Work
2.1.Social Analysis via Multimedia.The advent of social networks and cloud computing has made social multimedia sharing in social networks become easier and more efficient [15].With the rapid increasing of volume of multimedia data, social networks analysis and mining via multimedia data attract attention of a number of researchers recently.Zhuhadar et al. proposed combination of social learning network analysis and social learning content analysis in studying the impact of the social multimedia systems cyberlearners [16].They presented evidence obtained from the analysis that Social Multimedia System impacts the communication between faculty and students.To deal with the challenges of event detection rom massive social media data in social networks, Zhao et al. [17] proposed a novel real-time event detection method named microblog clique to explore the high correlations among different microblogs, which was supported by social multimedia data.Sang and Xu [2] proposed to analyze into variety of big social multimedia from the perspective of various sources.Laforest et al. [18] present a new kind of social networks named spontaneous and ephemeral social networks (SESNs) which allow people to collaborate spontaneously in the production of multimedia documents.In order to find overlapping communities from multimedia social networks, Huang et al. [19] proposed an efficient algorithm named LEPSO for overlapping communities discovery, which is based on line graph theory, ensemble learning, and particle swarm optimization.

Name Ambiguity and De-Anonymity.
Recently, name ambiguity and de-anonymity have been widely studied.There are several methods to identify characters in social networks, which can be divided into three categories, for internal relational database, Internet webpage, and topology structure of social networks.
Name ambiguity and de-anonymity are with the same essential features.In the past, the identity of the characters is determined by the accurate attribute information in internal databases of enterprises.In 2008 Narayanan and Shmatikov proposed the method to process high dimensional data [20], such as personal attribute, recommendation, and transaction information.Users can identify characters in the anonymous database with limited personal information.This method has strong robust even though background information is inaccurate or disturbed.However, internal database is localized and static; it cannot describe feature of characters thoroughly.Therefore, these methods are not suitable for name ambiguity problem in big data with complexity, dynamicity, and cross platform.
Name ambiguity is more prominent in the Internet.In 2008 Tang et al. proposed a standard probability framework to recognize the independence of observed objects [21].But when we search name on the Internet, numerous webpages containing one same name can be returned, and it is not certain whether these pages belong to the same people.Bekkerman and McCallum proposed two statistical methods to solve this problem in 2005 [22].One is based on link structure of webpages and the other is on multiway distributional clustering method, which is unsupervised frameworks and only needs a few of prior knowledge, and the experiments show that their solution outperform traditional clustering.However, the above methods are deeply subject to the uncertainty of web information.
At present, name disambiguation becomes even more prominent in social networks modeling and analysis.In 2008, Liu and Terzi analyzed character-centered social networks and then pointed out that features of relation structure can expose characters' identity [23].For identity hiding in social relations network, they defined graph-anonymization and proposed the algorithm based on -degree anonymous graph and node degree sequence.Narayanan and Shmatikov defined privacy of social networks in 2009 [24] and designed a novel reidentification algorithm which can implement deanonymity and identifying node by using the topological structure of network.For dynamic evolution of social networks, Ding et al. proposed the "threading" technique and used the connection between released data to implement deanonymizing [25].And they proposed to combine structure information and attributes of nodes to reidentify anonymous nodes.Korayem and Crandall worked on de-anonymizing method specially in cross platform social networks [26], which can recognize that different accounts belong to one user by extracting time sequence data, text features, geographic location, and social relation characteristics.In 2012, Srivatsa and Hicks introduced de-anonymizing users' mobile trace information based on graph structure of social networks [27].As contact graph between characters consists of vast quantities of mobile trace, they proposed structure similarity of interuser correlation, which was used to map contact graph and social network.
Since a large number of mobile trajectories can be used to build the contact graph of characters, the structural similarity is employed to find out the corresponding nodes in the contact graph and social network.The de-anonymity with mobile trace is implemented by mapping character nodes between the two networks.The methods mentioned above aim to solve the problem of anonymity in traditional networks in which nodes and relations have only one category.However, the social network created on big data is comprehensive, such as category diversity of relationship and nodes and temporality.In addition, these methods need to add attributes to supplement topology information of social networks.
From enterprise databases to webpages databases and then to social networks databases, this is a developing process from local data to global data.Previous methods rely on attribute details of local data; namely, it needs much more auxiliary information to identify characters, but the efficiency is low.However, big data is multisourced [28], time-variable, global, and macroscopic.The social networks are built on global data, and it is impossible to look back upon distributed sources.In this type of networks, intrinsic relationship structure of vertices is a key factor to measure uniqueness of characters.Since different characters have different social relationships, we can identify characters by network structure features.For the heterogeneity and temporality of big data networks, we propose uniqueness correction method and the notion of activity path similarity based on heterogeneous temporal networks [29,30], to promote the efficiency and accuracy of character identification.

Social Network Modeling
A large-scale social network is based on diversified multimedia data which is multimodality [31]; for instance, an image can be described by color modality or shape modality.It contains information of multifarious and complex social activities.We can build this kind of network by extracting relations from transaction activity information which is extracted from multimedia datasets.As academic network is a typical case of social network, we use it as an example to describe the process of social networks mining.
In general, academic relations mainly include teacherstudent relationship, classmate relationship, project partnership, and coauthor relationship.These relations are contained in education experiences, research and work experiences, cooperation and coauthor experiences, and academic activities and conferences experiences.The information of academic activities is contained in project proposals, project progress and concluding reports, degree certificates, award certificates, photographs or videos concerning conference, and other scientific information documents.Therefore, we extract academic activities information from the multimedia data and then construct academic transaction activity network.It is the base of mining and analysis of academic network between scholars.
Figure 1 shows a general view of framework of academic relation network construction.First, we use academic activity transaction extract method from multimedia sources which contain texts, images, audios, and videos to collect individual resume information of scholar and team members information.Then we construct an academic activity transaction database.This database contains personal information, study and work experience information, and project and publication information.After that, we build academic activity relation network containing heterogeneous vertices and relations.On this basis, we create academic transaction activity networks.In this kind of networks, there are several types of transaction activities, such as study experiences ("graduated from Tsinghua University," "studying in Cambridge University," "was conferred doctor's degree," etc.), work experiences ("worked at Microsoft MSRA," "teaches in Central South University," etc.), publication information ("published ( + ) Evolutionary Strategy for 3D Modeling and Segmentation with Super quadrics," etc.), and research experiences ("took over The Association Rules Mining of Time Series and Knowledge Discovery for Recognition of Expert Academic Activities Track project," etc.).
These transaction networks are 2-mode networks which consist of two types of vertices: character vertices and entity vertices and their activity relations.These vertices represent scholars or researchers and academic entities, respectively.We can mine alumni relationship, workmate relationship, project cooperation, and coauthor relationship from them.The character vertices and academic relation constitute academic relationship network which is a kind of homogeneous 1-mode network.We proposed vertices merging method based on structure error of network to implement uniqueness correction in this 1-mode network.

Evaluating Uniqueness of Character Vertices Based on Structure Error
Redundant information of vertices and relation is generally carried out by nonunique character vertices.Thus, correct structure merging is a key process to remove redundant information from social networks.Theoretically, structure of networks will not be changed after redundant vertices and relations merging.We evaluated uniqueness of character vertices by merging test and then screened out redundant vertices candidates.

Evaluating Uniqueness of Character Vertices.
In a social network, we consider the character vertices which have the same neighbor as suspicious redundant vertices.Some of them containing redundant information are nonunique, and the others with a high similarity may not be redundant.Thus, we call suspicious redundant vertices as redundant vertices candidates.
) and the values of structure error between  are not zero.

Redundant Vertices Candidates.
In theory, character vertices which are nonunique have selfsame or nearly identical relation structure.Redundant relations and vertices are generated by this situation and they should be merged so as to remove redundant information.We introduce the notion of structure error to describe the difference of network structure between vertices.The vertices with selfsame or highly similar structure are referred to as redundant vertices candidates.They contain redundant relation information.

Definition 3 (redundant vertices). Let a vertices set be
If the vertices in Η1 , Η2 , . . ., Η are nonunique, the set Η is referred to as redundant vertices set.The number of all vertices in Η is denoted as   .
Based on this notion, we can recognize redundant vertices candidate from social networks according to structure error.If   (V  , V  ) ̸ = 0, we can regard V  and V  as a vertex pair with uniqueness, whereas they are redundant vertices candidates.

Algorithm.
We designed redundant vertices candidates screening method in social networks according to the above notion, which is shown in Algorithms 1 and 2. Firstly, we arbitrarily select two character vertices V  and V  from networks and then calculate the number of relations between character vertices and their neighbors.We denote it as preRelations.Secondly, based upon merging principle we calculate the number of the relations between them after correct merging, and it is denoted as postRelations.Lastly, we calculate structure error of each vertex pairs and put the vertices which have zero value of structure error into redundant vertices candidates set.

Character Uniqueness Measure Based on Activity Path Similarity
The character vertices and entity vertices in different subnetworks.As differences of temporal attributes cause differences of relation path, we introduce activity path to describe these network structure.Based on this notion, we quantitatively measure similarity of character vertices by calculating temporal weight of activity paths.After combining all results in each subnetwork, character uniqueness can be measured precisely.simply.Thus, it can be seen that a large-scale TAN contains multitype vertices and relations, and differences of vertex types lead to differences of relation types [32].In real world, a TAN always contains several types of social activity; namely, there are different types of relations and vertices in a network.Figure 3 shows a heterogeneous academic TAN.It contains two types of vertices: scholar vertices  S and entity vertices  I ,  C ,  P ,  R which represent institution, conference, publication, and research project.Due to differences of academic activities, there are different relations between vertices, such as write relation between scholars and papers and participation relation between scholars and conferences.We use  org ,  atd ,  incl ,  wrk ,  wrt , and  undt to denote six types of relations (organize, attend, included, work at, write, and undertake) and  org ,  atd ,  incl ,  wrt ,  wrt , and  undt denote temporal attribute set.

Transaction Activity Path (TAP).
In a TAN, transaction activity paths (TAPs for short) are relative to topology of it.We regard character vertices and entity vertices, respectively, as master vertices and their neighbor vertices, and then we can describe TAPs.A TAP is a path which goes through a pair of character vertices and one entity vertices and the relations between them.From one master vertex to another, there is one or more TAPs through their common neighbors, and they contain semantics and temporal attributes of original transaction records.
Let character and neighbor vertices be V  , V  ∈   and V ∈ V , respectively, Χ 1 = {1, 2, 3, . . ., ( Definition 3 (transaction activity path).In a TAN G  = ⟨  , V ,   ,   , Φ  , Ω  , Θ  ⟩, let V  or V  be start vertex; a path which begin at V  , and go through neighbor vertex V and then end at V  is called transaction activity path.It is denoted by The set of TAPs between V  and V  is denoted by Similarly, we denote the path through neighbor

Character Uniqueness Measure.
Owing to temporal attributes of relations, we can define and calculate temporal weight of relations and TAPs, which reflect temporal characteristics of transaction activity networks.Based on temporal weight we can calculate TAP similarity to measure similarity degree of character vertices pairs.The similarity threshold is a filter to screen out unique vertices so that we can get redundant vertices set.

Temporal Weight Calculation.
In a transaction activity network, temporal weights of relations are decided by start time and end time, while temporal weights of TAPs are decided by the former.Based on time attribute   = ⟨   ,    ⟩, Figure 5: The first type of TAP.we can use the following equation to calculate temporal weight of   : Now denotes current data,  denotes label of relations, and  is label of neighbor vertex V .The following equation is the temporal weight of TAPs: The temporal weight of relations reflects the start time and end time, as well as the duration of relations.Apparently, the temporal weight of TAPs contains all of this information since TAPs consisted of two relations.The weight is decided by the temporal attributes of relations.

Transaction Activity Path Similarity.
In a transaction activity network G  , let character vertices and entity vertex be V  , V  ∈   and V ∈ V which is the neighbor of V  and V  .The TAP sets are denoted by P  , P  , and P  represent three types of paths, respectively.They have three different structures.
Figure 5 shows the first type of TAP between V  and V  .These paths begin from V  then go through relations   1 , neighbor V , and relation   2 and end to V  .In a network, all of the TAPs between different two vertices are this type.The second and third types are showed in Figures 6 and 7.Both of them begin from one vertex (V  or V  ) and end to the same vertex, and they are through the same relation twice.
Definition 2 (SimTAP).SimTAP is the similarity between two vertices V  and V  .It is decided by structure and temporal weight of TAPs between V  and V  .The definition formula of SimTAP is as follows: In this formula, (P  ), (P  ), and (P  ) denote the temporal weight sums of these three types of TAPs.We use the following formulas to calculate these weights: 1 and    1 are weights of relations between V  and V  .
Generally, a transaction activity network contains several subnetworks.In order to measure the similarity of all characters, we need to add all similarity values in each subnetwork and then calculate arithmetic mean.Let a TAN be  = {G  |  ∈ B}, we calculate similarity of vertices pair V  and V  in G  ; then we get the TAPs similarity set {SimTAP  (V  , V  ) |  ∈ B}.After that we calculate the arithmetic mean in .The formula is as follows: In the formula, |B| is the number of subnetworks in  and SimTAP(V  , V  ) is the TAP similarity of V  and V  .

Character Uniqueness
Measure.SimTAP(V  , V  ) can measure uniqueness of characters quantitatively in TANs.The larger its value is, the greater the similarity between character vertices is, and vice versa.According to this idea, we proposed uniqueness measurement of characters: after SimTAP(V  , V  ) calculating, we set character uniqueness threshold  based on features of networks and data-analytic requirements to screen out the results.If SimTAP(V  , V  ) < , we regard V  and V  as unique characters, while if SimTAP(V  , V  ) ≥ , vertices V  and V  have high similarity, which indicates that we need to merge these vertices and their shared relations.1.
In this table, columns V  and V  are name of characters, G S , G W , G P , and G C indicate the similarity of vertices pairs in these four subnetworks, and  denote the similarity in .After setting the threshold  = 0.70, we can Table 1: The results of transaction activity similarity.

Algorithm Design.
We designed TAPs similarity algorithm based on the above-mentioned theories, which is shown in Algorithm 5.At first, we get the relation lists of vertices pair V  and V  from each subnetwork G  and then calculate the temporal weight of transaction activity paths.Second, we calculate the transaction activity network similarity SimTAP  of V  and V  and then calculate arithmetic mean of similarity SimTAP in network , shown in Algorithm 4. After traversing all vertices pairs in candidate redundant vertices set H and getting their similarity, we set threshold  and compare it with each similarity.We regard the vertices whose similarity is larger than  as redundant vertices and put them into redundant vertices set Η, shown in Algorithm 3. The vertices whose similarity is smaller than  are regarded as unique vertices and they must remain in network.

Experiment and Analysis.
The multimedia dataset for academic transaction networks building contains texts, images, and videos concerning proposals, papers, award certificates, and videos of academic conference.In the experiment of this paper, we extract academic activity transaction data from 724 proposals of Natural Science Foundation of China (NSFC) [33], which are texts in Chinese only, and then established a transaction database.After that, we import these data into graph database Neo4J and then construct transaction activity networks which contained 598 vertices.We mine academic relationship between scholars and then build academic networks.On this basis, we calculate structure error of character vertices and then give the visual presentation of network [34].Based on the results of structure error calculation, we get vertices from redundant vertices set H and calculated SimTAP of each vertices pair.In this network, we calculate structure error of each vertices pair and screen out the vertices with 0 structure error.Table 2 shows partial results.
In Table 2, fields V  and V  denote two vertices and field denotes value of structure error of these vertices in .We can find that the structure error of vertices pairs Faye Wu and Fei Wu and Shaojia Zhu and Shaonan Zhu equals zero.Therefore, these two vertices are regarded as redundant vertices candidates.We can find their structure features in Figure 8. Four highlighted character vertices are Faye Wu, Fei Wu, Shaojia Zhu, and Shaonan Zhu.These two highlighted subnetworks illustrate that the two vertex pairs have same neighbors, respectively.
In order to analyze our method deeply, we extract academic activity information from the database.Tables 3-6 show academic activity information of Faye Wu and Fei Wu.
We can find that Faye Wu and Fei Wu studied in the same school over the same period.Likewise, they have the same experience on the aspects of work, project, and publication.Namely, their experience of academy is selfsame.Thus, Faye Wu or Fei Wu is not unique, which is redundant information.
In Figure 8, vertices Shaojia Zhu and Shaonan Zhu own the same neighbors.Similarly, we extract their activity information.
From Tables 7-10 we can see that vertices Shaojia Zhu and Shaonan Zhu studied in the same universities and were employed by the same employer but the periods are different.That means their education and work experience is different.The difference between Shaojia Zhu and Shaonan Zhu is caused by the difference of temporal attributes.Therefore, both of them are unique and they do not contain redundant information.
The results indicate that character vertices which have the same neighbors may not contain the exact same social   activity information.These vertices are redundant candidates and among them there are some vertices with uniqueness.But we cannot recognize them by structure error.On the contrary, we can only screen out vertices whose structure error is not zero.They exactly have uniqueness.Above all, we need to recognize character uniqueness ulteriorly.

TAPs Similarity Calculation.
We first calculate the similarity of vertices pair in an academic network containing 589 characters.After setting  as 0.70, we screen out the vertices whose SimTAP are higher than .The results are shown in Table 11.
In this table, we find that the value of similarity of vertices pair Faye Wu and Fei Wu is 1.0000, which indicates that their academic activity information is identical.That means the similarity between them has been maximized.The similarities of Jia Gao and Di Feng, Xinhua Zou and Xingxing Zou, and Zhe Feng and Kang Du are 0.7450, 0.7463, 0.8546, and 0.7500.

Regression Analysis.
We chose the vertices whose similarity in subnetworks is zero and extracted their transaction information from database.It is shown in Tables 12,13, and 14.
We can see from Table 12 that Jia Gao and Di Feng studied in three different universities.Likewise, in Table 13, Xinghua Zou and Xingxing Zou studied in different colleges as well.In Table 14, the publications of Zhe Feng and Kang Du are entirely different.These situations indicate that these three pairs of character are different in education and publication

4. 1 . 1 .
Uniqueness of Vertices.Let  = ⟨, , Φ⟩ be a 1-mode network in which the vertices represent characters.Two Extracting information from multimedia data Transaction

Figure 1 :
Figure 1: Framework of academic relationship network construction.

Figure 6 :
Figure 6: The second type of TAP.

Figure 7 :
Figure 7: The third type of TAP.

Instance 2 .
In a transaction activity network  = {G S , G W , G R , G C }, the type sets of vertices and relations are, respectively, denoted by A and B = {Study, Work, Research, Coauthor}.There are 10 character vertices in this network and the values of similarity of them are shown in Table

Table 2 :
Structure error of vertices in .V  V    (V  , V  )

Table 3 :
Education experience info of Faye Wu and Fei Wu.

Table 4 :
Work experience info of Faye Wu and Fei Wu.

Table 5 :
Project information of Faye Wu and Fei Wu.

Table 6 :
Publication info of Faye Wu and Fei Wu. and Nonlinear Feedback Synchronization and Performance Research in Discrete Chaotic System 2004 2004 Fei Wu Adaptive Control Synchronization Approach Research of Unified Chaotic Systems 2000 2000 Fei Wu Linear and Nonlinear Feedback Synchronization and Performance Research in Discrete Chaotic System 2004 2004

Table 7 :
Education experience info of Shaojia Zhu and Shaonan Zhu.

Table 8 :
Work experience info of Shaojia Zhu and Shaonan Zhu.

Table 9 :
Project information of Shaojia Zhu and Shaonan Zhu.