Why You Go Reveals Who You Know : Disclosing Social Relationship by Cooccurrence

The popularity of location-based services (LBS) and the ubiquity of sensor device have resulted in rich spatiotemporal data. A large number of human behaviors had been recorded including cooccurrence which refers to the phenomenon that two people have been to the same places at the same time.These data enable attackers to infer people’s social relationship based on their cooccurrences and many attack models were proposed. However, current attack models still cannot effectively address the following two challenges: How to distinguish cooccurrences between acquaintances and strangers?What kind of cooccurrence contributes to strong social strength? In this paper, we present a novel social relationship attack model—the Mobility Intention-based Relationship Inference (MIRI) model—which can solve the above two issues. Firstly, we extract mobility intentions and adopt them to characterize cooccurrences. A classification model is trained for attacking social relationship. The experimental results on two real-world datasets demonstrate that the proposed MIRI model can properly differentiate cooccurrences by simultaneously considering spatial and temporal features. The comparison results also indicate that MIRI model significantly outperforms state-of-the-art social relationship attack models.


Introduction
Nowadays, along with the rapid advances in Internet of Things (IOT) devices [1], especially with the popularity of social network services (SNSs) [2], the volume of human spatiotemporal data increases tremendously.A large number of spatiotemporal datasets are developed for various researches and many applications.In the last decade, lots of attack models [3][4][5][6] and privacy protection strategies [7] are proposed to protect users' privacy before releasing datasets to the public.However, most existing studies focus on how to protect individual privacy.There are few researches about protecting the social relationship of two users.
On the one hand, it is quite clear that two people are often seen together who have intimate social relationship, such as friends, colleagues, or family members.This phenomenon is known as cooccurrence where two people have been to the same places at the same time [8].The cooccurrence is very common in people's daily life.On the other hand, the abundant spatiotemporal datasets enable adversaries to analyze social relationship based on cooccurrence, and it leads to the problem of social relationship protection.
If the social relationship of two people which should not be known is inferred by criminals, it leads to crimes or even two people might get hurt in real world.In 2014, a criminal knew a madam is an accountant employed by the owner of private companies in China.The criminal pretended to be the owner and cheated $30,000.Along with the increasing amount of human spatiotemporal data, attacking social relationship based on cooccurrence has become a focal point in social relationship research.
In the past few years, many attack models based on cooccurrence are proposed.These attack models employed spatial features of cooccurrence like location entropy [9] to infer social relationship.The basic idea is that cooccurrences at nonpublic places imply strong social strength and cooccurrences at public locations contribute less to social strength.These attack models have two problems.First of all, as we will see later, these attack models can be prevented by inserting fake records.Second, the precision of these attack models is not satisfied.Two people with an intimate relationship like friends may have cooccurred at public locations, such as shopping mall and cinema.It leads to an unreliable estimation of social strength.
What is more, existing models are based on self-reported data which involves explicit users' operations like Gowalla [8] and cell phone data [10].In general, there are many cooccurrences belonging to acquaintances due to reporting motivation [11].Hence, it is easy to distinguish cooccurrences of acquaintances from strangers in existing attack models.However, with the development of Internet of Things (IOT), rich passively collected spatiotemporal data are being produced by IOT's devices, such as the Smart Card Data (SCD) of public transport [12], the traffic surveillance [13], and bank notes [14].The collection devices are deployed in public places and people have little awareness of them.Therefore, most cooccurrences are coincidence which means two strangers cooccurred.
What is more, all cooccurrences in passively collected spatiotemporal datasets took place at public locations.Figure 1 shows the distribution of location entropy in Beijing Bus Smart Card (BBSC) and Gowalla which are two real datasets used in our experiment.Locations with entropy less than 4 in Gowalla are over 90%, and 30% of the locations are checked by one user.In contrast, in BBSC dataset, more than 70% of the locations have entropy over 4. It means that colocations in BBSC are mainly public places.Hence, existing attack models can not be directly applied in the passively collected data.In summary, there are still two significant challenges in existing social relationship attack models: first, how to distinguish cooccurrences of acquaintances from strangers and, second, what kind of cooccurrence contributes to strong social strength.
To address the above two challenges, we propose a novel inference attack model called the Mobility Intention-based Relationship Inference (MIRI) model.We adopt mobility intention to analyze cooccurrence behaviors and exploit their contributions to social relationship.In general, human mobility is fundamentally driven by diverse mobility intentions, such as family party, shopping, and dining.If two people always cooccur for the same mobility intentions, they should be acquaintances with high probability.If two people cooccur for different mobility intentions, they are likely to be strangers and the cooccurrences should be coincidences.Moreover, it is obvious that social relationship of two people who often cooccur for shopping or entertainment is much closer compared to two people who only cooccur for commuting.
In the MIRI model, firstly, we obtain mobility intentions.Then, an Adaboost model is trained to map every cooccurrence to a mobility intention dyad.Finally, we train an SVM classifier based on mobility intention dyads to infer social relationship.
The major contributions of this paper are as follows: (i) We analyze existing inference attack models and propose a method to protect social relationship.
( The remaining of this paper is organized as follows.In Section 2 the problem is formally defined and current inference attack models are analyzed.Our proposed model is detailed in Section 3. In Section 4 extensive experiments are conducted.Finally, we conclude the paper with future directions in Section 5.

Problem Definition and Current
Attack Models Definition 2 (social relationship attack).Given a cooccurrence set   of users  and , the problem of social relationship attack is to infer whether they are acquaintances or strangers and how close their relationship is.
In this work, we use mobility intention to attack social relationship from spatiotemporal data.Formally, mobility intention can be defined by the following claim.
Definition 3 (mobility intention).The mobility intention refers to a common cause which can explain why a user appeared in location loc at time .

Current Attack Models.
In last decade, there are many research works which focus on attacking social relationship by cooccurrences.Most attack models are based on spatial features including location entropy and number of cooccurrences and so forth.The relation between social relationship and cooccurrence was first studied by Crandall et al. [15].They found that the probability of a social tie increases sharply as the number of cooccurrences increases among Flickr users.The works [10,16] obtained similar conclusions from cell phone data.Nonetheless, cooccurrences at different locations do not contribute equally to social relationship.Some other spatial features of cooccurrence were adopted to decide the contribution to social relationship.
Location entropy is a widespread used spatial feature.It takes into account both the number of users who are observed at the location and the relative proportions of their footprints.Let loc  be a location and   be the set of all users who have footprints at loc  .Let   be the set of footprints at loc  and  , be the set of   's footprints at loc  .The probability that a randomly picked footprint from   belongs to user   is   (  ) = | , |/|  | which is the total fraction of all footprints at location loc  that are of user   .If we define this event as a random variable, then its uncertainty is given by the location entropy Location entropy with high value indicates many users have footprints at the location with equal proportion.Many public places, such as plazas, famous scenic spots, train stations, and stadiums, are popular to many visitors and have high value of location entropy.Conversely it will have low entropy if the distribution of footprints at a location is heavily concentrated on a few users.The private places, such as houses which are specific to a few people, have low value of location entropy.From definition of location entropy (1), it is clear that locations with high entropy usually have more cooccurrences than locations with low entropy.
Cranshaw et al. [9] firstly used location entropy to assign different contributions to cooccurrences.Their experiment results show that the precision of social relationship attack is greatly improved.Since then, many state-of-the-art inference models are based on location entropy.In [8], Pham et al. considered both location entropy and diversity of locations in cooccurrences.They proposed EBM model which is a linear regression model to attack social relationship of two users: is social strength of   and   and , , and  are optimal parameters.  is a measurement for diversity of colocations and   is closely related to entropy of colocations.Wang et al. [17] considered not only location entropy but also another two additional factors: individual mobility pattern and time gaps between two continuous cooccurrences.The result of the proposed model PGT is the product of the above three weights: is decided by location entropy and it plays a key role in the proposed attack model.Zhou et al. [18] proposed a TAI model which considers cooccurrence distribution on locations.
The problem of social relationship attack can not be solved by current location privacy protection techniques.Shahabi et al. [19] proposed a framework which can attack social relationship from privacy preserving spatiotemporal datasets.However, existing attack models are mainly based on location entropy.Furthermore, colocations with low entropy provide more information for social relationship attack.Hence, we can protect the social relationship by increasing the location entropy value of locations with low entropy.From (1), we know locations have high entropy if there are many users that visited the location with equal proportion.The locations with low entropy will have high value of entropy by inserting fake footprints on the colocations.For example, suppose there are five footprints for one user at a private location loc  , and the entropy of loc  is If we add fifteen fake footprints with an equal allocation of three users, then the entropy of loc  is The entropy value of loc  is much greater than the original value.Based on ( 2) and ( 3), the precision of EBM and PGT will decrease dramatically after modifying location entropy values.Therefore, we can protect social relationship by inserting fake footprints before releasing spatiotemporal datasets.
What is more, though the above inference models have shown how social relationship correlates to cooccurrence, the weights of these models have limited discernibility for considering only partial features of cooccurrence.The problem of distinguishing cooccurrences of acquaintances from strangers has not been studied in existing attack models.In contrast, as we will discuss later, our proposed MIRI model determines cooccurrence's contribution by corresponding mobility intentions which take into account various spatial and temporal features and overcomes the drawbacks in previous works.

Social Relationship Inference
3.1.Overview.As shown in Figure 2, MIRI model consists of three steps: (1) obtaining mobility intentions; (2) mapping every cooccurrence to a mobility intention dyad; (3) training an SVM classifier for attacking social relationship.After SVM classifier is trained, given cooccurrences of two people, then we can infer whether they are acquaintances or strangers.
We will explain how to obtain mobility intentions in Section 3.2.In Section 3.3, we train an Adaboost model with comprehensive feature engineering for mapping a footprint to a mobility intention.Then, the cooccurrence can be characterized by a mobility intention dyad.The problem of extracting cooccurrences from a spatiotemporal dataset is out of the scope of this paper, and we assume cooccurrences have been obtained before characterizing.In Section 3.4, a feature vector which is based on mobility intention dyads is constructed for attacking social relationship.If social strength between users in training dataset can be measured by continuous value like Katz score [8], we can train a linear regression model and can tell how close two users' relationship is.However, the training dataset of passively collected spatiotemporal data only tells if two users are acquaintances or not.Hence, in this paper, a binary SVM classifier is trained to infer whether two users are acquaintances or strangers.It is very easy to extend our binary classifier to linear regression model after obtaining continuous measurement in training dataset.

Mobility Intention Extraction.
As mentioned before, mobility intentions can be used to infer social relationship.They are latent variables which can not directly be observed.First of all, we need to know how many and what kinds of mobility intentions hide in a spatiotemporal dataset.There are two ways to obtain mobility intentions: summarizing from auxiliary material and extracting from dataset.
First, we can obtain mobility intention from auxiliary materials, such as statistic materials or social network services (SNSs) provider.For instance, the 2009 National Household Travel Survey (NHTS) provides information of nation's inventory of daily travel including mobility intention (work, shopping, etc.) [20].The Beijing Transport Institute release Beijing Transport Annual Report for public transport of Beijing to the public every year [21].The annual reports provide the proportion of mobility intention in Beijing public transport.The proportion of seven mobility intentions in 2015 and 2016 are shown in Figure 3.
Many SNSs provide the information of location category.There were nine location categories in Gowalla dataset and five location categories in another popular check-in service Foursquare which include food, coffee, nightlife, fun, and shopping.Many researchers [22][23][24] consider location categories as activity categories, such as entertainment, food, and shopping.In general, the activity categories also can be considered as mobility intentions.
However, many spatiotemporal datasets have no auxiliary information.Furthermore, mobility intentions are not the same in different datasets.The second way to obtain mobility intention is extraction from spatiotemporal dataset.Many studies show that humans mobility follows simple reproducible patterns.The mobility patterns show a high degree of temporal and spatial regularity and can be considered as mobility intentions [14,25].For example, commuting which is a basic mobility pattern in many spatiotemporal dataset can be used to explain why a worker arrived at the work place around 9 a.m. on work days.Hence, we can use mobility patterns as mobility intentions.
In our former work [26], we use Nonnegative Tensor Factorization (NTF) to extract mobility patterns from a spatiotemporal dataset and consider them as mobility intentions.NTF is an effective tool for analyzing the interrelationship between spatial and temporal attributes for spatiotemporal dataset [27].The CANDECOMP/PARAFAC (CP) decomposition algorithm [28] is a kind of widely used NTF and factorizes a tensor Y into a sum of component rank-one tensors in the following manner: Every rank-one tensor Y  is the outer product of three vectors Y  is a mobility pattern and is considered as a mobility intention in our former work.
In order to extract mobility intentions, firstly, a threedimensional tensor which is composed of location-hourday X ∈ R ×× is constructed.We partition one day into ℎ time bins with approximately equal time intervals.The element    ,  ,  of the three-way tensor X ∈ R ×× can be computed as where   ,   , and   are the index of the location, the time bin, and the day of month, respectively;  is the total number of locations; and Count(  ,   ,   ) is the number of users who appeared at location   at time   on   th days.
Then the tensor X is factorized into a linear combination of rank-one tensors Y  through CP decomposition algorithm.After decomposition, we manually named the labels to summarize the mobility intention described by every rankone tensor Y  .Here, every rank-one tensor Y  is the outer product of three vectors h  , l  , and d  : where h  , l  , and d  can be considered inherent characteristics of mobility intention for hour, location, and day, respectively.What is more, these rank-one tensors can also be used to analyze spatiotemporal features which are applied to form mapping function in the following subsection.We use   to denote the th mobility intention and M = {  | 1 ⩽  ⩽ } to represent the set of  mobility intentions.After obtaining the mobility intentions, we can map every footprint to a mobility intention.

Inferring Mobility Intentions Dyads.
In this subsection, we will detail how to map a cooccurrence  = (   ,    ) to a mobility intention dyad.We consider every mobility intention   ∈ M, ⩽  ⩽ , as one class.Each footprint    = (loc   ,    ) corresponds to one mobility intention, in other words, belongs to a class.Then the mobility intention mapping can be considered as a multiclass classification problem.
In order to acquire good performance of multiclass classification, we perform a comprehensive feature engineering and model training.After analyzing vectors of rank-one tensor in the last subsection and considering experiment results in feature engineering, we propose three kinds of exploited and distinguishable features: spatial features, hour features, and day features.Let    = (loc   ,    ) and   −1 = (loc  −1 ,   −1 ) be th and ( − 1)th footprint of user , respectively, and be the last footprint of user  which also occurred at location loc   .Then, the features can be defined as follows: (i) Spatial features (a) Location entropy is used to measure the popularity of a location and its definition is where   is the probability that a randomly picked check-in from all check-ins at location loc Finally, with the above features, we train an Adaboost model [29] to map    to a mobility intention    .Then, we can characterize a cooccurrence  = (   ,    ) with a mobility intention dyad (   ,    ) through the trained Adaboost model.In the last section, we discuss how to protect social relationship by changing entropy of location.However, this method does not work here.The entropy of location is a parameter in our Adaboost model.The Adaboost model can achieve similar classification results with much less tweaking of parameters.Hence, the changing of location entropy will not seriously affect multiclassification results and it can not protect the social relationship.

Social Relationship
Inference.So far, we can build the inference model which is based on the mobility dyads.By adopting the Adaboost model, the cooccurrence sequence   of user  and user  can be characterized as the following sequence of mobility intention dyads: In order to identify the weight of each of the mobility intention dyads, we construct a mobility intention vector m with the following ( + 1)/2 elements: (Count ( 1 ,  1 ) , . . ., Count (  ,   ) , where Count(  ,   ) is the number of mobility intention dyads (   ,    ).For example, if there are three mobility intentions  1 ,  2 , and  3 extracted from a spatiotemporal dataset.Given the mobility intention pairs of   and   as follows: the mobility intention vector m  is From the feature vector m  , we see that the two users cooccurred three times for the same mobility intention  1 , cooccurred twice with one for  1 and the other for  2 , and cooccurred once with left different mobility intention dyads.Then, we can adopt virtually any existing binary classifier algorithm to distinguish acquaintances from strangers.In this paper, we use SVM as our binary classifier.After training is finished, given a pair of users and their cooccurrences, we first map them to mobility intentions dyads through the Adaboost model.Then we construct the mobility intention vector.Finally, we can infer whether the two users are acquaintances or not by the trained classifier.

Experiment and Analysis
In this section, we first describe experimental settings including the dataset and experimental environment.Next, baselines and evaluation metric which are used to evaluate the performance are discussed.Finally, we compare MIRI model with several state-of-the-art models to demonstrate its effectiveness.
4.1.Settings.We use two real-world datasets in our experiment: Gowalla [2] and BBSC.The Gowalla dataset is a publicly available check-in dataset which was collected from February 2009 to October 2010.It is a self-reported dataset and consists of two different sets.The first one is spatiotemporal data with 6,442,890 check-ins from 107,092 users, and the format of every check-in is as follows: <user ID, latitude, longitude, timestamp, location ID>.The other one is a social graph of friendships among users and has 950,327 friend pairs which serve as the ground truth.We mine cooccurrences from the first set and form the first experimental data collection with corresponding relationship information of user pairs in the second set.The BBSC collects prepaid smart card records for public transportation in Beijing, China, and is a passively collected dataset.The format of each card record is as follows: <card ID, line ID, bus station, swap card time>.The geographic coordinate of each bus station can be obtained from Google Places API.We obtained a dataset with 275,951,094 bus transaction records with about 16,161,460 cards from October 1, 2014, to October 31, 2014.We identified 412 card users and 2,796 friend pairs among these card users, and the cooccurrences of these 412 card users and corresponding relationship are formed to the second experimental data collection.Every experimental data collection is divided into two subsets: training set and test set.We construct mobility intention-based relationship attack model and other baseline models on the training set and test the performance of all attack models on the test set.In order to verify the effectiveness of tensor decomposition, we extract mobility intentions from BBSC and Gowalla.

Methodology.
The precision-recall curve is used to measure the precision of our model and make comparison with other baseline models.Let TR denote the set of ground truth friend pairs in the test set and MR be the set of friend pairs reported by a social relationship inference model.The precision and recall are defined as Three baseline models are chosen for performance comparison: EBM [8], PGT [17], and TAI [18].EBM is a state-ofthe-art social relationship inference model.It is an Entropy-Based Model and two major factors are considered: location diversity which is measured by Renyi entropy and weighted frequency which is based on location entropy.PGT is an extension of EBM by considering two additional factors: personal factor which indicates an individual user's probability to visit a certain location and temporal factor which considers the time gaps between consecutive cooccurrences.TAI is a probability model which is based on LDA and its performance is close with numbers of "topic themes" and spatiotemporal windows.The topic themes are similar to the mobility intention of our proposed MIRI.

Protecting Social Relationship by Increasing Location
Entropy.In Section 2.2, we analyze current social relationship attack models and conclude that social relationship can be protected by increasing location entropy value.We will use some experiments to illustrate the proposed protection method in this subsection.
The location entropy values of locations with location entropy less than 3 are modified to 3 in Gowalla dataset.Then EBM model and PGT model are trained and tested.Figure 4 shows the performance comparison of two inference models on original dataset and modified dataset.After location entropy modification, the performance of both attack models decreases dramatically.The prediction precisions when recall is greater than 0.2 and less than 0.8 reduce by 35% to 89% and 120% to 296% for EBM model and PGT model, respectively.These results show that increasing location entropy is an efficient method to protect social relationship.

Performance Comparison with Baseline Models.
In Figure 5, the precision-recall curves of baseline models and MIRI on two datasets illustrate that MIRI performs the best among all comparison models on two datasets.In detail, for the Gowalla dataset in Figure 5(a), MIRI outperforms PGT by 5% to 20% in precision for considering comprehensive spatiotemporal features.For BBSC dataset in Figure 5 performance of EBM and MIRI on BBSC is not as good as that on Gowalla.There are two reasons for this.The first is that all cooccurrences of BBSC occurred at public places.The second is that there are more coincidences in BBSC than in Gowalla.Although the coincidence weights of EBM and PGT are very small, the amount of coincidence is large and it leads to unreliable estimation of social relationship.However, MIRI have stable performance on both datasets.The possible reason is that mobility intention is introduced and plays a key role in differentiating cooccurrences between acquaintances and strangers.Our experiments demonstrates that the use of mobility intention improves the precision of social relationship attack.

Contribution of Mobility Intention Dyads.
After tensor decomposition, we extract ten and seven mobility intentions from Gowalla dataset and BBSC dataset, respectively.The seven mobility intentions from BBSC dataset are commuting, shopping, visiting relatives/friends, dinning, recreation, entertainment, and routing business.With comparison with the mobility intentions in Figure 3, the mobility intentions extracted from dataset are consistent with the mobility intentions in Beijing Transport Annual Report [21].This result verifies the effectiveness of extracting mobility intentions form spatiotemporal dataset by tensor decomposition method.
The top 5 positive and negative weights of mobility intention dyads are illustrated in Figure 6.The weights presented in the figure are normalized by the total sum of the absolute value of all the weights to reflect a pair of mobility intentions' relative importance.From the figure, it is clear that negative weights are generally more important than positive weights and they are not quite different.This provides evidence that two people are most likely strangers when they cooccurred for different mobility intentions.The positive weights are very different.For Gowalla dataset in Figure 6(a), the mobility intention pair party-party which means two people cooccurred for family party is a dominant indicator for acquaintances.On the contrary, there are no single dominant positive weights in BBSC, probably due to the fact that there are no private locations in BBSC.In general, the positive weights are very different which mean different contribution to social relationship both in Figure 6(a) and in Figure 6(b).This result shows that some cooccurrences with high positive value weights imply strong social strength.

Conclusions
In this paper, we discuss existing works about attacking social relationship by cooccurrences and propose a method by inserting fake footprints to prevent this kind of attack.We have proposed a novel social relationship attack model called MIRI, which considers the mobility intention dyads as features and adopts a classifier to infer whether two persons are acquaintances or not.Extensive experiments indicate that the proposed model significantly outperforms existing social relationship attack models and can be applied in both selfreported datasets and passively collected datasets.
This work leads to an important future research.How can we protect our social relationship under MIRI attack?What kind of operations do we need before releasing spatiotemporal datasets?We believe that this work provides a necessary step towards addressing such questions.

Figure 1 :
Figure 1: Distribution of location entropy in BBSC and Gowalla.

Figure 3 :
Figure 3: The proportion of mobility intention in Beijing public transport.

Figure 4 :
Figure 4: Comparison of original Gowalla dataset and its modification.

Figure 5 :
Figure 5: Comparison with the state-of-the-art models.
Let   = { 1 ,  2 , . . .,   } denote the set of  cooccurrences of users  and .The formal definition of social relationship attack problem from spatiotemporal data is as follows.
Location type indicates the category of location loc   such as bar and mall.Location type can be obtained from LBS's application program interface (API), such as Google Places API (https://developers.google.com/places/?hl=zh-cn)or sina weibo API (http://open.weibo.com/).(c) Distance refers to the distance between location loc  and last location loc −1 .Day of week is weekday of    .It is denoted by {0, . . ., 6} which means Sunday to Saturday.(b) Day of month is day in month of    .It is denoted by {1, . . ., 31}.(c) Day type refers to the category of    .There are three categories: workday, short breaking holidays, and long holidays.
belongs to user   .(b)