An Efficient and Differential Privacy-Based Scheme for Aggregating Mobility Datasets

,


Introduction
Along with the explosive growth of mobile smart devices, such as mobile phones, smart glass, and in-vehicle navigation devices, the massive amount of users' mobility data collected by them enables the fast development of various location-based applications.Recent market studies show that the Apple App Store has more than 2.2 million apps and Google Play has over 2.8 million apps [1].In the perspective of supporting citizens' real lives, these mobility data enable intelligent transportation systems [2], spatial resource optimization, and even fre emergence response [3].In the academic research perspective, mining mobility data enables us to better understand human behavior [4] and provide solutions to some global issues, such as controlling a full-blown COVID-19 epidemic [5].Literally, location-based applications have become a necessity in modern life.
Although these mobile smart devices and applications have brought us considerable benefts and improved citizens' quality of life, the potential privacy risk also has gradually become a hot point of attention since the large amount and fne-grained collection of individual users' mobility data would reveal some of their sensitive information, such as lifestyle, physical condition, home location, and even identity [6].In [7], the authors infer the "top N" locations for each user from the call records (which contain the location information) and correlate this information with the publicly available census data to fnd the user's home location.Golle and Partridge [8] found that even though the location records were anonymized, they still could infer the users' identities with some background knowledge.Besides, studies in [9] have shown that by combining location data, gender, zip code, and birthdate, the majority of the US population can be uniquely identifed.
Tus, protecting the privacy of mobile smart devices' users has become a burning problem to be settled.Terefore, large-scale corresponding research has been carried out.Te mobility datasets usually contain the trajectory data of the users; thus, a simple and naive idea to solve the privacy problem is releasing the aggregated mobility datasets, such as the number of users in a target block at a specifc timestamp, instead of the raw trajectory data.No individual users' information seems to be exposed through this method, and the released data still could support a large scale of applications, such as epidemic controlling [10], transportation scheduling [11], and business intelligence [12].However, in 2017, Xu et al. [13] developed an attack system based on the uniqueness and regularity characteristics of human mobility to recover individual trajectories from aggregated mobility datasets, and they could achieve an accuracy of 73% ∼ 91%.To resist such kind of an attack [14], P4Mobi has been proposed to aggregate mobility datasets, which have achieved privacy preservation at the statistic level with the probabilistic structure count-min sketch (CMS).P4Mobi achieves outstanding performance in resisting the attack, but its capacity for privacy preservation is determined by the size of CMS, which also controls the utility of the released data.To reduce the strong relevance between P4Mobi's privacy-preserving capacity and the utility of its released data, DP-Mobi [15] is carried out.DP-Mobi is an enhanced version of P4Mobi, in which, except CMS, diferential privacy is also employed to control the privacy level, which also enhances the privacy of the released data compared to P4Mobi.Te workfow's overview for P4Mobi and DP-Mobi is shown in Figure 1.Clearly, the raw mobility data are frst processed by the trusted third party, and then, the privacy-preserving population distribution over a period of time is transmitted to the location-based service providers for further processing purposes.Experiments' results in [14,15] demonstrate that for both P4Mobi and DP-Mobi, the utility of the aggregation mobility datasets is determined by the size of the sketches, and the better the utility is, the larger the sketches would be, and thus, the data transmission between the trusted third party and the service providers is less efcient.
Te goal of our work is to improve the data transmission efciency between the trusted third party and service providers with a desirable utility of the aggregated population distribution datasets.To this end, we propose an efcient and diferential privacy-based scheme, which could protect the privacy of the mobility datasets along with improving data transmission efciency.Te workfow of our scheme is similar to that of P4Mobi and DP-Mobi as shown in Figure 1, in which the raw mobility dataset containing the trajectories of the users is collected by the trusted third party, and then, in the trusted third party, the global sketch of our scheme aggregates the raw mobility data and stores the result into the G-array.In the last step of our scheme, the G-array is transmitted to the location-based service providers for further purposes.To reduce the potential privacy risks during data transmission, instead of executing the enquiry procedure of the trusted third party, our scheme delivers Garray to the location-based service providers and authorizes them to obtain the privacy-friendly population distribution datasets through enquiring G-array.
Te highlight of our scheme is that we develop two collaborative components, global sketch and temporal sketch, to aggregate mobility datasets with better utility preservation.For P4Mobi and DP-Mobi, only one sketch is employed in the aggregation procedure, so that the utility of the released population distributions is directly afected by the size of the sketch.Since in the aggregation procedure, when the size of the sketch is small, the probability of collisions (i.e., two or more diferent records are stored in the same cell of the sketch) is high, and less utility is preserved.While in our scheme, the size of the temporal sketch is a set larger than that of the global sketch, so that the utility preservation of the temporal sketch is better.Ten, in the aggregation stage, a mobility record l ij is frst stored in the temporal sketch, and only if the number of l ij stored in the global sketch is smaller than that in the temporal sketch, the mobility record l ij will be stored in the global sketch.Terefore, in our scheme, even if the size of the global sketch is small, the utility of the fnal released population distributions still could be preserved.Besides, as privacy issues may occur in the data transmission procedure between the trusted third party and the location-based service providers, we employ the Laplace mechanism to make the transmitted data satisfy ϵ-diferential privacy.
Te contributions of our paper are summarized as follows: (i) We propose an efcient and diferential privacybased scheme for protecting mobility datasets, which aggregates the raw mobility data with the temporal sketch and global sketch.To reduce the privacy risk existing in the data transmission stage, our scheme sends G-array, which stores the population distributions in the global sketch, to the location-based service providers, and authorizes them to reconstruct the privacy-preserved population distributions from G-array.Journal of Advanced Transportation (ii) To balance the tradeof between the utility of the privacy-preserving population distribution and the size of the sketch that aggregates the mobility data, our scheme employs a temporal sketch with a larger size to aggregate the mobility data and be responsible for guaranteeing the utility of the population distributions aggregated by the global sketch.Compared to other CMS-based methods, our scheme could increase the utility of the aggregated population distributions on the premise that the sketches are of the same size.(iii) We enhance the privacy of our scheme by employing the Laplace mechanism, and the transmitted data G-array satisfes ϵ-diferential privacy.
For the users of our scheme, they could tune the privacy parameter λ of our scheme to meet diferent privacy requirements.(iv) We conduct an empirically experimental evaluation of our scheme by comparing it with other three state-of-the-art privacy-preserving methods (DPsimple, P4Mobi, and DPCMS) for mobility datasets.Te results of the experiments demonstrate that the privacy-preserving capacity of our scheme is affected by the size of the temporal sketch, global sketch, and privacy parameter λ.Compared to the other three methods, the volume of the transmitted data in our scheme is much smaller on the premise that the utility of their released population distributions is at the same level.
We organize the rest of the paper as follows.In Section 2, we present the preliminaries related to our scheme, including (a) count-min sketch, (b) diferential privacy, and (c) mobility datasets.A detailed introduction of our scheme is presented in Section 3. Section 4 evaluates our scheme by conducting experiments to compare it with the other three privacy-preserving methods.Finally, we conclude our work and corresponding further work in Section 5.

Preliminary
We present in this section a set of defnitions (Sections 2.1 and 2.2) related to our scheme and other three kinds of mobility datasets with privacy-preserving mechanisms (Section 2.3), which will be the comparison objects to our scheme in Section 4.

Count-Min Sketch.
Count-min sketch (CMS) is proposed by Cormode and Muthukrishnan [16], which is a probabilistic data structure to store the frequencies of items in an array and returns an estimate of the frequency of any given item when enquired.Due to the relatively small memory footprint and high accuracy, CMS has been a popular choice for a broad spectrum of applications, such as processing distributed datasets [17,18], aggregating statistics in sensor networks [19], and detecting attacks in routers [20].
A CMS consists of an array of d rows and w columns, and we present the cell in the i th row and j th column of the array with A i [j], where 0 ≤ i ≤ d and 0 ≤ j ≤ w.Each row A i of the array is associated with an independent hash function h i (•), which has a uniformly distributed output.To estimate the frequency of a given item by CMS, the values of all cells in the array are frst initialized to 0, and then, two operations are executed, i.e., insert and enquiry.
2.1.2.Enquiry.We get the estimation of the frequency of item e through the enquiry operation, whose frst step is similar to that of the insert operation, i.e., computing d hash functions h 1 (e), h 2 (e), . . ., h d (e).Ten, the smallest value among ] is selected as the estimated frequency of item e to return.
By observing the description of CMS, it is not difcult to fnd that the size of the array (i.e., the value of w and d) in CMS is a key factor in determining the accuracy of the estimated frequency of the target items.Since in the insert operation, diferent items would be inserted into the same cell of the array (referred to as collisions) if the size of the array is smaller than the items' distribution, so that the frequency of these items would be overestimated, and the accuracy would be decreased.Inspired by such characteristics of CMS, we propose a scheme based on CMS, which employs a temporal sketch to guarantee the accuracy of the estimated population distributions even if the size of the array is small.Te detail of our scheme is introduced in Section 3.

Diferential Privacy.
In this section, we give some preliminaries on diferential privacy, which is an important step in our scheme for protecting the privacy of mobility datasets.Diferential privacy was introduced by Dwork et al. [21], which is a framework to quantify the privacy level of a dataset while releasing useful aggregate information about the dataset.
Considering two neighbouring datasets D 1 , D 2 ∈ D n , where D n is the set of all possible datasets, and a real-valued query function q: D n ⟶ R, a randomized queryanswering mechanism K for the query function q will randomly output a number with probability distribution depending on query output q(D), where D is the dataset.(ϵ, δ) − differential privacy is defned as follows.
Defnition 1.A randomized mechanism K gives (ϵ, δ)-diferential privacy if for all datasets D 1 and D 2 differing on at most one element, and all S ⊂ Range(K).
Journal of Advanced Transportation where ϵ and δ are the privacy budget and confdence of mechanism K, respectively.
Obviously, based on Defnition 1, the smaller the value of ϵ is, the closer the two probabilities (Pr[K(D 1 ) ∈ S] and Pr[K(D 2 ) ∈ S]) will be, and the mechanism will achieve a better performance in preserving privacy.Te smaller the value of δ is, the better the mechanism will comply with the defnition of diferential privacy more strictly.
Te sensitivity [22] of a real-valued query function is defned as follows.
Defnition 2. For a real-valued query function q: D n ⟶ R, the sensitivity of q is defned as for all D 1 and D 2 difering in at most one element.
In [22], basic techniques, such as randomized response, Laplace mechanism, and exponential mechanism, which could protect datasets while satisfying diferential privacy are introduced.Among them, the Laplace mechanism is a classic technique satisfying (ϵ, 0)-diferential privacy, in which the privacy budget is determined by the value of ϵ, and the confdence value is 0, that is, the Laplace mechanism could comply with the defnition of diferential privacy strictly.Before introducing the Laplace mechanism, we frst give the defnition of Laplace distribution.
Sometimes, we write Lap(b) to denote the Laplace distribution with scale b.
As the name of the Laplace mechanism suggests, it frst computes the query function q and then perturbs each coordinate with noise drawn from the Laplace distribution.Te scale of the noise will be calibrated to the sensitivity of q (divided by ϵ).

Defnition 4. Given any query function
where Y i are random variables drawn from Lap(Δ/ϵ) and Δ is the sensitivity of function q.
In our scheme, the privacy of the mobility datasets is still at high risk in the data transmission step between the trusted third party and the location-based service providers.Although, in this step, our scheme transmits the sketches instead of the estimated population distributions to enhance privacy, it is still possible for the evildoers to recover some information by enough observation of sketches, such as the specifc hash functions related to each row of sketches or population distribution.Tus, our scheme will conduct the Laplace mechanism on the sketches and transmit these diferentially private sketches.Te specifc parameters' setting of the Laplace mechanism is introduced in Section 3.

Privacy-Preserving Mechanisms on Mobility Datasets.
To resist the threats afecting location privacy, various privacy-preserving mechanisms for mobility datasets have been carried out, and in this part, we put emphasis on three of them, which are relevant to our scheme.

DP-Simple.
A simple and naive method to protect the privacy of mobility datasets is simply aggregating the trajectories and releasing the population distributions to cover the sensitive personal information existing in the trajectories.To enhance the privacy of this simple method, the concept of diferential privacy has been adopted in a mechanism called geo-indistinguishability (which we refer to as DP-simple) [23].In DP-simple, the Laplace mechanism is frst applied to the raw trajectories to ensure diferential privacy, i.e., noise drawn from the Laplace distribution is added to each location point in the trajectories.Ten, DPsimple aggregates the encrypted trajectories to obtain the population distribution for release.As in such a mechanism, it is more difcult to recover sensitive individual information compared to simple aggregation, and the privacypreserving level of DP-simple is determined by the parameters of the Laplace mechanism.
In the DP-simple method, the Laplace mechanism is applied to the individual user's trajectory data in the client site before transmitting it to the third party, and then, the third party generates the population distribution based on the protected trajectory data.In such a scenario, there are fewer constraints on the third party.Te utility of the released population distribution is only afected by the parameter of the Laplace mechanism.

P4Mobi
. Diferent from the DP-simple mechanism, which ofers theoretical guarantees to protect the privacy of the mobility datasets, P4Mobi [14] reaches the objective of preserving the privacy of mobility datasets by employing a probabilistic data structure called count-min sketch (CMS) [16] in the statistic step of generating population distributions from mobility datasets and ofers practical guarantees in protecting the privacy of mobility datasets.Descriptions in Section 2.1 have shown that the size of CMS would afect the accuracy of the estimated frequency of items, because when the size of CMS is smaller than the distribution of items, collisions would happen, which decreases the fnal accuracy.Inspired by this, P4Mobi formalizes the relationship between the utility (accuracy) loss of the fnal estimated population distributions and the size of CMS as where d and w denote the number of rows and columns in CMS, respectively, and q refers to the number of locations in the target area.In equation ( 5), (1 − 1/w) q− 1 denotes the probability that the other (q − 1) locations (except the current location) are not mapped to the position of the current location record in one row of the sketch, which means that the current location is mapped to a position with a value of zero (not occupied by other elements) in this row.Terefore, the probability that the current location is mapped to a position with a nonzero value (i.e., a collision occurs) in all d rows is (1 − (1 − 1/w) q− 1 ) d .Equation (5) formulates the probability of collisions happening in CMS by using the CMS parameters, which also refects the utility loss of CMS.Straightforwardly, the utility loss also refects the privacy level of the released data; therefore, for a specifc raw mobility dataset, the value of q is constant, and the users could tune the value of w and d in equation ( 5) to satisfy the diferent utility loss (privacy) requirements of the released population distribution.

DPCMS.
Although P4Mobi achieves better performance in resisting the attack model [13] compared to traditional mechanisms, the tradeof between the utility loss and the privacy level of the fnal released population distribution is the main concern of P4Mobi.As a progressive version of P4Mobi, the authors in [15] introduce a diferential privacy-based probabilistic mechanism for mobility datasets releasing (referred to as DPCMS).Similar to P4Mobi, CMS is also the main component of DPCMS, which could enhance privacy at the statistic level.Te main contribution of DPCMS is that it employs the Laplace mechanism to sketch before the enquiry step of CMS, which not only makes the sketches diferentially private but also relieves the strong correlation between utility loss and privacy in P4Mobi.
While the abovementioned mechanisms ofer various guarantees for the privacy of mobility datasets, it is difcult to guarantee the privacy of data in the transmission stage.In this paper, we show how our scheme protects the privacy of the transmitted data and guarantees a desirable utility (accuracy) of the fnal released population distributions.

Scheme
As shown in Figure 1, the application scenario of our scheme consists of three main participants, the mobility datasets' providers (individual users), the trusted third party, and the location-based service providers.To obtain location-based services, the individual users frst send their mobility data to the trusted third party, and then, in the third party, our scheme aggregates these mobility data and stores the aggregated population distributions in the form of a protected array (G-array), which will be transmitted to the locationbased service providers.Finally, the service providers recover the population distributions through the protected Garray and provide services to the individual users.In this section, we will describe each step of our scheme in detail and analyze the privacy and utility-preserving capacity of our scheme.

Mobility Dataset.
Generally, the mobility dataset contains the trajectory information of individual users, which records the users' whereabouts (locations) associated with a series of timestamps.To describe our scheme intuitively, we assume that the target area is represented as a grid, and each location corresponds to a cell in the grid.Formally, let L � L 1 , L 2 , . . ., L q   be the universe of q locations in the target area and T � t

Framework of Our Scheme.
We summarize the framework of our scheme in Figure 2. Broadly, there are two components in our scheme, the global sketch and the temporal sketch, among which three operations, the initialization, update, and query are conducted in diferent orders.
As shown in Figure 2, when a new timestamp t i arrives, our scheme will frst call the initialization operation, which will initialize the G-array and T-array for the global sketch and temporal sketch, respectively.Ten, for each user's location record l k,i , 1 ≤ k ≤ m at timestamp t i , the temporal sketch would frst update the T-array with it and then query T-array about l k,i and return Q t , which stands for the number of l k,i in T-array, e.g., the number of users at location l k,i stored in T-array.With respect to the global sketch, before updating the location record l k,i , it conducts the query operation on G-array with l k,i frst and compares the result Q g with Q t .On condition that the value of Q g is smaller than that of Q t , l k,i will be updated into G-array; otherwise, the next user's location record l k+1,i at timestamp t i will be the input for both temporal sketch and global sketch.Successively, when k ≤ m is false, i.e., all users' location records at timestamp t i have been updated, our scheme will conduct the encryption operation on G-array in the global sketch (as shown at the top of Figure 2) and transmit the protected Garray to the location-based service providers for further processing.In the following section, we will describe the four operations initialization, update, query, and encryption of our scheme in detail.

Initialization.
At each timestamp t i of a mobility dataset, our scheme aggregates all users' location records in an array and transmits the array to the third party for the supplement of the population distribution of the target area at t i .Terefore, the initialization operation will be conducted when a new timestamp arrives.In this operation, our scheme will preset four parameters, (w g , d g ) and (w t , d t ), and then, the initialization creates the G-array and T-array for global sketch and temporal sketch, respectively.Both the G-array and T-array are zero-valued arrays, and their sizes are w g columns, d g rows, w t columns, and d t rows, respectively.In the fnal step of this operation, the initialization arranges d g independent hash functions to associate with each row in Garray, and the same arrangement is applied to T-array.Later, in Section 3.3, we analyze how the values of parameters (w g , d g ) and (w t , d t ) could afect the utility of our scheme and how to present them to satisfy diferent utility requirements.

Update.
To aggregate a location record l ki into an array, update frst computes d g (or d t ) hush functions h i (l ki ) and then increases the value of the cell in the i th row and [h i (l ki )] th column of the array by 1.In Figure 3, we summarize the update operation.

Query.
Te query operation of our scheme is responsible for returning the frequency of the location l ki in the array.Similar to the enquiry operation in CMS, the frst step of query is computing d g (or d t ) hush functions h i (l ki ), which stand for the column index of the target cells in each row of the array.Ten, the smallest value among the target cells is selected as the result to return.Formally, for an array A (G-array or T-array) in our scheme, we have which returns the frequency of the location record l ki stored in the array A. In the raw mobility dataset, the location record l ki denotes the position of the user u k at timestamp t i , which also could be construed as there was 1 user in l ki at t i .Terefore, the frequency of l ki stored in the array represents the population in l ki , i.e., the query operation of our scheme returns the population of users in the target location.

Encryption.
When the last record of the timestamp t i has been updated, our scheme would call encryption to protect the G-array in the global sketch.In this operation, we employ the Laplace mechanism to preserve the privacy of Garray and make it satisfy ϵ-diferential privacy.According to the requirement of the privacy level, we would preset the value of parameter λ � 2/ϵ, and then, random variables drawn from the Laplace distribution Lap(λ) are added to the G-array.Te encryption operation is formalized as In Section 3.3, we will analyze how the encryption operation could make the protected G-array satisfy ϵ-diferential privacy and how to tune the parameter λ to satisfy diferent privacy requirements.

Utility and Privacy Analysis of Our Scheme.
In this section, we will analyze how our mechanism can meet the needs of strong utility, ideal privacy preservation, and costefcient data transmission.

Utility Preservation.
Retrospecting the workfow of our scheme, it frst updates the raw mobility dataset into the G-array and then triggers encryption to protect the G-array before being transmitted to the location-based service providers.An obvious solution to improve the efciency of data transmission between the trusted third party and service providers is trimming the amount of the transmitted data, i.e., downsizing G-array.However, the discussion in the last paragraph in Section 2.1 indicates that the size of the G-array determines the accuracy of the fnal estimated population distributions, and the smaller the size is, the more inaccurate the results will be.In our scheme, to balance the tradeof between the utility (accuracy) of the fnal results and the data transmission cost, we introduce the temporal sketch to guarantee the utility (accuracy) of the fnal population distributions while maintaining G-array in a relatively small size.Te temporal sketch works in the trusted third party, where the storage space is abundant, so that the size of T-array, i.e., (w t , d t ), in the temporal sketch could be set without regard to the constraint of storage space.By trying to avoid collisions happening in the G-array, before updating the current record l ki , our scheme frst queries Garray about the number of l ki (Q g ) and then compares it with that aggregated by T-array (Q t ), and when the value of Q g is smaller than that of Q t , it could be considered that the probability of collisions happening in the G-array when updating l ki is relatively low, which indicates that the utility(accuracy) of the G-array could be guaranteed.
On the strength of the abovementioned description, we will quantitatively analyze the utility-preserving ability of our scheme by associating the probability of collisions happening in the G-array with the preset parameters (w t , d t ) and (w g , d g ).For a location record l ki , the precondition for the collisions happening in G-array is that collisions frst happen in T-array, and in T-array, the probability of collisions of l ki is calculated as where q is the number of locations in the mobility dataset and (1 − 1/w t ) q− 1 is the probability that the other (q − 1) locations (except l ki ) in the dataset are not updated to the position of l ki in one row of T-array, i.e., l ki is updated to a position, which is not occupied by the others in this row of T-array.Terefore, the probability that l ki is updated to a position in one row that has been occupied by others is (1 − (1 − 1/w t ) q− 1 ).According to the query operation of our scheme, only when a collision happens in all the rows of the T-array, the results of query will be afected, and thus, the probability of collisions of l ki in T-array is (1 − (1 − 1/w t ) q− 1 ) d t .Te update and query operations in the global sketch are the same as that in the temporal sketch, besides a condition that the results of queryt (l ki ) in the global sketch should be smaller than that of the temporal sketch; i.e., collisions happen in the global sketch only when they frst happens in the temporal sketch, and therefore, the probability of collisions of l ki in G-array is For a given mobility dataset, the value of q is a constant, and to improve the data transmission efciency of our scheme, we could preset the values of (w g , d g ), and then, according to the utility requirement, we set the value of (w t , d t ).Following equation ( 9), for a value fxed (w g , d g ), when the value of w t or d t is set larger, the probability of collisions in G-array becomes lower, that is to say, the utility loss of our scheme is less.Tus, we could briefy sum up that for a given data transmission cost requirement, the utility of our scheme is relevant to the value of (w t , d t ), and the larger their values are, the more utility our scheme will achieve.
To enhance the privacy of the global sketch, the Laplace mechanism is involved in our scheme.Before transporting the global sketch, our scheme generates the Laplace noise and adds them to the sketch.Trough this step, the utility of the released population will also be afected.To make the utility analysis intuitive, we demonstrate in Figure 4 the probability of the noise generated by the Laplace mechanism under diferent values of parameter λ.Intuitively, when the value of λ is 0.5, the probability of generating a noise value of 0 is almost 99%, and in such a situation, the utility efect would be tiny.While when the value of λ increases to 2, the probability of value 0 decreases to about 21%, and the probability of value 4 increases to nearly 10%, that is to say, the value of the noise is larger, and the utility of the results will also decrease a lot.Above all, in our scheme, the value of parameter λ could also afect the utility of the results, and the larger the value of λ is, the more utility loss of the results will be.

Privacy Analysis.
Te privacy preservation and utility of the fnal results interact with each other in privacypreserving mechanisms.An ideal privacy-preserving mechanism is always accompanied by the sacrifce of utility.In our scheme, we employ the temporal sketch and global sketch to aggregate mobility datasets and protect the users' privacy.By observing equation (9), it is apparent that the value of the probability of collisions of G-array is in the range of (0, 1), and that is to say, no matter how large the values of (w d , d t ) and (w g , d g ) are set, collisions will always exist in G-array, and the aggregated results will be diferent from the real population distributions of the mobility dataset.From the perspective of privacy preservation, our scheme protects the privacy of the users by introducing collisions in G-array.
However, it is insufcient to preserve both the privacy and utility of the mobility dataset by tuning the parameters (w t , d t ) for temporal sketch and (w g , d g ) for global sketch, as under such circumstances, the correlation between the privacy and utility of the protected mobility dataset is excessively strong.On account of such an issue, our scheme includes the encryption operation to enhance the privacy preservation of the mobility datasets.
Assuming that in the data transmission stage, our scheme sends the G-array directly to the service providers, so it would be possible for the evildoers to access the G-array and reconstruct the sensitive individual information of the users.Trough the encryption operation, variables drawn from the Laplace distribution are added to the G-array, and the protected G-array (G − a rray) is sent to the service provider.Considering the evildoers' attack on G-array is f(•), then f(G − array) returns the cell in G-array that satisfes the evildoers' requirements.After involving the encryption operation, the sensitivity (defnition) of the attack function f(•) will be Δf � max|f(G − array)− f(G − a rray)|.After adding the noise, the position of the cell in the array containing the information that the evildoers are interested in may change or not, and therefore, the maximum diference between f(G − array) and f(G − a rray) is 2, i.e., Δf � 2. Based on the Laplace mechanism, our scheme chooses variables drawn from the Laplace distribution with scale λ � 2/ϵ, i.e., X ∼ Lap(2/ϵ), to protect the G-array.Here, we compare the probability density function of G-array and G − a rray (denoted as P G and P  G in equation (10), respectively) at some arbitrary point z ∈ R k as Te detailed derivation of equation ( 10) is provided in [22].
Te abovementioned analysis indicates that the encryption operation makes the protected G-array satisfy ϵ-diferential privacy.To achieve diferent privacy requirements, the users of our scheme could tune the value of ϵ, and the larger the value of ϵ is, the more the privacy will be preserved in the protected G-array.

Evaluation
In this section, we empirically evaluate our scheme on a realworld mobility dataset and a synthetic dataset and compare its performance with the other three privacy-preserving methods for mobility datasets, DP-simple, CMS, and DP-CMS, which have been introduced in Section 3.1.Te experiments were performed on a MacBook Pro PC with a 2.70 GHz Intel Core-i5 processor and an 8.00 GB of RAM.Te Python programming language was used to implement our proposed scheme.In our experiments, we use a GPS trajectory dataset collected from the GeoLife project [24][25][26] in a period of over four years (from April 2007 to August 2012) with a sampling interval of approximately 5s.In this dataset, a raw trajectory record contains a sequence of time-stamp points, each of which is associated with the information of latitude, longitude, and altitude.Ahead of the implementation of our scheme, we frst preprocess the raw mobility dataset, and through observation of the distribution of the raw dataset, we fnd that the trajectories collected within the area of longitude 116.25 °∼ 116.50 °and latitude 39.85 °∼ 40.10 °are more intensive; therefore, we select the trajectory data in this area as the research target.For formalization purposes, we split the target area into a 25 × 25 grid and set the time granularity to one minute.In the preprocessed dataset, the total number of users is 354, the number of location cells is 625, and the number of time slots is 1440.Te population distribution in the GeoLife dataset is sparse, and therefore, we generate a synthetic trajectory dataset as a supplement to the evaluation.In the synthetic dataset, the total number of users is 5000, the number of location cells is 400, and the number of time slots is 100.

Evaluation Criteria.
Inspired by [27], we use the error between the real and published population distributions to evaluate the practical utility of the scheme.In our scheme, we assume that the population distribution transmitted to the location-based service providers at the timestamp t i is  D i , and the real population distribution at this timestamp is D i ; then, the error between them is defned as where q is the number of locations in the area and d j and  d j are the number of users at the j th location at timestamp t i in the real and published population distribution datasets, respectively.Ten, the error of the scheme is defned as where n is the number of total time slots in the mobility dataset.

Experimental Results.
In this part, we present the performance of our scheme in preserving the utility of the fnal released population distributions with diferent parameter settings: the size of G-array, the size of T-array, and the value of ϵ in the encryption operation.

Efect of the G-Array's Size.
Te size of the G-array is determined by the value of (w g , d g ), and to measure the impact of the size of the G-array with the utility of our scheme, we conduct two experiments to observe the efect on the utility with the value of w g and d g , respectively.In the frst experiment, for the GeoLife dataset, we set the depth of G-array as 6, i.e., d g � 6. Te size of the T-array is w t � 40 and d t � 30.Te value of w g varies from 3 to 21.For comparison purposes, we also conduct experiments with DP-CMS, CMS, and DP-simple methods.For the DP-CMS and CMS methods, the size of the sketches is set as the same as our scheme.While for DP-CMS, DP-simple, and our scheme, we set the parameter of the Laplace mechanism as 1.
Te results are shown in Figure 5(a).We can observe that in Figure 5(a), ARE decreases along with the increase in the value of g w , which indicates that more utility of the population distributions released by our scheme will be preserved, when the width of the G-array is large.For the DPsimple method, its performance is irrelevant to w g , and therefore, the average relative error of DP-simple in this experiment is about 1.When the value of w g is smaller than 9, the ARE of our scheme is larger than DP-simple, and the reason is that in our scheme, except for the encryption operation, the small size of G-array causes more collisions and decreases the utility of the fnal result.While when the width of the G-array increases to 9, the utility-preserving ability of our scheme oversteps that of DP-simple, and the reason is that in our scheme, the Laplace mechanism is applied to the G-array instead of each location in DP-simple, so that the utility of the released population distributions is better.Te tendency of the CMS method's utility-preserving performance is similar to that of the DP-CMS method, except that when the value of w g is larger than 12, the average relative error of CMS is smaller than our scheme; i.e., the CMS method achieves better utility preservation performance than our scheme.Tus, when the size of the Garray becomes larger, the probability of collisions happening in both our scheme and CMS method will be smaller, and the utility of the CMS method will be better.However, in our scheme, the encryption operation employs the Laplace mechanism to protect the privacy of our scheme, so the utility-preserving ability of our scheme becomes weaker than CMS, which also indicates that our scheme achieves better privacy preservation compared to the CMS with the same size of sketch.For the synthetic dataset, similar results are shown in Figure 5(b).
Besides the value of w g , we also conduct experiments to verify the efect of the depth of the G-array.In this experiment, for the GeoLife dataset, we set the width of the Garray constant as w g � 6, and d g varies from 3 to 30.Te results of this experiment are shown in Figure 6, and obviously, the utility of our scheme becomes better when the value of d g is increased.Te comparison to the other three methods is similar to that in Figure 5.We could summarize the results shown in Figures 5 and 6 which show that the utility of our scheme is afected by the size of the G-array, and when the size is larger, the utility preservation is better.Compared to the other three methods, our scheme could achieve better utility preservation with a suitable size of G-array.

Efect of the T-Array's Size.
In our scheme, T-array is an important component responsible for preserving the utility (accuracy) of the population distribution released by our scheme.In this part, we will show the experiments' Journal of Advanced Transportation results related to the efect of the T-array's size on the utility of our scheme.Similar to the experiments conducted in Section 4.3.1,we also conducted two experiments to study the efect of the T-array's width and depth, and the results are presented in Figures 7 and 8, respectively.
In Figure 7, for the GeoLife dataset, we set the G-array's size of our scheme as w g � 6 and d g � 15.Te size of the CMS and DP-CMS methods is the same as that of the G-array.Te parameters of the Laplace mechanism of our scheme, DP-simple, and DP-CMS are all set as 1 in this experiment.Te value of w t is in the range of 5-40.Te T-array is a unique component of our scheme compared to the DP-CMS, CMS, and DP-simple methods, and therefore, in this experiment, the average relative error of these three methods is stable.Given the overall tendency shown in Figure 7, the utility of our scheme is afected by the value of w t , and when the width of the T-array increases, the utility of our scheme will also increase.For the GeoLife dataset, specifcally when the value of w t is smaller than 20, the utility-preserving performance of our scheme outperforms CMS and DP-CMS, while being less reliable than DP-simple.Compared to CMS and DP-CMS, even though the sizes of the sketches that store the population distributions are the same, the existence of Tarray in our scheme guarantees the utility of the released population distributions.While for the DP-simple method, when the size of the T-array is smaller, the probability of collisions happening in our scheme is relatively high, so the utility is decreased.When the value of w t is larger than 20, our scheme could achieve the best utility preservation compared to the DP-CMS, CMS, and DP-simple methods.Figure 8 shows the average relative error of the released population distributions with diferent depths of T-array in our scheme.For the GeoLife dataset, we set the width of the T-array as w t � 20, the depth of the T-array varies from 5 to 30, and the size of the G-array, CMS, and DP-CMS is set as w g � 6, d g � 15.Besides, for DP-Ssimple, DP-CMS, and our scheme, the parameter of the Laplace mechanism is set as 1.Te results shown in Figure 8 are similar to that presented in Figure 7, which indicates that the utility of our scheme will become better when the depth of the T-array increases, and when the value of w t is larger than 20, the average relative error of our scheme is the smallest among these four methods, and thus, the utility preservation of our scheme is the best in this situation.

Efect of ϵ in the Encryption
Operation.In Sections 4.3.1 and 4.3.2,we have demonstrated that the size of the Garray and T-array in our scheme could afect the utility of the released population distributions, and with appropriate settings of (w g , d g ) and (w t , d t ), our scheme could achieve a better utility preservation compared to the CMS, DP-CMS, and DP-simple methods.In this section, we will study the efect of the ϵ in our scheme on the fnal released population distributions.
As diferential privacy is not involved in the CMS method, in this experiment, only comparisons between our scheme, DP-CMS method, and DP-simple method will be conducted.For our scheme, the size of the T-array is set as w t � 20 and d t � 12 and the size of the G-array is w g � 6 and d g � 15, which is the same as the DP-CMS method.For the  GeoLife dataset, we range the value of ϵ from 0.9 to 10, and the results are shown in Figure 9(a).For the synthetic dataset, as the number of users is larger than the GeoLife dataset, we set the value of ϵ from 0.1 to 1, which would increase the added noise, and the results are shown in Figure 9(b) Apparently, all these three methods' utility-preserving ability is correlated with the value of ϵ, and when the value of ϵ increases, the utility for all three methods will increase.Particularly, the average relative error of DP-CMS is the largest all along, i.e., the utility-preserving ability of DP-CMS is weaker than DP-simple and our scheme in this experiment.Te reason for this result is that in DP-CMS, both collisions happened in the update procedure and the added noise drawn from the Laplace mechanism reduced the utility of the fnal results.While for our scheme and the DP-simple method, when the value of ϵ is larger than 1.0, the average relative error of DP-simple is smaller than ours, but when ϵ is decreased to smaller than 2.0, the average relative error of our scheme becomes smaller.In other words, with the value of ϵ becoming larger, the privacy level of the results also becomes stronger, and the utility-preserving ability of our scheme will be better than that of the DP-simple and DP-CMS methods.

Te Comparison of Data Transmission
Cost.Except for the ideal performance in preserving the utility of the released population distributions, another highlight of our scheme is the high efciency in the data transmission stage.For CMS, DP-CMS, and our scheme, the array that stores the frequency of users is transmitted to the service providers, while for the DP-simple method, the volume of the data that are transmitted to the service providers is 4 × q, where q is the number of locations in the raw mobility dataset, and each value occupies 4 bytes in the transmission stage.In this experiment for the dataset and the synthetic dataset, the transmitted data volume of the DP-simple method is 2500 bytes and 5000 bytes, which is shown as the solid line in Figure 10(a) and Figure 10(b).
For CMS, DP-CMS, and our scheme, we combine the results demonstrated in Figures 5-9 and show the correlation between the average relative error and the transmitted data volume in Figure 10.Horizontally, when the average relative error becomes larger, the transmitted data volume will decrease, which could be understood as when the volume of the transmitted data is reduced, the utility of the fnal released population distributions will also decrease.While vertically observing Figure 10, we see that when the utility preservation is consistent, the transmitted data volume of our scheme is much smaller than the CMS, DP-CMS, and DP-simple methods.

Conclusion
In this paper, we propose a scheme that protects the privacy of the mobility datasets by employing a temporal sketch, a global sketch, and a Laplace mechanism, which updates the users' trajectories' data into a G-array and sends it to the location-based service providers, who will get the population distributions through enquiring about G-array.Compared to other CMS-based methods, our scheme could balance the tradeof between the utility and privacy preservation of the released population distributions.Beyond this, the joining of T-array in our scheme enables the data transmission cost between the trusted third party and location-based service providers to decrease sharply, and the evaluation results show that compared with the DP-CMS, CMS, and DP-simple methods, our scheme could save approximately 80%, 75%, and 96% data transmission cost when the average relative error (utility loss) is about 1.8.
Nevertheless, in the research area of preserving the privacy of mobility datasets, another interesting research direction is on improving the efciency of diferential privacy, as location data collected by GPS could contain sensing noise [28], and for such location data, it is pointless to add more noise.Terefore, research on distinguishing location data with sensing noise would be of great signifcance to improve the efciency of diferential privacy.Tere are still other factors afecting the privacy or utility-preserving capacity.In our work, optimizing the spatial [29] and temporal granularity [30] in the data collection stage to protect the privacy of mobility datasets would also be another direction of future research.

Defnition 3 .
Te Laplace distribution (centered at 0) with scale b is the distribution with probability density function represented as Lap(x | b)

Figure 4 :
Figure 4: probability of the noise generated by the Laplace mechanism under diferent values of λ.

Figure 5 :Figure 6 :
Figure 5: Te average relative error of the released population distributions with diferent values of w g .(a) GeoLife dataset.(b) Synthetic dataset.

Figure 7 :Figure 8 :
Figure 7: Te average relative error of the released population distributions with diferent values of w t .(a) GeoLife dataset.(b) Synthetic dataset.

Figure 9 :Figure 10 :
Figure 9: Te average relative error of the released population distributions with diferent values of ϵ in the encryption operation.(a) GeoLife dataset.(b) Synthetic dataset.
2.1.1.Insert.To insert an item e into a CMS, i.e., to increment the stored frequency of item e by 1, the CMS frst computes d hash functions h 1 (e), h 2 (e), ... , h d (e) and then increases the values of cellsA 1 [h 1 (e)], A 2 [h 2 (e)], . .., A d [h d (e)] by 1, respectively.Te abovementioned process can be formulated as 1 , t 2 , . . ., t n   denote n timestamps.For m users U � u 1 , u 2 , . . ., u m  , we present their trajectories around n timestamps as Tr � tr 1 , tr 2 , . . ., tr m  , where tr i(1≤i≤m) � l i1 , l i2 , . . ., l in   and l ij(1≤j≤n) ∈ L. As mentioned earlier, our scheme aggregates the mobility datasets and releases the population distributions around n timestamps to support the location-based service providers, which are represented as D � d 1 , d 2 , . . ., d n  , where d i � p 1,i , p 2,i , . . ., p q,i   and p k,i denotes the number of users in location L k at timestamp t i .