LPPS : A Distributed Cache Pushing Based K-Anonymity Location Privacy Preserving Scheme

Recent years have witnessed the rapid growth of location-based services (LBSs) for mobile social network applications. To enable location-based services, mobile users are required to report their location information to the LBS servers and receive answers of location-based queries. Location privacy leak happens when such servers are compromised, which has been a primary concern for information security. To address this issue, we propose the Location Privacy Preservation Scheme (LPPS) based on distributed cache pushing. Unlike existing solutions, LPPS deploys distributed cache proxies to cover users mostly visited locations and proactively push cache content to mobile users, which can reduce the risk of leaking users’ location information. The proposed LPPS includes three major process. First, we propose an algorithm to find the optimal deployment of proxies to cover popular locations. Second, we present cache strategies for location-based queries based on the Markov chain model and propose update and replacement strategies for cache content maintenance.Third, we introduce a privacy protection scheme which is proved to achieve k-anonymity guarantee for location-based services. Extensive experiments illustrate that the proposed LPPS achieves decent service coverage ratio and cache hit ratio with lower communication overhead compared to existing solutions.


Introduction
Mobile social network applications are booming in recent years.With the rapid development of localization technologies (such as GPS) and Mobile Internet, mobile social network applications with location-based service (LBS) embedded are very popular such as Foursquare and Twitter.As a result, people use LBS more and more often.With the help of LBS, these mobile social network applications help users to connect to each other better.Typical applications include discovering popular restaurants in local area [1], traffic navigation [2], recommendation of friends nearby, and advertisement based on locations (such as "Promoted Tweets" [3]).According to the recent report, LBS is envisioned to become an over 10-billion-per-year business by the year 2016 [4].
As each coin has two sides, the LBS for mobile social applications may leak the users' location trajectory and so on, with a rising concern in location-based service about privacy protection.In order to obtain LBS, a user needs to submit his query and location to the server and fetches the desired answer.The leak of user location information will increase the risk of adversary tracking the daily life of the user or will receive customized ADs which is unwilling or even revealing his private activities such as visiting a bank or going to a hospital [5].It is important to protect user's location privacy for LBS.
Many efforts have been made to protect user's location privacy.The -anonymity model was proposed in [6], which declaimed that when -anonymity was satisfied, each individual should be indistinguishable from  − 1 other individuals.A user can achieve -anonymity by sending out  queries with different locations to the server and choosing the desired answer from the responses.However, such method wastes network bandwidth and causes extra overhead in both client and server sides.A few works introduced trust-worthy middleware or cache proxies by using random noise to conceal user's real ID and locations [7][8][9][10][11][12][13].However, once the middleware or the proxies are compromised, the location privacy is not guaranteed.In a word, most existing works adopt the pull-based strategy where the user submits the location-based query to the LBS server or the third-party server.The disadvantage of such pull-based strategy is the difficulty of avoiding the compromise of LBS servers or proxies in different levels.
To deal with this problem, we propose location privacy preservation by using cache pushing.The basic idea is applying distributed cache proxies to store the most popular location-related data and pushing the data to the users proactively.If the desired data is available from the cache, the user does not need to send out the location-based query; thus his privacy is preserved.
We use Figure 1 to illustrate the motivation of our work.Assume a user is working on a road and he wants to request the information regarding location  ( could be a bank or a hospital, which is the sensitive information to be preserved).In the traditional pull-based strategy, the user will send the query to the LBS server.Once the LBS server receives the query, it knows that the user is heading for .In our pushbased strategy, the nearby popular location-related data about locations , , and  could have been stored in the cache proxy beforehand (e.g., such locations have been requested by other users previously and have been stored in the cache proxy), and they will be pushed to the user when he passes by the wireless access point.As a result, the user obtains the desire data without contacting the LBS server or reporting his location to the proxy.Since the LBS server is not aware of the query and the proxy only knows the user may go to , , or , both of them cannot decide the user's real destination.
We propose the Location Privacy Preservation Scheme (LPPS) to achieve location anonymity.The LPPS need to answer three key questions: where to deploy cache proxies, how to organize and maintain cache content in proxies, and how to achieve -anonymity assurance.We address these issues; we first introduce a greedy deployment algorithm to calculate the most frequently visit locations of mobile users (known as stay points) and deploys proxies to cover such regions as possible.Then we introduce a cache pushing strategy using group index to record the popular cache items and pushing the requested content to the users in batch.Cache replacement and updating strategies are proposed to mine real popular places of interest from a lot of fake feedbacks when cache miss occurred.Thereafter, we propose a location privacy protection scheme to achieve -anonymity assurance.To the best of our knowledge, this is the first work to apply push-based distributed caching architecture for privacy protection in location-based services.
Compared to the pull-based strategy, our push-based strategy has several advantages.First, it is easy to implement and resilient to the compromise of servers.Second, both the LBS server and proxies have only partial information of the mobile users (a user contacts the LBS server only when the requesting data is not found in the proxy), which avoids the continuous location tracking.Third, the deployment of distributed cache proxies will facilitate data queries and reduce the workload and network traffic to the LBS server.
Our main contributions are summarized as follows: (i) A Location Privacy Preservation Scheme (LPPS).The proposed scheme introduces a distributed cache layer to store the most popular location-related data and push the data to mobile users, which reduces the communication overhead to the servers and enhances the location privacy of the users.
(ii) A Strategy for Cache Proxy Deployment.We present a strategy to reveal the users' frequently visited locations and deploy cache proxies in such areas to improve the utility of caching data.
(iii) A Strategy for Organizing Cache Contents Based on Markov Chain Model.We present a strategy to find the most popular location-related contents based on Markov Chain model constructing from history statistics and propose strategies to deploy popular cache contents on the proxies.
(iv) Distributed Cache Maintenance Strategies.The proposed cache pushing strategy divides cache content into group and broadcast the cache items in batch.
To maintain the cache contents, a cache updating strategy is applied by using the theorem of ball-tobins to identify real popular location-related data.
(v) -Anonymity Location Privacy Preservation.We proved that the proposed privacy preservation scheme based on cache pushing can achieve -anonymity assurance, which is resilient to the compromise of LBS server or proxies.
(vi) Trace Driven Simulations.We evaluate the system performance using the real mobile trace collected in Beijing and cab trace collected in San Francisco, which shows the efficiency of LPPS.
The rest of the paper is organized as follows.Section 2 reviews the related work on system architecture and models for privacy preservation in LBS.Section 3 presents details of Location Privacy Preservation Scheme (LPPS) that provides -anonymity location privacy.Section 4 conducts experiments to evaluate the efficiency of the proposed LPPS strategy by both real and synthetic datasets.Section 5 concludes this paper.

Related Work
In this section, we review the related work on system architecture and models for privacy preservation in LBS and distributed architectures and pull-push algorithms used in other domains.

Privacy Preservation Architectures.
There are three kinds of most used architectures of privacy preservation for LBS, that is, noncooperative architecture [14], centralized architecture [11,13,15], and peer-to-peer architecture [16,17].In the first one, users only use their own knowledge to protect their location privacy.In the second one, a trusted third-party component is added to provide location privacy protection for users.In the third one, users preserve their privacy cooperatively.
In the noncooperative architecture, mobile clients are assumed to have powerful computation capability and can complete the privacy preservation of their own requirements by generating fake locations or psedousernames independently [14].The advantage of this structure is the simpleness and easiness to deploy.However, due to the lack of global information, the ability of privacy preservation is weak.
In the centralized architecture, a third-party component is responsible for location anonymization incorporating with the user and LBS server [11,13,15].This trusted middleware improves the capacity of the privacy protection.But the disadvantage is that it may become the bottleneck of the whole system and easy to be attacked.
The peer-to-peer architecture [16,17] assumes users are cooperative and they trust each other.Each mobile user has enough capacity for computing and storage.However, it cannot avoid attacks from malicious users.

Privacy Preservation Models.
The most commonly used model to evaluate privacy protection level is -anonymity, introduced by Sweeney in [6].In this model, attackers cannot distinguish one user from a  user group.Privacy preservation models can be categorized into three kinds as follows.[8,18].Users send fake positions to the server.The fake positions have to be around the real ones in order to get the information or service needed for users.The distance between them is related to the degree of privacy protection and quality of service.It is found that the larger distance yields the higher protection degree and the poor service quality.Li et al. [18] introduce a model in which users send several fake positions to the server and request points of interest surround these fake positions and to conceal the real point of interest the user wants.[9,10,19].Hiding the identities of users through pseudonyms were presented in [19], where privacy protection degree depends on the strength of the relationship between users' identities and particular locations.To achieve better anonymous, users need to update their pseudonyms frequently [9].Beresford and Stajano [10] introduced a model based on dynamic pseudonyms, in which spaces are separated into "Mix Zones" and "Application Zones."Every time users move into "Mix Zones," the pseudonyms are changed.Reference [20] proposes an incentive mechanisms to encourage users to cooperate with each other in this kind of "Mix Zones."Liu et al. investigated the optimal multiple Mix Zones placement problem for location privacy protection [21].Gong et al. [22] take social ties into consideration and motivate users to participate in socially-aware pseudonym change game.However, attackers may find the patterns for the changes of pseudonyms.For this issue, a method involving path confusion was developed in [23].However, the service delay may be large in this method.Taking prediction into consideration, CacheCloak [7], which is also a kind of path confusion, uses predictions to protect the privacy and decreases the delay of service.However, it needs a thirdparty server and it takes a lot of resources for calculating and storage space.Zhu and Cao proposed APPLAUS in which colocated Bluetooth enabled mobile devices to mutually generate location proof, and update to a location proof server [24].Periodically changed pseudonyms are used by the mobile devices to protect source location privacy from each other and from the untrusted location proof server.However, it needs a third trusted party Certification Authority server and takes a lot of resources for calculating.[11][12][13].Users usually send a region instead of their accurate position to enhance the risk of location privacy.Such service usually needs to be delayed for a period of time for collecting more than one user's requests in the same region to send them to the LBSs at once.Attacker can only know that the users are in a particular region in a particular period of time and thus cannot track a user continuously.The server has to retrieve query results in the user's cloaking region that protect the privacy of users.However the server workload dramatically increases, as well as the amount of downloading traffic.Further, not only may the service be delayed, but also the performance of LBS may not be good because of inaccurate location information.Galdames and Cai proposed processing queries as a batch instead of one by one independently to reduce the workload of servers [25].Wang et al. proposed L2P2 to find the smallest cloaking area for each location request so that diverse privacy requirements over spatial and/or temporal dimensions are satisfied for each user [26].Vu et al. proposed a mechanism based on locality-sensitive hashing (LSH) to partition user locations into groups each containing at least  users (called spatial cloaks) [27].The mechanism is shown to preserve both locality and -anonymity.[28,29].The server encrypts the users' location and query separately and stores them on different database; it could prevent users from information leak.Xue et al. designed a novel index structure to provide fast search for users when they check in at the same venue frequently and outsources the heavy cryptographic computations to the server to reduce the computational overhead for mobile clients [28].Li and Jung designed a Location Query Protocol (PLQP), which allows different levels of location query on encrypted location information for different users, and it is efficient enough to be applied in mobile platforms.However, these kinds of encryption take a lot of resources for calculating, communication, and storage space [29].Wei et al. separate user identities and anonymized location updates onto two entities.Users' location privacy is protected if either entity is compromised by the adversary [30].However, if both entities are compromised, it could not prevent users from information leak.

Distributed Architectures and
Pull-Push Algorithms.Distributed architectures and pull-push algorithms have been used for many general applications such as social networks, p2p networks, and location-based advertising.Doer et al. analyzed why rumors spread fast in social networks; one of the reasons is that nodes of small degree use push-pull models to build a shortcut between those having large degree (hubs): node of a small degree quickly pulls the rumor from node with large degree and again quickly pushes it to another node with large degree [31].Zhang et al. presented an unstructured peer-to-peer network called GridMedia for live media streaming employing a push-pull approach which greatly reduces the latency and inherits most good features (such as simplicity and robustness) from the pull method [32].Unni and Harmon presented effectiveness of push versus pull mobile location-based advertising; the results indicate that privacy concerns are high, and perceived benefits and value of LBA are low [33].
In summary, most existing architectures and models for privacy preservation in LBS are pull-based, which are hard to prevent the compromise of servers.Different from their works, we propose a push-based strategy to reduce the risk of server compromise while preserving -anonymity location privacy.To the best of our knowledge, this is the first work on proactive push-based privacy protection scheme for locationbased services.

LPPS: A Location Privacy Preservation Scheme Based on Distributed Cache Pushing
In this section, we present a Location Privacy Preservation Scheme (LPPS) that provides -anonymity location privacy by using distributed cache pushing.The basic idea of LPPS is deploying a distributed cache layer to store the most popular location-related data and pushing the data to mobile users in range.By listening to the broadcast from the cache layer, the mobile users are able to obtain the desired data without querying the LBS server, which will eliminate the chance of leaking their location information.
The system architecture is illustrated in Figure 2.There are three major components in the system: LBS server, distributed cache layer, and mobile users.The distributed cache layer could be physically deployed on wireless access points or could be a set of proxy servers connecting to the Internet.Throughout this paper, we refer to the entities in the cache layer as cache proxies.Cache proxies are distributed and deployed in different regions and they can work independently without knowing the existence of each other.Each cache proxy only stores the location-related data.When a mobile user is within the communication range, the cache proxy pushes the caching data to the user.If the user's desired data is found in the cache, we call it cache hit; otherwise we call it cache miss.When cache miss occurs, the user will send  − 1 random queries with the real one to the LBS server.When the LBS server sends back the response, the user generates a feedback index containing the same  − 1 random location/keywords with the real one to the cache proxy to inform the proxy to update cache.Proxy uses the feedback index to request cache data from LBS server and get feedback content later, which could update its cache content.
The threat model is defined as follows: we assume both LBS server and distributed cache layer are possible to be compromised, and the mobile users could trust themselves.The proposed scheme achieves the following goal: even though the LBS server or the distributed cache layer is compromised, they cannot obtain enough information to track the entire trajectory information of the mobile users.
There are three key questions in implementing LPPS: where to deploy the cache proxies, how to organize, maintain, and distribute cache content in proxies, and how to guarantee -anonymity.In the following sections, we introduce the strategies for cache proxy deployment, cache maintenance, and cache pushing for privacy preservation.In Figure 3, we show the three key questions and brief answers.

Where to Deploy Cache Proxies.
In this subsection we answer the question "where to deploy cache proxies."In order to maximize the usage of caching content, cache proxies should be deployed near to the places which are frequently visited by users.To find out such "hot spots," we need the historical movement trajectories of users.In fact, we do not care who the users are and users could be anonymous.There are a lot of historical movement trajectories like these, where we can get the historical movement trajectories of users but cannot find a sequence of trajectories belongs to them, because the users are anonymous.Normally, when government do a city plan, it can refer to data sets like taxi and bus trajectories.It is possible to obtain anonymous trajectory data of a city.Such mobility trajectory datasets is openly available [2,34].
Physically deploying proxies may cost a lot on wireless access points.Although the cost is not cheap on power, machines, and renting places for deployment proxies, it has benefits and motivation to deploy distributed cache proxies based on the existing public infrastructure to provide LBS for people passing by, especially for today's O2O commercial business.In New York, the de Blasio administration's plan to convert aging pay phones to Wi-Fi hot spots won unanimous approval from a review committee, clearing the way for the installation of thousands of Wi-Fi sites across New York City over the next decade [35].Our work could be helpful for this kind of city plans.Duan et al. [36,37] analyze different mechanisms to motivate the collaboration of smartphone users on both data acquisition and distributed computing; they propose a reward-based collaboration scheme for data acquisition applications and use contract theory for distributed computing applications.Our work could be practically rolled out using a reward-based collaboration scheme to encourage users to collaborate in our system to reduce the cost.Besides, this system can be practically rolled out with VANET (Vehicular Ad hoc Network); our cache proxies here could be road side unit in VANET [38] which can not only reduce the cost, but also provide more kinds of services.
The historical trajectory of user  is represented by With the set of stay points obtained, we are able to devise the strategy of cache proxy deployment.The detailed strategy includes the following steps.
Step 1 (divide the geographical area into grids).Each unit in the grid is a small square of size  × , which is the communication area of a cache proxy and covered by several Wi-Fi APs.Reference [39] shows the bandwidth of Wi-Fi AP when transmitting location-related caches.Assume there are × squares in total.Each square is assigned a unique ID for reference.
Step 2 (construct the transitions matrix ).We firstly map trajectory of users to their staypoint list; then we map each staypoint list to the squares ID list by using the location information.With such transformation, the user trajectory can be represented by a sequence of square IDs.Now we build a weighted graph (, ),  is the set of nodes indicated by square IDs, and  is the set of edges (if a user moves from a node to another, there is an edge between them).
We define the weight  → on the edge (, ) as the number of users moving from location  to location , which can be computed as follows.Let  → = 0 at the beginning.If a user is found moving from square   to squares   (source stay point is in square   and destination stay point is in squares   ), let  → =  → + 1.After scanning over the whole trajectory, we get (, , ) with weight  on every edge.
Step 3 (calculate the visiting frequency).To find the most popular grids, we define a value ℎ   to represent the popularity of the th grid.ℎ   = ∑ =× =1  → .
Step 4 (deploy cache proxies).Assume  is the number of cache proxies we plan to deploy.We sort the elements of ℎ   in descending order and choose the top  squares to deploy the cache proxies.

How to Organize and Maintain Cache Content in Proxies.
To answer the problem how to organize and maintain cache content in proxies, we introduce a model to capture the mobility pattern of user.We use the Markov chain model to describe the query interests and mobility pattern of users.
Based on the model, we propose cache placement, cache pushing, and cache update strategies.

The Markov Chain
Model.We use Markov chain to describe the probabilities of a user moving from one location to another.States of the Markov chain are the popular locations found by using the previous mechanism, and the transient probabilities are the probabilities of users moving between locations.The Markov chain can be built as follows.
Based on the weighted graph mentioned above, we can calculate the probability that a user moving from location  to location  as  , =  → / ∑ =× =1 → , which satisfies In our model,  = [ , ] forms the one-step transit matrix of the Markov chain.The probability of moving from one square to another could be measured by counting the visit frequency.For example, after counting all the transmissions on all the staypoints, we can easily get the probability from one staypoint to another staypoint: Figure 4 is an example of the Markov Chain.In this example,  1,2 = 6/(6 + 9 + 5) = 30%.Given an initial configuration of user distribution, the users move to other locations with the probabilities defined by the transit matrix; thus the distribution evolves over time.Let  () = { ()  1 ,  () 2 , . ..} be the distribution at time , where  ()  = Pr[  = ] is the percentage of users in location  at time .At the beginning, the distribution is  (0) .After one time slot, the distribution becomes  (1) =  (0) .The system evolves as  (0)   →  (1)   →  (2)   → ⋅ ⋅ ⋅  () .By induction, it obtains  () =  (0)   .When the system runs for a long time (i.e.,  → ∞), regardless of what the initial distribution was,  () converges to a stable distribution.The stable distribution can be obtained by solving the equation  = , which represents the stable distribution of mobile users in different locations for a long time.
Three pieces of information are delivered by the Markov chain convergence theorem [40] regarding the stationary distribution: (1) existence: there exists a stationary distribution.
(2) Uniqueness: the stationary distribution is unique.(3) Convergence: starting from any initial distribution, the chain converges to the stationary distribution.
For the existence of stationary distribution, neither irreducibility nor aperiodicity is necessary for the existence of a stationary distribution.In fact, any finite Markov chain has a stationary distribution.Irreducibility and aperiodicity guarantee the uniqueness and convergence behavior of the stationary distribution.
In our problem, if the Markov Chain is reducible, it means users movement is in separated regions.In that case, we can divide the map to submaps to ensure the Markov Chain for each submap is irreducible.And it is a periodicity because it is hard for all the loop in the graph to be divided by 2. So the Markov Chain could have a stationary distribution .The stationary distribution can be derived by solving the equation  = , or it can be obtained by running  () =  (0)   for a sufficient long time .

Cache Placement Strategy.
We manage cache content on proxies as follows.We only consider the short-term movement of users; thus we predict the future movement for three steps; that is,  (+3) =  ()  3 .For each ℎ  , let the initial distribution  (0)  = [0, . . .0, 1, 0 ⋅ ⋅ ⋅ 0] (all users in square ).Then we calculate  (3)   =  (0)  ×  3 .The vector  (3)   indicates the probability that the user moves to other locations in three steps.Given such information, our cache placement and replacement principle are choosing the most probable location to cache and evicting the less probable visited locations.

Cache Pushing Strategy.
To preserve users' location privacy, the LPPS scheme pushes cache content to mobile users proactively.The cache pushing strategy of cache hit is described as follows.
Step 1.The cache proxy divides the cache items into several groups with each group having  items.Each item is in the format ⟨ , , ⟩.To avoid forming patterns, the cache items are grouped together randomly.Each group generates an index in the form {⟨   ,   ⟩}, which indicates the set of cache items in the group.The group IDs and their index are pushed to the users.
Step 2. When a user receives the group index, he compares his query which follows the format  = ⟨, ⟩ with the index.If cache hits, the user will send back the group ID to the cache proxy to acquire the data.
Step 3. On receiving the group ID, the cache proxy will push the cache items in the same group to the user.
Step 4. The user receives the grouped cache items and chooses the desired data.Due to the nature of wireless communication, if other users acquiring data in the same group overhear the broadcast, they can use the data without interaction with the cache proxy.
If the requested data is not found in the index, a cache miss happens.The cache pushing strategy of cache miss is described as follows.
Step 1.It is the same as Step 1 in the cache pushing strategy of cache hit.
Step 2. When a user receives the group index, he compares his query which follows the format  = ⟨, ⟩ with the index.If cache miss occurs, the user needs to retrieve the data from the LBS server.To protect location privacy, the user generates  − 1 queries with random locations/keywords and sends them with the original query to the LBS server.
Step 3.After the LBS server sends back the  responses to user, the user uses them to generate a feedback index in the form {⟨   ,   ⟩}; then the user sends back the index to the cache proxy.
Step 4. On receiving the feedback index in the form {⟨   ,   ⟩}, the cache proxy could store it for a time before it needs to update its cache.We assume cache proxy could be disconnected with LBS servers or Internet when it is not updating cache.When the cache proxy needs to update its cache, the cache proxy uses stored feedback indexes as requests to query about each ⟨   ,   ⟩ from LBS server and then receives the data from LBS server as feedback content to update its cache.If cache proxy reaches the limits of cache size, a LRU policy will be used to replace cache.Algorithm 1 shows the pseudocode of our cache pushing strategy for each cache proxy.

Cache Updating Strategy. Now we consider cache updating and replacement strategy when cache miss occurs.
As we know, popular locations change when peoples' interest changes.For example, as a new interesting shop or restaurant begins to work, people around may go to these new places frequently, which forms new hot spots.Cache replacement and updating strategy should consider the dynamic of user visiting interests on cache miss.
After cache miss happens for several times, cache proxy could get a large amount of feedback indexes from users.Since it contains fake random requests due to the reason of privacy preservation, we want to identify the real popular locations from the feedback indexes.We use a bin-ball model introduced by [40] to achieve this.As shown in Figures 5 and  6, because the real hot places are always needed by users, the accumulative probability that a location occurs should be higher than random requests.In Figure 5, we use black   balls to indicate the real requests and white balls to indicate the fake requests (queries with random locations/keywords).Each bin indicates a location related to the requests.The location-based queries equal putting balls (black and white balls) into bins.Each white ball drops to the random bins uniformly, corresponding to users queries with random locations/keywords.So each bin will get ( − 1) × / white balls as expected, where  is the total number of cache missing that happened and  is the total number of locations users may need.The distribution of black balls is not random: it is proportional to the popularity of the location.So the distribution of black balls is very different from the white balls, resulting in the fact that a few bins will get most black balls as Figure 6 shows.Finding the bins containing most black balls can indicate the real popular locations.
Suppose we have  balls indicating the number of locations in the feedback indexes and  bins indicating total number of locations.Suppose all balls are thrown into bins independently.Let   be the random variable representing the number of balls in the th bin.We calculate the probability that a bin is receiving more than  balls.If all balls are uniformly randomly dropped, the  balls are thrown to bin  with probability (1/)  , and there are totally (   ) distinct combinations of  balls.Therefore, ( is the number of total locations and  is the numbers of locations in the feedback indexes.In real practice,  ≪ , which indicates (1/!) × (/)  approaches 0 for larger .
If  is big enough, Pr[  ≥ ] is a very small number indicating that it is a small probability event.If a bin receives more than  balls, it means a small probability event occurs, which indicates that the throwing of balls is not uniformly random, and the bin represents a popular location in real world.For example, if  = 1000,  = 10000, and  = 10, we obtain Pr[  ≥ ] ≤ 2.7557319223986 * 10 −17 .If such event occurs, it is very confident that the th location is a popular location, which should be choosed to update to cache proxy.
We propose cache updating principles as follows.For each cache proxy , when it got  locations in its feedback indexes, it identifies the popular locations by calculating the probability in ( 2) and requests for feedback content and then inserts the corresponding location-related items into the cache proxy.The least visited locations are evicted from the cache to make room for the new items.The detailed algorithm is shown in Algorithm 2.

Discussion and Analysis.
In this section, we discuss the efficiency of the proposed LPPS scheme.
First, the deployment of cache proxy is greedy.Since we choose the most visited square to deploy cache proxy, it is trivial that this is approximate optimal place to deploy the proxies.
Second, the placement of cache content is based on movement prediction, which is optimized by using Markov Chain analysis.In order to find proper location-related content to push, we need to find the most related location with current location, which could be calculated by the transition matrix .We pick the most probable locations to push to the mobile users passing, which can reduce the risk of location report and the communication overhead to the LBS server.
Third, we analyze the computational cost of the proposed algorithm, which involves two steps: (1) Using original data to compute where to deploy cache proxies: Since we choose the most visited square to deploy cache proxy, if the input is the total points number  of historical trajectory and the total grids number  (the area is divided into a  =  ×  grids), the time complexity is (), and the space complexity is ( 2 ).In fact, this computational cost is huge but it only needs to be calculated once.( 2 for each    ∈   do (6) if    ≥  then (7) request feedback content about location  from LBS servers (8) newCache = ℎ ∨ {cache about location } (9) end if (10) end for (11) Sort ℎ  with Descending order of

𝜋 (0)
× 3 .The vector  (3)   indicates the probability that the user moves to other locations in three steps.Our cache placement and replacement principle is choosing the most popular locations to cache and evicting the less popular visited locations.If the city is divided into  =  ×  grids, in order to calculate  (3)   , the time complexity is ( 3 ), and the space complexity is ( 2 ).Fortunately, the computational cost to calculate  (3)   is huge but it also only needs to be calculated once.After that, if the cache size of ℎ  is , we only sort and choose top  most popular locations of  as the cache content for ℎ  .In order to choose the most  popular locations to cache, the time complexity is ( log()), and the space complexity is ().
(2.2) To maintain cache content in proxies, we need the resource including not only the bandwidth resource between proxies and LBS servers for updating cache, but also the computational resource for proxies to calculate which part of cache needs to be replaced and updated, which we explain as follows.We use Algorithm 2 to replace and update cache.In order to find the popular location data which is not in origin cache, we analyze the feedback indexes and use balls to bins model to get the real popular location data in feedback content which is requested from LBS servers.Then we update cache with it to replace the least popular caches in the origin cache.The same as (2.1), in order to sort and choose the most  popular locations to cache, our algorithm needs ( log()) time complexity and () space complexity.
In summary, although the computational cost which only needs to be calculated once is ( 3 ) + () time complexity and ( 2 ) space complexity, the computational cost of keeping our system working is ( log()) time complexity and () space complexity.

How to Guarantee k-Anonymity Location Privacy Preservation.
We show that LPPS achieves -anonymity of location privacy to both cache proxy and LBS server.
When cache hits occurs, the cache proxy received the group ID from the user.That is, the cache proxy knows that the desired data is within the group.However, there are  different items in the group and they are randomly chosen; the proxy is not able to distinguish the target from the other  − 1 items.When cache hits occurs, LBS server and user do not communicate, and LBS server is not able to track user.
When cache miss occurs, the user sends out locationbased queries with fake locations and keywords.Since the LBS server receives  different queries from the same user, it is hard to distinguish the real from the other  − 1 requests.The proxy also receives a feedback index from user, and it is hard to distinguish the real one the other  − 1 locations/keywords in the feedback index too.Besides, the LBS server receives user requests intermittently (only when cache miss occurs), which will make it impossible to track the user location continuously.
In the threat model, proxies and the LBS server are possible to be compromised.They can collude and try to track users.This did not affect the -anonymity of location privacy.When cache hit occurs, the proxies only know the  destinations the user may be heading to, and the LBS server does not have any information about the user.So it guarantees -anonymity of location privacy when cache hit occurs.When cache miss occurs, both proxies and the LBS server receive the  locations/keywords with  − 1 undistinguished random locations/keywords in it; -anonymity of location privacy is also guaranteed.In fact, both LBS server and the proxy only know the  probable destinations that the user is heading to and do not know more things; they cannot obtain enough information to track user.Our protocol can keep both LBS server and distributed cache layer only having partial information of user; even when they collude, they still could not obtain enough information.Even if neighbor proxies collude, they still only get the  probable next destinations that user is heading to.
In summary, the proposed LPPS guarantees -anonymity of location privacy.

Performance Evaluation
In this section, we conduct experiments to evaluate the efficiency of the proposed LPPS strategy by both real and synthetic datasets.
4.1.Experiment Setup.The two real datasets are the mobile user traces collected in Beijing [2] and cab mobility traces collected in San Francisco [34]; the statistics of trace datasets are in Table 1.The historical traces are obtained from public published GPS dataset GeoLife and Cabspotting project.The Geolife dataset (published by Microsoft Research Asia) logged the mobility traces (the GPS coordinates) of 178 users; most of the trace locations are in Beijing in a period of over four years (from April 2007 to October 2011).Cabspotting project logged the cabs mobility traces (the GPS coordinates) of 536 users collected in May 2008.
Our simulation area is a square in the Beijing city ranging from latitude 39.75 ∘ N to 40.1 ∘ N and from longitude 116.2 ∘ E to 116.66 ∘ E and San Francisco ranging from latitude 37.6 ∘ N to 37.85 ∘ N and from longitude 122.55 ∘ W to 122.23 ∘ W. We remove the trajectories which are outside these area from the traces and apply the method described in Section 3.1 to calculate the stay points (with Δ  = 600 m and Δ  = 1200 s).We obtain 13,666 and 1,894,208 stay points in total.We assume that the communication area for one proxy is 600 m × 600 m, and both divide the simulation area into a  ×  grid (GeoLife 65 × 65 and Cabspotting 47 × 47 the experiment).We apply the cache proxy deployment strategy to calculate the visiting frequency and deploy  cache proxies Based on the GPS dataset, knowing the current location, each mobile user generates a location-based query to request location information for his/her next stay point.The query is in the form of ⟨, ⟩.We set  as the location the user is heading to, which is available from the mobile trace.There are total 7,693 queries generated in Geolife and 9266161 in Cabspotting.For the required -anonymity, we set  = 5.
The synthetic dataset is generated in accordance with the following characteristics.In our synthetic dataset experiment, users distributed in a large area of 100 × 100 grids; each grid is also 600 m × 600 m.They generate a number of queries every minute.The distribution of the locations where these queries are generated from obeys the Zipf distribution, and the distribution of the location information that these queries request also obeys the Zipf distribution.In this model, the request generates probability    from location   , and the request probability    about location   is given by    =    = /  , where  = (∑ =100×100

𝑘=1
(1/  )) −1 .The parameter  (0 ≤  ≤ 1) is the skewness parameter of the Zipf-like distribution, indicating the degree of concentration of requests.The value of  is set to 0.7, 0.8, 0.9, 1.0 in our simulations dataset.All users' requests follow the same Zipf pattern.Although a synthetic dataset cannot represent a real dataset exactly, it allows us to vary data characteristics in a controlled manner not possible to be used in real dataset.
The performance metrics include service coverage ratio, cache hit ratio, and communication overhead.The reason we picked these as performance metrics is that service coverage determines whether a mobile user can be served, cache hit ratio determines whether a mobile user can be served without LBS server, and communication overhead is also what mobile user cares about.For each of these performance metrics, we evaluated the impact of proxies number  and cache size S.While evaluating the impact of  (resp., ), we fixed  = 2% (resp.,  = 14% of total locations).

Service Coverage Ratio.
We introduce coverage ratio to evaluate the performance of our cache deployment strategy.A query is said to be covered if the mobile user is in the  communication range of a proxy when the query is generated.The coverage is defined as the proportion of the number of queries covered by the cache layer to the total number of queries.If the number of cache proxies approaches  × , the service coverage ratio cloud achieve 100%.
In our experiment, we set the proxy number  varying from 5% to 20% of  × .Figures 9(a) and 9(b) show the impact of the number of cache proxies.Figure 9(a) shows the coverage ratio varying with  with two real datasets.When  = 5%, the coverage ratio is about 85% for Beijing and 95% for San Francisco.With  increasing, the coverage ratio increases quickly.When  = 20%, the coverage ratio approaches 99% for both real datasets.It indicates that if cache proxies are deployed in a small portion of the grid, they can serve most of the queries.It also suggests Power-Law distribution in the datasets.We can also see that cab traces in San Francisco are more concentrated than mobile user traces in Beijing.Figure 9(b) shows the results with simulated Zipf-distribution dataset and how the coverage ratio looks like with different .It shows that, with a fixed  (which indicates how concentrated the simulated Zipfdistribution dataset is), with  increasing, the coverage ratio increases quickly.And it also shows that, for a fixed , when  increase, the coverage ratio increases quickly.Because cache size cannot affect service coverage ratio, we do not illustrate the impact of cache size  with service coverage ratio.

Cache Hit Ratio.
We use cache hit ratio as another performance metric of cache deployment strategy.A query is said to be hit if the mobile user can get the desired content from the cache when the query is generated.The cache hit ratio is defined as the proportion of the number of queries hit by the cache layer to the total number of queries.
Figures 10(a) and 10(b) show the impact of cache proxy number.Figures 10(c) and 10(d) show the impact of cache size .
Cache proxy number is an important factor for the performance of our strategies.The larger cache proxy number yields the higher cache hit ratio and provides the better location privacy guarantee.Figure 10(a) shows with real dataset, when cache size is set to 2%, how the average cache hit ratio of proxies varies when cache proxy number  varies from 5% to 20% of total cache proxy number.The cache hit ratio is low when cache size is small.When  reaches 5%, the cache hit ratio is over 0.90 for Beijing and over 0.50 for San Francisco.It increases slowly when  is further increased.As shown in Figure 9(a), when  reaches 8% for San Francisco, the cover ratio barely increased, which causes the cache hit ratio to increase slowly.The number of destinations for a cab is much larger than that for a mobile user, so the hit ratio for Cabspotting dataset is lower than Geolife.Figure 10(b) shows how it looks like with simulated Zipf-distribution dataset.We can see that, with a fixed , with  increasing, the hit ratio increases.And it also shows that, for a fixed , when  increases, the hit ratio increases.
Cache size is an important factor for the performance of our strategies.The larger cache size yields the higher cache hit ratio and provides the better location privacy guarantee.Figure 10(c) shows with real dataset, when cache proxy number  is set to 2%  × , how the average cache hit ratio varies when cache size  varies from 1% to 3% of the size of keyword set.The cache hit ratio is low when cache size is small, which is about 50% for  = 1%.The number of destinations for a cab is much larger than that for a mobile user, so the hit ratio for cabspotting dataset is lower than Geolife.We also can see that cache hit ratio increases slightly with cache size up to 2% in Geolife dataset.Figure 10(d) shows how it looks like with simulated Zipf-distribution dataset.We can see that, with a fixed  which indicates how concentrated the simulated Zipf-distribution dataset is, with  increasing, the hit ratio increases.And it also shows that, for a fixed , when  increases, the hit ratio increases.

Comparison of Communication
Overhead.We assume that the communication cost of transmitting one unit of message from the LBS server to a mobile user is   , and the cost from a cache proxy to a user is   .Normally   is much smaller than   ; not only is the cache proxy local and is the LBS server a remote site, but also the proxies can broadcast messages to many users at the same time.We set   /  = 0.25 in our experiments.The communication overhead is calculated by the total number of messages exchanged between the users and the LBS server and the proxies multiplied by their corresponding communication cost.
We use the classical -anonymity strategy [6] as the baseline to compare the proposed LPPS.Figures 11(a  It can be seen that, with  increasing, the cost decreases.Figure 11(c) indicates the impact of , which shows that, with cache size  increase, the cost decreases, compared to anonymity strategy.It is shown that the communication overhead decreases when the number of proxies and the cache size increase.Figures 11(b   looks like with different .It shows that, with a fixed , with  and  increasing, the cost decreases quickly.And it also shows that, for a fixed , when  increases, which means the dataset is more concentrated, the cost decreases quickly.In the best case, the communication overhead of LPPS is only about 25% of the classical -anonymity strategy.In Table 2 we show the performance comparison between our LPPS and the classical -anonymity strategy.

Conclusion
In this paper, we address the location privacy issues in the prevailing location-based services and propose a pushbased location privacy preservation scheme called LPPS.The LPPS introduces a distributed cache layer to store the popular location-related data and push them to mobile users.Strategies for cache proxy deployment and cache pushing are proposed to achieve -anonymity of location privacy.Cache replacement and updating strategies are proposed to mine real popular places of interest from a lot of fake feedback indexes when cache miss occurred.Trace driven simulations illustrate that the proposed scheme achieves high service coverage ratio, decent cache hit ratio, and low communication overhead.In the future, we will focus on personalized cache pushing strategy; for example, people from different directions at different time may have different destinations.
We will focus on more kinds of cache content besides the location data.This system can be practically rolled out with VANET (Vehicular Ad hoc Network) and Wi-Fi city plan; our cache proxies here could be road side unit [38] in VANET which can provide more kinds of services.

Figure 1 :
Figure 1: An example of location privacy preservation.

Figure 4 :
Figure 4: An example of the Markov chain.

Figure 5 :
Figure 5: Balls to bins model: feedback as balls, locations as bins.

Figure 6 :
Figure 6: Balls to bins model: bins receiving more balls stands for real popular locations.

Figure 7 :
Figure 7: Deployment of cache proxies in

Figure 8 :
Figure 8: Deployment of cache proxies in San Francisco.
The coverage ratio with simulated Zipf-distribution dataset
) and 11(c) are made from real dataset, showing the communication overhead of LPPS as the percentage of the classical anonymity strategy.
Figure 11(a) indicates the impact of .
1 , ...}.Consider    = (   ,    ,    ), where    and    are the geographical coordinates of the th position in the trajectory of user .   is the timestamp denoting when user  is at the th position.Without loss of generality,    <   +1.The "frequently visited locations" have temporal and spacial meanings.We introduce the concept of stay points to represent them.A stay point is defined as the location that a mobile user stays in the area within Δ  distance for at least Δ  time, where Δ  and Δ  are the thresholds for spacial and temporal differences.The set of stay points   of user  can be obtained as follows.Starting from a location    ∈   , we calculate the largest set {  +1 , . . .,   + } that satisfies (   and   , we let   =   ∪ {(  ,   )} and remove the set {   ,   +1 , . . .,   + } from   .Starting from the first location, we repeat the process until   is empty.As a result, we obtain the set of stay points   to represent the characteristic locations of user movement and filter out the locations which the user only visits temporarily.

Table 1 :
Statistics of trace datasets.inthemostpopularlocations.Figure7illustrates the result of cache proxy deployment when  is set to 5% of 65 × 65 in GeoLife dataset.In this figure, one blue dot indicates one proxy and there are 211 proxies in this figure.Figure8shows the result of cache proxy deployment when  is set to 5% of 47 × 47 in Cabspotting dataset.In this figure, one blue dot indicates one proxy and there are 110 proxies in this figure.