Achieving the Optimal 𝑘 -Anonymity for Content Privacy in Interactive Cyberphysical Systems

,


Introduction
Cyberphysical systems (CPS), which deeply integrate different computing, communication, controlling, and monitoring components, have leveraged modern services in our daily life, like smart grid, intelligent transportation, automatic driving, etc.Recent development of mobile communication and networks has leveraged many modern applications built on interactive cyberphysical systems, in which client software programs or devices take actions according to their interactions with CPS servers.In more details, a client sends a request to the CPS server and is to take actions on receiving the reply from the CPS server.The actions to be taken depend on the reply of CPS servers.Health caring, automatic driving, and location-based services fall into this category of interactive CPS applications.Suppose an old guy Bob is wearing a health caring device, which is connected to a CPS server though mobile Internet.Bob could send "stomachache" to the CPS server for instructions to help him.The CPS server may send a reply telling Bob what to do or where is the nearest hospital.Then Bob takes his action according to the reply.Similar processes could be adopted in applications of automatic driving and locationbased services.These modern services over interactive CPS are attractive since they do bring convenience to people's daily life.However, privacy concerns arise at the same time while users' requests are submitted to the CPS servers through Internet.These requests disclose users' query contents to the CPS server and even vicious listeners to the communication channels.Here we refer query content as parameters in users' requests, such as "stomachache" in Bob's request to the health caring CPS server.These query contents should be kept as sensitive information for individuals, and the abuse or further leakage of these information will make users 2 Security and Communication Networks vulnerable in respect of private life or even individual security [1].
To the privacy concern, content privacy should be recognized to emphasize that users' query contents should be kept as sensitive information in interactive cyberphysical systems.Many research efforts have been made to protect different types of query contents such as locations and keywords in the literature of interactive cyberphysical systems such as location-based services.The major body of these efforts consists of two parts, cloaking based solutions and client based solutions.Cloaking based solutions employ a trusted server from a third party.When a user  queries with a query content ., the query (.) is sent to the trusted server which in the next step generates a cloaking region  containing 's location and at least another  − 1 users.Here  specifies the level of privacy guarantee.Then the query with  is sent to the CPS server, and the CPS server could not determine where  is or even whether  is querying.In this process, cloaking based solutions aim to make  indistinguishable from another  − 1 users; however it suffers from inherent drawbacks brought by the trusted server, which may become the single point of failure of privacy and the bottleneck of query performance.More seriously, when the CPS server holds certain side information such as the prior probability of query contents, cloaking based solutions will suffer from further privacy breach.To address this issue, client based solutions are presented, aiming at -anonymity provided by the client side.Reference [2] generates  − 1 dummy query contents for continuous scenarios.To prevent the adversary from inferring the actual query content, [2] constrains that the selected dummies should have prior probability larger than a predefined threshold.In practice, it is difficult to determine a proper threshold, and what is more, dummies selected by [2] could still be eliminated if they have quite different prior probability.Reference [3] employs an entropy based privacy metric and generates  − 1 dummy locations in random manner.However, the improper dummies could be eliminated from the reported  query contents due to the process of [3]; thus the provided privacy is degraded.
In this paper, we investigate the problem of preserving content privacy in interactive cyberphysical systems with a client based solution.To guarantee the utility of requests to the CPS servers, we adopt -anonymity in order to prevent the adversary from recognizing the actual query content from the  reported contents, since the actual query content must be sent to the CPS server for a meaningful reply.In this process, the major challenge arises in two aspects.First, the -anonymity provided should be carefully designed so that the adversary is not able to eliminate any query contents.Second, the overall privacy produced by anonymity should be optimized.We present two privacy metrics denoted expected entropy and dp-ratio which depict the achieved content privacy using a entropy based concept and a differential privacy mannered measurement, respectively.Then an algorithm, Multilayer Alignment (MLA), which establishes -anonymity based mechanisms for preserving content privacy is proposed.Given the prior probability of query contents together with an integer  which specifies the privacy level, MLA generates a set of reports, each of which consists of  distinct query contents, together with probability distribution on the report set for each query content.Given any report  submitted to the CPS server, MLA guarantees that the posterior probability of each query content in  is larger than 0. To this end, the adversary is not able to eliminate any query content from a report.Here a report could be taken as a set of different query contents, and we give its formal definition in Section 3. We theoretically introduce the properties of MLA by proving that MLA achieves the optimal expected entropy and the optimal dp-ratio at the same time.These attractive properties make MLA the optimal -anonymity solution for preserving content privacy in interactive cyberphysical systems.The major contributions of this paper are as follows.
(i) We formulate the problem of achieving the optimal -anonymity based mechanisms for preserving content privacy in interactive cyberphysical systems.The problem formulation is based on two content privacy metrics with entropy and differential privacy concepts.
(ii) We propose the Multilayer Alignment (MLA) algorithm, which establishes -anonymity based mechanisms for preserving content privacy.The MLA algorithm prevents adversaries from eliminating query contents from reports using Bayes inference.
(iii) We prove that MLA achieves the optimal -anonymity mechanisms in terms of our presented content privacy metrics simultaneously.
(iv) We evaluate our proposed MLA algorithm using real-life datasets.The evaluation results validate that MLA achieves effective content privacy in terms of the entropy based and differential privacy mannered content privacy metrics.
The rest of this paper is organized as follows.Section 2 introduces some necessary preliminaries including the process of preserving content privacy using a client based solution, together with common accepted privacy metrics.Section 3 formulates the problem of achieving the optimal -anonymity for content privacy in interactive cyberphysical systems.Section 4 proposes the MLA algorithm which establishes effective mechanisms for preserving content privacy.Section 5 theoretically proves that our proposed MLA algorithm achieves the optimal -anonymity for content privacy in terms of the content privacy metrics introduced in Section 3. Section 6 evaluates the MLA algorithm based on real-life datasets and related work of this paper is discussed in Section 7. Finally, Section 8 concludes this paper.

Preliminary
This section introduces necessary preliminaries including the process of preserving content privacy using a client based -anonymity solution.Then two accepted privacy notions, i.e., -anonymity and differential privacy, and their corresponding metrics are introduced.

Client Based 𝑘-Anonymity Solution.
In the process of a client based -anonymity solution for content privacy preservation,  query contents are reported to the CPS server when a user  wants to submit a request.The  reported query contents are determined on 's device, and no trusted third-party servers are employed; thus the potential single point of privacy failure and query performance bottleneck are eliminated.It is worth noticing that the actual query content  queries should be included in the reported ones; otherwise  is not able to receive a meaningful reply from the CPS server.After receiving the reported query contents, the CPS server processes  queries, one for each reported query content, and then returns the query results to .Irrelevant results are filtered on 's mobile device and the actual results are returned to .In this process, the CPS server receives  distinct query contents instead of a single actual one, and to this end the actual query content is hidden.Nevertheless, careful design is required to avoid ineffective dummies in the reported query contents.The way of generating reports to the CPS server determines the level of content privacy achieved, and this motivates our work.

Privacy Notions and Metrics
2.2.1.-Anonymity.One widely adopted notion of privacy is -anonymity, which is firstly introduced in the database community by [4].The principle of -anonymity is to hide the sensitive information into  − 1 dummies so as to make the adversary unable to recognize the actual one.In the literature of privacy protection in interactive cyberphysical systems, -anonymity could be categorized into cloaking based solutions and client based solutions.The cloaking based solutions such as [5] employ a third-party but trusted server, which is responsible for hiding the actual user among at least −1 dummy users by spatial generalization.The client based solutions including more recent work such as [2,3,6] perform on users' devices and generate  − 1 dummies in a local manner, in which process certain side information is adopted, for instance, the prior probability of each query content.The trusted server in cloaking based solutions may become a single failure if it is hacked by attackers and it is the performance bottleneck to incur long latency to requests.What is more, most of cloaking based solutions is unaware of side information held by the adversary such as the prior probability of query contents.At the same time, the existing client based solutions provide specious -anonymity, since the attackers may violate the principle of -anonymity through rerunning of the algorithms or launching probability inference for each of the  query contents.
The quality of -anonymity could be measured by the concept of entropy borrowed from the area of information theory.When the CPS server receives a report  = { 1 , . . .,   } consisting of  query contents, the entropy of  is formulated as follows: Note that the former formulation is slightly different from [3].Actually, [3] takes the prior probability (  ) as an approximation of the posterior probability (  | ).

Differential Privacy.
Differential privacy was firstly introduced and applied in statistic databases, and it aims to prevent the leakage of any individual's information during query processing.Generally speaking, to satisfy the notion of differential privacy, a random algorithm should return query results with similarly distribution for two databases differing with just one tuple.In other words, a single modification in a database brings a minor change to query results under the control of differential privacy.The definition of differential privacy is given below.
Definition 1 (differential privacy).Given  ⩾ 0, a randomized algorithm A satisfies -differential privacy if for all neighboring databases  and   , (A() ∈ ) ⩽   × (A(  ) ∈ ).Here  ⊆ (A).Any pair of neighboring databases  and   satisfies one of the following conditions: (1) (for unbounded differential privacy)  can be transformed to   with exact one insertion or deletion; (2) (for bounded differential privacy)  can be transformed to   with exact one modification.
The bounded differential privacy prevents distinguishing two datasets with the same size while differing with exact one tuple.The unbounded differential privacy prevents distinguishing two datasets which are the same except that one of them holds exact one additional tuple.
The metric for differential privacy is the coefficient  in Definition 1. Intuitively, a smaller  leads to a better privacy but larger noise in the query result, while a larger  leads to less noise in the query result but a weaker privacy guarantee.

Problem Definition
This section formulates the problem of achieving the optimal -anonymity based mechanisms for content privacy in interactive cyberphysical systems.Before introducing the problem definition, we provide several definitions which interpret indispensable concepts for our problem definition.
When a user queries a content , the client based solution first generates a report consisting of  distinct query contents and then sends the report to the CPS server for response.A formal definition of a report is given as follows.
Definition 2 (report).Given the global set  of query contents and an integer  > 0, a report  is a subset of  with size .When a user queries content , the generated report  must contain ; i.e.,  ∈ .Denote the set of all the reports for the given  and  by    .The set of reports containing the query content  is denoted by  ,  .
In the rest of this paper, we focus on specified  and , and we also use the notion   (instead of  ,  ) to refer to the set of reports containing the query content .

Security and Communication Networks
For a client based solution, multiple reports could include an identical query content .When  is queried, one of these reports is submitted to the CPS server.The following definition of reporting probability depicts the process of selecting such a report.The first two constraints in the definition of reporting probability illustrate that when querying a content  ∈ , a report  ∈   is selected according to the probability    (, ).The third constraint specifies that a report  will not be selected for  if  ∉ .
Next we formulate a client based solution as a mechanism in a probabilistic manner based on the concepts of report and reporting probability.
Definition 4 (mechanism).Given the global set  of the query contents and an integer  > 0, a -anonymity based mechanism M consists of two components including the set of reports    and the reporting probability    .When a user queries content  ∈ , M randomly selects a report  from    and the probability of selecting  is    (, ).
The above definition of a -anonymity based mechanism looks speciously strange; nevertheless existing solutions could be taken as instances of the above definition.We could specify the reporting probability using additional parameters in these works, for instance, the predefined prior probability threshold  in [2].
This paper adopts two privacy metrics to measure a given -anonymity based mechanism M in terms of privacy.As formulated in the following definition, the first metric integrates entropy measures of all the reports generated by M.
Definition 5 (expected entropy).Given the global set  of query contents, an integer  > 0 and the prior probability of query contents as (.), the expected entropy of mechanism M(   ,    ) is calculated as Here ( | ) = ()   (, )/ ∑ ∈ ()   (, ), and it is the posterior probability of  given report .  (M) measures the achieved content privacy overall by considering all the generated reports.The probability of each report is taken as the weight, and the entropy of each report is integrated in the above formulation.A larger   (M) indicates that a better content privacy is obtained with respect to the concept of entropy.
The second metric incorporates the notion of -anonymity and differential privacy.It measures a mechanism with the most distinguishable pairs of query contents in the generated reports.The following definition formulates our second metric named dp-coefficient.
Definition 6 (dp-coefficient).Given the global set  of query contents, an integer  > 0, and the prior probability of query contents as (.), the dp-coefficient of mechanism M(   ,    ) is calculated as Here the terms ( | ) and ( | ) are the posterior probability of query contents  and  given a report , and they could be calculated in the same way as the calculation of ( | ) described above.
Based on the content privacy metrics, e.g., expected entropy and differential privacy coefficient, we formulate the problem of achieving the optimal -anonymity for content privacy in interactive cyberphysical systems as follows.
Problem Definition.Given the global set  of query contents, an integer  > 0, and the prior probability of query contents as (.),compute a mechanism M(   ,    ) with the optimal content privacy.The optimal content privacy is achieved if   (M) is maximized.
(M) and (M) depict the content privacy achieved by M from the holistic and individual point of view, respectively.Although our problem definition aims at the optimized expected entropy, in the next section we propose an algorithm which achieves the optimal expected entropy and the optimal dp-coefficient simultaneously.

Achieving the Optimal 𝑘-Anonymity
This section in first provides a short discuss on a naïve solution to the problem defined in Section 3. Then we propose our Multilayer Alignment (MLA) algorithm which achieves the optimal -anonymity for content privacy in interactive cyberphysical systems.MLA exhibits an attractive property that it achieves the maximized expected entropy and the minimized dp-coefficient simultaneously.

A Naïve Solution.
According to the problem definition in Section 3, the essential challenge of establishing the optimal mechanisms lies in building the reporting probability    :  ×    → [0, 1].A naïve approach to achieve the optimal -anonymity is formulating the problem using nonlinearprogramming technique with linear constraints in Definition 3, and expected entropy or dp-coefficient is used as the optimizing objective.However, the nonlinear-programming formulation employs ( ||  ) × || variables each of which stands for an entry in    .When || grows to 100 and  is set to 10, there will be more than 10 15 variables.Thus this naïve approach is impractical due to its computation expense.

The Multilayer Alignment
Algorithm.MLA computes the optimal -anonymity mechanism in two phases, namely, (1) Segment Alignment and (2) Mechanism Initiation.The major idea of MLA is to generate a mechanism where query contents have as similar posterior probability as possible in each report.To accomplish this goal in a holistic manner, Segment Alignment amortizes each query content with large prior probability to multiple query contents with small prior probability.What is more, the reports generated by Mechanism Initiation have the same distribution of posterior probability of the  included query contents.Next we introduce the two phases of MLA.

Segment Alignment.
Given the prior probability (.) of query contents, MLA represents each query content   ∈  using a segment   with length |  | = (  ).The segments for all the query contents are sorted in descend order, and denote the sorted set as  = { 1 , . . .,   }.Then MLA aligns the segments onto  layers in order.The aligning process has two modes, i.e., aligning dominant and aligning dominated.At the beginning of aligning, the mode of aligning dominant is active.The number of rest layers (denoted   ) is set to .MLA checks whether the current segment is dominant.When aligning   ,   is dominant if the condition |  | ×   > ∑ ≥≥ |  | holds.If the current segment   is dominant, MLA aligns   onto the current layer and   takes up the entire layer.The aligning stays in mode aligning dominant, and segment  +1 is taken as the current segment when the aligning continues.If the current segment   is not dominant, the aligning turns to mode aligning dominated.Then MLA sets the length of each of the remaining layers as ∑ ≥≥ |  |/  , and it aligns   , . . .,   along the rest layers.When aligning a segment   and the current layer has blank length   less than |  |,   is divided into two parts with lengths   and |  |−  .The first part is aligned onto the current layer, and the second part is aligned onto the beginning of the next layer.In the mode of aligning dominated, MLA goes on aligning all the remaining segments, and it never turns back to mode aligning dominant.After all the remaining segments are aligned, the first phase of MLA terminates and MLA continues to the second phase.
Here we use the instance in Example 7 to illustrate the process of segment aligning.There are 4 layers in the process In the next step, MLA sets a vertical line at the beginning of each layer.Then the vertical line moves to the right until it touches the first point on any layer at which a segment ends.Then the scanned parts of the  layers are packed into a report.Denote the scanned part on the th layer with length    ; then the probability of this report is ∑ 1≥≥    ×   and the posterior probability of the query content on the th layer is (   ×   )/(∑ 1≥≥    ×   ).The vertical line continues moving to the right and MLA packs the next report when any segment ends on a layer.The process terminates after the vertical line moves to the end of each layer and generates the last report.Continue with Example 7 as shown in Figure 2, the shrinking ratios of the 4 layers are 2, 1, 1 and 1.Then a vertical line starts moving to the right from the left end of all the layers.It first touches the end points of  32 and  5 on layers 3 and 4, respectively, and a report  1 = { 1 ,  2 ,  3 ,  5 } is generated.Then it keeps moving to the right and touches the end points of  2 and  6 on layers 2 and 4, respectively, and report  2 = { 1 ,  2 ,  4 ,  6 } is generated.Finally, the vertical line touches the end points of all the layers and generates the last report  3 = { 1 ,  3 ,  4 ,  7 }.In the end, MLA generates 3 reports including  1 = { 1 ,  2 ,  3 ,  5 },  2 = { 1 ,  2 ,  4 ,  6 }, and  3 = { 1 ,  3 ,  4 ,  7 }.The reporting probability is given in Table 1.Take  1 , for instance; half of its prior probability is assigned into report  1 , and one-fourth of its prior probability is assigned to reports  2 and  3 .Thus the reporting probability of  1 for reports  1 ,  2 , and  3 is 1/2, 1/4, and 1/4, respectively, as shown in Table 1.The reporting probability of other contents could be calculated in the same manner.
The pseudocode of the Multilayer Alignment algorithm is shown in Algorithm 1.In the beginning, MLA sorts the query contents in  according to their prior probability in descending order (line 1).Then it initiates necessary structures and variables. 1 , . . .,   stores the alignment of  layers (lines 2-3).Variable  indicates the layer being processed (line 4), and  indicates whether the alignment is under aligning dominant mode (line 5).Variable  indicates how much prior probability of the current query content is taken up by the last layer while V indicates the length of each layer processed in dominated mode, and they are initiated in line 6.Array  keeping the lengths of  layers is initiated in line 7.The loop in lines 8-28 aligns query contents in  in order onto  layers.The aligning mode is set dominant in line 5 before processing the first query content.Under aligning dominant mode, MLA checks whether   should be aligned under aligning dominant mode.If the answer is yes, a segment for   is created with length (  ) and it is added to the list   for the current layer.Here the constructor of segment specifies the label and the length for a segment.Then the alignment of   and the current layer terminates (lines 10-12).If   should not be aligned under aligning dominant mode, MLA turns to the mode of aligning dominated and calculates the length of each of the remaining layers as V (lines 13-15).In aligning dominated mode, MLA executes the code in lines 16-28.If   could be entirely aligned onto the current layer (line 16), MLA creates a segment for   with length (  ), and adds it to list   (line 17).The length of the current layer and  are updated (lines [18][19].If the current layer does not have sufficient space to hold   ,   is split into two segments.Lines 21-23 align the first segment onto the current layer, and lines 26-28 align the second segment onto the next layer.Lines 24-25 deal with a special case where   exactly uses up the space of the current layer.By here the Segment Alignment terminates and the MLA goes to the phase of Mechanism Initiation.It packs the heads of  lists   into a report  (lines [30][31].Then MLA determines the movement length ratio of the vertical line to the right as  (line 32).The reporting probability related to the current report is calculated in line 35.For each layer, MLA updates the length of the head.If the head of a layer is entirely packed into a report, it is popped from the list (lines [36][37][38][39]. The

Properties of MLA Algorithm
This section formally proves that MLA achieves the optimal expected entropy and the optimal dp-coefficient simultaneously.We first introduce some concepts which build necessary foundation for our formal proof.Definition 8 (dominant content).Given the global set  of query contents, an integer  > 0, and the prior probability (.),∀ ∈ , let dom() be the number of query contents larger than .A query content  ∈  is a dominant content iff the following conditions hold: (i) dom() < ; (ii) ∑ (  )≤() (  ) ≤ ()( − dom()).Definition 9 (dominated content).Given the global set  of query contents, an integer  > 0, and the prior probability (.), ∈  is a dominated content, not a dominant content.
According to the process of segment alignment in MLA, each dominant content takes up an entire layer.If there are remaining layers, the dominated contents take up these layers and none of them take up an entire layer.In the rest of this paper, we use dominant layer and dominated layer to denote a layer taken up by a dominant content or dominated contents, respectively.Recall the instance in Figure 1;  1 is a dominant content and layer 1 is a dominant layer.Query contents  2 , . . .,  7 are dominated contents and layers 2, 3, and 4 are dominated layers.
Definition 10 (layering strategy).Given the global set  of the query contents, an integer  > 0, the prior probability (.) of query contents, and a reporting probability    , let the query contents in each report be permuted arbitrarily and denote the th query content of report  by    .A layering strategy induced by M is a  dimensional vector, whose th component is calculated as ∑ ∈      (   , )(   ).When query contents are sorted by the value of    (, )() in descend order, the standard layering strategy is induced.
According to the above definition of layering strategy, a mechanism has multiple layering strategies.Intuitively, a mechanism assigns the prior probability of each query content to one or multiple reports.A report  contains  parts from distinct query contents, and they can be viewed as  segments on  layers.When query contents in each report are permuted, we can build  layers by connecting all the segments on the same layer from different reports together.To this end, we call the  dimensional vector a layering strategy.Next we define the entropy of a layering strategy.
Proof.Suppose the query contents in each report of  are arbitrarily sorted, and we get an induced layering strategy  = {V 1 , . . ., V  }.Denote the set of reports with posterior probability larger than 0 by  = { 1 , . . .,   }, and  , is the th query content of report   in the process of inducing .For 1 ≤  ≤ , we have the following equation: By applying the log-sum inequality [7] (adopted in the last but one line in the below), we have the following condition: So we prove that   () ≤ ().
Lemma 13.Given a mechanism M(   ,    ) generated by MLA,  is the standard layering strategy of M and M  is an arbitrary mechanism; then M  has at least one induced layering strategy   satisfying the fact that () ≥ (  ).
Proof.Given the mechanism M produced by MLA together with its standard layering strategy , we prove Lemma 13 by conducting an induced layering strategy   for an arbitrary mechanism M  , so that () ≥ (  ).To this end, we sort the query contents in each report of M  as follows.
For each report  of M  , we iterate all the query contents.For a query content , if it is a dominating query content determined by MLA and its order in  is   (), we set the order of  in  by   ().After arranging all the dominating query contents, we sort the rest of query contents in  by () × ( | ) in descend order and then fill the blanks in the ordering of .In this way we conduct an inducing layering strategy   of M  , and in the following we are to prove that () ≥ (  ).
Let   be the number of dominant layers in , and we first investigate the first   layers of   .For each dominant content on the th layer of , it is also aligned only on the th layer of   .At the same time, on the th layer of   there are possibly dominated contents.So we get that for each dominant layer the length of   is no smaller than that of , i.e.,   .V  ≥ .V  , 1 ≤  ≤   .As a consequence, the total length of dominated layers in   is no larger than that of  if   < ; i.e., ∑   ≤≤   .V  ≤ ∑   ≤≤ .V  .
Here we turn to a necessary observation of modifying a layering strategy at two layers with increased entropy.Suppose   is an arbitrary layering strategy, and its values on layer  and layer ℎ are different.With no loss of generality, assume   .V  >   .V ℎ .Then we move a length of  from layer  to layer ℎ; here 0 <  < 2(  .V  −   .V ℎ ).It is easy to see that the entropy of the modified layering strategy is larger than the entropy of   .Next we transform   to  with a series of modifications of the above type between two layers with different lengths.
The transform includes two phases.In the first phase,   make the dominated layers (here the dominated layers and dominant layers are determined by ) have the same length.Let   V be the average length of dominated layers for   .We repeat the below modification.Each time we pick the dominated layer with smallest length and largest length, and move the length from longer to the shorter until either one of them reaches   V .Then the number of layers with length   V increases by at least one.After at most  −   modifications, phase 1 terminates.And each modification make the entropy of   increase.If the dominated layers of   have the same length at the beginning of phase 1, its entropy remains unchanged.
In phase 2, we investigate each of the first   layers.For a layer  ≤   , let the th dominant content of MLA be    .Then we remove a length of   .V  − (   ) and distribute it evenly to dominated layers.After that, the length of each dominated layers for   is no larger than that of  (denoted   ).Meanwhile, the remaining length (   ) of layer  is larger than   .According to the observation above, each modification of phase 2 will increase the entropy of   .After at most   modifications,   will be transformed to , and each modification will not decrease the entropy.
Combining phase 1 and phase 2, we conduct a transformation from an induced layering strategy   of an arbitrary mechanism to , which is an induced layering strategy of the mechanism produced by MLA.Each step of the conducted transformation will increase the entropy or keep the entropy unchanged, so we prove that (  ) ≤ ().Lemma 14.Given a mechanism M(   ,    ) generated by MLA, and  is the standard layering strategy of M, then   (M) = ().
Proof.The standard layering strategy of M restores the result of segment alignment in the process of MLA.Let  1 , . . .  be the lengths of the  generated layers.Due to the shrinking process of MLA, the initiated reports have the same ratios between pairs of corresponding query contents on two given layers.Thus the standard layering strategy  of M could be calculated as  = { 1 / ∑ 1≤≤   ,  2 / ∑ 1≤≤   , . . .  / ∑ 1≤≤   }.For each produced report   in the process of inducing , we use  , to denote the th query content in   .Then we have ( , )( |  , )/( ,  )( |  ,  ) =   /   , for 1 ≤ ,   ≤ .So we can get that the entropy of each report   equals the entropy of .As a consequence the expected entropy of M can be calculated as follows: So we prove that   (M) = ().
Lemmas 12, 13, and 14 illustrate the relationship between the expected entropy achieved by MLA and the entropy of induced layering strategies of any other mechanisms.Based on these facts, we get the following theorem.Theorem 15.MLA achieves the optimal expected entropy.
Proof.According to Lemma 14, the mechanism M produced by MLA achieves the expected entropy of (), where  is the standard induced layering strategy of M. Assume M  is an arbitrary mechanism, and it has at least on induced layering strategy   so that (  ) ≤ () due to Lemma 13.At the same time,   (M  ) ≤ (  ) according to Lemma 12. Then we have   (M) ≥   (M  ), so we prove that MLA achieves the optimal expected entropy through M. Theorem 16.MLA achieves the optimal dp-coefficient.
Proof.Given the mechanism M produced by MLA together with its standard layering strategy , we conduct an induced layering strategy   for an arbitrary mechanism M  in the same way as the proof of Lemma 13.We sort the query contents in each report of M  as follows.For each report  of M  , we traverse its query contents.For a query content , if it is a dominated content determined by MLA and its order in  is   (), we set the order of  in  by   ().After arranging all the dominating query contents, we sort the rest of query contents in  by () × ( | ) in descend order, and then fill the blanks in the ordering of .The first layer of  only contains the first dominant content; however the first layer of  not only contains the first dominant content entirely but also possibly dominated contents.So we have On the other hand, we know that the total length of dominated layers in  is no smaller than that of   .At the same time, each dominated layer has the same length in  while the th layer in   has the smallest length.Then we have .V  ≥   .V  .In M the dp-coefficient is actually ln(.V 1 /.V  ).Denote the set of reports produced by M  by M  .  = { 1 , . . .,   }, and let  , be the th query content in report   , 1 ≤  ≤  and 1 ≤  ≤ .Then we have That is to say the dp-coefficient of M is no larger than that of an arbitrary mechanism M  .So we prove that MLA achieves the optimal dp-coefficient through M.

Evaluation
This section evaluates the performance of our proposed MLA algorithm based on three real-life datasets, and evaluation results report the comparison between MLA and three existing approaches including ,  [8], and  [3].

Evaluation Setting
Datasets.To obtain the prior probability of query contents, we employ three real-life datasets including  and  from [9] and  from [10]. and  contains street objects in the state of Texas and California.Each object is labeled with a coordinate and a set of keywords. contains worldwide coordinates and geotags.We use the coordinates as locations, and take the keywords and geotags as query contents.We divide , , and  into 8 × 8 regions and calculate a prior distribution of query contents for each region.Given the number of query contents ||, we pick query contents with top-|| frequency and they are used to compute the prior probability.In  and , some keywords such as city name and state name are removed since they dominate the frequency but provide no meanings.For each dataset, the average measures of its regions are reported in the evaluation results.The details of , , and  are introduced in Table 2.
Testbed.We implement our proposed MLA and competitors including , , and  in Java language.The JDK version is jdk1.8.0 151.All of the evaluation is conducted on  a PC-machine with i7-7700 CPU, 8GB memory, and 1TB 7200rpm Hard Disk.
Query Generation.For each prior distribution obtained for a region, we generate 1000 queries, which follows the prior distribution, to test .The internal loop times is set to 50 as in [3].For MLA, , and  we directly evaluate the privacy measures using the mechanism obtained for each prior distribution.
Privacy Measures.We employ three privacy measures to evaluate content privacy achieved by MLA and its competitors.These privacy measures are (1) expected entropy; (2) dp-coefficient; and (3) effective k.Expected entropy and dpcoefficient are introduced in Section 3. Effective k measures the number of query contents whose posterior probability is positive, and it measures the uncertainty of the reports in a mechanism.
Parameters.We test the effects of two parameters on the privacy measures we employ.These parameters include the number of query content in a report (denoted ) and the number of query contents in the global set  (denoted ).
In the following evaluation  is set to 5, 10, 15, 20, and 25 and its default value is 10.Parameter  is set to 50, 60, 70, 80, 90, and 100 and its default value is 80.

Evaluation Results
. Figures 3 and 4 depict the expected entropy achieved by MLA and its competitors.We first study the effects of parameter  on the expected entropy in Figure 3.Here the total number of query contents in  is set to 80 and  is increased from 5 to 25.As shown in Figure 3, our proposed MLA achieves the best expected entropy in the real-life datasets of , , and .This is consistent with the fact that MLA achieves the optimal expected entropy.The achieved expected entropy of MLA and its competitors grows with parameter , since a larger  improves the uncertainty of reports in a mechanism.In the more skewed dataset, i.e., , MLA outperforms  and  in larger degree than that of the case in datasets of  and .The reason is that MLA splits larger prior probability of query contents into a larger number of reports; thus it is more suitable to deal with skewed prior distribution of query contents.On the other hand, in ,  and  (approximately) keep the ratio of posterior probability for two query contents the same as that of their prior probability.
Compared to the datasets of  and , the expected entropy achieved in  is smaller correspondingly, since more skewed distribution of query content prior probability decreases the optimal expected entropy.Figure 4 presents the achieved expected entropy when parameter  grows from 50 to 100 while  is fixed at 10. MLA again outperforms its competitors in terms of expected entropy.When  grows, the expected entropy of MLA, , and  slightly increases while  gets decreasing expected entropy.The reason is that increased  brings a relief to the skewness of prior distribution of query contents, so MLA, , and  achieve better expected entropy.However, due to the process of , querying top frequent contents will make fewer query contents in reports eliminated.When an increasing  relieves the effects of top frequent contents, more query contents in reports of  get eliminated.Consistent with what is shown in Figure 3, a larger improvement is obtained in  when we compare MLA with  and .Meanwhile, better expected entropy is achieved in more uniform datasets of  and  compared to .
Next we investigate the dp-coefficient of MLA and its competitors.The privacy measure of dp-coefficient depicts the uncertainty of reports in a mechanism.A smaller dpcoefficient means that it is more difficult for the adversary to eliminate a query content from any report.The effects of parameter  on dp-coefficient is studied in Figure 5.We fix parameter  at 80 and increase  from 5 to 25.In all the datasets, MLA achieves significantly better dp-coefficient compared to , , and .When  increases, the dp-coefficient of all the algorithms grows, since more query contents are packed into the same report.In more skewed dataset, , a larger dp-coefficient is obtained.The skewness increases the difference of prior probability for query contents in the same report.In datasets  and , very small dpcoefficient is achieved when  is set to 5 to 10.For other cases of , the dp-coefficient is almost always smaller than 1, and this means very good uncertainty among query contents in any reports.On the other hand, , , and  suffer a larger dp-coefficient around 4 and 2 in different datasets, respectively.
The effects of parameter  on dp-coefficient are investigated in Figure 6.When we fix  at 10 and increase  from 50 to 100, , , and  produce nearly constant dp-coefficient.The dp-coefficient of , , and  in  is larger than 2.5, while a dp-coefficient around 1.5 is obtained for datasets of  and .In contrast, the dp-coefficient of MLA decreases when  grows, since larger  brings relief to the skewness of prior distribution.In , MLA achieves dp-coefficient smaller than 1.5.For more uniform datasets of  and , MLA produces very small dp-coefficients.It obtains ideal dp-coefficient with 0 for  dataset, and the dp-coefficient for  is also very close to 0. This brings significant difficulty to the adversary to infer the actual query content from any report  of MLA.Generally speaking, MLA achieves much better dp-coefficient compared to , , and , and it is able to produce dp-coefficient close to 0 for more uniform datasets.
Finally we test effective k of MLA and its competitors in Figure 7.Given the value of  and , the same effective k is obtained for different datasets, so we report the effects of parameter  and parameter  on effective k in Figures 7(a) and 7(b), respectively (not for each dataset individually).As shown in Figure 7, MLA, , and  achieve the optimal effective k with the value of .In contrast,  provides smaller effective k than the value of .This illustrates the effectiveness of MLA, , and  with regard to the disability of eliminating any query contents from each report.We argue that an effective -anonymity mechanism should provide effective k with the value of .
In summary, MLA achieves the best privacy measures of expected entropy, dp-coefficient, and effective k simultaneously, which is consistent with our theoretical analysis in Section 5.
Location privacy and content privacy are recognized in location-based services.Solutions to preserving location privacy and content privacy in location-based services mainly focus on cloaking technique such as [5].Cloaking technique employs a third-party server to execute spatial generalization algorithms so that the querier is hidden among at least  − 1 users.However, the third-party server unfortunately possibly becomes the single point of failure for privacy or a performance bottleneck of query processing.To this end, a number of client based solutions [2,3,6,8] are proposed recently.Reference [3] works on the problem of generating proper dummies for locations in reported queries to CPS servers for hiding the user's actual locations.In [3], 2 locations with similar probability with the user's location are chosen as dummy candidates, and  − 1 of them are randomly selected as final dummies.This approach obtains good entropy for the  locations in the reported query.Although this solution includes random nature, the posterior probability of the  reported location is still different due to the process of dummy selection, and the privacy guaranteed is not clear.Reference [6] employs cache to avoid submitting queries to CPS servers as much as possible and thus prevents the leakage of user's location.Reference [2] proposes a mechanism for protecting content privacy in a continuous manner.A set of  query contents are generated for a traveling path, and the user submits the same queries along the path to avoid privacy breach.This fits to continuous querying; however there is no privacy guarantee since it simply chooses query contents with probability larger than a given threshold as candidates.In summary, server-based -anonymity suffers single point of failure and existing client based solutions do not provide provable privacy guarantee based on the -location/query contents reported to CPS server.Reference [24] studies improving geoindistinguishability with multiple criteria for better location privacy; however this approach could not be adopted for content privacy due to utility concern.Reference [25] studies protecting privacy for smartphone usage, and this is parallel to our work.Recommendation [26] in locationbased system is getting more and more attention, and a location privacy preservation method is proposed for review publication in location-based systems in [27].The notion of -anonymity is also developed in statistical databases in [28,29].
Differential privacy is first introduced in statistic databases [30].The intuitive idea of differential privacy is that a single change of the input should not modify the output significantly.By this guarantee the adversary cannot recognize the input among all possible inputs similar to the real one.Due to the simple and clean nature of differential privacy, it has been adopted widely, such as machine learning [31], statistic database [32][33][34][35][36][37], data mining [38], graph [39], data analytic [40], and crowdsourcing [15].Recent research starts combining correlation [41] and personality [42] nature to original differential privacy.Our work is parallel to the large body of differential privacy research.We combine differential privacy and -anonymity to provide guaranteed privacy in interactive cyberphysical systems.Differential privacy has been adopted in the literature of privacy protection in location-based services, and [43] ensures that an adversary will not get significant information about a user's location after a query is reported.This is achieved by making the ratio of two nearby locations' posterior probability similar to that of their prior probability.Mechanisms following or adopting similar privacy guarantee are presented to optimize privacy or utility [44].Besides -anonymity and differential privacy, a number of works customize semantic privacy metrics such as [45][46][47] in social networks.This paper also defines privacy measures based on entropy and differential privacy.

Conclusion
This paper investigates preserving content privacy in interactive cyberphysical systems through -anonymity based mechanisms.We present two privacy metrics denoted expected entropy and dp-coefficient, which are based on entropy and differential privacy, respectively, and formulate the problem of achieving the optimal -anonymity for content privacy in interactive cyberphysical systems based on these privacy metrics.An algorithm MLA consisting of two phases, namely, segment alignment and mechanism initiation, is proposed to establish mechanisms for achieving the optimal -anonymity.Theoretical analysis illustrates the attractive property that MLA achieves the optimal expected entropy and the optimal dp-coefficient simultaneously.We conduct evaluation based on three real-life datasets, and three privacy metrics, namely, expected entropy, dp-coefficient, and effective k, which depict uncertainty of reports in mechanisms, are tested.Evaluation result demonstrates that MLA outperforms its competitors including recent client based solutions over all the employed privacy metrics, and these results are consistent with the fact that MLA achieves the optimal -anonymity for content privacy in interactive cyberphysical systems.

Figure 7 :
Figure 7: Effective k versus parameters  and .
Figure 1: Segment aligning for the instance consisting of 7 query contents in Example 7. The first layer is taken up by  1 under aligning dominant mode, and the remaining 3 layers are taken up by  2 ,  3 ,  4 ,  5 ,  6 , and  7 , under aligning dominated mode.Here   means the th part of content .
of aligning.The segment for  1 is aligned in mode aligning dominant since ( 1 )× > 1.The segment for the remaining query contents is aligned in mode aligning dominated, and each of the remaining layers is at length (1−0.4)/(4−1)=0.2.The aligning result is shown in Figure1.4.2.2.Mechanism Initiation.Denote the length of the th layer by   , where  = 1, . . ., .The second phase of MLA first shrinks the length of each layer, and the th layer is shrunk by ratio   /  .The shrinking ratio of the th layer is recorded as   .All the layers have the same length after shrinking.

Table 1 :
The reporting probability for Example 7.
computation cost of MLA consists of three parts, sorting  and initiating variables, Segment Alignment, and Mechanism Initiation.The first part costs (|| log ||).Segment Alignment costs ( + ||) since at most  + || segments are aligned, and each alignment costs a constant time.Mechanism Initiation costs (( + ||)) which is dominated by packing at most  + || reports (each packing costs ()) and calculating at most ( + ||) entries for    .In practice,  should be set smaller than ||.The total cost of MLA is (|| log || + ||).

Table 2 :
Details of datasets in evaluation.