A Sampling-Based Method for Highly Efficient Privacy-Preserving Data Publication

,


Introduction
Enabling the pervasive adoption of cutting-edge techniques of artificial intelligence usually requests the support of huge scales of data [1]. With the joint contribution of smart devices and easy network access, the emergence of volumes of data has been extended from those dominating enterprises to individual contributors like IoT devices. Therefore, the collection of contents in wireless manners has been considered a fundamental task for data processing and AI enhancing [2]. However, the concerns on privacy and resource consumption also rise accordingly. The ubiquitous availability of data sources has broken the boundary between cyber life and physical life. It is believed that every aspect of our life is recorded by some data, thus providing numerous challenges for privacy-preserving data sharing [3]. Meanwhile, data are usually uploaded via wireless or wired networks as they are stored on personal devices. Then, the publication of data is also resource-consuming, especially for those multimedia contents. Both factors have significantly hindered the pervasive collection and make it a nontrivial problem. Therefore, this paper proposes a sampling-based strategy for private data publication.
Actually, data consumers can accept or are even more interested in the statistics about data instead of the detailed contents from every contributor. Such statistics could be sufficient and more reliable for analysis and decision-making. For example, service providers may apply the scale of traffic loads for traffic prediction, while regular users can plan their routes based on such statistics [4]. Among these statistics, the histogram, which provides the distribution of underlying facts, is believed to be essential for data analysis [5]. It may act as an index showing the portions of users falling in the category. Meanwhile, the histogram also provides sights for privacy preservation, as individuals do not disclose their original contents to data brokers or consumers.
To formalize such insights, the local differential privacy [6] is proposed and considered to be a novel paradigm for privacy preservation under distributed manners. LDP is extended from the original differential privacy by removing the request of a trustable data curator. In a typical LDP framework, each participant locally holds her contents and reports encoded and obfuscated results to the data broker, which will aggregate the contents to generate statistics. In this way, the individual contents are preserved, while the noise can be reduced during aggregation. Existing works are conducted for different types of statistics. For categorical values, the heavy hitters, frequent itemsets, and many other statistics are investigated [7]. For numerical values, the mean value, the summation, and some other aggregation queries are studied [8]. Meanwhile, there is also a batch of studies focusing on efficient and fair data publication under various scenarios [9][10][11].
However, current works for LDP request the encoding of original contents, which are usually bandwidth-consuming. Take the random response as an example [12]. It encodes the value into K-folds, where each fold represents one category of value, like the visited website. As all K-folds will be encoded, the bandwidth for encoded contents will be huge. This is extremely difficult when the candidates of values are large or even infinite (i.e., numerical value). Although some works are conducted, they have not thoroughly considered the problem.
Fortunately, the data consumers, when dealing with statistics, can actually accept minor uncertainty in the results. This is due to the inherent bias underlying the collected data, where the contributors themselves are also samples of the whole population. Meanwhile, statistics with bounded minor errors will not affect decision-making. Therefore, it is interesting to study whether we can further reduce bandwidth consumption while maintaining utility and privacy preservation.
As a result, this work proposed a novel framework for the publication of histograms in distributed manners. In the framework, data contributors locally hold their contents, like their incoming data. The data consumers request histograms with different granularities. The data brokers act as the coordinators among them, which are assumed to be semihonest. They will collect the results from participants and try to infer the raw contents beneath them. A sampling-based algorithm is designed, where the raw data are first encoded with randomized perturbation, and then a bit-level sample strategy is applied for publication. The data brokers will decode the sampled results and respond to consumers with aggregated histograms.
In the framework, all participants are assured with local differential privacy, which is theoretically proved under the sampling strategy. Furthermore, we also propose a mechanism where participants can efficiently derive the encoding scheme under multiple histograms with heterogeneous queries. As for the sampling, two strategies are given, based on whether queries address the same focus on all intervals. We prove the unbiased results for the first sampling and the optimized allocation of bandwidth under the second strategy. Finally, we evaluate the performance under realworld datasets. The results demonstrate the efficiency of our methods. As far as we know, this is the first study on the sample-based histogram publication over numerical values under local differential privacy. The main contribution of this work includes the following: (i) A novel framework for efficient and privacypreserved histogram publication over multiple participants (ii) Two sampling-based strategies for distributed histogram publication under LDP (iii) Theoretical analysis on the accuracy, efficiency, and privacy preservation (iv) Evaluation on real-world data traces to demonstrate the effectiveness of proposed methods The rest of the paper is organized as follows. Section 2 reviews the literature works. Section 3 proposes the problem formulation and some preliminaries. Section 4 introduces sampling-based algorithms for histogram publication. The evaluation results are shown in Section 5. Section 6 concludes the paper.

Related Work
2.1. Privacy-Preserved Data Publication. The publication of private data has been extensively studied during the past decades. Typical techniques including K-anonymity are proposed [13][14][15][16], where the sensitive contents are mixed and obfuscated before publication. However, these studies usually request the limited background knowledge of adversaries and are vulnerable to specific types of attacks. The differential privacy, as a novel index for privacy preservation, allows the existence of arbitrary attackers. There are also some studies investigating histogram publication under differential privacy, focusing on different types of data [17,18]. They also apply the underlying properties of these data to further reduce the degrees of noise [5,19,20]. However, they assume the existence of a trustable data curator to coordinate the data publication, which is usually infeasible for distributed data publication. Finally, there are also some studies considering the fairness and other issues within the private data publication [21][22][23]. However, they fail to properly reduce the bandwidth consumption and are not compatible with the histogram publication.
2.2. Local Differential Privacy. Local differential privacy [6] provides a novel paradigm for distributed data publication under differential privacy. It allows multiple data contributors to privately aggregate their contents when the data broker is semihonest. Multiple types of statistics are studied, including the publication of graph structures [24], the range counting [25], and the histogram distribution [26]. There are also some studies investigating the efficiency of data publication, ranging from the RAPPOR and Basic RAPPOR methods [12] proposed by Google to sophisticated methods 2 Wireless Communications and Mobile Computing where more mathematical solutions are applied. Reference [27] reviews current studies on LDP providing guidelines for applications. Meanwhile, there is also a batch of studies on the publication of numerical values, and statistics like mean values are considered [8,28,29]. However, these studies tend to encode the numerical value into several fixed values, and the perturbed contents may fall out of the original range. This may reduce the utility of published contents [30,31]. Therefore, the design of an efficient mechanism for histogram publication under local differential privacy is still a challenging topic.
2.3. Sampling-Based Data Collection. Finally, the samplingbased strategy has also been studied for data collection from multiple contributors. Maybe one typical domain where the sampling strategy functions well is the Internet of Things. The wireless and battery-powered sensors and actuators are usually limited in resources. The sampling-based data collection can balance the accuracy of the results and the devoted resources. Corresponding studies are conducted on statistics like Top-K values and data sketching.
As for the combination of sampling strategies and privacy preservation, there are also some studies arguing that the sampling component can strengthen the degree of differential privacy. However, these studies request content-level sampling, which means the sampled contributors can save no resources. Applying the sampling strategy while flexibly balancing the bandwidth consumption is still not well addressed.

Problem Formulation
This section first provides the problem formulation for distributed histogram publication, including the network and attacking models. Then, preliminaries on local differential privacy are given.

Network Model.
The whole platform consists of three parties: the data brokers, the data consumers, and data contributors. Initially, data consumers post their histogram queries to data brokers, denoted as l 1 , l 2 , ⋯, l M . To simplify our model, l i indicates both the ith query and the interval length of the histogram requested by the query, i.e., the granularity of the histogram. Different consumers may request different diverse queries with different granularities, as they can hold heterogeneous purposes. For example, taxi companies request a coarsened level of traffic loads to guide the deployment of their services, while the navigation apps expect fine-grained histograms to generate a fast route. Upon receiving the requests, the data brokers will generate a data collection plan among participants. The plan consists of a set of consecutive intervals ½D 0 , D 1 Þ, ½D 1 , D 2 Þ, ⋯, ½D K − 1, D K , together with other parameters like the sampling ratio. As for the intervals, D 0 = D L indicates the minimum value of all contents, while D K = D U refers to the maximum value of all contents. In this framework, we will focus on the snapshot query, such that the data brokers will collect queries from all consumers before generating a plan. Therefore, the intervals are generated based on all queries.
By receiving the plan from data brokers, data contributors will encode and report their contents. Assume N contributors exist in the system, noted as fu 1 , u 2 ,⋯,u N g. Each of them holds one content d i , and D L ≤ d i ≤ D U . All contents belong to one dimension or can be applied for the same type of query, like the humidity level of a local area or the various sensing data capturing the traffic congestion. We assume the total bandwidth used for reporting d i to the data brokers as B i .
Finally, the data broker collects results from all contributors. It will first decode the reported contents and then aggregate them into different intervals. In a final step, the data broker will generate and distribute corresponding histograms to different consumers and charge them accordingly.
3.1.1. Adversarial Model. Due to the latent value of contents held by contributors, both data brokers and consumers are assumed to be semihonest. It means they will not break into contributors' devices to steal the contents but will follow the framework and try to infer the original contents. Therefore, contributors should carefully encode their contents to thwart such inference attacks. The local differential privacy has been considered an extension of differential privacy by removing the requirement of a trusted data broker. LDP still allows arbitrary background knowledge from the adversaries and can preserve individual contents within the statistics. To achieve the LDP property, data contributors can publish perturbed contents d i , which are either noise values or some relative data structures. The definition of local differential privacy is shown in Definition 1.
Definition 1 (local differential privacy). An algorithm Qð·Þ satisfies ε-local differential privacy (ε-LDP), where ε ≥ 0, if and only if for two arbitrary contents T i and T j : where Range ðQÞ denotes the set of all possible outputs of Qð·Þ.
Based on its definition, the local differential privacy ensures that no significant information will be disclosed to the data receivers. The parameter ε indicates the degree of privacy, where a larger ε means data contributors are less sensitive and will produce more accurate results.

Design Object.
Based on the network model and attacks from the adversaries, data contributors aim to both reduce their bandwidth consumption and preserve their raw contents during data uploading. The data brokers are concerned with coordinating the trading between other parties, so their focus is to generate a proper plan for data collection. The plan should be efficient and provide rational performance. Generally, assume that the accumulated variance for each query l i is Varðl i Þ. The design object is given as follows:

Wireless Communications and Mobile Computing
which means the derived result should be unbiased, the total bandwidth should be constrained, and the privacy for each contributor should be preserved.

3.2.
Preliminaries. This part introduces preliminaries for LDP. It first introduces one basic encoding-decoding-based method for data uploading and then addresses the compositional properties of LDP. The random response method provides some basic ideas for the implementation of LDP. We take the Basic RAPPOR proposed by Google as an example.
In Basic RAPPOR, assume there is a L-bit vector with binary entry, denoted as Then, V ′ can be generated by the randomized response: Finally, V ′ will be sent to the data curator for subsequent analysis. Actually, this mechanism of perturbation achieves the LDP property for vector V, which is proved by a previous study [27].

Theorem 2. For an arbitrary vector
Data sampling, where contributors only partially upload their contents, is also a major strategy for resource-saving in distributed data collection. It is believed that this can further reduce the disclosure of information. Li et al. have theoretically proved the effect [32]. Theorem 3. Assume Fð·Þ to be an ε-differentially private algorithm and Sð·Þ to be a sampling method algorithm. Then, if Sð·Þ is first applied to a dataset, which is later perturbed by F ð·Þ, the derived result satisfies ln ð1 + P 0 ðe ε − 1ÞÞ-differential privacy, where P 0 is the sampling probability.
Finally, the compositional property of differential privacy can also be merged with the LDP.
Then, applying all F i ð·Þ to one data item d 0 will provide ∑ k i=1 ε i -differential privacy.

Sampling-Based Histogram Publication
This part first provides a scheme applied for plan generation. Then, we argue that the efficiency could be further improved among all contributors by uploading partial results. The second part gives a bit-level sampling algorithm, where the bandwidth consumption among contributors is reduced and balanced. The third part proposes a biased sampling mechanism, where the sampling ratios can be adjusted and optimized according to the requests of consumers.

Data Plan for Multiple Queries.
Within the framework, the data broker receives multiple queries fl 1 , l 2 ,⋯,l M g from consumers. It will then generate a corresponding plan ½D 0 , D 1 Þ, ½D 1 , D 2 Þ, ⋯, ½D K − 1, D K from these queries. The main objective is to reduce the number of intervals in the plan, as each interval may reflect extra bandwidth consumption. The data broker applies the following strategy for plan generation.
The data broker iteratively generates intervals according to each query, which is determined by where Then, the data broker combines all intervals from all queries. Specifically, every two consecutive checkpoints D 0 + l i · j and D 0 + l m · n from one or two queries will compose a new interval ½D 0 + l i · j, D 0 + l m · nÞ. The newly generated intervals will be used by the data plan, as ½D 0 , D 1 Þ, ½D 1 , D 2 Þ, ⋯, ½D K − 1, D K . Therefore, the histogram for each query can be derived from such intervals by iteratively merging the results in several conjunctive intervals. The whole procedure will take OðM · ðD U − D L ÞÞ time. In the baseline method, data contributors may follow the typical random response to locally encode and obfuscate their contents. The results will be uploaded to data curators, which will further decode the results and publish the aggregated histograms to consumers.

Bit-Level Sampling for Histogram
Publication. This part introduces the sampling-based algorithm for histogram publication. Intuitively, the data curator can randomly pick a group of contributors for histogram publication, or the contributors can locally determine whether to participate in data processing with some sampling ratios. However, the sampled contributors have to apply the encoding mechanisms and fully upload the vectors in both cases, which is unbalanced and unwilling. Therefore, this part proposes an improved algorithm to sample the contents from another dimension.

Wireless Communications and Mobile Computing
The algorithm is named as Bit-Sampling Histogram Publication (BSHP for short).
The main idea of BSHP is to implement bit-level sampling among contributors instead of one-time participant selection. Initially, BSHP follows the same steps as the baseline method, where the data curator processes and distributes the queries to contributors. After locally encoding their data values, contributors follow the request of data collection from the data curator in BSHP. In the jth iteration, the data curator announces a sampling ratio P j to all contributors u i , where 0 ≤ P j ≤ 1. Then, all u i locally execute Bernoulli sampling with probability P j . Contributors who sampled 1 as the result will upload the corresponding bit D ij ′ to the data curator. This request phase repeats K 0 rounds until all bits are requested.
In the decoding phase, the data curator first estimates the counting of data values in each interval of the integrated partition, where Then, the data curator derives the results for different queries in the same way with the baseline method, and the whole algorithm terminates.

Analysis. This part analyzes the performance of BSHP.
We first investigate the accuracy of histogram publication and then discuss issues on privacy preservation.
The following theorem indicates that BSHP provides an unbiased estimation for the counting of each interval in the integrated partition.
Proof. Within BSHP, the data curator estimates R k as We denote R 0 k as the total size of contents belonging to interval ½W k , W k+1 . According to the definition of the random response, we have where V s refers to the sampling variable whether a contributor is selected, and V r and V r ′ indicate whether the corresponding bit is retained or reversed. We have the following notation: Then, As V s , V r , and V r ′ are independent variables, Therefore, Then, we have which means R k is an unbiased estimator for the counting of data values in ½W k , W k+1 .
According to Theorem 5, BSHP provides an unbiased estimation for each interval in the integrated partition. Therefore, the final outputs for each query will also be an unbiased estimation, as the final results are derived from the combination of these unbiased countings.
The variance of the estimated result for an interval is calculated in Lemma 6. The main idea of the lemma is to combine the variance from two steps of sampling and derive the correlation between the variance and the two parameters P k and ε 0 . Lemma 6. For each interval ½W k , W k+1 , with parameters P k and ε 0 , the variance follows Proof. First of all,

Wireless Communications and Mobile Computing
Assume As V s is the Bernoulli sampling, The same conclusion also holds for V r and V r ′ . Therefore, As for each data consumer, the variance of the histogram is determined by the summation of variances in different intervals. We omit the conclusion here as the summation is straightforward. Now we analyze the property of privacy preservation by BSHP. In BSHP, each contributor will only upload partial results of their vectors. Meanwhile, the perturbation and sampling could actually be applied in any order. According to the conclusions in [32], the sampling will strengthen privacy preservation. Therefore, we have the following conclusion.

Weighted Sampling for Histogram Publication.
This part further studies the sampling method for histogram publication. Specifically, the data consumers may hold different requests on histograms. Taking the incoming data as an example, some consumers may prefer more accurate results for people with high salaries, while others may expect to derive the results in the middle of the population. Therefore, the algorithms for histogram publication should also handle such sophisticated utilities for consumers. The proposed algorithm is named as Weighted-Sampling Histogram Publication (WSHP for short).
Initially, WSHP allows each consumer to report their weights at different intervals. The weights for all intervals in the ith query are The data curator first derives the integrated partition on the whole range, i.e., fW 1 , W 2 ,⋯,W K 0 g. Then, WSHP counts the accumulated weights for each interval. For an arbitrary interval ½W i , W i+1 , its weight ω i is derived by adding up corresponding weights from all contributors. Assume the jth query has its kth interval covering ½W i , W i+1 ; then, the weight inherited from ω jk is ω jk · ððW i+1 − W i Þ/ðW jk+1 − W jk ÞÞ. Following this strategy, WSHP traverses all contributors to derive all ω i : where ½W i , W i+1 ⊂ ½W jk , W jk+1 . Notice that ½W i , W i+1 will also belong to some ½W jk , W jk+1 . Otherwise, ½W i , W i+1 will be further partitioned into subintervals. Based on the weights, the data curator extracts corresponding sampling probabilities for different intervals. We consider a specific case where the incoming data are uniformly distributed over the whole range. Then, the sampling probabilities are determined by the following constraints. Firstly, where P 0 is the overall ratio of collected bits. Secondly, for an arbitrary pair of P i and P j , where α = ð1/2ÞN 2 f . Finally, WSHP follows the same strategies with BSHP to iteratively sample bits from contributors based on the corresponding P i and derives the results for different consumers.
4.3.1. Analysis. The objective of WSHP is to derive improved utilities for data consumers with heterogeneous concerns. The following theorem indicates that the WSHP algorithm can maximize the overall utility for all requestors. 6 Wireless Communications and Mobile Computing Theorem 8. With fixed privacy budgets and bandwidths, WSHP can achieve optimal utilities for data consumers when the data values uniformly are distributed in the whole range.
Proof. As the bandwidths and privacy budgets are fixed in this case, WSHP adjusts the sampling ratios to balance the accuracy among different intervals. Meanwhile, the general variance is applied to measure the accuracy as WSHP provides an unbiased estimation. The general variance is Then, according to the correlations between ω ij and ω k and the relationships between intervals in the integrated partition and histograms, we have Now we combine the analysis in equation (17) and derive where α, β, γ, and δ are all constant, and α = ð1/2ÞN 2 f . The results in equation (24) could be merged into equation (23), Minimizing the variance VarðQÞ requests the knowledge on R 0 j s, which is obviously unavailable for data curators. Instead, we assume that the underlying data values are uniformly distributed in the range. Then, R 0 j can be approximated by ððW j+1 − W j Þ/ðD U − D L ÞÞ · N. Therefore, the variance Var ðQÞ is determined by ∑ 2 + ω j αÞ/P j . It is obvious that equation (21) can minimize VarðQÞ in this circumstance, which is given in Theorem 8.

Discussion
This section covers the situations where data consumers may post their queries asynchronously, and the data curator has to acquire the data once and respond to continuously emerged queries.
The basic settings are similar to the previous cases, where the privacy budgets ε 0 and the bandwidth budgets B 0 are both fixed. In this case, the data curator could do the following: (1) Devote all resources to extract one single histogram for all queries (2) Partition the resource into multiple histograms and combine them for multiple queries We will show that the first strategy is actually preferred even if it is straightforward. Initially, the derived results should try to provide an unbiased estimation for forthcoming queries. However, this is usually infeasible due to the diverse partition of intervals in histograms. Then, an alternative objective is to minimize the difference between the ground truth and the estimated result. In this worst case, the distance could be all data values falling in two consecutive intervals in the published histogram. Therefore, minimizing this distance leads to the identical most fine-grained partition on intervals, which implies the adoption of all resources.
On the other hand, we can also achieve the same conclusion by considering the use of privacy budgets. Intuitively, partitioning the budgets into multiple folds will not reduce the overall variance, while extra bandwidths will be wasted for content uploading. Therefore, it is also preferred that the first strategy should be selected for the online querying model.
Our future study will investigate the design of methods toward online histogram publication. Maybe other advanced mechanisms besides the random response will be introduced for this case.

Evaluation
In this section, we adopt the salary data collected for normal citizens in the United States [33] to verify the performance of the sampling-based methods. New York City, San Francisco, and Baltimore are selected for our evaluation. Table 1 shows the overview of the datasets. We assume that data consumers request for the histogram of incoming levels with heterogeneous granularity. The data contributors will publish their data to the consumers, and the privacy concerns and bandwidth consumption should be treated. The data curator will coordinate the trading between two parties by generating the data plan and the final results.
The performance of the proposed algorithm is compared with a baseline method. In this method, the data contributors respond to each consumer separately. To thwart the collusion among consumers, the baseline algorithm requests the consumers to share the privacy budgets among multiple responses; e.g., assume the total privacy budgets to be ε 0 ; then, a contributor will apply ε 0 /K budget to each of K queries. Each algorithm has been executed 20 times to mitigate the randomness. Finally, the mean square errors (MSE for short) are applied as the metric.
6.1. Basic Performance. This part studies both the numerical values and the overall performance. There are three 7 Wireless Communications and Mobile Computing consumers in the system, requesting 3-fold, 5-fold, and 7fold histograms, respectively. They share the total budgets with ε = 15, where the baseline algorithm partitions the budgets among all three consumers. Meanwhile, our samplingbased algorithms apply all budgets for one common query. The sampling probability is 0.8.
The results are given in Figures 1-3. As we see, the proposed algorithms provide better utilities. They outperform the baseline method and achieve more accurate shapes for histograms, even though only part of the bits is collected under sampling. The difference is actually very significant when considering there are many data values belonging to some intervals to reduce the influence of randomness. It indicates that the proposed method can achieve good utilities by reduced bandwidth consumption.
This part also studies the overall performance under different budgets. In this group, the privacy budgets vary from 3 to 18. Two sampling-based algorithms are evaluated, with sampling probabilities set as 0.8 and 0.4.
According to the results in Figure 4, the proposed algorithms can reduce the MSE for histograms. The improvement is more significant when the privacy budget is    Wireless Communications and Mobile Computing relatively large. The reason is that the saving on the privacy budget can overwhelm the effect of sampling so as to maintain a good utility. We also observe that a higher sampling ratio can improve the general performance. This is intuitively rational, as more samples could decrease the impacts of randomness.
6.2. Heterogeneous Sampling Ratios. Finally, we study the impact of various sampling ratios on histogram publication. In this group, the privacy budget is 15, and the data consumers still request 3-fold, 5-fold, and 7-fold histograms. The sampling ratios increase from 0.3 to 0.7 with the incremental step as 0.1. The results are depicted in Figure 5. The general performance is improved for all three datasets according to the results, where the MSE value is reduced by at least 50% (San Francisco). However, the performance actually stays on the same scale for each dataset. This observation implies that increasing the sampling ratio (i.e., the bandwidth) will not always improve the data utility. In this case, the privacy budget will become the bottleneck for highly effective data publication. However, it also indicates that the bandwidth could be saved while the total utilities will not be reduced by too large.
Generally, both proposed algorithms can effectively and efficiently improve the performance of histogram publication.

Conclusion
To jointly preserve sensitive information and improve efficiency during data collection has long been considered a challenging task for data processing. The emergence of local differential privacy sheds light on this task. However, existing works fail to combine the sampling strategy with the mechanisms designed for LDP. Therefore, this work proposes a novel framework for privacy-preserved histogram publication in distributed manners. It first investigates a novel plan for data collection over numerical values and then designs two sampling-based algorithms for data encoding and decoding. These algorithms apply bit-level sampling to balance the cost among data contributors and can help consumers adjust their devotion on different intervals of the histogram. Extensive analysis is proposed, including unbiased results, privacy preservation, and optimization in allocating the bandwidth resources. Finally, we conduct an evaluation on one realworld dataset to show the superiority of proposed algorithms.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.