Key-Value Data Collection with Distribution Estimation under Local Differential Privacy

. Local diﬀerential privacy (LDP) is a promising privacy-preserving technology from users’ perspective, as users perturb their private information locally before reporting to the aggregator. We study the problem of collecting heterogeneous data, that is, key-value pairs under LDP, which is widely involved in real-world applications. Although previous LDP work on key-value data collection achieves a good utility on frequency estimation of key and distribution estimation of value, they have three downfalls: (1) existing work perturbs numerical value in a discrete manner that does not exploit the ordinal nature of the numerical domain and lead to poor accuracy, (2) they do not lead to improved privacy budget composition and consume more privacy budget than necessary to achieve the given privacy level, and (3) the frequency estimation of the key is not the most accurate due to the lack of consistency requirement. In this paper, we propose a novel mechanism to collect key-value data under LDP leveraging the numerical nature of the domain and result in better utility. Due to our correlated perturbation, the mechanism consumes less privacy budget than previous work while keeping the privacy level. We also adopt consistency as the postprocessing, which is applied to the estimated key frequency to further improve the accuracy. Comprehensive experiments demonstrate that our approach consistently outperforms the state-of-the-art mechanisms under the same LDP guarantee.


Introduction
Differential privacy (DP) [1] is the state-of-the-art technology for private data release, which provides provable and measurable privacy protection regardless of the adversary's background knowledge. Different from DP in the centralized setting that protects data after data collection, local differential privacy (LDP) has been proposed to protect data during data collection. In LDP, the server is assumed to be untrusted. Each user locally obfuscates his/her personal data using the LDP mechanism before uploading. After receiving the perturbed data from all users, the server performs data analytics or answers queries. LDP technology enables collecting statistics of users under a privacy guarantee and has been widely deployed in practice. For example, Apple deploys the LDP mechanism in iOS to identify heavy hitters in emojis while keeping user privacy [2]; RAPPOR has been deployed in Google's browser Chrome to collect and analyze the web browsing behavior of users under LDP [3].
Early LDP work mainly focuses on simple statistical queries such as frequency estimation on categorical data [4] and mean estimation [5,6] on numerical data. Nowadays, LDP is also applied for hybrid data types or queries [7][8][9], for example, key-value data, which has categorical data and numerical data simultaneously and is widely used in practice. e following examples show the potential applications of key-value data: (i) Product rating analysis: online market platforms such as Amazon and eBay are collecting users' ratings for the products they bought and show the ratings online as a reference for other buyers. ese rating data are usually in the form of key-value data where the key is the product and the value is the rating.
(ii) Software usage analysis: software developers and providers such as Microsoft need to collect usage time about each software to analyze users' preferences. is data is usually in the form of key-value pairs, where the key is the software identifier and the value is the usage time of this software.
Literature [7,8] are the first to study the problem of collecting key-value data under LDP. ey design mechanisms to support two estimation tasks: (1) the frequency of the key and (2) the mean estimation of the value. Recently, Ye et al. [9] expand the previous work and are the first to propose PrivKVM * to estimate the frequency of the key and the distribution of the value. It discretizes the domain into many bins and reports which bin contains the private value using categorical frequency oracle (e.g., GRR [10]) while considering the correlation between keys and values. However, it has three main limitations. First, it perturbs the numerical value in a discrete manner, which does not work well because it does not exploit the ordinal information of the numerical domain, that is, a perturbed report that is close to the true value also has useful information for the distribution estimation. Second, although the mechanism considers the correlation between keys and values, it does not lead to an improved budget composition. ird, it does not consider the consistency in the estimated frequency of the key. at is, the estimated frequency should be consistent with the property of frequency: (1) each frequency should be non-negative and (2) the sum of all estimated frequencies should be 1. Without enforcing the consistency requirement, the mechanism may not result in the most accurate answers to the key frequency [11].
Motivated by this, in this paper, we study the problem of collecting key-value data under LDP and propose a mechanism to solve the above three limitations: existing mechanisms (1) do not exploit the ordinal information about the numerical domain, (2) do not consider consistency to achieve the most accuracy, and (3) do not result in improved budget composition. Our mechanism aims to collect the two most fundamental statistics of key-value pairs: key frequency and value distribution. It contains three steps: (1) padding and sampling, (2) perturbation, and (3) aggregating and estimating key frequency and value distribution.
In step 1, each user pads his key-value data by dummy into the same length (make the sampling rate identical for all users) and samples one key-value pair. e reason for using the sampling protocol is that multiple key-value pairs are possessed by each user; if all the pairs are reported to the server, each pair will split the privacy budget, which results in large noise in each pair and bad utility.
In step 2, we solve the first and second limitations. Each user perturbs the sampled key-value pair in a correlated manner because there is an inherent correlation between key and value and reporting the value may also reveal information about the presence of the corresponding key [8].
us, we first perturb the key and then perturb the value according to the perturbation results of the key. If a possessed key is still possessed after perturbation, we then report the value via the LDP mechanism that utilizes the ordered nature of the domain and directly perturbs the value in the numerical domain, which solves the first limitation by leveraging the ordinal information about the numerical domain. However, there is a challenge when a non-possessed key is perturbed as possessed. By LDP definition, the perturbed values of the dummy keys and the perturbed values of the genuine keys are indistinguishable. us, the perturbed values of the dummy keys would affect the distribution estimation. To address this problem, we generate fake values for these keys that can satisfy LDP. Previous work selects values uniformly at random from the discrete output domain as fake values. However, such fake value generation does not work for our mechanism. In our mechanism, the output domain is a numerical continuous interval, and such fake value generation will violate the privacy guarantee since the probability of outputting the values that are not those discrete values is unbounded. erefore, we adopt the existing fake value generation for our mechanism.
By an outgrowth of the fake value generation, we show that our correlated perturbation has a privacy amplification effect: it consumes less privacy budget overall than the summation of budget in key and value perturbation. is solves the third limitation by providing a tighter budget composition, which achieves a better privacy-utility tradeoff than basic sequential composition (used in PrivKVM * ).
In step 3, we solve the third limitation. e server collects the perturbed results from all users and estimates the key frequency and the value distribution. For the frequency estimation, the server enforces consistency requirements to improve the accuracy. Since the fake values generated in the perturbation step affect the distribution, removing the influence of the fake value is another challenge. e consistency is not designed for this challenge since it only requires the estimated frequencies are sum-up-to-1 and non-negative but cannot detect the fake value and remove them. To address this problem, we design a method to statistically remove the fake value in the distribution estimation.
Our main contributions are summarized as follows: (1) Novel LDP mechanism for key-value data collection: we propose a mechanism that supports frequency estimation and distribution estimation over keyvalue data. It takes advantage of the numerical nature of the data domain and achieves better accuracy than existing solutions. (2) Improved privacy budget composition: we show that the privacy budget composition of our correlated perturbation mechanism has a tighter bound than sequential composition, which provides a privacy amplification effect and achieves a better privacyutility trade-off.
(3) Consistency as postprocessing to improve accuracy: we enforce consistency as postprocessing for key frequency estimation in our mechanism, which can further improve the accuracy than the existing LDP mechanism for key-value data collection.
(4) Comprehensive evaluation: we implement the mechanism and evaluate it on real-world data sets. e results show our mechanism outperforms existing LDP schemes. In particular, our mechanism significantly improves the accuracy, that is, reducing the error of current mechanisms by about an order of magnitude in most cases, especially when ε is small (large noise).

Preliminary
2.1. Local Differential Privacy. In the centralized setting of differential privacy, a trusted server or data aggregator has all users' personal data, and it is responsible for responding to queries while using DP mechanisms to protect user privacy. However, the assumption of a trust server may not hold in practice. Local differential privacy addresses this problem. In the local setting, each user perturbs his personal data and then uploads the perturbed results to the server for data analysis. In this way, the server can be untrusted because it cannot access the original data.
Definition 1 (local differential privacy (LDP) [12]). A randomized algorithm M is ε-LDP if and only if for any input x 1 and x 2 , the probability ratio of outputting the same result is bounded by e ε . Formally, By Definition 1, given any output y, the adversary cannot infer the original input is x 1 or x 2 with high confidence. Here, the confidence is controlled by the parameter ε (called privacy budget). e smaller the ε, the closer the probability Pr(M(x 1 ) � y) is to the probability Pr(M(x 2 ) � y). at is, the mechanism provides stronger privacy protection since the adversary has lower confidence to distinguish whether the original input is x 1 or x 2 .
When multiple LDP mechanisms are combined together to generate a new mechanism, the sequential composition theorem guarantees the total privacy of the new mechanism.
Theorem 2 (postprocessing [12]). If a mechanism M satisfies ε-LDP, given any function F that cannot access the original data and noise, then the F ∘ M also satisfies ε-LDP.
By eorem 2, the postprocessing does not violate the privacy guarantee of LDP mechanisms. In this paper, we use the postprocessing method to further improve the utility of our mechanism.
After receiving the perturbed results from all n users, the server or aggregator can estimate the frequency of users who possess the i-th item, that is, the input value is x � i. Denote the estimated frequency of the i-th item by f i ; the aggregator can estimate the frequency f i by the following unbiased estimator f i : where I is the indicator function and x j [i] indicates the i-th bit of the vector of the user j.

Square Wave Mechanism (SW Mechanism)
. SW mechanism is designed for numerical distribution estimation under LDP [13]. e intuition behind this mechanism is to increase the probability that a noisy reported value can carry useful information about the input. For the numerical domain, a noisy reported value that is different from but close to the true value also contains useful information for distribution estimation. erefore, given an input x, the SW mechanism reports values closer to x with a higher probability than values farther away from x. Formally speaking, the SW mechanism M assumes the domain of the input is D � [0, 1] (any bounded value can be linearly transformed into this domain) and the domain of the output is D � [− b, 1 + b], then it perturbs the input x with the following probabilities: As shown in [13], the value p and q are set to be which are derived by maximizing the difference between p and q while satisfying the total probability is add-up-to 1. Also, the parameter b is set to be which is obtained by maximizing the mutual information between the input and output of the SW mechanism.
Since the output domain and the input domain are different, the server/aggregator uses expectation-maximization Security and Communication Networks 3 (EM) algorithm to reconstruct the distribution after receiving the noisy reported values.

Problem Statement
ere are n users and one server in our system model, where each user possesses one or multiple key-value pairs 〈k, v〉. e domain of the key is assumed to be K � 1, 2, . . . , d { }, while the domain of the value is assumed to be V � [0, 1] (any bounded value can be linearly transformed into this domain). Besides, we assume the user i possesses the set of key-value pairs S i . e goal of the server is to collect the key-value data from all users and then estimate the (1) frequency and the (2) value distribution of a certain key. In other words, the server calculates the fraction of users who have a certain key and the value distribution of that key among those who have it.

reat Model.
We assume the server is untrusted, and a data breach might occur as a result of unauthorized data publishing or hacking. e adversary is considered to have access to all users' output and to be aware of the perturbation algorithm in the mechanism locally established on the user side. Furthermore, we assume that all users will honestly follow the perturbation mechanism.
3.2. PrivKVM * . To the best of our knowledge, PrivKVM * [9] is the state-of-the-art LDP framework for key-value data collection that can support frequency estimation of key and distribution estimation of value. We first briefly describe the mechanism and then summarize the main differences between our work and PrivKVM * .
3.2.1. Workflow of PrivKVM * . PrivKVM * collects the keyvalue data in two phases. In the first phase, each user first samples one key-value pair uniformly at random from the full domain of the key and then perturbs it in a correlated manner. Specifically, it first perturbs the key and then perturbs the value according to the perturbed result of the key. e following are the four cases of perturbation: (1) e sampled key is possessed by the user, and the key is perturbed as possessed. e user perturbs the value using the technology called GVPP, which is a categorical frequency oracle with a boundary. at is, the user first discretizes the numerical domain into many bins and then discretizes the value to the boundary of the bin containing the value with a specific probability. en, the user reports which boundary contains the private value using categorical frequency oracle, for example, GRR [10].
(2) e sampled key is possessed by the user, and the key is perturbed as non-possessed. e existing key disappears after perturbation in this case, and the user simply sets the value to be 0.
(3) e sampled key is non-possessed by the user, and the key is perturbed as possessed. In this case, a "fake" key appears, and the user samples a mean uniformly at random from the current means of all bins as the value.
(4) e sampled key is non-possessed by the user, and the key is perturbed as non-possessed. e user simply set the value to be 0.
After the perturbation, each user reports the obfuscated result, and the server estimates the statistical information: (1) the frequency of all keys, (2) the mean of the value, and (3) the distribution of values. To obtain a more accurate mean estimation, the server leverages virtual iteration technology and further calibrates the estimated mean. In the second phase, the server broadcasts to all users the heavy hitters (the keys with frequencies higher than a given threshold) and their corresponding mean estimated in the first phase, and users and the server repeatedly execute the steps as in the first phase to obtain the statistics estimation, except that the statistics are for the set of heavy hitters instead of all keys. For the rest of the keys, the server averages their statistics to reduce the noise effect since they are non-heavy hitters and the number of samples are insufficient.

Limitations of PrivKVM * .
We summarize the limitations of PrivKVM * as follows: (1) PrivKVM * estimates value distribution using categorical frequency oracle with boundary. In this way, the server can estimate the count of values falling in each bin and obtain the density distribution over the domain. However, values in the numerical domain have meaningful total order, and this method ignores such information in the value due to discretization. What is worse, it faces the challenge of finding the optimal size of bins. Binning causes two sources of errors: (1) LDP noise and (2) bias due to grouping values together. More bins lead to large error due to LDP noise, and fewer bins result in a greater error because of bias. Unfortunately, finding the optimal size of bins is a nontrivial task since the effect of the size depends on both ε and the property of actual data distribution that is unknown to the server [13]. (2) PrivKVM * does not consider the consistency problem in the frequency estimation of the key. at is, the estimated frequencies could not satisfy the basic requirements of frequency: (1) every frequency should be non-negative and (2) all frequencies should be summing-up-to-1. us, the estimated frequencies are not the most accurate. In what follows, we elaborate on our mechanism. Some important notations are summarized in Table 1.

Proposed Method
e overview of the proposed method is shown in Figure 1. e idea of our mechanism is as follows. Each user first samples one key-value pair from his personal data (the sampling protocol will be discussed in Section 4.1); then each user privately perturbs the sampled key-value pair by our LDP mechanism (the mechanism will be discussed in Section 4.2). After receiving the reported results, the server aggregates the perturbed data and estimates the key frequency and value distribution, which will be shown in Section 4.3.

Sampling Protocol.
In this subsection, we explain why we need to sample before reporting perturbed data and elaborating on our sampling protocol.

Why Sampling Protocol.
In practice, each user may have multiple key-value pairs. If the user perturbs all his keyvalue pairs, then each pair would consume the privacy budget, and the LDP mechanism has to split the total privacy budget. us, the noise added to each pair would be too large. To solve this problem, a promising method is to sample and submit one pair, which avoids the privacy budget splitting and improves the utility.

Our Sampling Protocol.
Sampling protocols are widely used in existing DP mechanisms for key-value data perturbation [7][8][9]. However, they either do not support distribution estimation or do not work well on large domain sizes. In particular, PrivKVM * [9] samples from the full domain in the first phase to identify heavy hitters, which does not work well when the domain size is very large and each user only possesses a small number of keys since users rarely report the information about the keys they possess. erefore, we adopt the padding-and-sampling protocol [8,14] for key-value data to support frequency and distribution estimation. e advantage of the padding-and-sampling protocol is that it samples from the set of keys users possess instead of from the full domain, and thus, it handles large domains better. e step of our sampling protocol is as follows. First, all users generate l dummy key-value pairs whose keys are } and values are zeros. For user i whose |S i | < l, he adds l − |S i | different random dummy key-value pairs to S i and make the length be l. Without padding, determining the probability that a pair is sampled is difficult, resulting in inaccurate estimation. erefore, the domain of the key of the padded data is en each user samples one pair from the padded data to perturb and upload. Although some pairs may be unsampled, this case only occurs for infrequent pairs, and the useful information is still reported with high probability. e following shows an example to illustrate the sampling process, and the details are shown in Algorithm 1.
We note that the previous mechanism PCKV [8] also adopts a padding-and-sampling protocol for key-value data collection under LDP. We emphasize there is a difference between their protocol and ours. PCKV only supports the mean estimation of values; thus, it discretizes the value into 1 and − 1 with particular probability to guarantee the unbiasedness for mean estimation. Our sampling protocol does not adopt the discretizing step because we want to estimate the value distribution and discrete values lose the numerical information and would affect the distribution estimation.
According to the literature [8,14], padding length l would cause two types of error: (1) variance between true Randomized output vector   (2) bias between true values and estimated results. A smaller l would underestimate the key frequency and results in a large bias, and a larger l would enlarge the noise in estimation, thus leading to a large variance. Unfortunately, finding the optimal padding length l that can balance the trade-off between the variance and the bias is a non-trivial task, and it is still an open problem so far [8]. us, in this paper, we empirically set the suitable padding length l in the experiments for comparing with other LDP mechanisms.

Perturbation Mechanism.
In this subsection, we introduce our perturbation mechanism. By Algorithm 1, each user samples one key-value pair 〈k, v〉 as the input of the perturbation mechanism. e basic idea of our perturbation mechanism is to perturb the value according to the perturbed results of the key. If a non-possessed key is perturbed as possessed or a possessed key is perturbed as non-possessed, we generate a fake value for the key to avoid the influence on the distribution estimation of the value. Under this strategy, we find the mechanism provides a tighter privacy budget composition (see eorem 3), that is, it is shown that the total privacy budget of the combined perturbations (key perturbation and value perturbation) is smaller than the sequential composition. Based on the above idea and two basic LDP mechanisms (UE and SW), we design an LDP mechanism for key-value data collection that can support numerical distribution estimation. Overall, the LDP perturbation is shown in Algorithm 2.
In the UE mechanism, the original input is encoded as a binary vector where the bit at the input-corresponding position is 1 and other bits are 0. Similarly, for key-value data, we encode the sampled key-value pair 〈k, v〉 as a vector x where the k-th element x[k] (corresponding to the key k) is 〈1, v〉 and the other element is 〈0, 0〉. en the perturbation can be divided into two steps. Note that each element in the vector has two items (key and value), for brevity, we use notation k (i) x and v (i) x to represent the key and value of the i-th element of vector x, respectively, and use k (i) x and v (i) x to represent perturbed key and value. First, we perturb the key as follows: where c ≤ 0.5 ≤ a. Given the perturbation result of the key, we then perturb the value. e value perturbation can be divided into three cases as follows: x is perturbed from 1 to 1, the corre- x is perturbed from 0 to 1, the fake value drawn from the uniform distribution x is perturbed from 0 to 0 or from 1 to 0, the key is reported as non-possessed; thus, we set the perturbed value to be 0.

Privacy Analysis of Our Mechanism.
In our mechanism, the key is perturbed by the UE mechanism with privacy budget ε 1 � ln(a(1 − c)/c(1 − a)) (see Section 2); the value is perturbed by the SW mechanism with privacy budget ε 2 � ln(p/q). In our mechanism, the key perturbation and value perturbation are correlated, that is, the value perturbation relies on the key and key perturbation. Generally, the correlated perturbation may leak less privacy than independent perturbation and has a privacy amplification effect [8]. at is, the total privacy budget ε is less than the summation ε 1 + ε 2 . eorem 3 shows our mechanism satisfies LDP and has a tighter budget composition than sequential composition. Theorem 3. Denote the privacy budget for key perturbation and value perturbation are ε 1 and ε 2 , respectively; our mechanism satisfies ε-LDP where Proof. For a key-value set S, we denote the key-value pairs by 〈k, v〉 for all i ∈ S, where i ∈ S means the key-value pair 〈i, ·〉 ∈ S. Suppose the sampled key-value pair is 〈k, v〉(v ∈ [0, 1]), we have the perturbed value v (k) x � 0 if the key is drawn from d + 1, . . . , d ′ . For vector x, only the k-th element is non-zero, and others are zeros. en we have the probability of outputting a vector x is as follows: Security and Communication Networks Denote the first term by f(x, k) and the second term by g(x), that is, Since the second term g(x) is same for different inputs, it will be canceled out when we calculate the ratio of Pr(x|S 1 ) to Pr(x|S 2 ) (S 1 , S 2 are two different key-value sets). erefore, we first calculate the f(x, k). According to the perturbation mechanism, we can calculate the numerator as follows: Based on the result, we can obtain the f(x, k) by en, we discuss the upper and lower bounds of f(x, k). Since both the UE mechanism and the SW mechanism have a higher probability of maintaining the input value than that of perturbing the input as other values, we have Because (ap/(c × (1/(1 + 2b)))) is greater than (ap/(c × (2b/(1 + 2b)))) and (aq/(c × (1/(1 + 2b)))), thus, we have the upper bound f u and lower bound f l of f(x, k) as follows: , en, we have the probability of outputting x given a key-value set S is Similarly, we also have Pr(x|S) ≥ f l × g(x). us, the following inequality holds for two different key-value sets S 1 and S 2 : Input: e sampled key-value pair 〈k, v〉, privacy budget ε 1 , ε 2 Output: Perturbed result x (1) Encode 〈k, v〉 as vector x (2) Perturb k (i) x as k ALGORITHM 2: Perturbation.
It is worth noting that the work [8] also proposed a tighter privacy budget composition in their mechanism. However, our tighter privacy budget composition is different from that in [8]. Specifically, the composition theorems hold for different LDP problems. e improved privacy budget composition in [8] holds for the estimation of the key frequency and the mean of the value under LDP. However, our tighter budget composition holds for the estimation of key frequency and the distribution of the value. Moreover, the perturbations are different between our mechanism and [8]. Literature [8] proposed two mechanisms: (1) PCKV-UE and (2) PCKV-GRR. PCKV-UE comprises unary encoding (UE) and randomized response, and PCKV-GRR is based on GRR. However, the components in our mechanism are UE and squared wave (SW) mechanisms. As a result, the privacy budget in our mechanism and literature [8] composes in different ways. e privacy budget is composed as equation (9) in our mechanism. But, in PCKV-UE and PCKV-GRR, the budget is composed as max ε 2 , ε 1 + ln 2/(1 + e − ε 2 ) and ln((e ε 1 +ε 2 + λ)/(min e ε 1 , (e ε 2 + 1)/2 { } + λ)), respectively. Figure 2 shows the (1) basic sequential composition, (2) the tighter composition of our mechanism, and (3) the tighter composition of PCKV (including PCKV-UE and PCKV-GRR). Note that the composition of PCKV-GRR depends on the padding length l(l ≥ 1) and the larger l results in tighter budget composition. erefore, we compare PCKV-GRR with varying l. e result shows that the composition of our mechanism is less tight than that of PCKV even under the minimum l, that is, l � 1. ere is an intuition behind this result. Our mechanism estimates the value distribution under LDP, which needs more information about the data than PCKV that only estimates the mean of the value. us, PCKV can bound the privacy loss at a tighter level.  Figure 2 also shows the relationship between our composition with basic sequential composition, which demonstrates the privacy amplification of our mechanism. Compared with a sequential composition where the total privacy budget ε � ε 1 + ε 2 , our mechanism consumes less privacy budget because max ((ε 2 e ε 2 − e ε 2 + 1)/ e ε 2 (e ε 2 − ε 2 − 1)), e ε 2 , e ε 1 × ((e ε 2 − 1)/ε 2 )} ≤ e ε 1 +ε 2 . In other words, our mechanism has a privacy amplification effect.

No Privacy Amplification Effect from Sampling
Protocol. In eorem 3, the privacy guarantee is independent of the padding length l, which means our mechanism obtains no privacy amplification from the sampling protocol. e main reason is that our mechanism outputs a vector containing multiple keys and multiple positions in the vector are 1. erefore, even if the sampling protocol is used, the upper bound of the probability ratio in the worst case is independent of the protocol. Here, we take an example that only considers the key perturbation to make this point more clear. (depending on which key is sampled). Since the probability ratio is Pr(M(S 1 ) � y)/Pr(M(S 2 ) � y). us, in the worst case, we need to maximize the Pr(M(S 1 ) � y) and minimize Pr(M(S 2 ) � y). To this end, we select the output vector y � [110000] because, in our mechanism, 1 ⟶ 1 with the highest probability and 0 ⟶ 1 with the smallest probability. erefore, no matter which key is sampled, the probability of outputting y is the same, that is, Pr(M(S 1 ) � y) � ac(1 − c) 4 and Pr(M(S 2 ) � y) � c 2 (1 − a)(1 − c) 3 . In other words, there are no privacy benefits from the sampling protocol.

Privacy Budget
Allocation. Since our mechanism contains two steps and each of them uses ε 1 and ε 2 , allocating a privacy budget for each step is important. A basic and widely used idea to allocate a privacy budget is to calculate the error as the function of ε and then find the optimal ε 1 and ε 2 that minimize the error [8]. However, calculating the error (or distance) between the estimated distribution and the true distribution as the function of ε is a non-trivial task [13].
us, we use an empirical allocation method in this paper and leave the strategy of finding the optimal privacy budget allocation method for future work.
In our experiments, we observe that even under such suboptimal budget allocation, our mechanism is still better than other mechanisms that consider the optimal privacy budget allocation.

Aggregation and Estimation.
In this subsection, we introduce how to aggregate the perturbed results and get the estimation of the frequency of the key and the estimation of the value distribution. For frequency estimation, an unbiased estimator is proposed in [8,14]. However, they do not take the prior knowledge of the estimated frequency into account, which reduces the utility. For numerical distribution estimation, the SW mechanism uses the EM algorithm to estimate the distribution. However, due to the fake value in our design (we set v (i) x is perturbed from 0 to 1), directly using the EM algorithm would not get a useful estimation. We use postprocessing methods to address these problems. Note that postprocessing of the output of a DP mechanism does not a ect its privacy guarantee [1].

Key Frequency Estimation.
After the server receives the perturbed results from all users, it counts the number of 1's that supports each key i, denoted as n i Count( k (i) y 1). en we rst use the estimator in [8,14] to obtain an unbiased frequency estimation f i of key i. Formally, Theorem 4. If the padding length l ≥ |S u | for all users u, the estimator f i is unbiased, that is,

and the variance is
Proof. e random variable n i is the summation of n independent random variables, each of which follows the Bernoulli distribution. For users who input the key i for perturbation (accounting for (f i /l) of all n users), the variable is drawn from Bernoulli(a), and for users who do not input the key i (accounting for 1 − (f i /l) of all n users), the variable is drawn from Bernoulli(c). us, we have the expectation of estimator f i that is and the variance of the estimator is

Improve the Utility with
Postprocessing. e estimator f i only provides unbiasedness in theory. However, the estimation may not be consistent. at is, the estimations for many values may be negative, and the total sum of frequencies is not equal to 1. Such inconsistency may reduce the utility of LDP mechanisms [11]. erefore, we further enforce the following consistency requirements on the estimated results to improve the utility:  s.t.
Based on the KKT condition [15], we can solve the postprocessed results as follows: where A is the set containing the non-negative frequencies in f.
We further explain why we use ‖f − f‖ 2 2 as the objective function. L 2 norm is used in the objective function because the noise by UE is well approximated by Gaussian noise, and minimizing L 2 norm achieves MLE [13]. Besides, when we enforce the consistency requirement on the estimated results, there are many results that can achieve the consistency requirements.  Proof. For each key i, we have the bias that is In many application domains, the number of users is large, and the true frequency of many keys is far from zero.
us, few estimated results may be negative, and |A| may be large. erefore, even though the postprocessing introduces a positive bias, it is sufficiently small in practice.
4.3.3. Numerical Distribution Estimation e server performs the distribution estimation in a discretized way, that is, the histogram on the domain. For a key i, the server first finds the reported results that support key i, that is, the i-th bit of the perturbed vector k

Data Sets.
Four real-world data sets are involved in our evaluation: E-commerce [16], Clothing [17], Amazon [18], and Movie [19]. We summarize the data sets parameters in Table 2, where l is the padding length. All rating values are linearly transformed into the range [0, 1].

Competitors.
We compare our mechanism with three existing mechanisms: PrivKVM * [9], PCKV [8], and KVUE [20]. PrivKVM * is elaborated in Section 3.2, and we do not repeatedly introduce it here. PCKV contains two mechanisms, namely PCKV-UE and PCKV-GRR, which are based on optimal unary encoding (OUE) and generalized random response (GRR), respectively, and we compare both in this paper. KVUE is a mechanism proposed to improve the performance of PrivKVM [7], which is the degraded version of PrivKVM * . It treats each key-value pair as a whole entity instead of treating key and value separately and directly perturbs each entity.
Since only PrivKVM * can support frequency estimation of keys and distribution estimation of values and other mechanisms are only designed for frequency estimation and mean estimation of values, we compare with PrivKVM * on both frequency and distribution estimation tasks and compare with PCKV and KVUE only on frequency and mean estimation.

Evaluation Environments.
All mechanisms are implemented using Python 3.6 and Numpy 1.14. All experiments are conducted on an Amax server. e operating system of the machine is Ubuntu 16.04; the CPU is Intel Xeon Silver 4214 2.2 GHz, 24 cores in total; and the memory is DDR4-2666, with a total of 128 GB.

Frequency.
We evaluate the key frequency by the mean squared error (MSE). Formally, we measure where X is any subset of the key domain K, and we set the default X to be K.

Distribution Distance.
We evaluate the distribution estimation by the average Wasserstein distance. Formally, we measure where X is any subset of the key domain K, and we set the default X to be K. W k (H, H) is the Wasserstein distance between the true value distribution of the key k and the estimated distribution. Formally, given the histogram H constructed by the true value of key k and the reconstructed histogram H, the Wasserstein distance is Pr v ∈ B i .

Mean and Variance.
Given the estimated value distribution, we can also calculate the mean and the variance of the value. We also use the MSE to evaluate the mean estimation and variance estimation. Formally, we measure where similar to frequency estimation, X is any subset of the key domain K and we set the default X to be K; μ i and μ i are the estimated mean and the true mean of the value of the key i, respectively; and σ 2 i and σ 2 i are the estimated variance and the true variance of the value of the key i, respectively.
All metrics measure the error between the estimated result and the true result, and the smaller the metric, the more accurate the estimated result. All results are averaged with 50 repeats to make the experiment results stable.

Key Frequency.
We first evaluate the existing LDP mechanisms on key frequency estimation. Here, we analyze these methods on three tasks:   (1) Frequency of individual key: We measure MSE between [f i ] i∈K and [f i ] i∈K . In this task, the X is the key domain K. (2) Frequency of most frequent keys: We select the top-T key and measure the MSE between their actual frequencies and the postprocessed ones. Formally, denote the top-T keys by the set D T � i ∈ K|f i rank stop T} and measure MSE between [f i ] i∈D K and [f i ] i∈D K . We also set the default T � 15 and ε � 1. In this task, X is the domain of the top-T key. (3) Frequency of subsets of keys: Estimating the subset of keys plays an important role in the interactive data analysis setting (e.g., estimating which category of products is more popular). We uniformly sample α (0 ≤ α ≤ 1) data from the domain of key and measure the MSE between the sum of the actual frequencies and the postprocessed frequencies. Formally, suppose D α is the random sampled subset of the key that has α × K keys, we define f D α � i∈D α f i and f D α � i∈D α f i . We sample D α 100 times and measure MSE between f D α and f D α . We set the default α � 30% and ε � 1.

Frequency of Individual Key.
We first evaluate the performance when querying the frequency of individual keys, and the results are shown in Figure 3. As a result, we conclude that our method is better than any other methods (the MSE of our method is the smallest) on all data sets because we enforce the consistency as postprocessing. Especially when the noise is large (ε ≤ 2), our method reduces the MSE of the state-of-the-art solution by about 2 orders of magnitude. is is because the estimated frequencies are prone to be inconsistent under large noise and our postprocessing improves the accuracy significantly.
is also happens to the other two tasks for a similar reason (see Figures 4and 5). On data sets E-commerce, Clothing, and Amazon, the MSE results of other existing methods are very similar; this is because the number of users in data sets Clothing and Amazon are large, and it compensates for the impact of the large domain of the key. However, our method shows the smallest MSE in data set Amazon among all four data sets. is is because the number of users on Amazon are largest, which lead to smaller bias and better accuracy (according to the analysis of our postprocessing in eorem 5). In data set Movies, although all methods do not perform as good as they do on the first three data sets (due to large padding length l leading to large error), our method still performs best among all mechanisms due to the consistency requirement.

Frequency of Most Frequent Keys.
e MSE results when querying the top-T frequent keys under varying T and ε values are shown in Figures 4 and 6. Overall, our mechanism significantly reduces the MSE of other methods under all ε values and T values in most cases. Similar to the results when querying the frequency of individual keys, the MSE of our method is also apparently lower than that of other solutions when the noise is large (ε ≤ 2). As the ε value grows, the decline in the MSE of our method is becoming stable. Figure 6 represents the MSE of our mechanism is significantly smaller than other mechanisms on all data sets under all T, which actively demonstrates that our method can cope with various queries for top-T frequent keys.

Frequency of Subsets of Key.
We show the results for frequency estimation of a subset of keys in Figures 5 and 7. Overall, our method outperforms other mechanisms under all ε values and α values. In Figure 5, the MSEs of all mechanisms decrease as the ε value grows, and the large gap between our method and other methods indicates our method performs much better than other existing methods. Moreover, it is worth noting that in Figure 7, the MSEs of other mechanisms are getting greater as the α grows, but the MSE of our method is symmetric with α � 50%. is is because the individual estimation error accumulates as α increases under other mechanisms, but we enforce consistency on the estimated results and all estimated frequencies are summing-to-1; thus, estimating the frequency of a subset for α > 50% is equivalent to estimating the rest.

Distribution.
We evaluate existing LDP mechanisms on distribution estimation. Here, we evaluate it from three perspectives: (1) distribution distance, (2) mean, and (3) variance. We compare our mechanism with PrivKVM * on all three tasks and compare with other mechanisms only on mean estimation since they are only designed for mean. We also set the number of buckets m � 1024 in our experiments as it has been shown to perform best in most cases for distribution estimation [13].

Distribution Distance.
We plot the AW results as the function of ε value in Figure 8. It shows our mechanism outperforms PrivKVM * and achieves a reasonable distribution estimation on all data sets, and the largest AW is only about 10 − 1 .
is is because PrivKVM * perturbs numerical values in a discrete manner and does not exploit the ordinal information of the numerical domain. It is also worth noting that the AW results on the first three data sets (E-commerce, Clothing, and Amazon) are similar and lower than that on data set Movies.
is is because the padding length l for data set Movies is the largest (l � 100), which leads to a large error for frequency estimation (see Figure 3). us, we may get a relatively inaccurate number of users who generate fake values when we statistically remove the fake value, which leads to high AW for distribution estimation.

Mean.
e evaluation of mean estimation is shown in Figure 9. As a result, our method performs much better than any other mechanisms under all ε value. Specifically, when ε is relatively small, our mechanism significantly reduces MSEs of all other solutions; when ε is larger than 4, the MSE     of our mechanism is one to two orders of magnitude smaller than most other solutions. is is because our mechanism reports the value closer to the original value with a higher probability than the value far away from the original value. In this way, such perturbed result can carry more useful information about the original value and leads to more accurate results. Figure 10 plots the MSE results as the function of ε value. Due to the categorical frequency oracle, PrivKVM * underperforms in our experiments. It is worth noting that the MSE on data set Movies is the highest. is is also because the largest padding length l � 100 for data set Movies leads to a large error for frequency estimation (see Figure 3) and results in a relatively inaccurate number of users who generate fake values when we statistically remove the fake value.

Related Work
Differential privacy has been the de facto standard for privacy-preserving. ere are many LDP deployments in the real world: Google Chrome extension [3], spelling prediction of Apple [2], and telemetry collection by Microsoft [21].

Frequency Oracle and Distribution Estimation.
Estimating the frequency of values is a basic task in LDP. ere have been several mechanisms [3,10,22,23] proposed for this task, and they are often called frequency oracles. For example, RAPPOR [3] enables the estimation of the marginal frequencies of a set of strings. However, it needs a dictionary for the candidate strings, which can be very large or unknown in practice. To solve this problem, Fanti et al. [24] use the EM algorithm as a decoder for RAPPOR to enable learning without explicit dictionary knowledge. Based on RAPPOR, Ren et al. [25] propose a novel mechanism to estimate distribution for high-dimensional data. Instead of the EM algorithm, they use Lasso regression to estimate the distribution in one round. Combining the EM algorithm and Lasso regression, Ren et al. [26] further propose a solution that can generate synthetic data by leveraging the estimated distribution of the data under LDP.
Although the above schemes also use the EM algorithm, there are two differences compared to the EM algorithm in our mechanism: (1) our EM algorithm can statistically remove the fake values and (2) it takes the aggregated results and is thus more efficient. When estimating the distribution of numerical data, a naïve approach is to bucketize the data and apply the categorical frequency oracles listed above. In [4], the authors achieve distribution estimation under LDP but with a strictly weaker privacy guarantee. ere are also mechanisms that can handle numerical settings but focus on the specific task of mean estimation, that is, SR [5,21] and PM [27]. e SW mechanism [13] is the state-of-the-art mechanism for distribution estimation tasks under LDP, which can recover the distribution instead of focusing on a specific task. Different from existing LDP mechanisms that only focus on simple statistical queries (such as frequency and mean), our paper designs a new LDP mechanism for key-value data collection that considers both key frequency and value distribution simultaneously.

Postprocessing.
For statistic tasks in differential privacy, one can utilize the structural information to postprocess and improve the data accuracy. Following this idea, Hay et al. [28] utilize the structural information and propose an efficient hierarchical method to minimize L 2 difference between the noisy result and the processed result. Besides that, Lee et al. [29] consider the non-negativity constraint and propose to use the alternating direction method of multipliers (ADMM) to obtain a result that achieves maximal likelihood. Wang et al. [11] further improve the data accuracy by enforcing consistency that the frequency should be non-negative and sum-to-one. Jia and Gong [30] use conditional expectation to estimate the true data given the LDPprotected results.
is method shows satisfactory results when data approximately follows power-law distribution. EM algorithm is used by [13] to improve the accuracy of histogram data when estimating numerical distribution.
In this paper, we adopt postprocessing for key frequency estimation to further improve the accuracy.
6.3. Key-Value Data Collection. Ye et al. [7] are the first to propose the LDP mechanism to collect key-value data called PrivKV, PrivKVM, and PrivKVM+. PrivKVM iteratively estimates the mean to guarantee unbiasedness. PrivKV is a simple version of PrivKVM, and it can be regarded as PrivKVM with only one iteration. To balance unbiasedness and communication cost, they also propose the advanced version of PrivKVM called PrivKVM+. Sun et al. [20] proposed another estimator for frequency and mean estimation under the PrivKV to achieve better accuracy. ey also introduced conditional analysis for keyvalue data for other complex analysis tasks in machine learning. Gu et al. [8] proposed the framework PCKV. It perturbs the key and value in a correlated manner and provides a tighter privacy budget composition. As a result, PCKV outperforms the above LDP mechanisms in both estimation of the key frequency and the estimation of the value mean. To the best of our knowledge, PrivKVM * [9] is the state-of-the-art mechanism that not only can support more statistical tasks but also can achieve the best accuracy in most cases.

Discussion and Conclusion
In this paper, we propose a novel LDP mechanism for private key-value data collection. Due to the consideration of numerical information of the value domain, our mechanism outperforms existing schemes in most cases. e mechanism perturbs the key-value data in a correlated manner and results in the privacy amplification effect. We further improve the accuracy of the frequency estimation by consistency. Finally, we evaluate our mechanism on four realworld data sets and demonstrate our mechanism outperforms existing schemes.
Although our mechanism performs well in our experiments, it still has the following limitations: (1) We do not consider the optimal padding length l that would lead to more accurate results. (2) Our mechanism only adopts the suboptimal privacy budget allocation scheme instead of studying the optimal allocation scheme. Although our method still outperforms previous mechanisms, it does not achieve the minimum error.
In future work, we will study the optimal padding length l that can further improve the privacy-utility trade-off and study the optimal privacy budget allocation.

Data Availability
e key-value data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.