Collecting Multidimensional Numerical Data and Estimating Mean with Personalized Local Differential Privacy

,


Introduction
With the development and advancement of information technology, a large amount of data is being generated every day in all industries. Te vast majority of companies and organizations recognize the wealth of knowledge that can be generated by collecting and analyzing data. As a result, data collection and analysis became popular and expanded. However, data collection also raises serious privacy concerns. Large amounts of sensitive user data have been leaked, raising a number of public safety issues such as fraud and harassment. To address the privacy issue, Dwork et al. [1] proposed diferential privacy (DP) as a standard for privacy protection in various domains.
Unfortunately, since DP focuses only on centralized datasets and assumes that the server is trusted, data is not protected on the client side and privacy issues still arise. In this context, local diferential privacy (LDP) [2] is proposed as a variant of diferential privacy. In the LDP solution, users frst perturb their own data on the user side and then upload the perturbed data to the server. By this way, the server receives only the perturbed data of the user, and the real data of the user never leaves the user's side. Terefore, there is no privacy problem due to the untrustworthiness of the server. Due to its strong security, LDP has been studied and applied by organizations such as Google Chrome [3], Apple iOS, and macOS [4], and Microsoft Windows Insiders [5].
However, existing multidimensional LDP solutions focus only on improving the usability of aggregated datasets while ignoring users' personalized privacy needs. In most solutions, users can only give their data to the perturbation mechanism to generate perturbed data and then upload it to the server. Specifcally, the perturbation mechanism in a multidimensional LDP solution typically has two inputs, the user data and the privacy budget for each attribute. User data is the real data of the user, and the privacy budget for each attribute is usually equally distributed. Tis privacy budget allocation method meets the privacy needs of the server setting, i.e., each attribute is equally important. However, since users usually have diferent sensitivities to each attribute, it usually does not meet the true user privacy needs. As a result, attributes with high sensitivity have weak protection strength, while attributes with low sensitivity have poor usability. Terefore, it is important to design a multidimensional LDP solution that meets users' personalized privacy needs.
Tis paper frst discusses the existing optimal solutions for multidimensional numerical data collection. To address its personalized privacy problem, we propose a new privacy standard, personalized local diferential privacy (PLDP). Under the PLDP standard, the solution ensures overall security and availability while meeting the personalized privacy needs of users within certain ranges. To solve its sampling dimension problem, we assume that each attribute data is uniformly distributed and obtain a better sampling dimension by minimizing the average variance. Based on the above work, we design a personalized multidimensional piecewise mechanism (PMPM) to collect multidimensional numerical data, which has a smaller mean square error than the existing optimal solution and can meet the personalized privacy needs of users. In addition, we compared our solution with existing solutions on two real datasets. Te experimental results demonstrate that the mean square error of our scheme is lower than that of existing solutions. Te contributions and innovations can be summarized as follows: (1) We design a personalized privacy budget allocation within a certain range and further propose a new privacy standard, personalized local diferential privacy. It meets the personalized privacy needs of users while ensuring the availability of perturbed data. (2) We optimized the sampling dimension of the existing solution, which resulted in a smaller mean square error of the perturbed data. (3) We propose a personalized multidimensional piecewise mechanism (PMPM) for collecting multidimensional numerical data and estimating the mean. It not only has a smaller mean square error but also meets the personalized privacy needs of users. (4) We compared our solution with existing solutions on two real datasets. Te results validate the superiority of our solution, which meets users' personalized privacy needs while having a smaller mean square error.

Related Work
Diferential privacy (DP) is a strict privacy standard designed to ensure that data is shared without risk. It guarantees that even if an attacker knows all but one piece of data in the dataset, he still cannot infer information about that piece of data [6]. In a DP solution, the server needs to be trusted because it collects and perturbs the user's real data. However, in real life, servers are not always trusted. Tis leads to solutions that may have privacy issues even if they satisfy the DP. However, servers are not necessarily trusted in real life, which leads to solutions that may have privacy issues even if they satisfy DP. In this context, Local Differential Privacy (LDP) [2] has been proposed as a variant of DP. In the LDP solution, the user frst perturbs the real data on the client side and then uploads the perturbed data to the server. Only the data owner can access the original data, which provides stronger privacy protection for the user [7]. Even if the server is not trusted, there are no privacy issues. Under LDP, collecting numerical data and estimating the mean is one of the basic goals of statistics. Duchi et al. [8][9][10] proposed an extreme value perturbation solution by randomly perturbing the user's real data x ∈ [− 1, 1] to one of two extreme values (e ϵ + 1/e ϵ − 1) or − (e ϵ + 1/e ϵ − 1) with probabilities (e ϵ − 1/2e ϵ + 2) · x + (1/2) and − (e ϵ − 1/2e ϵ + 2) · x + (1/2), respectively. Te worst-case variance of this solution is O((e ϵ + 1/e ϵ − 1) 2 ), which makes the variance of this solution very small when ϵ is small but not very good when ϵ is large. Because no matter how large ε, (e ϵ + 1/e ϵ − 1) 2 is greater than 1. Kairouz et al. [11] demonstrated that such extreme value perturbation solutions are not always optimal. Ten, Wang et al. [12] proposed a distribution perturbation solution called piecewise mechanism (PM), which randomly perturbs the user's real data with high probability and to [− C, l(x)] and [r(x), C] with low probability. Li et al. [13] proposed a method similar to PM called the square wave mechanism for estimating the distribution.
Te solutions mentioned above are used to collect onedimensional numerical data and estimate the mean. For multidimensional numerical data, the one-dimensional solution cannot simply be repeatedly applied to each attribute due to the curse of the dimensionality problem. Duchi et al. [9] proposed a multidimensional extreme value perturbation solution that randomly perturbs the user's real data x ∈ [− 1, 1] d to y ∈ [− B, B] d , where B is a constant related to ϵ. Based on Duchi's solution, Nguyên et al. [14] proposed the Harmony solution, which randomly selects one attribute for perturbation and uploads it instead of all attributes. Te Harmony solution has a smaller communication cost while achieving the same variance as Duchi's solution. Wang et al. [15] adjusted the probability of Duchi's solution so that it satisfes (ϵ, δ)-LDP and the variance becomes small. Wang et al. [12] extended PM to multiple dimensions by random sampling, achieving a smaller variance than Duchi's solution, especially when ϵ is large. In addition, Akter and Hashem [16] proposed personal local diferential privacy and extended Duchi's solution.
In summary, existing solutions for multidimensional numerical data collection and mean estimation perform well in terms of data availability but fall short in terms of users' personalized privacy needs. Our goal was to design a solution that would ensure overall usability and security while also meeting the personalized privacy needs of our users.

Preliminaries
3.1. Diferential Privacy. Diferential privacy (DP) is a definition of privacy tailored to the problem of privacy-preserving data analysis [17]. In the DP solution, the users upload real data to the server. For any analysis algorithm, the server perturbs the real output and then outputs it to ensure that an attacker cannot infer any real data about the user. DP is defned as Defnition 1.
Defnition 1 (Diferential privacy (DP) [18]). A randomized mechanism M satisfes ϵ-diferential privacy if and only if for all data sets D 1 and D 2 difering on at most one element, and all S⊆Range(M), it has (1)

Local Diferential Privacy.
Local diferential privacy (LDP) is a variant of DP. Unlike DP, LDP ensures that an attacker cannot infer the user's real data, even if the server is untrustworthy. LDP is defned as Defnition 2.
Defnition 2 (Local diferential privacy (LDP) [19]). A randomized mechanism M satisfes ϵ-local diferential privacy if and only if for any pair of inputs x, x ′ and any possible output y ∈ Range(M), it has Since LDP is based on diferential privacy theory, it inherits the composability property of the latter [20], as shown in Teorem 1.

Theorem 1. (Sequential composition of LDP [21]). Suppose a randomized mechanism includes d independent randomized mechanisms
. Te proof appears in Appendix.

Collecting Multidimensional Numerical Data and Estimating Mean
In this section, we frst discuss the existing optimal solution, MPM. After that, we point out the shortcomings of MPM and propose our solution, PMPM. Some notations used in this paper are listed in Table 1.

MPM: Existing Optimal Solution
Wang et al. [12] proposed a randomized mechanism called the piecewise mechanism (PM). It is mainly used to collect one-dimensional numerical data and estimate the mean. It takes a value x i ∈ [− 1, 1] as input and then outputs a perturbed value Te probability density function of y is a piecewise constant function, as shown below as follows: Te pseudocode is described in Algorithm 1. Wang et al. [12] proved that PM satisfes ϵ-LDP. And, given an input x i , PM returns an output y i with E[y i ] � x i and 4.1.2. Multidimensional Numerical Data. Ten, Wang et al. [12] proposed a multidimensional piecewise mechanism (MPM) based on PM. It is mainly used to collect multidimensional numerical data and estimate the mean. Te pseudocode is described in Algorithm 2. Wang et al. [12] proved that MPM satisfes ϵ-LDP. And, given an input x i,j , MPM returns an output By minimizing the worst-case variance of y i,j , they get In addition, let Y j � (1/n) n i�1 y i,j and X j � (1/n) n i�1 x i,j , Wang et al. [12] prove that Y j is an unbiased estimator of X j , and with at least 1 − β probability, which is asymptotically optimal [9].

System Model.
Our solution mainly applies to data collection in crowdsourcing mode. An untrusted server wants to collect multidimensional data from each user. To protect privacy, the users perturb the data at each client. Ten, the users upload the perturbed data to the server for use by third parties. Our goal is to personalize the perturbation and make the data available after the perturbation as high as possible. Specifcally, the system model consists of a server and n users. Each user has data with d attributes. For each data, the user reduces its dimension from d to k by random sampling. For the k sampled attributes, the user allocates a privacy budget to each attribute and then perturbs each attribute according to the allocated privacy budget. After that, each user uploads the perturbed data to the server. Te server aggregates the dataset, calculates the mean for each attribute, and publishes it. Te system model of our solution is shown in Figure 1.
Example 1. For simplicity, we consider n is 4, d is 2, and k is 1. Te simplifed system model is shown in Figure 2.

Personalized Local Diferential Privacy.
In MPM, the privacy budget is evenly allocated to each sampled attribute. However, users have diferent sensitivities to each attribute, which we call personalized privacy needs. It results in weak privacy protection for some attributes and low availability for others. To solve the problem of personalized privacy needs, we propose a new concept called personalized local diferential privacy (PLDP). Te specifc defnition is shown in Defnition 3.
Defnition 3 (Personalized local diferential privacy (PLDP)). A randomized mechanism M satisfes (ϵ, τ)-personalized local diferential privacy if and only if it allocates the total privacy budget ϵ to d attributes according to the user's needs within [(ϵ/τd), ((1 + (τ − 1)d)ϵ/τd)] and for any pair of inputs x, x′ and any possible output y ∈ Range(M), it has where τ is a personalization parameter within [1, +∞]. Te value of τ determines the range of privacy budget allocation. Te larger τ is, the more personalized the randomized mechanism is. Te smaller τ is, the less personalized the randomized mechanism is. When τ is 1, it is equivalent to allocating the privacy budget equally. We will introduce the specifc settings of τ in combination with experiments in Section 5.3.
PLDP is an extended version of LDP. It adds a privacy budget allocation condition, which we call personalized privacy budget allocation. In the personalized privacy budget allocation, users can allocate the privacy budget to each attribute within a certain range according to their own personalized needs. Specifcally, since τ, ϵ, and d are given, the users have a defnite allocation range [(ϵ/τd), ((1 + (τ − 1)d)ϵ/τd)]. Ten, users can allocate privacy budgets to each attribute at will within this range. It should be noted that the User 1  x 1,1 x 1,2  Security and Communication Networks 5 sum of the allocated privacy budget should be equal to ϵ.
Here, we give a simple allocation method. First of all, we divide attributes into important attributes and unimportant attributes. Ten, we allocate the smallest privacy budget (ϵ/τd) to each unimportant attribute. Finally, we evenly allocate the remaining privacy budget to important attributes.
Since we think age is important and gender is not important, we allocate the privacy budget (20/3) to age and (10/3) to gender. Te relationship between LDP and PLDP can be summarized as follows. If a randomized mechanism M satisfes ϵ-PLDP, then it satisfes ϵ-LDP. If a randomized mechanism M satisfes ϵ-LDP and personalized privacy budget allocation, then it satisfes ϵ-PLDP.

Personalized Multidimensional Piecewise Mechanism.
In MPM, the sampling dimension k is obtained by minimizing the worst-case variance. However, the worst-case scenario is usually rare. As a result, the variance of the overall data is not optimal. To solve this problem, we get the sampling dimension k by minimizing the average variance. Since the variance needs to be calculated, we frst give our solution a personalized multidimensional piecewise mechanism (PMPM). Te pseudocode is described in Algorithm 3.
PMPM meets the personalized privacy needs of users while ensuring overall security and availability. Teorem 2 ensures the security of PMPM.
Theorem 3. Given an input x i , PMPM returns a output y i and for any sampled value j, Te proof appears in Appendix. Teorem 4 ensures the availability of the perturbed data.

Theorem 4. Given an input x i , PMPM returns a output y i and for any sampled value j,
Te proof appears in Appendix. By Teorem 4, we have the average variance of y i,j is setting (x i,j ) 2 as E[(x i,j ) 2 ] and ϵ i,j as E[ϵ i,j ], which is shown as follows: E[(x i,j ) 2 ] is related to the distribution of attributes. However, the distribution of each attribute is diferent. To get a defnite value, we must assume that all attributes have the same distribution. Finally, in order to be as close to the real value as possible and to facilitate calculation, we assume that all attributes are uniformly distributed. Ten we have In addition, we assume that each attribute is valued by the same number of users, then we have the following: Tus we have the average variance of y i,j is shown as We plot (ϵ AverageVar[y i,j ] + (1/3) /d) with respect to k/ϵ (see Figure 3). Obviously, it is observed that (ϵ AverageVar[y i,j ] + (1/3) /d) is roughly minimized when (k/ϵ) � 0.28 (i.e. k � 0.28ϵ). Tus to minimize the average variance of y i,j , we set the optimal value of k to be the following

Estimating Mean.
After the server aggregates the perturbed data of all users, it can estimate the mean of each attribute data. Suppose there are n users and the jth attribute of the perturbed data of the ith user is represented as y i,j , then the server calculates Y j � (1/n) n i�1 y i,j to estimate the mean of each attribute. Note that when y i,j does not exist (i.e., not sampled), it is skipped.
Teorem 5 ensures the accuracy of the estimated mean, which is asymptotically optimal [9].
Te proof appears in Appendix.

Evaluation
We implemented the proposed solution and compared it with two existing solutions (Duchi's solution [9] and MPM solution [12]) on two real datasets.

Datasets.
We conducted experiments on two real datasets with the parameters shown in Table 2. Census 2015 [22] and Census 2017 [22] include census data for all counties in 2015 and 2017, respectively. Tey contain statistics for each county's citizens, such as "TotalPop," "Men," and "White." To simplify the experiment, we removed the category attribute and the numerical attribute that lacked data.

Evaluation Metrics.
As in much previous work, we evaluate the performance of the solution by using the mean square error (MSE), which is defned as follows: where d is the number of attributes of the data, Y is the estimated mean of the d attributes, and X is the true mean of the d attributes.
In addition, since Y j is a random variable, to make the evaluation more accurate, we repeated the MSE calculation 100 times and then took their average value as the fnal result.

Personalization Parameter.
Diferent values of τ determine diferent privacy budget allocation ranges. Diferent privacy budgets have diferent probability ratios e ϵ of differential privacy. We control the maximum and minimum values of the privacy budget allocation by setting the value of τ. Ten, we can control the maximum diference in probability ratio e ϵ between two attributes (i.e., the degree of personalization). In order to show it more intuitively, we set τ to 1.125, 1.25, 1.375, and 1.5 to observe the diference. Figure 4 shows the maximum diference in probability ratio e ϵ between two attributes with diferent personalization parameters τ.
Trough experimentation, we can set the value of τ according to the degree of personalization we need. However, the value of τ should not be too large; otherwise, a too small privacy budget will lead to large errors.

Comparison with Existing Solutions under Diferent
Privacy Budgets. When the privacy budget is greater than or equal to 7.143, k can be greater than 2 and there exists a personalized privacy budget allocation. Terefore, to evaluate PMPM more accurately, we set diferent privacy budgets from 8 to 14. Input: x i ∈ [− 1, 1] d : Te ith raw data, ϵ: the total privacy budget, τ: the personalization parameter Output: Sample k values uniformly without replacement from {1, 2, . . ., d}; Let y i � 0, 0, . . . , 0; for each sampled value j do Allocate ϵ to attribute A j according to user's needs within [(ϵ/τk), ((1 + (τ − 1)k)ϵ/τk)] and get ϵ i,j ; Feed x i,j and ϵ i,j as input to PM, and obtain a noisy value t i,j ; y i,j � (d/k) · t i,j ; return y i ALGORITHM 3: Personalized multidimensional piecewise mechanism (PMPM). We compare our solution with MPM [12] solution on Census 2015 and Census 2017 under diferent privacy budgets, and the results are shown in Figures 5 and 6.
Te experimental results show that the MSE of our solution is always lower than that of the MPM solution when τ is taken as 1.125, 1.25, and 1.375. When τ is taken as 1.5, the MSE of our solution is sometimes higher than that of the MPM solution. In addition, both our solution and the MPM solution experience a rebound in MSE as the privacy budget increases.
Trough our analysis, we believe that the reasons for the above-mentioned gaps are as follows: (1) Our k value is better. In our solution, the value of k is obtained by minimizing the average variance. In the MPM solution, the value of k is obtained by minimizing the worst-case variance. In fact, the worstcase scenario is rare. Terefore, our k value is better, which leads to a smaller overall mean square error. (2) An increase in the value of τ leads to an increase in MSE. As the value of τ increases, the range of personalized privacy budgets allocated expands, which leads to the emergence of smaller privacy budgets. Te smaller the privacy budget, the larger the error. So there comes a situation where our solution has a higher MSE. (3) Round down. In determining the value of k, we round down, which makes the resulting k value biased. Our k value may be the same under diferent privacy budgets. When the k value is too small, there is a rebound.
Ten, we compare our solution with Duchi's solution [9] on Census 2015 and Census 2017 under diferent privacy budgets, and the results are shown in Figures 7  and 8.
Te experimental results show that the MSE of our solution is much lower than that of Duchi's solution under diferent privacy budgets.
Trough our analysis, we believe that the reasons for the above-mentioned gaps are as follows: (1) Random sampling. Our solution reduces the dimensionality of high-dimensional data by random sampling. At the same time, Duchi's solution does not process high-dimensional data. Due to the highdimensional curse, the mean square error of our solution is much lower than that of Duchi's solution.

Comparison with Existing Solutions under Diferent
Dimensions. By removing some attributes, we set the two  Te experimental results show that the MSE of our solution is lower than that of the MPM solution under diferent dimensions. In addition, a larger value of k does not necessarily result in a larger MSE for our solution.
Trough our analysis, we believe that the reasons for the above-mentioned gaps are as follows: (1) Our k value is better. Terefore, the MSE of our solution is lower than that of the MPM solution in diferent dimensions.
(2) Privacy budgets are randomly allocated. Te larger the value of τ, the larger the range of privacy budget allocation. However, since we randomly allocated the privacy budget within the range during the experiment, the MSE is not necessarily large for a large τ value.
Ten, we compared our solution with Duchi's solution [9] on Census 2015 and Census 2017 under diferent dimensions, and the results are shown in Figures 11 and 12.
Te experimental results show that the MSE of our solution is much lower than that of Duchi's solution under diferent dimensions. In addition, the MSE of Duchi's solution increases as the dimensionality increases.  Trough our analysis, we believe that the reasons for the above-mentioned gaps are as follows: (1) Random Sampling. Our solution handles high-dimensional data by random sampling, so the mean square error is lower (2) Dimensional Curse. Duchi's solution does not effectively handle high-dimensional data, and thus its mean square error increases rapidly as the dimensionality increases

Conclusions
In this paper, we frst designed a personalized privacy budget allocation within a certain range and proposed a new privacy standard personalized local diferential privacy (PLDP). Ten, we optimized the sampling dimension of the existing solution MPM by minimizing the average variance. Finally, we designed a personalized multidimensional piecewise mechanism (PMPM) based on the above research. In addition, we validated the superiority of our solution by comparing it with existing solutions on two real datasets. Overall, our solution has a smaller mean square error while meeting the personalized privacy needs of users. However, our solution is only applicable to collect multidimensional numerical data and estimate the mean. For category data, we have not extended it efectively. On the other hand, to be closer to the true distribution, we obtain k values by assuming that all attributes are uniformly distributed. Tere would be better results if there was a way to achieve a true distribution.

Appendix
Proof of Teorem    Proof. For any pair of inputs x, x ′ , and output y, by Defnition 2, we have the following: Pr M x ′ � y .
Tis completes the proof. Proof. From Algorithm 3, PMPM composes k numbers of PM.
Since, E[y i ] � x i in PM, Te authors have the following: Tis completes the proof.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.