Limiting Privacy Breaches in Average-Distance Query

Querying average distances is useful for real-world applications such as business decision and medical diagnosis, as it can help a decision maker to better understand the users’ data in a database. However, privacy has been an increasing concern. People are now suffering serious privacy leakage from various kinds of sources, especially service providers who provide insufficient protection on user’s private data. In this paper, we discover a new type of attack in an average-distance query (AVGD query) with noisy results. )e attack is general that it can be used to reveal private data of different dimensions. We theoretically analyze how different factors affect the accuracy of the attack and propose the privacy-preserving mechanism based on the analysis. We experiment on two real-life datasets to show the feasibility and severity of the attack.)e results show that the severity of the attack is mainly influenced by the factors including the noise magnitude, the number of queries, and the number of users in each query. Also, we validate the correctness of our theoretical analysis by comparing with the experimental results and confirm the effectiveness of the privacy-preserving mechanism.


Introduction
Nowadays, a major concern in the modern society is the leakage of private information, e.g., health condition information and location information. Reports show that healthcare data breaches in 2018 resulted in the exposure of 13 million healthcare records [1]. Breaches in the US healthcare field cost $6.2 billion each year [2]. Meanwhile, the disclosure of personal location data can cause serious issues, e.g., by analyzing the semantic information (e.g., hospital and church) of users' locations, sensitive information such as home address, health status, and religious faith may be revealed [3,4]. On the other hand, users' data are valuable to be queried for public usage. erefore, privacy-preserving mechanisms are often used during the query processing [5]. However, it is still questionable whether existing mechanisms are sufficient for privacy.
We focus on the average-distance query (AVGD query) [6][7][8], which serves as the basic component of several applications such as business decision and medical diagnosis [8]. Given a database containing data of users, an AVGD query takes a query point and a set of users as inputs and queries the average distance between the query point and the data of those users. e query point is a value located in the same space as the users' data. To the best of our knowledge, privacy breaches in the AVGD query have not been studied yet, and the user privacy needs attention during such queries. In this paper, we study the privacy breaches in the AVGD query and propose a new kind of attack. As we find, an adversary can make an attack by selecting the user sets and querying on specific query points and reveal users' data by leveraging the query results.
Both the following examples of the AVGD query, which correspond to different dimensions of query points, respectively, are exposed to the attack. Example 1: a company plans to deploy a new branch, before which the branch's location is required to be carefully chosen according to several facts, e.g., traveling time. In this case, a locationbased service [8] benefits the choosing process by outputting the average distance between local users and a location inputted by the company. Here, the query point is the coordinate of the location. Example 2: medical data can be clustered to predict the likelihood of diseases. e average distance between elements in a cluster and the cluster centroid and the average distance between elements in two clusters are the basic measurements in clustering approaches. We consider the following one-dimensional case: given the medical data (e.g., blood pressure) of a group of patients and a query point (e.g., the cluster centroid), a hospital queries the average distance between the data of each patient and the query point. Here, the query point is the real-valued medical data, and the absolute value distance is considered as the distance measure. To perform the attack, an adversary selects two user sets, which differ in one target user. en, the adversary chooses a set of query points and gets the results queried on these query points and the two user groups. e adversary can set up equations based on the correlations between the results and recover the data of the target user. Detailed process of the attack will be described in Section 3.
Nevertheless, most of the existing techniques on privacypreserving query processing cannot prevent these kind of attacks. K-anonymity-based approaches [9][10][11] are suitable for publishing or querying on single user's record, while the AVGD query returns aggregate results. Other methods like data transformation [12,13] and homomorphic encryption [8,14] avoid transmitting user data in the plain text. ey prevent the exposure of the original user data during the process of the query, but they do not consider the information leakage from the query results.
Noise-based output perturbation [5] (e.g., differential privacy with Laplace noise [15]) is a possible method to prevent the attack. However, as we demonstrate in this paper, the efficiency of the noise may be weakened under the attack. Too less noise may not provide enough preservation, while too much noise destroys the utility of the results. us, the amount of noise should be quantified for balancing utility and privacy. We aim to decide the minimum amount of noise to meet the desired privacy requirement. Previous works [16][17][18] have studied attacks on queries with noisy outputs, but they all have focused on sum queries or questions based on sum queries. ey showed that the lower bounds of noises (as a function of the number of queries) are needed to prevent violations of privacy. To the best of our knowledge, we are the first to investigate attacks on AVGD queries. Instead of giving a lower bound directly, we formalize the process of the attack and obtain explicit expressions of the uncertainty of an adversary's estimation under different conditions, which serves as a guide on the amount of noises to be added. As we further find, the uncertainty is affected by several factors during the queries, e.g., the number of users in each query, the data value of the target user, and the number of queries. Finally, we propose the privacy-preserving mechanism based on the analysis.
We evaluate the proposed attack on two real-world datasets: a one-dimensional medical dataset and a two-dimensional location dataset. e results show the feasibility and severity of such attacks. We compare the experimental results with our theoretical analysis under different factors and show the correctness of the theoretical analysis, which guarantees the effectiveness of the corresponding privacypreserving mechanism. e contributions in this paper are as follows: (i) We discover a new kind of attack on the AVGD query where the adversary can recover a target user's data through analysis of query results (ii) We perform a detailed theoretical analysis of the attack and propose the privacy-preserving mechanism based on the analysis (iii) We evaluate the attack on real-world datasets. We show the severity of the attack and the effectiveness of our theoretical analysis e rest of this paper is organized as follows. Section 2 reviews the relevant work on the average-distance query and privacy-preserving query processing. Section 3 gives the formal definition of the AVGD query and the detailed process of the proposed attack and describes the goal of this paper. Section 4 theoretically analyses the uncertainty of the adversary's estimation. Section 5 gives a privacy-preserving mechanism based on the theoretical analysis. Section 6 conducts the attack on real-world datasets and gives the experimental results. Section 7 concludes this work and discusses future directions.

Average-Distance Query.
e average distance serves as a basic metric in several real-life applications. Scellato et al. [6] introduce the average distance as one of the metrics for the spatial properties of social networks. Armenatzoglou et al. [7] use the average distance as a metric for ranking users in a geosocial network. Yang et al. [19] seek to find a group of attendees for an impromptu activity that the average spatial distance between each attendee and the rally point is minimized under a restriction of the social relationship. e average distance metric is also used for the problem of min-dist optimal location selection [8,20,21]: given a set of existing facilities and a set of users, a location for a new facility is found such that the average distance between each user to his/her nearest facility is minimized.

Privacy-Preserving Query
Processing. Several approaches have been proposed to meet the purpose of the privacypreserving query. K-anonymity-based approaches [9] mix the target user's data with k − 1 other users' data such that an adversary cannot distinguish the target user from these k users. L-diversity [10] and t-closeness [11] are proposed to enhance the definition of k-anonymity. e attack proposed in this paper leverages the outputs of AVGD queries, which are aggregate information, while k-anonymity-based approaches preserve privacy when publishing or querying on single user's data.
Approaches of data transformation [12,13] transform the user's data before sending queries to the server. e server answers queries based on the transformed data other than the original data. Some approaches use cryptography [22][23][24] like homomorphic encryption [8,14] to encrypt the data of the users and enable queries on encrypted data. ese approaches avoid transmitting the original user data during the process of the query, thus protecting the user privacy, but they do not consider the privacy leakage from the query results, which motivated the attack proposed in this paper.
Differential privacy [15] is proposed to protect the privacy of individuals while releasing aggregate outputs. For two neighboring databases, differential privacy requires that query results on these two databases are indistinguishable by an exponent factor. One way to achieve differential privacy is to add controlled noise (e.g., Laplace noise [15]) on the query results. Applying differential privacy to the AVGD query is possible, but it is essentially a noise-based method, and the amount of noise should be carefully chosen under different factors, as shown in this paper.
Dinur and Nissim [16] derived lower bounds for the noise needed in sum queries to prevent violations of privacy, with the assumption of unbounded and bounded adversaries. Several works [17,18,25] further improved the study with more general databases and more efficient attacks. Blum et al. [26] proposed a practical framework to preserve the privacy where the number of queries is limited sublinear in the size of the database. All these works focused on questions based on sum queries.

Problem Formulation
In this section, we give the formal definition of the AVGD query. en, we show the proposed attack on the AVGD query and describe the goal of this paper.

System Model of Average-Distance Query.
e system model of the AVGD query is as follows. ere is a server that provides a service of the AVGD query and a client that requests queries. e server has a database which contains records of users. We assume that all these records are real valued. To run an AVGD query, the client needs to specify a set of users in the database. One way to specify the users is to use a uniform identifier between the server and the client, e.g., the phone number. e client can also specify the users based on some attributes of the users, e.g., patients who meet specific symptoms in a healthcare database. e AVGD query is formalized as follows. Average-distance query: (1) the client requests a query by choosing a user set U and a query point p; (2) the server answers the query by computing the average distance Avg(U, p) � u∈U d(t u , p)/|U|. Here, t u denotes the data of u and d(t u , p) is the distance between p and t u . e expression of d depends on detailed applications. e implementation of d depends on detailed applications. For example, it can be the absolute distance in a one-dimensional case or the spatial distance when the data are two-dimensional location points.

Adversary Model.
e client is considered to be "semihonest." e client will follow the protocol correctly to request queries. He/she will not break into the server system to get users' information illegally. However, the client may get extra information by analyzing the query results. e server is considered to be honest and is aware of the exact data value of each user but has to hide users' data from the client.

Attack on the
Query. An adversary can make an attack as shown in Figure 1. He/she chooses two groups of users U 1 and U 2 that differ in one user (the differed user can be targeted based on the attributes when the adversary knows enough partial information [18]). U 1 � u 1 , u 2 , . . . , u k and U 2 � U 1 ∪ u 0 . We assume that the adversary is not aware of the value of k. en, the adversary chooses query points p 1 , p 2 , . . . , p n . For each p i , he/she gets the query results a 1i � Avg(U 1 , p i ) and a 2i � Avg(U 2 , p i ). e equation holds that a 2i � (ka 1i + d(t 0 , p i ))/(k + 1), where t 0 denotes the data of u 0 . en, the adversary can solve equation (1) to recover t 0 : e goal of this paper is to quantify the effect caused by the attacks and design the corresponding noise-based protection schemes. Specifically, in the protection scheme, we assume that the server adds noises on the query results. We consider two kinds of noises: the multiplicative noise and the additive noise. e multiplicative noise is represented as a random variable z, and thus the noisy output is a mul � (1 + z)a. We assume that z obeys Gaussian distribution, which has been used for the statistical disclosure control [5]. e additive noise is represented as a random variable a, and the noisy output is a add � a + a. We assume that a obeys Laplace distribution, which is one of the means to provide differential privacy [15]. Both kinds of noises are assumed to have a mean value of 0 to avoid bias [5]. Note that we aim to present the attack with different kinds of noises. Enabling a differential privacy mechanism for the AVGD query is out of the scope of this paper, and we do not require the additive noise with Laplace distribution to satisfy the demand of differential privacy. We analyze how the noises affect the estimation of the user's data value recovered by the adversary and the utility of the query results.

Uncertainty Analysis
To protect users' privacy, the server replies query results added with noises.
is causes uncertainty in adversary's estimation of the target user's data. In this section, we investigate the factors that influence the precision of the adversary's estimation. Specifically, we solve the problem in a general model under two kinds of noises, i.e., the multiplicative noise and the additive noise, and then refine the solutions in two practical cases.

e Multiplicative Noise.
e adversary solves the following fitting problem to recover the data of the target user. Assume the adversary chooses two user sets U 1 and U 2 that U 2 � U 1 ∪ u 0 . For a query point p, the noisy results of the AVGD queries with U 1 and U 2 are a mul 1 � (1 + z 1 )a 1 and a mul 2 � (1 + z 2 )a 2 , respectively. We assume that |z 1 |, |z 2 | ≪ 1. Let k � |U 1 | and the data of u 0 be t 0 , and the adversary needs to fit the unknown parameters k and t 0 in the model of equation (2) with sets of p i and noisy a mul 1i and a mul 2i , where a mul 1i � (1 + z 1i )a 1i and a mul 2i � (1 + z 2i )a 2i are the noisy outputs queried on the i-th query point p i . a 1i and a 2i are the exact results queried on p i , and the noises added on them are z 1i and z 2i , respectively: We simplify the model as follows. Let x � [a 1 , p] T and G � a 2 . Let α � [k, t 0 ] T denote the parameters to be fit. We simplify the model in equation (2) as en, the problem is to fit α in equation (3) with noisy We write , 0] T , and the symbol ( ∘ ) denotes the Hadamard product.
Next, we derive the uncertainty of the estimation of the parameter α in equation (3) caused by the noise. Sader et al. [27] solved the problem of uncertainty analysis while fitting a function with noisy output data (G). We expand their work and enable uncertainty analysis while both input (x) and output data (G) contain noises. e method of least squares is used to solve such a fitting problem with the goal that the residual S [27] is minimized, where and α ′ is the estimated value of α.
us, α ′ satisfies the following equation: where ▽ α is the gradient with respect to α. We expand α ′ � α 0 + α 1 , where α 0 is the solution in the noiseless condition; i.e., α 0 is the ideal value of the fitting parameter; α 1 denotes the deviation between the estimation and the ideal value. en, we have where ▽ x is the gradient with respect to x.
As we assume α 0 as the solution without the effect of noise, we can get Substituting equations (6) and (7) into equation (5) and ignoring the high-order small components, we get the expression of the deviation in the fit parameters caused by the noise: where e expectation and variance of the fitting parameters can be approximated in the form of integrals. We assume that the query points p i are uniformly picked from a query range D. We set δp as the average area in D segmented by the query points. z 1i and z 2i are noises added by the server and are assumed to follow the same distribution that and The adversary chooses two sets of users U 1 and U 2 and query points p 1 , ··· , p n Server Query: send queries Avg (U 1 , p 1 ), Avg (U 2 , p 1 ), ··· , Avg (U 1 , p n ), Avg (U 2 , p n ) Response: answer the queries with noise added on the results   (8) and (9) with integrals and get the results for the expectation and variance of the mth component, α 1|m , of α 1 , under the multiplicative noise:

e Additive Noise.
e adversary solves a similar fitting problem as the case of the multiplicative noise, where the fitting model is the same as equation (2). e only difference is that the noisy results of the AVGD queries with U 1 and U 2 under the additive noises are a add 1 � a 1 + a 1 and a add 2 � a 2 + a 2 , respectively. We assume that |a 1 | ≪ a 1 and e residual which aims to be minimized in equation (4) changes into where We omit the detailed derivation. a 1i and a 2i are assumed to follow the same Laplace distribution that and We give the expectation and variance of the uncertainty of the adversary's estimation, α 1 , as follows, where δp is the average area in D segmented by the query points and B is defined in equation (12):

Case 1: One-Dimensional Average-Distance Query.
We consider a query of one-dimensional average distance, i.e., the queried records of users are one-dimensional real values. For instance, the queried data can be the diastolic blood pressure (diaBP) data of users from a medical database. In this case, given a user set U and a query point where k denotes the number of users in U and v u denotes the data of user u. Assuming that the value of the target user u 0 is v 0 , the model for the adversary to fit is We compute the expectation and variance of the adversary's estimation in the case of multiplicative and additive noises, as in equations (10), (11), (14), and (15). In both cases, we have , 0] T in equation (15).
To simplify the results, we give an approximation of a 1 as follows. Assume the data in the database follow a Gaussian distribution N(μ, σ 2 ), and the query range is from v m to v n . en, we approximate the value of a 1 as en, the uncertainty of the adversary's estimation is as follows. Let v 0 ′ be the adversary's estimation of v 0 and v e � v 0 ′ − v 0 be the error of the adversary's estimation. e adversary chooses a total of n query query points, and thus δp � (v n − v m )/n query . According to equations (10), (11), (14), and (15), we can get the variance and the expected value of v e under the assumption of multiplicative noise and additive noise, respectively, as Security and Communication Networks where denotes the relative value of v 0 in the query range and c 0 (s 0 ) ∼ c 3 (s 0 ) are the coefficients whose values only rely on s 0 .

Case 2: Two-Dimensional Average-Distance Query.
We consider the query of the average distance between spatial locations in this case. Assume that the server has a database of users' location data. e coordinates of locations are in the polar coordinate system. e client queries the average distance between a set of k users U and a specified query denotes the coordinate of user u. We denote the location of the target user as (ρ 0 , θ 0 ). In this case, the model for the adversary to fit is In this case, we have α 0 � [k, ρ 0 , θ 0 ] T and x � [a 1 , ρ p , θ p ] T in equations (10), (11), (14), and (15). e multiplicative noise added on the query results in equations (10) and (11) (14) and (15) (15).
We give an approximation of a 1 in this case as follows. We find that users' check-in locations show concentrations at the center of the city and disperse towards the edge of the city in real datasets. Figure 2 shows the distribution of the check-ins of users in New York from a real-world dataset [28]. e distribution is shown as a heat map that the red area shows higher concentrations of check-ins than the blue area. e check-ins concentrate in the center of the city and decrease towards the outskirts, and this comes from the reason that people are more active in the city center. Accordingly, we make an assumption that the coordinates of users follow the distribution that the radial coordinate and the angular coordinate of each user are uniformly selected from 0 ∼ R and 0 ∼ 2π, which gives concentrations of points in the center. en, a 1 can be represented as e uncertainty of the adversary's estimation in this case is given as follows. Let the adversary's estimation of (ρ 0 , θ 0 ) be (ρ 0 ′ , θ 0 ′ ). ρ e � ρ 0 ′ − ρ 0 and θ e � θ 0 ′ − θ 0 denote the errors of the adversary's estimations of ρ 0 and θ 0 . e adversary queries on a total of n query query points. According to equations (10) and (11) and letting δp � πR 2 /n query , we can get the variances of ρ e and θ e under the assumption of multiplicative and additive noises, respectively, as Var mul θ e � π n query Var[z] c 7 s 0 + c 8 s 0 k + c 9 s 0 k 2 , Var add ρ e � π n query Var[a]c 10 s 0 2k 2 + 2k + 1 , Var add θ e � π n query R 2 Var[a]c 11 s 0 2k 2 + 2k + 1 , (26) where s 0 � ρ 0 /R denotes the relative location of u 0 in the query range and c 4 (s 0 ) ∼ c 11 (s 0 ) are the coefficients whose values are determined by s 0 . e expected values of ρ e and θ e are both 0:

4.2.3.
Findings. An obvious finding in equations (18), (19), and (23)∼ (26) is that the variances of v e , ρ e , and θ e are in direct proportion to Var[z] or Var[a] and in inverse proportion to n query . In other words, the noise of the query results brings difficulties for adversaries to reveal the target user's data. Besides, as the number of returned query results increases, the risk of data leakage also increases since more equations can be built to solve the fitting parameters. Moreover, fewer number of users in a query would also lead to a higher accuracy of malicious estimation of users' data.

Privacy-Preserving Mechanism
In this section, we give a privacy-preserving mechanism for AVGD queries. We define the privacy metric for the AVGD query. en, we design a mechanism to decide the minimum amount of noise for a query given the desired privacy requirement.

Privacy Metric for AVGD Query.
We define the privacy metric for the AVGD query as follows. Hoh et al. [29] and Shokri et al. [30] quantified the location privacy as the expected distance between the user's location and the adversary's estimation. We follow the idea and quantify the privacy in the attack on AVGD queries as the expected error distance (eed), i.e., the expectation of the distance between the target user's data and the adversary's estimation. e definition of eed is as follows: where t is the real data of the target user and t ′ is the value estimated by the adversary. erefore, a smaller eed corresponds to a higher risk of privacy leakage. en, we show the detailed description of eed in the one-dimensional case (eed 1 ) and the two-dimensional case (eed 2 ).

One-Dimensional
Case. According to equation (8), the error of an adversary's estimation is the sum of random variables. In the case of multiplicative noises, the noises are assumed to obey Gaussian distribution. us, the error is the sum of Gaussian noises, which obeys Gaussian distribution as well. e noises are assumed to obey Laplace distribution in the case of the additive noise. However, the sum of Laplace noises may not obey Laplace distribution. We empirically approximate the distribution of the error in the case of additive noises as a Gaussian distribution. e expectations of the errors in the multiplicative noise case and the additive noise case are both 0, and the variances are described in equations (18)

Two-Dimensional Case.
Denote the real coordinate of the target user as (ρ 0 , θ 0 ) and the estimation of the adversary as (ρ 0 ′ , θ 0 ′ ). Let ρ e � ρ 0 ′ − ρ 0 and θ e � θ 0 ′ − θ 0 denote the errors of the adversary's estimations on ρ 0 and θ 0 , respectively. en, the distance between the real coordinate of the target user and the adversary's estimation is Approximating cos(θ e ) with 1 − θ 2 e /2, we get eed 2 in the two-dimensional case is To calculate the expected value of Δl, we use the following approximation. For a function g(x) on a random variable x, using Taylor approximation [31], we can get ρ e and θ e are random variables that obey Gaussian distribution in the case of multiplicative noise and are approximated to obey Gaussian distribution in the case of additive noise. Both the expected values of ρ e and θ e are 0, and the variances are Var[ρ e ] and Var[θ e ] (subscripts are omitted) respectively, as described in equations (23) to (26).

Privacy-Preserving Algorithm.
Given the privacy metric for the AVGD query, we design a mechanism for the privacy-preserving AVGD query, as shown in Algorithm 1. Since the precision of an adversary's estimation increases when he/she gets more results, and when there are fewer users in each query, we assume that the server limits the number of queries on a specified user set that each client can request as n query and limits the minimum number of users in each query as k. Each user u in the database has the privacy requirement eed u ; i.e., eed of the adversary's estimation on the data of u must be less than eed u . Assume that a client specifies a user set U and a query point q. en, the server calibrates the noise magnitude such that the privacy requirement of each user in U is satisfied and responses the client with noisy results.

Experiment
In this section, we perform the attack on real-world datasets and seek to answer the following questions: How do different factors influence the uncertainty of the adversary's estimation? How is the severity of the attack in real-life cases? How is the correctness of our theoretical analysis? How much utility is lost when enabling privacy preserving? Security and Communication Networks 6.1. Dataset 6.1.1. One-Dimensional Case. We use a medical dataset from the Framingham Heart Study [32] that contains health data of 4,240 users. We use the diastolic blood pressure (DiaBP) data of each user as the inputs in our experiments. e DiaBP data in this dataset range from 48 mmHg to 142.5 mmHg and obey Gaussian distribution with a mean value of 81.31 mmHg and a standard deviation of 10.98 mmHg.

Two-Dimensional Case.
We use a location dataset collected from Foursquare [33]. e dataset contains checkins contributed by users in New York from 12 April 2012 to 16 February 2013 [28]. Each check-in is associated with a user-id, a timestamp, and the GPS coordinate. e original dataset contains 1,083 unique users and 227,428 check-ins. In our settings, we only consider one check-in for each user in each query. For simplicity, we choose the latest check-in of each user as the target of the clients in our experiments. e radius of this dataset is 28.5 km.

Measurement.
We aim to measure the severity of the attack and the correctness of our theoretical analysis. e severity of the attack is measured as eed 1 and eed 2 defined in equations (29) and (32) for the two cases. Smaller values of eed 1 and eed 2 mean higher severity. We measure the correctness of the standard deviations in equations (18), (19), and (23) to (26) and eed 1 and eed 2 in equations (29) and (32). e correctness here is measured as the error rate between the theoretical estimation and the experimental value, i.e., error rate � y t − y e y t , where y t and y e denote the theoretical estimation and the experimental value, respectively.

Experimental Setup and Statistical
Result. e behavior of an adversary is simulated as follows. e dataset and a query range specified by the server are given. e adversary focuses on a target user u 0 (assume that u 0 is within the query range). e adversary randomly chooses a user set U 1 within the query range containing k users and a neighboring user set U 2 � U 1 ∪ u 0 . Assume that the maximum number of queries that each client can request on a specified user set is n query . e adversary chooses n query query points within the query range. In the one-dimensional case, we set the maximum query range from 20 mmHg to 145 mmHg. In the two-dimensional case, the query range for the angular coordinate is from 0 to 2π, and the maximum query range for the radial coordinate is from 0 m to 28,500 m. For each query point p, the adversary requests AVGD queries on U 1 and U 2 , respectively. e server computes results a 1 � Avg(U 1 , p) and a 2 � Avg(U 2 , p) and adds multiplicative noises z 1 and z 2 (resp., additive noises a 1 and a 2 ) on the query results such that the noisy outputs are a 1 � (1 + z 1 )a 1 and a 2 � (1 + z 2 )a 2 (resp., a 1 � a 1 + a 1 and a 2 � a 2 + a 2 ), where z 1 and z 2 (resp., a 1 and a 2 ) are both random variables from Gaussian distribution N(0, Var[z]) (resp., Laplace distribution Laplace(0, ������� � Var[a]/2 √ )). e adversary gets a total of 2n query noisy results and solves equations like equation (1) to recover the data of u 0 . We use the lsqnonlin function with the default trust-region-reflective algorithm [34] in MATLAB R2018b to solve the nonlinear least-square problems.
We study the influences on the adversary's estimation caused by different factors. e influences include the uncertainty of the adversary's estimation, eed of the adversary, and the error rate of the theoretical estimation. e factors we consider in our experiments are as follows: the magnitude (standard deviation) of the noise added on the query results (SD[z] and SD[a]), the number of users in the query requested by the adversary (k), the limited number of queries on a specified user set for each client (n query ), the relative position of the target user's data in the query range (s 0 ), and the size of the query range.
In our experiments, we observe the influences by simulating attacks with different values of the parameters. Under each configuration of parameters, we perform the simulation of the attack described above 2,000 times. e Input: n query : the limited number of queries on a specified user set; k: the minimum number of users in each query; U: the user set that the client specifies; q: the query point that the client specifies. Output: a: the noisy output.
(1) if |U| < k then (2) return false (3) end if (4) for each u ∈ U do (5) Get the privacy requirement eed u of u. (6) Solve the variance σ 2 i such that the privacy requirement eed u can be met.
Compute the correct answer a � Avg(U, q). (10) z ∼ N(0, σ 2 ) for multiplicative noises, or a ∼ Laplace(0, σ/ � 2 √ ) for additive noises. (11) a � (1 + z)a for multiplicative noises, or a � a + a for additive noises. (12) return a ALGORITHM 1: Privacy-preserving AVGD query processing. detailed settings of the parameters and the results are as follows. Note that in Section 6.3, we only demonstrate the results, and the detailed analysis is in Section 6.4-6.6.
We first demonstrate the influence of the noise magnitude. In the experiments, we fix the query range with the maximum query range and set k � 50, n query � 2, 000, and s 0 � 0.5 (in the two-dimensional case, s 0 denotes the relative position of the radial coordinate of the target user. e angular coordinate does not affect the uncertainty of the adversary as in equations (23)-(26), thus we set it with 0). For the one-dimensional case, we change the multiplicative noise SD[z] (resp., additive noise SD[a]) from 0.01 to 0.1 (resp., 0.1 mmHg to 1 mmHg). For the two-dimensional case, we change SD[z] (resp., SD[a]) from 0.001 to 0.01 (resp., 10 m to 100 m). e results of the first experiment are shown in Figures 3 and 4.
Secondly, we study the influence caused by the number of users in each query. We fix SD[z] � 0.01, SD[a] � 0.1 mmHg for the one-dimensional case and SD[z] � 0.001, SD[a] � 10 m for the two-dimensional case. For both the cases, we fix the query range with the maximum query range and fix parameters n query � 2, 000 and s 0 � 0.5 and change k from 20 to 200. e results are shown in Figures 5 and 6.
We then show the effect of the number of queries on a specified user set. We set SD[z] � 0.01, SD[a] � 0.1 mmHg for the one-dimensional case and SD[z] � 0.001, SD[a] � 10 m for the two-dimensional case and fix k � 50 and s 0 � 0.5 and fix the query range with the maximum query range for both cases. We choose n query from 200 to 2,000 for both cases. e results are shown in Figures 7 and 8.
We also study the effect of the relative position (s 0 ) of the target user's data in the query range. Note that in the twodimensional case, we only consider the impact of the radial coordinate of the target user because the angular coordinate has no effect on the uncertainty of the estimation. e parameter SD[z] (resp., SD[a]) is set with 0.01 (resp., 0.1 mmHg) for the one-dimensional case and 0.001 (resp., 10 m) for the two-dimensional case. We fix the query range with the maximum query range and fix k � 50 and n query � 2, 000 and choose s 0 from 0.1 to 1 for both cases. e results are shown in Figures 9 and 10.
At last, we demonstrate the influence of the size of the query range. We fix SD[z] � 0.01, SD[a] � 0.1 mmHg for the one-dimensional case and SD[z] � 0.001, SD[a] � 10 m for the two-dimensional case. For both the cases, parameters k � 50, n query � 2, 000, and s 0 � 0.5 are fixed. We change the size of the query ranges. Specifically, we fix the central point of the query ranges and change the ranges around the central point. e results are shown in Figures 11 and 12 Figures 3 and 4, the standard deviation of the adversary's estimation error and eed of the adversary are in direct proportion with the standard deviation of the noise. e precision of the adversary's estimation decreases when the server adds larger noises on the query results.

e Number of Users in Each Query.
e standard deviation of the adversary's estimation error and eed of the adversary's estimation are in direct proportion to the number of users in each query, as shown in Figures 5 and 6. e precision of the adversary's estimation decreases when the number of users in each query increases. e reason is as follows. For a 1 � Avg(U 1 , p) and a 2 � Avg(U 2 , p), where U 2 � U 1 ∪ u 0 , the difference between a 1 and a 2 is smaller when there are more users in U 1 , and thus it is more difficult for an adversary to recover the target user's data. Figures 7 and 8, the standard deviation of the estimation error and eed decrease when n query increases. More queries generate more results which can be used for the fitting problem, and thus the adversary gets a more precise estimation. Figures 9 and  10, the influences of s 0 on the uncertainty of the adversary do not show a consistent pattern, which depend on specific cases. Also, the multiplicative noises and the additive noises in the same case affect the results differently. Figures 11  and 12, for both one-dimensional and two-dimensional cases, eed of the adversary's estimation increases under multiplicative noise when the query range gets larger. When the noise is additive, eed does not change much with different query ranges. Actually, the value of eed is closely related with the absolute noise magnitude. e additive noise itself is the absolute noise and remains the same with different query ranges. e multiplicative noise is the relative noise. When the value of SD[z] is fixed, the absolute noise magnitude increases with larger query ranges, and thus the value of eed rises.

Severity of the Attack.
We demonstrate the severity of the attack based on the theoretical estimation and the results on real-world datasets. First of all, the severity of the attack depends on the sensitivity of the data in different cases, i.e., to what extent the information leakage threatens a user's privacy. For example, in the two-dimensional case, when an adversary has eed 2 � 1, 000 m, it is difficult for an adversary to link the estimated location with a specific sensitive place (e.g., hospital and church) with such uncertainty. But, in the one-dimensional case, if the adversary has eed 1 � 5 mmHg, a health problem such as high blood pressure of the target user may be leaked. Besides the sensitivity of different cases, the severity of the attack increases when the adversary gets a higher accuracy of the estimation, which is mainly influenced by the following three facts.

e Noise Magnitude.
When the server reduces the noise magnitude, eed of the adversary decreases, thus the severity rises. In Figure 3, eed 1 decreases from 5.6 mmHg to 0.45 mmHg (resp., 1.25 mmHg to 0.13 mmHg) when SD[z] changes from 0.1 to 0.01 (resp., SD[a] changes from 1 mmHg to 0.1 mmHg). In Figure 4, eed 2 decreases from 631 m to 71.8 m (resp., 277 m to 28 m) when SD[z] changes from 0.01 to 0.001 (resp., SD[a] changes from 100 m to 10 m). Moreover, the proposed attack can resist the noise to some extent. As shown in Figure 3, the adversary gets the eed 1 of about 5.6 mmHg even when the server adds the noise with SD[z] � 0.1.

e Number of Users in Each
Query. e adversary gets a lower value of eed when the number of users in each query decreases. As shown in Figures 5 and 6, when changing k from 200 to 20 (in the multiplicative noise case), eed 1 decreases from 1.8 mmHg to 0.19 mmHg and eed 2 decreases from 251 m to 29.5 m. It is the server's responsibility to take measures to limit the minimum number of users in each query.

e Number of Queries.
e more queries the adversary requests, the lower the eed is. As shown in Figures 7 and 8, eed 1 decreases from 1.4 mmHg to 0.44 mmHg and eed 2 decreases from 211 m to 71.7 m when n query increases from 200 to 2,000 (in the multiplicative noise case). e server has to limit the number of queries on a specified user set requested by each client to mitigate the risk of privacy breaches.

Correctness of the eoretical Estimation.
e error rate of our theoretical estimations increases when the magnitude of the noise gets larger, as shown in Figures 3 and 4. is comes from the assumption in our theoretical analysis that |z| ≪ 1 and |a| ≪ a 1 . In Figure 3, the error rate of eed 1 is less than 0.01 under multiplicative noises when SD[z] � 0.01 for the one-dimensional case and increases to 0.06 when SD[z] � 0.1. In the two-dimensional case, the error rate of eed 2 increases from 0.01 to 0.12 when SD[z] changes from 0.001 to 0.01, under multiplicative noises, as shown in Figure 4. e magnitude of the noise determines the utility of the results. e server may not add a large noise (e.g., with SD[z] > 0.1) on the results due to the utility concern, and thus the error rates of our theoretical estimations can be controlled. e correctness of the theoretical analysis ensures the effectiveness of the privacy-preserving mechanism in Algorithm 1.
It is worth noting that the error rate of eed 1 increases when k is small, as shown in Figure 5. is comes from the error in the approximation of a 1 in equation (17), which rely on an assumption of the distribution of the target users. e accuracy of the approximation decreases when the user set is small.
As shown in Figure 11, the error rate is high when the query range is small. e reason is that when we narrow down the query range, the distribution of the queried data limited in the query range deviates from our assumption in Section 6.1 (we assume the data in the one-dimensional case to obey a Gaussian distribution) and thus affects the theoretical results.

Utility Loss under the Privacy-Preserving Mechanism.
We show the utility loss when the privacy-preserving mechanism adds noise on the query results to meet different privacy requirements. e precision of the query results determines the utility, and thus we use the standard deviation of the noise to measure the utility; i.e., the more the standard deviation of the noise is, the less utility the query results have. e privacy for the two cases is quantified as eed 1 and eed 2 as defined in equations (29) and (32). We give examples to show the minimum amount of noise to meet the desired privacy requirement in Figure 13. For the one-dimensional case, we set n query � 200 and k � 50 and change eed 1 from 0 mmHg to 15 mmHg. For the two-dimensional case, we set n query � 2, 000 and k � 50 and change eed 2 from   Figure 11: Impact of the size of the query range (diaBP dataset).  Query range (km)

Real dataset Theoretical results
Additive noise

22
Security and Communication Networks 0 m to 1,000 m. We show the corresponding minimum amount of noise when s 0 is set to 0.3, 0.6, and 0.9, respectively. e standard deviation of the minimum noise is in direct proportion to the desired privacy requirement, i.e., a higher privacy requirement causes the lower utility of the query results. In the one-dimensional case with s 0 � 0.6, the standard deviation of the minimum noise SD[z] (resp., SD[a]) increases from 0.05 to 0.15 (resp., 1.81 mmHg to 5.44 mmHg), when the privacy requirement increases from 5 mmHg to 15 mmHg. In the twodimensional case with s 0 � 0.3, the standard deviation SD[z] (resp., SD[a]) increases from 0.008 to 0.014 (resp., 173 m to 347 m), when the privacy requirement increases from 500 m to 1,000 m.
To decrease the magnitude of the noise and thus preserve the utility of the results, one possible way is to restrict the maximum number of queries that each client can request and the minimum number of users in each query.

Conclusion and Future Work
e AVGD query serves as basic components of realworld applications such as business decision and medical diagnosis. e work of this paper is instructive to the service providers to enable AVGD queries while preserving the privacy of users' data. Specifically, we first propose an attack on the AVGD query in this paper. An attacker can recover a target user's data by querying with carefully selected user sets and query points. To understand the severity of such an attack and further propose the privacy-preserving mechanism, we formalize the process of the attack as a fitting problem with noisy data. We give the theoretical analysis of the uncertainty of the adversary. e results show that the severity of such an attack is mainly related to the factors such as the noise magnitude, the number of queries, and the number of users in each query. Other factors such as the data of the target user and the size of query ranges show low correlations with the severity of the attack. Based on the theoretical analysis, we design an algorithm for the privacy-preserving AVGD query. Experiments on two reallife cases show the severity of such an attack and the effectiveness of our theoretical analysis. We also evaluate the effectiveness of the service quality when enabling the privacy-preserving mechanism. e attack proposed in this paper is somewhat idealized. Future works can focus on more practical situations. First, recall that the proposed attack is based on queries on two groups of users where the groups merely differ in one user. In the future work, the attack can be extended by selecting more than two user sets while it can be more flexible in choosing the sets. Besides, recall that an adversary can get a higher precision of estimated private data by using more query results. How to limit the number of queries in practical applications should be studied, as several attackers can collude to query on the same user sets. However, this is a nontrivial task as discussed in previous works [26].

Data Availability
All data included in this study are available upon request from the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.