Query Privacy Preserving for Data Aggregation in Wireless Sensor Networks

Wireless Sensor Networks (WSNs) are increasingly involved in many applications. However, communication overhead and energy efficiency of sensor nodes are the major concerns in WSNs. In addition, the broadcast communication mode of WSNs makes the network vulnerable to privacy disclosure when the sensor nodes are subject to malicious behaviours. Based on the abovementioned issues, we present a Queries Privacy Preserving mechanism for Data Aggregation (QPPDA) which may reduce energy consumption by allowing multiple queries to be aggregated into a single packet and preserve data privacy effectively by employing a privacy homomorphic encryption scheme. ,e performance evaluations obtained from the theoretical analysis and the experimental simulation show that our mechanism can reduce the communication overhead of the network and protect the private data from being compromised.


Introduction
As a novel and modern technique, Wireless Sensor Networks (WSNs) have been introduced into a variety of scenarios such as medical applications [1], smart homes [2,3] autonomous vehicles [4], traffic administration [5] and military battlefields [6]. A WSN is composed of hundreds or thousands of tiny resource-constrained sensor nodes which are generally deployed in an unattended even hostile area. ese nodes are difficult to be replaced or recharged. is prevents WSNs from being applied into more critical applications, especially in scenarios where the long lifetime and the high quality services are needed. It is important that traffic and computation overhead should be kept as low as possible to extend the lifetime of WSNs. e Data Aggregation (DA) [7][8][9][10][11][12][13] technique is one of the most effective ways for the network to save energy and improve efficiency. It can reduce the quantity of information transmission through aggregating the data from different nodes, decreasing redundancy, and achieving the goal of prolonging the lifetime of the network. Unfortunately, DA is vulnerable to some attacks. Taking the aggregation node as an instance, it is an intermediate tier between sensor nodes and Base Station (BS). e main roles of aggregation nodes are to store the sensing data and reply the queries received from BS. If most of the aggregation nodes have been compromised successfully, the data of whole network may be revealed and tampered with easily. is may result in serious threat or economic loss, even the damage to the safety of state property. erefore, the Security Data Aggregation (SDA) plays an important role in the critical application of WSNs.
Privacy Preserving (PP) has attracted much attention in many fields, such as smart grid [14], Internet of ings [15,16], edge computing [17], social network [18] and other application scenarios [19][20][21]. PP can also protect the privacy of sensing data when DA is adopted in a WSN, and some interesting schemes have been proposed in recent years [22][23][24][25]. However, these solutions cannot guarantee the data integrity. Although the schemes discussed in [26][27][28] exploited the issue of data integrity, they may cause the leakage of concealed data due to the decryption at the aggregation nodes. A proposed scheme in [29] attempted to bridge the gap between PP and data integrity through integrating an encryption algorithm with an MAC authentication mechanism, but it has the risk of putting a heavy computation burden on sensor nodes.
In general, BS has two ways to collect information in a WSN. One is that BS sends a query and the nodes reply accordingly. e other is that the nodes periodically report information to the BS. We focus on the former one in this paper for the reason that the latter one consumes more resources in transmission replies which is inconsistent with our intention of saving energy. e data query has been widely exploited in the current studies. For example, the maximum/minimum query was used to monitor a patient and identify the maximum or minimum value of an indicator which could be regarded as a symbol to determine whether the patient is in a good state or not [30]. Up to now, the single query with PP, such as range query [31], verifiable top-k query [32], and location query [33], has been well addressed. However, the single query method cannot meet the requirements of application when it is introduced into a large-scale network. erefore, how to enrich the function of query becomes an urgent research challenge. As one of the reasonable solutions, the multiple queries mechanism has been proposed in which many queries can be executed simultaneously [34]. However, the multiple query mechanism with PP is an emerging direction, and many valuable issues need to be solved in the future.
To address the abovementioned issues, we propose a Queries Privacy Preserving mechanism for Data Aggregation (QPPDA) in this paper. e goal of our work is to bridge the gap between PP and energy consumption, and the following techniques are adopted. Firstly, the multiple queries are aggregated into a single packet in order to reduce energy consumption. en, a homomorphic encryption scheme is carried out, and the confidentiality of private data is ensured. Next, the data for different queries in a single aggregated packet can be distinguished from each other in the decryption of the aggregated data at BS. Compared with the single query, QPPDA may greatly decrease the communication and computation overhead. e main contributions of this paper are as follows.
(i) Improvement of Gridding Technology. e high computation complexity of cell limits the application of the grid technique. We break this restriction through improving the relative location algorithm in grid topology. As a result, the computation complexity is decreased, and the relative location provides an efficient way to maintain a dynamic WSN. (ii) Effective Privacy Preserving. Privacy is easily destroyed by an attacker for a WSN usually deployed in an unattended even hostile environment. e elliptic curve encryption combined with the homomorphic algorithm is adopted to effectively protect the private data from being compromised. (iii) Efficient Reply. Sending multiple replies individually leads to the wastage of network resources. rough aggregating the multiple queries into a single packet, the performances of WSN are promoted in terms of energy consumption and lifetime. e rest of the paper is organized as follows. Section 2 introduces related work. Section 3 discusses the topology construction of the network. Section 4 elaborates our scheme in detail. Section 5 evaluates the performance of QPPDA. We conclude this paper in Section 6.

Grid Topology.
e connectivity is one of the key issues in WSNs, and many valuable solutions have been proposed to deal with this challenge. A grid-based SDA scheme was proposed in [35]. e whole network was divided into some nonoverlapping virtual cells which were small enough to ensure that the radio coverage of a node can cover its surrounding cells, namely, each node in a cell can directly communicate with the nodes in the neighbouring cells. In [36], the nodes were divided into groups according to their geographic locations with only one node reserved in each group which can connect to the backbone network. In this way, the proposed scheme in [36] not only ensures the connectivity of nodes, but also speeds up the convergence rate of the network. Although the connectivity of the network is guaranteed, the grid topology causes a higher computational complexity than tree or cluster topology.

Privacy Preserving.
As to PP, some cryptographic schemes have been adopted to carry out the hop-by-hop encryption [37]. He et al. presented an Integrity-protecting Private Data Aggregation scheme (IPDA) [38], which is an improvement on the Cluster-based Private Data Aggregation (CPDA) [22]. Both IPDA and CPDA achieve privacy preserving through the technique of data slicing and assembling which ensures integrity by constructing two disjointed aggregation trees. However, the disjointed aggregation trees are computation-and communicationconsuming and inapplicable to resource-constrained WSNs. As far as the hop-by-hop scheme is concerned, data privacy cannot be guaranteed because the ciphertext must be decrypted in the intermediate nodes when DA technique is applied. erefore, the end-to-end scheme is a desirable choice in a network with DA. In [30,31], the nodes directly sent the encrypted data to the BS without the decryption operation involved in the intermediate nodes. Castelluccia et al. [39] proposed a simple and provable secure additive homomorphic stream which permitted the efficient aggregation of encrypted data. Girao et al. [40] discussed a mechanism which can conceal the sensing data and the aggregation data in an end-to-end manner. ough these schemes are efficient in preserving data privacy of DA, they cannot prevent the private data from being eavesdropped by their neighbours. Compared with [40], the Integrity Protecting Hierarchical Concealed Data Aggregation (IPHCDA) for WSNs ensured that no private data of a sensor node were released to any other nodes under the support of asymmetric cryptography [41]. It employed the elliptic curve-based Privacy Homomorphic (PH) and allowed the concealed aggregation data to be encrypted with different keys. e scheme includes the following steps.
Step 1: generate key pairs according to the point on the elliptic curve (p u , p r ).
Step 2: encrypt m using c � g m + h r , where + is the addition operation of elliptic curve points, r is a random number, and g m and h r are the scalar multiplication of elliptic curve points.
Step 3: perform the DA. Two ciphertexts ( Step 4: decrypt a ciphertext using the private key p r at BS.

Query Privacy Preserving.
e contributions presented in [42][43][44] investigated the privacy schemes of single query when attackers attempted to tamper with or eavesdrop on the private information of nodes. Papadopoulos et al. proposed a privacy-preserving scheme of range query [42] based on the bucketing technique [45], in which the domain of data values was divided into multiple buckets, and the time was divided into slots as well. In each time slot, data items collected by a sensor node were classified into different buckets with different IDs. If BS wants to perform a range query, it does not send the range directly. Instead, the bucket with various IDs that covers the required range is sent to the storage nodes. However, the bucket partitioning technique cannot prevent a compromised storage node from carrying out malicious activities in a WSN. Faced with this challenge, the proposed scheme in [46] discussed the privacy of query by encoding the sensing data. However, it needs high computation overhead and communication cost. To the best of our knowledge, rare contribution is found in investigating the privacy preserving of multiple queries with DA.
Different from the abovementioned approaches, QPPDA has the advantages of decreasing the resource consumption and protecting the private data from being compromised simultaneously.

Sensor Networks and Data Aggregation Model.
A sensor network is modelled to a grid which is divided into many cells with each one containing a number of sensor nodes.
ere are three types of nodes in a network: BS, Aggregation Node (AN), and Member Node (MN). It is assumed that BS is trusted and has unlimited energy, computing resource, and storage capacity. MN collects the sensing data and sends them to AN. And AN is responsible for forwarding the query sent by BS and aggregating the data of MNs. e network size is N which means that there are N nodes in a WSN. e sensor nodes are organized into a grid structure as shown in Figure 1. Notice that we adopt the three-dimensional model rather than other two dimensional models in most of the related works with grid topology. is model may expand the application scenario of QPPDA, and it can be used in many complex natural environments.
Let D � d 1 , d 2 , . . . , d N be raw data gathered at MNs. e set of sensing data, D, can be transmitted to BS hop-byhop. However, transmitting all the raw data to BS may result in a huge burden on the bandwidth and high energy consumption. erefore, DA is a favorite technique to decrease the occupancy of resources.
A data aggregation function is defined as where f represents the aggregation function which may be addition, average, min, max, and count. We focus on addition aggregation functions in our model It should be noticed that the addition aggregation function is not too restrictive because many other functions such as average and count which can be deduced from the addition function.

reat Model.
When queries are initialized, BS broadcasts them to the whole network. e nodes which meet the requirements of the queries send their reply data to AN, and the data are sensitive to the malicious activates if the security mechanism is absent. We adopt the well-known "honest but curious" threat model [47], in which the adversaries attempt to break the privacy but faithfully follow the protocol specification during the process of DA. Meanwhile, adversaries can overhear the original data of sensors through eavesdropping on the wireless link. In addition, a few nodes may collude with each other to violate the data privacy of the overall network.

Privacy Data Aggregation Protocol
We present a privacy data aggregation protocol called Queries Privacy Protection for Data Aggregation (QPPDA) which involves three phases: the grid division, the key generation, and the query processing. Firstly, a network is Wireless Communications and Mobile Computing divided into adjacent virtual cells, and the nodes within neighbouring cells can directly communicate with each other. Secondly, the corresponding key for each type of query is generated in order to guarantee the data privacy. Finally, the nodes aggregate multiple replies into a single packet which is transmitted to BS hop-by-hop.

Grid Division.
e grid division phase is responsible for the construction of the network structure. In the Geographical Adaptive Fidelity (GAF) algorithm, a network area was divided into grid topology which consisted of many contiguous cells according to the geographic information and the radio coverage of nodes [46]. In order to make GAF suitable for the WSNs in practice, some improvements of GAF were proposed, and a relative position was adopted to obtain the grid information [48]. However, some valuable issues, such as data privacy and accuracy, are left for future study.
We define all the cells that have a common edge as the neighboring cells. In the division process, it should be determined that all the nodes of a cell can directly communicate with the nodes in the neighboring cells.
us, equation (1) needs to be satisfied.
where r denotes the side length of the cell and R is the communication radius of the node. e relationship between r and R can be shown as Figure 2. We take the following steps to divide a grid into adjacent cells. Firstly, BS broadcasts its location, L bs (x bs , y bs , z bs ), and the side length r of each cell to all the nodes in a WSN. Node i(x i , y i , z i ) can calculate the coordinate of cell G i (g(x i ), g(y i ), g(z i )) and determine which cell node i belongs to using the following equation: Now, we use a simple example to explain why we use the top integral instead of the bottom integral when the cell coordinate is determined in equation (2). Assume that the coordinate of node i is (9,11,10) and that of BS is (0, 0, 0), respectively. e side length r is 4, as shown in Figure 3. We firstly calculate the cell where node i stays using the top integral according to equation (2) g(x) � (9 − 0)/4 � 3, g(y) � (11 − 0)/4 � 3, g(z) � (10 − 0)/4 � 3.
erefore, node i is inside G i (3, 3, 3). On the contrary, we can obtain G i is (2, 2, 2) using the bottom integral. It can be seen from Figure 3 that G i (3, 3, 3) is the desired one. e pseudocode of the grid division is listed in Algorithm 1. N nodes compute their coordinates from Line 1 to

Line 10, and the computation complexity of grid division is O(N).
After the grid is divided into cells, some sensor nodes (ANs) in different cells are selected and organized into an aggregation tree rooted at BS. In a cell, the member nodes send the data to AN, and AN sends the aggregation results to BS hop-by-hop along the aggregation tree. Figure 4 demonstrates the aggregation tree and the data aggregation in a cell, respectively.

Key Generation.
We introduce the homomorphic encryption scheme based on the elliptic curve [14] into QPPDA, which can protect private data from being revealed. e encryption method assigns different keys to the concealed data acquired from different nodes, and BS can correctly distinguish them in the aggregation process [41].
Assume that k types of query are supported by a network. erefore, k public and private key pairs are required. We take the following steps to generate the key pairs. Given a parameter τ, we define an algorithm Ψ(τ) to output a tuple (q 1 , q 2 , . . . , q k+1 , E), where E is a set of elliptic curve points that form a cyclic group. e order of E is n where n � q 1 × q 2 × · · · × q k+1 . Ψ(τ) works as follows.   Wireless Communications and Mobile Computing (i) Generate k + 1 random τ-bit primes (q 1 , q 2 , . . . , q k+1 , E) and set n � q 1 × q 2 × · · · × q k+1 (ii) Generate a set of elliptic curve points E (iii) Output the security elements () e points (α 1 , α 2 , . . . , α k+1 , β) are randomly chosen, and the order of these points are n.
en, we calculate c as c � α k r�1 q r . k+1 (3) e order of c is q k+1 . Next, k public keys are computed for k queries according to the following equation: where the order of h j is q j . Finally, we establish the public key set P uk � n, E, h 1 , h 2 , . . . , h k , β, c and the private key set P rk � q 1 , q 2 , . . . , q k+1 . erefore, the jth key pair, (P uj , P rj ) is generated for the jth query.
e key generation is illustrated in Algorithm 2 where Lines 4 to 6 obtain a tuple by the elliptic curve algorithm, and Lines 7 to 15 display the process of producing keys. N nodes execute the elliptic curve algorithm to generate key pairs for k queries with the complexity of O(N 2 ). 1, 2, . . . , N), where p and t represent the types of queries and the query epoch, respectively. T denotes the time that AN spends on replying to BS. Four steps should be taken to process the query: the data collection, the data encryption, the data aggregation, and the data decryption.

Data Collection.
After receiving the queries, the nodes collect the sensing data d i ∈ 0, 1, . . . , D { } where D is the maximum value of d i according to the query types.

Data Encryption.
e nodes encrypt data using the public key and the encryption process is as follows.
Step 2: node i selects a key according to the type of query. If the type of query is x, that is p � x, the public key is P uk � P ux � (n, E, g, h x ), where h x � u q j+1 .
Step 3: Node i computes the ciphertext

Data Aggregation.
Let d x denote the reply message of the xth query. Consequently, k ciphertexts in node i (aggregator) are aggregated into a ciphertext of C i ′ using the following equation: Calculate G i (g(x i ), g(y i ), g(z i )) according to equation (2) Obtain the position G(g(x i ), g(y i ), g(z i ));

Data Decryption.
During the decryption, BS is able to decrypt the data of each query separately from the aggregated ciphertext C i ′ . To decrypt a ciphertext C x i , BS needs to obtain the plaintext from equation (6) using the private key q x .
e pseudocode of query processing is shown in Algorithm 3. Lines 1 to 10 describe the data collection, the data encryption, and the data aggregation in detail. Lines 11 to 16 delineate the process of separating the decryption data at BS with the complexity of O(N).

Performance Analysis and Simulation Experiment
We evaluated the performance of QPPDA in terms of privacy preservation, communication efficiency, and computation overhead through theoretical analysis and simulation experiment. QPPDA was implemented using MATLAB. A WSN with 600 nodes was considered, and these nodes were randomly deployed in a 400 m * 400 m area. e transmission range of the sensor node was 50 m.

Privacy-Preservation Analysis.
We analyze the privacy preservation performance of QPPDA when a node is compromised by physical attack. If an adversary compromises an AN, it can perform an unauthorized aggregation and send false aggregation results to BS. However, due to the asymmetry of public key, an adversary cannot gain any additional information related to the data aggregation. Hence, the compromised node may affect the data integrity but not the data confidentiality in QPPDA. rough the analysis, we can conclude that the privacy can be revealed because of the leakage of keys. Assume that p ovk and p ovr are the probabilities of the key and the random number which are broken, respectively.
erefore, the probability of information leakage is p � (p ovk × p ovr ) k . Figure 5 demonstrates the privacy performance of different types of the queries. We may find that the exposure probability of privacy is less than 0.45% even if the reveal probability of key is 0.4. Besides, the more frequently the queries are sent, the better the data confidentiality will be.
is proves that QPPDA can effectively preserve the data privacy.

Energy Consumption.
In our experiments, we considered the efficiency of communication and computation and adopted the typical data query schemes, the single query [42] and the Slice-Mix-AggRegaTe (SMART) [22], as the benchmarks to verify the energy consumption of QPPDA.

Communication Overhead.
Communication overhead is mainly derived from data transmission, e.g., a node transmits its sensing data to AN or BS in a WSN. For node i, the length of data is l ci bits and the average number of hops between any two nodes is L.
us, the communication overhead of a cell with k queries can be computed as e overhead of a single query comes from sending the encrypted data and HMACs. e HAMCs of node i is l hi bits and the data is l ci bits. en, the overhead of a cell with k single queries are formalized as In SMART, each node divides its sensing data into three slices, two of which are sent to the neighboring nodes and Input: security parameter, τ; query types, k. Output: key sets (P uk , P rk ).

Computation Cost.
Single query converts the query scope to a prefix format before the data are transmitted. e number of binary prefix is nearly (2l ci − 2), and there are exactly (l ci + 1) prefixes. erefore, the node needs to perform about (2l ci − 2) * (l ci + 1) comparisons, and the computation complexity of query is O(w 2 Nk) in the worst case. e computation overhead of QPPDA comes from data encryption, and its computation complexity is O(l ci N 2 ) according to Algorithm 2. Data mixing is the prime computational consumption in SMART, which is O(N). Consequently, it is observed that the computation consumption of single query is higher than that of QPPDA and SMART when the number of nodes is fixed in a WSN according to Figure 7. It should be noticed that the computation overhead of SMART is less than that of QPPDA when one slice mechanism (SMART-1) is adopted. However, one slice SMART may result in a lower security level compared with QPPDA. erefore, our scheme is a better tradeoff between security and computation complexity.

Aggregation Accuracy.
e accuracy is defined as the ratio between the collected summation by the data aggregation and the real summation of all individual sensor nodes in [22]. Figure 8 illustrates the accuracy of QPPDA, Single query, and Select a key P ux � (n, E, g, h x ); Encrypt the data, End If (10) End For (11) For (i � 1; i ≤ x; i + +) at BS (12) Decrypt the data, SMART with respect to different query times in our simulation. From Figure 8, we can observe that the accuracy of QPPDA improves as the times of query increase. Two reasons contribute to this which have already been analyzed in [8]: (i) with a longer time interval, the data messages to be sent within this duration will have less chance to collide; (ii) with a longer time interval, the data messages will have a better chance of being delivered before the deadline. Besides, we can observe that QPPDA has a better accuracy than single query and SMART. It has been demonstrated that the communication overhead of QPPDA is reduced significantly, and the amount of transmission of QPPDA is much less than that of single query and SMART in Section 5.2.1. erefore, the chance of collision and packet loss are also decreased, which leads to an improvement in aggregation accuracy.

Conclusion
e energy consumption and data privacy are two important concerns in WSNs. e limited energy of sensor nodes may shorten the lifetime of network, and the nodes are often deployed in dangerous areas where the data privacy may be more likely to be destroyed easier than in the cable network. Faced with these challenges, we present a query privacy protection mechanism for data aggregation which can reduce energy consumption and preserve the data privacy as well. Experimental results show that our scheme can guarantee the data privacy, decrease the system overhead, and improve the accuracy of data aggregation. For the future work, we will focus on other aggregation functions, such as mean, max, and counter except the additive aggregation. e privacy of QPPDA is closely related to the number of keys, and it is a challenging work to promote the security of QPPDA without the complex key distribution so as to save energy and decrease the requirement of storage. In addition, tree or cluster topology will be discussed in our subsequent study in order to expand the application scenarios of our scheme.
Data Availability e datasets generated or analysed during the current study are available from the corresponding author on reasonable request.