Efficient and Secure Top-k Query Processing on Hybrid Sensed Data

The ubiquity of mobile devices equipped with various sensors has promoted the advent of a novel data sensing paradigm. Based on the traditional static sensing mode, the mobile sensing (sensor) nodes collaboratively collect data with the static sensor nodes.This large volume of hybrid sensed data is then sent to the storage nodes for flexible management and top-k query services. One crucial security issue is that the compromised storage node may falsify or drop some data during the query processing, which returns fake or incorrect result to the query users. In this paper, we propose an efficient and verifiable scheme (EVTopk) for secure top-k query processing on hybrid sensed data, which is suitable for the tiered hybrid sensing network where mobile nodes exist. The basic idea is to bind each data record, generated by static or mobile sensing nodes, with the corresponding location where it is sensed. Then some verification information is created sequentially, which is submitted along with the encrypted locations and hybrid sensed data for user’s verification. The security and efficiency of EVTopk are thoroughly analyzed in theory and evaluated in our experiments, respectively.


Introduction
Nowadays, the proliferation of mobile devices (e.g., smartphones), embedded with powerful sensors, has witnessed a revolution in the data sensing paradigm of tiered sensor networks.Traditionally, the two-tiered sensor network [1][2][3] consists of many energy constrained static (or fixed) sensor nodes and some resource abundant storage nodes at the lower and upper level, respectively.By contrast, the novel sensing paradigm relies on both static sensor nodes and mobile devices (regarded as mobile sensing or sensor nodes) to jointly monitor data about surroundings.It brings great benefits especially to some application-specific scenarios, such as traffic monitoring in smart city.The mobile phones or the vehicles, as the mobile sensing nodes, can monitor traffic conditions in the area where the fixed traffic sensors are not covered.Apparently, the introduction of mobile sensing nodes provides more comprehensive information and reduces the cost for deploying static sensors.Moreover, the energy consumption of static sensors is well balanced.
The storage nodes, acting as the data center, are responsible for storing the hybrid sensed data and providing assorted query services for network user (query user).In particular, one important type of queries is represented by top- query, which asks for  highest (or lowest) data records in a specific region during a specified time slot.Continue with the above example of traffic monitoring, the traffic department can retrieve the 10 most crowded roads (according to the traffic flow) in a city from 8 AM to 9 AM by issuing top-10 query.
However, the security issue still remains as a critical concern for pragmatic top- query, since the storage nodes manage large volume of data and are vulnerable to various attacks.For instance, the compromised storage node may falsify or drop some data during the query processing, which returns fake or incorrect result to the query users.To solve this problem, some verification schemes [4][5][6][7][8] for secure top- query in tiered sensor networks were proposed in previous researches, which aims to enable the query user to verify the authenticity and correctness of query results.Specifically, Dai et al. [4,5] generated verification information by chaining the ordered and adjacent data records.Similarly, Zhang et al. [7] bound adjacent data records and some auxiliary ID lists with message authentication code (MAC).Unfortunately, these solutions are only suitable for the static tiered sensor network, 2 Mobile Information Systems as it is assumed that the query users and storage node know the mapping between the nodes' locations and their corresponding IDs.Clearly, this assumption does not hold in our application scenario, where mobile sensing nodes exist at the lower sensing level.More specifically, different from the static sensor nodes, the mobile sensing nodes generate data at different locations and the compromised storage node may replace the data records sensed in the query region with others sensed outside the region.This attack is difficult to be detected with existing verification methods for static sensor nodes.
In this paper, we propose an efficient and verifiable scheme for secure top- query processing (EVTopk) on hybrid sensed data, which enables the query user to efficiently verify the authenticity and correctness of the query result in our novel tiered sensing network.Our specific contributions are summarized as follows: (i) We formulate a novel tiered sensing network model (tiered hybrid sensing network) on the basis of the traditional tiered sensor network, where mobile sensing nodes jointly collect data records with static sensor nodes at the lower level.
(ii) To achieve EVTopk, we give a concrete sequence relationship based method by binding each data record with its corresponding sensing location.Moreover, the data records generated by mobile nodes are returned in different formations according to their sensing locations, which enables the query user to detect any inauthentic or incorrect result.
(iii) We present the theoretical analysis and conduct extensive experiments to show the security and efficiency of our proposed scheme.
The rest of the paper is organized as follows.In Section 2, we review related work, followed by a preliminary of system model and problem formulation in Section 3. Section 4 presents our efficient and verifiable top- query scheme EVTopk and the theoretical analysis in terms of security and communication overhead.In Section 5, extensive experiments are conducted and the simulation results are analyzed in detail.Finally, we conclude this paper with directions for future work in Section 6.

Related Work
Recently, due to the advantages of better scalability and capacity, the two-tiered wireless sensor network (TWSN) has drawn increasing attention among researchers.In TWSN, the storage nodes are more vulnerable to be compromised by attackers for their important role in data storage and query management, which brings two challenging security issues: data privacy and data integrity (query integrity).
Many privacy preserving methods [9][10][11][12] were proposed to prevent the compromised storage nodes from knowing the sensitive information of data collected by sensor nodes and/or the queries issued by users.Peng et al. [10] proposed a Bloom filter based scheme for secure top- query; moreover, the author adopted an obfuscation coding method to hide the real data distribution and query preference.In [12], a secure top- query method, called PCTopk, was proposed to preserve the data privacy and query correctness simultaneously.However, this line of work is orthogonal to ours, as in this paper, we only focus on the security property of the query results instead of the privacy concerns, because the hybrid sensed data are not sensitive and are accessible to the public in our scenario.
For another line of research, considerable attentions have been paid to ensuring the query integrity (both the authenticity and correctness of the query result), such as verifiable range query [13,14], secure NN query [15,16], and top- query in different network environments.Three schemes were first proposed by Zhang et al. [2] to verify the fine-grained top- query result in two-tiered sensor network, which were further improved in [7] to deal with various attacks launched by compromised storage nodes and/or sensor nodes.However, in addition to returning one verification message for each unqualified sensor node, this scheme requires the sensor nodes to exchange their highest scores with other nodes, which results in a large communication overhead.The same problem also exists in [17], for each sensor node needs to send the node IDs and hashes of their neighbors.Additionally, more than  data records are returned to the query user due to the existence of dummy readings.In addition, Dai et al. [4,5] proposed an efficient verifiable top- query (EVTQ) by chaining ordered and adjacent data records.Although the scheme is feasible to verify the query result, the storage node needs to send multiple verification messages for those unqualified sensor nodes.As an improvement, He et al. [8] proposed an efficient top- query processing with integrity verification (ETQ-RIV), in which sensor nodes are not required to submit information about neighbors and only one verification message is returned for all unqualified sensor nodes.Therefore, the query communication cost is significantly reduced.However, all the schemes above only consider the static sensor nodes and are not suitable for our network model where mobile sensing nodes exist.Recently, Liu et al. [18] first proposed a novel verifiable scheme named VTMSN for top- query in tiered mobile sensor networks, which is the most relevant work to ours.Besides the location information of each data record, both the static and mobile nodes also require to send the encrypted chaining locations to the storage node, some of which are then sent to the query users for verification.Unfortunately, large extra communication cost is incurred since the static nodes send many redundant locations.Moreover, it is assumed that all the sensor nodes switch their state between static and mobile state periodically, which is a little different from our scenario.

Preliminary
In this section, we introduce our system, attack model, and design goals and briefly state the problems investigated in this paper.

System Model.
In this paper, our system model is based on the two-tiered sensor network, which consists of sensor nodes and storage nodes at the lower level and upper level,  respectively.Different from the traditional two-tiered sensor networks, we introduce mobile sensing (sensor) nodes in the lower level.As shown in Figure 1, in each sensing cell at the sensing level, some mobile sensor nodes (e.g., mobile vehicles in the smart city) also exist for cooperative sensing with the resource-constrained static sensor nodes.Obviously, this novel sensing paradigm can relieve the energy consumption of the static nodes.In addition to the hybrid sensor nodes, each sensing cell contains a storage node, which is responsible for collecting the hybrid sensed data and answering various query requests from the external query users.Specifically, the query users at the user level can issue a top- query and obtain the query results via an on-demand wireless link.
For ease of presentation, we assume that the whole network is partitioned into many sensing cells according to the real geographic location.As introduced above, each cell consists of many fixed (static) sensor nodes, mobile sensor nodes, and a storage node.In particular, similar to [18], we assume that the mobile nodes only move in its affiliated sensing cell.During the sensing period, the mobile nodes move and stop regularly to save their energy, and we suppose that the sensed data are only generated when the mobile nodes stop moving.In other words, a data record generated by the mobile node corresponds to the specific location where it is sensed.In particular, some localization techniques [19][20][21] are available to estimate the location of static or mobile nodes.At the end of each sensing time slot, the static and mobile sensor nodes send their hybrid sensed data to their corresponding storage node.

Attack Model.
Compared with the sensor node, the storage node with large volume of sensed data is more vulnerable to attacks.Hence, in this paper, we mainly solve the security problem of compromised storage node that may return fake or incorrect query results to the query users.In particular, the following attacks are mainly considered in our model.
(i) Replacement attack: the compromised storage node may replace some qualified top- query results with other unqualified data records or those even not generated by the sensor nodes.More specifically, since the mobile nodes generate data records at different locations, the compromised storage node may replace the data records sensed in the query region with those sensed by the same mobile node at other locations.
(ii) Deletion attack: the compromised storage node may drop some qualified data records generated in the query region and return the incomplete results to query users.

Design Goals.
In terms of the aforementioned attacks, our design goal is to enable the query users to verify the authenticity and correctness of the query results returned by the storage node.In addition, the query and verification efficiency should also be guaranteed.In general, our method is aimed to achieve the following security goals and performance guarantee.
(i) Authenticity: all the  data records in the result set, returned by the storage node, are generated by the static sensor nodes or mobile sensor nodes which have ever moved in the query region.
(ii) Correctness: all the  data records in the result set are in the query region, and they have the highest  scores among the data records sensed in the query region.
(iii) Efficiency: since the communication cost plays a significant role in the energy consumption of the whole network, the two security goals above should Mobile Information Systems be achieved with as little communication overhead as possible.As presented in [7], we assume that each record  , ∈   can be scored by some public scoring functions [22] according to the specific query application.Let (⋅) denote a specified increasing scoring function, and then the score of data record  , can be denoted by  , = ( , ).For simplicity, we only consider the scalar sensed data and the situation where  , and  ,  show partial order is not considered in this paper.Therefore, if  , ≤  ,  , which means each attribute value in  , is smaller than or equal to that in  ,  , then we have  , ≤  ,  .In other words, if  , and  ,  are both generated in the query region and  , is in the top- query result, so does  ,  .Furthermore, we formulate the top- query as   = ⟨  ,   , , ⟩, where   and   denote the ID of the query sensing cell and geographic query region, respectively;  and  are the number of query time slot and desired data records.Given the query request   , the issue investigated in this paper is how to enable the query user to efficiently verify the authenticity and correctness of the query result, as we described in Section 3.3.

Problem
To make our description clear, the major notations used in this paper is summarized in Notation Settings.

Secure Top-𝑘 Query Processing
In this section, we propose a novel verifiable top- query scheme EVTopk to efficiently verify the authenticity and correctness of top- query result over the hybrid sensed data.We mainly describe our detailed scheme in Section 4.1, and the security analysis against the compromised storage node is presented in Section 4.2.

Manipulations at Three Levels.
In this section, we describe the data preprocessing, top- query processing, and query verification at the sensing, storage, and user level, respectively.The details are as follows.

Data Preprocessing at Sensing
Level.In addition to submitting the sensed data, the sensor nodes require to send some additional information to the storage node, and some of them are then returned to the query users for query verification.Specifically, we generate the verification information by chaining ordered hybrid data records of each sensor node via a cryptographic hash function.Note that, to perform effective query verification for mobile sensor nodes, we bind each data record with its corresponding sensing location.
Consider the sensor node   as an example; first,   sorts its sensed data records   during each time slot in the descending order according to records' scores.We assume that the scores of data records in   differ from each other, which indicates that a unique correct query result exists for top- query.Suppose that an ordered list ⟨ ,1 ,  ,2 , . . .,  ,  ⟩ is generated such that  ,1 >  ,2 > ⋅⋅⋅ >  ,  .In addition, similar to [7], we assume that   shares a distinct key  , with the query user during time slot , which is used to encrypt the location information and compute the hash message authentication code (HMAC) as verification information.Initially,   is preloaded with  ,0 .At the end of time  ≥ 1, the time slot key  , is updated with ( ,−1 ) [7], where (⋅) denotes a hash function.Note that, to provide stronger security, it is necessary to update the encryption key in each time slot.According to the sequence of the ordered data records,   then chains adjacent data scores by recursively computing where HMAC  , (⋅) denotes the hash message authentication code keyed with  , and ‖ refers to the concatenation.Note that, if   is a static node, we have  , =  ,1 for  ∈ [2,   ].We hereafter use  ,1 to denote the location of static node   for simplicity.Moreover, if   = 0, we set V ,1 = HMAC  , ( ‖ ).
Let   , (⋅) denote the symmetric encryption function (in this paper, we use AES to encrypt the location information for its better performance especially in dealing with large volume of data) with key  , ; subsequently, each sensor node   ∈ {  }  =1 submits the following information Ψ  to the storage node.Specifically, if   is a static node, the information is as follows: ⟨, ,   , ( ,1 ) ,  ,1 , . . .,  ,  , V ,1 , . . ., V ,  ⟩   > 0. ( While if   is a mobile node, the information is Note that, different from [2,7], we assume that the location information of each sensor node is unknown to the storage node and query users in advance.Therefore, it is necessary to send some location information to the storage node in this phase.After receiving the message Ψ  , the storage node will store the corresponding information of   , including its data records   = { , }   =1 , the encrypted sensing locations   , ( ,1 ‖ ⋅ ⋅ ⋅ ‖  ,  ) and the verification information V ,1 , . . ., V ,  .

Top-𝑘 Query
Processing at Storage Level.The storage node, after receiving a top- query   = ⟨  ,   , , ⟩ from the query user, will first locate the query cell   and its corresponding query region   .It is worth noting that   may completely cover multiple static sensor nodes or partially cover multiple mobile sensor nodes during time slot , as mobile nodes may move out of   during this time slot.Let   denote the set of static or mobile sensor nodes in   , and we call the set of its corresponding data records a candidate set, which is denoted by C  .Note that some data records in C  may be not in the query region.
Subsequently, the storage node retrieves the highest  data records RS  in C  with their sensing locations in   , among which the lowest record score is denoted by .In particular, for each   ∈   , we assume that there are   data records with their scores not lower than .Accordingly, we have 0 ≤   ≤   and  ≤ ∑   ∈    ≤ |C  |, as some mobile sensor nodes may generate data records outside   but with their scores higher than .Moreover, all these data records are in the candidate set C  .For each  , ∈ C  , we make the following definition.
which means if the data record  , is generated in   ,  , equals  , .Otherwise, it equals the score of  , .Then the query user can simply identify if the data records are in   according to their data format (here referring to the data dimensions).For each candidate   ∈   , the storage node returns the following information R  to the query user.
If   is a static sensor node, the information is ⟨,   , ( ,1 ) ,  ,1 , . . .,  ,  , V ,1 ⟩   =   ≥ 1. ( If   is a mobile sensor node, the information is While for each   ∉   , the information is As we can see, for the sensor nodes in   , besides the qualified top- data records in RS  , the storage node requires to return the data scores that are higher than  but are generated outside   , so that the query user can recompute V ,1 with (1).While for the sensor nodes outside the query region, they only need to return the node IDs and their encrypted location information (7) to the query user for verification.

Result Verification at User
Level.Now we discuss how the user verifies the authenticity and correctness of the query result, as we mentioned in Section 3.3.For each information R  received from the storage node, the query user will defer to the following verification steps.
(1) First, the user determines which of the above cases R  belongs to according to its message format.More specifically, the user can distinguish (5) from ( 6) according to the first part (i.e., node's ID), since it is assumed that users can identify the static nodes by IDs in Section 3.4.In addition, as we can see, the information in ( 7) only contains two parts (ID scores are chained with HMAC.Similarly, if all the qualified data records of   are dropped, it is easy to be detected during the authentication verification because the storage node still requires to return the maximum data score  ,  +1 and verification information V ,  +2 , which obviously cannot derive the same V ,1 as computed by   .While for a mobile node   , if part of its qualified data records is dropped, it may pass the authentication verification but will fail the correctness check in step (3).Assume that R  = ⟨ ,1 ,  ,2 ,  ,3 ,  ,4 , V ,5 , V ,1 ⟩ is supposed to be returned to the user, which indicates that  ,1 and  ,3 are two qualified data records, and  ,2 is higher than  but outside   .If the compromised storage node drops  ,3 and returns  ,1 ,  ,2 , V ,3 , and V ,1 to the user, the user can derive the correct V ,1 based on the first three item  ,1 ,  ,2 , and V ,3 .However, the format indicates that  ,1 is a qualified record and  ,2 is lower than , which contradicts with the fact that  ,2 is higher than .Similar to the static node, if all the qualified data records of   are dropped, it is easy to be detected during the authentication verification.

Communication Cost among Three Levels.
Assuming that the length of encrypted data is the same as that of its corresponding plaintext.Let  id ,  loc ,  score , and  ℎ denote the bit-lengths of a sensor node's ID, location, score, and each HMAC, respectively.Moreover, let  be the average number of hops between a sensor and a storage node.Additionally, suppose that there are  1 ( 1 ,  2 , . . .,   1 ) static sensor nodes and  2 =  −  1 (  1 +1 ,   1 +2 , . . .,   ) mobile nodes in a query cell.For ease of presentation, we assume that each sensor node   ,  ∈ [1, ] generate   =  data records during slot .Then we have the following theorems.

Theorem 3. Without considering the transmission of some fundamental information, the extra communication cost between the sensing and the storage level in EVTopk is given by
Proof.As shown in Section 4.1.1,since the node IDs, time slot, and sensed data records are the fundamental information to be sent to the storage node, we omit the cost for transmitting them for clarity.
For each static node   ,  ∈ [1,  1 ], it requires to send encrypted location   , (  , 1) and  verification information V ,1 , . . ., V , to the storage node.Note that the length of encrypted data is the same as that of its corresponding plaintext.Thus, the additional communication cost incurred by the static node is While for each mobile node   ,  ∈ [ 1 + 1, ], it requires to send encrypted location   , ( ,1 ‖ ⋅⋅⋅ ‖  , ) and  verification information V ,1 , . . ., V , to the storage node.Thus, the additional communication cost incurred by the mobile node is Hence, by integrating ( 9) and ( 10), we have Let   denote the number of qualified sensor nodes that contribute their records to RS  ; specifically,   contains   and   static and mobile nodes, respectively.Then the number of the unqualified sensor nodes is −  , which contains  1 −  static nodes and  2 −   mobile nodes, respectively.

Theorem 4. Without considering the transmission of qualified data records and its corresponding node IDs, the maximum extra communication cost between storage node and query user in EVTopk is given by
Proof.As shown in (5)

Simulation Results
In this section, we mainly evaluate the efficiency of our proposed scheme EVTopk and validate the theoretical results Specifically, we use 160-bit HMAC-SHA1 to compute the verification information for each sensor node, and the default query region in the cell is 400 m × 400 m.The main performance metrics used to evaluate our proposed scheme include the aforementioned two aspects: the communication cost between the sensing and storage level, as well as that between the storage and user level.It is worth noting that these metrics are not compared with those in other schemes as none of them are fit for our network model where both static and mobile nodes exist.To avoid the accidental error, in all sets of experiments, our scheme is measured on an average of 100 random simulations.

Communication Cost between Sensing and Storage Level.
Figure 2 shows the simulation and theoretical results of the communication cost between the sensing and storage level with varying , which is the total number of sensor nodes in the query cell.It is clear that our simulation result matches the analytical result closely.More specifically, the communication cost increases linearly as  goes from 200 to 800.The reason is that more static or mobile sensor nodes require to send the location and verification information to the storage node with increasing .Specially, due to the mobility of mobile nodes, they need to send encrypted concatenate locations, as shown in (3), which results in more communication cost than static nodes.Moreover, the actual hops between a mobile sensor and a storage node may be more than the average hops which we considered in Theorem 3. Therefore, the simulation result incurs slightly more communication cost than that in theory.
Figure 3 shows the impact of , the number of data records generated by each node per slot, on the sensingstorage communication cost in both our simulation and theoretical analysis.Similarly, we can observe that the simulation and theoretical results closely match as  increases, and the sensing-storage communication cost of them both grows linearly with .The reason is that more location and verification information need be sent to the storage node for each static or mobile node.Moreover, our simulation exhibits slightly more expensive cost due to the same reason as we analyzed above.

Communication Cost between Storage and User Level.
As for the communication cost between storage and user level, since we only give the theoretical bound instead of the detailed cost in Theorem 4, we do not compare the theoretical and simulation results of our scheme in this section.Figure 4 illustrates the impact of  on the storage-user communication cost where the query region   = 400 m × 400 m, 500 m × 500 m, and 600 m × 600 m, respectively.
As we can see, on one hand, with the increasing number of , the communication cost grows slowly for a fixed query region.On the other hand, if  keeps unchanged, the larger query region is, the more expensive cost will be incurred.The reason is that more verification information about qualified or unqualified sensor nodes is transmitted to the query user as  or   increases.More specifically, if  keeps unchanged, more static or mobile nodes are unqualified with the extension of   and hence incurs more verification information about unqualified nodes.In contrast, if  increases, there may be more qualified sensor nodes in a given query region and hence incurs more verification information about qualified nodes.Note that the impact of  is much smaller than that of   on the storage-user communication cost, due to the fact that unqualified nodes incur more cost than the qualified nodes.Moreover, it is not required to send the verification information for each qualified data record.
In addition, Figure 5 depicts the impact of  and  on the storage-user communication cost.Similarly, the communication cost between the storage and user level increases linearly with , which is consistent with the theoretical bound given in Theorem 4. This is because the query region will get denser as  grows in a given query cell, and more verification information about unqualified sensor nodes needs to be sent to the query user.In contrast, with the increasing number of , more location information about mobile nodes need be transmitted and thus leads to the larger communication cost.

Conclusion
In this paper, we study the secure top- query processing in a novel sensing paradigm, where hybrid sensed data are generated by both static and mobile sensor nodes.To tackle the security threats posed by the compromised storage node, we propose an efficient and verifiable scheme that enables the query user to efficiently verify the authenticity and correctness of the query result.A novel data binding method is designed to generate the verification information, with which the storage node is not required to return verification information for each qualified data record.Theoretical analysis and simulation results demonstrate the security and efficiency of our scheme.Our future work is to investigate privacy preserving top- query processing with integrity verification on the hybrid sensed data in tiered hybrid sensing network.

Notation Settings
: The number of sensor nodes in a sensing cell   : The th sensor node in a sensing cell : The communication radius of sensor nodes   : The set of data records sensed by    , : The th data record in    , : The score of  ,

Figure 2 :Figure 3 :
Figure 2: Impact of  on the sensing-storage communication cost.

Figure 4 :
Figure 4: Impact of  and   on the storage-user communication cost.

Figure 5 :
Figure 5: Impact of  and  on the storage-user communication cost.
Statement.Without loss of generality, we only focus on top- query processing that covers one sensing cell for simplicity, which consists of  sensor nodes {  }  =1 (including static and mobile sensor nodes) and a storage node.Moreover, we assume that the query user can distinguish the static nodes from mobile nodes by their IDs.Suppose that each sensor node   ∈ {  }  =1 senses   data records, denoted by   = { , }   =1 , during each time slot .In particular, the location of generating each record  , is denoted by  , .Obviously, if   is a static sensor node,  , is the same for each  ∈ {1,   }.At the end of slot , the storage node can receive ∑  =1   data records ( = ⋃  =1   ).Note that the sensed data records may have multiple attributes (e.g., temperature, humidity), since the sensor nodes are usually embedded with multiple sensors.

Table 1 :
Default simulation settings.More specifically, we assume a sensing cell of 1000 m × 1000 m with 500 randomly distributed sensor nodes and a storage node at the center, in which static and mobile nodes account for 50%, respectively, and the mobile nodes move in a random waypoint model.Without loss of generality, we suppose that each static or mobile sensor node generates 10 data records during the time slot, and the packets are transmitted without collision and error.Table1presents the default parameter settings in our experiments, unless stated otherwise.