Alternative Tuples Based Probabilistic Skyline Query Processing in Wireless Sensor Networks

As uncertainty is the inherent character of sensing data, the processing and optimization techniques for Probabilistic Skyline (PS) in wireless sensor networks (WSNs) are investigated. It can be proved that PS is not decomposable after analyzing its properties, so in-network aggregation techniques cannot be used directly to improve the performance. In this paper, an efficient algorithm, called Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, is proposed. The algorithm divides the sensing data into candidate data (CD), irrelevant data (ID), and relevant data (RD).The ID in each sensor node can be filtered directly to reduce data transmissions cost, since, only according to both CD and RD, PS result can be correctly obtained on the base station. Experimental results show that the proposed algorithm can effectively reduce data transmissions by filtering the unnecessary data and greatly prolong the lifetime of WSNs.


Introduction
Recently, it is found that wireless sensor networks (WSNs) have a more and more important impact on the ways to collect and use information from the physical world.With the rapid development of microelectronics technology, communication technology, and the embedded technology, WSNs have become a common concern to industry and academia because of their great commercial prospects and its value of academic research [1][2][3].For example, we can prevent forest fires by monitoring the temperature and humidity in real time.Influenced by manifold factors such as hardware devices, sensor technology, communication quality, and the surrounding environment, sensing data collected by sensor nodes are often with inaccurate or low confidence.That is to say, the temperature and humidity data acquired by sensor nodes are not accurate.As uncertainty is an inherent property of sensing data, to some extent, sensing data are uncertain data essentially.
In this paper, an efficient algorithm, called Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, is proposed.It explores the problem of PS query processing in distributed WSNs, in which there exist alternative tuples.The basic idea is to perform data pruning and aggregation at sensor node such that only the data required for final processing are transferred to the base station.By comparing the data communication cost of DPPS and Centralized Algorithm (CA) to examine the effectiveness of the DPPS, we also perform sensitivity tests to observe the behavior of examined DPPS under various parameter settings.The result validates our ideas and shows the superiority of our proposal.
In summary, the contributions of this paper are as follows: (i) The properties of PS have been analyzed, and we prove theoretically that PS query is not decomposable.(ii) An efficient algorithm, called Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, is proposed, which reduces the in-network amount of data transmission by filtering the irrelevant data on the sensor nodes.(iii) Last but not least, the experimental results show that DPPS has advantages of data transmission in WSNs over CA.
The rest of this paper is organized as follows.The related work is introduced briefly in Section 2. Section 3 introduces the important notions and theorems.In Section 4, the DPPS is depicted in detail.And we analyze the performance evaluation of DPPS in Section 5. Finally, the conclusion of this paper is presented in Section 6.

Related Work
Here, we review representative work in the areas of (1) skyline query processing in WSNs and (2) skyline query processing on uncertain data.

Skyline Query Processing in Sensor
Networks.An extensive number of research works in this area have appeared in the literature [9][10][11][12][13][14][15][16].Due to the limited energy budget available of sensor nodes, the primary issue is how to develop energy-efficient techniques to reduce communication and energy costs in the networks.In literature [9], Wang et al. analyzed the properties of reverse skyline query and presented a skyband-based approach to tackle the problem of reverse skyline query answering efficiently over WSNs.Chen et al. [10] addressed the problem of continuous skyline monitoring in WSNs and presented a hierarchical thresholdbased approach, MINMAX, to minimize the transmission traffic.Two papers in the literature [11,12] investigated the sliding window skylines in sensor network.The former put forward an energy-efficient algorithm, SWSMA, to continuously maintain sliding window skylines over a wireless sensor network.The algorithm employs tuple filter or grid filter within each sensor to reduce the amount of data to transmit and save the energy consumption as a consequence, while the latter proposed a method EES which uses a mapping function to map the data into a smaller range of integers and carries out the skyline of the mapped set as the mapped skyline filter (MSF).Chen et al. [13] partitioned the entire data set into disjoint subsets and returned the skyline points progressively through examining the subsets one by one.Also, a global filter consisting of some found skyline points in the processed subsets is used to filter out those unlikely skyline points from the rest of subsets for transmission.Shen et al. [14] researched location-based skyline queries in WSNs and raised an energy-efficient approach of Ring-Skyline (RS) which divides the monitoring area into several rings and adopts in-network query processing to reduce energy consumption.In [15], Xin et al. raised an energyefficient multiskyline evaluation (EMSE) algorithm to evaluate multiple skyline queries effectively in WSNs.EMSE utilizes both global and local optimization mechanisms to eliminate unnecessary data transmission.In literature [16], a new lter-based method, called SKYFILTER, was brought up for skyline query processing.The method provides an enhanced efficiency by reduction of the total wireless communication between sensor nodes.
Skyline Query Processing on Uncertain Data.In literature [17], the bottom-up and top-down algorithms are put forward to process -skyline queries; a -skyline contains all the objects whose skyline probabilities are at least .It can filter the unqualified objects efficiently with the help of the grid-based space division algorithm and weight-counting algorithm.Literature [18,19] investigated the PS query of uncertain data streams.The former proposed an approach, candidate list, to compute a PS on a large number of uncertain tuples within the sliding window, and the later studied the problem of efficiently computing the skyline over sliding windows on uncertain data elements against probability thresholds.The all skyline query problem over discrete uncertain data sets was first researched in [20], in which space splitting algorithm and dominating counting algorithm were raised.In [21], Böhm et al. attempted to model the uncertainty with pdfs (probability density function) and investigated the skyline query over the pdf modeled uncertain data.Additionally, in [22], the objects are indexed with the Gausstree in the parameter space to improve the pruning efficiency, where the leaf nodes store the objects with expectation and variance.Ding and Jin [23] first address the distributed uncertain skyline query problem and the DSUD and e-DSUD algorithms were raised to process the queries over tuple-level uncertain data with the processing framework, in which the uncertain tuples are independent of each other.For skyline computation in highly distributed environments, Hose and Vlachou [24] provide a good survey of existing approaches, where the uncertain skyline queries and the open research directions are discussed.The reverse skyline query over uncertain database retrieves all the uncertain objects whose dynamic (or relative) skylines [25] contain a userspecified query object with a probability not less than a userspecified threshold.In [26], efficient exact and approximate algorithms are addressed to tackle this problem that skyline probability computation over uncertain preferences is ♯complete.
As opposed to our investigation, these researches either ignored the uncertainty of sensing data or considered no particularity of wireless sensor network environment.All of them failed to solve PS query processing problems effectively in WSNs.

Problem Statement.
In this section, some important concepts are defined; also, some theorems are proved to be true.The variable  is the threshold of the Probabilistic Skyline and the meanings of frequently used symbols are listed in Table 1.Consider a WSN that consists of a lot of sensor nodes deployed in a geographical region.Feature readings (e.g., temperature and humidity) are collected from these distributed sensor nodes.Multiple sensors are deployed at certain zones in order to improve monitoring quality.Figure 1 shows a wireless sensor network (with a two-tier hierarchical topology) that monitors forest temperature and humidity in different zones (denoted as different color).In this network, sensor nodes are grouped into clusters, where cluster heads are responsible for local processing and for reporting aggregated results to the base station.As shown,  2 and  6 denote the cluster heads for clusters A and B, correspondingly.
A table is shown in Figure 1, representing a snapshot of temperature and humidity records collected from the sensor network.As shown, each tuple records both possible temperature and humidity corresponding to a location.The confidence value associated with a tuple indicates the existence probability of those particular temperature and humidity.For example, there are two data tuples generated for Location A. The temperature and humidity in these two tuples are both valid (i.e., with measured confidences).
Definition 1 (possible world semantics [23]).We use  to denote a -dimensional space and  to denote the universal set of all uncertain tuples in the -dimensional space .Each tuple has a probability (  ) (0 ≤ (  ) ≤ 1) to occur, and V  (1 ≤  ≤ ) denotes the th dimension value.The tuples that cannot exist at the same time are alternatives.A possible world  is instantiated by taking a set of tuples from the alternative relation.
For example, uncertain tuples  1 and  2 in Figure 1 are alternatives.The various dimensions numerical values of  1 and  2 indicate the relevant information of the region A. Due to the property of alternative tuples, both of them may occur but cannot occur simultaneously.
The aggregate confidence of  is the sum of the confidence values of all its alternative tuples; that is, () = ∑ ∈ ().For instance, corresponding to location A,  A = { (2) for any possible world  ∈ -   , the uncertain tuple  does not belong to the skyline of ; that is,  ∉ Skyline().
Then, we conclude that the skyline probability of an uncertain tuple  is the sum of all the possible worlds' existential probability which are in the subset    ; that is to say,  sky () = ∑ ∈   Pr().For example,  sky ( 2 ) = Pr( 7 ) + pr( 10 ) = 0.04 + 0.08 = 0.12.
Assume that there exist an uncertain tuple  and an alternative tuples set   = {  1 ,   2 , . ..} in the universal set  ).We use  to denote the set that is composed of all   in ; that is,  = {  1 ,   2 , . . .,    } ⊆ .Consequently, the skyline probability of uncertain tuple  is the product of the existent probability () of  and the nonexistent probability Definition 4 (Probabilistic Skyline).Given a set  of uncertain tuples in the -dimensional space  and a threshold value , then the Probabilistic Skyline of  contains all the uncertain tuples in  whose skyline probability is bigger than , denoted as PS() = { |  sky () > }.

Property Analysis
Theorem 5. Probabilistic Skyline query is not a decomposable operator.
We can know that PS query is not a decomposable operator by Theorem 5; thus, we cannot improve the efficiency of PS queries in WSNs by using in-network computing technology [11,15] directly.
Next, we will further analyze the properties of the PS query.Theorem 6.Given a set  of uncertain tuples in the dimensional space , a tuple  ∈   and a threshold value .  = { 1 ,  2 , . . .,   } are the subset of  which contains tuples collected on the th cluster, and one uses   ⊆  to denote the set that is composed of    ⊆   .Thus,  does not belong to the skyline of  when it satisfies the conditions as follows: Proof of Theorem 6.This theorem can be proved by Definitions 2 and 3 directly.
Theorem 7. Given a set  of uncertain tuples in the dimensional space , a tuple  ∈   , and a threshold value , then,  should be excluded when it satisfies the conditions as follows: )) < .Also,  new will not be judged as the skyline tuple by mistake.Theorem 6 pointed out the tuples in the subset   that must not belong to the skyline of  clearly; that is, it pointed out the tuples that may be the skyline tuples of .Theorem 7 evidenced that we can delete the tuples in   which will not affect the calculation of the skyline of .Not all the tuples which do not belong to   can be deleted.The tuples that do not satisfy the conditions above will affect the calculation of skyline probability of other tuples, so we should hold them.

DPPS Algorithm
In this section, we propose the notions of candidate data, irrelevant data, and relevant data according to Theorems 6 and 7. Next, we take the PS query as a test case to derive candidate data and relevant data meanwhile prune the irrelevant data.Thus, irrelevant data tuples pruned in local sensor nodes will never appear in the final answer set.Definition 8 (candidate data).In the sensing data subset   ⊆  on sensor node, the tuples which are candidate data (CD) of the Probabilistic Skyline query satisfy the conditions: Definition 9 (irrelevant data).In the sensing data subset   ⊆  on sensor node, the tuples which are irrelevant data (ID) of the Probabilistic Skyline query satisfy the conditions: Definition 10 (relevant data).In the sensing data subset   ⊆  on sensor node, the tuples which are relevant data (RD) of the Probabilistic Skyline query satisfy the conditions: Algorithm 1 sketches the process of data aggregation, data classification, and the ID filtering on sensor nodes.First, the algorithm merges all the data tuples sent by child nodes.In other words, it merges CD into candidate data set and merges RD into relevant data set (Lines 4-7); second, the algorithm adds the local data tuple to the candidate data set (Line 8); and, then, the skyline probability of each tuple in the candidate data set and relevant data set will be calculated.Meanwhile, the tuples will be classified according to the definitions to removing ID and signing RD and CD (Lines 9-33); in the end, the partial relevant data set and candidate data set will be submitted to the parent node (Line 34).
For data classification in a candidate data set, our algorithm works as follows: first, it initializes the cumulative probability variable (Line 10); second, the value of  is calculated, where  is the number of   that can dominate the tuple  (Line 11); third, it finds out all   that dominate  (Line 12), after which each   's dominant probability is calculated (Lines 13-15).Then, the data tuples are classified based on the definitions above.In this procedure, tuples which are ID are deleted while tuples which are RD are transferred from the candidate data set to the relevant data set (Lines [16][17][18][19][20][21][22]. The process of data classification in a relevant data set is similar to the former.At first, the cumulative probability variable is initialized (Line 24); second, the value of  is calculated (Line 25); third, it finds out all   that dominate  (Line 26); next, the dominant probability of each   will be worked out (Lines 27-29); finally, the algorithm deletes  from the relevant data set if it is ID (Lines 30-33).In consideration of the running example in Theorem 5, we assume that the WSN is a two-tier hierarchical topology network.Let tuples  1 ,  2 , and  5 in  1 be collected by sensor nodes .In the meantime, let  3 and  4 in  2 be collected by sensor node .According to Algorithm 1, we can firstly calculate the Local Skyline Probability (denoted as  sky  ) of the tuples and then get the result that  sky  ( 1 ) = 0.15,  sky  ( 2 ) = 0.3,  sky  ( 5 ) = 0.4,  sky  ( 3 ) = 0.2, and  sky  ( 4 ) = 0.64.Thus, the data classification on node  is that  1 is ID,  2 is RD, and  5 is CD.Similarly,  3 is ID and  4 is CD on node .As a result, tuples  2 ,  5 on node  and  4 on node  are transmitted to the base station.
The process of query processing on base station is described in detail in Algorithm 2. To begin with, the algorithm merges all the data tuples sent by child nodes; that is to say, it merges CD into the candidate data set and merges RD into the relevant data set (Lines 3-6); second, the skyline probability of each tuple in the candidate data set will be calculated; then, ID are removed from candidate data set (Lines 7-17); finally, the rest data tuples in candidate data set are the final result of PS (Line 18).
For removing ID and RD in a candidate data set, it first initializes the cumulative probability variable (Line 8); second, the value of  is calculated (Line 9); third, it finds out all   that dominates  (Line 10); then, the dominant probability of each   will be calculated (Lines 11-13); last, the tuple which is not CD is removed from the candidate set (Lines 14-17).
For example, on base station, the process of our running example above works as follows: first, tuples  4 and  5 are merged in candidate data set;  2 is merged in RD.Second, we have  sky  ( 4 ) = 0.24 and  sky  ( 5 ) = 0.4.Third, delete  4 from candidate data set.Finally, we get the last result that  5 is the skyline result, which illustrates the correctness and feasibility of our algorithm.

Experimental Evaluations
In our experiments,  sensor nodes were generated randomly in a region with an area of ; thus, the average area of each node is 1.The communication radius between two nodes was set to be 2 √ 2, and the maximum packet transmitted between two nodes was stipulated to be 48 bytes.All the experiments were conducted on a computer with Intel Core i7-3770 CPU 3.40 GHz and 8.00 GB RAM.We conducted our evaluation on the standard test data sets of PS query, in which the probability for each tuple was generated uniformly.The performance of the algorithm is mainly studied on independence data and anticorrelated data.
Three parameters are mainly investigated in our experiments, which are the number of sensor nodes, the dimensions of sensing data, and the threshold value of the PS query.The algorithm adjusted the values of the parameters to minimize the overall data transmission in the network.The overall data transmission is calculated by the communication cost sent by all the sensor nodes in the network; that is, it is calculated by the dimensionality of sensing data × numbers × hop count.The communication costs of DPPS and CA were mainly explored with a number of sensor nodes which range from 600 to 1000, with the default number equaling 600.The dimensions of the sensing data range from 2 to 6 with the default dimension equaling 2. The threshold value of the PS query ranges from 0.1 to 0.3, which is 0.1 by default.
Under the independent and anticorrelation distribution, the data communication cost of DPPS and CA affected by the change of sensor nodes number is shown in Figure 3.In this figure, we found that a large number of sensor nodes lead to more communication cost.The increase speed of DPPS is slower than CA's.As the number of sensing data increases due to the more sensor nodes, the communication cost of CA increases fast.However, the unnecessary sensing   data are filtered in DPPS which directly leads to a less communication cost and a much slower rate of increasement.The communication cost in independent distribution is close to the one in anticorrelation distribution, which explains that data distribution has less impact on communication cost.In other words, the confidence of sensing data is the primary factor which affects the communication cost.
The data communication cost of both the algorithms, under the two kinds of data distribution, affected by the change of sensor data dimensionality is revealed in Figure 4. Obviously, the bigger the dimensionality is, the more the communication cost is.The reason is that, with the increment of data dimension, the probability of tuples dominated by others is decreased, which led to an increment in the number of skyline tuples and the data communication cost.The communication cost of DPPS is smaller than CA's, which further verified the effectiveness of DPPS.In addition, we can draw a conclusion that it is the confidence of sensing data which plays the primary role in communication cost affection.
Under the two different distributions, the data communication cost of DPPS and CA affected by the change of threshold value is shown in Figure 5.In the figure, we can see that a larger threshold value usually leads to less communication cost.It is intuitive, since the larger the threshold value is, the smaller the PS query result set will be.That actually results in a less communication cost.The communication cost of DPPS is always less than CA's, which proved the effectiveness of DPPS in a very great degree.In a similar way, the results demonstrated the confidence is the primary factor again.
All the results showed that DPPS precedes CA in all changes of sensor node number, the sensing data dimension, and the PS threshold value.It can be widely used in sensor  networks since it can improve efficiency and reduce the communication cost significantly.

Conclusion
In this paper, we explored deeply the requirements of PS query algorithm in WSNs and summarized the existing problems in the WSNs.According to the characteristics of applications in WSNs, we firstly studied the basic properties of PS query and theoretically proved that the algorithm is not decomposable.Then, an efficient algorithm, Distributed Processing of Probabilistic Skyline (DPPS) query in WSNs, was put forward.DPPS can classify the sensing data on sensor nodes and discard the irrelevant data which will not affect the result of the PS query.Thereby, the DPPS can reduce the data transmission cost significantly in WSNs.Finally, the algorithm was verified by simulation experiments, and the results showed that the performance of DPPS compared with the CA is significantly improved in saving the communication cost in network.

Figure 3 :
Figure 3: The communication cost influenced by nodes' number.

Figure 4 :
Figure 4: The communication cost influenced by data dimensions.

Figure 5 :
Figure 5: The communication cost influenced by threshold value.

Table 1 :
The meanings of frequently used symbols.
We assume that there exit uncertain tuple  and possible world subset    = {  1 , . . .,    } ⊆ , if  and  satisfy that (1) for any possible world  ∈    , the uncertain tuple  belongs to the skyline of ; that is,  ∈ Skyline(); 1 ,  2 }; that is,  1 and  2 are alternative tuple instances (or simply called alternatives) of  A .Consider ( A ) = 0.3 + 0.4 = 0.7.In the same way, we can get that  B = { 3 ,  4 ,  5 ,  6 ,  7 } and ( B ) = 0.1 + 0.4 + 0.1 + 0.2 + 0.2 = 1.The probability of all possible worlds in  is shown in Table2.Definition 2 (skyline).Given a set  of uncertain tuples in the -dimensional space , a skyline query retrieves tuples in  that are not dominated by any other tuple.For two tuples   and   in , tuple   dominates   (denoted as   ≺   ) if it is not worse than   in all dimensions (∀ ∈ [1, ], V  ≥ V  ) and better than   at least in one (∃ ∈ [1, ], V  > V  ).The probability that   dominates   is   's existing probability denoted as (  ≺   ) = (  ).Definition 3 (skyline probability).Given a set  of uncertain tuples in the -dimensional space , the set of possible worlds based on set  is denoted in the form of  = { 1 ,  2 , . . .,   }.
Figure 1: An example of wireless sensor network.= { 1 ,  2 , . . .,   }.If there exists    ∈   that dominates , we can say   dominates  (  ≺ ).Then, the probability that   dominates  can be calculated as (  ≺ ) = ∑    ∈  ,   ≺ ( ) Thus,  ∉ PS(  ) and  ∉ PS().Only the skyline probability of the tuples dominated by  will be affected if we delete .Suppose  new dominated by  is a tuple in another sensor node which will possibly be interleaved with tuples in   at the base station, and let  sky ( new ) indicate the skyline probability of  new .There are two possible cases to consider.Case 1.  new itself forms a new  new because the tuples that dominate  must dominate  new as well.Thus, it can be  (1 − (   )) <  and  new will not be judged as the skyline tuple by mistake.Case 2.  new is a member of an existed  that does exist in   named  new .Due to the mutual exclusiveness of tuple members in ,  new may appear in a possible world if and only if no other members of  new coexist in this possible world.By formula (2), it can be proved that  sky ( new ) = ( new ) × ∏  (1 − ( (d) (PS( 1 ) ∪ PS( 2 ))Figure 2: Example of PS query is not decomposable.deduced that  sky ( new ) = ( new ) × ∏ // input: The message set of child node   , the local sensing data , // the threshold value  // output: The data set  which will be submitted to the parent node For each element  in   Do = .(+ ); // get the number  of    = .(+ ); // and all   dominate  For each   in  Do calculate (   ); // input: The message set of child node   , the threshold value .// output: The data set  which will be submitted to the parent node For each element  in   Do  =  + .;  =  + .; end For For each element  in  Do  = 1;  = .(+ ); // get the number  of    = .(+ ); // and all   dominate  For each   in  Do