Reverse skyline queries have been used in many real-world applications such as business planning, market analysis, and environmental monitoring. In this paper, we investigated how to efficiently evaluate continuous reverse skyline queries over sliding windows. We first theoretically analyzed the inherent properties of reverse skyline on data streams and proposed a novel pruning technique to reduce the number of data points preserved for processing continuous reverse skyline queries. Then, an efficient approach, called Semidominance Based Reverse Skyline (SDRS), was proposed to process continuous reverse skyline queries. Moreover, an extension was also proposed to handle n-of-N and (n1,n2)-of-N reverse skyline queries. Our extensive experimental studies have demonstrated the efficiency as well as effectiveness of the proposed approach with various experimental settings.
1. Introduction
The skyline operator [1] and its variations [2–15] have been widely applied in many applications involving multicriteria decision-making. Specifically, given a set P of points, the skyline of P comprises all the points which are not dominated by any other point in P. Generally, a small value is assumed to be better in all dimensions. Here, we say a point x dominates another one y, if x is not worse than y for all dimensions (∀i,x[i]≤y[i]), and x is better than y for at least one dimension (∃j,x[j]<y[j]). Figure 1 illustrates an example of skyline in a 2-D space, where points a and g on the bottom-left line are the skyline points.
Example of skyline and its variations.
As a variation of skyline, given a query point, dynamic skyline [4–7] contains all the points which are near to the query point in all dimensions. As illustrated in Figure 1, the query point is b=(b[1],b[2]), and the dynamic skyline with respect to b is {h,q,c}. The other points are farther to b than h (or q, c) in all dimensions.
Compared with dynamic skyline, reverse skyline is proposed from an opposite perspective. If q is one of the dynamic skyline points of x, then x is called a reverse skyline point of q [7]. As shown in Figure 1, b is a reverse skyline point of q, since q is contained in the dynamic skyline of b. The reverse skyline query is very useful for many applications. For example, in an online used car trading system, each used car is evaluated in various aspects such as brand, price, and mileage. If a dealer wants to take a new used car to sell, it is desirable for the car to attract as many customers as possible. Intuitively, if a customer is interested in an existing car in the trading system, s/he might be also interested in the dynamic skyline cars of this car. Therefore, we can conduct a reverse skyline query with respect to the new one to retrieve all the cars in the system and find the cars whose dynamic skyline contains the new one. The bigger the number of reverse skyline is, the better it is.
In such applications, the dealer may want to continuously monitor the trading system for selecting the customers who will recommend the new used car. As the price of used cars is always fluctuant, the information too long before may not be quite relevant to the current used car recommendation. Therefore, we tend to only focus on the most recent (e.g., within a week) used car information, that is, the reverse skyline query over sliding windows.
Although reverse skyline query processing has been well studied in recent years [7, 10–13], there is only one previous work, Divide-and-Conquer Reverse Skyline (DCRS) [16], that has studied the reverse skyline over sliding windows. Once the sliding window moves, a new point is inserted into the window and the oldest point is expired from the window. Reverse skyline query over the sliding window will return the reverse skyline result in the current sliding window once the sliding window moves. The DCRS algorithm needs to preserve all points in the sliding window, which will consume a lot of memory space, and the maintenance of such a large number of points usually consumes a large amount of CPU time. In this paper, we propose an efficient algorithm, called Semidominance Based Reverse Skyline (SDRS), to continuously answer the reverse skyline on data streams. Our contributions can be summarized as follows:
We propose an efficient algorithm SDRS to process the reverse skyline queries over the sliding window. By using the semidominance relationships and first-in-first-out property of the sliding window, SDRS only maintains a small number of points in the sliding window. Also, by maintaining a reasonable structure of each reserved point, SDRS can quickly calculate the new reverse skyline once the sliding window moves.
By building and maintaining a 2D R-tree, SDRS is easily extended to deal with n-of-N and (n1,n2)-of-N reverse skyline queries over the sliding window.
Last but not least, extensive experiments show that our proposed SDRS approach can efficiently support continuous reverse skyline queries, including n-of-N reverse skyline queries and (n1,n2)-of-N reverse skyline queries.
2. Related Work
Dynamic skyline was first introduced by Papadias et al. [4, 5]. Given a query point, dynamic skyline can return all the points which are near to the query point in all dimensions. Sharifzadeh and Shahabi [17] proposed the concept of spatial skyline queries, and the derived spatial attributes of each point were defined by Euclidean distances from the point to some query points. Deng et al. [6] presented the multisource skyline query in the application of road networks, in which the dynamic attributes of each mapped point were defined as the relative network distances to multiple query points. Chen and Lian [18, 19] proposed generic metric skyline query, and the dynamic attributes of each point were defined in arbitrary metric space. Dellis and Seeger [7] considered the case that all the dynamic attributes were the absolute coordinate differences to the query point. For the sake of simplicity, in this paper, we adopt the definition of dynamic attributes the same as Dellis and Seeger [7].
Based on the concept of dynamic skyline, Dellis and Seeger [7] proposed the reverse skyline and presented an effective pruning method to reduce the search space of the reverse skyline computation. Lian and Chen [10, 11] formalized both monochromatic and bichromatic probabilistic reverse skyline query on uncertain data and proposed effective pruning methods to reduce the search space of query processing. Wu et al. [13] investigated the bichromatic reverse skyline on precise data and proposed several nontrivial heuristics that can optimize the access order of R-tree to reduce the I/O cost considerably. Wang et al. [20] and Min [21] investigated the reverse skyline queries in wireless sensor networks. Lim et al. [22–24] proposed a new reverse skyline query processing method that processed a query over moving objects or a continuous reverse skyline query efficiently. The proposed algorithm makes a verification range to guarantee the result of the reverse skyline query, whenever the new objects appear or moving objects move. A novel algorithm is proposed in [25], which is an efficient method to process the reverse skyline, by using an approach based on two pruning methods, the search-area pruning method and the candidate-objects pruning method, but in which it should refine the candidate list because there can be false positives in the candidate list after pruning phase. The reverse skyline processing techniques above cannot be used directly in rapid updated data stream applications.
There are some works that have been proposed to address the skyline query processing on data streams. Lin et al. [26] explored the problem of computing skyline against various different sliding windows. Tao and Papadias [27] studied the skyline computation in stream environments and developed efficient techniques to improve space/time efficiency. Morse et al. [28] introduced the continuous time-interval skyline operator to continuously compute the current skyline on a data stream and presented a LookOut algorithm for evaluating such queries efficiently. Sarkas et al. [29] studied the streaming categorical skylines and proposed some novel techniques for maintaining the skyline of categorical data in a streaming environment. Zhang et al. [30] investigated the problem of minimizing the communication overhead in client-server architectures, where the server continuously maintains the skyline of dynamic objects. Zhang et al. [31, 32] studied the problem of continuous, probabilistic skyline query and proposed some novel, efficient techniques to improve the efficiency. However, the existing works usually focus on skyline query processing on data streams, whereas ours concentrates on a more complex query (i.e., reverse skyline) processing on data streams.
Bai et al. [33] proposed a probabilistic reverse skyline algorithm over the sliding window by using some probability pruning methods. Liu et al. [34] proposed an algorithm to process reverse k-skyband. The reverse skyline computation on data streams was first studied by Zhu et al. [16]. They proposed an efficient algorithm DCRS to process continuous reverse skyline query by employing an improved DC-Tree index [35]. Firstly, the dataset in sliding window is divided into 2d pieces according to query point q, and a modified DC-Tree is constructed for each piece of dataset. Secondly, the modified DC-Tree recursively divides its dataset into 2d equal pieces until the terminate condition is met. The intermediate nodes maintain the local first and second skylines by merging the corresponding results in their children, while the leaf nodes store the data points directly. Finally, DCRS algorithm runs a window query on each global skyline point to get the current reverse skyline when new arrival point belongs to global first or second skylines. Obviously, it is necessary to preserve all the data in sliding window for DCRS algorithm, which costs not only a lot of memory space but also a lot of CPU time. As a result, in this paper, we focus on reducing the number of the data points to be preserved and improve the space/time efficiency for computing continuous reverse skyline as a consequence. Moreover, an extension is also proposed to handle n-of-N and (n1, n2)-of-N reverse skyline queries.
3. Problem Statement
We first recall two important concepts called full-dominance and semidominance [20], respectively.
A point x full-dominates y with regard to query point q(x⪯-qy), if it holds that (1) ∀i∈D(x[i]-q[i])(y[i]-q[i])≥0∧x[i]-q[i]≤y[i]-q[i] and (2) ∃j∈D(x[j]-q[j])(y[j]-q[j])>0∧x[j]-q[j]<y[j]-q[j].
A point x semidominates y with regard to query point q(x⪯˙qy), if it holds that (1) ∀i∈D(x[i]-q[i])(y[i]-q[i])≥0∧x[i]-q[i]≤2y[i]-q[i] and (2) ∃j∈D(x[j]-q[j])(y[j]-q[j])>0∧x[j]-q[j]<2y[j]-q[j].
Figure 2 illustrates an example of full-dominance and semidominance on dataset {a,b,c,d,e,f,g,h} in a 2-D space. We can see that point b full-dominates a with regard to q, since a and b are in the same side of q, and meanwhile b has smaller distance to q than a on both dimensions. Though b has smaller distance to q than e on both dimensions, point b does not full-dominate e with regard to q, because they are not in the same side of q. Similarly, the points, that is, b-h, are not full-dominated by any others.
Full-dominance and semidominance.
In order to explain semidominance more clearly, for each point x∈{a,b,c,d,e,f,g,h}, we use foursquare point xm to stand for the midpoint between the original point x and query point q. We can see that point f semidominates e with regard to q, since the midpoint fm between f and q full-dominates e. Similarly, point b semidominates a, point c semidominates d, point d semidominates c, and point h semidominates g, while the solid points b, f, and h are not semidominated by any others.
It has been theoretically proved in [20] that we can make use of semidominance to conduct reverse skyline queries, as shown in Theorem 3.
Theorem 3 (see [<xref ref-type="bibr" rid="B19">20</xref>]).
Given a query point q and a dataset P, any point p∈P is a reverse skyline point of q, if there does not exist any other point p′∈P such that p′⪯˙qp.
According to Theorem 3, all the points that are not semidominated by others constitute the reverse skyline. We use P to present an append-only data stream in a d-dimensional space D (i.e., d=|D|), and each element (point) x∈P has a label κ(x) to indicate its position in P. Let PN denote the most recent N elements in P. Then, the reverse skyline on data stream can be formally defined as in Definition 4.
Definition 4 (reverse skyline on data stream).
Given a query point q and a data stream P, a reverse skyline on data stream according to query point q continuously retrieves all the points in the most recent N points PN that are not semidominated by any other points in PN.
4. Semidominance Based Reverse Skyline
In this section, we present the details of SDRS approach. Specifically, some important properties and query processing techniques are discussed in Sections 4.1 and 4.2, respectively.
4.1. Preliminaries
Suppose there are two points p and p′ in data stream P such that p′ semidominates p. According to Definition 4, point p is not a reverse skyline point if p′ belongs to PN. As a result, point p will never be a reverse skyline point if point p′ is younger than p. Therefore, we can get Lemma 5 as follows.
Lemma 5.
A data point p will never be a reverse skyline point if it is semidominated by a younger point p′.
Proof.
Since we have p′⪯˙qp, point p is not a reverse skyline point when p′∈PN. Moreover, since we also have κ(p′)>κ(p), p will expire earlier than p′. Therefore, point p will never be a reverse skyline point.
Consider the example in Figure 2 where points arrive according to their alphabetic ordering. Suppose point h just arrives and the length of sliding window is 8; then current PN={a,b,c,d,e,f,g,h}. Since it is semidominated by a younger point (i.e., point b), point a will never be a reverse skyline point on the basis of Lemma 5. Similarly, points c, e, and g are semidominated by some younger points; all of them cannot be the reverse skyline points. Obviously, it will not result in any false negative after abandoning any of a, c, e, and g but may lead to false positive if all of them are discarded.
As illustrated in Figure 3, now points e, f are in PN; f can semidominate e and κ(e)<κ(f). When a new point i arrives, point i would be judged as a reverse skyline point if we only preserve point f; f cannot semidominate i. However, the truth is quite opposite; point i is not a reverse skyline point as point e semidominates i. Moreover, point i cannot be a reverse skyline point before point e expires. The reason for this problem is that the preserved points cannot semidominate all the points which are semidominated by the discarded points. The new point would be considered as a reverse skyline point when it is semidominated by the discarded points but not semidominated by the preserved points.
An example of false positive.
Now, we will present a very important characteristic of semidominance which helps us to solve the above problem.
Lemma 6 (see [<xref ref-type="bibr" rid="B19">20</xref>]).
If x⪯-qy and y⪯˙qz, then x⪯˙qz.
Lemma 6 shows that if one point x can full-dominate another point y, all the points which are semidominated by y can also be semidominated by x. Therefore, the false positive problem could be avoided if all the points which are not full-dominated by a younger point are preserved. Let SN denote the set of points which are not full-dominated by a younger point; that is,(1)SN=p∣p∈PN,∄p′∈PN,κp′>κp∧p′⪯-qp.
Then, the data points in SN can be divided into three categories:
Reverse Skyline Point (RP). The data point which is not semidominated by any other point in PN.
Candidate Point (CP). The data point which is semidominated by an older point rather than a younger point in PN.
Assistant Point (AP). The data point which is semidominated by a younger point in PN but not full-dominated by any younger point in PN.
We use RN, CN, and AN to denote the sets of reverse skyline points, candidate points, and assistant points, respectively. Continue the example in Figure 2, since only point a is full-dominated by the point which is younger than it (i.e., point b), SN={b,c,d,e,f,g,h}. Points b, f, and h are not semidominated by any other point; therefore they are reverse skyline points; namely, RN={b,f,h}. And point d is a candidate point because it is semidominated by the older point c; namely, CN={d}. The remaining c, e, and g are assistant points as each of them is semidominated by a younger point; namely, AN={c,e,g}.
A candidate point p tends to be semidominated by several points, and it will still be a candidate point if any of them belongs to PN. Apparently, the youngest one among them is the one who determines when point p can become a reverse skyline point, calling it the predecessor of p (denoted as pre(p)). Meanwhile, we set an attribute ρ(p) for p to record the position of pre(p); namely, ρ(p)=κ(pre(p)).
Next, we will discuss the correctness of only keeping the data points in SN.
Lemma 7.
A data point p will never be a reverse skyline point if it is full-dominated by a younger point p′.
Proof.
It can be immediately deducted from Lemmas 6 and 5.
Theorem 8.
A newly arriving point p must be a reverse skyline point or a candidate point. Moreover, if p is a candidate point, then its predecessor pre(p) must belong to SN.
Proof.
Since p is a newly arriving point and the points which are younger than p do not exist in PN, p is either a reverse skyline point or a candidate point.
We use apagoge to prove the second part. Suppose point p is a candidate point, while its predecessor pre(p) does not belong to SN. According to the definition of SN, there must be a point p′ in PN which is younger than pre(p) satisfying p′⪯-qpre(p). According to Lemma 6, we can infer p′⪯-qpre(p)∧pre(p)⪯˙qp⇒p′⪯˙qp. We have κ(p)>κ(p′)>κ(pre(p)) and p′⪯˙qp, which contradict the prerequisite that pre(p) is the predecessor of point p. Therefore, the proof is completed.
Lemma 7 shows that a discarded point will never be a reverse skyline point. Thus, false negative will not happen if we preserve all the points in SN. Theorem 8 shows that whether the new arriving point is a reverse skyline point or a candidate point can be decided by the points in SN. And if this point is a candidate point, we can get its predecessor correctly according to SN and also know when it may become a reverse skyline point. Therefore, only preserving the points in SN cannot result in any false positive.
Based on the analysis above, we can get Theorem 9 directly.
Theorem 9.
SN is the minimum information that needs to be preserved to process a continuous reverse skyline query over sliding windows.
Proof.
It can be immediately deducted from Theorems 9 and 8.
4.2. SDRS Algorithm
In this section, we will introduce the data structures and details of our SDRS approach successively.
Data Structure. The whole data space is divided into 2d different regions. Full-dominance relationship and semidominance relationship only exist between two data points when they are in the same region according to Definitions 1 and 2. When a new data point arrives, we just need to figure out its full-dominance relationship and semidominance relationship with the points whose region is just the same one as the new arriving point’s (when the new arriving point has the same values as the query point’s in some dimensions, it belongs to more than one region). Therefore, to shorten the time we need to figure that out, we can divide RN, CN, and AN into 2d regions depending on query point, respectively, and build up an in-memory R-tree for the points in each region, as showed in Figure 4.
Data structure.
Dominance Relationships. Since our techniques are based on R-trees, the relationship between point p and R-tree entry e must be involved. We use e.min and e.max to denote the lower-left corner and the upper-right corner of entry e; there are several main relationships being presented as follows.
If point p full-dominates e.min, all the points in e are full-dominated by p. If point p cannot full-dominate e.max, all the points in e are not full-dominated by p. Otherwise, point p full-dominates e.max but does not full-dominate e.min; there may be some points that are full-dominated by p. As showed in Figure 5(a), the dark shadow region is the full-dominated region of e; when point p is in this region, all the data points in e are full-dominated by p; while the light shadow region is candidate full-dominated region of entry e, we need to access each child of e to make sure which ones of these points are full-dominated by p.
Dominance relationships.
If point p semidominates e.min, all the points in e are semidominated by p. If point p cannot semidominate e.max, all the points in e are not semidominated by p. Otherwise, point p semidominates e.max but does not semidominate e.min; there may be some points which are semidominated by p. As showed in Figure 5(b), the dark shadow region is the semidominated region of e; when point p is in this region, all the data points in e are semidominated by p; while the light shadow region is candidate semidominated region of entry e, we need to access each child of e to make sure which ones of these points are semidominated by p.
If e.max semidominates p, all the points in e semidominate p. If e.min does not semidominate p, all the points in e cannot semidominate p. Otherwise, e.min semidominates p and e.max does not semidominate p; there may be some points in e that can semidominate p. In Figure 5(c), the dark shadow region is the semidominating region of entry e; when point p is in this region, each point in e semidominates p; while light shadow region is candidate semidominating region of entry e, we need to access each child of e to make sure which ones of these points can semidominate p.
For each entry e in R-tree, a label κ(e) should be reserved. κ(e) represents the largest label of its child node. For a leaf node e′, its label is κ(p) if point p corresponds to the leaf node e′.
Query Processing. Algorithm 1 describes the frame of SDRS. When a new point p arrives, if its label κ(p)>N, the oldest point o in SN is expired if κ(o)=κ(p)-N (Lines 2–4). Then find the set DC of points whose predecessor is o, and remove DC from CN to RN (Lines 5–7). Next, find the predecessor lp of p. Find the set Df of points which are full-dominated by p in SN and delete Df from SN (Lines 8–10). Find the set Ds of points which are semidominated by p in SN and remove Ds from RN and CN to AN (Lines 11–14). Finally, deal with p according to κ(lp) (Lines 15–19).
<bold>Algorithm 1: </bold>Semidominance Based Reverse Skyline.
whilea new data pointparrivesdo
ifκ(p)>Nthen
ifthe oldest point oκ(o)=κ(p)-Nthen
SN=SN-{o};
find Dc⊆CN whose predecessor is expired;
CN=CN-Dc;
RN=RN+Dc;
find the predecessor lp of point p;
find Df⊆SN full-dominated by p;
SN=SN-Df;
find Ds⊆SN semi-dominated by p;
RN=RN-Ds;
CN=CN-Ds;
AN=AN+Ds;
iflpexists inSNthen
ρ(p)=κ(lp);
CN=CN+{p};
else
RN=RN+{p};
return;
Algorithm 2 describes how to find the predecessor of new point p. First, find the R-tree set SR that p belongs to (Line 2). Then deal with all the entries of R-trees in SR (Lines 3-4). If entry e is an intermediate entry, according to the dominance relationship from e to p, different methods are applied (Lines 7–13). If entry e is a leaf entry, the predecessor of p is found (Lines 14–17).
<bold>Algorithm 2: </bold>Find predecessor of new point.
initialize κ=0;
find the R-tree set SR that point p belongs to;
foreachR inSRdo
insert all entries in the root of R into a heap H by descending order of κ(e);
whileHis not emptydo
remove the top entry e from H;
ifeis an intermediate entrythen
ife.max⪯˙qpthen
κ=κ(e);
break;
foreach child eiofedo
ife.min⪯˙qpthen
insert ei into H by descending order of κ(e);
else ife⪯˙qpthen
κ=κ(e);
break;
returnκ;
Algorithm 3 describes how to find the set Ds that contains all the points which are semidominated by the new point p but not full-dominated by p. First, find the R-tree set SR that p belongs to (Line 2). Then deal with all the entries of R-trees in SR (Lines 3-4). If entry e is an intermediate entry, according to the dominance relationship from p to e, different methods are applied (Lines 7–13). If e is leaf node, according to different dominance relationship from p to e, use different approach (Lines 14–17).
insert all entries in the root of R into a heap H;
whileHis not emptydo
remove the top entry e from H;
ifeis an intermediate entrythen
ifp⪯-qe.minthen
remove all points in e from SN;
else
foreach childeiofedo
ifp⪯˙qe.maxthen
insert ei into H;
else ifp⪯-qethen
remove e from SN;
else ifp⪯˙qethen
remove e from Ds;
balance R in a bottom-up fashion;
returnDs;
Algorithms 1, 2, and 3 can continuously maintain the reverse skyline queries efficiently. SDRS can prune some redundant data and minimize the number of points which are reserved in the sliding window, so it can reduce the time and space cost greatly.
5. Extensions
In this section, we extend the proposed SDRS algorithm to support n-of-N and (n1,n2)-of-N reverse skyline queries.
Given the recent N points PN in the sliding window, n-of-N reverse skyline computes the reverse skyline in the most recent n points which are not semidominated by the points in PN.
According to Definition 10, we propose a new definition lp to solve this problem. For a data point p∈PN, we use lp to denote the newest point which can semidominate p and arrive before p. In such a case, every point in PN has a predecessor point lp, and we use κ(lp) to denote lp’s label. If p has no lp, κ(lp) equals 0.
For every point in PN, we compute its lp. Construct a 2D R-tree index on κ(lp),κ(p). If the current label is t, we execute a range query 0,t-n, [t-n+1,t] on the 2D R-tree. For a point p∈PN, if κ(lp) is in the range [0,t-n] and κ(p) is in the range [t-n+1,t], p is the n-of-N reverse skyline. Through retrieving the whole 2D R-tree index, we can find all the n-of-N reverse skyline points.
In this section, we introduce the definition of (n1,n2)-of-N reverse skyline firstly. pn1 denoted the most recent n1 points in PN and pn2 denoted the most recent n2 points in PN.
Given the recent N points PN in the sliding window and n1>n2, (n1,n2)-of-N reverse skyline computes the reverse skyline points in the set Pn1-Pn2.
According to Definition 11, we propose a new definition rp to solve the problem. For a point p∈PN, we can use rp to denote the oldest point which can semidominate p and arrive after p. Specifically, (1) rp⪯˙qp, (2) κ(rp)>κ(p), and (3) ∄x∈PN, κ(p)<κ(x)<κ(rp) such that x⪯˙qp. κ(rp) denotes the label of rp. If a point p has no rp, κ(rp) equals ∞.
For every point in PN, we compute its lp and rp. Construct a 2D R-tree index on κ(lp),κ(p). If the current label is t, we execute a range query 0,t-n1, [t-n1+1,t-n2] on the 2D R-tree. For a point p∈PN, if κ(lp) is in the range [0,t-n] and κ(p) is in the range [t-n+1,t], we need further judgment. If κ(rp) is in the range [t-n2+1,∞], p is the (n1,n2)-of-N reverse skyline. Through retrieving the whole 2D R-tree index and executing a further judgment, we can find all the (n1,n2)-of-N reverse skyline points.
6. Experimental Evaluation
In this section, we experimentally compare our proposed SDRS algorithm against the only existing DCRS algorithm [16] using real dataset-stock data from the Yahoo Financial website and some synthetic datasets. The stock data contains 10 M stock records. Each record contains 4 attributes: price, volume, growth, and capital size. We evaluate the performance of SDRS and DCRS in 2D stock dataset (containing attributes price and volume) and 4D stock dataset when the sliding window sizes are 10 K, 100 K, 1 M, and 10 M. The experimental results are shown in Figure 6.
Experimental results of real dataset-stock.
Space usage versus window size
Result size versus window size
Response time versus window size
Figure 6(a) shows the space usage in terms of the number of points kept in the sliding window against N. DCRS needs to preserve all the points in the sliding window. Our SDRS only preserves a small number of the points in the sliding window. When the dimensionality is increased, the number of the preserved points is also increased. Figure 6(b) records the number of the result against N. Figure 6(c) records the response time of the two algorithms for 2D and 4D stock datasets. The response time of SDRS is greatly shorter than that of DCRS because of the filtering policies. From Figure 6, we can see that our SDRS performs much better than DCRS in the real dataset-stock dataset.
The synthetic datasets contain three different distributions, including Uniformly Distributed, Clustered, and Anticorrelated Distributed Datasets [1, 7]. Uniformly Distributed Dataset consists of points randomly generated from a unit square. Anticorrelated Distributed Dataset consists of points which are around the antidiagonal. While the Clustered Dataset comprises ten randomly centered clusters, each of them consists of an equal number of points and follows a multivariate Gaussian distribution whose covariance is a 0.05-diagonal matrix and mean vector is equal to its associated centroid. All experiments are performed on a machine with Intel Core 2 1.86 GHz CPU, 1 GB memory, and 80 GB hard disk. Table 1 summarizes the parameters involved in our experiments.
Experimental parameters.
Parameter
Range & default
Dimensionality
2, 3, 4, 5, 6
Window size
10 K, 100 K, 1 M, 10 M, 100 M
6.1. Reverse Skyline on Data Stream
We first evaluate the impact of sliding window size N. The following is shown in Figures 7, 8, and 9. Figure 7 shows the space usage in terms of the maximal number of points kept in SN against N. For three data distributions, DCRS must reserve all the points in the sliding window, while SDRS reserves only a small part of the points in the sliding window, since the filtering policies in SDRS can prune many redundant points. Figure 8 records the maximal number of the result against N in three data distributions. Figure 9 records the response time of the two algorithms against sliding window size N. The response time of SDRS is greatly shorter than DCRS in all the data distributions because of the filtering policies. From Figures 8 and 9, we can see that the number of the results is increased, and the response time of SDRS becomes longer with the increase of sliding window size N. Because, with the increase of N, more points in the window should be calculated, then result number will be more and the response time will be longer.
Space usage versus window size.
Uniform Dataset
Clustered Dataset
Anticorrelated Dataset
Result size versus window size.
Uniform Dataset
Clustered Dataset
Anticorrelated Dataset
Response time versus window size.
Uniform Dataset
Clustered Dataset
Anticorrelated Dataset
Next, we evaluate the impact of dimensionality d. The following is shown in Figures 10, 11, and 12. Figure 10 shows the space usage in terms of the maximal number of points kept in SN against d. The points reserved in SDRS are only a small part of the whole points in the sliding window because the filtering policies can prune many redundant points, while DCRS must keep all the points in the sliding window. Figure 11 records the maximal number of the result against d in three data distributions. Figure 12 records the response time of the two algorithms against d; the response time of SDRS is greatly smaller than DCRS because of the filtering policies in SDRS. From Figures 11 and 12, the number of the results is increased, and the response time of SDRS becomes longer with the increase of dimensionality d. Because, with the increase of d, more calculations are needed to compute the dominance relationships between two points and the dominance relationship becomes harder, then result number will be more and the response time will be longer.
In this section, maintenance time and processing time are used to evaluate the algorithm performance of n-of-N reverse skyline query. Maintenance time is the time of constructing the 2D R-tree index, while the processing time is the query time of n-of-N reverse skyline query. In order to evaluate the algorithm performance stably, we generate 1000 random values as n and carry out 1000 continuously n-of-N reverse skyline queries. Finally we record the average value of 1000-query time as the processing time.
As shown in Figures 13 and 14, Figure 13 records the maintenance time in 2D and 5D against sliding window size N in three data distributions and Figure 14 records the processing time in 2D and 5D against N in both data distributions. With the increase of N, maintenance time and processing time have a little increase. Because, with the increase of window size N, the number of points in R-tree is increased, maintenance time and processing time have a little increase. The maintenance time and processing time in 5D are much longer than those in 2D, because the calculation in 5D is more complex than that in 2D, and the reserved points in 5D are much more than those in 2D.
In this section, maintenance time and processing time are used to evaluate the algorithm performance of (n1,n2)-of-N reverse skyline query. Maintenance time is the time of constructing the 2D R-tree index, while the processing time is the query time of (n1,n2)-of-N reverse skyline query. In order to evaluate the algorithm performance stably, we generate 1000 random values as n1, n2 equal n1+500 and then carry out 1000 continuously (n1,n2)-of-N reverse skyline queries. Finally, we record the average value of the 1000-query time as the processing time.
As shown in Figures 15 and 16, Figure 15 records the maintenance time in 2D and 5D against N in three data distributions and Figure 16 records the processing time in 2D and 5D against N in both data distributions. Maintenance time and processing time have a little increase with the increase of N, because the number of reversed points in R-tree has a little increase. The calculation in 5D is more complex than that in 2D, so maintenance time and processing time in 5D are much longer than those in 2D.
Maintenance time for (n1,n2)-of-N query.
Uniform Dataset
Clustered Dataset
Anticorrelated Dataset
Processing time for (n1,n2)-of-N query.
Uniform Dataset
Clustered Dataset
Anticorrelated Dataset
7. Conclusions
Despite its importance in real-world applications, reverse skyline computation on data streams has not been well studied. Therefore, in this paper, we focus on the problem of efficiently computing reverse skyline against sliding windows over an append-only data stream. Specifically, we present an effective pruning approach to minimize the number of points to be kept in the sliding window and propose efficient semidominance based on approach SDRS for processing continuous reverse skyline queries. Moreover, we also propose an extension for handling n-of-N and (n1,n2)-of-N reverse skyline queries. Our extensive experiments have demonstrated the efficiency and effectiveness of our proposed SDRS approach under various experimental settings.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research was partially supported by the National Natural Science Foundation of China under Grants nos. 61472069, 61402089, and 61100022 and the Fundamental Research Funds for the Central Universities under Grant no. N130404014.
BorzsonyiS.StockerK.KossmannD.The Skyline operatorProceedings of the 17th International Conference on Data EngineeringApril 2001Heidelberg, GermanyIEEE42143010.1109/ICDE.2001.914855TaoY.XiaoK.PeiJ.Efficient skyline and top-k retrieval in subspacesDellisE.VlachouA.VladimirskiyI.SeegerB.TheodoridisY.Constrained subspace skyline computationProceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM '06)November 2006ACM41542410.1145/1183614.11836752-s2.0-34547622280PapadiasD.TaoY.FuG.SeegerB.An optimal and progressive algorithm for skyline queriesProceedings of the 22th ACM SIGMOD International Conference on Management of Data (SIGMOD '03)June 2003San Diego, Calif, USA467472PapadiasD.TaoY.FuG.SeegerB.Progressive skyline computation in database systemsDengK.ZhouX.ShenH.Multi-source skyline query processing in road networksProceedings of the 26th ACM SIGMOD International Conference on Management of DataJune 2007Beijing, China796805DellisE.SeegerB.Efficient computation of reverse skyline queriesProceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07)September 2007291302DingX.LianX.ChenL.JinH.Continuous monitoring of skylines over uncertain data streamsHuangZ.LuH.OoiB. C.TungA. K. H.Continuous skyline queries for moving objectsLianX.ChenL.Monochromatic and bichromatlc reverse skyline search over uncertain databasesProceedings of the 27th ACM SIGMOD International Conference on Management of Data (SIGMOD '08)June 2008Vancouver, Canada21322610.1145/1376616.13766412-s2.0-57149142762LianX.ChenL.Reverse skyline search in uncertain databasesDeshpandeP. M.DeepakP.Efficient reverse skyline retrieval with arbitrary non-metric similarity measuresProceedings of the 14th International Conference on Extending Database TechnologyMarch 2011Uppsala, SwedenACM31933010.1145/1951365.19514042-s2.0-79953881823WuX.TaoY.WongR. C.-W.DingL.YuJ. X.Finding the influence set through skylinesProceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (EDBT '09)March 2009Saint-Petersburg, Russia1030104110.1145/1516360.15164782-s2.0-70349159397XinJ.WangZ.BaiM.DingL.WangG.Energy-efficient β-approximate skylines processing in wireless sensor networksChenL.ZhaoJ.HuangQ.YangL. H.Effective space usage estimation for sliding-window skybandsZhuL.LiC.ChenH.Efficient computation of reverse skyline on data streamProceedings of the 2nd International Joint Conference on Computational Sciences and Optimization (CSO '09)April 2009Hainan, ChinaIEEE73573910.1109/cso.2009.742-s2.0-70649097032SharifzadehM.ShahabiC.The spatial skyline queriesProceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06)September 2006Seoul, Republic of Korea751762ChenL.LianX.Dynamic skyline queries in metric spacesProceedings of the 11th International Conference on Extending Database TechnologyMarch 2008ACM33334310.1145/1353343.13533862-s2.0-43349099901ChenL.LianX.Efficient processing of metric skyline queriesWangG.XinJ.ChenL.LiuY.Energy-efficient reverse skyline query processing over wireless sensor networksMinJ.-K.Efficient reverse skyline processing over sliding windows in wireless sensor networksLimJ.ParkY.BokK.YooJ.A continuous reverse skyline query processing considering the mobility of query objectsProceedings of the 5th International Conference on Internet and Distributed Computing SystemsNovember 2012Fujian, China227237LimJ.ParkY.Bokk.YooJ.An efficient continuous reverse skyline query processing method over moving objectsLimJ.LiH.BokK.YooJ.A continuous reverse skyline query processing method in moving objects environmentsHanA.ParkY.KwonD.An efficient pruning method to process reverse skyline queriesLinX.YuanY.WangW.LuH.Stabbing the sky: efficient skyline computation over sliding windowsProceedings of the 21st International Conference on Data Engineering (ICDE '05)April 2005Tokoyo, JapanIEEE50251310.1109/icde.2005.1372-s2.0-28444439519TaoY.PapadiasD.Maintaining sliding window skylines on data streamsMorseM.PatelJ. M.GroskyW. I.Efficient continuous skyline computationSarkasN.KoudasN.DasG.TungA. K. H.Categorical skylines for streaming dataProceedings of the 27th ACM SIGMOD International Conference on Management of DataJune 2008Vancouver, CanadaACM23925010.1145/1376616.13766432-s2.0-57149127376ZhangZ.ChengR.PapadiasD.TungA. K. H.Minimizing the communication cost for continuous skyline maintenanceProceedings of the 28th International Conference on Management of DataJuly 2009Providence, RI, USAACM49550710.1145/1559845.15598982-s2.0-70849121280ZhangW.LinX.ZhangY.WangW.YuJ. X.Probabilistic skyline operator over sliding windowsProceedings of the 25th IEEE International Conference on Data Engineering (ICDE '09)April 2009Shanghai, ChinaIEEE1060107110.1109/icde.2009.832-s2.0-67649671763ZhangW.LinX.ZhangY.WangW.ZhuG.YuJ. X.Probabilistic skyline operator over sliding windowsBaiM.XinJ.WangG.Probabilistic reverse skyline query processing over uncertain data streamLiuQ.GaoY.ChenG.LiQ.JiangT.On efficient reverse k-skyband query processingProceedings of the 17th International Conference on Database Systems for Advanced ApplicationsApril 2012Busan, Republic of Korea544559YangJ.QuB.LiC.-P.ChenH.DC-tree: an algorithm for skyline query on data streams