Optimized Query Algorithms for Top-K Group Skyline

Department of Information Engineering, Hebei University of Environmental Engineering, Qinhuangdao, China School of Information and Management, Shanghai Lixin University of Accounting and Finance, Shanghai, China Qinhuangdao Vocational and Technical College, Qinhuangdao, China School of Information Science and Engineering, YanShan University, Qinhuangdao, China Research Scholar, Department of Civil Engineering, Faculty of Engineering and Technology, Madhyanchal Professional University, Bhopal, India Sanskriti University, Mathura, India Engineering Dept., Misr Higher Institute of Engineering & Technology, Mansoura, Egypt


Introduction
Skyline query are also called maxima or Pareto [1] (to gain optimality without harming the interests of others in the field of business management). It is also a query optimization problem. Skyline query is proposed by Borzsonyi et al. [2], and it is introduced to the database domain at the 2001 ICDE conference at first. From then on, Skyline query attracts extensive attentions of the domestic and foreign researchers and becomes one of the most difficulty and hotspot in database-research field. Skyline query has lots of applications in the field of multidimensional optimization analysis such as choosing petrol stations and hotels in the road network, selecting players in social networks, and determining targets through multiple attribute information.
The Skyline has been differently extended in recent years and becomes an emphasis for research in the database domain. At present, there are still many researches on single point query based on the traditional Skyline, such as the Skyline query on the data stream [3] and on the subspace [4][5][6][7][8].
In the skyline query on the data stream, with the dynamic change of data stream tuples, for a given constraint query, find the nodes that fall into the valid area or affect the result tuple set. Such queries are often applied to intelligent transportation, online monitoring, and other fields. In the face of massive high-dimensional data, the whole space skyline query has the disadvantages of too large result set and low efficiency; so, the subspace skyline query has more important research significance. To reduce the size of the result set and feedback some representative Skyline points, k -dominated Skyline-defined variant is given by Skyline [9,10], Top-k Skyline query, distance-based classical Skylines [11], etc. In many cases, what we search is a point group made up of s points not a single point. For example, in the road network query, people want to find adjacent malls which meet their demands. These malls form a cluster and are connected on the route to shopping. In turn, people can recognize hotels and entertainment within the Skyline according to the distribution dense of shopping malls, which is usually called site selection analysis. Liu et al. [12] first extend the Skyline based on an original single point to the Skyline based on the point group and propose the corresponding algorithm for Skyline. In practical applications, such objective optimization problems can also be applied to path optimization [13] to calculate the minimum cost path, mobile trajectory tracking [14,15] to look for similar trajectories, social networks to find close communities, and graph correlation [16] to get the correlation degree of the target point.
Top-k is a typical query problem in large-scale data processing, which is widely used in daily query, such as the analysis and summary of the top 10 query words in search engines. The Top-k is introduced into the group Skyline query, and each query returns the best k-Skyline point groups to reduce the burden of further selection by setting the measurement index. To solve the group Skyline problem, this paper proposes some efficient algorithms. The main research work of this paper has four points.
(1) Combining with the practical application requirements, this paper introduces the Skyline query problem of Top-k group, makes theoretical analysis and exploration on this problem, and puts forward the criterion to evaluating the quality of the pointgroup. Taking the number of vertices in the Skyline layer as the basis for sorting the results, an SLGS algorithm is proposed (2) Aiming at the ranking strategy of skyline layer, the concept of vertex coverage is proposed to deal with the situation that the ranking of result point groups is the same. To avoid blindness in selection, the VCGS algorithm based on vertex coverage is proposed, which further ranks all result sets and returns the Top-k point groups Definition 3. (Group dominance). Given a set P which contains n data points in d-dimensional spaces, G = fp 1 , p 2 , ⋯ , p s g and G ′ = fp 1 ′ , p 2 ′ , ⋯, p s ′ g are two different point groups with s points of P. We can say that the G group dominates G′ if we can find two permutations of the s points for G and G′, G = fp u 1 , p u 2 , ⋯, p u s g and such that p us dominates p us ′ for all i (1≤ i ≤s), and p us dominates p us ′ strictly for at least one i. 2.2. Related Work Analysis. This paper mainly focuses on how to obtain the top-k Skyline point groups. Top-k [12,[16][17][18][19][20] Skyline query is a common problem in large-scale data processes. The group Skyline query is to compute the set of point groups which are not dominated by other point groups on a given dataset. It is a further extension of the traditional Skyline query. Up to now, there are few researches on group Skyline query, the group Skyline is put forward to and researched in Ref. [21][22][23][24]. In recent years, effective query results [25][26][27]get more attention. For reducing the size of the query result set and returning more representative Skyline points, the variations of Skyline definitions such as k -dominated Skyline [24], representative Skyline [28], top-k Skyline query, and distance-based representative Skyline [29] are given. Basic algorithms of group Skyline query include algorithm PointWise [12], UnitWise [12], and Unit-Wise+ [12]. In Ref. [24], the definition of group Skyline is first proposed, and the definition of group domination depends on a certain aggregated point or a representative point in a 2 Wireless Communications and Mobile Computing point group. Although many aggregation functions, such as the function summation, minimum, and maximum, can be used to calculate aggregation points, finding all group Skyline sets is not easy. PointWise algorithm enumerates candidate group Skyline by dynamically generating set enumeration tree containing candidate group and pruning off nongroup Skyline group. Firstly, the directed Skyline graph is preprocessed, the redundant nodes are filtered out, and then the remaining points in the graph are enumerated. The pruning strategy: if a point group is not a group Skyline in the enumeration process, then it need not be extended, and the subtree rooted by it can be pruned. Each candidate set corresponds to an extended set of points, which can filter out some points in the set and further reduce the enumeration. The verified point groups are the final point group Skyline.
UnitWise algorithm expands candidate group by adding point groups one by one. Similarly, the candidate groups are enumerated by dynamically generating a set enumeration tree containing candidate groups. Each node in the tree is a set of unit groups. At the same time, the candidate skyline groups are listed by pruning off the other useless point groups to the greatest extent. The pruning strategy: the candidate point group G contains at least s points, then the number of candidate point groups in G's subtree will be larger than s, the subtree can be pruned, and some points in the set of extended points corresponding to candidate point group can be filtered. The algorithm is based on cell group expansion, reduces the number of enumerations, and is more efficient than PointWise.
UnitWise+ algorithm is an improved algorithm based on UnitWise. In order to delete more point groups of nongroup Skyline in advance, the algorithm first processes the highlevel points in Skyline layer, enumerates larger candidate point groups in advance, and also filters the set of extended points corresponding to candidate point groups to reduce the size of the set. Moreover, depth-first traversal is used to detect candidate point groups to terminate the algorithm in advance, which can further narrow the query range and improve the effectiveness of the result set.
Although these algorithms reduce the size of candidate sets and the number of enumeration point groups, the result set is still large when the size, dimension, and the number of point groups are enlarged. We need to extract some appropriate point groups as the result to return. To overcome this shortcoming, the SLGS algorithm based on Skyline layer, and VCGS algorithm based on vertex coverage and improved algorithm VCGS+ are proposed.

Query Algorithm Based on Skyline Layer
3.1. Ranking Strategy. Firstly, the characteristics of the result set are discussed, and the criteria to measure the quality of the result set are put forward. The following analysis is combined with an example. As shown in Example 3.1, Table 1 is a set of hotel data sets.
The Skyline layer of the point set is constructed as follows: Firstly, 12 points are sorted in ascending order according to their attribute value-distance (users can choose the attribute value according to their preferences), and each point is processed sequentially. Point p 1 is the first point, and the next point p 7 is processed. The other point on the first layer cannot dominate p 7 ; so, the point p 7 belongs to the first layer. The next Skyline layer is constructed by processing p 4 , where the point on the first layer dominates p 4 . By analogy, until all points are processed. The results are shown in Figure 1.
Based on the definition of Skyline layer, the directed Skyline graph of the point set is constructed. In Figure 1, index value of the point is omitted (the point index value of point p 1 is 0, the point index value of point p 7 is 1, the point index value of point p 12 is 2, and so on). According to the Skyline layer, the directed Skyline graph results are numbered sequentially from lower to higher levels, as shown in Figure 2.
According to the definition of Skyline layer, the points on the first layer are defined as Skyline points of the whole point set P, which dominate the points on other layers except the first layer; the points on the second layer are Skyline points of the subset of the set P except the points on the first layer; that is, the points on the second layer dominate the points on other layers except the points on the first layer and the second layer. By analogy, it can be concluded that the point at the lower level dominates the point at the higher level. According to the definition of point domination, the values of low-level points on some attributes are not worse than that of high-level points, and the value of low-level points on at least one attribute is better than that of highlevel points. Therefore, the more points from the lower level in a point group, the better the point group.
The number of points from different Skyline layers in the Skyline point group is to identify which are better or worse, so that the Skyline point groups can be sorted and the top-k Skyline point groups can be obtained. From the above analysis, it can be seen that the number of points from the lower Skyline layer in the Skyline point group is a key factor affecting the overall group's quality. Thus, the following definitions about Skyline point group are derived.  3.2. Algorithmic Description. Through the ranking strategy proposed above, k optimal groups of Skyline points can be obtained by processing the result set calculated by the 3 Wireless Communications and Mobile Computing UnitWise+ algorithm, but the result set is disorderly. The best point group may be located anywhere in the result set, which brings the adverse effects to the user's choice. Because the first result is not sure to be the best, the whole result set needs to be sorted. Based on the ranking strategy, it can be concluded that there are some equivalent and indistinguishable point groups in the ranking process, which are packaged into a block to distinguish different groups of equivalents. That is to say, after processing the result set, different equivalent point group blocks are formed. The first block includes the best point group, the second block includes the better point group, and the last block naturally contains the worst point group. The results are well organized and hierarchical. Based on this idea, a SLGS query algorithm based on Skyline layer is proposed.
The basic idea of the algorithm is given the result point group R; based on the ranking strategy proposed above, each point group in R is traversed, and the equivalent point groups are divided into blocks; then, each block is sorted, and the blocks are dynamically inserted into the corresponding positions. Finally, k point groups from the block result set are extracted. The following is a simple flow chart of the algorithm SLGS.
For the results set R = fa, b, c, d, e, f , g, h, i, j, l, ng, if the user wants to select 5 optimal groups from 12 given result  Distance  4  30  24  14  36  26  8  34  20  40  28  16  Price  400  390  380  340  300  280  260  220  210  200  120  sets, that is, set k = 5, then the execution process of the algorithm is as follows. First, initializing the tag array mark ½ = f0g, which means that all point groups are not accessed, and the block set C is empty. Second, traversing point group a, it is found that the number of points from the first, second, and third levels is 2, 1, and 1, respectively, and a is equivalent to point group j and l. The corresponding position of tag array mark is assigned to 1, and the block C 1 composed of three point groups ða, j, lÞ is added to the head of C. Then, the next unvisited point group b is processed, and the number of points from the first, second, and third layers is 1, 2, and 1, respectively, which is equivalent to point group n. The corresponding position of tag array mark is assigned to 1. Point groups n and b form a block C 2 , which is inserted into C. At this time, the block C 1 already exists in the set C. Because the number of point groups from the lower layer is more than that from the upper layer, the midpoint group of C 1 is better than that of C 2 . Therefore, the block C 2 is inserted at the end of the C. By analogy, point groups d, e, and c are equivalent, and they are composed of the block C 3 . Because C 3 is better than C 1 and C 2 , C 3 is inserted into the head of set C. Point group f is equivalent to g, h, and i, and they formed the block C 4 . By comparing the number of points from lower level, it is found that C 4 is better than C 1 and C 2 , but worse than C 3 ; so, C 4 should be inserted in front of C 1 . At this time, all R point groups have been accessed, and a complete set of 4 blocks C = fC 3 , C 4 , C 1 , C 2 g has been obtained. Given k = 5, because the number of point groups of the first block C 1 in C is 3 and less than k, selecting two point groups from the second block C 4 is needed. Finally, the five point groups of c, d, e, f , and g are returned.
Assuming that there are n elements and e blocks in the result set R, the time complexity of the algorithm SLGS involves two aspects: (1) traversing n elements and realizing the block processing, whose time complexity is OðnÞ. (2) Finding the equivalence point group has to traverse n elements again. At the same time, order each block in the set of blocks by the binary search method. This time, the time complexity is Oðn + log eÞ. Therefore, the overall time complexity of the algorithm SLGS is Oðnðn + log eÞÞ.

Query Algorithm Based on Vertex Cover
4.1. The Basic Idea. According to the directed Skyline graph in Figure 2, it is found that if the number of points that can be dominated by points from other layers is different, then the number of points that can be dominated by different point groups is also different. For example, the size of the given point group G = fp 1 , p 7 , p 12 , p 4 g is 4, and the number of points dominated for p 1 , p 7 , p 12 , and p 4 is 0, 5, 8, and 2, respectively. The sum of points that this point group can dominate is 15. Similarly, the sum of points is 18 for another given point group G′ = fp 1 , p 7 , p 12 , p 9 g. Obviously, the point group G ′ is better than G. It means that the sum of points that can be dominated by all the point groups also affect the overall quality of the Skyline point group. Therefore, the concept of vertex coverage is proposed. Definition 9. (Vertex coverage). Given a point group G = f p 1 , p 2 , ⋯p s g in group Skyline, Let n 1 , n 2 , ⋯n s separately represents the number of points dominated by p i in G, so that S = sumðn 1 , n 2 , ⋯n s Þ is the number of vertex covers, and we name S as VC (vertex coverage) (G).
According to the ranking strategy of 3.1, the characteristic of the Top-k Skyline point-group is as follows: (1) There are more points on the Skyline low layer in the point group (2) The number of vertex cover of the point group is larger Through the above analysis, the accurate Top-k groups can be obtained by further sorting the partition results of SLGS, and the corresponding VCGS (Vertex Coverage Group Skyline) algorithm is proposed. The basic algorithm idea is given the result point group R, first run the algorithm SLGS, using Skyline layer and the number of vertices in the layer as the basis of the result ranking, get the result C composed of the blocks, and then traverse the point groups in each block of C. While traversing, the better or worse point group between blocks is judged by the size of the vertex cover set. Then, these equivalent point groups are reordered; after each block has been processed, k point groups can be extracted. The algorithm VCGS is shown in Algorithm 2.

Algorithm Optimization.
The algorithm is very sensitive to the size of block set n and the number of elements m of each block. This is mainly because the values of n and m will be very large when the size of data sets, dimensions, and point groups increases; so, it will take more time to traverse these elements. In fact, it is not necessary to calculate the whole result set. The algorithm UnitWise + enumerates all the points on Skyline layer. Assuming that the number of Skyline points on the first layer is n 1 , we select s points from the n 1 points from enumeration. If the enumeration value e 1 is greater than or equal to k, that is to say, the point group generated by the points on the first layer is enough to find k optimal results, then the Skyline layer higher than the first layer can be pruned, and the point group composed of the points on the first layer can be sorted directly. If e 1 is less than k, indicating that the point groups on the first layer are not enough to find k results, add the second layer and enumerate the Skyline points on the first two layers. If the enumeration number e 2 is greater than or equal to k, the Skyline layer higher than the second layer can be pruned, the points on the first two layers can be enumerated, and the result points can be sorted to find k optimal point groups, and so on. In this way, the algorithm UnitWise + can be judged earlier in the execution process and do not enumerate the invalid point groups. Based on the above analysis, the optimized algorithm VCGS + is introduced and shown in Algorithm 3.

Comparison and Analysis of Three Algorithms.
Algorithm 1 has the problem of too much enumeration and computation, and in the equivalent tuple, it cannot compare which point in the tuple is better. Algorithm 2 can further distinguish the advantages and disadvantages of points in tuples, but there is still a large amount of calculation, and many useless point groups participate in the calculation. Therefore, Algorithm 3 is optimized from three aspects: enu-merator, pruning strategy, and selection of equivalence points in the group. In experiment, the parameters d, n, s, and k represent, respectively, the dimension of the data set, the size of the data set, the size of the points group required, and the number of best groups returned. ifG i has visited then 4 continue; 5 for each group G j ∈ Rdo 6 ifG j is not visited and G i is equals to G j then 7

Experiment and Result Analysis
add G j to a existing chunk c 8 else ifG j is not visited and j == ithen 9 add G i to a new chunk c. for each group G i ∈ C i do 7 use number[] to compute VC for G i 8 use quick sort to sort the group in C i 9 S← select k Group-Skyline groups from C 10 returnS Algorithm 2: VCGS. 6 Wireless Communications and Mobile Computing

The Data Set and Evaluation
Criteria. The experiment uses the two real datasets for NBA (http://stats.nba.com/ leaders/alltime/?ls=iref:nba:gnav) player statistics and NHL (htttps://http://www.nhl.com/player/) player statistics. The information description of each data set is shown in Table 2. The experiment tests and compares the dimension, scale, size of point groups, and the number of returned results. The specific settings are shown in Table 3. The dimension of data set λ is as follows: the number of attributes contained in the target point set.
Data set size n is as follows: the number of target points. The size of the point group s is as follows: the number of target points contained in each point group.
Returns the number of optimal point groups k: the number of elements in the result set returned to the user.

The Performance Comparison and Analysis
5.3.1. The Influence of the Size of the Point Group. As can be seen from Figure 3, the execution time of each algorithm increases with the increase of s. When the s is small, the execution speed of the three algorithms is very fast, and the execution speed of SLGS and VCGS is similar. Compared with SLGS, VCGS takes a little longer to execute because VCGS algorithm is a further sorting processing of point groups on the basis of the calculation results of SLGS algorithm, which increases the execution time. With the increase of s, the execution time of SLGS and VCGS increases exponentially, and the number of points on the former Skyline layer will increase sharply, resulting in the increase of the Skyline point groups and data scale. However, the execution speed of the improved VCGS + algorithm based on VCGS is better than the other two algorithms, and the best case is 10 times the worst case.
As can be seen from Figure 4, the enumeration results of the three algorithms increase with the increase of s value. When the s value is small, the enumeration results of the three algorithms are not much different, and the ReConstruct_DSG(layer',s) 7 G ← UnitWise + (layer',s) 8 else 9 for each layer i (1 < i ≤ s) in DSG do 10 for each layer j (j ≤ i) in DSG do 11 layer" ← add layer j to the layer" 12 ReConstruct_DSG(layer",s) 13 ConstructDSG(layer',s) 14 G ← UnitWise + (layer",s) 15 if |G| ≥ kthen 16 break; 17 ifi==s and |G| < kthen 18 break; Algorithm 3: VCGS+.  7 Wireless Communications and Mobile Computing enumeration number of the algorithm VCGS + is only a little less than that of the first two algorithms. Moreover, because the algorithm VCGS is a further ranking of equivalent point groups based on SLGS calculation results, the enumeration results of the two algorithms are equal. When the s value increases gradually, the enumeration number of SLGS and VCGS increases a lot. By pruning Skyline layer, the enumeration result of VCGS + is less affected by s. Figures 5 and 6, on two different datasets, we can see that with the growth of data scale λ, the running time of the algorithm also increases, and the efficiency decreases rapidly. When the dimension of NBA dataset rises to 5, and that of NFL dataset rises to 7, the impact of SLGS and VCGS is more severe. The reason is that with the increase of λ, the number of Skyline points on each Skyline layer increases dramatically. These two algorithms need more time to calculate the group of Skyline points; so, the efficiency will become lower. Compared with the other two algorithms, VCGS + is more efficient and performs better on NFL datasets.

The
Influence of Dataset's Size. In Figures 7 and 8, we can see that with the increase of target data n, the performance of the algorithm is relatively stable. Therefore, the influence of n is not obvious. The running time of the algorithm increases linearly with the increase of n. The main reason is that only the points on the former s Skyline layer are used when computing group Skyline, and the number of      Wireless Communications and Mobile Computing these points is much smaller than the size of data n. For different data sets, the impact of data size on the overall algorithm is different. The running time of VCGS + is less than that of SLGS and VCGS, and the performance of VCGS + on NFL datasets is more drastic. The pruning strategy of this algorithm can improve the efficiency of the algorithm by two to three times.
5.6. The Influence of the Number of Point Groups Returned.
In Figures 9 and 10, with the growth of the point in the result set, the efficiency of the three algorithms varies steadily and linearly. Because the number of enumerated result point groups is greatly affected by the size of calculated point groups, data dimension, and data set size, it is independent of k value. This can also be explained directly from the time complexity of the algorithm.

Conclusions
Aiming at the problem of large result set and low query efficiency in existing group Skyline query algorithms, the following results are obtained.
(1) Aiming at the problem of large result set and large number of meaningless result point groups in existing Skyline algorithm, the Skyline query problem of the top-k group is given, and a SLGS algorithm based on Skyline layer is proposed to return k optimal Skyline point groups. This algorithm combines the structural characteristics of the high-level points dominated by the middle and low-level points in Skyline layer and gives a quantitative criterion to find the better one of two groups. Based on this criterion, the group Skyline results are ranked. and the k results in the top ranking are returned (2) To solve the problem of the same ranking result in SLGS algorithm, a ranking strategy based on Skyline layer and vertex coverage is proposed. The size of vertex coverage set in the point group is used as the basis of ranking, and the results with the same ranking are further processed. The corresponding VCGS algorithm is proposed to sort all the results, which makes the sorting results more accurate. Because the algorithm adopts traversal strategy, it is inefficient. In order to improve users' satisfaction with the returned results, an improved algorithm VCGS+, which is based on the algorithm VCGS, is proposed. This algorithm provides a pruning strategy of Skyline layer and avoids accessing most Skyline points. Only a few results can be calculated to find top-k groups of Skyline points, reduces the number of results enumerated and the number of points that need to be traversed, and thus improves the efficiency of the algorithm. Meantime, the experimental results show that the algorithm can improve the efficiency about ten times