Efficient Processing of Continuous Skyline Query over Smarter Traffic Data Stream for Cloud Computing

The analyzing and processing of multisource real-time transportation data stream lay a foundation for the smart transportation’s sensibility, interconnection, integration, and real-time decisionmaking. Strong computing ability and valid mass data management mode provided by the cloud computing, is feasible for handling Skyline continuous query in the mass distributed uncertain transportation data stream. In this paper, we gave architecture of layered smart transportation about data processing, and we formalized the description about continuous query over smart transportation data Skyline. Besides, we proposed mMR-SUDS algorithm (Skyline query algorithm of uncertain transportation stream data based on micro-batchinMap Reduce) based on sliding window division and architecture.


Introduction
Recently, tremendous changes have taken place in city transportation data sources, transportation data services, and information infrastructure.Traditional ITS (intelligent transport systems) present many defects in higher-dimensional space-time continuous data stream collected and passed back from mass perceptible and measurable sensor networks and the storage, processing, and analysis of big data.With the advent of computing technology such as Internet of things, cloud computing [1], and smarter transportation [2] has emerged, as a new concept of comprehensive transportation system.As shown in Figure 1, smarter transportation system covers various aspects of transportation and is a complex and comprehensive system consisting of plenty of subsystems.Analytical processing of multi-source and real-time transportation data stream [3] is the basis of realizing perceptible Smarter Transportation with interconnection integration and real-time decision.Besides, such analytical processing is critical to establishing global sustainable transportation surveillance, network optimization of dynamic transportation, automatic response to accidents, and integration of location-based transportation services.
With the rapid development of information technology, monitoring platform in various types of transportation information management collects complex mass transportation stream data including video information [4,5] from cameras, monitoring information of sensors, positioning system information of vehicle, and so on.Hence, transportation stream data are provided with diverse sources, wide varieties, different forms, and typical data-intensive processing characteristics.For example, by December 28, 2012, there were 8842 fixed transportation monitoring equipment in Beijing and merely dispatch center for transportation operational monitoring TOCC in Beijing updated over 3500 data immediately and replaced more than 20 thousand video pictures in real time.Operational applications of environmental sensor station are shown in Figure 2. Real-time transportation data stream lays important data foundation for road transportation stream control of various decision analysis and emergency response in smarter transportation system.Skyline [6] query, as a key data mining technology, is of great significance in multiconstrained decision support, city navigation, user preference query, visualization of data mining, and so on under dynamic environment [7][8][9].Hence, such query is consistent with practical application of data stream processing of smarter transportation.In addition, collection and analytical processing of transportation stream data present geographically distributed characteristic and are often influenced by uncertain sources such as wireless sensor networks, wireless radio frequency identification, locationbased services, moving object management, and so on.Thus, data objects in data stream present uncertainty.Therefore, uncertain [10] real-time transportation data stream is characterized by difficult prediction, variability, rapid arrival, mass and infinite arrival, and so forth.Meanwhile, analytical processing of transportation stream data requires multiservice parallel processing and very high timeliness.In the environment of cloud computing, this paper combined the processing requirements for complex, parallel, and realtime transportation stream data and investigated continuous Skyline query algorithm with low cost, rapid response, and efficient scalability based on parallel processing framework of mass data.Compared with traditional Skyline query, Skyline query over uncertain transportation stream data faces the following challenges.
(1) In computational process of Skyline query on uncertain data stream, both dominant relations between computing objects and Skyline probability need to be calculated.However, traditional strategies fail to perform this process directly.Obviously, Skyline query calculation is CPU intensive [11,12] and very high processing ability is required.(2) Transportation stream data arrives continuously and is required to be processed immediately.So, when data stream is too rapid and users pay attention to a great number of objects (sliding window [13] is very large), traditional algorithm of centralized stream processing is difficult to satisfy the query demand.
Cloud computing with high storage capacity and calculating ability can fully satisfy application requirements of Skyline query on mass data.Main contributions of this research are as follows.Section 2 introduces the processing architecture of stratified stream data of smarter transportation, background information, relevant work, and formal description of the problem.Section 3 explains design conception and optimization strategy of mMR-SUDS algorithm.Besides, experimental result comparison is demonstrated in Section 4, while summary of the entire research is made in Section 5.

Processing Architecture of Stratified Transportation Stream
Data.In smarter transportation, processing architecture of stratified transportation stream data is shown in Figure 3. Bottom layer is front end of perceptible equipment consisting of N acquisition nodes for remote real-time data monitoring.Interlayer consists of M coordinator nodes connected to high speed network, while all transportation data processing centers are placed on top layer, providing transportation data services such as control, analysis, early warning, and so on.

Relevant Work.
Early Skyline query is commonly applied to centralized database.Relevant researches mainly focus on centralized algorithms such as block-nested-loops, BNL algorithm [6]; divide-and-conquer, D&C algorithm [6]; sort-filter-Skyline, SFS algorithm [14]; nearest neighbor, NN algorithm [15]; branch-and-bound Skyline, BBS algorithm [16]; bitmap algorithm [17], and so on.Jian et al. [18] first proposed Skyline query technology on uncertain data and presented two query algorithms: bottom-up algorithm and top-down algorithm.In addition, in terms of uncertain data presentation, relevant researches usually pay more attention to discrete data.Therefore, according to literature [19], based on uncertain data at attribute level, three defined constraint methods including uncertainty reduction, pairwise comparison, and adaptive bound tightening were proposed to optimize Skyline query calculation.
In the field of Skyline query over data stream, aimed at continuous Skyline query based on sliding window model, literature [20] proposed Lazy algorithm and Eager algorithm which improves space and time efficiency using the method of advanced data cleaning.In addition, literature [21] investigated Skyline query of n-of-N data stream model in sliding window and proposed continuous n-of-N algorithm that improves system space performance by defining "key domination." In the field of Skyline query over uncertain data stream, the data model in literature [22] was a data set consisting of certain objects where variable amounts of examples were presented for each object.And the concept of Skyline probability was proposed based on Skyline probability of examples for each object.Hence, the data model in literature [23] was virtually a discretionary version of uncertain attribute, while this paper focused on the case of uncertain tuple.On the other hand, literature [24] concentrated on static dataset, while this paper concentrated on data stream.Moreover, aimed at efficient Skyline calculation of uncertain data stream, literature [13] proposed Skyline query based on probability threshold and used the optimization methods like Skyline candidate sets and so on to execute continuous Skyline query efficiently.In contrast, literature [25] presented Skyline over probabilistic data stream algorithm.Based on grid index with better adaptability, heuristic rules such as probability delimitation, stepwise refinement, elimination in advance, optional indemnity, and so on were employed to optimize the algorithm temporally and spatially.By comparison, literature [26] investigated expectation evaluation of Skyline probability and presented the relation between probability threshold and expectation of Skyline probability.
In the field of distributed parallel Skyline query, current researches mainly focused on static data.Literature [27] suggested that integral query performance of system could be improved by defining execution order of Skyline query on each server.In addition, parallel distributed Skyline algorithm proposed in literature [28,29] divided relevant sites into several groups by data division method and queries among groups were executed in parallel.
According to processing requirements of mass data, several existing relevant researches combined Map Reduce technology with Skyline query algorithm.Literature [4] proposed preview Skyline query algorithm and attempted to reduce size of input data in Map task and Reduce task through preview filtration.Thus, the performance of Skyline query based on Map Reduce framework was improved.

Data Stream.
In a formal way, a data stream is any ordered pair (, Δ) where  is a sequence of tuples and Δ is a sequence of positive real time intervals.For instance, there is a data stream with following tuple model in management system of road transportation stream (see Figure 4).
Road Stream is defined as data stream of tuple model processed in data stream management system.In the tuple model, attribute Road Stream denotes the name of the data stream, while Vechicle ID denotes the unique identifier of a vehicle.Moreover, X Way denotes road section of a vehicle; X Pos presents the location of a vehicle; Express Way denotes the expressway number; Speed denotes the current speed of a vehicle; Timestamp denotes that, when relevant information dispatched by a vehicle arrives at data stream system, system assigns a value to  according to the time sequence of received information.

Skyline Definition 1. Skyline
Definition 3. Given a dataset  with  instances that belong to  uncertain objects and a probability threshold , the instance-level probabilistic Skyline analysis returns all instances with Skyline probabilities at least .That is, return the Skyline set   such that

Division of Sliding Window.
According to the architecture of distributed transportation stream data processing, coordinator nodes collect continuous uncertain data stream monitored by each remote monitoring node.In this paper a cross method using count sliding window model divided whole sliding window so that data in the whole large sliding window of uncertain data stream are divided effectively.Then, data were distributed to various parallel computational nodes in order that each parallel computational node could actually correspond to a valid part of the whole sliding window.The basic conception was as follows: coordinator nodes dispatch arrived data successively to parallel nodes, and each parallel node maintains a count sliding window part.Thereby, the sliding window parts on all parallel nodes are combined across in turn, logically corresponding to the whole sliding window of uncertain data stream.And the corresponding relations are shown in Figures 5 and 6. ℎ

Processing Framework of Transportation Stream Data.
Based on the sliding window division and micro-batchinMap Reduce model, processing framework of transportation stream data is designed in Figure 7.The framework consists of four types of nodes: Coordinator nodes that are responsible for reception of input data stream and data dispatch to Map-PE nodes (map-processing element); Map-PE nodes that are responsible for maintenance of data refresh in sliding window of Map-PE nodes and calculation of Skyline probability presented in the form of  1 ,  2 . . .  , which can mutually communicate with each other; Reduce-Q nodes (reduce-query) that are responsible for reception of Skyline results from each computational node; and Master nodes that are responsible for status maintenance of Map-PE nodes and Reduce-Q nodes.Besides, , V, and  denote investigated uncertain data.According to processing framework of parallel data stream based on division of sliding window, Skyline query process of uncertain smarter transportation stream data is as follows.
(1) When uncertain data  arrives at Coordinator nodes, Coordinator nodes dispatch  to Map-PE node  1 .
(2)  1 maintains renewed variation of Skyline probability caused by overdue data V and incoming data  in the window of Map-PE node.Then,  1 node dispatches overdue data V and newly incoming data  to other Map-PE nodes.
( (5) When new uncertain data  arrives at Map-PE nodes, Map-PE nodes dispatch  to  2 which performs the above mentioned process circularly.

mMR-SUDS Algorithm.
The basic conception of Skyline query algorithm on uncertain transportation stream data based on micro-batchinMap Reduce framework is as follows.
The task of updating Skyline probability of uncertain transportation data tuple in the whole sliding window is distributed to each parallel node.Then, parallelism among Map-PE nodes is employed to improve the operational efficiency of overall system.Hence, algorithm realization of all types of nodes is discussed in this section.
Coordinator nodes are responsible for data cache and data dispatch.Processing algorithm on Coordinator nodes is illustrated as follows.
Input.Uncertain data stream; response message of all parallel nodes, Output.Data block of uncertain data.
(1) Coordinator nodes receive and then cache the incoming uncertain transportation stream data.
(2) If Coordinator nodes receive response message from a Map-PE node, the following results will be presented.from all parallel Map-PE nodes.Processing algorithm on Reduce-Q nodes is as follows.
Input.Skyline results dispatched from all Map-PE nodes.
Output.Global Skyline results.
(1) Skyline results from all Map-PE nodes are received and cached.
(2) Received information is synchronized and global Skyline results are output.
Finally, taking   as an example, processing algorithm on parallel Map-PE nodes is presented as follows.
Input.Data block of uncertain data; feedbacks from Map-PE nodes.
(2) If newly incoming data from Coordinator nodes are received, the results are as follows.
( (5) Otherwise, if node  receives unrecognized command, error message is presented.

The Optimization of Algorithm
(1) Reduction of Window Scanning Times.When analyzing the processing algorithm on parallel computational nodes, it can be found that three times of window scanning were presented, respectively, in procedures 2.3, 2.4, and 2.5.Besides, there were also three times of window scanning, respectively, in procedures 3.3, 3.4, and 3.5.To reduce window scanning times, three times of window canning can be integrated into one time scanning.Moreover, in each window scanning, data in the window is compared with new data and overdue data.
Thus, processing performance of the algorithm is improved by reducing window scanning times.
(2) Intermediate Filtration.Computational process of data Skyline probability shows That is, Skyline probability of tuple a   (a) equals the product of three probabilities including existing probability of tuple a (a), probability of tuple a not dominated by the data arriving earlier  old (a) and probability of tuple a not dominated by the data arriving later  new (a).Among the three probabilities, with new data arriving and old data expiring,  old (a) increases continuously, while  new (a) decreases constantly.Moreover, (a),  old (a), and  new (a) are all in the interval (0, 1) throughout.Therefore, if  new (a) < , the relation that   (a) <  is established.In addition, during the life cycle of a (time when a is in the sliding window),   (a) <  is established permanently so it is unnecessary to calculate   (a).Hence, through the method of intermediate filtration, times of comparison are reduced and algorithm processing speed is accelerated owing to the fact that result set is far less than source dataset.
(3) Decrease of Idle Waiting Time of Nodes.It is presumed that all Map-PE nodes are provided with the same processing ability. denotes average time that a node takes to communicate once with another node, while  denotes average time of one calculation update of Skyline probability except consolidated calculation.Besides,   denotes average time of consolidated calculation.And the relation of the three is that  <  <   .In basic scheme, calculation period of Skyline probability update caused by one data update is shown in Figure 8.
Figure 8 indicates that, when  1 receives newly incoming data, local Skyline probability update is achieved first and then updated information is dispatched to other Map-PE nodes.Therefore,  1 is completely in idle waiting state before consolidated calculation and idle waiting time of  1 is  + .Similarly, it can be obtained that idle waiting time of other nodes is   +  + 2.So, in this condition, it takes   + 2 + 3 to complete an entire calculation period.
To decrease idle waiting time of all Map-PE nodes, when receiving newly incoming data, Map-PE nodes can dispatch updated information to other parallel nodes first and then calculate local Skyline probability for update.The revised calculation period is illustrated in Figure 9.
Figure 9 shows that, in optimized scheme, idle waiting time of  1 is  and that of other parallel nodes is   + 2.As a result, it takes   + 2 + 3 to achieve a complete calculation period.Therefore, compared with basic scheme, optimized scheme saves  in a calculation period.

Experimental Evaluation
Algorithm in this paper was realized using Java language and experiments were conducted in practical data-centered environment.Every processing node was configured with a CPU of Pentium4 with 2.0 GHz, a DDR memory of 2 GB, and Ubuntu operating system.Besides, synthetic data (characterized as independently distributed data) in literatures was adopted in experimental tests and existing probability of tuples followed Gaussian distribution.In synthetic data, data in all dimensions is mutually independent and presents uniform distribution in the interval [0, 1].To test the real processing performance of mMR-SUDS algorithm, this paper presumed that Coordinator nodes cache numerous data tuples.When parallel nodes finish processing a batch of data and dispatch data request to Coordinator nodes, Coordinator nodes dispatch new stream data to parallel nodes for processing.In addition, probability threshold in experiments was set to 0.3 and window length was measured by data tuples contained in window with the value of 10000, 100000, 500000, and 1000000, respectively.Ranges of other experimental parameters were as follows: data dimension ranged from 2 to 6; the size of transmission data block (the number of data) was set to 1, 10, 100, and 1000, respectively, while the number of nodes participating in calculation was set to 1, 2, 4, 8, and 16, respectively.Each group of experiments was conducted 10 times and the average value was taken as the result.In contrast experiments, as a single machine algorithm, Base algorithm includes two nodes: a data cache node and a computational node.The data cache node is responsible for data cache and data dispatch to the computational node, while the computational node is responsible for the maintenance of sliding window and Skyline calculation.Besides, the computational node adopts the method of circularly dominating comparison.That is, once data arrives or expires, the computational node compares the incoming data or overdue data with all the data in sliding window and then updates Skyline probability.
Based on the experimental environment and experimental data above mentioned, the performance of mMR-SUDS algorithm was tested, respectively, in different sizes of transmission data block, window length, data dimension, and number of nodes.

Influence Tests of Transmission Data Block.
Uncertain data stream is transmitted in data block between Coordinator nodes and Map-PE nodes as well as between Map-PE nodes.Therefore, size of transmission data block has a certain influence on algorithm realization.To evaluate such influence, this group of experiments tested the algorithm performance in different sizes of transmission block.In experiments, transmission data block was set to 1, 10, 100, and 1000, respectively; data dimension was set to 2, while window length was set to 1000000.Moreover, there were 16 Map-PE parallel nodes participating in the calculation.
Experimental results are demonstrated in Figure 10.With the constant increase of transmission data block, processing speed of mMR-SUDS algorithm tends to increase first and then decrease.The main reasons are as follows: when data block is small, overhead communication increases due to frequent data transmission, while when transmission data block is large, computation cost increases due to the increasing complexity of data dominated comparison in block.In conclusion, when size of transmission data block takes the  middle value of 100, the algorithm provides good processing performance.

Tests of Window Scalability.
In this group of experiments, data dimension was set to 2 and size of transmission data block was set to 100, while window length ranged from 10000 to 1000000.To compare the performance of mMR-SUDS algorithm with that of Base algorithm, 16 Map-PE parallel nodes participated in the calculation.
Experimental results are illustrated in Figure 11.As window length increases constantly, system processing performance declines rapidly.When window length is 10000, performance of Base algorithm is even better than that of mMR-SUDS algorithm.The main reasons are as follows.When window length is small, calculating performance of single machine fully satisfies the requirement of query processing.But in parallel algorithm, in terms of the whole parallel computing system, much time is taken to deal with problems such as communication, synchronization, and so on, although each node participating in calculation completes query processing rapidly.When window length is 100000 or more, single computational node could not fully satisfy the performance requirement of query processing.And for the whole parallel computing system, time overhead is mainly spent on calculation and parallel computing system begins to present the advantage of parallelism.

Tests of Dimension Scalability.
To compare the dimension scalability of mMR-SUDS algorithm with that of Base algorithm, window length was set to 1000000 and there were 16 parallel processing nodes in system.In addition, data dimension value was in the interval [2,11] and size of transmission data block was set to 100 in this group of experiments.And Figure 12   results.With the increase of data dimension, processing speed of both mMR-SUDS algorithm and Base algorithm declines slowly; but processing speed of mMR-SUDS algorithm is about 12 times higher than that of Base algorithm throughout.All in all, mMR-SUDS algorithm provides better dimension scalability.

Parallel Scalability Tests.
To evaluate the parallel scalability of mMR-SUDS algorithm, this group of experiments tested processing performance of the algorithm in different numbers of nodes.In the experiments, the number of parallel nodes took the values of 1, 2, 4, 8, and 16, respectively, and total length of window was set to 1000000.Moreover, data dimension was set to 2, while size of transmission data block was set to 100.
Experimental results are illustrated in Figure 13.As the number of nodes increases continuously, processing speed of mMR-SUDS algorithm constantly increases, but the increasing range gradually decreases.The main reasons are as follows: with the increasing number of nodes, window length on each node decreases gradually.Hence, computation cost of each computational node gradually declines, while overhead communication gradually increases, which influences system processing performance.When the number of nodes took the value of 16, processing ability of mMR-SUDS algorithm was about 12 times better than that of single machine algorithm.And processing ability of mMR-SUDS in this case was far less than the theoretically optimum value which is as 16 times as that of single machine algorithm.When the number of nodes took the value of 2, processing ability of mMR-SUDS algorithm was the closest to the theoretically optimum value that was nearly twice that of single machine algorithm.

Conclusion
Aimed at Skyline query requirements of real-time uncertain data stream of smarter transportation with high capacity and large sliding window in the environment of cloud computing, this paper proposed a Skyline query algorithm mMR-SUDS over uncertain transportation stream data based on micro-batchinMap Reduce framework.Such algorithm transforms centralized processing problem of the whole global sliding window into the parallel processing problem of many nodes to their corresponding window by dividing data in sliding window.And such transformation effectively improves integral query processing performance.Experimental results show that mMR-SUDS algorithm presents not only high efficiency but, good scalability and load balancing.Therefore, such algorithm could satisfy the processing analysis requirements of various real-time transportation stream data.
In the parallel framework based on sliding window division, future research has to further optimize processing algorithm and improve algorithm processing performance using index structures such as grid, R tree and so on.Meanwhile, research scope of uncertain data shall be expanded to investigate Skyline query processing algorithm over uncertain transportation stream data at attribute level.

( 1 )
Processing architecture of stratified transportation stream data is demonstrated.(2) In the environment of cloud computing, this paper proposes the issue of continuous Skyline query on mass distributed uncertain transportation data stream and provides formal description.(3) This research develops an mMR-SUDS algorithm based on sliding window division and the architecture proposed.

Figure 3 :
Figure 3: Processing architecture of stratified transportation stream data.

Figure 8 :
Figure 8: The computing cycle in the basic scheme.

Figure 9 :Figure 10 :
Figure 9: The computing cycle in the improved scheme.

Figure 11 :
Figure 11: The effect of the length of sliding window.

Figure 12 :
Figure 12: The effect of the dimension of data.

Figure 13 :
Figure 13: The effect of the number of processing nodes.
A point  ∈  is said to dominate another point  ∈ , denoted as  ≺ , if (1) in every dimension   ∈ ,   ≤   ; (2) in at least one dimension   ∈ ,   ≤   .The Skyline is a set of points () ⊆  which are not dominated by any other point.The points in () are called Skyline points.The Skyline probability of an instance , that is, Pr  (), is the probability that  exists and no instance of other uncertain objects that dominates  exists.Let  be the total number of uncertain objects and let  ∈   ; we have Pr ) Each Map-PE node maintains renewed variation of Skyline probability resulting from overdue data V and newly incoming data  in the window of each Map-PE node.This type of nodes is only in charge of updating Skyline probability and sending the updated results to Reduce-Q nodes.And all the parallel nodes dispatch feedback about Skyline probability of data  in the corresponding node to  1 .(4) Taking the feedbacks from all nodes about Skyline probability of data  into account,  1 calculates global Skyline probability of data  and outputs the result to query nodes.