CC _ TRS : Continuous Clustering of Trajectory Stream Data Based on Micro Cluster Life

The rapid spreading of positioning devices leads to the generation of massive spatiotemporal trajectories data. In some scenarios, spatiotemporal data are received in stream manner. Clustering of stream data is beneficial for different applications such as traffic management and weather forecasting. In this article, an algorithm for Continuous Clustering of Trajectory Stream Data Based on Micro Cluster Life is proposed. The algorithm consists of two phases. There is the online phase where temporal micro clusters are used to store summarized spatiotemporal information for each group of similar segments. The clustering task in online phase is based on temporal micro cluster lifetime instead of time window technique which divides stream data into time bins and clusters each bin separately. For offline phase, a density based clustering approach is used to generate macro clusters depending on temporal micro clusters.The evaluation of the proposed algorithm on real data sets shows the efficiency and the effectiveness of the proposed algorithm and proved it is efficient alternative to time window technique.


Introduction
Recently, moving objects such as vehicles and animals are equipped with GPS devices; these devices leave digital traces (latitude, longitude) position at each moment.The cheap price of GPS devices leads to an exponential growth of trajectories data.Analysis of trajectory data leads to extraction curial information which helps the researchers to find solutions for many challenges such as traffic congestion [1].One of the most important analysis tools is clustering; clustering aims to aggregate data in clusters such that the similarity among cluster members is high and the similarity of members belonging to different clusters is very low [2,3].Clustering of stream data is more complex than classical data, since clustering stream data faces a set of challenges: (i) single pass processing due to continuous arriving of data, (ii) unbounded size of data stream and limited memory space and time, and (iii) evolving data where the model underlying the data stream may change over time.Thus the clustering algorithm should be able to detect such changes [3,4].Many algorithms of data stream clustering depend on object based paradigm which consists of two phases: online phase and offline phase.
The online phase stores summarized information of data stream in specific micro clusters which act as representative for raw data stream.When the size of micro clusters exceeds memory limitation, similar micro cluster will merge to reduce memory size.The offline phase which is evoked on user demand and density base clustering approach is used to cluster representative line of micro clusters to demonstrate the current results of stream data.
Problem Statement.Many existing algorithms such as TCMM and ConTraClu exploit time window technique to incrementally cluster trajectory data stream.Time window technique partitions trajectory stream data into equal temporal periods (time bins or time stamp) and clusters each period separately as illustrated in Figure 1.Starting clustering from scratch in each time bin leads to the following.(i) Disturbance occurs in clustering quality which centralizes in the border area between two adjacent time bins specially if it is very dense of trajectory segments since clustering process in time window technique creates new micro clusters (MC  ) for some segments   at the start of each timebin +1 even though these segments are very close (within distance threshold) to micro clusters (MC  ) at the end of previous timebin  .(ii) It is true that TCMM algorithm combines some of MC  and MC  during merging stage to reduce memory space but that will be time consuming.In addition to these problems, TCMM framework merges similar micro cluster when its size exceeds a given memory space, and the merging process does not maintain temporal information and that is inconsistent with the complexity of free moving object since it visits the same spatial area many times in different periods of time as illustrated in Figure 2.
In this article, Continuous Clustering of Trajectory Stream Data Based on Micro Cluster Life (CC TRS) algorithm is proposed; the algorithm assigns a life time for each newly created micro cluster, and the new upcoming segments will affect only the nonexpired clusters; for example, any segments in current time (dash line in Figure 3) will affect temporal micro cluster (B and C) and ignore A since it is expired.A Micro Cluster Life is a continuous clustering technique; therefore there is no need to divide stream data to time bins and start clustering from scratch as in time window technique.CC TRS algorithm consists of two-phase micro clustering (online phase) and macro clustering (offline phase).In online phase, temporal micro cluster (TMC) data structure is defined to store summarized information for each group of similar segments; TMC is similar to micro cluster data structure (MC) in TCMM [5] framework except that it has additional temporal features which describe TMC temporal existence.When the size of TMCs exceeds a given memory space, similar TMCs have to merge (spatiotemporally) to reduce memory space.In offline phase, any density based approach can be used to cluster temporal micro clusters to generate macro clusters, a response to user request.Note that CC TRS is used to cluster the trajectories of free moving object.
The main contributions of this paper are as follows: (i) Propose the concept of temporal micro cluster (TMC) which means TMC exists for a period of time starting from creation time.Clustering task will take into account TMC as long as it exists.
(ii) Define new data structure for TMC.
(iii) Evaluation of proposed algorithm on real data sets shows ability to maximize the cluster quality and minimize the execution time compared with existing algorithms.
The rest of this article is organized as follows.Section 2 presents the data stream clustering algorithms related works.Section 3 presents problem definitions.Section 4 proposed algorithm CC TRS.Section 5 presents performance evaluation.Finally, Section 6 concludes the article.

Related Works
Trajectory stream clustering aims to find out representative paths and common tendencies that are shared by group of moving objects.Numerous clustering methods have been presented for static data sets of trajectory; these methods can be classified into five main categories [6,7]: spatial based clustering [8], time dependent clustering [9], partition and group based clustering [10], uncertain trajectory clustering [11], and semantic trajectory clustering [12].
Many researches have been conducted for stream clustering of data.Aggarwal et al. [13] proposed CluStream framework for clustering evolving data stream; CluStream uses the concept of pyramidal time frame in conjunction with micro clustering approach.However, the CluStream framework does not handle trajectory stream data.Elnekave et al. [14] presented an incremental clustering algorithm for finding evolving groups of similar mobile objects in spatiotemporal data.In this algorithm, each trajectory is represented by set of minimal bounding box (MBB), the entire overlapping between two trajectories.MMBs represent the similarity between them.The algorithm uses a developed version of incremental -mean algorithm to cluster moving object trajectories.Jensen et al. [15] presented a disk-based algorithm for continuous clustering of moving object.The algorithm employs clustering features structure that can be updated incrementally.Moving object may be deleted from or inserted into a moving cluster during a period of time.Next, the approach merges and splits the clusters through monitoring their compactness.Li et al. [5] suggested the TCMM framework which consists of two parts: online micro clustering and offline macro clustering for incremental trajectories clustering.Online micro clustering stores statistical information of similar trajectory segments in cluster features (CF) data structure and updates CFs when new batch of segments is added.Similar CFs are merged to solve memory limitation issue.Offline clustering is implemented on the set of micro clusters based on density based clustering when user sends request to see the clustering results.Some studies use optimization strategies such as indexes or pruning to minimize search and enhance the efficiency of clustering.Yu et al. [16,17] proposed ConTraClu algorithm to cluster continuous high speed trajectories data stream and discover moving pattern such as flock.The algorithm consists of online clustering of trajectory segments depending on density based approach and updating process of closed clusters depends on bi-Tree index.Da Silva et al. [18,19] proposed an incremental algorithm for trajectory data stream.The algorithm uses a micro group structure to truck moving object and its evolution at consecutive time windows.Micro group describes the relationship among moving objects and evolves (merge or split) in the next time period.Mao et al. [2] produce twostage framework TSCluWin over sliding window model.During the first stage, sufficient summarized information of the micro clusters is stored and maintained continuously in EF data structure.During the second stage, a small number of micro clusters are produced depending on micro clusters.There also exist some different approaches but they deserve to be mentioned.Costa et al. [20] interpret trajectory as a discrete time signal and use Fourier transform to measure the similarity between two trajectories.

Problem Definition
In this section, we will define some notations.
Definition 1 (temporal micro cluster (TMC)).A vector summarized the spatiotemporal features for a set of similar directed segments.TMC is of the following form (LS cen , SS cen , LS th , SS th , LS  , SS  , , , and ).Note that the lifespan of TMC starts at TMC. and ends at TMC.+ Ω, where Ω is a predefined time threshold.
Definition 2 (representative line of TMC).It represents the spatial average of TMC members.The start and end points of any representative line of TMC can be calculated from its features: where  cen = LS cen  /,  cen = LS cen  /,  = LS th /, and  = LS  /.Definition 3. The distance between representative line of TMC (  ) and trajectory segment (  ) is equal to the sum of three components: center distance ( cen ), angle distance (  ), and parallel distance ( ‖ ) as illustrated in Figure 4:  where ‖ 1 and ‖ 2 are the Euclidean distance between the points (  ,   ) and (  ,   ), respectively.  and   are the end points of line segments   .
Definition 4 (temporal micro cluster extent ()).TMC extent is a pointer of its spatial tightness.The extent of TMC comprises three components extent cen , extent ang , and extent  , since the representation of TMC contains these three parts as illustrated in Figure 5.The extents are the standard deviation which can be computed from its corresponding LS and SS as defined in where  is the number of line segments in the temporal micro cluster and  represents center, length, and angle.

CC_TRS Algorithm
The core idea of CC TRS algorithm is to specify a lifetime (predefined threshold) for each newly created TMC; the TMCs can only interact with clustering task during their lives.
When a new trajectory segment arrives, the CC TRS algorithm finds the closest nonexpired TMC  for each segment   , since expired TMCs become already far spatially or temporally or both form new coming segments.If the distance between TMC  and   is less than a user distance threshold ( max ),   will be inserted into TMC  and update TMC  information.Otherwise, a new temporal micro cluster TMC new will be created for   .
Basically, the algorithm maintains two arrays called TMC and E TMC to control execution time; the data structure of each array entry is illustrated in Definition 1. CC TRS algorithm continuously inserts the newly created TMC new into TMC array with its status (existence) and creation time.After a while, the size of TMC array increases and some of TMC  become expired which affect the efficiency of the algorithm since the most time-consuming part is finding the closest TMC  in TMC array.Therefore, all expired TMC  are transferred periodically to E TMC array to minimize searching time in TMC array.Eventually, if the size of TMC and E TMC exceeds memory constraint, some TMC will be merged based on their spatiotemporal information.Algorithm 1 illustrates CC TRS algorithm steps.
Algorithm 1 performs the creation and updating of temporal micro cluster.In lines (6)(7)(8)(9)(10)(11)(12)(13) after the arrival of new segment   , the algorithm finds the nearest distance between   and all nonexpired TMC.TMC  is nonexpired when the difference between   time and TMC  creation time is less than Ω threshold.In lines (14)(15)(16)(17)(18)(19), if the distance (  , TMC  ) is less than a threshold distance ( max ),   will be added to TMC  and update TMC  information; otherwise a new TMC new will be created for   and set TMC new .=   .time.Line ( 21) transfers all expired TMCs to E TMC array every (Φ) second.In lines (22-23), if the size of both TMC and E TMC exceeds memory constraint, some TMC will be merged based on spatiotemporal information.

TMC Merging Algorithm.
Spatially, CC TRS adopts the TCMM [5] technique to merge two TMCs, TCMM suggests taking into consideration the tightness of micro clusters when merging them.Furthermore, TCMM framework gives the priority to merge two loose micro clusters rather than merging tight micro clusters if the distance between their representative lines is equal, since merging very tight micro clusters will break their tightness as in Figure 6(a) while merging two loose micro clusters will not hurt their loss tightness as illustrated in Figure 6(b).
The spatial distance between TMC  and TMC  is equivalent to the distance between their representative lines  *  with extent   and  *  with extent   .Note that extent  is used to strengthen the similarity of two loose micro clusters in order to give them the priority to merge as illustrated in Figure 7.The distance between  *  and  *  contains three parts: center distance, angle distance, and parallel distance: The center distance with extent is The angle distance with extent is      The parallel distance with extent is When the size of TMCs exceeds the memory limits, the TMCs must be merged to satisfy memory constraint.The merging algorithm of  given TMCs is illustrated in Algorithm 2. Note that CC TRS maintains temporal and spatial information during merging process while TCMM maintains only spatial features.

Trajectory Macro Clustering.
Macro clustering is evoked when the user requests to see the overall results.Any density based clustering algorithm can be used to achieve macro clustering by replacing the distance between spatial points with the distance between temporal micro clusters as depicted in Figure 8.The distance between temporal micro clusters is defined in (6).

Performance Evaluation
In this section, the performance of CC TRS algorithm is tested and compared with TCMM framework and ConTra-Clu.Two real data sets called Elk1993 and Deer1995 are used; the Elk 93 has 33 trajectories and 47204 points, while Deer 95 has 32 trajectories and 20065 points.Any trajectory segmentation algorithm such as in [10,21] can be used to divide each trajectory to set of segments.Matlab R2012b and excel 2013 were used to implement the algorithm and plot the charts.

Clustering Quality Evaluation.
The sum of square distance SSQ was used to compare the clustering quality results of CC TRS with TCMM.The SSQ of  trajectory segments is equal to the sum of square distances between segment   and its closest TMC representative line   as illustrated in (10) and (11); the value of distance threshold ( max ) is set to 600: Figure 9 shows an improvement in clustering quality results of CC TRS compared with TCMM using data sets Deer 95 and ELK 93.For Deer 95 data set, the maximum improvement is 4.7% when number of time bins is equal to 100, while minimum improvement is 3.75% when number of time bins is equal to 200 as illustrated in Figure 9(a).For ELK 93 data set, the maximum improvement is 3.8% when number of time bins is equal to 200, while minimum enhancement is 3.5% when number of time bins is equal to 100 as depicted in Figure 9(b).Therefore, the improvements of clustering quality range within 3.5-5% (the smaller SSQ, the better clustering quality).
Input: set of temporal micro clusters temporally arranged TMC = [TMC 1 , TMC 2 , . . ., TMC  ] (1) calculate the spatial similarity between every two consecutive (TMC  , TMC +1 ) (2) Arrange the similarity from the most similar to least similar TMCs (3) Merge most similar TMCs until the size of TMCs satisfy the memory limits.

Efficiency Evaluation (Time and Memory Space).
The CC TRS was compared with TCMM and ConTraClu algorithms to evaluate its efficiency in terms of execution time and space requirements.TCMM and ConTraClu divide stream data to set of time bins and cluster each time bin separately, while CC TRS allocates lifespan for each new created cluster.To make the comparison fair, we set cluster life equal to time bin as in Figures 1 and 3. Equation ( 12) can be used to calculate cluster life (time bin): where   and   are the stamp time of the first and last segments in data stream and NOB is the number of time bins which are specified by user threshold.On the other hand, the tests show that CC TRS needs 10-15% higher of memory space than TCMM as illustrated in Figures 11(a) and 11(b).The size of micro cluster in TCMM is 28 bytes, since the micro cluster has 7 fields and each field declared using uint32 Matlab variable (4 bytes).While TMP size is 32 bytes and one bit since it has two extra fields, the first field  is used to save creation time of TMC and a logical variable  is used to save the status of TMC (expired, nonexpired).It is obvious that TMC size is equal to 1.15 of micro cluster size; therefore, most of the extra memory required by CC TRS comes from the additional temporal field in TMC data structure.Note that we compare the memory requirements of CC TRS with only TCMM since both algorithms have similar data structure.

Parameter Φ Effect.
To explain the effect of Φ parameter on execution time of CC TRS, the algorithm is run several times with different values of Φ range within 1000-3500 seconds.We set the value of NOB (400) and  max (600) since these values give minimum execution time for Deer 95 data set as shown in Figure 10(a).Figure 12 shows that minimum execution time is achieved when Φ is equal to 2000 seconds.

5.4.
Parameter  max Effect.In this section, we describe the effects of  max on clustering quality and running time depending on Deer 95 data set.The smaller the  max , the better the clustering quality but it requires longer execution time.On the other hand, the larger  max , the faster running time but more information will be lost in micro clustering.For example, the clustering quality when  max = 400 (red line) is better than its value when  max = 700 (blue line) as illustrated in Figure 13(a), while the running time is higher for    max = 400 as illustrated in Figure 13(b).As a consequence, a trade-off between clustering quality and running time is required to get the best results.

Conclusion
In this article, CC TRS algorithm is proposed for the clustering of trajectories stream data for free moving object.The algorithm consists of two phases: online phase and offline phase.In online phase, CC TRS algorithm suggests a lifespan technique which assigns a lifetime for each newly created temporal micro cluster instead of time window techniques which divide stream data to time bin and cluster each time window separately.In offline phase, density base clustering approach is used to demonstrate clustering results on user demand.The tests of CC TRS on two data sets Deer 95 and ELK 93 minimize running time (50% and 20%) compared with TCMM and ConTraClu, respectively.Besides that, the clustering quality is improved by 3.5-5% compared with TCMM based on the sum of square distance SSQ.On the other hand, CC TRS algorithm needs higher memory space by 10-15% compared to TCMM framework since the data structure of temporal micro cluster has extra temporal fields.
LS cen : the linear sums of the line segments center points.SS cen : the square sums of the line segments center points.

Figure 4 :
Figure 4: Three components of the distance function.

Algorithm 1 :
Trajectories micro clustering.CC TRS algorithm.Merge Tight micro cluster A Tight micro cluster B (a) Merging tight micro clusters Merge Loose micro cluster C Loose micro cluster D (b) Merging loose micro clusters

Figure 13 :
Figure 13: The sensitivity of parameter  max (SSQ versus time).
where |cen  − cen  | represents the Euclidian distance between the centers of two segments (  ,   ).