An Approach to Spatiotemporal Trajectory Clustering Based on Community Detection

Nowadays, large volumes of multimodal data have been collected for analysis. An important type of data is trajectory data, which contains both time and space information. Trajectory analysis and clustering are essential to learn the pattern of moving objects. Computing trajectory similarity is a key aspect of trajectory analysis, but it is very time consuming. To address this issue, this paper presents an improved branch and bound strategy based on time slice segmentation, which reduces the time to obtain the similarity matrix by decreasing the number of distance calculations required to compute similarity. Then, the similarity matrix is transformed into a trajectory graph and a community detection algorithm is applied on it for clustering. Extensive experiments were done to compare the proposed algorithms with existing similarity measures and clustering algorithms. Results show that the proposed method can effectively mine the trajectory cluster information from the spatiotemporal trajectories.


Introduction
Nowadays, a huge amount of data is collected and it is important to develop tools to analyze data to extract useful knowledge. The collected data is often multimodal, that is of different types (e.g., audio [1], video [2], text [3], and image [4]), and can be analyzed jointly or separately [5,6]. An emerging type of data that is playing a key role in multimodal data analysis is trajectory data [7]. It consists of spatial and temporal information about moving objects. Common trajectory data can be divided into four categories, namely, human trajectories, vehicle trajectories, animal trajectories, and natural phenomenon trajectories. Analyzing and discovering patterns in trajectory data have applications in several fields such as intelligent transportation, human mobility analysis, urban planning, meteorology, and travel recommendations and can reveal insights that are not discovered from other data types.
The process of trajectory data analysis mainly consists of obtaining and preprocessing trajectory data, trajectory data management, and a variety of mining tasks, including trajectory pattern mining, privacy protection, outlier detection [8,9], and clustering trajectories on complex road networks [10,11]. Many studies have been published, and trajectory data analysis is a very active research field. A generative adversarial network (GAN) was used to predict pedestrian movement by analyzing multimodal trajectory data [12]. However, most techniques for trajectory data analysis require measuring trajectory similarity, which necessitates a large amount of calculations on trajectory data and results that the time complexity of these similarity measurement methods is relatively high. Based on the idea of branch and bound, a novel similarity measurement method, called FSTM [13], was proposed that sets a distance threshold to prune certain mismatched points. Still, FTSM only considers space constraints.
More recently, there is an increasing interest on time series clustering using graphs [14,15]. Traditional analysis methods only focus on the local relationship between data samples, while ignoring the global information. Advanced trajectory data mining techniques take network dynamics of trajectories into account, such as to mine trajectory group patterns and to assess the importance of a moving object in trajectory networks [16][17][18]. A complex network is suitable for revealing important relationships in trajectory data visually and can provide global information as time series data. In addition, there is no restriction on the shape of clusters.
Based on the above advantages and limitations, we propose an approach to spatiotemporal trajectory clustering based on community detection (STTC-CD). The algorithm implements an improved branch and bound strategy based on time slice segmentation. While richer trajectory information is taken into consideration, redundant trajectory points are pruned. Then, the trajectory data is converted into a graph representation based on the similarity matrix. Finally, a suitable community detection algorithm is applied to perform clustering on the graph. The main contribution of this paper is as follows: (i) An improved similarity calculation method is designed, which matches pairs of trajectory points and applies a pruning strategy based on time slicing to reduce the time complexity (ii) A method is proposed to convert trajectories into a suitable data format to apply many types of techniques for trajectory data mining. Based on this, a community detection algorithm is applied to cluster trajectories, which captures global relationships among trajectories from a graph-based perspective (iii) Experiments have been conducted to evaluate the proposed algorithm on several datasets to verify the influence of multiple factors. It was found that the proposed algorithm is more efficient than the compared methods The rest of this paper is organized as follows: Section 2 surveys relevant related work. Section 3 formally defines the trajectory clustering problem. Section 4 presents the designed STTC-CD algorithm. Then, Section 5 describes the experimental evaluation and Section 6 draws a conclusion.

Related Work
The key problem in trajectory clustering is how to measure trajectory similarity. This section first reviews techniques for trajectory similarity measurement and then surveys relevant work on community detection.

Trajectory Similarity Measure.
Most trajectory data analysis tasks require computing trajectory similarity measurements, such as trajectory clustering [19], transforming data for privacy-preservation [20], movement pattern mining [21], and abnormal trajectory detection [22]. Traditional trajectory measurement techniques such as EDR (edit distance on real sequence), LCSS (longest common subsequence), and DTW (Dynamic Time Warping) compute the overall trajectory similarity by analyzing each trajectory as a whole rather than considering subtrajectories or random trajectory points. Among these techniques, DTW [23] aligns trajectories of different lengths by warping a trajectory sequence and can match a point at a certain time from a trajectory to a number of continuous points from another trajectory. Hence, it has no restriction on the length of the compared trajectories. LCSS [24] calculates the longest common subsequence of two trajectories as their similarity. EDR calculates the minimum number of changes required to transform a trajectory into another as the similarity between the two trajectories. Clue-Aware Trajectory Similarity (CATS) [25] is aimed at overcoming the influence of track bias in time and space. Multidimensional Similarity Measure (MSM) [26] and Multiple-Aspect Trajectory Similarity Measure (MUI-TAS) [27] provide similarity measures for multidimensional sequences, adding information such as weather, user activity, and user interest into trajectory comparison.
However, DTW is a distance-based method, which directly accumulates the distances between trajectory point pairs. A problem of DTW is that the sum of the distances can greatly increase when there are noise points, which makes it sensitive to noise points. Quite the reverse, the ε -threshold-based measures use an ε-threshold value to determine if two points match, which can be more robust to noise. LCSS, EDR, CATS, and MSM fall all in the ε-threshold-based strategy, and the computation of similarity score is based on the point matching of two trajectories. They have a O(n 2 ) time complexity and cause a performance bottleneck for trajectory clustering algorithms. Furtado et al. proposed a branch and bound method (FTSM) to achieve fast similarity measuring by utilizing a transitive range pruning strategy to reduce the number of matching point pairs.

Community Detection in Networks.
A community is a subset of network nodes. Connections between nodes within a subset are relatively close, while connections between nodes from different subsets are relatively sparse, which is exactly in line with the needs and principles of clustering. Recently, community detection algorithms have been increasingly utilized for trajectory clustering.
Depending on whether a node can belong to multiple communities or only one, community detection methods can be categorized as finding nonoverlapping or overlapping communities. In a nonoverlapping community, each network node can belong to one community. Algorithms that detect communities of this type are Fastgreedy [28], Louvain [29], Label Propagation [30], and Infomap [31]. Modularity is used to measure the quality of community division. The Fastgreedy algorithm applies a bottom-up process. Initially, each node is regarded as a community. Then, at each iteration, the two communities providing the largest increase in modularity are merged until the entire network is merged into a single community. The final community structure is a division that maximizes the modularity. The Louvain algorithm improves upon the Fastgreedy algorithm by assigning each node to neighboring nodes for maximum modularity. When the ownership of a node no longer changes, the algorithm collapses each community into a node to form a new community for the next iteration. The basic idea of the Label Propagation algorithm (LPA) is to predict labels of unlabeled network nodes from labeled nodes. Each node label is propagated to neighboring nodes according to their similarity. At each step of node propagation, the node updates itself 2 Wireless Communications and Mobile Computing according to the label of the neighboring node until the label no longer changes. Similar to K-means, the results of LPA are affected by the initial label selection. The Infomap algorithm introduces a coding-based technique based on random walks. A good group division can lead to shorter coding length.
A trajectory clustering algorithm based on an improved Label Propagation algorithm was proposed where road network is modeled as a dual graph to capture and characterize the similarity between nodes [10]. Liu and Guo proposed a semantic trajectory clustering algorithm based on community detection [32], where different community detection algorithms were discussed.

Problem Statement
The following definitions are provided to facilitate the formulation of the problem under study: x, y, tÞ represents the spatial location ðx, yÞ of an entity at given time t of trajectory TR i , and n i is the number of points in TR i . Definition 2 (silhouette coefficient SI). The silhouette coefficient is a metric to evaluate the quality of a clustering, which considers two aspects that are cohesion and resolution. The sðiÞ of each trajectory point p i is calculated as follows: where aðiÞ denotes the average distance from p i to all trajectory points in the cluster to which p i belongs, and bðiÞ is the average distance between p i and trajectory points in other clusters. Given a trajectory dataset TS = fTR 1 , TR 2 ,⋯,TR N g , the silhouette coefficient of TS is the average of the silhouette coefficients of all trajectories, denoted as where N is the number of trajectories, n j is the number of trajectory points in TR j , and ð1/n j Þ∑ n j i=1 sðiÞ is the silhouette coefficient of trajectory TR j .
The value of SI is between -1 and 1 such that a higher SI value indicates a better clustering result in general. According to the above definition, the road trajectory clustering optimization problem is defined as follows: Definition 3 (trajectory clustering optimization problem). Given a set of trajectories TS = fTR 1 , TR 2 ,⋯,TR N g in Euclidean space for the time period ½0, T, the goal is to divide TS into groups fC 1 , C 2 ,⋯,C N c g to maximize SI.

The Proposed STTC-CD Algorithm
This paper proposes an approach to spatiotemporal trajectory clustering based on community detection, named STTC-CD, which is applied in three steps: (1) trajectory partition, (2) graph generation, and (3) trajectory clustering, as illustrated in Figure 1.
Stage 1. Trajectory Partition. Given a collection of space-time trajectories fTR 1 , TR 2 ,⋯,TR N g, STTC-CD divides them into time slices and then utilizes transitive range pruning to calculate the number of pairs of matching points between trajectories in each time period to generate a matching matrix.
Stage 2. Graph Generation. STTC-CD aggregates the matching matrix of each time period to generate a global matching matrix. Then, the algorithm transforms the matching matrix into a similarity matrix according to similarity rules, and a trajectory-connected graph is generated.
Stage 3. Trajectory Clustering. Based on the trajectory graph obtained in the second stage, we utilize a community detection algorithm for clustering to capture global relationships between trajectories from the perspective of the network.

Trajectory
Partition. An algorithm is proposed that takes the time characteristics of trajectories into account and utilizes a branch and bound strategy for fast trajectory similarity measurement. The algorithm is called STTC-CD. It not only improves the accuracy of similarity measurement but also only compares each trajectory segment with others from the same time slice instead of all trajectories, thereby improving computational efficiency through further pruning.
Given a trajectory dataset TS = fTR 1 , TR 2 ,⋯,TR N g and a partition threshold κ, TS is divided into κ subdatasets fTS 1 , TS 2 , ⋯, TS κ g according to the time slice and then allocated to the corresponding subdataset of the time slice. Let t min and t max be the minimum and maximum timestamp in the dataset, respectively. The length of each time slice is defined as follows: Each trajectory TR i = fp 1 i , p 2 i ,⋯,p m i ,⋯,p n i i g ∈ TS is divided into subdatasets according to the time slice (as shown in Figure 2). The index of the subdataset to which a point p m i is assigned is dðt m i − t min Þ/Δte. 4.2. Graph Generation. The graph is generated based on the similarity matrix. The calculation of similarity in each time slice is done based on the following definitions: Definition 4 (point matching (PM)). Let there be two points p i and p j , a matching threshold ε, and a distance function distðp i , p j Þ. If distðp i , p j Þ ≤ ε, then p i and p j match each other; otherwise, they do not match. The formula is defined as follows:

Wireless Communications and Mobile Computing
Definition 5 (trajectory segment matching (TM)). Given two trajectory segment sTR i = fp 1 , p 2 , ⋯, p m g and sTR j = fq 1 , q 2 , ⋯, q n g, trajectory segment matching is defined as follows: where m and n are the numbers of the points of sTR i and sTR j .
Considering that trajectory elements are points in Euclidean space, the following definitions adopt the Euclidean distance as distance function to perform point matching. Hence, the matching threshold can be seen as the radius ε of a matching circle.
Definition 6 (pivot point). For a trajectory TR i , the pivot point of TR i is the point at half of the trajectory as follows: where n i is the number of trajectory points of TR i .
Definition 7 (pruning radius (PR)). Given a pivot point p k i ∈ TR i and a matching threshold ε, the pruning radius is a circle around p k i that covers all the points that are at maximum distance ε of any point in TR i , that is, This lemma [13] means that for any point in TR j , if its distance to a certain point of TR i is less than ε, then its distance to the pivot point of TR i must be less than PR. Therefore, if the distance from a point to the pivot point of TR i is greater than PR, the distance from it to all points of TR i is greater than ε, and the pruning operation can be performed accordingly.
Based on the subdatasets generated in Stage 1, the number of matching points in each subdataset is calculated. Given two subtrajectories sTR i and sTR j , the calculation of point matching consists of three steps, as shown in Figure 3:

Wireless Communications and Mobile Computing
For instance, Figure 4 shows one of the subdatasets after partition. Figure 4(a) is a subdataset consisting of three subtrajectories, and Figure 4(b) shows the matching result of it, where the number of matching points between sTR 2 and other trajectories in subdataset 2 is calculated as 2 and 0.
The matching matrix is aggregated of each time slice. According to the matching point matrix, the similarity matrix can be obtained. The similarity is defined as follows: where m and n are the number of sub-trajectories in TR i and TR j , respectively. The similarity measure satisfies the property of nonnegativity, which means SimðTR i , TR j Þ ≥ 0 in all cases, and a large score indicates a high similarity.
Then, the matching matrix is transformed by Equation (9) to obtain the similarity matrix S, where SimðTR i , TR j Þ represents the similarity between TR i and TR j . A trajectory graph G = ðV, EÞ is constructed by exploiting the similarity matrix S. Firstly, N vertices are constructed for a dataset with N trajectories and each trajectory corresponds to a vertex. For each v i corresponding to the trajectory TR i and v j corresponding to the trajectory TR j , edge is added between them if SimðTR i , TR j Þ > 0. The weight of each edge is equal to the similarity between the two vertices. For instance, given a matrix ½½0,0:5,0:3,0, ½0:5,0, 0,0:2, ½0:3,0, 0,0:8, ½0,0:2,0:8,0, the trajectory graph is as shown in Figure 5.

Trajectory Clustering.
A community is composed of a group of closely connected nodes that are sparsely connected with nodes outside the community. Community detection is to discover these closely connected community structures in a complex network, which coincides with the objective of clustering. Therefore, the Infomap algorithm [31] is employed for clustering, which combines community detection with information encoding.
The basic idea of the Infomap algorithm is to find the shortest codes to describe the path generated by a random walk on the network. This is done using a two-level coding of all network nodes to find the module partition with the shortest encoding length by minimizing entropy to find the optimal clustering. The two-level code assigns unique module names, and nodes in different modules are allowed to use repeated codewords. The module code is inserted before the nodes in the same module, and the termination mark is inserted at the end. The average code where q ⋐ represents the probability of switching from one module to another per step of the random walk, Hð QÞ is the entropy of movements between modules, p i O denotes the proportion of all nodes in group i in the encoding, and HðP i Þ denotes the average code length required by all nodes in group i. The Infomap algorithm performs three steps: Step 1. Initialization. Each graph node is treated as an independent group.
Step 2. Each node is traversed in a random order, and each point is assigned to the adjacent module that gives the largest decrease in Equation (10).
Step 2 is repeated in a different random order until Equation (10) does not decrease.

Performance Evaluation
The performance of the proposed SSTC-CD algorithm was evaluated in terms of silhouette coefficient and runtime. All algorithms were implemented in Java 14, and all experiments were conducted on a Windows PC workstation equipped with an Intel(R) Core(TM) i5-10400 CPU@2.90 GHz and 16 GB of memory.

Datasets.
The algorithm was evaluated on several widely used public datasets, described in Table 1. The trucks dataset (DS1) is a real-word dataset composed of 1,100 trajectories generated by 50 different trucks transporting concrete in Greece. T-drive dataset [33] (DS2), provided by Microsoft Research Asia, is a collection of trajectories generated by 10,357 taxis located in Beijing within a week. The UCI dataset (DS3) was collected by the GoTrack Android app in 2016. It has a high sampling rate for a single trajectory, but the interval between trajectories is long. DS2 was collected in Beijing, which is located in longitude 115.7°E to 117.4°E and latitude 39.4°N to 41.6°N. Therefore, out-of-range points were deleted as abnormal points. The average trajectory length in DS2 is about 1,500 points. Yet, the longest trajectory has 150,000 points, and there are many repeated points and stay points, which we have removed from the dataset. Figure 6 presents the longest trajectory in DS2 with id 6275. Figure 6(a) is the original trajectory, and Figure 6(b) is the processed trajectory.

Evaluation.
In our experiments, we run STTC-CD with different ε-threshold and different number of time slices to identify the optimal parameters. Figure 7 shows the influence of different parameters on the proposed algorithm. As shown in Figure 7(a), the SI index shows a trend of rising first and then falling as the number of time slices increases, and it reaches the maximum value when the number of time slices is 45. As shown in Figure 7(b), the value of ε was set from 2 to 35 and the SI index reaches its maximum value when ε is 10.
The performance of the proposed STTC-CD algorithm was compared with several similarity measurement algorithms, namely, FTSM [13], DTW [23], MSM [26], and LCSS [24], on DS1 and DS3. The parameter ε was set to 10, and the number of time slices was set to 45. Results are presented in Figure 8(a).
It can be observed that the running time of STTC-CD and FTSM on both datasets is shorter than that of other algorithms. For large datasets, the runtime gap is greater. The reason is that the other three algorithms are implemented using dynamic programming, which have quadratic time complexity. As the data size increases, the time required by these 7 Wireless Communications and Mobile Computing algorithms rises sharply. Since FTSM and STTC-CD pruned the sequence to be matched on the trajectory, the complexity is close to linear in the best case. When the data size is small, STTC-CD prunes more pair-wise trajectory points than FTSM by splitting in time slices. However, the operations of splitting and matching time slices take more time, which results in spending more time than FTSM.
To further evaluate FTSM and STTC-CD, DS2 was split into six subdatasets of different sizes and the two algorithms were applied. It can be seen in Figure 8(b) that when the dataset is small, the runtimes of the two algorithms are almost the same. As dataset size increases, the gap becomes more obvious. This result is also consistent with the results for the other two datasets.
The performance of algorithms was further compared in terms of the SI index. The time dimension of the dataset is considered in the algorithm; therefore, the threedimensional Euclidean distance combined with the time dimension is utilized as the distance measure of SI. Compared with DTW, MSM, and LCSS implemented by dynamic programming, FTSM only pruned away some unnecessary comparisons, which improved the running speed of the algorithm without affecting the accuracy of the algorithm. Based on FTSM, the proposed algorithm further reduces the number of point matching in similarity calculation, but it also affects the accuracy of the algorithm. Therefore, the SI index was used to compare the accuracy of FTSM and STTC-CD, and K-means was used as the benchmark algorithm. As illustrated in Figure 9, the proposed algorithm was compared with FTSM and K-means with different numbers of trajectories on DS1 and DS2. It can be observed that the SI of FTSM and STTC-CD are greater than the SI of K-means on both datasets, and most of the time, STTC-CD results are better than FTSM, which indicates that the proposed STTC-CD takes better account of time correlation.
The clustering results of FTSM and STTC-CD on DS1 are displayed using lines of different colors, while trajectories from the same cluster are represented using the same color.

Conclusion
This article presented an approach to spatiotemporal trajectory clustering based on community detection (STTC-CD), which is based on time slicing to reduce the time for similarity calculation. STTC-CD relies on a new trajectory representation, which enables various algorithms such as for community detection to be applied for trajectory clustering.
Experimental results have shown that the proposed algorithm can effectively reduce runtimes on large datasets and that clustering results are more meaningful in the time dimension.
The approach proposed in this paper is designed to analyze and cluster trajectory data. An interesting research possibility for future work is to see this work as a building block to build a system for analyzing multimodal data consisting not only of trajectory but also text, video, and audio data. In particular, a hybrid system could be developed combining the proposed approach with a neural network or other machine learning models.

Data Availability
The T-drive dataset used to support the findings of this study has been deposited in the Microsoft Research Asia (doi:10 .1145/2020408.2020462). The trucks dataset used to support the findings of this study is included within the article "Clustering Trajectories of Moving Objects in an Uncertain World" (doi:10.1109/ICDM.2009.57). The UCI dataset used to support the findings of this study has been feed by Android app called GoTrack. It is available at Google Play Store.