^{1}

^{1}

^{1}

^{2}

^{1}

^{2}

Nowadays, large volumes of multimodal data have been collected for analysis. An important type of data is trajectory data, which contains both time and space information. Trajectory analysis and clustering are essential to learn the pattern of moving objects. Computing trajectory similarity is a key aspect of trajectory analysis, but it is very time consuming. To address this issue, this paper presents an improved branch and bound strategy based on time slice segmentation, which reduces the time to obtain the similarity matrix by decreasing the number of distance calculations required to compute similarity. Then, the similarity matrix is transformed into a trajectory graph and a community detection algorithm is applied on it for clustering. Extensive experiments were done to compare the proposed algorithms with existing similarity measures and clustering algorithms. Results show that the proposed method can effectively mine the trajectory cluster information from the spatiotemporal trajectories.

Nowadays, a huge amount of data is collected and it is important to develop tools to analyze data to extract useful knowledge. The collected data is often multimodal, that is of different types (e.g., audio [

The process of trajectory data analysis mainly consists of obtaining and preprocessing trajectory data, trajectory data management, and a variety of mining tasks, including trajectory pattern mining, privacy protection, outlier detection [

More recently, there is an increasing interest on time series clustering using graphs [

Based on the above advantages and limitations, we propose an approach to spatiotemporal trajectory clustering based on community detection (

An improved similarity calculation method is designed, which matches pairs of trajectory points and applies a pruning strategy based on time slicing to reduce the time complexity

A method is proposed to convert trajectories into a suitable data format to apply many types of techniques for trajectory data mining. Based on this, a community detection algorithm is applied to cluster trajectories, which captures global relationships among trajectories from a graph-based perspective

Experiments have been conducted to evaluate the proposed algorithm on several datasets to verify the influence of multiple factors. It was found that the proposed algorithm is more efficient than the compared methods

The rest of this paper is organized as follows: Section

The key problem in trajectory clustering is how to measure trajectory similarity. This section first reviews techniques for trajectory similarity measurement and then surveys relevant work on community detection.

Most trajectory data analysis tasks require computing trajectory similarity measurements, such as trajectory clustering [

However, DTW is a distance-based method, which directly accumulates the distances between trajectory point pairs. A problem of DTW is that the sum of the distances can greatly increase when there are noise points, which makes it sensitive to noise points. Quite the reverse, the

A community is a subset of network nodes. Connections between nodes within a subset are relatively close, while connections between nodes from different subsets are relatively sparse, which is exactly in line with the needs and principles of clustering. Recently, community detection algorithms have been increasingly utilized for trajectory clustering.

Depending on whether a node can belong to multiple communities or only one, community detection methods can be categorized as finding nonoverlapping or overlapping communities. In a nonoverlapping community, each network node can belong to one community. Algorithms that detect communities of this type are Fastgreedy [

A trajectory clustering algorithm based on an improved Label Propagation algorithm was proposed where road network is modeled as a dual graph to capture and characterize the similarity between nodes [

The following definitions are provided to facilitate the formulation of the problem under study:

A trajectory is a sequence of points in chronological order, denoted as

The silhouette coefficient is a metric to evaluate the quality of a clustering, which considers two aspects that are cohesion and resolution. The

The value of SI is between -1 and 1 such that a higher SI value indicates a better clustering result in general. According to the above definition, the road trajectory clustering optimization problem is defined as follows:

Given a set of trajectories

This paper proposes an approach to spatiotemporal trajectory clustering based on community detection, named

Algorithm flowchart.

An algorithm is proposed that takes the time characteristics of trajectories into account and utilizes a branch and bound strategy for fast trajectory similarity measurement. The algorithm is called

Given a trajectory dataset

Each trajectory

The schematic diagram of trajectory partition.

The graph is generated based on the similarity matrix. The calculation of similarity in each time slice is done based on the following definitions:

Let there be two points

Given two trajectory segment

Considering that trajectory elements are points in Euclidean space, the following definitions adopt the Euclidean distance as distance function to perform point matching. Hence, the matching threshold can be seen as the radius

For a trajectory

Given a pivot point

Let

This lemma [

Based on the subdatasets generated in Stage

Pruning step: the pivot point of

Splitting step:

Matching step: the points of

The schematic diagram of point matching.

For instance, Figure

Matching step for a subdataset.

Subdataset 2

Matching result of subdataset 2

The matching matrix is aggregated of each time slice. According to the matching point matrix, the similarity matrix can be obtained. The similarity is defined as follows:

For two trajectories

Then, the matching matrix is transformed by Equation (

Trajectory graph.

A community is composed of a group of closely connected nodes that are sparsely connected with nodes outside the community. Community detection is to discover these closely connected community structures in a complex network, which coincides with the objective of clustering. Therefore, the Infomap algorithm [

The basic idea of the Infomap algorithm is to find the shortest codes to describe the path generated by a random walk on the network. This is done using a two-level coding of all network nodes to find the module partition with the shortest encoding length by minimizing entropy to find the optimal clustering. The two-level code assigns unique module names, and nodes in different modules are allowed to use repeated codewords. The module code is inserted before the nodes in the same module, and the termination mark is inserted at the end. The average code length is calculated as follows:

Initialization. Each graph node is treated as an independent group.

Each node is traversed in a random order, and each point is assigned to the adjacent module that gives the largest decrease in Equation (

Step 2 is repeated in a different random order until Equation (

The performance of the proposed SSTC-CD algorithm was evaluated in terms of silhouette coefficient and runtime. All algorithms were implemented in Java 14, and all experiments were conducted on a Windows PC workstation equipped with an Intel(R) Core(TM) i5-10400 CPU@2.90 GHz and 16 GB of memory.

The algorithm was evaluated on several widely used public datasets, described in Table

Datasets.

DS# | Dataset | Trajectory count | Avg. trajectory points | Time span |
---|---|---|---|---|

DS1 | Trucks | 1,100 | 85 | 39 days |

DS2 | T-drive | 10,357 | 1448 | 7 days |

DS3 | UCI | 163 | 111 | 493 days |

DS2 was collected in Beijing, which is located in longitude 115.7°E to 117.4°E and latitude 39.4°N to 41.6°N. Therefore, out-of-range points were deleted as abnormal points. The average trajectory length in DS2 is about 1,500 points. Yet, the longest trajectory has 150,000 points, and there are many repeated points and stay points, which we have removed from the dataset. Figure

Comparison graph of trajectory processing.

Original trajectory 6275

Processed trajectory 6275

In our experiments, we run STTC-CD with different

Clustering quality on trucks dataset using SI with different parameters.

Different time slice (

Different

The performance of the proposed

Runtime.

Runtime on DS1 and DS3

Runtime on DS2

It can be observed that the running time of

To further evaluate FTSM and

The performance of algorithms was further compared in terms of the SI index. The time dimension of the dataset is considered in the algorithm; therefore, the three-dimensional Euclidean distance combined with the time dimension is utilized as the distance measure of SI. Compared with DTW, MSM, and LCSS implemented by dynamic programming, FTSM only pruned away some unnecessary comparisons, which improved the running speed of the algorithm without affecting the accuracy of the algorithm. Based on FTSM, the proposed algorithm further reduces the number of point matching in similarity calculation, but it also affects the accuracy of the algorithm. Therefore, the SI index was used to compare the accuracy of FTSM and STTC-CD, and K-means was used as the benchmark algorithm. As illustrated in Figure

SI.

SI index on DS1

SI index on DS2

The clustering results of FTSM and

Clustering result on DS1.

FTSM clustering results on DS1

This article presented an approach to spatiotemporal trajectory clustering based on community detection (

The approach proposed in this paper is designed to analyze and cluster trajectory data. An interesting research possibility for future work is to see this work as a building block to build a system for analyzing multimodal data consisting not only of trajectory but also text, video, and audio data. In particular, a hybrid system could be developed combining the proposed approach with a neural network or other machine learning models.

The T-drive dataset used to support the findings of this study has been deposited in the Microsoft Research Asia (doi:

The authors declare that they have no conflicts of interest.

This research is sponsored by the Key Research and Development Program under Grant No. 2020YFS0169, Science and Technology Department of Sichuan Province and the Science and Technology Planning Project of Sichuan Province under Grant No. 2020YFG0054, and the Joint Funds of the Ministry of Education of China.