JSTC: Travel Time Prediction with a Joint Spatial-Temporal Correlation Mechanism

Accurate travel time prediction is one of the most promising intelligent transportation system (ITS) services, which can greatly support route planning, ride-sharing, navigation applications, and eﬀective traﬃc management. Several factors, like spatial, temporal, and external, have big eﬀects on traﬃc patterns, and therefore, it is important to develop a mechanism that can jointly capture correlations of these components. However, spatial sparsity issues make travel time prediction very challenging, especially when dealing with the origin-destination (OD) method, since the trajectory data may not be available. In this paper, we introduce a uniﬁed deep learning-based framework named joint spatial-temporal correlation (JSTC) mechanism to improve the accuracy of OD travel time prediction. First, we design a spatiotemporal correlation block that combines two modules: self-convolutional attention integrated with a temporal convolutional network (TCN) to capture the spatial correlations along with the temporal dependencies. Then, we enhance our model performance through adopting a multi-head attention module to learn the attentional weights of the spatial, temporal, and external features based on their contributions to the output and speed up the training process. Extensive experiments on three large-scale real-world traﬃc datasets (NYC, Chengdu, and Xi’an) show the eﬃciency of our model and its superiority compared to other methods.


Introduction
Travel time forecasting (TTF) has been considered as one of the most essential services in intelligent transportation systems (ITSs), which greatly supports route planning, ridesharing, navigation applications, and effective traffic management. TTF is widely used throughout location-based applications and has become one of the most important services in these applications. However, producing an accurate TTF is still challenging since understanding the effects of different dynamic factors (such as urban flows, jams, peak hours, and special situations like public holidays, events, and vacations) on the travel time is a complex task [1]. e dynamic factors can be categorized into four groups as follows: (1) Spatial dependencies: travel time is greatly affected by the traffic conditions of each region and its neighbors as well, so trips from areas with heavy traffic will take a longer time than others. (2) Temporal dependencies: traffic conditions during different periods of the day affect the time of travel. For example, road traffic congestion in downtown cities is more severe during the morning and evening peak hours. (3) Periodical dependencies: periodic patterns such as working hours, weekends, and public events can also affect travel time, where traffic is more congested during workdays and peak times, for example. (4) External factors: several external factors have also a big impact on the travel time fluctuations, such as weather, holidays, and public events.
Due to the complexity of the spatiotemporal correlations, TTF is a very challenging problem, so accurately predicting travel time has become a vital task recently [2,3]. In general, the TTF has been treated as one of two methods (route-based and OD-based) using statistical methods, classical machine learning, and deep learning approaches. First, for route-based approaches, GPS and time series datasets of trajectories are useful in estimating travel times for both road segments and the entire path. However, some complex issues in this technique lead to inaccurate results and costly computations, such as sparsity in trajectory data and GPS devices' errors. Second, the OD-based approach is completely based on the shortest path between the origin and destination points, which reduces the heavy computations and minimizes accumulated error rates of GPS devices. erefore, the aim of this work is to provide a solution that improves the forecasting accuracy of the OD-based travel time. Many methods have been proposed for TTF, including linear regression (LR) [4], time-varying [5], Kalman filtering (KF) [6,7], autoregressive integrated moving average (ARIMA) [8], seasonal ARIMA (SARIMA-KF) [9,10], and random forest (RF) with gradient boosting (GB) (RF-GB) [11]. However, the major disadvantage of these approaches is that they are inappropriate for capturing the relationships between the complicated traffic factors. Most recent researchers have proposed deep learning models that strive to enhance TTF results, such as backpropagation neural networks (BP-NNs) [12][13][14], long short-term memory (LSTM) [15,16], convolutional neural networks (CNNs) combined with LSTM (CNN-LSTM) [17], and attention mechanism [18].
Unfortunately, these approaches still suffer from some difficulties, e.g., time-consuming and low speed during the training process, so these methods cannot perform concurrent processing. e sparsity of traffic data represents another concern of TTF approaches, where the historical traffic data do not cover the entire region. On the other hand, the correlations between the spatial features have been considered in many existing works, but most of these methods only focused on the local spatial correlations with the observance of the GPS coordinate points' nearby relationships [19][20][21]. Sometimes there may not exist similar records with the same location in the historical traffic data. erefore, we attempt to solve this issue by considering the records of distant neighbors. Besides, nearby regions can be relevant and very similar in terms of traffic patterns during various periods. Herein, finding a mechanism capable of integrating relevant spatial and temporal features and simultaneously capturing the complicated dependencies between them can be very helpful. e supplementary critical factors play a significant role in traffic pattern fluctuations, especially within the extreme circumstances of these factors as examples (weather conditions, public holidays, events, and vacations). us, we model these features according to the features' correlations and dependencies between each other and also consider the features' contributions to the output. e main contributions of our work can be summarized as follows: (i) Since data sparsity is a key challenge in real traffic scenarios, we propose a method to solve this issue and achieve better results by splitting the city into N × N grids using geo-hashing techniques and dividing the city into different clusters using the K-means algorithm. is allows us to use neighboring trips if there are no historical records or if the historical records are insufficient. (ii) We propose a new mechanism to capture both spatial and temporal dependencies. is mechanism comprises two modules: the spatial self-attention module (SSAM) that is used to infer the spatial relationships and the residual dilated convolutional module (RDCM) to capture dynamic time dependencies. (iii) Moreover, we adopt a multi-head attention approach to learn the attentional weights of a multimodality factor (spatial, temporal, and external) based on their contributions to the target. While many previous works use RNNs in their models, which are time-consuming in the training stage due to their recurrent nature, we use a multi-head attention mechanism that supports parallel computing in this work to dramatically reduce training time. (iv) We conduct extensive experiments using three large-scale traffic datasets in three different cities (NYC, Chengdu, and Xi'an). e results demonstrate the efficiency of our model compared to other methods under various traffic conditions. e rest of this paper is organized as follows. Section 2 reviews the related works about the TTF approaches. Section 3 contains the problem definition and formalization, followed by data processing and analysis. ereafter, we describe our proposed framework (JSTC) in detail. Section 4 discusses the experimental results of our model compared to other models. Finally, a summarized conclusion of this paper is presented in Section 5.

Related Work
Generally, TTF methods can be classified into two categories: route-based and OD-based methods.

Route-Based Methods.
Route-based methods can be divided into two approaches.

Segment-Based Method.
is method divides the road into segments and then estimates the travel time for each segment individually. Finally, the total travel time for the entire path is the summation travel time of all segments [22,23]. Many researchers consider the TTF as time series forecasting for a single road, such as the ARIMA model and KF [24,25], which have been applied in short-term forecasting for road section travel time. In addition, support vector regression (SVR) was used due to its competence and generalization compared to the historical average (HA) method [26]. e gradient boosting decision tree method (GBDT) has been also used to improve prediction accuracy on TTF problems [27]. Wang et al. [15] investigated the sequence relationship between the road segments. ey treated the travel time of the segment as a sequence of time series data and then used the LSTM model to solve this sequence prediction problem. e spatiotemporal hidden Markov method (STHM) was also applied to capture the correlations among different traffic time series and then predict the travel time [28].

Path-Based Method.
Another group of researchers combined multiple route segments as an entire path instead of using one road segment to solve the TTF problem. is considers the impact of intersections and traffic lights, which leads to more accurate predictions in the path-based method [21,29]. A non-parametric technique for route TTF based on floating car data (FCD) is the first to use the path-based approach [30]. It accumulated the travel time of each road segment from a low frequency instead of calculating the travel time of the subpath. Rahmani in [31] also proposed a route-based method for route TTF by combining multitraffic data sources collected by FCD and automated number plate recognition (ANPR). In [32], the K-shortest path algorithm was developed to infer the possible paths from each OD trip and then predict the link travel time. However, these techniques frequently suffer from dispersed data or the high cost [21]. Nowadays, vast amount of taxi trajectory data is collected by GPS equipment, so the TTF model for a direct path was proposed based on a three-dimensional tensor by applying two essential components; first, compute the travel time for each segment by the tensor decomposition. en, find the most optimal elements that help to estimate the route's travel time [33,34]. In [35], a deepIST model was proposed that takes spatial and temporal dependencies of traffic patterns into account by using map image information of the trajectory to predict travel time. In this framework, two CNN-based modules were combined to make images of the route segments and then look for spatial and temporal traffic correlations. To address the data sparsity issue that may occur in some trajectory segments, a CNN with LSTM model named DeepTTE was proposed for raw trajectory data processing [17].

OD-Based Method.
Many scholars have chosen the ODbased methods to address the TTF issues, to minimize the time needed and avoid the complex computations and complicated implementation. In [20], the authors proposed a multi-task representation learning model (MURAT) based on OD data, which achieved promising results. However, this method requires a long processing time and needs a lot of data, which seems to be the main disadvantage of this model. e estimation of the average time of the urban routes based on the candidates' paths expected between OD trip coordinates was proposed in [19,32]. ey combined the trucks' and taxis' travel datasets to predict travel time between each grid zone, followed by the same methodology in [32], while Faruk in [36] were the first scholars to develop a model for the TTF based on travel distance predicted directly through the OD coordinates' GPS data. However, they ignored delays in intersection queuing, which can reduce the TTF prediction precision. Recently, an ensemble technique with a multi-modality data source model named TTE-Ensemble was proposed in [21]. In this model, the ensemble method was adopted with GBDT and DNN models. GBDT and DNN predicted the travel time separately. en, each models' results are fed to a decision tree algorithm as a metalearner model to achieve the final TTF for each OD trip. However, this model basically relied on converting the trajectory data into 2D square cells instead of real OD locations which means that all trips with the same grid ID will have similar characteristics regardless of their distance. Nevertheless, the GBDT and decision tree approaches are unsuitable for big data due to the high computational cost.
Recently, the attention mechanism has been widely used for traffic forecasting. In [37], the authors proposed a pairwise self-attention mechanism for capturing the spatial and temporal dependency of traffic flow prediction. In [18], a deep learning model named FMA-ETA was proposed, which predicted travel time by combining a feed-forward network and self-attention. is model focus on spatial dependencies while temporal correlations were ignored. Besides, convolutional and graph neural networks have been used for spatial and temporal correlations in traffic speed forecasting [38]. A model called GSTGCN, which applies dilated convolutional network architectures to take the advantage of dilation rate by increasing covered spaces between the inputs, was designed. e literature survey concluded that most of the previously discussed methods did not completely handle the TTF issues and achieve high accuracy due to the complexity of spatial-temporal correlations learning, considering the differentiation of the road network topology and extreme temporal conditions. Also, there are some techniques that could be beneficial for improving the accuracy of travel time prediction. Inspired by the aforementioned ideas, we propose a JSTC framework relying on OD-based strategy, which can achieve high accuracy with promising performance in predicting the travel time for any given OD GPS points. Herein, our work mainly addresses the sparse spatial data problem and also focuses on the multi-component correlations between spatial, temporal, and external factors, which significantly affect the travel time.

Methodology
e aim of the traffic forecasting task in this paper is to predict travel time between any pair of locations by means of the observed historical traffic datasets. e general overview of our methodology mainly consists of three main parts: data preparation and preprocessing, analysis of traffic pattern similarity, and introducing our proposed model in detail. To begin, data preparation and preprocessing are critical, which include data cleaning and removal of noise and outliers, feature extraction, and geo-localization (clustering and gridpartitioning). en, we get through the spatial and temporal dependencies' similarity investigation to observe the influence of these components in traffic patterns' fluctuation.

Journal of Advanced Transportation 3
Finally, we introduce our prediction model, which aims to predict the total travel time of the OD trips accurately. e detailed descriptions of each of these parts are given in the following sections. In advance, we formalize the traffic forecasting problem in this work as in the following key concepts and definitions.

Preliminaries.
We define and formalize the TTF problem as a travel time prediction task between two given points (A) and (B).
can be obtained from these coordinates. To find the matched historical trips for trip P i , we define a query (Q) as follows: Definition 2. Spatial and temporal tensors: after splitting a city into N × N grids (G) and K-clusters (C) as a geo-region based on the OD-GPS coordinates, the GPS points have been mapped into G and K as well. We define two 3D tensors δ i ∈ P H δ ×F δ ×1 and τ i ∈ P H τ ×F τ ×1 to represent spatial features (δ i ) including pick-up locations, drop-off locations, speed, distance, cluster-ids, grid-ids, and other auxiliary features. Besides, temporal (τ i ) features include the day of the week in-between (0-6), the hour of the day in-range (0-23), and the day of the month as (0-30), where H represents the historical record ID and F denotes spatial or temporal features. Note that we consider the trip features as sequence.
Definition 3. TTF for trip P i : we define the travel time T i as the total time for the trip P i from (A) to (B) as follows: Hence, the main goal of our work is to estimate the total time (T i ) for an OD-trip (P i ) with an assist from the historical trips by a query (Q).

Data Analysis and Preprocessing.
In this paper, we used three large-scale real-world traffic datasets (NYC, Chengdu, and Xi'an) to verify the efficiency of our model across various road network topologies and traffic patterns. e first dataset is the NYC taxi dataset, which is provided by the New York City Taxi and Limousine Commission (TLC) [39] with billions of trip records from 2009 until now and comprises 21 different variables, including GPS coordinates for pick-up and drop-off, pick-up and drop-off time-stamp, total trip distance in miles, and other features. Following [40], we extracted six months of the traffic data between 01/ 01/2016 and 30/06/2016 for analysis and experiments in our work. e data we have selected contain approximately 75 million records, with over 12 million trips per month and 416,666 trips per day. e other two datasets are Chengdu and Xi'an, which were provided by the "Didi Chuxing platform" containing 9,707,970 and 5,272,758 taxi trajectories in September and October 2018 for Chengdu and Xi'an, respectively. e average trip per day is (123,463 and 133,843) trips, respectively (Table 1). e analysis of traffic data can greatly assist in recognizing the fluctuations in traffic patterns. Spatiotemporal data cleaning and anonymous value filtration were conducted by removing the invalid or uncharted trips' records that contain missing information in one or more parts of OD GPS location, passenger count, and pick-up/drop-off interval-time records. We consider the trips out of the city boundary as spatial outliers and clean them accordingly. Also, all trips with a distance less than 500 meters and more than 100 kilometers have been cleaned. e temporal components have been filtered by taking only the records with the travel time less than 24 hours (86,400 seconds) and over 3 minutes (180 seconds). In order to observe the traffic patterns over the whole city, 15 regions were classified according to the city's boundaries. en, each region was grouped by temporal dependencies (day of month and day of week) to obtain the similarity of week and day rhythms. Considering the time-interval of the day as (0-23), we measured the average rate of travel time for all trips within the same spatial and temporal information, as well as traffic intensity for all trips that flow in and flow out across these regions. Figure 1(a) represents the average rate of trip density, and we can see a low-density rate in the period from midnight up to 6 AM. In contrast, we can notice that the maximum density rate happens during two peak periods, from 7 AM to 9 AM as morning rush hours and from 6 PM to 8 PM as the evening rush period. For example, during the early morning and evening rush hours, there is heavy traffic congestion that means the movement will be slow. erefore, through the non-peak hours, traffic patterns seem to be normal. Note that the average rate of travel time in Figure 1(b) is quite similar to the density rhythm in terms of increase and decrease rate, except for trips with a long duration. So, each trip was considered as one counted trip in the density rate computation, whereas the trip's duration was taken into account while calculating the average rate of travel time, which affects the total average time in this case. Moreover, to determine peak and non-peak periods for Chengdu and Xi'an cities, we did some statistical analysis over various given regions within the same conditions. We randomly chose regions to illustrate the influence of traffic patterns. Table 1 shows that the average traffic volume measured (historical records which enter or leave the cluster or grid) is probably relatively low or high, especially in areas with heavy activity. e results show that the average travel time varies from one region to another according to the traffic rhythms during the hours of the day. On the other hand, traffic density during morning and evening hours is much higher than night and afternoon hours, which explains that traffic overcrowding influences traffic speed and travel time.
Eventually, to ensure that our proposed model is capable of producing effective results, after investigating the traffic patterns' similarities, two peak periods have been adopted for NYC, Chengdu, and Xi'an as the morning and evening peak periods, which include (7 ∼ 10 AM and 5 ∼ 8 PM), respectively.

Feature Extraction and Data
Preparation. Similar to [21], we apply data preprocessing based on the perspective of multi-modality. us, accurate prediction of TTF is greatly influenced by numerous dynamic components, including complicated spatial and temporal dependencies, and the influence of external factors such as weather status, social events, or public holidays [41,42]. Hence, to improve the prediction accuracy, we adopted three components in our proposed method: spatial, temporal, and external. We adopt two 3D tensors δ i and τ i for spatial and temporal components' representation, while the external components were divided into two subvectors: weather data and public holiday data.

Spatial Components.
e original dataset provides the trips' pick-up and drop-off GPS locations only, so we further extracted additional spatial features from these two points such as distance and speed, which are essential spatial features. We applied two different methods to calculate the distance between two GPS locations. e two methods are the Manhattan and haversine distance approaches [40]. Manhattan distance is formulated as follows: where △lat P i and △lon P i denote the total distance difference between the ordered pairs of OD coordinates computed by the following equations: e haversine distance is also formulated as follows: 2rarcsin ������������������������������������ � sin 2 (△ϕ/2) + cos o i lat cos d i lat sin 2 (△λ/2), where (△ϕ) is (△lat P i ) and (△λ) is (△lon P i ). Furthermore, the average speed was calculated regarding the trip distance and trip duration. In addition, we extracted other supplementary spatial features from the GPS coordinates, for example, cluster and grid density, which are explained in Definitions 4 and 5, respectively. In the realworld road network, traffic patterns' variation is highly related to time (e.g., traffic tidal phenomena during the weekdays) and space, including neighboring regions. us, the traffic patterns in neighboring regions are more relevant. Generally, traffic in neighboring regions exhibits similar flows over the day-time intervals.
To improve the proposed model's performance, we applied the K-means clustering method in the spatial component preprocessing phases. Since K-means attempts to group places based solely on their Euclidean distance, it returns clusters of places that are close to each other and geopositioning trips within nearby regions into the same cluster. In order to determine whether we are using the right number of clusters, we applied the elbow curve method [43] based on calculating the sum of squared errors (SSE) for a range of values of k (60, 80, 100, 120, and 150) and then picking the elbow of the curve as the optimal number of clusters to use by choosing a small value of k that still has a low SSE. From Figure 2, we can observe that the optimal value of K is 100.
Similarly, we mapped each OD-trip into 2DD grid cells with an area of approximately 0.5 km × 0.5 km. us, we can represent each trip with two grid-ID features, one for pickup and the other for drop-off. Finally, after the clustering and geo-location mapping processing, the degree of crowding for each part (cluster and grid) throughout the city is computed depending on the following definitions.
Definition 4. Density score for cluster: Definition 5. Density score for grid cell: where N and M represent the total number of origin (o) and destination (d) trips' locations recorded within the same cluster (C) and grid (G) at time interval of the day. ese two spatial features are essential to reflect the traffic flow of the region through different periods.

Temporal
Components. e temporal features are significant factors to understand travel time changes through time variation. erefore, trip duration is affected by several temporal factors, which may occur daily, weekly, or seasonally [44]. e rhythm of commuters' flow over workplaces, schools, and even public places is an example of activities that cause traffic jams at various times. To this end, the following temporal features were extracted from the traffic datasets, using the one-hot encoding (OHE) and labelencoding techniques as follows: (i) We represent the day of the month as a label value from 0 to 30. (ii) We represent weekdays as a categorical value from 0 to 6. (iii) We represent hours of the day as a label value from 0 to 23. (iv) Working days and weekends take 0 or 1.

External Components.
e external factors were divided into two parts: weather conditions and public holiday. Generally, the trip is affected by one or more of the following weather conditions (heavy rain, snow, storms, and so on). Different weather conditions can also result in varying travel times with similar spatial patterns and different interval times. Hence, the weather is considered as an important external factor in this work. Table 2 shows the weather data categories, which are classified into 10 types (sunny, cloudy, rainy, windy, and so on). Also, three more features are used to describe the weather situation of trips circumstances in terms of extreme weather conditions (snowing, raining, or foggy). ere are 16 different types of weather conditions, according to the historical weather data provided in [45].
us, this classification process makes similar weather conditions much closer and helps to reduce the data dimensions. Because of variable weather conditions, the same spatial locations in terms of OD-grids may not have the same trip times, as shown in Figure 3. is figure shows that when the weather is regular, travel time between the same origin and destination grids takes less time than hours characterized by extreme weather conditions when comparing two different days.
Besides the factors mentioned above, the traffic patterns during public holidays and events can differ from those of the daily routine, due to increased outdoor activities or variation in daily traffic patterns, leading to extreme traffic jams. As a result, two subcategorical features are concluded from the NYC and China public holiday datasets to represent whether the day is a holiday or not. Eventually, externals are classified into two types: categorical features by using the OHE technique and discrete features. Furthermore, data standardization and scaling techniques for features have been utilized.

JSTC Model Architecture.
Our proposed framework mainly comprises three modules, as shown in Figure 4. e first block is designed to learn the dependencies between spatial and temporal components and capture their complicated relations. is block also helps to capture the correlation between grids and clusters for ODtrips during different time patterns, especially when observing adjacent locations' properties and dealing with the sparse data. After processing the external features, we combine all feature representations and pass them to the last block, which is the multi-head attention module to learn the attentional weights of all features based on their contribution to the output. Next, we describe each part in detail.

Spatial Self-Attention Module.
In this section, we develop a self-convolutional attention mechanism that captures the correlations across different spatial features and learn their attentional weights. To this end, we adopt a 1D convolutional layer followed by self-attention heads. Figure 5 shows our proposed spatial self-attention module, and the spatial feature's tensor includes a pair of GPS coordinates, a pickup cluster, a drop-off cluster, a pickup grid, a drop-off grid, distance, and speed {D and S}. First, we reshape the input into three dimension as an input for the 1D convolutional layer. To do so, we used a reshape function to reshape the 2D features vector into 3D tensor δ i . en, we used the convolution filter and kernel size as shown in Figure 5 to handle the spatial input tensor. us, we can get Query { Q δ }, Key { K δ }, and Value { V δ } as an output from each 1D-Conv layer followed by the ReLU activation function as follows: where χ denotes the tensor input, i is the convolution processed index, j refers to the filter (f ) position, and κ is the kernel size.
(ω j f ) represents the filter (f j ) weight matrix, and (β j ) is the learnable parameter (bias). We set  the lter and kernel size to 1 and 3, respectively. We set the padding to "same" to avoid dropping some information and verify that all inputs are completely represented. erefore, the weight matrix (U ) between K δ and Q δ is computed by using the scaled dot attention function, and then the nal attention score ( W δ ) is computed as in the following equation: Afterward, the nal attention output is obtained over the multiple self (attention) layers, and then we atten the output of the spatial self-attention block and concatenate it with the temporal correlation output.  Figure 4: Joint spatial and temporal correlation (JSTC) mechanism architecture combines spatiotemporal correlation block, which includes the spatial self-attention module (SSAM) and residual dilated convolutional module (RDCM). en, we used a multi-head attention module (MHAM).

Residual Dilated Convolutional Module.
e temporal convolutional module aims to capture the temporal patterns. Several previous studies have considered the temporal dependencies of traffic forecasting tasks. In [46,47], the RNN architecture was applied to capture temporal relations, while references [48,49] utilized the gated recurrent units (GRUs) and long-short memory (LSTM) networks to model the temporal components on traffic pattern fluctuations. Although these approaches have shown good performance, they still suffer from many problems (e.g., exploding/vanishing gradients, time-consuming in the training phase, and some other limitations in modelling long sequences). Inspired by the recent success of the temporal convolutional network (TCN), we propose a residual temporal correlation module (RDCM), which comprises multiple dilated 1D-Conv layers stacked together as shown in Figure 6. We employed the TCNs advantages in the convolutional operations expanding domain by adjusting the dilation rate parameter on each layer. Empirically, same as the preprocessing we have used for the spatial components, we construct 3D tensor (τ i ) for the temporal features. Since the traffic patterns during the different periods of the day are highly affected by the traffic flow in each region. Accordingly, while investigating the dependencies of temporal factors, some spatial features should be considered due to their significant impact on the output. In our case, the density score grid and cluster for both pick-up and drop-off, which are measured hourly, have been adopted as supplementary features for the temporal correlation modelling. By now, the temporal component of each trip record is represented by the (χ τ i ) tensor, which includes the temporal features and the supplementary features.
In order to capture the interactions and patterns of temporal features in terms of long-short dependencies between the input features, we built three dilated convolutional layers with different "dilation −rates" as � { 1, 2, 4 } to address the following two key points: avoiding the backpropagation issue (gradient vanishing or exploding) and receptive field expansion to cover the entire input's representation through the shallow hierarchical layers. us, to achieve the normal convolution operation, we set the dilation "d r � 1" and the kernel-size "K � 3" in the first layer followed by ReLU and drop layers, and then the output is used as an input for the next dilated convolution layer with "d r � 2" and "K � 3." en, "d r � 4" and "K � 5" for the last layer. Figure 6(b) illustrates the dilated convolution steps. As a result, we make sure that the different space (long-short) of the relationship between the temporal factors has been considered. Also, an efficient representation of the features without missing any important information is also considered. e dilated convolutional layers were combined into a residual block, and an element-wise concatenation layer was used to add the last output to the input (χ τ i ), which can improve training and maintain an optimal feature correlation distribution. In this paper, we formulated the DRCM block operations as follows: where d r denotes the "dilation −rate" and s denotes the "filter− size." Eventually, the temporal correlation output is concatenated with the previous spatial correlation outputs and passed to a multi-head attention mechanism.

Multi-Head Attention Module.
e multi-head attention mechanism is illustrated on the right side of Figure 4 as reported in [42], which has been adopted in our model in charge of getting accurate prediction results. First, due to the impact of the external features on the travel time as mentioned before, we apply a fully connected layer followed by ReLU and dropout layers as subblock to represent the external factors (weather details and public holidays), and then we combine the external features' representation vector with the vector that represents the spatial and temporal correlations outputs (for more details, see Sections 3.4.1 and 3.4.2). By implementing this mechanism, we can enhance our model's ability to learn the attentional weights of various features using multiple attention layers. Besides, it makes the training process robust and fast where it guarantees processing strategies across multiple (H Att h ) heads. us, from the concept of learning the attentional weights of all features based on their contribution to the output. In this study, the attention scores represent the intercorrelations of the input features to the target (travel time). erefore, we applied a "scaled-dot" function to compute the attention score based on the contribution of each feature to the output target. To do so, we constructed (query (Q), key (K), and value (V)) vectors, which include the feature representations. Firstly, we can get the features' scores (weights) between each feature in (Q) and the set of keys, and then the second round of dot-product function takes these scores' (weights) vector and set of keys (K) to get the values' (V) vector, for calculating the final attention score. We formally defined this process as follows: where QW Q i , KW K i , and VW V i represent the (K, Q, and V) weights for each head and W O is a combination of scores'/ weights' matrix. h is the number of head parameter; after several trials with the h values { 4, 6, 8, 10 }, we adopted 6 as the number of attention heads, which leads to fast performance and achieves optimal results.
Eventually, we use a dense layer followed by a linear operation to get the final prediction results (y i OD τ ) ideally as follows: where (φ) is the linear activation function and (W f ) and (b f ) are learnable parameters.

Experimental Results and Analysis
We used three large-scale traffic datasets (NYC, Chengdu, and Xi'an) in our experiment. Section 3.2 describes in detail the data analysis and preprocessing. We randomly split the datasets into 80% for training and 20% for testing. e training set was then divided into two subsets: 70% for model training and 30% for validation. e learning rate values range (0.01, 0.001, and 0.0001), batch size as (128, 256, and 512), dropout values range (0.1, 0.2, and 0.3), and multi-head (h) as (4, 6, 8, and 10). e optimal values for parameters are as follows: the learning rate is 0.001, the number of training epochs and attention heads is (60 and 6), respectively, and batch size is 512. Besides, to reduce overfitting, we applied both the kernel regularizer (L2 norm) and dropout (0.2). Also, we adopted the Adam optimizer as an optimizing function with a linear activation function. Flatten

Evaluation Metrics.
To evaluate our model, we use two common prediction metrics. Mean absolute percentage error (MAPE) is calculated as Mean absolute error (MAE) is calculated as where y i and y i are the actual and predicted OD-trip durations in seconds, respectively. N indicates the total number of records in the test dataset.

Comparison of Various Models' Results with JSTC
Model. To show the performance efficiency of our model, we compared it with the following models: (i) LRM: we applied the LR model in [20] with almost all features except the grid and cluster, which have a high dimension and cause overflow. (ii) XGBoost: a machine learning model widely used for both classification and regression problems. However, XGBoost with a deep tree may lead to better predictions. Following [50], we set the maxdepth parameter between 4 and 6 to avoid overfitting. (iii) LightGBM: the LightGBM model is based on decision tree algorithm with leaf-wise and level-wise. is model is more appropriate for large datasets with large dimension of features [51]. Accordingly, we set the LightGBM parameters same as in [21]. (iv) ST-NN: spatiotemporal-based model was proposed in [19], which combined two DNN modules to predict the trip distance and then used this distance to predict the travel time. (v) TTE-Ensemble: the collaborative model proposed in [21] combines machine learning and neural network (GBDT and DNN) modules for modelling multi-modality data to predict the OD-trip travel time. (vi) FMA-ETA [18]: a deep learning model based on a multi-self-attention technique integrated with a feed-forward structure (FFN) for capturing spatial and temporal dependencies and obtaining TTF. (vii) STTNs [37]: two spatial-temporal blocks are integrated into an approach based on graph neural network and transformer (STTNs), which jointly investigates the dynamic spatial and temporal dependencies to enhance the traffic flow prediction result's accuracy. Table 3 illustrates our model results compared with other models in terms of MAPE and MAE for the NYC, Chengdu, and Xi'an datasets. e results show that our model outperforms other approaches. As previously mentioned, we divided the comparative models into two parts (ML and DL models). e results of ML (LR, XGBoost, and LightGBM) models show worse accuracy compared with the DL models because these simple statistical ML algorithms have difficulty in modelling the non-linearity relations of complex traffic patterns. We notice that the LR model gives the worst results compared to others (26.12, 24.37, and 25.85) in MAPE and (168. 34, 176.33, and 197.14 sec) in MAE for NYC, Chengdu, and Xi'an, respectively. e error rate (MAE) was reduced by (14.4, 14.14, and 9.11 sec) and (18.62, 20.94, and 20.88 On the other hand, the ST-NN achieved the lowest results of all the DL models because it only utilizes two MLP blocks. In comparison, our model reduced the errors (MAPE) by at least ( ∼ 7%) on NYC and Chengdu, while 6.31% on Xi'an. Furthermore, our model has also shown remarkable superiority over the TTE-Ensemble model by reducing the errors by (5.19%, 5.5%, and 4.73%) on NYC, Chengdu, and Xi'an, respectively. us, we can observe that ST-NN and TTE-Ensemble models achieved better results than ML algorithms (LRM, XGBoost, and LightGBM). is is because deep learning approaches consider the non-linear relations between the variables. Although, the ST-NN applied two DNN modules for estimating the trip distance first, then using this distance to predict the time, which means they also adopted the spatial component (distance) only, while the temporal patterns was ignored. e TTE-Ensemble model was built based on combining the DNN module with the ML (GBDT) model. ese models are not sufficient to capture the complicated correlations.
Eventually, as it can be seen from the table, FMA-ETA and STTN models give results which are more closer to our proposed model because these models have also adopted attention mechanisms to capture the non-linear correlations between the spatial and temporal features. e auxiliary spatial features that influence traffic patterns play a significant role when considering the dynamic scales of inner spatial and temporal correlations.
Moreover, to validate our model, two different datasets at morning peak (7 to 10 AM) and evening peak (5 to 8 PM) have been used to test all models during these two periods in terms of MAPE and MAE, as shown in Tables 4 and 5 for NYC, Chengdu, and Xi'an, respectively. Prediction errors are typically higher during these two peak periods than Journal of Advanced Transportation during non-peak periods. From the results shown above, we can demonstrate that our model provides more accurate results compared to other models, even during the morning and evening peak hours. Also, based on the random selection of trips used for testing our proposed model, Figure 7 shows a comparison between actual values and predictions of 50 random trips for all models on NYC, Chengdu, and Xi'an, respectively. Each point on the X-axis represents a trip from the test set, while the y-axis indicates the trip duration in seconds.

Ablation Analysis.
We built our model based on three main components (SSAM, RDCM, and MHAM). Besides, we consider external factors that influence travel time by improving the accuracy of our results. erefore, additional experiments were conducted to verify the contribution of each component in our prediction task. e ablation models we use in this analysis are as follows: (1) Without SSAM: in this model, we removed the spatial self-attention module (SSAM) and applied RDCN and MHAM modules only with a fully connected and output layer.
(2) Without RDCM: in this model, we removed the DRCM module and applied SSAM and MHAM modules only with a fully connected and output layer. (3) Without externals: to verify the effect of external factors (weather and public holidays), we remove the block responsible for representing these factors' dependencies. (4) Without MHAM: we removed a multi-head module (MHAM). So, after getting the spatial and temporal components' correlations, we concatenate these blocks' outputs with external features' representations and then apply fully connected and output layers directly.
We should mention that MLP layers were adopted as an alternative to each module that was removed during the ablation investigation phases 1, 2, and 4, as shown in Table 6. However, the impact of externals was just measured by removing the external factors' representation block in ablation 3. e results in Table 6 demonstrate that the performance of all modules combined together in one model leads to better results. In contrast, removing some parts affects the process of capturing traffic pattern fluctuations.    correlation mechanisms (SSAM and RDCM) increases the error rates more than removing a multi-head block, which means these two modules have a higher impact on our model since they are responsible for capturing correlations of traffic spatial and temporal factors. On the other hand, external factors play an important role in improving our prediction results. Conclusively, these results emphasize the importance of each proposed block through their contributions to improving travel time prediction results.

Computational Cost Measurement.
Measuring the computational complexity has been considered in this paper. We compute the time consumption of our model compared with deep learning-based models (ST-NN, TTE-Ensemble, FMA-ETA, and STTNs). Table 7 reports the average time of training and predicting functions for one million trips (1M) with only one epoch on NYC, Chengdu, and Xi'an datasets. Note that we performed our experiments on the same NVIDIA GPU (GeForce GTX 1050 Ti) with 4 GB. Also, we set the batch size to 512 for all models' training phase. us, we could observe that the complicated model's structure took more training time than the simple ones. Actually, one logical reason is that this model's complexity represents an improvement to give more accurate prediction results. In comparison, we can notice that the computation time of our model is much closer to that of the STTN model due to the fact that both models have a relevant structure.

Conclusion
In this paper, we first discussed the various characteristics of traffic patterns that affect travel time. en, we presented a mechanism for capturing interactions between spatial and temporal factors based on self-convolutional attention and dilated convolutional techniques. In addition, we adopted spatial auxiliary features and integrated them with the temporal features, which play a significant role in capturing the dynamic traffic patterns and their correlations. Furthermore, we applied a multi-head attention mechanism to learn the attentional weights of the spatial, temporal, and external features based on their contribution to the output and speed up the training process. Extensive experiments using three large-scale real-world traffic datasets (NYC, Chengdu, and Xi'an) have shown that our JSTC model outperforms prior methods. We denote our model's results in bold font as the best scores for each metric.