Multitask Learning with Graph Neural Network for Travel Time Estimation

Travel time estimation (TTE) is widely applied for ride dispatching, ride-hailing, and route navigation. Even for a given trajectory, the travel time is affected by many spatial-temporal factors, including static ones such as distance, road type, and so on and dynamic ones such as speed, traffic condition, and so on. Challenges of accurate estimation lie in proper representation of these spatial-temporal factors and more importantly capturing the complex relationship among them for TTE. To tackle such challenges, we present a framework based on the fact that the travel time of each road segment is affected by its adjacent segments. It features a graph convolutional neural network and a recurrent neural network for basic TTE for each road segment and a graph attention network for the relation to estimations on the adjacent road segments. Finally, a multitask learning model is proposed for the travel time of the entire given path and that for each road segment. Experimental results on real taxi trajectory datasets of two cities show that the percentage estimation error of the new approach is well controlled at 13.91% and the proposed method outperforms three state-of-the-art methods significantly.


Introduction
Travel time estimation (TTE) is a classic yet challenging problem using trajectory data. In urban cities, it plays a key role in route planning [1], vehicle dispatching [2], and ridehailing [3] applications, such as Uber, Lyft, and DiDi. e accuracy of TTE is vital to user stickiness and activity. According to [4], inaccurate travel time estimation leads to 28.4% car-booking cancellation.
ere exist many factors that affect the accuracy of TTE, which can be summarized into two categories: the static ones such as road type, e.g., highway or byway, road width, speed limit, and in-degree and out-degree, and the dynamic ones such as weather, accident, traffic speed, time interval, and so on. It is worth noting that factors of road segments may have implicit dependency, which will affect TTE in a very complex way. For example, the speed on a road segment may be affected by its adjacent and congested segment since the vehicles have to slow down and wait.
To accurately estimate the travel time, such factors should be combined all together; however, there are three challenges to do so. (1) How to investigate the effects of these factors on the travel time, e.g., how does road type (e.g., main road and secondary road) affect the estimation of the travel time. (2) How to encode the complex factors and learn effective features from them, especially for the implicit ones, such as the traffic condition characteristics. Inadequate understanding of these factors may cause inaccuracy in the estimation. (3) How to fuse spatialtemporal correlation factors for travel time estimation. Among all these factors, traffic condition is the most important. Existing work on TTE [5][6][7][8][9] mainly aims to estimate the travel time of a path considering the factors such as traffic flow, weather condition, road type, and so on but lacks the study of the aforementioned implicit dependency among road segments.
To address the challenges, we present dependent relationship travel time estimation (DRTTE). We first analyze the relationship among various factors that may affect the estimation of the travel time. Based on the analysis, we then learn several features for TTE via a sequence of graph neural networks. We use graph convolutional network (GCN) to obtain spatial feature, followed by gated recurrent unit (GRU) capturing the spatial-temporal feature. e extracted features, when combined with auxiliary information, such as weather, are used to learn the traffic condition representation.
e traffic condition representation, along with the road segment information, generates a vector for the speed on each road segment. Graph attention network (GAT) is then applied to update the speed vector considering the dependency of the road segments. With a multitask learning model, these new speed vectors are used in the final step for travel time estimation over all road segments and the entire path.
We highlight the following contributions in this work: (i) Learning the road segment traffic conditions by exploring the static and the dynamic features. (ii) Proposing a multitask learning framework for learning the feature of each factor by exploiting the dependency and fusing them together to predict the travel time. (iii) Conducting extensive experiments to confirm the effectiveness of our proposed solution in comparison with the state-of-the-art baselines. e rest of the paper is structured as follows. State-ofthe-art solutions for TTE and related deep learning algorithms are reviewed in Section 2. e problem statement is given in Section 3. e methodology and computational framework are described in Section 4 and evaluated in Section 5. Finally, conclusions and discussions are given in Section 6.

Related Work
Machine learning and deep learning have been widely applied for spatial-temporal problems, including path inference [10], path query [11], path selection [12][13][14], crowdsourcing analysis [15,16], path traffic [5,17], and travel time estimation. However, the above work aims to infer, query, or select a path, and less attention is paid on estimating the travel time which depends on relationship of road segments. Our method focuses on the two places: each road segment and the whole path. e SOTA methods focus on the whole path. In recent years, there are also many new approaches towards TTE. New methods of machine learning encoding the spatial-temporal features have been applied to solve TTE problems. ConLSTM [18] combined CNN and LSTM. Paper, [7,9] proposed a data-driven regression model considering complex factors. DEEPTRAVEL [9] extracted multiple features of TTE for a path. Paper, [6,19] utilized only GPS data for TTE. However, limitations such as path scale, auxiliary information, correlation, and dependency among road segments are not well addressed, leading to affected degree of accuracy.

Definitions
Definition 1. (directed graph for road network). A road network is represented as a directed graph G � (V, E, A), where V is the vertex set of road segments with order Ν � |V|, E is the edge set of connectivity between road segments, and A is a Ν × Ν adjacency matrix that captures how the directed edges are connected.
Attributes of a road segment include static spatial geographic ones as ID, length, direction, and so on and a dynamic one, the speed of vehicle as a function of time.
Several feature tensors of G are defined based on the above attributes. ey are F ∈ R N× M× J for original geographic features, its time t variant F t ∈ R N×M , and static variant F s ∈ R N×(M− 1) . Correspondingly, after representation learning, these are feature tensor S ∈ R N×D×J , static feature matrix S ∈ R N×D×J , and dynamic feature matrix.
Here M, J, D are the numbers of attributes of the road segment, time steps of data available, and features of the road segment after spatial representation learning, respectively. e traffic condition ∁ ∈ R N×D×K fuses the spatial-temporal correlations and auxiliary data feature.

Problem Statement
Problem Definition. For a given path p and a departure time ts, a travel time query is to be performed. A multitask learning framework called DRTTE is proposed, which can return the travel time t i for road segment v i and t en for the entire given path simultaneously.
Subproblem Definition. Prediction of traffic condition is a subproblem of TTE of each road segment and hence is carried out along with the prediction of spatial-temporal features.
e kernel of the feature prediction is spatialtemporal correlation st on the object road segment. For the time series sequence st, its prediction is to get values of K future time steps based on the given values of J time steps as stated below: where F t is the observation feature matrix at time t on the object road segment. Spatial-temporal features st t + 1 ,. . ., st t + K are from the dynamic feature matrix F t of the past J time steps in the time sequence.

Computational Intelligence and Neuroscience
Path travel time t en depends on path length and path speed en . e speed en depends on traffic conditions C and S s defined previously.

Methodology
To solve the TTE problem defined, a multitask learning framework is proposed, which consists of three major modules, namely, traffic condition module, speed module, and travel time module. Figure 1 shows the logic structure of the framework. During the training phase, features of traffic condition C and speed sp are effectively extracted. Also, during the test phase, t en is estimated for the given p and ts.
e three kernel modules are specified below. eir inputs and outputs are detailed in Table 1. Table 1, with original feature matrix F s and adjacency matrix A, static spatial geographic representation of the road segments is captured as S s . Similarly, from F t and A, the dynamic feature S t can be captured, resulting in st, the spatial-temporal representation of road segments over time. Traffic condition C, as the core of the framework, is then obtained via fusion of auxiliary feature (e.g., weather) and st.

Spatial Feature Capturing.
Acquiring the traffic condition C is a key issue in TTE. A road segment traffic condition module is designed for its learning using adjacency matrix A and other feature matrices. Note that local characters of the road network are missing in the original matrices. To fix this, a convolution is used to obtain the spatial characteristics with structural information of road segments. However, due to the nature of nonregular grid of the road network data, the intricate topological structure of the road network and the spatial dependency of the road segment cannot be obtained by traditional convolution neural network (CNN).
Instead, the graph convolution network (GCN) [20] is adopted for this purpose, using a line transformation after convolution with its surrounding road segments. With a filter in spectral domain, the topology structure of the road network is captured simultaneously. Also, the spatial dependence at fixed time slice can be learned. Graph convolutional filters are used to extract the local features shared by topologically adjacent elements in graph G. It is seen that with GCN filters, the input stochastic weights can be "propagated" to adjacent and correlated edges during convolutions via road network topology.
Hence, capturing both static and dynamic spatial features is done via the GCN model from corresponding feature matrices of road segments, i.e., S s from F s and S t from F t . Mathematically, they can be described as where _ A � D −1/2 AD −1/2 denotes the graph convolution filter, is a degree matrix, W is the weight matrix, and σ(·) represents the activation function. In Table 1, s s is the row of S s and denotes the learned static spatial vector for the road segment; s t is the row of S t and denotes the learned dynamic spatial vector for the road segment.

Spatial-Temporal Feature Prediction.
e temporal feature is another key issue in spatial-temporal correlation on each road segment. e collection forms a sequence data, which can be generally processed by the widely used recurrent neural network (RNN) that is most widely used for processing sequence data. However, the traditional RNN has limitations for long-term prediction because of the gradient vanishing and gradient explosion. e above problem has been addressed by long short-term memory (LSTM) and gated recurrent unit (GRU) models, which are designed according to the basic principle that the gated mechanism is used to memorize as much long-term information as possible. LSTM takes a longer time to train because of its complex structure. Compared with GRU, LSTM takes a longer time to train because of its complex structure and more parameters. e mathematical formulation is where σ(·) is an active function defined as σ(x) � (1 + exp(−x)) −1 , W u , W r , W c are weights, b u , b r , b c are parameters, operator [,] represents vector concatenation, and * denotes matrix multiplication.
Consequently, the GRU model is opted for temporal information processing. e spatial-temporal feature st at time t + K in the object road segment is predicted by a sequence of GRU cells, with the dynamic spatial feature vector s t at time t as input. e spatial-temporal feature st obtained from above can be fused with auxiliary data (such as weather w) to get the traffic condition C � [st, w], where [,] is again the concatenation operator. C plays a key role in the remaining modules.

Speed Module.
e purpose of this speed module is to learn the speed on road segments in the next K time steps.
ey are known to be highly dependent on the traffic condition. Hence, C is taken as an influence factor of the speed feature sp on a single road segment at time step t + 1, as shown in (4). e other factor is the static spatial feature where θ b is a parameter for LSTM. Moreover, the connectivity nature of the road network implies that along the whole path speeds of road segments are related, especially for those adjacent segments. is higher level of dependency indicates that another update to Computational Intelligence and Neuroscience the speed feature of the current road segment by its neighboring road segments at the time t + 1 is necessary.
To handle this, a graph convolution known as graph attention network (GAT) [21] is adapted to combine information about the neighbors of the object road segment. We embed the traffic from the road segment component into the path component using the GAT with time to get the traffic in the next time step along the path. e key idea is to weight the features of the neighbors using an attention mechanism. e attention coefficients from the GAT shows the level of dependency between road segments. e weight is the level of influence of neighbors on the target road segment. For target road segment ] j with |N(j)| neighboring road segments, the graph has |N(j)| + 1$ nodes. Features of the object road segment and its neighbors are combined. e dependency of the target road segment can be represented by the a ts ik using GAT. Finally, the new speed on the target road segment V i at next time step ts + t is combined by the activation function σ, as shown in Algorithm 1.
In (5), function f(·) applies the LeakReLU nonlinearity (with negative input slope � α0.1). When expanded, the coefficients computed by the attention mechanism can be expressed as where sp ts j is the representation speed of road segment v j at time ts. e traffic condition is effected by the time which is the daily periodic. Intuitively, α ts jk is the level of dependency or weight of road segment v k on road segment v j . e above procedure for speed representation on the path is implemented in Algorithm 1. ere are two major steps. e first (in lines 7-11) captures the correlation to the object road segments and their neighbor road segments. Also, the second (in line 12) updates the speed of the object road in the next time step.

Travel Time Module.
Travel time on a road segment finally depends on its length and travel speed. Here only speed needs to be calculated since length is fixed. Based on the multitask learning framework depicted in Figure 1, speed can be derived from the feature of speed sp learned from the previous two modules. is leads to travel time estimation of road segments and the entire path with sp i and sp en .
To achieve this, sp i on road segment v i is designed to go through fully connected layers, resulting in the mapped   Here a two-layer model instead of the traditional LSTM model is adopted due to its better prediction. Speed feature sp en for the entire path is a comprehensive quantity over sp i for each road segment. A simple way to accomplish this is to use the mean pooling or max pooling, i.e., sp mean � 1/n n i�1 sp i . However, the largely uneven speed features on each road segment lead to significant error of the above pooling. To improve, the equal-weight 1/n can be replaced by a set of specially designed weights, as in the following attention mechanism.
is the normalized weight for the i-th road segment. e resulting sp att is then fed to residual fully connected blocks that train a very deep neural network [22]. Based on the above result, sp en is finally obtained via a MLP simple neural network model.

Experiment Settings.
Effectiveness and overall performance of the DRTTE model are evaluated on two large-scale real-world taxi datasets, namely Harbin and Chengdu. For convenience, continuous road networks are segmented into discrete parts, and two-dimensional GPS data are transformed accordingly along with road segment ID by map matching algorithm [23]. We adopt Adam algorithm [24] optimization to train the parameters of the model. e learning rate is 0.001. We select the best models by 3-fold cross-validation.

Evaluation Metrics.
e evaluation metrics we adopt include mean absolute percentage error (MAPE), root mean squared error (RMSE), and mean absolute error (MAE). MAPE compares the estimation value to the percentage of the ground-truth value, while RMSE and MAE are the gaps between estimation and true values.

Comparisons with Baselines.
Results in performance of DRTTE are compared against the baseline methods including ARIMA, TEMP [25], and DeepTTE [6]. Table 2 shows the details. It is seen that ARIMA is the lowest performing method. TEMP gives medium performance and cannot cope with the complicated traffic conditions either. TEMEP and DeepTTE work better than ARIMA, but DRTTE outperforms them significantly on the two datasets. e reason is twofold. Firstly, static and dynamic spatial information can be obtained by DRTTE using graph convolution operations. Secondly, the dependency among the road segments with road properties can be captured by graph attention network. ese innovations help preserve the spatial-temporal characteristics of the traffic condition and the relationship between the road segments. Input: sp, p, N, ts, t Output: sp Initialize matrix: SP randomly Initialize vector: sp randomly Initialize scalar: α randomly // sp ∈∈R N×E×K : the speed tensor. // sp ∈ R N×F×K : the speed after GAT operation. // P: the given path. // N: the neighbor road segments of the object road segment. // ts: the start time of the given path. // t: the return travel time of the road segment. // j: the ID number of the object road segment.
(2) "LSTM + GCN + GRU": with spatial-temporal information of the road segments. with attention mechanism in the multitask layer.
eir effectiveness and efficiency are measured using the set of metrics, with results given in Table 3. Several observations can be made. Firstly, "LSTM" exhibits the lowest performance. Secondly, "LSTM + GCN + GRU" is comparable to DeepTTE in performance due to their similar structures of model framework. However, the spatial-temporal feature time series prediction of each road segment is missing in DeepTTE. is limits its capacity in accurate travel time estimation of the entire path. irdly, "LSTM + GCN + GRU + GAT" performs better than DeepTTE since the latter lacks the dependency of the speed of the adjacent road segment. Lastly, DRTTE performs even better than "LSTM + GCN + GRU + GAT" with the help of attention mechanism. e above comparisons show that DRTTE is the best in the set of methods built on LSTM. It addresses spatialtemporal feature time series prediction of each road and dependency of the speed of the adjacent road segment, enabling it to estimate travel time in a more efficient way with higher accuracy.

Travel Times and Distance
Patterns. Effects of travel distance to MAPE and MAE are depicted in Figure 2. e calculations are based on 9,870 road segments randomly    Computational Intelligence and Neuroscience Computational Intelligence and Neuroscience 7 selected from the validation datasets. Figure 2 shows that with increasing length of path, both DeepTTE and DRTTE see loss of accuracy in different degrees. is is natural and expected since uncertainty of traffic condition increases with the length of path, resulting in performance degradation for any model inevitably. However, it is noted that the percentage estimation error of DRTTE is well controlled (13%∼ 20%) for intermediate lengths (2∼7 km), while this range for DeepTTE is (17%∼ 30%). Also, in the field test, the MAE of DRTTE is controlled in around 2.4 minutes, while for DeepTTE, it is around 3 minutes. is shows that DRTTE gains around 20%∼30% in accuracy on average compared to DeepTTE and is less sensitive to distance. Results of MAPE and MAE with epoch amounts of "20, 40, 60, 80, and 100" are depicted in Figure 3. It is seen that a higher epoch reduces the MAPE from (Chengdu 50.75%, Harbin 42.23%) to (Chengdu 13.91%, Harbin 11.64%) and reduces the MAE from (Chengdu 320.75 s, Harbin 280.23 s) to (Chengdu 155.71 s, Harbin 136.29 s). ese results demonstrate the effectiveness of epoch for accuracy improvement of travel time estimation. Figure 4 shows the effects of kernel size of the graph convolutional operation. It is seen that the MAPE, MRSE, and MAE have the same trend, and the best results are obtained when the kernel size is intermediate. When the kernel size is less than 4, spatial correlation cannot be captured entirely, but when it is greater than 4, more unnecessary information is captured that damages the true correlation between road segments.

Conclusion
In this work, we proposed a novel multitask learning framework DRTTE to explore the effect of spatial-temporal correlation of the traffic to travel time estimation, considering traffic conditions and dependency relationship of road segments. e effectiveness and efficiency of DRTTE are validated based on experiments of two real taxi trajectory datasets. Our findings show that the proposed framework outperforms the existing methods with higher level of accuracy. More importantly, it is demonstrated that the spatial features have significant effects to travel time estimation. Future work will focus on federated learning for travel time estimation to prevent privacy leaking.
Data Availability e data underlying the results presented in the study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.