Multitime Resolution Hierarchical Attention-Based Recurrent Highway Networks for Taxi Demand Prediction

Taxi demand forecasting is an important consideration in building up smart cities. However, complex nonlinear spatiotemporal relationships in demand data make it difficult to construct an accurate prediction model. Considering that a single time resolution may not enable accurate learning of the time pattern of taxi demand, we expand the time series prediction model in our proposed multitime resolution hierarchical attention-based recurrent highway network (MTR-HRHN) model, using three time resolutions to model temporal closeness, period, and trend properties of demand data to capture a more comprehensive time pattern. We evaluate the MTR-HRHN on a taxi trip record dataset and the results show that the forecasting performance of the MTR-HRHN exceeds that of eight well-known methods in the short-term demand prediction in some high-demand regions.


Introduction
With the increasing travel demand of urban dwellers, taxis have become much more popular in urban areas, especially through the use of ride hailing services such as Didi Chuxing and Uber. However, the business still faces many inefficiencies, including long waits and numerous empty taxis [1][2][3]. e use of data technology and artificial intelligence to process massive taxi data can enable the construction of an accurate prediction model that can be used to estimate taxi demand and improve the efficiencies of taxi services. For example, the number of passengers from different regions was predicted [4][5][6] through a linear time series model. e impact of the road network and meteorological conditions on the demands of taxis was researched [7,8] using a method of machine learning. For the demand forecasting problem, the common method for taxi demand prediction is to consider the impact of historical demand data on future demand; that is, predict demand y T at time T, given a series of historical demands (y 1 , y 2 , . . . , y T−1 ). e time interval T is a short-term time, which is often a few hours or even shorter. However, for data such as taxi demand with nonlinear, unstable, and spatiotemporal related properties, linear or nonlinear methods considering only historical demand are insufficient. e following points should be considered when constructing the prediction model: (1) Besides historical demand data, relevant exogenous data are necessary and should be applied to train the model. In this regional forecasting problem, exogenous data are often selected from other regions.
(2) e model should be nonlinear and should consider not only the temporal dependence of target data and exogenous data but the relationship between target data and other exogenous data. Figure 1 is a spatiotemporal dynamic structure that models both the historical target data and historical exogenous data. As pointed out in [9], y T is related to the historical observations (y 1 , y 2 , . . . , y T−1 ), the exogenous data (x 1 , x 2 , . . . , x T−1 ), and their spatiotemporal dynamics. For their excellent performance in learning the dynamic dependence in sequences, deep learning models, such as the recurrent neural network (RNN) and its extended variants, have been used to capture the nonlinear temporal relationships of time series data. In addition, the convolutional neural network (CNN) can be added to capture the spatial correlation [10]. e encoder-decoder architecture was recently used to model sequence data [11,12], and some attention-based models [13] have been proposed to exploit the temporal dynamics of exogenous data when predicting future targets. However, these models do not consider the correlation of exogenous data between different components and the time factor in series data, and this will affect the prediction results. Overcoming these issues is the motivation of our research.
In this paper, we extend a hierarchical attention-based recurrent highway network (HRHN) [9] and propose a multitime resolution model, MTR-HRHN. We select different lengths of sequence data from historical time series data (including target data and exogenous data) with three different time resolutions and input the sampled data to three HRHN networks to train the model to capture the spatiotemporal characteristics. We merge the output of each HRHN network to predict taxi travel demand at a certain time in the region. Compared with other spatiotemporal deep learning network models, our network has the ability to learn from three time resolutions. It can not only extract the spatiotemporal characteristics of time series data and their relationship with exogenous data, but can capture the influence of recent, periodic, and trend factors on taxi demand. e organization of this paper is as follows. A brief overview of traditional prediction methods and deep learning models in traffic data prediction is given, followed by some definitions of demand prediction. e structure of MTR-HRHN is then described. We test the MTR-HRHN model on the New York City taxi dataset and compare it to other models. In the conclusion, we summarize the paper and provide some inspiration for improving the model.

Related Work
Statistics-based algorithms (such as ARIMA and its variants) [4,5,7] and machine learning regression models (such as linear regression and support vector machine) [6][7][8] are widely used in the research of traffic prediction. However, in the real world, the demand data of a certain region are often affected by other nonnumeric data (such as changes in weather), which prevents the linear model from completely digging out relevant information.
Recent superior performance of deep learning in computer vision and natural language processing has encouraged its application to traffic data prediction. Among them, the CNN can strongly extract the features of the input data, so it is naturally used for traffic prediction [14][15][16]. e RNN and some of its extended variants, such as LSTM [17] and the gated recurrent unit (GRU) [18], are outstanding at capturing dynamic time dependence and are widely used to predict time series data [19][20][21][22][23]. For example, Xu et al. encoded past taxi demand into week-long sequences, fed the sequential data to an LSTM network, and made the network learn the taxi demand patterns in each area. Rather than forecasting a deterministic taxi demand, it predicted the entire probability distribution of taxi demand in different areas through mixture density networks [22]. However, when dealing with regional demand prediction, different regions relate to each other, and the demand change of a certain region often has a certain correlation with the demand data of other regions. e inability to simultaneously capture spatial and temporal relations made these deep learning models inapplicable to our problem. erefore, some researchers have chosen to build spatiotemporal deep learning models for traffic data prediction [10,24,25]. Among them, the combined deep network of CNN and LSTM is a classic spatiotemporal deep learning model. For example, Yao et al. proposed a novel local CNN method to consider spatial near regions and extract the sequential relations in a demand time series, and some LSTM networks were used to model sequential dependencies [10]. e encoder-decoder framework was also used by some researchers to deal with the spatiotemporal relationships of traffic data [24,25]. For example, Zhou et al. proposed an encoder-decoder framework with attention mechanism to deal with the multistep citywide passenger demand prediction problem. ey employed convolutional and ConvLSTM units in both the encoder and decoder and learned attention to emphasize the effects of representative citywide demand patterns on each step prediction during the decoding phase [24]. Some studies have expanded the spatiotemporal models to solve some traffic prediction problems that require more precision. For example, Rodrigues et al. proposed a deep learning architecture combining text information with time-series data and applied the approach to the problem of taxi demand forecasting in event areas [26]. Liu et al. proposed a contextualized spatial-temporal network to deal with the taxi origin-destination problem, integrating the local spatial context, temporal evolution context, and global correlation context in a united framework [27]. Although these spatiotemporal deep networks showed outstanding performance in the transportation field, they have some shortcomings, as they only sample historical traffic data from a single time resolution (such as a half hour or hour), which may lead to the inability to fully mine the possible multitime patterns of traffic data.
Apart from the above spatiotemporal models, HRHN, as an end-to-end deep learning model, has the ability to predict future target data by mining the spatial and temporal interaction information of historical exogenous and target data. It has been tested in several domains and proved able to not only achieve accurate prediction of time series but to better capture their sudden changes and oscillations [9]. Inspired by the capabilities of HRHN in the prediction of time series data, we chose it to learn the spatial and temporal correlation information between the demand data of the target region and the demand data of other regions. Moreover, to adapt to possible multitime patterns in demand data, unlike the original HRHN model, our model uses three time resolutions to sample the past target demand data and demand data of related regions and feeds them to three HRHN models to extract the corresponding spatiotemporal correlation information.

Definitions
where i d is the trip identification number, t start and l start are, respectively, the time and place a passenger gets on a taxi, and t end and l end are, respectively, the time and place the passenger gets off the taxi.

Short-Term Demand Prediction Problem.
In this study, we set the length of each time interval to one hour and only predict the demand data of the selected region in a specific future time. For a fixed region i and time interval T, the onestep demand prediction problem can be defined as follows: , the task is to predict the demand value of this region at future time interval T: where represents the pick-up demand at a given region i at time t, h is the length of the input sequence data, is the exogenous data and e is its dimension, and F(·) is a function to be learned that captures the complex spatiotemporal interaction between historical target and exogenous data.

Methods
As shown in Figure 2, MTR-HRHN has three layers: input, HRHN, and merge. In the input layer, we divide the historical target and exogenous data according to three time resolutions and select different lengths of sequence data to form the recent-, near-and distant-time training samples. To match the time characteristics of the three HRHN networks, the time resolution of the recent-time samples is the smallest, followed by near-time samples and then distanttime samples.
In the HRHN layer, three HRHN networks train the model from three time-related perspectives: recent, period, and trend. Each HRHN network has an exogenous data capture part (X) and a demand forecast part (DP). Each X is linked to a sequence of historical exogenous data, and each DP is linked to a sequence of historical target data. e attention mechanism of the HRHN further learns the association between the target and exogenous data.
In the merge layer, the output of each HRHN undergoes the transformation of the fully connected layer. e transformed data are summed to obtain the final demand prediction data. e prediction data are used to construct a loss function together with the real data, and the model parameter training is completed through an optimization algorithm.
MTR-HRHN has an encoder-decoder structure and the ability to process sequence learning. Unlike most spatiotemporal deep network learning models that use LSTM, our model uses RHN to capture the temporal feature and embeds RHN in both the encoder and decoder. Compared to LSTM, RHN can offer a deeper understanding of the strengths of the LSTM cell and incorporate highway layers inside the recurrent transition, enabling the efficient use of substantially more powerful and trainable sequential models [28]. To our knowledge, HRHN has not been used in the field of taxi demand forecasting. For this new application, we employed a new model with multiple HRHNs, and the input layer, merge layer, and training algorithm are designed accordingly, so that the expanded new model has a better ability to learn spatiotemporal correlation sequences.

Input Layer.
We use three time resolutions to divide the historical demand data and historical exogenous data into three parts: closeness, period, and trend. e recent historical exogenous data are selected for the closeness part, where L c is the number of time intervals of the closeness fragment. e near historical exogenous data are selected for the trend part, where L t is the number of time intervals of the trend fragment. It is noted that P and Q are different types of periods, where P is equal to 12 and reveals the half-daily periodicity, and Q is equal to 24 and reveals the daily trend.

HRHN Layer.
We applied the HRHN [9] to the regional demand prediction problem.
e CNNs in the encoder Mathematical Problems in Engineering learn spatial-related information from different components of demand data of other related regions, and RHNs in the encoder model and analyze the temporal dependence of demand data of related regions from the CNN at different semantic levels. RHNs in the decoder capture the timedependent information of the historical demand of the region to be predicted. e decoder also includes a hierarchical attention mechanism, so that it can select the relevant multilevel semantic encoded information.

Encoder.
Convolutional neural networks and pooling layers are used in the encoder to learn spatial information from components of exogenous data. Suppose the number of convolutional network layers corresponding to each moment is K c, and the number of feature maps of the u-th layer is F u . Assuming that the kernel size of each convolutional layer is set as 1 × q, then the i-th convolution unit of the f-th feature map of the u-th layer can be calculated from the data of the u−1 layer as where k (u,f,i) is the j-th unit of the convolution kernel of the f-th channel graph of the u-th layer and b (u,f) is the bias term. In addition, for layer 1 (in this case, u � 1), the input data are exogenous; that is, when where s is the size of the maximum pooling layer. After processing by the K c layers of the convolutional and pooling layers, the local feature vector (w 1 , w 2 , . . . , w T−1 ) can be obtained. e RHN in the encoder analyzes the temporal dependence of the input data from the CNN. e relevant equations are as follows: where I is an indicator function, h [k] t is the intermediate output at time t and depth k in RHN, and I k � 1 { } means that w t only participates in the transformation at the first layer. In addition, the first layer network h [k−1] t ∈ R l corresponds to the output data of the last layer at time t − 1.

Decoder.
e decoder contains another RHN used to capture the time-dependent information of the historical demand sequence data of the region to be predicted. An attention mechanism is introduced to solve the problem of encoding longer input sequences.
An attention model was originally used for machine translation [29] and has been widely used in natural language processing, statistical learning, speech, and the computer fields. A hierarchical attention mechanism, which performs better than the traditional attention mechanism, was developed based on the original attention model. For example, when processing document classification, the hierarchical attention mechanism can simultaneously build sentence-and word-level attention models, while the traditional attention mechanism can only construct a single level of attention model. e decoder of HRHN introduces a hierarchical attention mechanism, which can mine the information stored in different layers to capture temporal dynamics at different levels, which will have a better impact on predicting future target series compared to the traditional attention mechanism [9]. e alignment model e [k] t,i is calculated as follows: where t−1 ∈ R p represents the output of the last layer of RHN in the decoder at time t-1, and v k ∈ R l , T k ∈ R e×p , and v k ∈ R l×l are all trainable parameters.
By computing the subcontext vector d [k] t as a weighted sum of all the encoder's hidden states in the k-th layer, the soft alignment for layer k is obtained as en, the context vector that we feed to the decoder is calculated as where K r is the number of RHN layers. From the output of the encoder to the input of the decoder, D p t is a time-dependent variable representing the interaction between D p i,t and d t : where W ∈ R 1×1 and V ∈ R 1×K r e are the weight matrices and b ∈ R 1 is the bias term. RHN in the decoder is similar to that in the encoder, with the following related equations: where W G,R,C ∈ R p×1 and V G,R,C ∈ R p×p represent the transformation functions of the nonlinear transformation G, transformation gate R, and carry gate C and b G,R,C ∈ R p are bias terms. e estimated value D p i,t of the pick-up demand in time interval T of the region i to be predicted under this time mode can be obtained as where s T−1 is the output data of the last layer of RHN in the decoder and d T−1 is the associated context vector. e parameters W ∈ R 1×p , V ∈ R 1×K r l , and b ∈ R 1 are trainable parameters that characterize the linear dependence and produce the final prediction.

Merge Layer.
e historical demand data and the historical exogenous data of the closeness, period, and trend parts are fed to the HRHNs. en we multiply each output of HRHN with the corresponding weight matrix and add the results together to get the final prediction data: where W c,p,t f ∈ R 1 are trainable weight matrices.

Loss Function and Optimizer.
After obtaining the predicted data D p i,T , the mean square error is used as the loss function of the model: Mathematical Problems in Engineering 5 where N is the number of training data points and D p i,n and D p i,n represent the predicted demand and real demand data, respectively, of region i at time interval n. Loss i is the loss function of the pick-up demand forecast for the region i.
In addition, each region has an independent loss function. e model uses the Adam optimizer to complete the training [30]. During the training process, the output of the loss function of the validation set is calculated in each iteration. If the value is less than the minimum value of the previous iterations, then the parameters of MTR-HRHN at this iteration are saved and the value is updated as the new minimum value. e termination condition of training is when the value of the loss function of the validation set corresponding to several consecutive iterations is greater than or equal to the minimum value.

Results and Discussion
e dataset selected for the experiment was the New York City Yellow Taxi Trip Records (https://www1.nyc.gov/site/ tlc/about/tlc-trip-record-data.page) from January 1 to March 31, 2019.
Regarding the region division, there are many methods to divide cities into regions with different granularities and semantic meanings, such as road networks and ZIP code tabulation areas [31]. We used the New York City regional division scheme attached to the dataset to divide the city into six regions: e Bronx, Brooklyn, EWR, Manhattan, Queens, and Staten Island. We selected 12 high-demand subregions from Manhattan as the experimental objects shown in Table 1. We selected data from the last two weeks as test data and the remaining data as the training set. e last 20% of the data in the training set constituted the validation set.

Evaluation Metric.
Root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) were used to evaluate the prediction performance of the model in each region. ey are defined as where D p i,n and D p i,n are the predicted and real data, respectively, of the demand of region i at time interval j and ξ is the number of test records.

Parameter Settings.
e Pearson correlation coefficient was used to calculate the correlation between the pick-up demand data in the target region and the pick-up and drop-off demand data in other regions. Demand data with a strong linear correlation (absolute value of correlation coefficient greater than or equal to 0.7) were set as the exogenous data. K c (the number of layers of the CNN in the encoder) was set as 3. q (the size of the convolution kernel matrix) was set as 5. F u (the number of image channels corresponding to each convolutional layer) was set as 64. K r (the number of layers of RHN in both the encoder and decoder) was set as 3. l (the dimension of the RHN's hidden state in the encoder) was set as 128, as was p (the dimension of the RHN's hidden state in the decoder). L c , L p , and L t (the length of the input data corresponding to different temporal properties) were set as 4, 2, and 2, respectively.

Methods for Comparison
(1) Historical average (HA): this uses the average value of the previous demand at the positions given in the training set in the same relative time interval (i.e., the same time of day) to predict demand. (2) Autoregressive integrated moving average (ARIMA): a classic model in time series prediction, it combines a moving average and autoregressive components to model time series. e ARIMA model needs to determine three parameters (P, I, and Q). In this experiment, we chose to call the pyramid library to automatically determine the relevant parameters.
(3) Linear regression (LR): LR uses the least square loss function of the linear regression equation to model the relationship between one or more of each of the independent and dependent variables. We used the Ridge and Lasso [32] linear regression models, and the tuning parameter of these models was set to 0.01. (4) Multilayer perception (MLP): also known as an artificial neural network, the MLP has several hidden layers in addition to input and output layers. We used three hidden layers, each with 32 neurons. (5) Extreme gradient boosting (XGBoost) [33]: XGBoost is a powerful boosting tree-based algorithm that is widely used in data mining. We set the learning rate to 0.1, and the remaining parameters took the default values. Upper East Side South (6) Long short-term memory (LSTM) [17]: this method can deal with the problem of RNN gradient dissipation and has excellent performance in time series data processing. We selected a three-layer unidirectional LSTM network with 32 hidden layer nodes in each of the three layers. (7) Temporal view + spatial (neighbors) view [10]: this spatiotemporal deep network uses CNN to extract spatially relevant information of the target region and its neighbor regions (those directly connected to the target region). e LSTM network processes the CNN output information to further extract temporal properties.   We compared the prediction performance under single and multiple time resolutions of the following two models with that of the MTR-HRHN.
(1) HRHN_One: it only has one HRHN that models the closeness property of demand data. (2) HRHN_Two: it has two HRHNs that model the closeness and period properties of demand data. Figure 4 shows the fitting results of the predicted values of MTR-HRHN in the test set of the 12 high-demand regions of Manhattan, New York City. It can be found that the predicted results of MTR-HRHN are relatively accurate at most times. However, at the peak of each day, the deviation from the actual value is relatively large. is may be because the demand data in the peak times are more susceptible to nonnumeric data (such as  sudden bad weather or a social event), and MTR-HRHN does not put such data into the analysis. e proposed MTR-HRHN model can generally obtain more accurate prediction results than the other models mentioned above. Compared to nonlinear models, MTR-HRHN can not only capture the dynamic connection of sequences in time but also extract spatial information. Compared to other deep learning models, MTR-HRHN can further extract the connections between different components of exogenous data at the same time and can expand the observable time pattern by introducing multiple time resolutions, thereby further enhancing the prediction performance. Table 3 shows the experimental results of the time resolution test, from which it can be found that, compared to HRHN_One, HRHN_Two and MTR-HRHN have decreased errors (11.59%, 0%, and 8.86% and 11.65%, 0%, and 7.83%) in RMSE, MAPE, and MAE, respectively. Furthermore, choosing two time resolutions (corresponding to HRHN_Two) can greatly improve prediction accuracy. According to the comparison results of HRHN_Two and MTR-HRHN, the results in RMSE, MAPE, and MAE are almost equal. We can infer that it does not always improve the accuracy of prediction simply through using more time resolutions. Figure 5 shows the relationship between the future demand forecast performance of 12 regions and the length of the input sequence, from which it can be found that the forecast performance and length of the input sequence are not proportional. In general, the prediction performance first increases with the length of the input sequence. e model achieves locally optimal prediction performance when the sequence length reaches a certain value and begins to decline as the sequence length continues to increase. is is because RHN is essentially an extended LSTM network, and it faces the same disadvantage as RNN. So, when the sequence length is too short, the dynamic correlation information in time is not completely learned, and when the sequence length is too long, the difficulty of training convergence increases because many more parameters must be learned.

Conclusions
We applied the MTR-HRHN model to regional taxi demand prediction. By considering that real-world demand series typically exhibit patterns across multidimensional temporal patterns, MTR-HRHN employed three HRHNs to hierarchically extract and select the most relevant input features. It can capture the close, periodic, and trend characteristics of time series data. e experimental results show that the MTR-HRHN model achieves more accurate prediction results on demand data prediction than traditional time series prediction methods, classic machine learning regression models, and other deep learning models. We further compared and analyzed the impacts of the number of HRHN networks and the length of the input sequence on the prediction.
ese new factors shall be considered when applying the HRHN model or other spatiotemporal deep learning models to predict time series-related demands.
In subsequent research, we will optimize our model in two aspects. First, we will cluster the regions with the same demand patterns into one large region and use nonlinear correlation coefficient methods (such as a maximal information coefficient) to calculate the degree of correlation between the predicted region and other regions. us the strong correlation of exogenous sequences from the demand series of other regions can be captured. Second, many studies have shown that contextual data help to improve the prediction. We will collect some nonnumeric attributes (such as weather) and some point-of-interest information (such as functionalities of areas) and combine them with the historical exogenous data and/or historical target data. e new formatted input and its effort on the prediction will be further analyzed.
Data Availability e dataset selected for the experiment was the New York City Yellow Taxi Trip Records. e website is https://www1. nyc.gov/site/tlc/about/tlc-trip-record-data.page.

Conflicts of Interest
e authors declare that they have no conflicts of interest.