An Improved Deep Spatial-Temporal Hybrid Model for Bus Speed Prediction

,


Introduction
With the urbanization, urban population grows rapidly, the increase in the aggregate amount of urban automobiles is still high on the way, and urban public resources are limited or insufficient; these lead to the urban transportation problems which are more and more serious such as traffic congestion, traffic overload, and traffic accidents which are the major problems facing widespread life.
us, green travels advocated by related organizations and urban public transport system are the main forces of urban transportation, which play an important role. Urban buses are the basic of urban public transport system, and it is crucial to analyze the urban buses and their relative information, such as bus speed and bus travel time.
Bus speed is the vital information, and it is essential for improving the performances of the urban public transport system [1]. For the public transportation agencies, they can manage the whole public transportation, schedule the buses and other resources to meet the travel requirements efficiently, and hence, mitigate crowdedness at bus stations and in buses and increase ridership [2]. For travelers, they can plan their trips based on these information reasonably. Besides, accurate and timely information about the bus speed and time is crucial; it could attract more travelers, reduce the waiting time, and improve their satisfaction [3,4]. erefore, for providing public transportation agencies and travelers with more timely and accurate information, efficient models need to be developed to predict the bus speed more accurately.
As is known, the bus speed prediction has gradually attracted the attention of researchers and public transportation agencies [5], it is significantly different from traditional traffic speed prediction, and its impacts are more, such as traffic congestions and bus operation schedules. But, nowadays, lots of cities set the bus lane in traffic peak hours, and to some extent, this alleviates the problems within prediction. us, the bus speed prediction is more similar to the traffic speed prediction in some time without considering scheduling and other relative factors, and it can refer to more traffic speed prediction models, but it remains facing many challenges.
Similar to traffic speed prediction, it is necessary to consider sufficient factors to construct the prediction model for achieving accurate results. us, this paper proposes an efficient prediction model for bus speed, called the Dynamic Hierarchical Spatial-Temporal Network, (DHSTN). e contributions of the DHSTN mainly involve four aspects: (1) e dynamic weights method is proposed to calculate the bus travel average speed of the bus line section; it makes full use of the each bus speed passing the bus line section and improves the accuracy of average speed. (2) e Entropybased Grey Relation Analysis (EGRA) is introduced to analyze the spatial correlations of nearby line sections and the target section, in order to extract the spatial line sections from candidates as the analyzed objects together with target section. (3) For making full use of the long-term temporal features, the attention mechanism is introduced to analyze the long-term temporal feature, so as to improve the prediction accuracy. (4) A multilayer structure is proposed; it combines the CNN, GRU, and attention mechanism, in order to capture the spatial and long-term/short-term temporal features within the travel speed.

Related Works
In the recent years, the bus speed prediction has gradually drawn researchers' attention, and various prediction models have been proposed. Different from traditional traffic speed prediction, these models rarely used linear models (such as ARIMA [6,7]) to predict the bus speed because the bus speed is not a standard problem, and the prediction results are not better. For analyzing the nonlinear features within bus speed data, many nonlinear models are exploited by researchers, such as the Bayesian network [8], Support Vector Machine (SVM) [9], radial basis function neural networks (RBFNN) [10], Kalman filter [11], back propagation neural network (BPNN) [12], and extreme learning machine (ELM). e core idea of nonlinear models is to capture the nonlinear relationship within speed data and mine more potential information without prior knowledge [13]. Both the linear and nonlinear prediction models are based on the time-dependent features of bus speed data and only analyze the short-term temporal dependency.
Each prediction model mentioned above has its own limitations and advantages; to some extent, it captures one or more features of bus speed, and it is hard to cover all or most of features and provide better prediction performances by a single model [14]. us, for making full use of each prediction model, researchers proposed the hybrid models, which combine two or more prediction models, and they make up for the limitations of the single model and show better prediction performances. Yang et al. [15] proposed the GA-SVM model based on SVM and genetic algorithm (GA) in analyzing the features of bus travel speed data, it still considered the time period, road length, weather, and other factors, and testing results showed that GA-SVM is superior to the single SVM model and Artificial Neural Network (ANN). Hashi et al. [16] proposed a novel hybrid model based on SVM and the Kalman filter to predict bus arrival time, and test results showed that it has better performances than GA-SVM. e linear prediction models, nonlinear prediction models, and the hybrid prediction models mentioned above are exploited to capture the temporal dependency within the data and still have good performances. As is known, the bus speed has obvious temporal features, which are made full use of by many prediction models, but it is noted that the speed has still spatial dependency. us, in analyzing the features within the data, the prediction models should capture the spatial and temporal dependency. Liu et al. [17] proposed a comprehensive prediction model based on the long shortterm memory (LSTM) and ANN, and it utilized the spatialtemporal features vector to predict the bus arrival time. Cui et al. [18] proposed a novel multivariate conditional autoregressive (MCAR) model, and it is exploited to analyze the spatial and temporal correlations of the bus speeds. Sun et al. [2] proposed a hybrid prediction model based on the LSTM and attention mechanism, it can capture the spatial and temporal features, but it is weak in capturing the longterm temporal dependency. Hu et al. [8] proposed a novel Bayesian model to characterize space-time interaction patterns and to construct a bus speed prediction model further. Gu et al. [19] proposed a fusion model based on LSTM, EGRA, and the gated recurrent unit (GRU), and a double-layer structure is constructed based on the LSTM and GRU in order to resolve the problem that the singlelayer LSTM or GRU cannot capture the short-term temporal dependency. Lu et al. [20] proposed a novel hybrid model for lane-level traffic prediction; it is based on the complete ensemble empirical mode (CEEMDAN) and extreme gradient boosting (XGBoost), CEEMDAN is exploited to decompose the raw data into two components, and the XGBoost is used to analyze each component and predict its changes, respectively, and finally, the prediction results are integrated to obtain the final prediction result; experimental results confirm that the hybrid model outperforms the stateof-the-art models in the term of prediction accuracy and stability. Guo et al. [21] proposed a novel prediction fusion framework; it can combine multiple individual predictors and use different fusion strategies to improve the model performances. Gu et al. [22] proposed an improved Bayesian combination model with deep learning (IBCM-DL); it analyzes the relevance between the historical traffic data and current data, and three subpredictors are incorporated into the IBCM framework to take advantages of each model. Bachu et al. [23] developed a new prediction method which considers both temporal and spatial variation in travel time; first, it rewrites the conservation of vehicles equation in terms of flow and density instead of speed, then discretizes the novel equation, and finally, it is used in the prediction scheme which is based on the Kalman filter, and experimental results confirm that the proposed method is able to perform better. Ma et al. [24] also proposed a novel segmentbased approach to predict bus travel time using a combination of real-time taxi and bus datasets, and the experimental results confirm that the novel approach improves the accuracy of bus travel time prediction, especially under abnormal traffic conditions.

Preliminaries
In the section, after analyzing the urban bus net and bus trajectory data, we give a novel method for getting the average speed of each bus line section.
Different from the urban road net, the urban bus line is irregular, and the urban bus net is complex, as shown in Figure 1, in which the black points are the stations, green means that the traffic in the line sections is smooth, yellow means it is slow traffic, and red means it is congested. Before analyzing the bus running trajectory data, the bus net needs to be gridded based on the stations. But, it is generally known that several bus lines may overlap in some area; this leads to the stations to get too close, and the bus net being gridded is too complex. So, the stations in the overlapping area of bus lines are merged, in order to simplify the bus net after being gridded, and the result is shown in Figure 2.
Based on the gridding bus net, the bus running data can be matched to the corresponding line section, by the following equations: where (lon A , lat A ) and (lon B , lat B ) are the position of any two adjacent stations and (lon x , lat x ) is the trajectory point position of any running bus. When the bus runs from station i to station i + 1, its running speed is not constant, and it is sometimes in the process of acceleration or deceleration, as shown in Figure 3. e state of acceleration and deceleration is not conductive to get the average speed from station i to station i + 1 (or line section). us, the trajectory point data in the speed buffer area is not considered.
In general, the average speed V i,j of bus j in line section i is obtained by the following equation: where L(i) is the length of line section i, T N s and T N e are the projection time of bus j passing endpoints N s (i) and N e (i) of line section i, and T(i, j) � T N e − T N s . But, due to various reasons, in real bus trajectory data, it is hard to get time T N s and T N e . Based on these bus trajectory points in the line section, a novel method (Dynamic Weight Computing Method, DWCM) for computing the average speed is given, and it performs in two sequential steps. e first step is to compute the average speed of each bus passing the line section. Based on the running trajectory of the bus, the first trajectory point and the last trajectory point data in the line section are selected to compute its average speed as follows: where T 1 and T n are the sampling time of the first trajectory point and the last trajectory point in the line section, respectively, and L j (i) is the length of the line between the first trajectory point and the last trajectory point. e second step is to compute the average speed of buses in the line section. In general, the average of one or several buses cannot fully reflect the real average speed of the line section, and it is partial. us, the impact factor λ, a novel conception is proposed, and it can dynamically show the influences of each bus average speed on the real average speed of the line section. It is obtained by the following equations: where N i is the number of buses passing line section i. Finally, based on V i,j and λ(i, j), the average speed of the line section can be obtained by the following equation: 4. Methodology

Temporal Dependency Mining.
As a basic part of the urban intelligent transportation system, urban bus speed exhibits higher correlation in a short time period, due to the nature of the transportation system [25], and the short-term temporal dependency is the key factor for prediction. Moreover, similar to traffic flow, it also exists long-term temporal dependency within bus speed data. e special example is that because of extreme congestion or extreme weather, the buses will take a long time to travel from the last line section to current line section. Hence, during temporal dependency mining, both the short-term and long-term temporal dependency must be taken into account. Traditional methods only consider the short-term temporal dependency and ignore coexistence of short-term temporal dependency and long-term temporal dependency. In the recent years, many research findings have explored the fusion methods to mine the temporal dependency within the data [25][26][27][28][29]. e long short-term memory network (LSTM) is an important RNN and introduced to analyze temporal features of traffic data [27,30], and it can fully capture both short-term and long-term memories within traffic data and avoid gradient vanishing or exploding problems.
As is known, the LSTM is an ideal structure [19], has good performances, and avoids some unnecessary problems, but its learning parameters are a little more, to some extent, which affects the time performance. e Gated Recurrent Unit (GRU) is a variant of the LSTM; it is essentially an LSTM without an output gate, and its structure is simpler and learning parameters are less, but the GRU can achieve equal or even better performances [31]. e GRU only has two gates, update gate z t and reset gate r t , z t is to control the degree which the previous hidden state h t−1 inputs current new hidden state h t , and r t is to control the degree of ignoring the previous hidden state. us, for current input x t , z t , r t , h t , and the final output y t are calculated by the following equation [32]: where w z , w r , w h , w y , w hr , w hz , and U h are the weight matrix. σ(·) and tan h(·) are the activation functions, . ⊙ represents the Hadamard product of two vectors.

Spatial Dependency Mining.
Only analyzing the temporal features of bus speed will weaken the prediction performances. It is well known that traffic data shows the spatial features obviously; for example, the traffic state of the current line section may be affected by its upstream or downstream line sections. In what follows, it is to focus on exploring the spatial features within bus speed, for which the convolutional neural network (CNN) is selected. e CNN originates from tackling the image processing problems, and it is a highly efficient tool and, naturally, has been utilized in many works [26,[33][34][35][36]. Moreover, analyzing the whole urban bus net is a very complex task; it needs lots of data, more resources, and much more time. In real application, it may find out various problems, e.g., only analyzing bus arrival time. ese need to transform the whole net to many small segments, which is analyzed easily. In the study, based on the characteristics of current bus lines and data, the 1D-CNN is exploited to mine the spatial features within the data, and thus, it performs the 1D convolution on the vectors where h m (x k ) and h m−1 (x k ) are the output and input of the m-th convolution layer, respectively, b m represents the bias, w m represents the convolutional weights, g(·) represents the activation function, and ⊗ is the convolution. One of advantages of the deep neural network is that it could improve its performances by increasing hidden layers; the more the hidden layer is, the better its representation capability is, but it is weak in time performances and data demands and also leads to be over fitting easily, and the appropriate hidden layers can balance the performances and make the better results. us, based on multiaspect factors, it sets the hidden layers m as 3, m � 3. As mentioned above, the CNN is weak in the time performances, for converging fast, the Rectified Linear Units (ReLUs) are exploited as the nonlinear activation function, and it can also tackle the gradient vanishing or exploding problems [37].
It is well known that, for the CNN, the convolutional layer is the key component, in which, the filter is applied to capture the features from input data. Moreover, Springenberg et al. [38] proposed a novel CNN, whose pooling layer is replaced by a convolutional layer; it has state-of-theart or competitive performances on the object recognition with small images. us, after fully analyzing existing data and its limited space dimension, the special 1D-CNN without the pooling layer is exploited to capture the spatial features.

Attention.
e GRU described above only uses previous data, which are usually a few hours' data, and it does not consider the long-term dependency within the data; longterm dependency has great influences on the final prediction and it is still. As is known, urban bus speed data show obvious periodicity, the same as other traffic data. us, it needs to capture both long-term and short-term dependency during prediction and consider the previous short-term data and long-term historical data. Since the shallow single-layer GRU does not capture the long-term memory [39], it needs to be extended further.
In this paper, the data of previous d days are used to capture the long-term information, and for each day, only limited data near predicted time t are chosen, which can reduce influences of redundant data on the final performances. In analyzing the periodicity of bus speed, it finds out the same problem as traffic flow, which is the temporal shifting of periodicity [40].
us, it needs to analyze the importance of each input data on the final prediction result, imp(x k , y t ), for this the attention mechanism is exploited to analyze each input data. Some researches [40][41][42] show imp(x k , y t ) could be analyzed by linear or nonlinear methods. e attention mechanism is that, before inputting data into the GRU, it needs to analyze its importance, as shown in Figure 4. We set x k � (x k,1 , x k,2 , . . . , x k,n ), where it is assumed that the n values near the time t have impacts on the prediction, w k,j (1 ≤ j ≤ n) is the weight of each input, and the relation between y t and x k is where b k is the bias. Equation (10) is a multiple regression function; it needs to get optimal w k by minimizing the error e between y t and the predicted value y 0 t , e � f(y t , y 0 t ), and f is the function calculating the error. For the time performance, the idea of Extreme Learning Machine (ELM) is exploited to solve the problem, and w k and b k are learning parameters. Finally, the importance of α k,j is obtained by the following equation: and then, the new input of the GRU is α k · x k , where w k,j is the analyzing results by minimizing the error e.

Fusion Method.
Based on the analysis mentioned above, a novel fusion is proposed, called the Dynamic Hierarchical Spatial-Temporal Network (DHSTN). Figure 5 is the model architecture. Considering the periodicity and continuity of the urban bus speed, continuous time series data of previous days and the same day last three week are selected as the inputs of the DHSTN with the near-term bus speed data, which are highly correlated to the future bus speed. e DHSTN is includes six steps. (a) e first step is the feature extracting, showing the "a" in Figure 5. As is known, bus speed data have obvious spatial feature, and it needs to choose the nearby line sections from candidates. For more correlated spatial variables, the correlation analysis is exploited to recognize the line sections with high impacts on the target section. After analyzing lots of related methods, the Entropy-based Grey Relation Analysis (EGRA) is chosen to analyze the candidate line sections, and detail description refers to the literature [43,44]. (b) e second step is the spatial dependency mining and, in fact, spatial feature analysis has already started in the first step, and in spatial mining, the CNN is chosen, detailed in the section "Spatial Dependency Mining." (c) e third step is the attention. In the paper, the previous d days data x s k , 1 ≤ s ≤ d, are chosen to analyze the long-term dependency, for each previous day, and it needs to know the importance of near-term time data x s k,i on the final predicting time y 0 t , in order to generate more reasonable input data x s k . (d) e fourth step is the temporal dependency mining, including the short-term and longterm dependency. e short-term dependency analysis is in . .
x k x ⌃ k × × × Figure 4: e attention mechanism before GRU.
Mathematical Problems in Engineering the first GRUs, and for each previous day x s k , the final time t data h s t are obtained. e long-term dependency analysis is in the second GRUs, and for each h s t , the long-term information h t,ls could be preserved by h q t,ls � GRU(h q+1 t,ls , h q t ), 1 ≤ q ≤ d − 1, and h q+1 t,ls is the output of the (q + 1)-th GRU and h p t is the input of the q-th GRU. (e) e fifth step is the short-term dependency analysis for the target line section. It also includes spatial and temporal dependency mining, which is similar to the step (b), (c), and (d), and the main difference is that step (e) uses one GRU layer to analyze the short-term dependency h 0 t,st of target line section. e final output h 0 t,sl is the concatenation of h 0 t,st and h t,ls , h 0 t,sl � h 0 t,st ⊕ h t,ls , which preserves the short-term and longterm temporal dependency information. (f ) e sixth step is the final prediction, and it uses ELM to fit y 0 t+1 � f(w f h 0 t,sl + b f ) and get the final result. It is noted that the data during the whole processing are all centered on the target line section, and the superscript "0" indicates the day of time to predict.

Experiments
For verifying the proposed fusion model (DHSTN), real bus speed datasets are used to conduct experiments by comparing with other models (ARIMA, MLP [45], RBFNN [46], TM-CNN [47], FDL [19], DNN-BTF [25], RS-DHSTN (its feature extracting strategy is replaced with random selection), KNN-DHSTN (its feature extracting strategy is the k-nearest neighbor method), and NAM-DHSTN (it is a special DHSTN without attention module)). ese data are sampled from the Huangpu road of Dalian city, shown in Figure 6,  Figure 7. Moreover, the model performances are evaluated by the mean absolute error (MAE), the mean absolute percent error (MAPE), and the root mean-squared error (RMSE).

Average Speed Analysis.
For testing the performances of the average speed analysis method (DWCM) proposed in the paper, we use the method based on speed integral (SI method) proposed by Quiroga and Augusto [48]. Figure 8 is the average speed analyzing result. As is shown, comparing with the SI-method, the proposed method could compute the average speed of line sections more accurately. In addition, the unstable states of bus entering and leaving the station have impacts on analyzing average speed. For verifying these, the comparing tests are carried out under considering/not considering the unstable states (C-US and NC-US). Table 1 is the result error of the two methods. Under considering the unstable state, the MAPE of the DWCM is only 6.53%, which is far less than that of the SI method, 9.47%. As is shown, it is indicated that independent of whether the unstable states are considered or not, the DWCM method for analyzing the average speed is better than the SI method.     Mathematical Problems in Engineering 7 on weekdays, its peak time is delay, and its changes are not much greater than that of traditional traffic speed in peak time. e reason is that, in the peak time, there are bus lanes, whose time is from 6:30 am to 8:30 am in the morning and from 16:00 pm to 19:00 pm. After the time, the traffic peak is almost over. Because of the works and other reasons, the traffic state on weekends is stable, and the bus speed is much smoother, as shown in Figure 7(b). Table 2 is the results of extracting spatial features within the bus line sections given in Figure 6. As is shown, GRG values of the candidate line sections near the target section are larger, and it means that these sections are more likely to be chosen as inputs of the DHSTN. Besides, GRG values of upstream sections are larger than that of downstream ones, and it indicates that the upstream line state has greater impacts on the target section than that of the downstream ones. If λ � 4, the candidates with top 5 GRGs are selected to construct the input matrix. Figure 9 shows the prediction results of the DHSTN on a weekday/weekend. From the figure, it is known that the DHSTN is capable of efficiently fitting the tendency of bus speed and also capturing the sudden change with an excellent performance. Table 3 is the analysis of prediction performances of the DHSTN. It indicates that the DHSTN is working better for the weekend than that for the weekday due to relatively stable speed, and its performance for weekends is superior to that for weekdays, which is improved of 0.61, 0.89%, and 0.56 on MAE, MAPE, and RMSE, respectively, because on weekends, the buses run more smoothly.

Prediction and Analysis.
To evaluate the performances of the DHSTN and test the importance of the two modules of the DHSTN, the feature extracting module and attention module, the comparison experiments with the RS-DHSTN, KNN-DHSTN, and NAM-DHSTN are conducted. Table 4 shows the average evaluation results for week datasets. From Table 4, it is observed that the DHSTN is better and has smaller error. Comparing with random selection and the KNN, the DHSTN improves by 45.60% and 29.09% on MAE, respectively, and 38.80% and 24.03% on MAE, respectively. It indicates that the feature extracting module is the basic, and better candidate selection strategy could improve the performances of the DHSTN. Comparing with the NAM-DHSTN, it improves by 16.93%, 15.88%, and 17.45% on MAE, MAPE, and RMSE, respectively, and it means that the attention module could analyze the importance of each near-term time data and improve the prediction accuracy.
For verifying the performances of the DHSTN further, several methods are exploited to carry out the comparative experiments. Table 5      time series information, but their performances are better than those of RBFNN, which considered spatial and temporal features. It means that the temporal features are essential for short-term prediction, and spatial information may lead predictor worse, so reasonable utilization of spatial and temporal information is of vital importance. Figures 10 and 11 are the part prediction results of each model on the weekday and weekend, respectively (from 7:30 to 9:30 on the weekday and from 16:30 to 18:30 on the weekend). From the changing tendency, these models can capture the features of bus speed, while the spatial-temporal models are better. e difference between real data and prediction results of the DHSTN is around 2.6, and the least difference of other predictors is about 3.1. us, the DHSTN proposed in the paper has the better performances. Table 6 is the comparison analysis of different models on weekdays and weekends. e error of the DHSTN is the least among these models, the least MAE is 2.3898, comparing with others, and its accuracy is improved by 41.38%, 45.73%, 26.72%, 16.71%, and 12.25% on weekdays and 46.25%, 49.84%, 56.05%, 39.76%, 31.03%, and 19.18% on weekends, respectively. It is noted that the MAE and RMSE of each models on weekends are similar to that on weekdays, while the MAPE on weekdays is much larger than that on weekends, which means that the prediction results on weekends are better. e reason is that different from weekdays, the urban traffic on weekends is more stable and the bus speed is smoother.

Conclusions
In this paper, the original bus speed data were deeply analyzed and constructed as the time series with the time interval of 5 minutes based on the proposed dynamic weight computing method to generate the bus travel average speed in each line section. Referring to relative models, the Dynamic Hierarchical Spatial-Temporal Network model (DHSTN) is proposed, which consists of the GRU, CNN, EGRA, and attention mechanism. First, it exploits EGRA to analyze the candidate line sections based on the correlations, in order to choose the suitable candidates with high impacts on the target section. In the next, the DHSTN constructs multilayer structure by using the CNN, GRU, and attention mechanism to analyze and capture the spatial dependency, long-term and short-term temporal dependency. Finally, by ELM, it fuses the spatial dependency of nearby line sections and temporal dependency of the target section to predict the bus speed variation in the next time interval. For verifying the model performances, the comparative experiments are conducted with ARIMA, MLP, RBF, TM-CNN, FDL, and DNN-BTF,    Data Availability e urban bus travel speed data used to support the findings of this study have not been made available, because the data are from the operation system of relevant enterprise, according to the operation specifications, these data are still in the confidential stage, and we are not authorized to release the data.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.