DCAST: A Spatiotemporal Model with DenseNet and GRU Based on Attention Mechanism

.e accurate prediction of crowd flow in urban areas is becoming more and more important in many fields such as traffic management and public safety. However, the complex spatiotemporal relationship of the traffic data and the influence of events, weather, and other factors makes it very difficult to accurately predict the crowd flow. In this study, we propose a spatiotemporal prediction model that is based on densely connected convolutional networks and gated recurrent units (GRU) with the attention mechanism to predict the inflow and outflow of the crowds in regions within a specific area. .e DCASTmodel divides the time axis into three parts: short-term dependence, period rule, and long-term dependence. For each part, we employ densely connected convolutional networks to extract spatial characteristics. Attention-based GRU module is used to capture the temporal features. And then, the outputs of the three parts are fused by weighting elementwise addition. At last, we combine the results of the fusion and external factors to predict the crowd flow in each region..e root mean square errors of the DCASTmodel in two real datasets of taxis in Beijing (TaxiBJ) and bikes in New York (BikeNYC) are 15.70 and 5.53, respectively. .e experimental results show that the results are more accurate and reliable than that of the baseline model.


Introduction
Crowd flow prediction based on big data is a typical spatiotemporal prediction problem, which is of great significance of social stability [1][2][3]. In essence, crowd flow prediction is based on historical data to predict the relevant value of crowd flow in a region. If the crowd flow in a certain area can be predicted in advance, measures can be taken in advance to ensure safety of the crowds and reduce the loss caused by traffic congestion.
ere are three difficulties in accurately predicting the flow of crowds in an area. (1) Spatial factor: crowds inflow and outflow in r 2 (as shown in Figure 1(a)) are not only affected by surrounding areas r 1 and r 3 but also affected by crowd flow in geographically distant areas. (2) Temporal factor: crowd flow in an area is usually periodic (as shown in Figure 1(b)). For example, crowd flow is relatively high during morning and evening rush hours and repeats roughly every 24 hours, and the morning rush hour starts later and later as the temperature drops. (3) External factors: weather and some events may have a significant impact on the flow of crowds in an area. For example, when a great party is held, people gather in large numbers in one area. e influence of various factors makes it difficult to predict the crowd flow.
is study mainly studies the crowd flow prediction problem with the region [4], taking into account two types of crowd inflow and outflow. As shown in Figure 1(a), inflow refers to the number of crowds entering a region in a certain time interval, while outflow refers to the number of crowds coming out of a region in a certain time interval. Both types reflect changes in crowd flow, so it is significant to control these two types. We can obtain crowd flow data onto vehicle track data, mobile phone signal data, public transportation data, and pedestrians.
Many researchers solve crowd flow prediction problems mostly based on mathematical equations or simulation technology, while the real crowd flow involves the weather, events, crowds, and other factors, and it is difficult to accurately express with mathematical models. Traffic big data have become a basic resource with rich content and complex structure, and we must make good use of these resources. However, classical shallow learning algorithms cannot adapt well to the new situation. Williams et al. used ARIMA [5] to model and predict vehicle traffics to flow. Castro-Neto et al. [6] proposed the application of online support vector machine for regression supervised statistical learning technology to expressway short-term traffic flow prediction. Sun et al. [7] proposed a short-time traffic flow prediction method based on the Bayesian network, and the traffic flow between adjacent road links to the traffic network is modeled as the Bayesian network. Chen et al. [8] presents a novel social media-based approach to traffic congestion monitoring. However, these methods fail to capture the complex temporal and spatial correlations in the data. erefore, the prediction of crowd flow needs to involve a data-driven model [9]. e most representative data-driven model is deep learning [10], which can automatically extract relevant depth characteristics of data. For example, Zhang et al. [4] proposed a crowd flow prediction method based on CNN to predict the crowd flow in urban areas after the grid. Yao et al. [11] proposed a spatiotemporal dynamic network, introduced a gate mechanism of traffic flow to learn the dynamic similarity between locations, and designed a periodic shift attention mechanism to deal with long-term periodic shifts. However, this model is larger and more complex.
To address the previously mentioned problem, we present a traffic prediction model DCAST based on deep learning. is model utilizes the DenseNets [12] module to capture spatial correlation. DenseNets connects each layer of CNN to each subsequent layer in a feed-forward manner, which can enhance feature reuse and enhance feature propagation. In terms of temporality, the attention-based mechanism-gated recurrent unit module is used to capture the potential time dependence on the DCASTdata according to the temporal closeness to the predicted target. By combining the above modules, the model can not only better capture the temporal and spatial characteristics of data but also make full use of the external information. In this way, the crowd flow can be predicted more accurately, which is of great significance to improve the operation efficiency of the city and strengthen the management of urban public safety. e model was validated against two real-world datasets (BikeNYC and TaxiBJ), including bike-sharing data onto New York City in 2014 and taxi data onto Beijing from 2013 to 2016. Compared with the existing methods, our model has better performance. e main contributions to this study are summarized as follows: (1) We proposed a spatiotemporal prediction model that is based on a densely connected convolutional network and gated recurrent unit with the attention mechanism. e model is more robust and flexible when dealing with traffic forecasting. (2) We designed a gated recurrent unit with the attention mechanism module to capture the temporal features. (3) Verify the validity of the model through a large number of experiments on two real-world datasets. e proposed model has better predictive ability than the typical shallow learning model and other models based on deep learning. e rest of the study is structured as follows: Section 2 introduces the related work. Section 3 defines the problem and describes the techniques used in this study. Section 4 makes a detailed analysis of the motivation and structural design of the proposed DCAST model. Section 5 presents the experimental results of the DCAST model in real traffic datasets and analyzes and evaluates the performance of the proposed method. e last section summarizes the whole study and looks forward to the future research direction.

Related Work
In recent years, spatiotemporal prediction has been widely concerned, and crowd flow prediction is a typical spatiotemporal prediction problem. is section discusses works related to spatiotemporal prediction issues. e earliest work on this question focused on time series prediction. Williams et al. used ARIMA [5] to model and predict vehicle traffics to flow. Castro-Neto et al. [6] proposed the application of online support vector machine for regression supervised statistical learning technology to expressway short-term traffic flow prediction. Chan et al. [13] proposed an optimized ANN model to predict short-term traffic flow by using mixed exponential smoothing and Levenberg-Marquardt algorithm. Sun et al. [7] proposed a short-time traffic flow prediction method based on the Bayesian network, and the traffic flow between adjacent road links to the traffic network is modeled as the Bayesian network. Chen et al. [8] presented a novel social-media based approach to traffic congestion monitoring. Bai and Chen [14] proposed a deep architecture to predict the shortterm traffic flow in an urban traffic network. However, these methods fail to capture the complex temporal and spatial correlations in the data.
After the breakthrough in the research of Hinton et al. [15], deep learning models have a superior performance in the fields of computer vision [16], speech recognition [17], and natural language processing [18], and the crowd flow prediction methods based on deep learning has attracted the attention of many researchers. In the first category, full connection layers are stacked, and data from multiple sources are combined. For example, Hua Wei et al. [19] proposed a zero-grid ensemble spatiotemporal model to predict traffic demand on four predictors, and Dong Wang et al. [20] presented an end-to-end framework called deep supply demand using a novel deep neural network structure to find the gap between taxi supply and demand. ese methods use a large number of features but do not explicitly model the interspace and time interactions. In the second category, convolution structure is applied to capture the spatial correlation between space and time prediction. For instance, Zhang et al. [4] proposed a deep learning-based prediction model for spatiotemporal data (DeepST), in which spatiotemporal component employs the framework of convolutional neural networks to simultaneously model spatial near and distant dependencies and so on. Zhang et al. [21] then presented an end-to-end structure of ST-ResNet, which employs the residual neural network to model the temporal closeness, period, and trend properties of crowd traffic. Ziru Xu et al. [22] proposed a PredCNN model that was completely based on CNN, which used the multiplicative cascade unit to predict the future image without any recursive chain structure. ese methods take into account the influence of spatial factors but do not fully consider the influence of temporal factors. In the third category, the model based on the recurrent neural network is used to model the sequential dependent relationship. For example, H. Yao et al. [11] proposed a spatial-temporal dynamic network, which uses a traffic gating mechanism to learn the dynamic similarity between locations and designed a periodic attention mechanism to deal with longterm periodic time transfers. Sønderby et al. [23] proposed splicing CNN and LSTM module for convolution LSTM dependence of space and time to deal with the taxi demand forecast. He et al. [24] proposed a spatiotemporal attentive neural network for the networkwide and long-term traffic prediction, which exploit a codec system structure with the attention mechanism to forecast traffic speed. Le Nguyen and Ji [25] used the convLSTM structure to solve the traffic matrix prediction problem and fully modeled the spacetime model. Yao et al. [26] further proposed a multiperspective space-time network for demand prediction, which integrates LSTM, local CNN, and structural embedding and comprehensively considers spatial, temporal, and semantic relations. Lin et al. [27] proposed a hybrid model called SpAE-LSTM, which uses a sparse autoencoder to extract the spatial characteristics of the spatial-temporal matrix through the full connected layers and cocaptures the spatial-temporal features of traffic flow evolution with the LSTM network for prediction. Li et al. [28] proposed a novel spatiotemporal prediction model that uses a densely connected convolutional network to extract spatial characteristics, a fully-connected network to extract features, and finally, an attention-based long short-term memory module is leveraged to capture the temporal pattern. Zhang et al. [29] proposed a novel spatial-temporal cross-domain neural network to effectively capture the complex patterns hidden in cellular data and adopted a convolutional long short-term memory network as its subcomponent, which has a strong spatiotemporal correlation modeling capability. Due to the large memory consumption and calculation amount of the LSTM module, Li et al. [30] extended the traditional CNN and RNN structures to the graphbased CNN and RNN for traffic prediction, such as the graph convolution GRU. In the above study, the spatial correlation between regions is based on one perspective, ignoring the different importance of each time interval. Du et al. [31] proposed a hybrid multimodal flow prediction model (HMDLF) based on deep learning, which combines CNN, GRU, and attention mechanism. However, the DCAST model exceeds the ability of HMDLF to capture temporal and spatial correlation. e DCAST model automatically learns the dynamic temporal and spatial dependent features of crowd traffic data through the attention-based GRU module and the densely connected convolutional neural network.

Problem Definition.
Based on previous studies [21], we divide the whole city into a a × b grid map with n regions, where n � a × b, and a grid represents a region. e spatiotemporal prediction problem is measured in a variety of ways, including air quality [32], weather, taxi orders, and bike rental/return. Here, we study the inflow and outflow of crowd flow. e crowd flow trajectory at t in a certain time interval is recorded as a set P. e inflow and outflow of the grid region at the time interval can be defined as where g i is a geographic coordinate, g i ∈ (a, b) denotes that the point g i lies in region(a, b); T r :g 1 ⟶ g 2 ⟶ · · · ⟶ g l represents the trajectory of the moving object r; l represents the trajectory length; and | · | is the cardinality of a set.
If the grid map is regarded as an image of length a and width b, the inflow and outflow in time interval t can be represented as a two-channel image As shown in Figure 2, the bar on the right represents the relationship between crowd flow and color brightness. e horizontal and vertical axes are used to identify specific areas.
Using the above notations, the crowd flow prediction problem can be defined as Definition 1 (Crowd flow prediction). Given the historical data of crowd flows X t |t � 1, 2, . . . , n , predict X n+1 .

Attention Mechanism.
As one of the most influential ideas in the field of deep learning, the attention mechanism aims to overcome the problem of information loss caused by fixed intermediate vector length when the length of input sequence is relatively long. It was originally designed for the seq2seq model in natural language processing (NLP) and has been rapidly applied to other fields since then. e output of the attention mechanism can be written as where U � XW U , U ∈ (Q, K, V), and X is the input, W U is learnable matrix, d k denotes the dimension of keys, and K T is the transpose of the matrix K. softmax(·) is an activation function that is defined as softmax(

Gate Recurrent
Unit. e problem of gradient disappearance or gradient explosion is easy to occur in the calculation of back propagation when the layers of the simple recurrent neural network (RNN) are relatively deep. erefore, RNN sometimes fails to capture long-term dependencies on sequences. e LSTM proposed by Hochreiter and Schmidhuber [33] is a variant based on RNN, which can capture the long-term dependence features of sequence data. GRU is a variant network based on LSTM, proposed by Cho et al. [34] (Figure 3). It combines the forget gate and input gate in LSTM into update gate, maintaining the effect of LSTM while making the structure simpler. erefore, we use GRU for long-term dependency features learning. Figure 3 is a typical GRU block diagram. GRU has only to update gate z t and reset gate r t . e long-term dependency learning block GRU calculates the hidden states through a set of equations formulated as In these equations, update gate z t controls the extent to which states information from the previous moment is substituted into the current state, and reset gate r t controls the extent to which state information from the previous moment is ignored. σ is the activation function. e candidate activation h t is computed with the reset gate r t (which control how much of the previous information needs to be retained), and * denotes the elementwise multiply operation. Finally, h t represents the actual activation of the proposed GRU unit at timet, which is a linear interpolation between the previous activation h t− 1 and the candidate activation h t . x +  Figure 4, we first processed the crowd flow of each region at the time interval t as an image of shape (2, a, b), and then, the time axis is divided into three parts, which are used to simulate the short-term dependence, period rule, and long-term dependence, respectively. e three parts share the same network structure with a densely connected convolutional network, and followed by the attention-based mechanismgated recurrent unit module, get the different interval of timing the crowd flow characteristics of spatiotemporal. Features from external datasets such as weather conditions and events fed into a two-layer fully connected neural network. e outputs of the first three parts are fused, and the results of different parts are given different weights. e result of the fusion is integrated with the output of the external component and fed into an activation function to obtain the prediction.

Structure of the First ree Parts.
e first three parts (i.e., short-term dependence, period rule, and long-term dependence) share the same network structure, which is composed of a densely connected convolutional network and an attention-based GRU module, as shown in Figure 5. e city can be divided into many areas according to their different spatial positions. e crowd flow in nearby areas will influence each other. Including areas with weak correlation will hinder the prediction of the performance of the target area and will waste the central feature of CNN. In order to solve these problems, inspired by Huang [12], we use a DenseNet module to capture the spatial correlation between all regions. As shown in Figure 5(a), the DenseNet structure is implemented by transferring all outputs of the i − 1 layer to i th layers. e short-term dependency part in Figure 4 simulates the short-term dependency with several 2-channel matrices of the recent time interval. We first connected the short-term dependent sequence with the first axis (i.e., time interval) as a tensor X (0) s ∈ R 2l s ×P×Q feed into k convolution layers. e transformation at each K layer is defined as where * denotes the convolutional operation, and σ() is an activation function. In this study, we use the rectifier function as the activation function, e.g., the σ(x) � max(0, x), and W (k) s and b (k) s are the two learnable parameters in the k th layer to be trained. e (·) operation is to concatenate the input tensors in the first dimension (concatenate the channel dimension of images).
After k densely connected convolutional layers, we reshape the output X s ∈ R 2l s ×a×b into the feature vector J s ∈ R 2l s ab . In order to reduce the feature dimension, a dense layer is used to generate the final spatial feature S s which can be written as where W s and b s are the two learnable parameters sets. e introduction of the attention mechanism into the GRU networks is to simplify the selection of input at the previous layers and is critical to each subsequent step. Figure 5(b) shows an attention-based GRU module that takes the spatial characteristics of each time interval as input, where the length of the input sequence is equal to 5. Select L time intervals to predict the next crowd flow. Combining the attention mechanism and GRU introduced above, the spatial-temporal feature ST in L time periods intervals are expressed as follows: where f is the GRU network, and S s denotes the final spatial feature. e final feature of short-term dependence on attention mechanism is defined as follows: where weight a j measures the importance of the time interval j∈{1, 2, . . ., L}, and h j is the hidden state in the GRU module at time interval j. e important measure weight a is obtained by learning the spatiotemporal characteristics and the previously hidden state. By doing the same operations as above, we can construct the period rule and long-term dependent parts of Figure 4. Suppose the length of the period rule sequence is l p , the period is p, so the period rule dependent sequence is [X t− (l p − 1)·p , . . . , X t− 1 · p, X t ] such as formulas (4)- (7). rough the densely connected convolutional network and an attention-based GRU module, the output of the period rule part is Final p . Similarly, the long-term dependent sequence length is l lt and the long-term span is lt, so the longterm dependent sequence is [X t− (l lt − 1)·lt , . . . , X t− 1 · lt, X t ], and the output of the long-term dependent part is Final lt . Note that p and lt are two different types of period spans. In the concrete implementation, p stands for daily periodicity and lt stands for weekly periodicity.

External Features Part and Fusion.
Crowd flow forecasts are influenced by external factors such as weather, holidays, and event information. is part is optional, depending on whether the original data contain external information. Let E t be the feature vector that predicts the external factor at time interval t. e implementation takes into account weather, holiday events, and metadata (i.e., Sundays, weekdays, and weekends). Formally, we stack the two-layer fully connected neural network; the first layer can be seen as an embedded layer for each subfactor and then activated. e second layer is exploited to map low to high dimensions with the same shape as X t . e output of the external part is denoted as X E t in Figure 4.
Next, we discuss how to fuse the four parts of Figure 4, first using the parameter matrix fusion method to fuse the first three parts (i.e., short-term dependence, period rule,    Figure 4: Structure diagram of the DCAST model. and long-term dependence) and then further combining them with the external part. e fusion Final f of the first three parts is expressed as where • is the Hadamard product (i.e., elementwise multiplication), and W s , W p , and W lt are the learnable parameters that adjust the short-term dependence, period rule, and long-term dependence, respectively. As shown in Figure 4, the fusion Final f of the first three parts is integrated with external part X t as follows: where tanh is a hyperbolic tangent function, ensuring that the output value is [− 1-1]. e DCAST model minimizes the mean square error between the predicted value and the real value through training: where θ is the learnable parameter of the DCAST model.

Model Training (Algorithm)
. Algorithm 1 summarizes the training process of the DCAST model. We build a sample set D sample (lines 2-7) from historical observations and then divide D sample into the training set D train and the testing set D test . e former is used for the training model and the latter for the testing model. A batch of training samples D batch is selected at each iteration to optimize the objective function (formula (10)) (lines 10-13).

Dataset Description.
In this study, we use two large real datasets from New York City and Beijing to evaluate the proposed model. e details of each dataset are as follows: BikeNYC: the bicycle trajectory data are taken from New York City [26] in 2014 from April 1 to September 30 (183 days). No external information is provided, but datasets on inflow and outflow and their respective timings are included. e city was divided into a 16 × 8 grid map, and the data of the last 10 days were selected as the test data and the data of the other days as the training data. . e external information contains data about holidays and weather conditions. e experiment divided the city into 32 × 32 grid map and selected the data of the last four weeks as the test data and the data of other times as the training data.

Evaluation Criteria.
We choose the mean square error (RMSE) and the mean absolute error (MAE) to evaluate the experimental results, which is defined as where v i and v i are the ground truth and the predicted value, and M is the number of all predicted values.
where v i and v i are the ground truth and the predicted value, and M is the number of all predicted values.

Baseline Models.
In this experiment, we compare the DCAST model with four traditional models and three based on the deep learning model: (i) HA: the historical average method (HA) through the period before the crowd inflow and outflow of average to predict population flow. For example, for predicting crowd flow in an area from 7:30 pm to 8: 00 pm on Friday, we can use all the actual data from that area from 7:30 pm to 8:00 pm on Friday. e HA model is simple and easy to manage but imprecise. (ii) ARIMA: the autoregressive integrated moving average model (ARIMA), which is composed of autoregressive and moving average models, is the most widely used typical model in the field of time series prediction. (iii) SARIMA: the seasonal autoregressive integrated moving average model combines the seasonal difference with the ARIMA model for modeling time series data with periodic characteristics. (iv) VAR: vector autoregression (VAR) is usually used to estimate the dynamic relationship between the joint endogenous variables. It can predict the spatial and temporal data with many parameters and a large amount of calculation. (v) ST-ANN: the ST-ANN extract spatial feature (place the area to be predicted in the center of the 3 × 3 unit) and the temporal feature (the previous time period) and input them into the artificial neural network. In this experiment, the previous time period was set as 8. (viii)PredCNN: PredCNN is an entirely CNNbased architecture that models the dependencies between the next frame and the sequential video inputs. e cascade multiplicative unit (CMU) in PredCNN provides relatively more operations for previous video frames. And the CMU enables PredCNN to predict future spatiotemporal data without any recurrent chain structures.
e DCAST model is compared with other baseline models according to whether the model considers the spatial, temporal, and external characteristics of the data. As shown in Table 1, the DCAST model considers the characteristics of all aspects of the data.

Preprocessing and Parameters.
In the output of the model DCAST, tanh is selected as the final activation, which ranges from − 1 to 1. And, we use the min-max normalization to scale the crowd flow to the range [− 1, 1].
In the evaluation, we scaled the predicted value back to the normal value and compared it with the ground truth. For external information, one-hot encoding is used to transform discrete features (i.e., weather and holidays), and min-max normalization is used to scale the data to the range of [0, 1]. e linear transformation of initial data is as follows: where x i represents the sample from initial data; x max and x min represent the maximum and minimum values in the data, respectively; x * i denotes the transformed data. e predicted values of the model are rescaled back to produce the true predicted values.
is experiment runs on 1080Ti GPUs and uses Python 3.6 environments with TensorFlow and Keras (https:// github.com/fchollet/keras) to build this model. As shown in Table 2, the DenseNets part contains 3 dense blocks and 32 filters of size (3,3) in each dense block. e attentionbased GRU part has two layers of GRU, and the number of neurons in each hidden layer is 128. In order to maintain the generalization ability of the DCASTmodel, we add a dropout layer that dropout rate is 0.5 after the dense layer. e batch size is set as 512, and the maximum epoch is set as 150. We employ the Adam [35] optimization model with a learning rate of 0.001. is optimizer had good universality and rapid convergence ability in the deep learning model. e decay of the learning rate is 0.001. We use 90% of the data as the training set and the rest 10% as the validation set in the training process. e early stop method with a patience of 20 is used to get the best results and avoid overfitting.  (11. 76) and TaxiBJ (57. 69) because it only relies on historical data without considering spatial correlation and external information. Models based on deep learning perform better than traditional methods because they make more efficient use of data and information. As shown in Table 3, the prediction results of ST-ResNet and DeepST are superior to the above models as ST-ResNet and DeepST use CNNs to capture spatial information and consider time temporal periodicity.

Performance Comparison with the Baseline Models.
ere are no any recurrent chain structures of PredCNN, which can predict future spatiotemporal data and achieve full parallelization, but it also loses some historical information.
In this study, the DCAST model of capturing the spatialtemporal characteristics between regions using a densely connected convolutional network, and an attention-based GRU outperforms the previously mentioned methods. Figure 6 shows the variation diagram of loss in the DCAST model. It can be seen that when the epoch approaches 150, the value of loss does not change much. So, the early stop method with a patience of 20 is used to get the best results and avoid overfitting.

Performance Comparison with Model Variants.
We study the influence of different parts in the DCAST model to confirm their validity. As the BikeNYC dataset does not contain external auxiliary information, we use the TaxiBJ dataset to carry out experiments and verify each part by deleting or replacing to form relatively complete comparison results.
As shown in Table 4, the model of CNN has the worst prediction effect, which only focuses on the spatial correlation of nearby areas. After replacing CNN with DenseNets structure, the prediction effect is increased by 20. 3%, proving that DenseNets can simulate complex spatial relations better than CNN. In addition, adding the GRU module or LSTM module to DenseNets improves the performance of DenseNets and verifies the validity of the temporal information.
DenseNets with attention GRU or LSTM outperforms the model variant without the attentional mechanism, proving that the attention mechanism can help GRU or LSTM better capture time patterns. As can be seen from Table 4, using GRU as the basic module is better than using LSTM as the basic module because GRU is a variant based on LSTM, which is simpler in structure and has fewer parameters while retaining a good effect. e DCAST model proposed in this study combines the densely connected convolutional network, GRU, attention mechanism, and external features to get the best prediction results. erefore, the external feature from auxiliary information is helpful for prediction.
In addition, we also study the variations of RMSE of different epochs on different models. As shown in Figure 7, the RMSE variation of the DCAST model for different epochs is compared with other model variants. With the increase of the epochs, the predictive effect of all models improved, while the DCAST model remained superior to other model variants. We observe that the RMSE remained almost stable when the epoch is greater than 150 but did not change much when the epoch continued to grow. In other words, more epochs do not mean better prediction result; the generalization ability is not significantly improved when the epoch is greater than 150, and all models seem a bit overfitting when the epoch is greater than 250. To sum up, although the increase of the epoch can improve the precision of model training, it will lead to the problem of overfitting, Input: historical observations: X 1 , X 2 , . . . , X n− 1 , X n ; length of short-term dependence, period rule, and long-term dependence: l s , l p , l lt ; span of period rule and long-term dependence: p, lt; external features: [E 1 , E 2 , . . . , E n− 1 , E n ] Output: DCAST, model //Generate samples from historical crowd flow observations Put a training instance ( P s , P p , P lt , E t , X t ) into D sample (7) End (8) Divide D sample into D train and D test //Train the model (9) Initialize all the parameters θ in DCAST (10) Repeat (11) Randomly choose a batch of samples D batch from D train (12) Find θ by minimizing the objective (10) with D batch (13) Until stopping criteria is met (14) Output the learned DCAST model ALGORITHM 1: DCAST algorithm training process.      but the calculation is heavy, which is not conducive to the application of the model.

Conclusions and Future Research
In this study, a new spatiotemporal prediction model based on densely connected convolutional networks and gated recurrent units with attention is proposed for crowd flow prediction. e DCAST model divides the time axis into three parts: short-term dependence, period rule, and longterm dependence. For each part, based on historical data, weather, and events, we use a dense convolutional network to capture spatial dependency and design a GRU based on attention mechanism that captures temporal dependency. e fusion results of these three parts are further combined with the external features extracted from the external auxiliary information. We innovatively combined these deep learning techniques to build a new traffic forecasting model that is more robust and flexible. e model can extract complex spatiotemporal features hidden in depth. Two types of population flow in Beijing and New York were evaluated.
e experimental results showed that the prediction results of the DCAST model were significantly better than those of the 7 benchmark models, which proved that the model was more suitable for the prediction of population flow.
In the future studies, we plan to use the graphic neural network (GNN) [36,37] to further study the dynamic correlation between regions. In real life, there are no rules for the division of areas. erefore, the GNN is more effective than CNN in capturing complex space-temporal characteristics and obtaining stable prediction results. In addition, other types of data will be considered in the future, and an appropriate fusion mechanism will be used to better fuse different types of data, so as to achieve accurate prediction of regional crowd flow.

Data Availability
e research data used to support the findings of this study are from the study "Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction," from AAAI 2017 https://arxiv.org/pdf/1610.00081.pdf, and from https:// www.jianguoyun.com/p/DesHv2UQs-HRBxi5gtYB.

Conflicts of Interest
e authors declare that there are no conflicts of interest.