Capsules TCN Network for Urban Computing and Intelligence in Urban Traffic Prediction

College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110016, China Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian University of Technology, Dalian 116024, China School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China University of Chinese Academy of Sciences, Beijing 100049, China


Introduction
Empowered by Internet of Things (IoTs) technologies and advanced algorithms that can collect and handle massive traffic datasets, urban computing and intelligence can make more informed decisions and create feedback loops between actual traffic situation and management department in the urban environment [1]. It can bridge the gaps between ubiquitous sensing, intelligent computing, cooperative communication, and big data management technologies to create novel solutions which can improve urban traffic environments, quality of life, and smart city systems [2]. In these urban computing methods, the huge datasets used by the scientists are all from various sources, such as geographic information, taxi GPS, and online weather web sites [3].
Urban traffic prediction has become a challenging urgent task for the development of a smart urban city, as it can afford visions for urban planning and traffic administration to improve the performance of urban transportation, as well as provide warnings for public security emergency message as timely [4]. Moreover, urban traffic prediction has been an important research issue with highly social shock [5]. When some emergences happen such as traffic accidents, an earthquake, tornado, and national holiday, urban traffic prediction becomes the top priority for authority (e.g., law enforcement) and traffic management operators (e.g., bus/ferry/subway) to protect people's safety and keep the work of social infrastructures [6]. Particularly for an enormous population city such as New York and London, the urban traffic is very heavy, which commonly leads to more probability for different traffic collisions and accident situations [6].
To meet this challenge, we are with the purpose of deriving the urban traffic prediction from period, trend, geospatial, and external influences and generate an accurate prediction for the urban traffic in the next time window, which is considered to be an available way to dispose the urban computing. We propose a neural network-based method called Capsules TCN Network based on collected big traffic mobility data and two deep-learning architecture TCN and Capsules Network. For real time, we also proposed a further improvement method for spatial-temporal data processing to achieve supervision of urban area vehicle density.

Related Work
Traffic flow prediction has been considered as a key functional component of intelligent transportation systems. Meanwhile, artificial intelligence technology is rapidly growing and the fifth-generation communication technology is approaching [7][8][9][10][11][12][13][14][15]. Massive traffic data are being continuously collected through all kinds of sources, some of which can be treated and utilized as streaming data for understanding and predicting urban traffic [6]. All these stimulate us to take new efforts and achieve new success on this social issue by using such streaming mobility data and advanced artificial intelligence technologies [6].
The evolution of traffic flow can be considered to be a spatiotemporal process. As early as the 1970s, the autoregressive integrated moving average (ARIMA) model was used to predict the short-term traffic flow of expressways [16]. Traffic flow prediction based on a time series method is a widely used traffic flow prediction technology. Levin and Tsao applied Box-Jenkins time series analysis to predict highway traffic flow and found that the ARIMA (0, 1, 1) model was useful in the prediction of the most statistically significant [17]. Hamed et al. used the ARIMA model to predict the traffic volume of urban arterial roads [18]. In order to improve the prediction accuracy, many variants of ARIMA were proposed, such as Kohonen-ARIMA [19], subset ARIMA [20], ARIMAX [21], space-time ARIMA [22], and seasonal ARIMA [23]. In addition to ARIMA-type time series models, other types of time series models are also used for traffic flow prediction [24].
On account of the random and nonlinear nature of traffic flow, nonparametric methods have received widespread attention in the field of traffic flow prediction. Davis and Nihan used the KNN method for short-term traffic prediction on expressways [25]. Chang et al. proposed a dynamic multi-interval traffic forecasting model based on KNN nonparametric regression [26]. Faouzi developed an autoregressive function with a smooth kernel function for short-term traffic flow prediction, in which a function estimation technique was applied [27]. Sun et al. used a local linear regression model for short-term traffic prediction [28]. A traffic flow prediction method based on Bayesian network was also proposed [29]. It proposed an online learning weighted support vector regression (SVR) for short-term traffic flow prediction. Various artificial neural network models for pre-dicting traffic flow have been established [30][31][32]. The MA, ES, and ARIMA models are used to obtain three related time series, which are the basis of the nature in the aggregation phase [33]. Zargari et al. developed different linear programming, multilayer perceptron, and fuzzy logic models to estimate 5-and 30-minute traffic flows [34]. Cetin and Comert combined the ARIMA model with the expectation maximization and cumulative sum algorithm [35]. Yao et al. proposed to combine the principal component analysis method with SVR and select urban multisection data to establish a road network short-term prediction model that took into account the relationship between time and space of multiple sections [36]. Li et al. used the wavelet decomposition and wavelet reconstruction of the traffic flow sequence data and then the use of Kalman filtering for dynamic data prediction [37]; Sun et al. proposed the application of the gray system theory to intersection traffic volume prediction [38]. Xiong et al. combined traditional linear models with artificial intelligence prediction models and proposed a short-term traffic flow prediction method based on artificial neural networks and Kalman filtering [39].
This article is divided into 6 sections: The first section describes the research background, significance, and purpose of the traffic forecast of urban vehicle traffic. The second section introduces the current situation and the structure of this article. The third section models the traffic forecast in urban areas and introduces the structure of Capsules TCN Network, which has two main technologies: Capsules Network and Temporal Convolutional Network. At the same time, the Capsules TCN Network model results are superresolution reconstructed to obtain a regional traffic flow forecast map with higher accuracy. The fourth section introduces the dataset used in the experiments and the data preprocesses, the experimental criteria, and the comparative baselines. Moreover, in the experimental environment, platform construction is introduced and the experimental results are demonstrated and analyzed. The fifth section summarizes the whole research.
3. Analytical Model of Regional Traffic 3.1. Regional Flow Prediction Problem. In urban areas, the indicator of vehicle flow can be used to indicate the vehicle flow in an area. This indicator can well reflect the traffic, population density, and public safety of a region. This article predicts two types of vehicle group traffic: inflow and outflow, as shown in Figure 1(a). Inflow refers to the total volume of vehicles entering a certain area from other areas within a given time interval. Outflow represents the total flow of vehicles leaving the area in a given time interval. Both types of traffic are used to indicate the movement patterns of vehicle traffic in urban areas. Understanding them can be of great help in risk assessment and traffic management. Inflow and outflow can be measured by the number of cars driving near the road, the number of cars driving on public transportation systems (e.g., subway and buses), the number of taxis, or all available data. Figure 1(b) shows an example of using the GPS trajectory of a rental car to measure the amount of traffic. The results show that the inflow 2 Wireless Communications and Mobile Computing in area B2 is 4 and the outflow in B5 is also 4. Obviously, predicting traffic flow can be regarded as a spatial-temporal prediction problem.
There are three complex factors in the spatial-temporal prediction problem: 3.1.1. Space dependence. As shown in Figure 1(a), the inflow in the B2 area is affected by the outflow in its vicinity (such as B5). Similarly, the outflow of B5 will affect the inflow of other regions (such as B2). The inflow of the B2 region will affect its own outflow. Urban traffic flow may be even affected by distant areas. For example, people who live far away from the office always take the car or taxi to work, which means that the outflow of long-distance residential areas directly affects the inflow of office areas.
3.1.2. Time dependence. The change of the traffic flow in any area is generally continuous from the perspective of time. It means the traffic flow at the next moment and the traffic flow at the previous moment have the strongest correlation. With the increase of the time interval, the correlation of traffic flow will gradually decrease. Figure 2 shows the time-varying curves of the traffic flow in a typical residential area and a typical working area from our dataset. It can be seen that both curves are relatively smooth, reflecting the continuous change characteristics described above. At the same time, it can be seen from Figure 2 that the change curve of the traffic flow in the living area is different from the change curve of the traffic flow in the working area, which reflects the regional differences.
Different regions have different numbers of population densities. Residential areas are suitable for living and resting. In a residential area, each person has a larger unit space that is more suitable for living and resting. Therefore, the lower the population density of a residential area, the better the residential area. In the work area, the closer the workers are, the more convenient the communication is and the work is more efficient. Therefore, the population density in the work area is much larger than that in residential areas. Different population densities determine different needs for public transportation. It can be seen from Figure 2 that although the trend of the number of taxis in the residential area and the working area is basically the same over time, there are obvious differences in the magnitude of the two.
As shown in Figure 3, whether it is a change in the traffic flow in the work area or a change in the traffic flow in the residential area, there are obvious characteristics of periodic changes. To further complicate matters, this periodicity will also be different under different time scales. When you observe in days, you can see the daily fluctuations of vehicles from morning to night. When you observe in weeks, you can see fluctuations of vehicles from work. If you look at the unit of year, you can see the impact of the climate and holidays on the traffic flow in the four seasons. This paper divides time dependence into period and tendency. Period: traffic during the morning rush hour is similar on consecutive working days. The morning rush hour usually occurs from 8 AM to 10 AM, and the evening rush hour is usually from 17 to 21 PM, repeated every 24 hours. Tendency: there is a cyclical difference between traffic between a working day and a nonworking day, with a time interval of one week.

External factors.
Some external factors, such as weather conditions and holidays, can drastically change traffic flow in different areas of the city. As shown in Figure 4, a rainstorm affects the speed of traffic on the road and further   Figure 5 shows the impact of holidays on a regional traffic.
There are many ways to divide the city area. According to the function, it can be divided into working area, residential area, mixed area, etc. It can also be divided according to the structure of the urban road network, and the city can be divided into main roads by using the map division method. The division method is introduced in the paper. We use grids to divide cities according to latitude and longitude. As shown in Figure 6(a), j and k represent the number of rows and columns in the area, respectively. In actual life, the values of j and k can be adjusted according to different city sizes and different application scenarios. In this paper, the scenario is divided into 16 × 16 grids.
Let R be the trajectory set of the ith time interval. For the grid ðj, kÞ located in the jth row and the kth column, the inflow and outflow at the time interval i are defined as Among them, L i is the trajectory of all the vehicles in R at ith time interval. Here, the trajectory is determined according to the GPS coordinates in the dataset and the grids divided by the map. In the ith time interval, inflow of 16 × 16 grids in the entire area can be represented by a matrix composed of α j,k i as shown Figure 6(b). The traffic prediction problem is transformed into known historical data α j,k i and β j,k i to predict α j,k i+1 and β j,k i+1 in the next moment.

Algorithm Model of Capsules TCN.
Both recurrent neural networks (RNN) and long-term short-term memory (LSTM) are capable of learning remote time dependence. However, if RNN or LSTM is used to simulate time periods and trends, it requires very long input sequences, which make the entire training process very complicated. According to the knowledge of space-time domain, only a few previous key frames will affect the next key frame. Therefore, we use time period, tendency, and geographic space to select key frames for modeling. Figure 7 shows the architecture of Capsules TCN Network proposed in the paper. It consists of four primary parts, which model time period, tendency, geospatial, and external influences.
As shown Figure 7, first, the methods introduced in formulas (1) and (2) are used to convert the inflow and outflow of the entire city at each time interval into a 2-channel matrix. The spaced 2-channel stream matrix in each time segment is sent to the first two parts, respectively, and the same network structure of proposed Capsules TCN Network is used for modeling. This structure also captures the spatial dependence between nearby and distant areas. They are provided to the same neural network structure in the external factors. The output of the four parts is fused in the way of fully convolutional networks. Finally, the result is mapped to the range [-1, 1] by the Sigmoid function, which produces faster convergence than the standard logic function during the backpropagation learning process. The entire neural network structure consists of two important methods: Capsules Network [40] (CapsulesNet) and Temporal Convolutional Network [41] (TCN).     Wireless Communications and Mobile Computing routing between capsules" [40]. We use the ideas from the reference when designing our Capsules TCN Network. Figure 8 shows the architecture of Capsules network (Capsu-lesNet). CapsulesNet, like ordinary neural networks, consists of many layers. The lowest capsule layer is called the primary capsule layer: each capsule unit in them receives a region of a matrix as input and detects the presence and posture of a specific object, and higher layers can detect larger and more complex objects.
Capsules are a group of neurons whose input and output vectors represent instantiation parameters of a specific entity type (that is, the probability of certain objects, conceptual entities, etc. appearing and certain attributes). The capsules at the same level use the transformation matrix to predict the instantiation parameters of higher-level capsules. When multiple predictions are consistent (this paper uses dynamic routing to make predictions consistent), higher-level capsules become active. The activation of the neurons in the capsule represents the various properties of the specific entities present in the matrix. These properties can include many different parameters, such as pose (position, size, and orientation), deformation, speed, reflectivity, color, texture, and more.

Wireless Communications and Mobile Computing
The length of the input-output vector represents the probability of an entity appearing, so its value must be between 0 and 1. To achieve this compression and complete capsule level activation, Sabour et al. used a nonlinear function called "squashing." This nonlinear function ensures that the length of the short vector can be shortened to almost zero, and the length of the long vector is compressed to close to but not more than 1 [40]. Here is the expression for this nonlinear function [40]: where V j is the output vector of capsule j, which S j is the weighted sum of the vector output by all capsules in the previous layer to capsule j in the current layer, which S j is simply the input vector of capsule j. The nonlinear function can be divided into two parts [40], namely, the first part is the scaling of the input vector S j , and the second part is the unit vector of the input vector S j . This nonlinear function not only retains the direction of the input vector but also compresses the length of the input vector to the interval [0,1). When S j is zero, V j can take 0, and when S j is infinity, V j approaches 1 infinitely. This nonlinear function can be seen as a kind of compression and reallocation of the vector length, so it can also be seen as a way to "activate" the output vector after the input vector. Then, as mentioned above, the input vector of capsule is equivalent to the scalar input of a classic neural network, and the calculation of this vector is equivalent to the way of propagation and connection between two layers of capsules. The calculation of the input vector is divided into two phases, namely, linear combination and routing. This process can be expressed by the following formula [40]: whereû jji is a linear combination of u i , which can be seen as a general neuron in the previous layer outputs with different strengths to a neuron in the next layer [40]. Just that capsule has a set of neurons (to generate a vector) at each node compared to a general neural network, whichû jji means that the output vector of the ith capsule in the previous layer is multiplied by the corresponding weight vector (W ij representing a vector). The resulting prediction vectorû jji can also be understood as the strength of connecting to the jth capsule in the latter layer if the previous layer is the ith capsule. Afterû jji decision is made, routing needs to be used for the second stage of allocation to calculate S j in the output nodes. This process involves iterative updates c ij using dynamic routing. We can get the S j of the next layer of capsule through routing and then put S j into the "squashing" nonlinear function to get the output of the next layer. The entire capsule layer and the process of propagation between them have been completed.
Coupling coefficient c ij is updated and determined iteratively by a dynamic routing process. The sum of the coupling coefficients between capsule i and all capsules in the next level is 1. In addition, c ij is determined by "routing softmax," and b ij in the softmax function is initialized to 0. The softmax of c ij is calculated as [40] b ij depends on the position and type of the two capsules but does not depend on the current input matrix. The consistency between the current output V j of each capsule j in the subsequent hierarchy can be measured.   Wireless Communications and Mobile Computing coupling coefficient with the consistency of the measurement. This paper simply measures this consistency by the inner product as This part also involves using routing to update the coupling coefficient [40]. The routing process is the update process. It calculates the product of V j , andû jji updates b ij by adding it to the original b ij and then uses softmax (b ij , j) to update c ij . When the output V j is new, it can be updated c ij iteratively, so that the parameters are updated directly by calculating the consistency of the input and output without back propagation.
For all capsule i and capsule j, initialize b ij to equal to zero. The routing algorithm is very easy to converge; basically, it can have a good effect in 3 iterations. c ij is updated through consistent routing. It does not need to be updated according to the loss function, but other convolution parameters and W ij in the entire network need to be updated according to the loss function. In general, these parameters can be updated directly for the loss function using standard back propagation. The expression of this loss function is [40] where c is the classification category, T c is the indication function of classification (c exists as 1, and c does not exist as 0), m + is the upper boundary, and m − is the lower boundary. In addition, v c modulus is the L 2 distance of the vector.

TCN in Capsules TCN
Network. TCN has better performance than a baseline recursive architecture in a wide range of sequence modeling tasks. Because these tasks include various benchmarks that are often used to evaluate recurrent network designs, it shows that the recent success of convolutional architectures in applications such as sequence processing is not limited to these areas [41]. TCN is based on two principles: the network produces an output of the same length as the input, and it cannot leak from the future to the past. To complete the first point, TCN uses a one-dimensional full convolutional network architecture, where each hidden layer is the same length as the input layer, and a zero-padding length (kernel size-1) is added to keep the subsequent layers from the previous layers. To achieve the second point, TCN uses causal convolution, and the output at time t is only transformed with elements from current time and earlier layers from the previous layer. It can be found by careful observation that TCN = 1D FCN causal convolution.
The major difference between TCN convolution and ordinary 1D convolution is the use of dilated convolutions. The higher the level, the larger the convolution window, and the more "holes" in the convolution window. More formally, for a 1D sequence input X ∈ R n and a filter f : f0, ⋯, k − 1g → R, the dilated convolution operation F on element s of the sequence is defined as where d is the expansion factor, k is the size of the filter, and x s − d * i represents the past direction [28]. Therefore, expansion is equivalent to introducing a fixed step between every two adjacent filter faucets. A primitive timing sequence convolution is just able to run back over at a point in time with size linear in the depth of timing sequence of the network. It makes a challenge to put in the mentioned causal convolution for time series, in which a longer history is critical. To acquire an exponentially large receptive field, a good part of the solution is dilated convolution. As illustrated in formula (9), d is the expansion factor. When d = 1, the expansion convolution is reduced to regular convolution. In order to figure a broad range of inputs, a larger dilation can be applied at the top level of the output. This ensures that there is a wider scale that expand the receptive field of a convolution within the effective history, meanwhile also extending for a long effective history using deep networks.
Every two such convolutional layers and identity mapping are encapsulated into a residual module (the residual module here is different from ResNet). The residual module contains RelU function, and a fully convolutional layer is used instead of a fully connected layer in the last few layers, as shown in Figure 9(a).
Generally, when using expanded convolution, we will increase d exponentially as the depth of the network increases. When the expansion factor is 1, as shown in Figure 9(b), the expansion convolution degenerates into causal convolution with a receptive field of 2. When the expansion factor is 2, the convolution kernel of the expanded convolution becomes 4. The final output contains all input information. By controlling the expansion factor, the size of the convolution kernel is increased to achieve the purpose of increasing the receptive field.
There are two disadvantages in large-scale neural networks: (1) it is too time-consuming; (2) it is easy to be overfitting. The dropout layer prevents overfitting of the network. Dropout is the process of training the network during deep learning. First, a part of the neural network units is temporarily dropped from the network with a certain probability, which is equivalent to finding a more streamlined network from the original network.

Superresolution Matrix of Inflow and Outflow Based on GAN.
Unlike traditional time series prediction, the result of urban traffic flow prediction is a matrix rather than a simple value. When a high-resolution prediction result is needed, for example, the city is divided into 32 × 32. Through the two neural network models of the Capsules Network and Temporal Convolutional Network within a minute time level to obtain the final predicted 32 × 32 time results, it cannot be achieved by the hardware conditions at this stage. Therefore, 7 Wireless Communications and Mobile Computing by reducing the resolution (matrix dimension) of the input data, it can achieve an exponential time-saving effect. Using a GAN-based superresolution reconstruction model is reasonable to reconstruct the high-resolution prediction results. Although the accuracy of the prediction result is sacrificed, it can obtain a minute-level high-resolution prediction result under the available hardware conditions. Under the urban traffic command and public safety guarantee scenarios, it is vital to obtain near-accurate results faster, which can provide better support for decision makers to make timely and effective judgments.
Because the amount of data is huge and the calculation is complicated, only the vehicle scene prediction of the experimental scene city in the 16 × 16 grid is calculated. However, in actual life applications, the experimental scene city is divided into 16 × 16 grids which is not inadequate. Dividing the city into a finer-grained grid is undoubtedly the solution to this problem. However, the traffic flow at the next interval cannot be predicted in time for more data to be computed.
The superresolution reconstruction reconstructs the city traffic flow a 16 × 16 experimental scene and obtains a 32 × 32 traffic flow prediction result. When we want to get better 32 × 32 fine results, the input data that needs to be processed increases by 4 times, and the overall calculation volume will also increase exponentially. We directly predict 32 × 32 results based on the predicted 16 × 16 results based on Generative Adversarial Network (GAN). The overall structure and workflow of traffic superresolution reconstruction of GAN are shown in Figure 10. Figure 10 shows the structure of the superresolution reconstruction process based on GAN. The 16 × 16 experimental scene of urban vehicle traffic is used as a low-resolution matrix sequence after convolution layers to form a set of arranged matrixes. This set of matrixes output a 32 × 32 high-resolution matrix after passing through the GAN.
The process of inputting a convolution layer of a lowresolution matrix is based on the input of a low-resolution matrix of a frame and then convolving the matrix. The training process of the convolutional layer network is the optimization process of the parameters. The spatial transformation can be expressed as The matrix I t+k ′ represents the high-resolution matrix obtained by transforming T θ i ðI t+k Þ, and the transformation is Tð:Þ [42]. Regarding the loss function of the convolutional layer network, we utilize a regularization method to express it. The optimal parameter estimation process can be expressed as [42] among them, θ * i represents the parameters of the optimization estimation, λ is a regularization parameter, and Q is a Laplacian. Differentiate the parameters θ * i on the right side of formula (11), and make the differentiated result equal to 0. Use the fastest gradient descent method to iteratively solve the equation until the error is less than a preset threshold. The output parameter θ * i is the estimated optimal parameter.
The weight representation of the reconstruction network refers to defining a weight for each input low-resolution matrix, then performing weight representation on the input  Wireless Communications and Mobile Computing low-resolution matrix to obtain a frame of high-frequency detail information. We add a convolution layer before the generative adversarial reconstruction network to complete the weight representation of the low-resolution matrix after the convolution layer. The mathematical expression of the weight representation can be expressed as [42] X m, n ð Þ= 〠 where ω k ðm, nÞ represents the weight value corresponding to the matrix block of the low-resolution matrix sequence. Generally, the same weight is defined for the matrix block. K represents the number of input low-resolution matrixes, ðm, nÞ represents the serial number corresponding to the matrix block, and ðm ∈ 0, → ,M − 1 ; n ∈ 0, → ,N − 1Þ.

Experimental Data and Preprocessing.
In the experimental verification part, the urban taxi dataset (taxi GPS) of the experimental scenario is used, and the data is shown in Table 1. This article uses the reserve method: (1) the dataset is divided into two disjoint parts, one is the training set and the other is test set; (2) keep the data distribution roughly consistent, similar to stratified sampling; (3) in this paper, the amount of data for one year is used as the training set, and the amount of data for 4 months is used as the validation set. The amount of training set data should account for 75%.
We mainly use historical taxi traffic data prediction to refer to the forecast of rental vehicle traffic data at the future moment. The experiment selects the urban taxi GPS track data from the experimental scenes from June 10, 2018, to June 10, 2019, as the training set, and the remaining data as the test set. In order to facilitate the display and calculation of the results, we select the period from 8:00 to 10:00 AM for analysis.
The grid is divided into 16 × 16 grids, as shown in Figure 6(a). The GPS trajectory of the taxi is then mapped to the grid area, and a grid area map is developed, as shown in Figure 6(b). The grids represent regions, and the line segments connect the two regions (connected by taxi in this article). The area map actually combines data from the road network and taxi trajectory.
In Keras, learnable parameters are initialized with a uniform distribution with default parameters. The convolution of CapsulesNet 1st and all TCNs uses 32 filters of size 3 × 3, and CapsulesNet 2ed uses the convolution of 2 filters of size 3 × 3. Each Capsules TCN Network unit consists of 4 TCNs and 2 CapsulesNets. Table 2 for details, there are five additional hyperparameters in Capsules TCN Network.
In our superresolution reconstruction experiment, the 16 × 16 grid map is also an input as a low-frame image, and a 32 × 32 grid map is obtained through calculation by a GAN-based traffic prediction network. The magnification of the reconstruction experimental resolution is 2 × 2. The initial learning rate is set to 10 -4 , and with each 10,000 iterations, the learning rate drops by 5%. In order to balance the convergence and training time of the network, the maximum number of iterations for superresolution reconstruction is set to 106.   9 Wireless Communications and Mobile Computing the GPU server, and its detailed information is shown in Table 3.

Experimental Environment and Evaluation
We use Root Mean Square Error (RMSE) to evaluate the model [43].
where x is the real value andx is the corresponding predicted value; Z is the number of all available true values. The RMSE is used to measure the deviation between the observed value and the true value, which is more suitable in this experiment. Furthermore, in order to measure the quality of the superresolution reconstruction algorithm, evaluation indicators need to be used. The requirements for reconstruction results are different in different application scenarios, so the evaluation standards used are also different. Evaluation methods are generally divided into two categories, one is subjective evaluation and the other is objective evaluation. In objective evaluation, the two most commonly used evaluation indicators are Peak Signal-to-Noise Ratio (PSNR) [44] and Structural Similarity (SSIM) [45].
The specific calculation formula of PSNR [44] is described as follows: MSE is the mean square error, f ðx, yÞ represents the reference matrix. In the experiment, 32 × 32 grids represent the matrix. It can be known from the formula that when the PSNR of the matrix to be evaluated is larger, the reconstruction result is better.
The specific calculation formula of SSIM [45] is given as follows: where μ f is the average value of the reference matrix, μ̂f represents the average value of the matrix to be evaluated, σ f is the variance of the reference matrix, and σ̂f is the variance of the matrix to be evaluated.

Effect of Hyperparameters on Experimental
Results. The number of CapsulesNet has an effect on the taxi GPS dataset experiments, as shown in Figure 11(a). The network depth also greatly affects the experimental results. As shown in Figure 11(b), the number of TCN increases; the RMSE of the model fluctuates. It indicates that the network is not the deeper the better, because it captures not only close-space dependencies but also far-space dependencies. When the network is very deep (such as when the number is 15), training becomes very difficult. Based on the above comparison, the number of CapsulesNet is two, and the number of TCNs is set to four.  HA predict the inflow and outflow of people based on the historical average of inflow and outflow at the same time and area in the past. For example, to predict the inflow of a region from 10:00 to 10:30 AM this Thursday morning, calculate the average of the inflow from 10:00 to 10:30 AM every Thursday morning in this region.
ARIMA is a well-known model for understanding and predicting future values in a time series. In the traditional linear model, the autoregressive integrated moving average model has been widely used in passenger flow prediction. It is a general formula for autoregressive (AR) models, integral (I) models, or moving average (MA) models. 32 [46].
LSTM is a special RNN that can learn long-term time dependencies [47].
We compare the RMSE between the Capsules TCN Network and the true value and then compare it with other prediction models to verify the validity of Capsules TCN Network. The results are shown in Figure 12. According to the comparison results, it can be seen that the proposed Capsules TCN Network has smaller RMSE. It has higher prediction accuracy than other models, indicating the effectiveness of the proposed Capsules TCN Network for traffic prediction tasks. Figure 13 shows the spatial-temporal distribution of taxi traffic during the morning rush hour at 8:30-10:00 AM on September 30, 2018. From the results in Figure 13, it can be seen that the proposed Capsules TCN Network better grasps the spatiotemporal characteristics of the changes in taxi traffic and makes predictions with sufficient accuracy.
The experimental results of superresolution matrix of inflow and outflow based on GAN are also demonstrated. Figure 14(a) is the result of superresolution reconstruction based on GAN, and Figure 14(b) is the real value, when the input low-resolution matrix is used from Figure 6(b). According to experiments, it can be seen subjectively that GAN-based superresolution reconstruction has achieved good reconstruction results and is close to the real value visually. The objective assessment is as follows: PSNR is 33.844 and SSIM is 0.93. We also obtain the PSNR and SSIM of superresolution matrix of inflow and outflow based on GAN in 64 × 64 and 128 × 128, respectively. In 64 × 64 scene, PSNR is 28.94 and SSIM is 0.88. In 128 × 128 scene, PSNR is 22.75 and SSIM is 0.79.

Conclusions
Traffic forecasting has been a core issue in transportation planning and management, and it has also been a major issue in urban computing. The prediction of traffic volume can help the development of urban traffic safety, and traffic flow will be more order. We propose a method based on the Capsules Network and Temporal Convolutional Network to predict traffic flow in local areas of the city. This method is called Capsules TCN Network. The Capsules TCN Network model can learn the spatial dependence, time dependence, and external factors of traffic flow prediction. We evaluated the GPS track data of urban taxis in the experimental scenarios and verified that the model has a good applicability in vehicle traffic prediction. Because the accuracy of regional traffic flow is different in different scenarios, we propose a GAN-based superresolution reconstruction model of traffic flow to improve the accuracy of Capsules TCN Network model results. The experimental results show that the GAN-based traffic superresolution reconstruction model not only has a better subjective visual effect but also has more prominent objective evaluation indicators.

Data Availability
The dataset used in this article is from a commercial company. If you need the dataset used in this study, you can send a usage request to centaureacyanus@foxmail.com. After being authorized by the company, the dataset will be transmitted to the applicant in the form of an email attachment.

Conflicts of Interest
The authors declare that they have no conflicts of interest.