Short-Term Traffic Flow Prediction with Recurrent Mixture Density Network

,


Introduction
Over the past few decades, the number of megalopolises has increased rapidly, especially in developing country as China. In the process of city development, projects in various aspects such as urban layout plan, environment protection, and traffic management are still tricky challenges [1]. Hence, great demand for smart city construction and intelligent transportation systems (ITS) emerges [2]. Short-term traffic flow prediction aims to provide future traffic information of road network, which is useful for both individual users and transportation management [3]. Timely and accurate traffic situation awareness contributes to a wide spectrum of applications, such as route planning, traffic congestion alleviation, and emission reduction. With the rapid development of sensors, communication techniques, and location acquisition techniques, such as Global Navigation Satellite System (GNSS) and 5G technology, massive multisource data including trajectory data and traffic flow data are generated and communicated by prevalent traffic information acquisition devices [4]. e evolution of traffic conditions has both spatial and temporal characteristics [5].
ere is abundant potential knowledge about traffic conditions contained in traffic data, which implies the spatiotemporal patterns of traffic flow. Ubiquitous traffic data lead to the increasing advent of datadriven traffic flow prediction methods. Hence, it is feasible to predict the evolution trend of traffic flow via knowledge discovery. Predicting short-term traffic flow refers to estimating the traffic flow indices of a certain road or transverse section at one or more following time instances, based on the recent and current traffic flow data. Most of the existing traffic flow prediction methods can be categorized into parametric methods and nonparametric methods [6].
In the category of parametric methods, the prediction models maintain a fixed structure. e model parameters are estimated by the empirical way based on historical data. Widely used parametric methods include the time-series model, Autoregressive Integrated Moving Average (ARIMA) model [7], and Kalman Filter [8]. ere are three core parameters to determine the structure of ARIMA (p, d, q), where p denotes the autoregressive order, d denotes the integrated order, and q is the moving average polynomial order, respectively. ARIMA model converts the time-series data into stationary series without tendency and captures both the autocorrelation features and noise features of the stationary series. Kalman Filter based methods rely on transition matrix and dynamic equation. However, the short-term traffic flow is highly nonlinear and stochastic. e evolution of traffic flow is actually an uncertain event. Hence, traditional parametric methods are limited in capturing the inherent features of traffic flow and fail to achieve expected performance.
In order to improve prediction precision, plenty of literature focuses on applying nonparametric models to traffic flow prediction. Nonparametric methods are mainly based on machine learning and deep learning technologies, which do not involve predefined parameters and structure [9]. Deep neural networks (DNN) [10], including convolutional neural network (CNN) [11] and recurrent neural network (RNN) [12], have achieved remarkable performance in a wide range of fields such as computer vision (CV) and natural language processing (NLP). e rapid development of deep learning technologies brings opportunities to improve the precision of traffic flow prediction more efficiently. RNN has the ability to learn historical information and contextual dependency from traffic flow series. Furthermore, RNN excels at fitting nonlinear time series by virtue of the nonlinear activation units. Actually, the concrete value of traffic flow indices is nonrepetitive and usually has a certain distribution. e way to predict concrete values does not match with the fact that the future evolution of traffic flow is an uncertain event. Hence, the methods introduced above regarding the anticipated traffic flow indices as deterministic results could be confused by the continuous and nonrepetitive values.
Motivated by the current study status, we dedicate this paper to a short-term traffic flow prediction approach based on the recurrent mixture density network. e recurrent mixture density network is constructed by integrating long short-term memory (LSTM) network and mixture density network (MDN) [13]. LSTM network is a variant of RNN with a special unit structure, which is competent for learning historical information and contextual dependency in time series and avoiding the issue of vanishing gradient. e main purpose of introducing MDN is to parameterize mixture distributions by the outputs of the neural network. In this way, the prediction model generates the probable distributions of anticipated traffic flow indices instead of concrete values. e ultimate prediction results can be obtained by sampling from the distributions. e manner of fuzzy prediction accords with the actual behavior of traffic flow evolution. To the best of our knowledge, this is the first time that the recurrent mixture density network is applied to a real-world short-term traffic flow prediction task. Additionally, the recurrent mixture density model is easy to be transplanted to different application scenarios, due to the prominent advantage of generalization and learning ability. e prediction model is implemented on a real-world traffic flow prediction task. Different from most of the other existing work, which predicts the value of only one certain traffic flow index of next one time step at once, we predict both the traffic time indices (TTI) and the average speed (AS) of a certain road section at several following time steps simultaneously. Since it is reasonable to consider the coherent relationship between TTI and AS, predicting traffic flow at more than one time step also provides more valuable information for traffic situation awareness. e remainder of this paper is organized as follows: the previous work related to short-term traffic flow prediction is reviewed in Section 2; Section 3 elaborates the prediction model; Section 4 presents the experimental results and the corresponding evaluation; and, finally, the discussion and conclusions are presented in Section 5.

Related Work
e manner of short-term traffic flow prediction methods can be concluded as follows: build a mathematic model based on history series data and then use the model to predict the future value of traffic flow indices. In general, the existing prediction methods can be categorized into two groups: parametric methods and nonparametric methods, according to the way of modeling.
In the category of parametric methods, the model parameters are estimated by the empirical way based on historical data. ARIMA model is a typical parametric model. As early as 1979, ARIMA has been applied to represent freeway traffic time-series data and make forecasts one time-interval in advance [14]. In recent years, various ARIMA models and the ARIMA-based models have been proposed. A vector autoregressive model was found appropriate and better than the traditional ARIMA model in practical experiments [15], which introduced the traffic flow information upstream as well as downstream.
e ARIMA model and its variants such as autoregressive moving average (ARMA) and seasonal ARIMA (SARIMA) are also studied comparatively with other widely used machine learning methods, to evaluate the efficiency in traffic flow prediction tasks [16,17].
However, short-term traffic flow is difficult to predict due to the nonlinear and stochastic feature. e performance of parametric methods is limited. Plenty of existing work attempts to develop nonparametric methods with flexible structure and parameters. Nonparametric methods mainly include machine learning and deep learning techniques [18]. RNN and its variants such as LSTM and gated recurrent units (GRU) network are widely used in short-term traffic flow prediction tasks. One of the advantages of the LSTM network is that the format of the input data is flexible. Hence, in order to improve the prediction performance, several traffic flow indices such as flow, speed, and occupancy can be used as input simultaneously [19]. LSTM network and GRU network were also evaluated comparatively with ARIMA. e experiments demonstrated that the RNN based methods perform better than the ARIMA model [20]. Experiments including several prevailing parametric and nonparametric algorithms also suggest the excellent prediction ability of the LSTM network [21]. Experimental results of multiple prediction tasks with different time intervals verified the generalization ability of the LSTM network [22].
Additionally, other deep learning techniques are also widely used for traffic flow prediction. CNN is an excellent technique in image recognition fields, which is utilized to learn traffic flow as an image and make predictions [23]. A deep architecture combined by CNN and LSTM is developed and implemented on traffic flow forecast [24]. Stacked autoencoders (SAE) also perform better than other machine learning techniques [25].
In this work, we construct the recurrent mixture density network and apply it to a real-world short-term traffic flow prediction task, taking advantage of the LSTM network and meanwhile integrating the parameterization function of the MDN. e recurrent mixture density network and data processing process will be elaborated in this section as follows.

Recurrent Mixture Density Network.
e recurrent mixture density network is constructed by integrating the LSTM network and MDN, which has the ability of learning historical information and contextual dependency and generating predicted mixture distributions. Unlike the ordinary feedforward neural network that only has connections between adjacent layers, the LSTM network allows recurrent connections between adjacent cells of the same layer. Hence, the LSTM network is deep in time-series dimension and has the ability to map the historical input series to the output via recurrent connections.
e most distinct characteristic of the LSTM network is the unique gate structure of the LSTM cell including input gate, forget gate, and output gate as shown in Figure 1.
e three gates work collaboratively to encode the input signal and generate outputs, which enable the LSTM network to maintain historical information contained in series data. e LSTM network encodes the input signal as follows: where f j , i j , and o j denote the forget gate, input gate, and output gate, respectively. σ and tanh denote the logistic sigmoid function and hyperbolic tangent function, respectively, both of which are used as the nonlinear activation function. h j−1 is the output of the j − 1 th unit in the LSTM layer. x j denotes the input signal from the preceding layer.
e weight matrices W f , W i , W o and the corresponding bias b associated with these gates are variables to be optimized during the training process. LSTM cells retain and update the cell state while receiving an input signal.
en, the updated cell state is activated by the nonlinear activation function and fed to the next layer.
Outputs of all the LSTM layers are concatenated as the integrated output of hidden layers and sent to the MDN layer, as shown in Figure 2. e MDN layer is a full-connected layer as follows: where N denotes the number of stacked LSTM layers and h n j is the output of the nth bidirectional LSTM layer. W h n y denotes the weight matrix connecting the LSTM layer and MDN layer, b y is the corresponding bias, y j denotes the outputs of the MDN layer. e outputs of the MDN layer are used to parameterize mixture distributions. In order to capture the probable range of the two-dimensional vector x � [TTI, AS] simultaneously, a bivariate Gaussian mixture model is adopted as the mixture distributions. Individual components of the mixture distributions are denoted by a bivariate Gaussian model, which are parameterized by sets of means μ, standard deviations σ, and correlations ρ. Additionally, the Gaussian mixture model also contains the weight parameter π: where M denotes the number of individual components composed of the mixture distributions. All the outputs generated by the network must be normalized to meaningful ranges as the final output before being used as parameters: where e probability density function of the train label x j+l (l ≥ 1) against the output mixture distributions is defined as follows: where G is the bivariate Gaussian function: e negative log-likelihood function is employed as the loss function to be optimized in the training process: Adam optimizer is applied to the backpropagation through time process to fine-tune the network variants. We also adopt the dropout technique and gradient clip strategy to avoid the overfitting and exploding gradient issues.

Data Processing.
In order to predict the short-term traffic flow indices x � [TTI, AS] at several following time steps, we prepare both the training input data and training label as sequence. All the training data are generated through sliding window strategy [13], as shown in Figure 3, where we hypothesized the input sequence length k � 3 and prediction length h � 2. e train label is in the same sequence length with the train input and lags h � 2 time steps. In the prediction stage, the prediction model also outputs the mixture distributions of k � 3 vectors, whereas we only pick out the last h � 2 mixture distributions for sampling. e ultimate prediction results are obtained by sampling from mixture distributions via roulette strategy based on the weights of each individual Gaussian component.

Experiments
In this section, we aim to complete a practical short-term traffic flow prediction task by utilizing the recurrent mixture density network. e optimal parameter configuration of the prediction model is determined after several experiments. In order to assess the performance of the prediction model, two widely studied methods are introduced for comparison.

Dataset and Experiment Setup.
Shenzhen is one of the first-tier cities in China, where the economy developed rapidly in recent years. As a megalopolis, Shenzhen is faced with typical city development challenges, such as transportation management. Shenzhen North Railway Station is one of the most important transportation hubs of the city. e road networks around Shenzhen North Railway Station are also the main transportation corridors of the city, which directly influence the urban traffic efficiency. e traffic flow dataset is generated by sensors installed on road networks which record the speed and traffic flow information of mobile vehicles. It contains TTI and AS information of twelve road sections around Shenzhen North Railway Station in two periods from 1 January 2019 to 31 March 2019 and from 1 October 2019 to 21 December 2019. e time granularity is ten minutes; that is, records at each timestamp denote the average indices in ten minutes. Figure 4 presents an example of typical traffic flow patterns overtime in a day of a certain road section.
In experiments, we select the records from 1 December 2019 to 21 December as test data and use the rest for model training. We use one-hour-long traffic flow information to predict the indices in the upcoming half-hour; that is, the input length is 6 and the prediction length is 3. ree metrics are employed for performance evaluation: mean absolute error (MAE), mean relative error (MRE), and root mean square error (RMSE): where x i ′ and x i denote the prediction result and the ground truth, respectively. n is the total number of test traffic flow segments.
e prediction errors are calculated on every predicted time instance.

Comparative
Results. Two widely used methods are involved in comparison experiments, including the LSTM network and ARIMA model. e optimal parameter configuration of the recurrent mixture density network is determined after several experiments, where the number LSTM layer is set to 4, the number of LSTM units in each layer is set to 256, and the number of mixture components in the mixture density layer is set to 15. In comparison experiments, the structure of the LSTM network is the same as the LSTM layers of the recurrent mixture density network. ree parameters of the ARIMA model (p, d, q) are determined as (1, 1, 1) after several experiments. It must be noted that the recurrent mixture density network and LSTM network predict TTI and AS simultaneously, while the ARIMA model processes the two indices separately. Figure 5 depicts the performance evaluation of the model in the training process. e model training process is completed in the vicinity of the 40 th epoch according to the descending curve of the loss function.
e prediction performance comparison of the three methods is given in Table 1. e lowest errors of all metrics are highlighted in bold font, which evidently demonstrate the prominent performance of the recurrent mixture density network. In Table 1, the accuracy (1-MRE) of the recurrent mixture density network is over 0.92 at all the predicted timestamps for both TTI and AS prediction tasks. It is obvious that the prediction errors increase with the increase of prediction time steps. e performance of the recurrent mixture density network is promising in 10 min, 20 min, and 30 min prediction tasks. Compared to the results of the LSTM network, the better performance of the recurrent mixture density network proves the superiority of the fuzzy prediction strategy. e attempt of combining MDN and LSTM brings an accurate solution for the short-term traffic flow prediction issue. In 10 min prediction, the recurrent mixture density network improves the accuracy of the TTI prediction and AS prediction by 1.28% and 0.8%, respectively, against the ARIMA model, which verifies the prominent advantage of the nonparametric method. e MAE is visualized in Figure 6, where we depict the cumulative distribution function (CDF) of mean absolute errors of 10 min, 20 min, and 30 min predicted results. It is clearly observed that the CDF curve of the recurrent mixture density network is above the curves of the other two methods. Figure 6 presents strong evidence that the overall performance of the recurrent mixture density network is superior to the LSTM network and ARIMA model. In the TTI prediction task, there are over 90% of the test data with prediction error less than 0.2, over 80% of the test data with prediction error less than 0.1. For the prediction of AS, the prediction error of more than 92% of the test data is under 6 km/h and more than 76% is under 3 km/h. It is notable that the proportion of test data with low prediction error in LSTM experiments and ARIMA experiments are close to that in the recurrent mixture density network experiments. While there are more test data with large errors in the LSTM experiments and ARIMA experiments, e CDF of error implies that, in most tests, all the three methods yield promising results, whereas, in some unexpected scenario such as confronting traffic congestion, the recurrent mixture density network is more reliable. e prediction results of the recurrent mixture density network on two different road sections are demonstrated in Figure 7, which are the typical instances of heavy traffic congestion and low traffic volume, respectively. e observed TTI and AS are also depicted for comparison, which are clearly coherent in the periods of all the 21 days. It is      shown that the predicted results match well in both two scenarios. However, the recurrent mixture density network performs better when the traffic condition is fluent. Although the periodical pattern of the traffic flow indices is evident, there are still some abnormal values, especially in the road section with heavy traffic congestion, where the traffic condition could be extremely terrible in rush hour. It is challenging to predict the traffic congestion with a large TTI value and low AS value; even so, most of the mutations are matched well as shown in Figure 7. Figure 8 presents the comparison of prediction results of three methods on a certain road section on 9 December  In the period from 8 o'clock to 9 o'clock, the traffic time index increased rapidly while the average speed became lower. In the rush hour, it is shown that the recurrent mixture density network and ARIMA model successfully capture this sudden change of the indices, while the performance of the LSTM network is not as good as the other two methods. Meanwhile, the overall prediction results of the recurrent mixture density network and the LSTM network match the ground truth better than that of the ARIMA model.

Discussion and Conclusion
In this paper, we are dedicated to the issue of short-term traffic flow prediction. e main contributions of this paper are as follows: (1) we constructed the recurrent mixture density network by integrating the LSTM network and mixture density network. To the best of our knowledge, it is the first time that the recurrent mixture density network is utilized in a short-term traffic flow prediction issue. (2) We attempted to implement these methods on a practical shortterm traffic flow prediction task, including two comparison methods. e advantage of the recurrent mixture density network consists in two aspects: (1) it has the ability to learn historical information and contextual dependency in traffic flow data series, which is useful for yielding promising prediction. (2) e strategy of fuzzy prediction generating probable distributions of the following traffic flow indices instead of concrete values is closer to the actual behavior of traffic flow evolution. Different from most of the existing work processing one certain traffic flow index at once, we take the coherent relationship between the traffic time indices and the average speed into consideration. e recurrent mixture density network treats the two indices as a two-dimensional vector and processes it simultaneously. Meanwhile, we predict the traffic flow at several following time steps at once, which could provide more valuable information for traffic situation awareness.
e recurrent mixture density model is easy to be applied to different application scenarios since it has strong generalization and learning ability. In practical fields, shortterm traffic flow prediction is expected to play a crucial role in the development of intelligent transportation systems and the construction of a smart city.
Data Availability e data can be accessed on the page https://www.sodic. com.cn/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.