Stacking Ensemble Learning Process to Predict Rural Road Traffic Flow

By predicting and informing the future of traﬃc through intelligent transportation systems, there is more readiness to avoid traﬃc congestion. In this study, an ensemble learning process is proposed to predict the hourly traﬃc ﬂow. First, three base models, including K-nearest neighbors, random forest, and recurrent neural network, are trained. Predictions of base models are given to the XGBoost stacking model and bagged average to determine the ﬁnal prediction. Two groups of models predict traﬃc ﬂow of short-term and mid-term future. In mid-term models, predictor features are cyclical temporal features, holidays, and weather conditions. In short-term models, in addition to the mentioned features, the observed traﬃc ﬂow in the past 3 to 8hours has been used. The results show that for both short-term and mid-term models, the least prediction error is obtained by the XGBoost model. In mid-term models, the root mean square error of the XGBoost for the Saveh to Tehran direction and Tehran to Saveh direction is 521 and 607 (veh/hr), respectively. For short-term models, these values are decreased to 453 and 386 (veh/hr). This model also brings less prediction error for predicting the ﬁrst and fourth quartiles of the observed traﬃc ﬂow as rare events.


Introduction
Intelligent transportation systems are one of the leading efficient tools for transportation network traffic management. e result of using these systems is achieving or maintaining the balance between transportation supply and demand with low cost [1]. Intelligent transportation systems include various subsystems which one of the most important of them is the advanced traveler information system. By this system, available information about the transportation network is given to travelers to plan their travels with more awareness. is information can be informed for the current state of the network, but its effectiveness becomes more if it is predicted and informed for the future of the transportation network [2]. In such circumstances, the traveler is more prepared to choose the appropriate route and departure time and even to choose to have a trip or cancel it. Generally, traffic parameters such as traffic volume [3], average speed [4], and travel time [5] are predicted and informed by intelligent systems. As the time horizon of these predictions is limited to the near future compared to the time horizon of classical 4-step transportation planning prediction, they are short-term predictions.
Prediction of traffic parameters is made by analyzing the past observations and discovering effective features on the variation of traffic parameters. For this purpose, the use of time-series models as a tool based on statistics and probability has more antiquity in previous studies. In time-series models, each traffic parameter's variation is a function of that parameter's previously observed values, independent effective features, and random term. For example, Kumar and Vanajakshi [6] have predicted the traffic flow using the seasonal autoregressive integrated moving average (SAR-IMA). Results show that the model is more accurate than the historical average models. In Yan et al.'s study [7], autoregressive integrated moving average (ARIMA) has been used to predict subway passengers' flow. Time-series models are only capable of considering linear relationships between independent and dependent variables. On the other hand, by increasing the number of observations and features, traffic data are converted to big data. ese models are not compatible with big data characteristics, including volume, velocity, and variety [8].
Another approach to predict traffic parameters is the machine learning (ML) approach. ML models are compatible with big data characteristics and can depict linear and nonlinear relationships. Lack of interpretability and disability in discovering causal relationships are the main weaknesses of ML models, and time-series models are superior to ML models in this regard [9]. ML models are diverse and artificial neural network (ANN) [10], support vector machine [11], and decision tree [12] are some of the widely used ML models. To predict the traffic flow, Ma et al. [13] use ANN optimized by genetic algorithms and exponential smoothing. e results show that the optimization of the artificial neural network improves prediction accuracy. Simple ANN considers consecutive observations independently. To capture the relationship between successive observations, Lu et al. [14] have used a recurrent neural network (RNN) model. e RNN model emphasizes the importance of the time-series nature of data by forming neural network blocks at different time intervals. Each block's input is the output of another block related to past times and predictive features. Also, long shortterm memory (LSTM) model is another type of RNN model that considers the dependency of observations for both shortterm (near past observations) and long-term (far past observations) pasts. is algorithm is used in Farahani et al. [15] and Chen [16] studies. Wang et al. [17] focus their research on ANN models' weakness in interpreting results. After training a deep neural network model for traffic flow prediction, the proposed model is interpreted in two different ways: first, justifying the number of layers and nodes; second, explaining the causality between historical data and future state of traffic. e ML models used to predict traffic variables are not limited to the neural network-based modes and traffic flow prediction problem. As an example of other ML models and other traffic parameters, Xu et al. [18] predict the nominal traffic state by using the Kalman filter, Zheng et al. [19] predict traffic speed by K-nearest neighbours (KNNs), Liu et al. [20] predict traffic congestion by random forest (RF), and Yang et al. [21] predict travel time by Markov chain method.
Variety of short-term prediction methods, and on the other hand, lack of a technique that has the highest accuracy for all situations has led researchers to the use of ensemble learning process. In this process, the base models' output is used and to provide one unique final prediction. In general, the ensemble learning process is divided into three categories: bagging, boosting, and stacking. In the bagging process, the base models are trained with the same training dataset, and by averaging or voting, the final prediction is determined. In the boosting process, the base models are trained sequentially to improve the old model's prediction accuracy in the current model. In the stacking process, predictions of base models are introduced as inputs of a supermodel that can be an ML model, the supermodel's output is the final prediction [22]. By using bagging ensemble modeling, Moretti et al. [23] combine predictions of statistical and neural network models to predict traffic flow. Yenru and Haghani [24] use a gradient boosting regression tree model to predict travel time. Ma et al. [25] use a contextual convolutional recurrent neural network to recognize inter-and intra-day traffic patterns. Lin et al. [26] propose a stacking ensemble learning process to predict public bicycle traffic flow. In all of these three studies, using ensemble learning modeling leads to more accuracy of predictions than base models.
In this study, hourly traffic flow is predicted using three ML base methods, including KNN, RF, and RNN. Outputs of these models are given to XGboost as a stacking supermodel and bagged averaging to predict the final output in the ensemble learning process. e predictive models are divided into two categories: short-term and mid-term. In the short-term models, in addition to the external predictive features including cyclical temporal features, holidays, and weather conditions, the observed traffic flow in the previous 3 to 8 hours has also been used, and these models can only predict the traffic flow only for one and two hours of the future. In mid-term models, only use external predictive features, and there is no time horizon limitation. Finally, the accuracy of these two sets of models is evaluated and compared. e data used in this study are related to traffic data of Tehran-Saveh, a rural road in Iran, for both directions. In general, identifying the dominant pattern of traffic parameters in rural roads is more complicated than the urban roads because in contrast to urban trips, a significant part of rural trips is nonroutine.
is study's contribution is to propose a stacking and bagging ensemble learning process consisting of three base ML algorithms, including KNN, RF, and RNN, alongside the XGboost as a supermodel that puts predictions of base models together. Although previous studies use ensemble learning process for traffic parameter prediction, but designed architecture used in this study is unique. XGboost is a significant part of this structure which is recommended to be used as a stacking supermodel which is not used in the architecture of previous studies related to traffic parameter prediction. Also, short-term and mid-term models with different time horizons and different predictive features are trained and evaluated in this paper for rural road that less investigated before. Finally, employing cyclical feature which are related to temporal features is another novel idea for traffic flow prediction.

Data
is study's traffic data is collected for one section of the Saveh-Tehran rural road for both directions by loop detectors. Data collection has been carried out for about three years, from 21 March 2017 to 10 March 2020. Data are divided into three sections: first, two years of observations are used to train base models, the next six months, and related predictions of base models are used to train the stacking model. e last six months are used to test the base models and stacking model performances. We called these datasets train 1, train 2, and test. Also, total observations for the ensemble learning process, including train 1 and train 2 datasets, are named train datasets. e raw data includes hourly traffic flow and date. After exploring the relationship between hourly traffic flow and calendar attributes such as holidays and their type, new features related to the calendar are added to the dataset. Since holidays in Iran are based on two lunar and solar calendars, and as these two calendars are not fixed together, both of them are considered. Also, many passengers start their trips before the holiday and continue it until after the holiday, so it is necessary to consider the effects of holidays on the traffic flow of the days before and after it. Weather condition is another important factor affecting the traffic flow, which is extracted and added to the features. Table 1 describes the candidate features to predict traffic flow in the dataset.
In Table 1, season, solar and lunar months, day of solar and lunar months, day of the week, and time of day (temporal features) are essentially cyclical and varied in particular intervals. For instance, hour 23 and hour 0 are close to each other. is also refers to the spring and winter, the first month of the year and the last month of the year and the first day of the week, and the last day of the week. e biggest problem is letting the algorithms know that these features varied in cycles. Calculating the components of the sinus and cosine and introducing cyclical characteristics is the best way to deal with this problem. For this purpose, the following sinus and cosine transformations are used [27].
(1) e scatter graph of temporal features after these transformations is shown in Figure 1.
Season, solar and lunar months, day of solar and lunar months, day of the week, and time of day are used cyclically in this study. e introduced features in Table 1 are used to train the mid-term models with the unlimited prediction time horizon. In short-term models, in addition to the features in Table 1, the traffic flow observed at intervals 3 to 8 hours ago is also used as predictor features and these models are only able to predict one and two hours of the future. Figure 2 depicts the traffic flow histogram for the ensemble learning train dataset (train 1 + train 2) and the test dataset. Table 2 presents a statistical summary description of traffic flow.
In current study, to prepare and select predictive features, cyclical features have been used.
ere are several input data selection methods for this purpose. For example, genetic algorithm, forward or backward feature selection, and recursive feature elimination [28]. In the rest of this paper, the effect of using cyclical features have been presented.

Methods
is study proposes a stacking and bagging ensemble learning process consisting of three base ML algorithms, including KNN, RF, and RNN, alongside the XGboost as a supermodel. We choose base models based on their accuracy and selected base model outperforms other models. For example, we tried to employ LSTM algorithm as a deep learning base model but the resulted predictions have not enough accuracy to consider LSTM in ensemble learning process.

K-Nearest Neighbors.
e KNN model is an ML method used for both classification and regression problems. e main objective of the KNN is to find some labeled observations in the training dataset which have the smallest distance with nonlabeled observations in the test data. Using the averaging or voting, the new label is assigned to new data [29]. e four main steps of this approach are as following: Step 1: the train dataset is given in an n-dimensional coordinate system (n is the number of features).
Step 2: Euclidean distance between any new observation and training data observations is calculated.
Step 3: k is the number of observations that have the smallest distance from any new observation.
Step 4: the average of K observation labels is selected as the new observation label.

Random
Forest. Similar to the KNN, the RF is an ML model used for regression and classification problems. e RF consists of a large number of decision trees. In this model, the training data are divided between decision tree models, and after training them, predictions are made for each decision tree. e average of predictions is determined as the RF's final prediction [31]. e following steps indicate how the algorithm works.
Step 1: start with the select random samples from the training dataset Step 2: using each sample to train a decision tree.
Step 3: the prediction of each decision tree model is made for the test data.
Step 4: the average of predictions is selected as the final prediction.
RF starts with a node and branches to another node. is paper uses the entropy formula to determine how the dataset branches from each node. Equation (3) presents the entropy formula [31].
where p i is the relative frequency of label i, i is the index of labels, and c is the total number of labels.

Recurrent Neural
Network. RNN is a kind of deep neural network. Since the successive observations are dependent on each other, the use of the RNN can help improve the accuracy of predictions. ese ANNs are particularly useful for time-series analysis, where each neuron can maintain internal information of the connected nodes. is attribute of maintaining the internal state or the memory capability helps the network to understand and discover the link between di erent successive observations [32].
Let denote the input time series with D variables of length T as X (X 1 , X 1 , . . . , X T ), where X t is the t-th observation. c t is a memory cell, contains information at time step t, and is controlled by three gates. ese gates control whether to forgot the current cell value (forget gate f t ) to read its input (input gate i t ) and to output the new cell value (output gate o t ) [33]. Also, c t is an input modulation gates. All these gates, cell update, and output are computed in the following formulas [34]:    Journal of Advanced Transportation where ⊙ indicates scalar product, W s are the network parameters matrices, h t is the hidden state, ϕ is the hyperbolic tangent function, and σ denotes the standard logistics sigmoid transfer function.

Bagged Averaging.
After training KNN, RF, and RNN, the predicted tra c ow is given to the ensemble learning algorithms to determine the nal prediction. Bagged averaging is one of these algorithms that can be done weighted or simple. In the weighted method, each model's prediction weight is inversely related to the model's root mean square error (RMSE). Equation (5) shows how weights in bagged averaging are calculated.
where W i is prediction weight of model i, I is the total number of models, and RMSE i is the root mean square error of model i.   Train  19908  2750  1500  8  7069  Test  3658  2578  1447  130  6058   Saveh to Tehran  Train  19901  2907  1455  19  6880  Test  3686  3270  1439  29  6079 Journal of Advanced Transportation gradient boosting paradigm, it applies ML algorithms. XGBoost o ers a parallel tree boost that easily and reliably addresses several data science issues [35]. e boosting tree is de ned as follows:

Stacking
where F is the set of decision trees, y i is model prediction, x i is a set of predictor features, and n is the number of trees. e loss function of the model is as follows: where L is the di erence between the predicted and actual values, named di erentiable function. Popular loss functions include square, logarithmic, and exponential function functions. Ω is used to regulate the complexity of the model.
where c and λ are penalty coe cients. XGBoost aims to minimize the di erentiable function. By rewriting the differentiable function and Taylor expansion, the formula is as follows: where g i and h i are the rst and second derivatives of the loss function, respectively [36]. KNN, the number of trees (NT), and the number of variables randomly sampled as candidates at each split (NV) in RF, and the number of hidden layers (N) in the neural network model. To nd the optimal value of these parameters after assigning di erent values to them, models are trained. Accuracy for the test dataset is evaluated based on the RMSE. Equation (10) represents how to calculate the RMSE. Figures 3 and 4 show   the sensitivity analysis performed to nd optimal values of the short-term and mid-term models' parameters.

Results and Discussion
where y t and y t are predicted and actual values, and n is the number of observations. Table 3 shows selected optimal values for nal models. After training the nal models to assess the accuracy of predictions on the test dataset, in addition to the RMSE, the mean absolute percentage error (MAPE) is used. Equation (11) shows how MAPE is calculated. Table 4 presents the obtained values of error metrics for the nal models.

MAPE
100% n n t 1 Results in Table 4 shows that for both short-and midterm models and both directions of Saveh-Tehran road, the lowest error prediction is achieved by the RF, and then the KNN has the highest prediction error. e MAPE of the mid-term RF for the Saveh to Tehran and Tehran to Saveh is 21.23 and 27.14, respectively. Also, in the short-term model, the MAPE of RF for the Saveh to Tehran and Tehran to Saveh is 15.25 and 16.61. Figure 5 shows the di erence between the RMSE of the short-term and mid-term models. e accuracy of the shortterm models is higher than the mid-term models, and using previously observed tra c ows had increased the accuracy  of the prediction. e limited-time horizon of these models is considered as their weakness.

Bagging and Stacking Ensemble Models Results.
After receiving the base models' predictions, the ensemble learning process is performed by using the bagging and stacking methods, and the nal results are obtained. In addition to the bagging and stacking methods, the maximum and minimum predicted tra c ow values are analyzed as the nal prediction. Like the base models, the ensemble learning process has also been examined for short-term and mid-term predictions that their inputs are the short-term and mid-term output of base models. Table 5 shows the results obtained by ensemble learning and the RF as the most accurate base model. Table 5 indicates that for both the short-term and midterm models and both directions of Tehran-Saveh road, based on the RMSE and the MAPE, using the XGBoost, decreases the prediction error and stacking ensemble learning by using XGBoost has the lowest prediction error. Based on the RMSE, in the mid-term model, the predictions through maximum and minimum values of the predicted tra c ow values have higher and lower accuracy compared to the RF, respectively. It can be concluded that the base models underestimate tra c ow. Bagged averaging only increases the accuracy of predicting for Tehran to Saveh. In the short-term models, only the XGBoost model has reduced the tra c volume prediction error, and other methods have no positive e ects on the accuracy of tra c ow prediction.
Another critical point in the tra c ow prediction is predicting maximum and minimum tra c ow values that indicate rare tra c events. Generally, informing hours with high and low tra c ow is more worthwhile for users and system operators than normal tra c ows. To determine the models' performance in predicting rare events, the RMSE has been calculated separately for the rst and fourth quartiles of the observed tra c ow and presented in Figures 6 and 7.    Figures 6 and 7 show the lowest RMSE for the rst and fourth quartiles are achieved by XGBoost and Max methods, respectively. e exciting point is less prediction error of the XGBoost than the Min method in predicting the rst quartile. e XGBoost could predict both the rst and fourth quartile more accurately than the base models, whereas the Max method only predicts the fourth quarter more accurately than the base models. Among the base models, the RF model predicts the tra c ow for two quadrants more accurately than the two other base models.

Conclusion
One of the applications of intelligent transportation systems is predicting the future state of tra c while the traveler will have more proper planning to choose travel, departure time, and route choice. Also, the transportation network operator will be more prepared to deal with tra c congestion. In this study, tra c ow as a parameter shows the state of tra c is predicted using three base methods based on ML, including KNN, RF, and RNN for a rural road in Iran for both directions. en, using the bagging and stacking methods, the most important of them is the XGBoost, and the nal tra c ow is predicted. Preprocessing is performing by adding predictor features related to cyclical temporal features, holidays, types of holidays, and weather in the rst step. In the second step, to nd optimal values of the parameters of short-term and mid-term models, models are trained by di erent values of parameters, and optimal values are selected based on the accuracy of prediction on the test data. After training the base models with optimal values of parameters, the initial predictions are evaluated and compared. In the next step, by using base models' predictions, the ensemble learning process is applied to make the nal prediction, which is expected to be more accurate than base models predictions. e results show that the highest accuracy of prediction for both short-term and mid-term is achieved using the XGBoost model in the stacking learning process.
is model predicts the first and fourth quartiles of the observed traffic flow more accurately than the base models. In general, the prediction error of short-term models is lower than the mid-term models. However, these models can only predict the traffic flow of one and two hours of the future.
In the end, the predicted traffic flow by short-term and mid-term models can be informed to passengers via advanced traveler information systems. To use the prediction accuracy of the short-term models and have the prediction time horizon of mid-term models, future one and two hours will be predicted by short-term models, and for the next hours, prediction by mid-term models can be used.

Data Availability
e traffic data used in this study are available from the corresponding author upon reasonable request.