Mass Rapid Transit System Passenger Traffic Forecast Using a Re-Sample Recurrent Neural Network

,


Introduction
Traffic congestion is a serious problem encountered in the implementation of MRT (Mass Rapid Transit) Systems.Delays can seriously affect people's quality of life and urban development, while overcrowding remains a top source of customer dissatisfaction.As the number of passengers choosing to use MRT as their major mode of transportation continues to increase, overcrowding during peak hours has become a common occurrence when ridership exceeds the capacity of the train.For this reason, it is very important to distribute congestion information to the public timely, such that they are able to rearrange their plans or departure time accordingly in order to avoid congestion.On the other hand, subway operators can also explore effective approaches to solving this congestion.They can create more standing room by removing seats in their existing fleets and can add additional train cars to allow them to run longer trains during peak periods.They can also consider adding enhanced traction power and train control systems, as well as rail car storage space.
The increasing availability of data from system operators has created unique opportunities to predict congestion [1].Predicting the crowdedness of a Rapid Transit System benefits the train operators and the public.Most existing work on predicting traffic has focused on predicting crowd flows [2][3][4].For instance, Sun et al. [2] selected nonparametric regression as a prediction method to forecast pedestrian congestion.Zhang et al. [3] developed a real-time system based on Microsoft Azure Cloud to monitor and forecast crowd flow.Nicholas G. Polson et al. [4] developed a deep learning model to capture nonlinear, spatiotemporal effects and show how deep learning can provide precise, short-term traffic flow predictions.There are also other works that have focused on predicting traffic congestion, such as Min et al. [5], which proposed an adaptive, data-driven, real-time congestion prediction method which employed an adaptive Kmeans clustering method to identify different traffic patterns.Ma et al. [6] extended deep learning theory into large-scale transportation.They utilized a deep Restricted Bolzmann Machine and Recurrent Neural Network architecture to model and predict traffic congestion evolution based on Global Positioning System (GPS) data from taxi.Deep Neural Networks can capture the nonlinearity of history data but are limited by the imbalance of the data which is a common phenomenon in the real world.In particular for the datadriven methods, the accuracy of their traffic pattern recognition is low due to data imbalance.In our previous work [7], we presented re-sample Deep Neural Networks for congestion prediction.Although it reported a nice forecasting result, it still has some room to improve in predicting.In this work, we employ deep learning Recurrent Neural Networks in our model to predict the passenger traffic of MRT.Passenger traffic is studied and indicated by several patterns according to congested level.As previously mentioned, the history data utilized is imbalanced, as the majority of the data implies normal conditions with little to no congestion.Only a very small fraction of the data shows severe congestion occurring during the peak times.If we use a deep learning model to represent this tremendously imbalanced data, it would always be overfitted by the large amount of normal data and fail to establish a good generalization on the unseen severe congestion data.Thus, it cannot predict the true congestion level in a timely manner.In order to solve this problem, we propose re-sample Recurrent Neural Network Model (RRNN) to effectively predict congestion levels.We use the traffic data from the Rapid Transit System of San Francisco in the US as a case study.These experiments show that the model's predictions were able to achieve an accuracy of more than 90% within 20 minutes.
The remainder of this paper is organized as follows: related works and literature review are given and discussed in Section 2. The definition of congestion level and a description of the data structure are introduced in Section 3. Section 4 discusses the modular re-sample Recurrent Neural Network used to predict passenger traffic patterns.We evaluate our results in Section 5 and summarize our research in relation to prior works in Section 6.

Literature Review
The Rapid Transit, or urban metro, System has been an extremely important component in urban infrastructures [8][9][10] and plays an essential role in the development of cities.According to statistics and prior studies, there are about 213 cities around the world that have deployed MRT Systems since the first line opened in 1863 in London [11].On the one hand, MRT provides a certain level of convenience for people, but on the other hand, the crowdedness of MRT as a whole results in delays and heavily affects people's dayto-day lives.This crowdedness is always more commonplace during peak hours, when people are heading to or from work.This phenomenon has sparked a large amount of research focused on investigating this issue of overcrowding [12,13].In addition to this, many research efforts have also been made in predicting passenger flow [14,15].For example, [15] proposes a multipattern deep fusion technique constructed by fusing deep belief networks.For each different pattern, a DBN is developed as a deep representation for passenger flow.This process ends up making the model much more complex.Some works focus on short-term metro passenger flow using parametric [16] and nonparametric techniques [17].Generally, the application of parametric methods has some limitations because of a linear assumption within time lagged variables.Unlike the parametric models, nonparametric ones construct a nonlinear relationship between input and output without any prior knowledge.Some hybrid models integrating both parametric and nonparametric methods are designed to improve prediction accuracy [18,19].These models focus on short-term passenger flow prediction.
Our work will mainly focus on studying passenger flow pattern especially for congestion level prediction.Timely and accurate recognition of congestion can lower the negative impacts on a MTR (Mass Rapid Transit) System [20].In order to more accurately describe traffic conditions, many scholars apply networks and fuzzy theories including the FCM algorithm, ANN algorithm, and DS-ANN algorithm [21].
Machine learning has also been widely applied and has shown considerable success in traffic pattern recognition [22].Chen [23] used shallow neural networks for traffic applications.They applied a dynamic neural network with a single-hidden layer which used a Gaussian basis function as an activation unit.Zheng et al. [24] deployed multiple single-hidden layer networks to predict traffic within the next fifteen-minute time span.In their work, they used both a tanh activation and a Gaussian radial basis function.Cetiner [25] proposed a neural network model using the day of the week and the time of day as inputs.Lv et al. [17] demonstrated that deep learning can be effectively used for traffic forecasting.A stacked auto-encoder was applied to recognize spatiotemporal patterns in the traffic data.Zhao et al. [26] proposed a LSTM network to forecast traffic by considering spatiotemporal correlations in traffic systems via a two-dimensional network.
Although some efforts have been made to model spatiotemporal characteristic traffic flows in prior studies [4,26], there have been no efforts to predict the passenger traffic patterns of MRT in a timely manner.Basically, deep learning allows us to design flexible network structures with multiple layers, and as such models built on deep learning perform better than traditional Neural Network Models.In recent years, deep learning has become a very popular technology, especially in dealing with image recognition and natural language processing [27].A Recurrent Neural Network (RNN) is a type of deep learning module that has been applied to a wide range of applications including speech recognition [28] and text generation [29].Some prior studies have also tried to use a RNN model for traffic accident prediction [30].
This work develops a re-sample RNN model to predict the passenger traffic patterns of MRT along with the spatiotemporal structure of passenger trip data.Generally, this training data is imbalanced due to congestion that only occurs during the peak period.However, the other 90% of the time it is seat available, without any congestion.If this kind of dataset is directly used as input data to train the model, it is highly likely that it overfits for major patterns such as no congestion and fails to display a good representation of minor patterns like congestion, which are typically what researchers are interested in.In order to solve this problem, we propose a resample Recurrent Neural Network referred to as RRNN.Event indicators are identified by a list of special events, such as sporting events or entertainment events recorded by the operator of the MTR.The schedule and location of events are assumed to be known in advance.Each period is defined to be 20 minutes for the modeling process.The operating hours is from 4:00 am to midnight on weekdays, from 6:00 am to midnight on Saturday, and from 8:00 am to midnight on Sunday.The whole system every week has 7 days * 60 time segments * 92 links, and the number of rides in every segment in every station is recorded by operator.The output is the predicting result: severe, moderate, light, and normal.The training data includes 70% of the entire dataset, which translates to around 9 months of history data.The 2 remaining months of data are used as testing data.In total, the dataset consists of 833520 data samples.The output and input variable structure is listed in the Appendix.

Description of the
According to common understanding, passenger traffic patterns are designed according to congestion level which is defined as the average amount of passengers per train car.According to this work, given 40 seats are available per car on MRT, the passenger traffic patterns are defined as 4 different patterns: congested, moderate, light, and normal, which are separated by three respective cutoff points: 120, 80, and 40.As such, if the number of passengers per car is less than 40, then congestion is normal, between 40 and 80 is light, between 80 and 120 is moderate, and over 120 is congested.

Module
Recurrent Neural Networks (RNNs) are neural networks with internal connections between hidden neurons and specifically designed feedback connections.The premise of this design is that human beings do not start their thought process from scratch.The human mind has the ability to associate previous information with current events, a phenomenon called persistence of memory.However, traditional neural networks are unable to replicate this and end up ignoring previous information.Using a movie scene classifier as an example, a traditional neural network cannot utilize any previous scenes to predict the current one.
In contrast with traditional neural network, RNNs are networks with a loop that allows the network to retain that prior information.An RNN introduces a transition weight W to transfer information between time slots.RNNs process sequential input data once and update a vector state that contains past information about past events in the sequence.A neural network that takes input as a value of X(t) and then outputs a value Y(t) is shown in Figure 1 [31].
As shown in Figure 1, the first layer of the neural network is characterized by the function Z(t)=X(t)×W in , where X(t) is the input sequence and Z(t) is the state of the hidden neuron.The output is Y(t)= Z(t)×W out .Z(t) can be updated with time, and t is a time series equal to 1,2,3. ... A more explicit architecture can be found in Figure 2.For the hidden layer, the temporally shared W 2 is learned.At the same time, the input layer weight W 1 and the output layer weight W 3 are also learned [31].
After preprocessing, the data has 45 attributes which include 11 months of data (2017.1.1-2017.11.30).The total data are split into two parts.The first part, about 9 months' worth of data (2017.1.1-2017.9.30), is used as training data, while the 2 remaining months' data is used as test data.The total number of training data samples is 625140.The components and the proportion of every component over the total training data are shown in Table 1 and Figure 3.
As shown in Table 1 and Figure 3, both the proportions of congestion level and the proportion of moderate level data are very small.
In order to speed up the training, minibatch gradient descent is used to split the training dataset into smaller batches that are used to calculate model error and update model coefficients.Minibatch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.Thus, during the training, every input sequence into the RNN is a minibatch of stochastic samples of the training dataset.When the training dataset is imbalanced like our data, it is very difficult for RNNs to accurately gain an understanding of the minor data patterns.In order to solve this problem, we propose a re-sample method.First, we split our training data into 4 subdatasets with respect to their class labels.During the training, every minibatch size of input data re-samples from these 4 different subdatasets according to the proportion of every class' data, which is shown in Figure 4.
As shown in Figure 4, for every training step, the input data weight re-samples from different subdatasets.w i (j=1,2,3,4) is the re-sample weight.Suppose the minibatch size is predefined as minibatch which is re-sampled using the following formula: where B is a predefined minibatch number,   is the weight of the re-sample from the ith subdataset, and   is the instance from the ith subdataset.
The whole architecture of our RRNN (re-sample Recurrent Neural Network) is shown in Figure 5.
The calculations of nodes in Figure 5 are done via the following equations.
W in and W out are the weights of the input output layer, respectively, and Wh is the weight between hidden neurons.⌀ is the activation function, which is tanh for this research, and b in , b out are the biases of the input and output layer, respectively. is also an activation function, which is Softmax in this instance.The above computation is the process of forward propagation.We use BPTT (Back Propagation Through Time) during back propagation, which is illustrated by the following equations.
Because there are losses during every time slot, let us assume that the loss L is The gradient of W out , b out is calculated as formula ( 6) and ( 7): where ŷ() is the output at step t and y is the true label.
Let us define the gradient of the hidden state at time t as the following: Then the gradient of input weight W in , input bias b in , and hidden weight Wh is calculated as follows:

Evaluation
To validate the effectiveness and efficiency of the proposed model RRNN, we perform a case study of San Francisco in the US.Minimizing cross-entropy error is used as the optimization objective during the RRNN model training procedure.The cross-entropy error indicates the distance between the probability distributions of network outputs and target labels.The cross-entropy error is defined in the following equation [12]: where ŷ is the predicted probability of a value of class i, and   is the true probability for that class.A confusion matrix is the specific table layout that visualizes the performance of the model.Each row of the matrix represents the instances in a predicted class, while each column represents the instance in a true class [31].
In pattern recognition, precision and recall are indicators of the performance of the model.Precision (positive predictive value) is the fraction of the relevant instances among the retrieved instances, and recall (sensitivity) is the fraction of relevant instances retrieved over the total amount of relevant instances: While tp is the true positive, fp is a false positive and fn is a false negative.
The F1 score is a measure of accuracy of the test.It combines the precision p and the recall r of the test to calculate a score.The F1 score is the harmonic average of the precision and recall.The hyperparameters in our model are determined through experiments.

Discussion of the Performance of the Model on 4 Congestion
Levels.As mentioned previously in Section 3, the dataset consists of 833520 data points ranging from 01/01/2017 to 11/30/2017.In order to train and test the model, the dataset was split into two parts.The first part was cut at 09/30/2017, which is the range from 01/01/2017 to 09/30/2017, and was used as training data.The remainder was used as test data.
The weights of the re-sample of the model need to be determined with prior tests.For the RRNN input, the time step and element size should be predetermined.In our work, we reshape the 45 dimensional data items into a 9 * 5 matrix, set the time step as 5, the element size as 9, the iteration step as 12000, and the minibatch number as 1000.In order to determine the hidden layer size, we trained the model with different hidden layer sizes to discover that 250 is an appropriate value.The entropy-loss and accuracy are shown in Figure 6.
From Figure 6, the training accuracy is 86.1%, the test accuracy is 86.4%, and the minibatch loss is 0.341623 when the re-sample weights (congestion, moderate, light, normal) are 0.15, 0.15, 0.2, 0.5 and the hidden layer size is 250.From the time axis on the bottom of Figure 6, the model training time is about 5 minutes when the hidden size is 250 and iteration  step is 12000.We also notice that the training accuracy and test accuracy are close, which implies that the model has good generalization.
In order to further validate the performance of our model, we tried different rates of re-sample weights to find the best weights for accurate predictions.The results are shown in Figure 7.
From Figure 7(a), when the weights of re-sample (which decide how many instances are sampled from the congested, moderate, light, and normal subdatasets) are set to 0.15, 0.15, 0.2, 0.5 with hidden size 200, the average test accuracy is 83.7%.More specifically, the precision of the congested level is 89% and the precision of the normal level is 92%.The normal level data is represented well by the model, because the fraction of training data is large.For the moderate and light levels, the precision is 70% and 71%, respectively.The average of precision, also referred to as the F1-score, is all 84%.From Figure 7(b), when the weights are set to average (0.25,0.25,0.25,0.25)with hidden size 150, the model performs well as the precision of the congested level is 94% and the precision of the normal level is 89%.This shows that the model can effectively predict the congested and normal levels, which are the most important to operators and commuters.From Figure 7(c), when the weights of resample are set to 0.15, 0.15, 0.2, 0.5 and the hidden size is set to 250, the model also has a very good performance.It shows that the model can effectively predict congestion where the precisions of the congested and normal levels are both higher than 90%.Also, the minimum of all the F1-scores is 0.76, which shows that the model can predict unseen data effectively.From Figure 7(c), having the model without resample although the average precision is very high (more than 90%), the recall is very low, which implies that the module failed to gain a good representation of the congestion level.As such, this module cannot be used to predict the congestion of passenger traffic.As shown in Figure 7(d), we set the weights to the same rate as the training data component, which is 86% normal data, 9.7% light level data, 2.3% moderate level data, and 2% congested data.The performance of the model was poor for the moderate level and the F1-score was very low, which shows that it has bad generalization.
The summary of the results of the model with different re-sample weights and hidden sizes is shown in Table 2.
From Table 2, the test accuracy represents the total accuracy of the test data.We notice that when the re-sample weights are set to (0.25,0.25,0.25,0.25)with a hidden size of 200, the model performs better.Both the precision of severe congestion and the precision of the normal level are higher than 90%, and, at the same time, the performance of the moderate and light levels are also higher than other model.Furthermore, on average the recall of this model is better than others, which indicates that the model has a good  generalization for unseen data.The model with the re-sample weights (0.86, 0.097, 0.023, 0.02) has the worst performance, as it cannot predict the crowdedness correctly.This weight rate is the exact rate of the distribution of the training data which consists of 86% normal data, 9.7% light data, 2.3% moderate data, and 2% congested data, which is basically equivalent to not having a re-sample.The model without resampling has a higher total accuracy, but it has bad recall on the congested and moderate levels, which implies that the model has bad generalization for unseen data.This essentially means that a re-sample on imbalanced data performs better than no re-sample at all for this model.We also notice that when the hidden size increases, the precision also slightly increases, but not obviously.In order to manifest more clearly, a comparison of congestion level prediction is given by Figure 8 among modes with different weights.
According to Figure 8(a), the models are all with hidden size 250.It is noticed that the models with re-sample weights rate (0.15,0.15,0.2,0.5) has the best performance in precision and also well in recall and F1-score, the model with weights average rates has also much better performance than the model without re-sample.Even the model with re-sample weights rate (0.02,0.023,0.097,0.86) is better than the model without resampling.According the previous introduction, this weights rate is the exact rate of four class dataset.The efficiency of the proposed weighted re-sample can also be certified by Figure 8(b).A comparison of F1-score among models with different weights is given by Figure 8(b).According to Figure 8(b) the model without resampling works worst when it is predicting congestion level and light level.The models with re-sample weights rate (0.2,0.2,0.2,0.4)work better than others when it is predicting congestion level.The model with re-sample average weights can also get a good performance on congestion level yet a little bad performance on normal.This is because that the training data size of normal level is big, when raising the rate of re-sample weights, the model works better on predicting normal level.

Discussion of the Performance of the Model on 3 Congestion Levels.
Another notable case from the confusion matrices in Figure 7 is that there was more confusion in predicting the moderate and light levels.For instance, from Figure 7(b), there are 29/250 instances of moderate misclassified to light and 53/250 of light misclassified to moderate.We can also see from Table 2 that the model has lower precision and recall on these two levels.We think the distinction between light and moderate is not sufficient for the model to gain a good representation of them.Based on this analysis, we redefine  the passenger traffic pattern into 3 patterns (congestion, moderate, and normal).We relabel our congestion levels into 3 classes including normal, light congestion, and congestion.The moderate and light levels are combined into one level named moderate.Thus, the training data consists of 86% normal, 12% moderate, and 2% congestion data.The crossentropy is shown in Figure 9, while the confusion matrix and corresponding performance are shown in Figure 10.
From Figure 9, the weights of re-sample (congestion, moderate, normal) are set to the average of the minibatch size.It shows that the training accuracy and the test accuracy of the model are 91.7% and 91%, respectively.The accuracy and cross-entropy of the training and test are very close, such that the model has good robustness.From Figure 10, it shows that the precision and recall of the congested and normal levels are higher than 90%, while the precision of moderate is 86% and the recall is 91%, which means that the model can effectively predict unseen data.From Figure 9, it also shows that the training time is about 8 minutes.In order to decide the appropriate weights of re-sample and the hidden size, we tried different weight rates, and the summary of those experiments can be found in Table 3. From Table 3, we compared the different weight rates and hidden sizes.When the weights are set to an average of 0.33 congested, 0.33 moderate, and 0.34 normal and the hidden size is 250 (the bold row), the model has better performance.The other two models (the italic row) have also reached a similar level of performance.In essence, when the hidden size is increased, the model performance improves slightly, but not obviously.However, when the hidden size is under 150, the model performance decreases significantly.

Discussion of the Influence of the Parameter Selection in
Model.In this study, the data consists of three categories of variables: demand, supply, and day attributes.After testing numerous variables, 45 were included in the final model, categorized into three types.The first type is demand variables.The second type is supply-related variables.The third type consists of a period, day of the week, month, week of the month, day type, raining day, and event indicator variables.In our work, we reshape the 45 dimensional data items into a 9 * 5 matrix, set the time step as 5, the element size as 9.We did not deliberately arrange the sequence of the variables.According to the attributes of RRN, here there are 5 sequences which include 9 variables and the output of these 5 sequences is the same.In order to verify the efficiency of the model, we scramble the variables.The results show that the sequence of the time step has little influence on the predicting performance.A comparison is shown in Table 4.
According to Table 4, model 1 is the result with the input variable scrambled.And model 2 is the original input variable sequence discussed in aforementioned experiments.So the two models only have different order sequence of time step, the same hidden size of 200 and the same weighed re-sample rate both with the re-sample weight (0.15,0.15,0.2,0.5) which means during every iterate re-sample 15% of Congestion, 15% of Moderate, 20% of Light, and 50% Normal.It is noticed that both models have similar predicting performance.So in this scenario the variables are independent and the elements selection in time step does not affect the performance.
In re-sampling the data, how the weights influenced the prediction results with changing weights for different subdataset is shown in Table 2 when congestion level is defined by 4 and is shown in Table 3 when congestion level is defined by 3.
According to Tables 2 and 3, it is noticed that as the re-sample rate of the congestion level increases (which has smallest fraction of the sample), the predicting accuracy increases accordingly when the hidden size is fixed, but the performance of other levels may decrease such as normal level.Thus people have to make tradeoff decision when designing the weighted rate of re-sample.For this project, both customer and planner concern more congested level than normal.According to Table 2, the model with weights rate is set to (0.25,0.25,0.25,0.25)and with hidden size 150 (italic row) gain 94% precision and 84% recall when applied to forecast severe level.And according to Table 3, the model with re-sample weights rate is set to (0.33,0.33,0.34)and with hidden size 250 (bold row) gain 95% precision and 92% recall when applied to predicting severe level.

Summary
Congestion in Rapid Transit Systems has presented a major problem in many cities, and therefore methods are needed to mitigate the problem.This work has three contributions.First, Recurrent Neural Networks are studied to build a congestion prediction model.In order to solve the imbalanced training data problem, which is a common phenomenon in the real world, a RRNN (re-sample Recurrent Neural Network) is proposed.Through introducing a measurement model, this study suggests that the level of congestion of a MRT System combined with spatiotemporal characteristic traffic information can be measured and predicted effectively.Second, the results from this analysis demonstrate that an appropriate passenger traffic pattern of MRT is studied.According to the result of experiments, the traffic pattern can be measured and predicted by the model.The effectiveness of the proposed model is demonstrated from the case study of MRT System of San Francisco.From the case study, it

Figure 2 :Figure 3 :Figure 4 :
Figure 2: A RNN architecture where all the weights in all the layers must be learned with time.

Figure 7 :
Figure 7: The confusion matrices with different re-sample weights and different hidden sizes.The confusion matrices are set with rows from congestion to normal from top to bottom.

Figure 8 :
Figure 8: A comparison of performance among models with different weights.

Figure 9 :
Figure 9: The accuracy and cross-entropy of the model with the weights of re-sample set to average and the hidden size set to 250.

Figure 10 :
Figure 10: Confusion matrix and the resulting report with resample.Weights set to average and hidden size set to 250.

Data Structure and Passenger Traffic Patterns Definition
, day of the week, month, week of the month, day type, raining day, and event indicator variables.

Table 1 :
The data components of the 4 passenger traffic patterns.

Table 2 :
The performance of models with different weights of resample rate (congestion, moderate, light, and normal) and different hidden sizes.

Table 3 :
The performance of models with different weights of resample rate (congestion, moderate, and normal) and different hidden sizes.

Table 4 :
A comparison between two models with different sequence of time step.