Passenger Flow Prediction of Subway Transfer Stations Based on Nonparametric Regression Model

Passenger flow is increasing dramatically with accomplishment of subway network system in big cities of China. As convergence nodes of subway lines, transfer stations need to assume more passengers due to amount transfer demand among different lines. Then, transfer facilities have to face great pressure such as pedestrian congestion or other abnormal situations. In order to avoid pedestrian congestion or warn themanagement before it occurs, it is very necessary to predict the transfer passenger flow to forecast pedestrian congestions.Thus, based on nonparametric regression theory, a transfer passenger flow predictionmodel was proposed. In order to test and illustrate the prediction model, data of transfer passenger flow for one month in XIDAN transfer station were used to calibrate and validate the model. By comparing with Kalman filter model and support vector machine regression model, the results show that the nonparametric regression model has the advantages of high accuracy and strong transplant ability and could predict transfer passenger flow accurately for different intervals.


Introduction
Most cities in China are facing serious traffic problems, such as traffic congestion, pollution, and accidents.It is agreed that subway system is one of the efficient countermeasures to solve traffic problems.However, passenger flow is increasing dramatically with accomplishment of subway network system in big cities.As convergence nodes of subway lines, transfer stations need to assume more passengers due to amount transfer demand among different lines.Transfer facilities have to face great traffic pressure because passengers always arrive in a very short time.Consequently, pedestrian congestion or other abnormal situations will occur more easily.So, in order to avoid pedestrian congestion or warn the management before it occurs, it is very necessary to predict the transfer passenger flow to forecast pedestrian congestions.
Nonparametric regression was selected as the prediction method to forecast the passenger flow due to the fact that the authors have demonstrated the advantages of nonparametric regression over other approaches, such as Kalman filtering [1,2] and neural networks [3,4] in previous research efforts, based on sufficient history data.
Nonparametric regression is suitable for uncertain and nonlinear dynamic system.It is founded on chaotic system theory.Earlier work by Smith [5] found that a simple implementation of the nearest neighbor forecasting approach provided reasonably accurate traffic condition forecasts.In 1987, Yakowitz [6] suggested the using of -nearest neighbor method in time series forecasting.The basic approach of nonparametric regression is heavily influenced by its roots in pattern recognition [7].In essence, the approach locates the state of the system (defined by the independent variables) in a neighborhood of past, similar states.Once this neighborhood has been established, the past cases in the neighborhood are used to estimate the value of the dependent variable.
Nonparametric regression model is quite suitable for deterministic and nonlinear prediction.And it could be used in the situation without transcendental knowledge and enough historical data.It can try to find the nearest neighbor between historical data and current data, and with the nearest neighbor, it tries to predict the flow in the next interval.
The algorithm assumes that the intrinsic links of all factors are all contained in the historical data.So, the information can be obtained directly from the historical data instead of establishing an approximate model for it.In other words, the nonparametric modeling does not smooth the historical data.Therefore, the predicted effect is more precise than the parameters modeling, especially in the special events.As a free parameter, portable, and high prediction accuracy algorithm, the error of nonparametric regression is relatively small.What is more, this model is quite suitable for computer programming and can be applied to the complex environment.
The basic idea of nonparametric regression is to form a typical historical database, which is on the basis of comprehensive analysis of a large number of historical data.The historical database contains variety of traffic state trends as well as the typical rules.Each type of data in the sample library represents a traffic evolution trend.The latest traffic data collected in real-time are matched with historical data to find the nearest -group data.The prediction of coming traffic state is determined by the nearest neighbor trends of the group data.Accordingly, the whole algorithm has no fixed parameters and coefficients.It can predict the next period traffic state totally based on the sample database evolution trend and the value of real-time data.Historical data series are the typical mode of traffic evolution, which play an important role in the short-term prediction.Figure 1 shows the principle of nonparametric regression theory.
Due to well prediction ability, kinds of nonparametric regression models were used to forecast traffic states gradually.In 1991, Davis and Nihan [8] used the nonparametric regression in traffic forecasting.In 1997, Smith and Demetsky [9] used the last 5 months' data to forecast the traffic flow.The definition of state vector included historical average flows; the results were better than historical average and neural network methods.Oswald et al. [10] researched how to speed up the runtime of nonparametric regression, but the accuracy was degraded.Qi and Smith [11] developed a distance metric that can be effectively used with categorical data which commonly make up traffic event data.The metric was based on the influence of variable values on a measurable objective to the purpose of selecting the nearest neighbors.When this method was incorporated in a nonparametric regression forecasting model, it was demonstrated to outperform parametric forecasting models significantly.
Tang and Gao [12] enhanced the automatic incident detection ability for forecasting traffic flows based on improved nonparametric regression algorithms and standard deviation algorithms.Turochy [13] coupled nonparametric regression with a condition monitoring method which characterized the extent to which the current traffic conditions deviate from those that may be expected based on historical data.The mean absolute percentage errors for two of the four nearest neighbor forecasting procedures were reduced.Kindzerske and Ni [14] introduced a composite approach based on nonparametric regression which was used to predict traffic conditions.The composite approach performed the nearest neighbor search for each loop detector station only using the data which are in proximity to the detector's position on the roadway.This method accommodated every detector station individually to minimize the forecast error on the entire roadway.And the composite approach can predict the onset and propagation of traffic shock waves.
Liu et al. [15] proposed a recursive nonparametric regression model and implemented it to forecast traffic flows and queue evolution in a congested actuated intersection.The model can be used to substitute traditional simulation software in the lower level of a real-time traffic control system to search the optimal control variables and then utilize the found solutions as the inputs in the simulation software in the upper level of that control system to attain the system performances.Shi and Ren [16] proposed a new method called MW model to improve the accuracy and computing speed of the nonparametric regression model when the database was too large and hard to search in short-term traffic flow forecasting.Zhang et al. [17] proposed a rule-based nearest neighbor nonparametric regression model to forecast large scale traffic flow of urban road networks.Rules were extracted from the historical data using Rough Set Theory, which assisted in finding the near neighbors.
Sun and Zhang [18] also proposed a selective random subspace predictor (SRSP) which was very similar to nonparametric regression model.The SRSP built selective input space based on Pearson correlation coefficients and then generated random input subspace to forecast.The method which the SRSP used to select relative variable could be used in nonparametric regression model.
From the previous literature review, it can be found that kinds of nonparametric regression models were widely used to predict traffic condition of motor vehicles.However, there were few research works related with pedestrian traffic.So, in order to test and verify the applicability of nonparametric regression in pedestrian traffic condition prediction, the nearest neighbor nonparametric regression model was used to forecast the transfer passenger flow of subway stations.The nonparametric regression's advantages of high accuracy and strong transplant ability are showed while being compared with Kalman filter model and support vector machine regression model.

Procedure of Nonparametric Regression Prediction
The application of nonparametric regression prediction contains five key steps: choosing clustering methods of historical database, the definition of state vector, the determining of similar mechanism, the choosing of the nearest neighbor mechanism, and the choosing of prediction function.

Choosing Clustering Methods of Historical Database.
The first and critical step in nonparametric regression is historical data preparation, whose quality directly determines the prediction effect of nonparametric regression.What is more, the prediction effect of nonparametric regression is closely related to the choosing of clustering methods and computational time.Therefore, firstly, in order to search enough nearest neighbors, the historical database which was  built by clustering method must cover all state of the system.Secondly, clustering method should be able to meet the requirements in the dynamic data real-time classification and to meet the requirements of real-time, online programming.But now, traditional clustering methods take the average state vector or a single historical value as the clustering objects; it is difficult to reflect the data changing trends characteristics.Thus, the paper will focus on discussing the improvement of clustering methods and the model computational speed.

Definition of State
Vector.State vector is composed of the minimum number of state variables, which are associated with the predictor variables.Because maybe there are a lot of state variables associated with predictor variables, it is necessary to properly select the number of state vectors to achieve the best balance between accuracy and computational speed.

Similar Mechanism.
It is an important concept in the nonparametric regression, which means how to evaluate the similarity of the current point and the historical database.The most commonly used metric method is the Euclidean distance or weighted Euclidean distance.

Choosing the Nearest Neighbor Mechanism.
As a core concept of nonparametric regression, the nearest neighbor mechanism refers to the point in the history database and how to become a close neighbor of the current point.There are two mechanisms: minimum -nearest neighbor method and nuclear nearest neighbor method, respectively.The minimum -nearest neighbor method means  points, whose similarity is the biggest in historical database.The nuclear nearest neighbor method refers to taking the current point as the core; all points within the radius of  become the nearest neighbor of the current point.

Prediction Function Selection.
After finding the nearest neighbor points, a function needs to be used to take advantage of these points to predict the next period value.Commonly used methods are average, weighted average, and so on.

Improvement of Typical Model
3.1.Improvement of Historical Data Clustering.The basic procedure of nonparametric regression prediction is to compare the recent data status with the historical data and figure out the most similar data serials which would be used to predict the future data status.So, in order to provide the most similar data serial, the historical database should include enough historical information.And, in order to reflect as many trends of data serial as possible, all the historical data were stored in the database without any processing.So, the organization method of data serial in historical database determines the calculation efficiency of the prediction model.The historical database is the foundation of transfer passenger flow prediction.The core concept of the nonparametric regression is to match recent data with the historical database.From all the matches, either the  nearest matches or all the matches below a given distance threshold are located.According to the data storing system of computer science, an improved historical data organization method is proposed.This method quantifies the trend of historical data serial and sets different value for different trend which is used to cluster the historical data serials.
If  = {0, 1, 2} is the trend description serial of data serial, then the value of the trend description serial is ( The number of clustering types of historical database is For one data serial, the clustering label is Figure 2 is the trend of one data serial with length of 4. Based on (2), the number of clustering types in historical database is And the clustering label is

The Selection of Data Serial.
Based on the experimental analysis, the neighbor data are chosen as the state vector.The vector contains four current transfer passenger flow trend  data and five historical transfer passenger flow trend data.Four neighbor data are selected as data serial.The prediction model calculates the clustering label based on the trend of the four neighbor data and searches for the most similar data serials from history database.Then, the future data status is predicted according to the next trend of the most similar data serials.

The Similar Mechanism.
The Euclidean distance is used to calculate the similar level between the recent data serial and the historical data serials.The equation is Except for the Euclidean distance, the weights of the most similar historical data serials are also used in the prediction model.As shown in (7),   is the weight of the most similar historical data serial .The bigger the   is, the more remarkable the influence level on the prediction result of data serial  is: where  is the number of the most similar data serials.the numbers of nearest neighbors which are selected from historical database, and has close relation to the database's character.Based on the previous research results [13,14,19], the  is 5.

The Improvement of Selection Model.
The weighted average method based on the reciprocal of the matching distance is chosen as the prediction function.The shorter distance point is the more similar point.Then, the weighing is bigger.For most nonparametric regression prediction models, the next value of the most similar historical serial is used as the prediction value of recent data serial.The next value and weighted coefficient based on the historical data are used to predict the transfer passenger flow in the prediction algorithm.In the state vector of the prediction model, the historical data of the current time and the nearest time are used to identify different prediction coefficient, and the historical data of the next trend are used to calculate the prediction data directly.
However, due to reasons such as the lack of historical data or abnormal flow, the next value of recent data serial may change dramatically, taking Figure 3 as an example.So, in order to improve the prediction accuracy, the amending     coefficient with average value of recent data serial is proposed.The improved model is where  is the average value of recent data serial given as and  ℎ is the average value of the neighbor data serial  given as

Application
In order to test the accuracy of the prediction model, the transfer passenger flow of XIDAN station was used to calibrate the model.The historical database was built with the transfer passenger flow from July 26 to August 25, 2011.The prediction data were the passenger flow of August 25, 2011.The prediction results are illustrated in Figure 4 to Figure 9.      and the stability of the error for the improved nonparametric regression model have been improved significantly.So, the improved nonparametric regression prediction model can be used in real application.

Conclusions
As a convergence node of subway lines, transfer stations need to assume more passengers due to amount transfer demand among different lines.So, it is really very necessary to predict the transfer passenger flow to avoid pedestrian congestion or warn the management before it occurs.
Based on nonparametric regression theory, a transfer passenger flow prediction model was proposed.And data of transfer passenger flow for one month in XIDAN transfer station were used to calibrate and validate the model.The results show that the model could predict transfer passenger flow accurately for different intervals.What is more, the prediction accuracy is also much better than Kalman filter model and support vector machine regression model.The bigger the interval is, the more accurate the prediction result is.The maximum average relative error is 12.20%, which means that the prediction model can be used in real application.

Figure 1 :
Figure 1: The schematic illustration of nonparametric regression theory.

Figure 2 :
Figure 2: Illustration of trend label of state vector for nonparametric regression.

Figure 3 :
Figure 3: Comparison of state vector of prediction and similar neighborhood.

Figure 4 :
Figure 4: Forecasting result for each 5 minutes in morning peak hour using nonparametric regression.

Figure 5 :
Figure 5: Forecasting result for each 3 minutes in morning peak hour using nonparametric regression.

Figure 9 :Figure 10 :Figure 11 :Figure 12 :
Figure 9: Forecasting the result for each 1 minute in evening peak hour using nonparametric regression.

Table 1 :
Precision of nonparametric regression forecasting model.

Table 1 .
It is obvio11 that the improved nonparametric regression model has very high prediction accuracy.The maximum average relative error is less than 15%.To compare the applicability of different prediction model, the Kalman filter model and support vector machine regression model are chosen to predict the transfer passenger flow.Figures 10,11, and 12 show the comparison of prediction capability of three different models.Compared with the Kalman filter model and support vector machine regression model, the accuracy of the predicted transfer passenger flow Forecasting the result for each 3 minutes in evening peak hour using nonparametric regression.