Prediction and Analysis of Train Passenger Load Factor of High-Speed Railway Based on LightGBM Algorithm

In order to improve the prediction accuracy of train passenger load factor of high-speed railway and meet the demand of different levels of passenger load factor prediction and analysis, the influence factor of the train passenger load factor is analyzed in depth. Taking into account the weather factor, train attribute, and passenger flow time sequence, this paper proposed a forecasting method of train passenger load factor of high-speed railway based on LightGBM algorithm of machine learning. Considering the difference of the influence factor of the passenger load factor of a single train and group trains, a single train passenger load factor prediction model based on the weather factor and passenger flow time sequence and a group of trains’ passenger load factor prediction model based on the weather factor, the train attribute, and passenger flow time sequence factor were constructed, respectively. Taking the train passenger load factor data of high-speed railway in a certain area as an example, the feasibility and effectiveness of the proposed method were verified and compared. It is verified that LightGBM algorithm of machine learning proposed in this paper has higher prediction accuracy than the traditional models, and its scientific and accurate prediction can provide an important reference for the calculation of passenger ticket revenue, operation benefit analysis, etc.


Introduction
High-speed railway has become the main transportation mode for passengers' mode of transportation for passengers. According to the relevant statistical data, in 2019, the passenger volume of the national railway reached 3.57 billion, of which the passenger volume of multiple unit train was 2.29 billion, accounting for 64.15%. High-speed railway is a significant driver of railway passenger operation revenue and passenger flow growth, and its profit and loss analysis is critical to train operation and operation decision, and the passenger load factor is used as a direct measure of train operation efficiency and a momentous basis for calculating passenger ticket revenue. Scientific and accurate prediction of the train passenger load factor can provide significant reference for train operation scheme, ticket revenue calculation, operational benefit analysis, and so on and so forth [1].
Passenger load factor prediction of passenger trains is usually based on historical data of passenger tickets; the traditional method is to input the train information into the electronic form, use manual to process, classify, and estimate the passenger load factor, and to form a decision table [2]. Nevertheless, there are some problems such as substantial error and inconsistent decision information. At present, there are quite a few research studies on the passenger load factor prediction of high-speed railway multiple units. Different scholars use a variety of model methods to predict, such as multiple regression model, time series model, neural network model, decision tree model, gray theory model, and integrated learning algorithm model. Aiming at the competitive relationship between high-speed train and air transport, Wang et al. [3] proposed a prediction model of the passenger load factor of the high-speed railway trains based on Adaboost-CART algorithm from the perspective of the impact of air fare level and dynamic fluctuation on highspeed railway trains passenger flow. From the perspective of train attributes, Zhang et al. [4] proposed a classification and prediction model of the passenger train load factor based on random forest algorithm. On the basis of analyzing the influencing factors of the train passenger load factor, Xu and Nie [5] established a BP neural network prediction model of train passenger load factor, considering the factors of train attributes and the operation period. Based on two single prediction models, ARIMA model and BP neural network, Zhang and Bai [6] constructed a linear combination prediction model of railway passenger load factor according to the principle of minimum sum of square errors [7]. In the research of machine learning prediction, quite a few studies use machine learning algorithms for short-term traffic flow prediction [1,8,9] and use LightGBM and XGBoost for prediction and classification [10][11][12][13][14][15]. Dong et al. [16] established a short-term traffic flow prediction model based on XGBoost algorithm. After analyzing the vehicle speed, road, and weather features in the course of the operation of the bus, Wang et al. [17] established a prediction model of bus travel time based on LightGBM algorithm. Huang et al. [18] constructed deep belief neural network based on multitask learning to predict traffic volume [19]. Zhang et al. [2] constructed a short-term traffic flow prediction model based on the fusion algorithm of XGBoost and LightGBM.
In the study of the passenger load factor or passenger volume forecast, various models are mainly used to predict the historical passenger load factor or passenger volume or passenger volume, and the rules of generating target variables are rarely obtained according to the attributes of trains [20][21][22][23]. In this paper, for high-speed railway trains, consider the influence factors such as train attributes, historical weather, and passenger flow sequence, and a single train passenger load factor prediction model and a group train passenger load factor prediction model based on the LightGBM algorithm are proposed, which can provide decision-making basis for ticket revenue calculation and operation benefit analysis.

Influence Factors.
e passenger load factor of highspeed railway is an index reflecting the utilization degree of passenger carrying capacity. It is the ratio of passenger turnover to the total number of passenger kilometers, which is expressed as a percentage of the average number of passenger per kilometer [24]. e passenger load factor data comes from the analysis and statistics of passenger flow; in essence, the passenger flow determines the passenger load factor, and the influencing factors of passenger travel choice are the factors that affect the train occupancy rate. Hence, the analysis of influencing factors of the passenger load factor is the analysis of passenger flow travel choice.
From the perspective of demand and the macrodistribution of passenger flow, the spatial distribution of passenger flow is determined by the regional economic development level, population, and function orientation along the high-speed railway. In a period of time, the macrofactors such as regional economic development level are relatively stable, so the spatial distribution of passenger flow is also at a relatively stable level. Passenger travel has evident time preference for specific travel behavior, and the departure and arrival time of a train will directly affect the train's load factor. Simultaneously, travel time and weather will affect the choice of the travel mode. For different time's nodes, on weekdays, most passengers travel mainly for business; nevertheless, on weekends and holidays, most passengers travel for tourism, family visits, etc. erefore, the difference in travel time will also affect the load factor of the train.
From the perspective of transportation supply, firstly, the distribution of passenger flow direction is unbalanced, so the train operation direction will affect the train passenger load factor; the number of trains running between OD, namely, the service frequency of trains between OD is one of the main factors affecting the choice of passengers; furthermore, the departure and arrival time, running mileage, station of the way, train capacity, and type of the train will all affect the choice of passenger travel, thus affecting the load factor of the train [21].
By and large, the influencing factors of the train passenger load factor can be divided into internal factors and external factors, as shown in Figure 1. In a period of time, the regional economic level, population, and function orientation of city is relatively stable; furthermore, the main factors affecting the passenger load factor of high-speed railway are the direction of train operation, service frequency of OD, train attributes, weather, and travel time [14]. To this, this paper mainly considers attributes of the train, weather, and travel time.

Prediction Model
e passenger load factor prediction model of high-speed railway is mainly composed of data acquisition, data processing, feature engineering, model training, and prediction.
(1) Data acquisition: the historical weather data, train attribute data, and passenger load factor data are obtained from various ways, and the original data is formed by data fusion. (2) Data processing: the types and dimensions of data variables are inconsistent; therefore, it is necessary to transform the categories and features of the original data before data modeling so that the data can meet the requirements of algorithm data structure. And, data processing mainly includes data transformation and data cleaning, such as removing the character "°C" from the maximum and minimum temperature of the weather data. For the train with few times that are temporarily operating or have been suspended, it should be deleted (high-speed railway below 60 times in this paper have been deleted); as there is a great difference in the change of the passenger load factor in each operation period, this thesis mainly selects weekday data for research. All feature engineering obtained after data preprocessing, including historical weather and train attributes of the passenger rate data, are shown in Table 1. (4) Model training and prediction: after feature engineering, the sample data set is constructed and divided into the training set and the test set. In the meantime, the model is trained and tested. e framework of the prediction model is shown in Figure 2.
For a single train, the attributes of the train are fixed, and the main factors affecting its passenger load factor are weather and time series characteristics. For group trains, the attributes of the train are one of the cardinal factors affecting the passenger load factor, therefore, the prediction of passenger load factor prediction of group trains, train attributes, weather features, and time-series characteristics need to be considered.

Case Analysis
To verify the effectiveness of the single train passenger load factor prediction model and group train passenger load factor prediction model based on LightGBM algorithm proposed in this paper, the passenger load factor data of all down directions from station A of a high-speed railway in the area as an example is taken. And, the load factor data of high-speed railway comes from all trains that departed from station A in the downward direction from October   data of train passenger load factor, and the train load factor data of the first 600 days is taken as the training set and the last 67 days is taken as the test set.

Passenger Load Factor Prediction of a Single Train Based on LightGBM Algorithm
3.1.1. Model Train. e passenger load factor of a train is predicted based on the LightGBM algorithm and, in comparison, with XGBoost and ARIMA algorithm. During the training in the training set, use LightGBM. Cv () function to optimization of 10-fold cross validation parameters; set the learning_rate to 0.01 and adjust the parameters {n_estimators, num_leaves, bagging_ fraction, bagging_freq, fea-ture_fraction, max_bin, min_data_in_leaf, lambda_11, lambda_12, min_split_ gain, max_depth} in turn; finally, a set of optimal parameters is obtained, and then, fine adjustment is made. MAE is used as the index of performance evaluation in the training process: where the actual value of train passenger load factor is y i , and y i is the forecast value.

Result Analysis.
After the optimal parameters are trained by the training set model, the visual fitting process is shown in Figure 3 and compared with XGBoost algorithm. It can be seen from Figure 3 that the model cannot be well identified and fitted at the mutation point of the passenger load factor, but it can fit other relatively stable points well. Meanwhile, considering that the passenger load factor sequence is a kind of time sequence, in order to further verify the effectiveness of the model constructed in this paper, the ARIMA model is selected for the comparison test.
For the ARIMA (p, d, q) model, through ADF unit root test, Ljung-Box test, ACF chart of autocorrelation coefficient, and PACF chart of partial autocorrelation coefficient combined with AIC and BIC minimum as the target order, ARIMA (7,8) is determined as the final model, and the visualization of the fitting of the model is shown in Figure 4, where the red sequence is the true values and the yellow sequence is the fitting values. e trained model is used to predict the test set of LightGBM and XGBoost and rolling prediction ARIMA, and then, the MAE of three models in the training set and test set is obtained, as shown in Table 2. erefore, LightGBM has the best prediction performance; the ARIMA model has the worst fitting, the lowest prediction accuracy, and the rolling prediction needs to increase the actual value in each step and then retrain, which is not suitable for multistep prediction.
e LightGBM and XGBoost prediction results are compared with the true values, as shown in Figure 5.
e LightGBM model is used to predict the passenger load factor of the selected train and visualize the importance of its features, as shown in Figure 6.  It can be seen from Figure 6 that the characteristics of the passenger rate of the train are sorted by importance are WeekOfYear, DayOfWeek, Avgtemperature, Day, Weather, Month, and AQIlevel (air quality level), of which the most important is WeekOfYear and the least important feature is AQIlevel.

Group Train Passenger Load Factor Prediction Based on
LightGBM Algorithm. Before the prediction of the passenger load factor of group train, the histogram of train passenger load factor to be predicted is drawn to check the distribution of its value. e x-axis represents the 0 to 1 passenger load factor divided into 100 cells and the y-axis represents the statistics for the cells, as shown in Figure 7; according to the histogram, the passenger load factor of the sample is a long tail distribution, and it has a large imbalance.
LightGBM algorithm can set the parameters of data acquisition in the course of training, compared with other traditional machine learning algorithms; it ensures that the data acquisition of training keeps the original proportion, which is more suitable for dealing with the issue of unbalanced sample distribution. LightGBM algorithm has designed the parameter class_Weight, passing the value "balanced" to this parameter, and it will automatically calculate various weights according to the classification label value. e problem of unbalanced sample distribution can be adjusted, which is helpful to the convergence of training when the samples are unbalanced. erefore, to ensure that the passenger load factor of group train is predicted under small error, the prediction of passenger load factor of the group train with 10 classifications is constructed in [7], as shown in Figure 8, and good classification results are obtained.

Model Train.
e data processed by feature engineering are divided into the training set and test set, and the data of test set is the last month. In the given parameter space, the cross-validation function "LightGBM. cv ()"of LightGBM official website is used to optimize the parameters.
For binary classification, 1 is used as a positive example and 0 as a negative example, and the four classifications are defined as follows: can be used to obtain TP, FP, FN, and TN, for each category, which are recorded as TP i , FP i , FN i , and TN i , respectively. And, for the multiclass classification model, different issues have different evaluation indexes. In the official document of machine learning classification evaluation, Sklearn. metrics. f1_score set the parameter average � "weighted," which can address the problem of unbalanced sample evaluation indexes in multiclassification.
It is denoted as Weighted_F1, and the formula is as follows:  (4) is expressed as the ratio of the weight of the accuracy rate to the multicategory. Formula (5) represents the ratio of the weight of the recall rate to the multicategory. Formula (6) is to solve the problem of sample evaluation index imbalance in multicategory.

Result
Analysis. e 10 classification model is based on machine learning LightGBM algorithm; the optimal model is optimized after cross validation and the parameter adjustment. For machine learning, the feature_importance function in LightGBM algorithm can calculate, output, and  visualize the importance of each feature. Simultaneously, to verify the prediction model of the group trains' passenger load factor based on the unbalanced 10 classification samples, the XGBoost and RandomForest algorithm are used to compare the prediction results. In XGBoost algorithm, One-Hot Encoder is needed for the features of the categorical variables, and the parameter tuning of RandomForest and XGBoost algorithm is based on Python machine learning GridSearchCV function. e prediction results of this paper are shown in Table 3. It can be seen from Table 3 that the classification consequence of LightGBM algorithm is the best; visualize the importance of the features of LightGBM optimal model, as shown in Figure 9.
It can be seen from Figure 9 that, in the classification and prediction model of the train passenger load factor, the top five important features are DepaTime, Mileage, OperTime, WeekOfYear, and StatNumber, in which DepaTime, Mileage, OperTime, and StatNumber are the train attribute features and WeekOfYear is the time sequence feature of the train passenger load factor. In addition, Capacity is the least important in the features of train attributes, Month is insignificant in the time-series characteristics, and WIntensity and AQIlevel have little influence on a train passenger load factor.

Conclusion
In this paper, consider the factors such as train attributes, historical weather, and passenger flow time sequence that affect the passenger load factor of high-speed railway trains; a single train passenger load factor prediction model and a group train passenger load factor prediction model based on LightGBM algorithm are constructed for different prediction requirements and compared with XGBoost, Random-Forest and ARIMA algorithm; the feasibility and effectiveness of the prediction model constructed in this paper are verified.
By analyzing the importance of the passenger load factor features' output by the machine learning LightGBM algorithm, the influencing factors of passenger load factor of high-speed railway trains in the region can be obtained. For a train, the crucial factors that affect the passenger load factor are WeekOfYear, DayOfWeek, and average temperature, which are the features of passenger flow time sequence. For high-speed railway trains in a certain area, the main factors affecting the passenger load factor are the attributes of the train, followed by departure time, mileage, and operation time.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.