Optimization of Metro Trains Operation Plans Based on Passenger Flow Data Analysis

Metro intelligent system produces massive passenger ow and trac data every day, among which route, station, and operation data are important for optimizing the train operation scheme. We collect passenger ow information of Shenzhen metro, analyze the passenger ow pattern and its distribution characteristics based on the data warehouse of the Hadoop platform, and optimize the train operation scheme in this paper. Using dynamic passenger ow data, an optimization model with train departure and dwell time as decision variables and passenger waiting time, passenger ride time, train full load ratio, and train operation balance as objectives is developed. An improved parallel genetic algorithm (GA) incorporating a simulated annealing algorithm (SAA) and an optimal individual retention strategy is used to nd the optimal result. To verify the usefulness of the method, simulation experiments are conducted on the optimization model and method using the real passenger ow and train operation data of Shenzhen metro, and the simulation results are compared with the original plan.

1. Introduction e metro system is characterized by large capacity, fast speed, high frequency, and punctuality. It has become one of the best schemes to alleviate urban tra c congestion [1]. Metro system produces a large number of passenger ow data [2] such as passenger origin-destination (OD) information and train operation data. Using big data to analyze passenger ow data can improve rail transit train transportation e ciency [3] and passenger satisfaction. e intelligent construction of the metro is an important means to relieve the pressure of urban tra c, and train schedule optimization is one of the important ones [4]. In the metro system, passenger origin-destination (OD) information is very important. It can be used for the optimization of the metro train operation plan. e train operation plans are developed from historical tra c data. It determines the train's departure time at each station, its dwelling time at the station, and its arrival time at the station. It needs to meet some operational constraints such as train full load factor and travel time. rough the analysis of OD data and passenger ow data, we can optimize the train operation scheme to improve passenger satisfaction [5] and reduce the operation cost of the metro.
Lots of research have been performed on metro schedule optimization by many scholars. In terms of optimization models and optimization objectives. Wang et al. [6] proposed a mixed integer programming model based on timevarying demand, which minimizes the passenger waiting time and the number of passengers unable to transfer, using train capacity as a constraint. Zhang et al. [7] developed two nonlinear nonconvex programming models considering the variation of train frequency, train running time, and stopping time, and under the constraints of train operation and passengers getting on and getting o process, the train timetable with the minimum full passenger travel time is designed. Qu et al. [8] proposed a two-step optimization model to change the metro schedule, in which the train departure interval is used as a decision term to reduce the waiting time of people in the rst-step model. In the second step model, the total energy consumption of all trains is minimized by taking the train leave and arrival times at various stations as the decision terms. Wu et al. [9] proposed a multi-objective train schedule optimization method with the objectives of minimizing total energy consumption, average waiting time, and average maximum load deviation and demonstrated through a case study that the method can be used to reduce the total energy consumption, the maximum load deviation and the waiting time of passengers. Xie et al. [10] designed a synchronized metro schedule and stopping timetable optimization model for passengers and energy saving and demonstrated experimentally that it is very effective in reducing train energy consumption, running time, and delay probability. In terms of optimization methods, Wihartiko et al. [11] used an improved integer programming model of the genetic algorithm to solve the bus schedule problem in chromosome design, initial population recovery technique, chromosome reconstruction, and generation-specific chromosome extinction, respectively. Shang et al. [12] established a total passenger travel time model to minimize the total passenger travel time and proposed a spatial branching delimitation algorithm to solve the model. Wang et al. [13] proposed a linear weighted compromise algorithm and a heuristic algorithm to find the best solution for the bi-objective integer programming model with the train stopping time control. Guo et al. [14] proposed a mixed integer nonlinear programming model for generating optimal train schedules and maximizing interchange synchronization events, and then a hybrid optimization algorithm (PSO-SA) combining particle swarm optimization and simulated annealing is designed, and its superiority is proved by comparing with many algorithms. Tang et al. [15] combines the genetic algorithm and the simulated annealing algorithm to find the best result of an optimization model considering multiple constraints. Liu et al. [16] developed a mathematical model of it considering headway time distance and dwell time. en an improved artificial bee colony algorithm is designed to solve this problem. Tang et al. [17] developed a bi-objective optimization model considering the minimization of full passenger waiting time and departure time and designed an improved nondominated ranking genetic algorithm (NSGA-II) for fast search of Pareto optimal solutions by using a specific coding scheme. Huang et al. [18] proposed a two-step model for matching metro passenger relationships and reducing the full waiting time of passengers, respectively, and designed a hybrid MCMC-GASA (Markov chain Monte Carlo genetic algorithm simulated annealing) approach to solve the problem.
A review of the literature shows that there has been extensive discussion and research by many experts in the area of the subway train schedule optimization problem, and in previous studies, it was common to assume a constant passenger flow model at a particular moment in time and then to optimize the train travel plan for that particular moment in time. e reality is that passenger flows vary dynamically with time distribution [19], and in previous train schedule optimization, the passenger flow distribution is often first assumed to be normal or some other distribution pattern. However, modeling passenger flow patterns in complex scenarios by such approximate estimation models is inaccurate, which may lead to the inapplicability of the optimization model to the normalized environment. With the rapid development of big data technology, big data analysis methods provide new methods and techniques for train schedule optimization in the metro. We collect historical passenger ticket card data from the metro AFC, clean the data through a Hadoop big data platform, and then calculate the passenger arrival rate at each station and the passenger disembarkation rate between stations distributed over time. A multi-objective train schedule optimization model that takes into account train movements and passenger demand is proposed. en a parallel genetic algorithm (GA) incorporating a modified simulated annealing algorithm is designed and the optimal subindividual retention strategy is added to get the best result. We use the measured data of Shenzhen metro to evaluate the proposed model and a solution method, and the result shows that the method is effective and accurate.
Other parts of this article are as follows: in Section 2, we describe the methodology for AFC data acquisition and processing. In Section 3, we develop a multi-objective optimization model considering metro operations and passenger travel demand. In Section 4, we propose a parallel improved genetic algorithm incorporating simulated annealing algorithm to solve the multi-objective optimization function. Section 5 brings in the multi-objective optimization model based on real historical passenger flow data of the Shenzhen metro and solves the optimal solution. Finally, Section 6 gives the conclusion of this paper.

Description of Data.
e raw data we capture is the ticket card information from the metro automatic fare collection (AFC) system. When a passenger through the gate to ride the subway, the passenger information is saved in the AFC system and a corresponding travel data set is generated. e data set includes start station address, start line, start station time, destination station address, destination line, and destination time. Shenzhen metro generates approximately 5.9 million records per day, each record containing more than 60 attributes. To facilitate data statistics in the future, the source data is cleaned and transformed, and only the fields we can use are retained, as shown in Table 1. (1)

Data
Processing. In recent years, big data analysis technology has been developing, and accordingly, big data platforms are becoming more and more advanced and perfect [20]. e core features of big data platforms are scalable distributed storage and efficient parallel data processing and computing capabilities. In this paper, we set up a multinode Hadoop platform and add the corresponding ecological components, such as Hive and HBase, and then complete data processing and model building in this big data platform.
To reduce data interference and computational effort, we take the raw data stored in HDFS for data cleaning and then use Hive to store the data. Calculations are performed using Hive to get the passenger arrival and disembarkation rates.
Calculate the number of passengers who take the metro at station j in the same line in the period t 1 .
Count the number of passengers who leave stations j in the same line during period t 2 .
Calculate the number of passengers who take the metro from station i and get off at station j in the period t 2 .
e passengers' arrival rate at j stations can be calculated by dividing C j in station by t 1 .
e proportion of passengers leaving stations i can be calculated by dividing C j in station by C j i .

Multi-Objective Optimization Model
To improve the operational efficiency of the metro, we develop a passenger flow data-driven dynamic optimization model of the metro train operation plan in this section based on the passenger flow and travel data preprocessed by the Hadoop platform described in the previous part. e optimization model considers both metro operation and passenger experience, including train operation stability and train loading efficiency, and passenger experience including passenger ride and waiting time and the number of passengers on the train. We use a metro line consisting of k metro stations and l trains [21] as the target of our study, specifying the starting station as station 1 and the ending station as station k. To quantify the various parameters to describe the mathematical model, to better match the actual situation of metro operations as well as to simplify the overall optimization model, the following assumptions are required in this paper to build the model in terms of both passengers and metro trains.
(1) Only one train can stop at the same station in the same direction of subway operation at the same time, and there will be no overtaking when parallel trains are running on the subway line. (2) When the train enters the metro station, all passengers line up to get off and get on following the principle of "first off, then on, first to arrive, first to serve." (3) e maximum capacity of each train is a fixed value.
When the number of passengers waiting on the platform exceeds the capacity of the train, the remaining passengers need to continue to wait on the platform and wait for a train to arrive.
Assumption (1) is generally applicable to most urban transportation systems to ensure that trains operate in sequence. Assumption (2) is in line with the mainstream passenger queuing principle, and assumption (3) can improve the running stability of the train and the comfort of passengers.

Model of Train Operation.
Describing the operation of a train is generally performed by train exit time, interstation running time, entry time, and dwell time [22]. Given a train l and a subway station k, the travel interval between train l and its preceding train l − 1 can be expressed as the difference between the exit times of the two trains at station k: where d (l,k) is the moment of departure of train l from station k and d (l−1,k) is the moment of departure of train l − 1 Mathematical Problems in Engineering 3 from station k. d (l,k) can be represented by the moment a (l,k) when train l arrives at station k and the stop time s (l,k) at station k.
e time a (l,k) at which the train arrives at station k can be described as the total of the train's departure time d (l,k−1) from the last station and traveling time r (l,k−1) between the two stations.
e running time is usually a preset fixed value because the distance between stations is certain and the train runs in autopilot mode between the two stations. e stopping time s (l,k) of train l at station k can be expressed by this equation: where s min is the minimum stopping time of the train, a and b are two parameters that denote the time required for a passenger to board and alight respectively, which can be obtained analytically, N door is the number of trains opening their doors at stations, for the convenience of calculation, we assume that the passengers who are going to get on the train will consciously form two lines, and the passengers who are going to get off the train will form one line in the train, U (l,k) and D (l,k) denote the number of passengers getting on and getting off train l at station k, respectively, these two parameters can be estimated from the historical data.
In addition, to improve safe train operation, two adjacent trains need to satisfy the minimum headway time constraint, i.e., the difference between the arrival time of train l at station k and the departure time of the previous train l − 1 from station k should be greater than a constant, which can be described as d (l,k) − d (l−1,k) ≥ Hmin.

Model of Passenger Demand.
e number of passengers in a train l when the train leaves the station k is P (l,k) . It can be represented by the number of passengers P (l,k−1) in train l when it leaves station k − 1, the number of passengers D (l,k) who get off from station k and the number of passengers U (l,k) who get on board at station k: ere is a maximum amount of passengers that a train can carry when it is running. As a result, passengers may become stranded at stations during peak traffic. e number of passengers boarding the train at the station k is U (l,k) . It can be expressed by the number of passengers P remain (l,k) remaining in the train at station k and the number of passengers W wait (l,k) waiting at station k: where the number of remaining passengers in train l at station k is P remain . It can be represented by the maximum number of passengers on board as Q (l,max) , the number of passengers on board as P (l,k−1) , and the number of passengers off the train as D (l,k) : e number of passengers waiting for train l at station k is W wait (l,k) . It can be expressed by the number of passengers W remain (l−1,k) stranded at station k by the previous train l − 1 and the number of passengers λ k (d (l,k) − d (l−1,k) ) arriving in the travel interval between adjacent train l and train l − 1, where λ k is the passenger arrival rate in the interval between two adjacent trains (d (l,k) − d (l−1,k) ) [23].
e number of passengers W remain (l,k) stranded by train l at station k can be described as e number of passengers on train l who get off at station k is D (l,k) . It can be represented by the number of passengers who boarded at the previous stations as k−1 i�1 U (l,i) , and the passenger boarding and alighting ratio O-D matrix as E (i,k) : Big data analysis techniques can be used to statistically analyze historical passenger flow data to determine the proportion of passengers boarding and disembarking at each stop.

Multi-Objective Optimization Function.
e optimization of train schedules based on dynamic and uneven passenger flows mainly includes train operation optimization and passenger satisfaction optimization. e train operation optimization mainly includes reducing the deviation of the actual train capacity from the desired capacity and ensuring the balance of train operation. Passenger satisfaction optimization consists of reducing the waiting time in the station and the travel time between stations. e waiting time J 1 of passengers at the platform is a sum of the waiting time of passengers who are stranded after the departure of the previous train and the waiting time of new arrivals in the interval between the operation of two trains. It can be expressed as Passenger travel time is the sum of the time passengers who are on board when the train is running and the time passengers who wait on board when the train stops at each station and can be expressed as 4 Mathematical Problems in Engineering e train running balance J 3 can be expressed as the difference between the stopping times of two adjacent trains running between stations at each station, and can be expressed as e difference J 4 between the actual capacity of the train and the desired capacity of the train can be expressed as follows: Considering the above elements to be optimized, the multi-objective optimization function can be described as where a, b, c, d denote the weights of each objective, which are set differently according to different optimization needs. It is vital to increase the values of a and b suitably during peak passenger periods in order to carry passengers rapidly and decrease waiting and journey times. e stability of train operation should be improved and the operating cost should be decreased during the low-peak time of passenger flow, thus the values of c and d need to be suitably increased. e weights can be set in a balanced manner, taking into account the stability of train operation and the length of time passengers must wait, during the stable period of passenger flow. In conclusion, when choosing the weights for each optimization target, it is important to take into account both the passenger flow and the optimization requirements. e best weights should be chosen after conducting numerous tests.

Solution Method
To find the best solution for the multi-objective optimization model proposed in the previous section, we designed an improved parallelized genetic algorithm and completed the algorithm implementation in Hadoop big data platform.

Improved Genetic Algorithm.
Genetic algorithm is a computing model that models natural selection and biological evolution, and it is a way of searching for optimal solutions by simulating the natural evolutionary process. GA provides a number of benefits, including the capacity to handle continuous and discrete variables, the adaptability of constraint definition, the capacity to handle huge search spaces, and the capacity to provide numerous optimal or good solutions [24]. e simulated annealing algorithm is derived from the solid annealing principle and has shown to be quite successful in locating the global optimum for a variety of NP-hard combinatorial problems [25]. Starting from a certain initial temperature, the probabilistic abrupt change property of SA can help the objective function to obtain the global optimal solution in the desired time as the temperature decreases [26]. Given the benefits of these two methods, Gandomkar et al. [27] presented a hybrid algorithm that combines GA and SAA to optimize the distributed generation resource allocation problem. e advantage of the genetic algorithm is that it can quickly search out the whole solution in the solution space, excellent global search ability, overcoming the fast descent trap problem of other algorithms; suitable for distributed computing, natural parallelism speeds up the convergence speed. Relatively, genetic algorithm local search ability is insufficient, a simple genetic algorithm is time-consuming and less efficient for search in late evolution. SAA has a relatively powerful local search ability [28], but it cannot make the optimization search process the most promising area. erefore, we improved the genetic algorithm and designed an adaptive genetic algorithm incorporating a simulated annealing algorithm with an optimal individual replacement strategy as follows: e genetic algorithm uses the roulette wheel selection method, but the probabilistic selection is random, to retain the good individuals, we use the best individual replacement strategy, i.e., we replace the individuals with low fitness values with those with high fitness values, thus increasing the fitness of the offspring. e specific selection method is as follows: (1) Find the individual x b with the highest fitness by calculating the fitness of each individual in the current population, assuming that the number of individuals in the population is N. (2) Calculate the probability p(x i ) that an individual is selected and the cumulative probability q(x i ). (c) Crossover: Two individuals are selected for simulated binary crossover operation based on the set crossover probability, and then the child fitness value Fit(c) and the parent fitness value Fit(p) are calculated for the simulated annealing operation. Let T 0 denotes the initial temperature, α is a positive number less than 1 and generally takes values between 0.8 and 0.99 the temperature calculation formula is e new state is accepted at annealing with a probability according to the Metropolis criterion.
(d) Mutation: Regular polynomial variation encoded in real numbers for chromosomes that have completed the crossover operation according to a set probability of mutation.

Parallel Genetic Algorithm.
Based on the improved genetic algorithm proposed earlier, we have proposed an improved parallel genetic algorithm. e specific algorithm is described as follows: In Algorithm 1, Step (3) is the regular genetic operation, including selecting individuals with high fitness from the population and eliminating individuals with low fitness, crossing chromosomes with a certain probability, and mutating chromosomes with a certain probability.
Step (4) is to compute the individual fitness after the iteration.
Step (5) is to choose the optimal chromosome and fitness. Steps (6) is to output the Input: < key, value >, where the key is individual in one population, and the value is fitness in one population. Output: < key′, value′ >, where key′ is the best individual in the iterative process, and value′ is the best fitness value individual to key′. Algorithm Procedure: (1) Identify the number of iterations as M.
(2) Initiate integer i � 0.   (2) is to find the optimal chromosome and fitness in each population's optimal solution.
Step (3) is to output the final chromosome <key, value> pairs and fitness value<key′, value′> pairs to the sequence file on HDFS.

Numerical Results
With the intention of verifying the performance of our designed optimization method in the multi-objective optimization model of the metro schedules, we collected the AFC data of Shenzhen metro line 6. e dataset    Figure 1). e existing train schedules have fixed stopping times at each station as shown in the following Table 2.
Since the subway trains are in automatic mode, the train runs between two adjacent stations for a fixed period of time.
is is shown in the following table (Table 3). e train is a 6-part A-type train and the other information about the train are listed as follows. (Table 4). e passenger arrival rate with time distribution is obtained using the historical passenger flow data statistics with the Hadoop big data platform for the study period. e following figure shows the distribution of passenger arrival rate at each station of Shenzhen metro line 6 over time in a day ( Figure 2).
We decided to focus on two hours of the morning peak period to perform more precise schedule optimization research. In Figure 3, the statistical exit ratios between stations are displayed, where the final station is on the horizontal axis and the starting station is on the numerical axis. e data in the figure is 0, which means that few or no passengers get off from the station during the period.
A total of 17 trains are scheduled to depart during this period with a departure interval of 435 s. Using the departure interval of trains at the first station and stopping time at each station as the decision variables, the improved genetic algorithm introduced above is used to find the best result. e input information for setting up the genetic algorithm is listed below (Table 5). e waiting time and travel time of passengers are the first optimization objectives, and the train operation balance is the secondary optimization objective. erefore, the weights of the optimization function are set as a � 0.4, b � 0.3, c � 0.2, d � 0.1, respectively. e optimized train schedule does not increase the number of departures, and the departure interval of each train at the departure station is shown in the table below. Table 6.
e results of the comparison between the original train timetable and the optimized timetable are shown in Figure 4, where the horizontal axis is the arrival and departure time of trains at various stations and the vertical axis of each station of line 6 ( Table 7). e experimental results show that the optimized metro schedule reduces passenger waiting time by 21.42%, reduces passenger travel time by 22.56% and increases train full capacity by 2.65% compared to the existing schedule. It can be seen that the optimized metro timetable driven by    Train 1  400  Train 10  395  Train 2  468  Train 11  376  Train 3  394  Train 12  366  Train 4  421  Train 13  409  Train 5  411  Train 14  384  Train 6  399  Train 15  395  Train 7  386  Train 16  435  Train 8  401  Train 17  435  Train 9  420 passenger flow data improves passenger satisfaction and train operation efficiency more than the existing planned schedule.

Conclusion
By analyzing and mining past passenger flow data, which the metro system creates in large quantities, it is possible to significantly increase operational efficiency and passenger pleasure. In this paper, we built a Hadoop big data platform to process and analyze the enormous historical passenger flow data of the Shenzhen metro, then we built a data warehouse to calculate the passenger inbound rate and the station-to-station disembarkation ratio of each station that changes at any time of the day through the Hive component.
A multi-objective model considering both trains and passengers is proposed to optimize the train timetable. We have designed a parallel genetic algorithm incorporating simulated annealing algorithm improvements, using the best individual replacement strategy to retain the best individuals to get the best solution. Results of experiments using actual data from Shenzhen metro line 6 show that an improved train timetable can decrease passengers' waiting and transit times while also enhancing the balance of train operations and transportation effectiveness.
In future studies, we will further develop the proposed model with AFC data for multiple line interchanges. We will consider train operations for train turnarounds and turn-backs for the study, and another task to be performed is to analyze the passenger travel characteristics on holidays and weekends to optimize various nonworking day train schedules based on it.

Data Availability
Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.