GMM Clustering Based on WOA Optimization and Space-Time Coupled Urban Rail Traffic Flow Prediction by CEEMD-SE-BiGRU-AM

Accurately predicting the short-term passenger ow of urban subways is very important for urban subway stations to formulate passenger ow organization and evacuation plans eectively and rationally plan passenger travel routes.is paper establishes a novel framework to predict the hourly inbound and outbound passenger ow of subway stations based onWOA-GMM station classiers, CEEMD-SE noise reduction, and BiGRU optimized by attention. Firstly, this paper classies subway stations using the improved Gaussian mixture model (GMM) with Whale Optimization Algorithm (WOA) to realize the feature extraction of dierent types of subway stations. Secondly, this paper uses the Complementary Ensemble Empirical Mode Decomposition (CEEMD) to decompose the noise reduction of each station’s hourly inbound and outbound passenger ow. It combines the empirical modal components by calculating the sample entropy (SE), which makes the time series stable and reduces the time cost of forecasting. Finally, a Bidirectional Gated Recurrent Unit (BiGRU) model improved by attention mechanism (AM) is established for each station’s inbound and outbound passenger ow.e prediction model established in this paper is veried by subway passenger ow data in Shanghai, China, within 24 days. Finally, it is concluded that the model can predict the passenger ow of subway stations. Compared with the traditional BackpropagationNeural Network (BP), the Long Short-TermMemory (LSTM), and the normal BiGRUmodel, the model proposed in this paper has an average reduction of 65.90%, 64.54%, and 49.06% in Mean Absolute Percentage Error (MAPE) in the prediction of the hourly inbound and outbound passenger ow of each type of station, respectively.


Introduction
e ability to accurately forecast the short-term passenger ow of city subways is critical to inhabitants' quality of life. With the rapid advancement of global urbanization, many urban rail transit systems have achieved explosive growth [1]. e number of newly opened transportation systems has surged during the past decade. In 2019 alone, the total number of passengers transported by subway and light rail globally reached 70.794 billion. By the end of 2020, 538 cities in 77 nations and regions have inaugurated urban rail transportation, with a total operating mileage of over 33,346 km and over 34,220 stations [2]. e rapid expansion of the rail transit network has also caused passenger ow organization and train scheduling problems. Excessive passenger ow on the subway and excessive inbound and outbound passenger ow of the station may cause safety problems such as stampede incidents and are not conducive to passengers reaching their destination according to the target time. However, due to the di erent external characteristics such as spatial layout and commercial e ect of each station, excessive attention to these characteristics will complicate the prediction of passenger ow. erefore, making the best use of the time and space features of subway stations and accurately anticipating the passenger ow of rail transit stations is a pressing issue that must be addressed. stations greatly correlates. erefore, it is necessary to perform cluster analysis on stations from time and space perspectives. Zhou et al. [2] thoroughly considered each station's topology structure and passenger flow information, which used complex network evaluation and the k-means method to cluster the stations. It proposed a method to balance and control passenger flow. However, in practical applications, the k-means algorithm [3], Affinity Propagation (AP) algorithm [4], Spectral Clustering (SC) algorithm [5], and other methods are more suitable for processing spherical data, and the processing effect for data with irregular shapes is not satisfactory. In contrast, the GMM algorithm assumes that a combination of multiple Gaussian distributions forms the data points, and there is no restriction on the shape of the data. Multiple Gaussian distribution functions can be used to fit the shape of the data points to the greatest extent. However, since the GMM algorithm is based on the Expectation-Maximum (EM) algorithm, it is simple to attain a local optimum and it is extremely sensitive to the initial value. For such a problem, Guo et al. [7] presented utilizing the Particle Swarm Optimization (PSO) optimized GMM as a correction approach for the parameters of the electromagnetic transient model. It also verifies the effectiveness of optimizing the initial parameters of the GMM algorithm through the heuristic algorithm. WOA [8] is a new variant of the heuristic optimization algorithm that outperforms the PSO and Grey Wolf Optimizer (GWO) algorithm [9,10] in terms of performance. Abbas et al. [11] used the WOA to extract essential features in the breast cancer dataset, which greatly improved the prediction accuracy. Gadekallu et al. [7] used the combination of Principal Component Analysis (PCA) and WOA to accurately extract the important features of tomato disease data and used a deep neural network to identify the disease types. Among them, using the WOA improves the model's accuracy by 4% on the test set. Yan et al. [13] used WOA to improve the extreme gradient boosting (XGB) model and found that the optimized model predicted daily reference evapotranspiration considerably better than the basic FAO-56 PM model. To overcome the problems, the parameters of GMM based on the EM algorithm attain the local optimum and rely too much on the selection of the initial value. e WOA is used to optimize the initial parameters of GMM in this study. At the same time, it also solves the problem of low fitting accuracy and robustness of the k-means algorithm.
In addition, changes in the external environment will have a particular impact on passenger flow data. e data noise created in this manner is unpredictable, and too much random fluctuation will diminish the predictability of the results. erefore, it is necessary to decompose and denoise the original data. In recent years, researchers have developed various processing methods to identify the characteristics of nonlinear sequences to maximize the universality of the prediction results and reduce the impact of external environmental factors on the results. ese include wavelet transform (WT) [14], Singular Spectrum Analysis (SSA) [15], Empirical Mode Decomposition (EMD) [16], Ensemble Empirical Mode Decomposition (EEMD) [17], etc. Some researchers have also proved the higher accuracy of the hybrid prediction model with decomposition and noise reduction processing. Zhu et al. [18] used the Autoregressive Integrated Moving Average model (ARIMA) to predict the passenger flow and combined WT to decompose and denoise the data, which achieved higher accuracy than the ARIMA model alone. However, the wavelet transform needs to specify the wavelet basis function artificially, and the decomposition function lacks directionality [19]. In contrast, the empirical mode decomposition method is more adaptive. Zhao et al. [20] used the empirical mode decomposition method to convert the time series into a series of eigenmode functions and residuals. ey used a long and short-term memory neural network to estimate subway passenger flow, and the results were more accurate than previous predictions. However, EMD is prone to modal aliasing, and CEEMD [21] solves this problem by introducing complementary noise. erefore, this paper uses CEEMD with high computational efficiency and high reconstruction accuracy of the original signal to denoise the original data.
Parametric and nonparametric models are the two types of traditional short-term passenger flow prediction models. erefore, the traditional short-term passenger flow prediction models are mainly divided into parametric and nonparametric models. One of the representative parametric models is ARIMA [22]. ARIMA considers the periodic characteristics of time, and the prediction method is direct and does not require too much preprocessing. However, ARIMA ignores the influence of randomness on the overall forecasting results, and the forecasting accuracy of time series with strong randomness is not high. Compared with parametric models, nonparametric models are more flexible and can deal with more complex situations, including Support Vector Machine (SVM) [23], LSTM [24], and Gated Recurrent Unit (GRU) [25]. e WT-SVM model proposed by Sun et al. [26] complemented the advantages of WT and SVM to predict the subway passenger flow. erefore, SVM still has a big limitation in kernel selection. Yang et al. [27] proposed the Wave-LSTM model, which used the method of controlling variables to determine the model parameters and verified the effectiveness and prediction accuracy of the hybrid model by comparing it with traditional models. In contrast, Deep Neural Networks (DNNs) [28] have a greater advantage in sequence learning, so the improved LSTM model captures the long-term time dependence of sequences mainly through gated recursive units. [29], used LSTM and GRU to forecast the short-term passenger flow of urban rail transit and found that 5 minutes was the best temporal granularity for both models to predict short-term passenger flow. Under this time granularity, the overall performance of GRU is better than that of LSTM. However, it is often difficult for recurrent neural network models to capture local features accurately. In this regard, Wu and Tan [30] used a combination of CNN and LSTM to predict traffic flow, but this method would increase the difficulty of training and be less efficient for long-term sequence prediction. e AM [31] method extracts valuable features globally. Reference [32] improved the CNN model mainly used for local feature perception by using the features with a strong global perception of AM and effectively extracting global features. erefore, this paper adopts the BiGRU model improved by AM to make the final passenger flow prediction, which greatly improves information utilization and accuracy.

Paper Contribution
is paper establishes the WOA-GMM-CEEMD-SE-BiGRU-AM model to predict the short-term passenger flow of urban rail transit and compares it with other models to verify its accuracy and effectiveness. e contributions of this paper are as follows: (1) Since different stations have different spatiotemporal characteristics, this paper employs GMM to group stations with comparable characteristics. However, the EM algorithm is the core part of the GMM, and its clustering effect largely depends on the initial value of the EM algorithm. erefore, the initial parameters of the GMM are optimized by WOA. e stations are finally classified according to their spatiotemporal characteristics.
(2) e passenger flow data of rail transit fluctuates greatly, and the regularity is poor. erefore, this paper decomposes and denoises the passenger flow data of each type of station through CEEMD and uses a more regular component as the target variable of prediction to increase the prediction accuracy. However, considering the time cost of prediction, similar variables are combined by calculating SE.
(3) In order to predict the passenger flow of rail transit, this paper firstly establishes the BiGRU, but to extract the local features of the time series more accurately, AM is used to improve the BiGRU. Finally, compared with other models, it can be verified that the model established in this paper has high accuracy. e rest of the chapters are arranged as follows: Chapter 4 introduces the model, 4.1 introduces the processing flow of the station clustering algorithm, 4.2 introduces the CEEMD-SE method for decomposing noise reduction, and 4.3 introduces the building process of the BiGRU-AM model. e third chapter carries on the calculation of the example and the validation of the model, and the fourth chapter concludes this paper 4. Since GMM is classified based on the EM algorithm, this model can easily achieve local convergence [33]. erefore, this paper uses WOA to solve this problem. e whale optimization algorithm is a metaheuristic algorithm based on swarm intelligence. By simulating the predation behavior of humpback whales, the random learning strategy is used to achieve global shrinkage, and the convergence rate is fast, which is suitable for solving global optimization problems. e predation process of the humpback whale is as follows:

Model Introduction
Step1: Determine the prey location and select the hunting method.
(1) Shrinkage and encirclement mechanism: When the humpback whale surrounds the prey, it will choose to swim toward the humpback whale with the best position, identify the position of the prey, and surround it. Such hunting is realized by reducing the value of a.
where X(t) is the position of the t-th iteration of the humpback whale. X p (t) is the optimal position currently searched; t is the current iteration number; D is the distance from the humpback whale to the prey. t max is the maximum number of iterations; a is a value linearly decreasing from 2 to 0; r is a random vector in the range 0 to 1. (2) Spiral spit mechanism Calculate the distance between itself and the optimal individual in the current population, and spit out bubbles while swimming in a spiral.
b is the logarithmic spiral shape constant; l is a random number in [−1, 1].
Step 2: Search for prey. When humpback whales search for prey, they will randomly swim according to the position of the same species to achieve a global search.
X rand (t) is a random position.

Gaussian Mixture
Model. e Gaussian mixture model (GMM) [34] assumes that all data points come from different distributions, and by finding a mixture of multidimensional Gaussian probability distributions, it eventually generates a distribution that can satisfy any shape. Data points are judged to be from the same distribution in one group. Compared with the k-means algorithm to calculate the Euclidean distance between data points, the GMM algorithm can measure clusters' distribution probability, which makes the fitting accuracy higher. In addition, GMM has lower time complexity than traditional clustering algorithms and is more in line with the central limit theorem under the condition of large samples.
In the multidimensional case, the objective function of GMM is p(X) is the product of the likelihood functions of all Gaussian distributions. x i is the i-th sample (dimension p); p(x i |μ k , k ) is the conditional distribution under the specified parameters, which obeys the expectation μ k (dimension p), and the covariance matrix is k (dimension p × p) which is usually distributed; L is the log-likelihood function; K is the number of clusters; N is the number of samples. e calculation steps of GMM are as follows: Step 1: Set the number of sample clusters to K, and the samples obey a Gaussian distribution: where p(x i ) is a multivariate Gaussian distribution; α i is the probability of the i-th mixture component.
Step 2: Calculate the probability of x j : where c ij represents the probability that the j-th sample belongs to the i-th class.
Step 3: Compute the maximum likelihood estimate for each cluster to update the model parameters, including the expectation, variance, and probability of each data belonging to class i of Gaussian mixture distribution.
Step 4: Repeat steps 2 and 3 until the objective function converges.
Step 5: e j-th template is divided into the corresponding category according to the rule of taking the maximum value of c ij .

WOA-GMM.
GMM is established based on the EM algorithm. e EM algorithm is an expectation-maximization algorithm, but it is easy to fall into a local optimum, and the result strongly depends on the initial parameters. erefore, this paper uses the WOA to initialize the parameters of the EM algorithm and then establishes the GMM.
Step 1: Determine the number of clusters.
Step 2: Construct a Gaussian mixture model objective function.
Step 3: WOA is used to solve the maximum likelihood estimation of the Gaussian mixture model constructed in Step 2.
Step 4: e optimization parameters obtained in Step 3 are used as the initial parameters of the EM algorithm.
Step 5: Iterate until optimal. e model flow is shown in Figure 1. t1 and itermax1 are the current and maximum iterations of the WOA. t2 and itermax2 are the current number of iterations and the maximum number of iterations of the GMM algorithm.

Complementary Ensemble Empirical Mode
Decomposition. EMD is a decomposition noise reduction method used to smooth and linearize data. However, due to the discontinuity of the Intrinsic Mode Function (IMF) components, the method will produce modal aliasing during the decomposition process, and the improved EEMD model solves this problem. At the same time, due to the white noise introduced by EEMD, its residual parts will affect the original data to a large extent [35]. As a result, this work uses the CEEMD model, including complimentary noise to improve the computational efficiency and signal reconstruction accuracy. e calculation process of the CEEMD model is as follows: Step 1: A set of positive and negative Gaussian white noise sequences ϵ + i (t) and ϵ − i (t) is added, which obtains a new passenger flow sequence based on the original passenger flow sequence x(t). 4 Mobile Information Systems x + i (t) and x − i (t) represent the result of adding positive and negative white noise to the original passenger flow sequence.
Step 2: Perform empirical mode decomposition on the sequences in Step 1 separately to obtain n-1 IMF components and one residual component.
c + ij (t) and c − ij (t) are, respectively, the j-th modal component obtained by empirical modal decomposition after adding Gaussian white noise for the i-th time; when j � n, it is the residual component.
Step 3: Replace the Gaussian white noise sequence and repeat Step 1 and Step 2 until N groups of eigenmode components are obtained.
Step 4: Calculate the average value of N groups of IMF components in Step 3 to obtain the final decomposition result.

Sample Entropy.
After CEEMD decomposes the time series data, multiple IMF components can be obtained. SE [36] can measure the complexity of different IMF components, which improves the Approximate Entropy (AE), reduces the dependence on the length of the time series in the calculation process, and reduces the impact of the loss of original data on the complexity calculation. e sample entropy calculation steps are as follows: Step1: Let a particular IMF component obtained by Step 2: Define the distance between X i and X j as Step 3: Given a similarity tolerance r (r > 0), count the number of d(X i , X j ) < r for each vector X i , denoted by G i , and calculate its ratio to N-u expressed as G u i (r).
Step 4: Calculate the mean of all G u i (r) expressed as G u (r).
However, when SE is applied to actual data, N is always finite, thus giving an estimate of sample entropy as where z t is the update gate. r t is the reset gate; h t is the newly generated information after processing the historical information through the update gate. w z , w r , and w h are the weight parameters of the GRU network; b z , b r , b h are the thresholds of the GRU network. e GRU structure diagram is shown in Figure 2. e hidden state of the GRU model layer only considers the impact of historical information on the current input, and the impact of future information is equally important, so this paper uses BiGRU to solve this problem. BiGRU consists of two unidirectional and opposite GRUs. e model integrates the information before and after the current state, and the final output is the weighted summation of the output results of the forward GRU and the backward GRU and is linearly superimposed with the threshold. e main calculation formula of BiGRU is where h (t) f and h (t) b are the forward and backward GRU output vectors for the current state, and β are the weight parameters of the forward and backward GRU; c is the linear threshold. e BiGRU network structure is shown in Figure 3.

Attention Mechanism.
Although BiGRU improves the efficiency of information utilization to a certain extent, it still has the problem that local features cannot be accurately extracted when dealing with long-term sequence data. erefore, when building the model, this paper adds an attention layer to the hidden layer to solve this problem. e attention layer further extracts the key information of the time series by calculating the weight of the output result in the BiGRU layer, thereby reducing the length of the output result and improving the robustness of the model: is the unnormalized attention score. e higher the attention score, the better the match between the input vector and the target vector; u s is the randomly initialized attention matrix; w is the weight coefficient; b is the bias coefficient; c (t) is the attention vector; h (t) i is the value represented by the i-th output result at the t-th time.

Data Sources and Descriptions.
is paper selects Automatic Fare Collection (AFC) data of Shanghai Metro Lines 1, 11, 2, 8, 9, 12, and 16 from April 1, 2015, to April 24, 2015, to test the established subway station passenger flow prediction model. e dataset contains timestamps, station names, and passenger arrival and departure records.

Station Clustering.
Considering the large differences in passenger flow data in different time periods, the original data needs to be standardized first. Second, the optimal number of clusters needs to be determined. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are always used for trading off the complexity of the clustering model and the goodness of fitting the data: L is the log-likelihood function in equation (4). K is the number of clusters; N is the sample size. e second term in (19) and (20) is the penalty term. It can be seen that the penalty term of BIC considers the sample size, and the overall value is larger than that of AIC. erefore, BIC can more effectively balance the model accuracy and complexity when the number of samples is too large, preventing the model from becoming too complex. Since 162 stations need to be clustered and due to the large sample size, only BIC is used to determine the optimal number of classifications. e smaller the BIC, the better the model classification effect. Set the number of clusters to a range from 1 to 20, and take the number of clusters corresponding to the minimum BIC value. From Figure 4, it can be determined that the optimal number of clusters is 5.
Secondly, use WOA to find GMM parameters for the aggregated passenger flow data of 5 types of stations: the number of humpback whales is 80, and the maximum number of iterations is 500. Finally, the optimal initial weights of each Gaussian distribution are 0.09, 0.35, 0.31, 0.09, and 0.16.

Forward layer
Backward layer

Mobile Information Systems
To illustrate the advantages and efficiency of the WOA, we still use AIC and BIC as evaluation metrics to compare the GMM and WOA-GMM. As shown in Table 1, the model optimized by the WOA is smaller in both AIC and BIC indicators, which proves its excellent improvement effect. Figures 5 and 6, respectively, represent the 24-hour cumulative passenger flow changes in 5 clusters within 24 days. e typical characteristics of these clusters of stations are shown in Table 2.

Decomposition and Noise Reduction of Station Passenger
Flow Sequence. Next, this paper takes the inbound passenger flow of the first cluster of the station as an example to illustrate the process of data decomposition and noise reduction.
Using CEEMD, the empirical mode decomposition process of adding complementary white noise is carried out step by step for the daily hourly inbound and outbound passenger flow data of five clusters of subway stations. Figure 7 represents the decomposition of the corresponding data of the first cluster of subway station inbound passenger flow into 10 IMF components, the last of which represents the residual.
As shown in Figure 7, many of the components have similar complexities. IMF1∼IMF5 have high complexity, IM6∼IMF7 have lower complexity and a particular trend, and IMF8∼R have obvious trends. However, conclusions based on   observation alone are not accurate, so this paper uses sample entropy (SE) to determine the complexity of these components. Figure 8 represents the changing trends of inbound passenger flow of the first cluster of the station, respectively. Next, to reduce the training time and improve the model prediction efficiency, this paper combines the IMF components of the inbound passenger flow of various types of stations into the combined components in Table 3 according to the IMF components with similar complexity as shown in Figure 8. e trend of the combined components over time is shown in Figure 9. In this paper, the recombination components Comp1 and Comp2 with high frequency and randomness are regarded as high-frequency components; the recombination components Comp3 with low frequency and certain randomness are regarded as low-frequency components; the recombination components Comp4 with the lowest frequency are regarded as trend components. e assignment results are shown in Table 3.

Passenger Flow Forecasting Process.
is paper optimizes BiGRU based on the attention mechanism, implements it in Python, and uses the TensorFlow framework to build the model. e experimental hardware device is configured as an 11th Gen Intel(R) Core(TM) i7-1165G7 with 16G memory. According to the prediction module in the model process, the model parameters are first initialized. e number of nodes in the input layer is set to three according to the number of auxiliary prediction variables (date, hour, subway station type), and the number of hidden layers of the BiGRU-AM network is selected to be two. Adam is used to continuously adjust the hyperparameters of the BiGRU-AM model in the iterative process until the error value is small.
In the process of inbound passenger flow prediction for the first station cluster, the Root Mean Square Error (RMSE) is used to evaluate the prediction accuracy, and its expression is shown in equation (21). e smaller the e number of such stations is small, but the passenger flow is large. e morning and evening peaks of passenger flow are obvious, and the inbound volume in the evening peak is higher than in the morning peak. e outbound volume in the morning peak reaches an obvious peak, and the inbound volume is still large after 20 : 00.
Most of these stations are subway transfer stations, railway stations, airport transportation hubs, or located in the city center, which belong to the city's core business district. After the evening peak, the passenger flow is still relatively large, which further proves that the commercial development around these stations is relatively good, and more people work overtime at night and participate in entertainment activities.

Cluster 2 57
e number of such sites is large, and the traffic is low. Compared with other categories, the morning and evening peaks of cluster 2 are not very prominent, and the overall passenger flow is relatively low.
Such stations are mostly located in underdeveloped areas, where buildings are mostly nonresidential, nonoffice, or in scenic spots. e areas where such stations are located are primarily residential, and some are located in the outer suburbs, with many houses and relatively low house prices. erefore, many citizens go to work during the morning rush hour and leave work during the evening rush hour.

Cluster 4 14
e overall passenger flow of such stations is low, and the inbound volume in the morning peak is the most obvious.
is station is also surrounded by residential areas, similar to the third cluster, but the overall passenger flow is low. RMSE, the higher the prediction accuracy. Figure 10 shows the effect of setting the number of neurons in the two layers in BiGRU on the prediction accuracy of each combined component. It can be seen that when the number of neurons in the two layers is 13 and 11, respectively, the overall predicted RMSE reaches the minimum. e same method is used for the number of neurons in the BiGRU layer of other clusters of inbound and outbound passenger flow prediction models. Next, choose the optimal learning rate with the optimal number of neurons already determined. To simplify the model, a unified learning rate is determined separately in the prediction process of inbound passenger flow and outbound passenger flow. Taking the inbound passenger flow as an example, Figure 11 shows the variation of the inbound passenger flow prediction accuracy with the learning rate for each station cluster. It can be seen that the RMSE is the smallest when the learning rate is 0.006. Similarly, it can be found that the optimal learning rate for inbound passenger flow prediction should also be set to 0.006. is paper sets the step size to 20 and the batch size to 110. According to experience, the number of neurons in the attention layer is set to 64. In this process, the optimal number of iterations for inbound and outbound passenger flow and the number of neurons in the two GRU layers in BiGRU are shown in Tables 4 and 5, respectively. Among them, the number of iterations is the average number of iterative predictions for each IMF combination, and the ratio of the training set and test set is 9 : 1.
In order to verify the validity of the BiGRU-AM model in the prediction of subway passenger flow, experiments are   Figure 12. Figure 12 shows that the model's prediction results for Comp1∼Comp3 are close to the true value, while the prediction for Comp4 is slightly different from the true value. It further illustrates the advantages of decomposing the passenger flow series into more trending components. In later chapters, we will further evaluate the model with more accurate metrics.     number of iterations  499  421  437  322  323  BiGRU hidden units 1  13  16  22  16  18  BiGRU hidden units 2  11  14  21 14 17  where n is the number of test set data; y i and y i are the actual and predicted values at a time, respectively; y is the sample mean. Among them, MAPE represents the difference between the actual value and the predicted value as a percentage of the predicted value itself, RMSE represents the average difference between the predicted value and the actual value, and MAE represents the average absolute difference between the predicted value and the actual value. e smaller these three indicators are, the better the model prediction effect is. R 2 represents the proportion of the total variation of predicted passenger flow that can be explained by independent variables such as station location, entry, and exit time. e larger the R 2 , the better the prediction effect of the model. Figure 13 compares the forecasting effects of different models for the first cluster of subway station inbound passenger flow in a short period of time. It can be seen from the figure that, compared with other models, the CEEMD-BIGRE-AM model has the highest fitting degree to the passenger flow sequence and is closer to the actual value. In comparison, other models are more different from the actual value. e basic hyperparameters of the comparison models in this paper are the same as those of the main model, and the settings mentioned in Section 5.4 are the same.

Predictive Model Comparative
Next, we will use the four model evaluation indicators MAPE, RMSE, |R 2 , and MAE to comprehensively analyze the prediction accuracy of different models. e prediction errors of different models for inbound and outbound passenger flow are shown in Tables 6 and 7.
It can be seen from the comparison that the CEEMD-BiGRU-AM model outperforms the other five models on the four error evaluation indicators of MAPE, RMSE, MAE, and R 2 . For predicting each type of passenger flow, the model established in this paper has reached the minimum in MAPE, RMSE, and MAE indicators and the maximum in R 2 . Regardless of whether the overall passenger flow of the station is large, the morning and evening peak passenger flow is large, or the passenger flow is small.
(1) BP neural network easily falls into local optimum, resulting in problems such as gradient disappearance. Although LSTM solves this problem, its hidden layer only considers the impact of historical information on the current input, and GRU as a simplified version also has this problem. (3) e data becomes more stable after using the CEEMD-SE model to decompose and reconstruct the passenger flow time series data, and the prediction accuracy is further improved. In terms of the RMSE index, the station's entry and exit prediction accuracy are reduced by more than 40% on average compared to BP and LSTM and increased by more than 0.8% on average.

Statistical Testing of Predictive Models.
Considering that the above four evaluation indicators may have generalization and randomness, this paper further uses the statistical test method to verify the model's overall prediction effect. In order to verify the overall statistical significance of the combined model proposed in this paper, the Friedman test, Nemenyi test [37], and paired t-test [38] were used to verify the differences between different models. e Friedman test can determine whether different algorithms perform significantly on multiple datasets without the need for normality assumptions. In this section, this paper uses the inbound and outbound traffic of each type of passenger flow as a whole to consider 10 sets of data sets and studies the differences between different models on the four measurement indicators MAPE, RMSE, MAE, and R 2 , under the condition of a significance level α � 0.05. e null hypothesis is that there is no significant difference in the performance of the five algorithms, and the alternative hypothesis is that the performance of the five models is significantly different. It is necessary to sort each algorithm and its corresponding evaluation index corresponding to each data, which set it as 1, 2, 3, 4, and 5 from the 1st to the 5th.
Due to the small sample size, it cannot be calculated directly using the chi-square statistic, but the F statistic is constructed from it: k is the number of algorithms to be compared, which is 5 here; N is the number of datasets, 10 here. r i is the average ordinal value of the i-th algorithm. τ χ 2 is the chi-square statistic. τ F is the F statistic and obeys the F distribution with k-1 and (k-1) (N-1) degrees of freedom. e F-test here has a critical value of F 0.05 (4, 36) � 2.634 and a significance level of 0.05. rough calculation, the pvalues (probability that the test statistic is greater than the critical value) corresponding to different metrics are shown in Table 8.
It can be found that, for all indicators, the p-values of all indicators are much less than 0.01. erefore, the null hypothesis is rejected, and the prediction effects among the five models are considered to be significantly different.
On this basis, it is necessary to continue to compare whether there are significant differences between the two models. In this regard, the Shapiro-Wilk test [39] is firstly performed on the evaluation index corresponding to each algorithm to judge whether it conforms to the normality assumption. e null hypothesis for this test is that the data follow a normal distribution. So if the p-value is less than 0.05, the null hypothesis is rejected, indicating that the data are not from a normal distribution. Its test statistic is x (i) is the i-th smallest sample value in the data; x is the average value of the sample; x i is the i-th sample value; a i is a constant that meets certain conditions, which will not be repeated here. Next, using the Shapiro-Wilk test, it can be seen that, among the four indicators, only the five models corresponding to the RMSE after logarithmic transformation do not reject the assumption of normal distribution. e pvalues of the tests are shown in Table 9.
As can be seen in Table 9, most of the models have p-values less than 0.05 for MAPE, MAE, and metrics. erefore, we performed a further test using Nemenyi's test, which does not require normality to be assumed. e null hypothesis is that there is no significant difference between the two models, and the alternative hypothesis is that there is a significant difference between the two models. e statistic is q α is the critical value of α of the Tukey distribution, where α � 0.05 is taken and the value is 2.728. k is the number of algorithms to be compared, which is 5. N is the number of datasets, 10 here. e critical value is 1.9290 by calculation, and the average sequence value of the five models is shown in Table 10.
e difference in mean ordinal values between the two pairs is shown in Figure 14.
In the case of a small amount of data, the Nemenyi test rejected the null hypothesis that the two models are inconsistent. When the average ordinal difference between the two models is only greater than the critical value of 1.929, the null hypothesis is rejected, and there is a significant difference between the two contrasting models. It can be seen from Figure 14 that the combined model established in this paper is significantly different from other models, and the differences between the BiGRU-AM model and BP and LSTM are also significant, while other models cannot directly reject the null hypothesis.
en, for the ln(RMSE) that meets the normality assumption, a paired t-test with the same null hypothesis is performed between the two models, and its statistic is    d is the mean of the sample differences; N is the sample size of 10; s is the standard deviation of the sample differences. e statistic (25) follows a t-distribution with 9 degrees of freedom. It is obtained that the significance level is α � 0.05 the p-value of the paired t-test between the models, as shown in Table 11.
At a significance level α � 0.05, the above test can prove the following: (1) e model after comprehensively using CEEMD and AM to improve BiGRU is significantly different from other comparable models. (2) BiGRU already has a significant difference compared with BP and LSTM and has a significant difference with BiGRU-AM under the significance level α � 0.05. (3) Combining the results of Tables 9 and 10, it can be found that the model established in this paper is significantly better than other comparative models. AM and CEEMD have a noticeable improvement effect on the BiGRU model.

Conclusion
is paper proposes a clustering algorithm based on the GMM and WOA. On this basis, CEEMD decomposes the passenger flow sequence and then uses the AM-optimized BiGRU model to predict the short-term passenger flow. is paper uses the AFC data of some lines of the Shanghai Metro from April 1, 2015, to April 24, 2015, as a calculation example. is paper draws the following conclusions: (1) Using WOA to optimize the parameters of the GMM, the stations are divided into 5 categories according to their spatiotemporal characteristics to save computational costs. (2) Use the CEEMD method combined with SE to stabilize the time series and decompose and denoise the time series of inbound and outbound passenger flow in this method. e actual measurement example shows that, compared with the BiGRU-AM model without decomposition and noise reduction, the method improves the model's measurement of MAPE by an average of 49.92%.
(3) Using BiGRU as the prediction model's main body and the attention mechanism to capture local features of long-term sequences, the model reduces the MAPE indicators by 31.19%, 52.18%, and 27.71%, respectively, compared to BP, LSTM, and BiGRU.
is paper predicts the hourly inbound and outbound passenger flow of subway stations in different geographic locations.
6.1. Future Work. Some further research work needs to be considered for the current research framework. For example, a model is established based on the structure of the subway station network to form a more complex spatiotemporal system to improve the efficiency of this theoretical framework and broaden its application field, but it may increase the time cost. Of course, further verification of this theoretical approach is needed.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e author declares that there are no conflicts of interest.