Airport Arrival Flow Prediction considering Meteorological Factors Based on Deep-Learning Methods

. This study presents a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method for ﬂight arrival ﬂow prediction at the airport. Correlation analysis is conducted between the historic arrival ﬂow and input features. The XGBoost method is applied to identify the relative importance of various variables. The historic time-series data of airport arrival ﬂow and selected features are taken as input variables, and the subsequent ﬂight arrival ﬂow is the output variable. The model parameters are sequentially updated based on the recently collected data and the new predicting results. It is found that the prediction accuracy is greatly improved by incorporating the meteorological features. The data analysis results indicate that the developed method can characterize well the dynamics of the airport arrival ﬂow, thereby providing satisfactory prediction results. The prediction performance is compared with benchmark methods including backpropagation neural network, LSTM neural network, support vector machine, gradient boosting regression tree, and XGBoost. The results show that the proposed LSTM-XGBoost model outperforms baseline and state-of-the-art neural network models.


Introduction
e airport is the terminal for aircraft taking off and landing. It is also the transferring point for passenger distribution. e daily air traffic flow has strong periodicity and randomness. ere are many factors influencing the airport arrival flow, among which the most widely acknowledged are the complex meteorological factors, for example, the change of short-term arrival flow caused by severe weather such as thunderstorm in summer and blizzard in winter, as well as the unfavorable weather conditions that may affect visibility [1,2]. Real-time and high-precision arrival flow prediction at the airport is of great significance to identify similar patterns, implement passenger evacuation strategy, alleviate airport congestion, and improve air transportation management systems [3][4][5]. It can also assist passengers to make better traffic mode selection decisions. erefore, it is necessary to take the meteorological factors into account when forecasting the short-term arrival flow at the airport.
the Bay Area of California. A multikernel convolutional layer is designed to maintain the network structure and extract short-term and spatial patterns [13]. Li et al. proposed an adaptive real-time prediction model under uncontainable conditions. e model consists of two stages, including an online sequence extreme learning machine with a forgetting factor for noise processing and a hidden Markov model for traffic flow prediction [14].
As compared with highway traffic flow prediction, the short-term prediction of airport arrival flow tends to be more complicated, due to the stochasticity and dynamic nature of air traffic flow considering the various influencing factors such as weather conditions [15][16][17]. Until recently, the short-term prediction of air traffic flow remains a hot issue. Although different statistical approaches have been used in past studies, each has suggested that there are meaningful relationships between various input variables and traffic flow rate [18][19][20]. Further development is still needed to advance the predictive aspects of the linkage between airport arrival flow and the input variables including meteorological variables and then to predict future arrival flow using data mining techniques. e primary objective of this paper is to, first, discover if there are significant relationships between airport arrival flow and various meteorological variables; second, identify which factors can then be used as inputs to estimate airport arrival flow; and third, select an appropriate model that can be used to predict the airport arrival flow with decent performance. To this end, the correlation between historic arrival flow and various features is calculated. en, a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method is proposed for airport arrival flow prediction. e selected features including meteorological variables are input into the network. e rest of the paper is organized as follows. Section 2 illustrates the data collection and preparation procedure. Section 3 presents the proposed framework incorporating the long short-term memory neural network and the extreme gradient boosting algorithm components. Section 4 describes the data analysis results by comparing the performance of the proposed method with that of commonly used benchmark methods. Section 5 discusses the conclusions and future works.

Data Preparation
To meet the research objective, the airport performance data and various factors required in the data mining procedure are collected. e data sources for analysis can be divided into two categories: flight arrival data and airport meteorological information.

Flight Arrival Data.
is paper selects the flight arrival data of Nanjing Lukou International Airport (NKG) from January 1, 2018, to December 31, 2018, with a total of 113,243 records of information for data extraction and analysis. e specific flight information includes flight ID, aircraft type, departure airport, destination airport, estimated departure time, estimated arrival time, actual departure time, actual arrival time, and status of flight for that day. e daily flight information is divided into 48 records, with 30 minutes as the time horizon of a record. According to the flight information provided, the flight date, planned and actual arrival time of the aircraft, and the final status of the flight are used to calculate the planned and actual flow data of each time slice of the day. e canceled flights and changed flights on that day are excluded. Figure 1 illustrates the daily arrival and canceled flights in 2018. It can be found that the trend of flight arrivals is periodically fluctuated, while the trend of canceled flights tends to be stochastic and nonscheduled. In addition to the canceled flights, there are also some cases that may cause the difference between the scheduled flight counts and the actual flight counts, that is, change of flight routes, transferring to alternate airports, and missing values. As for the 30 min data records, the difference between the scheduled flight counts and the actual flight counts ranges from 0.014 to 6.803 with a mean value of 2.027, which accounts for 17.56% to 88.47%% with an average of 34.94%.

Airport Meteorological Information.
e airport meteorological information comes from OGIMET [21], which provides local weather conditions. Data from the Meteorological Report of Aerodrome Conditions (METAR) of Nanjing airport in 2018 are collected, including the four-character code of the airport, UTC time, wind direction, wind speed, wind gusts, temperature, dew point temperature, visibility (runway visual range), air pressure, cloud height, cloud cover, humidity, pressure, and weather phenomena such as precipitation, thunderstorm, fog, snowfall, and haze. Variables about some weather phenomena are set as dummy variables. Taking rainfall as an example, 1 indicates the presence of rainfall and 0 indicates no rainfall. e collected METAR messages are summarized. Table 1 presents partial data of the real-time meteorological indicators of Nanjing Lukou International Airport from 10:00 to 14:00 on June 28, 2018, for illustration.
As the METAR information is issued roughly hourly, the linear interpolation method is used to obtain the 30 min granularity meteorological data to match the flow data of 48-time slices per day. Considering that the meteorological information includes not only continuous meteorological factors such as wind speed, temperature, and visibility but also discrete meteorological factors such as rain, snow, and thunderstorm, the piecewise linear interpolation method is used to interpolate the hourly continuous meteorological data, while the weather phenomena are regarded to be consistent in the current onehour period. Figure 2 illustrates the daily arrival flights as well as the occupied time duration of rain and thunderstorm of NKG in May 2018.

Data Preprocessing.
e collected data are preprocessed by filtering, normalizing, and reconstructing, which effectively improve the convergence speed and prediction accuracy of the model. e final dataset includes one actual inflow as the output variable and twelve features which contain eleven real-time weather features and one planned flow volume as the input variables. All the variables are normalized using the following equation to transform into a dimensionless value ranging from 0 to 1: where x' represents the normalized dimensionless value and x represents the original value. e model is calibrated using data from January to September with a total of 13,104 30 min records and then validated using data from October to December with a total of 4,416 30 min records.

Methodology
In this section, a combined LSTM-XGBoost method is constructed for short-term airport arrival flow prediction. e proposed LSTM-XGBoost method contains two components, the long short-term memory neural network and the extreme gradient boosting algorithm. e methods used in each component are briefly discussed.

3.1.
e LSTM Method. LSTM is one of the important variants of Recurrent Neural Networks (RNNs). It has been proved that LSTM works well on sequence-based tasks with long-term dependencies. Compared with the traditional artificial neural network, the LSTM network realizes the combination of long-term and short-term memory by setting special structures such as forget gate, input gate, and output gate [22]. In recent years, the LSTM method has been frequently applied in short-term prediction with good performance [23,24].
As shown in Figure 3, x t is the input variable and h t is the output variable at time t. ơ and tanh are the activation functions of the network, where ơ represents the sigmoid function and tanh is the hyperbolic tangent function. eir role is to introduce nonlinear transformations in neural networks in order to make the network have stronger nonlinear expression capabilities. e data processing procedure of a unit in the LSTM network structure is like this. First, x t is input together with the output data at the previous time into the network. en, the long-term memory state variables are selectively remembered through    Complexity the forget gate, and a new memory state variable is formed by superposing the current state with the long-term state at the previous time through an input gate. Finally, the output variable at time t can be obtained as the long-term memory state variable through the output gate: In equations (2) to (6), W * i , W * f , W * c , W * o , and b * are learning parameters. σ( * ) and tanh( * ) are two commonly used nonlinear activation functions.

e XGBoost Method.
e extreme gradient boosting (XGBoost) method is an improved method based on Gradient Boosted Decision Tree (GBDT) proposed by Chen and Carlos (2016) [25]. e salient features of XGBoost which make it different from other gradient boosting algorithms include clever penalization of trees, a proportional shrinking of leaf nodes, newton boosting, and extra randomization parameter. In this paper, the XGBoost method is used to extract features and evaluate relative feature importance. e procedures are presented as follows.
For a given dataset with n samples and M characteristics, represented as |D| � (m i , y i ) (m i ∈ R m , y i ∈ R), assuming that XGBoost model has K decision trees, the flight flow prediction model is represented as follows: where y i is the predicted value at time i; m i is the corresponding input variables for y i ; and f k is the prediction function corresponding to the kth decision tree, which is defined as follows: where q(m i ) represents the structure function of mapping m i to the kth decision tree corresponding to the leaf node; ω is the quantization weight vector of the leaf node; and M is the number of leaf nodes in the tree. e loss function L of the XGBoost algorithm includes error term l and regularization term Ω.
e prediction model is learned by minimizing the loss function of the formula. In this paper, the root-mean-square error is selected as error term l, which is defined as follows: In the formula, the regularization term prevents the model from overfitting.

e Combined LSTM-XGBoost Method.
As mentioned above, the daily air traffic flow has strong periodicity and randomness. Data analysis indicates that there are several peak time of arrival flights, from 8:30 am to 11:00 am, from 12:30 pm to 13:30 pm, and from 17:00 pm to 19:00 pm. e airport arrival flow is influenced by many external factors, among which meteorological factors are commonly recognized that may be significant. e LSTM model has been widely used to deal with time-series problems, which can capture the temporal correlation of time-series data. However, the traditional LSTM lacks the ability to extract the external features that may affect the predicted variables. To this end, this paper proposes an LSTM-XGBoost model, which can well characterize the temporal correlation as well as the influence of external characteristics. e structure of the LSTM-XGBoost model is shown in Figure 4. e input data of the LSTM cell consists of two parts, including the scheduled flight flow data z , i and historic flight flow data x , i , constituting the input matrix X , i , where X , i ∈ R 2×T ; T represents the prediction timestep. After the LSTM layer, the Rectified Linear Unit (Relu) is used as the 4 Complexity activation function to output the predicted value y T+i at T + i time, which is shown as follows: en, the XGBoost model is used to predict the arrival flow at time T + i from input features m T+i , which incorporates the predicted value from LSTM at time T + i () ( y T+i ) and external meteorological characteristics E T+i :

Evaluation
where y i represents the actual value of sample i; y i represents the predicted value of sample i; y i represents the average value of the real data; and n is the sample size.

Correlation Analysis of Input Features.
As mentioned above, twelve features are collected and incorporated in the proposed model, including scheduled flights, wind speed, temperature, dew point temperature, visibility, atmospheric pressure at nautical height (QNH), cloud, rain, thunderstorm, fog, snowfall, and haze. To identify the relationship of various factors, the Pearson correlation coefficient (r) between actual arrival flow and the explanatory variables as well as the correlation between different explanatory variables is calculated. e equation is shown as follows: In this formula, x is the independent variable; y is the dependent variable; x is the mean of the independent variable; and y is the mean of the dependent variable. e Pearson correlation coefficient (r) ranges from −1 to 1, which represents the strength of the linear correlation between two variables. e results are shown in Figure 5.
As shown in Figure 5, it can be found that, besides scheduled flights that are highly related, the actual flights are  also positively related to visibility, wind speed, and temperature, while negatively related to fog. In addition, the visibility is positively related to temperature, wind speed, scheduled flights, and dew point temperature, while negatively related to fog, cloud, rain, QNH, and haze. It should also be noted that although thunderstorm and snowfall have a weak correlation with the other features with the current data, it does not indicate that these two factors can be excluded from consideration. On the contrary, as rare events, these extreme bad weather conditions may seriously affect the arrival of flights. Considering that, as input features, the temperature is highly positively related to dew point temperature and highly positively negatively to QNH, these two variables (dew point temperature and QNH) are removed from input features in the subsequent models.

Analysis of Variable Importance.
With the selected features, the XGBoost method is applied to identify the relative importance of various variables. e results are shown in Figures 6(a)-6(c) for the 30 min, 60 min, and 120 min prediction time horizon, respectively. Generally, the meteorological variables have a similar impact on the arrival flow for all the three scenarios. e most important influential feature is scheduled flights, which is congenial with common sense. e other two important influential features include temperature and visibility. As for the temperature, it is due to the reason that first, the collected data indicate that, in general, people prefer to travel more in warmer days, except for the traditional holidays. Second, there are more flights in the daytime with higher temperature, as compared with nighttime. Considering the visibility, it is acknowledged that there are visibility requirements for the operation of aircraft. e flights tend to be delayed with poor visibility until it returns to normal conditions. ere are some slight differences for the relative importance of variables of the prediction models with different time periods, which are temperature, followed by visibility, wind speed, cloud, and snow for the 30 min perdition model; visibility, temperature, wind speed, cloud, and snow for the 60 min perdition model; and visibility, temperature, wind speed, snow, and thunderstorm for the 120 min perdition model.
It is also found that the F-scores for the meteorology features are relatively low, while the extreme weather conditions may have strong impacts on the actual flight arrival rate. e collected data indicate that the difference between the actual flow rate and the scheduled flow rate has a higher fluctuation under bad weather conditions. e reason for the small F-scores is that almost all the extreme weather conditions are rare events. e feature importance is generated according to the degree of influence of the feature on the accuracy of the prediction during the process of generating the model. Besides, some of the weather conditions occur at specific time periods during a day. For example, the fog usually appears in the early morning with a lower arrival flow rate. us, the calculated importance of the feature will be small according to the collected data. In addition, it is acknowledged that most of the meteorology features are associated with visibility. e impacts of these bad weather conditions are reflected through the perspective of the feature of visibility to a certain extent, rather than the occurrence of snow, thunderstorm, rain, haze, fog, and so on, in terms of dummy variables.  Table 2.

Complexity
To testify the performance of the proposed LSTM-XGBoost model, several benchmark methods are also tested and compared. e selected benchmark methods include backpropagation (BP) neural network, LSTM neural network, support vector machine (SVM), gradient boosting regression tree (GBRT), and XGBoost, which were commonly used in previous studies of short-term traffic flow prediction.
e hyperparameters for BP and LSTM are selected in a similar way as that for the LSTM-XGBoost model. All the benchmark methods are trained and tested with the same data and input variables, so as to ensure that the models are comparable. e results are summarized in Table 3.
As shown in Table 3, for each method, six short-term arrival flow prediction models are developed, with 30 min, 60 min, and 120 min as the prediction time level, as well as historic and scheduled flights and historic and scheduled flights together with meteorological variables as input features. Based on the data analysis results, the following findings can be obtained.
First, for each method, MAE, MSE, and RMSE increase sharply with the increase in prediction time horizon, while  MAPE slightly decreases. Specifically, MAE and RMSE are the lowest for the 30 min prediction time horizon, as the two metrics increase with the magnitude of the original arrival flow data, while in terms of MAPE, the model exhibits the best performance for the 120 min prediction time horizon. Second, for all the five methods, the model performance can be increased by incorporating meteorological variables, especially for the 120 min prediction time horizon, indicating the fact that these factors may have a significant impact on airport arrival flow, especially extreme weather conditions. e improvement is the most prominent for the proposed LSTM-XGBoost method.
ird, the proposed LSTM-XGBoost method generally outperforms all the other machine learning techniques in terms of lower MAE, MSE, RMSE, and MAPE, followed by XGBoost, GBRT, and LSTM. is confirms the superiority and feasibility of the proposed model, which can successfully capture both the temporal features and influencing factors.
To further investigate the performance of the proposed model affected by various meteorological factors, the prediction accuracy of the airport arrival flow for different weather conditions is tested and compared, as shown in Figure 7.
In Figure 7, the x-axis represents the randomly selected samples with 30 min data for each sample. e y-axis represents the number of flights. e prediction results from LSTM, XGBoost, and LSTM-XGBoost methods are compared with the actual data. It is found that the proposed LSTM-XGBoost model outperforms the other two methods for all scenarios.
e results further demonstrate the robustness and applicability of the proposed model.

Conclusions
is paper proposed a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method for arrival flow prediction at the airport. e traditional Long Short-Term Memory (LSTM) network and the XGBoost model are incorporated by taking both the timeseries information and the meteorological features into account. e Pearson correlation coefficients are calculated to describe the strength of the linear correlation between two variables, and the importance of variables is identified. e prediction results are compared with some benchmark methods, including BP, LSTM, SVM, GBRT, and XGBoost. e proposed algorithm improves the accuracy and stability of short-term airport arrival flow prediction.
Even though the proposed LSTM-XGBoost approach has exhibited great potential for short-term prediction of airport arrival flow, several limitations are still needed to be addressed in this study. First, this study is focused on incorporating the meteorological factors in airport arrival flow prediction. As a matter of fact, the real-time airport arrival flow is affected by a series of factors. Future research is still needed to identify the impacts of other significant variables. Second, the paper used the data from Nanjing Lukou International Airport as a case study. Data from other airports can also be applied to further investigate the robustness and applicability of the proposed model, especially those with extreme weather conditions. e authors recommend that future studies could focus on these issues.

Data Availability
e Flight Data.rar file is provided as supplementary materials, containing all the flight arrival data for Nanjing Lukou Airport in 2018. e airport meteorological information is collected from OGIMET (http://ogimet.com/ metars.phtml.en).

Conflicts of Interest
e authors declare that they have no conflicts of interest.