Flight Delay Classification Prediction Based on Stacking Algorithm

With the development of civil aviation, the number of ﬂights keeps increasing and the ﬂight delay has become a serious issue and even tends to normality. This paper aims to prove that Stacking algorithm has advantages in airport ﬂight delay prediction, especially for the algorithm selection problem of machine learning technology. In this research, the principle of the Stacking classiﬁcation algorithm is introduced, the SMOTE algorithm is selected to process imbalanced datasets, and the Boruta algorithm is utilized for feature selection. There are ﬁve supervised machine learning algorithms in the ﬁrst-level learner of Stacking including KNN, Random Forest, Logistic Regression, Decision Tree, and Gaussian Naive Bayes. The second-level learner is Logistic Regression. To verify the eﬀectiveness of the proposed method, comparative experiments are carried out based on Boston Logan International Airport ﬂight datasets from January to December 2019. Multiple indexes are used to comprehensively evaluate the prediction results, such as Accuracy, Precision, Recall, F 1 Score, ROC curve, and AUC Score. The results show that the Stacking algorithm not only could improve the prediction accuracy but also maintains great stability.


Introduction
Airports are significant nodes of air transportation. e number of airport flight delays has been on increase in recent years. Delayed flights are defined by the Federal Aviation Administration when they arrive or depart more than 15 minutes later than scheduled. In 2019, the arrival delay rate is 19.2% and the departure delay rate is 18.18% in the United States [1]. Flight delays can cause many negative effects, such as passengers' inconvenience, increased airport pressure, and airline losses [2]. Effective flight delay prediction could provide support for flight plan and emergency plan formulation, reduce the economic loss, and alleviate the negative impact. e Bureau of Transportation Statistics has recorded the nationwide flight operation data in the United States which provides valuable and reliable datasets for study flight delay issues. Meanwhile, with the development of artificial intelligence, machine learning technology has been widely used in airport flight delay prediction. Machine learning technology involves multiple disciplines, such as probability, statistics, and computer science [3]. Machine learning can break the limitations of mathematical formulas and improve the accuracy of flight delay prediction. In general, machine learning technology can be roughly divided into supervised learning, unsupervised learning, deep learning, reinforcement learning, and ensemble learning. Each of these learning methods has its characteristics. We should select the appropriate methods and algorithms to carry on research. Poorly performing algorithms not only cannot gain accurate results but also wastes computing power. erefore, algorithm selection is an important process in machine learning technology. is paper aims to provide an applicable flight delay classification prediction method, especially for solving algorithm selection problems.
Many scholars have studied flight delay issue based on different machine learning methods. Esmaeilzadeh and Mokhtarimousavi used a support vector machine to mine the nonlinear relationship between flight delay and various features. Given the black-box nature of machine learning, the sensitivity analysis of corresponding variables and independent variables was conducted, and weather factors, airport scene operation, demand, and other factors were comprehensively considered. is research provided a new idea for studying the flight delay causes [3]. Kalyani et al. proposed a flight arrival delay prediction classification model based on XGBoost and a flight arrival delay prediction regression model based on linear regression. As one of the most widely used algorithms in the machine learning field, linear regression has the advantages of simple principle and easy application, and XGBoost is an ensemble learning algorithm based on the Decision Tree, which can find the optimal result by constantly adjusting the hyperparameters [4]. Zhang and Ma established a flight delay prediction model based on the Catboost algorithm, and the prediction accuracy reached 0.77. e SHAP value was used to analyze the features' contribution degree [5]. Khaksar and Sheikholeslami developed a hybrid method combining the J48 Decision Tree with K-means to train flight datasets from the United States and Iran, respectively, and compared them with four algorithms and obtained the optimal results with the hybrid method [6].
When utilizing machine learning techniques, most scholars will use multiple machine learning algorithms to train the same datasets and come up with the optimal algorithm and the optimal predict result through the evaluation indexes comparison [7,8]. Moreover, with the development of machine learning technology, the variety of algorithms is increasing and most scholars tend to use at least three algorithms in one research. Henriques and Feiteira presented a classification model based on Hartsfield-Jackson International Airport which utilized Decision Tree, Random Forest, and Multilayer Perceptron. e Multilayer Perceptron provided the highest accuracy [9]. Choi et al. attempted two supervised learning algorithms, Decision Tree and KNN, and two ensemble learning algorithms, Random Forest, and Adaboost, and the results showed that ensemble algorithm classifier was greater than single algorithm classifier [10]. Stefanovič et al. took Lithuania Airport flight delays datasets as the research object and selected seven machine learning algorithms including probabilistic neural network, multilayer perceptron neural network, Gradient-Boosted Tree, Decision Tree, and the Gradient-Boosted Tree obtained the optimal results [11]. e above research studies are inspirational, and most of them through the model comparison obtain one optimal model while the other models were eliminated which create a waste of computing power. In addition, flight datasets are enormous and versatile, and the stability of algorithm is significant for real world applications. However, most studies did not pay attention to the algorithm stability, especially some novel algorithms. In thie study, we build a flight delay prediction classification model based on Stacking and design the experiments to verify the stability of Stacking. e flight delay prediction methods based on machine learning technology become mature gradually. However, one core process that is often neglected in previous studies is feature selection [12]. Features selection is an essential step in machine learning [13]. e main purpose of feature selection is to remove redundant features and improve model efficiency by calculating feature importance. Onan and Korukoglu presented a feature selection model based on the ensemble method. e experiment result shows that the proposed method not only effectively processed the complex features but also improved the classification accuracy [14]. In addition, considering weather information could effectively improve the prediction accuracy [15], but the exact weather information might not be available until few hours before the flight. erefore, we are not considering bringing in weather features in this research temporarily. e rest of this paper is organized as follows. Section 2 elaborates the research methods and principles used in this study including the Stacking classification algorithm, the SMOTE algorithm, the Boruta algorithm, and several indexes. Section 3 describes the data sources and the data preprocessing method. Section 4 discusses comparative experiments and comprehensively evaluates the prediction results through Accuracy, Precision, Recall, F1 Score, ROC curve, and AUC Score. In Section 5, the conclusions and expectations of this research are discussed.

Stacking Classification Methods.
Stacking methods are derived from the idea of ensemble learning based on learners' combinations [16]. Stacking learner usually contains two levels, the first-level learner consists of multiple basics learners selected for training the same datasets, and the predicted outputs will become a new dataset to be carried into the second-level learner [17]. To avoid overfitting, crossvalidation can be used when the first-level learner is the training model, and we select the k-fold cross-validation method in this paper [18]. e main process of Stacking methods is shown in Figure 1. e initial datasets have been divided into training dataset Dta and testing dataset Dts, and then the training dataset Dta has been divided into k subdatasets, Dta1, Dta2,. . ., Dtak. In the k-fold cross-validation method, i models will be trained for k times, each subdataset becomes a test dataset in turn, and other subdatasets are training datasets to participate in training. In each model, k prediction results are combined to form a new training subdataset Tir(r � 1,2, . . ., k) and Tir (r � 1,2, . . ., k) have formed a new training datasets Nta and brought into the secondlevel learner.
When K-fold cross-validation is carried out in the firstlevel learner, every time Model i trains the training dataset Dta, testing datasets Dts will be predicted as well. erefore, k prediction results Rik which are predicted by the same testing dataset Dts will be obtained. When solving the regression problem, the averaging method is usually adopted to process the k prediction results. In the classification problem, the processing of the prediction results is shown in Figure 2.
In machine learning, the binary classification will output the probability value of positive and negative at first. e category corresponding to a higher probability value is the category of the data sample, and the sum of the probability value is 1. In Stacking classification, model i predicts that the probability of the data sample belonging to positive p is P(p) � (p1 + p2 + · · · + pk)/k and the probability of the data sample belonging to negative is P(n) � (n1 + n2 + · · · + nk)/k. us, the prediction result of Model i on testing dataset Dts, Ri (i � 1,2, . . ., i), forms a new testing dataset Nts into the second-level learner. e secondlevel learner could choose a relatively simple algorithm and then trains the model with the new training dataset Nta and test with new testing datasets Nts.

Imbalanced Datasets
Processing. Imbalanced datasets are one of the common problems in machine learning classification. is is mainly reflected in the fact that the number of samples belonging to a certain category in the datasets is far greater than that of other categories. To improve the accuracy, most classification algorithms tend to identify the minority class data samples as the majority class samples when training imbalanced datasets. Although such a classifier can achieve a certain accuracy, it does not have applicability [19]. e flight delay datasets in this paper are typical imbalanced datasets, and the data volume of on-time flights is nearly four times that of delayed flights (3.78 : 1).
Oversampling and undersampling are the commonly used techniques to deal with imbalanced datasets [20]. e main idea of these two technologies is to reconstruct the sample size. Undersampling has achieved balance by reducing most samples, while Oversampling has achieved balance by increasing the minority of samples.
In this paper, SMOTE (synthetic minority oversampling technique) algorithm is selected to process the imbalanced datasets [21]. e SMOTE algorithm is an oversampling technology based on the KNN algorithm. It improves the simple random oversampling algorithm of randomly copying a few samples to increase the sample size, which can avoid overfitting and effectively improve the generalization ability of the model. e main process of the SMOTE algorithm is as follows: (1) e Euclidean distance is calculated from each minority sample x to the other minority sample (2) e sampling rate is set according to the difference between the minority sample size and the majority sample size and randomly determines k nearest neighbors of sample x of a minority class (3) Between a few samples x and x i , according to the sampling rate set in Step (2) (1)

Features Selection.
Feature selection is one of the core contents of machine learning, which aims to eliminate redundant features, improve model accuracy, and reduce operation time.
e commonly used feature selection methods include Filter, Wrapper, and Embedded [22]. e Boruta algorithm is utilized in this research to select features. Boruta is an encapsulated feature selection algorithm based on Random Forest. e importance of each feature to the dependent variable is calculated to determine whether to be retained. e main process of the Boruta algorithm is as follows: (1) Establish shadow feature: the original features are randomly sorted to form a shadow feature matrix, and the new feature matrix is obtained by splicing the shadow feature matrix with the original feature matrix.

(2) e new feature matrix is brought in a Random
Forest classifier for training, and output the importances of features v. (3) e Z score of the original feature and shadow feature is calculated, and the calculation formula is as follows: where A v represents the average value of feature importance and S v represents the standard deviation of feature importance. (4) e maximum z score is searched in the shadow feature, denoted as Z max . (5) If the original feature z score is greater than Z max , the feature is recorded as "important." On the contrary, if the original feature z score is less than Z max , the feature will be marked as "unimportant" and be deleted.

Evaluation Indexes.
In this paper, Accuracy, Precision, Recall, and F1 Score are calculated by output confusion matrix to evaluate the prediction results. e confusion matrix is shown in Figure 3 [23].
TP is True Positive, indicating that both the true value and the predicted value are positive, that is, the number of positive samples predicted correctly. FP is False Positive, indicating that the true value is negative, but the predicted value is positive, that is, the number of negative samples is wrongly predicted to be positive. TN is True Negative, indicating that both the true value and the predicted value are negative, that is, the number of negative samples that are correctly predicted. FN is False Negative, indicating that the true value is positive, but the predicted value is negative, that is, the number of positive samples that are wrongly predicted to be negative.
Accuracy is the ratio of correctly predicted samples to the total amount of samples, and its calculation formula is as follows: Accuracy is one of the most used evaluation indexes in classification. Since the flight delay data sample is the imbalanced dataset, that is, the sample size of on-time flights is much larger than delayed flights. To improve accuracy, the model tends to identify the minority samples as the majority, and the model can obtain higher accuracy, but the prediction of delayed samples is almost ineffective. erefore, the predicted results also need to be evaluated by Precision, Recall, and F1 Score in the classification problem.
Precision indicates the percentage of correct predictions in the sample with a positive predicted value. e calculation formula is as follows: Recall indicates the percentage of the correct prediction in the sample with a positive true value. e calculation formula is as follows: According to the calculation formula of Precision and Recall, it can be found that when the Precision increases, the Recall will decrease, and when the Recall increases, the Precision will decrease. In this paper, the Precision focuses on how many delayed flights were successfully predicted in the total sample, while the Recall focuses on how many delayed flights were successfully predicted in all delayed flights. Moreover, the F1 Score, as the harmonic average of Precision and Recall, could consider both. e calculation formula is as follows:   Figure 4.

Data Acquisition and Preprocessing
Both datasets include 9 features, and the input features and descriptions are shown in Table 1.

Uniformization Processing.
To avoid the impact of dimensionless differences among features in the dataset, the data are normalized in this paper. e aim is to adjust the mean of the data to 1 and the variance to 0. e calculation formula is as follows: where X mean is the mean value, X max is the maximum value, and X min is the minimum value.

Features Selection Results.
In this research, the Boruta algorithm is utilized to select features for the departure delay dataset and arrival dataset, respectively, and the results are shown in Figure 5. In the departure dataset, all features are marked as important. e CRS_DEP_TIME is the most important feature in the departure dataset. In the arrival dataset, 8 features are estimated as important features, and Diverted has been rejected. e departure dataset features importance is shown in Table 2, and the arrival dataset features importance is shown in Table 3.
To explore the influence of features' importance on the prediction results, the following experiment has proceeded. At first, only input the most important features for training and then add one feature at a time according to the importance value until all the features are input. According to   Figure 6.
In the departure dataset, when the fifth important feature is given as input, Accuracy, Precision, Recall, and F1 Score exceed 0.8. When the sixth important feature is given as input, the indexes show slight decrease, but the overall trend is stable without significant increase or decrease. In other words, the last four features contributed limited to the prediction model, which was consistent with Boruta feature selection results. In the arrival dataset, when the fourth important feature is given as input, the evaluation indexes have no significant change. In the    Journal of Advanced Transportation arrival dataset, when the fourth feature is given as input, the evaluation indexes exceed 0.8 and tend to be stable. It is worth mentioning that with the increase in features, Recall changes from the highest to the lowest among the four indexes, while Precision changes from the lowest to the highest.

Comparison between Algorithms.
ere is no "multipurpose algorithm" or "the greatest algorithm" in machine learning. It is necessary to attempt multiple algorithms. In this research, six algorithms are selected including KNN, Random Forest, Logistic Regression, Decision Tree, Gaussian Naive Bayes, and Stacking to train the same dataset, respectively. e experiment results are shown in Figure 7. In addition to Stacking, Random Forest also showed a great prediction result which four evaluation indexes all exceed 0.8. e difference among four indexes of KNN is lager than other algorithms but also has reached 0.7. Meanwhile, Gaussian Naive Bayes and Logistic Regression have relatively poor performance, and four indexes are around 0.6. e ROC (receiver operating characteristic) curve could measure algorithm generalization ability. e AUC (area under curve) is the area under the ROC curve [24]. e closer the AUC is to 1, the better the algorithm will be. We output the ROC for each algorithm and calculate the AUC Score, and the results are shown in Figure 8. Stacking reaches 0.823 in the departure dataset and 0.821 in the arrival dataset. e result of Random Forest is similar to that of Stacking. With this result, we consider that Random Forest contributes more to Stacking compared with other algorithms. However, if we remove Random Forest from the Stacking algorithm, will the performance of Stacking decrease? In other words, if we remove the weak performance algorithm Gaussian Naive Bayes, will the performance of Stacking increase? In section 4.3, we experiment to explore the impact of strong and weak algorithms on the performance of Stacking.

First-Level Learners Analyses.
In the single algorithm comparison, we find that the Random Forest has great performance, and Gaussian Naive Bayes and Logistic Regression perform poorly. In this section, one algorithm is removed, in turn, to figure out how strong or weak algorithms affect Stacking prediction results. e results are shown in Tables 4  and 5. Overall, there is no significant difference between the six groups with different first-level learners. Both in the departure dataset and arrival dataset, the four evaluation indexes are similar among the six scenarios, only the Recall and F1 Score of the third scenario decrease below 0.8. e overall accuracy is shown in Figure 9. e prediction accuracy is around 0.8 which is close to the result of Stacking. It can be concluded that the Stacking algorithm not only could ensure the prediction accuracy but also maintains great stability. Random Forest has a strong performance, but when we remove Random Forest form the first-level learner, the model still acquires great predict results. As we mentioned before, there is no "multipurpose algorithm" or "the greatest algorithm" in machine learning. erefore, the Stacking algorithm could be a great solution to deal with algorithm selection, especially the enormous and complex datasets like flight datasets.

Conclusion
In this research, we propose a flight delay prediction classification method based on the Stacking algorithm. e SMOTE algorithm is introduced to process imbalanced datasets used, and the Boruta algorithm is utilized to select input features. e Logan International Airport flight data in 2019 are collected to carry out comparative experiments, and the Accuracy, Precision, Recall, and F1 Score are above 0.8. e main contributions are as follows: (1) e Boruta algorithm is used to select features. Features selection is an essential process when utilizing machine learning technology. According to section 4.1, the comparison experimental results are consistent with the Boruta algorithm feature selection results, which verify the effectiveness of the Boruta algorithm. 9 feature importances are obtained based on the Random Forest classifier, and the experiments are designed to input different features into the model in the order of their importance value.
In the departure dataset, all features have been confirmed while Diverted has been rejected in the arrival dataset. (2) A flight delay prediction classification method based on Stacking is proposed in this study. e first-level learner includes KNN, Random Forest, Logistic Regression, Decision Tree, and Gaussian Naive Bayes, and the second-level learner utilizes Logistic Regression. To distinguish the contribution of five first-level learners, the same dataset that has been trained based on these five first-level learners separately. e result shows that Random Forest has the best performance which is similar to Stacking.
(3) e main aim of this study is to explore the stability of the Stacking algorithm. Stacking is a combination of different algorithms with different performances. In section 4.3, we design an experiment to verify how strong or weak learners affect the Stacking performance. e experiment result shows that whether strong learners or weak learners are removed, the overall accuracy of the Stacking has no obvious difference. erefore, we believe that Stacking provides a reliable solution for algorithm selection in machine learning applications, especially the enormous and complex datasets like flight datasets.
In future research, other machine learning technologies can be utilized to study flight delay prediction. Moreover, it can also pay close attention to weather influence on a flight delay. In this research, we does not add exact weatherrelated features in the prediction model but that does not mean weather influence is unimportant. On the contrary, we believe that studying the influence of weather on flight delays is a significant and complex issue. We will focus more on establishing reasonable features to measure the impact of weather on flight delays, especially for highimpact weather, and use machine learning correlation analysis technology to explore the relatedness between weather and flight delay.
Data Availability e flight dataset used in this paper is from the Bureau of Transportation Statistics website (https://www.transtats.bts. gov/homepage.asp).

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Figure 9: e accuracy of different first-level learners.
Journal of Advanced Transportation 9