The diagnosis of bankruptcy companies becomes extremely important for business owners, banks, governments, securities investors, and economic stakeholders to optimize the profitability as well as to minimize risks of investments. Many studies have been developed for bankruptcy prediction utilizing different machine learning approaches on various datasets around the world. Due to the class imbalance problem occurring in the bankruptcy datasets, several special techniques would be used to improve the prediction performance. Oversampling technique and cost-sensitive learning framework are two common methods for dealing with class imbalance problem. Using oversampling techniques and cost-sensitive learning framework independently also improves predictability. However, for datasets with very small balancing ratios, combining two above techniques will produce the better results. Therefore, this study develops a hybrid approach using oversampling technique and cost-sensitive learning, namely, HAOC for bankruptcy prediction on the Korean Bankruptcy dataset. The first module of HAOC is oversampling module with an optimal balancing ratio found in the first experiment that will give the best overall performance for the validation set. Then, the second module uses the cost-sensitive learning model, namely, CBoost algorithm to bankruptcy prediction. The experimental results show that HAOC will give the best performance value for bankruptcy prediction compared with the existing approaches.
Ministry of Science, ICT and Future Planning2015-0-009381. Introduction
Machine learning and data mining [1–9], which is the process of learning in order to look for patterns in observations or data and make better decisions in the future based on the training samples, is widely used in various fields such as cybernetics [10–14], engineering [15–18], bioinformatics [19], medical informatics [20], economics [21–27], etc. Especially in economics, there are many issues for optimizing profits in the business such as customer lifetime value modeling (CLVM), churn customer modeling (CCM), dynamic pricing, customer segmentation, recommendation systems, etc. CLVM [23] is one of the most important models for eCommerce business. These models can identify, understand, and retain the most valuable customers in your business. With the obtained results from these models, the business managers may make a better business strategy to optimize profitability. CCM [24] can help the companies determine their customers who will stop using their services. The outputs of these models, the customer list, are important inputs of an algorithmic retention strategy because they help optimize discount offers, marketing campaigns, and other targeted marketing initiatives. Dynamic pricing models [25] are for flexibly pricing products based on several factors such as the level of interest of the target customer, demand of the market at the time of purchase, and whether the customer has engaged with a marketing campaign. Meanwhile, customer segmentation models [26, 27] group customers into personas based on specific variations among them using several clustering and classification algorithms. Recommendation systems are another major way by which machine learning proves its business value. Recommendation systems sift through large quantities of data to predict how likely any given customer is to purchase an item or enjoy a piece of content and then suggest those things to the user. The result is a customer experience that encourages better engagement and reduces churn. Bank lending and systemic risk [28, 29] is another issue in economics sector that attracted a lot of attention. This model will find empirical evidence against diversification as a mean to reduce systemic risk.
Bankruptcy prediction is also a hot topic in the field of business attracted by many scientists on computer science as well as economics around the world. In computer science domain, bankruptcy prediction, a predictive machine learning model, is to analyze the financial statement of a firm to make predictions for its fate in the future. Based on the obtained results from this task, investors and managers will devise appropriate strategies for companies that are going bankrupt. Many studies have been developed in recent years to predict the firm bankruptcy using various approaches [30–32]. In 2015, Kim et al. [30] introduced an efficient boosting algorithm, namely, GMBoost, using geometric mean for dealing with the problem of imbalanced data occurring in bankruptcy datasets. This algorithm calculates the error of majority class and the error of minority class separately. Then geometric mean value of these values will be determined to calculate the weight values for the next phase. Next, a novel approach [31] utilizing eXtreme Gradient Boosting (XGB) with synthetic features was proposed for bankruptcy prediction. In this study, the synthetic features proposed are automatically generated by random selection of two existing features and random selection of the arithmetical operation which help to improve the prediction performance. Recently, Barboza et al. [32] performed and evaluated several existing classification models including SVC (linear and RBF kernels), artificial neural networks (ANN), logistic regression, boosting, Random Forest, and Bagging, for forecasting bankruptcy companies. The authors use a balanced bankruptcy dataset that includes 449 bankruptcy firms and 449 non-bankruptcy firms from 1985 to 2005 for training the above classifiers. The trained models will be evaluated by an imbalanced bankruptcy dataset collected between 2006 and 2013 that consists of 133 bankruptcy cases and 13,300 non-bankruptcy cases. The experimental results in this study indicate that three classifiers including boosting, bagging, and random forest provide better results for bankruptcy prediction.
In many datasets on various domains, class distribution is commonly imbalanced called by class imbalance problem. The minority class in these datasets consists of a small number of data points while the majority class has a very large number of data points. Specifically, the number of bankruptcies is extremely small compared to the normal companies in bankruptcy datasets. The traditional classification models have a big bias towards majority class in such datasets. It is the cause of reduced performance of the above models. Therefore, many methods are given to deal with class imbalance problem which are grouped into the four following categories [33]. (1) Algorithm level approaches adapt existing classifiers to bias the learning toward the minority class [34, 35] without changing training data. (2) Data level approaches change the class distribution by resampling the data space [36, 37] to improve the predictive performance. There are three subcategories in this group including undersampling, oversampling, and hybrids techniques. Undersampling techniques balance the data distribution by removing the real data samples in majority class while oversampling techniques add the synthetic data samples to minority class. Meanwhile, hybrids techniques combine both undersampling and oversampling techniques. (3) Cost-sensitive learning framework is the hybrid methods that combine data and algorithm level approaches. These frameworks add costs to data samples (data level) and modify the learning process to accept costs (algorithm level) [38, 39]. The classifier in this group is biased toward the minority class by assuming higher misclassification costs for this class and seeking to minimize the total cost errors of both classes. (4) Ensemble-based methods usually consist of a combination of an ensemble learning algorithm and one of the techniques above, specifically, data level and cost-sensitive ones [40]. By combining data level approach to the ensemble learning algorithm, the new hybrid method usually preprocesses the data before training each classifier, whereas cost-sensitive ensembles, instead of modifying the base classifier in order to accept costs in the learning process, guide the cost minimization via the ensemble learning algorithm. The above four methods are used depending on the datasets to improve performance.
In 2018, Le et al. [41] first introduced the Korean Bankruptcy dataset denoted by KRBDS. In this study, the authors presented the oversampling based (OSB) framework that utilizes the oversampling techniques, a technique belonging to data level approach, for dealing with the class imbalance problem to predict the bankruptcy. This framework found that SMOTE-ENN is the best oversampling technique for KRBDS. Then, Le et al. [42] proposed a cluster-based boosting (CBoost) algorithm for dealing with the class imbalance problem. CBoost approach is considered as a cost-sensitive learning framework for dealing with the class imbalance problem. The framework, namely, RFCI, based on CBoost algorithm achieves the best AUC (The area under the receiver operating characteristics curve) with a shorter processing time compared with the first framework and several methods for bankruptcy prediction. In this study, we propose a hybrid approach, namely, HAOC, that combines the oversampling technique and cost-sensitive learning framework together for bankruptcy prediction. Our proposed approach firstly uses SMOTE-ENN to adjust class distribution of KRBDS with specific balancing ratio. Then, HAOC will use CBoost algorithm to predict the bankruptcy. The first experiment was conducted to find the best normalization technique among StandardScaler, MinMaxScaler, and RobustScaler for KRBDS. The second experiment is to find the optimal balancing ratio for oversampling phase. The comparison between HAOC with the existing approaches will be evaluated in the third experiment.
The rest of this manuscript is structured as follows. Section 2 first summarized the experimental dataset, namely, KRBDS, an oversampling technique, namely, SMOTE-ENN, and the CBoost algorithm. As the main contribution of this study, Section 2 introduces the hybrid approach for bankruptcy prediction, namely, HAOC. Two experiments were conducted to find the optimal balancing ratio and to show the effectiveness of proposed approach for bankruptcy prediction. Finally, the conclusions as well as several future research issues related to bankruptcy prediction are given in Section 4.
2. Materials and Methods
This section firstly introduces the experimental dataset, namely, KRBDS. Then, we summarize the oversampling technique named SMOTE-ENN and the cost-sensitive learning framework named CBoost algorithm. Finally, the proposed approach, namely, HAOC, will be introduced.
2.1. The Experimental Dataset
KRBDS was first introduced by Le et al. [41] that was provided by a Korean financial company. From the financial statements released by Korean companies from 2016 to 2017, nineteen financial features that have frequently been used in the previous bankruptcy prediction studies including assets, liabilities, capital, profit, etc. were extracted. Assets are any resources owned by the business such as buildings, equipment, and stocks while a liability is defined as any type of borrowing from persons or banks for improving their business. In addition, capital is any economic resource used by entrepreneurs and businesses to buy what they need to make their products or to provide their services. Meanwhile, profit is a financial benefit that is realized when the amount of revenue gained from a business activity exceeds the expenses, costs, and taxes needed to sustain the activity. These values are extremely important in finance to consider the company’s performance, especially bankruptcy prediction. These features and some statistical information including maximum, minimum, and mean are shown and described in Table 1.
The statistical information of KRBDS.
Feature
Description
Max
Min
Mean
Standard Deviation
Median
P25
P75
F1
Current assets
2.2×1011
0
2.2×107
9.2×108
2.2×106
8.0×105
6.5×106
F2
Fixed assets, or fixed capital property
9.5×1010
0
2.9×107
6.5×108
1.4×106
2.9×105
6.8×106
F3
Total assets
2.5×1011
0
6.2×107
1.7×109
4.5×106
1.5×106
1.5×107
F4
Current liabilities within one year
2.1×1011
-1.2×106
1.8×107
8.9×108
1.1×106
2.9×105
5.2×106
F5
Non-current liabilities.
6.5×1011
-7.7×105
2.2×107
2.5×109
4.2×105
1.2×104
2.2×106
F6
Total liabilities
6.5×1011
-2.1×105
4.9×107
2.9×109
2.1×106
5.5×105
8.3×106
F7
Capital
1.6×1010
-2.9×107
5.1×106
1.2×108
4.0×105
1.5×105
1.0×106
F8
Earned surplus
4.8×1010
-6.4×1011
1.4×106
2.5×109
8.3×105
1.2×105
3.2×106
F9
Total capital
5.5×1010
-6.3×1011
1.3×107
2.5×109
1.7×106
5.4×105
5.5×106
F10
Total capital after liabilities
2.5×1011
-4.3×104
6.2×107
1.7×109
4.5×106
1.4×106
1.5×107
F11
Sales revenue
6.0×1010
-1.4×109
3.6×107
5.2×108
5.1×106
1.8×106
1.5×107
F12
Cost of sales
5.4×1010
-4.7×106
2.7×107
4.2×108
3.4×106
8.6×105
1.1×107
F13
Net profit
2.5×1010
-2.6×1010
7.3×106
1.6×108
1.1×106
4.2×105
3.1×106
F14
Sales and administrative expenses
1.3×1010
-5.2×106
5.5×106
9.6×107
8.8×105
3.4×105
2.4×106
F15
Operating profit that refers to the profits earned through business operations
2.5×1010
-2.6×1010
1.9×106
1.1×108
1.9×105
3.6×104
6.5×105
F16
Non-operating income
1.0×1010
-4.4×105
1.6×106
5.1×107
4.3×104
8.1×103
2.2×105
F17
Non-operating expenses
3.0×109
-5.5×105
1.6×106
2.8×107
6.6×104
1.2×104
3.2×105
F18
Income and loss before income taxes
2.8×1010
-2.3×1010
2.0×106
1.2×108
1.6×105
3.3×104
5.8×105
F19
Net income
2.8×1010
-2.3×1010
1.5×106
1.2×108
1.4×105
2.9×104
5.0×105
There are 307 bankrupted firms and 120,048 normal firms in KRBDS which has the balancing ratio of 0.0026. This ratio is extreme small for the normal classifier to predict bankruptcy correctly. Therefore, we need to develop several specific techniques to improve the performance.
2.2. Oversampling Technique with MOTE-ENN
Resampling technique belonging to data level approaches for dealing with class imbalance problem is the most common approach by adjusting the class distribution. Resampling technique consists of three subcategories including oversampling techniques, undersampling techniques, and hybrids techniques as illustrated in Figure 1. Undersampling technique balances the data distribution by removing the real data samples in majority class while oversampling technique accomplishes that purpose by adding the synthetic data samples to minority class. Meanwhile, hybrids methods combine both undersampling and oversampling techniques.
The illustration of oversampling, undersampling, and hybrids techniques.
The advantage of these techniques is to balance the class distribution for improving the predictive performance. However, there is no absolute advantage of one resampling method over another. Application of these techniques depends on the use case it applies to and the dataset itself. Meanwhile, the disadvantage of undersampling techniques is that they can remove potentially useful data samples that could be important for the induction process. When the number of samples in the minority class is too small compared to that of samples in the majority class like KRBDS, undersampling techniques became ineffective. In this case, many samples in majority class are deleted. In addition, the main disadvantage with oversampling is that, by making exact copies of existing examples, it makes overfitting likely. A second disadvantage of oversampling is that it increases the number of training examples. Thus, the systems increase training time and the amount of memory required to hold the training set.
In 2018, Le et al. [41] conducted the oversampling framework that presents the empirical evaluation of oversampling techniques for bankruptcy prediction on KRBDS. Several oversampling techniques such as Synthetic Minority Over-sampling Technique (SMOTE) [36], Borderline-SMOTE [44], Adaptive Synthetic (ADASYN) sampling approach [45], SMOTE-ENN [46], and SMOTE-Tomek [46] were used to improve the bankruptcy prediction performance. The experiments conducted in this study found that SMOTE-ENN is the best oversampling technique for KRBDS. This approach is summarized as follows.
The SMOTE algorithm was first proposed by Chawla et al. [36] in 2002 that generates synthetic minority samples based on the feature similarities between the original minority samples. Firstly, SMOTE determines the k-nearest neighbors (NNs) which is denoted by Kxi for each minority sample xi∈χmin.
Figure 2(a) demonstrates the three NNs of xi that connect with xi by a line. To generate a synthetic data sample (xnew) for xi, SMOTE randomly selects an element x^i in Kxi and x^i in χmin. The feature vector of xnew is the sum of the feature vectors of xi and the value, which can be obtained by multiplying the vector difference between xi and x^i with a random value δ from 0 to 1 (δ∈[0,1]), as the following equation:(1)xnew=xi+x^i-xi×δwhere x^i is an element in Kxi: x^i∈χmin.
A toy example of the three-nearest neighbors for the xi (a); and generate xnew using SMOTE (b).
According to (1), the synthetic sample is a point along the line segment joining xi and the randomly selected x^i∈Kxi. Figure 2(b) shows a toy example of the SMOTE algorithm. The new sample xnew is in the line between xi and x^i.
Then, SMOTE-ENN will apply the neighborhood cleaning rule based on the edited nearest neighbor (ENN) [46] to clean unwanted overlapping between classes, which removes samples that differ from two samples in the three nearest neighbors. Figure 3 shows the example of an ENN. Generally, SMOTE-ENN also uses SMOTE for the oversampling step and then uses ENN to remove the overlapping examples as shown in Figure 4.
Example of an ENN.
The flowchart of SMOTE-ENN algorithm.
2.3. Cluster-Based Boosting Algorithm
Recently, Le et al. [42] proposed CBoost algorithm that is based on the cost-sensitive learning framework for dealing with the class imbalance problem occurring in bankruptcy datasets effectively. CBoost algorithm first clusters the majority class in the bankruptcy datasets, i.e., the non-bankruptcy firms, by applied k-mean clustering with k = 45 which is considered as the best k value based on the experimental results in [42]. Then, for each sample belonging to the majority class the algorithm will determine the distance from this sample to the nearest center point. Let dmax be the maximum value of the distances of data samples in class of bankruptcy firms. CBoost algorithm then assigns the values of each data sample in the minority class equal to dmax. Then, CBoost algorithm will determine the initial weights denoted by W1 as follows:(2)W1i=ln1dxiwhere d(xi) refers to the distance between data point xi and the nearest center point for the majority class and d(xi) = dmax for the minority class. Equation (2) makes it so that the samples in the majority class closed the center points and the samples in the minority class will have higher weight values compared to the further samples in majority class. CBoost will then normalize these values by the following equation:(3)W1i=W1i∑i=1mW1iwhere m is the total number of data points in the training set. This step will ensure that(4)∑W1i=1The initial weight W1 helps the weak classifier classify more accurately the samples in the majority class close to the center points as well as the samples in the minority class. Therefore, it will improve the overall performance for class imbalance problem like bankruptcy dataset.
For each iteration, CBoost identifies the weak learner denoted by ht(x) that produces the lowest classification error denoted by ϵt, calculates the weight for this classifier denoted by αt, and determines the next weight Wt+1 for the next iteration as follows.(5)ht=argminhj∈Hϵj=∑i=1mWtiyi≠hjxiαt=ηlog1-ϵtϵtWt+1i=Wtiexp-αtyihtxiZtwhere Zt is normalization factor. Finally, the algorithm will combine all weak learners to make the final classifier H as follows.(6)Hx=sign∑t=1Tαthtxwhere ht(x) is the weak learner at the iteration t-th and αt is its weight.
In short, CBoost is a greedy algorithm that finds one weak learner at an iteration, optimizes the weight of this learner, and updates the weighted distribution for the next iteration. The algorithm combines all weak learners as in (5) to create the final classifier. The flowchart of CBoost algorithm is shown in Figure 5.
The flowchart of CBoost algorithm.
2.4. The Hybrid Approach for Bankruptcy Prediction on KRBDS
The balancing ratio of KRBDS is very small which leads to a reduction in performance of oversampling and cost-sensitive learning independently. Therefore, this study proposes a hybrid approach that combines oversampling technique and cost-sensitive learning (HAOC) for bankruptcy prediction on KRBDS to improve the overall performance.
The flowchart of HAOC is presented in Figure 6. KRBDS is first normalized by using a normalization module that uses the best normalization technique in the first experiment (Data preprocessing). Next, the fivefold cross-validation module will be used to split the KRBDS into five parts, in which four parts were used for training and the remaining part was used for testing alternately.
The flowchart of HAOC.
The training set will be put into the found optimal balancing ratio module. This module will divide the training set into two subsets: the training set and validation set. Using these sets, this module tries various balancing ratios for SMOTE-ENN and will find the optimal balancing ratio for the KRBDS which will be presented in the first experiment. The training set will be balanced by SMOTE-ENN with the best balancing ratio that was found in the previous step. After this phase, the resample training set will be utilized to train the CBoost algorithm for bankruptcy prediction later. The testing set will be used to evaluate the proposed approach.
3. Experimental Results3.1. Experiment Setup
The experimental methods were implemented in Python 2.7 environment and performed on a computer with Intel Core i7-2600 CPU (3.40 GHz × 2 cores), 8 GB RAM that runs with Ubuntu 16.04 LTS. In addition, SMOTEENN was implemented by the imbalanced-learn package [47] and Bagging, AdaBoost, Random Forest, and MLP were in Scikit-learn package [48]. The imbalanced-learn package is an open-source Python toolbox which consists of several methods for dealing with the problem of class imbalance while Scikit-learn package is a free software machine learning library for the Python programming language.
To show the effectiveness of the proposed approach, we compare the performance among the state-of-the-art methods and HAOC for bankruptcy prediction on KRBDS. The first four approaches are Bagging (BG), AdaBoost (AB), Random Forest (RF), and Multilayer Perceptron (MLP) which were recommended by Barboza et al. [32]. These approaches were used to predict bankruptcy directly; i.e., there is no resample approach applied to adjust the class distribution. The 5th to 8th approaches combine undersampling method based on clustering technique [43] with BG, AB, RF, and MLP classifiers. The 9th-12th approaches are oversampling method using SMOTE-ENN (with balancing ratio = 1) combined with BG, AB, RF, and MLP classifiers to predict bankruptcy. The 13th approach is RFCI introduced by Le et al. [42] and the 14th approach is the proposed approach (HAOC). Moreover, the study employs the fivefold cross-validation in 10 times with different configurations of folds for each run to get the average performance.
Next, we use GridSearchCV in Scikit-learn package [48] to turn several parameters of Bagging, AdaBoost, Random Forest, and MLP. We turned the n_estimators (150) and max_samples (0.2) for Bagging, learning_rate (0.1) for AdaBoost, max_depth (5) for Random Forest, and max_iter (150), learning_rate_init (0.01), and hidden_layer_sizes (50, 5) for MLP.
3.2. Evaluation Metrics
This study uses two evaluation metrics including AUC (Area under the ROC Curve) and G-mean (Geometric Mean) to compare the performance among the experimental methods. A ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots True Positive Rate (TPR) and False Positive Rate (FPR) computed as follows.(7)TPR=TPTP+FNFPR=FPFP+TNwhere TP, FN, FP, and TN are true positives, false negatives, false positives, and true negatives, respectively. Lowering the classification threshold classifies more items as positive, thus increasing both false positives and true positives. AUC (Area under the ROC Curve) provides an aggregate measure of performance across all possible classification thresholds. If an algorithm has a larger AUC than that of another algorithm, this algorithm is better.
From ROC, Youden index which is the vertical distance between the 45-degree line and the point on the ROC curve was used to determine the optimal cut-off threshold. The Youden index is determined as follows.(8)J=sensitivity+specificity-1The optimal cut-off threshold corresponds to the point with the maximum value of J. From that threshold, sensitivity and specificity, respectively, will be determined. G-mean is the root of the product of classwise sensitivity. This measure tries to maximize the accuracy on each of the classes while keeping these accuracies balanced. For binary classification G-mean is the squared root of the product of the sensitivity and specificity. Similar to the AUC, the algorithm with a larger G-mean is better.
3.3. Data Preprocessing
In this section, we apply some normalization techniques including StandardScaler, MinMaxScaler, and RobustScaler to the original features. StandardScaler normalizes the original features to create standardized features by removing the mean and scaling to unit variance. MinMaxScaler transforms the features by scaling each feature to a given range while RobustScaler scales the features using statistics that are robust to outliers. HAOC is then used to predict the bankruptcy from the normalized features. The performance results in Table 2 show that the StandardScaler is the best normalization technique for KRBDS. Therefore, we apply the StandardScaler for the next experiments. Please note that the settings of StandardScaler were found only using training data and then we used these settings for the training and testing data.
Performance results of HAOC using several normalization techniques for KRBDS.
No
Normalization technique
Normalization formula
AUC
1
None
None
50.0±0.0
2
StandardScaler
x′=x-xmeanxstdev
87.1±0.6
3
MinMaxScaler
x′=x-xminxmax-xmin
73.0±4.0
4
RobustScaler
x′=x-xQ1xQ3-xQ1
50.0±0.0
3.4. Finding the Optimal Balancing Ratio
This section is conducted to find the optimal balancing ratio of HAOC for KRBDS. Using different balancing ratios from 0.003 to 1 for oversampling module, we obtain the AUCs for the valuation sets shown as Figure 7 in five folds. According to the results, we found that the balancing ratio at 0.08 gives the best average AUC for validation sets. Therefore, we use this value for our proposed approach in the final experiment.
Performance of HAOC in terms of AUC for validation sets in five folds.
3.5. Performance Results
Figure 8 shows the box plot in terms of AUC of the experimental approaches for KRBDS in five folds. We can easily see found that CUS_BG, CUS_AB, CUS_RF, CUS_MLP, and OSE_MLP did not achieve good results. The remaining approaches get more positive results.
The box plot in terms of AUC of experimental approaches for KRBDS in five folds.
Figure 9 presents the box plot in term of G-mean of all the experimental approaches which indicate that AB, RF, NLP, OSE_RF, RFCI, and HAOC are the best methods in terms of G-mean.
The box plot in terms of G-mean of experimental approaches for KRBDS in five folds.
Table 3 presents the average AUCs and G-mean of these approaches with standard deviation. According to these results, Bagging without resample approach gives poor results at 78.8 in AUC, respectively. Meanwhile, AdaBoost, Random Forest, and MLP show acceptable results at 84.9, 86.2, and 86.7 in AUC. In addition, the undersampling method based on clustering technique (UCS) [43] is responsible for reducing the performance of classification algorithms including Bagging, MLP, RF, and AB. Therefore, UCS is not suitable for KRBDS when its balancing ratio is very small. The 9th–12th approaches, OSE-BG, OSE-AB, OSE-RF, and OSE-MLP, give the overall AUC at 83.9, 85.4, 86.6, and 72.8, respectively. Meanwhile, RFCI [42] that uses the cost-sensitive learning algorithm, namely, CBoost, achieved 86.6 in AUC. Our proposed method outperforms the other approaches when achieving the overall AUC at 87.1. Moreover, Table 3 also reports the G-mean of all experimental approaches. According to these results, HAOC achieves the best value of G-mean while OSE-RF obtains the second value. Besides, RFCI, MLP, RF, and OSE-RF also have good results. In general, the proposed approach has the best values which balance between AUC and G-mean for KRBDS.
The overall results of all experimental approaches for KRBDS.
No
Method
Resample approach
Classifier
AUC
G-mean
Average Rank
p-value
1
BG
None
Bagging
78.8±0.4
70.8±0.8
9.0
3.9×10−5
2
AB
None
AdaBoost
84.9±0.8
78.2±0.6
7.0
0.0023
3
RF
None
Random Forest
86.2±0.6
79.9±0.6
4.7
0.069
4
MLP
None
MLP
86.7±0.8
80.1±1.0
2.6
0.487
5
USC-BG
Under-sampling method based on clustering technique (USC) [43]
Bagging
65.1±1.6
53.6±4.9
11.2
1.2×10−7
6
USC-AB
AdaBoost
59.7±3.0
56.3±5.0
12.9
5.6×10−10
7
USC-RF
Random Forest
64.7±1.0
62.6±1.9
11.9
1.5×10−8
8
USC-MLP
MLP
46.9±2.7
36.5±3.7
14.0
1.1×10−11
9
OSE-BG
Oversampling method using SMOTE-ENN (OSE) [41]
Bagging
83.9±0.3
77.4±0.3
7.8
5.1×10−4
10
OSE-AB
AdaBoost
85.4±0.7
78.5±0.4
6.2
0.009
11
OSE-RF
Random Forest
86.6±0.7
80.2±1.0
3.3
0.285
12
OSE-MLP
MLP
72.8±2.1
69.8±1.8
10.0
3.3×10−6
13
RFCI [42]
Under-sampling method using IHT concept
CBoost
86.6±0.7
79.1±3.5
3.1
0.336
14
HAOC
Oversampling method using SMOTE-ENN (with balancing ratio = 0.08)
CBoost
87.1±0.6
81.1±0.8
1.3
-
In addition, we employ the MULTIPLETEST package [49] for conducting multiple comparisons involving all possible pairwise experimental methods whose results are also presented in Table 3. The average rank of the proposed method is 1.3 which is the best rank in terms of AUC. Also, it can be noted that the results of our proposal do not have statistical differences against those results obtained by Random Forest, MLP, OSE-RF, and RFCI when the p-values are greater than 0.05. In addition, the p-values (≤0.05) show that the differences in the results of HAOC against the remaining tested classifiers are statistically significant.
Finally, Figure 10 presents the feature importance of HAOC approach on KRBDS. We can easily see that F3 (total assets), F4 (current liabilities within one year), F6 (total liabilities), F7 (capital), F8 (earned surplus), and F16 (nonoperating income) are the most important features. On the contrary, F1 (current assets), F2 (fixed assets, or fixed capital property), F9 (total capital), F10 (total capital after liabilities), F13 (net profit), F14 (sales and administrative expenses), and F19 (net income) are unimportant features and therefore they can be removed in the proposed model.
Feature importance of HAOC approach.
4. Conclusions
This study proposed a hybrid approach using oversampling technique and cost-sensitive learning framework for bankruptcy prediction on the Korean Bankruptcy dataset. In the first phase, the training set will be balanced by an oversampling module that utilizes the SMOTE-ENN algorithm with an optimal balancing ratio. Then, the second module uses the cost-sensitive learning framework, namely, CBoost, for bankruptcy prediction. Two experiments were conducted in this study to show the effectiveness of the proposed approach. The first experiment is to find the optimal balancing ratio that will give the best overall performance for bankruptcy prediction on the training set. Using the optimal balancing ratio that was found in the first experiment, we evaluate the performance in terms of AUC and G-mean between our proposed approach and the existing approaches. The results indicate that HAOC outperforms the existing approaches for bankruptcy prediction on KRBDS.
In the future, we will focus on how to find the optimal feature selection methods using evolutionary algorithms. In addition, several advanced methods for forecasting bankruptcy from multiple information sources to improve performance will be studied.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This research was supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the National Program for Excellence in SW (2015-0-00938) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).
CupertinoT. H.Guimarães CarneiroM.ZhengQ.ZhangJ.ZhaoL.A scheme for high level data classification using random walk and network measures20189228930310.1016/j.eswa.2017.09.014SilvaT. C.ZhaoL.2016SpringerMR3821553LeT.VoB.Fournier-VigerP.LeeM. Y.BaikS. W.SPPC: a new tree structure for mining erasable patterns in data streams201949247849510.1007/s10489-018-1280-5LeT.VoB.BaikS. W.Efficient algorithms for mining top-rank- k erasable patterns using pruning strategies and the subsume concept2018681910.1016/j.engappai.2017.09.010LeT.NguyenA.HuynhB.VoB.PedryczW.Mining constrained inter-sequence patterns: a novel approach to cope with item constraints20184851327134310.1007/s10489-017-1123-9KieuT.VoB.LeT.DengZ.LeB.Mining top-k co-occurrence items with sequential pattern20178512313310.1016/j.eswa.2017.05.021VoB.LeT.CoenenF.HongT.-P.Mining frequent itemsets using the n-list and subsume concepts201672253265VoB.LeT.NguyenG.HongT.Efficient algorithms for mining erasable closed patterns from product datasets201753111312010.1109/ACCESS.2017.2676803NguyenG.LeT.VoB.LeB.EIFDD: An efficient approach for erasable itemset mining of very dense datasets2015431859410.1007/s10489-014-0644-8StojkoskaB. L. R.TrivodalievK. V.A review of Internet of things for smart home: challenges and solutions20171401454146410.1016/j.jclepro.2016.10.0062-s2.0-85028237634WollschlaegerM.SauterT.JasperneiteJ.The future of industrial communication: Automation networks in the era of the internet of things and industry 4.0201711117272-s2.0-8501759611410.1109/MIE.2017.2649104NguyenN. P.HongS. K.Sliding mode Thau observer for actuator fault diagnosis of quadcopter UAVs2018810, article 1893NguyenN. P.HongS. K.Fault-Tolerant control of quadcopter uavs using robust adaptive sliding mode approach2019121, article 952-s2.0-85060065542NguyenN.HongS.Fault diagnosis and fault-tolerant control scheme for quadcopter UAVs with a total loss of actuator2019126, article 113910.3390/en12061139NguyenT. N.LeeS.Nguyen-XuanH.LeeJ.A novel analysis-prediction approach for geometrically nonlinear problems using group method of data handling201935450652610.1016/j.cma.2019.05.052NguyenT. N.ThaiC. H.LuuA.Nguyen-XuanH.LeeJ.NURBS-based postbuckling analysis of functionally graded carbon nanotube-reinforced composite shells2019347983100310.1016/j.cma.2019.01.011NguyenT. N.ThaiC. H.Nguyen-XuanH.LeeJ.NURBS-based analyses of functionally graded carbon nanotube-reinforced composite shells201820334936010.1016/j.compstruct.2018.06.017NguyenT. N.ThaiC. H.Nguyen-XuanH.LeeJ.Geometrically nonlinear analysis of functionally graded material plates using an improved moving Kriging meshfree method based on a refined plate theory201819326828010.1016/j.compstruct.2018.03.036LeD.PhamV.HGPEC: a Cytoscape app for prediction of novel disease-gene and disease-disease associations and evidence collection based on a random walk on heterogeneous network2017111, article 6110.1186/s12918-017-0437-xHemanthD. J.AnithaJ.SonL. H.Brain signal based human emotion analysis by circular back propagation and deep kohonen neural networks20186817018010.1016/j.compeleceng.2018.04.0062-s2.0-85045264699FazioD. M.SilvaT. C.TabakB. M.CajueiroD. O.Inflation targeting and financial stability: Does the quality of institutions matter?20187111510.1016/j.econmod.2017.09.011LeT.VoB.FujitaH.NguyenN.BaikS. W.A fast and accurate approach for bankruptcy forecasting using squared logistics loss with GPU-based extreme gradient boosting201949429431010.1016/j.ins.2019.04.060VanderveldA.PandeyA.HanA.ParekhR.An engagement-based customer lifetime value system for E-commerceProceedings of the the 22nd ACM SIGKDD International ConferenceAugust 2016San Francisco, Calif, USA29330210.1145/2939672.2939693ZhuB.BaesensB.vanden BrouckeS. K.An empirical comparison of techniques for the class imbalance problem in churn prediction2017408849910.1016/j.ins.2017.04.015ChekiredD. A.KhoukhiL.MouftahH. T.Decentralized cloud-SDN architecture in smart grid: a dynamic pricing model20181431220123110.1109/TII.2017.2742147ChenX.FangY.YangM.NieF.ZhaoZ.HuangJ. Z.PurTreeClust: a clustering algorithm for customer segmentation from massive customer transaction data201830355957210.1109/TKDE.2017.2763620LongH. V.SonL. H.KhariM.AroraK.ChopraS.KumarR.LeT.BaikS. W.A new approach for construction of geodemographic segmentation model and prediction analysis2019201910925283710.1155/2019/9252837SilvaT. C.AlexandreM. D. S.TabakB. M.Bank lending and systemic risk: A financial-real sector network approach with feedback201838981182-s2.0-8503070599410.1016/j.jfs.2017.08.006TabakB. M.SilvaT. C.SensoyA.Financial Networks201820182780259010.1155/2018/7802590KimM.KangD.KimH. B.Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction20154231074108210.1016/j.eswa.2014.08.025ZiębaM.TomczakS. K.TomczakJ. M.Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction2016589310110.1016/j.eswa.2016.04.001BarbozaF.KimuraH.AltmanE.Machine learning models and bankruptcy prediction20178340541710.1016/j.eswa.2017.04.006FernándezA.GarcíaS.GalarM.PratiR. C.KrawczykB.HerreraF.2018SpringerLinY.LeeY.WahbaG.Support vector machines for classification in nonstandard situations2002461-31912022-s2.0-003616102910.1023/A:1012406528296Zbl0998.68103LiuB.MaY.WongC.Improving an association rule-based classifier2000293317ChawlaN. V.BowyerK. W.HallL. O.KegelmeyerW. P.SMOTE: synthetic minority over-sampling technique20021632135710.1613/jair.953Zbl0994.681282-s2.0-0346586663LeT.BaikS. W.A robust framework for self-care problem identification for children with disability2019111, article 892-s2.0-85061092091ChawlaN. V.CieslakD. A.HallL. O.JoshiA.Automatically countering imbalance and its empirical relationship to cost2008172225252MR243476510.1007/s10618-008-0087-02-s2.0-50549101751LingC.ShengV.YangQ.Test strategies for cost-sensitive decision trees20061881055106710.1109/TKDE.2006.131GalarM.FernandezA.BarrenecheaE.BustinceH.HerreraF.A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches201242446348410.1109/TSMCC.2011.21612852-s2.0-84862515469LeT.LeeM. Y.ParkJ. R.BaikS. W.Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset2018104, article 792-s2.0-85046108283LeT.SonL. H.VoM. T.LeeM. Y.BaikS. W.A cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset2018107, article 2502-s2.0-85050389409LinW.TsaiC.HuY.JhangJ.Clustering-based undersampling in class-imbalanced data2017409-410172610.1016/j.ins.2017.05.008HanH.WangW.-Y.MaoB.-H.Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning3644Proceedings of the International Conference on Intelligent Computing (ICIC '05)August 2005878887Lecture Notes in Computer Science2-s2.0-27144501672HeH.BaiY.GarciaE. A.LiS.ADASYN: adaptive synthetic sampling approach for imbalanced learningProceedings of the International Joint Conference on Neural Networks (IJCNN '08)June 20081322132810.1109/ijcnn.2008.46339692-s2.0-56349089205BatistaG. E. A. P. A.PratiR. C.MonardM. C.A study of the behaviour of several methods for balancing machine learning training data200461202910.1145/1007730.1007735LemaîtreG.NogueiraF.AridasC. K.Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning20171817152-s2.0-85016274615PedregosaF.VaroquauxG.GramfortA.Scikit-learn: machine learning in Python20111228252830MR2854348GarcíaS.HerreraF.An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons2008926772694Zbl1225.68178