A Rule-Based Model for Bankruptcy Prediction Based on an Improved Genetic Ant Colony Algorithm

In this paper, we proposed a hybrid system to predict corporate bankruptcy. The whole procedure consists of the following four stages: first, sequential forward selection was used to extract the most important features; second, a rule-based model was chosen to fit the given dataset since it can present physical meaning; third, a genetic ant colony algorithm (GACA) was introduced; the fitness scaling strategy and the chaotic operator were incorporated with GACA, forming a new algorithm—fitness-scaling chaotic GACA (FSCGACA), which was used to seek the optimal parameters of the rule-based model; and finally, the stratified K-fold cross-validation technique was used to enhance the generalization of the model. Simulation experiments of 1000 corporations’ data collected from 2006 to 2009 demonstrated that the proposed model was effective. It selected the 5 most important factors as “net income to stock broker’s equality,” “quick ratio,” “retained earnings to total assets,” “stockholders’ equity to total assets,” and “financial expenses to sales.” The total misclassification error of the proposed FSCGACA was only 7.9%, exceeding the results of genetic algorithm (GA), ant colony algorithm (ACA), and GACA.The average computation time of the model is 2.02 s.


Introduction
Corporate bankruptcy is of great importance in economic phenomena.The health and success of the businesses are of widespread concern to policy makers, industry participants, investors, managers, and consumers [1].It is a problem that affects the economy on a global scale.Accurately predicting the number and probability of failing firms serves as an index of the development and robustness of a country's economy [2,3].The high individual, economic, and social costs as a consequence of corporate failures or bankruptcies have spurred searches for better understanding and prediction capabilities [4].
Bankruptcy prediction is the technique of predicting bankruptcy and various measures of financial distress of public firms [5].It is a vast area of finance and accounting research.The quantity of research is also a function of the availability of data; for public firms which went bankrupt or did not, numerous accounting ratios that might indicate danger can be calculated, and numerous other potential explanatory variables are also available.
1.1.Previous Works.Historically, numerous methods have been developed for predicting bankruptcy.Early research focused primarily on one-variable models such as financial ratios.The ratios were used individually and a cutoff score was established for each ratio based on minimizing misclassification.The one-variable methods were later criticized, in spite of their considerable results, because of the correlation among ratios and providing different signals for a form by ratios [1].
Later research turned to multivariable models that used statistical techniques such as multiple discriminant analysis (MDA) [6], logit [7], and quadratic interval logit [8].Recently, research has shown that artificial intelligence such as feedforward neural networks (FNNs) can be an alternative methodology for classification problems to which traditional statistical methods have long been applied [9].

Our Contribution
1.2.1.Feature Selection.Large, high-dimensional data sets are common in the financial field.High-dimensional data presents many challenges for analysis; a fundamental challenge is the so-called curse of dimensionality.Observations in a high-dimensional space are more sparse and less representative than those in a low-dimensional space [10].In this paper, we use the sequential feature selection method [11] to select only 5 features out of the original 20 features.

Rule-Based Model.
Although studies and experiments demonstrate the usefulness of FNNs in different studies, there are some shortcomings in building and using the model.First, it is not easy for users to find an appropriate FNN model that can represent problem characteristics such as network architectures, learning methods, and parameters.Second, the FNNs should restudy as the data changes slightly.Third and most important, the user cannot readily comprehend the final rules that the NN models acquire.These unknown rules of FNNs are referred to as "black boxes." As a solution, we have found that the "black box" problems can be solved successfully using a rule-based approach.It is capable of extracting classification rules that are easy for users to recognize [12].Among these systems, a rule-based solution is widely used for classification problems, either through supervised or unsupervised learning.

Fitness-Scaling Chaotic GACA.
Zhang and Wu proposed the genetic ant colony algorithm (GACA) that combines the genetic algorithm (GA) and the ant colony algorithm (ACA) [3].The GACA performs well to find global optima, yet it is easy to be trapped into local minima at some extreme conditions.In order to improve the performance of GACA, two improvements are introduced.(1) Traditional selection function uses fitness values to select the individuals of the next generation.It assigns a higher probability of selection to individuals with higher fitness values.However, individuals with smaller fitness values will have little chance to be selected, forcing the population gathering near the best individual.The power-rank scaling method will adjust the fitness values in order to make the population diverse.(2) Chaos is introduced to improve the robustness of basic GACA, considering its outstanding performance of jumping out of stagnation.The improved algorithm is called fitness-scaling chaotic GACA (FSCGACA).

Cross-Validation.
Constructing the best rule is a challenge due to the following two problems.The first is the overfitting problem, namely, the rules fit training data well but perform poorly out of sample.The other is the underfitting problem.The optimization algorithm may fail to determine the global minima because it can be misled by the local minima [13].
The solution to the first problem is to use cross-validation [14] which divides the dataset into training subset and validation subset.Then, the validation subset is used to monitor overfitting.The solution to the second problem is to develop a novel powerful global optimization method.
1.3.Structure.The structure of this paper is organized as follows.Section 2 introduced the basic concepts of ACA and GACA.Section 3 proposed a novel fitness-scaling chaotic genetic ant colony algorithm (FSCGACA) and gave its pseudocode.Section 4 discussed the sequential feature selection method, particularly, the sequential forward selection.Section 5 employed the rule-based model in the application of bankruptcy prediction.Section 6 presented the stratified -fold cross-validation to avoid the overfitting.Experiments in Section 7 showed every step of the proposed bankruptcy prediction system and compared the proposed FSCGACA with GA, ACA, and GACA in terms of classification accuracy and computation time.Besides, we demonstrated the necessity of feature selection and compared the proposed rule-based model with the FNN model.Finally, Section 8 was devoted to conclusions.

Background
As GA is well known to the readers, we discussed the basic concepts of ACA and GACA in this section.

Introduction of ACA.
ACA is an algorithm developed recently to simulate the behavior of real ants to rapidly establish the shortest route from a food source to their nest and vice versa [12].Ants begin randomly searching for food in the area surrounding their nest.When an individual ant encounters food along its path, it deposits a small quantity of pheromone at that location [15].Other ants in the neighborhood detect this marked pheromone trail.As more ants follow the pheromone rich trail, the probability of the trial being followed by other ants is further enhanced by increased pheromone deposition [16].This autocatalytic process reinforced by a positive feedback mechanism helps the ants to establish the shortest route.The flowchart of the algorithms is stated as follows.
Suppose an undirected graph  = (, ), where  is the set of nodes and  is the set of arcs connecting the nodes.The density of the nodes determines both the precision of a solution and the memory and computation time demands of the algorithm.All arcs  are initialized with a small amount of pheromone  0 .The target is to find the shortest path from the source node  1 to the destination node  2 .
In the second step,  ants are sequentially launched from  1 , where  is the number of ants in the colony.Each ant walks pseudorandomly from node to node via connecting arcs as far as the  2 or dead end is reached.When deciding which node  to go to from a specific node , the probability   is assigned as follows: Here, trail level   () is the amount of pheromone currently available at step  in the arc from node  to node .It indicates how proficient the ant has been in the past to make the move from  to .The attractiveness   () is the desirability of move from  to .Parameters  and  control the relative importance of trail level and attractiveness, respectively.The trail levels of all arcs are updated according to moves that were part of "good" or "bad" solutions.Consider that Here,  denotes the pheromone evaporation coefficient and  denotes a pheromone constant.Pheromone evaporation also has the advantage of avoiding the convergence to a locally optimal solution.If there was no evaporation at all, the paths chosen by the first ants would tend to be excessively attractive to the following ones.At any iteration, the best route is calculated from  routes.The pheromones of the best route are enforced while others evaporate.It should be noted that local updates exist in some models; however, the models with local updates cannot guarantee convergence.
It should be noted that combinatory optimization problem can be solved directly by ACA; however, our mission is a continuous space optimization problem, so ACA cannot be directly used to solve our problem.A special coding strategy can be used to transform the continuous space into the routine search problem [17].Figure 1 gives a simple example coding the value 4.85 as a routine.

Introduction of GACA.
ACA converges relatively slowly due to the lack of pheromones during the initial stages.The ants search tediously at first, and once the pheromones have accumulated to some degree, the ants will converge to the optimal location relatively more quickly.Conversely, GA converges to the optimal solutions quickly at first, after which the population iteratively vibrates near the optimal solution.
Combining the advantages of the algorithms, the GACA was proposed by Zhang and Wu [3], and it can be divided into two stages.In the coarse-searching stage, GA approximates the neighborhood area of the optimal point.The processes are repeated until MaxGAEpoch iterations.In the fine-searching stage, ACA seeks the exact position of the optimal points.The concept and flowchart of GACA are depicted in Figure 2.
The parameter MaxGAEpoch is determined through trial and error.If it is too large, the GA will be excessively time-consuming.Conversely, if it is too small, GA does not approximate the optimal area and the algorithm may be misled by local minima.Setting MaxGAEpoch as 10 is appropriate to produce both efficient and inclusive results.The ACA will terminate according to any of the following criteria: the maximum epochs of ACA (MaxACAEpoch), the maximum stagnation (MaxStag), and the fitness tolerance (FitTol).These values are also determined by trial and error.

Fitness-Scaling Chaotic GACA
GACA has proven to perform better than GA and ACA in theories and simulations [18].Furthermore, we can make improvements via following two aspects.Fitness scaling converts the raw fitness scores that are returned by the fitness function to values in a range that is suitable for the selection function [19].The selection function uses the scaled fitness values to select the bees of the next generation.The selection function assigns a higher probability of selection to bees with higher scaled values [20].There exist numerous of fitness scaling methods, four popular ones of which are selected and shown in Table 1.
Among those fitness scaling methods, the power scaling finds a solution that is nearly the most quickly due to improvement of diversity, but it suffers from instability [21].Meanwhile, the rank scaling shows stability on different types of tests [22].Therefore, a new power-rank scaling method was proposed combining both power and rank strategies as follows: where   is the rank of th individual/ant and  is the number of population.Our strategy contains a three-step process.First, all individuals/ants are sorted to obtain the corresponding ranks.Second, powers are computed for exponential values .Third, the scaled values are normalized by dividing the sum of the scaled values over the entire population.Chaos theory is epitomized by the so-called butterfly effect established by Lorenz [23].Attempting to simulate numerically a global weather system, Lorenz discovered that minute changes in initial conditions steered subsequent simulations towards radically different final results, rendering long-term prediction impossible in general [24].Sensitive dependence on initial conditions is not only observed in complex systems but also in the simplest logistic equation.The well-known logistic equation shows that where  0 ∈ (0, 1) and  0 ∉ {0.25, 0.5, 0.75}.The chaotic series can be used to generate the mutation operation.In all, the pseudocodes of FSCGACA are listed as follows.
Step Step 3 (elitist selection).Select the best   *  individuals to replace the same number worst individuals.
Step 4 (crossover).Choose   *  individuals and do onepoint crossover.That is, to choose the locus randomly, all data beyond the locus in either parent is swapped.The resulting individuals are the offspring.
Step 5 (mutation).Choose   *  individuals and do uniform mutation.This operator replaces the value of the chosen individual with the chaotic number generated by formula (4) and maps it to the user-specified upper and lower bounds.
Step Step 11 (path selection).For each ant, select the new path by formula (1).
Step 14.If termination conditions are met, jump to Step 15; otherwise, jump to Step 11.
Step 15.Select and output the best route   * ().

Feature Selection
Classification methods often begin with some types of dimension reduction, by which high-dimensional data are approximated by points in a lower-dimensional space.In this paper, a sequential feature selection method [25] was applied.An objective function in feature selection is defined as criteria.Feature selection seeks to minimize the criteria over all feasible feature subsets.Common criteria include "mean squared error" and "misclassification rate" for regression models and classification models, respectively.
Feature selection has two variants shown in Figure 3.In sequential forward selection (SFS), features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion [26].In sequential backward selection (SBS), features are sequentially removed from a full candidate set until the removal of further features increases the criterion [27].
In this paper, since the original features are as high as 20, we use the SFS to determine the important features.We defined the criteria as the deviance of the fit using "binomial" [28].We use binomial criterion because the prediction contains two outcomes (either bankrupt or nonbankrupt), and we use "deviance of fit" since it can represent the classification margin while the "misclassification rate" cannot [29].

Bankruptcy Rule
If (CV 1 min ≤  1 ≤ CV 1 max ) and (CV 2 min ≤  2 ≤ CV 2 max ) and ⋅ ⋅ ⋅ and (CV  min ≤   ≤ CV  max ), then the firm will bankrupt, where  is the number of attributes and CV  min and CV  max are the minimum and maximum bounds of the th attribute   , respectively.The rule is then encoded as Table 2.
We choose the misclassification error (ME) as the fitness evaluation which is defined as below The goal is to find the optimal parameters in Table 2 to make the ME as small as possible.Classification Accuracy (CA) is defined as the ratio of number of classified firms to the number of all firms.The sum of ME and CA is 1

Stratified K-fold Cross-Validation
Typically, a statistical model that deals with the inherent data variability is inferred from the database (i.e., the training set) and employed by statistical learning machines for the automatic construction of classifiers.The model has a set of adjustable parameters that are estimated in the learning phase using a set of examples.Nevertheless, the learning machine must ensure a reliable estimation of the parameters and consequently good generalization, that is, correct responses to new examples.Hence, the learning device must efficiently find a trade-off between its complexity, which is measured by several variables, such as the effective number of free parameters of the classifier and the feature input space dimension, and the information on the problem given by the training set (e.g., measured by the number of samples).Cross-validation methods are usually employed to assess the statistical relevance of the classifiers.It consists of four types: random subsampling, -fold cross-validation, leaveone-out validation, and Monte Carlo Cross-Validation [30].The -fold cross-validation is applied due to its simple and easy properties while using all data for training and validation.The mechanism is to create a -fold partition of the whole dataset, repeat  times to use  − 1 folds for training and a left fold for validation, and finally average the error rates of  experiments.The schematic diagram of 5fold cross-validation is shown in Figure 4.
The  folds can be purely random partitioned; however, some folds may have quite different distributions from other Retained earnings to total assets x 5 Stockholders' equity to total assets x 12 Financial expenses to sales folds.Therefore, the stratified -fold cross-validation was employed, in which every fold has nearly the same class distributions [31].The folds are selected so that the mean response value is approximately equal in all the folds.In case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.
Another challenge was to determine the number of folds.If  is set too large, the bias of the true error rate estimator will be small, but the variance of the estimator will be large and the computation will be time-consuming.Alternatively, if  is set too small, the computation time will decrease and the variance of the estimator will be small, but the bias of the estimator will be large [32].In this study, we empirically determined  to be 5 through the trial-and-error method.

Experiments and Discussions
The program was in-house developed by MATLAB 2013a and run on IBM P4 machine with 2 GHz processor and 1 G ram.The data set contains 1000 externally audited midsized manufacturing firms, 500 of which filed for bankruptcy and the other 500 for nonbankruptcy during the period 2006-2009.Each observation contains 21 financial indicators, of which 20 variables include the financial statistical measurements of the corporation and the last variable indicates bankruptcy status.
7.1.Feature Selection.In the feature selection stage, we normalized all the 20 variables to the range [0, 1] and then selected 5 financial indicators using SFS method.The algorithm only consists of 6 steps, in which variable 5, 1, 2, 4, and 12 are sequentially added into the model as shown in Figure 5.The criterion value decreases gradually as the SFS progresses until any addition of a new variable will increase the criterion value.The physical meanings of the 5 selected variables are listed in Table 3.The other literature discussing bankruptcy also include those 5 features [33,34], which demonstrate the effectiveness of SFS and the importance of those selected 5 features.

Algorithm Comparison.
In this section, the rule-based model was established based on the 5 selected variables.
The proposed FSCGACA was employed to optimize the parameters of the rule-based model.Besides, we run GA [34], ACA [3], and GACA [3] for comparison.The parameters and termination criteria of all algorithms were obtained through trial-and-error and listed in Table 4.
Each algorithm ran 100 times to eliminate the randomness.The average of the results of 100 runs is listed in Table 5.It shows the rule model established by all algorithms and their corresponding classification accuracy.The proposed FSCGACA performs best and achieves the highest CA as 92.1%, followed by the GACA with the CA as 91.8%.The ACA achieves 91.4% CA.The GA is the worst algorithm with the CA as 91.1%.The rule-based model established by the proposed FSCGACA can be translated as if "net income to stock broker's equality" is between 0.1324 and 0.6573 and "quick ratio" is between 0.0257 and 0.8038 and "retained earnings to total assets" falls between 0.0138 and 0.8957 and "stockholders' equity to total assets" falls between 0.0226 and 0.8168 and "financial expenses to sales" falls between 0.0522 and 0.5805, then the firm will bankrupt.

Convergence Performance.
A typical run is shown in Figure 6.The convergence curve of the proposed FSCGACA is distinct from the others.In the beginning (from 1st to 25th epoch), the curve exhibits slowest the decline, followed by the one fastest (from 25th to 75th epoch) till finding the global minimal point (after 75th epoch).This kind of declining way adheres to our expectation.FSCGACA spends   more individuals in exploring areas approximated to global minimal points during the coarse-searching stage, so, it does not perform as well as we expected in terms of best fitness function, but it has located more potential areas.Afterwards, the fitness curve exhibits the sharpest decline when exploiting those areas in the fine-searching stage.Subsequently, the FSCGACA becomes dominant among all algorithms from 65th epoch.In all, Figure 6 indicates that the FSCGACA regulates the tradeoff between exploration and exploitation in a remarkably efficient way, so it exceeds other algorithms including GA, ACA, and GACA.

Computation Time Comparison.
The distribution of computation time of 100 runs is shown in Figure 7.The central  mark denotes the median, the edges of the box denote the 25th and 75th percentile, the whiskers extend to the most extreme data points, and the outliers are plotted as the plus symbol.The average computation times of the GA, ACA, GACA, and FSCGACA are 1.91, 2.07, 2.01, and 2.02 s, respectively.
The GA costs the least time, while the ACA costs the most time, because GA generates new individual using either crossover (1 operation generate 2 offspring) or mutation (1 operation generates 1 offspring), but the ACA generates new path by combining arcs by formula (1).Since the precision of our model is 10 −4 , a new path should combine 4 arcs, that is, to repeat formula (1) four times as shown in Figure 8.Therefore, ACA will cost more time than GA.The GACA and FSCGACA run GA first and followed by ACA, so their computation time is between GA and ACA. then the firm will bankrupt.
The omission of feature selection increases the parameter space by adding 20 logical variables t and increasing the number of antecedent element CV from 5 to 20, so the optimization algorithms become unstable, time-consuming, and easy to fall stagnancy.We run the proposed FSCGACA with and without feature selection for 100 times, respectively.The averaged classification accuracy of FSCGACA without feature selection is only 19.2%, compared to the result with feature selection as 92.1%.Therefore, feature selection is essential and effective.

Comparison with Neural Network.
In this section, we use the FNN with structure {5-4-1} to predict bankruptcy.The number of input neurons is set as 5 because only 5 features are selected (Table 3).The number of hidden neurons is chosen by the Bayesian probability method [35].The number of output neurons is set as 1 because this is a binary classification problem, of which only one output neuron is able to indicate bankruptcy or not.The Levenberg-Marquardt method [36] was used to train the neural network.The convergence plot of the neural network is shown in Figure 9.We found that at the 5th epoch, the validation error begin to increase, which is an indication of overfitting.
After training, all data are submitted to the FNN.The total misclassification error is 7.4%, better than our method as 7.9%.A question "why do not use FNN?" is raised.The answer is stated in the introduction as the FNN does not have explicit physical meanings due to the nonlinear interaction between input layer and hidden layer.Although the performance of the proposed rule-based method is a bit lower, it can provide detailed physical model for economists to analyze (Table 5).

Conclusion
In this paper, we proposed a novel system to predict corporate bankruptcy.The procedure consists of four stages: first, sequential forward selection was used to extract the 5 most important features out of 20; second, we use a rule-based model to approximate the given dataset; third, the proposed FSCGACA algorithm was used to find optimal parameters of the model; and fourth, the cross-validation technique was employed to prevent overfitting.The importance of feature selection was demonstrated in the experiments and discussions Section.If we omit feature selection, the classification accuracy of final system

Figure 1 :
Figure 1: Coding strategy for ACA solving continuous optimization.

Figure 2 :
Figure 2: The concept and flowchart of GACA.
10 (transform).Generate 10 × 10 × ⋅ ⋅ ⋅ × 10 graph according to problem precision. Individuals are transformed to  ants.The paths corresponding to the values of individuals   () (see Figure 1) are spread by pheromones  with the amount of [  ()].The heuristic function values  are set equal to the scaled fitness values [  ()].

Figure 3 :
Figure 3: Two variants of sequential feature selection: the sequential forward selection and sequential backward selection.
ME =Number of misclassified firms Number of all firms .

Table 1 :
Summaries of fitness scaling techniques.
linear =  +  ×  raw a and b are constants defined by the users Rank scaling  rank =  r denotes the rank of the individual/ant Power scaling  power =   raw p is a problem-dependent exponent that might change during a run to stretch or shrink the range as needed Top scaling  top =  raw ≥  0 raw <  s is the user-defined constant; c is the threshold 1 (parameter setting).Determine the population size , crossover probability   , mutation probability   , elite selection probability   , trail level factor , attractiveness factor , pheromone evaporation coefficient , pheromone constant , power scale factor , initial logistic point  0 , and set iteration epoch  = 0, Step 2 (initialization).Generate feasible solutions   (),  ∈ [1, 2, . . ., ] randomly.Their corresponding fitness values  raw [  ()] are evaluated and scaled by formula (3) as [  ()].

Table 3 :
Five selected variables by SFS method.

Table 5 :
Rules extracted by algorithms (average of 100 runs).

Table 6 :
Encoding mechanism without feature selection.Figure 8: GA versus ACA in terms of generating new offspring/paths.
7.5.Effect of Feature Selection.In this section that we discuss the advantages of feature selection.If we omit feature selection in the rule-based model, the coding strategy will include 20 logical indication elements t = [ 1 ,  2 , ...,  20 ], of which 1 indicates that the corresponding feature is included and 0 denotes it is neglected.Table6gives the encoding mechanism without using feature selection.Bankruptcy Rule with Logical IndicatorsIf (CV 1 min ≤  1 ≤ CV 1 max ) or (! 1 ) and (CV 2 min ≤  2 ≤ CV 2 max ) or (! 2 ) and ⋅ ⋅ ⋅ and (CV  min ≤   ≤ CV  max ) or (!  ),