An Analytic Hierarchy Model for Classification Algorithms Selection in Credit Risk Analysis

This paper proposes an analytic hierarchy model (AHM) to evaluate classification algorithms for credit risk analysis. The proposed AHM consists of three stages: data mining stage, multicriteria decision making stage, and secondary mining stage. For verification, 2 public-domain credit datasets, 10 classification algorithms, and 10 performance criteria are used to test the proposed AHM in the experimental study. The results demonstrate that the proposed AHM is an efficient tool to select classification algorithms in credit risk analysis, especially when different evaluation algorithms generate conflicting results.


Introduction
The main objective of credit risk analysis is to classify samples into good and bad groups [1,2]. Many classification algorithms have been applied to credit risk analysis, such as decision tree, K-nearest neighbor, support vector machine (SVM), and neural network [3][4][5][6][7][8][9]. How to select the best classification algorithm for a given dataset is an important task in credit risk prediction [10][11][12]. Wolpert and Macready [13] pointed out in their no free lunch (NFL) theorem that there exists no single algorithm or model that could achieve the best performance for a given problem domain [14,15]. Thus, a list of algorithm rankings is more effective and helpful than seeking the optimal performed algorithm for a particular task. Algorithm ranking normally needs to examine several criteria, such as accuracy, misclassification rate, and computational time. Therefore, it can be modeled as a multicriteria decision making (MCDM) problem [16].
This paper develops an analytic hierarchy model (AHM) to select classification algorithms for credit risk analysis. It constructs a performance score to measure the performance of classification algorithms and ranks algorithms using multicriteria decision analysis (MCDA). The proposed AHM consists of three hierarchy stages: data mining (DM) stage, MCDM stage, and secondary mining stage. An experimental study, which selects 10 classic credit risk evaluation classification algorithms (e.g., decision trees, K-nearest neighbors, support vector machines, and neural networks) and 10 performance measures, is designed to verify the proposed model over 2 public-domain credit datasets.
The remaining parts of this paper are organized as follows: Section 2 briefly reviews related work. Section 3 describes some preliminaries. Section 4 presents the proposed AHM. Section 5 describes experimental datasets and design and presents the results. Section 6 concludes the paper.

Related Work
Classification algorithm evaluation and selection is an active research area in the fields of data mining and knowledge discovery (DMKD), machine learning, artificial intelligence, and pattern recognition. Driven by strong business benefits, many classification algorithms have been proposed for credit risk analysis in the past few decades [17][18][19][20][21][22], which can be summarized into four categories: statistical analysis (e.g., discriminant analysis and logistic regression), mathematical programming analysis (e.g., multicriteria convex quadratic programming), nonparametric statistical analysis (e.g., recursive partitioning, goal programming, and decision trees), and artificial intelligence modeling (e.g., support vector machines, neural networks, and genetic algorithms).
The advantages of applying classification algorithms for credit risk analysis include the following. It is difficult for traditional methods to handle large size databases, while classification algorithms, especially artificial intelligence modeling, can be used to quickly predict credit risk even when the size of dataset is huge. Second, classification algorithms may provide higher prediction accuracy than traditional approaches [23]. Third, the decision making based on the results of classification algorithms is objective, reducing the influence of human biases.
However, the no free lunch theorem states that no algorithm can outperform all other algorithms when performance is amortized over all measures. Many studies indicate that classifiers' performances vary under different datasets and circumstances [24][25][26]. How to provide a comprehensive assessment of algorithms is an important area. Algorithm evaluation and selection normally need to examine multicriteria. Therefore, classification algorithm evaluation and selection can be treated as an MCDM problem, and MCDM methods can be applied to systematically choose the appropriate algorithms [16].
As defined by the International Society on Multiple Criteria Decision Making, MCDM is the study of methods and procedures by which concerns about multiple conflicting criteria can be formally incorporated into the management planning process [27,28]. MCDM is concerned with the elucidation of the levels of preference of decision alternatives, through judgments made over a number of criteria [29,30]. MCDM methods have been developed and applied in evaluation and selection of classification algorithms. For instance, Nakhaeizadeh and Schnabl [31] suggested a multicriteriabased measure to compare classification algorithms. Smith-Miles [32] considered the algorithm evaluation and selection problem as a learning task and discussed the generalization of metalearning concepts. Peng et al. [33] applied MCDM methods to rank classification algorithms. However, these research efforts face challenging situations that different MCDM methods produce conflicting rankings. This paper proposes and develops AHM, a unified framework, based on MCDM and DM to identify robust classification algorithms, especially when different evaluation algorithms generate conflicting results.

Performance
Measures. This paper utilizes the following ten commonly used performance measures [33,35].
(i) Overall accuracy (Acc): accuracy is the percentage of correctly classified instances. It is one of the most widely used classification performance metrics: where TN, TP, FN, and FP stand for true negative, true positive, false negative, and false positive, respectively. (ii) True positive rate (TPR): TPR is the number of correctly classified positive instances or abnormal instances. TPR is also called sensitivity measure: (iii) True negative rate (TNR): TNR is the number of correctly classified negative instances or normal instances. TNR is also called specificity measure: (iv) Precision: this is the number of classified fault-prone modules that actually are fault-prone modules: (v) The area under receiver operating characteristic (AUC): receiver operating characteristic stands for receiver operating characteristic, which shows the tradeoff between TP rate and FP rate. AUC represents the accuracy of a classifier. The larger the area, the better the classifier. (vi) -measure: it is the harmonic mean of precision and recall. -measure has been widely used in information retrieval: (viii) Kappa statistic (Kaps): this is a classifier performance measure that estimates the similarity between the members of an ensemble in multiclassifiers systems: where ( ) is the accuracy of the classifier and ( ) is the probability that agreement among classifiers is due to chance.

Evaluation Approaches
The DM stage of AHM selects 10 classification algorithms, which are commonly used algorithms in credit risk analysis, to predict credit risk.

MCDM Method.
Multiple criteria decision making is a subdiscipline of operations research that explicitly considers multiple criteria in decision making environments. When evaluating classification algorithms, normal multicriteria need to be examined, such as accuracy, misclassification rate, and computational time. Thus algorithm evaluation and selection can be modeled as an MCDM problem.
The MCDM stage of AHM selects four MCDM methods, that is, technique for order preference by similarity to ideal solution (TOPSIS) [49], preference ranking organization method for enrichment of evaluations II (PROMETHEE II) [50], VIKOR [51], and grey relational analysis (GRA) [52] to evaluate the classification algorithms, based on the 10 performance measures described in Section 3.

The Proposed Model
The proposed AHM is developed to evaluate and select classification algorithms for credit risk analysis. It is designed to deal with situations when different MCDM methods produce conflicting rankings [33,53]. The approach combines MCDM, DM, knowledge discovery in database (KDD) process, and expert opinions to find out the best classification algorithm. The proposed AHM consists of three stages: DM stage, MCDM stage, and secondary mining stage. The framework is presented in Figure 1.
In the first stage, DM stage, 10 commonly used classification algorithms in credit risk analysis, including Bayes network (BNK), naive Bayes (NBS), logistic regression (LRN), J48, NBTree, IB1, IBK, SMO, RBF network (RBF), and multilayer perceptron (MLP), are implemented using WEKA 3.7. The performance of algorithms is measured by the 10 performance measures introduced in Section 3.1. The DM stage can be extended to other functions, such as clustering analysis and association rules analysis.
The MCDM stage applies four MCDM methods (i.e., TOPSIS, VIKOR, PROMETHEE II, and gray relational analysis) to provide an initial ranking to measure the performances of classification algorithms based on the results of the DM stage as input. This stage selects more than one MCDM method because the ranking agreed by several MCDM methods is more credible and convincing than the one generated by a single MCDM method. All these MCDM methods are implemented using MATLAB 7.0.
In the third stage, the secondary mining is presented to derive a list of algorithm priorities and multicriteria decision analysis (MCDA) is applied to measure the performance of classification algorithms. Expert consensus with the importance of each MCDM method is applied to the algorithm evaluation and selection, which can reduce the knowledge gap from different experiments and expertise of experts, especially when different evaluation algorithms generate conflicting results.

Datasets.
The experiment chooses 2 public-domain credit datasets: Australian credit dataset and German credit dataset (Table 1). These 2 datasets are publicly available at the UCI machine learning repository (http://archive.ics.uci) (.edu/ml). Input: 2 public-domain credit datasets. Output: Ranking of classification algorithms.
Step 1. Prepare target datasets: data cleaning, data integration and data transformation.
Step 2. Train and test the selected classification algorithms on randomly sampled partitions (i.e., 10-fold cross-validation) using WEKA 3.7 [34]. The German credit card application dataset contains 1000 instances with 20 predictor variables, such as age, gender, marital status, education level, employment status, credit history records, job, account, and loan purpose. 70% of the instances are accepted to be credit worthy and 30% are rejected.
The Australian dataset concerns consumer credit card applications. It has 690 instances with 44.5% examples of credit worthy customers and 55.5% examples for credit unworthy customers. It contains 14 attributes, where eight are categorical attributes and six are continuous attributes.

Experimental Design.
The experiment is carried out according to Algorithm 1.

Experimental Results.
The standardized classification results of the two datasets are summarized in Tables 2  and 3. The best result of each performance measure of the two datasets is highlighted in boldface. No classification algorithm has the best result on all measures.
The initial ranking of the classification algorithms of the two datasets is generated by TOPSIS, VIKOR, PROMETHEE II, and GRA. The results are summarized in Tables 4 and 5, respectively. Weights of each performance measure used in TOPSIS, VIKOR, PROMETHEE II, and GRA are defined as follows: TP rate and AUC are set to 10 and the other three measures are set to 1, the weights are normalized, and the sum of all weights equals 1 [33]. From Table 4 and Table 5, we cannot identify and find the regular pattern of performances of classification algorithms with intuition. What is more, the intuition is not always correct, and different people often have different conclusions. Based on these observations, the secondary mining stage is proposed in our developed AHM.
The final ranking of classification algorithms is calculated by TOPSIS, one of the MCDA methods, which is implemented in the secondary mining stage. The weights are obtained by decision making with expert consensus. That is, all algorithms are equally important over all measures, having their own advantages and weaknesses. Three invited experts agree on the fact that each MCDM method is equally important; namely, the weight of each MCDM method is 0.25. The final ranking results are presented in Table 6.
The ranking of classification algorithms produced by two datasets is basically the same, except Bayes network (BNK) and naive Bayes (NBS). Compared with the initial ranking, the degrees of disagreements of the final ranking are greatly reduced.

Conclusion
This paper proposes an AHM, which combines DM and MCDM, to evaluate classification algorithms in credit risk analysis. To verify the proposed model, an experiment is implemented using 2 public-domain credit datasets, 10 classification algorithms, and 10 performance measures. The results indicate that the proposed AHM is able to identify robust classification algorithms for credit risk analysis. The proposed AHM can reduce the degrees of disagreements for decision optimization, especially when different evaluation algorithms generate conflicting results. One future research direction is to extend the AHM to other functions, such as clustering analysis and association analysis.

Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.