A Novel Imbalanced Data Classification Approach Based on Logistic Regression and Fisher Discriminant

We introduce an imbalanced data classification approach based on logistic regression significant discriminant and Fisher discriminant. First of all, a key indicators extraction model based on logistic regression significant discriminant and correlation analysis is derived to extract features for customer classification. Secondly, on the basis of the linear weighted utilizing Fisher discriminant, a customer scoringmodel is established. And then, a customer ratingmodel where the customer number of all ratings follows normal distribution is constructed. The performance of the proposed model and the classical SVM classification method are evaluated in terms of their ability to correctly classify consumers as default customer or nondefault customer. Empirical results using the data of 2157 customers in financial engineering suggest that the proposed approach better performance than the SVM model in dealing with imbalanced data classification. Moreover, our approach contributes to locating the qualified customers for the banks and the bond investors.


Introduction
There exist a lot of imbalanced data sets in real society [1], and the imbalanced data set appears when the size of samples in one class is greatly larger than the size of samples in another class.Many classification approaches constructed based on imbalanced data sets usually perform well on one class data but bad on the other class data [2].Much more attention should be paid to the rare class data in these cases.For example, in risk management, the number of default customers is only 1% to 4% of all loan customers.And it causes hundreds of billions of loan losses for banks.What is more, the classifier cannot tell us the real default customers.Therefore, high prediction accuracy of the default customers will be more useful to help the bankers, the society, and the bond investors to reduce loss.
The main customer classification approaches proposed to handle imbalanced data problems can be divided into three categories.One of the most famous methods is the classification model based on econometrics technique.
Using an empirical study on imbalanced bankrupt and nonbankrupt enterprises in U.S., Altman established fivevariable -score model [3].On the basis of combining collinearity diagnostics and logistic regression significant discriminant (LRSD), Shi and Chi established a classification model for handling imbalanced financial data [4].Altman et al. employed statistics discriminant technique to revise the -score model and created the zeta rating model in order to deal with imbalanced bankrupt and nonbankrupt data [5].Ju and Sohn created a credit-scoring classification model for selecting appropriate funding beneficiaries [6].Elliott et al. built a model based on a double hidden Markov model (DHMM), to extract information about the "true" credit qualities of firms from all of the loan firms [7].In order to screen good customers, Hwang et al. created an ordered semiparametric probit credit rating model by substituting ordered semiparametric function for linear regression function [8].In comparison with conventional models such as multiple discriminant analysis, logistic regression analysis, and neural networks for business failure prediction, Min and Lee proposed a DEA credit scoring model for loan customers' classification [9].Linear discriminant analysis (LDA) model is applied in clients' classification [10].The second method for handling imbalanced data problems is the classification approach based on stochastic probability.In order to infer credit-quality data MNAR (missing not at random), Chen and Åstebro proposed a flexible method to generate the probability of missingness within a model-based bound and collapse Bayesian technique.Empirical results show that the method improves the classification power of credit scoring models under MNAR conditions [11].Carmona et al. used a structural model with stochastic volatility for the computation of rare credit portfolio losses, and they demonstrated the efficiency of their method in situations where importance sampling is not possible or numerically unstable [12].Kim and Sohn proposed a random effects multinomial regression model to estimate transition probabilities of different types of customers [13].Carling et al. established continuous time model consisting of macroeconomics factors, such as the GDP growth rate and the unemployment rate, to readjust debtor's credit rating transition probability [14].KMV Company used the probability of asset value less than debt value to measure enterprises' default situation, by creating a KMV model to measure default probability [15].JPMorgan utilized transition matrix to describe the debtor's probability of credit rating change, establishing CreditMetrics credit rating model [16].In order to establish CreditRisk+ model, Credit Suisse First Boston developed the stochastic probability to measure default [17].Moreover, imbalanced data classification method based on artificial intelligence is another important method.In order to extract features for classification, Li et al. created a generalized linear discriminant analysis model based on trace ratio criterion algorithm (GLDA-TRA) [18].Abellán and Mantas studied the ensembles of classifiers for bankruptcy prediction and credit scoring using the random subspace method.And an experimental study showed that bagging scheme on decision trees presents the best results for bankruptcy prediction and credit scoring [19].Wang established a hybrid sampling SVM model to imbalanced data classification [20].Zhong et al. carried out a comprehensive experimental comparison study over the effectiveness of four learning algorithms (i.e., BP, ELM, I-ELM, and SVM) over an imbalanced data set consisting of real financial data for corporate credit ratings [21].Akkoc ¸proposed a threestage hybrid adaptive neurofuzzy inference system (ANFIS) credit scoring model to deal with imbalanced loan customers [22].In order to classify loan customers, Finlay compared the performance of several multiple classifiers and found that error trimmed boosting outperformed all other multiple classifiers on UK credit data [23].Twala explored the predicted behaviour of five classifiers for different types of noise in terms of credit risk prediction accuracy and how such accuracy could be improved by using classifier ensembles.The experimental evaluation showed that the ensemble of classifiers technique had the potential to improve prediction accuracy [24].
Although the existing researches have made great progress in handling imbalanced data issues, there are still some drawbacks.First of all, the real default status of customers is not taken into account in existing loan customers' classification.And secondly, the collinearity between indicators, which could induce the information chaos of index system, cannot be excluded in the existing classification researches.
In order to overcome the above shortcomings, this paper creates a novel imbalanced data classification approach based on logistic regression significant discriminant (LRSD) and Fisher discriminant.Using a Chinese state-owned commercial bank's 2157-microfinance loan for small private businesses, the empirical result shows that the average accuracy rate for our proposed model is 96.27%.The proposed model performs well on the imbalanced customer classification.
The rest of the paper is organized as follows.Section 2 introduces the methodology of this paper.Section 3 presents the data and empirical analysis of our imbalanced data classification model for small private business.We conclude the paper in Section 4.

A Novel Imbalanced Data
Classification Approach It has to be noted that no technique has been shown to be optimal for all kinds of data.Because the maxmin normalization technique has been widely used in the standardized of quantitative indicators [4,25,26], this maxmin normalization technique is applied in transforming the positive and negative indicators.Let   denote the standard score of the th customer on the th indicator.Let V  denote the indicator original data of the th customer on the th indicator.Let  denote the number of customers.The standardization equations of positive indicators and negative indicators are shown in (1) and (2), respectively [4].Consider Equation ( 1) is the ratio of the deviation between the indicator original data V  and the minimum value min(V  ) to the range max(V  )-min(V  ).It indicates that the closer the indicator original data V  to the maximum value max(V  ) is, the bigger the standardized value   would be.Consider The meanings of (2) are the same as (1).Equation (2) indicates that the closer the indicator original data V  to the minimum value min(V  ) is, the bigger the standardized value   would be.
Let  1 denote the left boundary of the ideal interval.Let  2 denote the right boundary of the ideal interval.The standard score equation of the interval indicators is shown as follows [4]: The meanings of the rest of variables in (3a), (3b), and (3c) are the same as the variables in (1).
Equations (3a), (3b), and (3c) are applied to analyze the interval indicator standardization.From (3c), if the indicator original data V  belongs to the interval [ 1 ,  2 ], the standardized value   identically equals 1.From (3a), if the indicator original data V  is less than the left boundary  1 , the numerator  1 − V  is the deviation between the indicator original data V  and the left boundary  1 and the denominator max( 1 − min(V  ), max(V  ) −  2 ) is the maximum between  1 − min(V  ) and max(V  ) −  2 .Equation (3a) illustrates that the smaller the distance between the indicator original data V  and the left boundary  1 is, the bigger the standardized value   would be.Similarly, if V  >  2 , (3b) indicates that the smaller the distance between the indicator original data V  and the right boundary  2 is, the bigger the standardized value   would be.

The Standardization of Qualitative Indicators.
By rational analysis and expert investigation for qualitative indicators, the scoring standard of qualitative indicators can be obtained.

The Key Indicators Extraction Approach in
Next, this paper will give a selection approach based on logistic regression significant discriminant (LRSD).The original hypothesis is as follows.If the th indicator has no effect on customers' default status, the coefficient   of logistic regression of the th indicator is equal to zero.Conversely, the alternative hypothesis is that if the th indicator has a significant effect on customers' default status, the coefficient   of logistic regression of the th indicator is not equal to zero.We establish the Wald statistics and judge whether the coefficients  1 , . . .,   in (4) equal zero or not.In other words, it is to make judgment on whether the th indicator would significantly affect customers' default status.
Let   denote the Wald test value of the th indicator.Let b denote the th indicator's estimated value in (4).Let  b denote the standard deviation of b .Thus, the Wald test value   is given by The standard process of selecting the indicators is as follows.Comparing the test probability sig i of Wald test value   with the given significance level Level 0 = 0.05 [27], we can distinguish whether the indicators have an obvious effect on customers' default status.If sig i < Level 0 , thus   ̸ = 0, which means the th indicator affects customers' default status significantly, and therefore the th indicator   should be reserved.On the contrary, if sig i ≥ Level 0 , thus   = 0, which means the th indicator does not have significant effect on customers' default status, and therefore the th indicator   should be deleted.

Deleting the Repeated Information Indicators Based on
Correlation Analysis.The aim of the correlation analysis is to delete indicators of large correlation from the whole extensive indicators set, avoiding repeated information.
Let   denote the standard score of the th customer on the th indicator.Let   and   denote the mean values, respectively, corresponding to the th indicator and the th indicator.Let   denote the correlation coefficient between the th indicator and the th indicator.Then, As a matter of experience, the threshold of correlation coefficient  equals 0.80 [25].In other words, if the absolute value of the correlation coefficient |  | is more than 0.8, the two indicators reflect the repeated information.One of the two indicators should be deleted.

The Indicator Empowerment Based on Fisher Discriminant.
Considering the distance between default sample and nondefault sample, the bigger the distance is, the bigger the weighting is, and then the weighting of every selected indicator can be calculated by using Fisher discriminant method.
Let  = (  ) × denote the deviation matrix among indicators in the same group.Let   denote the deviation value between the th indicator and the th indicator.Let  denote the number of indicators.Let  1 denote the number of nondefault customers.Let  2 denote the number of default customers.Let  (1)   denote the deviation value between the th indicator and the th indicator in the nondefault sample group.Let  (2)   denote the deviation value between the th indicator and the th indicator in the default sample group.Let   denote the standard score of the th customer and the th indicator.Let   denote the standard score of the th customer and the th indicator.Let   and   denote the mean values.Let  (1)   and  (1)   denote the mean values in the nondefault sample group.Let  (2)   and  (2)   denote the mean values in the default sample group.We have [28] where Let  = (  ) ×1 denote the deviation matrix between default sample and nondefault sample.Let   denote the deviation value of the th indicator between default sample and nondefault sample.Thus [28], where The meanings of the rest of variables in (10) are the same as the variables in (8).

The Calculation of Customers' Credit Scores.
Let   denote the score of the th customer.We have [26] The meanings of the rest of variables in (15) are the same as the variables in (14).
Because the credit score lies in the interval [0, 1] calculated by (15), it is not the generally accepted score among [0, 100].The credit scoring can be converted to numbers among 0 and 100 by using (16).Let   denote the standard score of the th customer.Let  denote the number of customers.Thus, the standard score   is given by With customer numbers of credit ratings following normal distribution [9,26], all loan customers can be divided into nine ratings.A step-by-step instruction is provided.
Step 1.According to (16), the customers' standard scores in descending order can be obtained.
Step 2. On the basis of customer numbers of all credit ratings following a bell-shaped normal distribution, we can compute the sample proportion of every rating, as shown in the second column of Table 1.The third column and the fourth column of Table 1 are the illustration of every rating.The sample frequency distribution is shown as in Figure 1.
Step 3.According to the first credit rating sample number accounting for 8% of the total sample number, the first scoring interval can be obtained combining with customers' credit scores.If one customer' credit score belongs to the first scoring interval, the customer is divided into the first credit rating.In the same way, all of the small private businesses can be divided into nine ratings.commercial bank [26], this paper selects 64 indicators of microfinance for small private businesses, which includes six feature layers, that is, " 1 basic information, " " 2 guarantee and joint guarantee, " " 3 capacity of repayment, " " 4 capacity of profitability, " " 5 capacity of operation, " and " 6 microenvironment, " as shown in Table 2, columns 1, 2, and 4. All of these 64 indicators come from .

Empirical Study of the Imbalanced Data Classification Model
At the beginning of screening indicators, we removed six unavailable indicators, such as " 4,9 industry experience" and " 5,10 business capacity".Another 58 indicators are left.These  3.

Indicators Extraction Based on LRSD.
In order to create a logistic regression significant discriminant (LRSD) model, the training sample and test sample need to be determined.In a total of 2157 customers, 80% customers are randomly selected as the training sample (i.e., 1726 customers).In the training sample, the selected 1529 nondefault customers are shown in columns 2158 to 3686 of Table 3.And the selected 197 default customers are shown in columns 4069 to 4265 of Table 3.Meanwhile, all of the 2157 customers are used for the test sample.
Taking the standardized data from columns 2158 to 3686 and columns 4069 to 4265 of Table 3 into (4) and ( 5), the regression results are given in the third to sixth column of Table 5.And the given significance level Level 0 equals 0.050 [27], as shown in   (4069) Chen D.
1   5, 24 indicators are reserved.The screening results are listed in the eighth column of Table 5.

Indicators Extraction Based on Correlation Analysis.
Substituting the 24 reserved indicators' data in Table 3 into (6), the correlation coefficients among these indicators are obtained, as shown in Table 6.As mentioned in Section 2.2.2 above, if the absolute value of the correlation coefficient |  | of two indicators is larger than the threshold 0.8 [25], it is indicated that the two indicators reflect the repeated information and one of them can be deleted.From Table 6, the correlation coefficient of " 2,5 strength of the guarantor" and " 2,8 credit status of joint guarantor" is 0.925.Because 0.925 is larger than the threshold 0.8, it indicates that these two indicators are reflecting repeated information.Because " 2,5 strength of the guarantor" reveals the basic information of debtors more than " 2,8 credit status of joint guarantor, " it is reasonable to delete " 2,8 credit status of joint guarantor." In the same way, we deleted another four indicators, including " 2,9 the relationship of coinsurance group membership, " " 4,2 net income, " " 5,5 fixed assets turnover, " and " 6,3 consumer price index." All these five deleted indicators are marked with "deleted by correlation analysis" in column 3 and column 5 of Table 2.In summary, we select 19 indicators which can effectively distinguish nondefault customers from default ones, as shown in Table 2.
In order to test the classification ability of the key indicators extraction model, we use all of the 2157 customers as the test sample.Substitute the 19 selected indicators into (4), and the regression results are shown as in Table 7. Table 7 shows that the average accuracy rate for our model is 96.27%.The model has good classification ability for the small private business.
It should be pointed out that a lot of evaluation metrics can be applied in measuring the model's performance, such as AUC, recall, precision, -measure, and overall success rate.This study evaluates the model's performance by using the average accuracy rate of the default customers' accuracy rate and the nondefault customers' accuracy rate.It reflects the advantage of the proposed classification approach in dealing with imbalanced data.For instance, there are ten customers.Two of them are default customers, and the other eight are nondefault customers.If all of the two default customers are discriminated error and the other eight nondefault customers are discriminated right, the overall success rate is 80% (=8/10).However, in accordance with the proposed method in this paper, the default customers' accuracy rate is 0% and the nondefault customers' accuracy rate is 100%.Then, the average accuracy rate of the default customers' accuracy rate and the nondefault customers' accuracy rate equals 50% [=(100% + 0%)/2].It is obvious that the classification performance of the imbalance data is accurately measured by the proposed average accuracy rate.

Comparative Analysis of LRSD and SVM in the Customer Classification.
Based on the support vector machine (SVM) classification method in [29], this section constructs a SVM classification model for discriminating the customers' default status.In order to obtain the most accurate classification function, we make the penalty parameter  change from 1 to 5 (i.e., 1 ≤  ≤ 5), the kernel parameter  2 changes from 0 to 3 (i.e., 0 ≤  2 ≤ 3), and the step of the penalty parameter  and the kernel parameter  2 equals 0.5.So it will be a total of 9 × 7 = 63 combinations of the penalty parameter  and the kernel parameter  2 , as shown in column 2 to 3 of Table 8.When the average accuracy rate achieves its maximum, the classification function corresponding to the given  and  2 is the most accurate classification function.
As mentioned in Section 3.2.2,80% customers are randomly selected as the training sample, and all of the 2157 customers are used for testing sample.Combined with the standardized data from columns 2158 to 3686 and columns 4069 to 4265 of Table 3, the 63 default customers' accuracy rates, the 63 nondefault customers' accuracy rates, and the 63 average accuracy rates can be calculated separately, as shown in Table 8, columns 4 to 6. From the sixth column of Table 8, the maximum of the average accuracy rate is 93.10%.Therefore, the most accurate classification function can be obtained.Consider From Tables 7 and 8, the average accuracy rate for our proposed model is 96.27%, and the average accuracy rate for the SVM model is 93.10%.The proposed model based on logistic regression significant discriminant and correlation analysis has better performance than the SVM model in dealing with imbalanced data.

The Calculation of Credit Scoring for
Small Private Business  9. Taking the data of the third column in Table 9, the maximum value 0.699 of this column, and the minimum value 0.324 into (16), the 2157 customers' standard credit scores are obtained, as shown in the fourth column of Table 9.

The Credit Rating for Small Private Business.
As it is mentioned in Section 2.4 above, ranking the customers' credit scores of the fourth column of Table 9 in descending order, the results are given in the third column of Table 10.Take the credit rating of the first grade, for an example.From the second column of Table 1, the first credit rating sample number accounts for 8% of the total sample number, so the first credit rating sample number  1 is equal to 173 (= 2157 × 8%), shown in column 5, Table 10.The credit scoring 72.79 of the 173rd customer can be found in the third column, so the first scoring interval is 72.79 ≤   ≤ 100, which is listed in column 7, Table 10.That is to say, these customers whose credit score belongs to the range of 72.79 to 100 are the customers of rating AAA.In the same way, the credit rating results of the rest of eight ratings are obtained, shown in column 7 of Table 10.

Conclusion
Many small private businesses are important cornerstones to the flow dynamics of the current Chinese economic development.At the end of 2013, the statistical data demonstrated that there were 44.36 million small private businesses in China, and their money amounted up to 2.43 trillion Yuan [30].However, most Chinese small private businesses were faced with the difficulty of raising funds due to their poor financial structures.The Chinese government has led financial innovation by supporting the small private businesses via carrying out a series of Pratt & Whitney financial measures.However, the default rates of small private businesses were very high for several reasons.One of the major reasons is that the credit rating system of microfinance for small private business is not sound at all, and most banks in China even have not yet established this rating system.And another primary reason is that the real default status of customers is not taken into account in existing credit rating systems.
In order to resolve the customer classification problem effectively, we propose a novel imbalanced data classification approach of microfinance for small private businesses.First of all, this paper sets up a key indicators extraction model by using logistic regression significant discriminant to select indicators which can effectively distinguish default customers from nondefault ones and utilizing the correlation analysis to delete the repeated information indicators.Secondly, on the basis of the linear weighted evaluation utilizing Fisher method, the credit scoring model for small private business, which reflects the default discriminant ability for default customers and nondefault customers, is established.And then, a credit rating model in which the customer number of credit ratings follows normal distribution is established.
The proposed approach has been verified using the data of 2157 small private businesses of a Chinese state-owned commercial bank.The results of our empirical analysis show that the proposed approach can accurately divide customers' credit ratings.And there are 19 indicators which can effectively distinguish default small private businesses from nondefault ones, such as " 1,2 marital status, " " 3,2 asset-liability ratio, " and " 6,7 industry cycle index." And our approach can contribute to find the quality customers for the banks and the bond investors.
Moreover, the performance of two classifier systems is evaluated in terms of their ability to correctly classify consumers as default (i.e., bad customer) or nondefault (i.e., good customer) credit risks.Empirical results suggest that the proposed approach better has performance than the SVM model in dealing with imbalanced data classification.
Imbalanced Data Classification 2.2.1.Screening the Key Indicators Based on Logistic Regression Significant Discriminant.Using the logistic regression model for selecting indicators, we ensure that the reserved indicators can effectively distinguish default customers from nondefault ones.Let  be the dependent variable of logistic regression model.It is the default status of customers' loan.Use  = 1 to denote default customer and  = 0 to denote nondefault customer.Let ( = 1 |  1 , . . .,   ) denote the corresponding default probability while conducting credit rating by indicators  1 , . . .,   .Let  0 denote the constant term, and let  1 , . . .,   denote regression coefficients.The logistic regression model is as follows [6]:

Table 1 :
The credit grade standard of microfinance for small private businesses.

Table 2 :
Extensive index set of small private business credit rating.According to the indicator type in column d of Table3, taking the original data of positive indicators V  from column 1 to 2157 of Table3into (1), the original data of negative indicators V  into (2), and the original data of interval indicators V  into (3a), (3b), and (3c), the standardized data of indicators   are obtained.The results are illustrated in columns 2158 to 4314 of Table3.Next, we will compute the standardized score of qualitative indicators.The scoring standard of qualitative indicators can be obtained by rational analysis, as shown in column 2 to 6 of Table4.In accordance with the scoring standard of qualitative indicators in Table 4, the standardized scores of qualitative indicators are obtained combining with the indicator type in Column d of Table 3, as shown in column 2158 to 4314 of Table

Table 5 ,
column 7.According to the standard of logistic selecting indicators shown in Section 2.2.1 above, if the test probability sig i

Table 3 :
The original data and standardized data of microfinance credit rating indicators for small private business.

Table 4 :
The scoring standard of the qualitative indicators.

Table 5 :
Significance screening on credit rating indicators based on logistic regression.  is less than the given significance level Level 0 , it means the th indicator affects customers' default status significantly, and the th indicator   should be reserved.Comparing the given significance level Level 0 with the test probability sig i in the sixth column of Table

Table 6 :
The correlation coefficient matrix between the 24 reserved indicators.