Empirical Study on Indicators Selection Model Based on Nonparametric K-Nearest Neighbor Identification and R Clustering Analysis

The combination of the nonparametric K-nearest neighbor discriminant method and R cluster analysis is used to construct a double-combination index screening model. The characteristics of the article are as follows: firstly, the nonparametric K-nearest neighbor discriminant method is used to select the indicators which have significant ability to discriminate the default loss rate, whichmakes up the shortcomings of the previous research that only focuses on the indicators with significant ability to discriminate default state. Additionally, the R cluster analysis applied in this paper sorts the indicators by criterion class, rather than sorting the indicator by the whole index system.This approach ensures that indicators which are clustered in one class have the same economic implications and data characteristics. This approach avoids the situation where indicators that are clustered in one class only have the same data characteristics but have different economic implications.


Introduction
The existing research on the influencing factors of credit risk in microenterprises is divided into the following two categories.
(1) Existing Studies on Credit Evaluation Indicators System.Reusens and Croux (2017) think that the government debt, GDP growth rate, inflation, and other macroeconomic factors play a significant role in promoting corporate credit, so they cite these variables to build a credit evaluation index system [1].Anand et al. (2016) think that indicators such as profitability, liquidity, firm size, and credit rating have an influence on the stability of the firm and play a vital role in credit evaluation.So these indicators should be included into the credit evaluation index system [2].Jones et al. (2015) built a corporate credit rating system using financial indicators such as total assets.In addition to the above financial variables, Jones also cited the market variable such as enterprise scale and years of establishment into the index system [3].Doumpos et al. (2015) mainly examined the impact of financial indicators on corporate credit and built a credit evaluation index system including asset returns, interest income, solvency, long-term debt leverage, and the size of the company [4].
(2) Existing Studies on Indicators Selection Methods.Many existing researches establish a classifier from the perspective of fuzzy to solve the credit evaluation problem [5][6][7].Sohn et al. (2016) use the fuzzy logic regression method to establish the credit rating equation [8].Abiyev (2014) develops fuzzy logic and neural network methods to extract important credit risk assessment information [9].Ju and Sohn (2014) established a credit rating equation to pick up appropriate funding beneficiaries [10].Elliott et al. (2014) screen out the true information which could reflect the credit state of a company based on a double hidden Markov model (DHMM) [11].Abellán and Mantas (2014) construct the ensembles of classifiers for bankruptcy prediction and credit scoring based on random subspace method.Experimental studies show that decision tree packaging solutions provide the best results for bankruptcy forecasts and credit scores [12].Bijak and Thomas use (2015) improved Bayesian analysis techniques to deal with the problem of loss from bad loans [13].Gorzałczany and Rudziński (2016) are more concerned about the supervision and division of customer credit ratings than other scholars, which helps banks make better lending decisions [14].Jones et al. (2015) predict the variation tendency of customer credit levels and determine the credit threshold through the binary classifier [3].
The defects of the existing research are as follows.First, most of the existing research constructs indicators system from the perspective of default and nondefault, which lack the research from the perspective of the default loss rate.Second, some of the existing researches cannot classify the indicators from the perspective of the economic sense of the indicators when using R cluster analysis, so that the existing research cannot remove the indicators which have redundant information.
Contributions of This Paper.First, this paper implements nonparametric -nearest neighbor discriminant method to remove indicators that cannot significantly distinguish samples of different default loss rate.Second, the paper classifies indicators by R clustering analysis and selects indicators which cover the largest information from each class by coefficient of variation.It ensures that the duplicate information is removed.

The Difficulty of the Problem
Difficulty 1.First difficulty is how to ensure that the selected indicators can significantly differentiate samples which have different default loss rate.In the existing study, the indicators selected by many classic methods can only distinguish different default state.Difficulty 2. Second difficulty is how to delete the indicators which have the problem of information overlap and redundancy.

The Method to Solve the Difficulty
The Method to Solve the Difficulty 1.The nonparametric nearest neighbor discrimination method will screen out the indicators which have significant discrimination ability on samples that have different default loss rate.If there are ℎ indicators, then ℎ identified accuracy will be calculated.The ℎ identified accuracy is compared with the accuracy of all the indexes, and the accuracy difference between ℎ index and all indicators is obtained.If the accuracy difference between a certain indicator and all indicators is greater than or equal to 0, then delete the index; if the accuracy difference between a certain indicator and all indicators is less than 0, then retain the index.After the above steps, the indicators which have significant discrimination ability on different default loss rate will be selected.
The Method to Solve the Difficulty 2. According to R cluster analysis, the indexes were screened again and the collinearity was excluded.By means of the R cluster analysis, the above indexes were screened out by the nonparametric nearest neighbor discrimination method and were reclassified according to criteria layer.The indicators which have largest coefficient of variation of each category of each criteria layer will constitute the final indicator system, and the final indicator system will not cause the problem of information redundancy.In this paper, the optimal  value will be selected by error balance method (Xing and Tingjin, 2014) [15].At the same time, set a constraint for the error balance method.Compared with the method of generalized cross validation, the error balance method can not only get the optimal  value but also reduce the computational cost greatly.

Construction of Indicator System
Error balance method makes the  value increase from 1 and combines the test error of all the samples to draw the trend of test error.Finally, according to the trend, determining an optimal K value ensures that the test error is minimum.This method not only specifies the direction of the optimal  value selection, but also ensures that the optimal  value is chosen within the reasonable  value range.This paper combines Góra and Wojna's thought (Góra and Wojna, 2002) with the error balance method to find the best  value [16].
Assume that   is test error of the th type sample;   is the number of th type samples misjudged into other class samples;   is the number of actual th type samples ( = 1, 2, 3).
Assume that E is the test error of the all sample;  1 is the test error of the high default loss rate sample;  2 is the test error of the low default loss rate sample;  3 is the test error of the nondefault sample;  1 ,  2 , and  3 are the sample size of high default loss rate sample, low default loss rate sample, and nondefault sample.
The meanings of formulas (1) and (2) are as follows: the ratio of the number of misjudgments to the actual sample size represents the test error, and the weighted average of test errors of the three types sample is the total sample test error.
Assume that  (3) The Meaning of (3).According to Góra and Wojna's theory, the optimal  value should be in the range of 1 ∼ √.Under the above constraints, the optimal  value is the value that minimizes the total sample test error.

The Process of Index Screening through Nonparametric K-Neighbor Identification Method
(1) Calculate the Prior Probabilities.Assuming that p is the prior probability of each class, where  = 1, 2, 3, p ≥ 0, and p1 + p2 + p3 = 1.  is the sample amount of each class. is the sum of the sample sizes per class (Ganjiang, 2007) [17]: The Meaning of ( 4).Calculate the prior probability of each class through calculating the ratio between the sample number of each class and total samples.The smaller the result, the smaller the likelihood that the sample will be classified into the class. ( The Meaning of (5).The probability of  falls within the established range.
(3) Calculate Posterior Probability.Assuming that (  | ) is the posterior probability of a known category.p is the prior probabilities of each class, p ≥ 0, and p1 + p2 + p3 = 1 .f () is probability density functions of each class.∑ 3 =1 p f () is the sum of the product of the probability density function and the prior probability of each class (Ganjiang, 2007) [17]: If ( 1 | ) is the largest of three, then the sample should be sent to the class which is high default loss rate; if ( 2 | ) is the largest of three, then the sample should be sent to the class which is low default loss rate; if ( 3 | ) is the largest of three, then the sample should be sent to the class which is nondefault, where 1 minus the error rate equals the accuracy.
(4) Measure the Identification Accuracy of the Default Loss Rate.Assuming that   is the accuracy of the th type sample;   is the number of th type samples judged by the nonparametric -nearest neighbor discriminant method.  is the actual number of th type samples.Then   is The Meaning of (7).The larger the calculated value, the better the nonparametric -nearest neighbor discriminant method which is used to identify different classes of samples.Assume that  is the identification accuracy of all the sample; there are The Meaning of Formula ( 8).The discrimination accuracy  of all the samples is equal to the weighted average of the discrimination accuracy  1 of the high default loss rate sample, the discrimination accuracy  2 of the low default loss rate sample, and the discrimination accuracy  3 of the nondefault sample.The higher the A, the higher the accuracy of discrimination of all samples.
(5) Calculate the Degree of Influence of the th Indicator on the Discrimination Accuracy.Assume that   is the degree of influence of the th indicator on the accuracy of the discrimination;   is the identification accuracy of the residual indicator after eliminate the th index;  0 is the identification accuracy of all the indicators.Then   is Formula ( 9) reflects the degree of influence of the th index on the accuracy of the discriminant. ( Assume that S is the sum of square deviation of all types of indicators ( = 1, 2, 3, . . ., ): Step 1. Treat  indicators as  classes.
Step 2. Combine any two of indicators in those  indictors into one class, no change on indicators left.There are ( − 1)/2 kinds of combination.According to (10), calculate each class of indicators' sum of square deviation   .
Step 3. Calculate total sum of squares of deviations as to the indicators in all of the classes by (11), and reclassify the indicators in the way of indicators' combination that would minimize the total sum of squares of deviation. sorts total sum of squares of deviations.
Step 4. Repeat Step 3 until the kind of classification is .
In the R cluster analysis, the number of reasonable categories is between 2 and 4. In order to avoid the subjective randomness of the number of categories, the nonparametric - test of each class after clustering is used to judge the rationality of the classification number .The original hypothesis of the nonparametric - test is that there are no significant differences in the numerical characteristics of the different indicators.
If the significance level of each category sig > 0.05, then accept the original hypothesis.That is to say, there is no significant difference between the indicators from the same class, and the number of classification is reasonable.On the contrary, indicators should be reclustered.
(2) Analysis of the Size of the Discriminant Force Based on the Coefficient of Variation.An indicator's coefficient of variation reflects its identification ability.The bigger an indicator's coefficient of variation is, the more information content it is contained.Therefore, the indicator with the biggest coefficient of variation within the same class should be retained.
Assume that   is the overall standard deviation of the th indicator;   is the mean of the th indicator; the formula of the coefficient of variation of the th index is The advantage of the coefficient of variation is that the indicator which has the largest coefficient of variation has a strong ability to distinguish different information, and its role in the comprehensive evaluation is the largest, through removing the index whose coefficient of variation is small to ensure that the index system is simple and effective.
Assume that LGD  is the default loss rate of the th sample;   is receivable principal and interest of the th sample which is not repaid now;   is receivable principal and interest of the th sample.
LGD  =     .1.According to the type of the default loss rate, 860 customers will be divided into three categories and placed in column 73 of Table 1.

Screening of Indexes Based on Nonparametric K-Nearest
Neighbor Discriminant Method.Select the optimal value of .In this paper, the sample size is 860, and  value should be smaller than the square root of the sample size, so the value of  is less than √ 860 ≈ 29.32. should belong to [1,29].
Find the best  value.Combined with the objective function, the best value of  can make the test error of all samples be the smallest.The test error of each value of  = 1, 2, 3, . . ., 29 is used to draw the trend of test error and  value.Determine the optimal  value.It can be seen from Figure 1 that the  value corresponding to the minimum test error is 1, so the value of  is 1.
The specific process of screening indicators is based on nonparametric -nearest neighbor discriminant method.The indicators are placed in column 1 of Table 2.
The discriminant accuracy  0 of 68 indices is obtained by nonparametric -nearest neighbor discrimination.
Step 1. Calculate the discriminant accuracy HDA 0 of the high default loss rate sample.Among the 24 high default loss rate samples, the number of samples that were accurately discriminated by nonparametric -nearest neighbor discriminant method was 13.According to formula (7), the discriminant accuracy HDA 0 of the high default loss rate sample is HDA 0 =   / 1 = 13/24 = 54.17%,placed in column 2 of Table 2.
Step 2. Calculate the discriminant accuracy LDA 0 of the low default loss rate sample.Among the 6 low default loss rate samples, the number of samples that were accurately discriminated by nonparametric -nearest neighbor discriminant method was 0. According to formula (7), the discriminant accuracy LDA 0 of the low default loss rate sample is LDA 0 =   / 2 = 0/6 = 0%, placed in column 2 of Table 2.
Step 3. Calculate the discriminant accuracy UA 0 of the nondefault sample.Among the 830 nondefault samples, the number of samples that were accurately discriminated by nonparametric -nearest neighbor discriminant method was 823.According to formula (7), the discriminant accuracy UA 0 of the nondefault sample is UA 0 = / 3 = 823/860 = 99.16%,placed in column 2 of Table 2.
One of the 68 indicators is deleted one by one, and the discriminant accuracy   of the remaining 67 indicators is calculated by the nonparametric -nearest neighbor discriminant method.
The discriminant accuracy of the high default loss rate sample, the discriminant accuracy of the low default loss rate sample, the discriminant accuracy of the nondefault sample, and the discriminant accuracy of the total sample can be obtained by using the 67 indicators after removing the index  1 , placed in the first row of Table 2. Similarly, remove the  2 to  68 one by one, and calculate the discriminant accuracy of the high default loss rate sample, the low default loss rate sample, the nondefault loss rate sample, and the discriminant accuracy of the total sample, placed in the other rows of Table 2. Substitute   and  0 into (9),   =   −  0 , and then calculate the influence degree of the th index on the discrimination accuracy; the degree of influence is placed in column 7 of Table 2.
Screen indicators based on the degree of discrimination of different indicators.
Standard 1 (remove indicators whose   > 0).According to the degree of influence   of the second column of Table 3, the degree of influence   of  7 ,  14 , and  48 is larger than 0. Discrimination accuracy can be improved if this type of indicators is eliminated and the results are placed in the corresponding row in column 3 of Table 3.
Standard 2 (remove indicators whose   = 0).According to the degree of influence   of the second column of Table 3, the degree of influence   of  1 ,  21 ,  27 , and  68 is equal to 0. Discrimination accuracy will not change if this type of indicators is eliminated and the results are placed in the corresponding row in column 3 of Table 3.
Standard 3 (retain indicators whose   < 0).According to the degree of influence   of the second column of Table 3, the degree of influence   of  51 ,  64 , and  65 is less than 0. Discrimination accuracy will decrease if this type    (1) In the criterion layer of solvency, it can only be divided into one class and no K-W test because there is only one indicator.The criterion layers of the basic situation of legal representative and nonfinancial factors within the enterprise are similar to the criterion layer of solvency, so  9 ,  51 , and  37 should be reserved.
(2) In the criterion layer of operating capacity, there are two indicators.Firstly, two indicators can be divided into one class.The result of the K-W test for these two indicators is  < 0.05, which indicates that the original hypothesis with the same data feature between  28 and  29 is refused, so  28 and  29 have different data feature and should be reserved simultaneously.
(3) In the criterion layer of enterprise external macroconditions, there are two indicators.Firstly, two indicators can be divided into one class.The result of the K-W test for these two indicators is  < 0.05, which indicates that the original hypothesis with the same data feature between  64 and  65 is refused, so  64 and  65 have different data feature and should be reserved simultaneously.
(4) In the criterion layer of profitability, there are three indicators.Firstly, three indicators can be divided into 2 classes.The result of the K-W test for the two indicators among 3 indicators is  < 0.05, which indicates that the original hypothesis with the same data feature is refused, and (2) In the criterion layer of the basic situation of legal representative,  51 should be reserved.(3) In the criterion layer of nonfinancial factors within the enterprise,  37 should be reserved.(4) In the criterion layer of operating capacity,  28 and  29 should be reserved simultaneously.(5) In the criterion layer of enterprise external macroconditions,  64 and  65 should be reserved simultaneously.(6) In the criterion layer of profitability,  19 ,  20 , and  23 should be reserved simultaneously.

Analysis of the Size of the Discriminant Force Based on
the Coefficient of Variation.R clustering analysis shows that there is no redundant information in each index layer, so there is no need to use the coefficient of variation to delete the index with weaker recognition ability.So far, the paper has completed the second index screening process.
By the application of nonparametric -nearest neighbor discriminant method and R clustering analysis, the paper establishes a small enterprises credit evaluation indicators system, which contains 6 principle layers and 10 indicators.

Comparative Analysis.
In order to reflect the superiority of combined model of the nonparametric  nearest neighbor discriminant and the R clustering proposed in this paper, the comparative analysis of the combined model with stepwise discriminant analysis and neural network model will be carried out.The superiority of an indicator screening model can be reflected in the indicators selected by the model having higher identification ability.Therefore, this article will compare the discriminatory power of the three models.
Comparative analysis includes the following two steps.
Step 1.The combined model, stepwise discriminant analysis model, and neural network model will be used, respectively, to screen indicators that have significant discriminating ability on default loss rate.
Step 2. Use the selected index system to test the discrimination ability of the model.The higher the discriminative power of the model, the greater the superiority of the model.
Table 6 shows the discriminating ability of the three models for all types of samples.The discriminatory power of the combined models is higher than the stepwise discriminant analysis model and the neural network model, no matter for the discrimination ability of some samples or the discrimination ability of all the samples.Therefore, the combination model has more superiority than the other two models: that is to say the index system screened by the combination model has stronger identification ability.In addition, the combinatorial model is also more suitable for analyzing multiclassification problems because it has higher discriminative power when dealing with multiclassification problems.

Conclusion
5.1.The Main Conclusions.The credit index system which contains 10 indicators is selected by the model combination of nonparametric -nearest neighbor discrimination method and R cluster analysis.

The Characteristics of This Article.
First, the nonparametric -nearest neighbor discrimination method is used to select the indicators which have significant discriminant ability on samples with different default loss rate.The study in this paper makes up the deficiency of previous studies which mainly focus on the default state.
Second, the R cluster analysis applied in this paper is based on the criterion layer rather than the whole index system.This will ensure that the indicators clustered into the same class have the same economic implications and data features, which avoid the clustering of indicators which mainly focus on the same data characteristics but ignore different economic implications.

Figure 1 :
Figure 1: The trend of  value.
1 is test error of the high default loss rate sample;  2 is test error of the low default loss rate sample;  3 is test error nondefault sample;  is the number of nearest neighbors;  is sample size;  1 ,  2 , and  3 are the sample size of high default loss rate sample, low default loss rate sample, and nondefault sample.
If the discrimination accuracy   of the residual indicators after the th index is excluded is larger than the discrimination accuracy  0 of all the indicators, that is to say   > 0, it means that the accuracy of the discrimination after deleting the index is improved so the index should be removed.Mark the standard as standard one.All the indicators that meet Criteria 1 should be removed.If the discrimination accuracy   of the residual indicators after the th index is excluded is equal to the discrimination accuracy  Criterion 3. If the discrimination accuracy   of the residual indicators after the th index is excluded is smaller than the discrimination accuracy  0 of all the indicators, that is to say   < 0, it means that the accuracy of the discrimination after deleting the index decreases so the index should be retained.Mark the standard as standard three.All the indicators that meet Criteria 3 should be retained.The R cluster analysis of the indexes in the same criterion layer is carried out by the squared sum method.Assume that   is the sum of square deviation of th type indicators ( = 1, 2, 3, . . ., ); the  indexes are divided into class ;   is the number of the th type indicator;   is the standardized sample value vector ( = 1, 2, . . .,   ) of the jth indicator in the th class;   is the average vector of the th class of indicators: 6) Three Criteria of Indicator Screening Based on Nonparametric K-Nearest Neighbor DiscriminantCriterion 1. 0 of all the indicators, that is to say   = 0, it means that the accuracy of the discrimination after deleting the index does not change, so the indicator should be removed.Mark the standard as standard two.All the indicators that meet Criteria 2 should be removed.

Table 1 .
The default loss rate calculated by formula (13) is set out in column 72 of Table

Table 1 :
Loan data of 860 microenterprise customers.

Table 2 :
The result of nonparametric -nearest neighbor discriminant model.

Table 3 :
The result of indicator screening based on nonparametric -nearest neighbor identification model.

Table 4 :
The retained indicator after the first indicator filters.

Table 3 .
Indicator Screening Results.58 indicators were excluded from the 68 indicators, and 10 indexes were retained.Table4shows the retained indicators by nonparametric -nearest neighbor discrimination.

Table 5 :
Indicators selection based on R clustering analysis.

Table 6 :
Comparative analysis of 3 models.beclusteredinto 2 class.In this case, there is no need to divide three indicators into one category.Finally, in this criterion layer, three indicators should be divided into three categories.So  19 ,  20 , and  23 should be reserved simultaneously.The classification results of indicators are as follows.(1)In the criterion layer of solvency,  9 should be reserved.