Establishment of the Credit Indicator System of Micro Enterprises Based on Support Vector Machine and R-Type Clustering

The micro enterprises’ credit indicators with credit identification ability are selected by the two classification models of Support Vector Machine for the first round of indicator selection and then for the second round of indicator selection, deleting credit indicators with redundant information by clustering variables through the principle of minimum sum of deviation squares. This paper provides a screening model for credit evaluation indicators of micro enterprises and uses credit data of 860 micro enterprises samples in Inner Mongolia in western China for application analysis. The test results show that, first, the constructed final micro enterprises’ credit indicator system is in line with the 5C model; second, the validity test based on the ROC (Receiver Operating Characteristic) curve reveals that each of the screened credit evaluation indicators is valid.


Introduction
The large number of micro enterprises plays an irreplaceable role in promoting economic growth and the settlement of China's social employment and people's livelihood.But the financing difficulty of micro enterprises is becoming increasingly prominent, which seriously inhibit enterprises' healthy development, so constructing a scientific credit evaluation indicator system for micro enterprises to help measure the credit risk of micro enterprises, help solve the problem of financing, and then promote enterprises' healthy development becomes an urgent problem to be solved.
For current status of foreign research, SBSS (Small Business Scoring Service) is a credit evaluation model of micro enterprises created by Fair Isaac Corporation (USA), which is constructed by the methods of mathematical statistics and historical data analysis.SOHO (Small Office Home Office) model, a credit evaluation method for micro enterprises established by the Yachiyo Bank of Japan, mainly focuses on the analysis of qualitative nonfinancial indicators.The evaluation model of the CRD (Credit Risk Database) Operations Agreement uses a way to rate each of the negative aspects of micro enterprises and financing.By virtue of its corporate asset credit database and investigators, the Imperial Data Bank determines whether to lend to a micro enterprise through field interviews, visits, and indirect surveys.The micro enterprise credit indicators designed by India's credit evaluation company, SMERA company, include 6 aspects, which conducts a different benchmark for enterprises of different industries and different registered capital size.
For current status of domestic research, Zhanjiang [1] selected micro enterprise credit indicators through the Brown-Mood median test, Moses variance test, and the Kendall rank correlation test.Guotai et al. [2] selected the indicator system according to the ability of an evaluation indicator to discriminate an enterprise's credit status based on probit regression.Zhang et al. [3] studied the comprehensive evaluation indicator system of low-carbon road transport by using analytic hierarchy process and the method of Delphi fuzzy evaluation.Honghai [4] selected the indicators that contain more information and lower degree of redundant information according to relative discrete coefficient, Pearson's correlation coefficient, and cumulative information contribution rate criteria.Youxi [5] selected indicators by 2 Mathematical Problems in Engineering combining the chi-square test, - test, and  test after the initial construction of the indicator system.
There are shortcomings in the previous studies: Firstly, the enterprise's credit evaluation is mainly focused on the large enterprises, for micro enterprise credit evaluation research is lacking.Secondly, the indicators that have been screened out cannot be guaranteed to significantly identify the micro enterprises' credit status, which leads to a higher false-positive rate in the final enterprise credit evaluation results.Thirdly, there are information redundancy indicators in the final credit evaluation indicator system; that is, the selection of indicators does not consider eliminating repeated information indicators.
In this paper, we first select indicators that can identify the credit status of micro enterprises based on SVM (Support Vector Machine) and then construct an indicator system by deleting the indicators with redundant information and retaining the indicators with strong ability of credit identification based on R-type clustering, which makes the selected credit indicators be able to significantly identify the credit status of micro enterprises and do not have duplicate information and finally apply the constructed model to the credit data of micro enterprises in Inner Mongolia in western China.
The innovation of this paper lies in the following: the nonlinearity of the credit indicator is mapped to the highdimensional space by the Gaussian kernel SVM and then the evaluation indicators are filtered out with credit identification ability, which solves the problem that the traditional linear weighting model cannot reflect the nonlinear relationship between the credit indicator and the evaluation results.Then we use Levene's variance homogeneity test statistic,  value, to recognize the credit identification ability of indicator and then cluster clusters using the method of Rtype hierarchical clustering within the criterion layer and keep the indicators with largest  value in each cluster, both deleting the redundant information indicators and retaining the indicators with significant credit identification ability.

Difficulties of the Problem
Difficulty 1.The first difficulty is how to ensure that each micro enterprise credit indicator that has been selected has the ability to identify the micro enterprises' credit status.The commonly used indicators in credit evaluation do not necessarily have significant credit capabilities in micro enterprise credit ratings.In order to prevent companies with high default risk from obtaining a higher credit score, it is necessary to ensure that the selected indicators have the ability to identify the credit status of micro enterprises.Difficulty 2. The second difficulty is how to avoid the situation where micro enterprise credit indicators reflect repeated information and how to ensure not mistakenly deleting the indicators with strong ability to identify the micro enterprises' credit status when eliminating the redundant information indicators.A good micro enterprise credit indicator system must not contain redundant information indicators; each indicator in the final construction of the micro enterprise credit indicator model having significant credit identification ability is essential for micro enterprise credit evaluation, Therefore, in the process of constructing the micro enterprise credit indicator model, in addition to avoiding overlap information in credit indicators, retaining the indicators with significant credit identification ability on micro enterprise credit status is more important.

Ideas to Solve the Difficulties
(1) Ideas to Solve Difficulty 1.Credit identification ability of a credit indicator is the correct percentage to identify the credit status of micro enterprises.In this paper, we obtain the credit identification ability of all the indicators, , and the credit identification ability of the remaining indicator after deleting the th indicator,   , by predicting the credit status of micro enterprises and using the two classification models of SVM; the difference between   and  is defined as   , which has been taken as the impact of the th indicator on the evaluation results.
Remove or retain the th indicator according to positive   or negative   and then filter out the credit evaluation indicator.Specifically, if   is greater than or equal to 0, the credit identification ability of the remaining indicator after deleting the th indicator is stronger than or equal to the credit identification ability of all the indicators when the th indicator is not deleted, which indicates that the th indicator cannot identify the enterprises' credit status and so just deletes it.If   is less than 0, the credit identification ability of the remaining indicators after the deletion of the th indicator is weaker than the credit identification ability of all the indicators when the th indicator is not deleted, which indicates that the th indicator can identify the credit status of the micro enterprise and so just retains it.The ideas to solve difficulty 1 are shown in Figure 1.
(2) Ideas to Solve Difficulty 2. After R-type clustering, indicators in the same category are considered to reflect similar information and indicators in different categories are considered to reflect different information.In this paper, the R-type hierarchical clustering method is used to cluster the indicators of the same criterion layer which reflect the same type of information according to the principle of the minimum sum of deviation squares in order to cluster the indicators that reflect the repetitive information into one cluster through retaining the indicator with strongest credit identification ability in the indicators of same cluster and deleting all other indicators of the cluster to achieve the goal of preserving the indicators with strong credit qualification ability and at the same time deleting the indicators that reflect redundancy information.The variance homogeneity Levene's test statistic  value (hereinafter referred to as  value) is used to measure the credit qualification ability of credit indicator.The  value reflects the thought that the greater the degree of deviation from the mean value of credit indicator in default enterprise samples to the mean value of all enterprise samples, the stronger the ability of the indicators to   significantly identify the micro enterprises' credit status.The ideas to solve difficulty 2 are shown in Figure 2.

Principle of Building the Model.
The principle of building the credit indicator model of micro enterprise based on the methods of SVM and R-type clustering is shown in Figure 3.

Initial Selection and Standardization of Credit Indicators.
There are two principles in the mass selection of indicators: retaining classic and high-frequency indicators and reflecting the characteristics of micro enterprises.Directly delete unobservable indicators or indicators with inability to obtain data or loss of original data of more than 10% of the total sample.Interpolation is used to process data that has lost less than 10% of the total number of samples.Set   as the standardized value of the th indicator of the th enterprise, V  as the original value of the th indicator of the th enterprise,  as the total number of micro enterprises samples,  1 as the left border of the indicator's interval, and  2 as the right border of the indicator's interval.Then the standardized value of positive indicator,   , is Then the standardized value of negative indicator,   , is Then the standardized value of interval indicator,   , is The standardization rules for qualitative indicators are shown in Table 1.

The Method of the First Round of Indicator Selection
Based on SVM (1) The Determination of Kernel Function.In this paper, the Gaussian radial basis function is selected as the kernel function of the SVM in the first round of indicator selection using the method of classification prediction of SVM; there are three main reasons: Firstly, linear kernel function is suitable for linearly separable situations, whereas the Gaussian radial basis function is suitable for linearly inseparable situations; for the nonlinear relationship between credit indicators and evaluation results, Gaussian radial basis function can get more accurate results than linear kernel function.Secondly, the number of parameters in the kernel function will affect the accuracy of the model.Kernel functions with fewer parameters help to improve the accuracy of the model compared to other kernel functions.the Gaussian radial basis function has fewer parameters.Thirdly, the use of Gaussian radial basis function as SVM's kernel function also reduces the difficulty of the calculation.
(2) The Criteria of Selection Criterion 1.   > 0, and   > A, indicating that the credit identification ability of the remaining indicators after deleting the th indicator is stronger than the credit identification ability of all the indicators when the th indicator is not deleted; the th indicator cannot identify default enterprises and nondefault enterprises to be deleted.
Criterion 2.   = 0, and is   = A, indicating that the credit identification ability of the remaining indicators after deleting the th indicator is equal to the credit identification ability of all the indicators when the th indicator is not deleted; the th indicator cannot identify default enterprises and nondefault enterprises to be deleted.(3) Calculation of Credit Identification Ability of Credit Indicator.Set A as the credit identification ability of all the indicators for all micro enterprise samples,  0 as total number of nondefault enterprises,   as the true value of the default status of the th enterprise (  = 0: the true value of the default status of the th enterprise is nondefault;   = 1: the true value of the default status of the th enterprise is default),    as the predictive value of the default status of the th enterprise, and  1 as total number of default enterprises.Then  is given as follows: In this paper,   is a formula obtained by replacing    in the molecule of formula (4) with    (the predictive value of credit status of the th enterprise calculated by the indicators remained after deleting the th indicator); then obtain   (the credit identification ability of the indicators after deleting the th indicator for all micro enterprises samples).

The Method of the Second Round of Indicator Selection
Based on R-Type Clustering (2) The Calculation of Deviation Sum of Squares.Set  ℎ as the sum of the squares of the ℎth criterion layer,  ℎ as the number of clusters in the ℎth criterion layer,   as the number of indicators of the th cluster of the ℎth criterion layer,    as the vector of the th indicator in the th cluster of the ℎth criterion layer, and   as the mean vector of all the indicators in the th class of the ℎth criterion layer.Then  ℎ is given as follows: (3) - Test.In this paper, the nonparametric - test is used to test the rationality of the number of clusters, that is, to test whether there is a significant difference between the credit indicators of the same cluster.If the - test is not passed, which indicates that there is significant difference between these indicators of the same cluster, they cannot be clustered into a cluster; in this case, the number of clusters needs to be reset; if the - test is passed, which indicates that there is no significant difference between these indicators of the same cluster, they can be clustered into a cluster; in this case, retain the indicator with the strongest ability of credit identification and delete information redundancy indicators by retaining the indicator of the largest  value in each cluster and deleting all the other indicators to complete the second round of indicators selection.Specifically, the - test is as follows: H 0 : there is no significant difference between the indicators within the cluster.H 1 : there is significant difference between the indicators within the cluster.
The significance level is set to 0.01.
When sig.> 0.01, accept H 0 , so these indicators can be clustered into a cluster.
When sig.< 0.01, refuse H 0 , so these indicators cannot be clustered into a cluster.
(4) The Calculation of  Value.Set   as the  value of the th indicator,  as total number of enterprise samples,  0 as total number of nondefault enterprise samples,  0  as the absolute value of the difference between the th indicator of the th nondefault enterprise and the mean value of the th indicator of all nondefault enterprises,  1   as the absolute value of the difference between the th indicator of the th default enterprise and the mean value of the th indicator of all default enterprises, and  1 as total number of default enterprises.Then the th indicator's  value,   , is given as follows: 2 . (7)

The Validity Test of Credit Indicators.
The ROC curve is a comprehensive indicator that reflects the sensitivity and specificity of continuous variables; the vertical coordinate of ROC curve, sensitivity, indicates the ratio at which the default samples are judged to be correct; the specificity indicates the ratio at which nondefault samples are judged to be correct, so the horizontal coordinate of ROC curve, 1 − specificity, indicates the rate at which nondefault samples are judged to be incorrect.When the horizontal coordinate is constant, the larger the vertical coordinate is, the higher the proportion of default samples judged to be correct is, the larger the AUC (area under ROC curve) of the corresponding credit indicator is, the stronger the ability of credit identification of the indicator against the default samples is, and the more effective the indicator is.Based on the ROC curve, this paper tests the validity of the screened indicators; the criteria for indicator to define whether it has the accuracy to identify the credit status of enterprises samples are as follows: when 0 ≤ AUC < 0.5, it does not have the accuracy of identification; when 0.5 ≤ AUC < 1, it has the accuracy of identification.for the first round are filtered using classification and prediction of SVM so as to pick out the indicators that can identify the credit status of micro enterprises.The division of micro enterprise samples is shown in Table 3.

The Application of the Model
(2) Determination of Optimal Parameters.It is necessary to determine the penalty coefficient  and the Gaussian radial basis function parameter  by using SVM's classification and prediction to calculate    in formula (4) and    after deleting the th indicator.MATLAB software and LIBSVM toolbox are used to determine the penalty coefficient, c, and the Gaussian radial basis function parameter, g; c is selected in steps of 0.5 between 2 −4 and 2 6 and  is selected in steps of 0.5 between 2 −5 and 2 5 ; the cross validation number is set to 3-fold; the accuracy rate discretization display step is set to 0.9; then the program is run in MATLAB based on the parameters that have been set and the training set and test set that have been selected according to Table 2, columns (1)-( 860); then we have that the optimal Gaussian radial basis function parameter  is 5.6569 and the optimal penalty coefficient  is 0.125.4, where "Delete" indicates that the corresponding credit indicator is deleted and "Retain" indicates that the corresponding credit indicator is retained in the first round of indicator selection based on the SVM.
After the first round of indicator selection, we delete 25 indicators and keep 43 indicators that can identify the credit status of micro enterprises.(2) Clustering the Indicators within the Criteria Layer.The indicators in the first criterion layer, financial internal factors, are used as an example for clustering; the other two criterion layers do similar processing.
Firstly, make all the indicators marked "Retain" in Table 4, column   4, column (3), within the criteria layer based on Rtype hierarchical clustering according to the principle of minimum sum of deviation squares and the clustering results are shown in Table 5, column (1); in order to avoid some of the indicators misinterpreted in the second round of R-type clustering because of the significant difference between the evaluation indicators within the cluster, in this paper, use the method of - test in SAS software for the clustered credit indicators to complete the significant test at a significance level of 0.01 (except for the cluster with only one indicator) and the - test sig.values for each cluster are shown in Table 5, column (2), according to the criterion of test: 20 clusters of indicators are clustered as reasonable, so there is no need to reset the number of clusters.5.
For the standardized values of the 20 credit indicators shown in column (b) of Table 6 remaining after the final selection, use the ROC curve in SPSS software to test the validity of the indicators in the constructed micro enterprise credit indicator system; the ROC curve of each indicator is shown in Figure 4; the AUC of each indicator is shown in column (2) of Table 6.As shown in column (2) of Table 6, the AUC values of the 20 credit indicators remaining after the final selection are all greater than the critical value of 0.5; as shown in column (3) of Table 5, the results of the validity test of the credit indicators show that all the indicators remained after the final selection has passed the validity test.(2) Compared with the 5C element model, the results show that, in this paper, all the credit indicators of the micro enterprise credit indicator model can be related to the elements in the 5C element model, so the information of the constructed micro enterprise credit indicator model covers all the elements of the 5C element model.

Conclusions
( Calculate the credit identification ability of all indicators, A Delete the jth indicator and calculate the credit Calculate the influence of the jth Are all the indicators calculated?Keep the jth indicator Delete the jth indicator No No Yes Yes identification ability of the remaining indicators, A j indicator on the evaluation results, d j d j ≥ 0?
of SVM is used to filter out the indicators with ability to identify the credit status of micro enterprises Using the method of R-type clustering to delete the indicators with redundant information and retain the indicators with strong ability to identify the credit status of micro enterprises Micro enterprise credit evaluation indicator system Validity test of credit indicator based on ROC curve

Figure 3 :
Figure 3: Principle of building the credit indicator model of micro enterprise.

( 1 )
The Criteria of Selection.After the first round of indicator selection, clustering the indicators inside the same criteria layer according to the principle of minimum deviation sum of squares using the method of hierarchical clustering through the R-type clustering, the validity of the number of clusters, L, is verified by the - test when the total number of clusters reaches the preset value, L; if the - test is not passed, then reset the number of clusters; if the - test is passed, then retain the indicator with the strongest ability of credit identification and delete redundant information indicators by retaining the indicator of the largest  value in each cluster and deleting all the other indicators.

( 3 )
Calculation of the Degree of the Influence of Credit Indicator on Evaluation Results.The training model is established on MATLAB using the selected training set and the optimal parameters  and ; the value of    in column (4) is obtained by predicting the credit status of the enterprise in the test set.Delete the th indicator in the training set and test set at the same time; establish the training model in MATLAB based on the optimal parameters  and  using the training set that has removed the th indicator; the value of    can be obtained by predicting the credit status of the enterprise in the test set where the th indicator has been deleted.The values shown at Table 4, column (1), are obtained by substituting the two credit status predictive values obtained above and the credit status true values shown atTable 2, 69th row, into formula (4); the degree of the influence of each credit indicator on evaluation results, or   shown at Table 4, column (2), is obtained by substituting the values shown at Table 4, column (1), into formula (5).

( 4 )
The First Round of Credit Indicator Selection.  shown at Table 4, column (2), represents the degree of the influence of the th credit indicator on evaluation results; the selection results obtained according to the first round of indicator selection criteria are shown in column (3) of Table

4. 3 .
The Second Round of Indicator Selection Based on R-Type Clustering.The second round of indicator selection for the 43 indicators remaining after the first round of indicator selection based on R-type clustering is to filter out the indicators with strong ability of credit qualification and delete the redundant information indicators.

( 1 )
Determine the Number of Clusters in Each Criterion Layer.Calculate the number of clusters in each criteria layer according to the fact that there will be 20 credit indicators retained in the final indicator model; specifically; there are 43 indicators remaining after the first round of indicator selection, where there are 18 indicators remaining from the first criterion layer, the internal financial factors, and they would be divided into (18/43) × 20 ≈ 8 clusters.There are 20 indicators remaining from the second criterion layer, the internal nonfinancial factors, keeping the indicator-collateral score directly in order to correspond to 5C factor analysis model and treat it alone as a cluster; the remaining 19 indicators are divided into (19/43)×20 ≈ 9 clusters.There are 5 indicators remaining from the third criterion layer, external macro environmental factors, and they would be divided into (5/43) × 20 ≈ 2 clusters.
(3), numbers 1-35, from the first criteria layer, the internal financial factors, into a cluster, respectively, which is formed into 18 clusters; then cluster any two clusters of indicators into a new cluster, which clusters the indicators within the first criteria layer into 17 clusters, adding up to  2 18 = 153 clustering schemes.Substitute the standardized values of the indicators of each clustering scheme into formula (6) to calculate each clustering scheme's deviation squared sum; the clustering scheme with the smallest deviation squared sum is chosen and then the first round of clustering is completed.Continue clustering in this way until the number of clusters in the first criteria layer reaches the preset quantity, 8.The clustering results of all the indicators are shown in Table4, column (1).

( 3 )
The Test of the Rationality of the Number of Clusters.Cluster the 43 credit indicators marked "Retain" shown in Table

( 4 )
The Calculation of  Value.Substitute the standardized values of the 43 indicators marked "Retain" in column (3) of Table 4 into formula (7) to calculate the  values of the 43 indicators and they are shown in Table 5, column (3).

( 5 )
The Second Round of Credit Indicator Selection.The second round of indicator selection is achieved by keeping the indicator of the largest  value in each cluster according to the clustering results shown in Table 5, column (1); the results of selection are shown in Table 5, column (4), in which the indicators marked "Delete" are deleted and the indicators marked "Retain" are retained in the second round of indicator selection based on R-type clustering.After the second round of indicator selection, we delete 23 indicators and keep 20 indicators that can significantly identify the credit status of enterprises and do not contain redundant information indicators.4.4.Contrast with the 5C Model.Comparatively analyze the constructed micro enterprises' credit indicator model and 5C factor analysis model; the results are shown in column (1) of Table6, in which legal representative's loan default records and four other evaluation indicators reflect the moral quality of the 5C elements; the cash recovery rate of all assets and 11 other evaluation indicators reflect the repayment ability of the 5C elements; the fixed rate of capital and 2 other evaluation indicators reflect the capital strength of the 5C elements; the collateral score reflects the secured collateral of the 5C elements; the industry sentiment indicator and the Engel coefficient reflect the operating environment conditions of the 5C elements.

4. 5 .
The Validity Test of Credit Indicators and the Final Indicator System.After the pretreatment of micro enterprise' credit indicator and two rounds of indicator selection, the paper constructs a credit indicator system of micro enterprises with 20 credit indicators shown in column (b) of Table

( 1 )
In this paper, the micro enterprise credit indicator model is constructed through the double combination selection model based on SVM and R-type clustering, where internal financial factors, nonfinancial factors, and external macro environmental factors are criteria layers and the cash recovery rate of all assets and 20 other credit indicators are indicators layers.

)
The results of the validity test of the credit indicators of micro enterprise based on ROC curve show that all the credit indicators of the micro enterprise credit indicator model constructed in this paper pass the validity test, so all the indicators in the micro enterprise credit indicator model are valid.

Table 5 :
The second round of indicator selection based on R-type clustering.

Table 1 :
The standardization rules for qualitative indicators.Criterion 3.   < 0, and is   < A, indicating that the credit identification ability of the remaining indicators after deleting the th indicator is weaker than the credit identification ability of all the indicators when the th indicator is not deleted; the th indicator can identify default enterprises and nondefault enterprises to be kept.

Table 2 :
Indicators and standardized values.

Table 3 :
The division of micro enterprise samples.

Table 4 :
The results of first round of indicator selection based on SVM.