A Novel Approach for Reducing Attributes and Its Application to Small Enterprise Financing Ability Evaluation

Attribute reduction is viewed as a kind of preprocessing steps for reducing large dimensionality in data mining of all complex systems. A great deal of researchers have proposed various approaches to reduce attributes or select key features in multicriteria decision making evaluation. In practice, the existing approaches for attribute reduction focused on improving the classification accuracy or saving the cost of computational time, without considering the influence of the reduction results on the original data set. To help address this gap, we develop an advanced novel attribute reduction approach combining Pearson correlation analysis with F test significance discrimination for the screening and identification of key characteristics related to the original data set.The proposed model has been verified using the financing ability evaluation data of 713 small enterprises of a city commercial bank in China. And the experimental results show that the proposed reduction model is efficient and effective. Moreover, our experimental findings help to locate the qualified partners and alleviate the difficulties faced by enterprises when applying loan.


Introduction
With the coming of the era of big data, the size of data sets has been increasing sharply, causing the decision makers and management to have difficulty in making decisions based on those data [1].Then the most important thing for decision makers is to reduce huge attributes or large dimensionality in data sets.Attribute reduction, also called indicators selection or feature screening, ascertains a subset of attributes to reduce the dimensionality of the original data sets.Utilizing reducing attributes, it can select the attributes with the highest information content and save the cost of computational time and memory [2].Besides, it is also useful to improve the classification accuracy as a result of deleting the information chaos and irrelevant attributes [3].In practice, attribute reduction has been applied to a great deal of fields such as decision making, pattern recognition, and economic and social system evaluation [4][5][6][7].
The main attribute reduction approaches can be divided into three categories.One of the most famous methods for attribute reduction is based on rough set theory.Rough set approach proposed by Pawlak provides useful tools for reasoning from data [8].It is advantageous to other approaches for attribute reduction that typically use multivariate statistics which require specific parametric assumptions [9,10].Degang et al. established a model to reduce the attributes of covering decision systems combining traditional rough set.Empirical study indicated that the proposed attribute reduction approach accomplished better classification performance than those of existing rough set methods [11].In order to improve the classification accuracy containing hybrid type attributes, such as discretizing numerical attributes or categorical attributes, Hu et al. introduced a simple and efficient greedy algorithm for hybrid attribute reduction [12].When some decision or evaluation systems have some errors, missing data, and missing attributes in observation, neither DRSA (dominance-based rough set approach) [13] nor VC-DRSA (variable-consistency dominance-based rough set approach) [14] can work appropriately.Inuiguchi et al. created a variable-precision dominance-based rough 2 Complexity set approach (VP-DRSA) to deal with these problems [15].Tsang et al. presented an attribute reduction model with covering rough sets based on discernibility matrix to compute all attribute reducts [16].Furthermore, Wang et al. developed a novel approach for constructing simpler discernibility matrix with covering rough sets, and it improved some characterizations of attribute reduction proposed by Wang et al. [17].In addition, there are the two most important attribute reduction models, which extend the Pawlak's rough set, the neighborhood rough set (NRS) model [18] and the fuzzy rough set model [19].They can tackle continuous numeric data and fuzzy information granulation, and the determination of what objects should be included in a rough set allowed some flexibility [20].
The second method for screening key factors is the attribute reduction models based on statistics or econometrics technique.In order to obtain preference information of the decision maker in multiobjective search, Zitzler and Künzli defined an optimization goal in terms of a binary performance measure, to select key information directly utilizing this measure [21].Polat and Krmac screened the most important attributes using pairwise Fisher score attribute reduction approach (PFSAR) and correlation based attribute reduction [22].Ju and Sohn developed a technology attribute reduction model that uses logistic regression based on exploratory factor analysis (EFA) of 16 technology-related attributes [23].Elliott et al. developed a model based on a double hidden Markov model (DHMM), to extract information about the "true" credit qualities of firms [24].Shi et al. created an indicators extraction model based on Pearson correlation analysis and logistic regression significant discriminant in customers' classification.The proposed approach ensured the reserved indicators can effectively distinguish default customers from nondefault customers [25].
In addition, there are other attributes reduction methods, such as the concept lattice model, the heuristic algorithm, and the colony optimization algorithm.Some researchers developed some new attribute reduction models by using the concept lattice classification theory [26][27][28].Wei et al. discussed attribute reduction in information systems by establishing three equivalence relations on the attribute set and its power set [29].In overwhelming data analysis and machine learning studies, most existing attribute reduction work focused on improving the classification accuracy.However, these studies neglected the problem of how to decrease the test cost.Min et al. proposed a heuristic algorithm to handle this problem in attribute reduction [30].Chi et al. created an indicators screen model based on correlation analysis and component analysis [31].Minimal test cost attribute reduction is very important in cost-sensitive machine learning.However, in many cases these heuristic algorithms cannot find the optimal solution.In order to deal with this problem, Xu et al. established an ant colony optimization algorithm for attribute reduction.Experimental results on UCI data sets showed that the proposed method outperforms the information gain-based approach [32].According to the principle of eliminating redundant information and the principle of the maximum information content, Shi and Chi proposed an attribute reduction model combining  cluster analysis and coefficient of variation [33].
Because people are interested in the maximal rules implicated in attribute reduction, Li et al. developed two new kinds of attribute reduction approaches in the decision formal context based on maximal rules [34].
The existing findings can offer important references for reducing attributes.However, there are still some limitations.First of all, in the evaluation of complex systems, the aim of the attribute reduction is to eliminate the factors, which should not have significant effect on the comprehensive evaluation results.However, the existing attribute reduction approaches have not established the comprehensive index  (i.e., the comprehensive score vector ), which can reflect all of the attributes' characteristics.This means that the existing attribute reduction approaches have not developed the relationship between attributes and the comprehensive index  (i.e., the comprehensive evaluation result).This results in some reserved attributes, which do not have significant effect on the comprehensive evaluation result.And secondly, most of existing attributes reduction approaches judged the performance of the proposed approach by the standard of saving the cost of computational time.The standard does not analyze the information contribution degree of the reserved attributes to the mass-election attributes.Thirdly, most of existing researches verify the applicability of the proposed attribute reduction methods using numerical simulation, but not utilizing actual data.
To solve the shortcomings, this study creates a novel attribute reduction model to screen the key influencing factors.We advance in three aspects.First, this paper establishes an attribute reduction approach by combining Pearson correlation analysis with  test significance discrimination.Pearson correlation analysis is applied to the calculation of the correlation among attributes to delete the similar attributes. test significance discrimination is used to select the key attributes which have the greatest influence on comprehensive index .Second, we also define an information contribution ratio to assess this attribute reduction approach from a statistical viewpoint.Third, the proposed attribute reduction approach has been verified by utilizing the financing ability evaluation data of 713 small enterprises of a city commercial bank in China.Empirical evidence presents that the selected attributes reflect 94.7% original information with 27.54% original attributes.Furthermore, this paper also selects 19 key influencing factors for assessing the financing ability of small enterprises.
The remainder of this paper is organized as follows.Section 2 introduces the design and methodology of this study.Section 3 presents the data and empirical analysis of our attribute reduction model for 713 small enterprises.Section 4 concludes and highlights the future research directions of this paper.

Design and Methodology of the Study
In this section, we introduce a novel attribute reduction model by combining Pearson correlation analysis with  test significance discrimination approach.First of all, in order to eliminate the influence of the differences of attributes units and dimensions on attribute reduction, the original data should be transformed into real numbers within the interval [0, 1].Secondly, we utilize Pearson correlation analysis to delete the attributes of large correlation from the whole mass-election attributes set, avoiding repeated information.Thirdly,  test significance discrimination approach has been created to select the attributes with the highest information content, which ensures that the selected attribute has the greatest influence on the small enterprise financing performance.A step-by-step instruction is as follows.
2.1.Standardization of Attribute Data.In our attribute reduction model, the first step is standardization of attribute data so that the after-calculation processes and parameters use the same standard.According to the features of attributes, the attributes can be divided into two types: quantitative attributes and qualitative attributes.The quantitative attributes include positive attributes, negative attributes, and interval attributes.The positive attributes are attributes showing that the greater their values are, the better the small enterprise financing capacity is.The negative attributes are attributes showing that the less their values are, the better small enterprise financing capacity is.The interval attributes are attributes reasonable only when the original index data are within certain range.
The standardization equations of positive attributes, negative attributes, and interval attributes are represented by ( 1), (2), and (3), respectively, [35]: where   is the standardized score of the th small enterprise on the th attribute, V  is the attribute original data of the th small enterprise on the th attribute,  is the number of small enterprises,  1 is the left boundary of the ideal interval, and  2 is the right boundary of the ideal range.The qualitative attributes refer to these attributes whose attribute values are described by a text, rather than a numerical value.The standard scores of qualitative attributes can be obtained by rational analysis and expert investigation.

Pearson Correlation Coefficients. The Pearson productmomentum correlation coefficient was developed by Karl
Pearson from a related idea introduced by Francis Galton in the 1880s [36].It is a measure of the linear correlation (dependence) between two random variables.It was also called the PPMCC, PCC, or Pearson's   .Historically, it is the first formal measure of correlation and it is still one of the most widely used measures of relationship.
The Pearson correlation coefficient of two attributes  and  is defined as the covariance of the two variables divided by the product of their standard deviations.The Pearson correlation coefficient is commonly represented by the letter r and it can be equivalently defined by [37] where  = (1/) ∑  =1   ,  = (1/) ∑  =1   are the mean of  and , respectively.Equation ( 4) is applied to the calculation of the correlation between two variables  and .The coefficient   ranges from −1 to 1 and it is invariant to linear transformations of either variables.A value of 1 indicates a total positive correlation between  and , a value of 0 implies no correlation between  and , and a value of −1 indicates a total negative correlation.Some authors have offered guidelines for the interpretation of the Pearson correlation coefficient [38][39][40][41].If the Pearson correlation coefficient of two attributes is greater than 0.8 [40,41], we can conclude that these attributes are information redundancy.In this situation, we should remove one of attributes.In the opposite situation, if the Pearson correlation coefficient is smaller than 0.8, it indicates that these attributes are not information redundancy and should keep these two attributes.

Attribute Reduction Model.
In our attribute reduction model, the third step is to select the key attribute which has the greatest influence on comprehensive index  and deleting the uncorrelated attributes.In this part, we first calculate the attribute weightings using entropy weight approach.And then, we can obtain the financing ability evaluation score  (i.e., comprehensive index ) for every small enterprise.Subsequently, the multiple determination coefficient  2 −1 between comprehensive index  and all of these  − 1 Complexity attributes can be obtained, and the multiple determination coefficient  2 −2 between comprehensive index  and the remaining  − 2 attributes after removing an attribute   can be calculated.By using  test significance discrimination, these key attributes which have the greatest influence on small enterprise financing ability evaluation are selected.At the same time, the reduction idea-that is, the bigger the difference Δ 2 between the multiple repeated determination coefficient  2 −1 and the multiple determination coefficient , the more the significance to comprehensive evaluation results-is reflected.Thus, the right time to make up the existing attribute reduction approaches cannot reflect the influence of attributes on the comprehensive index , because the attribute reduction process has nothing to do with comprehensive index .

Weighting Attributes Utilizing Entropy Weight Method.
Let   denote the weight of the th attribute in the th small enterprise, let   denote the standard score of the th attribute in the th small enterprise, let  denote the number of small enterprises, and let  denote the number of attributes.
The subordinate degree function   of the attribute   is given by Then, the entropy   of the th attribute can be calculated with And then, the entropy weight   of the th attribute is [42] where ∑  =1   = 1.

Reducing Attributes Based on 𝐹 Test Significance Discrimination.
After eliminating redundant information in Section 2.2, this section will select the key attributes which have the greatest influence on comprehensive index  utilizing  test significance discrimination approach.Now we outline the steps to build an attribute reduction model based on  test significance discrimination.
Step 1. Calculate the comprehensive index .Let   denote the comprehensive index or the comprehensive score for the th small enterprise financing ability evaluation.We have The meanings of the rest of variables in (8) are the same as the variables in ( 1) and (7). Step In ( 9), the estimated values for parameters  0 ,  2 ,  3 , . . .,   can be obtained using the least squares regression estimation method.Furthermore, the estimated value vector ŷ of the comprehensive index y can be calculated.Then, we have [43] where  = (1/) ∑  =1   and  denotes the number of small enterprises.
It should be pointed out that the attribute  * 1 should be reserved in attribute reduction, because the attribute  * 1 has the maximum pertinency with the comprehensive evaluation results.It also indicates that the attribute  * 1 has the biggest impact on small enterprise financing ability evaluation.
Step 4. Calculate the multiple determination coefficient In the same way, we can calculate the estimated value vector ŷ of the comprehensive index  for (11).And the multiple determination coefficient  2 −2 is given by Step 5. Calculate Δ 2 .Let Δ 2 denote the difference of the multiple determination coefficient  2 −1 and the multiple determination coefficient  2 −2 ; namely, In (13), the difference Δ 2 reflects the influence of the attribute  * 2 on the comprehensive index .If Δ 2 is not equal to zero significantly, it means that the attribute  * 2 affects the comprehensive evaluation result  significantly, and therefore the attribute  * 2 should be reserved.On the contrary, if Δ 2 is equal to zero significantly, then Δ 2 = 0, which indicates the attribute  * 2 does not have significant effect on the comprehensive evaluation result , and the attribute  * 2 should be deleted.
Step 6. Reduce attributes establishing  test significance discrimination.
Hypothesis  0 : Δ 2 ̸ = 0;  1 : Δ 2 = 0. Let   denote the  test value of the th attribute   ; we have [44] For ( 14), we can understand its meanings from the following three aspects.Firstly, the bigger the multiple determination coefficient  2 −1 is, the smaller the deviation of the estimated value ŷ and the actual comprehensive index  would be.The smaller the multiple determination coefficient  2 −2 is, the bigger the deviation of the estimated value ŷ and the actual comprehensive index  after removing the attribute  * 2 would be.That is to say, when we remove the attribute  * 2 , the explanation ability of the −2 attributes  * 3 ,  * 4 , . . .,  *  to the comprehensive evaluation score  decreases significantly.It also indicates that the attribute  * 2 has significant effect on the comprehensive evaluation result  of small enterprises; thus the attribute  * 2 should be reserved.Secondly, the bigger the difference Δ 2 of the multiple determination coefficient  2 −1 and the multiple determination coefficient  2 −2 is, the bigger the difference of the explanation ability  Thirdly, the bigger the difference Δ 2 (i.e., the bigger the difference value  2 −1 −  2 −2 ) is, the bigger the  test value   would be.In this situation, the  test can be passed easily.And it also expresses the attribute effects on the comprehensive evaluation result  significantly.
Under the condition of the hypothesis of  0 ,   follows  distribution; that is to say,   ∼ (1,  − ( − 2)).Let the confidence level  be equal to 0.05 [45] ).At this time, the attribute reduction can be stopped.It suggests that the rest of attributes do not have significant influence on comprehensive evaluation result .

The Judgment of Reasonability of the Proposed Attribute
Reduction Approach.According to the idea that the multiple determination coefficient  2 can be used to describe the explanation ability of the independent variable on the dependent variable, this paper uses an information contribution ratio to assess the performance of attribute reduction model.The information contribution ratio can be defined as the ratio of the explanation ability  2 Reserved of the reserved attributes to the comprehensive evaluation score  to the explanation ability  2 Mass-election of the mass-election attributes to the comprehensive evaluation score .
Let In denote an information contribution ratio of the reserved attributes to the mass-election attributes, let  2 Reserved denote the multiple determination coefficient of the reserved attributes to the comprehensive evaluation score , and let  2 Mass-election denote the multiple determination coefficient of the mass-election attributes to the comprehensive evaluation score .The information contribution rate In of the reserved attributes to the mass-election attributes is given by Mass-election .
Equation ( 15) is applied to judge the reasonability of the proposed attribute reduction model.The numerator  2 Reserved reflects the explanation ability of the reserved attributes to the comprehensive evaluation score , and the denominator  2  Mass-election illustrates the explanation ability of the masselection attributes to the comprehensive evaluation score .Equation ( 15) is the ratio of the explanation ability  2  Reserved to the explanation ability  2 Mass-election .It reveals the information contribution degree of the reserved attributes to the masselection attributes.
As a decision criterion for judging the rationality of the proposed attribute reduction model, the proposed approach is considered reasonable if the reserved attributes are able to contribute more than 90% of the mass-election attributes by using less than 30% of attributes in the mass-election attribute set.

Sample Selection and Data Sources.
In consideration of research purpose of verifying the applicability of the proposed attribute reduction model, this subsection implements empirical study based on the financing ability data of 713 small enterprises.In order to guarantee the representation of empirical results, this paper collected the data from the headquarter and all of the branches in a city commercial bank of China, including Beijing Branch, Tianjin Branch, Shanghai Branch, Chongqing Branch, Shenyang Branch, Dalian Branch, and Dandong Branch.The data is shown in Column 5 to Column 717 in Table 1 [46].
The mass-election attribute set for small enterprise financing ability evaluation contains six criterion layers:  The data standardization for quantitative attributes is as follows: in terms of the attribute type in Column 4 of Table 1, substituting the original data of positive attributes V  from Column 5 to 717 of Table 1 into (1), the original data of negative attributes V  into (2), and the original data of interval attributes V  into (3), the standardized data   of attributes are obtained.The results are shown in Column 718 to 1430 of Table 1.
Subsequently, we will compute the standardized score for the qualitative attributes.Learning from a commercial bank nonfinancial attributes scoring standard [46], the scoring standard of qualitative attributes can be obtained by rational analysis, as shown in Table 2.Then, the standardized scores of qualitative attributes are obtained combined with the attribute type in Column 4 of Table 1, as shown in Column 718 to 1430 of Table 1.

Attribute Reduction Utilizing Pearson Correlation Analysis.
In practice, due to the presence of related attribute values but independent attributes of meaning, some of the attributes might be mistakenly deleted.This paper calculates attributes' Pearson correlation coefficients in the same criterion layer.In order to explain the process of Pearson correlation analysis, we take the 10 attributes of the fourth criterion layer " 4 Operation ability" as an example.
After substituting the data from Row 45 to 54 and Column 718 to 1430 of Table 1 into (4), the correlation coefficients can be obtained for any two attributes, as shown in Table 3. Known from Table 3, the correlation coefficient 0.998 between " 53 accounts payable turnover speed" and " 54 cash cycle" is greater than the threshold value 0.8, which means that the two attributes reflect information highly repetitively.Because there are other attributes representing cash flow in the attribute set, such as " 17 the main business income cash ratio" and " 20 all assets cash recovery rate," we delete the attribute " 54 cash cycle." Similarly, we can obtain the attributes'  1.
There are 53 attributes after reducing by Pearson correlation analysis, and the corresponding attributes' standard data are listed in Column 3 to 55 in Table 4.

Attribute Reduction Using 𝐹 Test Significance Discrimination.
Taking the data of Table 4 into ( 5) to (7), the entropy weights of 53 attributes can be obtained:  = (0.0178, 0.0102, 0.0030, . . ., 0.0003 Substituting the data of the first row in Table 4 and the entropy weights of 53 attributes into (8), then the comprehensive score  1 = 0.155 of enterprise 1 can be calculated.Similarly, we can calculate the rest of 712 enterprises' comprehensive scores   , as shown in the last column of Table 4.
After taking the data from Table 4 into (4), 53 Pearson correlation coefficients between the 53 attributes   and the comprehensive score  can be obtained.According to the correlation coefficient absolute value |  | in a descending order, the attributes' ranking results are listed in Table 5. Obviously, the attribute  61 has the maximum correlation coefficient with the comprehensive score ; therefore the attribute  61 should be reserved.

Attribute types
Original data V No.
(2) Criterion layer Options and its description  The selected 19 attributes are marked with "reserve" in the last column of Table 5.And the deleted 34 attributes are marked with "delete by  test significance discrimination" in the last column of Table 5.The selected 19 key influencing factors of small enterprise financing ability are shown in Table 6.And the detailed attribute reduction process of financing ability evaluation for 713 small enterprises is shown in Table 7.

The Reasonability Judgment for the Proposed Attribute
Reduction Model.In Table 4, taking the data of 19 reserved attributes and the comprehensive evaluation score y into ( 9) and ( 10), the multiple determination coefficient  2 Reserved = 0.947 can be obtained.In Table 1, taking the data of 69 original attributes and the comprehensive evaluation score  into ( 9) and ( 10), the multiple determination coefficient  2 Mass-election = 1.000 can be obtained.Thus In =  2 Reserved / 2 Mass-election = 94.7%.It illustrates that the selected attributes reflect 94.7% original information with 27.54% attributes (27.54% = 19/69) by using the proposed attribute reduction model.And the experimental results show that the proposed reduction model is efficient and effective.
3.6.Some Notes about the Proposed Model.In Section 2.3.1, this paper takes entropy weight method as an example for the purpose of illustrating the feasibility and rationality of the proposed attribute reduction idea.As a matter of fact, the weight methods can be substituted in terms of the needs of decision makers.They can select other weight methods, such as AHP, G1, G2, and interval numbers weight approaches [47].
In Section 2.3.2, the paper takes linear regression model as an example so as to explain the feasibility of the proposed model.In reality, decision makers can select other nonlinear regression models [48].

Conclusions and Future Work
In order to reduce large dimensionality in complex data sets, we create an attribute reduction approach based on Pearson correlation analysis and  test significance discrimination.First of all, we delete redundancy attributes using Pearson correlation coefficient, avoiding information chaos of the original attribute data sets.Secondly, developing attribute reduction methodology utilizing  test significance discrimination can find the key attributes that have the greatest influence on the evaluation results.Thirdly, the paper also defines an information contribution ratio to assess the performance of attribute reduction model from a statistical viewpoint.
The proposed attribute reduction model has been verified utilizing the financing ability evaluation data of 713 small enterprises of a city commercial bank in China.The empirical evidence shows the accuracy and applicability of the proposed model.Moreover, we also establish an evaluation indicator system for small enterprise financing ability.It will help the downstream organizations of supply chain to choose more qualified partners and alleviate the difficulties faced by enterprises when applying loan.Furthermore, applications of the proposed model to real world data are expected in future.
It is well known that the problems of attribute reduction are ubiquitous in data mining activities.The empirical study in this paper is only an example in order to verify the accuracy of the proposed model.A topic of future research can be the application of the proposed approach to data sets in other attribute reduction areas.Researchers can easily conduct attribute reduction through cases and empirical studies.
2 −1 of the  − 1 attributes  * 2 ,  * 3 , . . .,  *  to the comprehensive evaluation score  and the explanation ability  2 −2 of the  − 2 attributes  * 3 ,  * 4 , . . .,  *  to the comprehensive evaluation score  would be.It means that the attribute  * 2 affects the comprehensive evaluation result  of small enterprises significantly, and the attribute  * 2 should not be deleted.
; the critical value   can be checked from  statistics.If   ≥   , accept hypothesis  0 : Δ 2 ̸ = 0.It means that Δ 2 is not equal to zero significantly, and the attribute  * 2 should be reserved.Conversely, if   <   , reject hypothesis  0 : Δ 2 ̸ = 0, which indicates that Δ 2 is equal to zero significantly, and the attribute  * 2 should be deleted.Step 7. Repeat Step 3 to Step 6, and select other attributes.For the rest of the  − 2 attributes  * 3 ,  * 4 , . . .,  *  , we can reduce attributes by repeating Step 3 to Step 6.Until you find the first attribute  *  , the corresponding  test value satisfies the inequation  *  ( *  ) ≤  0.05 ( * 1 , enterprise basic situation;  2 , debt paying ability;  3 , enterprise profitability;  4 , operation ability;  5 , development potential;  6 , enterprise external macroconditions, as shown in Column 2 in Table 1.All of the 69 attributes are listed in Column 3 in Table 1.As known from the fourth Column of Table 1, there are 46 positive attributes, 7 negative attributes, 2 interval attributes, and 14 qualitative attributes.3.2.The Attribute Data Standardization.In this paper, we have two interval attributes: " 4 the age of enterprise legal person" and " 67 consumer price index (CPI)."The ideal range of " 4 the age of enterprise legal person" is [31, 45] [25].It means if the age of the business owner is within the interval [31, 45], the repayment ability and repayment willingness of the small enterprise are strong.The ideal range of " 67 consumer price index (CPI)" is [101, 105] [25].It indicates that there exists neither deflation nor inflation, when the CPI is within the range [101, 105].

Table 2 :
The scoring criteria of the qualitative attributes.

Table 3 :
The correlation coefficients between attributes for " 4 operation ability".

Table 5 :
Attribute reduction process based on  test significance discrimination.

Table 6 :
The key influencing factors of small enterprise financing ability.

Table 7 :
The attribute reduction process of financing ability evaluation for 713 small enterprises.