Default Risk Prediction of Enterprises Based on Convolutional Neural Network in the Age of Big Data: Analysis from the Viewpoint of Different Balance Ratios

,


Introduction
Default risk prediction means the prediction of the repayment ability of enterprise loans. Default risk prediction can effectively avoid default risk and reduce the loss of debtors and investors. In the age of big data, deep learning algorithms, such as convolutional neural networks, are widely applied to various fields. In the area of default risk prediction, there are two main problems: one is the balance ratio of the dataset, and the other is the selection of the best feature combination. To address these two problems, this study proposes a comprehensive metric model based on multimachine learning algorithms to find the best balance ratio and optimal feature combination and to improve the performance of the deep learning model in the default risk prediction.

1.1.
e Best Balance Ratio of the Imbalanced Dataset. Due to the objective fact that there are far more nondefault samples than the default samples, the problem of data imbalance is prevalent. e performance of traditional machine learning methods on unbalanced datasets is poor.
ere are many studies on rebalancing methods for imbalanced datasets, but few studies on the optimal balance ratio. Different balance ratios greatly influence the prediction result of the machine learning model. For the deep learning model, the training time is too long, and it would take a lot of time to find the optimal balance ratio. erefore, finding the best balance ratio is a significant problem.

e Best Feature Combination and Selection
Order. Different feature combinations can make the default risk prediction quite different. Feature selection can improve the model performance and save computational costs. erefore, feature selection is an essential problem in the task of default risk prediction. At the same time, with respect to the orders of rebalancing and feature selection, the datasets with the same balance ratio have different optimal feature combinations. erefore, the order of feature selection and data rebalancing processing is also a problem worth discussing. e differences between our study and existing studies are mainly reflected in the following two aspects.
Our study on balance ratio is most related to Hou et al. [1]. Hou et al. applied the method of synthetic minority sample oversampling and minority sample weighting to address the imbalance problem [1]. In this study, from the perspective of searching for the best balance ratio, the score ranking of the comprehensive metric model based on the multimachine learning algorithm in a training set with different balance ratios is used to reverse the optimal balance ratio and improve the performance of the classifier.
Our study on feature selection is most related to Song et al., but the research of Song et al. on feature selection does not consider the influence of data balance processing [2]. In this study, the feature combination selection is conducted before and after the data balancing. en the optimal feature combination is obtained through the comparative analysis of multiple groups of experiments.
In recent decades, researchers have introduced many resampling strategies and imbalance-proof algorithms to address the data-imbalanced problem for default risk prediction. Unfortunately, the efficiency of the abovementioned methods almost reaches the limit, which means that we have to find a new facet to push the limit. In this study, the optimal balance ratio is found through the classification accuracy, which changes the deficiency of the existing research that samples are imbalanced or the balance ratio is 1 : 1 and ensures the accuracy of the classification model. A comprehensive metric model based on the machine learning algorithm is proposed, which can simultaneously find the best balance ratio and perform the optimal feature selection. Additionally, our framework is flexible and universal for different resampling methods and machine learning algorithms and obtain a more satisfactory result than the original algorithms. In conclusion, a framework to explore the best balance ratio and feature combination is promising and meaningful. is makes the strategy for the data-imbalanced problem more systematic.
is paper left is organized as follows. Section 2 is the literature review. Section 3 presents the problem statement and solution idea. Section 4 introduces the methodology. Section 5 shows the experimental analysis. Section 6 outlines the conclusion.

Research on Data Imbalance.
e solution to data imbalance can be divided into algorithm level, data level, and the combination of algorithm level and data level.
At the algorithm level, researchers improve the performance of models by making the algorithm focus on the minority class. Huang et al. evaluated the ability of neural networks to deal with imbalance problems in the field of default risk prediction. e experimental results show that the proposed methods can obtain consistently high TP prediction rates for each class [3]. Huang et al. used three strategies to construct the hybrid SVM-based credit scoring models. Compared with other tree classifiers, the SVM classifier achieved an identical classificatory accuracy [4]. I. Mues and C. Mues used resampling to process imbalanced credit data and applied LR, DT, and other models to predict default. is empirical study indicates that the random forest and gradient boosting classifiers perform very well in credit scoring [5]. Kvamme et al. used convolutional neural networks to predict mortgage defaults. Experimental results show that the threshold is essential for accuracy [6]. Yu et al. proposed an integrated model based on the deep belief network (DBN) and SVM. e experimental results indicate that the proposed model can be a promising tool for credit risk classification with imbalanced data [7]. Ma et al. applied LightGBM and XGboost to deal with unbalanced data to predict the default risk of loans [8]. Li et al. explored the application of transfer learning in the financial field. eir study highlights the commercial value of the transfer learning concept and provides practitioners and management personnel with a decision basis [9]. M. Lardy and J. P. Lardy offered a simple, global, and transparent CDS structural approximation for the energy industry based on random forest [10]. Zhao et al. constructed a DMDP system that can capture a company's historical performance and use long short-term memory to dynamically consider news from social media and public opinion [11]. Fu et al. utilized BiLSTM to predict the default risk of the platform based on the extracted keywords of investor comments. Experimental results show that the proposed model can better capture semantic features and achieve significant improvement [12].
At the data level, resampling methods are globally used, including undersampling, oversampling, and hybrid sampling. Among them, the oversampling technique recently has recently become a research hotspot. e synthetic minority oversampling technique (SMOTE) is the bedrock of the oversampling method proposed by Chawla et al. [13]. Hernandez et al. presented an empirical study about using oversampling and undersampling methods to improve the accuracy of instance selection methods on imbalanced databases.
e experimental results show that using oversampling and undersampling methods significantly improves the accuracy for the minority class [14]. I. Lai-Yuen and S. K. Lai-Yuen presented a novel oversampling approach called A-SUWO which can avoid sample overlapping. e experimental results indicate that the proposed method achieves significantly better than other sampling methods [15]. Liang  e experiments show that the proposed model improves the AUC of seven known and popular DES algorithms [19].
A common difference between our study and the abovementioned literature is that different balance ratios are considered in this study.

Research on Feature Selection.
Feature selection includes single feature selection and feature combination selection. Single feature selection only focuses on the function of one feature but ignores the underlying relationship among different features, while feature combination selection considers the influence of multiple features on the results simultaneously.
Research on single feature selection. Chen and Li proposed four approaches combined with the support vector machine classifier for feature selection that retains enough information for classification purposes. e result suggests that the hybrid credit scoring approach is robust and effective in finding optimal subsets [20]. S. Oreski and G. Oreski proposed a feature selection method based on genetic algorithms. e preset threshold eliminates the unqualified features in terms of FDAF-score. Experimental results show that the proposed classifier is promising for feature selection and classification [21]. Song et al. calculated the FDAF-score of each feature to evaluate the feature's contribution to classification. Experiments demonstrate that the proposed FDAF-score algorithm can obtain good results and deal with the classification problem with noises [2]. Sheikhpour et al. studied the semisupervised feature selection method to reduce the cost of dataset annotation and simplify data collection. Based on the literature review, it was observed that most of the semi-supervised feature selection methods had been presented for classification problems [22]. Urbanowicz et al. placed Relief-based algorithms in the context of other feature selection methods and provided an in-depth introduction to the Relief-algorithm concept [23]. Hu et al. conducted feature selection by calculating the correlation between features and classification results. e classification results show that the proposed model performs better than the other five methods [24]. Kozodoi et al. proposed a profit-driven multiobjective feature selection method. Experiments demonstrate that the proposed approach can yield a higher expected profit using fewer features [25]. e above is a single feature selection method, and many scholars have significantly contributed to feature combination selection. Luo et al. presented an adaptive unsupervised feature selection method that can generate a picture of the original feature space and output a reliable feature combination [26]. Zhang et al. applied the multiobjective particle swarm optimization to a multiclass dataset that verifies that the proposed algorithm is a helpful approach to feature selection for multi-label classification problems [27]. Mafarja et al. presented a wrapper feature selection method based on the binary dragonfly algorithm. e results show the ability to search the most informative features for classification tasks [28]. Gu et al. presented a competitive swarm optimizer that deals with high-dimensional feature selection. Experiments demonstrate that the proposed feature selection algorithm can select a much smaller number of features with better classification performance [29]. M. Mirjalili and S. Mirjalili modified wrapper feature selection by whale algorithm proved efficient in searching for the optimal feature subsets [30]. Abualigah et al. used the particle swarm optimization algorithm to perform feature selection which can eliminate the uninformative features of the text [31]. Sayed et al. proposed a chaotic crow search algorithm to improve the convergence rate and find the optimal feature combination [32]. Zhang et al. proposed a two-archive multiobjective artificial bee colony algorithm that is proved to be an efficient and robust optimization method for solving cost-sensitive feature selection problems [33]. Ghosh et al. proposed a wrapper-filter combination of ant colony optimization in feature selection. e experimental results clearly show that our method outperforms most state-of-the-art algorithms used for feature selection [34]. Zhang et al. proposed a self-learning and multiobjective algorithm to balance local and global searching [35]. e difference between the research on feature selection in this study and the abovementioned studies is that our study first uses the feature combination selection method to select the datasets with different balanced ratios and then selects the optimal feature combination from the selected feature combinations. In addition, this paper studies the selection order.

Research on Default Risk Prediction Methods.
ere is extensive literature on various approaches to credit scoring, default risk prediction, and fraud detection. Credit risk prediction indicates the level of the risk in investing with the company. It represents the likelihood that the company pays its financial obligations on time.
Credit risk prediction models can be divided into qualitative and quantitative analysis, machine learning models, and deep learning models. Standard and Poor's, Moody's, and Fitch use a twofold analysis (qualitative and Complexity quantitative) to assign a credit score to a company. Qualitative analysis is based on different factors such as company strategy and economic market outlook, while quantitative analysis is based only on financial statements. However, how these analyses lead to the final credit score is still unclear [36].
Many machine learning methods have been extensively employed to forecast corporate default risk. Linear default risk prediction models determine which category the new individual sample belongs to by summarizing a classification rule with a large number of samples. e representative models include linear discriminant analysis [37], multivariate discriminant analysis [38], logistic regression [39], and probit models [40], among others. Although the linear models have the advantages of simplicity and ease of use, they cannot effectively deal with the nonlinear relationship between variables.
us there is a significant difference between the classification results and the actual default status. With the advent of the big data era, artificial intelligence-based default risk prediction methods are increasingly advantageous. Machine learning models involve support vector machines [41], decision trees [42], k-nearest neighbors [43], BP neural networks [44], and so on. Compared with the linear models, which require strict assumptions and are sensitive to noise data, the artificial intelligence models effectively overcome linear models' limitations with strong robustness and unstructured characteristics.
In addition to standard neural networks, some researchers focus on deep learning models in many financial fields. Compared with an artificial neural network (ANN), a deep neural network (DNN) is a model with more than one hidden layer between the input and output layers [45]. Deep learning models such as Convolutional Neural Networks (CNN) [46], Generative Adversarial Networks (GAN) [47], and Recurrent Neural Networks (RNN) [48] have been proven to significantly improve the accuracy of classification in various financial problems [49]. Previous applications of CNN include their use in image processing, sequence, and time-series [50,51], while in financial problems, they are mainly used in the stock market analysis [52]. Tsantekidis et al. used CNN to predict mid-price movements of the limit order book; their empirical evidence reveals that CNN can obtain more accurate results than the multilayer perceptron model [53]. Chung and Shin applied one of the representative deep learning techniques, multichannel convolutional neural networks, to predict the fluctuation of the stock index. e experimental results show that CNN outperforms the comparative models, which demonstrates the effectiveness of CNN [54]. Chen et al. proposed a novel method for stock trend prediction using a graph convolutional feature based convolutional neural network model, in which both stock market information and individual stock information are considered [55]. Some studies have tried applying CNN to the default risk prediction field. Kvamme et al. predict mortgage default by applying CNN to consumer transaction data [6]. Carrasco and Sicilia-Urbán used DNN to measure their ability to detect false positives by processing alerts triggered by a fraud detection system [56]. J. I. Z. Lai and K. L. Lai proposed the deep convolution neural network scheme based on a financial fraud detection scheme using a deep learning algorithm. Over a time duration of 45 s, the detection accuracy of 99% was obtained using the proposed model, as observed in the experimental results [57].
In our study, CNN is used to predict the default risk of the enterprises, and the experimental results show that CNN outperforms KNN, DT, SVM, and LR. As a deep learning model, CNN can accurately predict enterprises' credit risk.

e Determination of the Best Balance Ratio of Data.
In order to solve the problem that the default samples are far fewer than the nondefault samples, the dataset is often rebalanced before constructing the default risk prediction model. As an excellent sampling method, the synthetic minority oversampling technique (SMOTE) is widely used to process imbalanced datasets. In this study, the SMOTE algorithm is used to balance the dataset. In the oversampling process, the difference in balance ratios between default and nondefault samples will affect the model performance.
In this study, the approach to solve the problem is to use default and nondefault samples with different balance ratios to establish a prediction model. e best balance ratio is obtained through the comprehensive ranking of multiple metrics of the prediction model. e approach to constructing the balance ratio is as follows. Assume that A represents the majority (nondefault) samples in the training set and B represents the minority (default) samples. It is clear that the proportion of the majority class and minority class of the training set is A : B.
e SMOTE algorithm is used to perform the oversampling operation on default samples. It is assumed that every round of sampling is performed on all minority samples, i.e., repeated B times. en, the maximum sampling round N max is: e abovementioned formula means round-up, which means the number of times required to apply the SMOTE algorithm to the minority samples when the dataset is balanced to 1 : 1. en, the sampling round N α required to achieve is shown as follows.
e abovementioned formula indicates the times required to apply the SMOTE algorithm when the dataset is balanced to α : 1, where α is the balanced ratio and set to 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. e datasets with different balance ratios are put into support vector machine (SVM), logical regression (LR), decision tree (DT), and k-nearest neighbor (KNN). e optimal balance ratio can be obtained by comparing the comprehensive ranking of multiple metrics of the four models with different balance ratios.

e Determination of the Best Feature Combination and the Selection Order.
In the task of enterprise default risk prediction, enterprise data often have many features, which creates the problem of feature redundancy. Selecting the best feature combination can optimize the performance of machine learning models and reduce the calculation cost. Different feature combinations significantly influence the results, and there are always several feature combinations that can make the models have higher prediction accuracy. Feature selection can be divided into single feature selection and feature combination selection. Feature combination selection refers to the selection of multiple different feature combinations simultaneously. In this study, feature selection technology based on the genetic algorithm is a feature combination selection method. In the feature selection process, we often face the problem of the order of feature selection treatment and rebalancing treatment. e selection order is also studied in this work.

Synthetic Minority Oversampling Technique.
Chawla et al. proposed the synthetic minority samples oversampling technique (SMOTE), which is a type of oversampling method to solve the problem of data imbalance [13]. e basic principle is based on the distance between the samples in the sample space, using the existing minority samples to synthesize new samples. Given an imbalanced dataset, for each sample x i of the minority class, we calculate the Euclidean distance between x i and other minority samples. SMOTE finds K neighbors of x i , selects a random point x j from these K points, and then synthesizes a new sample point x new . e calculation formula of the new sample is as follows.
where δ is a random number between 0 and 1. e abovementioned formula indicates that a new sample is synthesized between two minority samples, and the coordinate value of the new sample point is calculated.
In Figure 1, the circles represent the minority samples, the rectangles represent the majority samples, and the crosses represent the synthesized samples.

Genetic Algorithm.
In this study, the genetic algorithm is used to implement the operation of feature selection, which is a kind of feature combination selection. e introduction of the genetic algorithm is as follows.
A genetic algorithm is an optimization algorithm that simulates natural evolution. In nature, organisms evolve through natural selection. In genetic algorithms, computergenerated "creatures" are selected and evolved by fitness functions.
e basic process of the genetic algorithm is shown in Figure 2.
(1) e population is generated randomly, and numerical arrays represent the individuals. Additionally, an individual represents a combination of features. e elements in the array are called genes, which are values of 0 or 1. In the feature selection task, 0 represents that the feature corresponding to the position is removed, and 1 represents that the feature corresponding to the position is selected. (2) e fitness function is used to calculate the fitness of individuals. In our study, the fitness function is the average F1-score of the fivefold cross-validation results of the SVM on the training set. e details of the SVM are described in Section 4.3.3. e F1-score is introduced in Section 4.5. (3) According to the preset threshold, the unqualified individuals in terms of fitness are removed. (4) e remaining individuals are "hybrid" to produce a new individual, and the offspring's genes are the binary sum of the same locus of its parents' genes. For example, an individual with a genome of [1, 0, 1, 1, 0] crosses with that of [0, 1, 1, 1, 1] to obtain [1, 1, 0, 0, 1]. en, some individuals are randomly selected for gene mutation at a random locus; the gene is changed from 0 to 1 or from 1 to 0. en, the population is updated.  e genetic algorithm is a heuristic search and optimization technique that mimics the process of natural evolution. e algorithm is easy to understand, good for noisy environments, and robust with respect to local maxima/minima. However, the genetic algorithm may fall into the local optimum when dealing with a complex optimization problem, and the result of the solution strongly depends on the initial value.
e genetic algorithm jumping out of local optimization to obtain global optimization is based on crossover and mutation in the algorithm. Compared with the standard genetic algorithm, this study makes the following two improvements.

4.2.1.
e Improved Crossover. First, according to the matching principle, the parents are queued, i.e., sorted by the fitness function. Individuals with small objective function values are paired, and individuals with large objective function values are paired.
en, the position of the crossover point is determined using the logistic chaotic sequence x(n + 1) � 4x(n) (1 − x(n)). Finally, the determined crossover is performed.
For example, paired individuals (Ω 1 , Ω 2 ), First, an initial value between 0 and 1 is taken, and x(n + 1) = 4x(n) (1 − x(n)) is used to generate one chaotic value on (0, 1) at a time; multiply this value by k, and finally round it.

e Improved Mutation.
e mutation is also a means to achieve group diversity and is an essential guarantee for jumping out of local optimization. e improved mutation in this study is designed as follows. According to the given mutation rate, two integers between 2 and k are randomly selected to mutate the genes at the corresponding positions of these two numbers. e mutation takes the current gene value as the initial value, and the chaotic sequence x(n + 1) � 4x(n) (1 − x(n)) is used to perform a number of iterations to obtain the new gene after mutation; thereby, a new chromosome is obtained.

Machine Learning Model.
e following four wellknown machine learning models are not the focus of this article; thus, they are briefly described.

K-Nearest
Neighbor. K-nearest neighbor (KNN) is a classification algorithm based on the distance between samples in space. e basic idea of the KNN algorithm is as follows. Given a test sample, k training samples closest to it are obtained, and the prediction based on the information of these k neighbors is made. Generally, the voting method can be used for the prediction. e prediction result is the class label that appears most in the k samples. e schematic diagram of KNN algorithm is shown in Figure 3.

Decision Tree.
Decision tree is a tree-like structure that can classify a sample through multiple levels of nodes. Among them, each internal node represents the judgment of a feature, and the classification result is finally output by the leaf node. e decision tree usually divides the sample attributes according to the purity of the dataset. e decision tree used in our paper uses the Gini index to measure the purity of the dataset. e Gini index formula is as follows.
where Gini(X) is the Gini index of dataset X, n is the number of samples in the dataset, and P k is the probability that the sample belongs to class K. e above formula means the sum of the probabilities of two samples randomly selected from the dataset in different classes. e smaller the Gini index, the higher the purity of the dataset. e generation process of a decision tree is mainly divided into the following three parts.
(1) Feature selection A feature is chosen from many features in the training data as the split criterion of the current node. Different feature selection methods derive different decision tree algorithms. is article uses information gain to divide nodes. (2) Decision tree generation According to the selected feature evaluation criteria, child nodes are generated recursively from top to bottom, and the decision tree stops growing until the data set is indivisible. (3) Pruning e decision tree has been over-fitting, and it is necessary to reduce the size of the tree structure through pruning. ere are two commonly used prepruning and backward pruning. Figure 4 is a schematic diagram of a decision tree. Rectangles represent the internal nodes, and ellipses represent the leaf nodes.

Support Vector
Machine. Support vector machine (SVM) is an algorithm that can classify samples by maximally spaced hyperplanes. Its goal is to find a support vector that is the most distant from the samples. e maximal interval hyperplane is found by solving the optimization problem in (5). e objective function of the support vector machine is shown as follows.
where w is the weight vector, b is the displacement term, ξ is the slack variable, and C > 0 is the penalty factor. In (4), the first expression is the optimization goal, and the second expression is the constraint condition. (4) means minimizing the square of the modulus of the weight matrix under constraint conditions. Figure 5 is a schematic diagram of SVM.

Logistic Regression.
e basic idea of logistic regression (LR) is to judge the classification by finding the relationship between the classification probability and the input vector. Firstly, it is assumed that the data obey a certain distribution, and then the maximum likelihood estimation is used for parameter estimation.
Where p is the default probability, 1 − p is the nondefault probability, α is the constant in the model, β h is the logistic regression coefficient, x h is the hth independent variable. e maximum likelihood method is used to estimate the parameters α and β. Given the data set (x i , y i ) N i�1 , the logarithmic likelihood function can be expressed as e maximum likelihood function of (5) is equivalent to the following minimum cost function: y i ln s i + 1 − y i ln 1 − s i . (8) In this paper, the logistic regression parameters are estimated using gradient descent. e main goal of parameter optimization is to find a direction in which the value of the cost function can be reduced after the parameters move in this direction.

Convolutional Neural Network.
To explore the application of deep learning algorithms to default risk prediction, we choose the convolutional neural network (CNN), a wellknown model in computer vision, to compare it with the abovementioned four traditional machine learning algorithms.
As a representative algorithm of deep learning, CNN is widely used in the field of image and audio in the era of big data. In recent years, convolutional neural networks have also produced many research results in the area of default prediction. e CNN constructed in this study includes the convolutional layer (CONV), linear rectified function (ReLU), and full-connection layer (FC) for extracting features, as shown in Figure 6.

Convolutional Layer.
e Convolutional layer is a unique structure of CNN, which can be used to extract data features. e convolution layer contains multiple convolution kernels. Since the object of study in this paper is onedimensional data, the one-dimensional convolution kernel is used. e convolution layer uses the convolution kernel to  Complexity perform the convolution operation on the input matrix and convert the input matrix to the output matrix. e convolution operation process shown in Figure 7 is illustrated by taking one-dimensional convolution as an example. e weight matrix of the convolution kernel starts from the initial position of the input matrix, calculates the sum of the product of the weight matrix and the corresponding position elements of the input matrix, and fills the result into the output matrix. e model self-adjusted the weight matrix through the backpropagation (BP) algorithm. After each calculation, the convolution kernel would move on to calculate the value of the next element of the output matrix until all calculations are completed.
A 10-dimensional vector shown in Figure 8 represents the input matrix. Moreover, the weight matrix of the convolution kernel is represented by a three-dimensional vector shown in Figure 9, and the size of the convolution kernel is 1 × 3.
en the first element of the output matrix is as follows.
e output matrix is finally obtained by analogy, as shown in Figure 10.
Let the dimension of the output vector of the convolutional layer be n out (in the one-dimensional case), then the size of the output matrix of the convolutional layer can be calculated by the following formula.
where n in is the dimension of the input vector, p is the padding, k is the size of the convolution kernel, and s is the step. Padding � 0 and s � 1 in this paper.

Linear Rectifying Function.
e linear rectifier function is a commonly used activation function, which is the slope function in this paper, and its expression is Where f(x) is a linear rectifying function and x is an input matrix. e meaning of the above equation is that for each element value in the input matrix, if greater than 0, it remains unchanged; otherwise, it becomes 0.

Full-Connected
Layer. e full-connected layer lies behind the convolution layer and acts as a feed-forward neural network. After the output matrix of the convolution layer is input to FC, the Softmax function can calculate the default probability. e expression of the Softmax function is

Complexity
Where x T is the transpose of the input matrix, P is the probability that the sample belongs to class k, and w k is the weight vector under class k. e above equation means that for each input x, Softmax calculates the probability P of its being divided into each classification k. When the probability of default exceeds the threshold, the sample would be predicted as a default sample.

Model
Testing. e model evaluations adopted in this paper are as follows.
(1) F1-score is the harmonic mean of precision and recall and is shown in.
(2) e ratio of default samples wrongly predicted to the nondefault sample, namely Type-II error is (3) Accuracy is the ratio of the number of correctly classified samples to the total number of samples.
(4) e G-mean is the geometric mean of the correct rates of nondefault and default samples and is shown as follows.
(5) e AUC value is equivalent to the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example [58].
e confusion matrix of prediction results is shown in Table 1.

Comprehensive Metric Model Based on Multi-Machine Learning Algorithms.
e training of CNN is a time-consuming task. Compared with the convolutional neural network, the four machine learning models mentioned above can be easily trained and are simpler to debug. erefore, before training the CNN, we first use the machine learning models to evaluate the datasets with different balance ratios and feature combinations. e average ranking of multiple metrics determines the best balance ratio and feature combination. is result is applied to CNN to improve the performance of the deep learning model. e basic process of our approach is as follows.
(1) An unprocessed copy of the original training set is denoted by T. e copy of the original training set processed by feature selection is denoted as T 0 , and its corresponding feature combination is denoted as F 0 . (2) Rebalancing T, the balance ratios are set to 1 : 1, 2 : 1, . . ., 10 : 1, and ten training sets are denoted as T 1 , T 2 , . . ., T 10 . (3) e ten training sets are processed by feature selection to obtain T 11 , T 12 , . . ., T 20 and their corresponding feature combinations are F 1 , F 2 , . . ., F 10 . (4) Applying the same rebalancing process as above to T 0 , the ten training sets obtained are denoted as T 21 , T 22 , . . ., T 30 . Ultimately, we get 32 different training sets and 11 different feature combinations. Namely, T is a dataset without rebalancing and feature selection (a copy of the original dataset), T 0 is a feature selected dataset without rebalancing, T 1 -T 10 are balanced in different balance ratios, but no feature section, T 11 -T 20 are rebalanced followed by feature section, T 21 -T 30 are feature selected followed by rebalancing. In addition, we call these 32 training sets derivative datasets of the original dataset. (5) We input the 32 derived datasets into KNN, DT, SVM, and LR and calculate the accuracy, G-mean, Type-II error, and AUC of the four models tested on the same test set after training the derived datasets. (6) e four metrics of a machine learning algorithm in different training sets are ranked, and the comprehensive ranking of the performance of the machine learning algorithm in different training sets is obtained by averaging the rankings of the four evaluations. (7) Averaging the comprehensive ranking of the four machine learning algorithms on the derived training set can obtain the score of the comprehensive metric model of multimachine learning algorithms. e derivative datasets with the highest rank can be found, and their balanced ratio and feature combination are the best. (8) Training the CNN with the selected derivative dataset and testing it on the same test set.
e flowchart of our approach is shown in Figure 11.

e Empirical Research.
In this section, the dataset (China Dataset) is collected from the credit database of a regional commercial bank in China. e data contain 80 features, which can be divided into three levels (i.e., internal financial, nonfinancial, and external macro factors) and nine second-level criterion layers, including solvency,  Table 2. Table 3 shows the parameters we used for the algorithms. In this study, the random search method [59] is used to select the best hyperparameters of CNN. e original training set of the China Dataset is denoted by C. C 0 is obtained by feature selection of C. After the balance treatment of C, ten training sets are obtained, which are denoted as C 1 , C 2 , . . ., C 10 . By performing feature selection on the ten datasets, we can obtain C 11 , C 12 , . . ., C 20 . en, the ten datasets obtained by balancing C 0 are denoted as C 21 , C 22 , . . ., C 30 . ese 32 derived datasets are input into our proposed approach. Table 4 shows the rank of the comprehensive metric model based on multimachine learning algorithms on the 32 derived datasets. Table 4 shows that through the results of our proposed method on different datasets, we can conclude that the optimal balance ratio of default data of small enterprises is 2 : 1, and the optimal feature combination includes 31 features shown in Table 5. e research results in Table 5 show that, for the default prediction of Chinese small enterprises, the best feature combination includes 31 features. Next, we train the CNN with the C 12 dataset. In this experiment, the CNN has two convolutional layers and four fully connected layers. e first convolution layer consists of two one-by-five convolution kernels. e second convolution layer is the same as the first one. Because the C 12 dataset contains only 35 features, the padding for the convolution operation is 0, and the step size is 1, which means that the size of the output matrix is 1 × 27 after processing by two convolution layers. e number of input neurons in the first FC is 27, and the number of output neurons is 15. e number of output neurons in the second FC is 9. Additionally, the third and fourth output neurons are 3 and 1, respectively. e performance of the CNN and its ranking with the other four models are shown in Table 6.
It can be concluded from Table 6 that it is feasible to use our method to find the best balance ratio and the optimal feature combination. e bold value means that the method ranks first in some evaluation criterion. CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. CNN shows excellent predictive ability on the selected dataset.

e Comparison of Different Datasets.
In this section, three datasets are collected from the UCI Machine Learning Repository to evaluate the robustness of our method. Table 7 shows the details of the datasets.
To verify the generalization ability of our method, we apply it to the Japan Dataset, Australia Dataset, and Chile Dataset to perform research. e three datasets are denoted as J, A, and CH, respectively. We pretreated the abovementioned three datasets in the same way as in Section 5.1.
e experimental results are shown in Table 8. e abovementioned results show that the best balance ratio of the Japanese dataset is 3 :1, the Australian dataset is 1 :1, and the Chilean dataset is 3 :1. Next, we bring the experimental results into the CNN. Because these three datasets have few features in this experiment, the CNN only contains one convolutional kernel and three fully connected layers. Due to the different numbers of features in the three datasets, the number of neurons in the FC is slightly different, but the structures of the three networks for the three datasets are the same. e experimental results are shown in Tables 9-11.  Figure 11: Flowchart of the comprehensive metric model based on multimachine learning algorithms.

Complexity
For the Japanese dataset, the CNN's G-mean, Type II error, and AUC rank first. Although DT obtains the highest accuracy, Type-II error is 3.3 times higher than that of CNN. Although the accuracy of CNN is not the first for the Australian dataset, it is close to first place, while the other three metrics are all ranked first. With regard to the Chilean dataset, the G-mean of SVM is 0; although Type-II error ranks first, it is worthless. e AUC of the CNN is very close to first place, while the G-mean and accuracy rank first. It can be considered that the comprehensive performance of the convolutional neural network is the best. We can conclude that the convolutional neural network performs well on the three public datasets processed by our method and is superior to the other three machine learning models. us far, we have discovered that our proposed comprehensive metric model based on multimachine learning algorithms has good robustness.

e Comparison of Different Resampling Methods.
To evaluate the performance of our proposed method, we perform comparison experiments on the China Dataset in terms of undersampling and oversampling [14]. In this section, the SMOTE method is replaced by two methods to evaluate the robustness of our method.
For undersampling, the CNN's accuracy and G-mean rank first, DT's AUC ranks first, and SVM's Type-II error ranks first. For oversampling, the CNN's G-mean and AUC rank first, DT's Type-II error ranks first, and SVM's accuracy ranks first. It can be seen from the comparison in Table 12 that CNN has better classification performance than the other four models in both undersampling and oversampling, which verifies the robustness of the model proposed in this study from the aspect of sample processing.

e Comparison of Different Feature Selection Methods.
To evaluate the performance of our proposed method, we perform comparison experiments on the China Dataset in terms of FDAF-score [2], correlation coefficient    Per capita disposable income of urban residents e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first.    e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. 14 Complexity [24], particle swarm optimization algorithm [31], and artificial bee colony algorithm [33]. In this section, the GA method is replaced by four methods to evaluate the robustness of our method. In this study, four feature selection methods (i.e., the FDAF-score, correlation coefficient, particle swarm optimization algorithm, and artificial bee colony algorithm) are used to verify the robustness of the model proposed in this study from the aspect of feature selection. For the FDAFscore, correlation coefficient, and particle swarm optimization algorithm, the CNN ranks first in terms of G-mean, accuracy, and AUC. For the artificial bee colony algorithm, the CNN ranks first in terms of G-mean, Type-II error. e empirical results in Table 13 show that CNN performs better than the other four classification models.

Conclusion and Future Research
e main conclusions of this study are as follows.
(1) e performance of models with different balance ratios and feature combinations is different for an enterprise credit dataset. ere is always the best balance ratio and its corresponding optimal feature combination for each dataset. Based on our proposed method, we can find the best results among the 32 derived datasets.
(2) Overall, the dataset whose selection order is data rebalancing followed by feature selection always achieves a better result, which means that it is the best selection order.
(3) e best balance ratio and our method's corresponding optimal feature combination are also suitable for the convolutional neural network. In addition, the performance of CNN is better than that of traditional machine learning models in the task of default risk prediction.
e primary contributions of this study are as follows.
(1) Different balance ratios will result in different classification accuracies, and there is bound to be an optimal balance ratio. In this study, the optimal e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. e bold value means that the method ranks first in some evaluation criterion. For example, CNN ranks first in terms of G-mean, accuracy, and AUC, which indicates that its comprehensive ranking also ranks first. balance ratio is found through the classification accuracy, which changes the deficiency of the existing research that samples are imbalanced or the balance ratio is 1 : 1 and ensures the accuracy of the classification model.
(2) e order of data balance and feature selection affects the model accuracy.
is study proposes a comprehensive metric model based on the machine learning algorithm, which can simultaneously find the best balance ratio and perform the optimal feature selection.
Further studies may include the use of a new feature selection method. Although we make two improvements to the GA to avoid falling into the local optimization, it cannot guarantee that the results are globally optimal. As a heuristic algorithm, GA relies on the initial condition that makes it unstable. us, a robust and effective optimization method is our next goal. In the future, feature selection methods such as principal component analysis, rough set, or lasso regression may be an option for redundant features.

Data Availability
e Chinese data that support the findings of this study are available on request from the corresponding author. e data are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest
e authors declare no conflicts of interest.