Financial Credit Risk Control Strategy Based on Weighted Random Forest Algorithm

In order to improve the effectiveness of financial credit risk control, a financial credit risk control strategy based on weighted random forest algorithm is proposed. +e weighted random forest algorithm is used to classify the financial credit risk data, construct the evaluation index system, and use the analytic hierarchy process to evaluate the financial credit risk level.+e targeted risk control strategies are taken according to different risk assessment results. We compared the proposed method with two other methods, and the experimental results show that the proposed method has higher classification accuracy of financial credit data and the risk assessment threshold is basically consistent with the actual results.


Introduction
In recent years, with the rapid development of Internet finance, online credit has developed very rapidly and its participants are becoming more and more diversified. Online credit helps more and more users with emergency and short-term capital turnover. However, in the process of opening user loan business, the total amount of loans is increasing, the default probability of users is rising, and the risk of user loans is gradually revealed. erefore, in-depth research on user loan risk and evaluation of the scientificity and rationality of the risk are extremely important for microonline credit enterprises to prevent Internet financial risk [1][2][3]. In the study of user loan risk prediction and analysis, it is found that the user's data have the following three important characteristics: unbalanced distribution, a large amount of noise, and high-dimensional characteristics. e user loan risk is caused by the interaction of the characteristics of multiple different dimensions of users. For the traditional statistical methods, only a single feature or a small number of features can study the relationship between user loan risk. erefore, it is a challenge to the traditional statistical methods [4].
With the rapid development of the financial industry and the Internet industry, Internet finance, as an emerging business model, has gradually come into everyone's sight. e online business volume of Internet finance such as medium and small loans is also increasing rapidly with the growth of Internet scale. e loan applications of individuals, businesses, and even enterprises are gradually moving towards rapidity and paperless [5]. In the early days, loan default forecasts relied entirely on manual review. is forecasting method is only suitable for small-scale credit audit, and there is nothing to do with large-scale loan audit. In order to solve the uncertainty of human factors in the process of manual audit, a method to judge whether to make a loan according to detailed rules is proposed by financial institutions. is method gives a conclusion according to the rules, so that the credit auditor can judge whether to lend money according to clear indicators. At the same time, the proposal of this method greatly reduces the requirements for the credit auditor, and ordinary staff can start credit work after relevant training. In addition, many researchers have proposed data mining models to study credit default [6]. e random forest (RF) algorithm has been very successful in general purpose classification and as a regression method. It is an machine learning algorithm that could be applied to large-scale problems that involve big data and could be easily adapted to various ad-hoc learning tasks [7][8][9]. e voting mechanism of random forest algorithm assumes the same weight for all of the base classifiers, which is not always the case. Some of them may have lower weights than the others, and all of the trees may not have the same ability to make decisions. erefore, the weighted RF algorithm performs better than RF algorithm in most cases [10,11].
In order to control the financial credit risk more effectively, a financial credit risk control strategy based on weighted random forest algorithm is proposed.

Related Work
e study in [12] proposed a financial credit risk assessment method based on particle swarm optimization algorithm. On the basis of fully demonstrating the supply chain financial risk characteristic index system, a binary particle swarm optimization algorithm is used to optimize the feature subset, and the support vector machine (SVM) parameters are co-optimized to obtain the financial credit risk classification assessment results. e work in [13] proposed the financial credit risk assessment method of stack noise reduction self-coding network, fully considered the correlation between data features, improved the stack noise reduction self-coding neural network model, introduced the truncated Karhunen loè ve expansion as the noise input term, and eliminated the noise data in the financial credit risk data to obtain more effective evaluation results. e study in [14] proposed a financial credit risk assessment method based on xgbfs, which uses a series of data preprocessing methods and embedded feature selection method xgbfs (xgboost feature selection) to reduce the user's credit data dimension, train the xgboost assessment model, and finally realize the user's credit risk assessment. e work in [15] proposed that integration of supervised and unsupervised machine learning strategies will produce better results than using only one of them. ey proposed a system for credit risk assessment by integrating the supervised and unsupervised learning strategies at the consensus stage and the dataset clustering stage. e study in [16] concluded that traditional approaches used for forecasting credit risks are not well suited to help the financial institutions and they need ML-based techniques for forecasting credit risk. ey proposed a hybrid ensemble machine learning approach by incorporating two classic machine learning approaches, the RS (random subspace), and multiboosting. e work in [17] used the standard probit algorithm along with various machine learning algorithms like neural networks and KNN. ey achieved a lower error rate with the machine learning techniques as compared to the classic methods used for credit risk assessment. e work in [18] used a loan dataset of a commercial bank to test five different machine learning algorithms including KNN, DT, RF, NB, and logistic regression for credit risk assessment. eir results show that the random forest algorithm performs better than the other algorithms tested.

Financial Credit Risk Control Based on Weighted Random Forest Algorithm
In order to effectively control the financial credit risk, it is necessary to accurately classify the financial credit risk level, classify the financial credit risk data by using the weighted random forest algorithm, and construct the credit risk evaluation model. Targeted control measures shall be taken for different financial credit risk levels.

Data Classification.
Decision tree is a machine learning algorithm with tree structure, which is composed of root node, nonleaf node, and leaf node. Its structure is shown in Figure 1. e decision tree is a recursive structure. e root node carries out data training and prediction from top to bottom. According to different similarity calculation standards, subsets with certain similarity in the data are divided to generate multiple branches. When reaching the leaf node of the decision tree, the division is stopped. Leaf nodes are set according to the maximum tree depth or the minimum number of leaves. One leaf node represents one classification result of the data. e decision tree can be divided into decision trees based on information entropy, information gain, information gain rate, and Gini impurity according to the different basis of feature division. Assuming that the training sample set D with N financial credit data can be divided into K categories, it establishes the calculation formulas of information entropy, information gain, information gain rate, and Gini impurity of different decision trees as follows: where H(D) denotes information entropy, n k denotes the number of k_th category, and p represents the probability. Similarly, for information gain, we have For information gain rate, we have For Gini impurity, Here, n k represents the number of kth category, v represents the number of subsets of the sample set divided according to feature a, and the value range is [1, V] [19].
Random forest is an integrated learning algorithm based on decision tree. It randomly generates multiple irrelevant decision trees. Each decision tree learns and predicts independently. ese predictions are synthesized into single prediction by voting. e category with the most votes is the model prediction result, and its result is better than the decision tree.
Suppose the input financial credit training dataset is D, the number of iterations of the decision tree is M, and the generation steps of the random forest are as follows: (1) Conduct the mth sampling on the sample training set, where m is an integer and the value range is [1, M]. Randomly collect n times to obtain the training set D m containing n samples. (2) When the decision tree is divided into nodes, all n input variables do not fully participate in node splitting, but k(k ≤ n) random feature variables are randomly selected, the value of k is generally 2 log 2 n + 1, the best feature among the k features is split as a node, and the mth decision tree G m generated by classification and regression tree CART algorithm is trained. (3) e CARTdecision tree is divided into features based on Gini impurity. When the Gini impurity is smaller, the representative impurity is smaller, and its features are better. Finally, M CART decision trees are generated to form a random forest. (4) Determine the type of data by calculating the number of votes. e flow of random forest algorithm is shown in Figure 2. e decision tree algorithm is fast and easy to understand, but it is easy to over fit, and when dealing with unbalanced data, the feature division tends to choose features with more values. Compared with the decision tree algorithm, random forest has high accuracy and is not easy to over fit, but in unbalanced datasets, the classification accuracy of a few classes is still not high. Aiming at the classification problem of unbalanced datasets, an accurate classification method based on weighted random forest lifting is proposed.
In order to minimize the overall error rate, the machine learning algorithm will ignore the classification of classes with less data, and it is easy to form a model conducive to the classification of most classes during training [2]. In order to improve this situation, the weighted random forest follows the idea of cost sensitive learning, increases the influence of fewer classes by giving more weight to fewer classes, and balances the relationship between samples, which can make the generated model more suitable for unbalanced data and improve the accuracy of classification results of a few data.
Class weight is mainly reflected in the following: (1) e growth process of decision tree, the reduction Δgi of weighted Gini impurity GI is used to find the optimal division feature. e larger the value, the smaller the representative impure, and the better the separation result. e calculation formula is Here, K represents the total number of categories, J represents the sample set at the nonseparated node, J L represents the separated left node sample set, J R represents the separated node sample set, n i represents the number of various samples in the node, and W i represents the class weight value assigned to each class.
(2) When determining the category at the leaf node, the final classification result c is determined by combining the weighted voting of each decision tree. e calculation formula is rough the above calculation, the classification of financial credit data for subsequent financial credit risk assessment is completed.

Construction of Financial Credit Risk Evaluation Index (1) Principles for Construction of the Evaluation Index System
(1) Principle of comprehensiveness e evaluation index system of personal credit risk of commercial banks must cover all factors affecting credit risk control. On the one hand, the impact of personal factors on credit risk should be considered; on the other hand, the impact of social environmental factors of lenders should be considered. Only by ensuring the comprehensiveness of the evaluation index system we can ensure the accuracy of credit evaluation of borrowers [20,21].
(2) System optimization principle Ensuring the comprehensiveness of the evaluation index system does not mean that the more indicators the better. e selection of the index system must take into account the system optimization principle, that is, select important indicators, and eliminate those redundant indicators with the same impact level, so that the whole index system can fully reflect the actual situation of the lender and form an interconnected unity. (3) Operability e selection of personal credit risk evaluation indicators must be combined with the actual situation, select those indicators that can obtain the actual data, and ensure the reliability and accuracy of these data. Legitimacy means that the establishment of the index system must be consistent with the national macroeconomic policies and comply with the provisions of relevant laws and regulations; fairness means that the establishment of personal credit risk evaluation index system is objective and fair. It is based on objective facts, which can scientifically and accurately reflect the basic attributes of commercial banks and loan applicants and does not favor either party.
e personal credit risk evaluation index system is selected by analyzing the evaluation index system established by relevant research institutes at home and abroad, drawing on the relevant experience of domestic and foreign experts on personal credit risk evaluation research and combining the possibility of obtaining correct data for each specific index, a set of evaluation index system including personal basic information. e second level index system of personal credit risk evaluation was based on three first level indexes of economic status and reputation status. e specific index system is shown in Table 1.

(2) Basic Personal Information Indicators
(1) Age Generally, the probability of default of the lender decreases with the increase in age. Borrowers under the age of 20 may have no fixed economic income, but there are very few borrowers in this age group, so the risk of default is not high; at the age of 35-55, the lender in this age group has generally made achievements in his career, and his income is relatively stable, and his ability to repay the loan is generally strong; after the age of 55, with the growth of age, the expenditure on medical treatment and life may gradually increase, and the default risk of this part of people may be relatively high.

Scientific Programming
Combining the two groups of statistical data on male and female loan default rates, it can be found that this group of statistical data shows that the male lender default rate is higher than the female lender default rate, and the difference in default rate of this group of data is about 6%. (4) Education level Usually, with the continuous improvement of the lender's education, the probability of default will decrease. e personal quality of the lender is generally positively correlated with the level of education. e lender with a high level of education may have stronger self-restraint ability, so his personal reputation is relatively high. (5) Occupation type e occupation type of the applicant is closely related to the bank's decision whether to lend to him. e unemployed will not become the loan object of the commercial bank. Borrowers have different occupations, and their income also varies greatly. Borrowers with good career prospects and stable income will generally become potential customers of commercial banks. (6) Position e loan default rate is also related to the position of the lender. e lender with relatively high position may pay more attention to personal reputation, so the probability of default may be low.

(3) Economic Indicators.
(1) Monthly income It is of great significance to judge the default risk of loan applicants. Generally speaking, it is considered that the higher the monthly income, the lower the default risk, otherwise the higher. When the loan applicant has the same willingness to repay, the higher the income level, the stronger the repayment ability. ere is a positive correlation between the income and the borrower's repayment ability.
(2) Proportion of monthly repayment in monthly disposable income e proportion of monthly repayment in monthly disposable income is also an important indicator to measure the lender's repayment ability. e monthly repayment amount refers to the amount of loan interest and principal repaid by the lender to the bank every month, and the monthly disposable income is the remaining income after the lender's monthly income minus the amount of loan interest and principal repaid to the bank every month. eoretically, the ratio value fluctuates between 0.1. e smaller the ratio value, the better the financial situation of the borrower.
(3) Loan term e loan term refers to the time period from the time when the commercial bank issues the loan to the lender to the time when the lender pays off all the loan and interest. e length of the loan term is closely related to the loan interest rate. Generally, the longer the loan term, the higher the loan interest rate. e probability of default is relatively high for customers with long loan term than short loan term. e longer the loan term, the worse the liquidity and the higher the risk. (4) Loan amount e loan amount is closely related to the loan applicant's application amount and the applicant's credit rating. Generally, lenders with relatively small loan amount have relatively low probability of default, while customers with high probability of default are mainly concentrated in the range with relatively high loan amount. erefore, including the loan amount in the personal credit risk evaluation Bank account records mainly refer to the records of customers opening accounts and handling bank cards in this commercial bank. Generally, the default risk of users who do not open an account in the loan application bank is higher than that of the lender who has opened an account.

(4) Personal Reputation Indicators
(1) Historical credit history Historical credit records refer to the past loan repayment, credit card overdraft information, guarantee, and other information of the loan applicant. rough the review of the loan applicant's historical credit records, we can preliminarily judge the credit situation of the loan applicant and the probability of default risk in the future.
(2) Personal judicial records e review of personal judicial records is mainly to investigate whether the loan applicant has a criminal record in the past, which is only a supplementary indicator.
(3) Other records of reputation damage is includes personal reputation indicators other than historical credit records and personal judicial records, including whether the lender has tax evasion, arrears of water, electricity costs, etc.

Financial Credit Risk Assessment Based on Principal Component Analysis.
e principal component analysis method is used for financial credit risk assessment. e principal component analysis method is the basic means of multivariable data analysis. Its main idea is map a group of multivariable information to several principal components, analyze and process the principal components as the overall information, and there is no linear correlation between the principal components; therefore, the dimension of the original data is reduced, the redundant information is effectively eliminated, and the noise in the original information is effectively suppressed. It is especially suitable for dealing with the situation that there are many and complex original data samples. It can effectively reduce the dimension and denoise and improve the convergence accuracy and risk assessment accuracy of financial and credit data.
If X is an m × n-dimensional data sample set, the input is X(n × m) and the output is Y(n × m).
e kth principal component is P k � (P 1k , P 2k , ..., P nk ) T , k � 1, 2, ..., m; the principal component matrix (i.e., principal component score matrix) is the crossproduct of X and principal component load: Here, the single variable in the score matrix is T ij � m k P kj X ik and the T j � m k X k P kj variable is j.
(1) Data Standardization Processing. e original financial loan risk data shall be processed as follows: Here, X ij ′ represents the standardized new financial credit risk data, and M j and S j represent the arithmetic mean and standard deviation of a column of the original financial credit risk data, respectively.

(2) Calculation of Characteristic Matrix D of the Original Matrix
e covariance is as follows:

(3) Calculation of Eigenvector P and Eigenroot
When only the jth eigenvalue is considered, there is DP j � P j λ j , that is, |D − λ j I| � 0. Solving λ and arranging the periods in order of size, that is, en, the eigenvector P corresponding to each eigenvalue can be obtained, and the eigenequation can be solved.
After the load of each principal component is obtained, the score matrix of financial credit risk evaluation is obtained: According to the calculation results of financial credit risk, the evaluation of financial credit risk can be completed, so as to formulate targeted financial credit risk control strategies.

Experimental Setup.
We used Python's machine learning library Scikit-learn [22] to do our experiments. Scikitlearn is an open source machine learning library written in Python that provides efficient implementations of various machine learning algorithms. It takes advantage of Python interactivity and modularity to supply fast and easy prototyping [22,23]. e ML library is used in Windows environment, with a standalone system having specification of Intel ® Core ™ i3-4010U CPU @ 1.70 GHz, to simulate the credit risk assessment problem.

e Dataset.
e sample data selected in this paper are the real data from the personal credit database of a branch of a commercial bank. In this paper, 75 data samples are randomly selected in the personal credit database. Among them, there are individual data missing or the data are obviously in line with the actual situation. After removing these invalid samples, 61 valid samples are determined. ere are 40 learning samples and 21 test samples. e training samples are shown in Table 2.
e quantitative table of qualitative indicators is shown in Table 3.
In order to fully verify the performance of the proposed method, comparative verification experiments are carried out. e experimental scheme is set as follows: taking the accuracy of financial credit risk data classification and risk assessment as the experimental comparison index, the proposed method is compared with the financial credit risk assessment method based on particle swarm optimization algorithm proposed in [12] and the financial credit risk assessment method based on stack noise reduction self-coding network proposed in [13].

Accuracy of Financial Credit Risk Assessment.
e comparison results of financial credit risk assessment accuracy of the three methods are shown in Figure 4.
By observing the accuracy results of financial credit risk assessment shown in Figure 4, it is not difficult to see that the risk threshold of the proposed method is basically consistent with the actual results, while the risk threshold of the two literature comparison methods is quite different from the actual value.

Conclusion
e awareness of credit risk and virtuous asset cycle forces China's commercial banks to establish an effective credit risk monitoring and control platform. From the perspective of the virtuous cycle of commercial bank loans, establishing a scientific and reasonable credit risk evaluation index system and realizing the credit risk evaluation model of commercial banks has a certain practical significance for the benign development of the financial industry, the healthy development of China's national economy, and the prevention of the global credit crisis. erefore, a financial credit risk control strategy based on the weighted random forest algorithm is proposed to verify the performance of the method from both theoretical and experimental aspects. is method has high accuracy in financial credit risk data classification and financial credit risk assessment. Specifically, compared    with the method based on particle swarm optimization, the classification accuracy of financial and credit risk data is significantly improved and always maintained at more than 95%, and compared with the method based on stack noise reduction self-coding network, the risk threshold is consistent with the actual results.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e author declares no conflicts of interest.