A Study on Early Warning of Financial Indicators of Listed Companies Based on Random Forest

Financial crises can have a negative impact on business operations, and in serious cases, they directly aect the survival and growth of a company. erefore, the study of nancial early warning based on nancial indicators is particularly important. However, there are still some shortcomings in the current research on nancial early warning, for example, it still evaluates the scoring method or only uses a single model to participate in the construction of nancial early warning algorithm. In view of the above problems, this study will mainly use the random forest method combined with the decision tree algorithm to study the nancial early warning problem of listed companies in China. Firstly, this paper uses the literature review method to analyse the relevant literature and generate the nancial indicator system for this study. Subsequently, by collecting the nancial data of A-share listed companies in China from 2013 to 2018 as the research object, the importance ranking of nancial indicators was generated by using random forest modelling after data preprocessing. On this basis, CART decision tree modelling was applied to generate nancial indicator early warning determination rules and analyse them. e results of the study show the importance ranking of nancial indicators and the six nancial warning rules based on the CARTdecision tree. rough this research, it is expected to achieve the objective of providing early warning for the risk of nancial crisis and to provide constructive nancial warning solutions for relevant stakeholders.


Introduction
e establishment of early warning indicators for nancial risks and the exploration of related models have always been a priority in corporate nancial management. As economic globalisation progresses, competition between enterprises intensi es, and the market becomes saturated, as well as the risks to the survival and development of enterprises increase. If an enterprise is not prudent enough in the internal control of nancial risks, it may lead to an avalanche of nancial crises. In many cases, it is the internal nancial situation that has led to the downfall of a company. erefore, early warning analysis, alerting, and control of the nancial situation have become the key to the nancial management of many companies.
ere are many kinds of methods and models for conducting nancial early warning. is paper will use the random forest method to rank the importance of a number of nancial indicators on the screened nancial data of listed companies in China. In addition, this paper will use the CART algorithm to analyse the factors in uencing the generation of nancial crises of listed companies in China based on the ranking of the importance of nancial indicators. Based on the results of the nal nancial warning rule construction, the paper will be summarised and evaluated in order to obtain a reasonable and e ective nancial warning solution that is practically meaningful to the relevant stakeholders of the listed companies in China.
In the early stage of financial early warning research, Fitzpatrick [1] was a pioneer in introducing quantitative analysis to financial risk early warning, with his innovative classification of financial indicators and the related study of a sample divided into insolvency and noninsolvency data sets. At the same time, he used a single ratio of corporate financial data as the independent variable factor and combined it with an innovative method of multiratio fusion analysis. Beaver [2] then identified better financial warning indicators such as gearing and return on assets. is led to the development of an effective early warning model using a fusion of multiple ratios.
Subsequently, around 1980, Altman [3] designed the Z-value model, which was very new at the time and applied it to the field of financial early warning. e Z-score was generated by selecting factors as discriminatory variables to determine the financial condition (insolvency or otherwise) of a company. A few years later, Altman et al. [4] developed the ZETA model after nearly six years of research, collecting and studying financial data from over fifty insolvent companies at the time. Later, in order to overcome the shortcomings inherent in linear discriminant models, Martin [5] made an innovative use of logistic models in his study on predicting bank failures.
Around the 1990s, Zmijewski [6] followed the logistic model made by Martin and developed the complex Probit model. In addition, the study of financial early warning based on neural network technology is also worthy of attention. Lapedes and Farber [7] first applied neural network techniques to early warning and analysis of credit risk in banking practice. Several years later, Wilson and Sharda [8] also used neural network techniques in their study of corporate operating insolvency risk, building models with an accuracy of 97%. At the same time, they compared various models used to identify corporate insolvency and finally found that the neural network-based technique had some advantages.
At the beginning of the 21st century, Breiman [9] first proposed the random forest algorithm based on the basic machine learning algorithm decision trees. Since then, Richard et al. [10] has argued that the random forest algorithm has high classification accuracy, as well as the advantage of high flexibility in unsupervised learning, regression, and classification. In terms of innovation, Bernard et al. [11] proposes an innovative dynamic random forest algorithm (DRF), which is a random forest induction algorithm developed based on adaptive decision tree induction. By guiding decision tree induction to enrich existing decision trees as much as possible to enhance the sampling rate, satisfactory prediction rates are eventually obtained. Based on previous research, Hapfelmeier [12] et al. concluded that the random forest algorithm can achieve good prediction rates for this after a large number of experiments related to indicator data that need to be integrated and complex indicator correlations. In addition, in terms of neural network-based technology, Yang [13] et al. applied a forward neural network model (with a three-layer model of the output layer, hidden layer, and output layer) for early warning of financial crises of enterprises, and the results surface that the method is effective. ereafter, Liu and He [14] optimized and built an artificial network financial early warning model.
In the last three years, research on financial early warning using data mining methods has remained popular. In 2020, Liu [15] et al. proposed a financial early warning model based on the AdaBoost strong classifier and selected 1350 groups of enterprise financial data for classification.
e experimental results showed that the accuracy of the AdaBoost-based strong classifier was higher than that of the BP neural network-based weak classifier. In 2021, Jia [16] constructed a corporate financial crisis early warning model based on time series and the random forest algorithm and proposed an improved K-fold random forest algorithm, whose model accuracy improved by 1.54% compared with the traditional random forest model. In 2022, Wang [17] et al. applied the Wilcoxon rank sum nonparametric test and principal component analysis for feature engineering and used the logistic regression model to study the financial early warning of forestry listed companies in Shanghai and Shenzhen A-shares in China, and the accuracy of their model reached more than 80%.
It is clear from the above literature that financial early warning is still highly dependent on the construction of models and that the key to the construction of each model is many input financial indicators. At the same time, we can also see that random forest has significant advantages in the selection of indicators for the models. erefore, this study aims to combine the random forest algorithm and the CART tree model in order to analyse the factors influencing the financial crisis of the listed companies in China.

Approach to Financial Early Warning.
Financial early warning is a key task in the financial management of an enterprise. In the financial practice of enterprises, financial early warning usually requires the selection and setting of indicators in advance of the relevant financial data. Based on these predefined sets of indicators and according to certain judgment rules, the occurrence of financial crises is monitored and predicted in advance. e use of computer technology (e.g., decision trees, neurons, and random forests) is therefore almost always an important part of the process of detecting and forecasting financial indicators.

Design of the Financial Early Warning
Method Based on Random Forest

e Idea of a Random Forest.
A random forest is the result of multiple optimisation of decision trees, both in terms of the integration of decision tree forests with the idea of rich randomisation. e basic principle is that, firstly, the basic elements of each random forest are both mutually unrelated decision trees t 1 (x) and t 2 (x), t 3 (x) , ..., t n (x).
Since they are called decision trees, they may contain both imprecise and efficient binomial trees and precise and inefficient multinomial trees. We then introduce our dataset into these randomly generated decision trees to determine the classification, which we refer to as the act of building a random forest classifier. At the same time, we vote on the classification results of the relevant datasets from the many randomly generated decision trees to obtain the final classification results. In short, the purpose of building a random forest classifier is to determine which class an input dataset belongs to. e purpose of voting on the classification results is to find the classification that is chosen most often. Here, we define the total data set as S(x) � S(x a1 , . Here, there are N subdata in the S(x) dataset, and each dataset corresponds to F features as x a1 , x a2 , ..., x aF etc.
Usually, the discriminatory F features {x a1 , x a2 ,..., x aF } is by calculating the degree of uncertainty of the sources between the different features (also known as information entropy). ere are generally three decision tree methods from front to back: ID3, C4.5, C5.0, CART, whose corresponding calculated values are the information gain value, the information gain ratio value, and the Gini index value, respectively. By calculating these information entropy values, we can obtain the optimal splitting attribute at a node. If the relevant attribute value meets the discrete condition, then forking can continue. e relevant formula is as follows.
e information entropy of the total sample S(x) is judged to be directly related to the purity of the source data in information theory by Shannon as follows: In the above formula P i is the proportion of samples of type i to the total sample S(x) of the total sample. Since for the F kinds of characteristics x a1 , x a2 ,..., x aF for decision bifurcation, the total sample S(x) is divided into a total of k parts. e resulting calculated information entropy values as well as the information gain values are as follows: ereafter, the information gain rate discriminant, which improves on the information gain as follows, is judged with even better accuracy due to the penalty factor of information attached to its denominator.
e Gini coefficient is a discriminant of information purity invented after the information gain rate and has higher discriminant efficiency because it is free from the calculation of the log-log function, which is as follows: As the three information purity discriminants above have less influence on the final result, the Gini coefficient will be chosen as the criterion for discriminating at the random forest nodes. Finally, the random forest ends when the feature attributes have been exhausted, the decision tree has been classified to its maximum depth, the Gini coefficient has reached a previously determined threshold, and the number of data sets at the end has reached a previously given value.

Indicator Screening Design for Random Forests.
Due to the ease of operation of the random forest, the process design consists of only an input layer and an output layer. e input data contains the number of classes F of features, the total sample set S(x), the size of all decision trees Tree, the depth of the tree h, and related parameters, as well as the filtering end algorithm. e output layer contains the results of the random forest feature selection and the model built by the random forest. e specific random forest process has seven steps as follows: Step 1: for the total data set S(x), expect to generate a random forest with a total of j trees j � 1: nTree Step 2: using bootstrap-based bagging sampling, repeat sampling with put-back and selection of a training set of dataset size X Step 3: select F signs at the nodal forks of the decision tree, filter and find the best features, and then divide the data set in this way Step 4: generate all decision trees accordingly, completing the algorithm Step 5: calculate the probability that a given uncertain sample a and is classified as N in the test set in which test learning is performed Step 6 : obtaining the classification error while selecting the best category N after voting Step 7 : return the classification results generated by the random forest and the model Discrete Dynamics in Nature and Society

Evaluation of Indicator Screening Results for Random
Forests. e process of constructing a financial early warning model using the random forest algorithm generates two pieces of data, the OOB (out-of-bag) error rate and the AUC value under the ROC curve. e smaller the out-of-bag error rate or the higher the AUC value, the more effective the random forest model will be.

e Idea of a CART Decision Tree.
e core of the CART decision tree algorithm is the selection of features for the original dataset and the pruning of the decision tree after forking.
For a dataset S(x) CART decision tree T constructed, a total of N feature categories are covered in the dataset, and Pj is defined as the data with feature category j as a percentage of the total dataset S(x) of the probability, which can be given by the following formula: Sort the data set at the node S(x). Classify into subsets S(x 1 ) and subsets S(x 2 ) , whose Gini index is calculated as follows: After selecting the best feature indicator and the value of the feature indicator at the bifurcation node by the above formula, the data set is S(x). If the Boolean value of a piece of data is true, the data will be placed in the left leaf node of the bifurcation tree at that node, and if it is false, it will also be placed in the right node. By calculating and placing the data in this order, a CART decision tree can eventually be constructed.

Design of the CART Decision Tree.
e data set is now defined S(x). A total of N data exist for each category of indicators. e algorithm consists of a total of input and output layers. Among the data that should be input are the total data set S(x) and relevant threshold conditions. e specific process steps are as follows: Step 1: enter the relevant dataset data and associated threshold conditions.
Step 2: calculate the total data set at the node S(x) of the overall Gini index, meanwhile, calculate the corresponding sample eigenvalues K based on different sample features K.Subsequently, this is followed by dividing the datasets S(x 1 ) and S(x 2 ) according to the Boolean case of the sample dataset with K features and the overall eigenvalue K,and then calculating the Gini index after the division.
Step 3: after all features with K and the feature value of k, the feature with the smallest Gini index and its eigenvalue are selected and used as the node segmentation indicator to assign the data.
Step 4: repeat steps 1 to 3 until the stop building condition is met.
Step 5: return the classification results from the CART decision tree and the model.

Evaluation of the Results of the CART Decision Tree.
As the CART decision tree is constructed, the corresponding confidence values are calculated. In layman's terms, the confidence level reflects how trustworthy the tested data features in the total data set are compared to the tested values. It can be interpreted as both the accuracy and reliability of the algorithm. e higher the confidence level of an algorithm or rule, the better the prediction of the system.

Example Analysis of Financial Early Warning
Based on Random Forest

Selection of the Sample.
e aforementioned method of determining the financial crisis of the listed companies in China has been described in detail at this stage of the consensus of the academic community and the security industry.
erefore, this paper adopts the indicator of "whether a listed company is on the ST warning board" as the basis for judging whether a company is in financial crisis. Companies on the ST warning board (including ST companies and * ST companies) are considered to be in financial crisis, while companies not on the ST warning board are considered to be in financial health. Generally speaking, a company will only be placed on the ST Alert if its net asset value per share is lower than its value per share in the market, and if the company's annual reports for two consecutive financial periods show consecutive losses (i.e., the net profit of the company's financial statements for two consecutive years is less than the total cost of ownership).
In this paper, the relevant financial data of all A-share listed companies from 2013 to 2018 were collected from the Wind database, and the data of all ST companies (148 companies in total) in these six years were extracted and sorted. Subsequently, in order to avoid the impact of data imbalance (both too few ST companies) on the subsequent experiments, 302 non-ST companies in the A-share market from 2013 to 2018 were randomly selected to jointly construct a data pool of A-share listed companies in China (a total of 450) in this paper.

Determination of Financial Indicators.
By combining the statistics of the indicator results of the financial early warning indicator system in the literature reviewing process (check Table1) and the analysis of specific financial indicators of A-share listed companies in China, we finally identified a total of 24 financial indicators to be used in the random forest model. We finally determined a total of 24 financial indicators (check Table 2) of A-share listed companies in China to participate in the construction of the random forest model, and the specific indicators are listed in the following table.

Data Standardisation of Financial Indicators.
Firstly, since the missing values account for a small proportion of the total data and are numeric, we can usually fill in missing values with a mean value. In addition, since the data are in line with normal distribution. e principle of missing value filling proposed by Anderson [18] et al. is that "under normal distribution, the sample mean is the best possible value to be estimated." erefore, in order to ensure the integrity of the experimental data and the reasonableness of the results as far as possible, filling the mean value is the best solution at this stage.
At the same time, as the final financial indicators identified in this paper have a large number of data lines and the data vary greatly in order of magnitude, a data standardisation operation should be performed prior to data analysis. If such raw data with widely different data characteristics are directly operated, it will cause the final results to be biased towards data-based indicators in order to lead to poor model building results.
ere are many methods of data normalisation, such as extreme value normalisation, Z-score normalisation, normalisation, and so on. Each of them has its own corresponding advantages. Due to the more constrained and effective nature of polar normalisation, we will now focus on polar normalisation. Extreme value normalisation, also known as extreme value difference normalisation, is the process of deflating data to between 0 and 1 by varying it by equal proportions. For data columns x 1 , x 2 , x 3 , . . . , x n , extreme value normalisation is performed to produce a normalised data column y 1 , y 2 , y 3 , . . . , y n , and the transformation formula is as follows: e raw data will fall between 0 and 1 when transformed by polar normalisation while eliminating its original dimensional limits and the effects of different orders of magnitude and facilitating the subsequent conduct of the test.

Optimization of Parameters for Random Forest
Modelling. Due to the advantage of random forest in being able to handle a large number of features well, this paper puts all the financial indicators identified above into the random forest algorithm to construct the corresponding financial warning model. e 24 identified financial indicators such as gross sales margin, net sales margin, return on net assets, return on total assets, and current ratio are used as independent variables, while whether the listed company is placed on the ST warning board (hereinafter referred to as ST, with non-ST company value being 0 and ST company value being 1) is used as the dependent variable for prediction. On top of this, 70 per cent of the sample data set was divided into training samples, and the remaining 30 per cent was divided into test samples to participate in the construction of the random forest model.
Subsequently, we obtained the in-bag error rate of the model corresponding to each mtryvalue by traversing all mtry (which refers to both the number of random variables  Zhou [24] Meng [25] Zhang [26] Yang [27] Wu [28] Song [29] Yang [30] Lu [31] Fu [32] Wang [33] Feng [34] Chen [35] Wang [36] Counting Inventory turnover ratio Discrete Dynamics in Nature and Society Zhou [24] Meng [25] Zhang [26] Yang [27] Wu [28] Song [29] Yang [30] Lu [31] Fu [32] Wang [33] Feng [34] Chen [35] Wang [36] Counting Net asset growth rate in the process of constructing a random forest) in the range of 1 to 24. As shown in Figure 1, the eighth modelled in-bag error rate of the parameter search reaches the lowest value, i.e., when the mtry value is 8, the in-bag error rate is 0.0036, which is the lowest value. At the same time, the in-bag error rate for subsequent iterations increases sequentially, so the optimal value of mtry for this model is 8. Similarly, we modelled and traversed a range of ntrees (i.e., the number of decision trees in the random forest construction process) again at the lowest in-bag error rate mtry value (both 8) above to obtain a graph of the change in the in-bag error rate of the model as the ntree value changed. As shown in Figure 2, the in-bag error rate tends to stabilise when the ntree value is greater than 800.
rough the parameter search process, we found that either too high or too low mtry values will affect the final random forest prediction. At the same time, a high ntree value will cause a rapid increase in model complexity and affect the computational efficiency, while a low ntree value will directly affect the model performance. e result of the above parameter search is that the mtry value is 8 and the ntree value is 800, and the error rate of the random forest model constructed in this paper is the lowest and the best.

Random Forest Model Results.
Using the final optimisation results of the above algorithm for the mtry and ntree values, we can determine that the best prediction efficiency of the random forest is achieved when the mtry value of the proposed model is 8, while the ntree value is 800 and is therefore modelled with these optimal parameters. Figures 3-5 show the final results of random forest modelling.
At the same time, the results of the R-language random forest operations in Figure 3 show that when the random forest-based financial early warning model is constructed with the settings described above, the out-of-bag data error rate of the final model is 8.41%. From the perspective of the out-of-bag error rate, this random forest model has excellent prediction results. e final ranking of the metrics generated according to the random forest algorithm takes two forms, one of which, the left-hand graph in Figure 4, is composed according to mean decrease accuracy (MDA).
is evaluation scale is generated based on the out-of-bag error rate (OOB). Under this measure, the horizontal coordinate indicates the extent to which the prediction rate of the constructed model decreases when a variable is replaced with a random number.
us, if the value of an indicator is larger in this indicator ranking form, it indicates that the level of the indicator is more important. Secondly, the right-hand panel of Figure 4 is constructed on the basis of the mean decrease Gini. is evaluation scale is based on the Gini coefficient. e horizontal coordinate indicates the differential impact of a feature on the observed training values at the nodes of the random decision tree when it is replaced. Similarly, a larger value of the horizontal coordinate of a feature indicator indicates that the feature is more important.
One of the evaluation rules based on mean decrease accuracy is based on the out-of-bag error rate (OOB ER-ROR), which embodies the core bagging algorithm idea of the random forest algorithm. It ensures that the distribution 0. 043258226 0. 004105213 0. 004663714 [1] > print (err) [   8 Discrete Dynamics in Nature and Society of the data after feature replacement is infinitely close to the original (with both put-back resampling). In contrast, the mean decrease Gini-based evaluation rule is based on the binomial distribution idea of the CART algorithm, which is more incompatible with the overall idea of the random forest algorithm. erefore, the importance ranking of financial indicators based on mean decrease accuracy is more scientific. e results of the specific indicator importance ranking are as follows: (1). net assets per share; (2). total assets (year-on-year growth rate); (3). net profit (year-onyear growth rate); (4). total asset turnover ratio; (5). earnings per share (diluted). Figure 5 shows the ROC plots for the final random forest financial early warning model of this paper. As can be seen, the AUC value of the financial early warning model constructed in this paper is 0.909, which is at a fairly high value level. erefore, from the perspective of the ROC curve, this model has a very high predictive performance.
Taken together, the financial early warning model constructed in this paper has a low out-of-bag error rate (OOB ERROR � 8.41%) as well as a very high AUC value (AUC � 0.909). e OOB rate for the risk evaluation of real estate projects using the random forest algorithm by Li and Shenjiang [19] et al. is 10.53%. e OOB rate for the financial failure prediction of the listed companies using the random forest algorithm by Zhou [20] et al. is 26.37%. Zhang [21] used the shuffle-based random forest technology to establish an AUC value of 0.8666, while he used the embedded-based random forest technology to establish an AUC value of 0.8404.
In addition, among the latest financial early warning research results in the past three years, Liu [15] et al. used a financial early warning model based on the AdaBoost strong classifier with an accuracy of 96%, which was higher than the accuracy of the weak classifier model based on the BP neural network (91.54%). Jia [16] used an improved K-fold random forest algorithm based on time series for financial early warning with an accuracy of 90.327%. e final accuracy of Wang and Lu [17] financial early warning modelling using a logistic model and a PCA method based on the Wilcoxon rank sum nonparametric test for processing financial data of A-share listed forestry companies was 86.7%. erefore, the OOB of the financial early warning model of the listed companies in this paper is 8.41%, and the AUC of this model is 0.909, which can be concluded that the model possesses a good predictive effect.

Discovery of Financial Warning Rules Based on CART Decision Tree Construction.
rough the screening process for the listed companies' financial data columns after the standardisation process above, we imported these data into Discrete Dynamics in Nature and Society the SPSS modeler software for rule building based on the CART decision tree, and after computing, the CART tree diagram was drawn as shown in Figure 6, in which the four financial indicators of net assets per share, total assets (yearon-year growth rate), gross sales margin, and current assets turnover ratio were retained down to participate in the construction of the CART decision tree (check Figure 7). e early warning rules based on the CART decision tree and their confidence values are shown in Figure 6. ere are three rules for determining that a company is in financial crisis (i.e., the company is ST), and at the same time, there are three rules for determining that a company is financially healthy (i.e., the company is not ST). e rules for determining when an enterprise is in financial crisis are analysed as follows: firstly, an enterprise is considered to be in financial crisis when its net assets per share (normalised by the extreme values described above and the same for all the indicators below) are less than or equal to 0.277. e confidence level for this rule was 95.8%, and 71 data items were used for training. Secondly, a company was identified as being in financial crisis when its net assets per share were greater than 0.277, but its total assets (year-on-year growth rate) were less than or equal to 0.031. e confidence level of this rule is 68%, and there are 25 data items involved in the training; thirdly, when the net asset per share of an enterprise is greater than 0.277 and its total assets (year-on-year growth rate) are greater than 0.031, but its net asset per share is less than or equal to 0.329, its gross sales margin is less than or equal to 0.466, and its current asset turnover ratio is less than or equal to 0.113, then the enterprise is considered to be in financial crisis. is rule has a confidence level of 100% and a total of seven data items were involved in the training. e rules for determining the financial health of an enterprise are as follows: first, when the net asset per share of an enterprise is greater than 0.277 and less than or equal to 0.329, and its total assets (year-on-year growth rate) are greater than 0.031, gross sales margin is less than or equal to 0.466, and current asset turnover ratio is greater than 0.113, then the enterprise is considered to be financially healthy.
e confidence level of this rule is 75%, and a total of 4 data items are involved in the training. Secondly, when the net asset per share of a company is greater than 0.277 and less than or equal to 0.329, and the total assets (year-on-year growth rate) are greater than 0.031 and the gross sales margin is greater than 0.466, then the company is considered to be a financially healthy company. e confidence level of this rule is 86.2%, and a total of 29 data items are involved in the training.
irdly, when the net assets per share of a company are greater than 0.329 and the total assets (year-onyear growth rate) are greater than 0.031, then the company is considered to be a financially healthy company. e confidence level of this rule was 98.4%, and a total of 191 data items were involved in the training.

Conclusion
In view of the actual situation of the financial data of the listed companies in China, we constructed a financial early warning model based on the random forest algorithm and the CART decision tree algorithm. e method uses the random forest algorithm to rank the importance of many financial indicators, which can significantly improve the predictive effect of subsequent modelling and is in line with the principles underlying the establishment of financial warning models. In addition, the quantitative analysis of the filtered indicators using the CART decision tree algorithm to determine the decision thresholds for specific financial warning indicators can significantly improve the operability of the model. e results of the study show that the method has an excellent predictive effect. At the same time, the method can provide the listed companies' management and   investors with a prediction model that can quantify the financial situation of the listed companies.

Data Availability
e Wind China Listed Companies Dataset was used to support the findings of this study.

Conflicts of Interest
e author declares that there are no conflicts of interest.