A Study on RB-XGBoost Algorithm-Based e-Commerce Credit Risk Assessment Model

The current method ’ s e-commerce credit risk assessment is prone to poor data balance and low evaluation accuracy. An RB-XGBoost algorithm-based e-commerce credit risk assessment model is proposed in this study. The adaptive random balance (RB) method is used to sample and process the obtained data to improve the balance degree of the data. An assessment index system is constructed based on the processed data. Based on the risk evaluation index system and the XGBoost algorithm, this paper constructed an e-commerce risk assessment model and assessed the e-commerce credit risk using this model. The experimental results show that the proposed method has good data balance, a high kappa coe ﬃ cient, and a large receiver operating characteristic (ROC) curve area, which can e ﬀ ectively improve e-commerce credit risk assessment accuracy.


Introduction
At present, e-commerce has entered society, and informatization has become an inevitable trend and core content of e-commerce, which has a significant impact on the fields of culture, society, and politics [1,2]. In network economic activities, this technology effectively improves resource allocation and enhances China's economic competitiveness. Therefore, the progress of e-commerce technology is of great significance in economic growth, industrial structure optimization, and economic operation quality and efficiency in China. However, the problem of the credit crisis will lead to great risks in the practical application of e-commerce and seriously restrict the steady development of ecommerce. Therefore, it is necessary to analyze and study the e-commerce credit risk assessment methods to avoid the risks in e-commerce transactions.
Wu et al. minimizes e-commerce credit assessment indicators by a rough set method to obtain important influencing factors of assessment in [3]. A C-XGBoost model is first established to forecast for each cluster of the resulting clusters based on a two-step clustering algorithm, incorpo-rating sales features into the C-XGBoost model as influencing factors of forecasting in [4]. Aiming at the customer characteristics of social network e-commerce, Zhuang builds a customer value model that integrates the value of social network to help companies subdivides the customer accurately in [5]. To improve and enhance the predictive ability of consumer purchasing behaviours on e-commerce platforms, a new method of predicting purchasing behaviour on e-commerce platforms is created in [6]. In the support vector regression method, a particle swarm optimization algorithm is introduced to optimize the model parameters, and the optimized model is used to complete the assessment of e-commerce credit risk. This method has good effectiveness, but the data imbalance rate obtained by this method is high, leading to a poor data balance degree. Chang et al. determines the risk assessment indicators based on the actual transaction situation and relevant literature and constructs a two-layer hybrid model to evaluate the credit risk of e-commerce combined with the back propagation (BP) neural network and naive Bayesian algorithm [7]. This method has relatively high assessment stability but does not process the data set before assessment, resulting in the unsatisfactory effect of the ROC curve obtained by this method and the problem of low assessment accuracy. An e-commerce credit risk assessment model based on the RB-XGBoost algorithm is proposed to solve the issues in the above methods. The e-commerce credit risk assessment model based on the RB-XGBoost algorithm is used to sample and process e-commerce risk data through the adaptive random balance RB method to reduce the imbalance of data [8][9][10]. The specific process is shown in Figure 1.

(II) Grey correlation analysis of data
We set that m stands for the number of e-commerce enterprises, n stands for the number of risk assessment indicators, and x i = fx ið1Þ , x ið2Þ ,⋯,x iðnÞ g is used to describe the i th e-commerce enterprise sample, where i = 1, 2, ⋯, m.
An ideal sequence x 0 fx ij g represents a positive index and x 2 j = min i fx ij g represents a negative index.
There are differences between the dimensions corresponding to different risk assessment indices, so it is necessary to eliminate the data dimensions before data comparison [11,12]. The negative index is replaced with the positive index and normalizes by the following formula: where x min and x max , respectively, represent the minimum and maximum values of the jth risk assessment index and x ij represents the corresponding value of the jth indicator in the ith e-commerce enterprise. A correlation coefficient ξ ij is set, and its calculation formula is as follows: where ∂ represents the resolution coefficient. The correlation degree r j is calculated according to the correlation coefficient: (III) Risk assessment index system The risk assessment indices are sorted according to their relevance. In the assessment process, the assessment indices of r j > r 0 are selected to build the risk assessment index system [13,14], as shown in Figure 2.

e-Commerce Credit Risk Assessment
Model. The establishment of the e-commerce credit risk assessment model based on the RB-XGBoost algorithm uses the XGBoost algorithm.
The basic elements for XGBoost model establishment are the tree set. The binary tree structure in the classification regression tree can reflect the actual results of the decision tree. In the decision tree structure, there are two branches of "no" and "yes," which correspond to the branches on the right and left, respectively. Each feature variable is divided by a binary tree, and the feature space is divided to obtain several leaf nodes.
A set D = fðx i , y i Þg is set, in which there are m variables and n samples. The prediction model is obtained based on the regression tree integration model through K functions, andŷ is an output:ŷ where Γ = f f ðxÞ = ω qðxÞ gðq : R m ⟶ T, ω i ∈ R m Þ represents the regression tree space, ω i represents the score corresponding to the ith leaf, T represents the number of leaf nodes in the tree structure, q stands for the tree structure, f k stands for tree, and x i represents the independent variable corresponding to the ith sample. For the tree model, objective function ϑ is used for training: where l is the convex loss function to measure the difference between the real value y i and the predicted valueŷ i and Ω represents the penalty term, and its expression is as follows: where ð1/2Þλkωk 2 describes the regular term and γ represents leaf node penalty, which is mainly used to avoid overfitting problems.
In the process of e-commerce credit risk assessment, European space cannot be directly used to optimize the objective function [15,16]. Therefore, the RB-XGBoost algorithm-based e-commerce credit risk assessment model trained the model through boosting learning strategy. The specific process is as follows: Journal of Sensors whereŷ ðtÞ i represents the output corresponding to the accumulation model in the tth round of training and f t ðx i Þ represents the function newly added to the tth round training.
According to the above process, the objective function is transformed into the following formula: where constant is a constant term.
The fitting results of the model and training data in the assessment process can be measured by the loss function L = ∑ n i=1 lðŷ i , y i Þ, in which the logical loss function lðŷ i , y i Þ = y i ln ð1 + e −y∧ i Þ and the square loss function lðŷ i , y i Þ = ðy i − y∧ i Þ 2 are widely used in the assessment process [17,18]. The RB-XGBoot algorithm-based e-commerce credit risk assessment model brings the square loss function into the target function to obtain the following formula: whereŷ ðt−1Þ i − y i represents the residual. The loss function can be approximated by the Taylor expansion to obtain the following formula: The training set is divided into n categories Balance the data set with undersampling and oversampling The number of balanced data subsets generated is N num End Figure 1: Data balance sampling processing flow.

Journal of Sensors
The objective function is substituted into the above loss function to obtain When the loss function belongs to square loss in the training process, there is the following formula: The parameters g i and h i are substituted into the objective function to obtain the following formula: whereŷ ðt−1Þ i describes the output result of the model during the t − 1th round training and y i describes the dependent variable existing in the objective function. If the dependent variable y i is known, the above objective function can be simplified to obtain the following formula: In the formula, g i and h i are the parameters existing in the loss function. The values of the above parameters are different in different loss functions, so the values of parameters g i and h i can be determined in the form of the loss function.
Each tree is redefined by the following formula:   Journal of Sensors where ω describes the weight corresponding to the leaf node in the tree structure, ω qðxÞ describes the predicted value obtained by the tree model, and q : R d = f1, 2,⋯,Tg represents the structure of the tree. Model complexity includes L2 regularization of leaf node score and the total number of leaf nodes T [19,20]. Model complexity Ωð f t Þ can be obtained through tree definition: The smoothness of leaf nodes can be improved by L2 regularization to solve the overfitting problem [21,22]. In the objective function, when the complexity of the model increases, there are two different types of accumulation, one of which is I j = fijqðx i Þ = jg, where I j represents the set of samples in the leaf node j. After adding complexity to the objective function, the final objective function is obtained, that is, the e-commerce credit risk assessment model [23,24]: Based on the selected risk assessment indices, the risk assessment is performed using the e-commerce credit risk assessment model.

Experiments and Results
To verify the effectiveness of the RB-XGBoot algorithmbased e-commerce credit risk assessment model, it is neces-sary to carry out a test. The proposed method, literature [3] method, and literature [4] method are used for comparative experiments. The imbalance rate τ is used as the experimental index to test the data balance degree of different methods. The calculation formula of imbalance rate τ is as follows: where N max and N min represent the maximum and minimum values of the sample data in the set. The larger the imbalance rate τ, the more unbalanced the data. The imbalance rate τ of the proposed method, the reference [3] method, and the reference [7] method are shown in Figure 3. Based on the data in Figure 3, the data imbalance rate obtained by the proposed method is less than 5% when testing different data sets, while the imbalance rate obtained by the methods of literature [3] and literature [7] fluctuates around 10% and 15%, respectively. It can be seen that the imbalance rate obtained by the proposed method is low, indicating that the data obtained by the proposed method is well balanced. This is due to the data sampling and processing by the adaptive random balance RB method before constructing the e-commerce credit risk assessment model, which ensures the balance of the data.
The assessment accuracy of the proposed method, literature [3] method, and literature [7] method is verified by the kappa coefficient and ROC curve. The kappa coefficient can weigh the difference between the assessment results and the real results. The calculation formula of kappa coefficient K is as follows: where p o represents the proportion of correctly evaluated samples in the total number of samples and p e represents   Journal of Sensors the randomness ratio. The higher the kappa coefficient K, the more accurate the evaluation results of the method are. The kappa coefficients of the proposed method, the literature [3] method, and the literature [7] method are shown in Table 1.
From the data in Table 1, we can see that the kappa coefficients of the proposed method in multiple iterations are higher than those obtained by the methods in literature [3] and literature [7], indicating that the proposed method can accurately complete the assessment of e-commerce credit risk. This is because this method constructs a risk assessment index system based on the data with a high balance and completes the assessment of the e-commerce credit risk based on the high-precision risk assessment indices.
The abscissa is the true positive rate in the ROC curve, and the ordinate is the false positive case rate. The larger the area enclosed by the ROC curve and the abscissa, the higher the accuracy of the assessment results of the method. The proposed method, literature [3] method, and literature [7] method are, respectively, used to evaluate the credit risk of different e-commerce enterprises, and the obtained ROC curves are shown in Figure 4.
By analyzing Figure 4, it can be seen that the area enclosed by the ROC curve of the proposed method and   Journal of Sensors abscissa is larger than that enclosed by the ROC curve of the methods of literature [3] or literature [7] and abscissa, indicating that the proposed method has higher assessment results accuracy and can complete credit risk assessment accurately in e-commerce enterprises.

Conclusion
Aiming at the problems of high data imbalance rate and low accuracy of assessment results in the current e-commerce credit risk evaluation methods, an e-commerce credit risk evaluation model based on the RB-XGBoost algorithm is proposed. The risk assessment index system is first constructed by using the data with a high balance rate, and then, the risk assessment model is established by the XGBoost algorithm. This model realizes the assessment of ecommerce credit risk, solves the problems existing in the current methods, ensures the degree of data balance, and improves the accuracy of risk assessment. Future work includes improving the risk assessment model and further enhances the accuracy of risk assessment.

Data Availability
The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest
The authors declare no competing interests.