XGBoost Optimized by Adaptive Particle SwarmOptimization for Credit Scoring

Personal credit scoring is a challenging issue. In recent years, research has shown that machine learning has satisfactory performance in credit scoring. Because of the advantages of feature combination and feature selection, decision trees can match credit data which have high dimension and a complex correlation. Decision trees tend to overfitting yet. eXtreme Gradient Boosting is an advanced gradient enhanced tree that overcomes its shortcomings by integrating tree models. .e structure of the model is determined by hyperparameters, which is aimed at the time-consuming and laborious problem of manual tuning, and the optimization method is employed for tuning. As particle swarm optimization describes the particle state and its motion law as continuous real numbers, the hyperparameter applicable to eXtreme Gradient Boosting can find its optimal value in the continuous search space. However, classical particle swarm optimization tends to fall into local optima. To solve this problem, this paper proposes an eXtreme Gradient Boosting credit scoring model that is based on adaptive particle swarm optimization. .e swarm split, which is based on the clustering idea and two kinds of learning strategies, is employed to guide the particles to improve the diversity of the subswarms, in order to prevent the algorithm from falling into a local optimum. In the experiment, several traditional machine learning algorithms and popular ensemble learning classifiers, as well as four hyperparameter optimization methods (grid search, random search, treestructured Parzen estimator, and particle swarm optimization), are considered for comparison. Experiments were performed with four credit datasets and seven KEEL benchmark datasets over five popular evaluationmeasures: accuracy, error rate (type I error and type II error), Brier score, and F1 score. Results demonstrate that the proposed model outperforms other models on average. Moreover, adaptive particle swarm optimization performs better than the other hyperparameter optimization strategies.


Introduction
Granting potential borrower loans is the core operation of lending establishments around the world. e loan business brings huge profits to a company, while it also enables the company to face a huge financial loss. erefore, lending institutions need to comprehensively analyse the basic information and credit histories of applicants to estimate the possibility of repayment and then decide whether to approve the application.
An acceptable credit scoring method can help lending institutions distinguish good applicants from loan applications and reject the unacceptable applications. Machine learning (ML) technology has attracted more attention, such as the most commonly used neural network (NN) [1], the support vector machine (SVM) [2], and decision tree (DT) [3]. ey have been increasingly applied in credit scoring. NN is a model of information processing that uses structures similar to synaptic connections in the brain, which will be improved by iteratively adjusting weights to minimize the error of prediction. e capability of NN in treating nonlinear data is beneficial to identifying intrinsic patterns in complex financial credit data. Chuang and Huang [4] proposed a hybrid credit score model based on an NN, in which the first part of the model divides applications into the accepted group and rejected group. e results show that the model obtains a more accurate result than the other compared methods; moreover, it has been proven to reverse potential customer churn. e proposed model reinforced NN primarily to enhance the accuracy without much help in reducing misclassification. Fonseca [5] proposed a two-stage process that employs a fuzzy inference model as an input for an NN model using a credit score rating as a response to conduct the fuzzy reasoning step of the analysis. e results indicated satisfactory predictability of the model. Although these methods add different models to improve the performance of NN, NN still has limitations: it lacks the explanation capability of making lending decisions, performing time-consuming, and overfitting [6]. e SVM is a suitable technique for credit scoring as binary classification [7]. e goal of the SVM is to identify the hyperplane that can correctly separate training instances into correct classes in the space. Hens and Tiwari [8] utilize an SVM in combination with the feature selection index F score which measures the ability of features to distinguish between two categories. e results show that the method is superior to other experimental methods in terms of precision and calculation time.
e results obtained by SVM are still sensitive to kernel parameter selection. Luo et al. [9] proposed a kernel-free SVM model which avoids selecting kernels and related kernel parameters. e promising results demonstrate the effectiveness of the proposed method on personal credit data. e limit of the SVM still exists: the boundary of the hyperplane is close to the minority instances when dataset is imbalanced, which results in a decrease in prediction accuracy; the SVM still does not reveal the explanatory nature of lending decision-making. e DT is favored by decision makers because of its reasonable capability of interpretation [6]. By splitting the nodes, the DT forms a tree decision structure. DT is good at capturing interactive information between credit features and shows the indication of important variables. Kao et al. [10] established a Bayesian latent variable model based on a classification regression tree (CART) for the credit data of banks. Compared with other experimental ML models, the prediction accuracy of the proposed model is the highest, and its type I error is significantly fewer. One of the reasons for the improvement is pruning, which reduces the complexity of the model and avoids overfitting. e importance and deduction rules of the variable are derived from the tree structure that supports credit decisions. Xia et al. [11] present a novel model by employing an advanced tree-based model. e model is evaluated over metrics (accuracy, the area under the curve (AUC), and Brier score) on five public credit datasets. e results demonstrate that the proposed model significantly outperforms most of benchmark models on predictive accuracy. e DT has few assumptions on data distribution which allows it to handle complex financial data. Because of DT's intuitive representation of the tree, the result is interpretable. e proposed ensemble method can dynamically assign weights to the base classifier based on the performance of the model, which overcomes the inherent weakness of the sensitivity of a single DT to noisy data or redundant features. Ensemble modelling gives classifiers the opportunity to express their ability to learn on different parts of data and the feature space. Based on the basis of the "no free lunch" theorem [12] and as the structure and characteristics of changeable credit data are different, the prediction accuracy is greatly limited by a single classifier.
Ensemble learning combines multiple classifiers which process different hypotheses to construct a better hypothesis and obtain excellent predictions. Ensemble models have been extensively applied to credit scoring [13]. Hsieh and Hung [14] employed class-wise bagging as a data augmentation strategy and combined NN, SVM, and other single classifiers into the ensemble classifier. e proposed ensemble approach has significantly better accuracy than these single classifiers on German dataset. Xiao et al. [15] proposed ensemble classification based on supervised clustering for credit scoring. In the method, the base classifier on each subset, which is divided by supervised clustering, is combined by weighted voting.
e proposed approach has a higher prediction accuracy than the base classifiers DT, logistic regression (LR), and SVM on datasets of German and Australian. Xia et al. [16] proposed an ensemble credit model that combines bagging with stacking. On datasets of German, Australian, and P2P, the ensemble model is better than single classifiers (LR, SVM, and DT) in accuracy, AUC, and Brier score. Numerous studies have shown that the models for integrating a single classifier achieve better performance [6,17], which shows that ensemble learning has a satisfactory learning ability for credit scoring.
Furthermore, Chen and Guestrin [18] proposed an advanced gradient boosting algorithm, the eXtreme Gradient Boosting (XGBoost) tree, that has obtained good results in Kaggle data competitions. Zieba and Tomczak [19] applied XGBoost to bankruptcy prediction. For the real unbalanced data of Polish companies, the XGBoost model utilizes the advantage of being insensitive to imbalanced data phenomenon as it enables the selection of an AUC measure for evaluation and forces proper ordering of the imbalanced data. e model performs significantly better than other methods, including the NN, LR, SVM, and random forest (RF). Based on the notion that credit data and bankruptcy data belong to the financial risk domain [6], XGBoost provides a viable scheme for credit scoring. Chang [20] used XGBoost to construct a credit scoring model with cluster-based undersampling. XGBoost is based on DT and is supported by gradient boosting which improve its predictive power. With real loan data of a financial institution, the assessment model exhibits superior accuracy and AUC values to LR, the NN, and the SVM. Xia et al. [21] proposed a cost-sensitive credit risk assessment model by XGBoost to predict the risk of a loan. e prediction of the default probability of the purposed model is better than LR and RF on two P2P lending data from platforms of LendingClub and Renrendai. An explanatory data analysis shows that variables differ in predicting the probability of default and profitability, due to the interpretability of the XGBoost tree structure.
Credit data as financial data have inherent characteristics, such as credit data imbalance caused by a small number of default applications, a large number of credit features and complex feature relationships, a sparse 2 Mathematical Problems in Engineering matrix problem caused by many missing values and features, and interpretability issues for feature decisions. When faced with various problems, traditional ML methods and other integrated learning methods can address some of these problems. e field of credit scoring is directly related to economic returns, which requires models to have higher capabilities to address the characteristics of credit data in more aspects. XGBoost allows custom objective functions and evaluation indicators to guide the model to deal with special situations, such as dealing with imbalances by using the AUC indicator as the objective function. By calculating the gain of samples in the left and right subtrees to decide the default direction of the missing data, XGBoost learns how to deal with missing values. XGBoost with the mechanism is appropriate for credit data with missing values and sparse characteristics. XGBoost's intuitive interpretation capabilities can explain the contribution of features to prediction and identify the commonalities or opposite relationships between two features, which are meaningful for understanding decision-making.
Although XGBoost performs well, its satisfactory performance depends on having the proper hyperparameter settings. e hyperparameters of XGBoost directly affect the structure of the model and the performance of the model, so appropriate tuning is particularly important. Generally, manual tuning relies on artificial experiences, which consumes a substantial among of time and computing resources. Hyperparameter optimization methods have been employed to tune and overcome disadvantage of manual tuning. Hyperparameter optimization that is commonly utilized in machine learning consist of grid search (GS) and random search (RS) [22]. Within the limited resources, GS and RS, as discrete optimization methods, may overlook the optimal value as the objective function is nonconvex. Different from RS or GS, the tree-structured Parzen estimator (TPE) [23] is a Bayesian method whose proxy model is a tree structure, which builds a probabilistic model based on obtaining the evaluation results of past targets to minimize the objective function. e algorithm is limited to the condition that each hyperparameter is completely independent and is difficult to meet. Particle swarm optimization (PSO), which was originally proposed by Kennedy and Eberhart [24], is a computational intelligence technique. e optimization of hyperparameters is performed in a continuous search space. Shen et al. [25] proposed an ensemble model by combining the AdaBoost method with the NN base classifier and employed PSO to search for the optimal connection weight of the NN. e results show that this model is more effective in processing datasets of German and Australian than the other models examined. e original PSO algorithm was mainly designed for the optimization of a continuous space owing to the quantities that describe the particle state, and its motion laws are continuous real numbers [26]. erefore, PSO rather than GS, RS, and TPE is more suitable for the hyperparameter optimization of XGBoost. PSO finds the optimal solution through iteration, and it has a fast convergence speed. is characteristic easily causes the states of the particle to fall into a local optimum, thereby causing premature convergence. In response to this problem, we use the idea of clustering [27] to adaptively divide the particle swarm into different populations and guide the populations by applying different update strategies. is enhances the diversity of particles and helps particles jump out of a local optimum.
On the basis of the above considerations, we use the adaptive particle swarm optimization (APSO)-XGBoost model for credit scoring. APSO is more suitable for the parameter optimization of XGBoost, and it improves the model prediction accuracy. e contributions of this paper are as follows. First, XGBoost is optimized by an APSO algorithm to establish a credit scoring model. e hyperparameters of the model are set to more reasonable values, and more accurate prediction results are obtained.
Second, the proposed APSO method is based on adaptive subgroup division and uses two different learning strategies to update different types of particles to enhance the diversity of the particle populations. e rest of the paper is organized as follows: Section 2 explains the related work on the methods used. Section 3 introduces the principle of the APSO-XGBoost model. Section 5 describes the experimental setup. Section 6 reports the experimental analysis results. Finally, Section 7 concludes the paper and discusses future work.

Related Work
Much of the credit scoring research is concentrated on purposing a new algorithm to improve the accuracy of classification. In general, the credit scoring method has two major categories: statistical method and artificial intelligence (AI) method. [6] e aim of the statistical method is to obtain an optimal linear combination of input variables to predict default risk. LR, a statistical method, is a popular model that is utilized in credit scoring as it is easy to implement [28]. LR uses the maximizing likelihood function to classify parameters as default or nondefault and solve parameters of expression by gradient descent. Note the method requires a linear relationship between input variables, and its performance will be limited when addressing with real complex nonlinear credit data.
ML has more outstanding predictive ability in addressing with complex nonlinear data [29]. Several studies have shown that ML methods have better prediction accuracy than statistical methods for financial data. As traditional ML methods, NN, SVM, and DT are commonly applied to predict personal default [30]. Oreski and Oreski [31] proposed a hybrid genetic algorithm based on NN, which achieved good prediction performances. Maldonado et al. [32] proposed a profit-driven model that can simultaneously construct classifiers and select characteristics based on the SVM, which performs well in the credit scoring business of a Chilean bank. Cai and Zhang et al. [33] obtained the satisfactory performance of default prediction by applying DT and LR to LendingClub data. e amount of training data of a single classifier is insufficient, and its hypothesis space is small, so it is easy to obtain the local Mathematical Problems in Engineering 3 optimal value. Based on reasonable but different ideas, ensemble learning can solve the issue by combining different single classifiers. Single classifications applied to credit scoring commonly suffer from negative effects from the noise and redundant attributes, similar to the other research. e single classifications show a large deviation in processing complex data with noise and redundancy characteristics, which will cause the performance of the model to be unstable on various indicators.
To compensate for the limitation of single classifier, applying ensemble learning to credit scoring is becoming a positive trend [16]. Ensemble learning has two main methods: bagging and boosting [13]. Bagging adopts repeated random sampling to generate a random subset. e final result is obtained from the concordance of prediction model on each subset. As a famous bagging classifier, the RF trains many trees on the random subset and then adopts the mean of a result set of various trees as the final output. Since bagging trains classifiers on each bootstrap sample and combines the results by majority voting or averaging, they are less influenced by noise than single classifier. e Bagging-DT and Bagging-NN introduce a bagging strategy into a single classifier to achieve better performance [17]. Unlike the bagging constructs the base models in parallel, the boosting sequentially builds models. Boosting is an iterative algorithm that combines weak learners based on weight. AdaBoost [34] is the most influential boosting algorithm that improves the weights of weak classifiers with a low error rate and reduces their weights with a high error rate in each iteration. AdaBoost is a framework that can use a variety of base learners. e DT is the most prevalent default base classifier, and the AdaBoost-NN base classifier is a multilayer perceptron; they both obtain satisfactory results for credit scoring [35]. Because of the simple construction of the weak base classifier, AdaBoost is not easy to overfit. e classification accuracy of AdaBoost will decrease due to data imbalance. e Gradient Boosting Decision Tree (GBDT) [36] adopts the regression tree as the weak classifier, which adds a new tree to the model in each iteration to fit the residuals. GBDT has the flexibility to handle various types of credit data; it handles exceptional or fraudulent applications better due to the robustness of the loss function. Compared with GBDT, XGBoost uses first and second derivatives, and its loss function is more accurate, which can split the tree more precisely to make the tree structure more suitable for financial data characteristics. XGBoost adds a regularization term in the cost function to control the complexity of the model and prevent the model from overfitting when it learns credit data with a small sample. In the case of missing eigenvalues, the segmentation direction of the credit sample is automatically learned. ese advantages make the XGBoost achieve reasonable prediction in credit scoring [20].
To take advantage of the efficacy of XGBoost, the model needs to optimized by a hyperparameter optimization approach. GS attempts every combination of the candidate parameters via the loop traversal and selects the best performance parameters as the final result. Instead of testing all values in the search area, RS randomly selected sample points, which is faster than GS. Due to a relatively large number of hyperparameters, the two algorithms may be infeasible to XGBoost. Chang et al. [20] proposed XGBoost optimized by the TPE to predict credit data. e results of the experiment show that the model outperforms baseline models on average with regards to accuracy, error rate, AUC, and Brier score. e TPE requires that the hyperparameters are independent of each other [23]. In the experiment, hyperparameters that are highly correlated cannot be optimized simultaneously, and one of them needs to be fixed first, which removes the relation between the hyperparameters. e aim of PSO is to obtain the optimal value via the cooperation and information sharing among individuals in the swarm, which was originally designed to address the continuous optimization problem. e algorithm searches according to individual adaptation information, so PSO is not limited by objective function constraints, such as continuity and differentiability. Moreover, the algorithm does not require mutual independence of the hyperparameter.
Based on the advantages of XGBoost in reasonably addressing the missing values of credit data, more accurate solution of loss function, mechanisms to prevent overfitting, and the advantages of PSO in optimizing parameters in high-dimensional continuous space, we build an XGBoost model that is optimized by an improved APSO to realize high-precision credit scoring. APSO sets the objective function as the fitness of the particle and directs the particle to simultaneously optimize multiple hyperparameters in the continuous space and obtains the hyperparameter value that makes the objective function lower.
is reasonable hyperparameter setting enables XGBoost to fully exert its effect, reduce the influence of credit data imbalance on the model, learn the direction in which credit data missing values of credit data are split reasonably, and improve the accurate performance of the model.

Materials and Methods
Hyperparameter optimization has always been an important aspect of prediction models. In this section, we introduce the related works on these techniques.

XGBoost.
XGBoost is a powerful methodology for regression as well as classification. It is applied as a group of winning programs from Kaggle competitions. XGBoost, which is based on the gradient boosting framework, constantly adds new decision trees to fit a value with residual multiple iterations and improve the efficiency and performance of learners. Unlike gradient boosting, XGBoost uses a Taylor expansion to approximate the loss function, and the model has a better tradeoff bias and variance, usually using fewer decision trees to obtain a higher accuracy. Below is a description of XGBoost.
Assume that a given sample set has n samples and m features; it can be expressed as , where x is the eigenvalue and y is the true value. e algorithm sums the results of K trees as the final predicted value, which is expressed as (1) F is the set of decision trees as where f(x) is one of the trees and w q(x) is the weight of the leaf nodes. T is the number of leaf nodes, and q represents the structure of each tree, which maps the sample to the corresponding leaf node. erefore, the predicted value of XGBoost is the sum of the value of the leaf nodes of each tree. e goal of the model is to learn these k trees, so we minimize the following objective function: where l is the loss of the difference between the estimated values y i and the true value y i ; common loss functions include the logarithmic loss function, square loss function, and exponential loss function. Ω regularization is used to set the penalty of the decision tree, which can prevent overfitting. Ω is expressed as follows: In the regular term, c is a hyperparameter that controls the complexity of the model and T is the number of leaf nodes. λ is the penalty coefficient for the leaf weight ω, which is usually constant. λ and ω determine the complexity of the model and are usually given empirically. During training, a new tree is added to fit the residuals of the previous round.
erefore, when the model has t trees, it is expressed as follows: Substituting (4) into the objective function (2) yields the following function: XGBoost carries out the Taylor expansion of the objective function, takes the first three terms, removes the high-order small infinitesimal terms, and transforms the objective function into where g i is the first derivative and h i is the second derivative of loss function. e residual between the prediction score y (t− 1) and y i does not affect the optimization of the objective function, so it is removed: e iteration of the tree model is transformed into the iteration of the leaf nodes, and the calculated optimal leaf node score is G 2 j /(H j + λ). By substituting the optimal value into the objective function, the final objective function is obtained: Overall, XGBoost adds regularization to the standard function as a result of the reduced model complexity. e first and second derivatives are applied to fit the residual error. is method also supports column sampling in both reducing overfitting and reducing computation. erefore, more improvements lead to a larger number of hyperparameters than the GBDT. However, it is difficult to reasonably tune the hyperparameters. A reasonable setting requires not only the prior knowledge of researchers and their experience in parameter tuning but also a great deal of time. Hyperparameter optimization is an effective solution to this problem.

Parameter Optimization.
e parameter is one of the most significant concepts in ML, and the training model essentially finds the appropriate parameters to achieve better results. e parameters are divided into model parameters and hyperparameters. e model parameters are obtained by learning the distribution of training data, without the need for human experience. A hyperparameter is defined as a higher-level concept about the model, such as its complexity or ability to learn. Having a set of satisfactory hyperparameters improves the performance of learning models, so tuning them is important. However, hyperparameter tuning is subjective and relies on empirical judgement and trial-and-error approaches [23]. e hyperparameter optimization algorithm overcomes the dependence of manual search on experience and trial and error. Common hyperparameter optimization algorithms include GS, RS, and Bayesian optimization. e booster parameters of XGBoost are commonly applied to improve the performance of the model. e parameters directly control the generation of trees at each step, and the formulation of frequently selected booster parameters is described in Table 1.
Below is a brief introduction to them. GS: within a specified range of the hyperparameters, GS is a method that performs one step at a time to adjust the hyperparameters via training and is determined to be the best among all validation methods. However, a GS cannot widely explore a hyperparameter space as the computational complexity of GS grows exponentially when the iterations of the algorithm are increased to create more opportunities based on the number of hyperparameters. erefore, GS is not suitable for the optimization of models with many dimensions [37].

Mathematical Problems in Engineering 5
RS: RS samples a certain number of sets from a specified distribution by randomly sampling within a search range.
e theoretical basis is that if the set of random sample points is large enough, the global optimal value or its approximation will be found.
Bayesian optimization [38] in tuning hyperparameters was intended to optimize the objective function of sample points by adding it to the objective function to update the posterior distribution. e calculation process of this algorithm is Gaussian and considers the information of the last hyperparameter and then adjusts the hyperparameter to slowly improve the joint posterior distribution. Bayesian hyperparameter optimization assumes that there is a real distribution, and the noise of the hyperparameter is mapped to a specific target function.
Bergstra et al. [23] proposed the TPE, which is a Bayesian optimization method for tuning the hyperparameters of XGBoost.
e results show that the model outperforms other models according to the evaluation measures. However, Bayesian optimization is established on the basis of the distribution of the independent prior and the idealized hypothesis that properties are independent of each other.
is condition is difficult to attain in practical applications; the number of properties may be large or the correlation between the properties may be high, thereby causing performance degradation.

PSO.
PSO simulates a bird in a flock by designing a massless particle that has only two properties: its speed, which represents how fast it moves, and its position, which guides the direction in which it moves. Each particle determines the optimal solution in the search space of an individual and stores it as the current individual extremum. According to the current individual extremum of all the particles, to obtain the current global optimal solution, the whole particle swarm adjusts its speed and position. e process of PSO is as follows.
First, initialize the particle swarm; then, evaluate the particles, calculate the adaptive value, search for individual extrema, and find the global optimal solution. Last, modify the speed and position of the particles.
e standard PSO algorithm is as follows. Assume that in a D-dimensional search space, there is a population of m particles represented as x 1 , x 2 , . . . , x m , where the particles are expressed as . en, the speed and position information of the particle can be updated at time t + 1 by the following formula: ω is the inertia weight that maintains an effective balance between global exploration and local exploration, and c 1 and c 2 are the learning factors (random numbers in a uniform distribution function) that adjust the step length of the direction of motion to the position of the particle and the direction of the global best position, respectively. To avoid a blind search by the particle, its speed and position are

APSO-XGBoost Credit Scoring Model
Hyperparameters control the complexity or regularization to refine the model. We need to carefully employ parameter optimization to tune the hyperparameters to obtain better prediction ability. As a powerful ML algorithm, XGBoost has numerous hyperparameters. APSO can simultaneously obtain the optimal value of multiple XGBoost hyperparameters in n-dimensional space. With the advantage of the characteristics of the position updating formula, APSO solves the continuous optimization of better searching for the optimal hyperparameter of XGBoost. Additionally, improved APSO divides the population by clustering to improve the diversity of subgroups and adopts two updating strategies for different types of particles to average the information of locally optimal particles and prevent the local convergence caused by the local aggregation of particles. APSO impacts XGBoost to adequately fit the credit data and improve its prediction accuracy. First, we described the APSO, and second, we concretely introduced the overall framework of the credit scoring model.

Adaptive PSO.
In this section, we introduce PSO improved by adaptive learning strategies. In the process of searching, a swarm is adaptively divided into subgroups according to the particle distribution. In each subgroup, we use two different learning strategies to guide the search directions of two different types of particles. e search process stops when a global optimal value is found or a termination condition is met.
Relevant studies have shown that the diversity of the population is the key to avoiding the premature convergence of PSO; the core guiding principle of the algorithm is clustering [26]. According to the distribution of each particle, the fast search clustering method [39] is adopted to perform the adaptive division of the population into several subgroups. is method can automatically discover the dataset samples' class cluster centre. e basic principle is that the centre of the class cluster has two basic features: the first is that it is surrounded by points with a lower local density, and the second is that it has a greater distance from points with a higher local density. erefore, for a population of H particles, where S � x i H i�1 , the two properties ρ i and δ i are defined for each particle. ρ i , the distance between the local density of the particle and a higher local density of particles, is defined as follows: where d ij is the Euclidean distance of particles between x i and x j and d c is the truncation distance. e truncation distance is d c � d R * M , where R represents the proportion and M indicates that the matrix d ij contains M � (1/2)N(N − 1) values. It can be seen that d c is the distance corresponding to the R * Mth value of d ij . (12) gives the expression of the distance δ i , representing the minimum distance from particle i to other particles that have a higher ρ i : For the maximum local density ρ of the sample, δ i � max j d ij .
According to equation (11), if the density of particle x i is the maximum, δ i is much larger than the distance δ of its nearest particles. erefore, the centre of the subgroup consists of particles that have an unusually large distance δ and a relatively high density as well. In other words, the particles with larger ρ and δ values are selected as the centre of the cluster. According to the above idea from [39], the formula c i � ρ i * δ i is used to filter out particles that may become cluster centres. We arrange the c i values in descending order and then use the truncation distance to filter out the cluster centres from the order. Because the c value of the top particle is more likely to increase exponentially than those of the other particles, it is distinguished from the c value of the next particle. Referring to [39], R is set to be between 0.1 and 0.2. rough a parameter sensitivity analysis, we found that the value of the distribution parameter has no effect on the performance of the particle swarm algorithm. e default value in this article is 2. e cluster centre is obtained by dividing the truncation distance after placing the other particles x j in subgroups where the denser ρ is larger than the ρ of x j and the c is the closest to the c of x j . e particles of each subgroup are divided into ordinary particles and local optimal particle based on the result of the division of subgroups. Under the primary guidance of the optimal particles, the ordinary particles exert their local search ability, and the updated formula is given as follows: where ω is the inertia weight, c 1 and c 2 are the learning factors, rand d 1 and rand d 2 are uniformly distributed random numbers in the interval [0, 1], pbest d i is the best position of particles, and cgbest d c is the current best position of the particles in subgroup c. To enhance the exchange of information between subgroups, the local optimal particles are mainly updated by integrating the information of each subgroup. e update formula is presented as follows (see (14)), where C is the number of subgroups: Ordinary particles search for local optimality, but more importantly, they are used as the medium for information exchange between subgroups to modify the direction of population search and further improve the population diversity. In the same subgroup, unlike a learning strategy that causes too many particles to be gathered locally, the learning strategy integrates the information of the locally optimal particles from different subgroups to obtain more information and help avoid local optima. In addition, learning too much information may lead to the direction of the update being too fuzzy, which may counteract the convergence of particles. Considering that the local optimal particles have the maximum probability of finding the optimal solution in the subgroup, valuable guidance for the optimal solution is provided by their information. erefore, the cgbest d c of each subgroup uses the average information to guide the local optimal particle update (see (14)). e transmission of the optimized information in the subgroups can be improved by this approach, the population diversity can be further increased, and particles can be prevented from falling into local optima.

APSO-XGBoost.
is paper proposes a credit scoring model based on XGBoost optimized by APSO. e model is divided into three parts: data preprocessing, feature engineering, and model training. First, the data preprocessing involves standardized datasets and marked missing values. Second, the feature engineering is based on the score of the feature importance that is obtained from the initial hyperparameter model. According to the rank of the feature importance, redundant features are removed. Last, according to the selected features and hyperparameters tuned by APSO, the model is built. e flow chart is shown (see Figure 1). e process is described in detail below.

Data Preprocessing.
Data preprocessing is divided into two steps: data standardization, namely, 0 − 1 scaling, and missing value processing. Although the tree-based algorithm is not affected by scaling, feature normalization can greatly improve the accuracy of classifiers, especially those based on distance or edge calculations. erefore, the standardization of datasets in data preprocessing renders the model to be more accurate and persuasive. e training set is described { } represents the target value, Y � 0 represents poor application, and Y � 1 represents good application. If x is a certain feature, it is calculated by 0 − 1 scaling as follows: where x ′ expresses the standardized value. Credit data often have missing values. XGBoost comes with its own sparsity segmentation algorithm, which can learn the best way to deal with missing values and is more suitable for modelling than traditional methods of dealing with missing values. If there are outliers and noise in the data, standardization can indirectly avoid the influence of outliers, and centralization can deal with extreme values.

Feature Selection.
First, the score of the relative feature importance with the initial hyperparameter is calculated, and the redundant features are discarded by the feature selection algorithm. e importance of feature selection lies in eliminating redundant features, highlighting effective features, improving the calculation speed, and eliminating the influence of adverse features on the prediction results.

Training the Model.
To ensure consistency between hyperparameter and the training dataset as much as possible, we perform cross-validation of the dataset. We tested several cross-validation methods. From many experimental results, we ultimately decided to use 10-fold cross-validation to divide the datasets. First, the hyperparameters in the XGBoost model, including the maximum tree depth, subsample ratio, column subsample ratio, minimum child weight, maximum delta step, and gamma-delta, are the optimization targets, and the position of each particle is randomly initialized in the hyperparameter search space. Second, the particles are divided into adaptive populations. is step is achieved by calculating the local density of the particles and the distances to the particles whose local density is higher. According to the value determined by the position of the particle, we assign the hyperparameters of the XGBoost model and introduce the verification data into the model for prediction. Last, the loss function on the verification dataset is the fitness function of the particles. e simplified description of credit scoring is a two-category problem. If the labels of the where p is the predictive value and y represents the actual value. In this paper, as our model labels are 0 and 1, the logistic loss is expressed as follows: e particles are divided into ordinary particles and optimal particles in accordance with the fitness value. Different update strategies update the information of the corresponding particles, and the algorithm checks whether the termination condition is reached; if so, we obtain the optimized value. If not, based on the positions of the particles, the model reclassifies the population again, calculates the fitness value, and updates the position information of each particle until the termination condition is reached. e optimal hyperparameters are utilized to construct the model, and the training and prediction are carried out with the data.
e diagram of the model architecture is shown in Figure 2. e algorithm steps are as follows: (1) Divide the datasets, train the data for the training model, and verify the data for prediction. Initialize the adaptive PSO algorithm. Subgroups of the particle swarms are divided according to equations (11) and (12). (2) Take the logistic loss function as the fitness value, and calculate the fitness value of each particle according to (16). Build the XGBoost model with the corresponding hyperparameters determined by current best particle. Training and prediction of datasets, and the fitness value updated by the loss function given. (3) Determine the position of the global optimal particle pbest and the local optimal particle gbest according to the result of the population division and the fitness values of the particles. (4) According to (13) and (14), update the positions of the ordinary particles and locally optimal particles, respectively. (5) Judge whether to terminate. When the maximum number of iterations is n, return the optimal value of the hyperparameter; otherwise, return to 2. (6) Obtain the optimal hyperparameters to build the XGBoost model and calculate the indexes.
As shown in Figure 4, when the number of iterations is 200, the curve of APSO tends to converge. So, we set the maximum number of iterations n � 200. e pseudocode of the APSO-XGBoost is given in Figure 3, while a short description is presented as follows: First, the initialization process of APSO starts; the particles with position X and velocity V in the search space are initialized as the basic input. Subsequently, the overall adaptive partitioning algorithm starts: the distance ρ and density δ of each particle are calculated, and the particles with high ρ and δ are selected as the centres; the remaining particles are assigned; and the swarm is divided into C subgroups. XGBoost has nodes set I initialized on the training data, and the hyperparameters of the model are determined by the current optimal values. In the modelling, the gain of each node is calculated, and the node with the highest score is split to generate a tree. After the modelling is completed, the type of particle is updated based on the fitness value (loss function), and then, two types of particles are updated using different strategies in each subgroup. After exiting the optimization process, XGBoost with the best hyperparameter settings was applied to the test set.

Experimental Setup
In this section, we evaluate the performance of the APSO-XGBoost model by experiments. First, the credit dataset is introduced. Second, we introduce baseline models and set up the hyperparameters of the XGBoost group.
ird, the evaluation scale is given. Finally, we compare the proposed APSO-XGBoost model with other commonly used ML models in terms of the evaluation scale.

Datasets.
In this section, the performance of the model is verified by UCI credit datasets. Two credit datasets, German and Austrian, from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html) are used. In addition to the above datasets, P2P credit data from two platforms (LendingClub in the US and Ren.com in China) were also used to verify the effectiveness of our model in providing decision support for P2P lending businesses and to verify the generalization of the model. e formulation of the data is listed in Table 2.

Feature Engineering.
Feature engineering selects important characteristics and removes irrelevant features to build a model. It can greatly reduce the dimension disaster problem, improve the operational efficiency, reduce the difficulty in learning tasks, make the model simpler, and reduce the computational complexity.
By calculating the importance of the features, features that are more favorable to the model are selected. XGBoost offers three measures of feature importance: gain, cover, and frequency. e gain refers to the tree gain of the model, which accounts for the contribution of each feature; the feature is used to split the nodes and calculate the contribution to the model compared with the other characteristics, and the higher the value is, the better the feature is for predicting the importance. e cover is affected by the division of the average sample number. e frequency is for specific features in the model tree that explains the percentage of the relative number of gain characteristics in terms of the relative importance of the relevant properties. erefore, we choose the gain as the feature importance property. XGBoost adopts the stochastic fractal search (SFS) algorithm according to the rank of the importance of the features and adds features into the dataset to form subsets Mathematical Problems in Engineering one by one. With the default hyperparameter of XGBoost, the subset that minimizes the logistical loss is selected as the subset of the features after 10-fold cross-validation.

Baseline Models.
To explore the performance differences between XGBoost and other methods, the first set of experiments is arranged as follows. Ten competitive algorithms that are widely used for credit scoring, including statistical algorithm LR, traditional ML algorithms (DT, LR, NN, SVM, and RF), and ensemble learning algorithms (AdaBoost, AdaBoost-NN, Bagging-DT, Bagging-NN, and GBDT), were tested for comparison with default XGBoost. Moreover, to reveal the effectiveness of the APSO for optimizing the XGBoost model, the proposed model is compared with XGBoost models optimized by several Output   hyperparameter optimization methods, including XGBoost-GS, XGBoost-RS, XGBoost-TPE, and PSO-XGBoost. In order to ensure the fairness of the experiment, while taking into account the accuracy and computational complexity of the model, the number of iterations of the hyperparametric algorithm is 200. e number of computations is the number of times that the hyperparametric algorithm calls the objective function in an iteration and the GS method based on a traversal idea, so the number of computations is determined by the parameter space. e search space and number of computations of each optimization method set are described in Table 3. e baseline models are described below. NN: it refers to neural principles, where each neuron can be regarded as a learning unit. e NN is constructed on the basis of many neurons, which are composed of an input layer, hidden layer, and output layer. ese neurons take certain characteristics as input and obtain output according to their own model. For the NN model of credit scoring, its input is the applicant's attribute vector, and the output is the default or nondefault category, +1 or −1.
e weight assigned to each attribute varies according to its relative importance, and the weight is adjusted iteratively to make the predicted output closer to the actual target.
SVM: by mapping the feature vector of an instance to a point in space, the purpose of the SVM is to draw a line to best distinguish the two types of points. In credit scoring, the data are correctly divided into the default type and the nondefault type. e SVM finds the hyperplane that separates the data. To best distinguish the data, the sum of the distances from the closest points on both sides of the hyperplane is required to be as large as possible.
DT: this is commonly used in credit scoring. e DT is a process for classifying instances based on features, where each internal node represents a judgement on an attribute, each branch represents the output of a judgement result, and each leaf node represents a classification result. e classification result in credit scoring is default or nondefault. e decision-making algorithm loops all splits and selects the best-partitioned subtree based on the error rate and the cost of misclassification.
RF [40]: the principle of random forest is to generate multiple decision tree models, where each tree learns and makes predictions independently on the bootstrap sample. By adding the results of each decision tree, the result with the most votes is selected as the final prediction result.
LR: the statistical technique of logistic regression is usually used to solve binary classification problems, and it is regarded as the benchmark of credit scoring [28]. First, regression analysis is used to describe the relationship between the independent variable x and the dependent variable Y and to predict the dependent variable Y. LR adds a logistic function on the basis of regression. In credit scoring, the probability of future results is predicted by the performance of historical data from applicants. e goal of LR is to judge whether the feature vectors of customers belong to a certain category, default or nondefault.
Bagging and AdaBoost are two popular methods in ensemble learning.
Bagging [41]: this method randomly samples and substitutes a training set with m samples to generate multiple new datasets and then builds models for each bootstrap sample. ese models are merged through a certain strategy, such as majority voting.
Among the boosting methods, the most important include AdaBoost and GBDT.
AdaBoost [42]: the misclassified samples in the training set are assigned greater weights. AdaBoost sequentially combines weak learners into strong learners to boost the model. Based on the error of the previous classifier, Ada-Boost adjusts the weight of the sample. rough the iterations, it gives larger weights to the misclassified samples in the training set.
GBDT: this is an iterative decision tree algorithm. e conclusions of all trees are added to obtain the final answer. e core of the algorithm is to use the most rapidly descending approximation, which is the negative gradient value of the loss function, as an approximate value of the residual of the algorithm to fit a regression or classification tree. e configurations of the parameters of the traditional ML algorithms and ensemble learning used employed in the experiments are displayed in Table 4.

Evaluation Measures.
In credit scores, the average accuracy is one of the most popular evaluation indices and represents the overall performance of the model. To better explore the ability of the model to distinguish between nondefault applications and default applications, type I errors and type II errors in the confusion matrix (see Table 5) are often applied to evaluate the models to predict their performance in detail. A type I error occurs when a default loan application is being wrongly classified as a nondefault loan application. Conversely, a type II error occurs when a nondefault is misclassified as default. TP and TN in the confusion matrix represent the numbers of correctly classified good borrowers and bad borrowers, respectively. FP and FN represent the numbers of misclassified loan applications. e formulas are defined as follows. e average accuracy (ACC): e type I errors: e type II errors: e Brier score measures the accuracy of the predicted probability and the calibration of the prediction performance. e Brier score ranges from 0 to 1, and the interval value represents probabilistic predictions from perfect to poor. e Brier score is defined as follows: where N is the number of samples and p i and y i denote the probability prediction and the true label, respectively, of sample i. e F1-score takes into account both precision and recall of classification models. It is the harmonic average of these two indicators, and it ranges from 0 to 1: where precision is the proportion of positive samples in positive cases, and it is defined as Also, recall is the proportion of predicted positive cases in the total positive cases, and it defined as

Comparisons among Hyperparameter Optimization
Methods. To demonstrate the performance of the tuning strategies, Figure 4 shows the convergence curves of average loss function of parameter optimization methods over four credit data. e ordinate represents average minimum value of loss function, and the abscissa represents the number of iterations. It can be seen from the figure that the convergence   speed of GS and RS is slower. TPE converges faster, and its result is better than GS and RS. e value of the loss function of PSO is slightly lower than that of TPE, but the convergence speed of PSO is much faster. PSO and APSO have the fastest convergence speed among the parameter optimization methods. PSO has entered the convergence state early, which leads to its unsatisfactory final error rate and failure to find the global optimal value. APSO gets the best performance among all the converging performance. APSO still continues to decline after some iterations, indicating that the optimization mechanism helps subgroups increase diversity, prevents local particles from clustering, and helps particles to find the local optimal value.

Results and Discussion
e ACC is one of the most mainstream and intuitive indicators. e ACC indicates the overall predictive ability of the model. In credit scoring, a small improvement in performance may help institutions reduce their losses by helping them with a large number of applications at risk of default.
In the German dataset (see Table 6), the XGBoost model obtained the best value of ACC (76.85%). Although the single classifier NN and Bagging-NN perform best in terms of type I error, the XGBoost model achieves better results with the other four indicators than them. For the type II error, SVM achieves the best value, but SVM does not maintain consideration for classification performance because it has the worst type I error rate of all models. In addition, the XGBoost model achieved satisfactory Brier score and F1-score.
For the Australia dataset (see Table 7), the XGBoost credit scoring model achieves the best performance in all measurements except the type I error rate. LR has the best performance for type I error but is the worst performance for the type II error, which shows that, with this balanced dataset, LR is inclined to determine the sample as a bad application, which is not acceptable for exploring the potential applicants. XGBoost obtained the best results of Brier score and its F1-score which is higher than that of second-best classifier Bagging-NN by 0.8%.
In the balanced P2P-LC dataset (see Table 8), the ACC of XGBoost reaches the best (66.70%). XGBoost achieves the lowest Brier score and best F1-score as XGBoost changed the training loss function.
e XGBoost assigns weights to samples with a small proportion, which affects the loss of the sample's contribution during the training process, which in turn changes the first-order derivative and the second-order derivative.
For the P2P-ren dataset (see Table 9), the XGBoost model is only second to GBDT regarding type I error rate; all other indicators exceed those of GBDT. LR, SVM, and AdaBoost obtain a low type II error rate but have very serious imbalance misclassifications as their maximum type I error is 90%. e type I error and type II error of XGBoost are acceptable, which demonstrates that XGBoost sufficiently learns imbalanced data.
Overall, the ensemble classifiers perform better than the single model on the four datasets on average. e ensemble classifier generates better prediction accuracy by combining individual weak classifiers with different performances. Under the premise of ensuring accuracy, XGBoost is able to take into account various indicators. In credit scoring, each indicator is meaningful for making decisions. For example, a low type I error indicates that the model has a stronger ability to discriminate against nonperforming loan applications. e model with a type II error is likely to maximize the profits of credit institutions by better exploiting potential creditworthy applicants. XGBoost achieved the highest F1-score over four credit datasets, which indicates that the model is less affected by the data imbalance than other models. e satisfactory performances on the Brier score indicate that the XGBoost model obtained a more accurate probability value prediction than other baseline models. Table 10 summarizes the results of XGBoost models using different optimization algorithms, which indicates that the XGBoost that was optimized based on APSO has achieved promising performance with the four credit datasets.
For the German dataset, APSO-XGBoost algorithms achieved the best prediction accuracy with a score of 77.48%, which is 0.8% higher than the default XGBoost model. e proposed model ranks second place in type I error, only second to XGBoost-GS, but it surpasses XGBoost-GS in terms of other indicators. e model obtains acceptable results for type II and Brier score. Moreover, the proposed model obtained the best F1-score, which demonstrates that the model can adequately learn the unbalanced data.
For the Australia dataset, APSO-XGBoost obtained the highest ACC and lowest type I error rate. A low type I error indicates that the model prevents misjudging a poor credit application with a higher probability of good credit and likely to reduce an enterprise's high debt rate risk. APSO-XGBoost achieves the third-lowest Brier score below that of XGBoost-RS and XGBoost-TPE. In addition, the F1-score of APSO-XGBoost is the best as the more appropriate setting obtained by APSO. is finding can be explained by the hyperparameter of XGBoost Maximum Delta Step, which resists the problem of unbalanced data to a certain extent. APSO found a more accurate value of the hyperparameter.
For the P2P-LC dataset, APSO-XGBoost is the best compared to other optimized credit models for all indicators, which indicates that the hyperparameter settings are more reasonable than other model settings and promote the combination of the model tree structure and credit characteristics.
e APSO-XGBoost model has obvious improvement over PSO-XGBoost in each indicator. APSO can further explore the predictive performance of the XGBoost model.
In the P2P-ren, APSO has also reached the best in all indicators except for the type II error rate. Although the default nonoptimized model has the lowest type II error rate, the type I error rate is too high because of the unbalanced data that have more labels with good credit. Poor hyperparameter settings hinder the model's ability to sufficiently learn to resolve the situation. APSO-XGBoost also achieves the best capability of probability prediction as it obtains the highest Brier score.
APSO-XGBoost perform better than most models on average. e results show that the APSO can promote the matching of the XGBoost structure with the characteristics of the credit data. From the performance of ACC and low type errors, PSO-XGBoost is slightly better than XGBoost-TPE. XGBoost-TPE also achieved acceptable performance with various datasets, especially German datasets, which demonstrate that it is an acceptable model for credit scoring.
e APSO-XGBoost model is better than the PSO-XGBoost model, which indicates that APSO overcomes the local optimality of the particle swarm to a certain extent and allows particles to obtain hyperparameters that render the model more accurate.

Statistical Significance Analysis.
We use the Friedman nonparametric test and Nemenyi post hoc test to verify that statistical difference of models in the accuracy over four credit datasets. e null hypothesis is that there is no difference between the models. First, the Friedman test rejects the null hypothesis which means that models are significant difference. Next, the Nemenyi post hoc test is applied to evaluate the differences among individual models. When the averaged ranks differ by at least a critical difference (CD), the result indicates the difference between models. When the number of samples is 4, the number of algorithms is 16, and the critical value q α is 3.196 from the Nemenyi critical value table with a significance level of 0.1. e critical difference is computed as 10.759 according to formula (25): where N is the number of sample and K is the number of the algorithms.
In Figure 5, the bars represent the average rank of the models over four datasets. APSO-XGBoost is the benchmark model whose average rank is 1, the horizontal line that represents the threshold of CD computed as 11.759. e average rank of XGBoost models is higher than other models, and our model has reached the top. e models above the horizontal line which includes AdaBoost, SVM, AdaBoost-NN, DT, and NN are significantly worse than the APSO-XGBoost model.

Conclusions
e core of ensemble learning is to gather a series of learners and combine them to form a strong learner. is reduces errors and achieves the goal of accurate prediction. e accuracy of financial credit prediction directly determines the revenue of a financing institution. e integrated model XGBoost is based on the GBDT and adds regularization constraints, improved loss functions, etc.; it obtains acceptable prediction results in the field of credit prediction. e structure of the model depends on the XGBoost hyperparameter setting. To obtain the optimal hyperparameters and improve the accuracy of the model, our model construction process is divided into three main steps: first, the data are preprocessed and standardized. Second, feature engineering removes redundant features.
ird, we use the APSO algorithm to optimize the hyperparameters and set the optimal hyperparameters obtained by training. Last, the trained model tests two UCI datasets and two P2P datasets with four evaluation indicators. e results show that the proposed APSO-XGBoost model has the best ranking among all classifiers in terms of the ACC. e experimental results show that the performance of APSO-XGBoost improves the credit dataset compared to the performance of XGBoost optimized by RS, GS, and TPE. A comparison with the other baseline models shows that our model is superior in credit scoring; furthermore, the type I error rate of our model is almost lower than that of the XGBoost models optimized with other hyperparameter optimizations. It is shown that the choice of hyperparameter optimization matched with the classifier is reasonable and effective, and our model has a stronger guiding significance to assist institutions in avoiding the risk of loss.
In future work, we will mainly focus on imbalanced data research.
Data are the basis of credit scoring research. A large portion of public data is class imbalance, which limits the research scope. e most related works have been oriented towards supervised algorithms or the construction of ensemble methods for increasing the classification accuracy rates. Supervised algorithms require a large amount of data. Semisupervised learning (SSL) provides an efficient solution since only a small amount of data are needed to achieve similar or even better learning ability and more robust classification. Karlos et al. [43] compared many semisupervised schemes against famous supervised algorithms with data of Greek firms. e promising results show that SSL is effective in mitigating with the imbalance of enterprise bankruptcy data, which also shows that SSL is feasible for solving the same problem of financial credit scoring. Moreover, Fazakis et al. [44] propose a novel SSL scheme for self-labelled classification, which utilizes the efficacy of XGBoost to build a model with highly accuracy and robustness. Based on this research, we plan to apply XGBoost to estimate the labels for unlabelled data to overcome class imbalance and dynamically update the most confident prediction instances into the initial labelled training set by evaluating the classifier in each iteration.
Cost-sensitive learning (CSL) is another widely used solution to imbalance data classification; it is an ML method with the minimum cost of misclassification. CSL implemented by instance-weighting or instance-relabelling is considered data-level solutions [6]. Akila [45] proposed a transaction window bagging model that uses a parallel bagging approach and incorporated with an incremental learning model, cost-sensitive base learner, and a weighted voting-based combiner to effectively address the data imbalance in credit data. Moreover, applying the idea of costsensitive boosting, we plan to introduce cost items into XGBoost. Incorporate cost as a performance metric with a regular indicator to tradeoff the decision-making which helps institutions determine whether grant a loan to an applicant or not.

Data Availability
e data used to support the findings of this study are available at https://www.lendingclub.com/info/demandand-credit-profile.action and https://www.renrendai.com/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.