Study on Evaluation Model of Chinese P2P Online Lending Platform Based on Hybrid Kernel Support Vector Machine

Accurate evaluation of the risk level and operation performances of P2P online lending platforms is not only conducive to better functioning of information intermediaries but also effective protection of investors’ interests. (is paper proposes a genetic algorithm (GA) improved hybrid kernel support vector machine (SVM) with an index system to construct such an evaluation model. A hybrid kernel consisting of polynomial function and radial basis function is improved, specifically kernel parameters and the weight of two kernels, by GA method with excellent global optimization and rapid convergence. Empirical testing based on cross-sectional data from Chinese P2P lending market demonstrates the superiority of the improved hybrid kernel SVM model. (e classification accuracy of credit risk level and operation quality is higher than the single kernel SVM model as well as the hybrid kernel model with empirical parameter values.


Introduction
Chinese P2P online lending industry was once without supervision and regulation for more than five years so that most platforms act as credit intermediaries, providing credit enhancement measures such as principal guarantees and third-party guarantees [1,2]. With increasing events in bankruptcy and disappearance of platforms, investors are more and more sensitive to platform characteristics in decision-making. Risk management focusing on platforms shall be a new trend for regulation of the P2P online lending industry [3,4]. Interim Measures for the Administration of the Business Activities of Online Lending Information Intermediary Institutions issued jointly by four ministries and commissions of Chinese government in August 2016 clarified contents of P2P lending, regulatory system, and business rules; subsequently, a series of detailed rules and regulations on third-party depository, filing and registration, and information disclosure were promulgated to standardize the development of P2P online lending industry [5,6]. Accurate evaluation of risk level and operation performances of platforms not only provides solid basis for practical measures adoption by regulation authorities but also acts as an important reference for investors' decisionmaking. erefore, constructing an advanced evaluation model for P2P online lending platforms is of vital realistic significance [7].
Risk level and operation performance evaluation are hotspot issues in recent research studies given unstable market environment. Tsolas applied a new series two-stage DEA method while evaluating credit risk of enterprises [8]. Luo Sirong et al. introduced a regression spline-based discrete time survival model to assess comprehensive performance of credit card applicants [9]. Dahira et al. presented a feature selection-based hybrid-bagging algorithm (FS-HB) for improved credit risk evaluation [10]. With respect to Chinese P2P platforms, existing research studies usually adopt statistical methods such as factor analysis, principal component clustering, and analytic hierarchy process. Zhu Zongyuan and Wang Jingyu perform the analytic hierarchy process and data envelopment analysis to measure the technical, scale, and overall efficiencies of 22 P2P online lending platforms, finding those efficiencies to be generally low [11]. Shan Peng et al. successively apply the factor analysis method to scoring and sequencing comprehensive strength and risk levels of the sample platforms [12]. Yan Xin et al. constructed a complex evaluation index system for P2P online loan platforms and utilized the two steps and Kohonen model to cluster 516 platforms for classification and providing references for investors' decision-making [13]. Liu Ao et al. determined optimal weights by means of the teaching and learning optimization algorithm and sorted efficiencies of 100 P2P online loan platforms [14].
ere are mainly two defects in the existing research studies. Firstly, in most research studies, platforms are ranked according to certain criterion. e boundary of suitable platforms for investment is ambiguous, whereas an intuitive support for investors' decision-making is missing. Secondly, for researches adopting statistical models, data modeling is overemphasized so that accuracy of the modelbased prediction will be affected, while data dimension is enlarged.
erefore, a machine learning algorithm integrating GA and hybrid kernel SVM is proposed in this study. e improved algorithm sets a clear boundary of whether the platform is credible that investors could trade on by classification of risk level and operation quality. Moreover, applying the GA method and hybrid kernel SVM will not only reach a higher classification accuracy than statistical and traditional machine learning models but also fit for large data volume analysis. e rest of this paper is organized as follows. Section 2 discusses design of evaluation model for GA optimized hybrid SVM. Section 3 shows simulation experimental results, including the labeling process by principal component method and platform evaluation process by the optimized hybrid SVM method. Section 4 concludes the paper with summary and future research directions.

Establishment of SVM Hybrid
Kernel. e principle of SVM as a classification algorithm is to find the separate hyperplanes with the maximum margin to maximize the distance between point x and hyperplane (w T x + b � 0). Slack variable, namely, a nonnegative parameter ξ, and penalty factor C are introduced to describe inseparability losses and penalty for sample misclassification. While the training samples are assumed as x i , y i (x i : input index and y i : classification tag value), the basic model can be described as (1) e kernel function is to map the data implicitly to the high-dimensional feature space so that linear inseparable issue in original low-dimensional space may be solved, whose form and parameter value significantly influence the classification accuracy of the SVM algorithm. e kernel function may generally be divided into two types (global and local kernels); the former has strong generalization capacity but weak learning ability, while the latter is opposite. Among common kernel functions, global kernel functions include the polynomial and Sigmod types and RBF type belongs to local kernel functions. e polynomial and RBF kernel functions were linearly combined in this study to obtain a hybrid kernel function which has both learning and generalization capacities to overcome limitations of the single kernel functions. Mathematical expressions are as follows:

Optimization of SVM Parameters.
While the hybrid kernel function is applied for classification, those parameters to be necessarily determined include λ (hybrid kernel weight coefficient), a, c, and d (polynomial kernel parameters), σ 2 (RBF kernel parameter), and C (penalty factor). Firstly, the hybrid kernel weight coefficient is determined by principle of minimizing featured distances between similar samples and maximizing featured distances between dissimilar samples, which was put forward by Wang Xingfu and Yu Lu [2]. Evaluation function L (λ) is defined as the difference between spacing of any two dissimilar samples or any two similar samples; ϕ 1 and ϕ 2 represent the corresponding mappings of RBF and polynomial kernel functions, respectively. e distance between sample i and j may be expressed as where x i stands for the sample value and y i stands for the sample type.
Plugging equation (2) in equation (3), Secondly, GA with global optimization ability is used to optimize kernel parameters, and its basic principles are as follows: (1) Initialization of SVM parameters and setting searching space for kernel parameters and the penalty facto and initialization of GA parameters, population size, encoding lengths, crossover and 2 Scientific Programming mutation probability, and maximum number of iterations. (2) Random selection of the number of individuals of the initial population for coding is based on the following equation: where M represents a binary code string; x represents the independent variable, whose value range is [a, b]; and l represents the encoding length.

Construction of the Comprehensive Evaluation System and Index Preprocess.
is study focused on evaluation of the monthly operation level of a P2P platform with reference to the industry average level by taking data availability and index stability into account, and the evaluation indexes were selected in the following four dimensions: (1) Transaction level: it was decomposed in two subdimensions, trading scale and cost of capital, in which 3 indexes (namely, turnover, average reference rate of return, and net capital inflow) were examined. (2) Platform popularity: it is primarily to examine the platform's attractiveness to investors and borrowers through the brand effects, public opinion communication, and other channels, and it is directly reflected by numbers of investors and borrowers, investment, and loan amount per capital. (3) Loan decentralization: the explosive increasing in the trading volume and the high concentration of borrowing transactions leads to extensive payment pressure of platforms. is study focused on the degree of decentralization of borrowers; thus, two indexes (the per capita amount to be paid and the percentage of the amount to be paid by top ten borrowers) are selected for representation.
(4) Liquidity level: it refers to the ability of liquidating any assets at a reasonable price. As for any asset, the worse its liquidity is, the less active its transaction is. e average loan term is generally utilized to reflect the liquidity level, and the shorter the term is, the stronger the fund liquidity is.
Our platform and industry data are derived from statistics results for October 2017 of Website (http://www. wangdaizhijia.com), and 463 valid samples were obtained after deleting those samples whose data are incomplete. Software environment: WINDOWS 7/SPSS 19.0/Matlab R2016b.
e statistical description of original indexes is shown in Table 1.
Original indexes are preprocessed in two steps: relativization and reversing negative indexes. Due to the imperfect supervision system of the Chinese P2P industry, regulatory authorities have bound neither cap nor floor for platform operation indexes. In this paper, the ratio of the absolute value and the industry average acts as input indexes for the kernel principal component analysis, which represents a relative level against the industry in an economic sense. Due to the lack of industry statistics of the index X9, a proportion of 50% is used here, which is the cap proportion authoritatively set for commercial banks in China.
Ten original indexes consist of positive and negative ones. e latter includes the per capita amount to be paid, the percentage of the amount to be paid by top ten borrowers, and the average loan term, whose absolute values have negative correlations with the operation level of a platform. us, the reciprocal of original negative indexes is adopted to unify the dependency of index value and the platform operation level.

Classification Evaluation Mechanism Based on Principal Component Analysis.
To begin with, sample data are scored and labeled using the principal component analysis method to generate output results of the supervised learning of the SVM algorithm.
e corresponding results are shown in Table 2. e top six components whose accumulated variance contribution rates are up to 85% are extracted as the principal components, namely, as F1, F2, ..., and F6 in sequence. e score matrix is shown in Table 3.
Each component is expressed as a linear combination of index (X) according to the following equation, whose coefficient matrix is the score matrix of principal components in Table 3: e comprehensive score function was established as follows, which is a weighted sum of scores of all principal components; and the weight is the corresponding variance contribution rate for each principal component:

Scientific Programming
While X(i) is taken as 1 for any i, the industry average score is calculated as F � 0.812. While the comprehensive score F ∈ (− ∞, F), the platform shall be below the industry average level and it belongs to the "ALERT" type platforms and is labeled with "− 1". In contrast, while X(i) is taken as 10 for any i, its "EXCELLENT" type score is calculated as F * � 9.973. While the comprehensive score F ∈ (F * , +∞), it shall be labeled with "1". While the comprehensive platform score F ∈ (F, F * ], it belongs to the "GENERAL" type platforms and it shall be labeled with "0". e principal component analysis was performed to gain the results: 107 "EXCELLENT" type platforms, 334 "GENERAL" type platforms, and 22 "ALERT" type platforms.
In addition, in order to assess the ability of early warning of optimized evaluation model, a second classification standard is constructed. A binary classifier gives a definite answer to whether investors could trade on the platform based on its risk level, which is different from the ternary classifier we built before aiming at choosing the most outstanding platforms. "EXCELLENT" and "GENERAL" platforms are collectively called "NONALERT" platforms, labeled "1" and "0" for "ALERT" platforms. Accordingly, there will be 22 "ALERT" platforms and 441 "NONALERT" platforms.

Classification Evaluation Results for Determining SVM Parameters Based on Empirical
Values. e empirical value parameters were first selected to test the accuracies of the single and hybrid kernel SVM models. By taking λ � 0.5, a � c � 1, d � 3, σ 2 � 10, and C � 1, the 5-fold cross validation binary classification and ternary classification results are shown in Table 4.
As shown in Table 4, the classification accuracy of the polynomial-RBF hybrid kernel support vector machine evaluation model with empirical parameters is slightly better than that of the four common single-core models both in binary and ternary classification. However, ternary   classification results are not satisfactory especially. GA is introduced to optimize the hybrid kernel weight coefficient and kernel parameters to achieve higher classification accuracy.

Optimization of SVM Parameters Based on GA.
Parameters are optimized by LIBSVM toolkit, and gamma � a � 1/σ 2 is taken when applying the hybrid kernel function. SVM parameters are optimized by the GA algorithm in accordance with the specific steps as follows: Input: inputting 463 sample data after feature extraction.
Step 2: λ is solved based on the characteristic distance method.
Step 3: SVM classification accuracy based on the 5-fold test method is calculated and defined as the fitness function of GA.
Step 4: selection was performed by the roulette wheel selection method so that the greater the fitness of individuals, the higher the probability of being selected. e generation gap is set as 0.9, which means that 90% individuals are copied to the next generation. e probability of an individual being selected is Step 5: crossing was performed by the two-point crossover method. Two crossover points were set randomly in two paired individual encoded strings, between which some genes were exchanged. e crossover probability is pc � 0.7.
Step 6: mutation was performed by the discrete mutation method, where the mutation probability is taken as pm � 0.01.
Step 7: keep the current optimal solution and the filial generation was inserted again into the parent to generate a new population. If the number of iterations is not up to the maximum which is 100, operation shall be performed again from Step 2; otherwise, Step 8 shall be performed.
Step 8: decoded outputs (λ, a, c, d, σ 2 , C) and classification accuracy. e warning capability of the optimized model for "ALERT" platforms is investigated firstly. e best binary classification accuracy (fitness) during the evolution process is shown in Figure 1. When the iteration goes to the fiftieth generation, the accuracy reaches 98.9201% and finally converges to the value, which is significantly higher than that with empirical parameters in Table 4. e ROC curve of the binary classifier is shown in Figure 2, from which we can see that the AUC value reaches 0.9817. is shows that the evaluation model of hybrid kernel SVM method optimized by genetic algorithm has outstanding warning ability for "ALERT" platforms. Optimal parameter values of binary classifier are shown in Table 5.
e fitness curve of optimized ternary classifiers is shown in Figure 3. When it evolves to the 26th generation, the ternary classification accuracy reaches 96.7603% and finally converges to the value. e accuracy is significantly higher than that of single kernel (72.14%-75.59%) and hybrid kernel support vector machines with empirical parameters (76.89%) are presented in Table 4. is shows that the GA optimized hybrid kernel SVM algorithm is effective in accurate classification of risk level and operation quality of Chinese P2P online lending platforms. Parameter values of ternary classifiers during evolution are shown in Table 6.

Conclusions
How P2P platforms operate is closely related to investors' fund safety and their investment decisions, which creates requirements for rating and classification of platforms. An improved hybrid kernel SVM evaluation model is put forward to effectively increase the accuracy of traditional SVM algorithm. A hybrid kernel function is introduced in which the weight is solved by the characteristic distance method and the parameter value is determined by the GA algorithm.
e transaction data test indicates that this improved model has strong learning ability and generalization ability, and the prediction accuracy is significantly higher than of either single kernel SVM models or the hybrid kernel model with empirical value parameterization, which enables evaluation and classification of Chinese P2P online lending platforms to be more accurate and more objective.
Nonetheless, the premature defect of GA algorithm is not solved in this study. e improved hybrid kernel model has limited ability while exploring an unknown space as well    as the tendency to converge to a local optimal solution. Optimization could be further developed through these aspects.

Data Availability
e labeled dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.