Comparison of Two Methods Forecasting Binding Rate of Plasma Protein

By introducing the descriptors calculated from the molecular structure, the binding rates of plasma protein (BRPP) with seventy diverse drugs are modeled by a quantitative structure-activity relationship (QSAR) technique. Two algorithms, heuristic algorithm (HA) and support vector machine (SVM), are used to establish linear and nonlinear models to forecast BRPP. Empirical analysis shows that there are good performances for HA and SVM with cross-validation correlation coefficients R cv 2 of 0.80 and 0.83. Comparing HA with SVM, it was found that SVM has more stability and more robustness to forecast BRPP.


Introduction
Pharmacokinetic (PK) often uses mathematical models and equations to study quantitative change law of medicines with time [1,2]. PK is divided into several areas including the extent and rate of absorption, distribution, metabolism, and excretion [3]. It is mainly used to build mathematical expressions to monitor individual in vivo dose or drug regimen with time and work out PK parameters, make out, and adjust individual regimen to guarantee effectiveness and safety of treatment by applying PK model, expression, and parameters [4,5]. After a drug is absorbed into vein, most of it is bound with plasma protein. Combining percentage of a drug for therapeutic dose and plasma protein is called binding rate of plasma protein (BRPP) [6]. In this paper, BRPP is stable because it is a measured value for normal people in normal dose. Free drug can diffuse to organisms by lipid membrane. And it can be filtered by tubules or metabolized by liver [7]. Consequently, combination of drug and protein can have evident effect on process of drug distribution and elimination and decrease drug potency at the target site. Some studies indicate that pharmacodynamic and pharmacokinetic is mainly influenced by its binding protein, so does bioavailability [7][8][9][10]. The higher BRPP is, the longer its half-life is. Among R&D projects, about percent forty candidate compounds fell into disuse because of poor PK parameters, such as slow absorbing speed, low bioavailability, high BRPP, quick metabolization leading to short duration of drug action, and metabolites with toxicity and slow excretion leading to accumulated toxicity, in a body [11]. These reasons make in vitro activity of the compounds to lose the developing values of clinical drugs [12]. Therefore, for feasible drug design, we must consider characteristics of pharmacodynamic and pharmacokinetic to achieve the best balance between them. It is an important content of drug design for quantitative structure pharmacokinetic relationship (QSPKR) [13,14] and quantitative structureactivity relationship (QSAR) [15,16]. At the same time, they are also successfully used to forecast characteristics of drugs such as drug metabolism, toxicity, and actual bioavailability. Computer-aided drug design (CADD) is becoming an important research field of new drug development [17], which can apply known knowledge of drug molecules and biological targets to find and design new kinds of drug molecules by theoretical simulation and calculation [18]. At present, it is a very active area to study PK models in pharmaceutical industry. Because drug BRPP is influenced by many factors, causality and mechanism are not clear and distinct between molecular structures of drugs. As far as the present scientific level is concerned, there are still many difficulties to clarify relations between them according to basic principles. Classical forecasting methods (e.g., multiple linear regression) face more and more dilemmas. However, artificial intelligence methods provide stronger tools to analyze existed PK data and construct QSPKR between BRPP and molecular structure variables of a drug. In particular, the results of practical application in other fields indicate that the performance of support vector machine (SVM) has superiority over ANN and can overcome the problems of overfitting and local minimization of traditional neural networks excellently.
In order to find a new method to construct a PK model of BRPP, we establish QSAR models by Heuristic algorithm (HA) and SVM with BRPP of seventy drugs and test forecasting performance and stability of a SVM model.
The remainder of the paper is organized as follows. Principles of research methods are introduced in Section 2. Empirical study is presented in Section 3. Finally, conclusive results are drawn in Section 4.

Data Resource and Structure Parameters.
All experimental data of seventy drugs and their BRPPs resource are from reference [19]. Models are constructed by training set consisting of fifty-six drugs chosen randomly. Data of the remaining fourteen drugs as test set are used to examine stability and forecast performance of the two models. All compounds are initially optimized by molecular mechanics method (MM+) in program Hyperchem 4.0. Then, they are geometrically optimized further by semiempirical method (AM1). Optimized molecular structure is calculated in MOPAC 6.0, and then the results are transferred into CODESSA program to calculate five kinds of descriptors (independent variables): composition descriptor, topological descriptor, geometric descriptor, electrostatic descriptor, and quantum chemical descriptor.

Heuristic
Algorithm. HA can entirely search for a great quantity of molecular descriptors in software CODESSA and establish optimal linear regression equation [20]. HA has to control collinearity of molecular descriptors [21]. For example, if correlation coefficient of any two descriptors is more than 0.8, they will not be involved in the same model simultaneously. The optimal model is built by rapid filter and selection of HA to descriptors, while it is not done by considering all possible combination of descriptors. HA takes pretreatment way to eliminate some descriptors according to four rules: (1) the descriptors not owned by each compound; (2) descriptors with smaller changes of values for all compounds; (3) descriptors with test value less than 1.0 in an equation; (4) descriptors with test value less than a specific value [16]. Heuristic regression method (HRM) sequences molecular descriptors as descending order of correlation coefficients of a model. Every time, the descriptor with the biggest correlation coefficient is introduced among the remaining descriptors, which takes turn until the end. Performance of a model depends on multiple correlation coefficient ( 2 ), test value ( ), standard deviation ( ), and so forth [22]. Stability of a model is tested by correlation coefficient 2 cv of cross-validation of leave-one-out (LOO) [23]. Briefly, eliminate a sample in data set and forecast the eliminated sample by building a new model with the same descriptors, take turns until every sample in data set is eliminated and forecasted once, and calculate correlation coefficient between a forecasted value and an observed value. Generally, speed and quality of HRM are higher than others, which makes it become the first choice in practice [24].
In this paper, errors of heuristic regression results are denoted by root mean square (RMS), and the equation is as follows: where is target value,̂is an observed value, is the quantity of compounds, and denotes a compound.

Support Vector Machine.
Principle of SVM is that maps input vector into high-dimensional feature space by scheduled nonlinear mapping and then constructs optimal hyperplane in the high-dimensional space [25]. Thus, the problem is transformed into quadratic programming. No matter what target function or classification function it is, they both involve the inner product in quadratic programming. If a kernel function is used, it can avoid complex calculations in high-dimensional space and realize the inner calculations by an original space function. Consequently, selecting appropriate inner product ( , ) can realize linear calculation of a nonlinear transformation, while it does not increase calculating complexity [26]. Support vector machine regression (SVRM) maps a variable into highdimensional feature space by a nonlinear constructor Φ, and the regression is done in the space [27].
Assume the given input sample is a -dimension vector, samples and their output value are denoted as follows: Regression analysis is also called function estimation, which is a statistical process for estimating the relationships among variables. For a given sample set {( , ), = 1, . . . , }, where is the independent factor (descriptor) and is the dependent factor (BRPP). A regression model relates to a function of , = ( ). If the function ( ) is linear, the regression is called as linear regression, otherwise called as nonlinear regression [28]. There is only one kind of sample points for SVMR, namely, optimal hyperplane which makes the total deviation minimized between all sample points and the hyperplane. Thus, sample points are between two borders. If insensitive function is taken as an error function, the problem of how to find the optimal regression hyperplane is transformed to solve quadratic convex programming when the distances of all sample points to the quested hyperplane are not more than [25]. Namely, 3 When distances of several sample points to the hyperplane are more than , deviation of insensitive function is equivalently the introduced slack variable of SVM clustering. Introducing fault-tolerant penalty function , the problem of quadratic convex programming to find the optimal regression hyperplane can be transformed as follows: Then, linear regression function of the optimal hyperplane is , * , and can be calculated through constraints; S.V. denotes support vector. In order to determine parameters of the optimal hyperplane, the above solving process can be realized by MATLAB program. Last results indicate that the optimal regression hyperplane is only determined by sample points. If points and in sample space are replaced by mapped image point ( ) and ( ) with a kernel function; let ( , ) = ( ( ) ⋅ ( )); denotes number of points [29]; then,

HA Model.
Each molecule can be worked out to five hundred to six hundred descriptors by using CODESSA, including composition, topological, geometric, electrostatic, and quantum chemical descriptors. Composition descriptor reflects the composition information of a molecule, including quantity of atoms, atomic bonds, atomic rings, and molecular weight. Topological descriptor indicates connecting information of atoms in a molecule, including Wiener index, Randic index, and Kier-Hall index. Geometric descriptor reveals size and shape of a molecule, including inertia moment, molecular cubage, and surface area. Electrostatic descriptor displays distribution information of electric charges in a molecule, including maximum and minimum partial charges, polarity, and charged partial surface area (CPSA). Quantum chemical descriptor discloses electric charge distribution in a molecule and energy information of molecular orbit, including reaction index, dipole moment, energy of lowest unoccupied molecular orbital (LUMO), and highest occupied molecular orbital (HOMO), which has an important effect on molecular reaction, electrostatic interaction between molecules, and interaction between molecular orbits. By HM filtering, six parameters are introduced to the model. Their interrelations and forecasting results are seen in Tables 1 and 2. In HA model, correlation coefficient 2 = 0.85, test value = 63.64, error RMS = 12.24, and correlation coefficient of crossvalidation 2 cv = 0.80 (see Figure 1 and Table 3).   There are six descriptors in HA linear model. WPSA-3 weighted PPSA (Zefirov's PC), HASA-1/TMSA (Zefirov's PC), and PNSA-2 total charge weighted PNSA are electrostatic descriptors. ALFA polarizability (DIP), Tot pointcharge compd. of the molecular dipole and final heat of formation are quantum chemical descriptors. -polarizability is molecular polarizability which reflects molecular cubage and interaction between agent and molecule. Polarizability scale is closely related to hydrophobicity and electrophilicity. In the model, only the signal of -polarizability parameter is positive, which indicates that polarizability has a positive effect on bond of drug and plasma protein. Hydrophobicity influences combination of drug and plasma protein directly. Because protein consists of polypeptides with electric charge, the stronger electrophilicity is, the easier binding with plasma protein is. Final heat of formation (FHF) is relative to molecular stability, which expresses molecular reaction ability. Change of FHF influences molecular structure and function, while it does the combination of drug molecule and plasma protein. WPSA-3 weighted PPSA is partial positive surface charge. PNSA-2 total charge weighted PNSA is the weights of total charges and determined by surface area and functional gene of molecule, which reflects interactions between polarmolecules. On the surface of plasma protein, there are enzymes with specific function gene. At the same time, there also exist a series of receptors. As the ligand, drug is bound with receptors on the surface of plasma protein.

SVM Model.
In order to compare performance of SVM with that of HA model, we choose the same test set, training set, descriptors with HA model. In SVM model, it is very crucial to choose kernel function. There are four kinds of kernel functions including linear, polynomial, Gaussian, and sigmoid. When size and dimension of samples are small, the four kernel functions can show better performance. On the contrary, Gaussian kernel is a better choice [30], which is most commonly used in SVMR; namely, where is a constant, and V are two independent variables. controls generalization ability of SVM by adjusting the shape of Gaussian function. Because size and dimension of samples are big in our study, Gaussian kernel function is a preferred choice. The forecasting results are seen in Table 2. After adjusting , , and simultaneously, we can get three useful results.
Firstly, seeing Figure 2, the error is minimal when is 0.035. Optimal value of depends on data type while it also considers support vectors. Because insensitive function can control border of all training set, it is very important for SVM to choose . Secondly, relation between and errors is seen in Figure 3. When = 0.173, the error is the least. Thirdly, another important parameter is used to measure the training error between maximal and minimal hyperplane. If is too small and training is not enough, it is very difficult to arrive to the optimal. On the contrary, overfitting phenomenon will happen. Relation between and errors is seen in Figure 4. When is equal to 130, the error is the least.
According to the above training results, when the optimal parameters , , and are equal to 0.035, 0.173, and 130 respectively, forecasting ability of the model is the most robust and stable. In Figure 5, RMS is 11.40. For training and test set, 2 is 0.97 and 0.92, respectively. Total 2 cv is 0.83. Comparing HA with SVM, it is found that their correlation coefficient square ( 2 ) is 0.80 and 0.83, respectively, after cross-validation, RMS is 12.24 and 11.40, respectively. Higher 2 value and lower RMS value indicate a better predictability of the dependent variable from the independent variables [31]. Therefore, a conclusion can be drawn that SVM model has better stability and more robust forecasting ability for BRPP than HA model, which is a good tool to construct PK a model.

Conclusion
In this paper, we construct HA and SVM model to forecast BRPP, respectively. By calculating descriptors of molecular structure, we found that it is satisfactory for forecasting results of nonlinear QSAR model based on SVM and linear QSAR model based on HA. By comparison of two methods,  nonlinear model based on SVM is more stable and more robust to forecast BRPP than linear model based on HA. Therefore, SVM model is a more effective tool to study QSAR and BRPP of a drug.
However, because the comparison is primarily based on the analysis of one real dataset, our research has certain limitations. The conclusions need research supports of more datasets. SVM performance of predicting BRPP should be studied and discussed further by more datasets in the future.