A QSAR Study Based on SVM for the Compound of Hydroxyl Benzoic Esters

Hydroxyl benzoic esters are preservative, being widely used in food, medicine, and cosmetics. To explore the relationship between the molecular structure and antibacterial activity of these compounds and predict the compounds with similar structures, Quantitative Structure-Activity Relationship (QSAR) models of 25 kinds of hydroxyl benzoic esters with the quantum chemical parameters and molecular connectivity indexes are built based on support vector machine (SVM) by using R language. The External Standard Deviation Error of Prediction (SDEPext), fitting correlation coefficient (R2), and leave-one-out cross-validation (Q2LOO) are used to value the reliability, stability, and predictive ability of models. The results show that R2 and Q2LOO of 4 kinds of nonlinear models are more than 0.6 and SDEPext is 0.213, 0.222, 0.189, and 0.218, respectively. Compared with the multiple linear regression (MLR) model (R2 = 0.421, RSD = 0.260), the correlation coefficient and the standard deviation are both better than MLR. The reliability, stability, robustness, and external predictive ability of models are good, particularly of the model of linear kernel function and eps-regression type. This model can predict the antimicrobial activity of the compounds with similar structure in the applicability domain.


Introduction
is used to research the relationship between the molecular structure and biological activity and physicochemical characteristics, reveal the quantitative relationship, predict the activity of unknown compounds, and direct the synthesis of new materials [3][4][5]. QSAR is considered as one of the promising technologies and is widely used at present because of making up the loss of experimental data, reducing the cost of testing, and achieving high throughput prediction and screening [6]. Many international organizations and regulatory agencies have supported and promoted the use of QSAR and thought that QSAR can be used as an alternative to animal experiments. Health Canada, the United States of Food and Drug Administration (FDA), Environmental Protection Agency (EPA), the European Union, and the Organization for Economic Cooperation and Development (OECD) apply QSAR to identify potential health hazards, screening, and priority [7]. After recent years of development, QSAR has become a frontier topic in medicinal chemistry, environmental chemistry, life science, analytical chemistry, computer chemistry, and even pesticide [8][9][10][11].
Hydroxyl benzoic esters are important kinds of preservatives, which are widely used in medicine, food, cosmetics, pesticides, and other fields [12]. At present, there are about 60 kinds of food preservatives in the world [13]. The benzoic acid and sorbic acid are productive in China, but the usage is little because of the high toxicity of benzoic acid and the high price of sorbic acid. Hydroxyl benzoic esters have high efficiency, low toxicity, compatibility, and other advantages; the performance of antibacterial is stronger than benzoic acid and 2 Bioinorganic Chemistry and Applications sorbic acid because it has a phenolic hydroxyl [14]. So it is of great significance to study and apply the antibacterial activity of hydroxyl benzoic esters.

Research Status of SVM in QSAR.
SVM is a machine learning algorithm based on statistical learning theory proposed by Cortes et al. [15][16][17]. SVM can be used for pattern recognition, regression analysis and function fitting, and so forth because it possesses favorable mathematical properties, such as the uniqueness of the solution, nondependence on the dimension of the input space, and so forth. The optimal solution of SVM is superior to the traditional learning methods. In recent years, SVM is applied to the study of QSAR of the compound. Hou et al. [18] investigated the QSAR of the antimalarial activity of PfDHODH inhibitors by generating four computational models using a multiple linear regression (MLR) and a SVM based on a dataset of 255 PfDHODH inhibitors. Sharma et al. [19] drew support from SVM and MLR studying the activity of HIV-1 capsid inhibitors. SVM model was found more efficient in prediction. Khuntwal et al. [20] used MLR and SVM to develop QSAR models for a dataset of 34 tetrahydrobenzothiophene derivatives. Zhiming et al. [21] by using ridge regression (RR) and SVM built QSAR models of bitter tasting thresholds (BTT) and cytotoxic T lymphocyte (CTL) and predicted independent test data. Results showed that the fitting, LOOCV, and external prediction accuracies were superior to the reported results of the existing literature. Zhang et al. [22] took the benzene compounds as the research object, combining the molecular structure of the quantitative description with MLR or nonlinear regression statistical methods SVM, to build successfully the acute toxicity QSAR models and mutagenic QSAR models of benzene compounds. By comparing the linear and nonlinear QSAR models, Zhang Xiao-Long discovered that the stability and prediction ability of nonlinear QSAR models are better than those of multiple linear QSAR models. In the literature, there are very few researches about QSAR of the hydroxyl benzoic esters. Jiang et al. [23] used MLR to build the model of QSAR and it can well predict the MIC and t 0.5 in the range of atomic number (the number of C among 1-4 on the ester chain of MIC and 1-3 of 0.5 ). Qiu et al. [24] optimized the molecular structures of eleven kinds of phydroxyl benzoic esters by using density functional theory (DFT) B3LYP method of quantum chemistry and then used stepwise multiple linear regression to select the descriptors and to generate the best prediction model that relates the structural features to inhibitory activity. The QSAR results showed that the lowest unoccupied molecular orbit LUMO and the increase of dipole moment were the main independent factors contributing to the antifungal activity of the compounds. SVM has shown obvious advantages in the QSAR research, but QSAR study of the compound of hydroxyl benzoic esters is confined to the linear model at present; there is no literature on the nonlinear QSAR analysis of the system.
In this paper, we use the quantum chemical parameters and molecular connectivity indexes to analyze the antibacterial activity of the hydroxyl benzoic esters. The QSAR model is established by the SVM algorithm in the R software.
We obtain the structure-activity relationship between the molecular structural parameters and the antibacterial activity of Escherichia coli under the most stable configuration, which provides a basis of predicting the antibacterial activity of similar compounds.  [23], in the form of logarithm (lgt 1/2 ) to express its antibacterial activity. The results are shown in Table 2.

Calculation and Selection of Molecular Descriptors.
The quantum chemical parameters [25] and molecular connectivity indexes [26] can well explain the antibacterial activity of compounds and have good correlation between them; therefore, this paper selects them with a clear physical meaning as the descriptor.

The Quantum Chemical Parameters.
In this paper, the quantum chemical parameters are calculated by the latest Gaussian 09 software [27] that is a quantum chemistry software of semiempirical calculation and ab initio calculation of United States Gaussian company. Gaussian 09 in the calculation can carry out the molecular structure through the View Gauss 5 software directly and create the input files of molecular structures. In the calculation, Gaussian 09 software calls directly the input file and translates it into the form of redundant internal coordinates automatically. The results of the calculation are output by the text. Each time before calculation, a suitable chemistry model (computational method) should be established for the system in order to achieve balance in terms of computational cost and accuracy [27,28]. The method of this paper is B3LYP/6-31G DFT/(d). Because all the molecular configurations are optimal configurations and the geometry optimization is convergent and there is no virtual frequency by the frequency analysis, therefore, all the data are true and reliable. Find out the useful quantum chemical parameters from the output file. The values are shown in Table 3.

The Molecular Connectivity
Indexes. Molecular connectivity indexes which mainly reflect the number of atoms in molecules, valence bond and branch information, and so forth are the constants that are calculated according to the molecular structure. Each order index has a different meaning. Many studies show that 5 X k P can characterize a lot of information, which has a great significance in explaining the influence of structure on biological activity [29,30]. So, this The results are shown in Table 4.

Partition of Dataset.
The rational division of datasets is a very hot research topic in the field of QSAR. There are a variety of methods. In this paper, Random Sampling (RS) [31] is used to divide the raw data into training set (22 kinds) and test set (3 kinds, o-hydroxyl benzoic esters, m-hydroxyl benzoic esters, and p-hydroxyl benzoic esters). The training set is used to establish the SVM nonlinear models, and the test set tests the external prediction ability of the models.

Modeling Method.
Through the R software program, the training set with 22 compounds is used to build the nonlinear models by SVM algorithm based on the selected descriptors. Firstly, we standardize the data and then establish 4 models of kernel for radial, linear, eps-regression, and nuregression type, respectively.

Model Validation.
Model validation is very important for QSAR research, which consists of two aspects: internal validation to test the fitting ability and robustness of models and external validation to test the model's predictive ability. Both internal and external validations are equally important [32].

Internal Validation.
There are many methods to estimate a model's stability, robustness, and internal predictive ability, such as the fitting correlation coefficient, crossvalidation, random model test, Y random, and various residual errors (like Root Mean Squared Errors (RMSEs), standard residual error, etc.) [33]. In this paper, the fitting correlation coefficient ( 2 ) between the experimental and predicted values of the training dataset and leave-one-out cross-validation ( 2 LOO ) are used to test the reliability, robustness, stability, and whether the models are overfitting or not.

External Validation.
A very important purpose of the QSAR models is to predict the related activity data of new or even nonsynthetic compounds, in order to guide the design and synthesis of compounds with desirable activity, or to screen the compounds. This requires that the model has good predictive ability and generalization ability; however, crossvalidation can only explain the internal predictive ability of models and good internal prediction ability does not mean the excellent external prediction ability [34][35][36]; that is, good cross-validation 2 cv is a necessary but nonsufficient condition for the high external predictive ability [35]. The only way , and SDEP ext . In this paper, the test set is used to predict the corresponding lgt 1/2 and external predictive ability of the models is evaluated by SDEP ext .

Extraction of Key Descriptors.
We use principal component analysis to extract the most critical molecular descriptors of the hydroxyl benzoic esters for antibacterial half-life.

Internal Prediction and Scatter Plot.
Four nonlinear SVM models based on the selected descriptors are established by using training set. Experimental values and internal prediction results of lgt 1/2 are shown in Table 5 and scatter plot in Figure 1. Table 6.

Results of External
Validation. lgt 1/2 of the test set is predicted, respectively, by 4 SVM models and the results are shown in Table 7. SDEP ext of the models and the residual between experimental values and the predicted results of lgt 1/2 are displayed in Table 8. Scatter plots of experimental values and prediction results by 4 SVM models of 25 compounds of lgt 1/2 are shown in Figure 2

Discussion and Conclusion
The degree of freedom and the speed of the preservative molecule determine the effective collision between the central atom of reactivity and the group or atom of microbial molecular activity. As a result, the antimicrobial property of the preservative is essentially determined by the electronic behavior of the preservative and the microorganism, that is, the quantum biochemical characterization of preservative. Therefore, from the perspective of quantum chemistry to study the relationship between the structure and properties of compound, the effective antimicrobial groups of preservative can be explained in essence [37]. Jiang et al. [23] use multiple linear regression to establish the linear model of 25 kinds of hydroxyl benzoic esters. The parameters are shown in Table 9. Results showed that 2 was only 0.421, but the equation had good linear relationship when the number of C atoms was less than 4. When the number of C atoms in the ester group is more than 4, the influencing factors become more complex and cannot be described by simple linear relationship and may be in nonlinear or diversified relationship. So we use the R language to write the program and establish 4 kinds of nonlinear models through the SVM machine algorithm for 25 hydroxyl benzoic esters and predict lgt 1/2 . Predicted results of training set are shown in Table 5. The scatter plot of experimental and predicted lgt 1/2 is drawn by using R software. Figure 1 shows that the predicted and experimental values are in good agreement and the linearity is obvious. According to literatures, if the value of 2 is greater than 0.6 [35,38] and 2 is greater than 0.5, the model is good, and model is excellent when the values are more than 0.9 [39]. Tropsha et al. [6] recommend 2 and 2 to be greater than 0.6. Table 6 shows that both 2 and 2 LOO are greater than 0.6 and 2 and 2 LOO of two models with linear kernel function are close to 0.75, so we may think that the stability, robustness, and internal predicted ability of the 4 models are better and the models are not overfitting because 2 is larger than 2 LOO by no more than 25%. By RS extracting, the para-, ortho-, and metacompound from 25 hydroxyl benzoic esters make up external test set to test the models, and the prediction results are shown in Table 7. The parameters from Table 8 show that the residual values of lgt 1/2 of the test set are in the range of −0.037244∼ 0.322733 and SDEP ext is 0.213, 0.222, 0.189, and 0.218, respectively. The results indicate that the 4 models have high external predictive ability among themselves; in particular the model of the linear kernel function and eps-regression type is better than the other 3 models. Scatter plots of experimental values and prediction results by 4 SVM models of 25 compounds of lgt 1/2 are shown in Figure 2. The results show that     Note. Radial + eps-reg, radial + nu-reg, linear + eps-reg, and linear + nu-reg, respectively, represent the 4 SVM models where kernel function is radial and linear and type is eps-regression and nu-regression. Note. Radial + eps-reg, radial + nu-reg, linear + eps-reg, and linear + nu-reg, respectively, represent the 4 SVM models where kernel function is radial and linear and type is eps-regression and nu-regression.     the overall prediction of the 4 SVM models is better and, particularly, the linear relationship between predictive and experimental value of the model, where kernel function is linear and type is eps-regression, is the best. In Table 10, the principal component analysis shows that the proportion of variance of the first principal component reaches 96.03%; therefore, the first principal component is taken only. Table 11 shows that the first principal component includes (total energy), ZPE (zero-point vibrational energy), and (polarizability). We consider that , ZPE, and are the key factors for antibacterial half-life of hydroxyl benzoic esters. is a kind of structural parameter characterized by molecular deformation tensor under the action of external electric field. It is the most important property that is related to the volume of the molecule and contains information about the molecular interaction that is able to characterize the properties of the molecule as an electron acceptor. Since the coefficients of and ZPE are negative, this indicates that the value of and ZPE is greater and the antibacterial half-life of hydroxyl benzoic esters is shorter but E is just the opposite because the coefficient is positive. In summary, QSAR nonlinear model obtained by quantum chemical parameters and molecular connectivity indexes can better predict the antibacterial activity of hydroxyl benzoic esters. The introduction of SVM algorithm solves the problem of poor correlation of QSAR and complex nonlinear relationship between the molecular descriptors when formula weight is large, which provides a basis for the prediction of the antibacterial activity of compounds with similar structure. Therefore, the main conclusions of this paper are as follows: (1) The establishment of the 4 kinds of nonlinear models using 25 hydroxyl benzoic acid esters by SVM method, through internal and external validation, the stability, and robustness, and internal and external predictive ability of 4 kinds of models are good; that is, the models are available and may predict new compounds in the applicability domain.
(2) The model of linear kernel function and eps-regression type has the largest 2 and 2 LOO , the minimum SDEP ext , and the optimal linear relationship between predictive and experimental value of lgt 1/2 in 4 kinds of SVM models, which is the optimal model.
(3) SVM algorithm is a good method to solve the problem of multicollinearity and complex nonlinear relationship between molecular descriptors in QSAR modeling.
(4) E, ZPE, and p are the key factors for antibacterial halflife of hydroxyl benzoic esters.

Conflicts of Interest
The authors confirm that this article's content has no conflicts of interest.