Comparison of Quantitative Structure-Activity Relationship Model Performances on Carboquinone Derivatives

Quantitative structure-activity relationship (qSAR) models are used to understand how the structure and activity of chemical compounds relate. In the present study, 37 carboquinone derivatives were evaluated and two different qSAR models were developed using members of the Molecular Descriptors Family (MDF) and the Molecular Descriptors Family on Vertices (MDFV). The usual parameters of regression models and the following estimators were defined and calculated in order to analyze the validity and to compare the models: Akaike?s information criteria (three parameters), Schwarz (or Bayesian) information criterion, Amemiya prediction criterion, Hannan-Quinn criterion, Kubinyi function, Steiger's Z test, and Akaike's weights. The MDF and MDFV models proved to have the same estimation ability of the goodness-of-fit according to Steiger's Z test. The MDFV model proved to be the best model for the considered carboquinone derivatives according to the defined information and prediction criteria, Kubinyi function, and Akaike's weights.


INTRODUCTION
Quantitative structure-property/activity relationship (QSPR/qSAR) models may be considered data mining applications [1]. These methods are used to estimate/predict physical-chemical properties [2,3] and/or biological activities [4] of compounds, or to classify molecules [5] based on structural features. Besides their usefulness in compound screening [6], QSPR/qSAR models are also used due to their ability to explain action mechanics for the investigated compounds [7].
The antileukemic activity of carboquinones expressed as the minimum effective dose (MED) and the optimum effective dose (OED) was previously modeled using the electrotopological state and the molecular connectivity indices with multiple linear regression (MLR) [20]. A four-descriptor model was identified for MED (R 2 = 0.90 and s = 0.21; R 2 is the determination coefficient and s is standard error of estimate). The same model obtained also revealed the ability to estimate the OED (R 2 = 0.88, s = 0.19).
Srivastava and Khan showed in a qSAR study that -OH and -NH 2 groups had an important contribution to the biological activity as terminal substituents [21]. Kawakami et al. [22] used a selforganizing map to analyze qSARs on carboquinone derivatives. The identified model proved able to predict biological activity (MED) with an average of error equal to 4.2% (0.87 squared of cross-validation correlation coefficient). The relationship between the structure and activity of carboquinone derivatives was also investigated by using neural networks [23,24].
The main differences of the approaches applied in investigation of carboquinone derivatives consisted of the use of different methods to generate descriptors and/or to identify the descriptors better able to explain the activity of the compounds. In addition, models with improved statistical quality as compared with previously reported models on carboquinone derivatives were published; unfortunately, the significance of this improvement was not quantified.
Our research reports the results of the MED of carboquinone derivatives for the same molecular set studied by Kawakami et al. [22]. Two families of structural descriptors, the Molecular Descriptors Family (MDF) and the Molecular Descriptors Family on Vertices (MDFV), were used to generate descriptors. Forward stepwise regression was applied for descriptor selection. The models (MDF, MDFV, and the previously reported model [22]) were compared in order to identify the method with the highest performance.

Data Set: Carboquinone Derivatives
The inverse of molar concentration, expressed in logarithmic scale, was taken from previously published research [22]. Molar concentration is the MED per 1 kg of mice able to prolong life by 40% compared with controls (administration of a small-quantity dosage in chronic injection) [19]. The generic structure of the investigated compounds is presented in Fig. 1. The abbreviation of the compounds, the substituent, and the observed and estimated activities are presented in Table 1. Table 1) The observed activity of interest [22] was subject to statistical analysis in order to test the normality of data (assumption of multiple regression and condition for inference making). The observed activity had a mean of 5.76, a standard deviation of 0.63, a skewness of -0.12, and a kurtosis of 0.41. The Jarqua-Bera test [25] (two degrees of freedom) was applied to test the normality of observed data and a value of 1.66 was obtained (p = 0.44). The Grubbs test [26] did not identify any outlier in the observed data (Grubbs value = 2.25 for the furthest data from the rest (cqd 01 ), p > 0.05).
The approach used to calculate molecular descriptors (MDF and MDFV) are detailed in the Appendix.

Models Search, Validation, and Comparison
Multivariate regressions were obtained through systemic or random search for MDF and MDFV members by using client-server applications developed in Borland Delphi (v.6) and FreePascal (v.2). The task was performed after the filtration, identification, and removal of bias descriptors (as in the above-stated statistical validation of descriptors).
The best model obtained by each approach was selected according to the following criteria [33,34]: The highest explanation of the observed variance (highest values of significant correlation coefficients between the observed and estimated activity). A model was considered valid IF all correlation coefficients (Pearson (r), semi-Quantitative (r sQ ), Spearman (ρ), Kendall's (τ a , τ b , τ c ), and Gamma (Γ) [35]) were statistically significant. The absence of at least one correlation coefficient that is not statistically significant leads to the exclusion of the model from further analysis.
The smallest number of descriptors in the model. The lowest standard error of estimate (s est ). The highest Fisher value (the lowest p-value); significant coefficients of the regression model (highest t-value, lowest associated p-value).
The following parameters and/or tests were used as validation and comparison methods: Akaike information criteria (AIC [36]) and related formulas: consider the statistical goodness-offit and the number of parameters able to achieve the degree of fit. Its corrected formula (AIC c ) [37] proved to be a better model selection criterion [38] and was used in the study. The following related criteria were calculated to select the best models: where AIC c = corrected AIC for bias adjustment in small sample sizes (applied when the n/k ratio is below 40); AIC R 2 = AIC based on the determination coefficient; AIC u = McQuarrie and Tsai corrected AIC; BIC = Schwarz (or Bayesian) Information Criterion (also abbreviated as SIC); APC = Amemiya Prediction Criterion; HQC = Hannan-Quinn Criterion; n = sample size; k = number of parameters in the model; RSS = residual sums of squares. The preferred model was the one with the lowest AIC, BIC, APC, and HQC values. Kubinyi function (FIT) [43,44]: The highest the FIT value, the better the model was considered. The best model is considered the one with the smallest relative distance from the "truth". The difference between the model with the lowest AIC and the others (∆ i = AIC i -min(AIC), where ∆ i = difference between the AIC of the best fitting model and that of model i; AIC i = AIC corrected for model i; min(AIC) = minimum AIC value of all models). The formula used in this analysis was [45]: where w i = Akaike weights for model i; denominator = sum of the relative likelihoods for all candidate models; j = number of models. The Akaike weights were calculated based on Eqs. 1-3. The comparison of correlation coefficients obtained by two models was performed by applying the Steiger's Z test at a significance level of 5% [46].

RESULTS
The valid MDF and MDFV members on the carboquinone sample were included in the multivariate regression analysis in order to obtain qSAR models. One MDF (see Eq. 9) and one MDFV (see Eq. 11) model with the best performances were chosen from statistically significant models and are presented.
The estimated activity values associated to each model and the residuals are shown in Table 1. The values of descriptors used in the MDF and MDFV models are presented in Table 2.
where Ŷ MDF = activity estimated by MDF model; IGDMlQt (X 1 ), IbMDpHg (X 2 ), IHMmlHt (X 3 ), lHDDfHg (X 4 ), and IHDMkMg (X 5 ) = MDF members; the values from round brackets allows us to obtain the lowest (subtraction) and upper (addition) confidence boundary for the slope parameters; R 2 = determination coefficient; s est = standard error of estimate; n = sample size; F est (p) = Fisher value of the MDF model (p-value); t = t-value; int = intercept; p = p-value;R 2 loo = cross-validation leave-one-out square correlation coefficient; s loo = standard error of predicted; F loo = Fisher value on cross-validation leave-one-out model; r = Pearson correlation coefficient between observed activity and estimated by the model; r sQ = semi-quantitative correlation coefficient; ρ = Spearman rank correlation coefficient; τ a , τ b , τ c = Kendall's correlation coefficients; Γ = Gamma correlation coefficient.
The MDF descriptors in Eq. 9 did not significantly correlate with the observed activity or between them when all correlation coefficients were investigated (see Table 3).
The MDFV descriptors in Eq. 11 did not correlate significantly with the observed activity or between them when all the correlation coefficients were investigated (see Table 4).
The values obtained by applying the validation and comparison parameters (Eqs. 1-8) for MDF and MDFV, as well as for the linear regression model obtained by using the previously reported descriptors (molar refractivity of the steric effects of R 1 and R 2 , hydrophobicity of the steric effects of R 1 and R 2 , hydrophobicity of the steric effect of R 2 , molar refractivity of the steric effect of R 1 , and two substituent's constants) [22], are shown in Table 5.
The goodness-of-fit of the MDF and MDFV models is presented in Figure 2. The results of the Steiger's Z test are presented in Table 6. An external set of compounds was used in order to predict the inverse of molar concentration using the best identified model (Eq. 11). The values of the descriptors and the predicted inverse of molar concentration (logarithmic scale) are presented in Table 7.

DISCUSSION
Three qSAR models were investigated in order to assess their ability to estimate the antileukemic activity of a sample of 37 carboquinone derivatives. Two approaches were used to calculate the molecular descriptors for the carboquinone derivatives: MDF and MDFV. The MDF approach proved able to estimate properties and activities [47,48,49,50,51,52]. The MDFV is a new approach that implements the fragmentation of vertices on the molecular graph. A similar approach on vertex cut proved its usefulness on b-ary trees [53]. The third analyzed qSAR model was obtained by using the physical-chemical descriptors reported by Kawakami et al. [22]. A series of classical and newly defined parameters were computed (Eqs. [1][2][3][4][5][6][7][8] in order to compare the models. The qSAR models were selected according to the Hawkins principles [54]. The models with the highest correlation coefficient, the highest Fisher parameter, the lowest standard error of estimate, and the smallest possible number of significant parameters were chosen. The MDF and MDFV models proved to have estimation abilities, demonstrated by the presence of statistically significant correlation coefficients between the observed and estimated activity (see Eq. 9 for the MDF model and Eq. 11 for the MDFV model).
The analysis of the MDF (Eq. 9) and MDFV (Eq. 11) models in terms of the descriptor's contribution to the activity of carboquinone derivatives revealed the following: The investigated activity of the carboquinone derivatives proved to be of geometric and topological nature. It depended on compound charge, number of directly bonded hydrogen atoms, and relative atomic mass in the MDF model (see Eq. 9) and on compound electronegativity, melting point, and electronic affinity in the MDFV model (see Eq. 11).
The absence of collinearity between the descriptors used by the MDF and MDFV models (see Tables  3 and 4), and the parameters obtained in leave-one-out and leave-many-out analyses (see Eqs. 10 and 12) supported the validity of these models.
As far as the comparison of models is concerned, a series of parameters were computed in order to identify the best qSAR model for carboquinone derivatives (see Table 5). The analysis of parameters presented in Table 5 leads to the following observations: The MDFV model (Eq. 11) systematically obtained the best expected values: the smallest value of prediction criteria (AIC c , AIC R2 , AIC u , BIC, APC, and HQC); the highest values of Akaike's weights (w i (AIC c ), w i (AIC R2 ), w i (AIC u )) and of the Kubinyi function (FIT). The overall classification of models in descending order of their performances according to all the parameters (Eqs. 1-8) is: MDFV -MDF -regression model obtained from the previously reported physical-chemical descriptors [22].
In most cases, the MDF model registered the second performance. Two exceptions were observed: the model had the third performance according to the AIC R2 and w i (AIC R2 ) criteria. The lowest value of BIC obtained for the MDFV model implied fewer descriptors and a better fit when the model was compared to the MDF model. It implied only better goodness-of-fit when compared with the model obtained from the previously reported physical-chemical descriptors [22].
The analysis of the results presented in Table 1 revealed that the mean of the observed and estimated activity are equal, but the standard deviation of activity estimated by MDF and MDFV models were slightly lower (a difference of 0.01 for MDFV model and of 0.02 for MDF model) compared to the standard error of observed data [22]. This observation leads to the existence of a possible risk of overprediction and could be assigned to random or systematic experimental errors. The intrinsic variability of experimental measurements pulls over the intrinsic variability of the model. If the experimental measurements are not valid, the model is not valid. The Jarque-Bera test [25] was applied on the observed data in order to investigate their normality and membership to the same population, as a measure for minimizing the overprediction (also a condition for MLR). The experimental data proved to be normally distributed and no outlier was identified by the Gubber test, even if the value of the furthest compound from the rest was include into the analysis.
As far as the goodness-of-fit of the MDF and MDFV models according to Steiger's Z test was concerned, these two models were not statistically different (see Table 6). The MDF and MDFV models proved to have significantly higher correlation coefficients compared to the regression model obtained from the previously reported physical-chemical descriptors [22] (see Table 6, p < 0.01).
The MDFV model was considered as the best model (considering the number of descriptors and the information criteria). Thus, this model was applied on an external sample of 30 compounds in order to predict the inverse of molar concentration (logarithmic scale). The values of the descriptors (see Table 7) had the same order of size and the average value of two descriptors proved to be covered into the 95% confidence interval of the descriptors' value in sample of 37 compounds. The predicted values of the inverse of molar concentration expressed in logarithmic scale showed the highest values (more potent compounds) compared to the sample of 37 compounds. The standard deviation is also a little bit higher as well as the average of predicted values. Note that the predicted values need to be experimentally validated in order to sustain the potency of these compounds, the absence of this validation being the main limitation of the present study.
The present study aimed to compare three qSAR models in order to understand the relationship between the structure of the investigated carboquinone derivatives and the MED expressed in logarithmic scale. Two models were obtained by applying the MDF and MDFV approaches, while the third model was obtained from the physical-chemical descriptors reported by Kawakami et al. [22]. Useful information related to the structural nature of the investigated activity of carboquinone derivatives was obtained once the MDF and MDFV models were constructed. While the MDF approach has already proved its estimation and prediction potential [44,45,46,47,48,49], current research in our laboratory aims to characterize other activities and/or other chemical compounds in order to test the usefulness of fragmentation on vertices in the investigation of structure-activity relationships.
The statistical parameters of the MDF and MDFV models supported their validity. The MDF and MDFV models were not significantly different. Both models proved to have better goodness-of-fit compared with the model obtained from the previously reported physical-chemical descriptors [22]. The MDFV model proved to be the best model for the studied carboquinone derivatives according to the prediction criteria, and to the value of Kubinyi function.
The modeling process in qSARs is widely used by computational chemists, but unfortunately, different models obtained on the same class of compounds are not usually compared. The research used a series of information parameters besides the Steiger's Z test in order to assess and compare different qSAR models. The proposed concept was evaluated on a set of carboquinone derivatives. Future research is required in order to develop guidelines for comparing different qSAR models.
The SAR modeling using the MDFV approach gives an advantage due to its construction; a systematic pool of unique descriptors (the same descriptors with the same values are obtained any time when the approach is applied on the same structures) is obtained from the structure of a given set of compounds using two extreme (minimal and maximal) and three intermediate (harmonic, geometric, and arithmetic) operations, which are able to cumulate the physical contribution of the atoms to the activity of compounds. A small part of the descriptors explains (correlate) the activity/property based on structural information in a sample of compounds. The explanation power of the SAR model increases by embedding as much information as possible, as was proved in the text (the goodness-of-fit of the MDFV model presented in Eq. 11 is higher compared with the goodness-of-fit obtained in the training set model presented in Eq. 12). Thus, the described approach should be conducted by using as much information as possible in order to construct the relationships between the compound structure and activity/property (model), and the prediction should be limited to similar compounds (similar with the ones in the training set) as was conducted in this study. Using the proposed approach, the prediction of antileukemic activity was performed on a sample of compounds (the structure of the used compounds was similar to the structure of the compounds used to obtain the MDFV model). Note that the experimental value of the compounds included in the external validation set could not be found in the specialty literature using the available resources. Even if the results obtained in the internal validation of the MDFV model lead to good results, the predicted antileukemic activity needs to be correlated with experimental data and could lead to more active carboquinone derivatives with antileukemic activity.

CONCLUSIONS
The MDF and MDFV approaches provided reliable and valid models in terms of statistical characterization, collinearity, leave-one-out and leave-many-out analyses. The MDF and MDFV models proved equally able to estimate the activity of carboquinone derivatives according to Steiger's Z test. The MDFV model proved to be the best model for the considered carboquinone derivatives according to the information and prediction criteria, Kubinyi function, and Akaike's weights.

Molecular Descriptors Calculation
Two approaches were used to calculate the molecular descriptors for the sample of carboquinone derivatives: Molecular Descriptors Family (MDF) [27] and Molecular Descriptors Family on Vertices (MDFV). Both approaches integrate the complex topological and geometrical information obtained from the structure of the compounds by computing the family of descriptors used to explain the activity of interest.
The topological and geometrical models of the compounds were the input data in the investigation of carboquinone derivatives. The three-dimensional structures were drawn by using HyperChem version 7.01 [28]. The compounds partial charges were calculated by using the semi-empirical extended Hückel model [29]. The geometry of compounds was optimized by applying the Austin method (AM1) [30]. The *.hin files were the input molecular files and the *.txt file was the input activity file used by both methods in order to generate and calculate the pools of descriptors. A brief description of the MDF and MDFV methods are presented below.

Molecular Descriptors Family
• Method principle: candidate fragments obtained using pairs of vertices.
• Physical model of interaction: for a pair of atoms.
• Physical model of atomic overlapping interaction: ▪ in fragments; ▪ cumulated for pairs of atoms; ▪ cumulated for entire molecule. • Molecular topology: matrix representation of the molecular graphs. o Delete all descriptors with a Jarque-Bera value higher than critical value for the observed activity [25]. o Delete all descriptors with an intercorrelation higher than 0.99.
The molecular descriptors were calculated by using a series of PHP programs, run on an IntraNet network on a FreeBSD server. The applications used MySQL dynamic libraries to connect to MDF and MDFV databases where the descriptors and identified models were stored.