QSAR Study of Anthra[1,9-cd]pyrazol-6(2H)-one Derivatives as Potential Anticancer Agents Using Statistical Methods

In this study, the anticancer activity of a series of 32 molecules based on anthra[1,9-cd]pyrazol-6(2H)-one was studied by threedimensional quantitative structure-activity relationship (QSAR) analyses: multiple linear regression (MLR), partial least squares (PLS), multiple nonlinear regression (MNLR), cross-validation analyses, and Y-randomization. A theoretical study of series was firstly studied using density functional theory (DFT) calculations at B3LYP/6-31 level of theory for employing to determine the structural parameters and electronic properties. Then the topological descriptors were computed using ACD/ChemSketch and ChemDraw 8.0 programs. The RNLM, given the descriptors obtained from the MLR and PLS, exhibited a correlation coefficient close to 0.91.The prediction models collected were confirmed by twomethods of cross-validation and scrambling (or Yrandomization).The strong correlation between experimental and predicted activity values was observed, indicating the validation and good quality of the derived QSAR model.


Introduction
The heterocycles and their derivatives constitute a class of cyclic compounds in which one or more carbon atoms of a reference carbocycle (e.g., cyclohexane, benzene, cyclopentane, and cyclopentadiene) are replaced by a heteroatom. The rapid development of heterochemistry comes from the study of living organisms (several bioactive heterocyclic compounds are extracted from animal and plant organisms).
Heterocyclic compounds find wide practical application in animal and human medicine (various drugs), in improving crops in agriculture (herbicides, fungicides, and insecticides), or are used as detergents, dyes, and explosives. They are also present in polymers, semiconductors, and photovoltaic cells [1][2][3][4].
The chemistry of heterocycles is a very broad field, given the number of heterocyclic compounds listed, which continues to expand. Among the different classes of heterocyclic compounds, mainly nitrogenous structures are present in many natural compounds of plant, animal, or synthetic origin. These structures are sometimes associated with each other, but in most cases they are linked to very diverse structural patterns. A number of hybrid compounds comprising mainly heterocycles containing nitrogen, sulfur, and/ or oxygen atoms have shown remarkable pharmacological activity [5][6][7][8].
Pyrazoles are chemical compounds of synthetic origin that have a five-membered heterocycle with two nitrogen atoms and three adjacent carbon atoms. Moreover, this structure is particularly rare in nature. Pyrazole derivatives, several members of the pyrazoles class, have shown good pharmacological effects or have the potential biological activities, such as anti-inflammatory [9], antiviral [10], antimicrobial [11], anticonvulsant [12], antitumor [13], fungicidal activities [14], and antihistaminic [15] activities.
The pyrazole ring is a structural isomer of imidazole; pyrazole name comes from the pyrrole ring to which a nitrogen atom was added: "azole." The two nitrogen atoms have different properties: one behaving like pyridine can undergo protonation in an acid medium; the other has the property of the pyrrole nitrogen doublet participating in the aromaticity of the ring [16].

Advances in Chemistry
Pyrazoles variously substituting aromatic and heteroaromatic groups have many biological activities, making them particularly interesting [17].
Chemically, for its anticancer activity, pyrazol exists in a variety of pharmacological targets. Drug discovery is a long and complex process. It is recognized that, on average, for a molecule that comes onto the market as an innovative drug, 10,000 molecules are synthesized and tested. In addition, the development of a drug usually requires between 10 and 15 years of research. It is indeed a matter of finding a molecule that must both have particular therapeutic properties and possess the minimum of undesirable side effects. The cost price of a drug is mainly due to its long, expensive, and ultimately useless syntheses. The development of reliable computer tools coupled with the growth of computing power has enabled the implementation of molecular modeling techniques, which have become, today, indispensable tools in the field of drug design. Among the techniques of chemoinformatics, we can mention QSAR techniques of finding a correlation between biological activity measured for a panel of compounds and some molecular descriptors. Quantitative structure-activity relationship (QSAR) methodology is an essential tool in medicinal chemistry [18,19].
Two disciplines of "computational chemistry" have been developed in response to this need: quantitative structureactivity relationships (QSARs) and quantitative structureproperty relationships (QSPRs). They essentially consist of the search for similarities between molecules in large databases of existing molecules whose properties are known [20,21]. The discovery of such a relationship makes it possible to predict the physical, chemical, and biological properties of compounds, to develop new theories or to understand the phenomena observed. Our main objective in this work was to develop a novel model for studying the relationship between the structure and anticancer activity of pyrazol and their derivatives [22,23].
To establish the relation between structural characteristics of molecule and its properties, the mathematical methods can be used. Multiple linear regression (MLR), partial least squares (PLS), multiple nonlinear regression (MNLR), and cross-validation analyses were applied to a series of pyrazol inhibitors in order to develop a QSAR model to reliably predict anticancer activity.

Experimental Data.
In the present study, we chose 32 substitutions of anthra [1,9-cd]pyrazol-6(2H)-one for which their anticancer activities are reported in the literature by Chen et al. [24]. On the other side and for the 2DQSAR study, the reported values of IC50 have been converted into PIC50 by taking negative logarithm (PIC50 = log 10 IC50) and subsequently used as the dependent variable for the 3D-QSAR model development. Figure 1 represents the basic structure of the pyrazol and Table 1 shows the studied substitutions of the compounds and corresponding experimental activities of PIC50. Cumulative variability (%) Proper value Axis Scree plot

Validation of QSAR Models.
The stability and robustness of the model developed are evaluated using correlation coefficients ( 2 ), the adjusted R2, the value MES (root of the average square of the errors), the value of standard deviations SD, and criteria of Fisher F. In addition, the choice of descriptors was supported by Student's t-test at a 95% confidence level. All the models have been validated by crossvalidation, according to a leave-one-out (LOO) procedure and to check if the results obtained by cross-validation are not due to the chance of a Y-randomization procedure being involved. Also the model has been evaluated by external validation from data that are not part of the training set and the predictive power is then characterized by the correlation coefficient for the validation set ( 2 test). [22,23,25].

Calculation of the Molecular Descriptors.
Before any modeling, it is necessary to calculate a certain number of descriptors because the parameters which described the anticancer activity of the pyrazoles are poorly known. Part of the success of any QSAR model lies in the choice of the molecular descriptors used. In general, the standard descriptors used for such an analysis are constitutional, topological, or even geometric descriptors. However, it is often difficult to link these parameters to the reactivity of the inhibitors with the target cells. The use of descriptors derived from quantum chemistry is less frequent in QSAR, whereas they have the advantage of being directly related to the reactivity properties of molecular systems [26,27]. The thirty-two molecules were optimized using quantum mechanics using the DFT approximation and the B3LYP function associated with the 6-31G base set using the Gaussian 03 software. A number of electronic descriptors were then computed from the optimized molecules, including the dipole moment (DM), the energy of the boundary orbitals (EHOMO, ELUMO), the total energy (Etotal), and the repulsion energy (RE) [28,29]. ChemBio Office (2015) was used to calculate the following parameters: molecular weight (MW), lipophilicity (log ), hydrogen bond acceptors (HA), and hydrogen bonding donors (HD). The ChemSketch program was used to calculate the following parameters: molar volume (MV (cm3)),   molar refractivity (MR (cm3)), parachor (Pc (cm3)), density (g/cm3), refractive index, tension superficial (Dyne/Cm), and polarizability (cm3) [30,31].

Statistical Analysis.
Structure-activity models were generated using XLSTAT version 14 software starting with principal component analysis to minimize the matrix and then entering the multiple linear regression (MLR) method to study the relationship between a dependent variable and several independent variables. It is a mathematical technique that minimizes the difference between real and predicted values. It is also used to select the descriptors used as input parameters in multiple nonlinear regression and the neuron network to account for the nonlinear correlation between activity and structure [32]. The cross-validation technique is one of the most famous ways of selecting regression models that is based on the "leave-one-out" criterion. The leave-one-out procedure successively removes a molecule from the learning set containing 32 molecules. A QSPR model is built on a set of 31 compounds and the removed molecule is predicted by the model. This procedure is repeated 32 times to predict the properties of all molecules [33][34][35][36].
In order to ensure that a QSAR model is reliable, Yrandomization tests are one of the most used techniques. Indeed, it is not uncommon to obtain fortuitous correlations (or "chance correlation"), that is to say, a model displaying good statistical results (R2, MAE) for learning, but involving descriptors that in reality are not related to the modeled property. These random models can be detected by the Yrandomization procedure. They consist in randomly mixing the experimental properties for the learning set and, using the same descriptors, again training the learning algorithm to try to obtain a model. Normally, the models obtained must have very low performances. The distribution of the obtained models makes it possible to fix a heuristic threshold of meaning of the models. Thus, one can choose models that have at most 1% chance of being confused with a fortuitous model [34,37].

Results and Discussion
3.1. Dataset for Analysis. QSAR study was carried out for a series of 32 substitutions of anthrax [1,9-cd]pyrazol-6(2H)one, in order to determine a quantitative relationship between the structure and the antiviral activities. The values of the 16 descriptors are shown in Table 2. The results obtained for 3D-QSAR using ACP, MLR, MNLR, ANN, CV, and Yrandomization are represented in Tables 3 and 4.

Principal Component Analysis.
The totality of the 16 descriptors coding the 32 molecules is submitted to a principal components analysis (PCA) [37]. 16 principal components were obtained ( Figure 1).
The first three principal axes are sufficient to describe the information provided by the data matrix. Indeed, the percentages of variances are 52.6%, 16.62%, and 15.25% for the axes 1, 2, and 3, respectively. The total information is estimated to a percentage of 84.47%. Table 2 shows the correlation matrix (Pearson ( )) therefore obtained between different descriptors.
The Pearson correlation coefficients are summarized in Table 5. The obtained matrix provides information on the negative or positive correlation between variables. The principal component analysis (PCA) was conducted to identify the link between the different variables. Correlations between the 16 descriptors are shown in Table 5 as a correlation matrix and in Figure 2 these descriptors are represented in a correlation circle. 6 Advances in Chemistry

Multiple Linear Regression (MLR).
In order to select the predominant descriptors that will affect the inhibitory activities of these compounds, correlation analysis was performed with statistical software XLSTAT2014 taking every calculated descriptor as an independent variable and PIC50 as a dependent variable. Based on the correlation analysis, the aforementioned stepwise multiple linear regression technique was used to establish the QSAR model. Several statistical parameters such as the regression coefficient ( ), squared correlation coefficient ( 2 ), adjusted squared correlation coefficient ( 2 adj ), the mean squared error (MSE), the value of the value of Fischer ( ), and the significance level ( ) < 0.05 are used to verify the credibility of the developed models. Great value of , small MSE, very small value, and and 2 of nearly one indicate good QSAR model. In this study, all developed QSAR models are statistically significant with a significance level being < 10 −3 . Given that the value is much smaller than 0.05, we are taking less than a 0.01% risk in assuming that the null hypothesis is wrong. The values of the multiple correlation coefficient ( ) and of the square correlation coefficient ( 2 ) which are superior to 0.87 and 0.75, respectively, support the estimated capacity of the QSAR models.
E LUMO , the molar volume (MV), the density, and the molecular weight (MW) were the descriptors that are dependent on the anticancer activity of the derivated pyrazol.
The QSAR model built using multiple linear regression (MLR) method is represented by the following equation:  The correlation between experimental plots and data predicted from QSAR derived multiple regressions given in Table 3 shows that the predicted values are much closer to the experimental ones. It shows that the developed models can be successfully applied to predict the inhibition for other derivatives.
Negative correlation factors that affect the anticancer activity show that the increase in the values of these factors involves a decrease in the value of PIC50. PIC50 changes with the descriptor values, which are shown in (1), show that the MV, ELUMO, and the density vary in the same manner as the activity, so that the MW varies in the opposite direction.
PIC50 activity was linked with frontier orbital energies and especially energy BV which is the energy of the lowest unoccupied molecular orbital and reflects the electrophilic reactivity. This parameter is widely used for the explanation of the antiviral activity. The LUMO energy suggests that highly electrophilic compounds resulted in high cell penetration. The energy of E LUMO is directly related to the electron affinity and characterizes the susceptibility of the molecule to be attacked by nucleophiles.

Partial Least Squares (PLS). The PLS have two objectives:
to approximate the matrix of molecular structure descriptors to the matrix of dependent variables and to maximize the correlation between them.
We proposed the data matrix constituted clearly from the descriptors proposed by MLR (Figure 3) corresponding to the 32 molecules, to the partial least squares (PLS) (Figure 4). This method used the coefficients , 2 , and the -values to select the best regression performance.
For the ELUMO, MV, MW, and density to PIC50, the following equations were used.
The molecular descriptors used were the ELUMO, MV, MW, and density. To correlate the molecule descriptors linearly to the following equations were used: The obtained coefficient of correlation in (2) is quite interesting (0.69). To improve the anticancer activity in a quantitative manner, taking into account several parameters, Advances in Chemistry   9 we have used the technique of the nonlinear regression model.

Multiple Nonlinear Regression (MNLR).
The basic descriptors corresponding to the RLM 32 compounds were applied to the data matrix which is obvious ( Figure 5). The coefficients and 2 and the mean squared error are used to select the best performance of the regression.
The resulting equations: The predicted values of PIC50 calculated from (3) are added to Table 3 compared to the observed values. The correlation between the predicted and observed values activities is shown in Figure 6.
The correlation coefficient obtained in the equation is very interesting (0.91) to show anticancer activity. We can say that the values obtained from nonlinear regression are highly correlated with those of the observed activity comparing the results obtained by the MLR method. Validation of MNLR model is done by dividing the dataset into the training and the test set; the external validation of several correlation coefficients is PIC50 = 0.7 for MNLR for the whole test.
3.6. Validation. We use the procedure "leave-one-out" which removes successively a molecule of learning the game containing 24 molecules. This procedure is repeated 24 times in order to predict the properties of all the molecules.
The consistency and reliability of the MLR, MNLR, and PLS model are validated using the cross-validation technique with a good correlation being obtained with cross-validation Rcv = 0.86. So the predictive power of this model is very significant.
3.7. Scrambling or Y-Randomization. Y-randomization is broadly used in QSAR studies to ensure the portliness of obtained models. This method is used after the "best" regression model is selected to make sure that there is no chance for correlations. Scrambling validates the QSAR model by comparing the performance of the original model to that of models built for permuted (randomly shuffled) responses based on the original descriptor pool and the original procedure used to build the model. If the correlation coefficient of models built for permuted responses is close to that obtained by applying the full model, this result indicates that there is independence between the molecules, as the nearest target point measurement points do not obscure other experimental data and are not almost exclusively involved in the estimate, and the data used in this validation are evenly distributed in space. Therefore, the resulting model can be extrapolated to the entire series. (Table 3 and Figure 7).
The correlation coefficient value of the mixture of molecules was close to that obtained by applying the full model. This result demonstrates the absence of dependence between descriptors included in the model. Additionally, the closest measurement point of the target point does not hide other experimental data and is not involved exclusively in the estimate, and the data used in this validation are regularly distributed in space so the resulting model can be extrapolated for the entire series.  Table 3). It has been observed that the designed PLS have higher PIC50 values than the RLM and RNLM (Table 1). Additionally, compounds X1 and X15 have higher PIC50 values than the existing compounds in the case of the 32 studied compounds.   grouped under the name of "rule of five" [38], this rule is the most used for the identification of "drug-like" compounds [38]; a substance will be better absorbed or penetrated, so (1) the molecular weight is less than or equal to 500 Da, (2) it has 5 or less hydrogen bond donors (sum of OH and NH), (3) it has 10 or fewer hydrogen bond acceptors (sum of O and N), (4) its log value is less than or equal to 5.
The empirical conditions to satisfy the Lipinski rule and demonstrate good oral bioavailability involve a balance between the aqueous solubility of a compound and its ability to passively diffuse through various biological barriers. These settings allow us some oral absorption or permeability membrane which occurs when the molecule evaluated follows the rule of Lipinski [39,40].
Molecules that violate many of these rules can have problems with bioavailability. Therefore, this rule establishes some relevant structural parameters for the theoretical prediction of the oral bioavailability profile and is widely used in the design of new drugs [41].
The results of calculation (Table 7) show that all compounds satisfy the rules of Lipinski, suggesting that these compounds theoretically do not have problems with oral bioavailability except the molecules 5, 8, 13, and 29 which has a log( ) value of 5.

Conclusion
A quantitative analysis of the structure-property relationship (QSAR) was performed on 32 molecules derived from the derivated pyrazol. A QSAR model was established using the multiple linear regression (MLR), partial least squares (PLS), and multiple nonlinear regression (MNLR). Assessing the quality of the MLR, PLS, and RNLM models has shown that the predictive capability of RNLM was substantially better than that of the other methods. The predictive power of the model obtained was confirmed by LOO cross-validation. A strong correlation was observed between the experimental and predicted values of the biological activities, which indicated the validity and quality of the QSAR model developed in this work. We conclude that the most important finding from this research is that we have been able to design and Table 5: The correlation matrix (Pearson ( )) between different obtained descriptors.
Desc. The following variables then removed are Parachoc, RE, and MR.  predict new compounds with higher or lower values than existing compounds (Table 6) by adding suitable substituents by calculating their propriety using the RLM, RNLM, and PLS equations. Thus, the proposed models will reduce the time, the cost, and also the human mobilization.