Linear and Nonlinear QSAR Study of N 2 and O 6 Substituted Guanine Derivatives as Cyclin-Dependent Kinase 2 Inhibitors

The inhibitory activities (pIC 50 ) of N2 andO6 substituted guanine derivatives as cyclin-dependent kinase 2 (CDK2) inhibitors have been successfully modeled using calculated molecular descriptors. Two linear (MLR) and nonlinear (ANN) methods were utilized for construction ofmodels to predict the pIC 50 activities of those compounds.TheQSARmodels were validated by cross-validation (leave-one-out) as well as application of the models for prediction of pIC 50 of external set compounds. Also, the models were validated by calculation of statistical parameters and Y-randomization test. Two methods provided accurate predictions, although more accurate results were obtained by ANNmodel.Themean-squared errors (MSEs) for validation and test sets ofMLR are 0.065, 0.069 and of ANN are 0.017 and 0.063, respectively.


Introduction
The cyclin-dependent kinases (CDKs) are a class of enzymes which play a fundamental role in cell cycle regulation [1,2].Particularly as their name suggests CDKs activation partially depends on the binding of another class of proteins named cyclins, for example, cyclins of the D family complex with CDK4 and CDK6 during G1 phase, cyclin E with CDK2 in late G1, cyclin A with CDK2 in S phase, and cyclin B with CDK1 (also known as cdc2) in late G2/M.Then, aberrant CDK control and consequent loss of cell cycle check point function have been directly linked to the molecular pathology of cancer [3].It is well known that phosphorylation in a conserved threonine residue of the CDK subunit is required for its complete activation.This task is performed by the CDK activating kinase.These proteins properly regulate the cell cycle progress and DNA synthesis only as an active complex (T160pCDK/cyclin) [4].Overall, the activity of the CDK/cyclin complex can be depleted by at least two different mechanisms that contain the phosphorylation of the CDK subunit at the inhibitory sites or the binding of the specialized natural inhibitors known as CDK inhibitors.In the first mechanism, the amino acid residue Y15 and to a lesser extend T14 (in CDK2) are phosphorylated by human Wee 1 Hu [5].This inhibitory phosphorylation is independent of previous cyclin binding [6].The second mechanism involves the binding of natural CDK inhibitors.Four major mammalian CDK inhibitors have been discovered: P21 (CIP1/WAF190) and P27 (KIP1) inactive CDK2 and CDK4 cyclin complexes by binding to them.The two other inhibitors are P16 INK4 and P15 INK4B that are specific for CDK4 and CDK6.They inhibit the formation of the active cyclin complexes by binding to the inactive CDK, and they can also bind to the active complex [2,7].However, it has been shown that natural CDK inhibitors are subexpressed in some carcinogenic cells, and medicinal chemists have put some of their effort in the search for new synthetic inhibitors to replace them [8][9][10][11][12].Some of them have entered in clinical field; for instance, flavopiridol induces cell cycle arrest and tumor growth inhibition [13].
The search for more potent and selective CDK inhibitors is a daunting challenge due to the similarity of the ATP binding site along the different CDK subtypes [14].According to this, the development and use of new strategies to overcome this problem are urgently needed.Nowadays, new and exciting strategies have emerged and become available to find more potent and selective inhibitors, and they normally use quantitative structure-activity relationships (QSARs) derived from different computational calculation approaches [15][16][17][18][19].
Quantitative structure-activity/property relationship (QSAR/QSPR) was used for correlation of different activities and properties to characteristics of molecular structures.In recent years, several QSAR and QSPR models based on both linear and nonlinear methods that aimed to predict different activities and properties were used [20][21][22][23][24][25][26][27][28][29][30].The reliable prediction of inhibition of CDK2 has an important role in medicinal researches.The ultimate role of the different formulations of the QSAR theory is to suggest mathematical models for estimating relevant endpoints of interest, especially when these cannot be experimentally determined for some reason.These studies simply rely on the assumption that the structure of a compound determines the related activity.The molecular structure is therefore translated into the so-called molecular descriptors through mathematical formulae obtained from several theories, such as chemical graph theory, information theory, and quantum mechanics [31,32].In this work, we carried out a QSAR study by predicting the inhibitory activity of a set of N2 and O6 substituted guanine derivatives by using multiple linear regression (MLR) and artificial neural network (ANN) dealing with linearity and nonlinearity, respectively.

Multiple Linear Regression (MLR).
The general goal of multiple linear regression (MLR) is to model the relationship between some independent variables and a dependent variable by fitting a linear equation to observed data.Generally, the multiple linear regression model is as the following equation: where  is the number of independent variables,  1 , . . .,   are the regression coefficients, and  is the dependent variable [33].

Artificial Neural Network (ANN).
ANNs have large numbers of computational units connected in a vast parallel construction.Neural networks do not need an obvious formulation of the handled problem.They act as a means to introduce scaled data to the network.The data from the input layer (input neurons) propagate through the network via interconnections.Scalar weights are specialized to each connection.A remarkable aspect of the neural networks is their learning step.In this step, the value of weights and biases would be optimized based on a set of measured numerical values (training set).More details about neural networks are given in [34,35].

Descriptor Generation and Screening.
The inhibitory activation of compounds is related to some of their structural, electronic, and geometric properties.The value of these properties can be encoded quantitatively by numerical values named molecular descriptors.These molecular parameters are to be used to search for the best QSAR model of the inhibitory activation.The 2D structures of the molecules were drawn using Hyperchem8 software [37].The molecular structures were optimized using the Polak-Ribiere algorithm until the root-mean-square gradient was 0.001 kcal mol −1 .The resulting geometry was transferred into the Dragon program package, and 1481 descriptors were produced [38].Then, these descriptors were given to SPSS 17 for statistical work [39].It is worth mentioning that in the first preselected analysis, some descriptors were removed because many of them included zero or other constant/near-constant values and did not have enough information of structure.On the other hand, to decrease the redundancy existing in the descriptor data matrix, the correlation coefficient  of the descriptors with each other (Pearson's correlation) was examined, and the collinear descriptors (with  > 0.9) were removed.By doing so, 238 descriptors were remained.Then by using the stepwise mode for regression,   The correlation matrix for the selected 10 descriptors presented in the model is shown in Table 3.These results show there is not any correlation between the selected descriptors.

Results and Discussion
The prediction ability of QSAR/QSPR models is affected by two factors: one is the descriptors, which should carry enough information of molecular structure for the interpretation of the activity/property.The other is the modeling method employed [20].The number of descriptors available for QSAR/QSPR studies is often so large that it is difficult to obtain a model including all of them.Therefore, identifying important descriptors certainly plays an important role in QSAR/QSPR.Descriptors should represent the maximum information in activity variations, and collinearity among them must be kept to a minimum.As can be seen from the correlation matrix (Table 3), there is no significant correlation between the selected descriptors.In the present work, these descriptors were used for construction of both linear and nonlinear models.This model was then used to predict the validation and test sets of data.Then, artificial neural network (ANN) was used to make a nonlinear model to calculate the inhibitory activities (pIC 50 ) of the compounds.To do so, a 3-layer feedforward network with backpropagation pattern was used in which mean squared error (MSE) was applied as the performance function.The MLR selected descriptors were used as the input layer of the network.To have a strong network, 5 parameters were optimized.The optimized parameters are (1) the number of descriptors (between 2 and 10), (2) the number of nodes in the hidden layer, (3) the transfer function (including log sigmoid and tan sigmoid), (4) training function (including Bayesian regulation (trainbr) and Levenberg-Marquardt (trainlm)), and finally (5) number of epochs.Table 4 shows the training settings of the optimized network.It should be noted that the training of the network for the prediction of pIC 50 was interrupted when the MSE of the validation set started to increase, to avoid overfitting.
According to Table 4, a network with a Levenberg-Marquardt training function and log-sigmoid transfer function with 10 descriptors (the same MLR descriptors) has the least MSE value (0.0171).In order to evaluate the predictive ability of the linear and nonlinear models and to compare them, we employed the percentage of mean absolute error (MAE), mean squared error (MSE), predictive residual sum of squares (PRESSs), standard error of prediction (SEP), determination coefficient ( 2 ), percentage of relative error prediction (REP (%)), and relative mean absolute error (RMAE).These statistical parameters for MLR and ANN are listed in Table 5.
As can be seen from Table 5, all the error parameters' values of ANN for both test and validation sets are smaller than those of MLR.This is believed to be due to the nonlinear capabilities of the ANN model.
The used statistical parameters are defined as:  where   is the experimental value, ŷ is the predicted value,  is the mean value, and  is the number of compounds.
To avoid chance correlation and to guarantee the network's predictability power, Y-randomization test was carried out.The results of several repetition of this test are shown in Table 6.The low values of  2 show that there is no chance correlation in the developed model.
Figures 1 and 2 show plots of the predicted values versus experimental ones of ANN and MLR models for validation and test sets.The obtained results show the superiority of ANN model than MLR to predict of pIC 50 of these compounds.The ANN and MLR residuals of leave-oneout are plotted against the experimental values in Figure 3.The symmetric distribution of residuals at both sides of the zero line indicates that no systematic error exists in the development of the MLR and ANN models.

Figure 1 :
Figure 1: Plot of the predicted values versus the experimental ones for the validation set.

Figure 2 :Figure 3 :
Figure 2: Plot of the predicted values versus the experimental ones for the test set.
[36]experimental data used in the present study to model IC 50 were taken from[36].The whole data set included 56 compounds, whose biological activities (pIC 50 values) were determined for inhibition of CDK2.It is worthy to say that pIC 50 values span a broad range from 4.11 to 7.22 M. In this work, the structure-activity model is generated by ANN and MLR.The names of these compounds, their experimental and calculated pIC 50 values by ANN and MLR methods, and also their values using leave-one-out are shown in Table1.As can be seen, this set contains 56 inhibitory activity (pIC 50 ) data of CDK2s.The data set was split into training, validation, and test sets to increase the network's generalization.The training set of 34 compounds, with pIC 50 values ranging from 4.11 to 7.22, was used to construct the model.The validation set of 11 compounds, with pIC 50 values ranging from 4.19 to 6.62, was used to prevent overtraining/over fitting of the ANN model.The test set of 11 compounds, with pIC 50 values ranging from 4.19 to 6.96, was used as an external set to evaluate the predictive ability of the model. 3.1.Data.

Table 1 :
The names of compounds, their experimental and calculated pIC 50 values by ANN and MLR methods, and also their values were calculated using leave-one-out.

Table 2 :
Descriptors were selected for construction of model.

Table 3 :
Correlation matrix for the selected descriptors.

Table 4 :
The training settings for the ANN model.

Table 5 :
Comparison of the statistical parameters obtained from ANN and MLR models.

Table 6 :
2 values of the test set after several Y-randomization tests.