Quantitative Structure Activity Relationship Studies of Topoisomerase I Inhibitors as Potent Antibreast Cancer Agents

Topoisomerase I (TOP I) is a valuablemolecular target for the development of clinically used anticancer agents. Indenoisoquinolines have emerged as potent topoisomerase I inhibitors. So, with an aim to elucidate the important features responsible for their activity, QSAR studies on breast cancer cell line using stepwise multiple linear regressions, partial least square, and neural network were performed. The MLR and PLS models showed good correlation values of r2 = 0.932, rcv = 0.897, and r2 = 0.932, rcv = 0.913 respectively. The model revealed the importance of steric arrangement of functional groups and number of H bond acceptors. In addition to MLR and PLS, neural network architecture was also constructed using selected descriptors and the inhibitory activities in order to evaluate the mode of dependencies of biological activity on obtained descriptors.


Introduction
Cancer is a leading cause of death worldwide.It is a disease of cell characterized by progressive, persistent, abnormal, and uncontrolled proliferations of tissues.Deaths from cancer worldwide are projected to continue rising.By 2030, it is projected that there will be an estimated 26 million new cancer cases and 17 million cancer deaths per year [1].Breast cancer is a malignant tumor that starts from cells of the breast.It is the most frequently diagnosed cancer in women around the world.It is estimated that the figure of breast cancer deaths in women is heaved to 39,520 by the year 2011 in the US [2].The successful treatment of this disease is limited by the fact that essentially all breast cancers become resistant to chemotherapy.Therefore, new chemotherapeutic agents are needed to be designed, which are able not only to target breast cancer but also to display increased efficacy and overall decreased systemic toxicity.
Topoisomerase I (TOP I) is a valuable molecular target for the development of clinically used anticancer agents.Camptothecin is the first agent identified as a TOP I inhibitor.Camptothecins and its derivatives exert their pharmacological activity via binding at the interface of the TOP I-DNA complex [3,4].Although CPTs are very potent but often shows dose related toxicities and pharmacokinetic problems [5][6][7][8].As compared to camptothecins, the indenoisoquinolines as a class of cytotoxic TOP I inhibitors offer certain advantages, including the greater stabilities of the compounds themselves, as well as the greater stabilities of their drug enzyme-DNA cleavage complexes [9][10][11][12][13].
In line to the above discussion, we felt that there is a need to evaluate the binding requirements of indenoisoquinolines as TOP I inhibitors by employing computational approach.The results of such studies will be extremely helpful to design more potent TOP I inhibitors.Indenoisoquinoline compounds and their derivatives are found to possess antitumor and other biological activities like antitrypanosomal; however, the relationships of their structure and activity are still not well understood [14].Therefore, correlating the physicochemical properties or structural features of compounds with their cytotoxicities in GI 50 will surely provide useful information for the design of new antitumor drugs.QSAR is a powerful computational approach used for the study of biological activities with properties or molecular structures, which is helpful to explore the relationship between the structures of ligands and their activities [15][16][17].Also, it offers the advantages of higher speed and lower costs for bioactivity evaluation, especially compared to experimental testing.Among classical methods of QSAR approach, multiple linear regressions (MLR) and partial least squares (PLS) are two methods that are widely used.However, regression analysis gives assumption of a linear relationship between the biological activity and one or more descriptors.On the other hand, biological phenomena are considered nonlinear by nature, and therefore, the contribution of some of the parameters to a specific biological activity can be nonlinear.The key of solving these problems is using neural networks, owing to their nonlinear mapping.
In the present study, we have developed the QSAR models using multiple linear regression (MLR), partial least squares (PLS), and artificial neural network (ANN) approaches on the same series of compounds.

Biological Activity Data.
A series of indenoisoquinoline derivatives was adopted from the literature for the present QSAR studies [18].Experimentally determined cytotoxicity GI 50 values obtained with selected breast cancer cell lines were the concentrations corresponding to 50% growth inhibition.The reported inhibitory constant values were converted into corresponding −log value All the computational studies were performed by using tools for structure activity relation (TSAR) version 3.3 software and ChemDraw Ultra 10.0.The calculation in TSAR derive a wide range of structural descriptors from the simple 2D and 3D structural information available from a structure, some calculation use partial atomic charge to derive further dipole related parameters.The model development in TSAR is mainly based on a numerical description of all descriptors and employs statistical method to perform correlation.
All chemical structures of indenoisoquinoline derivatives were sketched and imported to TSAR 3.3 (http://accelrys .com/)spread sheet via .molfiles.Some compounds were excluded from the QSAR study as their GI 50 values were uncertain.

Defining Substitutions and Generation of 3D
Structure and Charge Calculation.Structure entry and substitutions defining are an important stage in QSAR methodology.The substituent of each chemical structure were defined into four (namely R 1 , R 2 , R 3 , and R 4 ).All the substituent were numbered according to their position in molecules, and each molecule had defined number of substituent attached to nucleus by a single bond.The substituent pattern opted is depicted in Table 1.Three-dimensional structures of all the molecules and their substituent were generated.Charges were calculated using Charge-2 option and the geometries of all the structures were optimized using Cosmic module of TSAR.

Descriptor Generation.
TSAR affords calculation of the following descriptors: molecular surface area, volume, moments of inertia, ellipsoidal volume, Verloop parameters, dipole moments, lipole moments, molecular mass, Wiener index, molecular connectivity indices, molecular shape indices, electro topological state indices, log , number of defined atoms (carbon, nitrogen, etc.), rings (aromatic and aliphatic) and groups (methyl, hydroxyl, etc.), electrostatic properties such as the total energy, electronic energy, nuclear repulsion energy, accessible surface area, atomic charge, mean polarizability, heat of formation, total dipole, polarizability, and dipole components.
As an initial approach, more than 250 molecular descriptors were calculated for the whole molecule and the substituent which vary from one molecule to another at a common point on the generic structure.To reduce data redundancy pair wise correlation analysis was performed.The descriptors having high intercorrelation were examined for their correlation with biological activity and the descriptor with low correlation was discarded.This restricted false prediction of the QSAR model, as high collinearity among descriptors can lead to statistical instability and over prediction and also make mechanistic interpretation difficult [19].Descriptors having distributions of values clumped about a few distinct values were also removed since these are not useful for explaining a continuous variation of activity.

Training and Test Set Assembly.
Molecules, which lack biological growth inhibitory activity in numerical form, were removed from the analysis.The dataset was randomly partitioned into a training set of 30 and a test set of 9 compounds having adequate coverage in terms of both chemical and biological diversity.During the processes of model development and validation, 6 molecules (5, 40, 50, 59, 95, and 108) were found not to fit to either the training or test set of compounds.These six compounds were detected as outliers since their residual values were more than two orders of magnitude.Owing to this reason, these compounds were finally omitted from the training set.

Multivariate Statistical Analysis.
The computational tools used in present drug-design study are rather diverse from each other and comprise equation based models that includes methods such as multiple linear regression (MLR) and partial least squares (PLS) and nonequation based neural-network model (NN).

Multiple Linear Regression Analysis (MLR).
The Multiple Linear Regression model protocol builds a model for a dependent property using the selected molecular descriptors.The relationship between the physiochemical and structural parameters and the cytotoxic activity (log 1/GI 50 ) was quantified by the multiple linear regressions.The acceptability, robustness, and predictive power of the model was decided on the basis of various statistical significance parameters like correlation coefficient () of training and test set, square of correlation coefficient ( 2 ), cross-validated correlation coefficient ( 2 cv), Fisher ratio (), and standard deviation ().The chosen models were validated by leave-one out crossvalidation [20] procedure and test set prediction.

Partial Least Square Analysis (PLS)
. PLS is a robust multivariate generalized regression method which uses projections to summarize multitudes of potentially collinear variables [21].It has been recommended as an alternative approach to enlarge the information content in each model and avoid danger of over fitting [22].PLS models were developed for selected set of descriptors and checked for statistical significance.

Forward Feed Neural Network Analysis (FFNN).
It has been reported that artificial neural networks (ANN) sometimes provide more accurate estimates than multi linear regression [23].This provides evidence for dependency of biological activity on structural features.
Feed Forward Neural Network (FFNN) [24] consists of three layers: input layer, hidden layer, and output layer.The input layer does not process the information; it only distributes the input vectors to the hidden layer.Each neuron on the hidden layer employs a nonlinear transfer function to operate on the input data.NNs were used as the evaluation function for mapping the molecular descriptors to the activity of interest (dependent variable) by using the Monte Carlo algorithm.
The best set of descriptors selected based on linear regression was used to build FFNN model.Models with different net configurations were generated to improve the rms error and the predictive power of the model.Although the number of variables is the same in the regression model and the FFNN model, there are more adjustable parameters in the FFNN model, since each connection is considered as adjustable.
An initial weighting value of 1.0 was applied to all connections.Starting weights in the range of −0.03 to +0.03 and −1 to +1 for the initial node biases were selected.The results were visualized on a 2D plot of output node against input (dependency graph).The FFNN architecture was 3-1-1 for the final model.In the present study, the objective for the development of FFNN-based model was to validate the results of MLR and PLS.

Results and Discussion
3.1.Multivariate Models.Linear multivariate analysis like multiple linear regression (MLR) and partial least squares (PLS) and nonlinear analysis like Feed forward Neural Network (FFNN) were carried out on breast cancer cell line data.Total 39 compounds were considered for the present QSAR study (Table 1) which were divided into training and test sets.
The training set comprised of 30 compounds and test set of 9 compounds.

Linear Regression Analysis.
Firstly, multivariate analysis (MLR and PLS) was performed on the whole descriptor pool which showed insignificant predictive power of  2 cv MLR = 0.384 and  2 cv PLS = 0.583.From the initial descriptor pool, the final set of three descriptors was selected for compounds under investigation, which were independent of each other and were useful in generating the model (Tables 2 and 3).
Usually in QSAR outliers are encountered, which exist when they act on a different binding site of the same enzyme or because of the limitations on the quality of the biological data.To maximize the predictability of the model and to gradually improve the statistical significance identified numbers of outliers were deleted.In the present study, 6 outliers 1 5, 1 40, 1 50, 1 59, 1 95, and 1 108, were detected on the basis of their high residual values and deleted one by one.
The final regression equation obtained from MLR analysis for breast cell line is represented in (2): 1 is molecular mass (substituent 3), 2 is molecular surface area (substituent 2), 3 is number of H-bond acceptor (substituent 4).
The training set without outlier with reduced number of descriptors showed improved statistical values and high predictivity with  2 cv MLR = 0.897, and  2 cv PLS = 0.91 (Table 4).
The best fit equation was selected on the basis of the highest correlation coefficients and the lowest standard deviations as depicted in Table 4.The  value is the standard deviation about the regression line or standard error of the regression model.The value of  for best MLR model is 0.269.The smaller the value of  the better the QSAR model.Fischer statistics () is the ratio between explained and unexplained variance for a given number of degree of freedom.The  value for MLR model is 92.570 which is statistically significant.The larger the value of  the greater the probability that the QSAR model is significant.
In addition to MLR, PLS was performed on the selected significant descriptors [25,26].As there may be redundancy of information in the analysis of all the descriptors, the PLS analysis was performed after variable selection.According to Cramer, PLS regression can be used with more than one dependent variable and for a well-defined problem, both MLR and PLS should have comparable results [27].
The best PLS model is represented by ( 3) 1 is molecular mass (substituent 3), 2 is molecular surface area (substituent 2), 3 is number of H-bond acceptor (substituent 4).The statistical significance of the generated QSAR models was evaluated in terms of square correlation coefficient values, where  2 values of 0.932 for MLR and PLS explain 93.2% variance in biological activity in both the analyses.This indicates the goodness of fit of the model.The  2 cv = 0.897 and 0.913 of MLR the PLS models were evaluated, and it was found that both of the models have comparable  2 cv.The  2 cv is an important measure of the predictive power of a model.The closer the  2 cv value to 1.0, the better the predictive power of a model.For a good model,  2 cv should be fairly close to  2 .If  2 cv is much lower than r 2 , the regression is probably over-fitting the data.The predictive ability of the model was also validated using the test set of compounds.According to Tropsha et al. [28,29], the predicting ability of a QSAR model can be estimated conveniently by an external  2 ext (4): where,  tr is the averaged value for the dependent variable for the training set.
Ideally, the following criteria should be satisfied: 0.85 <  < 1.15 or 0.85 <   < 1.15. ( 2 is the squared correlation coefficient of regression between the predicted and observed activities of compounds in the training and test set.Mathematical definitions of  2  ,  2  , k, and   are based on regression of observed activities against predicted activities and regression of the predicted activities against observed activities [28].The robust statistical values for the models are indicative of the high predictive ability of the developed models (Table 4).

Forward Feed Neural Network Analysis.
The FFNN model in this study was developed with the same descriptor set that was used in the MLR and PLS analyses in order to check the dependence of biological activity on structural features.The multiple-layer FFNN functionality, which undergoes a supervised training by the back propagation error, was used.The number of neurons in the hidden layer and the number of rows in the training set were balanced to achieve the optimum predictive power for the neural network.The statistics obtained for the FFNN treatment were  = 24, input columns (descriptors) = 3, net configuration = 3-1-1 (3 input nodes, 1 processing node, 1 output node), with test rms = 0.088, best rms = 0.068, and  2 = 0.931 for training and  2 = 0.829 for test.Although using the same descriptors for the MLR model, the FFNN treatment appears to slightly improve the predictions obtained (Table 4).The actual and predicted activity obtained after MLR/PLS, and FFNN, analysis for the training and test set of compounds are shown in Tables 5 and 6 respectively.The QSAR model obtained exhibited strong negative dependencies on the molecular mass (substituent 3), molecular surface area (substituent 2), and number of H-bond acceptor (substituent 4) (Figures 7, 8 and 9).

Study of Entered Descriptors.
Molecular mass (substituent 3) is a steric parameter that is related to the bulkiness of the molecule.It defines the size and total number of hydrogen bonds for each molecule.The molecular mass descriptor has also been an ubiquitous variable in QSAR studies.This descriptor is used because it closely approximates the size (radii) of the drugs involved in the study and their interaction with enzyme.From the regression equation it can be observed that it is negatively correlated, that is an increase or decrease in the molecular mass of substituent 3 would have an inversely proportional relation with the biological activity, which is clearly proven when compounds 49 and 68 are compared.
The molecular surface area (substituent 2) depends on the structure connectivity and conformation.It is one of the most popular geometrical descriptors of a compound.It has been closely related through various physical and quantum mechanical models to the intermolecular dispersion energy and the free energy of cavity formation in condensed media.Surface area has a prominent effect on the interactions which occur between a drug molecule and its surroundings.These descriptors are useful in encoding steric effects that can occur between drugs binding with DNA.In the present study, it is negatively correlated to the activity, that is an increase in the surface area of substituent 2 would cause a decrease in the biological activity, which can be verified when compounds 49 and 68 are compared.
Number of H-bond acceptor is related to the acidic value of the molecule.A less acidic value of a functional group shows its high tendency of accepting the hydrogen.It is negatively correlated to the biological activity.As the hydrogen accepting groups are introduced in the molecules at substituent 4, their biological activity decreases.So the introduction of a less acidic group would have a negative effect on the biological activity.
For drugs binding reversibly to DNA, both their strength of binding and their cytoxicity have been fairly predicted from the identified molecular descriptors.The equations  derived provide information about the importance of physicochemical molecular descriptors, such as molecular weight, surface area, and the number of hydrogen bond acceptors, which may be useful for rational design of novel TOP I inhibitors.

Conclusions
Topoisomerase I (TOP I) has become a valuable molecular target for the development of clinically used anticancer agents.In the present study, the QSAR models were developed using multiple linear regression (MLR), partial least squares (PLS), and artificial neural network (ANN) approaches on the indenoisoquinoline-based TOP I inhibitors.All these models revealed the importance of molecular mass, molecular surface area, and number of H bond acceptors in the bioactivity prediction of these inhibitors as antibreast cancer agents.Thus, these QSAR models will be influential for the design and the development of novel TOP I inhibitors as anticancer agents.

Figure 1 :
Figure 1: Plot of actual versus predicted values for the training set molecules with the help of MLR statistical method.

Figures 1 ,
2, 3, 4, 5, and 6 depict the graphs plotted between actual and predicted activities of training and test set obtained by MLR, PLS and FFNN respectively.

Figure 2 :
Figure 2: Plot of actual versus predicted values for the training set molecules with the help of PLS statistical method.

Figure 3 :
Figure 3: Plot of actual versus predicted values for the training set molecules with the help of FFNN statistical method.

Figure 4 :Figure 5 :
Figure 4: Plot of actual versus predicted values for the test set molecules with the help of MLR statistical method.

Figure 6 :
Figure 6: Plot of actual versus predicted values for the test set molecules with the help of FFNN statistical method.

Figure 7 :
Figure 7: Dependency graph illustrating the correlation between the molecular mass (substituent 3) used to train the neural network architecture versus the actual activity data.

Table 1 :
Structures of the indenoisoquinoline derivatives as topoisomerase I inhibitors.

Table 2 :
Correlation matrix of the independent variables used in the final model demonstrating the degree of correlation.

Table 3 :
Statistical data of the independent variable illustrating the significance in terms of statistical parameters.

Table 4 :
Statistical validation parameters obtained in the MLR, PLS, and FFNN analyses.

Table 5 :
Actual and predicted activity data of derivatives obtained from multivariate analysis for training set of breast cell line.

Table 6 :
Actual and predicted activity data of derivatives obtained from multivariate analysis for test set of breast cell line.