Computational Molecular Modeling of Pin 1 Inhibition Activity of Quinazoline , Benzophenone , and Pyrimidine Derivatives

Universidad San Francisco de Quito, Departamento de Matemática, Diego de Robles y Vı́a Interoceánica, Quito 17-1200-841, Ecuador Universidad San Francisco de Quito, Grupo de Quı́mica Computacional y Teórica (QCT-USFQ), Departamento de Ingenieŕıa Quı́mica, Diego de Robles y Vı́a Interoceánica, Quito 17-1200-841, Ecuador Departamento de Quı́mica y Bioloǵıa, División de Ciencias Básicas, Universidad del Norte, Km 5 vı́a Puerto Colombia, Barranquilla, Colombia


Introduction
Pin1 has been used as a target for treating cancer since its discovery [1] because it plays a critical role in cell-cycle regulation, it catalyses the cis-trans isomerization of prolyl amide bonds in its substrate proteins, and deregulated proteins are common human cancer cells [2].Also, Pin1 induces apoptosis and mitotic arrest.en, the inhibition of Pin1 presents new opportunities for the development of new anticancer treatments [3].A potential prognostic marker in human cancers should be the overexpression of Pin1, as demonstrated for the breast [4], prostate [5], and lung [6].Moreover, it has been reported that 38 of 60 tumours have more than 10% of Pin1 overexpression, compared with the corresponding normal controls [7].
Focused on the importance of Pin1 in the cancer treatment, some inhibitors for Pin1 have been reported such as 2-{[4-(4-tert-butylbenzenesulfonamido)-1-oxo-1,4dihydronaphthalen-2-yl]sulfanyl}acetic acid (KPT-6566) [8], all-trans retinoic acid (ATRA) [9], and inhibitors based on aromatic compounds [10].With respect to the lastmentioned compounds, the Bailing Xu's group had reported the synthesis of several compounds as potential Pin1 inhibitors.In their earlier e orts, in 2011, they synthesized 2,4disubstituted quinazoline derivatives (Scheme 1).All quinazoline synthesized structures are available in Table S1 in the supplementary material.e most potential inhibitor, compound 13, with the 50% inhibitory concentration (IC 50 ) equal to 2.90 μM, has two chlorine atoms bonded in the position 3 of the aromatic ring on the substituent R 4 , a carboxylic acid linked to the benzene in the position R 5 , and an NO 2 group in the quinazoline nucleus.Clearly, the set of molecules depicted in Table S1 in the supplementary material have several heteroatoms present in their structures [11], which suggest a large dipole moment due to the electronegative di erences between the atoms involved in the compound.ese set of compounds were separated into two parts, the rst containing an amine group to link the R 1 , while in the second, an amide functional group is linked with the R 4 group.
In the same context, in 2012, Liu et al. [12] prepared a series of Pin1 inhibitors with benzophenone skeleton (Scheme 2).
e investigation on structure-activity relationships (SAR), varying the substituents on the position (R 6 and R 7 ), the binding (1 and 2), and the molecules (X and Y) was performed (complete structures are shown in Table S2 in the supplementary material).e author suggested that the Pin1 inhibition could be related to two molecular characteristics: an oxo-acetic group linked to the benzophenone moiety and the aromatic bicyclic ring, having di erent heteroatoms, linked to amide group moiety.In addition, the author suggested the methoxy group could enhance the activity substantially.e most active substrate of this set was compound 24 with an IC 50 value of 5.99 μM, which contains a bicycle nitro-benzothiophene, which can increase substantially the polarity of the molecules in the corresponding evaluated set.
On the contrary, recently Cui et al. [13] reported the synthesis and Pin1 inhibitory activities of pyrimidine derivatives, and their core structures are shown in Scheme 3. A set of twenty-six compounds was prepared by the authors, and di erent aromatic substituents including the heteroatoms nitrogen, oxygen, sulphur, and some halogen were used as substituents, presenting a very interesting dataset with important variations between their structures (Table S3 in the supplementary material).
e authors suggest that compounds 28, 33, 38, and 49 with IC 50 values lower than 3 μM demonstrate potent inhibitory activities against Pin1, compound 38 being the most active with an IC 50 value of 1.68 μM.
is compound in analogy to the most active compounds of the other two sets is shown in Tables S1 and  S2 in the supplementary material (13 and 24), and it also presents a bicyclic structure as substituent involving oxygen and nitrogen as heteroatoms in R 9 (4-benzoxazole).Additionally, the chlorine and nitro groups on 38 also suggest a large polarity of the molecule.
Based on the hypothesis of the existence of a relationship between the molecular structure and the biological activity, the drug design in general terms can be assisted by quantitative structure-activity relationship (QSAR) modeling, which has become a widely used tool in computer-aided drug design (CADD), fate modeling and predictive environmental risk assessment, property prediction, and toxicity of pharmaceuticals and chemicals [14].Several molecular descriptors can be used to obtain QSAR models, and they can be achieved from quantum mechanics calculation and/ or topological indexes for two-and three-dimensional structures [15][16][17].In this sense, Ghalia et al. [18] reported a 3D-QSAR study of the TC 50 , which was calculated by the Reed and Muench method and represents the concentration that inhibits 50% cellular growth compared to that untreated control, and IC 50 (antiviral activity) by using quantum chemical descriptors, which were estimated on twenty-one molecules of novel N-phenyl benzamide and Nphenylacetophenone derivatives.Multilinear regression and non-multilinear regression models were obtained for the dataset, which was divided as the training and test set.e authors suggest that the results represent an excellent stability for high values based on correlation coe cients R pIC50 0.91 R pTC50 0.96 for the RNLM and R pIC50 0.87 and R pTC50 0.95 for MLR.In addition, a QSAR predictive analysis through an assembly of a regression model to predict the inhibition of aldose reductase for avonoids was carried out on 55 molecular structures, including parameters of all types calculated using the software DRAGON 5.0 [19].
e predicted power of this model was measured with the following parameters Q loo 0.934 and l−n%-o R l-30%-o 0.803 [20].Furthermore, another study has developed QSAR models to predict the inhibitory activity of 88 organic bromodomain modulators.In this case, the descriptors were developed using QuBiLs-MIDAS and MAS  Journal of Chemistry (Quadratic, Bilinear and N-Linear Maps based on N-tuple Spatial Metric [(Dis)-Similarity] Matrices and Atomic Weightings).One of the best models with 9 variables showed the following statistical parameters R 2 0.794, Q 2 loo 712, Q 2  Boot 0.683, Q 2 EXT 0.8563, and 1 outlier [21].On the contrary, a useful tool and widely used on drug design is the molecular docking, which is a computational procedure used to predict the binding a nity between a micromolecule (ligand) and a macromolecule (receptor), which has a particular importance in drug development [22].In a recent study was reported a molecular docking study on the evaluation of potential anticancer agents, related with the half maximal inhibitory concentrations (IC 50 ) and their e ect on microtubule assembly [23].Docking programs have become popular to nd the proper position ligands (orientation and conformation) into a protein-binding site [24].Some scoring functions predict the ligand's biological and complementary activity; usually, docking scores are more important than having the correct position [25,26].One of the widely used docking programs is AutoDock Vina, which uses a sophisticated gradient optimization method in the optimization procedure [22].By using AutoDock Vina docking algorithm, the platform Mcule's online app 1-click docking provides the highest quality purchasable molecular modeling and compound database tools, where the calculations are running on cloud machines [26].
In reference to the above description, this work is seeking a reasonable computational modeling for the inhibition of the Pin1 by six-membered aromatic derivative compounds involving the three set of molecules included in Tables S1-S3 in the supplementary material (quinazoline, benzophenone, and pyrimidine derivatives).en, a total of fty-two compounds with important variations into their structures, which imply a robust dataset, have been used for a molecular modeling simulation.Multilinear algebraic map descriptors were used in the modeling process, and di erent regression techniques were employed in the model construction as well as for the aggregation of these models through the construction of an ensemble.

Dataset.
A dataset was employed and separated randomly as the test set (22%) and training (78%).Compound 52 was considered as outlier based on the statistical parameters and adjusted on the training and test set; in addition, it is well-known in the literature that it failed to show cellular e ects due to the poor permeability of the phosphate group [27].eir biological activity expressed as IC 50 was collected from the literature by the Bailing Xu research group, where 17 possess quinazoline structures (Table S1 in the supplementary material) [11], 9 are benzophenone structure (Table S2 in the supplementary material) [12], and 26 of the molecules possess pyrimidine and naphthalemic nucleus (Table S3 in the supplementary material) [13].In Table S4 (Supplementary Materials) is shown all the IC 50 values on μM, and a logarithmic transformation was applied (equation ( 1)), which was used as a dependent variable.pIC 50 − log 10 −6 IC 50 . (1)

Descriptors Calculation.
All the molecules were drawn into the GaussView (Version 5.0) software, and the 3D structure was optimized with the semiempirical PM6 (parametric method 6) [28] by using the software Gaussian 16 suite [29], where the convergence criterion for the self- consistent field (SCF) was set as default.
e molecules were characterized as minimum stationary points, which were obtained by a frequency calculation on the optimized structures at 298.15 K [30].892 topological descriptors were calculated by using the free software QuBiLs-MIDAS and MAS, available on http://tomocomd.com/[31], and that pool wad enlarged by the addition of 5 descriptors: cLogP, cLogS, druglikeness, total surface area, and polar surface area were calculated using the software OSIRIS DataWarrior [32].e hydrophilicity of a drug is measured by the logarithm of the concentration of a drug in n-octanol over water (equation ( 2)); values of marked drugs are between −10 and 8. Another feature used to measure the drug effect is cLogS, which measures its distribution and absorption.
One aim for drug design is to avoid poorly soluble compounds; typical clogS values of traded drugs are greater than −8 and smaller than 2. It is calculated by applying a base 10 logarithm to solubility (S) in mol/liter.In addition, druglikeness is a qualitative concept for drug design and is estimated based on topological descriptors, clogP, molecular weighs, and other properties.Although positive values are recommended for traded drugs, it is not mandatory because it does not measure the biological activity or specific effect [33].
Also, the total surface area and the polar surface area (psa) were also estimated.e total surface area, which considers all polar and nonpolar fragments of the molecule, as well as polar surface area (psa), which is a measure of the degree polarity of molecules, was estimated.psa is equal to the sum of surface contributions from polar fragments."psa" of nontoxic compounds, which do not cause death or an adverse histological change, is greater than 75 Å2 , and compounds with psa <75 Å2 are more likely to be toxic drugs.[34].

MLR.
It is a classical statistical method that calculates the "weights" or coefficients of the dependent variables of a linear expression, and the predicted value is the sum of the attributes multiplied by its weight and the Akaike criterion for model selection.

SMOreg.
is method overcomes the sources of inefficiency and confusion caused by SMO, which maintains a single threshold value, while the SMOreg uses two criteria parameters, which significantly improve the adjusted value on regression as well as the model predictability.

IBK.
is method is amended in the lazy algorithms set to implement in Weka and widely used for classification and regression, which uses cross validation to select the best number of k, which is the same value for k nearest neighbour (KNN) approach.It measures simple distances to find the training instance closest to the test set.In case of same distance obtained in multiple instances, the first one found will be used.e parameter KNN specifies the number of the nearest neighbors to use when predicting a test instance and the outcome is determined by majority vote.

Random Forest. It consists of unpruned classification
or regression trees by using a bootstrap sample of random feature and training data selection.e prediction values are made by the averaging or majority of votes of the ensemble.In addition, the relation with the dependent variable and the descriptors are hidden inside a "black box" and does not produce an explicit model.RF algorithm overcomes the instability of decision caused by its hierarchical nature applying subset selection and bagging techniques and reduces the bias due to class imbalance and overfitting.

Statistical Analysis.
In order to determinate the robustness of a model, several statistical parameters must be calculated [40].First, in case the coefficient of determination for adjusting (R 2 ) value is close to 1, the model is considered robust.Second, a model is considered suitable when the average bootstrapping (Q 2 boot ), which provides information about the predictability of a particular model, is close to 10fold cross validation (Q 2 CV ).ird, the values of a(R 2 ) < 0.3 and b(Q 2 ) < 0.05 are accepted to validate the model.Also, the difference between the total correlation in the attributes (K xx ), which value is lower than 50, and the correlation in the set specified by attributes plus the dependent variable (K xy ) must be positive (ΔK � (K xy − K xx ) > 0).For the purpose of evaluating the internal predictability of each model, standard deviation error of prediction (SDEP), and standard deviation error of calculation (SDEC) values must be close to zero.Finally, models good fitting are corroborated by a high value of Fisher ratio (F) and a low-value standard deviation (s) [41].
Scheme 4 summarizes the process of this work step by step.To begin with, molecules are drawn in GaussView, followed by an optimization in Gaussian 16 at the PM6 level.
en, the following software is used to calculate features: QuBiLs-MIDAS, MAS, and DataWarrior.Subsequently, regression algorithms are applied in Weka 3.8 and the most robust models are selected.Finally, an ensemble by using IBK and/or RF regression techniques was assembled.e parameters to select the best assemble are coefficient of determination of adjust (R 2 ADJ ), cross validation (Q 2 CV ), and test set (Q 2 EXT ), as well as the corresponding mean square errors (MAE).

Docking Analysis.
e docking analysis was done by using the platform Mcule's online app 1-click docking, where the 3D structure of the receptor description file (RDF), Pin1, can be found as 3jyj on the PDB library.e protein 3jyj was selected because in the previous reports in the literature it has been identi ed as the most adequate receptor binding site for the evaluation and screening of possible active organic compound in pin1 inhibition [42].
e cartesian 3D coordinates were identi ed for the binding site as X: 1.1902, Y: 29.3651, and Z: 22.2862, which was established as default, and the size of the binding site was 22 Angstrom.e water molecules in Pin1 protein was removed, hydrogen was added, and incomplete residues were corrected.According to the binding free energy (BE) of the molecules, the 6 nal docked conformations were ranked.

Result and Discussion
A total of 51 compounds were used in this study.In order to show the random distribution of the pIC 50 values of the training and test datasets, a histogram is represented in Figure 1.It shows an adequate distribution taking values between the application domains de ned by the training dataset.
A total of seventeen models were selected by applying the rst condition, which implies that the models must contain a maximum of nine descriptors in order to obtain a ratio number of molecules/descriptors >5.In this sense, 5 models were obtained by applying the regression technique IBK, four with MLR, six with RF, and two with SMOreg.e adjusted and cross-validation correlation coe cient only for the training dataset is available in Supplementary Materials (Table S5).To select the most robust models, some criteria were applied as follows: for R 2 ADJ > 0.78, for 10-fold Q 2 CV > 0.51, and for the test set, a Q 2 EXT > 0.64 .In this sense, in Table 1, the most robust models comply with all the conditions are shown with the corresponding correlation coe cients and MAE for adjusting, cross validation, and test set.Four of these models were obtained by using the technique MLR, and the remaining one was found by using random forest.
For the case of models obtained by MLR techniques, the equations to calculate pIC 50 values are presented as follows: pIC 50−LR1 8.26 + 0.0686AMh + 0.332ACIch − 4.82TSch e independent variables included in the robust models described above were named only by the invariants and the physical-chemistry (PC) properties and abbreviated in capital letters and lowercase letters, respectively.In case of two or more descriptors having the same invariant and PC property but di er at least in one characteristic, the name includes a capital letter in the middle.e description and abbreviations of the independent variables invariants used in equations ( 3)-( 6) are presented in for each feature is available in the supplementary materials (Table S6).
e PC parameters found in the models described above are shown in Table 3, and they have become popular in the description of many biological activities; for example, van der Waals volume (v) performs a key role in the interaction/ orientation for biological activity of an organic compound.While "s" and "h" are linked with donor-acceptor properties of a molecule, "c" has a signi cant e ect on the enzymesubstrate electrostatic interaction.Additionally, "psa" provides information about the capability of bond formation of a particular compound and has exhibited as an essential property on QSAR studies [41].Furthermore, "r" is the refractivity calculated by Lorentz-Lorenz Formula, related not only to the London dispersive forces but also to the volume of the molecules [43].Finally, polarizability (p) delineates the interactions between molecules, nonpolar atoms, and polar and ions molecules with dipole moments [44].
In order to validate LR models, all statistical parameters were calculated and reported in Table 4.In all the cases, the values of Q 2 boot are greater than 0.59; SDEP and SDEC values and the di erence between Q 2 loo and Q 2 boot are all near to zero.e K values smaller than 50 suggest a noncollinearity between the selected attributes, and the ΔK values are positive except for those of the LR2 model, which was discarded for further analysis.
By taking into account the statistical parameters described in Table 4, the most robust model is LR4, which is composed by a total of nine attributes, with a Q 2 boot value of 0.803.It is important to highlight that all the physicalchemical parameters described above are present in the features of this model.In Figure 2 shows the graphical correlation between experimental and predicted pIC 50 values for the training and test dataset.is result suggests a good tting and predictability for this model.For example, for the three most potent inhibitor compounds, 13, 24, and 38, the predicted pIC 50 values were 6.64, 6.78, and 6.33, correspondingly, while the experimental values were 6.46, 6.78, and 6.23. is result suggests an excellent prediction of the absolute values as well as the order of pIC 50 (24 > 13 > 38), with a small error in the prediction of the Pin1 biological activity.
It is important to note that individual models normally present highly sensitive to a small perturbation in the training set.en, to tackle this problem, the construction of an ensemble modeling, which has become popular in recent years, aggregates results from di erent individual models [45].IBK and RF machine learning techniques were used to construct the ensemble model where predicted values from individual models were taken as independent variables and experimental pIC 50 as the dependent variable.Because R 2 ADJ , Q 2  CV , and Q 2 EXT are all over 0.89 for obtained ensembles (Table 5), the criterion of selection for the ensemble was the smaller number of variables.

6
Journal of Chemistry Consequently, the ensemble model IBK was chosen as the most robust one, and its independent variables, which are individual models, are LR3 and RF1.Similar to the results found with the most robust individual model (LR4), the predicted pIC 50 by the ensemble for the molecules 13, 24, and 38 were 6.462, 6.739, and 6.313 correspondingly, which are also in agreement with the experimental values and the order of the inhibitory activity.e graphical plot for the experimental and predicted values obtained by the ensemble is depicted in Figure 3, where an excellent fitting was obtained.
OSIRIS DataWarrior descriptors: clogP, clogS, druglikeness, and polar surface area values are shown in Figure 4 as histograms, and they were used in order to corroborate if the compounds used in this study can be considered as drugs.clogP and clogS values calculated are in the range for being declared as drugs from −9.9 to 3.6 and between −5.5 and 0, respectively.With respect to druglikeness values, most of the molecules are in the range from −17 to 2, which also support that these compounds can be considered as drugs.However, molecules 9 and 10 cannot be considered as drug trades because of present values <−17.Lastly, psa values showed a uniform distribution from 54 to 194.4,where more than 92% of molecules have values greater than 75, and consequently can be considered nontoxic drugs.e larger psa, as well as the lower clogP, is in agreement with the high polarity of these molecules due to the presence of heteroatoms in their structures.e most active compound (38) has values of clogP, clogS, and psa of −4.54, −2.74, and 100.7 Å2, respectively; consequently, this compound complies with all the necessary requirements to be considered as a drug.

Docking Simulation.
e docking simulation represents a powerful tool in the drug design; thus, in the present study, all structures were docked into the binding site described for the 3jyj protein, which is reported to be related with the most common action mechanism for the Pin1 inhibitor biological activity.In Figure 5 is presented, as a histogram, the distribution of values for the binding free energy (BE), which suggest a strong-affinity protein drugs with negative values from the range −5.2 to −8.2 kcal/mol.Also, a good distribution for the test set into the training set was found, which suggests a good representation of the data by the selected training and test.
In order to gain more insight into the binding affinity on these series of compounds, three compounds from the total (13, 24, and 38) were selected and are presented in Figure 6.ese molecules were selected because they are representative of the three subsets described in Tables S1-S3 in the supplementary material and are the most active compounds in each subset.e compounds 13, 24, and 38 have values of pIC 50 of 6.5, 6.8, and 6.2, respectively; compound 38 is the most active compound in the total dataset.In contrast, values on the binding free energy of −7.5, −7.9, and −6.2 kcal/mol were found for the docked 13, 24, and 38, respectively, which suggest a good affinity interaction between the receptor and the organic compound.ey have in common a bicyclic compound in their structures, where compound thirteen is a quinazoline derivative (two six-membered merged rings), twenty-four is a benzophenone derivative with a bicyclic as substituent (a five and six merged rings), and thirty-eight is a pyrimidine derivative with a bicyclic compound (five and sixmembered rings).With respect to the ligand interaction diagram of these three compounds with the different present amino acids (right on Figure 6), the interaction with the residues lysine (LYS), arginine (ARG), serine (SER), leucine (LEU), aspartame (ASP), methionine (MET), glutamine (GLN), histidine (HIS), glycine (GLY), and phenylalanine (PHE) can be observed.e interaction of the three evaluated compounds with the mentioned amino acids are almost the same in each one and for the smaller one, which is the most active compound and presents a psa value of 100.7 Å2, also a good wrapper of these amino acids in the ligand is observed.All the amino acids mentioned are polar in nature, and as expected, they can have a strong interaction with the drugs considered in

Journal of Chemistry
this study due to their polar nature because in the structures can be found some heteroatoms, which increase reasonably their polarity.

Conclusions
A molecular modeling simulation for the Pin1 inhibition by an organic compound containing an aromatic ring in their structure was evaluated by using QSAR approach.A total of 51 compounds, divided randomly as training (78%) and test set (22%), were used in the calculations and topological descriptor was employed for the models construction.Models were obtained by di erent regression techniques such as MLR, SMOreg, IBF, and RF.Five individual models were selected based on the statistical parameters, and the most robust one was constructed by using MLR approach with a total of 9 descriptors, which are weighted by the physical-chemical properties, which a ect signi cantly the biological activity, such as aLogP, charge, electronegativity, hardness, mass, polarizability, topological polar surface area, refractivity, softness, and van der Waals volume.ese properties are closely related to the biological activity of organic compounds.Regression coe cients of 0.910, 0.819, and 0.841 were obtained for adjusting, 10-fold cross validation, and test set, respectively, while the MAE values are less than 0.156.In order to improve the predictability of these models, an ensemble was constructed by using the ve obtained employed IBK and RF techniques.A signi cant improvement was obtained in the predictability by using a multiclassi er constructed with IBk involving only two individual models (LR3 and RF1), with values of R 2 ADJ 0.982, Q 2 CV 0.962, and Q 2 EXT 0.918.Consequently, this ensemble can be used for the prediction of the Pin1 inhibition activity of analogs compounds to the series used in this study.With respect to the druglikeness, clogP, clogS, and psa values, it is possible to conclude that the majority of   Data Availability e data used to support the ndings of this study are available from the corresponding author upon request.computing (HPC) system available in the USFQ for the development of this project.

Figure 2 :
Figure 2: Graphical plot between experimental and calculated pIC 50 by LR4.

Figure 5 :
Figure 5: Distribution of binding free energy values of the whole compounds.

Table 2
. e whole name

Table 1 :
e ve most robust models based on regression coe cients and MAE values for adjusting, cross validation, and test set estimated by using Weka 3.8.

Table 2 :
Invariants used on the attributes calculations.

Table 4 :
Statistics parameters used for robustness evaluation of the MLR selected models by using MATLAB.

Table 5 :
e ensemble models with the corresponding regression coefficients and MAE values.