Combined Ligand/Structure-Based Virtual Screening and Molecular Dynamics Simulations of Steroidal Androgen Receptor Antagonists

The antiandrogens, such as bicalutamide, targeting the androgen receptor (AR), are the main endocrine therapies for prostate cancer (PCa). But as drug resistance to antiandrogens emerges in advanced PCa, there presents a high medical need for exploitation of novel AR antagonists. In this work, the relationships between the molecular structures and antiandrogenic activities of a series of 7α-substituted dihydrotestosterone derivatives were investigated. The proposed MLR model obtained high predictive ability. The thoroughly validated QSAR model was used to virtually screen new dihydrotestosterones derivatives taken from PubChem, resulting in the finding of novel compounds CID_70128824, CID_70127147, and CID_70126881, whose in silico bioactivities are much higher than the published best one, even higher than bicalutamide. In addition, molecular docking, molecular dynamics (MD) simulations, and MM/GBSA have been employed to analyze and compare the binding modes between the novel compounds and AR. Through the analysis of the binding free energy and residue energy decomposition, we concluded that the newly discovered chemicals can in silico bind to AR with similar position and mechanism to the reported active compound and the van der Waals interaction is the main driving force during the binding process.


Introduction
According to the latest World Cancer Report 2014 [1], prostate cancer (PCa) has become the second most common cancer among men in the world. The morbidity rate of PCa has reached 15%, which is merely 1.7% lower than the leading lung cancer. It is reported that about 1100,000 people were diagnosed as new PCa patients in 2012 [2]. Additionally researchers pointed out that prostate cancer is not the privilege of men; women have similar prostate tissue, which also has the risk of cancer [3].
The androgen receptor (AR), a ligand inducible transcription factor in the nuclear hormone receptor super family [4], plays a critical role in the development and progress of PCa. Natural hormone testosterone (T) and dihydrotestosterone (DHT), known as androgens, are the endogenous ligands of AR. When bound to AR, androgens play significant roles in the sexual development, function, and musculoskeletal growth of male. The main mechanism of androgen action is to regulate the gene expression by means of binding to AR, changing the level of specific proteins in cells, and controlling cell behavior [5]. Therefore, a rational approach to cure PCa is the use of antiandrogens to prevent the interaction of T or DHT with AR.
At present, androgen receptor antagonists, such as bicalutamide and flutamide, are used as main hormone therapies for prostate cancer [6]. Although these antiandrogens exhibit good efficacy in many cases and comprise an important part of effective therapeutics [7][8][9][10], the emergence of recurrent and metastatic forms of castration-resistant PCa (CRPC) becomes a major challenge, with a median survival of only 1∼2 years [11]. A possible reason is that these antiandrogens have partial agonistic activities at high concentration in vitro [12]. Therefore, the discovery of new AR antagonists with high antiandrogen activities is highly expected.
Here, in this study, to aid the research and development of steroidal antiandrogens, we investigated the relationships between a series of 7 -substituted dihydrotestosterone derivatives and corresponding antiandrogen activities. The vital features related to the bioactivities were explored, and a linear quantitative structure-activity relationship (QSAR) model was established according to OECD principles [13], using the QSARINS program [14,15]. Then the QSAR model, thoroughly and strictly validated by various internal and external validation techniques, is used to virtually screen new dihydrotestosterones, without experimental bioactivities, downloaded from PubChem database [16]. Besides, molecular docking and molecular dynamics (MD) simulations are used to study the possible binding mode of compounds owning high in silico activities with androgen receptor. At last, the most active compounds with good binding affinities to AR, as highlighted by the Insubria graph [17], are proposed for experimental research group to test the antiandrogen activities in the future.

Materials and Methods
2.1. Data Set. The success of any QSAR model depends on accurate and clean training data, proper representative descriptor selection methods, suitable statistical methods, and, most critically, both internal and external validation of resulting methods [18,19]. Here, in this work, a set of 36 7substituted dihydrotestosterones derivatives were taken from literatures [20,21]. The skeleton structure of these derivatives is shown in Figure 1, in which R group represents amine, carboxylic acids, and halogens, and so forth.
These molecules were divided into a training set and a prediction set according to the structure diversity in QSARINS. Finally, 29 compounds were included in the training set and 7 compounds were in the prediction set (prediction set a). The experimental values, half maximal inhibitory concentration (IC 50 ) expressed in nM, were converted into negative logarithmic units marked as pIC 50 , which was used as dependent variables in the QSAR analyses. The studied molecular structures and corresponding antiandrogen activities were listed in Table 1.

Descriptors Calculation.
To describe a molecule, the molecular structures were firstly sketched in HyperChem program [22] and then were optimized to the minimum energy conformation by using AM1 method. After minimization, we submit the structures to DRAGON 5.5 software [23] to calculate 2914 descriptors including zero-, one-, two-, and three-dimensional (0D, 1D, 2D, and 3D), charge descriptors, and molecular properties. The related theories of the molecular descriptors are provided by DRAGON software, and the calculation procedure is clarified in detail, in the Handbook of Molecular Descriptors [24].
In order to facilitate the successive feature selection process, the constant and near constant descriptors were removed. Besides, if pairwise correlation of two descriptors is larger than 0.98, the one showing the highest pairwise correlation with others will be excluded. Finally, 358 descriptors remained for the next variable selection process.

QSAR Model Generation.
After descriptor calculation, genetic algorithm (GA) implemented in QSARINS software was used to select descriptors. The final model was built by using MLR method based on the selected descriptors, named GA-MLR. The first step of GA is to produce a set of solutions randomly which is called initial population. Each solution, a model based on the contained descriptors by using multiple linear regressions method, is called a chromosome. Subsequently, the fitness function, Friedman LOF, is used to evaluate the fitness of these individuals: (1) Here, SSE represents the sum of squares of errors, is the number of basis function, is the smoothness factor (default 0.5), is the number of features in the model, and is the number of samples for model construction. In the successful selection stage, the fittest individuals evolve to the next generation. Then crossover and mutation operators were performed to generate new individuals. Finally a new population is formed consisting of the fittest chromosomes. The above evolution continues until the stop conditions are satisfied. The related parameters that control the GA performance are list as follows: population size (200), maximum generations (10000), and mutation probability (0.05).

Model Validation.
QSARINS is based on GA-MLR method and performed various tools to a rigorous internal and external validation, based on the different validation criteria, as well as for the check of model applicability domain. The robustness and stability of the built model were validated by several statistical parameters, such as determination coefficient ( 2 ), leave-one-out (LOO) cross-validation 2 LOO , and root mean squared error (RMSE). Besides, leave-manyout (LMO) cross-validation method was also performed and 2 LMO was reported. Randomization technology, by reordering the independent variable, was used to exclude the possibility of the chance correlation. Generally, the correlation coefficient of the built QSAR model should exceed Y randomized generated model. After Y scrambling was BioMed Research International 3   carried out with iterations of 5,000, the average value of squared correlation coefficient of the randomized models 2 and 2 LOO was reported. A good QSAR model should also have satisfactory predictive ability. The best way to evaluate the predictive ability of a model is its validation by new compounds, called prediction set, which do not participate in the process of model building. After the activities of the prediction set samples were predicted, the agreement between the experimental and predicted values was calculated as a measure of a QSAR model quality. Here we adopt several ways to calculate this agreement, 2 1 , 2 2 [25], 2 3 [26], and CCC [27][28][29]. All the above external validation parameters were calculated in the software QSARINS and were combined to evaluate the predictive ability of the proposed model.

Applicability Domain.
To validate the practical applicability of a model to a new chemical, the applicability domain (AD), a theoretical domain which is defined by means of the selected descriptors in the process of modeling, should be defined properly. In this research, the AD was quantitatively assessed by the leverage approach [30,31]. The leverage (hat, ℎ) was calculated by ℎ = ( ) ( = 1, . . . , ), where was the descriptor row-vector of the query compound and was the * matrix of the training set ( is the number of model descriptors). The limit of model domain was quantitatively defined by the leverage cutoff (ℎ * ), set as 3( + 1)/ . A leverage greater than ℎ * means that the query was outside of the model structural AD, so the predictions were extrapolations of the model and could be less reliable. The AD for chemicals with experimental data was verified by the Williams plot, where the hat values versus the standardized residuals were plotted, while the AD for chemicals without experimental data, which were analyzed in the virtual screening, was verified by the Insubria graph where the hat values were plotted versus the predicted responses [14,18].

Virtual Screening of Potent Steroidal Antiandrogens.
To explore more 7 -substituted dihydrotestosterones and to find similar derivatives with high antiandrogen activities, the studied skeleton structure was used as a query to search PubChem database for new dihydrotestosterones, without experimental bioactivities. Then the established MLR model, after thoroughly being validated internally and externally, was used to predict the antiandrogenic activities of these new dihydrotestosterones, verifying the AD.
Besides, molecular docking was employed to investigate the possible interaction mechanisms of the samples owning high in silico antiandrogenic activities with AR. Particularly, comprehensively considering the docking speed and accuracy, LigandFit, which is commonly used as a flexible docking method executed in the commercial software Discovery Studio 2.5 [32], was applied into the progress of structurebased virtual screening. The protein structure of AR was firstly downloaded from RCSB Protein Data Bank [33] (PDB entry code: 1T65) and imported in docking process. All ligands and water molecules were removed at first, the charge and polar hydrogen atoms were added, and the incomplete residues were corrected.

Molecular Dynamics (MD)
Simulations. The molecular dynamics (MD) simulations were carried out using the Amber 14 software package [34]. MD is a commonly used methodology in exploring the interaction between ligand and protein. We have investigated the interaction mechanisms of R-bicalutamide/S-1 with WT/W741L AR using molecular dynamics simulations [35]. The docked structures of AR (PDB ID: 1T65) with the reported most active compound number 4 and novel chemicals with high in silico activities were used as the initial structures for MD simulations. During the process of docking, taking into consideration the fact that these residues collide significantly with the compounds, Helix 12 of AR was removed in the model as executed in the literature [36].
All missing hydrogen atoms of the AR were added by the LEaP module of the Amber 14 package. To maintain the electroneutrality of all the studied complexes, the appropriate number of chloride counterions was added. Then each complex was immersed into a cubic periodic box of TIP3P water model [37] with at least 10Å distance around the complex.
For the ligand, the GAFF parameter assignments [38] were made by using Antechamber program and the partial charges were assigned by using the AM1-BCC method [39].
Amber 14 package and the Amberff03 force field were used for all molecular dynamics simulations. Sander program was carried out for the energy minimization and equilibration protocol. First, energy minimization of four complexes was done through three stages, using the steepest descent method switched to a conjugate gradient every 2500 steps for a total of 5000 steps with a nonbonded cutoff of 10Å. In the first stage, to enable the added TIP3P water molecules to adjust to their proper orientations, the AR and ligand were restrained with 5.0 kcal mol −1Å −2 . In the second stage, to enable the AR to find a better way of accommodating ligand, the protein backbone was restrained with 3.0 kcal mol −1Å −2 . In the third stage, the entire solvated system was minimized without any restraint. Additionally, gradual heating, density, and equilibration protocols were performed. During the 100 ps heating procedure, the system was gradually heated from 0 to 310 K, and then the density was at 310 K for 400 ps, and at last the equilibration was at 310 K for 400 ps.
Afterwards, four 20 ns production MD simulations were carried out with the PMEMD program without any restraints in the isothermal isobaric ensemble (NPT, = 1atm, = 310 K) MD. The time step was set at 2 fs. 10Å cutoff was applied to treat nonbonding interactions. During the simulations periodic boundary conditions were employed and all electrostatic interactions were calculated using the particle mesh Ewald (PME) method. The SHAKE algorithm was used to restrain all bond lengths containing hydrogen atoms. All of the coordinate trajectories were recorded every 2 ps throughout all MD runs. To analyze the energy and structure, a total of 500 snapshots of the simulated structures stripped in the last 2 ns stable MD production trajectory at 4 ps intervals were extracted.

Binding Free Energy Calculations.
For each proteinligand complex, the binding free energy was analyzed by the MM/GBSA method [40]. To compare the AR binding free energies with different ligands, MM/GBSA calculation was applied to the 500 snapshots extracted from the final 2 ns of the MD trajectories. The total free energy of binding free energy was composed of the following molecular species (complex): where complex , protein , and ligand are the free energy of complex, receptor, and ligand, respectively. The free energy for each species (complex, ligand, or receptor) can be decomposed into a gas phase energy (Δ MM ), a solvation-free energy (Δ sol ), and an entropy term ( Δ ).
where the Δ MM is the sum of the internal energy of bonds, angle, and torsion (Δ val ), electrostatic interaction energy (Δ ele ), and van der Waals interaction energy (Δ vdw ). Δ sol is solvation-free energy and can be divided into two parts, the polar solvation-free energy (Δ p ) and the nonpolar solvation-free energy (Δ np ). The polar solvationfree energy Δ p is determined by generalized Born (GB) equation. The values of the dielectric constant for solute and solvent were set as 1 and 80. Δ np is the nonpolar solvation contribution and was calculated with constants 0.0072 kcal mol −1Å −2 for surface tension proportionality constant and 0.92 kcal mol −1Å −2 for the nonpolar free energy for a point solute . SASA is the solvent accessible surface area and is determined by recursively approximating a sphere around each atom, starting from icosahedra (ICOSA method). Δ is the entropy term, including the translational, rotational, and vibrational terms of the solute molecules.
2.9. Energy Decomposition. Furthermore, to obtain the contribution of each residue to the binding process, we performed binding free energy decomposition. The MM/GBSA approach was used to calculate the per-residue free energy decomposition, which is based on the same 500 snapshots we have extracted from the last 2 ns of the stable MD trajectory.

Normal Mode Calculation.
Entropy was analyzed by normal mode with AMBER14 NMODE module. Due to the high computational cost in the entropy calculation, 50 snapshots were extracted from the last 2 ns trajectory of the simulation with 40 ps time intervals.   was higher. Besides, the value of CCC was as high as 0.891, surpassing the threshold value of 0.85 as suggested in literature [29] for predictive model. Additionally, the RMSE values for the training set and prediction set were similarly very low. All these parameters indicated the higher external prediction ability of the final model. The interrelation coefficients of the selected descriptors were presented in the Table 3. It could  be seen that the highest intercorrelation coefficient was −0.61 between IC5 and HATS3e, which indicated that the used variables were independent. All results proved that the selected model was reliable, stable, and predictive.

Results and Discussion
Y randomization technique was carried out with iterations of 5000 in QSARINS. Figure 2 showed the plot of 2 and 2 values versus , automatically obtained in QSARINS. From Figure 2, we could find that the 2 and 2 values of the final model were much higher than the models from scrambled Y-column, because the relationship between molecular structure and response was broken. This result indicated that the relationships between structures of 7substituted dihydrotestosterones and corresponding pIC 50 values did exist in the proposed model, and it was really not obtained by chance. The predicted pIC 50 values by MLR model were listed in Table 1. Figure 3 was the scatter plot of the experimental versus the predicted pIC 50 values. It was obvious that, in Figure 3, all predicted pIC 50 values were close to the line = , which indicated that the linear model can accurately predict the antiandrogenic values of these derivatives.
The model applicability domain was evaluated by means of leverage analysis, namely, Williams plot, shown in Figure 4, in which the standardized residuals ( ) and leverage values (ℎ) were plotted. In Figure 4, we could see that all compounds were inside the model structural applicability domain (ℎ * = 0.517) and reasonably well predicted with standard residue smaller than 2.5 . After the MLR model was built, we luckily found two new 7 -substituted dihydrotestosterones, showed in Table 1 (marked as "b," prediction set b), with experimental antiandrogen activities from another published literature [41]. These two compounds were additionally used to validate our model. Both of them (numbered in the Williams plot of   information content as follows: IC5 = − ∑ −1 ( /nAT) ⋅ log 2 ( /nAT), where runs over the equivalence classes, was the cardinality of the th equivalence class, and nAT was the total number of atoms. This index represented a measure of structural complexity per vertex. GATS5e belonged to 2D autocorrelations and was Geary autocorrelation, lag 5/weighted by atomic Sanderson electronegativities. This descriptor was favorable to the antiandrogen activities of these steroidal derivatives. DISPp, geometrical descriptors, indicated the displacement between the geometric center and the center of the polarizability, calculated with respect to the molecular principal axes. HATS3u is a GETAWAY descriptor [42], representing the leverage-weighted autocorrelation of lag 5/weighted by atomic polarizabilities. With the increase of these two descriptors, the bioactivities of the studied compounds decreased.

Virtual Screening.
From PubChem database, we found 110 new 7 -substituted dihydrotestosterone derivatives, without experimental data. By exploring the leverage ℎ values, 77.27% of them were located in the structural applicability domain of the proposed MLR model. Figure 5 is the Insubria graph of these dihydrotestosterones, the plot of leverage values versus predicted pIC 50 , which was proposed especially for exploring the unknown samples. In Figure 5, most chemicals are in the range of the hat cutoff (ℎ * = 0.517). Inside the model AD, the most active compound is CID 70128824, which has in silico PIC 50 of 7.37, higher than the reported most active compound number 4 (pIC 50 = 6.77). Outside the model AD, we luckily obtained several samples with higher in silico activities, especially CID 70126881 and CID 70127147, which showed excellent in silico antiandrogen activities as high as 7.90 and 7.76, respectively, even higher than bicalutamide and hydroxyflutamide. The ID of these   compounds and corresponding predicted pIC 50 values are listed in Table 4. Though only these three compounds were highlighted here, other samples with high in silico activities, especially those located in the model structural AD, were also worthy of our attention.
To further explore the possible binding mode of the screened compounds, molecular docking was employed to study the interaction between compounds owning high in silico activities (especially CID 70128824, CID 70126881, and CID 70127147), together with the reported most active compound 4 as a comparison, and androgen receptor (PDB ID: 1T65) by using LigandFit module in Discovery Studio 2.5. Firstly, DHT was extracted from crystal structure and redocked into ligand binding pocket to obtain the optimal docking parameters. Secondly, the ligand binding site was defined with the same parameters as DHT. At this point, the radius of SBD Site Sphere was set to 10Å. The other parameters were set by default.
To obtain reasonable conformations of different complex, the top-ranked compounds with lowest RMSD values were extracted. The binding mode of compound 4, CID 70128824, CID 70126881, and CID 70127147 with AR were presented in Figure 6. From this Figure, it could be seen that the docked pose of CID 70128824, CID 70126881, and CID 70127147 located in the same position with similar orientation in the AR ligand binding site to compound 4. All these results indicated that though two of them were outside of the model AD, these three compounds CID 70128824, CID 70126881, and CID 70127147 might have good performance to antagonize androgen receptor and have the potency for further research and development for PCa therapy.

System Equilibration.
In order to verify whether the studied systems reach equilibrium, the root mean square deviations (RMSDs) of all the backbone atoms of the protein, the C atoms for the residues of the active site (residues within 5Å around ligand), and the heavy atoms of ligand from the initial structure were monitored to examine the dynamic stability of the systems and plotted against time, as shown in  implying that the studied systems have reached stability. We used the last 2 ns to analyze the energy and binding modes for the four complexes.

Validation of the MD Simulations.
We calculated the binding free energy by MM/GBSA method between the four ligands and AR to validate the reliability of the MD simulation. Table 5 lists the binding free energy and all of the energy terms for the four compounds. From Table 5, the four complexes had different binding free energy; the ranking order is CID 70126881 (−41.62 kcal mol −1 ), CID 70127147 (−33.06 kcal mol −1 ), CID 70128824 (−31.86 kcal mol −1 ), and compound 4 (−22.69 kcal mol −1 ). Different binding free energy means different binding affinity between the four complexes. We have proved that the antiandrogen activity of compounds CID 70126881, CID 70127147, and CID 70128824 are higher than the published best one (compound 4) by MLR model. In addition, the ranking of the calculated binding free energy was consistent with their in silico bioactivities order.

Analysis of the Interaction Mechanism.
According to the calculated binding free energy, CID 70126881 holds the strongest binding affinity; on the contrary, compound 4 has the lowest binding affinity. As can be seen from Table 5, the nonpolar interactions (Δ nonpolar ) including van der Waals ( vdw ) and nonpolar solvation (Δ np ) terms are the driving force for the binding of the four ligands to AR, and the total polar contributions (Δ p ) are unfavorable for their binding. In addition, CID 70126881, CID 70127147, and   CID 70128824 have almost the same van der Waals interactions toward AR (−69.32 kcal mol −1 , −71.34 kcal mol −1 , and −70.56 kcal mol −1 for CID 70126881-AR, CID 70127147-AR, and CID 70128824-AR), while compound 4 has a low van der Waals value, which may partly explain the reduced binding affinity of compound 4 and prove that the newly discovered chemicals could possess higher antiandrogen activities.
To obtain the detailed interaction between four ligands and AR, the decomposition of binding free energy, which is calculated by MM/GBSA method, was executed to identify key residues during the binding process. The result of energy decomposition contains van der Waals, electrostatic, solvation-free energy, and total energy contribution terms, respectively, for four systems, shown in Figure 8. All the residues with great energy contributions were almost more than 1.5 kcal mol −1 . As shown in Figure 8(a), residues L701, L704, M780, L873, T877, L880, and L881 of AR make a significant contribution to the CID 70126881-AR binding, as well as those for the CID 70127147-AR which are L704, N705, L712, W741, M745, and F876 (Figure 8(b)). In Figure 8(c), residues L704, N705, G708, L712, F764, and F891 of AR make a substantial contribution to the CID 70128824-AR binding. However, only two key residues (L704 and N705) were the major energy contributions to compound 4-AR binding as shown in Figure 8(d). As mentioned previously, the vast majority of key residues of AR were nonpolar; it was reasonable to speculate that these residues can form greater van der Waals interactions with hydrophobic ligand and exhibit more favorable nonpolar interaction contribution to the binding free energy.
The MD simulation, together with the docking results, confirmed that the newly discovered chemicals CID 70126881, CID 70127147, and CID 70128824 share similar binding mode with the reported compound 4, and the in silico antiandrogen activities of them are higher through the calculated binding free energy and decomposition of binding free energy.

Conclusions
In this study, the relationships between a series of 7substituted dihydrotestosterone derivatives and corresponding antiandrogen activities were explored. A reliable, stable, and robust linear MLR model with four descriptors was built and validated in QSARINS. The predictive ability of the final model, fully evaluated by using two different prediction sets, is excellent enough to be used to virtually screen novel 7 -substituted dihydrotestosterones from PubChem database. After antiandrogenic activity prediction, molecular docking, and molecular dynamic simulations, CID 70126881, CID 70127147, and CID 70128824, as the most potent chemicals with good binding affinities to androgen receptor, were proposed. Of course, bioassay experimental researches are needed to evaluate the virtual screening results. This study provides the theoretical basis and specific chemicals for AR antagonists, which can help the experimental research groups to search for potential antiandrogens.