LASSO-ing Potential Nuclear Receptor Agonists and Antagonists: A New Computational Method For Database Screening

1 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, U.S.A. 2 National Exposure Research Laboratory, US-Environmental Protection Agency, 109 TW Alexander Dr., Research Triangle Park, NC 27711, U.S.A. 3 To whom correspondence should be addressed. (e-mail: goldsmith.rocky@epa.gov) SimBioSys, Inc. 135 Queen’s Plate Drive, Suite 520, Toronto, Ontario M9W 6V1, Canada. Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC 27587, U.S.A.


Introduction
The nuclear receptor (NRs) family of transcription factors are important targets for therapeutic interventions for multiple diseases [1] and also may interact with other xenobiotics that are endocrine disruptors present in the environment [2]. It is therefore important to identify compounds that may specifically bind NRs and act as endocrine disruptors as well as develop synthetic compounds that can selectively (in a cell-type and/or tissue-selective manner) modulate NR pharmacology (reviewed in [3][4][5][6][7][8][9]). NRs including the androgen receptor (AR; NR3C4), estrogen receptor α and β (ERα and ERβ; NR3A1 and NR3A2) and pregnane X receptor (PXR; NR1I2) are particularly important as both therapeutic targets and for xenobiotics to mediate off-target effects.
Multiple QSAR and machine learning models have been described for these NRs, to address endocrine disruptor risk assessment [26-28] and toxicological screening [29] .
For example, a recent QSAR analysis of 74 natural or synthetic estrogens provided information on structural features for the activation of ER and ER [30]. Non-linear statistical machine learning methods have been applied to separate NR activators from non-activators [31]. A virtual screening protocol identified ER specific ligands from a plant product-based database [32], from 12 candidates evaluated by a fluorescence polarization binding assay, 3 had >100-fold selectivity to ER over ER. The same approach has also been used to find compounds with good selectivity for ER over ER [32,33]. Bisson et al., have used computational methods that led to a non-steroidal antiandrogen with improved AR antagonistic activity based on an initial screening of FDA approved new drugs [33]. Several [46][47][48][49]. PXR pharmacophores, have been used to predict interactions for antibiotics [50] verified in vitro, and machine learning methods have also been evaluated [25,38,[51][52][53].
Several protein-based docking studies have also been used to predict PXR agonists [25, [54][55][56], although machine learning methods appear to be advantageous to date.
We have recently described troubleshooting various computational methods [57] and specifically compared different methods for PXR [25]. There is a continuous search for new methods that might offer advantages for computational modeling to overcome some of these limitations and specifically for NRs [58]. A new ligand based software called LASSO (Ligand Activity in Surface Similarity Order) has recently been described that is focused on similarities in biomolecular activity rather than structural similarity [59]. The key components describing LASSO are the 23 kinds of Interacting Surface Point Type (ISPT) molecular descriptors (see Supplemental Table 1), which capture the 5 essence of the surface point information in a feature vector containing the counts of each surface point type, and creates the feature vector for that ligand. This vector serves as the descriptor of that molecule with the assumption that ligands with similar feature vectors will have similar activity. As only the 3D orientation of the Interacting Surface Point depends on the conformation, LASSO is conformation independent. LASSO has been shown to be able to readily screen over 1 million structures/minute, identify active molecules by enriching screened databases and provides a means for scaffold hopping [59]. The current study applies LASSO to various NR datasets to generate models, validate them and make the models available on a public website to illustrate how the method can be used. This work can be considered an extension of our previous troubleshooting studies [25,57]

I) Training and Test Set Molecular Dataset Selection
One of the goals of this study was to determine what level of enrichment for binders (at weak or strong binding threshold) can be afforded using the ChemSpider LASSO descriptors (ligand-based approach) and compared with enrichment from a structure-based docking approach (eHiTS). For AR, the dataset consisting of 203 molecules with relative binding affinities and activity threshold classes of (a) strong (b) moderate (c) weak and (d) inactive/non-binding ligands, we evaluated the ability of LASSO to differentiate both (a) strong and (a+b) strong and moderate binders (all others considered to be non-binding). The training set for AR, derived with the LASSO descriptors were obtained from the DUD set [60] and differ considerably from the test 6 set. To evaluate the LASSO-descriptors for the ER dataset consisting of 50 molecules with 15 "hits" (i.e. considerably weak binders) and 35 "non-hits" for the estrogen binding that differ considerably from the training set obtained again we used the DUD ER specific (default or agonist and antagonist) as a training set [60].
In addition to the ChemSpider-LASSO approach for the AR test set we used the eHiTS structure-based (molecular docking) screening strategy on two conformations of the AR (using PDB structures 2AMA and 1XNN) and reported the minimum score across the two-conformations examined (this approach was used to add flexibility to the receptor). Similarly, for the ER dataset we docked against two functionally distinct conformations of the Estrogen receptor (3ERT and 1GWR). Dataset 2 represented 93 actives and 75 inactives that were drug-like molecules from a previous study [61]. The molecular structures encoded as SMILES strings [62] were downloaded from the supplementary information tables in the original publication [61]. . Human PXR activation was determined by a luciferase-based reporter assay as has been previously described in these and other publications.

LASSO models for ER and AR
The methodology of LASSO has already been previously described in detail, [59] and the method performance in terms of diversity of test set and % enrichment of a database has also already been evaluated for the DUD set in paper just mentioned and also for other targets published elsewhere.
(http://www.simbiosys.com/ehits_lasso/ehits_lasso_table.html) To examine the performance of eHiTS LASSO, with this endocrine panel subset of target proteins of the total ~48 nuclear receptors.We used the newly assembled directory of useful decoys (DUD) [60] dataset to augment both the KIERBL and NCTR AR datasets.
The above 25 models were then tested for enrichment factor using the actives from the total active set and leaving out the ones used for training (this was 8, 16,32,64 or 128 ligands respectively) mixed with drug-like decoys, that were obtained from another recent screening study [63]. To assess the effect of the size of the decoy set upon the prediction model, random 3000 (3k), 8000 (8k) and the whole 24,000 (24k) decoy sets were used. In each case the decoys from all three sets received (228 structures in total) were added into the decoy test set.

Results
The enrichment plots shown for AR ( into a real scenario would be to screen ChemSpider for AR with a descriptor above a threshold (in this case 0.07) from a specific dataset on ChemSpider and follow up these "hits" only with a more costly structure-based approach.
For the ER dataset where all 15 binders are in fact weak binders (i.e. 3 to 5 orders of magnitude weaker binders than the natural ligand 17-estradiol) the default LASSO descriptors outperform (3 fold enrichment) the structure-based approach (2 fold enrichment) and the agonist-trained LASSO method outperforms the antagonist LASSO method (most likely due to a large diversity among antagonists than agonists). Here we can see that even for weakly interacting partners (i.e. low affinity binders for ER) we can still obtain enrichment that is substantially better than random.
These tandem virtual screening approaches combine computationally-efficient ligand-based ChemSpider LASSO descriptors 1 prior to more costly structure-based virtual screening strategies, dramatically improving virtual screening and "earlyrecognition problem" workflow efficiency.
Piggy-backing more costly structure-based virtual screening strategies on top of an initial screen with could dramatically assist in virtual screening endeavors and the early recognition problem. 1 Since ChemSpider is at its core a rich and diverse collection of chemical structures, these were used in order to produce LASSO predictions for over 14 million compounds against a series of 40 targets including AR and ER. A LASSO search feature was added to ChemSpider to allow users to search the database by LASSO value (see Figure 3A). Scientists can readily search for the top 1000 compounds (or less) with the highest LASSO value for a particular target of interest. An advanced search in ChemSpider can combine LASSO value searches with other parameters such as molecular weight, Rule of 5 values and specific data sources (for example, selecting molecules from commercial data sources only).
We have also shown an example of a molecule, mibolerone, a strong AR and ER binder based on LASSO ( Figure 3B) which is known as a potent AR binder [64]. The Table 3 and more visually in Supplemental Figure 2.

LASSO surface point type values are shown in Supplemental
When we used LASSO models with hPXR in method I (Figure 4) we found the best results with Model 1 which suggested 40% of the ligands can be pushed into the top 10% of the screened database resulting with an enrichment factor 4 fold better than random ( Figure 5). In Method II we found the same enrichment factor using 64 actives in a 24 thousand compound decoy set ( Figure 6). Another way to evaluate the models is to present the statistics for using dataset 1 to predict dataset 2 (N = 168) for which we obtained Sensitivity 12%, Specificity 99%, Accuracy 51%, Matthew's correlation 0.2.
Using dataset 2 to predict dataset 1 gives similar results. These results suggest that the models could identify potential human PXR agonists in databases similar to other target proteins [59].

Discussion
For both the AR and ER ligands the main objective was to see how ligand-based screening tools, such as LASSO's ChemSpider implementation, perform such that they could be used for prioritizing chemicals for testing. The AR dataset contained a mixture of drug-like and environmental receptor modulators, whereas the ER dataset contained primarily environmental chemicals. Even in light of the relatively weak binding affinity of the "actives", i.e. K i of 10 -4 -10 -6 , while these would be poor candidates for lead optimization into drugs, they still pose an interaction potential with biological systems such as NRs if they bioaccumulate. Using these leads from LASSO screening with other methods such as molecular docking or free-energy perturbation simulations may also be useful. The validation of the approaches outlined above was pursued by examining two real datasets. These were the FDA's NCTR AR [38] and a 50 chemical dataset containing environmental molecules evaluated for ER binding affinity [35]. In addition we have used multiple sets of PXR agonists described previously. Our results show enrichments of between 2 fold and 4 fold depending on the NR. For PXR there have been numerous recent studies using different machine learning methods and descriptors [25, 54,56], and while the Matthews correlation coefficient in this study is lower than previous studies, the level of enrichment from between 4 fold (40% of actives in top 10% of compounds screened) to 10 fold (10% actives in the top 1% of compounds screened) was very encouraging.
The molecular descriptors used in eHiTS LASSO are independent of ligand conformation and have been shown to successfully enrich screened databases across a wide range of target families [59]. Lying somewhere between a 2D and a 3D descriptor the ISPT descriptor does not contain any shape or 2D connectivity information. There may however be some molecular size information implicit in the descriptor due to capturing the counts of surface points and larger molecules will have more surface points than smaller molecules (and eHiTS LASSO may be somewhat sensitive to this).
The relatively high speed of eHiTS LASSO on a single CPU [59], makes it an ideal tool to be used as a pre-docking screen. From a troubleshooting perspective, eHiTS LASSO will return a high percentage of false positives, due to not considering 3D relationships of surface properties. Because of this, it will also return a higher percentage 13 of different scaffolds, enabling scaffold hopping. It is also important to note that LASSO would not be able to differentiate stereoisomerism apart from perhaps diastereomeric pairs which have structurally (configurationally) different features rather than conformationally different features, for which this method is conformation invariant.
Taking the results of eHiTS LASSO and feeding the top N% into a docking program would allow the docking program to weed out many of the false positives binders. For this reason, eHiTS LASSO is currently available integrated with the commercially available eHiTS docking tool, and can be readily used as a pre-docking screening tool for large virtual screens.
From the current study we have shown significant enrichments when testing computational models for AR, ER and PXR. While AR and ER predictions are currently already implemented in ChemSpider, it is clear that adding predicted values for PXR and other NRs as they become available would be beneficial to the community in terms of accessing an open source of chemical structures with pre-generated descriptors. It should be noted however that the generation of model data for a database as large as that hosted by ChemSpider (now well over 25 million compounds) is not a small undertaking and consumes a significant amount of compute time, data preparation and handling in order to deliver the models to the community for consumption.
The use of such ligand-based computational methods as exemplified by LASSO in this study could also be useful for the design and selection of chemical products that are less hazardous to human health and the environment. This may make them useful in green chemistry [65] (http://www.epa.gov/gcc/pubs/about_gc.html) as well as in biomedical research. The ready accessibility of such NR binding predictions from 14 computational models like LASSO will be key in future for both pharmaceutical and environmental applications and databases like ChemSpider can have an important role in providing them to the public as a pre-docking criteria, as we have demonstrated in this study. This study used published rat ER, AR as well as human PXR data. LASSO could also be applied to build models for the same NRs across multiple species, such that they could be used to estimate interspecies variation in ligand binding.   Mibolerone, a strong AR and ER binder found with a LASSO search on ChemSpider.

27
The LASSO score is significant only for 4 targets: MR, AR, ER and PR while the rest of the 40 targets are all 0.10 or below for this molecule.
29 Figure 4. PXR LASSO 7 models derived in training method I.