Dataset Paper DockScreen : A Database of In Silico Biomolecular Interactions to Support Computational Toxicology

We have developedDockScreen, a database of in silico biomolecular interactions designed to enable rationalmolecular toxicological insight within a computational toxicology framework.This database is composed of chemical/target (receptor and enzyme) binding scores calculated by molecular docking of more than 1000 chemicals into 150 protein targets and contains nearly 135 thousand unique ligand/target binding scores. Obtaining this dataset was achieved using eHiTS (Simbiosys Inc.), a fragment-basedmolecular docking approach with an exhaustive search algorithm, on a heterogeneous distributed high-performance computing framework. The chemical landscape covered in DockScreen comprises selected environmental and therapeutic chemicals.The target landscape covered in DockScreen was selected based on the availability of high-quality crystal structures that covered the assay space of phase I ToxCast in vitro assays. This in silico data provides continuous information that establishes a means for quantitatively comparing, on a structural biophysical basis, a chemical’s profile of biomolecular interactions. The combined minimum-score chemical/target matrix is provided.


Introduction
A major challenge with chemicals in consumer products, including but not limited to both pharmaceutical and environmental chemicals, is the ability to fully discover, characterize, and anticipate adverse effects that may result as a consequence of exposure to these chemicals.Classical safety assessment and animal studies are not only cost-prohibitive and lengthy [1,2]; they often do not include the data required for extrapolations that are inherent in human risk assessment [3].Developing and evaluating predictive strategies to elucidate the mode of biological activity of environmental chemicals are major undertaking of the US Environmental Protection Agency's Computational Toxicology program (http:// www.epa.gov/comptox/).Aligning these strategies with the Agency's ongoing chemical-specific risk assessment needs provision of additional incentive to develop new means of elucidating key determinants of toxicity in the chemical source-to-outcome continuum at a molecular level of accountability.This has provided the motivation for the development of tools such as the Aggregated Computational Toxicology Resource (http://www.epa.gov/actor)[1], the DSSTox Toxico-Chemoinformatics initiative (http://www .epa.gov/ncct/dsstox/) [4], and the ToxRefDB (http://www .epa.gov/ncct/toxrefdb/) [5] in vivo animal effects database.
In an attempt to fill the inherently large data gaps required for modern risk assessment [6] and to develop both time-and cost-effective approaches for prioritizing the toxicity testing of large numbers of chemicals, the ToxCast program was initiated (http://www.epa.gov/ncct/toxcast/)[7].In Phase 1 of ToxCast, the profiling of >300 well-characterized chemicals (primarily pesticides) in over 400 HTS endpoints was performed.However, even such large scale in vitro screening may not be enough to understand the complex systemic effects seen in vivo.Virtual screening has been shown to greatly enhance the success rates of screening experiments [8,9] and virtual molecular profiling has been shown to be an effective tool for understanding the potential polypharmacology of a chemical leading to probabilistic, data-driven drug discovery [10][11][12].It is therefore quite appropriate to apply similar techniques when attempting to understand the polypharmacology that may lead to chemical hazard.
The unparalleled amount of data, low cost, high speed, and rich information content (i.e., high content data) afforded by in silico structure-based inquiry (e.g., molecular docking) in addition to the large number of public resources for both target crystal structures and chemical libraries has urged us to consider the development of a structure-based in silico database, DockScreen, to complement both the ToxCast program's screening/prioritization effort and computational toxicology in general.This database contains the biophysical evaluation of molecules within relevant structural constraints of the target proteins (receptors and enzymes) through multiple-chemical, multiple-target molecular docking experiments.We have created a web interface for accessing the data including multiple binding poses and scores for each protein/ ligand pairing, but this report is limited to only the most generally useful data: a table containing the highest score obtained for the docking of each ligand to each crystal structure.

Methodology
Chemicals were collected and prepared as follows.The chemical landscape covered in DockScreen comprises a selected set of environmental chemicals from ToxCast Phase I (http:// www.epa.gov/ncct/dsstox/sdftoxcst.html)[7] and therapeutic chemicals from the FDA MDD database (http://www.epa.gov/ncct/dsstox/sdffdamdd.html)[13] as drawn from DSSTox [4].Multiple stereoisomeric forms of ToxCast Phase I chemicals were generated using FLIPPER [14] since many chiral anthropogenic environmental chemicals are unresolved racemic mixtures.Chirality is an important factor in nearly all biomolecular interactions [15] and docking must therefore be carried out using only single isomers.Pregnancy categories for many of the therapeutics were manually extracted from Briggs et al. [16].Parent-SMILES fields for all chemicals were imported into MOE [17], structures were cleaned, hydrogens were added, and geometries were optimized in a molecular mechanics framework using the MMFFx force-field parameters [18].(Note: for docking all 3D ligand chemical structure files were submitted as .SDF (MDL) format; however, ligand ID and smiles codes are provided for brevity in supporting information under the SMILES field in Dataset Item 1.) Targets were selected and prepared as follows.The target landscape covered in DockScreen was selected based on the availability of high-quality crystal structures that covered the assay space of ToxCast Phase I in vitro assays (http:// www.epa.gov/ncct/toxcast/files/ToxCastAssays 01aug2007.pdf).A breakdown of the targets selected for study by class is available in Figure 1 and more detailed information is contained in the dataset.The 3D structure files of target proteins were obtained from the Protein Data Bank (PDB), visually inspected in COOT [19], and cleaned up by removing HETATM and solvent waters.HETATM structures, for example, primarily bound ligands, were used as the starting point for clip-file geometries.In some instances (e.g., 2BXK and 1LFO, human serum albumin, and fatty acid binding protein, resp.)multiple binding sites within the same crystal structure were evaluated, in which case the PDBID was augmented by a letter code (a-g) designating a different binding region.There were several redundant target sequences; however, each pocket's 3D conformation is different providing a unique computational docking experiment.
We chose eHiTS [20] as our initial docking platform since it has a flexible ligand docking method that is exhaustive on the conformations and poses that avoid severe steric clashes between receptor and ligand [21,22], a potential benefit to computational toxicology where minimizing false negatives is one of many goals.It has also been shown to compete well with other docking software programs in accuracy of docking and enrichment of chemical libraries [23].
In eHiTS, the binding pocket is determined by building a steric grid for the specified receptor binding site and a cavity description is built that consists of thousands of geometric shapes.The ligand is divided into rigid fragments and connecting flexible chains, where the rigid fragments are docked to all possible places in the defined cavity independently of each other.Then, exhaustive matching of compatible rigid fragment pose sets is performed by a rapid hyper-graph clique detection algorithm that enables the elucidation of acceptable combinations of poses and respective scores.The flexible chains are then fitted to the specific rigid fragment poses that comprise a matching pose set, driven by a scoring function built upon a local energy minimization in the active site of the receptor.
The default clip-file parameters were used for docking in eHiTS, using a square docking box around the desired ligand.An intermediate pose-reconstruction with "Accuracy level" 3 was used to evaluate all poses balancing accuracy with speed.The minimum energy score for each ligand/receptor complex is included in this dataset; however, all poses and scores are retained in the internal DockScreen database.All scores are stored and reported in units of log(  []).
The total project is comprised of ∼1100 ligands on 150 unique protein (receptor/enzyme) binding sites that covered a total of ∼100 unique targets.At a 32-pose-per-ligand (maximum) storage capacity, this puts the upper range of calculations at 1.6 × 10 5 unique ligand/target complex combinations.At a run time of >5 minutes on average, we anticipated a total run-time of >1.5 years; clearly, a job suited for distributed computing architecture.Calculations were run at the US Environmental Protection Agency's National Computing Center and deployed over a heterogeneous distributed highperformance computing cluster using primarily idle time and resulted in an average performance of ∼20 nodes running at any given instant over the period of two (2) months.
Biomolecular interaction profiles require the ligand to reach its target.For many chemicals, absorption, distribution, metabolism, and excretion (ADME) impose a limit on the interactions available.QikProp [24] was applied to the cleaned, stereospecific set of chemicals to provide an initial set of predictions regarding these chemicals' ADME characteristics.All chemicals were energy minimized using OPLS-AA [25] in MOE [17] to match with the energy minimization technique used in QikProp model generation.

Dataset Description
The dataset associated with this Dataset Paper consists of 4 items which are described as follows.).Listing of 1094 ligands for which docking results are available.Documented for each ligand are its name and stereospecific SMILES along with whether it came from the ToxCast or FDA dataset.DSSTox CID is included for linking to DSSTox (http://www.epa.gov/ncct/dsstox/).FDA therapeutic categories and pregnancy class are included for therapeutic chemicals.).Listing of 140 pdb entries from which target protein structure was extracted for docking studies.Contained are the PDB ID for linking to the Protein Data Bank (http://www.pdb.org/), a description of the protein, and details on the deposition, origin, and quality of the structure.Included are a target class annotation made manually by the authors and comments on whether multiple chains were used in docking studies.

Concluding Remarks
Potential methods by which 3D modeling techniques could inform mechanistic toxicology have previously been documented [26], but the wealth of data contained in DockScreen may provide even more options.As can be seen from comparison studies of different docking method [22], the use of only a single software program has its limitations; however, the creation of a large database of docking results across many targets yields advantages for data mining and analysis which are unavailable elsewhere.One apparent use is combining DockScreen with chemical descriptors to model and understand in vitro or in vivo assay results, as reported previously [27].This application as a unique source of knowledge in modeling may improve linking chemical structures with in vitro and in vivo effects in a fully computational approach, thereby increasing in silico predictive power and reducing our reliance on animal models.A second use for DockScreen is the population of information in data fusion tools intent on enabling decision support in chemical screening and prioritization.Efforts to build such tools have increased recently, resulting in the creation of "Dashboards" in the EPA's Chemical Safety and Sustainability program (http://actor.epa.gov/actor/faces/CSSDashboardLaunch.jsp).Third, DockScreen contains molecular-level representations that are readily searchable and can be a valuable resource for scientists within the EPA working on molecular-level insight to some of their in vitro data efforts.For instance, a DockScreen user can use the system to search for structural analogues of the novel compounds.Similarly, the nature of the data is amenable to probing molecular similarity based on 3-dimensional biophysical interaction profiles (e.g., multiple target vector scores for a given chemical) [11] which are significantly different from 2D Tanimoto similarity based on chemical fingerprints.

Figure 1 :
Figure 1: Biomolecular target class breakdown used in computational molecular docking.

Table ) .
Minimum scores resulting from docking the 1094 ligands defined in Dataset Item 1 to the 150 targets (from 140 PDB entries) denoted in Dataset Item 2. Each row contains data for one ligand linkable by Ligand ID.Column names correspond to PDB ID with an additional chain letter if applicable.Data values are the minimum scores obtained from docking that ligand to that target.If no score is listed, no poses were found to have sufficient binding to warrant scoring.