Rodent Carcinogenicity Dataset

e rodent carcinogenicity dataset was compiled from the Carcinogenic Potency Database (CPDBAS) and was applied for the classi�cation of quantitative structure-activity relationship (QSAR) models for the prediction of carcinogenicity based on the counter-propagation arti�cial neural network (CP ANN) algorithm. e models were developed within EU-funded project CAESAR for regulatory use. e dataset contains the following information: common information about chemicals (ID, chemical name, and their CASRN), molecular structure information (SDF �les and SMILES), and carcinogenic (toxicological) properties information: carcinogenic potency (TD50_Rat_mg; carcinogen/noncarcinogen) and structural alert (SA) for carcinogenicity based on mechanistic data.Molecular structure information was used to get chemometrics information to calculate molecular descriptors (254MDL and 784 Dragon descriptors), which were further used in predictive QSARmodeling.e dataset presented in the paper can be used in future research in oncology, ecology, or chemicals’ risk assessment.


Introduction
Rodent carcinogenicity datasets were used to build models to predict carcinogenicity within EC-funded project CAE-SAR (Project no.022674 (SSPI)) [1].CAESAR project was aimed to develop quantitative structure-activity relationship (QSAR) models for the REACH (Registration, Evaluation, Authorization, and restriction of CHemicals) legislation for �ve endpoints: bioconcentration factor, skin sensitization, carcinogenicity, mutagenicity, and developmental toxicity.REACH regulation requires the evaluation of the risks resulting from the use of chemicals produced in industry and testing of their toxicity.Carcinogenicity is among the toxicological endpoints that pose the highest public concern.e standard bioassays in rodents used to assess the carcinogenic potency of chemicals are time consuming and costly and require the sacri�ce of large number of animals.Cancer bioassays should be reduced according to REACH regulation [2], while the Seventh Amendment to the EU cosmetics directive will ban the bioassay for cosmetic ingredients from 2013 [3].
e aim of CAESAR project was to reduce the use of animals as well as the cost associated with toxicity tests.e models predicting carcinogenicity meet the requirements for QSAR models used for regulatory use.Great attention was paid to the quality of data used to build the models; the models were then validated.ey are transparent and reproducible and are checked against the OECD principles.
e models at the CAESAR's website have been implemented in java and are freely accessible for public use [1].
Models for prediction of carcinogenicity using rodent carcinogenicity database were described [4][5][6][7][8].State of the art and perspectives of predictive models for carcinogenicity are discussed in the paper by Benfenati et al. [9].

Methodology
e chemicals involved in the study belong to different chemical classes, so-called noncongeneric substances.e aim was to cover chemical space as much as possible.e list of 805 chemicals (see Dataset Item 1 ( It should be stressed that in order to obtain data suitable for QSAR modeling the initial dataset (1481 chemicals) has been cleaned of all incorrect structures, ambiguous or mixed structures, polymers, inorganic compounds, metallo-organic compounds, salts, and complexes and compounds without well-de�ned structure.�e obtained data and structures of chemicals were cross-checked by at least two partners using the following online databases: ChemFinder [11], ChemIDplus [12], and PubChem Compound [13].We selected chemicals with available information about carcinogenic potency in rats.�us, the �nal dataset of 805 chemicals, with their ID number, chemical name, CASRN, experimental TD 50 values for rat, and corresponding binary carcinogenicity classes (P: positive; NP: not positive), are available in Dataset Item 1 (Table ).For each substance, it is indicated whether it belongs to training or test set.
Rat data only was suggested to be used because a dataset based on data for single species is more consistent and has less variation than a dataset based on two or more species.
Additionally, in our latest study we complimented the dataset with the following alerts collected from Toxtree program: GA, genotoxic alert; nGA, non-genotoxic alert; and NA, no carcinogenic alert.Structural alerts (SAs) for carcinogenicity indicating possible mechanism of carcinogenicity were also collected for each chemical and are presented in Dataset Item 2 (Table ), and the list of SAs for carcinogenicity is presented in Table 1 (for more detailed explanation of terms listed, see http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html).e Toxtree expert system with the 33 SAs for carcinogenicity was reported in the Benigni/Bossa rulebase for mutagenicity and carcinogenicity [14].In a broad sense, the set of chemicals characterized by the same SA could compose a family of compounds with the same mechanism of action (see the recent review written by Benigni and Bossa [15]).From our point of view, SA for carcinogenicity is valuable information in mechanistic interpretation of models [8].
To prepare data for modeling, the dataset of 805 chemicals was subdivided into training (644 chemicals) and test (161 chemicals) sets using the subsorting of chemicals according to a hierarchical system of compound classes in relation to functional group (within classes, the compounds were sorted according to halogen substitution, aromaticity, bond orders, ring contents, and number of atoms), and the following procedure aimed to distinguish between connectivity aspects.is sorting of compounds was implemented with the soware system ChemProp [16,17].
External validation of models was performed using external validation set of 738 chemicals different from those in our dataset of 805 compounds described earlier [4].ChemFinder Ultra 10.0 soware was used [18].
Nowadays, thousands of chemical descriptors such as constitutional, quantum chemical, topological, geometrical, charge related, semiempirical, thermodynamic, and others can be calculated for a given chemical structure [19,20].In the present study, the following sets of descriptors for 805 compounds were generated for modeling: 254 MDL descriptors computed using MDL QSAR version 2.2.[21] and 835 Dragon descriptors calculated by DRAGON professional 5.4 soware [22].
e obtained descriptors include physicochemical, electrotopological E-state, connectivity, and other descriptors.It should be noticed that E-state indices are a combination of electronic, topological, and valence state information [23][24][25].
To develop robust and reliable models, the descriptors' space should be reduced by extracting the most signi�cant variables correlated with carcinogenicity.e Hybrid Selection Algorithm (HSA) method was used to select among the different molecular descriptors series the best parameters to classify chemicals by their carcinogenic potency.It combines the Genetic Algorithm (GA) concepts and a stepwise regression [26].In this way, the descriptors' space was reduced from 254 to 8 MDL descriptors [4].us, we used topological descriptors, including atom-type and group-type, E-State and hydrogen E-state indices, molecular connectivity, Chi indices, and topological polarity to obtain the molecular structure information, which is correlated with carcinogenic potency.Among the 8 MDL descriptors, there are two connectivity indices (dxp9 and nxch6), three constitutional parameters (SdssC_acnt, SdsN_acnt, and SHBint2_acnt), and three electrotopological parameters (SdsCH, Gmin, and SHCsats).
Among statistical approaches such as linear multivariate regressions, GMDH, and fuzzy logic, arti�cial neural networks (ANNs), particularly the CP ANN, appeared to be one of the most suitable approaches to predict the complex endpoint such as carcinogenicity for noncongeneric datasets of chemicals with the most reproducible results.e main advantage of neural network modeling is that the complex, nonlinear relationships can be modeled without any assumptions about the form of the model.Large datasets can be examined.Neural networks are able to cope with noisy data and are fault tolerant.However, the interpretation of the acquired knowledge is oen a challenge [29].
e detailed description of CPANN can be found in the literature [30][31][32][33].e models used to predict carcinogenicity using 8 MDL descriptors as well as 12 Dragon descriptors and their characterization have been published [4].

Dataset Description
e dataset associated with this Dataset Paper consists of 7 items, which are described as follows.).A list of 805 chemicals from CPDBAS used for carcinogenicity modeling with indication of training and test sets, which were extracted from the original dataset of 1481 chemicals downloaded from Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network (http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html).e column ID_v5 presents the codes of the chemicals used in CAESAR project (ID of chemicals in database version 5); ID_CPDBAS-Original, the ID number taken from Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network version 3b; Chemical Name, the chemical names taken from DSSTox and double checked from PubChem Compound (NCBI) (http://www.ncbi.nlm.nih.gov/sites/entrez?db=pccompound); CASRN, the registry number of the Chemical Abstract Service taken from DSSTox and double checked from PubChem Compound (NCBI).In the column Carcinogenic Potency Expressed as TD 50 , TD 50 is the dose rate in milligram per kilogram of body weight per day, which, if administered chronically for the standard lifespan of the species, will halve the probability of remaining tumorless throughout that period.e TD 50 value reported is the harmonic mean of the most potent TD 50 values from each positive experiment in the species.All the values were derived from the Carcinogenic Potency Database (http://potency.berkeley.edu/cpdb.html).In the column Carcinogenic Potency Expressed as P or NP, "P" means positive or active (carcinogens) and "NP" means not positive or inactive (noncarcinogens).In the column Set, "Training" is for the training set and "Test" is for the test (prediction) set.

Concluding Remarks
e CPDB rodent carcinogenic database was used for the development of models for the categorization of carcinogenic potency.Initial preprocessing of data and selection of data with carcinogenic potency for rats give us consistent data suitable for QSAR modeling with carcinogenic potency response closer to human.e MDL and Dragon soware programs were applied for calculating the molecular descriptors.e topological structure descriptors provided sound bases for classifying molecular structures.
e CP ANN model for prediction of carcinogenicity demonstrated good prediction statistics on the test set of 161 compounds with sensitivity of 75% and speci�city of 61%-69% in addition to accuracy of 69%-73%.A diverse external validation set of 738 compounds con�rmed the robustness of our models regarding a large applicability domain, yielding the accuracy 60.0%-61.4%,sensitivity 61.8%-64.0%,and speci�city 58.4%-58.9%.
e carcinogenicity models presented in the study [4] can be used as a support in risk assessment, for instance, in setting priorities among chemicals for further testing.e dataset and additional information presented in the paper can be used in the QSAR modeling, in future research in oncology, and in risk assessment of chemicals.

Table ) )
was extracted from rodent carcinogenicity study �ndings for 1481 chemicals taken from Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network, which was built from the Lois Gold Carcinogenic Database (CPDBAS) T 1: A list of 33 SAs for carcinogenicity extracted from Toxtree with number of chemicals in carcinogenicity dataset.

Table ) .
A list of 784 Dragon descriptors with their signs and de�nitions.