Identification of Chemical Toxicity Using Ontology Information of Chemicals

With the advance of the combinatorial chemistry, a large number of synthetic compounds have surged. However, we have limited knowledge about them. On the other hand, the speed of designing new drugs is very slow. One of the key causes is the unacceptable toxicities of chemicals. If one can correctly identify the toxicity of chemicals, the unsuitable chemicals can be discarded in early stage, thereby accelerating the study of new drugs and reducing the R&D costs. In this study, a new prediction method was built for identification of chemical toxicities, which was based on ontology information of chemicals. By comparing to a previous method, our method is quite effective. We hope that the proposed method may give new insights to study chemical toxicity and other attributes of chemicals.


Introduction
In drug discovery, detecting the toxicity of candidate drugs is a very important procedure. Some approved drugs such as phenacetin [1] and troglitazone [2], which have passed Phase III clinical trials, have to be withdrawn from the market, because their unexpected toxicities were detected. Pharmaceutical companies thus lost millions of dollars. In view of this, it is necessary to detect the toxicity of chemicals before they are selected as candidate drugs. However, evaluating the toxicity of a certain chemical requires comprehensive experimental testing, which costs millions of dollars and takes many years. On the other hand, with the advance of the combinatorial chemistry, a large number of synthetic compounds have surged, inducing that detecting chemical toxicities through traditional methods is an impossible task. Thus, quick, effective, and non-animal-involved prediction methods are urgently necessary.
In recent years, some prediction methods have been built for detecting chemical toxicities. Most of them can only deal with a single toxicity at the same time [3,4], that is, predict a certain chemical to be toxic or nontoxic for a single toxicity. To detect all toxicities of a chemical, these methods have to be executed many times. Recently, Chen et al. built a multiclass prediction method using chemical-chemical interaction information [5], which can provide a candidate toxicity sequence ranging from the most likely toxicity to the least likely one. Their method was applied to detect the toxicities of chemicals listed in Accelrys Toxicity Database [6], in which six types of toxicity are reported: (1) acute toxicity; (2) mutagenicity; (3) tumorigenicity; (4) skin and eye irritation; (5) reproductive effects; (6) multiple dose effects. In this study, we employed the data in Chen et al. 's study [5] and adopted a new kind of information of chemicals to identify chemical toxicities. ChEBI ontology, integrated in a well-known database ChEBI (Chemical Entities of Biological Interest) [7], reports the ontology information of chemicals and is composed of the following subontologies: (1) molecular structure; (2) biological role; (3) application; (4) subatomic particle. Since gene ontology [8], the ontology information for proteins has been deemed to be a useful tool to investigate protein-related problems [9][10][11][12]. It is believed that ChEBI ontology is also a useful tool for studying chemicals and building effective prediction methods to identify chemical attributes. Here, we established a prediction method based on this information and compared to the method reported in [5]. The results indicate that this information is suitable to identify chemical toxicity. And we hope that the proposed method may stimulate extensive investigation based on this information, thereby promoting the study of chemicals and drug discovery.

Dataset.
The toxicity information of chemicals was retrieved from a previous study [5], which was collected from the Accelrys Toxicity Database [6]. Six types of toxicity are reported in this database; there are (1) acute toxicity; (2) mutagenicity; (3) tumorigenicity; (4) skin and eye irritation; (5) reproductive effects; (6) multiple dose effects. Thus, the toxic chemicals in Accelrys Toxicity Database can be assigned to six classes. To investigate the problem of predicting chemical toxicity more throughout, we also employed the nontoxic chemicals, which were also retrieved from Chen et al. 's study [5]. These chemicals were collected from DrugBank (http://www.drugbank.ca/) [13] and Human Metabolome database (HMDB) (http://www.hmdb.ca/) [14]. Totally, 174,137 chemicals were collected and each of them was nontoxic or had at least one type of toxicity.
To obtain a well-defined dataset, the chemicals with no ontology information were excluded, resulting in 4,177 chemicals. Thus, we obtained a dataset S consisting of 4,177 chemicals, in which 3,769 chemicals were toxic and 408 chemicals were nontoxic. As mentioned in the above paragraph, each toxic chemical has at least one type of toxicity. For convenience, let us tag the six types of toxicity using 1 , 2 , . . . , 6 and nontoxicity using 7 . Accordingly, the dataset S can be separated into seven subsets formulated by where S consisted of chemicals having toxicity . The number of chemicals in each subset (i.e., number of chemicals having each type of toxicity) is listed in Table 1, column 3, from which we can see that the acute toxicity was a greatest type of toxicity containing most chemicals, followed by mutagenicity, multiple dose effects, and so forth, while the number of nontoxic chemicals was least. Since some chemicals may have more than one type of toxicity, that is, they may occur in more than one set of S 1 , S 2 , . . . , S 6 , the sum of numbers in seven subsets was larger than the total number of chemicals in S. Thus, it is a multilabel classification problem. Figure 1 gives the number of chemicals having 1-7 types of toxicity. Like many previous studies dealing with multilabel classification problem [5,15,16], the proposed method would give a series of candidate toxicities for each query chemical with the sequence from most likely toxicity to the least likely one.

Construction of a Graph by Ontology Information of
Compound. The ontology information of compound was retrieved from ChEBI (http://www.ebi.ac.uk/chebi/init.do) [7]. We downloaded a file named as "chebi.obo" (accessed November 2014) from its ftp website: ftp://ftp.ebi.ac.uk/pub/ databases/chebi/ontology/, which contains larger number of ontology terms and their descriptions. Since the ontology terms can be conceived as graph-theoretical structures, a graph can be constructed according to the information of all ontology terms, in which nodes represent ontology terms and edges denote the relationship between two terms. By using the entries "is a" and "relationship" in the obtained file to indicate the relationship between two terms, we constructed a large graph with 45,206 nodes and 113,549 edges.

Prediction Method.
As mentioned in Section 2.2, a graph was constructed according to the ontology information of compounds. It can be observed that the corresponding ontology terms of two adjacent nodes in have some special relationship. And it can be further inferred that if two nodes are with small distance in , the corresponding ontology terms have close linkage. In view of this, using the distance in to quantitatively measure the relationship between two ontology terms is reasonable. For two terms 1 and 2 , let us denote the distance of the corresponding nodes in by ( 1 , 3 by Dijkstra's algorithm [17]. The smaller the ( 1 , 2 ) is, the closer the relationship 1 and 2 have.
The proposed prediction method highly relied on the result of (2). To introduce the method clearly, it is necessary to employ some notations. Let S be a training set consisting of chemicals, say 1 , 2 , . . . , ; that is, S = { 1 , 2 , . . . , }. The toxicity information of each (1 ≤ ≤ ) can be represented by where (1 ≤ ≤ 7) was defined by For a query chemical , its score of having toxicity was calculated as follows.
(2) For each , the score of having toxicity was calculated by It is easy to observe that the score of having toxicity is the number of chemicals among 1 , 2 , . . . , which have toxicity . Since 1 , 2 , . . . , are highly related to , larger ( ⊳ ) indicates that many closely related training chemicals of have toxicity , inducing that the probability of having toxicity is high. In particular, ( ⊳ ) = 0 suggests that the score of having toxicity is zero, inducing that the possibility of having this toxicity is zero.
As mentioned in Section 2.1, the investigated problem is a multilabel classification problem. Only giving the most likely candidate toxicity is not enough. Fortunately, we can output a series of candidate toxicities according to the scores of the query chemical having 7 types of toxicity. The toxicity which receives the highest score is the most likely toxicity, while the toxicity receiving the second highest score is the second likely toxicity and so forth. For example, if the rank of seven scores for a certain query chemical is it suggests 1 (i.e., acute toxicity) is the most likely toxicity for , followed by 4 (i.e., skin and eye irritation) and 2 (i.e., mutagenicity), while the other types of toxicity are not predicted to be candidate toxicities for . Furthermore, 1 is called the first prediction, 4 the second prediction, and so forth.

Accuracy Measurements.
For a query chemical, the proposed method can provide a series of candidate toxicities. In view of this, we should calculate the accuracy for each order prediction. The th prediction accuracy can be computed by [5,15] ACC = = 1, 2, . . . , 7, where is the number of chemicals whose th prediction is correct and is the total number of chemicals that are predicted by the method. Since it is difficult to know the number of toxicities for a query chemical, the first prediction accuracy is the most important measure to evaluate the performance of the method. In addition, an effective prediction method for a multilabel classification problem should rank the candidate toxicities well; that is, prediction accuracies should follow a decreasing trend with the increasing of the prediction order.
Besides, to evaluate the performance of prediction method on the whole, another measurement was also adopted [5,15]. It measures the proportion of the true toxicities covered by the first predictions of chemicals, which can be calculated by where Ψ is the number of true toxicities of the th chemical which are listed among its first predictions and is the total number of true toxicities of the th chemical. Generally, is always taken as the smallest integer bigger than or equal to the average number of toxicities of chemicals processed by the method; that is, = ⌈∑ =1 / ⌉. It is obvious that larger indicates the true toxicities are arranged in the front of candidate toxicities.

Performance of the Method.
For the 4,177 chemicals in S, the prediction method was executed to identify their toxicities evaluated by jackknife test [15]. The seven prediction accuracies thus obtained by (7) are listed in Table 2, It can be observed that the first prediction accuracy was 75.17%, the second one was 43.52%, and the third one was 28.47%. Furthermore, seven prediction accuracies always followed a decreasing trend with the increasing of the prediction order, indicating the proposed method arranged the candidate toxicities of all tested chemicals quite well.
In addition, the average number of toxicities of chemicals in S was about 2.38. Thus, the first three predictions of all chemicals in S were collected, obtaining the accuracy of 61.87% by (8), which means the proportion of the true toxicities of chemicals in S covered by their first three predictions. All of these indicate that the proposed method is quite effective for identification of chemical toxicities.

Understanding the Method by Listing an Example.
To better understand our method, this section listed an example. CID104975 is a chemical with toxicity 2 (mutagenicity) and 3 (tumorigenicity). Its ontology term is CHEBI:25957. According to the method, we computed the distance between CHEBI:25957 and ontology terms of other chemicals in S, thereby calculating the relationship between CID104975 and other chemicals by (2). Four chemicals, listed in Table 3, were found to be closely related to CID104975; they are CID995, CID2236, CID6763, and CID13257. Their toxicities and ontology terms are listed in Table 3, column 2 and column 3, respectively. By the method, the toxicity 1 received 3 votes, 2 4 votes, 3 3 votes, 6 2 votes, and other toxicities no votes. Accordingly, we obtained that the candidate toxicities for CID104975 were 2 , 1 , 3 , and 6 . It is obvious that the first and third predictions were correct, while the second prediction was incorrect.

Comparison of Other Methods.
In this section, we employed another kind of chemical information, which has been applied for identification of chemical toxicities in Chen et al. 's study [5]. Their method used chemicalchemical interaction information, which has been deemed to be useful information for study of chemical-related problems [5,15,18,19], to build the prediction method, and gave good performance.
To compare our method and Chen et al. 's method in a fair circumstance, a chemical set, consisting of 3,955 chemicals, was extracted from S, called S c , such that each chemical in S c has both ontology information and interaction information; that is, each chemical can be predicted by these two methods. The number of chemicals in S c on each type of toxicity is listed in Table 1, column 4, from which we can see that the distribution of 3,955 chemicals on seven types of toxicity is similar to chemicals in S. Also some chemicals have two or more toxicities. Our method and Chen et al. 's method were all executed on S c with their performance being evaluated by jackknife test. Listed in Table 2, columns 3 and 4, are seven prediction accuracies. It can be seen that the first prediction accuracy of our method was 75.40%, which is little higher than 75.14% of Chen et al. 's method. However, with the increasing of prediction order, the prediction accuracies of Chen et al. 's method were higher than those obtained by our method. It is reasonable because the ontology information of chemicals is not very complete at present, which induces that many relations of ontology terms have not been detected. Furthermore, we also calculated the measurement defined in (8). Since the average number of toxicities of chemical in S c was about 2.44, the first three predictions of chemicals in S c , which were obtained by two methods, were collected, thereby obtaining the accuracy of 61.70% for our method and 65.31% for Chen et al. 's method. It is also caused by the aforementioned reason. Although, if one considers more than one toxicity for a certain chemical, our method is not better than Chen et al. 's method, the first prediction accuracy of our method is higher than that of Chen et al. 's method, which is the most important one because one always pays more attention to the most likely toxicity for a chemical. In view of this, we believe that our method has superiority for identification of chemical toxicities.

Conclusions
This study gave a new prediction method to identify chemical toxicities. By utilizing the ontology information of chemicals reported in ChEBI, one can predict the toxicities of a certain chemical with quite high quality. It is hopeful that this method may promote the study of chemicals.