The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as
Metabolic network analysis and reconstruction have become increasingly prevalent with diverse sources from functional genomics experiments. Plenty of bioinformatic tools were developed to generate high quality metabolic models on metabolic enzyme and pathway annotation for different organisms [
Although many transporter databases were developed to store and classify all reported transporters such as TCDB [
In postgenomic era, biomedical data and literature are growing in an exponential way. To date, the PubMed, the most comprehensive biomedical literature repository, includes over 21 million abstracts [
Here we constructed a standalone tool METSP, a maximum-entropy text mining classifier, to extract TSPs from semistructured text in UniProt protein annotation. As most comprehensive and confidential protein databases, UniProt provides us with more reliable substrate data. In addition, its semistructured text for protein annotation makes information extraction more reliable than those from free text. We believe that it will be useful to help the metabolic network reconstruction [
The main goal of METSP is to identify and extract sentences with transporter-substrate information from UniProt entries. Thus, METSP focuses on two tasks: The first task is to extract semistructured annotation sentences of transporters in UniProt and then map transporter and compound names in sentences into standardized protein identifiers in UniProt and compound identifiers in KEGG LIGAND database [
Workflow of design and function of METSP. Step I (highlighted in pink): explicit TSPs were manually collected from UniProt, TCDB, and TransportDB databases. Step II (in blue): the UniProt annotation text of proteins in explicit TSPs and in randomly selecting protein set was processed to get positive and unlabeled sentence training sets. The maximum-entropy model was used to train and retain the classifier. Step III (in green): the classifier was used to recognize TSPs from query protein annotation text. The new TSPs were obtained by further experts checking.
To obtain a comprehensive and reliable TSP dataset, we first retrieved all the known transporter and substrate information from UniProt and the other two popular transporter databases TCDB and TransportDB. Then we mapped substrate names to compound IDs in KEGG LIGAND database and manually checked all the transporter names and their corresponding substrate names and compound IDs one by one. Finally, we compiled 6955 reliable TSPs from all the above three databases (Additional file 1, in Supplementary Material available online at
Summary of reliable TSPs from UniProt, TCDB, and TransportDB.
UniProt | TCDB | TransportDB | Sum from formula ( |
|
---|---|---|---|---|
TSPs | 35586 | 2641 | 86726 | 6955 |
Transporters | 25056 | 1501 | 57070 | 5042 |
Substrates | 528 | 229 | 351 | 275 |
Note: (
The training sentences expressing transporting relationship of transporter and substrate are included in protein semistructured annotation text of UniProt. To obtain a reliable training dataset, we first retrieved all text of transporter entries from UniProt based on the accession numbers of transporters in our curated reliable TSPs (Additional file 1). As only annotations from protein name (DE), function annotation (CC), and gene ontology (DR) fields were informative to extract substrate information, we deleted annotations in other fields for training classifier to reduce the size of preprocessing data and to drop negative influence generated by irrelevant fields. Previous study indicated that better performance can be achieved by using sentences as input instead of sentence pairs [
It was expensive to collect negative dataset manually, so we collected unlabeled instances as negative data and combined positive instances to make up training set. There were 525,997 reviewed protein accession numbers in UniProt (checked on May 4, 2011), from which 5,042 protein accession numbers were chosen randomly. Then 28,120 sentences were extracted as unlabeled instances from the annotations of selected proteins which are not overlapping with any proteins in our training set (Additional file 2). Previous study indicated that training set consisting of positive and completely randomly chosen instances could get similar classification with training set consisting of positive and negative instances [
In this study, all marked compound names must have compound IDs in KEGG LIGAND database, with a hypothesis that a compound name appearing in sentences could be mapped to an identifier in KEGG LIGAND database and be unique. Besides, compound names in UniProt should be general and in KEGG LIGAND database should be comprehensive, for example, UniProt names compound sucrose “sucrose,” and the other names it “cane sugar,” “saccharose,” and “1-alpha-D-Glucopyranosyl-2-beta-D-fructofuranoside.” For these reasons, we fast tagged compound names in sentences using Trie data structure [
To assess how the predicted results are robust to any independent data set, we conducted a tenfold cross-validation. The collected training set (13,212 positive instances from the annotations of 5,042 transporters) was randomly partitioned into 10 equal sized subsets. Of the 10 subsets, a single subset was used as the validation data to test the output from our statistical model. The remaining 9 sets were used as training data for each run. This cross-validation process was repeated 10 times. As each subset was used exactly once as the validation data, we harvested 10 results from calculation, which were combined by average to generate the final rotation estimation. Our 10-fold cross-validation performed better than that repeated random subsampling because all the collected 13,212 positive instances were used for training and validation. In addition, each observation was used for validation exactly once. For each cross-validation process, there were 1,320 positive instances used as testing set roughly. We calculated the true positive by counting how many prediction results from testing set were the exact same as their original labels (positive or negative, as shown in Additional file 2).
We obtained features from sentences with the idea of bag of words [
Since there were positive instances in unlabeled dataset, the method of iteration of tagging false negative instances in unlabeled dataset was adopted to reduce their negative effect on the classifier. In each iteration, we rescued the sentences that expressed real transporting relationship in unlabeled dataset, added them into positive set to construct new training dataset, and then obtained the classification results from tenfold cross-validation experiments on the new training sets. The precision and recall [
The precision, recall of ME classifier, and the number of “false” negative instances that were actually positive instances in four iterations.
Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | |
---|---|---|---|---|
Precision | 94.93% | 98.17% | 98.50% | 98.54% |
Recall | 97.52% | 97.95% | 98.00% | 98.02% |
FP ratio |
546/688 | 70/250 | 24/205 | 16/201 |
Note:
The performance comparison of ME and NB classifiers. ROC curves of maximum-entropy classifier and Naïve Bayes classifier on the original (a) and relabelled datasets (b).
ROC curves before retraining
ROC curves after retraining
It is promising to extract specific transporter and substrate information from the wealth of biomedical knowledge in free text due to the increasing number of stored literature in databases such as PubMed and UniProt. To evaluate the results from METSP, we applied our tool on the 23,204 human reviewed protein annotations in UniProt (23 August 2011), which contained 182,829 sentences as input of classifier; 3942 TSPs were extracted (Additional file 3, the 3942 human TSPs were extracted by METSP).
To identify novel transporter-substrate relationships, we compared our predicted human TSPs with the human TSP data in TCDB, TransportDB, and KEGG (Figure
Comparison of TSP data. Comparing human TSPs extracted by METSP with that in three existing transporter-substrate databases (TCDB, TransportDB, and KEGG database). Blue bars represent the number of TSPs extracted by METSP and in the three databases; red bars represent the number of TSPs that were not extracted by METSP but in the three databases; green bars represent the number of TSPs extracted by METSP but not in the three databases.
The comprehensive TSPs extracted by METSP are not only able to facilitate access to transporter and substrate researches, but also useful to link transporters with metabolic pathways in KEGG PATHWAY database. For instance, a novel TSP (Q4U2R8 and C00954) summarized by METSP has never been recorded in any existent transporter database such as TCDB and TransportDB. TCDB does not include the entry Q4U2R8. In addition, TransportDB and KEGG databases only annotate substrate with general words as “organic anion” for entry Q4U2R8 (named “NP_004781” in TransportDB, “hsa: 9356” in KEGG), which is useless to assign a compound ID in KEGG LIGAND database for “organic anion” term. However the function annotations of Q4U2R8 in UniProt contain the precise transporting substrate as “indoleacetate” belonging to “organic anion.” Based on its extracted substrate name in the protein annotation, Q4U2R8 could be easier to be associated with four KEGG pathways including “ko00380,” “map01070,” “ko01100,” and “ko04075.” This link of transporter Q4U2R8 to the four pathways may give more accurate clues for this transporter on its substrate flux balance. In total, there are 136 compound IDs in our curated human TSPs, which are not included in other databases and in which 75 are annotated to 88 metabolic pathways. Therefore, our METSP is powerful to discover potential novel linkage between transporters and metabolic pathways.
METSP is implemented by using JAVA language and consists of data downloadable module, preprocessing module, ME classifier module, compound name mapping module, and assisting manual validation module. The three main features of METSP are extracting accurate TSPs using the UniProt accession numbers as input; extracting accurate TSPs from local semistructured text, which is similar to the format of text in UniProt, with transporter and substrate information; a command line-based running for user to process big data without minimum deployment.
METSP was packed for downloading at
METSP provides a few parameters for user implementing the running setting. The threshold of classifier can be set by the parameter “
“java -jar METSP.jar -t 0.5 -f input.txt result.pdf”.
In cellular metabolism, transporters are a class of molecules to control metabolite homeostasis and drug delivery. For transporter studies, it is crucial to identify their substrates precisely. In this study, we present a tool METSP focusing on the extraction of transporter and substrate knowledge. The resulting knowledge will be easy to convert into the formatted compound name from KEGG LIGAND database. The multitude of possible applications of METSP makes it a complementary approach to more comprehensive metabolic reconnection of metabolic enzymes and transporters. It should be noted that the organism-specific substrates of transporters are still scarce. With the rapidly expanding transporter and substrate data, the ability to predict transporter and substrate information based on the data from phylogenetic neighbours may be of great help. We believe that the combination of our METSP and sequence alignment tools such as BLAST can achieve a more comprehensive transporter and substrate reconstructions for many uncurated metabolic networks.
Our TSPs were mainly collected from only TCDB, TransportDB, and UniProt. As many reported TSPs in the literature still lack annotation, in the future, we will focus on transplant of our METSP to an abstract-based text mining system. To achieve this, it is necessary to gain reliable protein literature mapping relations. Starting from the mapped protein literature relations, more accurate and comprehensive TSPs will be collected based on our improved METSP system. In general, the next version of METSP will be focused on the extraction substrate information from free text to support growing free full text literature.
We present METSP, the first text mining tool to extract transporter and substrate information from the semistructured text in UniProt annotation. Using maximum-entropy model, METSP achieves high precision and recall in cross-validation experiments for identification of TSPs. We believe that METSP can be widely applied to help elucidate the relationship between transporter and its substrates including clinical drugs. This tool could have profound implications for the further tool development of the semistructured text mining by focusing on other high quality UniProt annotations such as disease and tissue specificity. The METSP is flexible and freely available at
The authors declare that they have no competing interests.
Min Zhao and Yanming Chen carried out analyses and helped write the paper. Dacheng Qu and Hong Qu conceived of the analysis and helped write the paper. Min Zhao and Yanming Chen contributed equally.
This work was supported by the National High-tech 863 Program of China (no. 2006AA02A312, no. 2008BAI64B01), the National Natural Science Foundation of China (no. 31171270, no. 61370136).