1. Introduction

ISRN.COMPUTATIONAL.BIOLOGY

ISRN Computational Biology

2314-5420

Hindawi Publishing Corporation

581245

10.1155/2014/581245

581245

Research Article

Application of Hybrid Functional Groups to Predict ATP Binding Proteins

Mbah

Andreas N.

Marashi

S.-A.

Oliva

Center for Bioinformatics & Computational Biology

Department of Biology

Jackson State University

Jackson

MS 39217

USA

jsums.edu

2014

812014

2014 02 09 2013 29 10 2013 8 1 2014

2014

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The ATP binding proteins exist as a hybrid of proteins with Walker A motif and universal stress proteins (USPs) having an alternative motif for binding ATP. There is an urgent need to find a reliable and comprehensive hybrid predictor for ATP binding proteins using whole sequence information. In this paper the open source LIBSVM toolbox was used to build a classifier at 10-fold cross-validation. The best hybrid model was the combination of amino acid and dipeptide composition with an accuracy of 84.57% and Mathews Correlation Coefficient (MCC) value of 0.693. This classifier proves to be better than many classical ATP binding protein predictors. The general trend observed is that combinations of descriptors performed better and improved the overall performances of individual descriptors, particularly when combined with amino acid composition. The work developed a comprehensive model for predicting ATP binding proteins irrespective of their functional motifs. This model provides a high probability of success for molecular biologists in predicting and selecting diverse groups of ATP binding proteins irrespective of their functional motifs.

1. Introduction

Recent advances in the next generation sequencing and human genome projects have resulted in rapid increase of protein sequences, thus widening the protein sequence-structure gap [1, 2], leading to diverse protein functions from common family. Computation prediction tools for predicting protein structure and function are highly needed to narrow the widening gap [3]. The ATP binding proteins (ATP-BPs) are a diverse family of proteins in terms of amino acid sequences, function, and their three-dimensional structures. These proteins hydrolyze ATP to provide the energy necessary to drive biochemical reactions in the cell [4]. There are two distinct functional groups of ATP binding proteins.

The first functional group has the Walker A motif [GXXXXGK (T/S) or G-4X-GK (T/S)] in their sequences for ATP binding [5]. Many members are transmembrane proteins and are responsible for transporting a wide variety of substrates across extra- and intracellular membranes [6]. The biochemical functions of ATP binding proteins are well exhibited within the ABC transporters group. In bacteria cell, ABC transporters pump substances such as sugars, vitamins, and metal ions into the cell, while in eukaryotes they transport molecules out of the cell [7]. They are also known to transport lipids and play a protective role to the developing fetus against xenobiotics [7]. ABC transporters are crucial in the development of multidrug resistance, with the ATP binding sites exploitable as targets for chemotherapeutic agents [8]. The mechanism of action in multidrug transportation is unclear. However, one model called hydrophobic vacuum cleaner states that, in P-glycoprotein, the drugs are bound indiscriminately from the lipid phase based on their hydrophobicity [9].

The second evolutionary diverse functional class of ATP binding proteins is called universal stress proteins (USPs). The universal stress proteins (USPs) are found in diverse group of organisms like archaea, eubacteria, yeast, fungi, and plants; their expressions are triggered by variety of environmental stressors [10]. These stressors might include but are not limited to starvation of nutrients such as carbon, nitrogen, phosphate, sulfate and the required amino acid and variety of toxicants and other agents such as heavy metals, oxidants, acids, heat shock, DNA damage, phosphate, uncouplers of the electron transport chain, and ethanol [11, 12]. The USPs bind to ATP through the ATP binding motif [G-2X-G-9X-G(S/T)] [13]. Members of the USPs will segregate into two groups based on whether or not they bind to ATP [13].

Experimental efforts are underway to determine the function of newly discovered proteins [14], but these experimental methods are costly and time consuming and at times are unsuccessful, due to the complexity involved in protein crystallization process. Several methods had been studied based on predicting ATP binding residues from their known structural features but with low accuracies [15, 16]. Some predictors of ATP binding proteins have been developed with promising results such as those in [17, 18], including Green et al. [19] article on an effective method to recognize ATP binding proteins by testing parallel cascade identification and KNN. Unfortunately these methods were adapted to ATP binding proteins containing only the classical Walker A motif [G-4X-GK (T/S)] in their sequences. The objective of this research reported here was to introduce a classifier built from a pool of protein sequences containing both ATP binding motifs of G-4X-GK (T/S) and G-2X-G-9X-G(S/T). To achieve the objective, support vector machine (SVM) approach is proposed which predicts protein functions based on the discriminative features that map protein sequences to biological functions [20–23] using the sequence pool ATP hybrid motifs.

There is aneed to develop an automated predictor for ATP binding USP encoded proteins to speed experimental designs and study how these proteins function under diverse environmental stressors. This research has developed hybrid ATP binding protein predictor using the open source LIBSVM toolbox classification. The best model was the combination of amino acid and dipeptide composition of the sequences with an accuracy of 84.57% and Mathews correlation coefficient (MCC) value of 0.693%. This model shows a striking overall performance in sensitivity (82.46%), specificity (87.00%), and precision (87.85%) with area under the ROC curve (AUC) value of 0.849219. The general trend shows that combinations of descriptors perform better and improved the overall performances of individual descriptors, particularly when combined with amino acid composition. This model provides a high probability of success for molecular biologists in predicting and selecting diverse motif groups of ATP binding proteins.

2. Materials and Method 2.1. Datasets

Balanced datasets of ATP and non-ATP binding proteins were constructed from the UniProt protein database (UniProt release 2011_11) (http://www.uniprot.org/), Protein Data Bank (http://www.rcsb.org/pdb/home/home.do), IMG/M database (http://img.jgi.doe.gov/cgi-bin/m/main.cgi), and published literatures [24–26] which contain diverse universal stress proteins.

2.1.1. Extraction of Walker A Motif Dataset

A total of 2000 protein sequences which belong to Walker A motif positive dataset were retrieved. Redundancy due to homologous sequences was removed using CD-HIT [27] and PISCES [28] servers at a threshold of 25%. This threshold statistically retains adequate number of protein sequences for analysis as well as avoids bias that might result from high homology. Dataset obtained was manually reviewed through literature search and information from the protein data bank [2] to ensure they represent ATP binding proteins. A total of 100 sequences were randomly selected from the original dataset and retained for training and testing to represent Walker A motif positive (ATP binding) dataset. The Walker A motif negative dataset (non-ATP binding) was taken from Yu et al. 2006 [29]. This was the “negative” dataset used for nucleic acid binding proteins. This is because ATP binding proteins are members of nucleotide binding protein family; hence the negative dataset used in [29] for predicting nucleotide binding protein family was considered useful. Redundancy was also maintained at 25% threshold and each protein was verified to be non-ATP binding using both the literature and protein data bank information. A total of 100 sequences were also randomly selected from [29] and retained for training and testing to represent Walker A motif negative (non-ATP binding) dataset.

2.1.2. Extraction of USP Protein Dataset

The extracted USP sequences were tested for the presence or absence of the G-2X-G-9X-G(S/T) motif in their sequences using the NCBI conserved domain search tool [30]. The USP sequences were divided into two groups based on the presence or absence of ATP binding motif [13]. The redundancy was also maintained at 25% threshold and 100 sequences were selected for each class of proteins (200 sequences in total).

The overall summary of the data prepared for analysis was as follows: (i) 100 ATP binding proteins with Walker A motif; (ii) 100 without ATP binding proteins without Walker A motif, (iii) 100 USP sequences with ATP binding motif [G-2X-G-9X-G(S/T)], and (iv) 100 USP sequences without ATP binding motif [G-2X-G-9X-G(S/T)]. The 400 sequences were separated into two hybrid groups as follows: 200 ATP binding sequences and 200 sequences without ATP binding motifs and were used to generate the feature vector. The feature vector was generated from the entire sequences of the proteins (not only the ATP-binding domains) via PROFEAT server using 1497 descriptor set [31]. Physicochemical and sequence attributes of biologically informative were prioritized for investigation. The attributes were incorporated into LIBSVM classifier to find the best hybrid model for predicting ATP binding proteins.

2.2. LIBSVM Classifier

Support vector machines (SVM) recognized objects to be classified as points in a high-dimensional space needing a hyperplane to separate them [32]. The biological molecules are represented with descriptor set. With a proper mapping furnished by a kernel function, SVM classifiers separate transformed data with a hyperplane in a high-dimensional space to predict the correct classification of protein functional classes. SVMs have been widely used in supervised classification problems in bioinformatics, such as [33–36]. The LIBSVM package which is freely downloadable at (http://www.csie.ntu.edu.tw/~cjlin/libsvm) was adopted and used to evaluate the attributes and build the final classifier, using the radial basis function (RBF) as the kernel function [37–39].

A “grid-search” was employed to select the proper values of the parameter of RBF and the penalty parameter (C) of the soft margin SVM. C was set to 2-5,2-3,…,215 and γ to 2-15,2-13,…,23. All the combinations of C and γ were tested and the pair with the best cross-validation accuracy for each feature set or combination of feature sets was selected. A smaller γ value makes the decision boundary smoother. The SVM training parameter C is the regularization factor, which controls the tradeoff between low training error and large margin [37, 40]. Throughout this work, the parameter C was maintained at C=4 after trial and error assessment as the best value. The optimal value of γ was obtained for each descriptor set for best results. The entire sets of attributes were evaluated in terms of their association with ATP binding protein and a final subset with good predictive power was selected. In this research a 10-fold cross validation (10CV) was implemented. The objective of training is to maximize the ability of the SVM predictor to discriminate between classes while avoiding overfitting.

2.3. Tenfold Cross-Validation Analysis

The technique to evaluate any newly developed method has become a major challenge to investigators. The jack-knifing leave-one-out cross-validation (LOOCV) [41–43] is the popular technique for evaluating models. During this procedure one sequence is used for testing and the left over sequences are used for training. This process is repeated many times and each sequence is used once for testing. Even though this method is popular, it is computer intensive with considerable labor time.

In this work, 10-fold cross-validation was used to train and test the dataset with sequences randomly partitioned into ten sets. This cross-validation ensures that the dataset was split at the protein level in addition to the stratified partition, thus ensuring a more rigorous evaluation. During the procedure, the positive and negative data samples are distributed randomly into 10 sets or the so-called fold. In each of the 10 round steps, 9 of the 10 sets are used to construct a classifier (training), and then the classifier is evaluated using the remaining set (testing). This procedure was repeated ten times in a manner where each set was used for testing [44, 45]. The overall performance was the average of the performances of all the 10 sets.

2.4. The LIBSVM Performance Evaluation

The standard parameters used in evaluating the performance of the LIBSVM are indicated below. The overall accuracy (Acc) is the intuitive measurement of the performance on a balance dataset whereas Matthew’s correlation coefficient (MCC) [46] is more realistic than Acc in measuring performance when using an unbalanced dataset [47, 48]. When both MCC and Acc values are high, the overall performance of the predicted model is better. In addition to Acc and MCC, the following parameters below were also calculated. Sensitivity is the percentage of correctly predicted binding proteins to the total binding proteins.

True positive (TP).

True negative (TN).

False positive (FP) (false alarm).

False negative (FN).

False positive rate (FPR).

Sensitivity/recall or True positive rate (TPR) TPR = TP/P = TP/(TP + FN).

Precision = TP/(TP + FP).

Accuracy (Acc) = (TP + TN)/(P + N) = (TP + TN)/(TP + TN + FP + FN).

Specificity (SPC) SPC = TN/N = TN/(FP + TN) = 1 – FPR.

Matthew’s correlation coefficient (MCC).

((TP × TN) − (FP × FN))/[sqrt ((TN + FN) × (TN + FP) × (TP + FN) × (TP + FP))] OR (1)MCC=(TP*TN-FP*FN)PNP′N′.

Here TP is the number of true positives (ATP-BPs), TN is the number of true negatives (non ATP-BPs), FP is the number of false positives, and FN is the number of false negatives.

2.5. Area under the ROC Curve (AUC) for LIBSVM

It is a plot between true positive proportion (TP/TP + FN) and false positive proportion (FP/FP + TN). The StatsDirect was used package to plot ROC and calculates the area under the ROC curve directly by an extended trapezoidal rule [49]. The confidence interval was constructed using DeLong’s variance estimate [50] embedded in the statistic package.

3. Results and Discussion

The ATP binding proteins are known to play key roles in the biochemical functioning of the cell. In signaling pathways ATP molecules are substrates for protein kinase phosphorylation. It is difficult to identify ATP binding proteins due to lack of experimentally determined protein structures [51–53]. This is because the growth of protein sequences from various genomic projects exceeds the capacity of experimental techniques in determining protein structures and their binding reactions which are time consuming and at times unsuccessful. Therefore there is an urgent need to develop automated expert methods for determining the functional class of proteins such ATP binding proteins from their primary sequence information.

The general assumption here is that every protein that binds to ATP molecule either USPs or those having Walker A motif will have some common features embedded in their sequences. In both the USP (G-2X-G-9X-G(S/T)) and Walker A (G-4X-GK (T/S)) motifs, the G, K, T, and S denote glycine, lysine, threonine, and serine, respectively, and X denotes any amino acid residue. The lysine (K) residue in the Walker A motif is crucial for nucleotide binding [54] in this class of proteins. It interacts with the phosphate groups of the nucleotide and with the magnesium ion, which coordinates the β- and γ-phosphates of the ATP molecule [55, 56].

The universal stress proteins bind to ATP through the ATP binding motif G-2X-G-9X-G(S/T), with the -G(S)/T as essential residues for ATP binding and phosphorylation [13]. Therefore, members of this class of proteins will segregate into two groups, based on whether or not they bind to ATP [13, 57]. Thus, it is important to identify ATP binding USPs and other ATP binding proteins. Several methods have been studied based on predicting ATP interacting residues if the protein structures are known, with some results showing very low accuracies [15, 16, 58, 59]. This work has predicted ATP binding proteins in general with high accuracy irrespective of their structural information using SVM classifier. The training and prediction statistics for each of the descriptor sets used were visualized and discussed below. The visualizations were constructed using Tableau Public Software (http://www.tableausoftware.com/public).

The objective in this report was to find the best descriptor set which can be use to build a predictive model for a reliable and effective server for predicting ATP-BPs in general, irrespective of their subfunctional classes. Throughout this work, the parameter C was maintained at C=4, while the optimal value of γ for each descriptor was obtained and used in evaluating their performances. Their performances were evaluated based on five computed parameters consisting of their accuracies, sensitivities, specificities, precisions, and MCC, after a 10-fold cross validation (CV10).

The performance of pseudo amino acid composition was evaluated with only accuracy due to lack of sufficient sequence information. The lengths of the color coded descriptors were used as a measure of their performances. In terms of accuracy the best descriptor was the combination of amino acid with dipeptide composition (84.57%), followed by amino acid composition alone (83.64%), dipeptide composition (83.17%), and Norm M-B autocorrelation in that order (Figure 1). The pseudo amino acids and Quasi sequence order descriptors performed poorly compared to the other descriptors. However, the overall performances of the other descriptors were better as most of them registered accuracy values greater than 70.00%. These high performers might be due to the rigorous refinement of protein sequences. Thus protein function classification with SVM classifiers can be improved drastically using rigorously refined protein sequences.

Figure 1

The performances of descriptors with LIBSVM in terms of accuracy. The length of each color coded descriptor and the pyramidal view are a measure of their performances in terms of accuracy (Accsvm). In terms of accuracy the best descriptor was combination of amino acid and dipeptide composition (84.57%), followed by amino acid composition (83.64%), dipeptide composition (83.17%), and Norm M-B autocorrelation in that order. The pseudo amino acids and Quasi sequence order descriptors perform poorly.

The individual performances of amino acid composition (83.64%) and dipeptide composition (83.17%) were increased to 84.57% when both descriptors were combined together. This indicates that the combination of descriptors can enhance the individual performance of other descriptors, particularly those combining with amino acid composition. This is a binary classification problem involving a balance dataset and accuracy (Acc) is the best parameter for evaluating performance based on balance dataset whereas Matthew’s correlation coefficient (MCC) is more realistic than Acc when using an unbalanced dataset [47, 48]. But when both MCC and Acc values are high, the overall performance of the predicted model is better.

The performances of the models were evaluated based on MCC (Figure 2). The pyramidal view and the length of the color coded descriptors were used for performance visualization. The best performer was amino acid and dipeptide composition in combination (0.6931) followed by amino acid composition (0.6765), dipeptide composition (0.6637), and Norm M-B autocorrelation (0.6449) in that order. This order is in line with their performances measured using accuracy as the parameter. This result justifies the performance of the overall model. In general the combination of descriptor sets performs better than individual descriptors, particularly when combined with amino acid composition.

Figure 2

The performances of descriptors with LIBSVM in terms of Mathew’s correlation coefficient (MCC). The length of each color coded descriptor and the pyramidal view are a measure of their performances in terms of MCC. The best performer was amino acid and dipeptide composition in combination (0.6931) followed by amino acid composition (0.6765), dipeptide composition (0.6637), and Norm M-B autocorrelation (0.6449) in that order.

Therefore from the statistical point of view the use of combination sets particularly with amino acid composition tend to give better prediction performance than individual-sets [53]. The amino acid composition generally increases the overall accuracies of other descriptors in combination. One of the shortcoming of amino acid composition as a descriptor is that the same amino acid composition may correspond to diverse sequences due to the loss of sequence order [28, 60]. This sequence order information can be partially covered by combination with dipeptide composition, but dipeptide composition itself lacks information on the fraction of the individual residue in the sequence, as such a combination set is expected to give a better prediction result [27, 61] as shown above due to masking effect.

The models were further investigated based on their sensitivity to predict ATP-BPs and the results displayed in pyramidal view (Figure 3). The most sensitive descriptor was amino acid composition (0.875) followed by dipeptide composition (0.8381), amino acid/dipeptide composition in combination (0.8246), and Norm M-B autocorrelation (0.8224) in that order.

Figure 3

The performances of descriptors with LIBSVM in terms of sensitivity. The length of each color coded descriptor and the pyramidal view are a measure of their performances in terms of sensitivity. The most sensitive descriptor was amino acid composition (0.875) followed by dipeptide composition (0.8381), amino acid and dipeptide composition in combination (0.8246), and Norm M-B autocorrelation (0.8224) in that order.

These descriptors were among the best four performers in terms of Acc and MCC. Evaluation based on specificity indicates that amino acid composition (0.87) was more specific followed by using the entire feature set (0.8478), Quasi sequence order descriptors (0.8333), and dipeptide composition (0.8257) in that order (Figure 4). This information highlights the vital role played by amino acid composition in protein function predictions in general. Interestingly the Quasi sequence order descriptors (0.9626) had the highest precision followed by amino acid and dipeptide composition in combination (0.8785), entire feature set (0.8692), and Transition (0.8411) in that order (Figure 5).

Figure 4

The performances of descriptors with LIBSVM in terms of Specificity. The length of each color coded descriptor and the pyramidal view are a measure of their performances in terms of specificity. The most specific descriptor was amino acid composition and amino acid/dipeptide composition (0.87) followed by all using all the feature set (0.8478), Quasi sequence order descriptors (0.8333), and dipeptide composition (0.8257) in that order.

Figure 5

The performances of descriptors with LIBSVM in terms of Precision. The length of each color coded descriptor and the pyramidal view are a measure of their performances in terms of precision. The most precise descriptor was Quasi sequence order descriptors (0.9626) followed by amino acid and dipeptide composition in combination (0.8785), all feature set (0.8692) and Transition (0.8411) in that order.

The overall model evaluation shows that the amino acids and dipeptide composition was the best model for predicting ATP-BPs from diverse functional classes using whole sequence information. The use of “all the descriptor” set did not generally result in a better model in classification. The “all features” descriptor accuracy was 79.9% against 84.57% for amino acids/dipeptide in combination. This finding is in accordance with [62, 63], on their work on molecular descriptors for predicting compounds of specific properties using “all features” set. The reduction in accuracy might be due to noise generated by the use of many overlapping and redundant descriptors. Hence the accuracy of the classifier algorithms can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance in solving the classification problem in question. The performance of the SVM model using ROC plot (Figure 6) has a value of AUC of 0.849219. This highlights a better model based on whole sequence analysis.

Figure 6

The ROC plot: the plot shows the performance of the LIBSVM model generated with StatsDirect package using an extended trapezoidal rule and a nonparametric method analogous to the Wilcoxon/Mann-Whitney test to calculate the area under the ROC curve. The calculated AUA was 0.849219.

4. Conclusions

The prediction of ATP-binding proteins has been exploited using a battery of descriptor sets and a hybrid functional group. Also for the first time the prediction of ATP binding in universal stress proteins had been investigated using the support vector machine. The best hybrid model was the combination of amino acid and dipeptide composition of the sequences with an accuracy of 84.57% and Mathews correlation coefficient (MCC) value of 0.693. The general trend is that combination of descriptors will perform better and improve the overall performances of individual descriptors, particularly when combined with amino acid composition. This model provides a high probability of success for molecular biologists in predicting and selecting diverse groups of ATP binding proteins.

Conflict of Interests

The author reports no conflict of interests in this work including the mentioned trademarks.

Acknowledgments

The research reported was supported by the National Institutes of Health (NIH-NIGMS-1T36GM095335) and the National Science Foundation (EPS-0903787; EPS-1006883). The content is solely the responsibility of the author and does not necessarily represent the official views of the funding agencies.

Bairoch

Apweiler

The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000

Nucleic Acids Research 2000 28 1 45 48

2-s2.0-0033957834

Berman

H. M.

Westbrook

Feng

Gilliland

Bhat

T. N.

Weissig

Shindyalov

I. N.

Bourne

P. E.

The protein data bank

Nucleic Acids Research 2000 28 1 235 242

2-s2.0-0033954256

Guo

Chen

Sun

Lin

A novel method for protein secondary structure prediction using dual-layer SVM and profiles

Proteins 2004 54 4 738 743

2-s2.0-1542346418

10.1002/prot.10634

Bustamante

Chemla

Y. R.

Forde

N. R.

Izhaky

Mechanical processes in biochemistry

Annual Review of Biochemistry 2004 73 705 748

2-s2.0-3943069115

10.1146/annurev.biochem.72.121801.161542

Walker

J. E.

Saraste

Runswick

M. J.

Gay

N. J.

Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold

The EMBO Journal 1982 1 8 945 951

2-s2.0-0001607723

Hirokawa

Takemura

Biochemical and molecular characterization of diseases linked to motor proteins

Trends in Biochemical Sciences 2003 28 10 558 565

2-s2.0-0642377462

10.1016/j.tibs.2003.08.006

Gedeon

Behravan

Koren

Piquette-Miller

Transport of glyburide by placental ABC transporters: implications in fetal drug exposure

Placenta 2006 27 11-12 1096 1102

2-s2.0-33748743456

10.1016/j.placenta.2005.11.012

Maxwell

Lawson

D. M.

The ATP-binding site of type II topoisomerases as a target for antibacterial drugs

Current Topics in Medicinal Chemistry 2003 3 3 283 303

2-s2.0-0012684806

Ashida

Oonishi

Uyesaka

Kinetic analysis of the mechanism of action of the multidrug transporter

Journal of Theoretical Biology 1998 195 2 219 232

2-s2.0-0032556517

10.1006/jtbi.1998.0787

Kvint

Nachin

Diez

Nystrom

The bacterial universal stress protein: function and regulation

Current Opinion in Microbiology 2003 6 2 140 145

2-s2.0-0038013951

10.1016/S1369-5274(03)00025-0

Nystrom

Neidhardt

F. C.

Cloning, mapping and nucleotide sequencing of a gene encoding a universal stress protein in Escherichia coli

Molecular Microbiology 1992 6 21 3187 3198

2-s2.0-0026526320

10.1111/j.1365-2958.1992.tb01774.x

Diez

Gustavsson

Nystrom

The universal stress protein a of Escherichia coli is required for resistance to DNA damaging agents and is regulated by a RecA/FtsK-dependent regulatory pathway

Molecular Microbiology 2000 36 6 1494 1503

2-s2.0-0033939643

10.1046/j.1365-2958.2000.01979.x

Sousa

M. C.

Mckay

D. B.

Structure of the universal stress protein of Haemophilus influenzae

Structure 2001 9 12 1135 1141

2-s2.0-0035659117

10.1016/S0969-2126(01)00680-3

Promponas

V. J.

Ouzounis

C. A.

Iliopoulos

Experimental evidence validating the computational inference of functional associations from gene fusion events: a critical survey

Briefings in Bioinformatics 2012

10.1093/bib/bbs072

Chauhan

J. S.

Mishra

N. K.

Raghava

G. P.

Identification of ATP binding residues of a protein from its primary sequence

BMC Bioinformatics 2009 10, article 434

2-s2.0-77950471248

10.1186/1471-2105-10-434

Guo

Shi

Sun

A novel statistical ligand-binding site predictor: application to ATP-binding sites

Protein Engineering, Design and Selection 2005 18 2 65 70

2-s2.0-17644419938

10.1093/protein/gzi006

Chen

Mizianty

M. J.

Kurgan

ATPsite: sequence-based prediction of ATP-binding residues

Proteome Science 2011 9, article S4 supplement 1

2-s2.0-80054028019

10.1186/1477-5956-9-S1-S4

Zhang

Y. N.

D. J.

S. S.

Fan

Y. X.

Huang

Shen

H. B.

Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features

BMC Bioinformatics 2012 13, article 118

10.1186/1471-2105-13-118

Green

J. R.

Korenberg

M. J.

David

Hunter

I. W.

Recognition of adenosine triphosphate binding sites using parallel cascade system identification

Annals of Biomedical Engineering 2003 31 4 462 470

2-s2.0-0038193763

10.1114/1.1561293

Garg

Bhasin

Raghava

G. P. S.

Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search

The Journal of Biological Chemistry 2005 280 15 14427 14432

2-s2.0-17644389617

10.1074/jbc.M411789200

Ahmad

Gromiha

M. M.

Sarai

Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information

Bioinformatics 2004 20 4 477 486

2-s2.0-1542400269

10.1093/bioinformatics/btg432

Xiao

Wang

Chou

K.-C.

GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes

Journal of Computational Chemistry 2009 30 9 1414 1423

2-s2.0-64749096548

10.1002/jcc.21163

Kumar

Gromiha

M. M.

Raghava

G. P. S.

Prediction of RNA binding sites in a protein using SVM and PSSM profile

Proteins 2008 71 1 189 194

2-s2.0-40549105098

10.1002/prot.21677

Williams

B. S.

Isokpehi

R. D.

Mbah

A. N.

Hollman

A. L.

Bernard

C. O.

Simmons

S. S.

Ayensu

W. K.

Garner

B. L.

Functional annotation analytics of bacillus genomes reveals stress responsive acetate utilization and sulfate uptake in the biotechnologically relevant bacillus megaterium

Bioinformatics and Biology Insights 2012 6 275 286

10.4137/BBI.S7977

Isokpehi

R. D.

Mahmud

Mbah

A. N.

Simmons

S. S.

Avelar

Rajnarayanan

R. V.

Udensi

U. K.

Ayensu

W. K.

Cohly

H. H.

Brown

S. D.

Dates

C. R.

Hentz

S. D.

Hughes

S. J.

Smith-McInnis

D. R.

Patterson

C. O.

Sims

J. N.

Turner

K. T.

Williams

B. S.

Johnson

M. O.

Adubi

Mbuh

J. V.

Anumudu

C. I.

Adeoye

G. O.

Thomas

B. N.

Nashiru

Oliveira

Developmental regulation of genes encoding universal stress proteins in Schistosoma mansoni

Gene Regulation and Systems Biology 2011 5 61 74

2-s2.0-80053590866

10.4137/GRSB.S7491

Mbah

A. N.

Mahmud

Awofolu

O. R.

Isokpehi

R. D.

Inferences on the biochemical and environmental regulation of universal stress proteins from Schistosomiasis parasites

Advances and Applications in Bioinformatics and Chemistry 2013 6 15 27

10.2147/AABC.S37191

Jaroszewski

Godzik

Clustering of highly homologous sequences to reduce the size of large protein databases

Bioinformatics 2001 17 3 282 283

2-s2.0-0035072551

Wang

Dunbrack

R. L.

Jr.

PISCES: a protein sequence culling server

Bioinformatics 2003 19 12 1589 1591

2-s2.0-0043180474

10.1093/bioinformatics/btg224

Cao

Cai

Shi

Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines

Journal of Theoretical Biology 2006 240 2 175 184

2-s2.0-33646510075

10.1016/j.jtbi.2005.09.018

Marchler-Bauer

Zheng

Chitsaz

Derbyshire

M. K.

Geer

L. Y.

Geer

R. C.

Gonzales

N. R.

Gwadz

Hurwitz

D. I.

Lanczycki

C. J.

Marchler

G. H.

Song

J. S.

Thanki

Yamashita

R. A.

Zhang

Bryant

S. H.

CDD: conserved domains and protein three-dimensional structure

Nucleic Acids Research 2013 41 D348 D352

10.1093/nar/gks1243

Z. R.

Lin

H. H.

Han

L. Y.

Jiang

Chen

Y. Z.

PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence

Nucleic Acids Research 2006 34 W32 W37

2-s2.0-33747816816

10.1093/nar/gkl305

Bikadi

Hazai

Malik

Jemnitz

Veres

Hari

Loo

T. W.

Clarke

D. M.

Hazai

Mao

Predicting P-glycoprotein-mediated drug transport based on support vector machine and three-dimensional crystal structure of P-glycoprotein

PLoS ONE 2011 6 10

2-s2.0-80053552409

10.1371/journal.pone.0025815

e25815

S. L.

Cai

C. Z.

Chen

Y. Z.

Chung

M. C. M.

Effect of training datasets on support vector machine prediction of protein-protein interactions

Proteomics 2005 5 4 876 884

2-s2.0-16344384583

10.1002/pmic.200401118

Brown

M. P.

Grundy

W. N.

Lin

Cristianini

Sugnet

C. W.

Furey

T. S.

Ares

Jr. Haussler

Knowledge-based analysis of microarray gene expression data by using support vector machines

Proceedings of the National Academy of Sciences of the United States of America 2000 97 1 262 267

2-s2.0-0034602774

10.1073/pnas.97.1.262

Furey

T. S.

Cristianini

Duffy

Bednarski

D. W.

Schummer

Haussler

Support vector machine classification and validation of cancer tissue samples using microarray expression data

Bioinformatics 2000 16 10 906 914

2-s2.0-0033636139

Chou

K.-C.

Cai

Y.-D.

Predicting protein-protein interactions from sequences in a hybridization space

Journal of Proteome Research 2006 5 2 316 322

2-s2.0-32344433486

10.1021/pr050331g

Matheny

M. E.

Resnic

F. S.

Arora

Ohno-Machado

Effects of SVM parameter optimization on discrimination and calibration for post-procedural PCI mortality

Journal of Biomedical Informatics 2007 40 6 688 697

2-s2.0-36048958298

10.1016/j.jbi.2007.05.008

Javed

Chan

G. S.

Savkin

A. V.

Middleton

P. M.

Malouf

Steel

Mackie

Lovell

N. H.

RBF kernel based support vector regression to estimate the blood volume and heart rate responses during hemodialysis

Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC '09)

2009

4352 4355

2-s2.0-77951017965

10.1109/IEMBS.2009.5332739

Chang

C.-C.

Lin

C.-J.

Training nu-support vector classifiers: theory and algorithms

Neural Computation 2001 13 9 2119 2147

2-s2.0-0000667930

10.1162/089976601750399335

Cherkassky

Practical selection of SVM parameters and noise estimation for SVM regression

Neural Networks 2004 17 1 113 126

2-s2.0-0346250790

10.1016/S0893-6080(03)00169-2

Chou

K. C.

Zhang

C. T.

Prediction of protein structural classes

Critical Reviews in Biochemistry and Molecular Biology 1995 30 275 349

10.3109/10409239509083488

Chen

Zou

Cai

Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine

Protein and Peptide Letters 2009 16 1 27 31

2-s2.0-61549096128

10.2174/092986609787049420

Ding

Luo

Lin

Prediction of cell wall lytic enzymes using chou's amphiphilic pseudo amino acid composition

Protein and Peptide Letters 2009 16 4 351 355

2-s2.0-65349165718

10.2174/092986609787848045

Bondia

Tarin

Garcia-Gabin

Using support vector machines to detect therapeutically incorrect measurements by the MiniMed CGMS

Journal of Diabetes Science and Technology 2008 2 622 629

Chen

Zhou

Yin

F.-F.

Marks

L. B.

Das

S. K.

Investigation of the support vector machine algorithm to predict lung radiation-induced pneumonitis

Medical Physics 2007 34 10 3808 3814

2-s2.0-34748905317

10.1118/1.2776669

Matthews

B. W.

Comparison of the predicted and observed secondary structure of T4 phage lysozyme

Biochimica et Biophysica Acta 1975 405 2 442 451

2-s2.0-0016772212

Bao

Cui

Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information

Bioinformatics 2005 21 10 2185 2190

2-s2.0-19544392545

10.1093/bioinformatics/bti365

Dobson

R. J.

Munroe

P. B.

Caulfield

M. J.

Saqi

M. A. S.

Predicting deleterious nsSNPs: an analysis of sequence and structural attributes

BMC Bioinformatics 2006 7, article 217

2-s2.0-33745862379

10.1186/1471-2105-7-217

Hanley

J. A.

Mcneil

B. J.

The meaning and use of the area under a receiver operating characteristic (ROC) curve

Radiology 1982 143 1 29 36

2-s2.0-0020083498

Delong

E. R.

DeLong

D. M.

Clarke-Pearson

D. L.

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach

Biometrics 1988 44 3 837 845

2-s2.0-0023710206

Chothia

Lesk

A. M.

The relation between the divergence of sequence and structure in proteins

The EMBO Journal 1986 5 4 823 826

2-s2.0-0022706389

Lesk

A. M.

Chothia

How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins

Journal of Molecular Biology 1980 136 3 225 270

2-s2.0-0019332167

Hilbert

Bohm

Jaenicke

Structural relationships of homologous proteins as a fundamental principle in homology modeling

Proteins 1993 17 2 138 151

2-s2.0-0027363912

10.1002/prot.340170204

Hanson

P. I.

Whiteheart

S. W.

AAA+ proteins: have engine, will work

Nature Reviews Molecular Cell Biology 2005 6 7 519 529

2-s2.0-21744446127

10.1038/nrm1684

Ferguson

K. M.

Higashijima

Smigel

M. D.

Gilman

A. G.

The influence of bound GDP on the kinetics of guanine nucleotide binding to G proteins

The Journal of Biological Chemistry 1986 261 16 7393 7399

2-s2.0-0022876166

Jurnak

Mcpherson

Wang

A. H. J.

Rich

Biochemical and structural studies of the tetragonal crystalline modification of the Escherichia coli elongation factor Tu

The Journal of Biological Chemistry 1980 255 14 6751 6757

2-s2.0-0018950801

Zarembinski

T. I.

Hung

L. I.-W.

Mueller-Dieckmann

H.-J.

Kim

K.-K.

Yokota

Kim

S.-H.

Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics

Proceedings of the National Academy of Sciences of the United States of America 1998 95 26 15189 15193

2-s2.0-0032431024

10.1073/pnas.95.26.15189

Saito

Shirai

An empirical approach for detecting nucleotide-binding sites on proteins

Protein Engineering, Design and Selection 2006 19 2 67 75

2-s2.0-31544440473

10.1093/protein/gzj002

Sobolev

Sorokine

Prilusky

Abola

E. E.

Edelman

Automated analysis of interatomic contacts in proteins

Bioinformatics 1999 15 4 327 332

2-s2.0-13044272912

10.1093/bioinformatics/15.4.327

Schapire

R. E.

Singer

Boostexter: a boosting-based system for text categorization

Machine Learning 2000 39 2-3 135 168

2-s2.0-0033905095

Ong

S. A.

Lin

H. H.

Chen

Y. Z.

Z. R.

Cao

Efficacy of different protein descriptors in predicting protein functional families

BMC Bioinformatics 2007 8, article 300

2-s2.0-34948861260

10.1186/1471-2105-8-300

Xue

Bajorath

Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening

Combinatorial Chemistry and High Throughput Screening 2000 3 5 363 372

2-s2.0-0033779243

Xue

Godden

J. W.

Bajorath

Evaluation of descriptors and mini-fingerprints for the identification of molecules with similar activity

Journal of Chemical Information and Computer Sciences 2000 40 5 1227 1234

2-s2.0-0034265657