Incorporating Amino Acids Composition and Functional Domains for Identifying Bacterial Toxin Proteins

Aside from pathogenesis, bacterial toxins also have been used for medical purpose such as drugs for cancer and immune diseases. Correctly identifying bacterial toxins and their types (endotoxins and exotoxins) has great impact on the cell biology study and therapy development. However, experimental methods for bacterial toxins identification are time-consuming and labor-intensive, implying an urgent need for computational prediction. Thus, we are motivated to develop a method for computational identification of bacterial toxins based on amino acid sequences and functional domain information. In this study, a nonredundant dataset of 167 bacterial toxins including 77 exotoxins and 90 endotoxins is adopted to learn the predictive model by using support vector machines (SVMs). The cross-validation evaluation shows that the SVM models trained with amino acids and dipeptides composition could yield an accuracy of 96.07% and 92.50%, respectively. For discriminating endotoxins from exotoxins, the SVM models trained with amino acids and dipeptides composition have achieved an accuracy of 95.71% and 92.86%, respectively. After incorporating functional domain information, the predictive performance is further improved. The proposed method has been demonstrated to be able to more effectively identify and classify bacterial toxins than the other two features on independent dataset, which may aid in bacterial biomedical development.


Introduction
The bacterial toxins are capable of causing human diseases. Many pathogenic bacteria produce protein toxins that are important or essential virulent factors [1][2][3]. Bacterial toxins have many different types which could overcome the defense mechanism of hosts. Typically, they can be classified into endotoxins and exotoxins. The cell-associated toxins are referred to as endotoxins which in most cases reside within the cell wall and are liberated into host tissues upon cell death. Endotoxins are the integral part of the outer membrane of Gram-negative bacteria. Certain proteins and, particularly, the lipopolysaccharides (LPS) comprise the outer layer of this membrane whereas its inner layer is composed of phospholipids and proteins [4]. The exotoxins represent extracellular diffusible toxins which are soluble substances secreted by bacteria in the host tissues [5]. They usually are secreted proteins or polypeptides and act enzymatically or through direct interaction with host cell receptors to stimulate a variety of immune responses, where there can be a site remote from bacterial growth or that of pathogen colonization [1].
The bacterial toxins have been implemented for medical purpose today. For instance, the botulinum toxin, an exotoxin, has been employed in physiatrics, orthopedics, gynecology, pediatrics, neurology, general surgery, plastic surgery, and gastroenterology and also to treat hyperhidrosis and wrinkles in dermatology [6]. Botulinum toxin type B is a safe and effective treatment for upper extremity dystonia in children with cerebral palsy [7]. Cholera toxin (CT) and the heat-labile toxin (LT) have been used as strong mucosal adjuvants in experimental models [8]. Some studies demonstrated that similar bacterial toxins can be effective in the cancer therapy and its feasibility has been cemented by using bacterial toxins to develop drug for cancer, receptor disease, and immune disease [9][10][11][12]. With the evolvement of biomedical technology, pathogenic mechanisms of toxin have been revealed gradually and made bacterial toxins potential powerful experimental and clinical tools. To facilitate the bacteria toxin identifications, which are traditionally achieved by costly and time-consuming experimental methods, developing a high throughput method to rapidly and accurately identify bacterial toxins is an urgent need. Besides, endotoxins and exotoxins have different roles and mechanism in the host, thus, classification of bacterial toxins provide vital clues in understanding basic cell biology and further development of toxoid based therapeutics. To address the above issues, by using support vector machines (SVM) and dipeptides composition, BTXpred has proposed and achieved an accuracy of 96.07 and 92.50% for discriminating bacteria toxins and nontoxins, respectively. They also obtained 95.71 and 92.86% of accuracy in classifying endotoxins and exotoxins, respectively [13]. Yang et al. achieved higher MCC in the same dataset by using increment of diversity and SVM [14]. Based on the concept of pseudo amino acid composition, combined with the methods of approximate entropy and IB1 algorithm, Song has demonstrated higher accuracy than the previous two methods [15]. In this study, we attempted to not only use amino acid and dipeptide composition but also develop a method containing functional domain mapping for identification of bacterial toxins and classification of bacterial toxins (endotoxins and exotoxins). Figure 1 presents the systematic workflow of the proposed method. It consists of the following steps: data collection and preprocessing, feature extraction, model learning and cross validation, and independent testing. The details of each process were described as follows.

Data Collection and Preprocessing
The training dataset used in this study was obtained from BTXpred [13] which was originally collected from UniProt database [16]. In BTXpred, nontoxin protein sequences were also obtained from UniProt by combined search using SRS. The retrieved data were checked in order to eliminate toxin proteins [17]. After removing sequences that have more than 90% sequence identity, the nonredundant dataset in BTXpred consists of 150 bacterial toxins that have 77 exotoxins and 73 endotoxins. The exotoxins were further classified based on their molecular targets. Here, we have removed proteins annotated as "Obsolete" in UniProt and added additional data released on UniProt by "December 31, 2007". The final nonredundant dataset consists of 456 nontoxin proteins and 167 bacterial toxins containing 90 endotoxins and 77 exotoxins for model training.
For independent testing dataset, we collected another bacterial toxin data based on previous work [13] including the updated data released after "January 01, 2008" on UniProt. Homologous sequences from the testing data were removed by using CD-HIT [18]. The final testing dataset included 810 nontoxin protein and 271 bacterial toxin proteins which contain 162 endotoxins and 109 exotoxins for independent testing. There also have 90% sequence similarity between training dataset and testing dataset.

Compositions of Amino Acids and Amino Acid Dipeptide.
Each protein sequence in the dataset was represented using a vector { , = 1, . . . , } labeled according to its corresponding protein group (e.g., bacterial toxins or nontoxin proteins). The vector has 20 elements for the amino acid composition and 400 elements for the amino acid dipeptide composition. For amino acid composition, the 20 elements specified the numbers of occurrences of 20 amino acids normalized with the total number of residues in the protein.
On the other hand, for amino acid dipeptide composition, the 400 elements specified the numbers of occurrences of 400 amino acid dipeptides normalized with the total number of dipeptides in the protein.

Information of Functional Domains.
Previous works on protein prediction have exhibited the ability of distinguishable domain regions in the classification of proteins [19,20]. In this work, domain information was investigated as a feature for classifying bacterial toxins from nontoxin proteins. To investigate the preference of functional domains in bacterial toxins, this study referred to the annotations in InterPro [21]. InterPro is an integrated resource, which was developed initially as a means of rationalizing the complementary efforts of the PROSITE [22], PRINTS [23], Pfam [24], and ProDom [25] databases, for providing protein "signatures" such as protein families, domains, and functional sites. The domain information of each bacterial toxin in the training data was collected by referring to its corresponding InterPro ID in the UniProt database. The collected domains were then analyzed in order to identify the most distinguishable domains in bacterial toxin. In this work, functional domains presenting in more than five bacterial toxins were considered as significant domains.

Model Learning and Cross Validation.
A support vector machine (SVM) was applied to generate computational models that incorporate the encoded features such as amino acids, accessible surface area, and secondary structure. Based on binary classification, the concept of SVM is to map the input samples into a higher dimensional space using a kernel function and then to find a hyperplane that discriminates between the two classes with maximal margin and minimal error. A public SVM library, LibSVM [26], was used to train the predictive model with positive and negative training sets, which were encoded with reference to various training features. The radial basis function (RBF) ( , ) = exp(− ‖ − ‖ 2 ) was selected as the kernel function of SVM. Moreover, two SVM parameters, cost and gamma value, were optimized to maximize predictive accuracy.
Cross-validation evaluation is important for the application of a predictor [27]. The predictive performance of the constructed models was evaluated by performing -fold cross validation. The training data was divided into groups Second amino acid residues by splitting each dataset into approximately equal sized subgroups. In one round of cross validation, a subgroup was regarded as the test set, and the remaining -1 subgroups were regarded as the training set. The cross-validation process was repeated rounds, with each of the subgroups used as the test set in turn. Then, the results were combined to produce a single estimation. The advantage of -fold cross validation was that all original data were regarded as both training set and test set, and each data was used for testing exactly once [28]. In this study, was set to five. The following measures of predictive performance of the trained models were defined: where TP, TN, FP, and FN represented the numbers of true positives, true negatives, false positives, and false negatives, respectively. Additionally, the parameters of the predictive models, window length, cost, and gamma value of the SVM models were optimized to maximize predictive accuracy. Finally, the parameters that yielded the highest accuracy were employed to construct predictive models for independent testing.

Independent Testing.
In order to further evaluate the trained models for identify bacterial toxin, an independent test set was obtained as described previously, resulting in 271 positive data and 810 negative data. In addition, this work also investigated the ability of the predictive model to identify bacterial toxin subtype: "exotoxin" and "endotoxin. "

Investigation of Amino Acid Composition in Bacterial
Toxins. The difference between bacterial toxin and nontoxin proteins was analyzed in terms of its amino acid composition and the result was shown in Figure 2. To determine the differentially presented amino acid, the occurrence of each amino acid except for N which is of the highest frequency (0.03772) was averaged. After adding the standard deviation, 0.004225, to the average (0.00767), 0.011895 was considered to be the threshold. It can be observed that bacterial toxins are significantly distinguishable from nontoxin proteins at the amino acid composition level. For instance, alanine (A, 0.01317), asparagine (N, 0.03772), leucine (L, 0.01475), and tyrosine (Y, 0.01416) residues all exhibit a remarkable difference between bacterial toxin and nontoxin proteins. Asparagine (N) is the most significantly distinguishable among all residues. In order to examine the effectiveness of amino acid composition in identifying baterial toxins, an SVM model was trained using a 20-dimensional vector consisting of the composition scores for twenty amino acids. The amino acid composition-based model was evaluated by means of five-fold cross validation. As shown in Table 1, the model achieved sensitivity of 92.81%, specificity of 99.56%, and accuracy of 97.75%. Amino acid composition comparison between endotoxin and exotoxin was also performed and shown in Figure 3. The occurrence of each amino acid except for most distinguishable residue, K, was used to obtain an average (0.006347). After adjusting the average by sytandard deviation (0.004226), frequency larger than 0.010573 was considered to be differential. Arginine (R, 0.017043), lysine (K, 0.03654), and threonine (T, 0.014236) residues were found to have differential frequency between endotoxin and exotoxin proteins. Similarly, SVM model was trained using a 20-dimensional vector consisting of the composition scores for twenty amino acids and evaluated by means of five-fold cross validation. As shown in Table 2, the model achieved sensitivity of 93%, specificity of 93.93%, and accuracy of 94.02%.

Investigation of Amino Acid Dipeptide Composition in
Bacterial Toxins. Protein dipeptide composition has been widely used in the proteins identification. Previous studies have reported that dipeptide composition-based methods can yield a better performance compared to amino acid composition-based methods [29,30]. To investigate this claim in terms of identifying bacterial toxins, an SVM model was trained by using amino acid dipeptide composition as features. Firstly, the composition of all possible amino acid pairs was calculated in baterial toxins and nontoxin proteins, leading to the fact that each protein sequence can be encoded as a 400-dimensional vector consisting of the composition scores for 20 × 20 amino acid pairs. Using the resulting 400dimensional dipeptide vectors, an SVM model was trained and is evaluated by means of five-fold cross validation.  The performance of dipeptide composition-based model for identifying bacterial toxin has sensitivity of 87.42%, specificity of 96.71%, and accuracy of 94.06% (as shown in Table 1). It can be observed that the amino acid compositionbased method yields higher accuracy in identifying bacterial toxins. It may be due to the short sequence length of toxins as it is difficult to obtain significant number of dipeptides for small proteins [13]. The amino acid dipeptide composition of bacterial toxins and nontoxin proteins is further analyzed by selecting statistically significant dipeptides among the 400 amino acid pairs. Figure 4 shows the probability difference of 400 amino acid pairs between bacterial toxins and nontoxin proteins. In the 20×20 matrix, amino acid pairs marked in red indicates overrepresentation in bacterial toxins, while amino acid pairs marked in green indicates underrepresentation. As illustrated in Figure 4, NN pairs are overrepresented in bacterial toxins as well as N residues paired with I, L, and T. Similarly, the amino acid dipeptide composition-based method also yields lower accuracy in classifying exotoxin and endotoxin as compared to amino acid compositionbased methods. The model achieved sensitivity of 92.22%, specificity of 85.71%, and accuracy of 89.22%, as shown in Table 2. Figure 5 portraits the probability difference of 400 amino acid pairs between endotoxin and exotoxin proteins. Amino acid pairs marked in red indicates overrepresentation in endotoxin, while amino acid pairs marked in green indicates overrepresentation in exotoxin. It can be observed that LE and TD pairs are overrepresented in endotoxin, while SK, KK, and NK pairs are overrepresented in exotoxin.

Investigation of Functional Domain Information in Bacterial Toxins.
In order to analyze functional domain information in bacterial toxin, the experimentally verified domains of each baterial toxin in the training data were collected by referring to the "InterPro" field in UniProt, resulting in a total of 100 functional domains. Since some toxin domains were filtered out while applying 6 baterial toxins, here, in order to capture the representative functional domains in bacterial toxins, functional domains that are present in more than 5 bacterial toxins were selected as distinguishable domains. Accordingly, a total of 40 functional domains were obtained as shown in Table 3. It is observed that the most distinguishable functional domain is the "Endotoxin N" with InterPro ID: IPR005639 which is composed of 84 bacterial toxins. Other distinguishable functional domains that comprise more 60 bacterial toxin proteins are the Galactose-bd-like, Endotoxin C, Endotoxin cen dom, and Endotoxin cen dom subgrl with InterPro ID IPR008979, IPR005638, IPR001178, and IPR015790, respectively. It is noticeable that these discernible functional domains all exist in endotoxins, implying that most of the endotoxins have similar functions while exotoxins still can be divided into more subcategories according to different functions. Inspired Table 3: Statistics of InterPro functional annotations in 167 bacterial toxin proteins. InterPro classifies sequences at superfamily, family, and subfamily levels and annotates of the occurrence of functional domains, repeats, and important sites. The annotations which occur in more than five bacterial toxins are presented with the information of InterPro ID, description, and bacterial toxin proteins. Heat-stable enterotox STa 11 IPR008985 ConA-like lec gl sf 11 IPR013320 ConA-like subgrp 11 IPR018511 Hemolysin-typ Ca-bd CS 11 IPR019806 Heat-stable enterotox CS 11 IPR000395 Neurotox Zn protease 9 IPR001869 Thiol cytolysin 9 IPR011065 Kunitz inhibitor ST1-like 9 IPR012500 Toxin trans 9 IPR012928 Toxin rcpt-bd N 9 IPR013104 Toxin rcpt-bd C 9 IPR013550 RTX C 9 IPR015214 Endotoxin

Independent
Testing. The feasibility of this method was further evaluated by using an independent dataset composed of collected bacterial toxin proteins with two subtypes (edotoxin and exotoxin) and nontoxin proteins as described above. The independent data was first tested on models of bacterial toxin and nontoxin trained on each feature as shown in Table 4. The amino acid compositionbased model has sensitivity of 30.99%, specificity of 93.95%, and accuracy of 78.17%. And sensitivity of 27.31%, specificity of 88.89%, and accuracy of 73.45% were gained by using dipeptide composition-based model. Obviously, both two models yielded a much lower performance as compared to the models based on functional domain that have sensitivity of 99.63%, specificity of 93.95%, and accuracy of 95.37%. The independent data was further tested for classifying exotoxin and endotoxin by each feature. As presented in Table 5, the functional domain-based model still yielded better performance as compared to the other two models by having sensitivity of 94.44%, specificity of 73.39%, and accuracy of 85.98%. The amino acid composition-based model performs with sensitivity of 0%, specificity of 94.49%, and accuracy of 38.01%. Amino acid composition-based model performs with sensitivity of 11.72%, specificity of 89%, and accuracy of 43.17%.

Conclusion
Bacterial toxins have wide-range affection on not only host defense and immune response but also therapeutics interventions. It is conceivable that how to identify bacteria toxins and its subtypes in a rapid and accurate way by taking advantage of computing science information is significant and helpful. Many research works have been working in this field to facilitate the understanding of bacterial toxicology  and development of therapeutic drugs and vaccines. From the results of these studies, machine learning methods are suitable for toxin proteins prediction due to the sequencing of most genomes of 296 bacteria, leading to a high accuracy.
Here we have proposed a different computational method to identify bacterial toxin proteins on the basis of amino acid sequence and functional domain information of a protein. Since shorter toxin protein sequences are commonly found and the different types of toxins having different specific functionality, functional domains information are more suitable for identifying the type of bacterial toxins as compared to amino acids composition information in protein sequence. From independent test results, the functional domain information provided a more stable performance for identification of bacterial toxins. It is speculated that even if the amino acid composition is different, the specific function domains are conserved. This work shows that the in silico identification could be a feasible mean for conducting preliminary analyses as well as significantly reducing the number of potential targets that require further in vivo or in vitro experimental confirmation. It is invisible that the functional domain based model may have potential for predicting toxin proteins produced by other species.