Identifying the Types of Ion Channel-Targeted Conotoxins by Incorporating New Properties of Residues into Pseudo Amino Acid Composition

Conotoxins are a kind of neurotoxin which can specifically interact with potassium, sodium type, and calcium channels. They have become potential drug candidates to treat diseases such as chronic pain, epilepsy, and cardiovascular diseases. Thus, correctly identifying the types of ion channel-targeted conotoxins will provide important clue to understand their function and find potential drugs. Based on this consideration, we developed a new computational method to rapidly and accurately predict the types of ion-targeted conotoxins. Three kinds of new properties of residues were proposed to use in pseudo amino acid composition to formulate conotoxins samples. The support vector machine was utilized as classifier. A feature selection technique based on F-score was used to optimize features. Jackknife cross-validated results showed that the overall accuracy of 94.6% was achieved, which is higher than other published results, demonstrating that the proposed method is superior to published methods. Hence the current method may play a complementary role to other existing methods for recognizing the types of ion-target conotoxins.


Introduction
The marine cone snail can secrete venom for predation and defense. A key component of venom is called conotoxin which is a kind of disulfide-rich neurotoxic peptide with 10-30 residues long. The high diversity of their sequences makes it difficult to systemically study them. It has been reported that there are over 100,000 conotoxins existing in approximately 700 species of cone snails [1]. Conotoxins can target G protein-coupled receptors (GPCRs) [2], nicotinic acetylcholine, and neurotensin receptors. Particularly, they interact with ion channels with extremely high specificity and affinity [3]. Thus, they have been regarded as important drug candidates to treat chronic pain, epilepsy, spasticity, and cardiovascular diseases [4,5].
With more and more conotoxins being discovered, biochemical experiments-based method to investigate the function of conotoxins becomes more and more difficult because of high cost and long period of wet experiment. Using computational method to predict the function of conotoxins provides us with a convenient way to perform systemic analysis of conotoxins. In 2006, Mondal et al. combined support vector machine (SVM) with pseudo amino acid composition (PseAAC) to predict the superfamily of conotoxins [6]. Subsequently, Lin and Li developed a novel method called increment of diversity (ID) to describe dipeptide sequence and used quadratic discriminant (QD) to predict superfamily and family of conotoxins [7]. Zaki et al. used sequence alignment which was also used by Zou et al. [8] combined with amino acid composition to predict superfamily of conotoxins by use of SVM [9]. They further provide a SVM-Freescore method to improve accuracy [10]. Recently, Yin et al. developed a method called dHKNN to predict superfamily of conotoxins and achieved the overall accuracy of 90.3% by using hidden Markov model to select best features [11,12].
Although the methods and results mentioned above can give some guide to study conotoxins, they did not provide more information for the prediction of conotoxins' function. A case shows that two conotoxins (delta-conotoxin-like Ac6.1 and omega-conotoxin-like Ai6.2) belong to the same superfamily; however, they can target different ion channels [16]. Thus, it is necessary to develop new bioinformatics tools to identify the function of conotoxins. In 2007, Saha and Raghava proposed a method based on SVM and PSI-BLAST to predict the function of neurotoxins [17]. Soli et al. developed a statistical-based model to predict the activity of scorpion toxins by using motifs and secondary structure information [18]. Recently, Yuan et al. developed a feature selection technique based on binomial distribution to predict the types of ion channel-targeted conotoxins by using radial basis function network [19]. Subsequently, they improved the accuracy by using SVM with optimal dipeptide composition [20]. However, the prediction accuracy can be further improved.
Thus, the present study aimed to develop a new prediction method to improve the prediction quality of conotoxins' types. We incorporated three kinds of new properties of residues into PseAAC for formulating conotoxins samples. Subsequently, we used SVM to perform classification. After feature selection, we found that the accuracy was dramatically improved in jackknife cross-validation. In the following section, we will introduce the process of model construction in detail.

Benchmark Dataset.
The benchmark dataset extracted from the UniProt [21] was constructed by Lin's group [19,20]. The dataset is reliable and objective because (i) the conotoxins with ambiguous annotations have been excluded, (ii) the function of all conotoxins in benchmark dataset has been experimentally confirmed, and (iii) high similar sequences (cutoff = 80%) have been pruned by using CD-HIT program. The benchmark dataset contains 112 mature conotoxins peptide sequences including 24 potassium ion channel-targeted conotoxins (K-conotoxins), 43 sodium ion channel-targeted conotoxins (Na-conotoxins), and 45 calcium ion channeltargeted conotoxins (Ca-conotoxins). All calculations and model construction in the following section are based on the data.

Feature Extraction.
A key point in protein prediction is how to extract important information from peptide sequences. In the past studies, the amino acid composition has been widely used in protein prediction. To consider the correlation of residues, the dipeptide composition was used in prediction model. Chou proposed a very popular and elegant descriptor called PseAAC which describes not only the correlation of physicochemical properties of residues but also the amino acid composition [22]. Furthermore, recently some web servers or stand-alone tools have been proposed to generate different modes of PseAAC, such as PseKNC [23], PseKNC-General [24], Pse-in-One [25], repRNA [26], and repDNA [27]. The authors should introduce these tools. In this study, we proposed three kinds of new properties, that is, rigidity, flexibility, and irreplaceability. The flexibility and rigidity of residues correlate with the protein structure and function. The irreplaceability of residues can reflect the evolution of life. The values of three properties for 20 residues [28] have been listed in Table 1. In the following, we will describe how to formulate conotoxins with PseAAC [22]. Consider a conotoxin P = 1 2 3 4 ⋅ ⋅ ⋅ , where 1 , 2 , and denote the 1st, 2nd, and th residue of the conotoxin sample P; it can be defined by a 400 + 3 -dimensional vector as shown by where is the normalized frequency of the 400 dipeptides in conotoxin P and can be defined as where denotes the number of occurrences of th dipeptide in conotoxin P.

BioMed Research
In (2), is weight factor for sequence order effect. is called the -tier sequence correlation factor computed by the following formula: where , + ( = 1, 2, 3 denotes rigidity, flexibility, and irreplaceability) is called the correlation function and can be given by where ℎ ( ) is the th kind of the physicochemical values of the amino acid . The values should be converted to standard type by where ℎ 0 ( ) is the original physicochemical values of the th amino acid. For the purpose of finding the best feature subset which can produce the maximum accuracy, we performed feature selection by using the algorithm called -score which can be defined as where and are the average values of the th feature in whole dataset and the th dataset; is the value of the th feature of the th conotoxin in the th dataset; and is the numbers of conotoxin in the th dataset. We noticed that the larger the ( ) value is, the better the predictive capability the th feature has. We used a python script fselect.py downloaded from https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/ to calculate -score.

Support Vector
Machine. SVM is a very popular machine learning method which is very suitable for small sample classification [29][30][31] and regressions [32,33]. Its basic idea is to map the original samples into a high-dimensional space and search for the best hyperplane in this space which can separate different samples. In this study, the LibSVM soft package was used to implement SVM. The radial basis function (RBF) usually exhibits excellent performance in nonlinear classification [34]. Thus the RBF kernel function was used in the current work. We utilized grid search method to find out the best values of the regularization parameter and kernel parameter via jackknife cross-validation. The search spaces for and are [2 15 , 2 −5 ] and [2 −5 , 2 −15 ] with steps being 2 −1 and 2, respectively.

The Evaluation of Model
Performance. We used jackknife cross-validation to evaluate the performance of proposed method. Three metrics, namely, sensitivity (Sn), overall accuracy (OA), and average accuracy (AA) as defined in [19,20], were used to quantitatively estimate the accuracy of the model: where is the total number of the th types of conotoxins and denotes the number of the th types of conotoxins which was correctly recognized.

Results and Discussion
As we can see from (2), the results of the proposed method depend on two parameters and , where represents the long-range sequence order effect and is called weight factor which reflects the weight imposed between the local and global effects. Generally speaking, the greater is, the more global sequence order information it contains. However, if is too large, it would cause the high-dimensional disaster as mentioned above. Therefore, our searching for the optimal values of the three parameters was carried out in the following regions: From (9), a total of 10 × 10 = 100 individual combinations needed to be considered for finding the optimal parameter combination. This was actually a routine but tedious process to optimize the model via a 2-dimensional grid search. We used the jackknife cross-validation approach to deal with the parameter optimization. The results show that when = 6 and = 0.2, the accuracy reaches to maximum value. We noticed that the current model contains 418 features which is still so large that the high-dimensional and overfitting problems will appear. Therefore, we must select the key features from the 418 components. These key features can produce the maximum Acc. The best feature subset will be obtained by   investigating all the combinations of features. However, it is time-consuming and even beyond computational capability for most computers to examine all possible combinations. Based on this reason, we used -score defined in (7) to perform feature selection. At first, all 418 features were ranked according to their -scores from large to small. Secondly, the SVM was used to classify three samples and calculate the accuracy based on the feature with maximum -score. Thirdly, a new feature subset was produced by adding the feature with the second highest F value to the former feature subset. We repeated the process until all combinations were investigated and the accuracies were calculated. We plotted the accuracies with feature dimension in Figure 1 and noticed that the maximum accuracy is 94.6% when 180 best features were used. The detailed results were recorded in Table 1. Other published results were also listed in Table 2. We noticed that Sns of Na-and Ca-conotoxins of our method are 95.3% and 95.6%, respectively, which are higher than those of RBF network-based method [19]. The Sns of K-and Ca-conotoxins of our method are 91.7% and 95.6%, respectively, which are higher than those of iCTX-Type [20]. Thus, in summary, our proposed method is superior to other published methods.

Conclusion
In this paper, we designed a new method based on three kinds of new properties to predict three kinds of ion channeltargeted conotoxins. By using feature selection technique, prediction accuracy was dramatically improved. Comparison with published methods demonstrated the advantage of our method. The properties of residues used in this paper can also be used in other fields of protein classification. In the future, we will construct a free webserver based on the proposed method for the convenience of the vast majority of experimental scientists.