A Multifeatures Fusion and Discrete Firefly Optimization Method for Prediction of Protein Tyrosine Sulfation Residues

Tyrosine sulfation is one of the ubiquitous protein posttranslational modifications, where some sulfate groups are added to the tyrosine residues. It plays significant roles in various physiological processes in eukaryotic cells. To explore the molecular mechanism of tyrosine sulfation, one of the prerequisites is to correctly identify possible protein tyrosine sulfation residues. In this paper, a novel method was presented to predict protein tyrosine sulfation residues from primary sequences. By means of informative feature construction and elaborate feature selection and parameter optimization scheme, the proposed predictor achieved promising results and outperformed many other state-of-the-art predictors. Using the optimal features subset, the proposed method achieved mean MCC of 94.41% on the benchmark dataset, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be effective in identifying the important protein posttranslational modifications and the feature selection scheme would be powerful in protein functional residues prediction research fields.


Introduction
Tyrosine sulfation is one of the most prevalent posttranslational modifications in transmembrane and secreted proteins. Many lines of evidence have suggested that nearly 1% of all tyrosine residues of the total proteins in an organism can be sulfated [1]. Tyrosine sulfation has been found to be participating in the interactions between proteins and the modulations of intracellular proteins [2,3]. Malfunction or dysregulation of tyrosine sulfation would lead to several serious diseases, such as atherosclerosis [4], lung diseases [5], and HIV infections [6]. Therefore, identification of possible protein tyrosine sulfation substrates and their accurate residues is valuable in exploring the intrinsic mechanism of tyrosine sulfation in biological processes and therefore arouses interests of biologists in these fields.
In view of the laborintensive and time-consuming biochemical experiments, computational intelligence technologies are becoming more and more popular due to their conveniences as well as efficiencies. In the past decades, many computational methods have been proposed and successfully applied in this field [7][8][9][10][11][12][13][14].
In 1997, Bundgaard et al. made the first attempt to predict the tyrosine sulfation residues based on sequence comparisons by using synthetic peptides [7]. They pointed out that the tyrosylprotein sulfotransferase was cell-specifically expressed. In 2002, Monigatti et al. constructed the first software tool named Sulfinator based on four different hidden Markov models to identify tyrosine sulfation residues [8]. Yu et al. developed a log-odds position-specific scoring matrix (PSSM) to construct the prediction model [9]. They found that tyrosine sulfation residues mostly located in extracellular tail and extracellular loop 2. Subsequently, Monigatti et al. gave an overview of sulfation in the context of modificomics [10]. Chang et al. proposed a computational method named SulfoSite based on support vector machine (SVM) [11]. Niu et al. developed a method by using maximum relevance minimum redundancy (mRMR) method to select the best feature subset and nearest neighbor algorithm to construct the predictor [12]. PredSulSite introduced two new encoding schemes, namely, grouped weight and autocorrelation function [13]. Jia et al. proposed a novel method named SulfoTyrP by using undersampling approach and weighted support vector machine [14].
All abovementioned methods facilitated the investigations on tyrosine sulfation; however, the accuracy was still far from satisfactory and detailed analyses of the features are lacking. Thus, it was significant to develop a powerful predictor to identify the tyrosine sulfation residues.
In this paper, we focused on the challenging problem of predicting tyrosine sulfation residues based on protein sequences. Firstly, several informative sequence-derived features were combined to construct the feature vector. Secondly, relative entropy selection and incremental feature selection (RES + IFS) were adopted to perform the preevaluation of the features, and then discrete firefly algorithm (DFA) and SVM were introduced to perform the second-round feature selection as well as build the predicted model. Experimental results on the benchmark datasets and independent datasets proved that our method was a powerful tool for tyrosine sulfation residues prediction. A web-server of DFA PTSs was constructed and freely accessible at http://biolabxynu.zicp .net:9090/DFA PTSs/ for academic use.

Datasets.
To reach a consensus assessment with previous researches [8,12,13], two benchmark datasets were collected in this work. The datasets were compiled from UniProtKB (2013.09) [15] with the keyword "sulfotyrosine." Then, these proteins were clustered with no more than 30% similarities by CD-HIT [16]. Finally, 137 experimental tyrosine sulfation residues from 79 protein sequences were collected. 68 proteins (119 tyrosine sulfation residues) were selected as a training set and the remaining 11 proteins (18 tyrosine sulfation residues) were selected as an independent test.
The peptide segments of tyrosine sulfation residues and non-tyrosine sulfation residues could be formulated by where indicated the distance between the furthest residue and the target residue and 2 + 1 would be the sliding window length. In order to test the proposed model fairly as well as keep consistent with previous studies [8,12,13], was set as 4 and the corresponding sliding window length would be 9. However, in some cases, the upstream or downstream number of residues for a tyrosine may be less than 4. The lacking residues would be filled with dummy code .

PSI-BLAST-Based Features.
As is well known, the life originated from ancient limited peptides. With the development of evolution and nature selection, various sequences began to appear and form the complex organisms. In the process of sequence evolution, some unimportant peptides disappeared while the important function-determinate regions were kept. Considering this, evolutionary conservation had been widely used to explore the attributes of proteins, such as predicting the extracellular matrix proteins [17] and identifying the epitopes [18] and cysteine S-nitrosylation residues [19].
To obtain evolutional conservation profiles, PSSM was generated by the program PSI-BLAST [20] with default parameters (3 iterations and 0.0001 of -value cutoff) against the Swiss-Prot database (http://www.ebi.ac.uk/swissprot/). The evolution conservation for a protein with residues would be given as the following matrix: where → , = 1, 2, 3, . . . , , represented the frequency of the th position residues which was substituted by amino acid ( = 1, 2, 3, . . . , 20) in the evolution history. The positive scores indicated that this substitution appeared more frequently than that expected, while the negative scores meant the opposite. Usually, the aggregation of positive scores indicated the important function zones in the proteins. Considering this, to make the descriptor uniformly cover the peptides, elements in the above equation for PSSM were used to define a new matrix PSSM , which was formulated by where ∑ → indicated the sum of amino acids type being changed to amino acids type in PSSM . Finally, 400 features were obtained to describe the evolutionary conservation of the adjacent regions of the tyrosine sulfation residues.

PSIPRED-Based Features.
Previous researches figured out that the proteins with the same structural class but low sequence similarity may still keep some attributes in their secondary structure. Hence, in this paper, the information of secondary structure was adopted for identifying the tyrosine sulfation residues. PSIPRED [21], which applies twostage neural networks to predict secondary structures, has found wide applications in computational biology, such as solvent accessibility [22], epitope recognition [18], cysteine Snitrosylation sites [19], and protein folding kinetic types [23]. According to [21], the output files of PSIPRED were encoded with terms of " " for coil, " " for helix, and " " for strand.
Here, we quantified the total number, average length, and percentage of each peptide, which were defined as follows: where = { , , }. Finally, 9 features were derived to construct the predicted secondary structure features.

Native Disorder Features.
Natively disordered zone has been proved to be connected with many various physiological activities, such as epitope recognition, solvent accessibility, and protein interaction [18,24,25]. Hence, they were often used in researches of protein structures and functions. Here, DISOPRED [26] was used to predict the disorder status for each residue in the peptides. In summary, 9 features were obtained to construct the native disorder features.

Protein Physicochemical Features.
As is well known, the hydrophobic residues tended to form small patches on the surface of the proteins to participate in the interaction. Some residues with polarity and charge could play a critical role in protein binding [22]. In addition, the flexibility and accessibility of a residue strongly affected the protein functional residues. Therefore, in this work, 6 physicochemical properties (hydrophilicity, flexibility, accessibility, polarity, exposed surface, and turns) were collected to predict protein tyrosine sulfation residues.

Discrete Firefly Optimization
Algorithm. The firefly algorithm (FA) [27] is a novel heuristic optimization algorithm inspired by the natural behaviors of fireflies. FA has been proved to be a very effective optimization algorithm to search the global optima. The DFA is the modified traditional firefly algorithm which could be used in solving discrete optimization problems. The pseudocode of the DFA was shown in Procedure 1.
Distance. The distance between any two fireflies and was defined as follows: where , was the th component of the th firefly.
Attractiveness. The attractiveness of a firefly was determined by its lightness, which implied how strong it attracted the adjacent fireflies: where was the distance between two fireflies, 0 was the attractiveness, and was a fixed light absorption coefficient.
Movement. The movement of a firefly was determined by the attractiveness from other fireflies. It was formulated as Discretization. If firefly moved toward , the position of firefly changed from a binary number to a real number. In this study, the sigmoid function was used to constrain the position value to the interval [0, 1]: where ( ) indicated the probability of .
Fitness Definition. In this paper, the prediction accuracy and the number of selected features were the two criteria to design a fitness function. Therefore, the fitness function had two predefined weights, for the prediction accuracy (in this paper, we chose the MCC) and for the selected features, which were formulated as follows:

Relative Entropy Selection and Incremental Feature Selection.
Although the combination of different types of features would provide a more powerful predictor, some unwanted noise features which were called "bad" features may also be brought in at the same time. These unwanted noise features may decrease the prediction and generalization performance of the classifiers. To reject the bad features as well as keep the good features, we here adopted relative entropy selection (RES) (i.e., Kullback-Leibler divergence) [28] to select the optimal feature subset. For the algorithm, relative entropy was defined as follows: where and were the conditional probability density functions of a feature under two various categories; DKL( ‖ ) was the K-L divergence of from and DKL( ‖ ) was the K-L divergence of from [19]. A feature list would be obtained after the relative entropy selection: where the index indicated the importance of in the feature list .
Once the ranked feature list was obtained, the incremental feature selection (IFS) procedure was used to search for the optimal feature subset for the predictor. During the IFS, the features in the list would be added one by one from the head to the tail. In each iteration, a new feature would be added and form a new feature subset. For each new feature subset, we built a new classifier using 10-fold cross-validation. Then, 472 individual classifiers would be obtained for the 472 feature subsets. As a result, a table named IFS, with one column for the feature index and the other columns for the prediction performance, was produced. The IFS curve was drawn based on the IFS list to identify the best prediction efficiency as well as the corresponding optimal feature subsets. ( -fold cross-validation) test, and jackknife test, are often adopted to assess the performance of a predictor. In order to remain consistent with [8,12,13], 10-fold cross-validation was used to assess the proposed method. The benchmark dataset was initially randomly divided into 10 equal subsets. In each iteration, nine subsets were used for training and the remaining one was used for testing. The procedure would be repeated 10 times and the final results were calculated by averaging the 10 testing results. Support vector machine (SVM) was a successful supervised learning method which found extensive use in classification and regression problems. In this work, LibSVM [29] was adopted to perform all the experiments. The system architecture of the proposed model was illustrated in Figure 1.

Assessment of Prediction
where TP, TN, FP, and FN were the abbreviations of true positives, true negatives, false positives, and false negatives. In this paper, MCC was used as the major evaluation index to evaluate the performance of the new proposed predictor. The ROC (Receiver Operating Characteristic) curve was to plot the true positive rate against false positive rate, and the AUC was a reliable measure for evaluating performance.

Preevaluation of the Features.
After finishing the relative entropy selection, two lists, one called coefficient value list and the other called feature list, were obtained. In the relative entropy feature lists, a feature with a bigger coefficient index indicated that it is more important for predicting tyrosine sulfation residues. Subsequently, 472 predictors were built one after another by adding features one by one from the top of the list to the bottom. The mean MCC value for each predictor was given in Figure 2. When 103 features were given, the mean MCC values reach the peak value of 0.88738.

Features Selection and Parameters Optimization.
In this work, we used RES + IFS to perform preevaluation of initial feature set and DFA to perform feature selection and parameters optimization. To evaluate the performance of this scheme, we compared our method with minimum Redundancy Maximum Relevance together with incremental feature selection (mRMR + IFS) in the preevaluation procedure and genetic algorithm (GA) [30] and discrete particle swarm optimization (DPSO) [31] in the second-round feature selection procedure. The experiments of RES + IFS and mRMR + IFS would use grid search to search parameters. GA, DPSO, and DFA would use the preselected 103 features obtained from RES+IFS to perform the second-round feature selection procedure. The parameter configurations were listed in Table 1.
The experimental results were given in Table 2 and DPSO algorithm produced a slight improvement of a MCC of 92.66% and an AUC of 91.79% while it selected the least 62 features. Generally, the DFA performed the best among these three optimization algorithms (a MCC of 94.41% and an AUC of 92.45%). Although DFA selected 3 more features than DPSO did, it produced the highest MCC of 94.41%. Actually, DFA used the least computational time to converge. Thus, in this work, the DFA was chosen as the final optimization algorithm.

Analysis of the Optimal Feature Subset.
In this part, we analyzed the final optimal feature subset in detail and  Figure 4 displayed the various contributions of different types of features. Among the 65 best features, 49 pertained to the evolutionary conservation, 3 to the secondary structure, 2 to the native disorder, and 11 to the physicochemical properties. Obviously, evolutionary conservation occupied the largest part in prediction of tyrosine sulfation residues. As is known to all, various biological species originated from the limited peptides in ancient oceans. Evolution and selection existed in the whole story of life. The evolution in protein includes the mutations, insertions, and deletions of a single residue or some peptides. With the accumulation = 64, = 0.03125 using Gauss kernel function; 2 = 64, = 0.04268 using Gauss kernel function; 3 = 128, = 0.003790 using Gauss kernel function; 4 = 128, = 0.01136 using Gauss kernel function; 5 = 128, = 0.005062 using Gauss kernel function. of time, some unimportant zone may disappear, but the functional regions may remain because they always share some common attributes. This explains why evolutionary conservation played the most important role in the optimal subset. Although only 3 and 2 features were selected from the secondary structure and native disorder, respectively, one could not regard that the secondary structure and native disorder played less important roles in identifying the tyrosine sulfation residues. Actually, nearly 84.75% of features were from the evolution conservation, and only 1.91% of features were from the secondary structure and native disorder. In addition, almost 33.33% and 22.22% among the secondary structure and native disorder were selected in the optimal feature subset. Listed in Supporting Information S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/ 2016/8151509) were the selected features. Table 3 were the experimental results performed by state-of-the-art methods on the independent dataset. Sulfinator [8] used sequence alignment; SulfoSite [12] used solvent accessibility area and maximum weight algorithm; and PredSulSite [13] used secondary structure, grouped weight, and autocorrelation function to construct the training features, respectively. In this paper, we adopted various informative sequence-derived features, namely, evolutional conservation, secondary structure, native disorder and physicochemical properties, and DFA algorithm and SVM, to construct the predicted model. Overall, our method exhibited the best prediction performance.

Comparison with Other Methods. Listed in
The excellent performance could be ascribed to two aspects: (i) the informative features, which included evolutional conservation, secondary structure, native disorder, and physicochemical properties (these features have been proven to be able to successfully distinguish the tyrosine sulfation residues from nonsulfation residues), and (ii) the powerful feature selection and parameter optimization method (this  method included the preevaluation of the features using RES + IFS procedure and the second-round feature selection together with parameters optimization by using DFA).

3.5.
Web-Server of DFA PTSs. DFA PTSs has been constructed and deployed as a free available web-server at http://biolabxynu.zicp.net:9090/DFA PTSs/. Here, we provided a step-by-step guide for biology experimental scientists.
Step 1. Open the web-server and you will find the home page ( Figure 5). Click on the "Introduction" link to see a detailed description about the server. Step 2. Either type or copy and paste the query protein sequences into the input box. DFA PTSs accepts both single or multiple sequences input, which accords with standard FASTA format.
Step 3. Type your email address, and the predicted results will be sent to your email after finishing calculation.
Step 4. Click on the Query button to submit the request. In general, it takes no more than 2 minutes for a protein sequence with less than 300 amino acids.

Conclusions
In this paper, we presented a novel method to identify protein tyrosine sulfation residues. The proposed predictor achieved promising results and outperformed many other state-ofthe-art predictors. The excellent performance should be ascribed to two aspects. The first aspect was the introduction of the informative features. These features included evolutional conservation, secondary structure, native disorder, and physicochemical properties. The second was the effectiveness of elaborate feature selection and parameter optimization schemes. This scheme included two procedures, namely, preevaluation of the features using RES + IFS procedure and the second round of feature selection using DFA. Finally, an optimal set of 67 features, which significantly contributed to the identification of tyrosine sulfation residues, were selected. Our predictor achieved the mean MCC of 94.41% on the benchmark dataset using 10-fold cross-validation, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be useful in assisting the discovery of important protein modifications and the feature selection scheme would be powerful in protein function and structure prediction research domains.