Prediction of S-Nitrosylation Modification Sites Based on Kernel Sparse Representation Classification and mRMR Algorithm

Protein S-nitrosylation plays a very important role in a wide variety of cellular biological activities. Hitherto, accurate prediction of S-nitrosylation sites is still of great challenge. In this paper, we presented a framework to computationally predict S-nitrosylation sites based on kernel sparse representation classification and minimum Redundancy Maximum Relevance algorithm. As much as 666 features derived from five categories of amino acid properties and one protein structure feature are used for numerical representation of proteins. A total of 529 protein sequences collected from the open-access databases and published literatures are used to train and test our predictor. Computational results show that our predictor achieves Matthews' correlation coefficients of 0.1634 and 0.2919 for the training set and the testing set, respectively, which are better than those of k-nearest neighbor algorithm, random forest algorithm, and sparse representation classification algorithm. The experimental results also indicate that 134 optimal features can better represent the peptides of protein S-nitrosylation than the original 666 redundant features. Furthermore, we constructed an independent testing set of 113 protein sequences to evaluate the robustness of our predictor. Experimental result showed that our predictor also yielded good performance on the independent testing set with Matthews' correlation coefficients of 0.2239.


Introduction
Nitric oxide (NO) has been reported to be an important signaling molecule which involves physiological and pathophysiological regulations of some cellular processes, such as cardiovascular, respiratory, gastrointestinal, reproductive, and host defense [1][2][3][4]. Protein S-nitrosylation which is covalently modified by NO has recently been discovered to play important roles in regulating diverse pathways [5][6][7] and other biological activities [8], such as chromatin remodeling [9], transcriptional regulation [10], cellular trafficking [11], and apoptosis [12]. Also, it has been reported that aberrant S-nitrosylation might contribute to some diseases such as neurodegenerative disorders [1,13] and cancers [14]. Several biochemical approaches have been developed to identify Snitrosylation sites; for example, Forrester et al. [15] used RAC (resin-associated capture) method to isolate SNO protein, and Foster et al. [16] utilized an approach based on protein microarray to screen S-nitrosylation sites.
In contrast to time-consuming and labor-intensive experiments, computational approach is fast and cost-effective. It is reported that there have been at least 170 databases and computational tools concerned with posttranslational modification including protein S-nitrosylation modification [17]. With regard to predicting S-nitrosylation modification sites, Xue et al. [17] developed a software tool named GPS-SNO 1.0; Hao et al. [18] applied support vector machine (SVM), Lee et al. [19] used the maximal dependence decomposition-(MDD-) clustered SVMs, and Li et al. [20] utilized k-nearest neighbor algorithm to deal with the problem. Although computational approach is becoming more and more attractive, prediction of S-nitrosylation sites still remains a great challenge due to the complications of effectively protein encoding.
In the paper, we presented a new computational framework based on kernel sparse representation theory to predict S-nitrosylation sites. The framework consists of two steps: feature extraction and feature selection. Firstly, 666 features were extracted from five categories of amino acid properties, that is, sequence conservation, amino acid factor, secondary structure, solvent accessibility, and amino acid occurrence frequency, and one protein structure feature, the residual disorder. Then, a two-stage feature selection procedure was applied to select an optimal subset from the 666 redundant features. Finally, a webserver for the prediction of S-nitrosylation sites based on kernel sparse representation classification and minimum Redundancy Maximum Relevance algorithm is available at http://www.zhni.net/ snopred/index.html.

Materials
The training and testing sets adopted in the paper were constructed as follows. A total of 645 protein sequences (see Supplementary Material S1 available online at http://dx.doi .org/10.1155/2014/438341) containing S-nitrosylation sites (see Supplementary Material S2) were first collected from open-access databases and the published literatures. Among the 645 protein sequences, 25 were from Uniprot database (version 2011 7) [21], 327 were from a research done by Xue et al. [17], and the other 293 protein sequences were from three recent reviews [22][23][24] on S-nitrosylation identification. The S-nitrosylation sites on the 645 protein sequences are all verified by experiments. Then, the sequence-clustering program CD-HIT [25] was applied to screen the 645 protein sequences. The cutoff value of CD-HIT was 0.4, meaning that the protein sequences having pairwise sequence identity greater than 40% to one another were removed. Finally, 529 protein sequences were left for analysis. Samples were then collected by taking peptides composed of 21 continuous residues with the central residue as cysteine; that is, peptides including a central cysteine and with each 10 residues in the upstream and downstream of the cysteine were picked out. For peptides with cysteine but which were less than 21 residuals, labels "X" were appended to end of the peptides. Thus, there were totally 2516 peptides obtained from the 529 proteins. 827 peptides with S-nitrosylation modification sites were labeled as positive samples and the remaining 1689 peptides were labeled as negative ones. More detailed information about collecting data can be found in our previous work [20]. The 2516 samples were grouped into training dataset and testing dataset at the ratio of 4 : 1; that is, we used 80% of the samples as the training samples, because sufficient samples were needed to train the predictor. Meanwhile, to evaluate the robustness, 20% of the samples were left for the testing. During sample grouping, positive samples and negative samples are distributed in a way so that the ratios of positive-to-negative samples in the training and testing datasets remained the same as that of the whole dataset which is about 1 : 2 (positive-to-negative ratio was 827 : 1689 in the whole date set). Consequently, the training set was composed of 662 positive and 1351 negative samples, and the testing set was composed of 165 positive and 338 negative samples (see Supplementary Materials S3 and S4).
Besides the training and testing sets mainly collected from published literatures, we also constructed an independent testing set with the Uniprot database of the latest version (version 2014 05). We searched the Uniprot database for those protein sequences with S-nitrosylation identification. Then, by deleting the proteins which had been used in the training and testing sets, totally 113 sequences containing Snitrosylation sites were obtained. The 113 sequences were used as the independent testing set (see Supplementary Material S6). Thus, we could do comparison between different methods based on the independent testing set.
The evolutionary conservation of amino acid is very important, which is generally represented as the probability that it would mutate into other 20 kinds of amino acid. By using PSI-BLAST program [26], a 21 × 20 = 420 dimensional vector describing conservation of each peptide was obtained.
Physicochemical or biochemical properties of amino acid were characterized quantitatively as a 5-dimensional vector using amino acid index database [27], whose elements represent properties of polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge, respectively. Except the cysteine, 20 amino acids in a peptide were represented as a 100-dimensional vector.
Disorder status of amino acid was quantified as a disorder score by the predictor of protein disorder [28], and thus, for a peptide, its disorder status was represented by a 21dimensional vector.
Secondary structural properties, that is, "helix, " "strand, " and "others, " and the solvent accessibility, that is, "buried" and "exposed, " of an amino acid were calculated by the predicting software of protein structure and structural feature [29], resulting in a 5-dimensional encoding vector consisting of BioMed Research International 3 0 or 1. A 21 × 5 = 105 dimensional vector represented the secondary structural and solvent accessibility properties of a peptide.
Frequency of the twenty amino acids around nitrosylated cysteine (nitrosylation site was excluded) was also taken into consideration.
Hence, each sample could be represented as a numerical vector containing as many as 666 (420 + 100 + 21 + 105 + 20) features. Table 1 shows the distribution of features. Details of feature construction could be found in our previous work [20].

Feature Selection.
A two-stage feature selection procedure is used to select optimal feature subset from the feature space. The predictor constructed by the optimal feature subset is our final S-nitrosylation sites predictor. The procedure is described as follows.
Stage 1. All features are evaluated by the minimum Redundancy Maximum Relevance (mRMR) algorithm [30] and then ranked according to their mRMR scores.

Stage 2.
Based on the mRMR evaluation, incremental feature selection procedure [31,32] is adopted to search for the optimal feature subset with the help of kernel sparse representation classification (KSRC) algorithm.

mRMR Algorithm. The mRMR algorithm proposed by
Peng et al. [30] is a feature evaluation method based on mutual information. Mutual information is able to quantify the dependency between two variables. The larger the mutual information is, the more the dependency between the two variables is. Mutual information between two random variables and is defined as follows: where function denotes probabilistic or joint probabilistic density.
Mutual information between the feature space Ω = ( 1 , 2 , . . . , ) and the target variable is defined as follows: The mRMR algorithm aims to evaluate feature subsets and then selects the optimal feature subset that meets the minimal redundancy and maximal relevance criteria, that is, the minimal dependency to the entire feature space and Maximal dependency to the target variable can be calculated by the following equation: Thus, the mRMR evaluation can be quantified as score by integrating (3) and (4) into the following equation:

Incremental Feature Selection.
In the implementation, the mRMR criterion is hard to satisfy, especially when the feature space is large. Hence, to attain an optimal feature subset of minimal redundancy and maximal relevance, a heuristic strategy named incremental feature selection [31,32] is adopted for the search of feature subset. Firstly, all the features are scored by (5), by shrinking feature subset to contain only one feature. Secondly, arrange all the features according to their mRMR scores. Thirdly, search for optimal feature subset by an increment means as follows.
Suppose all the features in the feature space Ω have been arranged in the order from high mRMR score to low mRMR score. Beginning from the feature of the highest mRMR score, move features from the scored feature space to the selected feature subset sequentially. When one feature is added, evaluate the classification performance of the feature subset by predictors which are constructed by the KSRC algorithm (see Section 3.2.4 for details). Finally, the feature subset of the highest classification performance is selected as the optimal feature subset and the predictor constructed by the optimal feature subset is the final predictor. In this study, the method used to evaluate the classification performance is presented in Section 3.2.3.

Evaluation
TP and TN represent the numbers of true positive and true negative, respectively. FP and FN represent the numbers of false positive and false negative, respectively. Among the four indicators, MCC is the most significant indicator, which is used to optimize the procedure of feature selection in this study.

KSRC Algorithm.
In this paper, KSRC algorithm is applied to construct predictor. The KSRC algorithm integrates the sparse representation classification (SRC) algorithm and the kernel function technique to fulfill classification task [33,34]. In the following section, we will introduce the SRC algorithm and the kernel function technique, respectively, and then illustrate how to integrate the two techniques.
In the recent years, the SRC algorithm has been successfully applied in these fields of signal recovery, signal encoding, and signal classification [33][34][35][36][37][38][39][40][41]. The principle underlying the SRC algorithm is that testing samples can be represented as linear combination of training samples if the testing and training samples belong to the same category so that the representation coefficient of a testing sample under all training samples might supply sufficient information to determine the category of the testing samples. min ‖ ‖ 0 , subject to = or min ‖ ‖ 0 , subject to where operator ‖ • ‖ 0 denotes the 0 norm, which counts nonzero entries, and operator ‖ • ‖ 2 denotes the 2 norm of a vector, respectively. Since the pursuit of exact solution of (7) and (8) is an NPhard problem [42], the orthogonal matching pursuit (OMP) [43,44] algorithm is used to seek an approximate solution to (7) and (8) in our works. The OMP is an iterative greedy method. Each step of iteration in OMP algorithm contains three operations: (1) computing residual referring to difference between original signal and recovery one, (2) selecting the column with the highest correlation to the current residual, and (3) projecting original signal into the linear subspace spanned by these already selected columns. For convenient description, the following symbols were used. The symbol specified a matrix, referred to the column in the matrix, and Θ consisted of columns of the matrix with the indices Θ. The OMP algorithm is described in Algorithm 1.
Once a coefficient vector was gained by the OMP algorithm, the category of the corresponding testing sample was determined by the following rule: where = (0, 0, . . . , 0, 1 , 2 , . . . , , . . . , 0) was a coefficient whose entries were all zero except (1 ≤ ≤ ) which corresponds to the samples from the class and is equal to the corresponding element from . The details of the SRC algorithm were shown in Algorithm 2.
Nevertheless, the performance of the SRC algorithm might be limited, if the testing samples are not linearly representable in the space of training sample [34]. Therefore, in our work, kernel function technique is applied to project testing sample into higher-dimensional space so as to alter the distributed structures of the samples.
Kernel function technique is a widely used technique that is able to map data from low-dimensional space to higher-dimensional space [34]. A well-chosen kernel function enables original linearly inseparable samples to become linearly separable in the high-dimensional feature space. In our work, the Laplacian kernel function Ψ( , ) = −| − |/ was employed.
Assume that the training samples with classes ] .

Optimal Feature Subset Selection.
First, the mRMR algorithm [30] was applied to the training set, producing a sequence of 666 scored features. Details of the results can be found in Supplementary Material S5. Second, apply incremental feature selection procedure to search optimal feature subset. Figure 1 shows MCC values of each candidate feature subset by using 10-fold cross validation on the training set. The best MCC value is 0.1634, corresponding to the combination of the first 134 features. Therefore, this candidate feature subset was regarded as the optimal subset.
In the implementation, the factor of the Laplacian kernel function in the KSRC algorithm is 100. The sparsity in OMP algorithm was 50. The used OMP algorithm codes are available at the following site: http://www.cs.technion.ac .il/∼ronrubin/software.html [45]. The used mRMR codes are available at http://penglab.janelia.org/proj/mRMR/ [30].

Comparison with Other Algorithms.
As was mentioned in Section 1, quite a few methods have been developed to predict the S-nitrosylation sites in recent years. However, it was difficult to make direct comparisons between them due to the following two reasons. First, different methods usually employed different datasets. It was biased to compare their overall performances based on different datasets. Secondly, we did not know what parameters they used to optimize the predictors. So, it was difficult for us to compare other methods with ours based on the same training and testing datasets.
Notwithstanding this, we attempted to compare our methods with other data mining methods based on our training and testing datasets. Hence, the KSRC algorithm proposed in this paper was compared to five other data mining algorithms: SRC [38], k-nearest neighbor algorithm (KNN) [46], random forest (RF) [47], sequential minimal optimization (SMO) [48], and Dagging [49]. KNN is an instance-based learning algorithm, which is widely used due to its simplicity and efficiency in training. RF is an integration method by combining many tree predictors together. Each tree predictor performs computation based on the values of a random vector sampled independently and with the same distribution for all trees in the forest. SMO is an algorithm that trains the support vector machine. Dagging is an algorithm that ensembles weak classifiers. In terms of implementation, KSRC and SRC were coded in Matlab language by virtue of the OMP package [45]. The computation of KNN, RF, SMO, and Dagging algorithms was performed by Weka (version 3-6-1) [50], which is a collection of learning machine algorithms and is available at http://www.cs.waikato.ac.nz/ml/weka/. In this work, the number of the nearest neighbors in the KNN is 3. The RF, SMO, and Dagging use the default parameters in the Weka. The sparsity of the OMP in the SRC is 50, the Input: the training set with distinct classes { } =1 , the testing sample , and kernel function Ψ Output: the category of the test sample (1) compute the matrix and the test sample such that = by using (12) (2) normalize columns of the matrix with the 2 norm (3) solve (13) using the OMP in Algorithm 1, and obtain the coefficient vector (4) compute = arg min =1,2,...,  Figure 2. The five algorithms attained optimal feature subsets containing 76, 52, 38, 127, and 103 features, respectively. All six algorithms were compared both on the training set and on the testing set with optimal feature subsets of their own. Tables 2 and  3 showed their performances on the training and testing datasets, respectively. As indicated by Table 2 and Figure 2, KSRC could achieve MCC that exceeded 0.16 on the training set. Although SMO and Dagging performed better in terms of the MCC, KSRC showed better SN than that of SMO and Dagging. Table 3 presented the performances of the six algorithms on the testing dataset, which were not previously used in the training. As shown in Table 3, KSRC yielded the highest MCC and SN among all of the six algorithms, while SMO and Dagging showed poor MCC on the testing set. The high MCC and SN of KSRC on both the training and testing datasets indicated that KSRC was more effective and robust than the other five data mining algorithms.
To compare the predictive performances of the 134 optimal features with that of the original 666 features, the 10-fold cross validation and independent tests were also conducted on the training and testing sets by the 666 original features, respectively. Table 4 shows the performance of using original    Table 4 that SN and MCC with the 134 optimal features were much better than those of the original features, though SP is a bit worse. Since the MCC is the most important criterion among the adopted metrics, we conclude that the 134 optimal features performed better than the original 666 features.

Comparison of Algorithms on Independent Testing Set.
Since the training and testing sets were mainly collected from published literatures, we constructed an independent  testing set for the comparison between our method and other methods. The independent testing set contained 113 protein sequences from the latest version of Uniprot database (version 2014 05) (see Section 2 for details). Two existing S-nitrosylation predictors, iSNO-AAPair [51] and iSNO-PseAAC [52], were used for comparison. The comparison results of our predictor, iSNO-AAPair, iSNO-PseAAC, and other five data mining algorithms on the independent testing set were presented in Table 5. As shown in Table 5, the SRC algorithm achieved the highest MCC of 0.2617, and our proposed KSRC algorithm was the second with MCC of 0.2239. The iSNO-AAPair and iSNO-PseAAC predictors attained MCC of 0.1125 and 0.1190, respectively, both of which were only approximately half of the KSRC algorithm. Although the MCC of KSRC algorithm was a little lower than that of SRC algorithm, the KSRC algorithm was the one algorithm that could achieve high and stable performance in both of the testing set and the independent set (as shown in Tables 3 and 5), demonstrating the robustness of the KSRC algorithm among different datasets.

Conclusions
In the paper, we proposed a framework based on the KSRC to computationally identify S-nitrosylation modification sites.
Our experimental results show that KSRC outperforms other state-of-the-art algorithms in terms of the key prediction metrics. The KSRC is an application of kernel function technique to the SRC. Kernel approach can project linearly inseparable samples into high-dimensional feature space with the use of kernel functions. If an appropriate kernel function is selected, the original linearly inseparable samples could become linearly separable in the high-dimensional feature space. Kernelizing of the sparse representation by Laplacian function could improve the separability of the samples and yields higher MCC than those linear classification algorithms, such as KNN and SRC. We believe that the proposed KSRC based framework could become a helpful tool for the prediction and analyses of protein S-nitrosylation.