Predicting the Types of J-Proteins Using Clustered Amino Acids

J-proteins are molecular chaperones and present in a wide variety of organisms from prokaryote to eukaryote. Based on their domain organizations, J-proteins can be classified into 4 types, that is, Type I, Type II, Type III, and Type IV. Different types of J-proteins play distinct roles in influencing cancer properties and cell death. Thus, reliably annotating the types of J-proteins is essential to better understand their molecular functions. In the present work, a support vector machine based method was developed to identify the types of J-proteins using the tripeptide composition of reduced amino acid alphabet. In the jackknife cross-validation, the maximum overall accuracy of 94% was achieved on a stringent benchmark dataset. We also analyzed the amino acid compositions by using analysis of variance and found the distinct distributions of amino acids in each family of the J-proteins. To enhance the value of the practical applications of the proposed model, an online web server was developed and can be freely accessed.


Introduction
J-protein, also known as Hsp40 (heat shock protein 40 kD), is a molecular chaperone protein and is found ubiquitously in both prokaryotes and eukaryotes [1,2]. J-proteins represent a large family of molecular chaperones and have cooperative functions with Hsp70. Most of the J-proteins contain a "J" domain through which they can interact with and stimulate Hsp70. Based on the structural differences, J-proteins can be classified into four types, that is, Type I, Type II, Type III, and Type IV J-proteins. Type I J-proteins contain an N-terminal J-domain that is separated from the rest of the proteins by a linker "G/F" region (glycine/phenylalanine region) [3,4]. Distal to G/F region is the zinc-binding cysteine-rich sequence named as "Zinc-finger domain" which distinguishes Type I proteins from other types of J-proteins [4], and Zinc-finger domain is followed by the C-terminal domain [1,2]. Type II proteins possess all the domains in Type I except the zinc-finger domain [3]. Type III Jproteins contain a C-terminal J-domain but lack both G/F and zinc-finger domains [3]. Type IV, also known as the Jlike protein [5], is a group of recently identified proteins that lacks histidine, proline, and aspartate signature motifs in their sequences [4].
By binding Hsp70 and Hsp90, J-proteins play important roles in chaperone cycle regulation and control many physiological functions [4], such as assisting the folding of nascent and damaged proteins, translocation of polypeptides across cellular membranes, and degradation of misfolded proteins [6]. Studies carried out in the past decade have also shown the regulatory roles of J-proteins in cell death. In association with Hsp70, J-proteins not only involve in the folding of caspase-activated DNase which is responsible for the apoptosis-induced DNA fragmentation [7] but also protect the macrophages from nitric-oxide-mediated apoptosis [8]. Gotoh and his colleagues have demonstrated the role of J-protein in the inhibition of Bax translocation to the mitochondria to prevent nitric-oxide-induced cell apoptosis [9]. Kurisu et al. found that MDG1/ERdj4, a member of  Type I J-protein  63  Type II J-protein  53  Type III J-protein  1107  Type IV J-protein  22 the human J-protein family, can interact with GRP78/BiP and protect against the cell death induced by endoplasmic reticulum stress in human [10]. The regulation of cell death by J-protein was also reported in plant. Liu and Whitham found that the overexpression of J-protein stimulated the hypersensitive response (HR)-like cell death in soybean [11]. Cancer progressions are also reported to be closely related to J-proteins, but different types of J-proteins play distinct roles [12,13]. Type I J-protein is tumour promoting, while Type II J-protein acts as tumour suppressors [13]. Therefore, reliably annotating the types of J-proteins is of major importance in order to clarify their distinct biological functions in cell death. However, to the best of our knowledge, there is no computational method for predicting the types of J-proteins.
Keeping these in mind, in the present work, we proposed a model to predict the four functional types of J-proteins based on reduced amino acid alphabet compositions. According to a recent review [14], the rest of the papers are organized as follows: (i) construct a valid benchmark dataset to train and test the predictor; (ii) formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) select a powerful machine learning method to operate the prediction; (iv) perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) provide a web server for the prediction method.

Dataset.
The sequences of J-protein were taken from the HSPIR database at http://pdslab.biochem.iisc.ernet.in/hspir/, which currently contains 3,901 J-protein sequences [15]. To reduce homologous bias, J-proteins that have ≥40% pairwise sequence identity to each other were removed by using the CD-HIT program [16]. By doing so, we obtained a benchmark dataset containing 1,245 J-proteins that were classified into four types: 63 Type I J-proteins, 53 Type II J-proteins, 1,107 Type III J-proteins, and 22 Type IV J-proteins ( Table 1). The benchmark dataset can be freely downloaded from http://lin.uestc.edu.cn/server/iJPred/data.

Reduced Amino Acid Alphabet.
Based on the physiochemical properties, the 20 native amino acids can be clustered into a smaller number of representative residues called reduced amino acid alphabet (RAAA) [17][18][19]. Compared with the traditional amino acid composition, RAAA not only simplifies the complexity of protein system but also improves the ability in finding structurally conserved regions and structural similarity of entire proteins.
Hence, in the present study, the J-proteins were encoded using the RAAA as formulated by the discrete feature vector P: where T is the transposing operator and is the occurrence frequency of the th -peptide RAAA and defined as where is the number of the th -peptide ( = 1, 2, or 3) RAAA in a J-protein with length of . For the different cluster profiles (Table 2) and different values of , the vector dimension ( ) in (1) will be different. The corresponding dimensions of reduced amino acid ( = 1) composition, reduced dipeptide ( = 2) composition, and reduced tripeptide ( = 3) composition were listed in Table 3.

Support Vector Machine (SVM)
. SVM is a powerful and popular method for pattern recognition that has been widely used in the realm of bioinformatics [28][29][30][31][32][33][34][35][36][37][38][39][40][41]. The basic idea of SVM is to transform the data into a high dimensional feature space and then determine the optimal separating hyperplane using a kernel function. To handle a multiclass problem, "oneversus-one (OVO)" and "one-versus-rest (OVR)" methods are generally applied to extend the traditional SVM. For a brief formulation of SVM and how it works, see the papers [28,29].
In the current study, the LIBSVM 2.84 package [42] was used as an implementation of SVM, which can be downloaded from http://www.csie.ntu.edu.tw/∼cjlin/libsvm/. The OVO method was employed for making predictions using the popular radial basis function (RBF). The regularization parameter and the kernel width parameter were determined via an optimization procedure using a grid search approach using the fivefold cross-validation. In grid research, the search spaces for parameter and range from 2 15 to 2 −5 and from 2 −5 to 2 −15 with the steps of 2 −1 and 2, respectively. Matthew's correlation coefficient (MCC), and overall accuracy (OA) defined as follows: where TP( ), TN( ), FP( ), and FN( ) represent true positive, true negative, false positive, and false negative of family ; is the number of subsets and equals to 4, while is the number of the total J-proteins in benchmark dataset.

Cross-Validation.
Three cross-validation methods, namely, subsampling (or K-fold cross-validation) test, independent dataset test, and jackknife test, are often used to evaluate the quality of a predictor [43]. Among the three methods, the jackknife test is deemed the least arbitrary and most objective as elucidated in [44] and hence has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors [31,34,[45][46][47][48][49][50]. Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study.
In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample and all the rule parameters are calculated without including the one being identified. The jackknife results obtained by the proposed model on the benchmark dataset based on the five different cluster profiles of the tripeptide (i.e., = 3) case were listed in Table 4. As it can be seen from Table 4, the best success rate of 94.06% was achieved when the predictions were based on CP(8) with a dimension of 512. For comparison, the results of the amino acid (i.e., = 1) and dipeptide (i.e., = 2) cases were also calculated and listed in Table 5, from which we can see that none of them has higher success rates than the case of = 3.
In our previous study [27], the six HSP families were successfully classified by using the dipeptide of RAAA. But for the classification of the J-protein subfamilies in the present work, the best predictive result was obtained by using the tripeptide of RAAA. Hsps belong to the same family share more sequence identity than that of different families [5]; hence we need more suitable parameters to encode the protein sequences as used in the current study.

Comparison with Other Methods.
Since there is no published work to predict the types of J-proteins, we could not provide the comparison analysis with existing results to confirm that our presented model is superior to other methods. However, for the purpose of comparison, we compared the results of the present model with that of Random Forest and Naïve Bayes using the same optimal features (the reduced tripeptide compositions based on CP (8)). The results of jackknife test on the benchmark dataset for Random Forest and Naïve Bayes are listed in Table 6. It is shown that the accuracy of SVM is higher than that of Random Forest and Naïve Bayes.

Amino Acids Composition Analysis.
To provide an overall view, the frequencies of the 20 naive amino acids were compared among the four types of J-proteins using the analysis of variance (ANOVA), and the average amino acid frequency of one type of J-protein with that of another type was further explored and compared using the Fisher's least significant difference (LSD) test. The result is given in Figure 1, where the green boxes indicate that the frequency differences among different types of J-proteins are not significant, while blue and red boxes indicate that the frequency differences are  The green boxes indicate that the frequency differences among different types of J-proteins are not significant. The blue boxes indicate that the amino acid is significantly enriched ( < 0.05; LSD test) in one type of J-proteins compared with its counterpart. Taking W as an example, the blue box with the coordinate (W, I-IV) indicates that W is enriched in Type I J-proteins compared with Type IV J-proteins. The red boxes indicate that the amino acid is lacking in one type of J-proteins but significantly enriched ( < 0.05; LSDtest) in its counterpart. Also taking W as the example, the two red boxes with the coordinates (W, I-III) and (W, II-III) indicate that W is lacking in both Type I and Type II J-proteins compared with Type III J-proteins, respectively. significant ( < 0.05; LSD test) among different types of Jproteins (see Figure 1 for more details). We found that, except Asn (N), the frequencies of all the other 19 amino acids are significantly different among the four types of J-proteins. Compared with other three types, Type I J-proteins are enriched in Cys (C), Gly (G), and Thr (T), Type II J-proteins are enriched in Phe (F), Type III J-proteins are enriched in Ala (A) and Leu (L), while Type IV-J proteins are enriched in Met (M), Gln (Q), Glu (E), and Pro (P) but lack Asp (D) and His (H). The lack of D and H residues in Type IV-J proteins leads to their inability to stimulate ATP hydrolysis [5]. Moreover, Figure 2: A semiscreenshot to show the top page of the web server. It is available at http://lin.uestc.edu.cn/server/Jpred. according to the binomial distribution [51], we also found the overpresented tripeptides in each family and listed them in Supporting Table S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2014/935719, where the over-presented tripeptides with their confidence levels are provided. These results indicate that the distinct distributions of amino acids in the four types of J-proteins may account for their distinct functions in biological processes.

Web Server Guide.
To enhance the value of the practical applications of the proposed model and for the convenience of the vast majority of experimental scientists, an online predictor was developed. The step-by-step guide on how to use it is provided as follows.
(1) Open the web server at http://lin.uestc.edu.cn/server/ Jpred and you will see the top page as shown in Figure 2. Click on the Read Me button to see a brief introduction about the predictor and the caveat when BioMed Research International 5  (2) Either type or copy/paste the query J-protein sequence into the input box at the center of Figure 2.
The input protein sequence should be in the FASTA format that can be seen by clicking on the Example button right above the input box.
(3) Click on the Submit button to see the predicted result. For example, if you use the four query J-protein sequences in the Example window as the input, after clicking the Submit button, you will obtain the results: the outcome for the 1st query sample is "Type I Jprotein;" the outcome for the 2nd query sample is "Type II J-protein;" the outcome for the 3rd query sample is "Type III J-protein;" the outcome for the 4th query sample is "Type IV J-protein. "

Conclusion
Cell death is a common phenomenon in developmental processes or in normal physiological conditions and is induced by an array of extra-or intracellular stimuli [7]. However, organisms are equipped with their own physiological defense to cope with environmental stress in order to prevent or induce cell death depending upon the severity of the stress [7]. In mammalian cells, the stress response involves the induction of Hsps, such as Hsp70 and Hsp90. By interacting with J-proteins, these Hsps play pivotal roles in cell death regulations. Since J-proteins act as intermediates, the analysis of J-proteins functions is urgent in order to clarify the regulatory roles of Hsps in cell death. Based on combination of whole-genome analyses and biochemical evidences, a large number of J-proteins have been identified [6]. However, the exact roles for many of the J-proteins are far from being understood [2,52]. In order to understand its biological functions, it is highly desirable to know which family a given J-protein belongs to.
By encoding the sequences using the reduced amino acid alphabet information, a predictor was developed to identify the four different families of J-proteins in the present work. To enhance the value of the practical applications of the proposed model and for the convenience of the experimental scientists, an online web server was provided and can be freely accessed at http://lin.uestc.edu.cn/server/Jpred. We hope that the present model will be helpful for scientists who focus on J-proteins and will provide novel insights into the research of cell death.

Conflict of Interests
There is no conflict of interests with any financial organization regarding this paper.