Prediction of IL4 Inducing Peptides

The secretion of Interleukin-4 (IL4) is the characteristic of T-helper 2 responses. IL4 is a cytokine produced by CD4+ T cells in response to helminthes and other extracellular parasites. It has a critical role in guiding antibody class switching, hematopoiesis and inflammation, and the development of appropriate effector T-cell responses. In this study, it is the first time an attempt has been made to understand whether it is possible to predict IL4 inducing peptides. The data set used in this study comprises 904 experimentally validated IL4 inducing and 742 noninducing MHC class II binders. Our analysis revealed that certain types of residues are preferred at certain positions in IL4 inducing peptides. It was also observed that IL4 inducing and noninducing epitopes differ in compositional and motif pattern. Based on our analysis we developed classification models where the hybrid method of amino acid pairs and motif information performed the best with maximum accuracy of 75.76% and MCC of 0.51. These results indicate that it is possible to predict IL4 inducing peptides with reasonable precession. These models would be useful in designing the peptides that may induce desired Th2 response.


Introduction
Cellular immune response to a pathogen is mediated through the processing and presentation of antigen on the surface via major histocompatibility complex (MHC). The exogenous antigens are processed through lysosome and presented by MHC class II. The loaded peptide on MHC class II interacts with CD4+ T cells and a pattern of cytokine is synthesized and secreted. Depending upon the cytokines secreted, the Thelper cells polarize into diverse T-cell populations like Th1, Th2, Th17, or iTregs [1]. In Th2 cell population, interleukin-4 is the major cytokine secreted. IL4 had been shown to play a critical role in diverse biological activities. This cytokine promotes the proliferation and differentiation of antigen presenting cells [2]. IL4 also plays a pivotal role in antibody isotype switching and stimulates the production of IgE. This cytokine has been applied in the treatment of autoimmune disorder like multiple myeloma [3], cancer [4], psoriasis [5], and arthritis [6]. IL4 has also been extensively applied to inhibit detrimental effect of Th1 [7]. Hence, for the rational development of better immunotherapy/vaccines to provide protection against infection, it is pivotal to assess the immune response generated by these antigens.
Although this identification demands experimentally validating the immune response generated by each antigen, this investigation is time-consuming, cumbersome, and expensive task since the possible antigens and corresponding fragments range in millions [8][9][10][11]. Thus, the initial screening of whole of pathogen's proteome for potential antigens/regions demands systematic computational approach. Over the last decade, tremendous efforts has been made for identifying antigenic regions or epitopes within antigens that can activate desired arm of immune responses against a number of pathogens. This has resulted into development of a number of software applications, databases, and webservers that assist researchers to design and select antigens to activate various arms of the host immune system like humoral, cellular, and innate immunity. In order to facilitate users, in the past a number of methods have been developed to predict antigenic regions/peptides or different types of epitopes such as MHC class I/II binders, TAP binders, linear/conformational B-cell epitopes, and pathogen associated molecular patterns [12][13][14][15].
In the past, several methods have been developed for identification of MHC class II binders that may activate T-helper cells. These T-helper cells induce different types of cytokines like IL-4 and IFN-gamma. Presently, available methods provide no information about the type of cytokine a MHC II binding peptide will induce. In order to address this issue, an attempt has been made to develop a method for predicting IL-4 inducing MHC II binding peptides.

Datasets Source and Processing
It is important to select the right dataset for developing a prediction method. The performance of the method is largely dependent on the datasets used for training a model. In this study datasets were generated from publicly available immune epitope database (IEDB) [16]. We extracted experimentally validated MHC class II binding T-helper epitopes; peptides having length shorter than 8 residues and longer than 22 residues were removed. Finally, we got unique 904 IL4 inducing and 742 noninducing MHC class II peptide sequences; we called these peptide sets as positive and negative sets, respectively. This dataset was created without any restrictions of host and source of epitopes.

Alternative Dataset.
Since the main dataset includes only MHC class-II binders, the prediction algorithm (based on the above dataset) can only predict IL4 inducing peptides from MHC class-II binders. We are also interested in discriminating the IL4 inducing peptide from the random peptides. Thus, we created an alternate dataset for building prediction models that can be used for mapping IL4-inducing peptides in antigens. Our alternate dataset has random peptides as negative set instead of non-IL4-inducing peptides. We generated IL4 noninducing (negative examples), from SwissProt proteins.

Peptide Length and Amino Acid Position Analysis.
We first analyzed the IL4 inducing positive and noninducing negative sequences, to comprehend the preferred peptide length for both positive and negative peptides, by using Rpackage for creating boxplot [17]. We also tried to understand the preference of specific amino acids at a specific position. For this, we created a two-sample logo from first 15 amino acids from N-terminal of all the peptides, using the twosample logo software [18].

Motif Analysis.
The recognition of functional motifs in peptide or proteins constitutes an important element in functional annotation of sequences [19]. In the present study, we have employed publicly available software MERCI for selection of exclusive motifs in IL4 inducing and noninducing MHC class II binding peptides [20]. MERCI compares both the positive and negative input sequences and selects the specific motifs in the positive datasets. Thus, in our analysis, to understand specific motifs for both IL4 inducing and noninducing peptides, we analyzed our datasets using twostep strategy. In this approach, we first provided MERCI with IL4 inducing peptides as positive input and IL4 noninducing peptides as negative input and extracted the motifs for IL4 inducing peptides. In the next step, we reversed the datasets; that is, we provided MERCI with IL4 noninducing peptides as positive datasets and IL4 inducing peptides as negative datasets and obtained the motifs important for IL4 noninducing peptides.
We further explored 100 degenerate motifs using three kinds of classification: (i) none, (ii) Koolman-Rohm [21], and (iii) Betts-Russell [22]. Top 10 motifs were extracted based on their unique sequence coverage. These different classification methods were employed to further discover different motifs in positive and negative peptides. Finally, unique motif containing peptides from both IL4 inducing positive dataset and IL4 noninducing negative dataset were selected, to calculate overall motif coverage in these sequences.

Amino Acid and Dipeptide
Compositions. We next analyzed the residue composition of these IL4 inducer and noninducer peptides. For this, we used in-house Perl scripts to calculate the amino acid composition of the peptide and summarize the intact epitope information in a fixed vector length. The algorithm calculates amino acid composition (AAC) using the following formula and a vector of dimension 20 is used to represent amino acid composition of a peptide: composition of amino acid (i) = total number of amino acid (i) × 100 total number of all amino acids in epitope , where i can be any amino acid. Likewise, the algorithm calculates dipeptide composition (DPC) and a vector of dimension 400, representing a peptide, using the following formula: composition of dipeptide (i + 1) = total number of dipeptide (i + 1) × 100 total number of all possible dipeptides in epitope , where i can be any amino acid and (i + 1) is dipeptide pair with next residue in epitope.

Amino Acid Pairs.
Amino acids pairs (AAP) based method represents the input epitope by a vector of fixed vector length (400) by incorporating the information from each amino acid pair and their propensity in the given dataset. This approach had shown its potential in past for predicting B-cell epitope [23].

Support Vector Machine
Learning Approach. In this study, classification models have been developed using machine learning technique support vector machine (SVM). In order to implement or develop SVM models, we used software SVMlight [24]. SVM models were developed using different features such as amino acid composition and amino acid pairs. In order to train or optimize the performance, we tuned all SVM parameters including three types of kernels (linear, polynomial, and radial bias).

Hybrid Approach.
We further employed a hybrid approach, where we combined the predictions from both motif and model based methods. In a hybrid approach, the weight of +1 was given to the peptide having IL4 motif (exclusively found in IL4 inducing peptides) and −1 was given to peptide having a non-IL4-motif (exclusively found in non-IL4-inducing peptides). We developed several hybrid models depending on the type of vector inputs used for SVM based prediction.

Evaluating the Performance of Models Validation.
In order to develop reliable prediction models, we trained and tested our models using fivefold cross-validation technique.
In this analysis, the whole dataset is divided randomly into five equal parts, and each time four sets are used for training our models and remaining set is used for testing. This procedure is repeated five times so that each set is tested once and four times it is used for training. The final performance of the model is evaluated by averaging the performance of models on each set. The performance of models was measured using the following standard parameters, that is, sensitivity, specificity, accuracy, and Matthew's correlation coefficient (MCC).

Data Analysis.
In order to understand the properties of the IL4 inducers, we analyzed both IL4 inducer (IL4+) and noninducer (IL4−) MHC class II binding epitopes extracted from IEDB. There are several studies where authors have exploited physiochemical properties (PCPs) of peptides to discriminate one class of peptide from other classes [25]. We examine various PCPs (such as hydrophilicity, hydrophobicity, charge, steric effect, side bulk, pI, hydropathy, and amphipathy) of IL4+ and IL4− peptides [26]. In our analysis, we calculated the average of PCP in three different manners: (i) Average of that PCP at a particular position of 15 N-terminal or C-terminal residues; (ii) average of the sum of PCP of all the peptides; for example, hydrophilicity of every peptide was calculated by the sum of hydrophilicity of every residue and average of all IL4+ peptides was taken; (iii) average of the sum of PCP of selected residues of N and C terminals, after analysis of discriminating residue positions; for example, 1, 2, 3, 5, and 12 positions of N terminal were selected for hydrophilicity (see Figure 1S

Peptide Length Analysis.
We first compared the length of the two types of peptides and observed that average length   of Il4 inducing and noninducing peptides is not significantly different. We could not find any significant relation between length of sequence and its potential to induce IL4 production ( Figure 1).

Amino Acid Composition.
We also compared amino acid composition of IL4+ and IL4− peptides and found compositional biasness between two types of peptides. It was observed that residues E, F, K, and I are more abundant in IL4 inducing peptides, while IL4 noninducing sequences majorly include G, D, and L ( Figure 2).  assays ( Figure 2S). We have also found 10 alleles for which IL4 assays were returned as exclusively negative and 47 alleles for which all the IL4 assays were resulted as exclusively positive ( Figure 3). The most promiscuous MHC allele for exclusive IL4 positive assays was HLA-DR7, which could bind to 9 different epitopes and induce IL4 cytokine.

Positional Preference of Residues.
The amino acid compositional analysis described the overall dominant residues in IL4 inducing and noninducing peptides. However, this information does not specify the positional preference of specific amino acid residues at specific positions. In order to know the preference of a particular amino acid at different positions or at N-or C-terminals, we created the two-sample logo for our positive and negative IL4 peptides. Two-sample logo as depicted in Figure 4 showed that certain residues are preferred at specific positions; in IL4 inducers charged residues are preferred at 2nd, 5th, 9th, 10th, and 15th positions while leucine or proline residues are abundant in non-IL4inducing at 1st, 2nd, 5th, 6th, 7th, 12th, and 13th positions. These results clearly suggest that the IL4 inducing and IL4 noninducing MHC class II binders can be discriminated on the basis of residues preferences. In order to look at position specific proclivity of different PCPs as mentioned in the method section, we calculated the average of every PCP, at every position of N and C terminal as mentioned in the method section. For every PCP, we found various positions showing discriminating values in plot ( Figure 1S); for example, 1, 2, 3, 5, and 12 positions of N terminal show high hydrophilicity in IL4+ peptides. Based on these observations we selected discriminating residues for PCP as mentioned in Table 1S. We selected some of the discriminating PCP and looked at the average of the sum of PCP of IL4+, IL4−, IL4 + Nt, IL4 − Nt, IL4 + Ct, and IL − Ct amino acid sequences ( Figure 3S) (see the Method section). We found hydrophilicity, pI, amphipathicity, steric, and charge properties, discriminating between IL4+ and IL4− data. On the other hand, the sum of PCP of selected residue positions (see the Method section) showed a significant difference in all the PCPs ( Figure 4S).

Motif Search.
We next tried to determine exclusive motifs or patterns in IL4 inducing peptides by using MERCI software. We used three types of classification, that is, none, Koolman-Rohm, and Betts-Russell, to determine 100 motifs in peptides. It was observed that Betts-Russell classification significantly discriminated 205 IL4 inducers from noninducers and Koolman-Rohm was significantly distinguished 150 non-IL4-inducers from IL4 inducers (Table 1).
Collectively, motifs generated from all types of classification discriminated 333 positive and 237 negative peptides. The best motifs generated from each classification are listed in [aliphatic]-L-[aliphatic]" motif was repeated in 41 IL4 noninducing sequences. Both of these motifs were found to be absent from the alternative datasets.

SVM Based Prediction Model.
We developed prediction models using SVM that is widely used in the past for classification models [27][28][29][30]. In the present work, we first developed a SVM based model using amino acid and dipeptide composition of the IL4 inducers and noninducers. With this model, we attained a maximum MCC of 0.29 and 0.31, respectively ( Table 3). As we have initially observed that the length of the sequence is not contributing to the IL4 inducing or noninducing potential of the peptide, we also developed a SVM model based on amino acid composition, dipeptide composition with length of the peptide, and observed no significant improvement in the performance (Table 2S).
We further developed another SVM model based on binary profile of amino acids of the peptides, where a vector of dimension 20 represents each residue. The compositional variation plot for each residue in IL4 inducing and noninducing peptides and the performance of SVM model based on binary profile of N-/C-terminal residues were also analyzed (depicted in Table 3S).

Hybrid Prediction
Model. We adopted a hybrid approach for prediction of IL4 peptides by combining the prediction based on SVM model and motif search. The datasets were first sorted based on the exclusive motifs searched in positive and negative peptides using MERCI. The initial search identified 333 IL4 inducing and 237 IL4 noninducing MHC class II binding peptides, and these sequences were given weightage by adding +1 and −1, respectively, in SVM score in the hybrid method. Additionally, in this hybrid approach, we developed four different models using different input features each time (Table 4). This technique resulted in better performance of each of the four hybrid models over motif or SVM model alone. In the hybrid model, while combining amino acid pairs and motif search, we obtained a maximum MCC of 0.51. Furthermore, fivefold cross-validation technique was used to test the robustness of all prediction models.

Models for Discovering IL4 Inducing
Peptides. All the models described above has been developed on main dataset that contain experimentally validated IL4 inducing and noninducing MHC class II binders. These models only can be used for predicting IL4 inducing peptides if users know that their query peptide is MHC class II binders. In order to provide service to the community we developed models on alternate dataset that can be used to discover IL4 peptides in proteins/antigens. As described in Materials and the Methods section our alternate dataset contains negative set/examples random peptide. We developed models on alternate dataset and achieved maximum accuracy of 70% (Table 5).

Model Validation.
Performance on independent dataset is one of the best ways to validate a prediction model. As the IEDB database is continuously updated, we extracted the 71-peptide novel entries (deposited after extraction of our dataset) from IEDB for MHC class II binders for which IL4 assay was positive. Out of 71 peptides our best model correctly predicted 49 epitopes at default threshold.

Discussion
With the advent of the next generation sequencing techniques, the designing of rational vaccines based on the immunogenic features of peptides has become a need of the modern era. Identification of peptides or antigenic that can activate all arms of the immune system is important for designing effective immunotherapy or epitope/subunit vaccine. It means vaccine candidates (peptide/antigen) should have antigenic regions that can activate both B-cell and Tcell epitopes (MHC Class I or II peptides). The Th2 response is very important in vaccine or immunotherapy design against extracellular pathogen. IL4 is the principal cytokine that directs commitment of T cells to Th2 phenotype [31]. Therefore, in this study an attempt had been made to predict the IL4 inducing MHC class II binders.
We extracted experimentally validated MHC class II binding peptides from IEDB with and without IL4 inducing potential. We initially analyzed both these datasets to select important features that could lay the basis for the IL4 inducing capability of the peptide. It is well documented that the binding of peptides to MHC complex is largely dependent on the length of the peptides [32]; thus we also examine the length of both IL4 inducing and noninducing peptides. We observed that the length of both IL4 inducing and noninducing is in the same range. This is the reason that our peptide composition based SVM models developed with peptide length as an additional feature have not resulted in improvement of the performance (Table 2S). Likewise, the length of the peptide and the conservation of amino acid residues at a specific position also play a crucial role in describing the IL4 inducing properties of peptides.
MHC alleles are well documented in literature for thier capability to skew the immune response [33][34][35]. Our analysis also supports this notion and we have observed 47 MHC alleles that are shown to induce IL4 cytokine. On comparison of IL4 inducing and noninducing reference sequences, it was observed that charged residues preferentially occupy 2nd, 5th, 9th, 10th, and 15th positions in IL4 inducing sequence, while aliphatic and aromatic residues largely reside at 1st, 2nd, 5th, 6th, 7th, 12th, and 13th positions in IL4 noninducing Negative sign (−) represents the gaps with the length of 1-5 residues at that position. peptides. It is thoughtful that such a differential preference of amino acids might be responsible for activating different factors for downstream signaling. Here, it is important to mention that these positional preferences could not be related to MHC groove as the information was extracted from the sequential comparison of epitopes. Distinction of different immune epitopes has already been reported with different PCPs in the past [36]. In our analysis, PCPs like hydrophilicity, amphipathy, charge, pI, and so forth (Figures 3S and 4S) showed difference in IL4+ and IL4− peptides both in full length and N/C terminal residues. The study with selected residues showed that hydrophilicity, pI, amphipathicity, steric, and charge properties are more profound in IL4+ peptides (Table 1S). We, next, analyzed the IL4 inducing and noninducing reference sequences for the presence of exclusive motifs that may distinguish both these types of sequences, by using MERCI software. Using MERCI, the exclusive motifs could be hunted by employing different classification of amino acids as proposed in the literature. We analyzed our reference IL4 inducing and noninducing dataset on these classifications and found that best results were obtained with Betts-Russell classification. The top ten motifs from each classification, based on the uniqueness in their sequence coverage, were capable of distinguishing 333 IL4 inducing epitopes and 237 non-IL4-inducing epitopes.
Next, we tried to discriminate IL4 inducers and noninducers by use of machine learning technique. For analysis of positional feature of a sequence by SVM, binary patterns of the sequences were used as input. Since binary patterns of peptides could only be applied at a fixed length, we generated different binary inputs by varying the length of amino acids from 9 to 15 through both N and C terminals of a peptide. The performance of SVM model based on these inputs showed a MCC of 0.18. Further, we analyzed the residue and dipeptide compositional vector of IL4 inducer and noninducer sequences. It was observed that the models based on compositional profile performed better than models trained on binary patterns, possibly because this attribute of the sequence does not depend upon the length of the peptide as it Clinical and Developmental Immunology 7  has a fixed feature input of 20 and 400 for residue composition and dipeptide composition vector, respectively. We have also developed hybrid model, by employing the information from motif and machine learning. In the hybrid approach, the weightage has been given to sequences that could be predicted with exclusive motifs searched using MERCI. We observed that the performance was improved up to a value of MCC 0.51 using hybrid of MERCI and amino acid pair. This could be attributed to the role of propensity and exclusive motifs in prediction of IL4 inducing epitopes as it was also publicized for B-cell epitopes [23]. The AAP feature may have some biasness as they incorporate weightage information from the whole data.
Performance of our model on independent dataset was that 69% (49 out of 71) is satisfactory. This performance is comparable with 78.76% sensitivity at fivefold cross-validation on training dataset. In summary, we have developed the in silico prediction method that can aid in understanding the IL4 inducing potential of the antigens in computer aided rational vaccine design for better control of diseases.

Conclusion
The tendency of an epitope to induce IL4 and skew the immune response towards Th2 makes it of great significance in immunotherapy and vaccine designing. Although the induction of IL4 response is a very complex issue that depends on a number of factors like cytokine milieu, MHC haplotype, the costimulatory molecules, and peptide itself, a peptide is an important factor that could be controlled easily while designing a vaccine or immunotherapy. However, Th2 response includes other cytokines like IL5 and IL10; we only focused on IL4 cytokine. Although the experimental evidence for Th2 peptides is limited, our computational analysis appears to support their existence.
Keeping this limitation in mind, we have made an attempt to predict the peptides that may induce IL4 response. In this study we evaluate performance of our models using fivefold cross-validation as well as evaluating performance of our method on an independent dataset. It was observed that our model predicts IL4 inducing peptides with reasonable accuracy. In order to facilitate the scientific community working in the area of subunit vaccine, we have used the above models for developing a webserver IL4pred (http://crdd.osdd .net/raghava/il4pred/).