Prediction of B-cell Linear Epitopes with a Combination of Support Vector Machine Classification and Amino Acid Propensity Identification

Epitopes are antigenic determinants that are useful because they induce B-cell antibody production and stimulate T-cell activation. Bioinformatics can enable rapid, efficient prediction of potential epitopes. Here, we designed a novel B-cell linear epitope prediction system called LEPS, Linear Epitope Prediction by Propensities and Support Vector Machine, that combined physico-chemical propensity identification and support vector machine (SVM) classification. We tested the LEPS on four datasets: AntiJen, HIV, a newly generated PC, and AHP, a combination of these three datasets. Peptides with globally or locally high physicochemical propensities were first identified as primitive linear epitope (LE) candidates. Then, candidates were classified with the SVM based on the unique features of amino acid segments. This reduced the number of predicted epitopes and enhanced the positive prediction value (PPV). Compared to four other well-known LE prediction systems, the LEPS achieved the highest accuracy (72.52%), specificity (84.22%), PPV (32.07%), and Matthews' correlation coefficient (10.36%).


Introduction
Epitopes, also called antigenic determinants, are clusters of amino acid segments located on the surfaces of an antigen. Epitopes can elicit the immune response and are recognized by specific antibodies [1]. Basically, B-cell epitopes are categorized into two types: linear and conformational. Linear epitopes (LEs) are composed of contiguous amino acid residues within a continuous stretch of a primary protein sequence. Conformational epitopes (CEs) consist of amino acids that are dispersed among discontinuous regions but become aggregated on the protein surface [2,3]. In general, over 90% of B-cell epitopes are discontinuous [4,5]; thus, CEs play critical roles in biological and biomedical applications, including the prevention and neutralization of pathogen infections, and the design of therapeutic drugs.
However, the prediction and identification of CEs within a protein depend on resolved three-dimensional structural information. One major, generally accepted concept is that conformational epitopes cannot be properly formed without binding to a corresponding antibody [6]. Therefore, antigenantibody cocrystallographic information is a major concern in CE prediction. On the other hand, because CEs are discontinuous epitopes, it is difficult to design a peptide that forms the same conformation as the predicted CE. Thus, CEs that are predicted by computational analysis may not be verifiable in biochemical experiments, except with the cocrystallographic approach. Although B-cell LEs occupy a small part of the entire epitope group, they are important in biochemistry [7], virology [8], immunology [9], and vaccine research [10]. Therefore, research and development of accurate computational approaches for LE Journal of Biomedicine and Biotechnology prediction remains a critical challenge in bioinformatics and computational biology [6]. Most published B-cell LE predictors have been based on the characteristics of amino acids, like hydrophobicity, surface accessibility, mobility, protrusion area, physico-chemical properties, antigenicity, and pocket characteristics [1,3,[11][12][13][14][15][16]. For example, BcePred [16], BEPITOPE [17], PEOPLE [11], VaxiJen [18], and LEP [12] are bioinformatics tool that use various mathematical approaches to predict LEs according to the physico-chemical propensities of amino acids. Nevertheless, in 2005, Blythe and Flower led a group that evaluated the physico-chemical propensities of amino acids to predict LEs in proteins; they reported that even the best physico-chemical propensity scales available performed only slightly better than a random model [19]. Hence, it was proposed that, instead of using the antigenicity scale alone, LE prediction may be improved by integration with other computational approaches.
Several machine learning computational methods have been applied to improve the accuracy of LE prediction. For example, BepiPred combined a hydrophilicity scale with a hidden Markov model [20]; BCPred [21] and FBCPred [22] employed SVM with a subsequence kernel; Söllner and Mayer utilized a molecular operating environment with the decision tree and nearest neighbour approaches [6]. However, these machine learning approaches were mostly set to predict peptides of fixed lengths. It is difficult to analyze true LEs, because they generally range from 8 to 20 amino acid residues in length [11,[23][24][25]. Epitopes with fixed lengths are not typically sufficient to represent the whole region of antigenic determinants. To overcome the drawbacks of training and/or predicting fixed length epitopes, ABCPred used two artificial neural network methods, the feed-forward network and the recurrent neural network, for the prediction of B-cell LEs [26]. Both networks were used with different window lengths from 10 to 20 amino acids and a two-residue interval.
Although bioinformatists have expended great effort on developing LE predictors, there remains much room for improvement. Theoretically, an epitope identified by experimental immunological or biochemical methods must possess biological antigenicity that can induce antibody production in animals. However, when computational skills are used for the prediction, some experimentally identified epitopes could be missed or ignored. This generated the interesting study of how to retrieve the unpredictable epitopes and enhance their antigenicity score in silico.
In 2008, LEP was developed for predicting LEs based on physico-chemical propensities combined with a mathematical morphology approach. LEP could retrieve some of the LEs that were locally embedded in the noise signals of the antigenic index [12]. We reasoned that prediction accuracies could be further improved and retain the advantage of variable length conditions, by combining the LEP with machine learning technologies.
As mentioned above, the machine learning methods used in previous LE prediction methods were often trained to predict epitopes with fixed lengths. Chen's study showed that the frequencies of occurrence for some amino acid pairs in the epitope dataset were significantly higher than in non-epitope datasets, or vice versa [23]. We noticed this important statistical feature and applied it to enhance the performance of LE prediction systems. Hence, in order to explore the statistical advantages of verified epitopes and retain the antigenic characteristics of candidate peptides, we decided to extend the concept of amino acid pairs from Chen's study, which only considered peptides with 2 residues.
In this study, we developed a novel B-cell LE prediction system called LEPS (Linear Epitope Prediction by Propensities and Support Vector Machine). The LEPS is freely available for academic use at http://leps.cs.ntou.edu.tw. We adopted the library for SVM (LIBSVM) tool and trained it to recognize features of amino acid segments (AASs) with lengths from 2 to 4 residues. Then, SVM was used to characterize those patterns as epitope and non-epitope clusters [27]. Accordingly, the LEPS approach first performed physico-chemical propensities and mathematical morphology approaches and then used the AAS features to cluster the predicted LE candidates and remove the less probable LEs.

Testing Datasets and Predictors.
Four datasets were used in this study. The AntiJen dataset was recommended at an international meeting sponsored by the National Institute for Allergy and Infectious Disease [6] and contained 171 protein sequences with 691 verified, nonoverlapping epitopes [19]. The HIV dataset was a collection of the antigenic determinants located on 10 HIV proteins with 54 nonoverlapping, verified epitopes [39]. The PC dataset, generated in this study, was a collection of 12 protein sequences with 98 nonoverlapping, verified epitopes (Table 1). In order to balance out the variation of each dataset in quantity and antigen diversity, these three datasets were merged into one, comprehensive dataset called the "AHP dataset." These datasets were analyzed with different LE predictors, including the BepiPred [20], ABCPred [26], BCPred [21], and FBCPred [22], to compare performances with that of the LEPS developed here.

System
Flow. The proposed system was divided into three main steps (Figure 1(a)). The first step retrieved primitive epitope candidates from a query protein sequence with LEP [12], which was developed in our previous work and was used with the default settings. Then, an SVM classifier was applied to remove less probable epitope candidates and improve prediction accuracies. In the final step, the predicted epitope residues were highlighted in the query sequence and visualized in a predicted structure. The virtual structure was generated from Modeller 9.9, based on homologous protein structure modeling approaches [40].

Training Datasets and SVM Model.
The process of training the SVM model comprised two major steps (Figure 1(b)). The first step (step 1(b)) evaluated the statistical characteristics that determined the frequencies of occurrence of AASs with various lengths from an independent B-cell epitope    [37] a Because some of the epitopes in the PC dataset were partial antigen fragments, the serial numbers for the residues in each epitope were assigned according to the sequence information retrieved from the UniProt database [38]. The overlapping amino acids between the experimentally verified and predicted epitopes are shown in bold.

SVM classifier
Step 2: SVM training Step 1(a): primitive epitope candidates with globally and locally high antigenicity were extracted by calculating weighting coefficients for various physicochemical propensities of each amino acid. After the filtering process with the SVM classifier (step 2(a)), predicted epitopes were highlighted (step 3(a)) in the query sequence and the simulated structure. (b) Step 1(b): 1230 experimentally verified epitopes and 872 non-epitopes were analyzed to determine the statistical characteristics of AASs.
Step 2(b): subsequently, epitope indexes of 872 epitopes and 872 non-epitopes were used to train the SVM model to predict candidate epitopes based on the statistical characteristics defined in step 1(b). dataset (Bcipep [41]) and a non-epitope dataset (Chen et al. [23]). The second step (step 2(b)) produced an SVM model that recognized the epitopes and non-epitopes of the Chen dataset based on the statistical features derived from step 1(b).
The Bcipep dataset comprised 1230 experimentally verified, B-cell, and nonredundant LEs with lengths that ranged from 3 to 56 residues that were identified in over 1000 antigen proteins. This dataset was used in step 1(b) to analyze the statistical characteristics associated with the frequencies of occurrence of AASs of 2 to 4 residues in length that represented epitopes. The Chen dataset contained 872 epitopes and 872 nonepitopes. All epitopes and non-epitopes within this dataset were restricted to a length of 20 residues. These verified epitopes were retrieved from the Bcipep dataset by applying a "truncation-extension treatment." That is, when the length of an LE was longer than 20 residues, an equal number of superfluous residues were truncated from both the N-and Ctermini to preserve the central 20 residues. Conversely, when the length of an LE was shorter than 20 residues, an equal number of residues were added to both the N-and C-termini until the epitope comprised 20 residues. On the other hand, the 872 non-epitopes were generated by randomly selecting peptide segments from the Swiss-Prot database [42], with the stipulation that none was the same as any of the 872 epitopes. The 872 non-epitopes were used to analyze the statistical characteristics of AASs for non-epitopes in step 1(b). After determining the statistical features that were associated with frequencies of occurrence, the proposed system applied these Journal of Biomedicine and Biotechnology 7 features (step 2(b)) to produce an SVM model in a 5-fold cross-validation on the Chen dataset.

Statistical Analysis of AASs and Epitope Indexes.
For LE verification, we considered the statistical features to be AASs of 2 (AAS 2 ), 3 (AAS 3 ), and 4 (AAS 4 ) residues in length for both epitopes and non-epitopes. For AAS 2 , 400 possible combinations of residue pairs were analyzed for occurrence frequencies within both the epitope and nonepitope datasets. The epitope index (Epidex 2 i ) of the ith pattern (AAS 2 i ) was calculated by taking logarithm value of the ratio of the number of AAS 2 i among all epitopes AASs 2 compared to the same ratio in the non-epitope AASs 2 group with the following equation: where f 2 + i and f 2 − i were the numbers of AAS 2 i in the epitope and non-epitope datasets; respectively, and i f 2 + i and i f 2 − i denoted the total number of AAS 2 i in the corresponding dataset. Finally, the values of Epidex 2 i were normalized to the range of [0, 1] to avoid dominance of any individual Epidex 2 i in the classifier learning processes. There were a total of 8000 and 160,000 possible combinations for AAS 3 and AAS 4 , respectively. A large portion of AAS 3 or AAS 4 did not appear in the non-epitope dataset; this would cause a problem, because it could lead to a zero in the denominator. Hence, the definitions of Epidex 3 i and Epidex 4 i were modified from the definition for Epidex 2 i , and the corresponding epitope indexes for AAS 3 and AAS 4 were defined as follows: where l was equal to 3 or 4. Again, the values of Epidex 3 i and Epidex 4 i were normalized to the range of [0, 1].

SVM Features and Model Selection.
In this study, we adopted the SVM as a learning method to classify the epitope and non-epitope peptides. We employed the open source LIBSVM toolbox for executing this classification. In LIBSVM, each instance in the training set possessed one target value (class label) and several features (attributes).
In the testing set, only the features were required for each instance. The objective of SVM was to generate a model from the training set that facilitated the prediction of the target value of each instance in the testing set. In this study, a peptide corresponded to an instance, and the target value (1  The Chen dataset was used to construct an SVM model based on three feature values and the target values of each epitope and non-epitope. There were four common kernel functions provided by LIBSVM, including linear, polynomial, radial basis function (RBF), and sigmoid. We examined these four kernel functions with a 5-fold crossvalidation. The training dataset was equally divided into 5 different subsets; four of the subsets were used for training the model, and the last one was used for testing the model. These processes were repeated five times with each individual subset used as the testing subset. Here, the RBF kernel was selected as the default kernel function, because it provided the best cross-validation accuracy with the training data. Subsequently, the RBF kernel function was applied to train the whole testing dataset for constructing the final SVM classifier in the LEPS.

Performance Measurement.
To evaluate the performance of the LEPS at the level of the amino acid residue, five indicators were used to measure effectiveness at the default settings. These indicators were (1) sensitivity (SEN), defined as the percentage of epitopes that were correctly predicted as epitopes; (2) specificity (SPE), defined as the percentage of non-epitopes that were correctly predicted as non-epitopes; (3) positive predictive value (PPV), defined as the probability that a predicted epitope was, in fact, an epitope; (4) accuracy (ACC), defined as the proportion of correctly predicted peptides; (5) Matthews' correlation coefficient (MCC), which was a measure of the predictive performance that incorporated both SEN and SPE into a single value between −1 and +1 [26]. These parameters were calculated with the following equations: where TP represented the true positive; TN, the true negative; FP, the false positive; FN, the false negative. in the PC dataset was 18.9 residues. This was considered a practical length for an epitope to be used in peptide vaccine development or antibody generation. The average epitope lengths in the HIV and AntiJen datasets were 26.4 and 16.3 residues, respectively. All sequences in the PC dataset were analyzed with the LEPS, and the predicted and experimentally verified epitopes are listed in Table 1.

The Performance of LEPS.
The epitope information collected from the PC, AntiJen, and HIV datasets were utilized to verify the performance of LEPS. The PC dataset was described in the previous section. The original AntiJen dataset comprised 3619 epitopes, of which 3168 were found in the Swiss-Port database. As in our previous report, we regenerated the original AntiJen dataset by removing the repeated epitopes [12]. The HIV dataset focused on one infectious pathogen and was recognized as a useful tool in the field of HIV immunology [39]. The AHP dataset combined these three datasets to balance the variations in each dataset including variations in epitope length and the physico-chemical properties of antigens. With these 4 datasets, we compared the performance of five LE predictors, including LEPS, BepiPred [20], ABCPred [26], BCPred [21], and FBCPred [22]. As expected, LEPS provided favorable results in all four datasets (Figure 2). Table 2 shows that LEPS displayed the best specificity (SPE), with values of 88.33%, 84.48%, 74.84%, and 84.22% in the PC, AntiJen, HIV, and AHP datasets, respectively. Moreover, LEPS showed the best PPVs, with values of 45.12%, 28.85%, 71.44%, and 32.07% in the PC, AntiJen, HIV, and AHP datasets, respectively. The PPV indicated the rate of identifying real epitopes among all positive predicted candidates. It is one of the most important factors in conducting vaccine development. Reduction of the false positive candidates can improve the effectiveness and efficiency of identifying the real epitopes. Therefore, the LEPS will outperform the other predictors in terms of biological experiment cost effectiveness. In the field of computational science, prediction accuracy is one of the most concerned factors for system evaluation. Except in the HIV dataset, LEPS displayed the best ACCs, with values of 61.66%, 73.81%, and 72.52% for the PC, AntiJen, and AHP datasets, respectively. These results showed that LEPS displayed excellent performance for LE prediction. The LEPS  also showed the best performance in the MCC for the AntiJen and AHP datasets (10.10% and 10.36%), and the MCC was only a little lower (22.76%) than BCPred (29.80%) and FBCPred (27.81%) for the HIV dataset. Taken together, LEPS displayed excellent performance in SPE and PPVs for all four datasets; it also showed the best or equivalent ACCs for all datasets. However, it showed relatively low SEN compared to the other predictors, mainly due to less number of predicted LEs.

The LEPS Platform.
The LEPS provides a user-friendly interface for biologists to predict linear epitope candidates (Figure 3(a)). LEPS will accept either FASTA format or text, and the default parameters were set as indicated. In this system, several physicochemical propensities can be dynamically modified by users, including secondary structures, hydropathy, surface accessibility, flexibility, polarity, and other factors. The scanning window size for each parameter is also adjustable. After executing the prediction, the overall antigenicity of the query protein and the predicted LE candidates are displayed. For example, Figure 3 respectively. To verify the surface conditions of the predicted LEs within the query protein sequence, a protein structure was simulated based on homologous modeling approaches. This structure can be viewed and analyzed by clicking on the button labeled "predicted structure."

Visualization of the Predicted LEs on 3D Structures.
Predicted structures of the query sequences can be rendered by Jmol (http://www.jmol.org/) in LEPS, and the corresponding PDBs and PyMOL script files (http://www.pymol.org/) are downloadable by request. For example, Figure 4 shows the simulated structure of HIV integrase as predicted by Modeller, with the predicted epitope segments displayed in yellow solid spheres. Because there is a high probability that true epitopes will be exposed on the protein surfaces for binding with antibodies, visualization of the predicted LEs on 3D structures can facilitate the selection of suitable epitopes from predicted candidates according to their surface distributions. Figure 5 shows an example of the experimentally verified epitopes and predicted epitopes for the 10 kDa chaperonin protein in the AntiJen dataset. The yellow spheres in both Figures 5(a) and 5(b) show the true and predicted epitope atoms, respectively. The position of the remaining protein is shown in red and blue solid balls in the two simulated structures. In both cases, most of the epitope residues are located on the protein surface.

Acceptability of Low Sensitivities.
Although LEPS can provide a highly accurate prediction of LEs, the low sensitivity is an issue that remains to be investigated. In general, epitope datasets confront a challenge that biological experiments would not cover all the true epitopes within an individual antigen. Peptide scanning data could only identify potential epitopes that were recognized by a specific antibody. However, different antibodies to the same antigen might recognize different epitopes. These biological variations caused low coverage of epitopes within an antigen [43]. This situation implies that the sensitivities of an LE predictor should generally be low. Alternatively, a LE predictor might ubiquitously predict more epitopes to regain the sensitivities accompanying with the reduction of specificities. This will definitely lead to higher experimental costs in general. Nevertheless, to persuade biologists to conduct in vitro experiments on the predicted potential LEs, the accuracy and MCC values could provide balanced statistics for evaluating the performance of a prediction system. In this study, LEPS displayed high accuracy, MCC, specificity, and PPV, although the sensitivity was a little low. However, the reduced sensitivity was offset by the high PPV. Therefore, the LEPS provides a high probability of success for molecular biologists in predicting and selecting functional epitopes effectively and efficiently.