INeo-Epp: A Novel T-Cell HLA Class-I Immunogenicity or Neoantigenic Epitope Prediction Method Based on Sequence-Related Amino Acid Features

In silico T-cell epitope prediction plays an important role in immunization experimental design and vaccine preparation. Currently, most epitope prediction research focuses on peptide processing and presentation, e.g., proteasomal cleavage, transporter associated with antigen processing (TAP), and major histocompatibility complex (MHC) combination. To date, however, the mechanism for the immunogenicity of epitopes remains unclear. It is generally agreed upon that T-cell immunogenicity may be influenced by the foreignness, accessibility, molecular weight, molecular structure, molecular conformation, chemical properties, and physical properties of target peptides to different degrees. In this work, we tried to combine these factors. Firstly, we collected significant experimental HLA-I T-cell immunogenic peptide data, as well as the potential immunogenic amino acid properties. Several characteristics were extracted, including the amino acid physicochemical property of the epitope sequence, peptide entropy, eluted ligand likelihood percentile rank (EL rank(%)) score, and frequency score for an immunogenic peptide. Subsequently, a random forest classifier for T-cell immunogenic HLA-I presenting antigen epitopes and neoantigens was constructed. The classification results for the antigen epitopes outperformed the previous research (the optimal AUC = 0.81, external validation data set AUC = 0.77). As mutational epitopes generated by the coding region contain only the alterations of one or two amino acids, we assume that these characteristics might also be applied to the classification of the endogenic mutational neoepitopes also called “neoantigens.” Based on mutation information and sequence-related amino acid characteristics, a prediction model of a neoantigen was established as well (the optimal AUC = 0.78). Further, an easy-to-use web-based tool “INeo-Epp” was developed for the prediction of human immunogenic antigen epitopes and neoantigen epitopes.


Introduction
An antigen consists of several epitopes, which can be recognized either by B-or T-cells and/or molecules of the host immune system. However, usually only a small number of amino acid residues that comprise a specific epitope are necessary to elicit an immune response [1]. The properties of these amino acid residues causing immunogenicity are unknown. HLA-I antigen peptides are processed and presented as follows: (a) cytosolic and nuclear proteins are cleaved to short peptides by intracellular proteinases; (b) some are selectively transferred to the endoplasmic reticulum (ER) by the TAP transporter, and subsequently are treated by endoplasmic reticulum aminopeptidase; and (c) antigen-presenting cells (APCs) present peptides containing 8-11 AA (amino acid) residues on HLA class I molecules to CD8+ T-cells [2]. Researchers can now simulate antigen processing and presentation by computational methods to predict binding peptide-MHC complexes (p-MHC). Several types of software systems have been developed, including NetChop [3], NetCTL [4], NetMHCpan [5], and MHCflurry [6]. However, despite that the binding to MHC molecules of most peptides is predicted, only 10%~15% of those have been shown to be immunogenic [7][8][9][10]. For neoantigens, the result was approximately 5% (range: 1%-20%) due to central immunotolerance [11,12]. As a result, the cycle for vaccine development and immunization research is extended. Here, we aim to develop a T-cell HLA class-I immunogenicity prediction method to further identify real epitopes/neoepitopes from p-MHC to shorten this cycle.
Many experimental human epitopes have been collected and summarized in the immune epitope database (IEDB) [13], which makes it feasible to mathematically predict human epitopes. However, there still exist two limitations: (i) a high level of MHC polymorphism produces a severe challenge for T-cell epitope prediction and (ii) there is an extremely unequal distribution of data to compare epitopes and nonepitopes. It is not conducive to analyze the potential deviation existing in TCR recognition owing to the presentation of different HLA peptides. A general analysis of all HLApresented peptides, ignoring the specific pattern of TCR recognition of individual HLA-presented peptides, may result in a lower predictive accuracy.
With the advances in HLA research, Sette and Sidney [14] classified, for the first time, overlapping peptide binding repertoires into nine major functional HLA supertypes (A1, A2, A3, A24, B7, B27, B44, B58, and B62). In 2008, Sidney et al. [15] made a further refinement, in which over 80% of the 945 different HLA-A and B alleles can be assigned to the original nine supertypes. It has not been reported whether peptides presented by different HLA alleles influence TCR recognition. Hence, we collected experimental epitopes according to HLA alleles and assumed that epitopes belonging to the same HLA supertypes have similar properties.
Moreover, screening for endogenic mutational neoepitopes is one of the core steps in tumor immunotherapy. In 2017, Ott et al. [16] and Sahin et al. [17] confirmed that peptides and RNA vaccines made up of neoantigens in melanoma can stimulate and proliferate CD8+ and CD4+ T-cells. In addition, a recent research suggests that including neoantigen vaccination not only can expand the existing specific T-cells but also can induce a wide range of novel T-cell specificity in cancer patients and enhance tumor suppression [18]. Meanwhile, a tumor can be better controlled by the combination therapy of neoantigen vaccine and programmed cell death protein 1 (PD-1)/PD1 ligand 1 (PDL-1) therapy [19,20]. Nevertheless, a considerable number of predicted candidate p-MHC from somatic cell mutations may be false positive, which would fail to stimulate TCR recognition and immune response. This is undoubtedly a challenge for designing vaccines against neoantigens.
In our study, based on HLA-I T-cell peptides collected from experimentally validated antigen epitopes and neoanti-gen epitopes, we aim to build a novel method to further reduce the range of immunogenic epitope screening based on predicted p-MHC. Finally, a simple web-based tool, INeo-Epp (immunogenic epitope/neoepitope prediction), was developed for prediction of human antigen and neoantigen epitopes.

Materials and Methods
The flow chart for "INeo-Epp" prediction is shown in Figure 1.

Construction of Immunogenic and Nonimmunogenic
Epitopes. Peptides that can promote cytokine proliferation are considered to be immunogenic epitopes. However, nonimmunogenic epitopes may result from the following reasons: (a) p-MHC is truly unrecognized by TCR, (b) peptides are not presented by MHC (quantitatively expressed as rank ð%Þ > 2, see Rank(%) Score (C24) for details), and (c) negative selection/clonal presentation is induced by excessive similarity to autologous peptides [21]. In this work, to further study the recognition preferences of T-cells, peptides with >2 rank(%) were regarded as not in contact with TCR, and sequences 100% matching the human reference peptides (ftp://ftp.ensembl.org/pub/release-97/fasta/homo_ sapiens/pep/) were regarded as exhibiting immune tolerance. Hence, we removed these from the definition of nonimmunogenic peptides.

Construction of Data Sets: Epitopes, External
Validation of Epitopes, and Neoepitopes. Antigen epitope data were collected from IEDB (linear epitope, human, T-cell assays, MHC class I, any disease was chosen). Data collection criteria accommodated for each HLA allele quantity > 50 and frequency > 0:5% (refer to allele frequency database [22]) ( Table 1, check Table S1 for detailed information).
Here, we removed peptides for which HLA supertypes do not appear in the training set, because we assume peptides belonging to the same HLA supertypes to have similar properties. In the external validation set, some peptides bind to rare HLA supertypes. Their characteristics were not included in the training set. Hence, these peptides in the external validation data might lead to a classification bias.
where P is peptide, c is characteristic. P c represents the characteristics of peptides, A represents amino acids, N represents the N-terminal in a peptide, C represents the C-terminal in a peptide, Pos represents the amino acid position in a peptide, and P A c represents characteristics of amino acids in peptides.

Frequency
Score for Immunogenic Peptide (C22). Amino acid distribution frequency differences between immunogenic and nonimmunogenic peptides at TCR contact sites (excluding anchor sites) were considered as a feature: where P + ie represents immunogenic peptides, P − ie represents nonimmunogenic peptides. f A ′ represents amino acid frequency in the TCR contact position. P + ie ðf A ′ Þ represents the frequency of amino acids in immunogenic peptides at TCR contact sites.

Calculating Peptide Entropy (C23).
Peptide entropy [41] was used as a feature: where P H represents peptide entropy. f A represents amino acid frequency in the human reference peptide sequence. P f A represents the frequency in the human reference peptide sequence of amino acids in epitope peptides.   BioMed Research International which rank(%) was recommended as an evaluation standard, rank ð%Þ < 0:5 as strong binders, 0:5 < rank ð%Þ < 2 as weak binders, and rank ð%Þ > 2 as no binders.

Fivefold Cross-Validation, Feature Selection, Random
Forests, and ROC Generation. The 5-fold cross-validation was implemented in R using the caret package [42] (method = "repeatedcv," number = 5, repeats = 3). The feature screening results were generated in R using the package Boruta [43] (a novel random forest-based feature selection algorithm for finding all relevant variables, which provides unbiased and stable selection of important and nonimportant attributes from an information system). It iteratively removes the features which are proven by a statistical test to be less relevant than random probes. It uses Z score (computed by dividing the average loss by its standard deviation) as the importance measure, and it takes into account the fluctuations of the mean accuracy loss among trees in the forest. R package ran-domForest [44] was used for training data (the R language machine learning package caret provides automatic iteration selection of optimal parameters: mtry = 15 for antigen epitope and mtry = 14 for neoantigen epitope; the remaining parameters use default values). R package ROCR [45] was used for drawing ROC.
2.5. Web Tool Implementation. The front end of Ineo-Epp was constructed via HTML/JavaScript/CSS. The back end was written in PHP, connecting the web interface and

Results
Ultimately, 11,297 validated epitopes and nonepitopes with lengths of 8-11 amino acids were collected from IEDB. Tcell responses included activation, cytotoxicity, proliferation, IFN-γ release, TNF release, granzyme B release, IL-2 release, and IL-10 release. Seventeen different HLA alleles were collected (Figure 2(a)), and the detailed antigen length distribution is shown in Figure 2(b). Additionally, we collected the neoantigen data from 12 publications, including 2837 nonneoepitopes and 164 neoepitopes (Figure 2(c)), and the detailed neoantigen length distribution is shown in Figure 2(d).
The TCR contact position plays a crucial role in the analysis of immunogenicity, as TCRs might be more sensitive to some amino acids; the amino acid preference in the antigen epitope peptide and the antigen nonepitope peptide was further analyzed after excluding anchor sites (N-terminal, position 2, and C-terminal) (Figure 3). We found that TCRs tend to identify hydrophobic amino acids. For example, 3/4 hydrophobic amino acids (L, W, P, A, V, and M) occur more frequently in immunogenicity epitopes. Charged amino acids (e.g., D and K) are enriched in nonepitopes, whereas the rest of the charged amino acids (R, H, and E) show no difference. Based on the result in Figure 3, the amino acid distribution difference at the TCR contact sites was regarded by us as one of the immunogenicity features (i.e., Frequency Score for Immunogenic Peptide (C22)).
The receiver operator characteristic (ROC) curve of models are shown in Figure 4. The fivefold cross-validation AUC was 0.81 in the prediction model for the antigen epitope (line in red, Figure 4(b)), and the externally validated (see Table 2) AUC was 0.75 (line in purple, Figure 4(c)). Here, we tried to remove peptides for which HLA supertypes did   BioMed Research International not appear in the training set from the externally validated antigen data, and the AUC, specificity, and sensitivity were increased to 0.78, 0.71, and 0.72, respectively (line in pink, Figure 4(c)). This, to some extent, verifies our conjecture about TCR specific recognition of different HLA alleles presenting peptides.

Classification Prediction Model for Neoantigen Epitopes.
Neoantigens derived from somatic mutations are different from the wild peptide sequences. Therefore, some mutationrelated characteristics were also taken into account. For instance, difference in hydrophobility before and after muta-tion (C25), differential agretopicity index (DAI, C26) [62], and whether the mutation position was anchored (C27). Finally, 27 features were selected for the neoantigen epitope prediction model. However, only 25 neoantigen-related features were retained after running Boruta, because C25 and C27 were removed. Also, rank(%) showed a marked effect ( Figure 5(a)). In the fivefold cross-validation of the prediction model for neoantigen epitopes, AUC was 0.78 ( Figure 5(b)).   Figure 3: Antigen epitope amino acid distribution frequency in the TCR contact site of epitopes and nonepitopes. Frequency distribution of amino acids at TCR contact sites in antigen epitope and nonepitope peptides, and the amino acids below the dotted line are preferred by the epitope. 7 BioMed Research International tool can be used to predict both immunogenic antigen and neoantigen epitopes. For antigens, the nine main HLA supertypes can be used. We recommend the peptides with the lengths of 8-12 residues, but not less than 8. N-terminal, position 2, and C-terminal were treated as anchored sites by default. A predictive score value greater than 0.5 is consid-ered as immunogenicity (positive-high), a score between 0.4 and 0.5 is considered as positive-low, and a score less than 0.4 is considered as negative-high. It is critical to make sure that the HLA-subtype must match your peptides (rank ð%Þ < 2). Where HLA-subtypes mismatch, a large deviation of the rank(%) value may strongly influence the ShadowMean

Discussion
Due to the complexity of antigen presenting and TCR binding, the mechanism of TCR recognition has not been clearly revealed. In 2013, Calis et al. [63] developed a tool for epitope identification for mice and humans (AUC = 0:68). Although mice and human beings are highly homologous, the murine epitopes may very likely cause limitations in identifying human epitopes. Inspired by J. A. Calis, our research here focused on human beings' epitopes and has been conducted in a larger data set.
By analyzing epitope immunogenicity from the perspective of amino acid molecular composition, we observed that TCRs do have a preference for hydrophobic amino acid recognition. For short peptides presented by different HLA supertypes, TCRs may have different identification patterns. The immunogenicity prediction based on all HLApresenting peptides may affect the accuracy of the prediction results. That is, if the prediction could focus on specified HLA-presenting peptides, the results may improve. Therefore, in our work we used HLA supertypes to improve the prediction of HLA-presenting epitopes, including antigen epitopes and neoantigen epitopes, for a better recognition by TCRs. At present, neoantigen epitopes that can be collected in accordance with the standard for experimental verification are too few, the data of positive and negative neoantigens are unbalanced, and there is not enough data to be used for an external verification set. In the future, we will continue to refine and expand our training and verification datasets. Recently, Laumont et al. [64] demonstrated that noncoding regions aberrantly expressing tumor-specific antigens (aeTSAs) may represent ideal targets for cancer immunotherapy. These epitopes can also be studied in the future. Increased epitope data may also help empower the prediction of potentially immunogenic peptides or neopeptides.

Conclusions
Neoantigen prediction is the most important step at the start of preparation of a neoantigen vaccine. Bioinformatics methods can be used to extract tumor mutant peptides and predict neoantigens. Most current strategies aimed at and ended in presenting peptide predictions, and among the results of these predictions, probably only fewer than 10 neoantigens might be clinically immunogenic and produce effective immune response. It is time-consuming and costly to experimentally eliminate the false positively predicted peptides. Our methods as developed in this study and the INeo-Epp tool may help eliminate false positive antigen/neoantigen peptides and greatly reduce the amount of candi-dates to be verified by experiments. We believe that in the age of biological system data explosion, computational approaches are a good way to enhance research efficiency and direct biological experiments. With the development of machine learning and deep learning, we expect that the prediction of epitope immunogenicity will be continually improved.
In summary, this study provides a novel T-cell HLA class-I immunogenicity prediction method from epitopes to neoantigens, and the INeo-Epp can be applied not only to identify putative antigens, but also to identify putative neoantigens.
It needs to be stated here that we published the preprint [65] of this article in July 2019. This is a modified version.

Data Availability
The data used to support the findings of this study are included within the supplementary information file(s).

Disclosure
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.