A human papillomavirus type plays an important role in the early diagnosis of cervical cancer. Most of the prediction methods use protein sequence and structure information, but the reduced amino acid modes have not been used until now. In this paper, we introduced the modes of reduced amino acids to predict high-risk HPV. We first reduced 20 amino acids into several nonoverlapping groups and calculated their structure and physicochemical modes for high-risk HPV prediction, which was tested and compared with the existing methods on 68 samples of known HPV types. The experiment result indicates that the proposed method achieved better performance with an accuracy of 96.49%, indicating that the reduced amino acid modes might be used to improve the prediction of high-risk HPV types.
Cervical cancer is a cancer with a higher morbidity and mortality rate among women worldwide [
Human papillomavirus belongs to the papillomavirus family. It is an icosahedral, uncoated particle composed of double-stranded DNA of approximately 8,000 nucleotide base pairs [
Up to now, many epidemiological and experimental methods can identify HPV types [
In recent years, several computational models have been proposed to predict high-risk HPV types. Eom et al. studied the sequence fragments and introduced genetic algorithms to predict the HPV types [
These methods have performed well in the prediction of high-risk HPV types, but the challenge of extracting HPV information remains. The information widely used in the prediction of high-risk types of HPV is based on sequence information, but the information limited to the characteristics of 20 AAs and their reduction groups has not been explored so far. In this paper, we proposed a novel method to predict high-risk types of HPVs based on the reduced amino acid modes. We classified 20 amino acids into several groups and extract their structure and chemical properties. These extracted features were used to predict the high-risk type of HPVs based on a support vector machine. Through some experiments and comparative analysis, we want to evaluate the efficiency of the proposed method, as well as the efficiency of various reduced amino acid modes.
There are eight open reading frames that encode early and late genes of the HPVs [
20 amino acids have subtle differences, but some of them have similar basic structures and functions. AAindex is a database of physical and biochemical indicators of amino acids established by Tomii and Kanehisa [
Here, we introduced BLOSUM62 to classify amino acids to simplify sequence analysis [
20 amino acids were divided into the following nonoverlapping groups according to their physicochemical properties in AAindex, and four types of the reduced amino acid modes were calculated as protein structural and physicochemical features.
The first mode is associated with the content-specific features, including the distribution of the RedAA and RedAA pattern in protein sequences.
Type I PRseAAC was proposed by Kuo-Chen Chou, which is defined as follows:
Type II PRseAAC can be calculated as
The second RedAA mode is based on the characteristics of correlation, which describes the correlation among the RedAAs. In the proposed RedAA mode, three different autocorrelation features are implemented: normalized Moreau–Broto autocorrelation (NMB) [
The order mode reflects the physical and chemical interaction among the RedAA pairs. There are two kinds of order modes: sequence coupling score and quasi-sequence score [
The quasi-sequence score can be calculated as
The position mode represents the distribution of RedAA positions of protein sequences based on the coefficient of variations [
We then calculated the positional information
We used a support vector machine (SVM) to predict the HPV type, which is expressed as follows:
Here, the Gaussian kernel function is used to calculate
The training model can predict the risk type of the test sample
There are three popular methods to evaluate the efficiency of prediction models: subsampling test, independent test, and jackknife test. Since the jackknife test can evaluate the efficiency of various predictor variables, we used it to evaluate the efficiency of the proposed method and calculated the class accuracies and overall accuracies:
We used the jackknife test to evaluate the performance of the proposed RedAA modes. We divided the 20 amino acids into 5 to 19 groups and calculated their RedAA modes as protein features and then input them into the support vector machine to predict the HPV type. Table
Comparison of the real risk types (REAL) and the prediction results using the proposed approach.
Types | Real | Predicted | Types | Real | Predicted | Types | Real | Predicted | Types | Real | Predicted |
---|---|---|---|---|---|---|---|---|---|---|---|
HPV-39 | High | High | HPV-7 | Low | Low | HPV-34 | Low | Low | HPV-50 | Low | Low |
HPV-72 | High | Low | HPV-30 | Low | High | HPV-44 | Low | Low | HPV-5 | Low | Low |
HPV-33 | High | High | HPV-73 | Low | Low | HPV-43 | Low | Low | HPV-20 | Low | Low |
HPV-51 | High | High | HPV-6 | Low | Low | HPV-32 | Low | Low | HPV-23 | Low | Low |
HPV-16 | High | High | HPV-27 | Low | Low | HPV-24 | Low | Low | HPV-19 | Low | Low |
HPV-56 | High | High | HPV-13 | Low | Low | HPV-8 | Low | Low | HPV-47 | Low | Low |
HPV-18 | High | High | HPV-55 | Low | Low | HPV-48 | Low | Low | HPV-22 | Low | Low |
HPV-59 | High | High | HPV-2 | Low | Low | HPV-12 | Low | Low | HPV-25 | Low | Low |
HPV-52 | High | High | HPV-10 | Low | Low | HPV-49 | Low | Low | HPV-9 | Low | Low |
HPV-35 | High | High | HPV-42 | Low | Low | HPV-15 | Low | Low | HPV-36 | Low | Low |
HPV-68 | High | High | HPV-28 | Low | Low | HPV-21 | Low | Low | HPV-41 | Low | Low |
HPV-58 | High | High | HPV-40 | Low | Low | HPV-4 | Low | Low | HPV-63 | Low | Low |
HPV-31 | High | High | HPV-3 | Low | Low | HPV-65 | Low | Low | HPV-1 | Low | Low |
HPV-66 | High | High | HPV-11 | Low | Low | HPV-37 | Low | Low | HPV-80 | Low | Low |
HPV-45 | High | High | HPV-29 | Low | Low | HPV-38 | Low | Low | HPV-77 | Low | Low |
HPV-61 | High | High | HPV-74 | Low | Low | HPV-60 | Low | Low | HPV-76 | Low | Low |
HPV-67 | High | High | HPV-53 | Low | Low | HPV-17 | Low | Low | HPV-75 | Low | Low |
It can be seen from Table
We further compared our method with the following method: SVM based on the mismatch [
Early HPV proteins contain E1, E2, E4, E5, E6, and E7, and late proteins include L1 and L2 [
Comparison of prediction accuracy of each class based on all the early and late proteins.
Figure
The proposed method reduced 20 AAs into several nonoverlapping groups, which relies heavily on the physical and biochemical indices of amino acids. The 522 characteristics of AAindex are divided into seven categories according to their physical and biochemical features [
Comparison of the mean of the overall accuracies of HPV type prediction based on seven physicochemical property classes and six RedAA modes for all the early and late proteins.
From Figure
In order to evaluate the performance of different modes, we used 522 physicochemical properties to calculate the RedAA modes of all the early and late proteins and calculated their average of the overall accuracies of HPV type prediction, which is shown in Figure
The proposed method used the structural and physicochemical features of reduced amino acids, which reduces the dimension of input information and improves the efficiency of the prediction model. However, it should be noted that the RedAA modes are associated with the number of reduced amino acids. In order to discuss the influence of the RedAA size, we reduced 20 amino acids into 5-19 classes based on 522 physicochemical properties and calculated their RedAA modes PRseAAC and RTCD for of all the early and late proteins. The average accuracies of the RedAA modes PRseAAC and RTCD with 5-19 RedAAs are summarized in Figure
Performance comparison of the RedAA modes PRseAAC and RTCD with different reduced amino acids: (a) the average accuracies of the PRseAAC and RTCD with 5-19 reduced amino acids for E1, E2, E4, E5, and E7 and (b) the average accuracies of the PRseAAC and RTCD with 5-19 reduced amino acids for E6, L1, and L2.
Figure
Genital papillomavirus is closely related to cervical cancer, especially high-risk HPV. Therefore, the identification of the HPV risk type is of great significance for the cervical cancer. We proposed a computational method for the prediction of the high-risk HPV based on the RedAA modes. With the help of the physicochemical properties of the amino acids, we reduced 20 amino acids into several nonoverlapping groups and calculated the structure and physicochemical characteristics of reduced AAs (RedAA) as the RedAA modes. We used reduced sequence information to predict high-risk types of HPV. Experiments with 68 known HPV types show that the proposed method has better performance than previous methods.
The first contribution is that L1 protein performs better in the prediction of high-risk HPV types, while L2 protein is more suitable for low-risk HPV types. The second contribution can be indicated from the influence of the physicochemical properties of amino acids; we noticed that E5, E6, E7, L1, and L2 proteins have a preference for beta physicochemical properties to reduce amino acids. The third contribution can be deduced from the comparison of the reduced amino acid modes; we found that the PRseAAC and RTCD outperform the other four RedAA modes and show better performance in beta physicochemical properties of the amino acids. The final contribution can be seen from the influence of the number of reduced amino acids; we noticed that the combination of the RCTD and beta physicochemical properties achieves the best performances with 8, 15, and 11 reduced amino acids for E6, L1, and L2 proteins, respectively.
All the data used to support the findings of this study are available from the Los Alamos National Laboratory (
The authors declare that they have no conflicts of interest.
This work is supported by the National Natural Science Foundation of China (61772028) and research Grants from Zhejiang Provincial Natural Science Foundation of China (LY20F020016).