Prediction of High-Risk Types of Human Papillomaviruses Using Statistical Model of Protein “Sequence Space”

Discrimination of high-risk types of human papillomaviruses plays an important role in the diagnosis and remedy of cervical cancer. Recently, several computational methods have been proposed based on protein sequence-based and structure-based information, but the information of their related proteins has not been used until now. In this paper, we proposed using protein “sequence space” to explore this information and used it to predict high-risk types of HPVs. The proposed method was tested on 68 samples with known HPV types and 4 samples without HPV types and further compared with the available approaches. The results show that the proposed method achieved the best performance among all the evaluated methods with accuracy 95.59% and F1-score 90.91%, which indicates that protein “sequence space” could potentially be used to improve prediction of high-risk types of HPVs.


Introduction
Cervical cancer is one of the leading causes of cancer morbidity and mortality in women worldwide [1]. Approximately, 500,000 new cases of cervical cancer were diagnosed each year, with 280,000 deaths [2]. It has become the second most common cancer among women especially for developing countries [3,4]. Some studies have shown that human papillomavirus (HPV) is strongly associated with cervical cancer, and some types of HPVs can cause abnormal tissue growth in the form of warts (papillomas) and some HPVs are associated with certain cancers and precancerous conditions [5][6][7].
Human papillomaviruses are icosahedral, nonenveloped particles that contain a small, double-stranded circular DNA of approximately 8000 nucleotide base pairs [8] and belong to the Papillomavirus family (papilloma, polyoma, and simian vacuolating viruses) [9]. The diameter of circular DNA is approximately 55 nm [10][11][12][13]. Up to now, there are more than 150 HPV types, and some new types will be identified when they have significant homology differences with the defined HPV types [14][15][16]. Epidemiologic studies have shown that genital human papillomaviruses have a strong relationship with cervical cancer, independent of other risk factors. According to their relative malignancy, the genital tract HPVs can be grouped into two or three types: low-risk type, intermediate-risk type, and high-risk types [17]. But HPVs are usually divided into two types in clinical association study: high-risk or low-risk types. Low-risk viral types are more closely related with low-grade lesions, and high-risk viral types are associated with high-grade cervical lesions and cancers [17]. High-risk type is composed of 20 HPV types, such as HPV-16, HPV-18, HPV-26, HPV-31, HPV-33, HPV-35, HPV-39, HPV-45, HPV-51-53, HPV-56, HPV-58, HPV-59, HPV-66, HPV-68, HPV-70, HPV-73, HPV-82, and HPV-85 [18]. HPV-16 and HPV-18 are responsible for about 62.6% and 15.7% of cervical cancers [19]. Therefore, discrimination of high-risk types of HPVs becomes one of the most important things for diagnosis and therapy of cervical cancers.
Because of the importance of the HPV types, many epidemiological and experimental methods have been proposed to identify them [5,[20][21][22]. They are mostly based on the polymerase chain reaction (PCR), a sensitive technique for the detection of very small amounts of HPV nucleic acids in clinical specimens. With rapid increasing of the HPV data in protein and DNA databank, there is a great need to develop some reliable and effective computational methods to predict the high-risk types of HPVs directly from the available data.
Recently, some research works found the correlations between these data and high-risk types of HPVs and proposed some computational methods to predict the high-risk types of HPVs. Eom et al. learned the most informative subsequence segment sets of DNA sequences and used genetic algorithm to classify the risk types of each HPV [23]. Joung et al. classified the risk type of HPVs based on the hidden Markov model and the support vector machines using the protein sequences [24,25]. Park et al. proposed a classification of the risk type of human papillomavirus by decision tree [26]. Kim and Zhang introduced the string kernel and Gapspectrum kernel to compute the distances of amino acids pairs and further used them to classify HPV risk types based on E6 protein sequences [7,9]. Kim et al. proposed an Ensemble support vector machine to classify HPV risk types based on the differential subsequences of protein secondary structures [13]. Esmaeili et al. calculated Chou's pseudo amino acid composition of E6 protein sequences and used ROC to predict HPV risk types [27]. Alemi et al. analyzed the physiochemical properties of all early and late proteins in high-and low-risk HPV types and introduced support vector machines to classify high-risk HPV types based on receiver operating characteristic analysis [28].
These methods have achieved promising results in highrisk types of HPVs prediction, but challenges for information extraction of HPVs still remain. The widely used information of HPVs in high-risk type prediction is sequence-based or structure-based information from the given DNA or protein sequence, and the information from related proteins or family has not been explored until now. With this problem in mind, we presented a novel scheme to predict high-risk types of HPVs using word statistical model of protein sequence space and support vector machine. We first constructed a "sequence space" of the given protein sequence with help of mutation matrices. We then extracted the information of HPV from the protein "sequence space" with the proposed word statistical model. At last, the extracted information was fed into support vector machine to predict high-risk types of HPVs. Through several experiments, we want to address how well the proposed prediction method performed when comparing with the available ones and whether the prediction abilities of the proposed prediction method depends on the choice of the mutation matrices.

Datasets.
All types of HPV share a common genomic structure which is arranged into the upstream regulatory region (URR) and eight open reading frames (ORFs) encoding the viral early and late genes [11]. URR contains long control region, TATA signal 1 and TATA signal 2. There are polyA signal 1 and polyA signal 2 between early and late genes. Late gene expression produces the structural proteins L1 and L2 [12], which assemble into the viral capsid structure, whereas early gene activity translates into the regulatory proteins E1, E2, E4, E5, E6, and E7. In this paper, we constructed seven datasets of HPV protein sequences: E1, E2, E4, E6, E7, L1, and L2, respectively. Here, we did not use HPV E5 because the lengths of its protein sequences are too small. All the HPV datasets were downloaded from the Human Papillomaviruses Compendium published by Los Alamos National Laboratory (LANL).
There are total 72 types of HPVs in each dataset, but some HPV sequences are missing in LANL. So we downloaded the missing sequences from taxonomy browser in the National Center of Biotechnology Information. For example, HPV 43, 67, 75, 76, 77, and 80 protein sequences are missing in L2 dataset; we obtained these sequences from taxonomy browser. But we could not find the missed sequences of the E4 dataset in the National Center of Biotechnology Information, so the total number of HPV sequences is 71 in the E4 dataset. Among HPV sequences, four sequences (HPV 26, 54, 57, and 70) are selected as the predicting data and others are the training data [13]. Here, HPV risk types are manually determined based on the HPV compendium, in which seventeen HPV types are classified as high-risk types (HPV 16,18,31,33,35,39,45, 51, 52, 56, 58, 59, 61, 66, 67, 68, and 72) and the remaining are low-risk types.

Construction of Protein "Sequence Space".
It is well known that there are over 20 amino acids and each one is different from the others. Mutation matrices represent the similarities among amino acids. Let and denote two amino acids from the set Ω, and their score was defined as follows: where Mutation( , ) represents the "normalized probability" that the amino acid mutates into the amino acid . In evolutionary biology, the score describes the rate at which one amino acid in a protein sequence changes to other amino acids states over time. That is to say the sequence similarity depends on the amino acids' scores represented in above definition. Usually, two amino acids and are considered similar if their score is more than zeros. It is worth noting that the similarity between and is symmetric, but it is not a transitive relation. For example, is similar to and is similar to , but is not similar to . Taking amino acids' scores into mind, we classified 20 amino acids into several overlapping classes. Here, star sets were introduced, in which the properties are known between vertices and center. Given an amino acid , its star set was defined as follows: where sig is a function that returns the sign of a number, indicating whether the number is positive or zero. If number is greater than zero, 1, otherwise, to zero. For example, 20 amino acids can be partitioned into several star sets based on PAM250 mutation matrix, which was presented in Table 1.
We wanted to go further with the star sets and found some related protein sequences that have a high similarity among them. Suppose 1 = 1 1 1 2 ⋅ ⋅ ⋅ 1 and 2 = 2 1 2 2 ⋅ ⋅ ⋅ 2 are two given protein sequences; they are related if they satisfy the following condition: From the above definition, it is easy to note that if two protein sequences have more similar sequences, they should be more closely related. With help of definition of related sequences, we constructed the "sequence space" of given sequence = 1 2 ⋅ ⋅ ⋅ , denoted by SP , as follows.
Step 1. Given a null-set, denoted by , add its star sets to and obtain protein "sequence space" SP .
Step 2. A prefix 1 was added to and obtained its star set Star ( 1 ). We checked whether the star set of the prefix 1 is empty or not. If its star set is a nonempty set, we added a symbol "−" after Star ( 1 ) and updated the protein "sequence space" SP .
Step 3. We repeated Step 2 until the end of the given sequence = 1 2 ⋅ ⋅ ⋅ and obtained its protein "sequence space" SP as follows: In the construction of the protein sequence space, all the protein sequences were closely related to the given protein sequence. That is to say all the information on the related proteins or family could be explored through the construction of the protein "sequence space. "

Word Statistical Model in Protein "Sequence Space".
Word statistical model is one of the most widely used methods for sequence analysis [29][30][31][32]. In this model, each sequence is first mapped into an -dimensional vector according to its word frequencies, and sequence similarity can be measured by distance measures, such as Euclidean distance [33], Mahalanobis distance [34], Kullback-Leibler discrepancy [35], and Cosine distance [36]. When the words occurring in biological sequence are estimative probabilities rather than the frequencies, they are more readily optimized by more complex models, such as Markov model [37][38][39], mixed model [40], and Bernoulli model [41]. These complex models could be considered to be the modification of traditional word-based models.
A biological sequence can be described as a succession of symbols, and a word is a series of consecutive letters in the sequence. For a sequence = 1 2 ⋅ ⋅ ⋅ , the count of its word = 1 2 ⋅ ⋅ ⋅ , denoted by ( ), is the number of occurrence of the word in the sequence . Here, we constructed a word statistical model in protein "sequence space. " First of all, a position function of an occurrence of the word was defined as follows: The count of the word in the protein "sequence space" can be defined from the random indicators of occurrence as follows In order to eliminate the effects of space size, we normalized the word contents with the size of the space and got word frequencies of protein "sequence space, " denoted as SP . Consider , . . . , where |Star | is the size of the star set and is the total number of the words that appear in the protein "sequence space" SP . indicates the th sample being risk type , where = 1, 2 denotes two different risk types ( = 1 indicates the th sample being high-risk type, and = 2 indicates the th sample being low-risk type). Let be the th word frequency in protein "sequence space" for the th sample, where = 1, 2, . . . , ; = ( ) , denotes all the statistical information of "sequence space" for all samples, 11 12 ⋅ ⋅ ⋅ where is defined as a Linear combination of the set of nonlinear data transformations is a bias term, is a regularization metaparameter, and denotes the training error for the th sample. This optimization problem derived in a dual space can be written as In this paper, we used the Gaussian radius basis function kernel to calculate the ( ) ( ) instead of calculating either ( ) or ( ) explicitly. Then the optimal separating problem was modeled as And the classifier takes the form After training the model, a test sample ∈ will be assigned to a risk type according to the following decision function: When ( ) is 1, it means that the test sample is the highrisk type of HPV; otherwise, should be low-risk type. Here, we selected the parameters for the sake of getting the highest overall prediction as possible. A simple grid search strategy based on 10-fold cross-validation for each dataset was performed to get the optimal values of and for prediction algorithm.

Evaluation Measures.
Subsampling test, independent dataset test, and jackknife test are three widely used crossvalidation methods to evaluate prediction's capability. The jackknife test always yields a unique outcome, which facilitates examining the quality of various predictors. Hence, we chose jackknife test to evaluate the performance of the proposed method and introduced the accuracy for each class, overall accuracy, and 1-score as standard performance measures, which were defined as follows: specificity (accuracy of high − risk type) = + , sensitivity (accuracy of low − risk type) = + , accuracy of totality = + + + + ⋅ 100%, 1-score = 2 ⋅ / ( + ) ⋅ / ( + ) ( / ( + )) + ( / ( + )) ⋅ 100%, where is the number of true positives, is the number of false positives, is the number of true negatives, and is the number of false negatives. From their definition, it is interesting to note that 1-score will be higher if is bigger. That is to say 1-score will be better to reflect the efficiency of HPV risk type prediction capacity.

Comparison of Early and Late Proteins' Performances in HPV Type Prediction.
The HPV genome encodes a number of early (E1, E2, E4, E5, and E6) and late (L1 and L2) proteins [3,5]. Several methods classified the high-risk and lowrisk HPVs using the information from protein sequences, secondary structure, and pseudo amino acid composition [23][24][25][26][27][28].  study, we constructed seven protein datasets of E1, E2, E4, E6, E7, L1, and L2 and compared their performance in HPV type prediction. The proteins of E5 were not included because their lengths are too small. The accuracy of each class, overall accuracy, and 1-score of all the early and late proteins were summarized in Figure 1. From Figure 1, it is easy to observe that the accuracies of low-risk type are higher than that of high-risk type. For the low-risk type prediction experiment, E7 performs better than other HPV proteins expect for mutation matrix p200. But as for high-risk type prediction and all-type prediction experiments E6 achieves the best performance among all the HPV proteins according to the accuracies and 1-scores. Some experiment studies have shown that E5, E6, and E7 proteins of high-risk HPV play an important role in disease progression and cancer [14]. E5 protein enhances half-life and activity of epidermal growth factor receptor (EGFR). E6 and E7 proteins inactivate p53 and Rb functions [42]. The results also highlight that the sequences of E6 protein are more suitable for HPV high-risk type prediction and E7 protein is more reliable for HPV low-risk type prediction in the proposed model.

Comparison of Mutation Matrices in HPV Type Prediction.
The proposed word statistical model was constructed based on protein "sequence space" that relies heavily on the mutation matrix. In order to evaluate the influence of different mutation matrices, we adopted ten mutation matrices including PAM 40, PAM 80, PAM 120, PAM 200, PAM 250, BLOSUM 40, BLOSUM 45, BLOSUM 62, BLOSUM 80, and BLOSUM 100. The accuracy of each class, overall accuracy, and 1-score of the proposed prediction method based on ten mutation matrices were represented in Figure 1. Figure 1 largely confirms that the proposed prediction method possesses different performances based on the different mutation matrices. The changes of high-risk type and all-type prediction experiments are similar, but there is a bit of difference in the low-risk type prediction experiments. As for the BLOSUM mutation matrices, BLOSUM 45 and BLOSUM 62 perform better in the prediction of high-risk type of HPVs. For PAM mutation matrices, PAM 40 and PAM 80 achieve the better performance in the high-risk type prediction experiments. Judging from prediction accuracy, it is easier to recognize that PAM 40 achieves the best performance based on E6 protein among PAM and BLOSUM

HPV Classification.
In this study, we extracted the information using the word statistical model of E6 protein "sequence space" that was constructed based on PAM 40 mutation matrix. Leave-one-out cross-validation was applied to determine the prediction performance for all experimental results. HPV types were grouped into two classes, high-risk and low-risk. Table 2 shows the comparison of the manually tagged answer and the results from the proposed prediction approach. Table 2 shows that the proposed prediction method achieves better performance, in which the prediction results of 65 HPV types are consistent with their real risk types. HPV 66 and HPV 72 are high-risk types, but they are predicted as low-risk type, and HPV 30 is low-risk type, but predicted as high-risk type using the proposed prediction method. In order to highlight the prediction differences, we further compared our results with Kim's results [13]. As for Kim's prediction, HPV 72 was predicted as possible highrisk type, but it was predicted as "low-risk type" in the proposed method; HPV 56 was predicted as possible highrisk type, while we predicted it as high-risk type; HPV 53 and HPV 73 were predicted as possible high-risk types, but they are low-risk types in our results. Phylogenetic analysis showed that HPV 30 was grouped closely with the established carcinogenic type HPV 56, which indicates that HPV 30 is more likely high-risk type. From the comparison, it is easy to note that the results obtained with the proposed method are more consistent with the real risk types.
To further evaluate the performance of the proposed prediction method, we computed the overall accuracy and 1-score and compared them with the published results in    [24], SVM with Linear kernel method (Linear) [13], SVM classifier with the Gap-spectrum kernel (Gap) [7], BLAST predictions with a slight modification of thenearest neighbor method [13], Ensemble SVM (Ensemble) based on protein secondary structures [13], and two textbased prediction methods AdaCost [26] and nave Bayes [26].
The proposed approach achieved 95.59% accuracy and 90.91% 1-score, while the Ensemble SVMs obtained 94.12% accuracy and 88.89% 1-score, and SVM with Mismatch kernel achieved 92.70% accuracy and 85.70% 1-score, and SVM using Linear kernel with 90.28% accuracy and 83.72% 1score and BLAST with 91.18% accuracy and 88.24% 1-score. Table 3: Prediction results of the HPVs with unknown types using the proposed prediction methods and from the available methods.

Types
Prediction methods Mismatch [24] Li n ea r [ 13] G a p [ 7] Genetic [23] PseAAC [27] E n se m b l e [ 13] Th As for text-based prediction method, AdaCost [26] achieved better performance with 93.05% accuracy and 84.490 % 1score, and Naive Bayes [26] lags behind with 81.94% accuracy and 63.64% 1-score. According to the prediction accuracy and 1-score, the proposed prediction method achieved the best performance among all the evaluated prediction methods; the next best prediction approach is Ensemble SVMs, and the others lag behind. It is worth mentioning that the proposed approach is based on protein sequences as well as Mismatch, Linear, and Gap, while Ensemble uses the information from the predicted protein secondary structures. We also noted that text-based prediction does not provide superior results compared with other prediction methods.
Although text-based prediction methods have an advantage in having explicit key words in the document, but they rely on the evidence obtained from the literature. When there is no available document for HPVs with unknown risk type, it is impossible to predict them. This comparison also indicates that the proposed word statistical model based on protein "sequence space" is more effective to classify risk types of human papillomaviruses.

Prediction for Unknown HPV Types.
The most important task of this paper is to predict high-risk types of new HPV.
Here, we downloaded the E6 protein sequences of HPVs with unknown types from the LANL database and used them to further evaluate the performance of the proposed approach. Table 3 shows the prediction results of all the HPVs with unknown types. From Table 3, HPV 26 and HPV 70 were predicted as high-risk types and HPV 54 and HPV 57 as low-risk types using the proposed method. In order to compare with the existing methods, we also represented the prediction results of available approaches in Table 3. From the HPV classification, we knew that the proposed prediction method achieved the best performance, and it is followed by Ensemble SVMs. From Table 3, it is easy to note that the proposed method and Ensemble SVMs achieve the same results. As for the HPV 54 and HPV 57, all the methods predicted them as low-risk types. For HPV 26, the proposed method, PseAAC [27], Ensemble [13], and Gap [7] predicted it as high-risk type, while Mismatch [24], Linear [13], and Genetic [23] predicted it as low-risk type. According to the reliabilities of the prediction approaches, HPV 26 should be high-risk type. As for HPV 70, all the prediction methods predicted it as high-risk type except for Genetic [23] and PseAAC [27]. These results show that the proposed method can provide a simple but efficient guideline for the investigation of potentially high-risk HPVs.

Conclusion
Genital human papillomaviruses have a strong relationship with cervical cancer, especially high-risk viral types of HPVs. Therefore, discrimination of HPV risk type plays an important role in the diagnosis and remedy of cervical cancer. This paper proposed a computational scheme to predict high-risk types of HPVs with word statistical model of protein "sequence space. " With help of mutation matrices, we first constructed a sequence space of the given protein sequences. Instead of only using sequence-based or structurebased information of protein sequences, we extracted the information of HPV from the protein "sequence space" with word statistical model to predict high-risk types of HPVs. The proposed method was tested on 68 samples with known HPV types and 4 samples with unknown HPV types. The results show that the proposed method achieved better performance in comparison to the previous methods.
The main goal of our research is to investigate a new prediction method based on protein "sequence space. " The first contribution can be seen from comparison of early and late proteins' performances in HPV type prediction; we found that the "sequence space" of E6 protein is more suitable for HPV high-risk type prediction, while that of E7 protein is more reliable protein for HPV low-risk type prediction. The second contribution can be indicated from comparison of mutation matrices in HPV type prediction; we noticed that PAM 40 achieves the best performance with the sequences of E6 protein among PAM and BLOSUM matrices except for PAM 80 with E4 protein. The third contribution can be deduced from HPV classification and prediction for unknown HPV types; we found that the proposed prediction method achieved the best performance among all the evaluated prediction methods, with 95.59% accuracy and 90.91% 1-score, which can be contributed to the introduction of the protein "sequence space. " Thus, this understanding can be used to guide development of more powerful method for prediction of high-risk types of HPVs.