It is crucial to understand the specificity of HIV-1 protease for designing HIV-1 protease inhibitors. In this paper, a new feature selection method combined with neural network structure optimization is proposed to analyze the specificity of HIV-1 protease and find the important positions in an octapeptide that determined its cleavability. Two kinds of newly proposed features based on Amino Acid Index database plus traditional orthogonal encoding features are used in this paper, taking both physiochemical and sequence information into consideration. Results of feature selection prove that
Acquired immune deficiency syndrome (AIDS) is a severe disease which mostly causes patient’s death during its terminal period. Most patients suffer from this disease because they are infected by HIV-1. Although many researches and investigations have been implemented, medicines or methods to entirely cure AIDS have not been found. However, there are some methods to relieve patient’s ailment by medicines or therapies. HIV-1 protease inhibitor is such a kind of medicine that can be used to treat AIDS. HIV-1 protease is an enzyme which plays an important role in the replication progress. It cleaves proteins to smaller peptides, and these peptides are used to make up some important proteins that are essential for the replication of HIV-1 [
A lot of researches and investigations for HIV-1 protease cleavage sites prediction scheme have been carried out during the past two decades [
A typical HIV-1 protease cleavage sites prediction frame can be described as this: extract features from octapeptides, train a classifier based on the training samples, and then predict the label of a new unlabeled sample with the trained classifier. As the amount of information provided by a single kind of features is limited, the prediction accuracy will be faced with a bottleneck if only using a kind of features. Three kinds of original features are used in this paper, and experiments are carried out to test their classification performance. It is reasonable to fuse the three kinds of features as the input of classifier to improve classification accuracy [
According to statistical learning theory, the generalization capability of a classifier is determined by its Vapnik-Chervonenkis dimension [
In this paper, feature representation is spread by fusing three kinds of features to improve classification performance, and feature selection actually improves generalization capability of classifiers. Decision fusion based on three kinds of features in subsets gets excellent classification performance. The important positions and amino acid residues in peptides which demonstrated the cleavage specificity of HIV-1 protease are found. Our work can provide some instructive help for designing HIV-1 protease inhibitor.
There are some classic data sets which have been collected and published. Cai and Chou [
You et al. [
Kim and his colleague [
Kontijevskis and his colleague [
In our research, these formerly used data sets are combined to enlarge the datasets and 3618 samples are got. After removing the contradictory and redundant samples, the dataset has 1922 octapeptides, which contains 596 positive samples and 1326 negative samples. This dataset is called 1922 dataset.
Numerous kinds of feature extraction methods for peptides have been proposed [
Feature extraction based on peptide sequence is a commonly used and classical method to represent a peptide for HIV-1 protease inhibitor prediction. Some methods to extract features are based on protein sequence, such as amino acid composition, n-order couple composition, pseudo-amino acid composition, and residue couple. However, these methods originally proposed to extract features of proteins, not particularly raised for peptide sequence. Usually a protein molecule is much larger than a peptide; thus a protein sequence contains much more structure information than a peptide. These methods cannot extract enough useful information from a small peptide molecule. Thus methods specially proposed for peptides are taken into consideration in our research. OE is one most frequently used method to employ a sparse representation. A 20-bit vector represents a kind of amino acid with 19 bits set to zero and one bit set to one. Each vector denoting an amino acid is orthogonal to the others. In this way, an amino acid sequence is mapped into a sparse orthogonal vector space. If a peptide sequence contains
Although OE features can provide good prediction accuracy for HIV-1 protease inhibitor, features just based on sequence cannot provide comprehensive feature representation. Features based on physicochemical properties of amino acids can provide different but quite useful information, which can effectively improve prediction accuracy. The inherently contained characteristics of amino acids can provide useful help for us to understand the specificity of HIV-1 protease [
The AAindex Database is a collection of amino acid indices in published papers [
Another important feature of amino acids that can be represented numerically is the similarity between amino acids. A similarity matrix which is called mutation matrix and it contains a set of 210 numerical values, 20 diagonal and 20 × 19/2 off-diagonal elements used for sequence alignments and similarity searches. The AAindex2 section of the AAindex Database is a collection of published amino acid mutation matrices together with the result of cluster analysis. This section currently contains 94 matrices.
Up to now, most methods of extracting features from peptides based on AAindex Database employ the amino acid indices. Many methods proposed for proteins can be used here, like autocorrelation function and pseudo amino acid composition. In our research, features extracted based on PCA and NLF of AAindex Database are used.
Nanni and Lumini utilize [
PCA based feature extraction is used to transform the original feature space into an orthogonal principal component space. The principal components are the
NLF based feature extraction utilizes an objective function of the nonlinear Fisher transformation with the purpose of well separating patterns of different classes. 20 different labels can be put on the 20 kinds of amino acids. So discriminating the 20 amino acids becomes a supervised classification problem. The original Fisher transformation suffers from occlusion of neighboring classes, so the nonlinear Fisher transformation is proposed. After conducting NLF to the original features, each kind of amino acid is represented by an 18-feature vector.
In our research, three kinds of feature extraction methods are utilized: OE, PCA, and NLF based feature extraction methods considering they are specially proposed for peptide encoding. Experiments in the following part of this paper indicate that all the three sorts of features can provide good prediction performance.
In a machine learning frame, dimensionality reduction is usually a highly important part which aims to reduce the classifier complexity and improve the classification accuracy. In some cases, both of the two aspects are taken into account, while sometimes one aspect is mainly focused on. There are two ways to implement dimensionality reduction: feature transformation and feature selection. Understanding the relationship and difference between them is very important. Feature transformation is carried out by mapping or combining features of the original feature space, a process that changes original features and generates new features. Feature selection is used to find the optimal (or suboptimal) feature subset from the original feature set and this process does not change the original features [
Feature selection is a relatively new problem in HIV-1 protease inhibitor prediction and it is the key thought in our research. It helps us find out the positions in octapeptides playing more important roles in deciding whether an octapeptide can be cleaved by HIV-1 protease. The important roles of amino acid residues that constitute octapeptides are also investigated. In our research, the feature selection task is separated into two steps: the preliminary step and the complete step. In the preliminary step, a wrapper feature selection method including a neural network is conducted and structure optimization of neural network is accomplished at the same time. In fact, feature selection combined with neural network optimization is conducted in previous research like CAFS [
A wrapper method is designed to provide better classification accuracy for the prediction task in the preliminary step. There is a neural network to examine the classification accuracy of different subsets in the feature selection algorithm; the useful and effective features can be selected for the following prediction process. It is important but difficult to determine the number of nodes in the hidden layer for a BP neural network. Too many nodes will cause high computational complexity and take up a lot of resources, while too few nodes cannot provide enough classification ability. Our method provides a solution to solve this problem. It is used to choose effective features from the data set, which shows us the more important positions in octapeptides and more important amino acid residues at different positions indicating the HIV-1 protease specificity.
The following is the feature selection scheme for finding the useful features and accomplishing with network structure optimization at the same time. One severe drawback of neural network is that its optimal structure is not explicit. A structure optimized neural network can provide reliable classification ability and guarantee good generalization capability. Therefore, feature selection is conducted with neural network structure optimized feature in this paper. The feature selection scheme is divided into two steps. The preliminary selection with network optimization accomplishes the initial selection of useful features. The complete selection confirms the final subset and network structure.
An octapeptide is denoted by
For the task of feature selection, data samples are randomly divided into two groups: training set and validation set. Training set is used to train neural network; validation set is used to evaluate the performance of feature subsets and guarantee the generalization capability of neural network in feature selection progress. The sample proportion of training set and validation set in all samples is 0.6 and 0.4, respectively. The general information of the feature selection method is shown in Figure
Flowchart of the feature selection method. In the flowchart, a subset which is shown as suboptimal subset is got after the preliminary step. The part before the suboptimal subset is got can be referred to as the preliminary step. Then a final subset is got according to the classification accuracy based on this subset in the complete step. The part after the suboptimal is got can be referred to as the complete step.
An octapeptide is denoted by
Sort the features at each pair of symmetrical positions from small correlation value to larger one. Thus a new feature array is got from which features are added to subset during feature selection. This feature array is called candidate feature array. When a feature is added to the subset, correspondingly it is eliminated from the candidate feature array. Generally, an initial subset is needed as a start in feature selection and the initial subset is the first feature in the candidate feature array.
Judge if the feature selection should stop by examining whether the candidate feature array is empty. If it is empty, the algorithm will stop. This is termination criterion 1 in this algorithm. After the algorithm stops, the current feature subset is the final subset in preliminary step and the initial subset for the complete step.
Feature selection starts from the first feature in the candidate array. Feature is picked out in order from this array and temporarily added to the current subset with a so-called temporary subset created. If the temporarily added feature is evaluated usefully in Step
Initialize a BP neural network. The number of input layer nodes is the size of temporary subset and the number of hidden layer nodes is initially set to one. In the following process, this number may be increased as needed. The number of output layer nodes is set to one. The true label of a sample in our data is 0 or 1, so the outputs of network are real numbers close to 0 or 1 after training.
Train the network in a partial way which means training the network
Check termination criterion 2 after every partial training series. Here criterion 2 is tested according to the validation error, and validation error has been achieved in Step
Determine whether further training is needed. If the difference between training errors of current training series and previous training series is smaller than a specified threshold
Evaluate the temporary subset after adding a new feature by analyzing the classification accuracy array. If there is a significant ascending trend in this array, it means that the temporarily added feature makes sense, and this feature provides useful information and improves classification accuracy. Otherwise go to Step
Determine whether a node in hidden layer should be added. Firstly, we temporarily add a hidden node for the network and train with the temporary subset again. If there is a significant ascending trend in classification accuracy array this time, it means that adding a hidden node can effectively improve classification performance. Thus the feature picked out in Step
After the preliminary step, an initial subset is got. Actually, a group of subsets are got to be analyzed from the initial subset. During the preliminary selection, a new feature is successfully added to the subset for each time, classification accuracy of validation set is got and saved, and the corresponding current node number of hidden layers is also saved. Thus each subset corresponds to a validation accuracy and hidden node number. This validation accuracy array is analyzed and helps determining the final subset.
In the end, the final subset will be determined based on the results achieved in Step
After the preliminary selection, a subset with some redundant features is got. As features in the subset are added one by one during preliminary selection, classification accuracy on validation samples is got and saved after a new feature is added. Meanwhile, the number of hidden nodes corresponding to each subset is also saved. The reason for carrying on two steps is to include enough useful features. So a loose feature evaluation criterion is conducted in preliminary selection. However, this manner will also contain some redundant features, so the complete step is needed to remove them. In fact, when the classification accuracy of validation set has a relatively high and steady level even if it is not the largest value, the subset still includes enough effective features and useful information. In the following part, this method to determine the final subset according to the different validation classification performance of subsets is introduced. This work uses the three subsets got based on three kinds of features after preliminary selection and it is previously mentioned.
Validation accuracy of OE features.
A subset of 151 features is got after the preliminary selection. When the subset contains 142 features, the classification accuracy on validation set gets largest value which is 92.8479. However, when the subset contains 90 features, validation classification accuracy already obtains a relatively high value. But the 90-feature subset is not the final subset. The final subset should have a steadily high classification accuracy level avoiding too many ups and downs. The final subset should not miss too many useful features so as to keep good classification performance and include as few as redundant features. Thus the final subset is the 104-feature one. The numbers of features distributed at 8 positions are 0, 16, 17, 19, 19, 17, 16, and 0. It can be easily found most features distributed at
Validation accuracy of PCA features.
After the preliminary selection, a subset that contains 135 features is got. The largest value of classification accuracy based on validation set is 94.4083. Nevertheless, when the feature subset contains 134 features, the classification accuracy of validation set gets a local maximum value. The final subset is determined by following the same method mentioned above in OE features and the final choice is the 102-feature subset. The numbers of features distributed at 8 positions are 0, 8, 19, 19, 19, 18, 8, and 0. A neural network is got after feature selection, containing 12 nodes in hidden layer.
Validation accuracy of NLF features.
The preliminary selection produces a 133-feature subset. When the subset contains 88 features, the classification accuracy of validation set gets the largest value of 93.4980, and the 88-feature subset bears a local maximum value for validation set classification accuracy. If more features are added, the classification accuracy relatively stays high. Thus, the final subset is the 103-feature one after the complete selection. The numbers of features distributed at 8 positions are 0, 8, 13, 17, 16, 15, 7, and 0. A neural network is got after feature selection, containing 13 nodes in hidden layer.
Sufficient experiments are conducted to compare the performance of final subsets that we get, the fused subsets, the original features, and fused original features using 10-fold cross validation. Tenfold cross validation is a widely used method to examine classification performance. Four parameters, accuracy, sensitivity, specificity, and Matthews Correlation Coefficient (MCC) [
A single kind of features containing the information is limited, so the prediction accuracy will be faced with a bottleneck by only using a single kind of features. Although OE features might provide massive information, only sequence information is not enough. Thus features based on physiochemical properties are taken into consideration, and PCA and NLF based features are used in this paper. They contain different information, respectively, and quite differ from OE features. Two kinds of methods for feature fusion are used to improve classification accuracy: combination fusion and decision fusion. Combination fusion is used to train the classifier by combining the three kinds of features, and decision fusion is used to train three classifiers separately with the three kinds of features and produce an output label based on the outputs of the three classifiers according to majority rule. Experiments are conducted to examine the classification performance based on the three kinds of original features plus the fused features. The results are shown in Table
Classification results of original features.
Reduced features | Accuracy | Sensitivity | Specificity | MCC | Feature number |
---|---|---|---|---|---|
OE | 0.9214 | 0.9011 | 0.9351 | 0.8198 | 160 |
PCA | 0.9162 | 0.8977 | 0.9253 | 0.8083 | 152 |
NLF | 0.9136 | 0.9077 | 0.9223 | 0.8029 | 144 |
Combination fusion | 0.9147 | 0.8943 | 0.9238 | 0.8048 | 456 |
Decision fusion | 0.9344 | 0.9245 | 0.9449 | 0.8501 |
Table
Known from the result of combination fusion, there is overfitting in the trained network. To find the specificity of HIV-1 protease and solve overfitting of network, feature selection is conducted. Feature selection can find the most useful features that indicate which positions and amino acid residues play more important roles in demonstrating the specificity of HIV-1 protease and can simultaneously accomplish network structure optimization. Feature selection is firstly conducted on the three kinds of original features in this paper. The importance of different positions is confirmed according to the number of features retained at each position after feature selection: the more features a position contains, the more important it is.
The distribution of features at different sites for OE features is shown in Figure
Feature number at different sites after feature selection for OE features.
The distribution of features at different sites for PCA features is shown in Figure
Feature number at different sites after feature selection for PCA features.
The distribution of features at different sites for NLF features is shown in Figure
Feature number at different sites after feature selection for NLF features.
The statistical results of the feature distribution at 8 positions after feature selection and the three histograms perfectly prove our supposition about the importance of positions nearer to the scissile bond. On the other hand, the biological meanings of chosen features can be estimated by analyzing the statistical results of samples in dataset. Each feature in the subset of OE features represents one kind of amino acid residue; thus computing the entropy values of all chosen features based on the statistical results of subset will prove the effectiveness of chosen features. Figure
Entropy values of each feature in OE subset. Each asterisk represents the entropy value of each feature. And their corresponding values are shown according to the order they are added to the subset.
As shown in Figure
After feature selection for three kinds of original features, three subsets are got. In order to improve prediction performance, we test the classification performance on the three subsets and use two fusion methods to apply fused features of the subsets. Table
Classification results of reduced features.
Reduced features | Accuracy | Sensitivity | Specificity | MCC | Feature number | Feature reduction ratio |
---|---|---|---|---|---|---|
OE | 0.9214 | 0.8993 | 0.9329 | 0.8191 | 94 | 0.4125 |
PCA | 0.9110 | 0.9044 | 0.9178 | 0.7992 | 91 | 0.4013 |
NLF | 0.9110 | 0.8775 | 0.9306 | 0.7934 | 74 | 0.4861 |
Combination fusion | 0.9199 | 0.8977 | 0.9299 | 0.8161 | 259 | 0.4320 |
Decision fusion | 0.9355 | 0.9211 | 0.9442 | 0.8518 |
Table
Understanding the specificity of HIV-1 protease can help human beings design effective protease inhibitor to treat AIDS. Judging whether a peptide can be cleaved by HIV-1 protease is the key point, and machine learning is an economical solution for solving this problem. To get comprehensive feature representation, three kinds of features are extracted from peptide sequences in this paper. However, large feature space causes overfitting of neural network. In order to guarantee the generalization capability, a two-step feature selection is conducted to eliminate the redundant features and reserve the useful features. Feature selection also helps us to understand the specificity of HIV-1 protease. The positions nearer to the scissile bond are supposed to play more important roles, and the results of feature selection prove this supposition. In fact, all the features at
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (no. 61003175), “the Fundamental Research Funds for the Central Universities,” and Dalian Science and Technology Plan Program 2012E12SF071.