ANPrAod: Identify Antioxidant Proteins by Fusing Amino Acid Clustering Strategy and N-Peptide Combination

Antioxidant proteins perform significant functions in disease control and delaying aging which can prevent free radicals from damaging organisms. Accurate identification of antioxidant proteins has important implications for the development of new drugs and the treatment of related diseases, as they play a critical role in the control or prevention of cancer and aging-related conditions. Since experimental identification techniques are time-consuming and expensive, many computational methods have been proposed to identify antioxidant proteins. Although the accuracy of these methods is acceptable, there are still some challenges. In this study, we developed a computational model called ANPrAod to identify antioxidant proteins based on a support vector machine. In order to eliminate potential redundant features and improve prediction accuracy, 673 amino acid reduction alphabets were calculated by us to find the optimal feature representation scheme. The final model could produce an overall accuracy of 87.53% with the ROC of 0.7266 in five-fold cross-validation, which was better than the existing methods. The results of the independent dataset also demonstrated the excellent robustness and reliability of ANPrAod, which could be a promising tool for antioxidant protein identification and contribute to hypothesis-driven experimental design.


Introduction
High concentrations of reactive oxygen species will result in oxidative damage to proteins, DNA/RNA, and the polyunsaturated fatty acids, which in turn can lead to hypertension, cancer, coronary heart disease, and Alzheimer's disease [1][2][3][4]. Antioxidant proteins eliminate excess free radicals through interactions to protect cells and DNA from oxidative damage, which is closely related to disease control, so they have become a research hotspot in the field of life science and pharmacology [5,6]. The method of identifying antioxidant proteins through biochemical experiments has problems of being time-consuming and expensive, so there is an urgent need to develop related computation methods to complement the experiments.
In recent years, with the mass production of protein sequences, a series of methods have been developed to identify different types of proteins. Based on a support vector machine (SVM), Zuo et al. successfully predicted defensin proteins with an accuracy of 92.38% [7,8]. Feng et al. designed a predictor called Aodpred to identify antioxidant proteins, with a cross-validation accuracy of 74.79% [9]. Fu et al. proposed a method called StackCPPred, which used a stack-based machine learning method to effectively predict cell-penetrating peptides [10]. Tan et al. applied the binomial distribution method to recode the sequence to predict hormone-binding protein [11]. Research on these machine learning methods yielded promising results, but there were some limitations in predicting the accuracy and efficiency of antioxidant proteins.
In this study, a novel feature extraction method, the amino acid reduction alphabets combined with the N-peptide composition strategy was used to identify antioxidant proteins. Amino acid-reduced alphabets are often used for large-scale protein structure analysis and prediction [8,12,13]. It can tolerate many changes in sequences while still retaining the basic folding and function of the proteins. Figure 1 shows the ANPrAod framework flow. First, a strict benchmark dataset was constructed to ensure the validity of the comparison among models. Subsequently, amino acid reduction alphabets combined with N-peptide composition (N = 1, 2, 3) strategy was used to extract the feature vectors and compare them to obtain the optimal scheme. Based on the support vector machine (SVM), ANPrAod yielded an accuracy of 87.53% in the fivefold cross-validation which was better than the existing methods through a series of comparison results. Finally, the prediction performance of ANPrAod was objectively evaluated on the independent dataset and principal component analysis (PCA), which proved the robustness and reliability of the model. In conclusion, ANPrAod was an effective tool for predicting antioxidant proteins, which could assist experimental studies of treatment-related diseases.

Materials and Methods
2.1. Dataset. The premise of building a high-quality model is to use a reliable database [14][15][16]. To facilitate the comparison of our model with previous work, we used the same benchmark dataset collected in the study of Feng et al. [9,17]. Finally, 1805 protein sequences were used as the training dataset, including 253 antioxidant proteins and 1552 nonantioxidant proteins. In addition, a strictly independent dataset was constructed by us, containing 240 protein sequences (50 antioxidant proteins and 190 nonantioxidant proteins) from Uniprot to objectively evaluate the robustness of the model.

Support Vector
Machine. The support vector machine includes four main kernel functions: linear kernel function, polynomial kernel function, radial basis function (RBF), and sigmoid kernel function [18]. The core of SVM is to transform the data into high-dimensional Hilbert space and find the optimal separation hyperplane. For the convenience of scientific research, Chang and Lin developed the LIBSVM package, which can be downloaded for free from the following location http://www.csie.ntu.edu.tw/ cjlin/libsvm/ [19]. It has been used in computational biology [20][21][22].
In this study, the LIBSVM package with RBF kernel was used to predict antioxidant proteins. We used the grid search   Computational and Mathematical Methods in Medicine to optimize the regularization parameter C and the kernel parameter γ to improve the performance of the model. The selection ranges of C and γ are as follows: 2.3. Reduced Amino Acid Alphabets. Researchers have shown that the amino acid sequence can be redefined according to the position, structure, function, and similarity of the amino acid in the protein sequences which are called reduced amino acid alphabets [23]. Compared to original protein sequences, the reduced amino acid alphabets performed superior predictive ability in reducing protein complexity and extracting conservative features hidden in noise signals [24]. Based on RAACBook, we adopted 673 amino acid reduction schemes to be applied to our model [25,26].

N-Peptide
Composition. Single amino acid interactions and more detailed sequence information can be effectively mined by N-peptide (N = 1, 2, 3) composition. We did not try longer N-peptide because of our memory limita-tion [8,27]. For a natural protein sequence, the dipeptide composition can be described as follows: where R 1 represents the first amino acid in the protein sequence, L represents the total length of the protein sequence. d i (i = 1, 2, ⋯, 400) is the ith dipeptide in the 400 amino acid combination, and T means the transposition operator.
2.5. Feature Selection. Feature selection is an important step in building a powerful model, which is of great significance for improving the performance of the classifier [28][29][30]. Analysis of variance (ANOVA), which measures the variance of features by calculating the ratio of features between and within groups, helps us evaluate the weight of each feature and is widely used in bioinformatics [31,32]. Appropriate dimensional features could save computing resources, reduce the risk of overfitting, and improve prediction accuracy, so we used incremental feature selection (IFS) to filter features measured by analysis of variance to train the model [33].    Type2  Type5  Type6  Type7  Type19  Type20  Type29  Type30  Type31  Type32  Type33  Type34  Type35  Type38  Type49  Type52  Type53  Type56  Type57  Type58  Type59  Type63 Cluster size

Computational and Mathematical Methods in Medicine
The ANOVA formula is defined as follows: where F is the variance value of the feature, S 2 X is the sample variance between groups, and S 2 y denotes the sample variance within groups.
2.6. Performance Evaluation. The traditional metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and area under the receiver operating characteristic curve (AUC), were used to evaluate the performance of the models, which are defined as follows [20-22, 34- where where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative of samples, respectively. α i and β i (i ∈ N) are the false positive rate and false negative rate obtained by different thresholds. The receiver operating curve (ROC) was used by us to quantitatively evaluate the performance of the model [38]. The true positive rate and false positive rate are the x-axis and y-axis, respectively.

Performance of Different Reduced Amino Acid Alphabets.
RAACBook summarizes the 673 amino acid reduced alphabets and classifies them into 74 types; each type contains 2-19 reduced sizes [25]. Based on SVM, the protein sequences of the training dataset were reduced according to RAAC-Book, and the N-peptide (N = 1, 2, 3) composition was used to extract feature vectors to evaluate the influence of different feature extraction methods on the predictive performance of the model. Figures 2(a) and 2(b) show the accuracy density profiles of 673 reduced amino acid cluster models for predicting antioxidant proteins with different N-peptide compositions (K = 1, 2, 3). Excitedly that compared with the combination of single peptide and tripeptide, dipeptide has achieved better accuracy performance, which meant that they can significantly simplify complexity and reduce information redundancy. Therefore, we further analyzed all the detailed accuracy of the dipeptide combination and showed 22 types with the optimal calculation results using the heatmap. It can be seen from Figures 3(a) and 3(b) that in type 19 and  7 Computational and Mathematical Methods in Medicine size 10, the accuracy of fivefold cross-validation reached 87.31%, which has the optimal discriminative ability.

Determination of Optimal Features.
It is well known that the predictive power of the model does not improve linearly with the increase of feature dimensions, so it is necessary to examine the predictive performance of different feature sets in dipeptide combinations (type 19, size 10). First, we used ANOVA to score each feature by weight, then sorted them according to the score from largest to smallest. Then, the IFS (step size is 1) was used to determine the optimal number of features. From Figure 3(c), when the top 93 features were used, the model accuracy has the highest fivefold crossvalidation result of 87.53%. Finally, the optimal feature set was used by us to construct the SVM model for antioxidant protein prediction. The ROC curve drawn according to the fivefold cross-validation result of the optimal feature set was used to further objectively evaluate the performance of ANPrAod (Figure 4(a)).

Feature Analysis.
The information maximization method of information theory was used by Solis to polymerize amino acids into 2-19 groups (Table 1) [39]. Mutual information was maximized based on the similarity of the paired contact interactions of the 20 amino acids, and then, this was used as the objective function to mimic the natural paired contact that occurs in natural proteins [39]. Specifically, they are assigned according to nonpolar aromatic (FWY), nonpolar aliphatic and sulfur-containing (CILMV), acid (DE), basic (HR), small (AT), and other polarities (NQS), which also demonstrate that these alphabets maintain the ability to identify remote interactions.

Comparison with Previous Methods.
To demonstrate the superiority of ANPrAod in the identification of antioxidant proteins, we compared it with published methods. As shown in Table 2, based on the same dataset, the fivefold crossvalidation results showed that ANPrAod has the optimal performance with an accuracy of 87.53%, which was better than other methods. This is due to the motivation that SVM was originally designed for binary classification and the theoretical bounds from generalization error [40]. The upper bound of generalization error does not depend on the dimension of space, and the maximum boundary is used to minimize the error boundary to minimize the distance between the hyperplane of two classes and the nearest data point [41]. In addition, ANPrAod used only 93 features compared to 158 features used by AodPred, which reduced computational complexity and the risk of overfitting. This comparison demonstrated the effectiveness of the amino acid reduction alphabets combined with N-peptide combination strategy and the strong function of ANPrAod to identify antioxidant proteins.

Performance Assessment of ANPrAod on Independent
Dataset. It is not rigorous to evaluate the model only based on the information in the training set, which may overestimate the performance of the model. In order to avoid this problem, we tested ANPrAod on an independent dataset to evaluate its real performance. The confusion matrix results showed that ANPrAod still achieved excellent pre-diction results, which proved the robustness and effectiveness of the model and could be a powerful tool to assist the study of antioxidant proteins (Figure 4(b)). In addition, we compared the natural protein sequences with the reduced amino acid protein sequences by using PCA, which further confirmed the superiority of the amino acid reduction combined with the N-peptide composition strategy (Figures 4(c) and 4(d)).

Conclusion
Feature extraction is extremely important for generalization ability; it can promote the subsequent learning of the model and has better interpretability [10,42]. In this study, a new feature representation scheme of amino acid reduction alphabets combined with N-peptide combination strategy was applied to redefine protein sequences. The new feature vectors were used to train SVM to find the optimal scheme for predicting antioxidant proteins. The accuracy of fivefold cross-validation was 87.53%, and the ROC curve area was   Computational and Mathematical Methods in Medicine 0.7266, which was better than other models. PCA and independent dataset results also indicated that the amino acid reduction alphabets combined with N-peptide combination strategy can effectively reduce the data complexity, and ANPrAod has strong robustness to accurately predict antioxidant proteins. We anticipated that ANPrAod can accurately and rapidly identify antioxidant proteins based on peptide sequence and promote the development of related drug research. In future work, we will establish an online web server and extend the research content to other fields.

Data Availability
To facilitate the comparison of our model with previous work, we used the same benchmark dataset collected in the study of Feng et al. (doi:10.1007/s12539-015-0124-9).

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.