Identification of Antioxidants from Sequence Information Using Naïve Bayes

Antioxidant proteins are substances that protect cells from the damage caused by free radicals. Accurate identification of new antioxidant proteins is important in understanding their roles in delaying aging. Therefore, it is highly desirable to develop computational methods to identify antioxidant proteins. In this study, a Naïve Bayes-based method was proposed to predict antioxidant proteins using amino acid compositions and dipeptide compositions. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife test, the proposed method achieved an accuracy of 66.88% for the discrimination between antioxidant and nonantioxidant proteins, which is superior to that of other state-of-the-art classifiers. These results suggest that the proposed method could be an effective and promising high-throughput method for antioxidant protein identification.


Introduction
Oxidation is a chemical reaction that transfers electrons or hydrogen from a substance to an oxidizing agent.Oxidation reactions can produce free radicals.In turn, these radicals can start chain reactions.When the chain reaction occurs in a cell, it can cause damage or death to the cell.Moreover, oxidative stress is also the cause and the consequence of disease.Antioxidants are protein molecules that terminate these chain reactions by removing free radical intermediates and inhibit other oxidation reactions.They do this by being oxidized themselves, so antioxidants are often reducing agents such as thiols, ascorbic acid, or polyphenols [1].
Antioxidants are widely used in dietary supplements and have been investigated for the prevention of diseases such as cancer, coronary heart disease, and even altitude sickness.Plants and animals maintain complex systems of multiple types of antioxidants, such as glutathione, vitamin A, vitamin C, and vitamin E, as well as enzymes such as catalase, superoxide dismutase, and various peroxidases.Insufficient levels of antioxidants or inhibition of the antioxidant enzymes can cause oxidative stress and may damage or kill the cells.
As oxidative stress appears to be an important part of many human diseases, the use of antioxidants in pharmacology is intensively studied, particularly as treatments for stroke and neurodegenerative diseases.Recently, Fernandez-Blanco et al. reported a computational model to identify antioxidant proteins based on star graph topological indices [2].However, by analyzing Fernandez-Blanco et al. 's dataset, we found that sequences in their dataset share high-sequence similarities; some sequences in their dataset even share 100% sequences identity.It has been demonstrated that the predictive accuracy is closely related to sequence identity [3,4], and high-sequence similarity can surely lead to the overestimation of predictive performance.Therefore, their results are not credible.There is an urgent need to develop efficient computational tools for antioxidant proteins identification.
In the current study, we propose a Naïve Bayes-based computational model for predicting antioxidant proteins using amino acid compositions and dipeptide compositions.The correlation-based feature subset selection algorithm [5] was introduced to find the optimal feature set.By using the optimized features, the proposed model was evaluated in a benchmark dataset in the jackknife test.
According to some recent comprehensive reviews [6,7] a series of recent publications [8][9][10][11][12][13], to establish a really useful statistical predictor, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) and establish a user-friendly webserver for the predictor that is accessible to the public.Below, let us describe how to deal with these steps one by one.

Materials and Methods
2.1.Dataset.Fernandez-Blanco et al. have constructed a dataset containing 324 proteins with antioxidant activity and 1657 proteins without [2].However, sequences in their dataset share high-sequence identity.The predictive accuracy is closely related to the sequence identity [3,4], and highsequence similarity can surely lead to the overestimation of predictive performance.
In order to prepare a reliable benchmark dataset, we first extracted proteins with antioxidant activities from the UniProt [14] according to the following steps: (i) only proteins with the experimentally confirmed antioxidant activities were included; (ii) the proteins which are fragments of other proteins were dislodged; (iii) and proteins containing nonstandard letters, that is, "B", "X", or "Z", were excluded as their meanings are ambiguous.After following the aforementioned strict screening procedures, we obtained 686 proteins with antioxidant activity and obtained a new raw dataset by merging the 686 proteins into Fernandez-Blanco et al. 's dataset [2].
For balancing the number of samples and providing a significant statistics, sequences which have >60% sequence similarity were removed from the new raw dataset using CD-HIT program [15].If the sequence identity cutoff is set to a stringent threshold of 25%, the results will be more objective and reliable.However, in this study we did not use such a stringent criterion because the currently available data do not allow us to do so.Otherwise, the number of antioxidant proteins would be too low to have statistical significance.Finally, a benchmark dataset containing 254 antioxidant and 1567 nonantioxidant proteins was constructed and can be found in the online Supporting Information S1 available online at http://dx.doi.org/10.1155/2013/567529.For further estimating the performance of the method, we also collected 20 antioxidant proteins (supporting information S2) which are independent from the training set.

Feature Vector.
One of the most important parts for identifying protein attributes is to generate a set of proper informative parameters to encode protein sequences.The amino acid composition and dipeptide composition are the most important and effective parameters which have been widely applied in the realm of protein prediction [10][11][12][13].Hence, every protein sequence in the benchmark dataset was encoded in a discrete vector as follows: where   are the normalized occurrence frequencies of the 20 amino acids ( = 1, 2, . . ., 20) and the 400 dipeptides ( = 21, 22, . . ., 420) in the protein sequence, respectively.T is the transposing operator.

Feature Selection. Inclusion of redundant and noisy
features in the model building process would cause poor predictive performance and increased computation time.
Feature selection is the process of removing irrelevant features and is extremely useful in reducing the dimensionality of the data and improving the predictive accuracy.To reduce the dimension of the feature space and improve the predictive accuracy, the filter method Correlation-based Feature Selection [5] combined with best-first search strategy was used in the process of feature selection in the current work.
The process starts with an empty set of features and generates all possible single-feature expansions.The subset with the highest accuracy is chosen and expanded in the same way by adding single features.If when expanding a subset the accuracy does not maximize, the search drops back to the next best unexpanded subset and continues from there until all features are added.The subset with the highest accuracy will be selected as the final optimized feature set [16].

Naïve Bayes.
Naïve Bayes is an effective statistical classification algorithm [17] and has been successfully used in the realm of bioinformatics [18][19][20].The theory of Naïve Bayes is to assume the attribute variables to be independent from each other given the outcome.This assumption greatly simplifies the calculation of conditional probabilities.
In the Naïve Bayes framework, a classification problem can be seen as the problem of finding the outcome with maximum probability given a set of observed variables.Given the protein example described by its feature vector F = ( 1 ,  2 , . . .,   ), we need to look for a class C that maximizes the likelihood P(F | C) = P( 1 ,  2 , . . .,   | C).Since the current work is intended to classify antioxidant and nonantioxidant proteins, a binary class C ∈ (0, 1) was generated, where 1 denotes the sample that was predicted as an antioxidant protein and 0 denotes nonantioxidant protein.
For the binary classification, the class for the protein sample could be determined by comparing two posteriors as follows: Taking the logarithm of ( 2 Hence, the sample will be predicted as 1 (antioxidant protein) if and 0 (nonantioxidant protein) for otherwise. is the threshold determining the tradeoff between sensitivity and specificity and can be trained on the training dataset to maximize the prediction performance.

Performance Evaluation.
The performance of the proposed model was evaluated using sensitivity, specificity (Garmer, Sperling, and Forsberg), and accuracy (Acc), which are expressed as follows: TP, TN, FP, and FN represent the number of the correctly recognized antioxidant proteins, the number of the correctly recognized nonantioxidant proteins, the number of nonantioxidant proteins recognized as antioxidant proteins, and the number of antioxidant proteins recognized as nonantioxidant proteins, respectively.As the performance of the current classifier depends on the threshold  as given in (4), the receiver operating characteristic (ROC) curve was employed.Therefore, the quality of a classifier can be objectively evaluated by measuring the area under the receiver operating characteristic curve (auROC).The value of auROC score ranges from 0 to 1, with a score of 0.5 corresponding to a random guess and a score of 1.0 indicating a perfect separation.

Results and Discussion
Three cross-validation methods, namely, subsampling test, independent dataset test, and jackknife test, are often employed to evaluate the predictive capability of a predictor.Among the three methods, the jackknife test is deemed the most objective and rigorous one that can always yield a unique outcome as demonstrated by a penetrating analysis in a recent comprehensive review [21], and hence has been widely and increasingly adopted by investigators to examine the quality of various predictors (see, e.g., [8,[22][23][24][25][26][27][28]).Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study.In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample and all the rule parameters are calculated without including the one being identified.

Prediction of Antioxidant Proteins.
We trained the Naïve Bayes classifier using Waikato Environment for Knowledge Analysis (WEKA) [29] on the benchmark dataset.As shown in Table 1, in the jackknife test, an auROC score of 0.68 and an accuracy of 55.85% with an average sensitivity of 75.59% and an average specificity of 52.65% were obtained for the classification of antioxidant and nonantioxidant proteins by using all the 420 features, that is, 20 amino acid compositions and 400 dipeptide compositions.For saving computing time, cross-validation methods (fivefold or tenfold) are widely used for feature selection in computational proteomics [16,30].In order to identify prominent features that can distinguish between antioxidant and nonantioxidant proteins, feature selection method was also carried out to eliminate the redundant features using WEKA in a ten-fold cross-validation approach on the benchmark dataset.In the ten-fold cross-validation, the benchmark dataset is split into ten pieces, and cross validation is performed using each of these ten pieces as the testing set.Thus, the training process is performed ten times, each of which uses the data obtained by deleting the testing set from the whole dataset.We found that the proposed method achieved a maximum accuracy of 66.89% and auROC of 0.762 when the feature dimension reduced to 44 (i.e., C, G, FP, FW, LK, LS, IE, VL, VH, VC, VW, MS, PD, AP, AY, YQ, YE, YR, HE, HG, QA, KA, KH, DF, DK, DR, EF, EM, EY, ER, CP, CN, CG, WC, RT, RD, RW, SV, SD, GV, GY, GK, GC).
The jackknife test results of the Naïve Bayes classifier based on the 44 optimized features for identifying antioxidant proteins were listed in Table 1.As it can be seen from Table 1, the current method yielded a better auROC score of 0.855 and a predictive accuracy of 66.88% with an average sensitivity of 72.04% and an average specificity of 66.05% (Table 1).Both predictive accuracy and auROC are higher than those of the model based on the 420 features.
Moreover, for the purpose of evaluating the performance of the proposed method, we used the 20 experimentallyconfirmed antioxidant proteins (in Supporting Information S2) to examine the method.As a result, 16 antioxidant proteins were correctly predicted by the proposed method; see Table 2.This result demonstrates the excellent performance of our model.

Comparison with Other Methods.
In order to further testify its superiority, we compared the capability of the present model with that of other models based on different kinds of algorithms such as BayesNet, J48 tree, and Random  3.
Although the accuracies of BayesNet, J48 tree, and Random forest are higher than those of Naïve Bayes, their auROC scores and sensitivities are all much lower than those of Naïve Bayes.These results indicate that the proposed Naïve Bayes model can be effectively used to classify antioxidant and nonantioxidant proteins.

Conclusions
In this study, the Naïve Bayes classifier with feature selection method is presented to identity antioxidant proteins based on the primary sequence information.By using Correlationbased Feature Subset Selection algorithm, the feature dimensions were reduced to 44 prominent features that could remarkably improve the predictive accuracies.However, the detailed analyses of the selected features are required to provide more information about their roles in biological activity.It is expected that the presented model will provide novel insights into the research on antioxidants.Since user-friendly and publicly accessible webservers represent the future direction for developing practically more useful predictors [31], we shall make efforts in our future work to provide a webserver for the method presented in this paper.

Table 1 :
Predictive performance of Naïve Bayes based on different features.

Table 2 :
Predictive results based on the independent dataset.