As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.
As a chemical modification occurring on DNA sequences, DNA methylation can change genetic properties under the condition that the order of DNA sequences remains unchanged. DNA methylation has many manifestations, such as 5-methylcytosine (5mC for short), N6-methyladenine (6 mA for short), and N4-methylcytosine (4mC for short) [
In order to identify 4mCs, many biology-based approaches have been explored. Single molecule real-time sequencing technology (SMRT for short) [
In the present study, a novel model named i4mC-EL is proposed to indentify mouse’s 4mCs, and we can see the framework of it in Figure
The framework of i4mC-EL.
This paper adopted the benchmark dataset constructed by Hasan’steam [
Transforming DNA sequences into vectors that can make a distinction between 4mCs and non-4mCs availably is the first step to build an ensemble learning-based predictor to identify 4mCs [
This encoding scheme refers to the frequency of
In the present study, we choose the values of the parameter
EIIP is the short name of electron-ion interaction pseudopotential. The encoding scheme based on EIIP was proposed by Nair and Sreenadhan in 2006. Through it, each nucleotide in each sequence is replaced by its corresponding electron-ion interaction pseud potential value (Table
The electron-ion interaction pseudopotential values for DNA nucleotides.
NT | A | C | G | T |
---|---|---|---|---|
EIIP | 0.1260 | 0.1340 | 0.0806 | 0.1335 |
As an open data mining platform, Weka has assembled a large number of machine learning algorithms that can undertake data mining tasks. In the present paper, the classifiers we used were all implemented by Weka, such as BayesNet, NaiveBayes, SGD, SimpleLogistic, SMO, IBk, JRip, J48, and ensemble learning. Finally, we chose the ensemble learning, and the results of related experiment will be presented in section
According to different combination strategies, bagging, boosting, and stacking are the three main types of ensemble learning. Ensemble learning is widely used in bioinformatics because it can improve the prediction performance of classifiers, such as protein-protein interaction [
In the two-stage stacked ensemble learning, the base classifiers used in this paper include BayesNet [
Figure
Partition dataset. Divide the training dataset into ten parts and mark them as train 1, train 2, …, train 10. The independent test dataset remains unchanged.
Train base classifiers. In the present paper, we chose BayesNet, Voted Perceptron, Naive Bayes Multinomial, and LibSVM as base classifiers. For one base classifier such as BayesNet, 10-fold crossvalidation is performed. In detail, train 1, train 2, …, train 10 are used as validation dataset in turn, the other nine parts are used as the training dataset, and prediction is made on the independent test dataset. This would get 10 predictions from the training dataset together with another 10 predictions on the independent test dataset. Combine the 10 predictions on the training dataset vertically to get A1 and take the average of the 10 predictions on the independent test dataset to get B1. Similarly, we could get A2, B2 from NavieBayes Multinomial, A3, B3 from LibSVM, and A4, B4 from Voted Perceptron.
Train metaclassifiers. Use the predictive values of the 4 base classifiers on the training dataset, A1, A2, A3, and A4, as 4 features to train the logistic classifier.
Predict new data. Use the trained model to make predictions on the 4 features, B1, B2, B3, and B4, constructed from the predicted values of the independent test dataset of the 4 base classifiers, and then the final prediction results are obtained.
Working diagram of ensemble learning.
For the sake of validating the quality of our classification predictor, we used four indicators widely adopted in the field of bioinformatics for evaluation [
To find the features that can adequately represent the structure and function of the DNA sequences, we attempted to contrast numerous feature encoding schemes. And to achieve the optimal accuracy, we also tried to train the model using several different classification algorithms. The results of relevant comparative experiments are as below.
AS shown in section of “feature encoding,” we encode the DNA sequences with a multifeature, which combines
Table
The contrast of performance for dissimilar feature encoding schemes under 10-fold crossvalidation.
Schemes | ACC | MCC | Sn | Sp |
---|---|---|---|---|
BPF | 0.668 | 0.335 | 0.665 | 0.670 |
DPE | 0.614 | 0.228 | 0.619 | 0.609 |
RFHC | 0.658 | 0.316 | 0.669 | 0.647 |
RevKmer | 0.755 | 0.511 | 0.745 | 0.765 |
PseKNC | 0.794 | 0.589 | 0.786 | 0.803 |
0.724 | 0.448 | 0.729 | 0.718 | |
0.747 | 0.493 | 0.744 | 0.749 | |
RevKmer+DBE | 0.738 | 0.476 | 0.723 | 0.753 |
RevKmer+EIIP | 0.779 | 0.558 | 0.764 | 0.794 |
0.732 | 0.464 | 0.741 | 0.723 | |
Our method | 0.803 | 0.606 | 0.784 | 0.822 |
To further illustrate the prediction capability of our selected multifeature encoding scheme, the ROC curves for dissimilar feature encoding schemes under 10-fold crossvalidation are displayed in Figure
ROC curves for dissimilar feature encoding schemes under 10-fold crossvalidation.
As shown in the section “classifier,” we inputted the multifeature composed of
The results of these comparative experiments are displayed in Table
The contrast of performance for dissimilar classifiers under 10-fold crossvalidation.
Classifiers | ACC | MCC | Sn | Sp |
---|---|---|---|---|
BayesNet | 0.727 | 0.453 | 0.739 | 0.714 |
NaiveBayes | 0.752 | 0.504 | 0.751 | 0.753 |
SGD | 0.712 | 0.424 | 0.710 | 0.713 |
SimpleLogistic | 0.761 | 0.522 | 0.753 | 0.768 |
SMO | 0.702 | 0.405 | 0.706 | 0.698 |
IBk | 0.637 | 0.276 | 0.584 | 0.690 |
JRip | 0.707 | 0.414 | 0.692 | 0.723 |
J48 | 0.665 | 0.330 | 0.674 | 0.655 |
RandomForest | 0.770 | 0.541 | 0.753 | 0.787 |
AdaBoostM1 | 0.713 | 0.427 | 0.739 | 0.688 |
Bagging | 0.729 | 0.459 | 0.744 | 0.714 |
Our method | 0.803 | 0.606 | 0.784 | 0.822 |
To further illustrate the classification capability of our selected stacking classifier, the ROC curves for dissimilar classifiers under 10-fold crossvalidation are displayed in Figure
ROC curves for dissimilar classifiers under 10-fold crossvalidation.
In this section, a comparative experiment on the independent test dataset (TEST-320) will be conducted to show the generalization capability of our selected multifeature and stacking classifier. The rationale for this is that this model is trained and tested on two different datasets, which is the equivalent of performing a real prediction task with the generated model.
Using the stacking classifier, we, respectively, evaluate the generalization capability of various feature encoding schemes described in Section
The contrast of performance for dissimilar feature encoding schemes on TEST-320.
Schemes | ACC | MCC | Sn | Sp |
---|---|---|---|---|
BPF | 0.753 | 0.530 | 0.606 | 0.900 |
DPE | 0.697 | 0.401 | 0.600 | 0.794 |
RFHC | 0.716 | 0.438 | 0.631 | 0.800 |
RevKmer | 0.666 | 0.335 | 0.744 | 0.588 |
PseKNC | 0.781 | 0.563 | 0.788 | 0.775 |
0.772 | 0.553 | 0.681 | 0.863 | |
0.800 | 0.614 | 0.694 | 0.906 | |
RevKmer+DBE | 0.756 | 0.516 | 0.700 | 0.813 |
RevKmer+EIIP | 0.713 | 0.427 | 0.763 | 0.663 |
0.772 | 0.553 | 0.681 | 0.863 | |
Ourmethod | 0.822 | 0.644 | 0.806 | 0.838 |
For the sake of further describing the generalization capability of our selected multifeature encoding scheme, Figure
ROC curves for dissimilar feature encoding schemes on TEST-320.
We compared stacking classifier used in this paper with other eleven classifiers on TEST-320 under the condition of using the multifeature combing
The contrast of performance for dissimilar classifiers on TEST-320.
Classifiers | ACC | MCC | Sn | Sp |
---|---|---|---|---|
BayesNet | 0.769 | 0.547 | 0.675 | 0.863 |
NaiveBayes | 0.788 | 0.577 | 0.744 | 0.831 |
SGD | 0.688 | 0.379 | 0.756 | 0.619 |
Simple Logistic | 0.728 | 0.456 | 0.738 | 0.719 |
SMO | 0.675 | 0.353 | 0.744 | 0.606 |
IBk | 0.600 | 0.201 | 0.563 | 0.638 |
JRip | 0.769 | 0.541 | 0.713 | 0.825 |
J48 | 0.663 | 0.325 | 0.656 | 0.669 |
Random Forest | 0.778 | 0.558 | 0.738 | 0.819 |
AdaBoostM1 | 0.791 | 0.581 | 0.794 | 0.788 |
Bagging | 0.781 | 0.564 | 0.744 | 0.819 |
Our method | 0.822 | 0.644 | 0.806 | 0.838 |
For the sake of further describing the generalization capability of our selected stacking classifier, the ROC curves for dissimilar classifiers on TEST-320 are displayed in Figure
ROC curves for dissimilar classifiers on TEST-320.
Here, we contrasted i4mC-EL with 4mCpred-EL and i4mC-Mouse on TEST-320 for the sake of further evaluating its performance. Table
The contrast of performance for dissimilar models on TEST-320.
Models | ACC | MCC | Sn | Sp |
---|---|---|---|---|
4mcPred-EL | 0.791 | 0.584 | 0.757 | 0.825 |
i4mC-Mouse | 0.816 | 0.633 | 0.807 | 0.825 |
i4mC-EL | 0.822 | 0.644 | 0.806 | 0.838 |
In the present paper, an ensemble learning model called i4mC-EL which was able to identify mouse’s 4mC sites was designed. In the process of constructing i4mC-EL, to determine the optimal combination of feature encoding schemes and classifiers, we conducted abundant comparative experiments on dissimilar features and classifiers. Finally, we encoded DNA sequences with multifeatures combing
In addition, we contrasted i4mC-EL with existing models for the sake of proving its effectiveness. The results show that i4mC-EL is better than the existing models and has better generalization capability. In summary, i4mC-EL is effective in predicting the 4mC sites in the mouse genome, which helps us to understand the biochemical properties of 4mC.
We will use adaptive feature vectors to donate DNA sequences to optimize the feature encoding scheme [
The datasets used during the present study are available from the corresponding author upon reasonable request, or can be downloaded from
The authors declare that they have no competing interests.
This work is supported by the Natural Science Foundation of Heilongjiang Province (LH2019F002), National Natural Science Foundation of China (61901103, 61671189), and the Fundamental Research Funds for the Central Universities (2572018BH05, 2572017CB33).