Feature Selection Has a Large Impact on One-Class Classification Accuracy for MicroRNAs in Plants

MicroRNAs (miRNAs) are short RNA sequences involved in posttranscriptional gene regulation. Their experimental analysis is complicated and, therefore, needs to be supplemented with computational miRNA detection. Currently computational miRNA detection is mainly performed using machine learning and in particular two-class classification. For machine learning, the miRNAs need to be parametrized and more than 700 features have been described. Positive training examples for machine learning are readily available, but negative data is hard to come by. Therefore, it seems prerogative to use one-class classification instead of two-class classification. Previously, we were able to almost reach two-class classification accuracy using one-class classifiers. In this work, we employ feature selection procedures in conjunction with one-class classification and show that there is up to 36% difference in accuracy among these feature selection methods. The best feature set allowed the training of a one-class classifier which achieved an average accuracy of ~95.6% thereby outperforming previous two-class-based plant miRNA detection approaches by about 0.5%. We believe that this can be improved upon in the future by rigorous filtering of the positive training examples and by improving current feature clustering algorithms to better target pre-miRNA feature selection.


Introduction
MicroRNAs (miRNAs) are short regulatory nucleotide sequences which were discovered about 2 decades ago [1]. Since then, they were shown to exist in organisms ranging from sponges [2] to human [3] and also in plants where they may have evolved independently [4]. Due to their involvement in posttranscriptional regulation, miRNAs have been implicated in human diseases [5,6] and, for example, in plant stress response [7]. Regulation of gene expression is of great interest and, therefore, miRNAs have received increasing interest.
Some miRNAs have been experimentally detected and they are stored in databases such as miRBase [8] and miR-TarBase [9]. Unfortunately, experimental detection of miR-NAs is quite involved and further suffers from the fact that some miRNA-target combinations may be expressed only in response to specific external or internal stresses so that it seems impossible to experimentally detect all miRNAs of any higher eukaryotic organism [10][11][12][13]. Additionally, current approaches may not use the full potential of available experimental techniques [14]. Even among the experimentally validated miRNAs in miRBase, there may be entries which are dubious and we have shown that some may not represent true miRNAs [15]. Thus, computational methods for miRNA detection are required to complement experimental approaches.
Computational approaches for miRNA detection and miRNA's target detection [16] are generally based on machine learning [17]. In order to use miRNAs in machine learning, they must be parameterized and many different features have been described [18]. For successful machine learning, known examples need to be provided to the algorithms and these are generally collected from miRBase. Except for a few examples [10], two-class classification is used for machine learning. In this case in addition to the positive learning examples, negative ones have to be provided.

Advances in Bioinformatics
Unfortunately, the negative class cannot be defined for any larger eukaryotic organism since any part of a genome must be coexpressed with any other part of the genome to prove that they cannot interact as miRNAs, a futile endeavor. Therefore, examples for the negative class have been arbitrarily created in studies performing machine learning for miRNA detection, but the most abundantly used negative dataset is from [19] although new datasets are being proposed [20]. While in our experience the dataset is a significant improvement over using randomly selected or generated examples, it is likely to also contain positive examples.
To overcome the problem of needing to define artificial negative examples, one-class classification can be used instead [21,22]. One-class classification (OCC) only needs the target class for learning and then classifies unknown examples into target or unknown (outlier) class. We have recently analyzed the use of one-class classification for miRNA detection in plants and found that it is competitive in comparison to two-class classification [23].
We also found that some features among the hundreds of features described for miRNA parameterization [24] are more discriminative than others. Unfortunately, feature selection is NP-hard [25] and selecting the best subset from more than 700 features on a per-dataset basis is not possible in an acceptable amount of time. Apart from the computational complexity of feature selection, methods have only been well established for two-class classification [26][27][28], while only few approaches deal with one-class classification [29][30][31].
In this study, we used different feature selection approaches and evaluated the effectiveness of each one against the others for the classification of miRNAs from different plant species using one-class classification. We found that clustering of features is important for classification performance and helps to improve classification accuracy by up to 36% (Oryza sativa) and on average by about 30% compared to selecting features by the lowest information gain. Clustering in its present form, however, is ill suited for selection from pre-miRNA features, since logically correlated features may be uncorrelated when examining the feature vectors in isolation and vice versa. Therefore, the clustering methods performed best, but not the expected clustering method which selected the feature with the highest information gain from each cluster. With a peak performance of 98.80% accuracy for Zea mays, feature selection using clustering methods was able to increase the accuracy achieved in our previous studies by 1-4% and the two-class classification benchmark PlantMiRNAPred by ∼0.5%. This study showed that feature selection is effective despite varying data quality. Concluding, it seems important to devise effective data selection methodologies to ensure that only true miRNAs are used as positive examples. Novel feature clustering approaches taking into account logical feature correlation need to be coupling this data filtering strategy. When both are achieved, oneclass classification may significantly outperform two-class classification for pre-miRNA detection in all cases (for plant) since the gap is already closed with this study.

One-Class Classification.
One-class classification was performed using randomly sampled 90% of the positive data for training and 10% for testing (Supplementary File 1 in Supplementary Material available online at http://dx.doi.org/ 10.1155/2016/5670851, Figure 2). In addition to the 10% positive data during testing, the pseudo negative sequences were used as unknown class. To determine the stability of the feature set, the process was repeated 100 times. We used the DDtools [35] implementation of a OCC.

Feature Selection Strategies.
Feature selection is not a straightforward process since two features which individually may not be very discriminative may have high selective power in combination. Additionally, we have more than 700 features so trying all combination of features from 1 to 700 for all datasets considered in this study is not possible. In order to test the impact of feature selection on the classification accuracy, four negative and four positive feature selection methods were designed. According to our previous research [24], 50 features seem sufficient for miRNA detection [24] and this limit was used in this study. Among the feature selection methods used in this study, only PCF is using only the positive class whereas all others are using both classes.

Selecting Features with Low Information Gain (LIG).
Information gain was calculated on a per-dataset level using KNIME [36] and for each dataset the process had to be iteratively repeated until the 50 features with lowest information gain. First information gain was calculated and since most features received a score of zero they could not be differentiated. After removing all features with information gain above zero and recalculating information gain for the remaining features, the pool of features that receive zero became successively smaller until only 50 features remained. These remaining features were selected and used for classification (Supplementary File 1, LIG).

Random Feature Selection (RFS)
. 50 features, from the overall list of features, were randomly selected for each dataset (Supplementary File 1, RFS).

Selecting Random Feature from Feature Clusters (RFC).
Since correlated features may not much contribute to the discriminative power of a feature set, features were clustered into 100 clusters and from each cluster a random feature was Advances in Bioinformatics 3 selected from which 50 random features were selected for classification (Supplementary File 1, RFC).

Selecting Features from Clusters (SFC).
Instead of selecting a representative feature per cluster (as in RFC and HIC), this approach selects up to three clusters and uses all features therein to see the impact of correlation among features on the classification (Supplementary File 1, SFC).

Selecting Features with High Information Gain (HIG).
Contrary to the selection of features with low information gain, we selected the features with the highest information gain on a per-dataset basis (Supplementary File 1, HIG).

Selecting Feature with the Highest Information Gain from
Feature Clusters (HIC). Among the 700 features describing miRNAs are likely some which describe very similar information and, to avoid overrepresentation of such information, it may be beneficial to cluster such features and use single features to represent each cluster. 100 clusters were produced per dataset and from each cluster the feature with the highest information gain was selected. The selected features were again ranked via information gain and the top 50 were selected (Supplementary File 1, HIC).

Pearson Correlation-Based Feature Selection (PCF).
The Pearson correlation-based feature selection method was introduced by Lorena et al. [29]. The Pearson correlation measure allows detection of the linearity relation among features. The pairwise distances among features were calculated using Pearson correlation. Features with lower correlation were preferred during the feature selection process (Supplementary File 1, PCF).

Feature Selection.
Feature selection is an important process in machine learning and not all of the more than 700 features which have been proposed to describe a pre-miRNA might be useful. We used eight feature selection methods and performed one-class classification using each of the selected feature sets. 50 features were used as a target and we did not try different sized feature sets. Additionally, for each feature selection method a combined feature set was created containing the most commonly selected features (denoted by suffix comb). To judge feature selection success, oneclass classification was performed using the selected features. For the combined feature selection method, SFC and HIC outperform all other feature selection methods consistently for all datasets (Figure 1). However, the performance varies strongly with the organism and the variation is similar for all feature selection methods. This synchronized variation indicates that the available training data is either not sufficient or not accurate enough or both. This is in line with a previous study where we showed that not all data on miRBase may represent true miRNAs [15]. Nonetheless, since the feature selection methods SFC and HIC consistently achieved the highest accuracy compared to all other feature selection methodologies, the assessment of feature selection methods is not hampered by the shortcomings of the datasets. The average accuracy over all feature selection methods can further help define dataset quality. ppt has the best result (∼88%) followed by zma (∼87%), while gma (∼80%) and osa (∼78%) are tailing the list (Supplementary Table 2).
Data in Figure 1 is in respect to combined feature sets and it may be informative to investigate accuracy in respect to feature selection on a per-organism basis (Figure 2) since the average accuracy of average accuracies for all feature selected methods is slightly higher for individual feature selection (84.69%) as compared to combined feature selection (84.58); see Supplementary Table 2 for more information. Figure 2 shows that the feature selection methods fall into three categories, improved accuracy (SFC, HIC, and RFS), no or little effect (RFC, HIG, ZNF, and PCF), and finally LIG, which was chosen as a negative control along with RFS and SFC. LIG which selects the features with the lowest information gain performs as expected and leads to the lowest effectiveness of the resulting classifiers. Interestingly, RFS and SFC which we expected to provide low performance lead to effective classifiers. Many feature clusters exist in the large feature space and RFS (random feature selection) likely picked from different clusters. This is supported by the observation that the spread of RFS is much larger than for SFC and HIC. SFC merely selects features from three clusters and was expected to have low performance. However, it turned out to be the best performing algorithm which we attribute to a critical problem in clustering for feature selection. We used -means clustering on the feature vector containing calculations for positive examples. Unfortunately, features like % A and % G are strongly correlated although their vectors are anticorrelated. This is true for many of the features among the more than 700 described features and the opposite is also true where two feature vectors appear highly correlated although they describe logically uncorrelated information. For these reasons, HIC which selects the feature with the highest information gain from each cluster performed not as well as would be possible if correlation would be perfectly following logical constraints. RFC (random feature selection from clusters) performed better than HIG, ZNF, and PCF, which supports the notion of why HIC did not consistently show the best performance among all algorithms. Nonetheless, HIC showed the overall best performance for zma with 98.80% accuracy. Among the feature selection methods, there is a large performance gap (∼38%). On a per-organism basis, the gap reaches ∼36% (O. sativa) but on average it is close to 30% (Supplementary Table 2), which highlights the importance of proper feature selection.

3.2.
Organism-Based Feature Selection. Feature selection as described above was performed on a per-organism basis and from the selected features the most common ones were used to define a combined feature selection (Supplementary Table  1). It was our belief that using feature selection on a perorganism basis would have an effect on the accuracy that can be achieved. On average this is true and a slight advantage of about 0.2% accuracy when comparing the averages of the averages of all feature selection methods can be observed (Supplementary Table 2). Unfortunately, it turns out that the number of positive results and their correctness are more important than the feature selection method and whether it is performed on a per-organism basis or not. For A. thaliana, G. max, and O. sativa the assumption holds but, for the other species used here, the results were not so clear. Perhaps the strategy of selecting the 50 most common features among the first 100 features into the combined feature set is not actually selecting a feature set that would describe the selected clade.
In order to investigate whether feature selection has to be made on a per-species basis, we used the interactive tree of life [37] to establish the phylogenetic relationship among the selected plant species (Figure 3). Three larger groups are visible one of which consists of a single species (ppt) while the other two contain three species each.
We checked the deviation of the accuracy achieved using feature selection on a per-organism basis to the average for the organisms and found that for SFC and HIC the distance of P. patens to the average was different from the distances of the other organisms which were alike in their groups (Supplementary Table 2) as expected from their placement in the phylogenetic tree ( Figure 3).

Performance Comparison.
In a previous study we introduced sequence motifs as features for pre-miRNA detection and compared one-class classification to two-class classification [23]. Compared to the results we achieved with MotifmiRNAPred, this study shows that feature selection is essential for classification performance (Table 1).
For some of the organisms (gma, zma, and sbi), Plant-MiRNAPred, which employs two-class classification, performs better than the one-class classification performed in this study. On the other hand, on average one-class classification is slightly better and performs better than Plant-MiRNAPred on some organisms (ath, ppt, ptc, and osa). The improvement to MotifmiRNAPred which also uses oneclass classification is about 3.5% on average and this again highlights the importance of feature selection. The improvement of classification accuracy in this study (using oneclass classification) over PlantMiRNAPred (using two-class classification) is about 0.5% accuracy, but it is an important step since for the first time OCC based pre-miRNA detection outperforms two-class classification.  [32], Triplet-SVM [38], microPred [39], and MotifmiRNAPred [33]

Conclusions
One-class classifiers were trained using different feature sets selected by various feature selection methods (see Supplementary Table 1 for selected features). Feature clustering seems essential since the best performing methods use clustering ( Figure 2). On average 30% accuracy can be gained using feature selection (Supplementary Table 2). Compared to previously published methods, the OCC using SFC filtered features performs better (∼3.5%) than our previously established OCC and slightly better (∼0.5%, ∼96%) than the best two-class classifier for plant pre-miRNA detection (Table 1). A peak accuracy of close to 99% may seem sufficient for miRNA detection, but this figure needs to be put into context with millions of hairpins in a genome which could constitute a pre-miRNA. Thus accuracies above 99.99% (about 1000 false positives in a genome of 3Gb) are needed if all prediction ought to be tested experimentally.
To achieve such high accuracies, the following points need to be tackled: (1) In the future, it will be important to devise a method to filter the positive training data since not all datasets seem to represent a pure class and this strongly influences classification accuracy ( Figure 1). (2) Clustering is essential for feature selection, but current clustering methods do not take into account logical correlation ( Figure 2). (3) A switch to OCC needs to be performed since negative data for pre-miRNAs does not exist [10,40]. (4) Feature selection should be performed on a perorganism basis and how classifiers perform for groups of organisms when features are selected on the group basis should be investigated.
Finally, all feature selection methods in this study, save PCF, are based on two classes and in the future additional feature selection methods that are independent of the negative class should be investigated. Since there is a large interest in the field of feature selection spawned by big data problems, many new methods have been devised and we plan to use some of them [41,42] in future research to select a stable and performant set of features.