Relevant-Based Feature Ranking (RBFR) Method for Text Classification Based on Machine Learning Algorithm

High dimensionality of the feature space is one of the problems in the ﬁ eld of text classi ﬁ cation. Identi ﬁ cation of optimal subset of features can optimize text classi ﬁ cation process in terms of processing time and performance. In this paper, we propose a novel Relevant-Based Feature Ranking (RBFR) algorithm which identi ﬁ es and selects smaller subsets of more relevant features in the feature space. We compared the performance of the RBFR against other existing feature selection methods such as balanced accuracy measure, information gain, Gini index, and odds ratio on 3 datasets, namely, 20 newsgroup, Reuters, and WAP datasets. We have used 5 machine learning models (SVM, NB, kNN, RF, and LR) to test and evaluate the proposed feature selection method. We found that the performance of the proposed feature selection method is 25.4305% times more e ﬀ ective than the existing feature selection methods in terms of accuracy.


Introduction
Massive amount of information is generated and pushed into the digital world every second through various sources such as web pages, blog contents, eBooks, social media contents, and review documents. As the content is increasing day by day, it becomes difficult to convert the content into an organized form which causes many problems such as difficult in searching and lack of summarization. Automatic text classification is one of the way to efficiently organize the documents. Supervised machine learning models such as support vector machines (SVM) [1], Naïve Bayes (NB) [2], k nearest neighbor (kNN) [3], random forest (RF) [4], and logistic regression (LR) [5] are very efficient in organizing content into one or more topics (or classes). There are wide applications of machine learning in the field of text classification such as spam detection [6], sentimental analysis [7], and topic classification [8].
There are three stages in text classification known as preprocessing, feature selection, and final classification. The preprocessing stage is responsible for formatting and removing useless words. Stop word removal, stemming, and text representations are few task performed in the preprocessing stage. Stop word removal eliminates useless symbols such as "is," "was," "that," and punctuation marks. Stemming is responsible for converting all the derived words into its root form (e.g., "running" is converted to "run," and "walked" is converted to "walk"). Word representing formats the document into usable text. Features are identified in this stage. There are many text representations such as Bag-of-Words (BoW) [9] and n-gram [10].
A feature is the indivisible atomic unit in a text document. A text corpus may contain many documents D = fd 1 , d 2 , d n g. Each document contains m number of unique features, and the entire text corpus contains k number of unique features such as F = f f 1 , f 2 , ⋯, f k g. As the number of documents increases, the corresponding feature size also increases which increases the classification complexity, increases time, and decreases the accuracy. Hence, an optimal subset of F should be found to represent the document much better and increase the classification performance. The total number of subset possibility is 2 k − 1 (excluding the null set), so it is not practically possible to brute force all the combinations; thus, there are various feature selection algorithms which are aimed at finding out the optimal combinations in much easier way.
There are three types of feature selection methods known as filter based, wrapper based, and embedded based [11]. Filter-based methods are model independent which picks the features based on statistical methods like correlation and chi-square. Filter-based methods are faster than the other two types but it cannot identify the dependency between the features. Wrapper-based methods are model dependent that means for each model, separate sets of features are selected. Wrapper-based methods use an evaluation strategy to pick the optimal subset. The embeddedbased method combines both the filter based and wrapper based. Wrapper-based methods inherit both the positives and negatives of filter and wrapper based.
In this paper, we propose a filter-based feature selection method called as Relevant-Based Feature Ranking (RBFR) algorithm which identifies the most important features and removes irrelevant features from the feature space. The proposed method first ranks all the features according to two metrics known as true positive rate (TPR) and false positive rate (FPR). Then, the features from top TPR are picked; within the chosen list, the features with high FPR are removed. The list is appended by the common features selected by odds ratio (OR), information gain (IG), and chi-square feature selection methods. We have compared the proposed method with well-known standard feature selection methods such as balanced accuracy, OR, IG, and Pearson correlation. The main contributions are listed as follows: (i) To develop a filter-based feature selection method which is able to pick the most important features that could describe the target class better (ii) To identify and eliminate overlapping or weak features that poorly represent the target class (iii) To utilize the merits of other filter-based methods to pick correct features The above-mentioned contributions are aimed at picking the high rich features that could represent the target class better than the other features; additionally, the error in the selected features should be identified and removed to increase the performance. Moreover, the high features selected by other filter-based methods are also utilized in the feature selection process.
The rest of the paper is organized as follows. Section 2 briefs the literature related to feature selection. Section 3 contains the working of the proposed algorithms. Section 4 presents the experimental results and the comparison with existing machine learning models and with other existing works. Finally, the conclusion is present in Section 5.

Related Works
In this section, we brief the recent works in the field of feature selection in text classification and list out the comparison, merits, and limitations.
A research work done by [12] proposes a feature selection method that uses correlation between each feature to the class. They have strengthened the positive features and weakened the negative features. A margin-based feature selection is implemented to increase the performance of the classification. They have evaluated their proposed filter-based method in thirteen datasets and showed the superiority over existing feature selection methods.
Feature selection can also be done in many stages. A work by [13] proposes a three-stage feature selection. In the first stage, they have incorporated particle swarm optimization to search for optimal features in the feature space. The second stage, the redundant features are found and removed from the selected features. The last stage is used to measure each feature for their significance; if the measure is too low, they are deleted from the feature space. Thus, one stage for selecting the features and two stages for removing irrelevant features are used.
The feature selection proposed by [14] focuses on selecting features in two decision levels. In the first level, they have used learners to find the relevant features. The filtration of learners is done to find the high confident learners. The elected learners are allowed to vote in the second level to pick the most relevant features among the feature space.
Clustering is used for grouping features and picking the relevant features in a work proposed by [15]. The redundancy and relevancy problems are solved by the clustering algorithm. A sorting algorithm is used which arranges all the features in the clustering space. Correlation is the main metric used in the sorting algorithm to rank all the features.
An embedded based feature selection was proposed by [16] for classification on Twitter review. As it combines both filter and wrapper methods, it eliminates the semantic problem. Transfer learning is used along with filter-based methods such as information gain, Pearson's correlation, and wrapper-based methods such as expectation maximization. A weight-based deep learning model is implemented to test the performance of the proposed method.
The irrelevant and redundant features present in the text corpus create a negative impact in text classification. A hybrid filter-based feature selection introduced by [17] combines principal component analysis and information gain. In 2 Journal of Nanomaterials their experiment, they found that their proposed feature selection method reduces the dimension of data significantly by picking the correct feature subset thus reducing the training time.
A comparison of feature selection was done by [18]; they have used seven filter-based methods, two wrapped-based methods, and one embedded-based method to test the significance of the classification. Three models artificial neural network, support vector machine, and random forest were used in their experiment. Several combinations of feature selection and classifiers are made, and the most appropriate subset is found based on the training performance.
Instance selection is the method of selecting/removing instance. Reducing the number of instances is also one of the methods to increase the performance of the classification. Ensemble methods are also popular in feature selection such as in [19] where the authors have used both feature selection and instance selection. Three-feature selection algorithms along with instance selection are used in their experiment. Two ensemble-based techniques are used in the experiment.
Redundancy and dependency identification is generally good in filter-based methods [20]; a work [21] shows that mutual information feature selection is effective in finding correlation between the features and the target class. When it comes to the fuzzy-based environment, the mutual information like other filter-based methods is weak in calculating correlation and dependencies. They adopted a fuzzy independent classification on a fuzzy-based data space; then, based on the proportion of classification error, they adjust the fuzzy-based feature selection.
Feature selection is optimized by using genetic programming as mentioned in [22]. A hybrid feature selection is done by merging multiple filter-based feature selection methods. A feature construction algorithm is utilized to optimize the selected features. Nine datasets were used in their experiment, and the comparison shows that the feature construction algorithm is effective (Table 1).
From the above-mentioned literature, the feature selection needs lots of improvement, especially when considering the relevancy. Thus, we propose a feature selection which is able to extract the relevant features which improves the efficiency of the text classification.

Few Existing Feature Selection
Methods. This section presents an overview of three popular feature selection filter-based methods.
2.1.1. Information Gain. Information gain [28] is a supervised feature selection methods which is used to rank the feature according to the word's contribution based on its presence or absence in a particular set of text inputs [29]. IG is calculated as where m represents the total number of target classes. If binary classification is used then m value is 2. PðC i Þ denotes the probability of class i. PðtÞ is the probability of the word t when t is present in the document, and similarly, Pð tÞ represents the probability of the word t when t is absent in the document. Pðc i | tÞ and Pðc i | tÞ are the conditional probabilities.
2.1.2. Chi-Square. Chi-square [30] is the test of independence of a feature with the target class. It is used to measure how much a term is diverged from its dependent class [31]. CHI is calculated using the formula shown as follows: The symbols +ve prob and −ve prob represent the probability of the positive class and the negative class, respectively.

Pearson Correlation.
Pearson correlation is one of the good statistical measures to test the dependence of a feature towards the target class [32]. It is unaffected by overfitting [33]. It is calculated by the formula as described as follows: The existing feature selections have lots of problems such as lack of representation of class unique features, problems in removing the unless and common features, and unable to perform negativity test.

Overall Drawbacks in Existing Feature Selection
Methods. Feature selection is done to reduce the dimensionality of features in the dataset. Good features need to be identified to separate the classes. As the number of features increases, the complexity of the classifier is also increased; this creates a need for better feature selection methods [34].
Most existing feature selection methods use a weighted method such as frequency and distribution; these feature selection methods fail to pick the class unique features; that is, when one feature is very specific to one class or few classes, that feature is very important for a classifier to determine the class as the classifier feels very easy to identify the class.
Another problem in the feature selection is many methods rely on positive test; that is, if a feature is present, then an appropriate class can be identified; however, negativity test is also one of the powerful methods to eliminate weak candidates in the classification. There are only limited methods for the negativity test.
Combining two or more feature selection methods lets the classifier enjoys the advantages of multiple feature selection methods. The existing methods are least focused on ensembling. Hence, by the use of ensemble technique, the performance of feature selection can be improved.

Relevant-Based Feature Ranking Algorithms
Feature selection is one of the important steps in text classification. The existing problem in ranking features is lack of identification of dependence. A good feature is identified by the following characteristics: (i) A feature present in only one class is uniqueness, and it helps to identify the class correctly (ii) A feature present in all the classes is not a good sign to identify a class (iii) A feature is absent in one or more classes is also uniqueness, and it helps in negativity test Consider a sample dataset as described in Table 2. There are two classes; one class is representing the topic astronomy, and other class is representing the topic society. Let us take the feature "planet" which is a unique feature in the topic 1; similarly, the feature "marriage" is a unique feature for the topic 2. The words "people" and "life" are present in both the topics. The ACC2 ratings are displayed in the last column; it is noted that for the unique feature "planet" and the nonsignificant feature "life" have the same rating, which is not a good sign for the classification. Hence, the rating methodology should be optimized to select the rich features.
The proposed feature selection algorithm takes this ranking problem in consideration and is aimed at assigning a rank based on its relevance towards the target class. If the feature represents the class fully, then high weight is given; similarly, when the feature is present in almost all the classes, then it is less likely that the proposed algorithm will pick this particular feature. The RBFR algorithm works in the following steps: (1) Rank the features based on TPR-FPR The rich features for each class are determined by the ACC2 (TPR-FPR) [35], but there are high chances that the negative features are also selected alone with the rich features. Hence, a second level filtration on the basis of FPR could remove the weakly represented features.

Feature Selection Methods.
To increase the rate of representation, three popular feature selection methods, namely, information gain, chi-square, and Pearson correlation, are Correlation in time-based features can be improved. Embedded FS can be incorporated. [24] Orthogonal least squares The authors have improved the speed of fetching the best features using orthogonal least squares. They have compared mutual information and other embedded methods.
Multiple correlation coefficient and the canonical correlation coefficient can be improved when feature generation and instance generation methods are used.
[25] Centroid mutation-based search A set of features which can represent a strong convergence to a set of classes is identified. This increases the position of classification margin and reduces the error.
The noisy features can be identified and removed before finding the strong convergence. [26] Balanced pointwise mutual information A deep learning model is employed in Twitter text classification. Special characters like emoji are used as features to classify tweets.
Spam detection can be implemented to increase the accuracy.
[27] Term weighting Most of the feature selection methods just use frequency. The authors used category information as additional metric to select features for classification.
Semantics information can degrade the performance of the classification. Journal of Nanomaterials used to extract features. If a feature is selected by at least two of the feature selection methods, then that feature is also selected as per equation (4) for classification.
F1, F2, and F3 in equation (4) represent the features selected by information gain, chi-square, and Pearson correlation, respectively. The details of the feature selection algorithms are briefed in the following subsections.

Class Unique Features.
A feature is important based on how it represents the class. If a feature is present in only one class, then the feature is very important because it is very unique to a class. Similarly, if a feature is present across many classes, then it is very less important. After the second level of filtrations, a unique weight is calculated for each feature. This weight is based on the occurrence of a feature across various classes. Consider Table 3 which displays feature wise and class wise frequency, where F i,j represents the frequency of feature i in the class j. The first step is to remove the less class wise frequent term as per the condition in The average of all frequency count is calculated, and the first step is to remove all the entries which have the frequency less than the average frequency. Then, an inverse class frequency is calculated to find out whether a feature is common or rare. A term which is very important is then filtered using a threshold value as described in equation (6), where jCj is the total number of classes in the classification. FðcÞ represents the number of classes the feature f represent.
3.3. Machine Learning Models. The proposed feature selection algorithm is tested using five machine learning models which are briefed in the following subsections.
3.3.1. k Nearest Neighbor. kNN is the machine learning models that finds distances between each instance. When a new sample or instance needs to be classified, the kNN finds the k closest neighbors from the instance, and the target class is found by majority voting. Some statistical methods are used to fix the value of K before starting the classification. It is better to fix the value of K as odd number. kNN is called as lazy classifier because it does nothing in the training phase; the distance calculation and the majority voting are done only in the classification phase.    5 Journal of Nanomaterials SVM can classify both linear as well as nonlinear data. A support vector is an end point in each class. The SVM model fixes a linearly separatable margin between the class; this margin is used to classify the instances.

Random Forest. RF is an ensemble-based classifier.
The RF uses multiple decision tree. The number of DT is fixed before the start of classification. Each decision tree receives unique set of input and trained separately. Then, the output of each DT is used in majority voting to determine the final class.

Logistic
Regression. LR is a special type of classifier that is used to classify linear data. LR constructs a margin which separates the classes. The new instances are assigned a class based on the position where it resides with respect to the margin.

Results and Discussion
We have used three benchmark datasets for evaluating our proposed feature selection algorithm. Table 3 contains the descriptions of all datasets.  Table 4. We have taken random 2500 features from each dataset for our experiment.

Performance Evaluation.
In order to test the performance of our proposed feature selection algorithm, we have used four standard metrics: accuracy, precision, recall, and F1-score. The formulas for calculating all the metrics are shown as follows: All the documents are preprocessed; stemming and stop word removal are done before the classification; also, a  Feature selection  kNN  NB  SVM  RF  LR  P  R  F  P  R  F  P  R  F  P  R  F  P  R  F   CHI  52  55  56  83  60  70  11  67  19  81  96  88  89  78  83  ACC2  56  79  66  77  89  86  60  90  72  83  96  89  73  93  82  NDM  71  80  75  86  90  88  78  96  86  83  77  80  75  86  80  IF  76  67  71  81  98  89  97  85  91  95  87  91  94  84  89  GI  89  88  88  85  74  79  93  81  87  62  94  75  52  71  60  RBFR  64  95  76  93  97  95  94  84  89  95  92  93  91 89 90     The characteristics of the features that are selected by a feature selection algorithm can be analyzed to test the effectiveness of the feature selection algorithm. If unique features are selected and high rank is given to those features, then it is more likely that the performance of the classification will be good. Similarly, if irrelevant features are assigned higher ranks, then that will cause very poor performance in classification. The proposed feature selection method removes the high false rates thus provides a way to rank good feature. This is one of the reasons for the good performance of each classifier. Along with the ranking, the RBFR also considers top selected features from three well-known filter-based methods, and the common features present in them were selected. The precision, recall, and F1 comparison are shown in the Tables 5-7 for the datasets Reuters, WAP, and 20 newsgroups, respectively.
From the performance comparison tables, it is clear that the RBFR method identifies the rich features present in the corpus and ranks them higher than the irrelevant features. Precision is one of the good measures to judge a classification. It indicates the quality of positive predictions. The RBFR has higher precision in majority cases while compared with other feature selection methods.
The ensemble of three filter-based feature selection increases the chance of selecting high rich features. As the selected features contain high level features, the classification using RBFR method is much higher than the classification done by other feature selection algorithms. Figures 1-3 display the accuracy of kNN in the three datasets. We have compared our proposed feature selection algorithm with other two works, and Table 8 shows the comparison.
The participation of multiple number of features in the process of classification is one of the important stages as it is not only responsible for increasing the efficiency of classification but also reduces the presence of simultaneous information redundancy. To solve the problems which affect the classification performance, the number of features should be selected optimally. If the feature size is very high, it increases the time of training rapidly; also if the size is too small, the accuracy becomes very low. Hence, the optimal number of features is determined by linearly increasing the number of features and stop when the performance degradation is observed.
In our experiment, we noticed that the optimal feature size is 600; after that, the accuracy of the classifiers seems to reduce. Among the classifiers, random forest seems to have increased accuracy even after 600; this is because the random forest can reduce the dimensionality by branching over the data. Up to 1400, the random forest classifier produces acceptable accuracy.
From Figure 4, it can be seen that among the existing feature selection methods, our proposed method outputs better performance in terms of accuracy, and SVM classifier produces the best accuracy when the number of features is 600. From the analysis, it can be found that as the number of features increases, there is a positive fluctuation in the classification performance. This is because, more sufficient knowledge can be derived in the training stage to improve the accuracy of the classification. Information duplication may arise when the number of features is increased too much; hence, an optimal count is preferred.
The number of neighbors plays a critical role in classification. From Figure 5, it can be observed that as the number of   Journal of Nanomaterials neighbor's increases, the performance also increases, but after 75, the classifier stabilizes. The proposed feature selection produces better results than the other feature selection methods because the removal of noise and redundant features.

Conclusions
Feature selection is one of the important stages in improving the performance of text classification. The existing feature selection methods can identify rich features present in the text corpus, but still lots of irrelevant features are also selected which degrades the performance of the text classification. In this work, we propose a ranking-based feature selection model which can identify and eliminate the irrelevant features from the selection set. We have implemented the proposed feature selection model in three datasets and compared with five existing filter-based feature selection methods, namely, ACC2, NDM, CHI, GI, and IG. The machine learning models used for classification were kNN, SVM, NB, LR, and RF. The experiment result shows that NB outperforms the classification task with 93.96% accuracy.
In future work, we aim to rank the features based on its semantics and implement deep learning-based classification.

Data Availability
The data used to support the findings of this study are included within the article.