Boosting Accuracy of Classical Machine Learning Antispam Classifiers in Real Scenarios by Applying Rough Set Theory

Nowadays, spam deliveries represent a major problem to benefit from the wide range of Internet-based communication forms. Despite the existence of different well-known intelligent techniques for fighting spam, only some specific implementations of Naı̈ve Bayes algorithm are finally used in real environments for performance reasons. As long as some of these algorithms suffer from a large number of false positive errors, in this work we propose a rough set postprocessing approach able to significantly improve their accuracy. In order to demonstrate the advantages of the proposed method, we carried out a straightforward study based on a publicly available standard corpus (SpamAssassin), which compares the performance of previously successful well-known antispam classifiers (i.e., Support Vector Machines, AdaBoost, Flexible Bayes, and Näıve Bayes) with and without the application of our developed technique. Results clearly evidence the suitability of our rough set postprocessing approach for increasing the accuracy of previous successful antispam classifiers when working in real scenarios.


Introduction and Motivation
Half a century ago, nobody could imagine the immense capabilities of current computing systems and network devices.Nowadays, they have drastically changed the way people share or exchange information and interact or communicate through a full Internet access (24 hours a day) implemented by last generation devices.Actually, most of the Internet consumers use the smartphone (67.5%) or tablet (42.3%) to access their e-mail accounts [1,2].
As long as e-mail can be read everywhere at any time, spammers found this service particularly appropriate for delivering spam content.On the one hand, the usage of email service has experienced an explosive growth achieving an average of 538.1 million messages sent daily during 2015, which represents an interannual increase of 5% since 2010 [3].On the other hand, the percentage of spam e-mails suffered a slight reduction, representing an interannual decrease of 3.4% since 2010 [4].Taking this situation into account, it is easy to realize that spam deliveries remain a problem to be solved in the modern society.To cope with this situation, the software industry (headed by Internet security enterprises) has been continuously improving existing antispam filtering techniques and systems in order to enhance both filtering throughput [5][6][7] and classification accuracy.
However, despite the large number of ML classifiers that have proven to be useful to fight against spam, only NB has been typically included by default in popular antispam filtering products such as SpamAssassin [24] and Wire-brush4SPAM [5], due essentially to its adequate balance 2 Scientific Programming between the accuracy obtained and the associated computational cost [21,22].This is particularly true because in the antispam filtering domain the number of false positive (FP) errors made by the classifier while processing legitimate contents is of utmost importance [25].This aspect still represents a major challenge for current techniques commonly applied in the area, especially when working in real and dynamic environments characterized by (i) the subjective nature of the spam concept, (ii) the adverse effects of concept drift, and (iii) the coexistence of multiple languages in individual mailboxes.To cope with this situation, Google (considered as one of the most valuable brands in the world [26]) decided to equip Gmail with a user-guided learning mechanism.As described in [27], this technology makes use of an ANN that takes into account the Gmail user classification criteria as feedback information for the neural network.In this context, it is obvious that the accuracy of this approach is directly proportional to the number of Gmail users.As a result, the large number of Gmail active accounts (more than 900 million in 2015 [28]) allows Gmail antispam filtering system to achieve a classification accuracy up to 99%.To this end, it is easy to realize that, due to its dependence on the number of users (to achieve suitable classification results), it can only be applied on e-mail services with a large number of active users.
As a direct consequence of the underlying operation mode, this strategy cannot be extrapolated to those e-mail services belonging to SMEs (Small and Medium Enterprises), since the number of e-mail users tends to be insufficient to achieve accurate classification rates.This situation has motivated SMEs to continue using typical antispam filtering frameworks such as SpamAssassin or Wirebrush4SPAM.
In such a situation, the continuous development and deployment of both exiting and novel antispam techniques over classical filtering frameworks continue to be a necessity for the SME environment.Specifically, we consider the reduction of type I (false positive) errors extremely important.To this end, in this work, we propose the use of rough sets (RS) theory due to its ability to deal with uncertainty and avoid type I errors [29].
In detail, RS theory was initially proposed by Pawlak in the 80s [30,31], providing a formal methodology for the automatic transformation of data into knowledge [32].The philosophy of this method is based in the supposition that any inexact concept (e.g., denoted by a class label) can be approximated superiorly and inferiorly using an indiscernibility relationship.As detailed in [33], one of the most important characteristics of RS theory is the ability to discover redundancy and dependencies between features.
Additionally, RS could provide interesting benefits to the correct classification of e-mails as they guarantee (i) effectiveness in discovering hidden patterns from data, (ii) the possibility of using both quantitative and qualitative information, (iii) capability to evaluate the significance of data, (iv) finding the minimal set of useful data that minimizes the overall classification complexity, (v) the automatic generation of a decision ruleset from scratch, and (vi) the identification of previously unknown relationships.All of these inherent features, together with some positive results achieved in previous works [29], suggested to us the possibility of creating a RS postprocessing algorithm applicable to any ML classifier working as a standalone antispam filter.In this line, the present work introduces the proposal of a postprocessing algorithm and shows the viability of the idea from an experimental point of view.
While this section has introduced and motivated our proposal, the rest of the paper is organized as follows: Section 2 summarizes previous related approaches that also make use of RS theory in the antispam filtering domain.Section 3 details the developed algorithm that applies RS theory to extract domain specific decision rules from data, which will later guide the final revision of the initial proposed classification.Section 4 provides a clear description of the experimental protocol and documents the benchmark results obtained from the executed experiments.Finally, Section 5 provides conclusions and identifies future research work.
In this line, Pérez-Díaz et al. [29] proposed three different execution schemes for using specific rules generated by applying RS theory.They compared these approaches against other well-known successful antispam techniques and reported a considerable reduction in the number of FP errors.Complementarily, Glymin and Ziarko [34] conducted a study to evaluate the use of variable precision RS (VPRS) [35] in the antispam filtering domain.In this work, a set of private Hotmail messages were collected during two years and VPRS were used to establish a decision table for classifying e-mails into two possible categories (i.e., spam or legitimate).
From a different perspective, some research studies focused their efforts on maintaining those rules generated through the use of RS [36][37][38].These works proposed different frameworks to share generated rules from servers with the final goal of giving adequate support to a collaborative community interested in spam filtering.In the work of Chiu et al. [36], both the rule updating procedure and the policy for deleting obsolete rules are centralised in collaborative servers with the goal of immediately sharing available changes with the community.Additionally, the work of Lai et al. [37] introduces the generation of rules by means of RS, genetic algorithms, and reinforcement learning.Finally, the study carried out by Lai et al. [38] proposed novel methods to generate rules and validate their precision.
From another point of view, the work of Yang [39] proposed a framework (called RCFG) that combines RS and ant colony for applying an initial filtering to available data.Afterwards, the proposed approach uses a genetic algorithm to carry out feature selection.Finally, different classifiers (i.e., SVM, -NN, ANN, and NB) are used to identify spam emails.
Furthermore, there are also available several works that make use of RS to support three-way classification schemes.This type of alternative involves the definition of a third category (i.e., "suspicious") to include those messages that cannot be easily classified as spam or legitimate.Following this approach, Zhao and Zhu [40] made use of the forward selection method [41] to generate a training corpus formed by eleven attributes and demonstrated the superiority of their VPRS-based algorithm when compared with Naïve Bayes.In the same line, the authors of [42,43] initially reduced data attributes (also making use of the forward selection method), applying genetic algorithms for calculating RS reducts.
Complementarily, several researchers concentrated their efforts in applying the decision theoretic RS (DTRS) model to three-way classification [44,45].In DTRS, the two thresholds that differentiate spam (i.e., ham and suspicious) are initially calculated by using Bayesian theory in an automated way.Afterwards, classification with DTRS is made by means of a set of loss functions, which obtains the best classification with the minimal risk.In [44], a three-way decision model based on DTRS was compared with Naïve Bayes to evidence a reduction in error rates.Zhao et al. [45] proposed a novel approach based on -positive-region of DTRS and compared achieved results with Naïve Bayes and other models based on RS.
Finally, Jia and colleagues [46,47] enumerated the many benefits of three-way decision approaches and introduced a further challenge of discovering what to do with suspicious e-mails and how they can be examined in detail.

Using RS to Extract and Apply Domain Specific Decision Rules for Improving Accuracy
As can be seen from the last section, during last years a wide variety of contributions showing the applicability of RS [30][31][32][33] to the antispam filtering domain were presented.However, to the best of our knowledge, there is not a valid approach able to combine the fast execution speed of some successful ML classifiers with the good accuracy achieved by RS alternatives.Therefore, in this work, we propose an innovative way to review the final output given by standard classifiers (in the form of a postprocessing algorithm) with the goal of reducing the number of type I (FP) errors.In this line, the generation of our complementary RS decision rules is carried out by using the same data (e-mail corpus) as in the case of the classifier (see Figure 1) but being applied only when a new incoming e-mail is initially classified as spam.By following this straightforward approach, our method becomes potentially applicable to any classifier.
As showed in Figure 1, the whole filtering process involves an initial feature extraction phase used to gather the specific values needed for representing a new incoming e-mail as an adequate input for the selected classifier.After that, the classification model guesses the class of the message generating an initial output.In the case that the message was categorized as spam, it is further revised by our automatically generated RS decision rules before reaching a final classification.These revision rules are generated by our knowledge acquisition and representation module (showed in the right part of Figure 1), which is structured into two different stages: (i) feature selection and (ii) computation of RS rules.
In order to carry out the initial feature selection stage, a dense dataset should be generated from those messages that comprise the e-mail corpus.To do this, each column included in the dataset (condition attribute)   ,  ∈ (1 ⋅ ⋅ ⋅ ) represents the existence or absence of a given token (i.e., the smallest portion of text enclosed by two characters included in [[:blank:]] class) in the e-mail corpus.Therefore, the number of condition attributes of the newly generated dataset, , is equal to the number of different tokens included in any message belonging to the e-mail corpus.Moreover, the real (known) class of each message (decision attribute) is also included as the last column of the dataset, being represented using a binary variable.In this context, the set of instances stored in the dataset is denominated universe, , and its cardinality is equal to the number of messages finally represented, .
During the feature selection stage, we perform a reduction of the dimensionality of the condition attributes that are part of the initial input dataset, represented by  = { 1 , . . .,   }.To this end, we apply two complementary procedures: (i) stop word removal and (ii) feature ranking.The first one comprises the elimination of those tokens having less than 3 characters and/or being included in the stop word list provided by Baeza-Yates and Ribeiro-Neto [48].Then, we take advantage of Information Gain (IG) [49][50][51] to evaluate the suitability of each attribute included in the dataset.From all the available columns, we select the best 100 ranked attributes included in the dataset and discard the rest of the information [29].Table 1 introduces an example of the result achieved after the execution of the feature selection stage, showing only 8 token attributes ( = 8) and 8 e-mails ( = 8) due to the lack of space.Additionally, we maintain the decision attributes () corresponding to the real (known) classes in the dataset (represented in the 9th column).
From the information stored in the dense dataset represented in Table 1, and applying RS theory, we designed a deterministic approach to generating a set of accurate revision rules [52], which will be later applied to the standard workflow represented in Figure 1.In this context, rule  establishes a specific combination of values for some condition attributes   (i.e., .conditions[ 2 ] =  V   2 ∧ .conditions[ 5 ] =  V   5 ) that determine a solution for a certain decision attribute  (.decision = solution).The proposed algorithm able to carry out the rule extraction process is introduced in Algorithm 1.For representation purposes, a value of '?' in a condition attribute,   , means that this feature should not be taken into consideration.
As showed in Algorithm 1, for each e-mail stored in the dataset,   , a new rule is generated through the computation of the shortest reduct (computeShortestReduct function) for a given concept (2), which is defined as 1 for the same e-mail,   '?' for messages of the same class, and 0 for other instances (lines (08)-( 12) in Algorithm 1).In this context, a reduct is a minimal (irreducible) subset of features, RED ⊆ , having the same precision to guess a concept () from the whole set of condition attributes in .In order to assess the potential for classification of a set of condition attributes,  ⊆ , all the instances, { 1 , . . .,   }, should be grouped into different subsets, where each subset contains all the indiscernible (indistinguishable) instances.In such a situation, this grouping is known as the set of equivalence classes, / IND().
Two instances   ,   ∈  are indiscernible regarding the condition attribute set, , if they share the same values for all their attributes.Taking this into consideration, the potential for classification of the condition attributes included in  is measured by computing the lower approximation for the concept , .In this context,  is the union of equivalence classes  of /IND() having at least one positive instance  ( If we now consider the example shown in Table 1, as long as the fact that all the represented instances are discernible, /IND() = {{ 1 }{ 2 }{ 3 }{ 4 }{ 5 }{ 6 }}, the lower approximation of concept  with attributes included in  is  = { 1 ,  4 ,  5 }.Moreover, the subset of features  = { 2 } is a reduct regarding concept , because /IND() = {{ 1 ,  4 ,  5 }{ 2 ,  3 ,  6 }} and, hence,  = { 1 ,  4 ,  5 } = .
Keeping in mind the existence of undefined values ('?') for concept 2 (considered in the algorithm shown in Algorithm 1), two lower approximations are equivalent if they only differ in those instances (  ) having an undefined value for 2 concept.
Therefore, using the reference implementation of the proposed technique (refer to Additional-File1.java from the Supplementary Material available online at http://dx.doi.org/10.1155/2016/5945192for its Java implementation), we extracted the rules from the example data source included in Table 1.The extracted rules are shown as follows.As shown above, the rules generated by our proposed algorithm are simple and easy to execute.Therefore, the postprocessing stage (labeled as RS-based decision in Figure 1) will not involve the usage of a great amount of computational resources.In addition, each rule generated by our algorithm includes the number of samples from training dataset that match with it (also known as coverage set cardinality).This information is very useful when a target message matches two or more conflicting rules.In this case, we use a voting scheme using the cardinality of the coverage set as vote weight.After that, if the obtained result is equal for both the spam and legitimate categories, the last one is selected for the target email.

Model Benchmarking
In order to demonstrate the suitability of applying RS theory for improving the accuracy of previously successful ML classifiers in the antispam filtering domain, we designed an experimental protocol to execute our testbed.In Section 4.1, we include a description of this protocol introducing the reasons supporting our specific corpus selection, detailing several preprocessing issues, and defining the fold cross validation scheme as well as different measures.Complementarily, in Section 4.2, we present and discuss the obtained results.

Experimental Protocol.
With the goal of evidencing whether the combination of ML techniques with RS is adequate to reduce type I (FP) errors, we analyzed several publicly available datasets in order to select one able to ensure the validity of our experimental results.In this line, the most widespread are SpamAssassin [53], LingSpam [54], PU1 [54], PU2 [54], PU3 [54], PU4 [54], TREC [55][56][57], and Spambase from the UCI repository [58].Table 2 compiles relevant information about these corpora including the percentage of legitimate and spam e-mails and the total number of available messages.First of all, LingSpam corpus contains legitimate messages collected from a linguistic list merged with some spam messages directly compiled by its authors.It only includes 481 spam messages (16.6% of the total) and 2412 legitimate instances.Because of the small number of spam messages, most ML classifiers are affected by imbalanced learning [59] and, therefore, this dataset is not adequate for general experiments.
Secondly, PU1, PU2, PU3, and PUA corpora are distributed into 10 separate parts to facilitate the execution of 10fold cross validation experiments [60].As shown in Table 2, these corpora present different percentages of spam messages (43.8%, 20%, 49%, and 50%, resp.)making them appropriate to avoid the imbalanced data problem.However, due to the format used for their original representation, the usage of stop word lists, stemming, and other techniques based on gathering information from the e-mail header is not supported.As long as our approach requires the application of preprocessing techniques (e.g., usage of a stop word list), we have ruled out their use.
In the case of Spambase corpus, it contains 4601 messages (60.6% being spam) represented as feature vectors with information about 57 attributes.Due to the reduced dimensionality (number of attributes) of this corpus, we found it unsuitable for the study.
Next, as described in Table 2, TREC conference presents three corpora grouped according to the mailing date (2005,2006, and 2007, resp.) with different percentages of spam and ham messages (43%, 35%, and 33.5%, resp.).These corpora were built following the standard Internet message format (described in RFC-2822 [61]), keeping unaltered the original content of the messages.The preprocessing of the corpus does not include the detection and removal of duplicates.
Finally, SpamAssassin is one of the most used corpus by the antispam filtering community.It includes a total number of 9332 messages, of which 25.5% are spam e-mails.This standard corpus was built by the SpamAssassin developers without altering the original content of the messages.The preprocessing of this corpus (distributed in RFC-2822 format) included the removal of duplicates and the anonymization of specific data with the goal of guaranteeing receiver privacy.The ratio between the size of the corpus (medium-sized) and the proportion of spam and ham messages makes SpamAssassin corpus as the most suitable dataset for our experiments.
In order to demonstrate the benefits of our proposal in the antispam filtering domain, we selected four well-known and widely used ML classifiers: Naïve Bayes [62], Flexible Bayes [62], AdaBoost [63], and SVM [64][65][66].Regarding their specific implementation, we chose the standard version of these classifiers included in the Weka Data Mining Software (available at http://www.cs.waikato.ac.nz/∼ml/weka/).To successfully use Naïve and Flexible Bayes Weka implementations, the dimensionality of the input feature vectors was limited to 1000 characteristics (using IG feature ranker).Moreover, Naïve Bayes classifier was executed using binary features (0|1) while Flexible Bayes was evaluated with continuous attributes (frequency).Additionally, AdaBoost was configured to use Decision Stumps as metaclassifiers and 150 boosting iterations.Complementarily, using IG method, we reduced the dimensionality of input vectors down to 700 binary features.Finally, a 1-degree polynomial function was selected as kernel for SMO algorithm (Weka SVM implementation), which was executed using binary feature vectors with a size of 2000 (reduced using IG feature ranker).
All these parameters were established taking into consideration the integral evaluation methodology proposed by Pérez-Díaz et al. [25] for accurately ranking different contentbased spam filtering models.Additionally, in the work of Méndez et al. [49], IG showed the best performance for all the compared models, while in [25] the authors experimentally computed the best number of features (using the IG feature ranker) for all the available classifiers.Finally, with the goal of ensuring the validity of our results, all the experiments were conducted under a stratified 10-fold cross validation schema [60].
To correctly assess the performance achieved by applying our RS revision method when compared to the independent execution of ML classifiers, we have chosen four groups of well-known measures: (i) percentage of correctly classified messages, false positive and false negative (FN) errors, (ii) score (also known as  1 score or -measure) [67,68], (iii) balanced -score [68], and (iv) Total Cost Ratio (TCR) [22].

Obtained Results and Discussion
. By applying the experimental protocol defined in the previous section, we straightforwardly evaluate the suitability of our proposed approach to improve the performance of different widely recognized ML classifiers.In this context, Table 3 shows the percentage analysis of the different type of errors (FP and FN) as well as the hits achieved by the analyzed ML techniques, giving specific information about the performance gain obtained by the use of the proposed RS-based approach.As described in Section 3, RS rules are automatically applied to revise the output of each ML classifier when it initially classifies a given message as spam.
As initially shown in Table 3, the percentage of correct classifications (% OK) using ML techniques was improved when RS revision rules were applied with the only exception of Flexible Bayes algorithm.The particular behavior of Flexible Bayes classifier can be explained by the very high number of FN errors, which cannot be successfully addressed by our proposal that is only applied in those cases in which an incoming e-mail is initially classified as spam.In the light of these results, the overall combination of ML techniques with the proposed revision approach was able to reduce the number of misclassifications of legitimate e-mails.This behavior avoids the incorrect filtering of relevant messages for the end user with a minimal footprint in FN errors (ability to detect spam).
With the goal of having a more insightful perspective about these initial results, we also computed -score and balanced -score values, merging recall and precision for different  alternatives.Table 4 presents the obtained results.
As shown in Table 4, the combination of precision and recall measures with the same weight ( = 1) evidences slightly worse results when applying RS in combination with Flexible Bayes and SVM.However, this assumption is unrealistic from a real user perspective for which classification errors own a very different importance.In this line, Table 4 reveals that when increasing the penalization of type I (FP) errors (using lower values of ), the RS-based revision approach achieves great evaluation results.
In this context, and with the goal of providing a further analysis about the real impact of type I errors from a costsensitive point of view, we carried out TCR evaluations for all the analyzed models.These results are shown in Figure 2.
As clearly shown in Figure 2(a), if the cost of an FP error is considered as important as a FN misclassification ( = 1), SVM and Flexible Bayes classifiers do not achieve additional benefits.However, a significant improvement is obtained by the application of our automatic revision procedure when working in real scenarios (situation modeled by assigning to  different values).

Conclusions and Future Work
In this work, we have presented a RS-based postprocessing technique able to reduce type I (FP) errors made by different well-known classifiers previously applied in the antispam filtering domain.To this end, we have designed a straightforward algorithm able to extract simple and complementary revision rules exploiting the same corpus used to train the original classifiers.Our approach is only applied to those messages initially classified as spam, alleviating the use of valuable computational resources in real implementations.
Results achieved by the execution of the experimental protocol have demonstrated the effectiveness of our proposal for improving the performance of different ML classifiers.Particularly, different cost-sensitive measures (such as TCR or balanced -score) obtained accurate rates for our RSbased revision approach when dealing with type I errors.The main advantage of its combined execution is an increase on classification hits, which is an important issue to augment the final classifier user experience.
Moreover, the impact on the time required for carrying out the final classification when our proposed method is applied is negligible because (i) the postprocessing is not applied on each classification (only for messages initially classified as spam) and (ii) the time and computer resources needed to evaluate the matching of rules are very low.Additionally, the knowledge acquisition and representation process represented in Figure 1 (as well as the training of the standard ML classifiers) can be executed in a different machine with the goal of saving computational resources on the hardware used to deploy the antispam filter.
The main drawback of our approach is the deterministic nature of the generated revision rules.In this regard, Pawlak and colleagues [52] have shown the limitations of RS deterministic approaches when compared to probabilistic ones that work with information uncertainty inherent in many classification problems (such as spam).Additionally, the main advantage of probabilistic models lies on providing a unified approach for both deterministic and nondeterministic knowledge representation systems.Taking this idea into account, our main line of future research work includes searching for complementary probabilistic approaches able to generate rules that outperform the capabilities of our current algorithm.Moreover, in order to complement our current work, we also find interesting the identification of novel feature selection and extraction methods.To this end, we believe that regular expressions representing more than one token could be more effective than features made up of a single one.Finally, we also found interesting the idea of carrying out the dynamic validation of rules in order to detect when they became obsolete.

Figure 1 :
Figure 1: Standard and augmented filtering process workflow executed whenever a new incoming e-mail arrives to the user mailbox.

Figure 2 :
Figure 2: TCR evaluation varying the importance assigned to type I errors for the analyzed models.

Table 1 :
Example of the reduced dense dataset (generated from the initial e-mail corpus) required for the computation of RS rules.

Table 3 :
Performance gain obtained by the use of the proposed RSbased approach when compared to the initial output of standard ML classifiers.

Table 4 :
-score and balanced -score rates for different  values.