Machine learning techniques are a standard approach in spam detection. Their quality depends on the quality of the learning set, and when the set is out of date, the quality of classification falls rapidly. The most popular public web spam dataset that can be used to train a spam detector—WEBSPAM-UK2007—is over ten years old. Therefore, there is a place for a lifelong machine learning system that can replace the detectors based on a static learning set. In this paper, we propose a novel web spam recognition system. The system automatically rebuilds the learning set to avoid classification based on outdated data. Using a built-in automatic selection of the active classifier the system very quickly attains productive accuracy despite a limited learning set. Moreover, the system automatically rebuilds the learning set using external data from spam traps and popular web services. A test on real data from Quora, Reddit, and Stack Overflow proved the high recognition quality. Both the obtained average accuracy and the F-measure were 0.98 and 0.96 for semiautomatic and full–automatic mode, respectively.
Despite several existing algorithms for web spam detection, which were, for example, presented in works [
In practice, web spam recognition should include a lifelong learning mechanism. The system—dedicated to protecting a specific site—should evolve during its lifecycle according to the current phase identified by knowledge of spam and nonspam data.
Let us discuss the lifecycle of a web spam recognition system presented in Figure
Lifecycle of web spam recognition system according to known spam and nonspam data.
The next phase is the development phase. In this phase, a set of regular nonspam data already exists, but the number of entries is relatively small in comparison to known spam examples. An appropriate classifier to analyse the entries is a method developed for imbalanced sets. Other methods tend to classify all the entries to the dominating class.
When a sufficient number of nonspam messages are collected, the system enters the mature phase. During this third phase, the number of nonspam messages is growing. Now, the system can use the messages to create a balanced dataset from both types of messages. The classification during this phase brings the best results.
Over time, the system enters the descending phase. New spam messages are significantly different from the ones used for the initialisation of the system. As a result, the classification accuracy fails regularly.
Most of the works on spam detection propose a classification algorithm that is based on a stable learning set and masters the accuracy obtained from a static learning set. The evolution of the classification process is not discussed, or it is limited to a specific issue as a reduction of the recognition accuracy over time.
In this work, we have proposed a new system for spam classification that addresses all problems connected with the described phases. The novelty of our proposition lies in an automaton that selects a proper classification methodology according to the current phase of the system lifecycle. Additionally, the proposed system offers full automatisation in the creation of the learning sets using an external data source. Therefore, the system can be implemented by raising blog platforms without substantial datasets of labelled comments.
We focused on web spam as it is still one of the most challenging issues. The most common type in the list is web spam that exploits vulnerabilities and gaps in the web 2.0 to inject links to spam content into dynamic and shareable content such as blogs, comments, reviews, or wiki pages.
We tested the system on three popular web services: Quora, Reddit, and Stack Overflow. The test proved high recognition quality. Using the datasets, we discussed practical issues of lifelong machine learning, including the spam classification in the case of an insufficient number of nonspam learning examples and the descending accuracy of the system over time.
The remainder of this work is organised as follows. First, other works on web spam detection are described in Section
Several works have analysed the web spam recognition issue. In works [
Some original approaches were proposed in the following works. Yin et al. [
The problem of the decrease in classification accuracy over time was stressed in [
Our solution depends on features that discriminate against web spam. Several works proposed their own set of features. Alarifi et al. [
Our rejector is based on Deterministic Finite Automaton. Dolzhenko et al. [
We propose a complete system for web spam rejection from blogs. The following systems aimed at email spam were created before. Bruckner et al. [
The main feature of the proposed system is the dynamic selection of a classification method according to the current phase of the system lifecycle. The lifecycle is modelled by a finite automaton that switches spam rejectors according to data flow.
A second important aspect is the collection of learning data from third-party sources to keep the system up to date even if the flow of data is too small to fulfill machine learning requirements.
Let us assume that set
Let us define set
Let us define a membership function for elements of the known classes
For that, the rejection function is used on the elements from the set
where
Formula (
If Formula (
In practice, the rejection can be implemented as a binary classification of the native and foreign elements [
We can define three cases according to the learning sets. In the first case, the learning set is
In the second case,
The third case assumes that the sizes of the learning sets are similar. This case is the most desirable. However, if the number of learning cases is too high, the classifier may not be flexible enough to learn new forms of spam and its quality may decrease in time.
A good web spam detection system should work correctly in all three cases. Therefore, we propose, for web spam rejection, a finite automaton, which switches between various types of rejections according to the parameters of the learning set.
Let us define a finite automaton
The symbols describe the following rules based on the cardinal number of the sets of nonspam
The notation
The transition function
Figure
Automaton switches between spam rejectors according to the size of the known nonspam and spam sets.
In most cases, a more extensive dataset allows us to create a better rejector and the automaton goes from
There is some similarity between our model and the model proposed in [
Two aspects distinguish the proposed web spam detection solution from the others. The first aspect is the dynamic classification module that chooses a proper classification method according to the current phase of the system lifecycle. The automaton that models the lifecycle was described in Section
The creation of the learning set from the labelled documents is critical for the quality of the system. Figure
Schema of learning set creation system.
The learning set
The learning set can be created from several independent sources defined as subsets of spam data
The classification subsystem identifies the incoming comments on the protected web services. Usually, the classification is not perfect and must be supervised by the operator. As a result, two sets labelled internally as spam and nonspam are called
The system uses external datasets. The first set
Sets notation.
Type | Source | Notation |
---|---|---|
Spam | All | |
Spam | Internal | |
Spam | External | |
Non-spam | All | |
Non-spam | Internal | |
Non-spam | External | |
Our system communicates with the scoring system through an existing API or by web scraping to collect comments commonly acknowledged as nonspam. The technical details of the comments collection are presented in Section
The automaton controls the classification in the system. The automaton reacts to changes in the learning set. For a small set, specialised machine learning algorithms are used to create an appropriate classifier. A proposition of implementation is given in Section
When the set is too large, the learning set is relaxed. The relaxation process removes old data from the learning set and rebuilds the classifier using data from the last period. The importance of the relaxation and comparison of relaxation strategies is presented in Section
An implementation of the automaton described in Section
The first classifier works in the initial phase when only a few examples represent one of the classes. A good candidate is a One-Class Support Vector Machine (OSVM) [
Assume that
with constrains:
The second classifier works with imbalanced sets. The RUSBoost algorithm [
The last classifier works using full knowledge of data from the previous periods to classify data from the current period. The classifier should provide high accuracy and be a quick learner. An excellent candidate is Random Forest [
All three described classifiers were implemented for the tests.
The data was collected from June 2013 to February 2014 and divided into ten monthly periods labelled
During each period
The examples of comments.
Spam
Nonspam
The second set, labelled as nonspam, consists of nonspam comments. This set is heterogeneous and contains nonspam comments from the three web communities: Quora, Reddit, and Stack Overflow. Examples of comments are presented in Figure
Quora dataset consists of the best answers to the most popular questions posted on Quora.com. Quora does not provide an API to the rating system. Therefore, a web scraping method was chosen. A bot started crawling from pages with most followed topics in 2014 (
On each topic page, the bot visited an overviewed top answers and FAQ page to extract the total of 2804 links to individual question pages. On each question page, up to 5 top answers were extracted and saved, for a total of 9520 answers.
The Reddit JSON API was used to collect nonspam comments for Reddit dataset. The bot started with top topics page ( A comment was parsed correctly according to Reddit API docs. A comment was ranked positively; it has 5 more thumbs-up than thumbs-down. A comment was not too short; it has at least 100 characters. A comment was not reported as offensive by any user.
When feed was collected a total of 6521 topic pages were visited, and 529158 comments were parsed. Among them, 130604 were rejected because they had the ups/downs balance lower than 5. Among all the comments, 176688 comments were considered too short (less than 100 characters). Finally, 221866 were collected and included in the feed data.
The Stack Exchange API was used to download all answers to top-rated questions on Stack Overflow. When feed has collected a list of 68410 top rated questions was downloaded via API and a total of 500000 answers were extracted and saved. Answers with an upvoted score greater or equal to 30 were selected for a total of 101161 highly rated answers.
The collected comments were limited to comments that overlap the monthly periods when the spam comments were collected. The number of data in the division on data sources and periods and the average monthly volume is given in Table
Distribution of data among testing periods.
| | | | | | | | | | Average | |
---|---|---|---|---|---|---|---|---|---|---|---|
Spam trap | 2588 | 7673 | 7371 | 4176 | 1783 | 7746 | 17419 | 15323 | 14112 | 2065 | 3648 |
Quora | 45 | 40 | 41 | 53 | 59 | 47 | 54 | 53 | 73 | 49 | 23 |
105 | 362 | 51 | 555 | 319 | 40 | 24 | 867 | 626 | 343 | 150 | |
Stack Overflow | 1104 | 951 | 972 | 1028 | 1057 | 1007 | 849 | 717 | 884 | 689 | 421 |
In the preprocessing, before calculating actual features, each analysed comment was transformed into three separate forms to calculate the features proposed in our previous work [
The first form was a Visible Text. The HTML document was stripped of all mark-up using the BeatifulSoup4 library with lxml backend. In the result, we obtained the pure text between tags. The second form was a Nonblank Visible Text. To obtain this form, we removed all space characters from the Visible Text. The third created form was Distinct Domains. The Distinct Domains is a set of unique domain names including the domains defined by Internationalized Domain Names in Application (IDNA) standards [
Table
Groups of extracted features.
Group | Count |
---|---|
HTML tags features | 10 |
Metadata section features | 6 |
Domains features | 7 |
Global text statistics | 12 |
Statistics for lexical items | 22 |
Alphanumeric and non-alphanumerics characters statistics | 6 |
| |
Total | 63 |
The features in the groups based mostly on count and length of described objects in all created forms. Simple statistics such as the average, maximum, and standard deviation were calculated. In summary, we created 63 features.
The following measures were used during the tests described in Section
The comparison of web spam recognition mechanisms [
For a preliminary evaluation of the proposed method, we compared our classifier with selected works mentioned in the newest list of web spam detectors presented in [
Table
The comparison of results obtained at the WEBSPAM-UK data set. The best result from the reference works is presented.
Data set | Measure | Work | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Our | [ | [ | [ | [ | [ | [ | [ | [ | [ | [ | ||
WEBSPAM-UK | ACC | 0.80 | 0.78 | 0.82 | - | - | - | - | - | 0.88 | - | - |
TPR | 0.79 | 0.86 | - | - | - | - | - | - | - | - | - | |
SPC | 0.81 | 0.69 | - | - | - | - | - | - | - | - | - | |
F1 | 0.80 | - | 0.80 | 0.81 | 0.86 | - | - | 0.95 | 0.76 | - | 0.75 | |
| ||||||||||||
WEBSPAM-UK | ACC | 0.94 | 0.92 | - | - | - | - | 0.92 | - | - | - | - |
TPR | 1.00 | 0.96 | - | - | - | - | - | - | - | - | - | |
SPC | 0.10 | 0.29 | - | - | - | - | 0.05 | - | - | - | - | |
F1 | 0.55 | - | - | 0.33 | 0.40 | 0.41 | - | 0.44 | - | 0.69 | - |
For our tests, we created a set of 600 thousand pages from 2006 and took all data from 2007. The data was evenly divided into the learning and testing sets. Luckner at al. [
The other compared works use the same repository, but the division into learning and testing sets was different than in this work. Therefore the results are hard to compare. Specifically, the better results obtained in [
Shengen at al. [
Several other works obtained a better F-measure than we on data from 2006. However, in works [
Our results on WEBSPAM-UK2007 are highly satisfactory. Except for work [
However, the quality of the web spam detector decreases over time. Therefore, the rest of the tests were performed on separate online data collected in the consecutive monthly periods.
We considered ten monthly periods from
In the test, the learning set was always a subset of data from period When the rejector was in state When some nonspam example comments had already been labelled in period During periods
Table
The accuracy obtained by the rejection system during 10 periods.
Testing | Learning | | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
| | | | | 0.98 | | | | | | | |
| | | 0.99 | 0.95 | 0.98 | 0.96 | 0.89 | 0.96 | | | 0.99 | 0.99 |
| | | 0.98 | 0.95 | | 0.96 | 0.90 | 0.96 | 0.97 | | 0.99 | 0.99 |
| ||||||||||||
| | | 0.97 | | 0.98 | | | | | | | |
| | | | 0.96 | 0.98 | 0.96 | 0.92 | 0.97 | 0.97 | | 0.98 | 0.99 |
| | | | 0.96 | | 0.97 | 0.90 | 0.96 | 0.99 | | 0.99 | 0.99 |
| ||||||||||||
| | | | 0.94 | 0.98 | 0.95 | | | | 0.98 | 0.98 | |
| | | | | 0.98 | | 0.91 | 0.95 | | 0.98 | 0.98 | 0.94 |
| | | | 0.94 | | 0.96 | 0.91 | 0.96 | 0.97 | | | |
The results for the period
For the following periods, the rejector was in state
The sensitivity obtained by the rejection system during 10 periods.
Testing | Learning | | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
| | | 0.80 | | | | 0.97 | 0.98 | 0.91 | 0.98 | 0.97 | 0.98 |
| | | | | | | | 0.98 | 0.97 | 0.97 | | |
| | | | | | | | | | | | |
| ||||||||||||
| | | 0.77 | | | | | 0.97 | | | 0.99 | 0.99 |
| | | 0.99 | 0.98 | 0.99 | 0.97 | 0.99 | 0.95 | | 0.98 | 0.99 | 0.98 |
| | | | | | | | | | | | |
| ||||||||||||
| | | 0.99 | | | | | | | | | |
| | | | 0.97 | 0.95 | 0.92 | 0.87 | 0.73 | 0.64 | 0.63 | 0.74 | 0.79 |
| | | | | | | | | 0.98 | | 0.99 | |
The specificity obtained by the rejection system during 10 periods.
Testing | Learning | | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
| | | | | 0.98 | | | | | | | |
| | | | 0.95 | 0.98 | 0.96 | 0.89 | 0.96 | 0.99 | | 0.99 | 0.99 |
| | | | 0.95 | | 0.96 | 0.89 | 0.96 | 0.97 | | 0.99 | 0.99 |
| ||||||||||||
| | | | | 0.98 | | | | | | | |
| | | | 0.95 | 0.98 | 0.96 | 0.91 | 0.97 | 0.97 | | 0.99 | 0.99 |
| | | | 0.95 | | 0.96 | 0.88 | 0.96 | 0.99 | | 0.99 | 0.99 |
| ||||||||||||
| | | | 0.93 | 0.98 | 0.94 | 0.89 | 0.96 | 0.98 | 0.98 | 0.98 | 0.99 |
| | | | | 0.98 | | | | | | | |
| | | | 0.93 | | 0.95 | 0.86 | 0.96 | 0.97 | | 0.99 | 0.99 |
The F-measure obtained by the rejection system during 10 periods.
Testing | Learning | | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
| | | 0.90 | | 0.99 | | | | 0.95 | | | |
| | | 0.91 | 0.98 | 0.99 | 0.98 | 0.95 | 0.97 | | 0.98 | | |
| | | | 0.98 | | 0.98 | 0.95 | 0.98 | 0.97 | | | |
| ||||||||||||
| | | 0.88 | | 0.99 | | | | | 0.98 | | |
| | | 0.99 | 0.97 | 0.99 | 0.97 | 0.95 | 0.96 | 0.98 | 0.98 | | 0.98 |
| | | | 0.98 | | 0.98 | 0.94 | 0.97 | 0.99 | | | |
| ||||||||||||
| | | 0.99 | 0.97 | | | | | | | | |
| | | | | 0.97 | 0.96 | 0.90 | 0.85 | 0.82 | 0.81 | 0.87 | 0.90 |
| | | | 0.97 | | | 0.93 | | 0.97 | | | |
Table
Table
Table
Finally, Table
This section presents a more detailed discussion of the following issues. Section
During the initial period
Because the historical web spam datasets exist—spam collected before the system start-up or spam from public repositories such as WEBSPAM-UK [
We have used the set
However, to optimise the classification process, the knowledge on web spam characteristic should be supplemented by a partial knowledge of the nonspam comments. This knowledge is represented by a set
Figure
Results for RusBoost algorithm with limited learning set.
Accuracy
F-measure
The accuracy and F-measure obtained on the testing set stabilise fast for the datasets
We have compared the results obtained by One-Class SVM trained on
Using an external dataset
Figure
Results obtained using internal and external nonspam sources.
Accuracy
F-measure
The best results were obtained using
The situation looks different when we analyse F-measure. Figure
The results show that the whole classification process can be fully automated. The average accuracy for all
Therefore, using spam data from the spam traps
The proposed system uses relaxation of the learning set and compared this solution with two alternative approaches. First, the learning set can be static [
Figure
Results for static, incremental, and dynamic learning sets.
Accuracy
F-measure
When we analyse the results for the last period, the average accuracy for the static learning set is 0.70 while the other approaches reach 0.99. Similarly with the F-measure which is 0.70 and nearly 1.00 for the static learning set and other approaches, respectively. This shows clearly that a static learning set cannot be used to create a reliable lifelong solution.
However, comparison of the results for the dynamic and incremental approaches is not so simple. Neither accuracy in Figure
If the accuracy obtained by our strategy is significantly better the test should show that the calculated accuracy is higher than for the incremental learning set strategy in most of the tests and smaller in a few tests by only a small amount. We compared the two strategies in all 27 combinations of datasets and periods. For 16 pairs the accuracy calculated for our strategy was greater. The opposite situation occurred in 8 cases. In the rest of the cases, the results were the same for both strategies.
Wilcoxon’s Signed–Rank test rejected the null hypothesis (p = 0.084), which stated that the results obtained by the two strategies were not significantly different, at the 0.1 level. Moreover, the modified test accepted (p = 0.044), at the 0.1 level, the alternate hypothesis that the difference in the accuracy between our strategy and the alternative strategy and the incremental strategy come from a distribution with median greater than 0.
A similar test on 27 combinations of datasets and periods was performed for F-measure. For 13 pairs the F-measure calculated for our strategy was greater. The opposite situation occurred in 8 cases. In the rest of the cases, the results were the same for both strategies.
Wilcoxon’s Signed–Rank test rejected the null hypothesis (p = 0.099), which stated that the results obtained by the two strategies were not significantly different, at the 0.1 level. Moreover, the modified test accepted (p = 0.051), at the 0.1 level, the alternate hypothesis that the difference in F1 between our strategy and the alternative strategy come from a distribution with median greater than 0.
Therefore, the results obtained by our strategy are significantly better than the results obtained by the other strategies when the strategies are evaluated using the accuracy and the F-measure.
Let us discuss the approach that uses sliding time windows for training data. Such an approach is proposed in several works, e.g., [
Finally, let us discuss why the idea of the static learning set fails. It is because the mathematical definition of spam and nonspam changes over time in an unforeseen way. Therefore, the predictions become less accurate as time passes. To prove this reasoning, we tested changes of features importance in the classification process.
An estimation of predictor importance for decision trees was calculated. Feature importance is calculated for a split defined by the given feature. Importance is computed as the difference between the Mean Squared Error (MSE) for the parent node and the total MSE for the two children in the regression task. In the classification task, the Gini coefficient is used instead to estimate how the data space in the node is divided among classes. The Gini coefficient equals
For a random forest, the used function computes predictor importance for all weak learners. For every decision tree, the sum of changes in the MSE is calculated due to splits on every feature used in the recognition process. Next, the sum is divided by the number of branch nodes.
Importance is normalised to the range
Figure
Variation of the importance of features among various periods.
The distribution shows that the essential feature in one period can be less critical in other periods. Therefore, it is not possible to create a classifier using a static learning set that obtains a stable accuracy of web spam detection.
In this paper, we have proposed an intelligent machine learning system for web spam detection. As an engine of the system, we proposed an automaton that creates a reliable lifelong machine learning solution by switching classification mechanisms according to the current learning set (see Section
The proposed solution can protect newly created web pages. In the initial phase—when nonspam messages typical for the web page are not well known—the system uses dedicated algorithms to protect the pages with accuracy that exceeds 0.9 (see Table
The system does not rely on a static learning set. The built-in mechanism collects data on new web spam received from the spam traps as well as on nonspam comments from third-party Web 2.0 platforms (see Section
This mechanism forms a fully automated high-level protection system. We have compared the results of spam classification obtained using learning sets consisting of data from the protected system and from the external sources. In the first case, when an operator must label the nonspam examples, the accuracy reached 0.98 and F-measure 0.98. The fully automated system based on external data achieved accuracy and F-measure of 0.96 and 0.96, respectively (see Figure
In comparison to the static learning set approach, the results obtained by the system were better by nearly 0.3 both for the accuracy and F-measure (see Figure
All elements of the classification process were tested on real data from spam traps and common known web services: Quora, Reddit, and Stack Overflow. The obtained average accuracy over time was 0.99, 0.98, and 0.96, respectively (see Table
Future works will focus on increasing automation of the system by estimation of the parameters of the transition function in the automation. The changes should pay off in the optimisation of the qualification results and higher stability of the system.
The data used to support the findings of this study are available from the corresponding author upon request.
The author declares no conflicts of interest.
The data was collected using Antyscam system that was developed in EU POIG.01.04.00-14-031/11 Project. The research was supported by the National Science Center, Grant no. 2012/07/B/ST6/01501, decision no. UMO–2012/07/B/ST6/01501.