The aim of machine learning is to develop algorithms that can learn from data and solve specific problems in some context as human do. This paper presents some machine learning models applied to the intrusion detection system in WiFi network. Firstly, we present an incremental semisupervised clustering based on a graph. Incremental clustering or onepass clustering is very useful when we work with data stream or dynamic data. In fact, for traditional clustering such as Kmeans, Fuzzy CMeans, DBSCAN, etc., many versions of incremental clustering have been developed. However, to the best of our knowledge, there is no incremental semisupervised clustering in the literature. Secondly, by combining a Kmeans algorithm and a measure of local density score, we propose a fast outlier detection algorithm, named FLDS. The complexity of FLDS is
Machine learning is a central problem in artificial intelligence. The purpose of machine learning is concerned with the development of algorithms and techniques that allow computers to
Intrusion detection system (IDS) is one of the most emerging tasks in the network connectivity. Each year, there are lots of network attacks in the world; consequently, the cost for solving these problems is very big, and was reported to be about 500 billion USD in 2017. This problem is a challenge not only for government/organizations but also for individuals in daily lives. To protect the computer network system, in general, some methods can be used such as firewalls, data encryption, or user authentication. The firmware is one technique to protect the system, but nowadays, the external mechanisms have emerged and quickly become popular. One important method for data mining in intrusion detection problem proposed in the literature is to use machine learning techniques [
Generally, data mining task in IDS must detect two kinds of attack including known attacks and outlier (anomaly) attacks. For the known attacks, we can use a (semi)supervised learning method such as neural network, support vector machine, random forest, decision tree, and naïve Bayes, to mention a few, to construct a classifier from data training (labeled normal/attacks connection) [
A general model for misuse detection in IDS.
A general model for outlier detection in IDS.
The contributions of our paper are as follows:
We propose an incremental semisupervised graphbased clustering. To the best of our knowledge, this is the first incremental semisupervised clustering algorithm. The preliminary work is presented in [
We introduce a fast outliers detection method based on local density score and Kmeans clustering algorithm. The preliminary work is introduced in [
We propose a multistage systembased machine learning techniques which can boost the accuracy of the intrusion detection process for the 802.11 WiFi data set.
The experiments carefully conducted on data set extracted from Aegean WiFi Intrusion Dataset (AWID) show the effectiveness of our proposed algorithms [
This paper is organized as follows. Section
Clustering is the task of partitioning a data set into
In [
Insertion cases (a) and deletion cases (b) of IncrementalDBSCAN [
In [
Outlier (anomaly) detection is one of the important problems of machine learning and data mining. As mentioned in [
For the classificationbased outliers detection, we have two categories: multiclass and oneclass anomalies detection methods. In multiclass classification techniques, we assume that the training data contain labeled points of all normal classes. The learner using a supervised learning model trains a model using the labeled data. The classifier can distinguish between each normal class and the rest of the class. A test point will be called outlier if it does not belong to any normal class. In oneclass outliers detection methods, we assume that the number of normal class is only one. The classifier learns a model that can detect the boundary of the normal class. If a test point does not fall in the boundary, it will be called outliers. Although many techniques have been done, however, the main disadvantage of these methods based on the availability of accurate labels for normal classes which is not easy to apply for real applications.
For the nearest neighborbased outlier detection methods, we use the assumption as follows: normal points belong to the dense regions, while outliers belong to the sparse regions. The most famous method of this kind is the LOF algorithm. The idea of LOF is based on the local density evaluation score for points. Each point will be assigned a score which is the ratio of the average local density of the
For the clusteringbased outliers detection techniques, the idea here is using clustering methods to group data into clusters. The points do not belong to any clusters called outliers. Some clustering methods can be detected outliers such as DBSCAN [
In the statistical outliers detection methods, these methods are based on the assumption as follows: normal data points occur in highprobability regions of a stochastic model, while anomalies occur in the lowprobability regions of the stochastic model. Some methods have been done for the kind of outliers detections. In general, statistical methods fit a statistical model (Gaussian distribution, the mixture of parametric statistical distribution, etc.) to the given data and then apply a statistical inference test to determine if an unseen instance belongs to this model or not. The key limitation of these methods is the assumption about the distribution of data points. This assumption is not true, especially when the dimension of data is high [
In the distancebased outliers detection methods, a point is considered as outlier if it does not have enough
In recent years, semisupervised clustering is an important research topic that is illustrated by a number of studies introduced [
Our new incremental clustering introduced in the next section is based on the work of semisupervised graphbased clustering using seeds (SSGC). We choose the SSGC algorithm because SSGC algorithm has several advantages such as SSGC use only one parameter and SSGC can detect clusters in varied density regions of data [
Constructing the
Constructing the connected components using the threshold
Propagating the labels to form the
Constructing the final clusters
Given a
The remaining points (graph nodes) that do not belong to any main clusters will be divided into two kinds: points that have edges which relate to one or more clusters and other points which are isolated points. In the first case, points will be assigned to the cluster with the largest related weight. For the isolated points, we can either remove them as outliers or label them.
We note that, in SSGC, the weight
SSGC is efficient when compared with the semisupervised densitybased clustering in detecting clusters for batch data; however, it is not adapted for data stream or data warehousing environment where many updates (insertion/deletion) occur.
In this section, we propose IncrementalSSGC, based on the SSGC algorithm. In the IncrementalSSGC, the seeds will be used to train a
Algorithm
Create the
Delete all
Delete edges in
Get label for
Update list
Examine points in
Algorithm
Delete
Update all weights
Delete all updated (at Step 2)
Now, we will analyse the complexity of IncrementalSSGC. Given a data set with
For the insertion process which aims to identify the cluster label for a new data point
For the deletion process, the complexity of Step 1 is
In summary, with the analysis of the insertion and deletion process above, we can see that it is very useful for data set that we usually need to update. In the next section, we also present the running time of both SSGC and IncrementalSSGC for some data sets extracted from intrusion detection problem.
Given a
To reduce the running time of the method, we propose a Fast outlier detection method based on Local Density Score, called FLDS. The basic idea of the algorithm FLDS is to use divideandconquer strategy. Given a data set
Using Kmeans to split
Using LDS algorithm on each separate cluster to obtain local outliers (using the threshold
The local outliers obtained in Step 2 will be recalculated LDS’s value across the data set
The FLDS algorithm is an outlier’s detection method based on Kmeans and local density score using graph. The complexity of FLDS is
This section aims to evaluate the effectiveness of our proposed algorithms. We will show the results of the IncrementalSSGC, the results of FLDS, and the results when using our methods for a hybrid framework for intrusion detection problem. The IncrementalSSGC will be compared with the IncrementalDBSCAN, while the FLDS will be compared with the LOF.
The data sets used in the experiments are mostly extracted from the Aegean WiFi Intrusion Dataset (AWID) [
To show the effectiveness of the IncrementalSSGC, two aspects will be examined including the running time and accuracy. 5 UCI data sets and 3 data sets extracted from AWID will be used for testing IncrementalSSGC and IncrementalDBSCAN. The details of these data sets are presented in Table
Main characteristics for clustering evaluation.
ID  Data  #Normal + impers.  #Attributes  #Clusters 

1  Iris  150  4  3 
3  Wine  178  13  3 
2 

336  8  8 
4  Breast  569  30  2 
5  Yeast  1484  8  10 
6  AWID1  5000  35  2 
7  AWID2  8000  35  2 
8  AWID3  12000  35  2 
To evaluate clustering results, the Rand Index is used. Given a data set
We used 5 data sets extracted from AWDI and four 2D data sets including DS1 (10000 points), DS2 (8000 points), DS3 (8000 points), and DS4 (8000 points) [
Main characteristics for FLDS and LOF.
ID  Data  #Objects  Categories 

1  OAWID1  3030  Impers., flooding, injections 
3  OAWID2  5030  Impers., normal, flooding 
2  OAWID3  7040  Flooding, normal, impers. 
4  OAWID4  10040  Normal, impers., injection, flooding 
5  OAWID5  15050  Normal, flooding, injection, and impers. 
To compare LOF and FLDS for AWID data sets, we use the ROC measure that has two factors including False Positive (False Alarm) Rate (FPR) and False Negative (Miss Detection) Rate (FNR). The detail of these factors is shown in the following equations:
To combine FPR and FNR values, we calculate the Half Total Error Rate (HTER) that is similar to the evaluation method used in [
We note that there is no incremental semisupervised clustering algorithm in the literature. So we compare the performance obtained by our algorithm and the IncrementalDBSCAN algorithm. IncrementalDBSCAN can be seen as the state of the art among Incremental clustering proposed. The algorithm can detect clusters with different size and shape with noises. Because both SSGC and IncrementalSSGC produce the same results, we just show the results for IncrementalSSGC and IncrementalDBSCAN. The results are shown in Figure
Clustering results obtained by IncrementalDBSCAN and IncrementalSSGC for 8 data set of Table
We can see from the figure that the IncrementalSSGC obtains better results compared with the IncrementalDBSCAN. It can be explained by the fact that the IncrementalDBSCAN cannot detect clusters with different densities as mentioned in the paper
Figure
Running time comparison between IncrementalSSGC and IncrementalDBSCAN.
Table
The HTER measure of LOF and FLDS (the smaller, the better) for some extracted AWID data sets.
Methods  OAWID1  OAWID2  OAWID3  OAWID4  OAWID5 

FLDS  0.13  0.12  0.10  0.11  0.06 
LOF  0.23  0.11  0.11  0.09  0.09 
The parameters used in data sets.
Methods  OAWID1  OAWID2  OAWID3  OAWID4  OAWID5 

FLDS (k, nc, 
(25, 30, 6)  (25, 30, 6)  (25, 30, 6)  (25, 45, 6)  (25, 45, 6) 
LOF (MinPts, 
(27, 1.2)  (27, 1.2)  (25, 1.2)  (25, 1.2)  (27, 1.2) 
Results of LOF (a) and FLDS (b) on some 2D data sets: the outliers marked as red plus.
Figures
Running time comparison between FLDS and LOF for four 2D data sets.
Running time comparison between FLDS and LOF for five AWID data sets.
In this section, we propose a multistage systembased machine learning techniques applied for the AWDI data set. The detail of our system is presented in Figure
A new framework for intrusion detection in 802.11 networks.
In this experiment, we use J48 for the misuse detection process and IncrementalSSGC for the detecting impersonation attacks. In the outliers detection step, we propose to use FLDS or LOF, and the results have been presented in the subsection above. Because the outliers detection step can be realized offline for some periods of time, we just show the results obtained by combining J48 and IncrementalSSGC. The confusion matrix of these results is illustrated in Table
Confusion matrix for AWID data set using J48 and IncrementalSSGC in proposed framework.
Normal  Flooding  Impersonation  Injection  Classification 

530588  116  6  75  Normal 
2553  5544  0  0  Flooding 
2  0  16680  0  Injection 
3297  148  0  16364  Impersonation 
We also note that for real applications, whenever an attack appears, the system needs to immediately produce a warning. The multistage systembased machine learning techniques provide a solution for users for constructing the real IDS/IPS system that is one of the most important problems in the network security.
This paper introduces an incremental semisupervised graphbased clustering and a fast outlier detection method. Both methods can be used in a hybrid framework for the intrusion detection problem of WiFi data sets (AWID). Our proposed multistage systembased machine learning techniques provide a solution to guideline for constructing the real IDS/IPS system that is one of the most important problems in the network security. Experiments conducted on the extracted data sets from the AWID and UCI show the effectiveness of our proposed methods. In the near future, we will continue to develop other kinds of machine learning methods for intrusion detection problem and test for other experimental setup.
The data used to support the findings of this study can be downloaded from the AWID repository (
The authors declare that there are no conflicts of interest regarding the publication of this paper.