Mobile Anomaly Detection Based on Improved Self-Organizing Maps

1School of Computer and Software, Jiangsu Engineering Center of Network Monitoring, Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Jiangsu Key Laboratory of Meteorological Observation and Information Processing, Nanjing University of Information Science & Technology, Nanjing 210044, China 2Department of Computer Engineering, Chonnam National University, Gwangju, Republic of Korea


Introduction
Rapid development of Internet brings users much convenience and penetrated into all aspects of our life.However, in the depth of Internet, the threat is everywhere.From the emergence of Internet, network security has always been the focus of researchers.Intrusion detection refers to detect intrusion behaviors and provide corresponding protection according to audit logs, network traffic, and so on.
Generally, intrusion detection is divided into two categories: misuse detection and anomaly detection.Misuse detection is a rule-based approach which records intrusion patterns and compare present network behaviors with these patterns.Behaviors that are similar to the stored patterns will be marked as intrusion behaviors.Anomaly detection, by contrast, refers to discovering the behaviors that are different from stored normal behavior patterns.Therefore, misuse detection can detect the known attacks accurately but cannot handle new attacks.And anomaly detection may have higher false alarm rate.
Traditional intrusion detection needs detection rules that are constructed by expert systems.The experts analyze intrusion behaviors and delineate detection rules by extracted intrusion features.Detection rules constructed manually not only are time-consuming and laborious work, but also reduce timeliness.Once new intrusion types appear, experts need to analyze new intrusion behaviors, extract new features, and improve intrusion detection system (IDS).
To solve aforementioned problem, researchers consider making IDS recognize intrusion patterns automatically.Statistical learning provides a theoretical support for this idea.In this context, data mining and other intelligent data analysis technology have been applied for intrusion detection.
Mobile devices have been changing modern lives.They are not only used for communication, but also can be used for shopping and working, and mobile devices can be seen as mobile PC [1].Therefore, the anomaly behaviors also appear in mobile platforms.In the paper [2], the authors introduce two types of botnet architectures which are constructed by mobile devices.Mobile devices have their own advantages for botnet.For example, mobile devices can keep connected with Internet and they are rarely turned off even at night.Thus, they can work as bots and will not make owners notice that.However, there are some issues that will expose the existence of mobile bots.The consumption of battery grows more quickly than normal usage because of the bot agents.And the volume of data traffic created by C & C channel exceeds the normal usage.Both of them can alert the owners to turn off the devices and raise the suspicion [3].
MalGenome Project [4] mainly focuses on the Android platform and aims to systematize or characterize existing Android malware.Zhou and Jiang have collected more than 1200 malware samples in 49 families that cover the majority of existing Android malware families.Furthermore, they systematically characterize these malware samples from different aspects such as their installation methods and activation mechanisms.
Data mining is extracting implicit, potential, but useful information and knowledge from massive, incomplete, and fuzzy realistic data.Data mining can be divided into several types according to diverse targets, such as classification, clustering analysis, outlier detection, and regression analysis.Feature selection is the preparations of data analysis and proper feature selection methods can reduce time consumption and memory space.Intrusion detection also needs to choose appropriate methods to find more effective features and several algorithms are applied for feature selection, such as clonal immune algorithm [5] and Hoeffding tree.
Classification is to decide the class of data points according to a priori knowledge and it belongs to supervised learning.Support vector machine is one of the most popular classification algorithms and it can map the low dimension sample space into higher dimension feature space which will translate origin nonlinear problem into linear problem [6,7].In the field of intrusion detection, classification is also an effective method.But it also faces new challenges.The high-speed data stream [8] in real network environment put forward different requirements to classification algorithm.In the paper [9], the authors propose improved classification algorithm for data stream.The results show this improved algorithm can achieve higher detection accuracy, low positive rate, and memory usage not increasing with the data samples.
Clustering analysis aims to divide data points into different clusters by the similarity between each data point [10].The target of clustering analysis is that data points in the same cluster will have higher similarity and different clusters will have obvious differences.To apply clustering analysis into anomaly detection, it should be based on two premises: (1) there are obvious differences between normal records and anomaly records; (2) the amount of normal records is greater than that of anomaly records.The existing algorithms can be classified into several categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods.
Classical clustering algorithms, such as -means, DBSCAN, Agnes, and SOM, have diverse applications in several fields.-means belongs to partitioning method and is known as simple and efficient clustering algorithm.But it also has significant drawbacks.The clustering results are affected by initial clustering centers, noise data, and the number of clusters.In particular, initial clustering centers have serious influence on clustering results.The efficient initial clustering centers can speed up the convergence and describe the distribution of dataset much better.Therefore, how to choose initial clustering centers is the key to improve -means algorithm [11].
Same as -means algorithm, self-organizing map (SOM) is unsupervised clustering algorithm.Kohonen proposed self-organizing maps in 1981 [12].He thinks neural networks will divide into different corresponding regions when accepting the external input modes.Each region has different response features to the input modes and this process is completely automatic.Self-organizing maps is proposed based on this view and it is similar to the characteristics of human brain.Self-organizing maps is also affected by initial weight vectors which correspond with the input modes.The function used to compare two vectors has some influence on clustering results.Based on these aspects, we propose the improved self-organizing maps for anomaly detection.We choose better initial weight vectors and use appropriate comparison function to measure the similarity.By comparing the improved algorithm with traditional SOM, we find that improved SOM has higher accurate rate and gets better clustering centers.
The paper consists of five sections.Section 2 reviews some necessary definitions and related works.The improved algorithm is shown in Section 3. In Section 4, this improved algorithm is evaluated and discussed according to experimental results.Section 5 concludes the paper and proposes the plans for future research.

Kohonen Self-Organizing Maps
In this section, we will review some relative definitions and related works.The dataset can be thought of as a data matrix  × .Each data record corresponds with each row of data matrix and it can be thought of as a mathematical vector which consists of  features.Most algorithms need some means of measuring the similarity between two vectors.Euclidean distance formula and cosine formula are often used.In this paper, these two formulas are combined for comparison function.
Self-organizing maps have two layers [13]: input layer and output layer.Neurons on the input layer collect the external information to each neuron on the output layer through weight vectors.Input layer has same structure with BP neural network and the number of neurons is equal to the sample dimensions.Output layer is also the compete layer and the arrangement of neurons has diverse forms, such as one-dimension linear array, two-dimension array, and three-dimension grid array.As shown in Figure 1(a), one-dimension SOM is the simplest structure.Neurons on the output layer connect with each other.Figure 1(b) is the structure of two-dimension SOM.This organization form has the image of the cerebral cortex.
The learning algorithm of SOM is called Kohonen algorithm which is based on Winner-Take-All algorithm.The main difference is the way of adjusting weight vectors and lateral inhibition.For Winner-Take-All algorithm, it only adjusts the winning neuron and other neurons do not change during the update process.However, Kohonen algorithm adjusts not only the winning neuron, but also neurons near by the winner.The impact of winning neuron on other neurons is changed from excitement to inhibition according to the distance between winning neuron and another neuron.Therefore, the learning algorithm adjusts the winning neuron and other neurons around the winning neuron also need to be adjusted by the distance.
Self-organizing map is divided into two stages: training stage and testing stage.In the training stage, it chooses the sample randomly from the training dataset and inputs into neural networks.For the specific input pattern, there is a neuron on the output layer that can produce the maximum responsivity and become the wining neuron.But at the beginning of training stage, the location of winning neuron is uncertain.When the input pattern changes, the winning neuron will change.The neurons surrounding with the winning neuron can also produce larger responsivity because of lateral mutual excitatory interactions.Therefore, the weight vectors of winning neuron and neurons nearby will be adjusted towards the input vector and the degree of adjustment is based on the distance between neuron and winning neuron.Self-organizing map trains weight vectors by large amounts of data and finally, neurons on the output layer will be sensitive to corresponding input pattern.When two input patterns are similar, the locations of neurons that represent these patterns are close.
After the training of SOM, the specific relation of each neuron on the output layer and input pattern is certain.Now, the trained SOM can be applied as classifier.In the testing stage, the neuron which represents the corresponding pattern will generate the maximum responsivity and classify the input vector automatically when a testing vector is inputted into the network.It is noted that if the pattern of new testing vector does not appear in the training dataset, SOM will mark it as the closest pattern.
Researchers have proposed diverse improvements of SOM.In the paper [14], a multiresolution clustering strategy in self-organizing maps is applied to astronomical observations.The authors propose the hierarchical structure of neural networks which consists of different tree-structured SOM networks.
In the paper [15], the authors try to integrate naïve Bayes model with self-organizing map for multidimensional visualization.The proposed method is evaluated by two benchmark datasets and a real-world image processing application which is compared with principal component analysis, selforganizing maps, and generative topographic mapping.The experimental results prove the effectiveness of this method.

Improved SOM
In this section, we will describe the improved SOM in detail.As the previous analysis, SOM is affected by initial weight vectors.Traditional SOM assigns the value of initial weight vectors randomly and it can result in unstable clustering results just like -means.Thus, we improve the way of choosing initial weight vectors and find efficient vectors in training dataset as weight vectors.The comparison function is also improved for measuring the similarity between two vectors more accurately.

Selecting Initial Weight Vectors.
Zhang and Cheng have proposed the optimized method for selecting the initial clustering centers of -means clustering algorithm [16].It uses an adjacent similar degree between data points to calculate similarity density.Then, the data point with the maximum will be utilized as initial clustering centers.Considering the similarity between -means and SOM, we apply this optimal method for selecting the initial weight vectors.The details of this method are introduced in the following part.
The comparison function for measuring the similarity has diverse methods, such as Euclidean distance formula and cosine distance formula.We use the combination of two distance formulas as the metric of similarity denoted by Sim(  ,   ) SimNeighbor(  , ) denotes the similar neighborhood of vector   which is the dataset .For other vectors, if the similarity between them and   is less than the threshold, they will be similar neighborhood to   .The method of finding similar neighborhood is as follows: The value of Sim(  ,   ) is always greater than (1 − ).When two vectors are the same, their Euclidean distance is zero and cosine distance comes to the maximum value.The greater the value of Sim(  ,   ), the lower the degree of similarity.
The density of vector   can be calculated according to SimNeighbor(  , ).The calculating formula is as follows: The symbol #neighbor(  ) is the number of SimNeighbor(  , ) and the vector with higher density is more suitable for representing the corresponding pattern of clustering center.
The optimal method of selecting initial weight vectors is as follows.
Output.The initial weight vectors.
Step 1. Compute the similarity degree between each vector and record in the matrix simMat × .It consists of  rows and  columns.The value of simMat[][] means the similarity degree of   and   .
Step 2. Figure out the similar neighborhoods of each vector in dataset  according to simMat × and the similarity threshold .The similar neighborhood of vector   is stored in Neigh[  ].
Step 3. Calculate the density of each vector according to simMat × , the similarity threshold , and Neigh[  ] that stores the similar neighborhoods.
Step 4. Find the vector which has the maximum density and delete it from dataset .The similar neighborhood of this vector should also be removed from dataset .
Step 5. Calculate the average of the vector which is obtained from Step 4 and its similar neighborhood as one of the initial weight vectors.
Step 6.If the size of weight vectors is less than initial output layer size, go to Step 2 and loop the process until it gets enough initial weight vectors.

The Design of SOM.
The input layer of SOM is similar to BP neural network, but the design of output layer is more complicated.It should consider multiple aspects and the preset values may result in different clustering results.The design of output layer needs to consider two aspects.One is the number of neurons and another is the arrangement of these neurons.If the number of neurons is less than the amount of input patterns, SOM cannot recognize all the patterns.But if the number of neurons is much greater, there are some neurons that have not been adjusted, because they are too far from the winning neuron.Therefore, it would be better to give more neurons in advance, in order to map the input patterns onto the appropriate neuron on the output layer.
The arrangement of neurons has a lot of choice and it is determined by the practical application.The arrangement of neurons should reflect the physical meaning of the actual problems.For example, in traveling salesman problem, twodimension output layer is more intuitive.For the problem of robot arm control, three-dimension output layer can reflect the spatial characteristics of the arm moment.In our experiences, we utilize two-dimension output layer.
In traditional SOM network, the initial weight vector is generated randomly and it can result in worse clustering results.If the initial weight vectors randomly scatter in sample space, they cannot reflect the distribution of samples.Therefore, we choose the average of vectors that are around the winning neuron as the initial weight vector.The selected initial weight vectors can embody the space distribution characteristics of samples.
The radius of winning reign should also be taken into consideration.The radius determines whether the neurons should be adjusted and it reduces gradually to zero with the growing number of iterations.In this way, the weight vectors of adjacent neurons are similar but have a little difference.When the winning neurons generate the maximum responsivity, the adjacent neurons will also generate certain responsivity.The calculating formula of radius is as follows: max is the maximum number of iterations,  is the th iteration, and  is a constant.The learning rate is also selected by experience.() is the learning rate for the th iteration and the value of learning rate can be higher at the beginning of updating weight vectors.But it reduces with the increasing number of iterations.The calculating formula of learning rate is as follows: ) . ( is the location of the neuron in the output layer and   is the location of the winning neuron.The distance between two neurons can be calculated by Euclidean distance formula.For example, if the location of   is (0, 0) and   is (1, 1), the distance of them is √ 2. If   =   , the value of ℎ (), is (), and it is the learning rate of winning neuron.For other neurons, ℎ (), will be less than ().To conclude, the steps of updating weight vectors are as follows.
Input.The output layer size (, ); the number of neurons is  * ; the error threshold diff; training dataset and testing dataset.
Output.The analysis result of testing dataset.
Step 1. Normalize training dataset and testing dataset.
Step 2. Calculate the initial weight vectors and place these neurons to the output layer that can get the location of each neuron.
Step 3. Calculate the learning rate, radius, and other coefficients for updating weight vectors by formula ( 5) and formula (6).
Step 4. Choose a vector from training dataset in sequence to update weight vectors by formula (7) while traditional SOM select the training vector randomly.
Step 6. Input testing dataset into SOM network and find the winning neuron for each input vector.Input vector will be marked as the pattern of winning neuron.

Experience and Analysis
In this section, we utilize diverse datasets to evaluate the performance of improved SOM.For applying this improved SOM in anomaly detection, we use KDD CUP 99 dataset for analysis.In the experiments, Algorithm 1 denotes traditional SOM, and Algorithm 2 represents our improved SOM.
There are three types of evaluation criteria in our experiments: accuracy rate (AR), precision rate (PR), and recall rate (RR).The calculating formulas are as follows: T, P, F, and N, respectively, stand for true, positive, false, and negative.TP is the number of correctly detected anomaly behaviors and TN is the number of correctly detected normal behaviors.FP is 4.1.The Experiments in Iris Dataset.Firstly, we utilize Iris dataset for evaluating the performance of improved SOM algorithm.In Iris dataset, there are 150 records, and every record consists of 4 features in which the class label is not included.We apply two clustering algorithms to Iris dataset and compare their clustering results.These two algorithms use same parameters: the size of output layer is changing, the maximum number of iterations is 120, diff is set to 0, and similar weighting factor  is 0.5.We utilize 80 percent of Iris dataset as training dataset and 20 percent of that as testing dataset.Traditional SOM obtains unstable clustering results, so we repeat it for several times and calculate the average of accuracy rate for comparison.
As Table 1 shows, we evaluate improved SOM with different size of output layer.It is obvious that improved SOM can get better clustering results than traditional SOM when they have same size.When the number of neurons is greater than five, the accuracy rate can increase to 100% with the appropriate similarity threshold .Then, we analyze the impact of similarity threshold on accuracy rate.The size of output layer is set while the similarity threshold is changing.The results are shown in Figure 2.
From Table 1 and Figure 2, we find that accuracy rate changes with the growth value of similarity threshold.It is necessary to find the appropriate similarity threshold for improved SOM.The accuracy rate of improved SOM will be much higher than that of traditional SOM with certain similarity threshold.When the number of neurons is more than five, the accuracy rate can be 100% and the number of neurons is same as the number of clusters in another clustering algorithm.

The Experiments in Universal Datasets.
As it shows in the last experiment, the performance of improved SOM is excellent.In order to evaluate the performance of improved SOM on universal datasets, we have selected several datasets from UCI repository.In addition to traditional SOM, we also compare the improved SOM with traditional -means and improved -means in the paper [17].The value of  is the number of clusters.The characteristics of these datasets are shown in Table 2. To describe the distribution of these datasets, we draw data points in coordinate system as Figure 3 shows.As Figure 3 shows, the distributions of datasets are uneven and two-dimensional.The shapes of these datasets are diverse.We utilize these datasets to evaluate the performances of two clustering algorithms for different shapes.For each dataset, we randomly extract 80 percent records as training dataset and the rest is testing dataset.The results are shown in Table 3.
According to the results in Table 3, the performance of improved SOM is better than traditional SOM.When the number of neurons increases to certain value, accuracy rate can become pretty high.But we think the number of neurons should be not more than √ for the dataset composed of  records just the same as -means.The clustering result of improved -means is better than improved SOM.However, improved -means takes more clusters than improved SOM, which need high time and space overhead.For spiral dataset, its shape is complex and improved SOM cannot get superior results when the size of output layer is (4,4).However, the accuracy rate will increase to 100% when the size is set to (7, 9).

The Experiments in KDD Cup99
Dataset.In order to evaluate the performance of improved SOM applied for anomaly detection, we utilize KDD Cup99 dataset to compare two clustering algorithms.KDD Cup99 dataset is extracted from real network environment and there are about 5 million records in KDD Cup99 dataset.Each record consists of 42 features which include the label of normal and attack type.We extract 2100 records from KDD Cup99 dataset which is composed of 2000 normal records and 100 attack records.The attack records consist of four types of attack.The labels of attack records are back, teardrop, smurf, and neptune.Before the experiments, we delete symbolical features, and numerical features are left.The clustering results are shown in Table 4.
As Table 4 shows, improved SOM can obtain better performance comparing with traditional SOM in the same size of output layer.There are too few attack records for traditional SOM to detect them.The number of neurons determines precision rate and recall rate of traditional SOM.However, improved SOM can get higher precision rate and recall rate with less number of neurons.The clustering results of improved SOM are affected by the shape of output layer and similarity threshold.The increasing of similarity threshold can improve recall rate, but when the size of output layer comes to (8,8), precision rate starts to decrease if similarity threshold is too large.The similarity threshold has particular influence on the clustering results, because it is used for selecting initial weight vectors.
To analyze the impact of similarity threshold in KDD Cup99 dataset, we change the value of similarity threshold   while the size of output layer is constant.The results are shown in Table 5.
From Table 5 and Figure 2, we find that recall rate and precision rate fluctuate with the increasing of similarity threshold.It is necessary for improved SOM to find the appropriate value of similarity threshold.In this way, improved SOM can get better results in the case of less neurons.

Conclusion
Traditional SOM is affected by initial weight vectors and it generates unstable clustering results.To overcome these shortcomings, this paper proposes an improved SOM clustering algorithm and compares it with traditional one.We utilize the optimal method to select more suitable initial weight vectors.In this way, SOM can obtain more appropriate initial weight vectors that will generate more stable and accurate results.
To analyze this improved clustering algorithm, we utilize diverse datasets to evaluate the performances of improved SOM.As the experimental results show, improved SOM can get higher accurate rates for each dataset.And the performance of improved SOM on KDD Cup99 is also better than traditional SOM.
However, there are some aspects that can be improved in our algorithm.Firstly, the process of finding initial weight vectors is time-consuming and it makes improved SOM spends more time than traditional one.Secondly, the fuzzy theory [18] can be introduced to improve the ability applied in practical environment.Last, the size of output layer is determined by experience and it can be modified to be decided automatically.

Figure 2 :
Figure 2: The comparison of different similarity threshold.

Figure 3 :
Figure 3: The distribution of datasets.
)      cos (  ,   )      .(1) (  ,   ) is Euclidean distance formula and cos(  ,   ) is cosine distance formula.The coefficient  is the weighting factor and can be changed from 0 to 1.This coefficient is adjusted by

Table 1 :
The comparison in Iris dataset.

Table 2 :
The descriptions of datasets.

Table 3 :
The results on universal datasets.

Table 4 :
The results in KDD Cup99 dataset.

Table 5 :
The comparison of different similarity threshold in KDD Cup99.