Online Incremental Learning for High Bandwidth Network Traffic Classification

Copyright


Introduction
Network traffic classification is a critical network processing task for network management.Traffic measurement and classification enable network administrators to understand the current network state and reconfigure the network such that the observed network state can be improved over time.The complexity and dynamic characteristic of today's network traffic have necessitated the need for traffic classification techniques that are able to adapt to new concepts.This includes the ability to classify types of traffic almost instantaneously to avoid outdating the knowledge gained from the learning of new concepts.
Data stream mining algorithms [1][2][3][4][5][6] have been introduced to overcome the shortcoming of conventional data mining algorithms.They are designed to handle concept drift, to forget old irrelevant data, and to adapt to new knowledge.References [7][8][9] have proposed the use of data stream mining algorithms for traffic classification such as Very Fast Decision Tree [3] and Concept-Adaptive Very Fast Decision Tree [4].Reference [10] proposed a new algorithm named Concept-Adaptive Rough Set based Decision Tree (CRSDT) to classify network traffic.These algorithms have successfully demonstrated the ability of data stream mining to handle dynamic and fast changing network data streams with sustained accuracy.However, the decision tree based implementation requires intensive training process and causes high memory consumption for model building [11].
References [2,12] proposed the use of incremental clustering for data stream classification.Although both works show high classification accuracy for evolving data stream, the processing rate of such algorithms is low.One of the reasons is due to the use of Euclidean distance in both works as the distance metric.Euclidean distance computation that requires multiple square and square root functions contributes to high overhead and limited speed.Another distance measure is the Manhattan distance that does not require heavy multiplications [13], which can be efficiently implemented on reconfigurable hardware such as Field Programmable Gate Array (FPGA).Unlike batch data mining, conversion of distance metric from Euclidean distance to Manhattan distance in -means incremental learning cannot be directly applied.Certain modifications on the incremental -means algorithm need to be done.In incremental clustering

Related Works
The simplicity of clustering algorithm such as -means makes it well adapted for network traffic classification.References [17][18][19] proposed -means algorithm implementations for classifying network traffic with high accuracy.Reference [20] proposed the use of feature selection method with -means algorithm to enhance the classification accuracy.Reference [11] proposed a new initialization method for centroid selection in -means to further improve the classification accuracy of network traffic, whereas [21] proposed enhancement to means algorithm to prevent the diverse impact of attributes on clustering output.The aforementioned works successfully show the suitability of clustering based algorithms in network traffic classification, although they could not adapt to changes in network concept in today's network traffic.
References [22][23][24][25][26][27] proposed incremental clustering algorithm which could adapt to new knowledge over time.Reference [22] proposed the microclustering concept where only the summary of clusters is kept throughout the learning process.The proposed algorithm can learn new concept incrementally by updating the clusters summary.Reference [23] proposed adapting the microclustering concept originally proposed in [22] and proposed macroclustering stage with pyramidal time frame.Not all microclusters are saved that reduce overall memory consumption.Reference [24] proposed the adaptation of microclustering and macroclustering concepts from [22,23] and is customized for trajectory data.Although the method in [24] demonstrates the ability to update classification model incrementally, it does not support streaming data.
Reference [25] proposed graph based incremental clustering although it is not suitable for online network classification due to its long processing rate and large memory consumption.Reference [26] proposed incremental DBSCAN for data warehousing.The proposed algorithm is based on density-based clustering.The radius of cluster and minimum number of points in clusters are assumed to be fixed.Reference [27] proposed incremental clustering based real-time anomaly detection.Incremental training is initiated based on the false alarm threshold that requires continuous feedback from the network administrator.Incremental clustering can continuously learn new knowledge and reduce misclassification caused by outdated knowledge.For online real-time network traffic classification, the processing rate of software implementation of such algorithms is not sufficient to support current network speed.Implementation of such algorithms in reconfigurable hardware such as Field Programmable Gate Array (FPGA) can accelerate the processing rate.
To the best of our knowledge, only [28] has proposed the implementation of incremental clustering algorithm in FPGA for multimedia traffic classification.The proposed method uses Hamming distance instead of Euclidean distance for the distance measurement.It uses an extra bit appended to the data to indicate training or testing dataset and incrementally updates the model when training bit is detected.However, the proposed method requires large training set (10 instance) to achieve high accuracy.On the other hand, implementations of nonincremental -means algorithm in FPGA are more common, for example, [29][30][31][32].However, implementation of Euclidean distance based -means algorithm on hardware consumes high hardware resources.References [30,31] reported that 90% of the hardware resources were required to implement -means algorithm.Modification of such distance was proposed as in [33] using distance squared.References [13,[34][35][36] proposed the use of Manhattan distance as a distance measure for -means.The implementation not only can reduce the hardware cost, it also can be fully configurable and easily pipelined to support high degree of parallelism.Hence, towards the implementation of hardware acceleration of incremental online network traffic classifier, the proposed incremental -means classifier based on Manhattan distance is a better option than using Euclidean distance.

Online Incremental 𝑘-Means Algorithm
We proposed online incremental -means clustering in [16] for online network traffic classification.It consists of two main processes: classification and learning.Some of the terms and terminology used in this paper are as follows: (1) Flow: the network traffic that belongs to a process-toprocess communication.
(3) Flow features: the attributes or statistical features that are extracted from a flow, for example, number of bytes in payload.
(4) Flow instance: the instance made up of flow features which represent a flow.
The classification process performs online classification on flow instances, while the learning process simultaneously performs incremental learning to update the classification model.Both processes will be discussed in detail in Sections 3.2 and 3.4, respectively.Figure 1 shows the overview of the proposed method.The selection module performs the selection of flow instances to be learned to avoid mislearning (see Section 3.3).However, manual labeling is not covered in this paper.An example of such technique is the groundtruth method which is discussed in [15].

Classification Model Initialization.
Before online classification can be performed, the classification model needs to be initialized.This process is performed once during start-up to prepare the base classifier model.In this stage, the supervised -means technique is used to cluster the batch labeled flow instances into  initial clusters.In order to increase classification efficiency, the classification model  is made up of  smaller micromodels that are located in different location in the Euclidean space.To perform this, precollected flow instances are distributed to  chunks according to the distance to origin   , such that where   is a flow instance with  flow features.Each micromodel is then built using respective chunk of flow instances.
The initializations of centroids are based on the method suggested in [12] where the initial class of clusters are assumed to be proportional to the data distribution.The clusters are then compressed to sufficient statistics known as clustering features (CF).CF is a 3-tuple information that summarizes information about a cluster as proposed in [22].Given   dimensional data points (  →   ) in a cluster  where  = 1, 2, . . .,   , the CF of a cluster is defined as three-tupple: ( Raw data are discarded in order to save memory space.The  clusters that are represented by  clustering features are used for classification and the clusters may be modified based on newly received data.Algorithm 1 shows the overall steps for classification model initialization.During model initialization, the precollected flow instances are divided into  sets, and the supervised -means method is used to create a micromodel for each set of flow instances.Created clusters are summarized as clustering features, CF with timestamp 0.

Online Classification. Classification starts upon receiving an incoming flow instance ( 󳨀 → 𝑥 𝑖 ). The distance to origin
(  ) is calculated to find the respective micromodel.In No Yes the micromodel, the distance between cluster's centroid and  →   is computed using (4).The nearest cluster ( 1 ) and second nearest cluster ( 2 ) are then determined.Assuming that the real class label of a flow instance is unknown (unlabeled flow instance), the predicted class   (  ) will be classified to class of  1 with respect to  →   : where   is the centroid of cluster  and   =  → LS/.

Selection of Learning Instance.
As self-training method is applied in the learning algorithm, all predicted labels are assumed to be accurate.While this is not entirely true in incremental learning process, certain threshold of false positive must be expected.Since learning on falsely predicted flow instances can cause false learning, only flow instances with high confidence of prediction are chosen to be learned.
The selection criteria are designed such that they make use of the information from classification process so that they do not need extra computation.Extra computation will make longer model learning and incremental learning inefficient.
Confidence level is divided into three levels { 0 ,  1 ,  2 } as shown in Table 1.The criteria are to determine confidence level in terms of label conflict within the two nearest clusters (conflicting neighbor) and condition when a flow instance is within the nearest cluster's boundary (in-boundary).Conflicting neighbors are set to true when both nearest clusters belong to different classes or set to false when they belong to the same class.
The boundary of the nearest cluster  1 is determined by the cluster's average radius  defined by ( 5) [22].However, for clusters with  = 1, (5) is not valid as the subtraction in numerator will results in zero denominator.This consideration was not analyzed in [37].In [2], the calculation of maximum boundary is based on the nearest neighbor's maximum boundary.However, the boundary of such clusters is not always similar, and to determine the nearest neighbors will increase the system computation complexity by ().
We propose that the boundary of a cluster with only one flow instance can be determined by the similarity of attributes.Each attribute of a cluster centroid is compared with the attributes of incoming flow instance.An attribute is considered similar to other attributes if the ratio between them is within 10% of each other (0.9 ⩾  1 / 2 ⩾ 1.1).A parameter boundary threshold is needed to define the maximum nonsimilar attributes between flow instances and the cluster centroid that can be tolerated to include an incoming flow instance in the respective cluster.If the number of nonsimilar attributes is lower than the boundary threshold, it is considered to be within the boundary of the particular cluster: The confidence level can be determined as follows: let  1 be the nearest cluster and  2 the second nearest cluster,    the class of   ,    the radius of   , and    the centroid of   .Confidence level is set to  0 by default.In the case of   1 =   2 , confidence level will increase by one ( 1 ).If   ∈   1 for  1 with more than one flow instance or   ≈   1 for  1 with only one flow instance, confidence level will increase to two ( 2 ) if the condition   1 =   2 is satisfied.

Semisupervised Incremental Learning. Only flow instances 󳨀
→   with confidence level  2 are used in incremental learning.As shown in Algorithm 2, flow instances with confidence level  2 are merged with the nearest cluster  1 based on (3).The learning will then update the classification model.As the input flow instances are in streams, changes in distribution and concept are expected.Outdated knowledge needs to be deleted and the micromodel needs to be reconstructed based on recent clusters.This process will be discussed in Section 3.4.2.

Injecting Labeled Instances.
In our previous work [16], we assumed that labeling of flow instances can be done immediately after the learning process.Our extended experiment suggests that this is not possible since labeling of flow instances involves manual steps.Thus, we can only assume that some flow instances will be labeled externally and fed into the model once they are ready.Thus, the flow instances will be treated as a new flow instance and reclassified in    order to get the necessary information such as confidence level and prediction results.In order to achieve minimum effort in injection, different handling methods based on the confidence level and prediction results are proposed for handling different possible scenarios.Table 2 shows the injection handling method for several scenarios.False prediction in trained flow instances will cause the trained cluster to be deleted immediately since the cluster is no longer reliable.Besides, new cluster with   will be added if   is not in the boundary of  1 .This is true except for scenarios 2 and 3, since there is a possibility of   to be within the boundary of  2 .Figure 2(a) illustrates the example of boundary condition when   is nearer to  1 but is in the boundary of  2 .Merging   to  2 will shift boundary of  2 towards  1 and can result in an overlap (Figure 2(b)).Thus, when  1 and  2 are of different classes, one of these clusters needs to be deleted.
If they belong to the same class,   will be ignored since its learning brings no significant changes to the model.

Micromodel Reconstruction.
In order to prevent the storing of all outdated clusters that may result in imbalanced micromodel, a micromodel reconstruction process is performed after a user-predefined number of flow instances (chunks) have been received.Clusters that are not utilized or underutilized will be deleted as they do not contribute to the classification decision.In addition, the micromodel reduction also aims to reduce memory footprint and classification time.This reduction process includes the following steps: (1) All clusters are structured as a time-series based on the time they were created.(2) All clusters are given zero timestamps.
(3) When a cluster is nominated as the nearest cluster in the classification process, the timestamps is incremented by one.(4) When a chunk is received, the timestamps of each cluster is checked from the beginning of the series.Clusters with timestamps zero will be deleted until the number of clusters, , is reduced to a userpredefined number,   /.Then, the timestamps of remaining clusters will be decremented by one.If in a situation where  >   / even after deleting all zero timestamps clusters, the deleting process will be repeated for timestamps one and so on until it reaches the required number of clusters.
After the deletion of unused clusters, the remaining clusters will be repartitioned into  micromodels based on their centroid locations in the Euclidean space.

Analysis of Euclidean and Manhattan Distance Measures
The conversion from Euclidean distance to Manhattan distance for the incremental k-means is not direct.Modifications on the clustering features and equations are needed to suit the used distance measure.

Euclidean Distance versus
4.2.Affected Elements.By changing the distance measures, the following elements in the proposed incremental -means algorithm need to be changed as well.These include the calculation of distance to origin   (see (1)), distance between incoming instance and centroid (see ( 4)), and radius of a cluster (see (5)).The changes in  indirectly change the elements in CF.By applying  = 1, the new equation for   is as in (7), while new equation for  is as in (8).Consider the following: For the case of , the original equation is in (9) before it can be translated in terms of  → LS,  → SS, where   is the number of instances in cluster ,   is the centroid of cluster , and  is the dimension.After substituting  = 2 to  = 1, the new equation for  will be Note that (10) is not able to be represented in  → LS,  → SS; hence the calculation of radius is no longer possible by only keeping the cluster summary; CF  = ⟨  ,   → LS  ,   → SS  ,   ,   ⟩.For example, let a cluster  be with 3 instances ( 1 ,  2 ,  3 ) in one dimension,  = 1, and centroid, Expanding (10) When a new instance  4 is added into cluster , the centroid will change to The new value of  will be Each absolute element in ( 14) requires recalculation due to the changes of centroid, but each instance ( 1 ,  2 ,  3 ) is not accessible (i.e., raw data are not kept).In this case,   could not be recalculated.In this paper, we suggest to calculate an approximation of  by storing the previous value of  in each dimension and the accumulation direction of the centroid change.New  can be calculated based on these values.Since a centroid is located in the middle of instance in the cluster and adding of new instance will only happen when it is within the radius, (  ∈  −1 ), we can assume that the changes of  −1 →   are as small as possible, such that ( −1 −   ) will be in the same direction with ( −1 −  −1 ).By taking into consideration the direction of change, we assume another instance in the cluster will have the change in the opposite direction.Thus, the changes will be canceled out.With this assumption, we can calculate the approximate of  as in (15).Consider the following:

Complexity Comparison.
Table 3 shows the overall changes of our proposed method using Euclidean distance and Manhattan distance.For calculation of distance to origin,   Euclidean distance requires ( 2 +  2 ) compared to Manhattan distance that only requires ().  is used in the determination of partition, which is required once every instance and affected the most in the reconstruction stage.During reconstruction,   for all clusters in the classification model need to be calculated.Hence, it will increase the complexity of reconstruction by ().Similar to   , the distance to centroid  for Euclidean distance also requires ( 2 +  2 ) compared to Manhattan distance that only requires ().The increase in complexity in the calculation of  affects classification time by ().
The radius  is used during selection for learning and during injection of labeled instances.For calculation of  in Euclidean distance, ( 2 + 3 2 ) are required compared to (2 2 +  2 ) for Manhattan.In this case, the complexity is dependent on the dimension.As long as the dimension of dataset is greater than 2, the complexity of  in Manhattan distance is greater.Increase in complexity for the calculation of  directly affects the time for learning by ().
The updates of CF in Euclidean distance are slightly simpler than Manhattan distance due to the recalculation of  for each dimension involved.This is the only trade-off of using Manhattan distance over Euclidean distance.However, since the updates of CF only happen during the learning and model reconstruction, it will increase the complexity of reconstruction by ().
Overall, although the computations of  and CF are more complex for Manhattan distance, they are not as frequent as the calculation of the distance , which requires  calculations of  on every incoming flow instance.

Experimental Results
This section describes the simulation setup and results of our proposed work.The experiment is conducted to analyze and compare the performance of our proposed method using Euclidean distance and Manhattan distance measures.[14] and UNIBS [15] are chosen for the experiment.The Cambridge dataset was captured from University of Cambridge network.It contains 248 attributes and 12 classes.A total of 11 online features are selected from the attributes as listed in Table 4.The data of the minimal class, games, and interactive are not used as they are not sufficient for training and testing (less than 10 flow instances).UNIBS dataset was captured in University of Brescia.It was collected in three consecutive days and the traces are in pcap format and come with a groundtruth.The traces were processed to extract online features with only the first 5 packets of each observed flow as in [38].By using the provided groundtruth labels, we labeled the flow instances into 5 classes (Web, Mail, P2P, SKYPE, and MSN).The details of the used datasets are summarized in Table 5.

Experimental Setup.
The model parameters used in our experiment are as stated below, unless specified otherwise: (1) Percentage of labeling,  = 10.
In our experiment, the first chunk of data (the first 1000 flow instances) are treated as precollected flow instances and they are used for model initialization to generate the base model.The rest of the data are randomly labeled for different percentage .The accuracy of the proposed model is verified using the interleaved test-then-train method where the data were first tested before being trained incrementally [39].Each experiment was repeated 100 times, and the average performance indicators are reported in this paper.
The performance indicators used in this paper are the accuracy, cumulative accuracy, time, classification speed, and memory requirement.Accuracy refers to the accuracy of each chunk, while cumulative accuracy is the cumulative accuracy after the classification of each chunk.Running time is defined as the time to process one chunk, including the time for model reconstruction.In our experiment, the running time does not include data labeling time as in [12], since data labeling is usually done offline and is beyond the scope of discussion in this paper.Classification time is measured based on the time needed to classify one flow instance not including the feature extraction time.

5.3.
Performance.This subsection discusses the overall performance of the proposed algorithm.The accuracy in timeseries is shown in Figure 3.As we only use the first chunk in the dataset for model initialization, the classes which were not seen in the first chunk are treated as new concepts.A drift detection experiment was conducted on our dataset by using Drift Detection Method [40] provided in MOA tools [41].
A series of detected drifts are plotted in Figure 4. We found out that Cambridge dataset has more detected drifts than the UNIBS dataset.In order to visualize the drifts clearly, we show the Cambridge dataset in different chunks range.In the figure, we observed that when concept drifts occur (drift detected = 1), our proposed algorithm (using either Euclidean distance or Manhattan distance) can learn from the new knowledge and is able to maintain network classification accuracy compared to the model without incremental learning.In order to clearly show the accuracy difference of our proposed method between both distance measures, the difference of accuracy when using Manhattan distance over Euclidean distance is shown in Figure 5.The results show that the proposed method using Manhattan distance can provide slightly higher accuracy than using Euclidean distance.This shows that simple distance measures can provide better performance.
Table 6 shows the overall performance of our proposed algorithm.In this paper, we assume that the classifier is the bottleneck of the network traffic classification system, as reported in several works such as [38], flow features extraction in FPGA can function in very high speed.We compute the performance of the classification in terms of millisecond per flow instances in software as that is the lower bound of the system performance.The classification time of our proposed algorithm with Euclidean distance is 4.45 ms to classify 1,000 flow instances and 1.49 ms to classify 1,000 flow instances when using the Manhattan distance.This shows that using Manhattan distance can increase the classification speed by almost 3 times.Besides, we found that the time for injecting a labeled flow instance was similar to the time for classifying a new flow instance.Thus, injecting labeled flow instance will not cause long delay to affect the overall online classification process.Time for reconstruction is less than 1% of the total running time and it is almost negligible for both methods.Our proposed system does not require large memory as only the summary of cluster's information, CF, is kept.It only requires in average 140 KiB RAM to complete the processes.Since both of the proposed method store similar amount of data, CF  = ⟨  ,   → LS  ,   → SS  ,   ,   ⟩ for Euclidean distance and CF  = ⟨  ,  →   ,  →   ,   ,   ⟩ for Manhattan distance, respectively, the overall memory consumption of both methods is similar.

Impact of Different Parameters Setting on Classification
Performance.In this subsection, we analyze the effect of changing algorithm parameters on the overall classification performance.The experiments are done by changing one parameter and fixing the others.Figures 6 and 7 show how labeling percentage, , is affecting the accuracy and running time, respectively.Increases in labeled flow instances provide more class information to the classification model for better accuracy.Our experimental results also show that the use of Manhattan distance provides better accuracy than Euclidean distance.The percentage of labeling does not affect classification time, but it will increase running time that is more significant for high labeling percentage.This is due to the fact that the more the labeled data injected to    the classification model, the more distance the measure needs to be calculated.Hence the running time difference becomes more significant.
The number of desired clusters,   , is the parameter used in model reconstruction process.It is used to maintain the number of clusters in the classification model.Figures 8 and 9 show the impact of different number of desired clusters on accuracy and reconstruction time, respectively.The increasing accuracy and reconstruction time are more consistent with the increase in   for proposed method using Manhattan distance.Reconstruction time for our proposed method with Euclidean distance is higher due to the fact that origin distance calculation required for Euclidean distance in the reconstruction stage is more complex than Manhattan distance.

Conclusion
This paper proposed and analyzed an online incremental learning high bandwidth network traffic classification method with two different distance measures (Euclidean and Manhattan distance).The use of Manhattan distance not only provides improvement in running and classification time, it

Figure 1 :
Figure 1: Overview of the proposed method.

Figure 2 :
Figure 2: Example of boundary condition (a) before model merging with  2 (b) after merging with  2 .

Figure 5 :Figure 6 :
Figure 5: Accuracy difference between Manhattan distance and Euclidean distance for (a) Cambridge and (b) UNIBS datasets.

Table 4 :
List of online features selected for online classification.
(16) cluster  will cause recalculation of  based on(15), additive of  based on(16), and increment of  by 1, and  and  remain unchanged: