High-Performance Internet Traffic Classification Using a Markov Model and Kullback-Leibler Divergence

As internet traffic rapidly increases, fast and accurate network classification is becoming essential for high quality of service control and early detection of network traffic abnormalities. Machine learning techniques based on statistical features of packet flows have recently become popular for network classification partly because of the limitations of traditional portand payload-basedmethods. In this paper, we propose a Markov model-based network classification with a Kullback-Leibler divergence criterion. Our study is mainly focused onhard-to-classify (or overlapping) trafficpatterns of network applications, which current techniques have difficulty dealing with. The results of simulations conducted using our proposed method indicate that the overall accuracy reaches around 90% with a reasonable group size of n = 100.


Introduction
Traditional methods of network classification are either portbased or payload-based.However, both types of approaches are currently facing numerous challenges from the advanced techniques used to circumvent the firewalls of organizations and the increasing number of packet encryption techniques.Machine learning-based network classification techniques that use statistical information from network traffic data are currently emerging.
Machine learning techniques presented to analyze network traffic data so far can be categorized into the following three techniques.Supervised learning, a classification method, uses training traffic data with known application labels.Using this method, on the basis of training sample data, we can extract (or learn) patterns of applications and apply our knowledge to unlabeled test data.Unsupervised learning, also called clustering, on the other hand, is purely based on a predefined similarity measure of network traffic data.Clustering is one way of overcoming lack of enough traffic data for training.And finally so-called semisupervised learning [1] which combines both techniques exists.
Most learning approaches use packet duration, number of packets, average packet size, or interarrival time as statistical features.While several Markov models and hidden Markov models are used to extract information from packet sequence and directions, we utilize the Markov model proposed by Munz et al. [2], which uses a discrete time Markov model with a finite state space and gives relatively successful accuracy and complexity.In order to handle correlated data, which is a characteristic feature of network traffic data, Zhang et al. [3] proposed a "bag of flows" concept.Since the first few packets follow similar patterns per network applications or protocols, it would be better to form a certain-sized group and assign group membership in order to prevent wrong individual assignment due to a fluctuation.
In this paper, we generally follow a supervised learning techniques with a Markov model with states defined by the direction and size of the first four packets (in TCP connections, it is well known that the first four packets are most of time enough to capture the characteristics of the application [4]).Previous Markov modeling techniques, however, fail to differentiate among applications with overlapping, or hardto-classify, traffic pattern as in IMAP and SMTP.Table 1 shows that both IMAP and SMTP include same state sequences: 0-4-1-4 and 1-4-1-4 (state 0 to state 3 are for client-to-server packets, and state 4 to state 7 are for server-to-client packets.Smaller state number at each case means smaller packet size.2), and these two sequences comprise more than 50% of packets in both applications (66.98% in IMAP and 52.98% in SMTP).Now consider a network flow with traffic pattern 0-4-1-4.With traditional Markov model classifiers, this flow will be classified as IMAP application because IMAP has the higher probability for this pattern.However, this flow also has a probability of more than 0.46 to belong to SMTP.The same argument can be applied to 1-4-1-4 pattern, and in fact we can say more than 52% of SMTP packets will be classified as IMAP packets under traditional Markov model classifiers.
Our solution to this issue is to build another Markov model for the testing flows and measure the similarity between the trained Markov model and the testing Markov model.In the training stage, we build and train a Markov model for each application.In the testing stage, we first collect a group of network flows, called a "bag of flows," that are believed to belong to the same application using only the port number, and then build a Markov model for each such group.After that, we use Kullback-Leibler divergence to measure the divergence between the Markov models of the training set and those of the test set.Finally, we classify a test group to an application whose Markov model has the smallest divergence from that of the test group.We verify the performance of our proposed method by means of theoretical and real data simulations.
The remainder of this paper is organized as follows.In Section 2, we review relevant related work on network classification.The theoretical background of our Markov model with Kullback-Leibler divergence and some evaluation measures for the classification used in our work are briefly explained in Section 3. Section 4 shows our proposed classification method with Kullback-Leibler divergence, while Section 5 presents the results of a theoretical simulation and real databased analysis.Section 6 concludes and discusses further prospective developments.

Related Works
Traditional packet classification techniques are either portbased or payload-based approaches.Port-based approaches classify packets based on the ports used by the packets.However, nowadays, many applications, especially P2P applications, use unpredictable or dynamic ports, which limit the efficacy of this approach [5].Payload-based, or Deep Packet Inspection (DPI), approaches look deep inside the packet to capture its application-specific pattern.However, this technique also suffers from challenges such as frequently changing packet formats, encryption of the packet payload, and load increments [6].
Machine learning approaches are being developed by researchers as an alternative to traditional approaches.A machine learning approach is either unsupervised or supervised.The unsupervised, or clustering, techniques start with unlabeled packets and classify them into different clusters.The simple -means algorithm [4], in which each flow is represented by a point in a -dimensional space, is one such technique.The first  packets of each flow are observed and the packet size of the th packet becomes the coordinate at dimension  for this flow.The flows clustered close enough together are considered to belong to the same application.This technique has a problem such as the fact that it requires an application to dominate in at least one of the clusters, which is not always true, especially in close overlapping traffics such as SMTP and IMAP.
Supervised learning starts with known packet classes.Features such as packet size, packet direction, and packet inter-arrival time are extracted for each class.These feature vectors are used to build and train models that represent the classes.Various modeling techniques have been proposed.Roughan et al. [7] collected statistics on the feature vectors and classified unknown traffic via Nearest Neighbors (NN), Linear Discriminate Analysis (LDA), or Quadratic Discriminant Analysis (QDA) classification techniques.Moore and Zuev [8] applied naive Bayes techniques to map unknown traffic to preclassified traffic classes.These techniques are simple and effective but they generally consume considerable computing time or require large memory space to construct and maintain complex data structures.Also they tend to show very poor performance when the feature vectors are not clustered tightly enough.Crotti et al. [9] built protocol fingerprints, a PDF vector on packet size and packet interarrival time, for each traffic class, which can express the statistical characteristics of traffic classes in a compact and efficient way.Unknown traffic is then classified via normalized threshold techniques.However, they suggest to use a simple histogram of feature vectors for the protocol fingerprints, which is obviously too simplistic to capture the subtle difference between characteristically overlapping traffic classes.
Several means of improving supervised learning techniques have been attempted.Nguyen and Armitage [10] proposed taking packet samples from various locations during packet transmission to build multiple subflows.Most classification techniques capture either the entire flow or the first few packets of the flow as samples, but they claim that capturing the entire flow is time-consuming and detecting the beginning of packet transmission is not always possible.Instead, they capture packets in an intermittent way during packet transmission and use these multiple subflows to train their modeling system.An interesting idea of considering entire packet flow as a linear combination of a set of multiple component flows was introduced in [11][12][13].They apply wavelet analysis to extract these components [11,12] or apply ICA (Independent Component Analysis) [13] to identify the fundamental independent components and use them to classify target flow.Their target flow was abnormal flow generated by network attacking programs, but their techniques are general enough to be applied to any network flow.Reference [14] shows another approach of classifying abnormal packet flow.Their technique initially trains the classifier with normal traffic data only and then the classifier evolves in a dynamic way learning about anomalous behaviors using the Discriminative Restricted Boltzmann Machine.To overcome the limitation of a single technique, hybrid approach has been suggested in [15,16].Chen et al. [15] combine hardware classifier based on network processors with software classifier based on Flexible Neural Tree technique, while Ye and Cho [16] combine signature-based classifier with statistics-based classifier.By combining different classifying techniques we could expect faster or more accurate classifying performance.
Some other efforts have taken into consideration the time series characteristics of the transmitted packets in addition to their statistical properties.Palmieri and Fiore [17] plot recurring pattern of the training packets on RP (Recurrence Plot) and overlap the embedding vectors of the testing packets on top of RP measuring the distance of each pair of plotted points.The packets will be classified to the application from which the distance is the smallest.Dainotti et al. [18] and Mu and Wu [19] constructed Hidden Markov Models (HMMs) to represent the traffic classes while Munz et al. [2] built Markov Models (MMs) with eight states and four stages, which is much simpler than building HMMs but still effectively expresses the time series and statistical characteristics of the transmitted packets.Zhang et al. [3] proposed constructing a bag of flows and classifying it as a whole instead of classifying individual flow.Classifying a bag of flows as a whole has the advantage that the portion of the misclassified flows can be ignored if it occupies a smaller percentage of the bag, because the larger portion of the correctly classified flows determines the membership of the entire bag.
Above techniques improve the performance of supervised learning methods in various aspects.However they achieve that at the expense of other performance criteria.Capturing multiple subflows [10] and HMM [18,19] both increase overall classification time considerably due to the extensive packet collection in the former case or due to the construction of the complex model in the latter.Each of the Markov Model by Munz et al. [2] and the Bag-of-Flow technique by [3] shows degenerated performance when the target traffic classes are overlapping in their traffic patterns.Hybrid approaches [15,16] improve the performance but at the cost of increased classification time.
Our technique combines the Bag-of-Flow (BOF) technique in [3] with the Markov model approach in [2].We believe that the Markov model is simple and powerful enough to grasp the characteristics of the traffic and that the BOF concept is essential for improving the classification accuracy.However, applying BOF technique directly to packet classification does not always produce the best results, especially when the target traffic classes show similar and overlapping characteristics, as in SMTP and IMAP.In such a case, direct application of BOF actually results in inferior performance to individual assignment (the details are given in Section 5.2).We propose the construction of another Markov model for the flows contained in the bag and to measure its similarity with the target Markov models.The flows in this bag are all assigned to the traffic class whose Markov model is most similar to the test Markov model.(

Theoretical Background
We build separate Markov models for each known network traffic application in the training phase.Let  TR  () denote a Markov model constructed from the training data of application .Empirical estimation of transition probability and initial probability distribution can be done in a way similar to that of Munz et al. [2].In the testing phase, we can proceed either by individual connection or by grouping correlated connections.For each group of connections, we build Markov models as was done in the training phase.Let  TE  () denote a Markov model constructed from the testing data of group .In the following section, we provide a dissimilarity measure between two Markov models.

Kullback-Leibler Information.
Kullback-Leibler information is a well-known dissimilarity measure used between two probability distributions [20].Let { 1 , . . .,   } be a set of  observations drawn randomly from an unknown true probability distribution function (), and let () be an arbitrary probability distribution function.We assume that the goodness of the model defined by () is assessed in terms of the closeness as a probability distribution to the true distribution ().Akaike [21] proposed the use of the following Kullback-Leibler information (or divergence): where   represents the expectation with respect to the probability distribution .Kullback-Leibler information or relative entropy  with respect to  can be represented in discrete models as where  and  are probability mass functions of  and , respectively.

Evaluation of Classification Methods.
Classification performance can be calculated by measuring "error rates" or misclassification probabilities.In binary classification, it is well known that the total probability of misclassification (TPM) is given by where Pr(  ) represents the prior probability of class .
In our work, we use recall and precision, which are used in [2], and -measure to evaluate per-class performance, as in [3].One has (iii) -measure = 2×precision×recall/(precision+recall).
Let  be the total number of observations that belong to either one of two classes.Then, we can represent our evaluation measures in a simple 2 × 2 frequency table (see Table 2): Hence, for example, recall 1 is the proportion correctly classified as application 1 out of all the true application 1 connections, and precision 2 is the proportion correctly classified as application 2 out of all those classified as application 2 connections.-measure is the harmonic mean of recall and precision, which hopefully represents the overall accuracy.
On the basis of the empirical measures, we can also estimate the TPM as follows:

Classification Using Kullback-Leibler Information
In this paper, we focus first on two applications, SMTP (port 25) and IMAP (port 143), whose traffic patterns are difficult to distinguish using existing classification algorithms, and later extend our technique to other applications in Sections 5.3 and 5.4.Bernaille et al. [4] showed that the first four packets of a TCP connection are sufficient to classify known applications (we follow the same payload intervals as in [2].These intervals have been chosen because they emphasize well the difference in traffic feature vectors among the various applications [2]).
The value of the Maximum Sequence Size (MSS) is often exchanged in a TCP connection.Since the direction is either from client-to-server or server-to-client, each stage can have 4 × 2 = 8 different states.Thus, our model becomes a four-stage left-right Markov model with state space  = {0, 1, . . ., 7}.States 0-3 represent payload length intervals from client-to-server while states 4-7 represent those of server-to-client.For example, state sequence 0-4-1-4 means the following: client sends a 0-99 byte packet first (after the handshake), the server responds with a 0-99 byte packet, the client then sends a little larger 100-299 byte packet, and finally the server responds with a 0-99 byte packet.By investigating the state sequences of SMTP and IMAP in training data, we find that 0-4-1-4 is the dominant pattern in both applications.Other common patterns, such as 1-4-1-4 and 0-4-0-4, also exist in both applications.In this section, we explain our classification model, which has only two common patterns: 0-4-1-4 and 1-4-1-4.We then extend our model in the simulation section to incorporate an extra unique pattern per application.
Since the decision rule is given, we can determine the evaluation measure recall as follows: where   (  ) is the proportion of   in model   and   is the set  indicator function.The above formula for computing recall can be used for more than two patterns.In order to determine precision, we need to estimate the ratio  of App 1 to App 2 and apply Bayes rule.We may assume  = 1 under no information regarding the abundance of both applications.A summary of evaluation measures is given in Table 3.

Group Assignment.
We can form a correlated group of connections in several ways.Zhang et al. [3] used the concept of "flow," which consists of successive IP packets having the same five-tuple: [src ip, src port, dst ip, dst port, protocol].They formed a BOF and assigned the BOF instead of individual connections.
In this paper, we use a simple concept of bag based on port number only.Constructing a group based on the same fivetuple is very time-consuming and may not be a good strategy in real-time classification problems.Therefore, our port-only based group assignment is fast and convenient even though it is slightly less correlated than five-tuple-based BOFs.
Let   denote the number of  1 in a group with size  of model   .Then,   is a random variable whose distribution follows a binomial with parameter (,   ), that is,   ∼ Bin(,   ), and ( 1 > /2) represents the probability of assigning a group to  1 .Thus, recall and precision can be computed similar to the individual assignment case, except that ( 1 = /2) and ( 2 = /2).We call (  = /2) undecided  , which means that we cannot assign such a group under model   (see Table 4).
We propose three group assignment methods: Majority, Kullback-Leibler, and 4096d.Majority assignment is based on the voting of each individual assignment, Kullback-Leibler assignment is our main proposed method, and 4096d assignment can be another reasonable candidate method based on Euclidean distance in 4096 dimensions.We explain the Kullback-Leibler and 4096d methods further in the next subsection.
where  TE (  ) and  TR  (  ) are the likelihoods of   under the model in test and training data, respectively.Since our Markov model consists of four stages and eight states, we have 8 4 = 4096 possible observations in test data (  = 1, . . ., 4096).To avoid division by zero, we use  TR  (  ) = 10 −5 for such  with  TR  (  ) = 0 and  TE (  ) ̸ = 0.If ( TE ,  1 ) < ( TE ,  2 ), then we assign a group of testing connections to  1 .The evaluation measures recall and precision can be calculated in a similar fashion to the Majority case.
4096d.We can map each BOF of size  to a point in 4096dimensional space, because we have 8 4 = 4096 possible observations.For example, if a size-10 test BOF consists of 4 1 , 2 2 , 1 4 , and 3 5 , it can be represented as a point in 4096-dimensional space like (4, 2, 0, 1, 3, 0, . . ., 0).After suitable standardization, we use the Euclidean distance to determine the membership of a test BOF.
Theoretical considerations for handling more than two patterns can be easily made.Trinomial or multinomial distributions are needed instead of binomial distributions to compute evaluation measures.We include some of those simulation results in the next section.

Simulation Experiments
In this section, we evaluate the performance of our proposed method under various scenarios.First, several hypothetical model-based simulations are presented, followed by real network traffic-based results.

Model-Based Simulation.
Suppose the proportions of observations for each application are given in Table 5.
In the following 4 hypothetical model-based simulations we choose values of  1 ,  2 ,  1 , and  2 representing the real traffic sequences of ours and the more challenging scenarios, that is,  2 ≈  2 .These parameters stand for the proportion of packet patterns for each application; that is,  1 denotes the proportion of pattern 0-4-1-4 and  1 denotes that of pattern 1-4-1-4 in application 1, while  2 and  2 stand for the proportions of the same patterns in application 2. For simplicity of notations, we use capital letter abbreviations to represent evaluation measures and subscript numbers to denote the application as usual: (i)   = recall  ,   = precision  ,   = -measure  , and   = undecided  ; (ii) Maj = Majority, K-L = Kullback-Leibler.
Case 1 ( 1 = 0.6,  1 = 0.4,  2 = 0.5, and  2 = 0.5 (see Table 6)).Case 1 represents a situation in which there are only two patterns.As expected, group assignment gives a better performance than individual assignment and both K-L and 4096d are better than Maj.An interesting observation in this case is the fact that the low values for  2 in individual (50%) and Maj (46%) improve up to 86% in K-L as bag size becomes  = 100.Under  2 , which have 50%  1 and 50%  2 , we wrongly assign up to half of true App 2 to App 1 in individual and Maj in group assignment.Nevertheless, even in that case, K-L performs well as group size increases.The next three case studies deal with more than two patterns.Case 2 deals with a situation that is similar to Case 1 but with an extra unique pattern in application 2.
Case 2 ( 1 = 0.6,  1 = 0.4,  2 = 0.5, and  2 = 0.4 (see Table 7)).The result shown in Table 7 shows that the individual assignment is even better than the group Maj method, but both methods give abysmal performances for  2 and  2 .The K-L method gives the best performance among the competitors (99%).
Case 3 ( 1 = 0.6,  1 = 0.3,  2 = 0.5, and  2 = 0.4 (see Table 8)).In Case 3, each application had its own unique pattern.K-L performs well in this case also but the performance gap between it and 4096d is smaller than that in Case 2. This tells us that K-L excels when a rare application specific pattern exists.

Real Data-Based Simulation.
We retrieved traffic data from the packet traces in [22].Our simulation system used the pcap library functions to extract valid TCP connections from the traces.We defined a valid TCP connection as being a packet exchange between a client and a server that starts with proper three-way TCP handshakes and has at least four packets after them (currently our system eliminates flows with less than four packets.However, it is not difficult to extend our system to handle flows with less than four packets.We can simply add zero-length packets at the end of the flow when it does not contain all the four packets.Applications that produce less than four packets then can be characterized as flows ending with a number of zero-length packets).Among the packet traces, we singled out SMTP (port 25) and IMAP (port 143) connections.The trace files were huge and produced over 160,000 connections for SMTP and over 30,000 connections for IMAP.We have used tenfold crossvalidation so that nine-tenths of the connections from each application selected via random selection is used to train the target Markov model and the remaining one-tenth is used to for testing purpose.For each connection, we have removed the three-way TCP handshake packets (SYN, SYN/ACK, and final ACK) and collected only the first four packets after the handshake.All acknowledgement packets are ignored.
Table 10 shows the state sequences (patterns) for each application in sorted order with the most frequent one at the top.Both applications had the 0-4-1-4 sequence as the most frequent sequence, with the other common sequences being 0-4-0-4, 0-4-0-0, and 1-4-1-4.Note that states 0 to 3 are for the client-to-server packets and states 4 to 7 are for the server-to-client packets with each state representing one of the four payload length intervals: [0, 99], [100, 299], [300, MSS-1], and [MSS] (Section 4).For example, 0-4-1-4 means the first packet was a client-to-server packet with size in [0, 99], the second packet was a server-to-client packet with size in [0, 99], the third packet was a client-to-server one with size in [100, 299], and finally the fourth packet was a server-to-client one with size in [0, 99].Let us summarize  1 IMAP and  2 SMTP application data collected in a given time interval as in Table 11.
Given a real network traffic observation   , we calculate the empirical likelihood under each model   , where π (1) is the proportion of state 1 in the first stage and P  (, ) is an empirical transition matrix constructed from the total   connections.Evaluation measure, recall 1 , can be computed as in (11) using empirical likelihood l and P1 (  ) =  1 / 1 instead.The other evaluation measures can be computed in a similar fashion.Table 10 shows why SMTP and IMAP applications are difficult to classify using conventional network classification methods, such as the BOF technique in [3] and a plain Markov model approach in [2].By eyeball computation,  SMTP (0414) <  IMAP (0414) and  SMTP (1414) <  IMAP (1414), so state sequences 0-4-1-4 and 1-4-1-4 of SMTP are misclassified as IMAP.Therefore, the recall rate of SMTP is at most 1-0.5298.
Group assignment using Maj does worse than the individual assignment in the recall rate of SMTP.Because more than half the percentage of SMTP state sequences are misclassified, majority counting will aggravate the situation.Table 12 shows the performance of individual and group assignment.
The Kullback-Leibler approach gives the best performance as in the model-based simulation.The recall rates of SMTP ( 2 ) with the Kullback-Leibler approach are well over 90% and close to 100% for a bag size of 100.The recall rates of IMAP ( 1 ) are also remarkably high.The values of  1 for K-L (Kullback-Leibler) are somewhat lower than those for the Maj approach, but it should be understood that the high recall rate of IMAP in majority counting comes at a severe sacrifice of SMTP recall rate.The 4096d approach also shows strong recall rates, around 90%, for SMTP traffic.Its performance degrades, however, with IMAP traffic, showing around a 65% recall rate.It appears that simple Euclidean distance in 4096dimensional space is not accurate enough to capture the characteristics of a traffic class whose traffic pattern heavily overlaps with another.Further, 4096d treats all dimensions equally in a sense, so it may not be sensitive enough to detect some unique rare patterns in certain applications.
Even though the Kullback-Leibler method performed well in the real data simulation, the computing time is still a concern in online network classification and decisionmaking.Table 13 presents a comparison of execution times for various algorithms.The execution times are normalized for 10,000 connections.The individual assignment and Maj show similar time requirements.The K-L approach is slightly faster than the 4096d method but is about roughly ten times slower than the individual assignment or Maj in the case of bag size of 100 connections, which seems to be a reasonable bag size (100 connections are a size that is large enough to present a representative state sequence distribution and small enough to declare the identity of some unknown traffic in due time).Most of the time is consumed in building the Markov model for the test data; computing the distance between two Markov models does not take significant execution time, contrary to our initial prediction.It turns out that the probability distribution of our total state-sequence space is very sparse, meaning that most of them are zeroes.Therefore, we can significantly reduce the K-L computing time.

Extending to Multiple Applications.
We have extended our proposed approach to handle network classification problems in the presence of more than two applications.From the network traffic data repository we have collected most of the application protocols having enough traffics to analyze.As a result we ended up with 10 network protocols, which are FTP, SSH, SMTP, HTTP, POP, NNTP, IMAP, HTTPS, SPOP, and BitTorrent.
We have conducted an experiment to check whether our proposed method works well in the presence of other protocols.Tables 14 and 15 show the recall and precision rates of various models, respectively, for the chosen protocols.  in Table 14 is the recall rate for a protocol with port number , while   in Table 15 represents the precision rate for the same protocol.BitTorrent protocol uses random port numbers, but we have collected and tested packets destined to port 6881 since our pcap files contain BitTorrent traffic of older version in which 6881 is known to be one of the most frequent port numbers (we discuss the problem of detecting BitTorrent packets in the presence of random port numbers in Section 5.4).Thus,  6881 and  6881 stand for the recall and precision rate of these representative BitTorrent packets.The recall rates of SMTP and IMAP are slightly different from those in Table 12 since each packet now is matched against 10 different models, instead of two models.It is clear that our techniques, KL-10 (Kullback-Leibler with bag size 10) and KL-100 (Kullback-Leibler with bag size 100), consistently show much better performance than other techniques.Other techniques show very poor recall rate for ports 21, 25, and 119.However, KL-10 and especially KL-100 produce close to 90% recall rate for all ports except for NNTP (port 119), which must be very tough to classify correctly as can be seen in the table.Still, KL-100 matches the packets of NNTP much better than other techniques, almost reaching 80% recall rate with bag size 100 while others produce less than or around 20% recall rate.
Precision rates in Table 15 show the accuracy of the prediction of each model.Again the proposed KL model displays considerably high precision rates compared to others.To compute recall and precision rates, we need the distribution of predicted ports for each classification technique.For example, Table 16 shows the distribution of predicted ports for each protocol with KL-10 classification technique.The table shows the total number of connections belonging to each protocol in the first column.For example, 1792 connections were found to belong to port 21.The rest of the columns show the predicted port number for each protocol.For port 21, the KL-10 has predicted that 1480 connections belong to  14.

Detecting BitTorrent Packets in the Presence of Random
Port Numbers.Detecting P2P packets such as BitTorrent is a hard problem and has been investigated by numerous researchers.In this section, we describe how our technique can be applied to detect BitTorrent packets and provide some preliminary experimental results.Since our technique needs a set of flows believed to belong to the same protocol, in this case BitTorrent, we have collected packets coming out from the same host to build a bag of flow and applied our technique to it.We have identified hosts that exchange packets with more than 10 different peers within relatively short time period, in our case 1000 seconds.There were 1049 such hosts in the pcap files used in the experimentation.Since the pcap files do not contain payload portion, we cannot tell exactly which traffic is due to true BitTorrent application (another approach of collecting traffic for BitTorrent detection would  17.About 30% of them were destined to one of the well-known ports that our classification system can recognize, while the rest (70% of the whole traffic), categorized as "Other" in the table, were using random port numbers.The table also shows the prediction result by the proposed KL-10 method.Since each flow will always be matched to one of the 10 models, there is zero flow in "Other" category.Instead each category has been predicted to contain more than the actual number of flows belonged to it.We believe the prediction system will become more precise when it is equipped with more Markov Models for other missing protocols.
We are especially interested in the recall and precision performance for the case of BitTorrent traffic.Out of 1049 hosts, we have further identified 39 hosts that are producing traffic with ports in 6881-6899 range.The total number of BitTorrent flows in this category was 943.We were interested in how much of them are classified as BitTorrent by our prediction system and how precise our prediction is.The result is shown in Table 18 and summarized at the bottom of the table."num of BT" in the table stands for "the number of BitTorrent flows," and "num of Non-BT" stands for "the number of non-BitTorrent flows."The table shows for each host the true number of BitTorrent flows and non-BitTorrent flows, respectively, and at the same time the predicted number of BitTorrent and non-BitTorrent flows.From the table we can see all of the BitTorrent flows are classified as BitTorrent in our detection system.Therefore the recall rate is 100%.However, the precision rate is 54.70%, a much lower figure than those in Table 15.The main reason is that our system has only 10 models, ports 21, 22, 25, 80, 110, 119, 143, 443, and 995 and BitTorrent, to classify the traffic.All other traffics that do not belong to one of these, those in the "Other" category in Table 17, still have to be classified into one of these models, lowering the precision rate.Particularly, a significant portion of them are classified as BitTorrent traffic as shown in Table 18.It might be that these flows are actually BitTorrent traffic, or The low precision rate could be a problem in real situation when we deploy our system in the gateway server.However, by increasing the number of models, we believe that the precision rate will improve.Also, since it captures unknown traffic as BitTorrent only by looking at the packet header, we can combine it with Deep Packet Inspection (DPI) technique to classify the traffic further.That is, instead of applying DPI to all the traffic, we can extract suspected BitTorrent traffic with our technique first and then apply DPI to these extracted ones.
There are other concerns when the current system is deployed in real situation.One is building and maintaining LRU (Least Recently Used) list to keep track of hosts for which to collect packets.Since we cannot keep track of all possible hosts, we use LRU to remove relatively inactive hosts from the monitoring list.How many hosts to keep in the LRU list, how many packets per host to collect, and so forth are questions to resolve in real world deployment.Too short LRU list will remove BitTorrent hosts prematurely from the list when a large number of non-BitTorrent hosts exchange packets before the BitTorrent host has the chance to send or receive the second packet.Too long LRU list obviously put too much overhead on the system.Another concern is the relatively slow classification time of Kullback-Liebler method.However, in a moderate speed network as in our test pcap file, the timestamps of captured packets show that there is enough time to analyze traffic pattern and classify them with Kullback-Liebler technique.Even in high speed network, classification task is a highly parallel process and with proper equipment such as parallel network processors our technique would still be a viable option.

Discussion and Conclusion
In this paper, we developed a novel classification method based on a Markov model with Kullback-Leibler divergence.Our primary goal was to develop a method that performs well on hard-to-classify network applications.Even though most of the previous methods of network classification perform well in most cases by using either correlated information of connections or a combination of a machine learning technique and Markov or hidden Markov models, they fail to produce convincing results when the patterns of connections of applications are similar.We proposed a novel method that combines a flexible Markov model with Kullback-Leibler information and correlated traffic connection by grouping or bagging with the port number of applications.As our theoretical simulation and real data simulation show, our method outperformed the other methods in hard-to-classify situations, even though we did not cover all possible cases.
We recognize the slowness of the Kullback-Liebler approach (compared to Individual or Majority approach as shown in Table 13) as being one drawback of our technique.However, its high prediction success rate even among the overlapping traffic classes in terms of traffic patterns, as in SMTP and IMAP, is promising.Further, our technique is scalable in that its execution time increases linearly as the number of target classes increases, because once it builds the Markov model for the test data, measurement of its distance from each of the Markov models of the target classes can be done very quickly.
Our approach can be extended to more general multiclass problems.A  ×  table can be set up to classify observations with  (> 2) patterns to one of the  (> 2) classes (see Table 19).
The recall rate for each application can be computed as follows: Precision rates can also be computed by considering the abundance proportions of each application.

Disclosure
A preliminary shortened version of this paper has been published in [23] by the same authors.An extensive experimentation has been performed since the publication of the preliminary version and the result is presented in the current paper.

Table 1 :
Overlapping state sequences of IMAP and SMTP.

Table 6
network traffic.The group assignment Maj performs worse than the individual in most of the measures, while K-L and 4096d perform as before. 2 and  1 of individual assignment are bad and worse in Maj but K-L and 4096d do exceptionally well in this case as well.

Table 12 :
Evaluation results for real data.

Table 14 :
Recall rates for various protocols.

Table 15 :
Precision rates for various protocols.

Table 17 :
Port distribution of the ows from the 1049 hosts.Instead we have assumed port 6881 through 6899 are BitTorrent ports and collected packets with these ports as BitTorrent traffic.Packets with these ports have very high chance of being BitTorrent packets (in older version of BitTorrent, whose traffic our pcap files contain, 6881-6899 are known to be the port range that BitTorrent hosts are using.),andthe purpose of our experimentation is to see how much of them are classified as BitTorrent packets by our KL model.The 1049 hosts were producing 37123 flows, or connections, and the port distribution of them are shown in Table

Table 18 :
Prediction result for BitTorrent hosts.