Capturing Uncertainty Information and Categorical Characteristics for Network Payload Grouping in Protocol Reverse Engineering

As a promising tool to recover the specifications of unknown protocols, protocol reverse engineering has drawn more and more attention in research over the last decade. It is a critical task of protocol reverse engineering to extract the protocol keywords from network trace. Since the messages of different types have different sets of protocol keywords, it is an effective method to improve the accuracy of protocol keyword extraction by clustering the network payload of unknown traffic into clusters and analyzing each clusters to extract the protocol keywords. Although the classic algorithms such as K-means and EM can be used for network payload clustering, the quality of resultant traffic clusters was far from satisfactory when these algorithms are applied to cluster application layer traffic with categorical attributes. In this paper, we propose a novel method to improve the accuracy of protocol reverse engineering by applying a rough set-based technique for clustering the application layer traffic. This technique analyze multidimension uncertain information in multiple categorical attributes based on rough sets theory to cluster network payload, and apply the Minimum Description Length criteria to determine the optimal number of clusters. The experiments show that our method outperforms the existing algorithms and improves the results of protocol keyword extraction.


Introduction
Network protocol reverse engineering [1][2][3][4] is a promising approach to address the problem of recovering detailed specifications of unpublished or undocumented network protocols from the network trace.The specifications of protocols play an important role in the network security and management oriented issues, such as intrusion detection, fuzzing test [5], recovering and understanding command-and-command (C&C) protocols [6], and building intelligent honeypot [7].
The specifications of open protocols such as HTTP are available from the published document.However, the specifications of some protocols (called proprietary protocols) used by enterprises or hackers are not open to public for the reason of commercial or security.Researchers deem that protocol reverse engineering is the only option available to build the understanding of proprietary protocol from network trace.
The extraction of protocol keywords from network trace is a critical task of protocol reverse engineering.The protocol keywords [4] are referred to as the constant strings or command characters used by the protocol.For example, the HTTP protocol uses "GET" as the request method and the FTP protocol uses "QUIT" as the command to quit a session.If the messages are not grouped into clusters to make sure each cluster belongs to a unique message type, some protocol keywords (associated with a particular message type) with low occurrence probability will be missed due to the interference of miscellaneous types of messages.Thus, it is critical that the messages input to a protocol reverse engineering system belong to a single type.In practice, it is a challenge to cluster messages of unknown traffic according to message types, since we have no prior knowledge about the unknown protocols.An adapted solution to these issues is to apply unsupervised clustering methods to group messages of the unknown traffic.

Mathematical Problems in Engineering
Clustering unlabeled data is also important for many other applications, such as building an automatic method to generate signatures for unknown or new network applications in network management.Such issues have been studied by many works [8][9][10][11][12] most of which are applicable for clustering data having attributes with numerical values.The reason why most clustering algorithms are focused on attributes with numerical value is that it is easy to define similarities of the numerical data using geometric concepts.However, much data contained in today's databases and much information which is valuable for clustering is categorical in nature.Specifically, in our problem domain, the protocol messages analyzed by reverse engineering systems expose mainly categorical characteristics, such as message direction, transport layer number used by the protocol, words, or delimiters used by the messages.On the other hand, most clustering algorithms are crisp clustering algorithms and they are inclined to classify an object into one and only one cluster [13].However, it is hard to define the crisp boundaries of clusters in many practical cases and the crisp clustering algorithms lack the ability to handle the uncertain information hidden in the data set [14].The messages that belong to different types reflect different string patterns.For instance, some messages start with a string of "M-SEARCH, " while others start with "GET." Hence, the remarkable characteristic is valuable for clustering application layer traffic.However, some attributes in different clusters have the same value.For example, the first 4 bytes of some messages of both HTTP and SSDP are "HTTP." Thus, when we have a message whose first 4 bytes are "HTTP, " it is difficult to determine which of HTTP and SSDP dose the message belong to.In other words, each attribute of a message has different degree of confidence to indicate the message's belonging to different types.We have to deal with the uncertain information of message attributes in the clustering process.
In this paper, the main innovative contribution is to improve the accuracy of protocol reverse engineering by applying a rough sets theory-(RST-) [14][15][16] based method to tackle the problems of clustering application layer traffic and grouping the messages according to the message types.In the process of clustering, we propose to select cluster for further clustering according to the degree of uncertainty within the cluster instead of the number of objects within the cluster instead of selecting the subclusters with more objects for further splitting [17].The approach is implemented on a system called RSCluster and applied to cluster real-world application layer network trace according to the protocols and group protocol messages according to the message types in order to improve accuracy of the protocol keyword extraction.
Besides the application of improving accuracy of protocol reverse engineering, the RSCluster is supposed to be an efficient and accurate traffic clustering tool for classifying and identifying newly emergent applications which are most likely unknown to network administrators.Based on the clustering results, one can generate models or signatures to represent the profile of each unknown application so as to distinguish them from each other in the future.
The remainder of this paper is arranged as follows.Section 2 studies the related work.Section 3 presents an overview of rough sets theory and outlines some basic definitions.Section 4 presents our methodology of traffic clustering.Section 5 discusses the Minimum Description Length criteria for model selection.The computation complexity is discussed in Section 6 and the approach is evaluated in Section 7. Finally, a conclusion is made in Section 8.

Related Work
In the last decade, numbers of techniques have been applied to cluster Internet traffic.The simplest approach is to cluster flows according to the well-known ports assigned by IANA [18].However, more and more applications exceedingly use dynamic port numbers during the communication in the Internet.Furthermore, considerable applications were exposed to hide their traffic behind some well-known traffic (e.g., HTTP) to be transmitted over well-known port so as to bypass the detection of firewalls.Therefore, the port based approaches are insufficient in many cases.
Recently, a number of algorithms have been proposed to apply the classic clustering methods such as -means and EM methods to address these issues.These techniques assume that traffic has statistical attributes (e.g., packet lengths, average length of packets, and packet interarrival time) which are unique for certain classes of applications.Hence, one could distinguish different kinds of applications from each other by leveraging the flow's statistical characteristics.Erman et al. [8] propose two unsupervised clustering algorithms (i.e., -means and DBSCAN) to classify traffic by exploiting the distinctive characteristics of applications when they communicate on a network.In [9], Erman et al. apply unsupervised EM clustering technique for the Internet traffic classification problem.The unsupervised clustering approach uses an EM algorithm to classify unlabeled training data into groups based on similarity.Although these techniques have improved the accuracy in a certain extent, the reported results were still far from satisfactory.
The difference between ours and [8,9] is that we cluster traffic by considering categorical value attributes and analyzing the multidimension of uncertainty between attributes, instead of quantifying flow attributes into numerical value attributes and applying geometric similarity.As we know, inaccurate or inappropriate quantification will lose some valuable information of data or distort the substantial truth.As a result, it is very important to keep the data as it is as far as possible in order to preserve more information.
In order to build up the categorical data clustering, Mahmood et al. [19] develop a framework to deal with mixed type attributes including numerical, categorical, and hierarchical attributes for a one-pass hierarchical clustering algorithm.However, they focus on analyzing network flow feature such as protocols (UDP, TCP, and ICMP) to identify interesting traffic patterns from network traffic data, while we aim to analyze the categorical features in application layer to cluster network traffic according to protocols and group protocol messages according to message types.Other researchers propose to take advantage of RST to cluster qualitative-value data [20].Mazlack et al. [21] propose RSTbased technique to choose effective attributes for data partitioning in the procedure of clustering.Parmar et al. propose an novel algorithm, namely, Min-Min-Roughness (MMR) [17], for categorical data clustering.They show that MMR performs well in handling the uncertainty of multidimension categorical attributes of data.In each iteration of the clustering process, the MMR algorithm selects a subcluster with more objects for further splitting.However, our approach chooses cluster for further splitting according to the degree of uncertainty instead of the number of objects within the cluster.The rationality of this alternative will be explained in the next section.
Wang et al. [22,23] demonstrate an approach to cluster unknown traffic and determine the number of clusters.Since the exact number of classes is unknown in advance, they apply the -means algorithm to efficiently estimate the best value of cluster number  by integrating Bayesian information criterion with basic -means.Georgieva et al. [24] propose an approach to determine the number of cluster based on the Minimum Description Length (MDL) criteria [25,26].Toward the number of clusters, we also explore MDL criteria to capture the optimal number of clusters.We determine the cluster number by choosing the clustering model which minimizes the total description lengths of describing the model and encoding the data set with the help of the model.

Traffic Clustering Using Rough Sets Theory
The concept of rough sets theory (RST) [15] was developed by Pawlak in 1982.To date, RST has received considerable attention of research in the computational intelligence literature for its excellent capability of handling imprecision, vagueness, and uncertainty in data analysis [14,16,17,21].Parmar et al. [17] show that the RST-based algorithm is appropriate to cluster categorical data and handle uncertainty in the clustering process.In what follows, we will introduce the basic concepts of rough sets theory.

Information System for Clustering.
In order to formulate the problem of traffic clustering or message grouping, an information system  is defined as an ordered quadruple  = (, , , ).The universe  = { 1 ,  2 , . . .,   } contains all objects (sessions or messages) in the network trace.The attribute set is denoted by  = { 1 ,  2 , . . .,   }, where every   is an attribute of the objects in .   ∈ stands for the domain of   , whereas  = ∪ ∀  ∈    .Finally, the description function  :  ×  →  maps an object  ∈  and an attribute   ∈  to the value domain .

Indiscernibility Relation.
For any two objects   ,   ∈ , they are indiscernible with respect to attribute  (denoted by ∼  ) in  if and only if (  , ) = (  , ).More generally, given a subset  ⊂ , if   and   are indiscernible with respect to every   ∈ , then   and   are defined to be indiscernible by the set of ; that is, ∀  ∈ , (  ,   ) = (  ,   ).The indiscernibility relation on the set  in S, in symbols ∼  , is defined as follows:   ∼    if and only if   and   are indiscernible by the set of .
Importantly, the elementary set, denoted by , in  with respect to , is defined as the equivalence class of relation ∼  .The family of all elementary sets with respect to  is denoted by / = { 1 ,  2 , . . .,  |/| }.For any element   of , the equivalence class of   of relation ∼  is represented as [  ] ∼  .

Lower and Upper Approximation.
In what follows, we introduce two important concepts in rough sets-based approach for data analysis, namely, the lower approximation and upper approximation.
Let  ⊂  be a set of attributes, and let  be a subset of objects in universe  (i.e.,  ⊂ ).The lower approximation of  with respect to  in , denoted by Δ  (), is defined as the union of all those elementary sets which are contained in .That is, given / = { 1 ,  2 , . . .,  |/| }, we have By this notation, it is easy to have that an object  ∈  belongs to  doubtlessly, if  ∈ Δ  ().
On the other hand, the upper approximation of  with respect to  in , denoted by ∇  (), is defined as the union of all those elementary sets which have a nonempty intersection with .More formally, given / = { 1 ,  2 , . . .,  |/| }, we have We note that if an object  ∈ ∇  (), it implies that  possibly belong to .
An accuracy measure of the set  in  ⊆  is defined as where card(Δ  ()) or card(∇  ()) is the number of objects contained in the lower or upper approximation of the set  with respect to .Obviously, 0 ≤   () ≤ 1.
If   () = 1, then the set  is asserted to be definable in  with respect to .Otherwise,  is undefinable in .
For ease of exposition, we define the notion of roughness as follows: If   () = 0,  is crisp with respect to .If 0 <   () ≤ 1,  is rough with respect to .

Classification of Information System.
Let  = { 1 ,  2 , . . .,   },   ⊂ , be a family of subsets of the universe .If  is a partition of , that is, then  is a classification of , whereas   s are called classes of .
Suppose that  is subset of , the lower and upper approximation of  with respect to  in  are defined as respectively.
With these notations, the quality of the classification with respect to  is given as and the accuracy of the classification with respect to  is given as

Traffic Clustering Using Rough Sets Theory
As suggested by Mazlack et al. [21], data clustering is a series of procedures to discover the intraitem dissonance of data and eliminate it within the resulting subpartitions by progressively partitioning the data set.Inspired by this rationale, traffic clustering could be achieved by recursively partitioning the data set to reduce the dissonance (uncertainty) within the resulting partitions.Obviously, the crisp partitioning is the most desired situation because there is no dissonance inside the partitions.However, it cannot always be achieved in real world.Thus, our proposed algorithm considers the partitioning leading to less uncertainty.
Following this heuristic thought, the procedure of traffic clustering should be performed in the following way: we firstly choose an effective partitioning attribute, then split the data set by searching for a partitioning point of the selected attribute so as to maximize the coherence of resulting partitions.These procedures are repeated until either the coherence is no longer changed or a predefined termination condition is satisfied.

Roughness and Average Roughness.
In order to deal with the uncertain information in data set and quantify the degree of the uncertainty, we define the some related concepts based on roughness [17,21] as follows.
Given an attribute   ∈  and the domain of its possible values    = { The roughness of   with respect to another attribute   ∈  on th partition is defined as The average roughness of   with respect to   is defined as the mean of roughnesses of   with respect to   on all partitions towards   .That is, Obviously, (  | ) measures the effect of the partitioning using attribute   toward all other attributes.It ranges from 0 to 1.The smaller is the (  | ), the crisper is the partition [21].

Splitting a Cluster.
Recall that it cannot always achieve the crisp partitioning in real world clustering cases, so we have to consider the partitioning scheme that leads to minimum uncertainty.Following this key, the goal of our algorithm is to minimize the roughness of data set.
Given a specific subset of  ⊆ , we firstly search for the attribute that should be selected to conduct the splitting of .Recall that the attribute roughness (  | ) measures the effect of the partitioning using attribute   .So, the attribute whose attribute roughness is minimal should be selected for splitting.Hence, the objective attribute is In what follows, we illustrate how to split  into two partitions of  1 and  2 with the help of  * , where  1 ∩ 2 =  and  1 ∪  2 = .
First of all, we define the summation of roughness of   on partition    () as Secondly, we rank   * (  ) in ascending order.Suppose that the rank result is given as Using these notations, the splitting of  can be denoted by The attribute roughness of  * in  1: and  :|  * | is  (17)

Selecting Subclusters for Further Splitting.
In the previous work, the Min-Min-Roughness [17] algorithm selects the subclusters having more objects for further splitting at subsequent iterations.However, if the uncertainty of a cluster having more objects is much smaller than that of another one which has less objects, it is rational to split the latter cluster for it contains much more intraitem dissonance than that of the former one.Therefore, we should choose the cluster for further clustering according to the degree of uncertainty within the cluster instead of the number of objects within the cluster.
The cluster roughness of a cluster  is defined as the summation of all attribute roughness in cluster .That is, In this paper, we choose the clusters with the largest cluster roughness for further splitting.We also constrain that the size of selected cluster must be larger than a predefined threshold (minimum cluster size) so as to avoid the overclustering.
We apply our method firstly on the universe  to split it into two subclusters.Then, we choose subclusters whose cluster roughness is the largest to repeat the procedures of splitting to obtain further partitions.We apply our algorithm recursively until the predefined termination condition is satisfied or no cluster is selected to be further splitting.The predefined termination condition is that either the passes of iteration or the number of clusters reaches their corresponding upper bound.

Determining Number of Clusters Based on MDL Principle
5.1.The Minimum Description Length Principle.The Minimum Description Length (MDL) principle which is proposed by Rissanen [25,26] has been successfully applied to select the optimal model from a set of given stochastic models.
The MDL principle asserts that the best model inferred from a given set of data is the one which minimizes the total description lengths of both the model and the encoding for the data with the help of the model.
More specifically, when a set of models { () |  = 1, 2, . . ., } is given, the description length of an observation  = { 1 ,  2 , . . .,   } using the th model is given as where   is the number of free parameters in the th model.We note that ( |  () ) denotes the likelihood of data  with respect to model  () .This term can also be viewed as the description length of the encoding of data  with the help of model  () .The second term in ( 19) is related to the complexity of model  () and the size of observation data .The third term is the code length representing the number of models for selection.

Determine the Number of Clusters Based on MDL.
Given  = (, , , ), let  () = { ()  1 ,  () 2 , . . .,  ()   } be the classification of  in the th iteration and let    be a variable standing for the value of   .For each class  ()   , we calculate the entropy of   with respect to  ()   ∈   as follows: where   ()  (  ) = (( ∈  ()  ,   ) =   ).For the sake of simplicity, we also denote the description length of universe  as follows: The total description length of the information system  in th iteration is given by where the first term,  (,1) , in (22) is the description length of encoding data set  using the th clustering scenario of  () , the second term,  (,2) , is the length describing the complexity of  () , and the last term,  (,3) , is the code length of representing the number of models for selection.Recall that, in each iteration, we select an optimal attribute for each candidate nodes to perform splitting procedure so as to minimize the roughness in the information system , so the uncertain information in  decreases.Therefore, as the recursive procedure of clustering goes further, the term of  (,1) in ( 22) will decrease constantly.On the other hand, as the number of clusters in  increases, the classification  will become more and more complex and computational cost will increase.So, the second term of  (,2) in ( 22) will increase.In particular,  (,3) is a constant.In summary, the total description length of  will firstly decrease until it reaches the minimum value point and then increase.Thus, we can search for a  0 to minimize the value of   in (22): As a result, the number of clusters in  0 th iteration is optimal.

Computation Complexity
In this section, we discuss the complexity of our algorithm.Suppose that there are totally  objects and  attributes are considered.The worst-case condition is that each attribute has distinct values for each object.That is, for each attribute   , the cardinality of    is exactly equal to  (|   | = ) and the cardinality of the family of elementary sets with respect to this attribute is also equal to  (|/{  }| = ).Therefore, we need at most  comparisons for each object to judge whether it belongs to one elementary set.In the worst situation, we have to perform the -comparison judgement for totally  elementary.So, the complexity of calculating the average roughness of   to another attribute   is  2 .Since there are totally  attributes, the complexity of computing the attribute roughness of a specific attribute is  ×  2 .In the procedure of finding the partitioning attribute, we have to do  passes of attribute roughness calculations, so the complexity is  2 × 2 .
On the other hand, in the worst case, the complexity for    () in ( 13) is  × , the complexity of sorting    () is , and the complexity for finding optimal splitting point in (17) is  ×  2 .
In a summary, the total complexity of our algorithm is ( 2 ×  2 +  ×  +  +  ×  2 ).For a large data set, the value of  is very large so the value of  could be considered as a constant comparatively.Thus, the complexity of our algorithm is ( 2 ).

Evaluation
The proposed algorithm is implemented in a system called RSCluster.In the first phase, the RSCluster is applied to clustering application layer traffic.Three data sets (i.e.data sets I, II and III, as shown in Table 1) are collected from School of Information and Science Technology in Sun Yat-Sen University on August 8, 10, and 11, in 2012.Table 2 shows the detail information of our data sets.
The overall accuracy is used to evaluate the overall effectiveness of the proposed algorithm based on rough sets theory.The dominating application in a cluster is used to label the cluster.Thus, the overall accuracy of clustering is defined as the ratio from the number of flows labeled correctly in all clusters to the total number of flows in the data set.Suppose that the number of flows labeled correctly in a cluster of   is referred to as the True Positives (TP), denoted by   .Thus, the overall accuracy is given as  = ∑    /, where  is the total number of sessions in data set.
Other metrics are listed as follows.
False negative (FN) is the number of sessions which are incorrectly classified as not belonging to a cluster.
False positive (FP) is the number of sessions that are incorrectly classified as belonging to a cluster.
True negative (TN) is the number of sessions that are correctly classified as not belonging to a cluster.
There are a total of 26 features used in our experiments, including the transport layer protocol type (TCP or UDP), transport layer port number, and the 4-byte fields occurring in the messages.
In the domain of traffic classification, Moore and Papagiannaki [27] and Ye et al. [28] show that effective application signatures with high accuracy can be generated using the first  bytes of each flow.Haffner et al. [29] present three main motivations for limiting the data size to first  bytes.(1) It is helpful for identifying traffic as early as possible.(2) Most application layer headers at the beginning of a data exchange are easy to be identified.(3) It allows the proposed algorithm to process less amount of data.
Therefore, it is enough to capture sufficient information about the class characteristic of message by considering the first 4-byte field (field 1), the second 4-byte field (field 2), the last 4-byte field (field 3), and the last but second 4-byte field (field 4) in each message as shown in Figure 1.
Previous researches [27][28][29][30] also indicate that concrete signature usually exists in the first few packets of a connection.So, for each session, it is sufficient to consider the first 6 messages as shown in Figure 2. If the corresponding feature does not exist, a special string "NULL" would be used in that position.
Since RSCluster has no prior information about data set, the exact number of clusters is totally unknown to our system.In order to determine an appropriate number of clusters, the MDL criteria are applied to choose a clustering model whose description length is the minimum.The candidate models are the mediate results in each pass of the clustering process.Figure 3 illustrates the description length for the three data sets.For example, in data set I, the minimum value of total description length is taken in 21st iteration.
Figure 4 shows the overall accuracy as the iterations increase.As we see, the overall accuracy of RSCluster increases at first until it reaches an upper bound.The reason for this upper bound is that there is no longer cluster selected by RSCluster for further splitting so the number of cluster no longer changes.Recall that, RSCluster chooses the cluster whose cluster roughness is largest as the candidates for further splitting.Thus, if all cluster roughness is 0 or the cluster size of each candidate is smaller than the minimum cluster size, no cluster will be chosen to split any more.We also implement EM and -means algorithm to cluster the same data sets.We repeat the two clustering algorithms for 200 iterations.In the th iteration, we set the expected number of clusters to .As shown in Figure 4, the proposed RST-based clustering algorithm excellently outperforms EM and -means algorithms.

Message Grouping for Improving Protocol Reverse Engineering.
In the second phase, the RSCluster is used to cluster application layer messages to improve the protocol reverse engineering accuracy.The traffic is captured from School of Information and Science Technology in Sun Yat-Sen University during April, 2011.The traffic has been classified into 4 classes of protocols (i.e., HTTP, POP, SMTP, and FTP) using the well-known network traffic analysis tool named Wireshark.In this experiment, the RSCluster is used to cluster messages for each protocol to improve the results of reverse engineering.The features used in this experiment include message direction (i.e., from the initiator to responder or from the responder to initiator) and the 4-byte fields in the messages as shown in Figure 1.The parameters and procedures are the same as Section 7.1.The grouped messages are taken as input into AutoReEngine [4] to extract protocol keywords.The protocol keywords extracted by AutoReEngine are shown in Table 3.The keywords in italic (e.g., POST, Origin:, Cache-Control:, ITYPE:, and OTYPE:) are those keywords with low occurrence probability.

Conclusion
The RST is a powerful mathematical tool for dealing with categorical data and uncertain information.We propose to apply a RST-based approach to cluster application layer network traffic and group protocol messages according to message types.The key of our approach is to consider multidimension categorical attributes based on rough sets theory and diminish the dissonance hidden in the data set.With the concepts introduced from the field of rough sets theory, the dissonance hidden in the data set can be quantified by the notion of roughness.The proposed approach aims to minimize the total roughness in the data set by selecting the clusters with the largest cluster roughness for further splitting in each iteration of clustering.The proposed approach is also unsupervised and the optimal number of clusters is determined by the Minimum Description Length principle.The experimental results show that our method can cluster the application layer payload with a high accuracy and group the protocol messages effectively to improve the accuracy of protocol keyword extraction.Some protocol keywords with low occurrence probability can be found with the help of message grouping by our method.In the future work, we will apply the hierarchical data structure and semantic information in the traffic to further improve the accuracy of traffic clustering.

Figure 4 :
Figure 4: The overall accuracy of clustering."RST" stands for the proposed RST-based algorithm, "KMs" stands for -means algorithm and "EM" stands for the Expectation-Maximization algorithm.

Table 1 :
Data sets for evaluation.

Table 2 :
Traffic class breakdown for data sets.Figure 2: Messages exchanged between two hosts." 1 , " " 2 , ". . .stand for messages exchanged between host A and host B. Only the first 6 messages are considered in the experiments.