Analyzing Network Protocols of Application Layer Using Hidden Semi-Markov Model

With the rapid development of Internet, especially the mobile Internet, the new applications or network attacks emerge in a high rate in recent years. More and more traffic becomes unknown due to the lack of protocol specifications about the newly emerging applications. Automatic protocol reverse engineering is a promising solution for understanding this unknown traffic and recovering its protocol specification. One challenge of protocol reverse engineering is to determine the length of protocol keywords and message fields. Existing algorithms are designed to select the longest substrings as protocol keywords, which is an empirical way to decide the length of protocol keywords. In this paper, we propose a novel approach to determine the optimal length of protocol keywords and recover message formats of Internet protocols by maximizing the likelihood probability of message segmentation and keyword selection. A hidden semi-Markov model is presented to model the protocol message format. An affinity propagation mechanism based clustering technique is introduced to determine the message type. The proposed method is applied to identify network traffic and compare the results with existing algorithm.


Introduction
Network protocol specifications, describing the structure of protocol messages and regulating the behaviors of communication entities on the Internet, play an important role in addressing numbers of security or management oriented issues in several domains of computer and networking.For example, intrusion detection systems and firewall systems require protocol specifications to perform deep packet inspection.Security experts spy and understand the specification of command & control (C&C) protocols [1] to detect and defend the botnets.Network management administrators build up application signatures based on protocol specifications to identify protocols and tunnels in monitored network traffic.Fuzz tests [2] make use of protocol specifications to reduce the number of fault-inserted files while still maintaining the maximum test case coverage.The protocol specifications are also powerful tools to enable the interoperation between multiple systems based on incompatible protocols [3][4][5].
A complete specification is referred to as both protocol message format and protocol state machine.The former reveals the protocol syntax which conducts the process of constructing different types of messages to be exchanged between communication entities, while the latter formulates the behaviors of protocol entities during the whole process of communication, such as the order in which different types of messages should be sent or received.For open protocols, like HTTP and FTP, their specifications can be obtained by means of accessing to the published documents.However, for proprietary protocols used by enterprises or hackers, their specifications would not be unpublished for commercial or security reasons.To date, more and more new protocols and mobile applications emerge every day due to the rapid development of mobile Internet and unprecedented popularity of smart phones [6]; network management administrators need to know about the specifications of these protocols or applications to monitor the network traffic.However, there is no public documentation about their specifications.Over the past few years, researchers deem that the only available option to spy the specification of proprietary protocol or new emerging mobile applications is protocol reverse engineering.
Traditionally, protocol reverse engineering is performed by manual analysis, which is time-consuming and errorprone.For example, the Samba project has taken over 12 years to manually recover the specification of SMB/CIFS [3].In the Pidgin project [4], the Pidgin plug-ins have to be patched when the target protocol is changed and the delay between the protocol changes and working patches could be months, caused by reverse engineering.In order to address these problems, automatic protocol reverse engineering has been proposed over the last decade and has become a heat topic in research field of network traffic analysis.
Automatic protocol reverse engineering is a process of recovering protocol message formats and inferring protocol state machine without access to the specification of target protocol.Generally, automatic protocol reverse engineering can be divided into network trace based approach and binary analysis based approach.The network trace based approach takes captured network trace as input and reconstructs message formats by identifying basic components, such as message fields or protocol keywords, using techniques introduced from the fields of data mining, bioinformatics, nature language processing, and so on.The binary analysis based approach operates by observing how the executable binary software implementing the target protocol makes use of the memory and registers during the runtime to process the received messages or construct the sent message.The former approach is easy to deploy and relies only on the network trace generated by the target protocol, while the latter approach is useful for the scenarios where executable binary software is available and can be run in a control environment.
In this paper, we focus on recovering the message formats from network trace using the network trace based approach.Our goal is to identify the location of message fields and determine the length of protocol keywords.The message format is comprised of message fields.Some fields (called keyword fields) contain the protocol keywords.The protocol keywords are some constants or commands used by network protocol.For example, "GET", "HTTP", and "POST" are some protocol keywords used by HTTP protocol.
The first challenge in our research is to determine the length of protocol keywords.Previous works [7][8][9][10][11][12] which are based on longest common subsequence (LCS) criteria select longest frequent substrings to be protocol keywords.For example, if "G", "E", "T", "GE", "ET", and "GET" are frequent substrings, "GET" will be chosen as the protocol keyword, since it is the longest substring.However, if the frequency threshold is low enough, "GET abc" ("abc" is a string that follows "GET") will become a frequent string, so "GET abc" will be chosen as protocol keyword, while the true keyword "GET" would be dropped.Therefore, it is not rational to simply choose the longest frequent substrings as protocol keywords.
The second challenge is to deal with binary protocols.It is easy to define and understand the protocol keywords that bound the message fields in text protocols which restrict their content to printable ASCII characters.However, for binary protocols, fields are predefined by the protocol specifications to represent specific meanings instead of using the protocol keywords as the preambles.Messages containing only fixedlength fields are not difficult to recover.However, the complexity will increase dramatically when the fields are variable in length.
The third challenge is to determine the location relationship of message fields.The relationship of fields varies from sequence to juxtaposition.For example, in the request message of HTTP, the request method field "GET" and the HTTP version field "HTTP/1.1"are of sequential relation, which means that "GET" must occur in some location before the position of "HTTP/1.1" and the location of the two fields can not be exchanged, while some other fields, such as the "Host" field and the "Server" field, are of juxtapositional relation, which means that their locations can be exchanged with each other.
In this paper, we apply a probabilistic model, hidden semi-Markov model (HsMM) [13], to address the challenges of our work.On the one hand, one can find out the optimal length of the protocol keyword with maximal likelihood probability based on the HsMM.Obviously, the length of keyword based on maximal likelihood probability is much more reasonable and rigorous than those empiristic decisions of choosing the longest frequent substrings.On the other hand, the HsMM model is a probabilistic directed graph (lattice).Each node in the lattice represents a state that can emit various observations.The states in the same longitude are of sequential relation, while states in the same latitude are of juxtapositional relation.Therefore, it is natural to use HsMM to model the sequential and juxtapositional relation of fields.
The organization of this paper is as follows.In Section 2, related work about protocol reverse engineering is studied.In Section 3, a brief review of the concept and definition about HsMM is illustrated.In Section 4, the proposed method of modeling message format using HsMM is presented in detail.In Section 5, the system architecture is presented and some implementation issues are discussed.In Section 6, the proposed method is evaluated and the experiment results are shown.Finally, a conclusion is made in Section 7.

Related Work
Over the past few years, automatic protocol reverse engineering has attracted tremendous research interest in both research and industry field of computer and networking application.Numbers of works have been published to discuss and address many issues about the heat topic.Beddoe [7] proposes to make use of algorithms widely used in the field of bioinformatics, that is, the sequence alignment algorithms and phylogeny construction algorithm, to determine the location and size of field in each individual packet.Beddoe presents his effort in the protocol informatics project and implements his approach in Python to extract the longest common subsequence (LCS) as message fields with constant value.Kreibich and Crowcroft [8] introduce a novel variant of the Jacobson-Vo algorithm [14] to compute the LCSs of input strings and employ a flexible gap-minimising algorithm to improve the efficiency and effectiveness of network traffic alignment.The authors show that their method outperforms the commonly used Smith-Waterman approach on a wide range of network protocols.Both Beddoe [7] and Kreibich and Crowcroft [8] aim to mine the commonalities of messages as the basic components of message formats based on LCS, while our approach is to infer the location and length of message fields based on the maximal likelihood probability.
Cui et al. present Discoverer [15] to recursively cluster and align the token patterns of messages to infer protocol message format idioms.Although Discoverer is practicable to recover the protocol message formats of three selected protocols, that is, HTTP, RPC, and SMB/CIFS, there are still about 10% of the message formats that could not be correctly inferred due to some inaccurate parsing.Discoverer firstly tokenizes the protocol messages and initially clusters messages according to the token patterns.Thus, the lengths of fields are factitiously forced to be consistent with the size of tokens and the boundaries of message fields in the text protocols are restricted to some separators (such as space) specified by the authors.Moreover, the relationship of fields in message formats inferred by Discoverer is sequential.In our approach, we do not make any assumption about the separators and aim to infer the optimal length of fields by maximizing the likelihood probability of message segmentation.Meanwhile, we capture the location relationship of fields, such as sequential and juxtapositional relation, by learning a probabilistic directed lattice graph.
Wang et al. [16] present a framework to infer message formats by improving the Aho-Corasick (AC) algorithm [17] to identify frequent sequences and mining the association rules among the frequent sequences.They evaluate the framework in wireless environment and show that the framework can identify ARP and ICMP packets in high accuracy.However, their framework only searches for association rules of some frequent fields in protocol messages, while the aim of our scheme is to infer the whole format of message by inferring all of the message fields.
Wang et al. propose Biprominer [18] to extract binary protocol message formats based on the statistical nature of message formats.Firstly, the Biprominer recursively learns and labels frequent patterns in the message based on the frequency of blocks (comprised of several bytes).Then, the messages with labeled blocks are converted into a transition probability model.Antunes and Neves [19] present building an automaton based on sequence alignment algorithm for recovering message formats from network trace.They firstly extend the partial order alignment algorithm to generate an initial automaton from messages, then apply sequence alignment techniques to find out the optimal alignment between the automaton and the new coming messages, and finally use the alignment results to further extend the automaton.These researches focus on modeling the transition probability of message blocks or finding out the acceptable paths of bytes in the automatons, while our work aims to identify message fields with variable length as well as model the location relationship of fields.Some works leverage the semantics analysis of message fields to infer message formats.The so-called semantics analysis is to identify the keyword sequences, each of which indicates a specific intention of the protocol message.Krueger et al. [20] present a semantics-aware tool for network payloads analysis to automatically extract semantics-aware components from captured network trace.They map protocol messages to a vector space based on tokens or words and identify communication templates corresponding to the base directions in the vector space.Wang et al. propose ProDecoder [21] to reconstruct the message formats based on semantics-aware approach.ProDecoder first identifies keywords using Latent Dirichlet Allocation (LDA) models taken from natural language processing.Protocol messages are then clustered according to their semantics (different combination of keywords) using the Information Bottleneck clustering algorithm.Finally, messages in each cluster are aligned to find out the common parts among them using wellknown sequence alignment algorithms.These methods aim to reveal the semantics characteristics of protocol messages under specific communication motivations, so the message formats are expected to be affected by the user intentions.However, our method captures the general structures of messages of the target protocol.
As an alternative approach to understand the unknown or proprietary protocols, binary analysis based techniques also draw much research attention in the field of network security.For example, Polyglot [22], Tupni [23], AutoFormat [24], Prospex [25], and Dispatcher [26] are all systems based on binary analysis techniques.They are workable and applicable in the scenarios where the binary software is available and can be run in a controlled environment.In addition, binary analysis techniques can not work when the binary clients apply some interference techniques, such as obfuscation, to protect themselves from being detected and reverse-engineered.In this paper, we narrow our research into the application scene that only the network trace of target protocols is available.Hence, we do not discuss these binary analysis based techniques but focus on those approaches based on network trace.

Hidden Semi-Markov Models
A hidden semi-Markov model (HsMM) as shown in Figure 1 is an extension of hidden Markov model (HMM) by allowing the underlying process to be a semi-Markov chain with a variable duration time for each state [13,27].
The basic elements of HsMM include the hidden state set the state duration set and the observation set The hidden state of underlying process at time  is donated as   ∈ S. The symbols  and  are used to Observable Time axis Underlying process represent substantive values of state variable .For simplicity of notation, we denote the following: ; however, the previous state   1 −1 and the next state   2 +1 may or may not be .
= ; however, the previous state   1 −1 may or may not be .
= ; however, the next state   2 +1 may or may not be .
As shown in Figure 1, the observation sequence   1 is the observable process, while the state sequence   1 and the state transitions (  ,   ) → ( +1 ,  +1 ),  = 1, 2, . . .,  − 1, are underlying process that cannot be observed.For each pair (  ,   ) in the underlying process,   is the time duration of state   .
Formally, a HsMM can be represented by where  is the state transition probability matrix,  is the emission probability matrix,  is the distribution of state durations, and  is the initial distribution of states.The state transition probability matrix is defined as where and zero self-transition probabilities  , = 0, for all ,  ∈ S.
The emission probability matrix  is defined as where The distribution of the state duration is The initial distribution of states indicates the probability of the initial state before time  = 1; that is, GET Host: GET HTTP/1.1 Host: Server: Keyword field Data field

Modeling Network Protocol Using HsMM.
A network protocol is a set of rules for regulating the exchange of messages in the Internet.The specification of network protocol describes the strict syntactical format for valid message and defines the strict procedure rules of data exchange.The alphabet of valid messages is the set of all possible values of a single byte; that is, A string  over Σ is defined as a finite sequence of letters in Σ; that is,  =  1 ,  2 , . . .,   , ( 1 ,  2 , . . .,   ∈ Σ).The set of all finite strings over alphabet Σ is represented as Σ * .
The protocol message, denoted as , is defined as the basic data unit exchanged between different communicating entities of application-layer protocol.A message consists of a set of message fields, including keyword fields and data fields, as shown in Figure 2. The message fields, denoted as , are strings over Σ; that is,  ∈ Σ * .
The valid messages exchanged by communicating entities are constructed according to the protocol message format.The relationship of field location in the message format is varying from sequential to juxtapositional.For example, according to the HTTP specification, message fields  1 ,  2 , and  3 in Figure 2 are of sequential relation; that is, the location of  2 must go after  1 but preceded  3 .However, the relation of fields  5 and  7 is juxtapositional that means the location of  5 and  7 can be exchanged with each other.
In order to model message format using HsMM, protocol message is treated as an observation sequence representing the observable process.Each field is a block of observations associated with a specific hidden state with the length of this field as the corresponding state duration.For example in Figure 3,  1 is the block of observations from  = 1 to 3 associated with state  1 and duration  1 = 3.In this model, the emission probability matrix  implies the relationship between observations and hidden states, while the state transition probability matrix  implies the relationship of field location.Let  be an observation sequence and let Ω be the set of frequent strings that occurred in .Given   ,  ∈ Ω, we denote that   ⊂ , if   is the substring of .The string  is closed in Ω, if there does not exist a string   ∈ Ω to satisfy  ⊂   .The set of closed frequent strings in Ω is denoted as L.
Each closed string in L is associated with different hidden states; thus, the number of hidden states for closed string in L is  = ‖L‖.Suppose that   ∈ L is associated with state ; then all characters in   are observations of state .

Parameters Reestimation.
In this section, we discuss an iterative procedure for reestimating the parameters of  = (, , , ), based on the Baum-Welch method [28].At the beginning, a random initialization of  and  is selected, while the initialization of  and  is processed as follows.
For  ∈ S and  ∈ D, In the forward-backward procedure, the forward variable is defined as where   is the remaining time of the current state   .Initially,  1 (, ) = ()  ( 1 )  ().
The inductive solution for   (, ) when 1 ≤  <  is given as follows: We define the probability that the state  ends at time , while the state  starts at time  + 1, given the entire observation sequence   1 , as follows: The probability that the state  ends at time  with its duration being , given the entire observation sequence   1 , is defined as The probability that the state at time  is , given the entire observation sequence   1 , is defined as In order to solve for   (), we consider the following identities: Thus, we have a recursive formula for   () as follows: In the phase of recursively computing   (), the initial condition is given as follows: With these notations, the parameters of  can be updated and improved by the following equations: Note that I(expression) = 1, if expression is true.Otherwise I(expression) = 0, if expression is not true.

Inferring Protocol Keywords.
Given the reestimated HsMM λ = ( Â, B, P, π) and an observation sequence , the forward and backward variables can be computed based on forward-backward algorithm.Then, the variable   (, ) can be computed using (16).In what follows, we can infer the state sequence with maximal likelihood probability based on the Viterbi algorithm [29].The inference procedure is given as follows:

(22)
The iteration proceeds until  1 +  2 + ⋅ ⋅ ⋅ +   = .Thus, the observation  is divided into a sequence of fields with the th field to be   = .   is referred to as the state of   .If 1 ≤   ≤ ,   is a protocol keyword with the corresponding field as keyword field.If  <   ≤ , then   is a data string and the corresponding field is a data field.

Inferring Message Type.
In this section, we present an algorithm to determine the type of protocol messages.The messages which belong to the same type have similar formats with each other.Thus, the type of protocol messages can be determined using clustering method according to the similarities between their message formats.
In this paper, we apply an unsupervised clustering algorithm proposed by Frey and Dueck [30] to solve the problem.The algorithm based on the affinity propagation mechanism takes the similarity matrix of data points as input and recursively selects representative exemplars for each point.Each of the selected exemplars represents a data type, while the type of other data points is determined by the exemplars they select.The number of clusters need not be specified beforehand.The similarity metric need not be defined strictly in a continuous space and does not have to satisfy the symmetric and the triangle inequality.Therefore, we can define the similarity in any reasonable way.
Before the further discussion about the message clustering algorithm, we define some basic notations.Suppose the universal set of protocol keywords is denoted as  and the set of protocol keywords that occurred in message   is denoted as   .Given a protocol keyword  of message   , the cost of encoding  in   using the keyword set of message   using   as the code book is defined as The similarity of   to   is defined as the minus summation of cost of encoding all keywords in   using   as code book is defined as The affinity propagation algorithm exchanges two kinds of information between data points during the clustering process: responsibility ((, )) and availability ((, )).The "responsibility" (, ), sent from an ordinary data point  to the candidate exemplar point , reflects the accumulated evidence for how well-suited point  is to serve as the exemplar for point , taking into account other potential exemplars for point .The "availability" (, ), sent from candidate exemplar point  to point , reflects the accumulated evidence for how appropriate it would be for point  to choose point  as its exemplar, taking into account the support from other points that point  should be an exemplar.
In this paper, we treat each message as a data point, and the responsibility and availability are updated according to the following equations: Specially, (, ) is updated by The affinity propagation algorithm clusters messages into subclusters, each of which represents a type of messages.The results of message type inference are important for constructing protocol state machine which will be discussed in our future work.

System Implementation
In this section, we will illustrate an overview of our system architecture and discuss some implementation issues which have to be addressed when one implements the proposed approach.

System Overview.
A brief view of our system architecture is shown in Figure 4. Training data set is raw traffic captured from real world using a well-known network traffic analysis tool called tshark [31].
Since well-known protocols, such as HTTP, are well studied and described in public documents, almost all of pop analyzer tools of network traffic embed and identify well these protocols, so the true ground of well-known protocols is easy to be obtained.As a result, we consider some well-known protocols to validate and evaluate our approach in this paper and assume that the training data set is generated by only one protocol.
In the session reconstruction phase, we reconstruct the sessions according to the 5-tuple, that is, transport protocol, source transport number, destination transport number, source IP address, and destination IP address.For TCP-based protocol, a session starts at the packet with the SYN flag in TCP header and finishes when the FIN flag is acknowledged.For UDP protocol, a session is defined as the packets shared the same 5-tuple.
In the message reassembling phase, messages of TCPbased protocols are reassembled from packets according to the TCP sequence number and acknowledgement number while the messages of UDP-based protocols are reassembled according to the arrival time stamp of packets and the transmission direction of packets.
In the HsMM modeling step, an algorithm based on the Baum-Welch method is performed to reestimate the parameters of the HsMM-based protocol model.The reestimated HsMM model produced by this step implies the message format.
In the message segmentation phase, the reestimated HsMM model is applied to determine the optimal length of protocol keywords and divide message into field sequence.
In the step of message type inference, protocol messages are clustered using the affinity propagation mechanism and each cluster represents a type of messages.

Extracting Closed Frequent Strings.
Suppose that L is a frequent string set.If  ∈ L and there do not exist   ∈ L satisfying that  is the substring of   , then  is a closed frequent string in L. In this section, the Apriori algorithm [32] widely used in data mining field is introduced and modified to address the problem of mining closed frequent strings as shown in Algorithm 1.
The frequent string candidate set   is initialized as  1 = Σ = {0, 1, . . ., 255}, each element in which represents a onebyte character (line (4)).Note that the length of each element in   is .The frequencies of elements in   are checked and the ones whose frequencies are less than the frequency threshold Γ would be deleted from   (lines (6)∼( 12)).The candidates of frequent strings with length of +1 are generated in lines ( 14)∼ (20), where the notation [1 : ] represents the byte sequence from the first byte to th byte in .If  1 ,  2 ∈   and the first  − 1 characters of  1 are equal to the last  − 1 characters of  2 , then the two strings can be combined into

Underflow Problem.
The joint probabilities of observation sequence often decay exponentially as the sequence length increases, which leads to a severe underflow problem when the forward-backward algorithms are implemented in a real computer.To the best of our knowledge, there are three approaches to solve this problem.Firstly, one can implement the forward-backward algorithm in the logarithmic domain to avoid the underflow problem [33].
Secondly, one can also refine the forward-backward algorithm based on the notion of posterior probabilities to make the HsMM robust against the underflow problem.The refined forward-backward algorithms replace the joint probabilities with conditional ones and automatically avoid the underflow problem without increasing the complexity.More information about the posterior probabilities and refined HsMM based on conditional joint probabilities can be found in the work by Yu [13].
Thirdly, the forward-backward probabilities are adjusted by multiplying a scaling factor whenever an underflow is likely to occur [27,34,35].In this paper, we tackle the underflow problem of HsMM based on this scaling method.In each , we first compute   () based on the procedure of (12) and then compute the scaling factor in time , denoted as   , as follows: where  is the number of states in the HsMM.
For the   () term in the backward algorithm, we use the same scaling factors for each time  as we used for  in the forward algorithm; that is, As stated by Rabiner [27], the scaling factors will not affect the transition variable , initial state probability distribution , and the observation matrix .However, the procedure for computing ( | λ) is changed as follows: In order to avoid the underflow problem, we prefer to calculate the logarithmic form of ( | λ):

Evaluation
In this section, we evaluate the proposed approach on data sets captured from the Internet entrance of our department on 23 December 2013.The data set contains network trace generated by six protocols, including two text-based protocols (HTTP and SSDP) and four binary-based protocols (BitTorrent, QQ, DNS, and NetBIOS).
Existing algorithms such as PI (protocol informatics) and Discoverer are also applied to analyze the same data set.The PI project has released an open source Python code for researchers in the project home page [7], so we apply the code and perform it to analyze the data set.The Discoverer system is implemented according to the work presented by Cui et al. and the parameters are set as reported in their previous work [15].

Protocol Keyword Extraction.
Since there is no information about protocol keywords of binary protocols in published protocol specifications, we only evaluate protocol keyword extraction for text-based protocols (i.e., HTTP and SSDP) in this section.We use the metrics of recall and precision to evaluate the quality of keyword extraction.The definition of these metrics is presented in the following: (i) Recall: the recall rate is defined as the ratio from the number of true positives of inferred keywords to the total number of keywords in the data set.
(ii) Precision: the precision rate is defined as the ratio from the number of true positives of inferred keywords to the total number of inferred keywords.
We randomly select 100 connections of each protocol and only consider the first 1460 bytes (it is long enough to contain the headers of protocol messages) of each message.The results of protocol keyword extraction are shown in Table 1, where "Discv" represents Discoverer system and "PI" represents PI project.The column of "true keyword" records the true number of protocol keywords that occurred in the trace, while the column of "inferred keyword" records the number of inferred keywords.Compared with Discoverer  and PI project, HsMM-based method has a higher true positive, precision, and recall rate.We found that Discoverer infers too many keywords, while PI project infers too little.
Actually, there are far more protocol keywords inferred by our approach than the true keywords.Most of them are frequent and indispensable in the protocol messages, such as some parameters used frequently.So, all of these strings are also treated as protocol keywords and they play important role in inferring message formats and analyzing protocol state machine.
We also note that it has been found that the proposed HsMM-based approach can not only extract frequent keywords but also extract some keywords whose occurrence frequency is low.

Protocol Message Format
Inference.We illustrate the results analyzed by PI in Figure 5.The message formats are inferred as the longest common substrings of protocol messages.As shown in Figure 5, only a few protocol keywords (such as "GET") and fields are inferred by PI, so PI does not seem to be expert in generating effective message formats.
As shown in Tables 2-4, we present the results of HTTP protocol for Discoverer, PI, and HsMM in a similar form to make it more clear for the readers.Discoverer infers message format based on token sequence and determines the attribute of token, such as constant token or variable token.Far more protocol keywords (such as "HTTP/1.1"and "Host:") are inferred by Discoverer than PI.However, some frequent strings (e.g., "ocspd" and "x86 64") which are not protocol keywords are also inferred as keywords.In this paper, we also implement the Markov model as stated in [36] and compare their results with ours, as shown in Figure 9.The results show that the proposed method outperforms the Markov based method in the field of traffic identification.

Conclusion
The protocol keywords and message fields are inferred based on hidden semi-Markov model by maximizing the likelihood probability of message segmentation.The segmentation of messages reveals some semantic information about the field, such as keyword, IP address, and -V pair.
The proposed technique is shown to be applied to the field of network traffic identification and outperforms existing algorithm.
The proposed HsMM-based protocol message format can be applied to field of intrusion detection or anomaly detection.One can use the HsMM-based message format of normal traffic to calculate the average likelihood probability of the new coming traffic and check whether the average likelihood probability is deviated from a normal level.Our  method can also be applicable for traffic identification, fuzz test, vulnerability discovery, and so on.

Figure 3 :
Figure 3: Illustration of modeling HTTP based on hidden semi-Markov model.

Figure 4 :
Figure 4: Overview of system architecture.

Figure 5 :Figure 6 :
Figure 5: The results output by PI.

Figure 8 :
Figure 8: Illustration of field segmentation for HTTP message.

Figure 9 :
Figure 9: The accuracy of traffic identification."HsMM" represents the HsMM-based method proposed in this paper, while the "Markov" represents the Markov model based method presented by previous work.

Table 1 :
Results of keyword extraction for text-based protocols.  by merging their overlap; that is,   [1 : ] =  2 [1 : ], and   [ + 1] =  1 [].Lines (24)∼(38) aim to find out the closed frequent strings by deleting any strings in   if and only if they are the substrings of some elements in  +1 .

Table 4 :
HTTP message format inferred by HsMM.

Table 5 :
The accuracy of message type inference.