Packet Payload Monitoring for Internet Worm Content Detection Using Deterministic Finite Automaton with Delayed Dictionary Compression

. Packet content scanning is one of the crucial threats to network security and network monitoring applications. In monitoring applications, payload of packets in a network is matched against the set of patterns in order to detect attacks like worms, viruses, and protocol definitions. During network transfer, incoming and outgoing packets aremonitored indepth toinspect thepacket payload. In this paper, the regular expressions that are basically string patterns are analyzed for packet payloads in detecting worms. Then the grouping scheme for regular expression matching is rewritten using Deterministic Finite Automaton (DFA). DFA achieves better processing speed during regular expression matching. DFA requires more memory space for each state. In order to reduce memory utilization, decompression technique is used. Delayed Dictionary Compression (DDC) is applied for achieving better speeds in the communication links. DDC achieves decoding latency during compression of payload packets in the network. Experimental results show that the proposed approach provides better time consumption and memory utilization during detection of Internet worm attacks.


Introduction
With rapid development, today the Internet has become more vulnerable to various threats and attacks such as intrusions, worms, viruses, spyware, and Trojans.Internet worm is a malicious code or program that exploits security holes and enters into the network without human interference [1,2].Internet worm is a self-propagating and fast spreading attack which has affected the Internet dramatically in the last few years.Moreover, malware attackers aim is to alter network traffic and create payload to cause infection at the host level [3].Exploiting the software vulnerabilities, worms propagate and affect network services [4].In order to protect the network from these attacks, effective defense mechanism is necessary.
The Morris worm in 1988 is the first network worm that infected DEC hosts and Sun3 operating systems to the large on Internet [5].In 2004, Witty worm payload of 637 bytes is padded with data from system memory to fill the random size and a packet is sent out from source port 4000.After sending 20,000 packets, Witty seeks to a random point on the hard disk, writes 65 kbytes of data from the beginning of iss-pam1.dllto the disk.After closing the disk, the worm repeats this process until the machine is rebooted or until the worm permanently crashes the machine [6].On October 2008, Conficker worm generated a large amount of network traffic and also caused user account lockouts [5].Slammer and Code Red worms affected billion dollars and thousands of computers within an hour [3].
Network Intrusion Detection Systems (NIDS) have been adopted to defend against attacks that exploit the vulnerabilities of a protocol and attacks that seek to survey a site by scanning and probing.Presently, NIDS depends on the content based detection in minimizing the false alarm rate [7].Scanning and probing attacks are detected by analyzing the network packet headers or monitoring the network traffic connection attempts and session behavior.The Internet worm attacks create payload to a vulnerable service or application.These can be detected by inspecting the packet [8].
For earlier NIDS, string matching is essential because it contains a collection of strings represented as signatures.Different software and hardware solutions have been proposed for the string matching problems [9][10][11][12].Various signatures based approaches are proposed for Internet worm detection [13][14][15][16].Other than signatures, worms payload can be effectively detected by regular expressions through scanning every incoming packet.NIDS has a limitation for faster network links to match signatures with the incoming packets.To overcome this challenge, regular expression representation provides better network traffic monitoring by deep packet inspection [17].Pattern matching methods provide better detection of Internet worms.
Deep packet inspection allows Network Intrusion Detection System (NIDS) to accurately identify the malicious payload occurring in the network during transfer of packets [17].Pattern matching consumes more data volume, which takes more computation time and memory.Regular expression based DFA and NFA approaches can also be used for pattern matching in networks, but NFA has significant computation and storage complexities.NFA has multiple concurrent active states and bandwidth costs [18].To overcome the challenge of existing approaches in scanning large number of packets with better speedup, DFA is implemented.
The proposed approach DFA for regular expression pattern matching inspects packet payload.DFA reduces the size of packets by splitting it into subpackets with the regular expression patterns to fit into the memory space.Accordingly, DFA applied with DDC increases speed in communication links and provides better decoding latency.Consequently, the proposed approach DFA with DDC provides better detection of the payload during transfer of packets in the network.
This paper is constructed as follows.Section 2 describes the previous approaches applied for Internet worm detection.Section 3 illustrates DFA used for regular expression pattern matching and the proposed approach in detail.In Section 4, the experimental results are shown comparing the proposed method with the existing approach.Section 5 concludes the paper.

Literature Review
Internet worms are causing a million-dollar damage by infecting thousands of machines within few minutes.Various defense mechanisms have been proposed by different authors to protect the network from Internet worm attacks.Some of the techniques proposed for detection of Internet worms are discussed below.
Wang et al. [19] implemented an approach for worm mitigation to validate the efficiency of the model through its extensive simulations.Considering the network-delay factor, the initial infection rate of active worms is detected.This approach also analyzes worm-free equilibrium point and derives basic reproduction number to quantify the guideline for effective worm defense.Amador and Artalejo [20] introduced an approach named block-structured statedependent event (BSDE) to improve the computer network security.BSDE is used to find computer network infections and to boost up the computer security.
Khule et al. [21] proposed a novel method Netflow, to monitor traffic profiles in the network.The traffic statistics of packets received on an interface is counted as "flow" and stored in a dynamic flow cache.Toutonji et al. [1] proposed a mathematical model which combines both dynamic quarantine and passive benign worms for containment of worm propagation.To minimize the number of infected hosts by benign worms, the further research suggested by the author is to have quarantine measures.
Saikia et al. [7] implemented an approach to detect the worms using behavioral signature especially for improving network security.Yu et al. [22] found that C-worm propagation is detected using spectrum based detection scheme in distinguishing normal and abnormal background traffic.With the frequency domain, the pattern distinguishes Cworm traffic from normal traffic.
Yu et al. [23] introduced threshold-based, trace-back based, and spectrum based defense schemes.Threshold and trace-back are integrated to defend against static worms and the combination of above three schemes are used to defend dynamic self-disciplinary worms.The propagation patterns are analyzed for detection of worms.Zaki and Hamouda [24] proposed an anti-worm system to reduce effectively the spreading speed of infecting worms in network routers.WSRMAS (worm spreading reduction multiagent system) consists of a multiagent system to limit or even stop the worm spreading.Internet security is in real need for a realistic antiworm system.
Table 1 summarizes various detection techniques proposed and the parameters used by proposing authors for their evaluation.
In Table 1, various techniques used for Internet worm attack detection are discussed with the proposed authors, parameters used, and its observed results.The observations indicate that the techniques proposed are efficient in increasing security for network, reducing worm propagation and identifying the attacks earlier.
Various signature based approaches have been developed by different authors for early worm detections.Cai et al. [25] proposed Wormsheild, a worm signature generation system to monitor the traffic.Distributed fingerprint filtering reduces aggregation traffic and the distributed aggregation trees improve load balancing to calculate fingerprint statistics.Wang et al. [26] analyzed network traffic based on patterns or signatures.The patterns at the network level are analyzed for detection of polymorphic worms to exploit the buffer overflow vulnerability.Based on both ploit-specific and vulnerability driven signatures, the zero day polymorphic worms are detected.Simkhada et al. [13] proposed a system to detect worms in a hierarchical manner by generating worm signatures automatically in large networks.Tang and Chen [14] proposed position-aware distribution signature (PADS) with expectation-maximization and Gibbs sampling algorithm for effective detection of worms.Tang et al. [16] proposed SRE signature based on exploitation of operating  system and vulnerability of network services.Kong et al. [15] proposed semantics aware statistical (SAS) algorithm, to detect packets from the suspicious flow pool and generate worm signatures automatically.These signatures generated cannot survey for longer periods, instead regular expression is essential.However, the above techniques have certain limitations in improving memory latency on detection of payload with regular expression.Thus the proposed approach is implemented by a compression algorithm to overcome the limitations of the existing approaches.

Proposed Methodology
The proposed approach DFA with DDC scans every incoming packet in depth to detect the payload occurrence of those affecting the network.Regular expressions are analyzed and for matching regular expressions, DFA-based pattern matching is developed to detect payload.To achieve better matching speed, DDC is combined with DFA.DDC algorithm provides better increased speed links and minimizes decoding latency.
The steps followed for scanning and detecting the payload patterns during network transfer are given in Figure 1.
Figure 1 gives the proposed procedure.The techniques involved are regular expression matching, DFA, and DDC.

Matching Regular Expressions with DFA.
The natural formalism used for regular expressions is finite automata and it is Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA).In DFA, all transitions are deterministic; each transition leads to exactly one state.The analysis of regular expressions and developing memoryefficient DFA-based solutions providing high speed processing are discussed.While in NFA transitions are nondeterministic, each transition leads to subsets of states.
DFA is one of the finite automaton, in which all transitions are deterministic.DFA consists of a definite set of input symbols, denoted as ∑, and definite set of states and a transition function, denoted as .∑ Consists of 2 8 symbols from extended ASCII code.Transition function  gets the start state  0 and an input symbol as an argument and enters the state.Each transition leads to exactly one active state.Regular expressions compare the packet with the pattern in the list.When it matches with the patterns, they split as states.

Patterns to Split Regular Expressions.
The collection of strings that are not listed in a specific format are defined as regular expressions.For scanning and analyzing packet payload with limited memory latency, the features listed in Table 2 are used.
When the regular expressions meet the patterns listed in Pseudocode 1, they are stored as subpackets containing limited number of strings.This overcomes the length restrictions in regular expression matching.For payload detection, the regular expressions uses DFA-based pattern matching approach.To reduce the buffer size, compression technique is applied.

3.2.
Proposed DFA with DDC.This section finds the solution to matching individual regular expression analyzed as DFA state in compressed stage.Compressing each state makes the DFA feasible and fits into the memory.Compression technique is applied to overcome the memory limitation by implementing DDC algorithm.

Delayed Dictionary Compression (DDC).
The Delayed Dictionary Compression algorithm generates the model  with a parameter additionally combined with Δ, that is, a nonnegative integer.When there is a delay of Δ units, updating is done in dictionary either as characters or packets.From the input reading, dictionary  is a function of all  − Δ − 1 units, for  ≤ Δ − 1. Δ = 0 for every standard dictionary compression algorithm.The above defined approach is called Basic Delayed Dictionary Compression (BDDC).
The DDC algorithms are formed by the combination of BDDC and stateless compression.This algorithm produces better decoding latency using stateless compression.The encoder encodes the current characters , encoding all characters till prior to last Δ characters.Each encoded packets point to a phrase.The encoded packet created as phases are stored in a dictionary as delay of Δ packets.DDC algorithm compresses the packets with the updated delay Δ proportional to network propagation delay.All encoded packets are compressed and stored in history except final Δ packets, as it precedes the currently encoded packet.Encoder transmits the entire encoded packet.Encoder transmits all the encoded phase to the receiver.A receiver, receiving all packet headers specified in the history, decodes the packets.
DDC is a general framework that can be applied for any dictionary algorithm; it consists of two main processes of encoding and decoding where the dictionary parser and the output parser are completely separated.This gives to update the dictionary freely for parser process.
In this section, states obtained from the above process of DFA are given as input here.If there are large data used in DFA, the memory space allocated for it will be large in size.To reduce the memory and to decrease the computational time, DDC is used.DDC algorithm performs encoding, stateless compression, and decoding to achieve decoding latency.
(1) Encoder.The states matching with patterns and their transition are given as input denoted by I.Then, compression algorithm maintains set of substrings called dictionary ().The parsing process for constructing the dictionary is called dictionary parser which is denoted as   ; the obtained output parser is denoted as   .
The additional parameter which updates the dictionary with the delay is represented as Δ.It is a nonnegative integer, constant, or adapted according to any rule that user chooses.For every standard DDC it is taken as zero.From Figure 2, the overall process of encoding is shown.Here the input I is given to the dictionary parser (  ).Then dictionary (Δ) consists of a set of substrings which is bidirectional.The packets accessed by   and  are given as a secret code to   .These secret codes are taken as compressed output.
(2) Stateless Compression Algorithm.The Compression algorithm is applied to compress the packets independently during encoding.In stateless, the compression and decompression are done independently for every packet.The receiver receiving the packets decompresses it regardless of its arrival order.This stateless compression minimizes the decoding latency.
The uncompressed traffic consists of "" packets with each packet having its header and data.The compressed traffic in Figure 3(b) consumes less memory compared to uncompressed traffic.Each packet is compressed for less buffer size consumption.
(3) Decoder.The secret codes obtained from the encoding technique of the DDC are given as input.The decoding process is the same as encoding where the reverse operation of it is done.The packet taken as input consists of secret codes.These codes are replaced with its corresponding phrases which builds the dictionary.
From Figure 4, the overall process of decoding is shown.Here the compressed text C(I) is given as input to the dictionary parser (  ).Then dictionary (Δ) consists of a set of substrings which is bidirectional as in the encoder.The packets accessed by   and  of secret codes are replaced with its phrases in   .These phrases are the original text (I).
This compression technique makes the DFA fit in a reduced memory.This gives the way to match a large number of individual patterns with a lesser memory.The compressed states are monitored to detect payload.

DFA Matching to Detect Payload.
For the payload scanning, regular expressions and automata theory are directly applied.In packet payload scanning, input packets or substrings of input entering into the network are matched with regular expression patterns.DFA faces complexity in recognizing all substring matches without any prior knowledge of start and end positions of substrings.In order to complete the matching process with DFA for all substrings exhaustive and nonoverlapping matching styles are executed.
In exhaustive matching, pattern matches all the input substrings taken for matching and provides a set of results completely for the given input stream and regular expression pattern.For example, for given pattern cb * and input cbbb, the report will be three matches such as cb, cbb, and cbbb.
For the matching process, let  be a function from a pattern  and a string  to a power set of  such that  (, ) = {substring   of  |   is accepted by the DFA of } .
(1) Using this style of matching is expensive and matching every substring report is considered as unnecessary.To overcome the requirement of exhaustive matching, nonoverlapping matching is proposed.
In Nonoverlapping approach, for the matching process, let  be a function from a pattern  and a string  to a power set of  such that  (, ) = {substring   of  | ∀  ,   accepted by the DFA of ,   ∩   = } .
(2)  From the input strings, this matching process reports all nonoverlapping substrings that match the pattern appearing in multiple locations of the input.For example, given pattern cb * and input cbbb, the report provided by this match is only one and even the prefix "cb" overlapped thrice.Nonoverlapping matching for payload scanning provides better analyzing of pattern attacks found in the packet.This matching lacks in a memory-efficient DFAs.
To handle pattern substring matching, one pass search execution model is created by DFA in this paper.DFA created explicitly for extended patterns which matches the pattern anywhere with the input.Rather than scanning from beginning till end, DFA is able to begin its substring matching at different positions of the input.To suit the network applications, this one pass search approach achieves O(1) computation cost per character.

Journal of Computer Networks and Communications
In this paper, for the packet payload scanning applications, DFA uses nonoverlapping matches and one pass search.Figure 5 illustrates DFA for regular expressions ∧ ab * cd?efi.ghPseudocode 1 provides the pseudocode for the proposed approach.The approach integrates compression state with DFA technique for better memory latency.
In Pseudocode 1, the packets divided into states are converted to stateless.Then again the states are decompressed for achieving decode latency.Using DFA payloads are detected.
Figure 6 gives the flow diagram of the proposed approach DFA with DDC algorithm for monitoring and detecting the payloads created by Internet worms.
The proposed approach monitors and detects the packet payload created by Internet worms during the network transfer level to prevent its spread.Figure 6 shows the approach proposed that uses the DFA with DDC for monitoring and analyzing the packet payloads with the datasets trained.The DDC algorithm in the proposed approach detects the Internet worms in compressed state to overcome the memory space limitation.

Experimental Results
The approach proposed has been evaluated using the parameters like memory utilization and time computation.Memory consumed by the CPU during the detection of Internet worms based on payload is measured using memory utilization metric.( The evaluation is done using Java platform with the real data set.There are 500 sample data collected from Internet for monitoring and detecting the worm attacks.The dataset contains 387 malware files and 113 normal files.The data during data transfer are monitored and the attacks are detected through packet payload occurrence with the proposed DFA with DDC.
From Table 3, it is shown that the proposed DFA with DDC approach gives better results in terms of memory utilization and time consumption.
Figure 7 shows proposed pattern matching method in identifying packet payload with minimum memory requirements providing for resource scalability.Using DFA with DDC provides high speed matching and efficiency in memory utilization compared to the existing GFGS algorithm.
Figure 8 illustrates a comparison of computation time for the existing GFGS with EHAMA and proposed DFA with   DDC. Figure 8 clearly shows that the proposed approach DFA with DDC algorithm gives lesser computation time than the existing GFGS with EHAMA.

Conclusion
In networking applications, packet payload occurrence creates threats to the Internet users.In this paper, regular expression pattern matching with compression algorithm is implemented for monitoring packet payload created by Internet worms.DFA-based pattern matching implementation provides faster detection of payload occurrence.DFA focuses on detecting repeatable suspected packets and speeds up the scanning process.Additionally, DFA with DDC overcomes the compression overheads and reduces the usage of memory.DDC algorithm applied with DFA provides better decoding latency as well as speed in communication links.The experimental results in Section 4 shows that proposed method gives better memory utilization and time computation for detection of Internet worms compared to that of existing approach.

Figure 1 :
Figure 1: Proposed flow for pattern payload detection.
Detection time is calculated to find the time consumed for detecting the Internet worms in the network using time utilization metric Memory Utilization = Memory consumption of CPU at the end of the process − Memory consumption of CPU at the start of process Time Consumption = Finishing time of processing − Starting time of processing.

Table 1 :
Review of literature for detection of Internet worms.

Table 2 :
Patterns for regular expression.

Table 3 :
Proposed approach metric comparison.