Multilevel Bloom Filters for P2P Flows Identification Based on Cluster Analysis in Wireless Mesh Network

With the development of wireless mesh networks and distributed computing, lots of new P2P services have been deployed and enrich the Internet contents and applications.The rapid growth of P2P flows brings great pressure to the regular network operation. So the effective flow identification and management of P2P applications become increasingly urgent. In this paper, we build a multilevel bloom filters data structure to identify the P2P flows through researches on the locality characteristics of P2P flows. Different level structure stores different numbers of P2P flow rules. According to the characteristics values of the P2P flows, we adjust the parameters of the data structure of bloom filters. The searching steps of the scheme traverse from the first level to the last level. Compared with the traditional algorithms, our method solves the drawbacks of previous schemes. The simulation results demonstrate that our algorithm effectively enhances the performance of P2P flows identification. Then we deploy our flow identification algorithm in the traffic monitoring sensors which belong to the network traffic monitoring system at the export link in the campus network. In the real environment, the experiment results demonstrate that our algorithm has a fast speed and high accuracy to identify the P2P flows; therefore, it is suitable for actual deployment.


Introduction
Wireless networks are getting more and more popular nowadays.As users have got used to wired infrastructure networks, it is becoming extremely indispensable for wireless networks to be committed to providing similar service features to them [1].It is difficult to find the relevant contents and services because the users and the data associated with a variety of applications are distributed over various sites and devices [2].Many resources in WMNs can be used efficiently, aiming to maximize the total throughput of the whole network.In these networks, the key to maximize aggregate throughput is the flow identification scheme plays [3].
As more and more people are interested in wireless mesh networks, making efforts to supply users with a similar quality of service is important to the ones who are adapted to networks with wired infrastructure [4].P2P (peer-to-peer) has grown to be a network transmission technology of high efficiency because of the widespread adoption of broadband residential access.Furthermore, it takes advantage of the modern network technology as well as distributed computing technology.It is a kind of distributed network and its basic idea lies in changing the traditional Client/Server mode [5].In recent years, P2P (peer-to-peer) network has gathered broad attentions because of the fact that the peer nodes have no need to be with the help of intermediate servers to achieve the purpose of communicating with each other.Besides, it has become a technology which has a bright future [6].In the last decade, the number of applications based on P2P technology has been increasing, such as BT, PPLive and eDonkey, Thunder.Up to now, P2P systems have accounted for more than 60% of Internet traffic in China [7].
The purpose of using mesh routers as wireless mesh network's device (WMN) is to form a wireless backbone rather than such a wired network.It is wireless mess backbone network architecture in Figure 1.Acting as a server to each other, each mess node in the graph gets the services provided by the peer-to-peer node.Distributed network makes a great difference in the flow distribution of network.Besides, it does reduce the stress on the storage server.What improved by the emergency of P2P network is the user experience along with enriching the Internet.However, the excessive growth in its flow as well as unlimited usage of bandwidth brings network congestion, increasing network packet loss and network delay.In other words, the network performance and quality of service are reduced in a great degree.Moreover, malicious code, reactionary, obscenity information, and piracy resources in the P2P network wantonly spread [8].As a result, P2P applications can reduce the performance of the network greatly, sometimes making a rather adverse impact on the regular network services.It requires controlling and keeping an eye on the P2P flows continuously.Also, it needs to guarantee the regular operations of network services [9].These are the motivations of P2P flows identification in wireless mesh network.How can we manage the network bandwidth of P2P services as well as ensuring the quality of service?To maximize the users' satisfaction of P2P streaming in WMNs, The key technology is the flow identification scheme.It enables the network administrators to execute different control strategies according to different flow requirements, in order to achieve the effective management of P2P services [10,11].Therefore, the accurate identification and classification of P2P flows generally become the focus for network operators and service providers.
Over the past decades, the studies of flow identification of P2P services have been widely concerned [12][13][14][15][16], such as the port identification method [12,13], the host-behavior characteristics analysis method [14], and the identification method based on flow statistical properties [15,16].But these methods cannot identify the P2P flows accurately and fast.So what we put forward in this paper is a high-performance P2P flow identification algorithm based on multilevel bloom filters.
The primary point of this paper can be divided into the following points: (1) This paper finds that P2P flows have the locality characteristics of the time intervals of packets arrival and the length of packets in P2P flows.
(2) An efficient flow identification scheme based on the multilevel bloom filters is proposed for identifying P2P flows.
(3) A comprehensive set of experimental results demonstrate that our algorithm effectively enhances the performance of P2P flows identification.
The rest of the paper is organized as follows.Section 2 introduces the related work.In Section 3, we analyze the package lengths and time property features of P2P flows.Section 4 gives an efficient P2P flows identification scheme.Section 5 is the simulation evaluation, and Section 6 is the performance evaluation in real environment.Finally, Section 7 concludes the paper.

Related Works
In this section, we provide a brief discussion on the methods of P2P flow identification.
P2P flow identification method mainly has three categories: the port identification method [12,13], the hostbehavior characteristics analysis [14], and the identification method based on flow statistical properties [15,16].
The port identification method [12] is the most primitive and simple network flow identification method.It is known that many traditional network applications use a fixed port.For example, HTTP flow uses port 80 and MSN uses ports 1863 and 80 and so forth [17].Therefore it can quickly and efficiently identify the corresponding flow according to the port numbers, which has a low degree of complexity.However, with the development of new business, a lot of services use dynamic random port in order to prevent filtering.When facing such a network service, the port identification method is almost a failure and the classification accuracy is very low.Due to its simple and fast identification ability, the TCP/UDP port identification method is still used in high-speed network flow identification.A major concern in utilizing diverse strategies to change the port numbers of the new P2P applications aims at avoiding traffic identification.As a consequence, on account of incomplete and inaccurate identification results, there is no use of port-based method [13].
The host-behavior characteristics analysis [14] is mainly designed for P2P flow.The basic idea of this method is analyzing the data packet, summarizing P2P flow characteristics according to the analysis, and identifying the flow whether belongs to P2P applications [18].In recent years, researchers have proposed many network measurement methods based on behavioral characteristics, and they have good scalability and high accuracy of identification [19].Because of the part similarity of the network service model, those methods can only identify coarse-grained network services, and its memory consumption is very large.However, P2P applications and the regular applications cannot be discriminated by similar behaviors which cannot identify the traffic accurately.
The P2P identification method based on flow statistical properties is a solution overcoming the limits of port identification's and flow behavior characteristics analysis.It uses statistics on arrival time interval, duration and a series of characteristics of packets, and supervised or unsupervised machine learning methods to achieve services identification.Supervised machine learning [15] trains data to model and then classifies data directly on this model, while unsupervised machine learning [16] classifies data directly.The identification method based on machine learning has a better scalability and can identify the encrypted data flow, and its classifier also has a good scalability and flexibility.But they have low performance due to serious consumption of resources caused by signature searching in the payload of every packet.

Research on the Locality Characteristics of P2P Flows
3.1.The Locality Characteristics.In this section, we find the locality characteristics of the P2P flows through the research on the package lengths and time property features.Over the past decade, researchers have revealed some statistical characteristics of the P2P flows through a large number of studies.It is also found that recently referenced file has a greater probability to be referenced again soon [20].The researchers found that the P2P applications have some features such as synchronous upstream and downstream flow, fast transmission and high-capacity, wide distributed service points, and lack of security mechanisms [21].These features determine that the P2P network has uncertainty, encryption, and large capacity.
For more comprehensive understanding and analysis of the characteristics of P2P flows, we select the average packet length and packet arrival time interval values to do experiments.Our purpose is to design an appropriate algorithm structure to identify P2P flows through the analysis of these values.We get the P2P packets from Internet and read the five-tuple information of the packets and then classify each packet to its own flow according to classification algorithm.Subsequently we get the corresponding time of the packet belonging to this flow and finally calculate the time attribute and the values of the packet length.
This paper used the P2P flows and did experiment on the time intervals of packet arrival and packet's average length.The results of the analysis are shown in Figures 2 and 3.As shown in Figure 2, it depicts average time interval of the arriving packet in the P2P flows.Obviously, as the number of flows increases, the attributes of packets gradually reduced and finally become stabilized when the number of flows grows up to 60.At this time the average time interval is about 1.7 seconds.As shown in Figure 3, it depicts the statistical value of the average length of the packet in the P2P flows.It shows that when the number of data flows reaches 130, the average packet length tends to be stable and around 100.We find the time intervals of packet arrival and the length of packet in P2P flows is less than other Internet flows, which is also in line with our analysis of the locality characteristics of P2P flows [22,23].

The Mathematical Basis
Definition 1.The Minkowski distance of data   and mean   is where   = { 1 ,  2 , . . .,   } is the time series sample and   = { 1 ,  2 , . . .,   } is the mean of the sample, and  denotes the number of the dimensions of the sample [24].
In our study, we assume that the historical time series sample  = { 1 ,  2 , . . .,   } of Internet packets subjects to normal distribution.The package lengths and time property features make up the samples.Let us suppose that  denotes the mean and  denotes the variance of sample   before time .The distance from the newly generated sample data  to the mean  determines that the sample data  and the mean  share the same assigned cluster probability.The assigned cluster probability is as follows: where (, ) denotes the Minkowski distance of sample data  and the mean .
From the equation we learn that the narrower the Minkowski distance between all of the newly produced sample data  and the mean  is, the bigger the value of (, ) will become.In a certain period of time, the probability of the appearance of the data, closer to the mean , is larger than the others in the similar time.

The Mathematical Model of Traffics Cluster Characteristics.
We give a quantitative analysis of the cluster characteristics of the real Internet traffics in this part.The study pays close attention to the package lengths along with the time property features.
The historical clustering sample coming into shape in the time period [ − 2, ] along with cluster center   , cluster radius , and sample variance   is assumed to be similar prior information of the similar prior of time series sample {  , . . .,  + } in the continuous period of time [,  + ] [25].It enables us to use a biased method to search the data of time series sample {  , . . .,  + } in the history cluster sample .According to the influence that the time series sample {  , . . .,  + } in the period of time [,  + ] has on historical cluster sample center   , the new history cluster sample  can be generated.The similar prior information for subsequent time series sample can be provided by the newly produced cluster [26].
There exist two types of data:  new = { |  ∈ ,  ∉ },  old = { |  ∈ ,  ∈ } in time series sample {  , . . .,  + } produced by machine after time .We give the following definitions for the purpose of reflecting the impact which is produced by the sample { +1 , . . .,  +2 } on the historical cluster center   .Definition 2. A deviate sample  new can be made up from the data of  new = { |  ∈ ,  ∉ } in the time series sample {  , . . .,  + }.One can judge whether the data  + in the { +1 , . . .,  +2 } belongs to the deviate sample  new using the Pearson correlation function as follows: where  + ∈  and   is the clustering center of cluster .Also,  is the number of the dimensions of the sample.We assume the value of the function is .( The function is the total deviate cost of the data of the sample {  , . . .,  + } from the historical cluster samples When Hc <  is satisfied, take the   of the new time series sample  as a new cluster center to form a new cluster sample .Definition 5.One can evaluate the clustering quality of the new cluster sample  by calculating the Pearson correlation degree between new cluster center   and historical cluster center   .The objective function is shown as follows: For the purpose of better reflecting the changes of new time series samples as well as producing new clusters [27] in a faster and better way, we can adjust the parameter  appropriately with the help of the cluster quality function (  ,   ).
A cluster algorithm for packet matching is given out in the next.In the time period of [ − 2, ], a historical time series cluster sample  should be assumed at first. (

Our Algorithm for Identifying P2P Flows
4.1.The Architecture of Our Algorithm.In this part, we design an efficient multilevel bloom filters algorithm to identify the P2P flows with high performance according to the locality characteristics of P2P flows.
Through the above experiments we get a detailed analysis of the locality characteristics of P2P flows; for example, the average time interval of packet arrival is stable at about 2 seconds as shown in Figure 2. Because of the quick update of the peer-to-peer flow nodes and the rapid transmission, the algorithm, respectively, stores the 0-30-second data flows in the first-level bloom filter, the 30-90-second data flows in the second-level bloom filter, and the remainder of the flows in the last-level bloom filter.If there are too many flows stored in the first-level bloom filter, in this case, the algorithm will consume more identification time.So we store more flows in the second-level bloom filter.The whole multilevel structure is designed as shown in Figure 4.
The searching procedure can be described by a pseudocode as shown in Algorithm 1.The specific method can be described as follows.The five-tuple information (SA, DA, SP, DP, and Pro) is substituted into FL Hash1, FL Hash2, . .., FL Hashk and the result values are compared with the firstlevel bloom filter.If the algorithm can find the matching flow node, the searching procedure will stop.Otherwise the packet enters the second-level bloom filter.Similarly the fivetuple information is substituted into SL Hash1, SL Hash2, . .., SL Hashk and compares the values in the second-level bloom filter.If the matching flow is found, the searching step will stop.Otherwise the packet enters the third-level bloom filter.And the searching procedure continues to the last-level structure until the corresponding flow is found.Due to the locality characteristics of P2P flows, a newly arriving packet has a large probability of being found in the first level of the structure and the corresponding counter of the flow is directly updated.
Our algorithm is designed by bloom filters.Bloom filter has false positive, so we need to discuss the false positive probability of our approach.Assuming the length of each bloom filter is  bits, the number of rules of each virtual router is .Based on existing research results in the paper [28], we calculate the number of hash functions through  = ⌈ln 2(/)⌉ to reduce the probability of false positive.

Dynamic Flow Aging and Update of the Multilevel Structure.
With time elapsing, some flows have been out of use and  the corresponding records should be eliminated in our bloom filter data structure.The memory space which is released can be used for the following flows.According to the locality characteristics of the P2P flows, we use sample data packet to update the timestamp of the flows, instead of using timestamp of every packet to update the information.Therefore, the algorithm reduces lots of writing operations on the memory.With the calculation and analysis of the flow's timestamp we can get the inactive P2P flows.By experimental analyzing packets in the previous section, we define the flow whose reaching time exceeds 10 seconds as an inactive flow and will move these flow nodes from the first-level to the secondor the third-level bloom filter.And the algorithm alternately updates the flows whose reaching time is within 5 seconds from the last two levels to the first level.Through the dynamic update of the data structure, the algorithm greatly improves the flow matching speed and the utilization coefficient of storage resource.

Simulation Evaluation
In this section, we come up with the emulation experiments to compare the performances of our algorithm with the flow statistical properties (FSP) algorithm [14] in P2P flow identification.In the experiments, the metrics of performance include the memory access evaluating the searching performance.5 shows the memory access performance of our algorithm has an average increase of 33.54% compared with the FSP algorithm when packets are in the relatively scattered case.Figure 6 shows our algorithm's memory access performance has an average increase of 35.17% compared with FSP algorithm when packets are in the relative concentration case.This is because the packets of the P2P flows are relatively smaller than the packets of other Internet flows, but their transmission speed is greater than the other packets, which makes the P2P flows identification more difficult for the FSP algorithm.However, it becomes much easier to identity P2P flows for our algorithm.

Performance Evaluation in Real Environment
In this section, we present the experiments to compare the performances of our algorithm with the port-based and host-behavior-based algorithms in real environment.In the experiments, the metrics of performance include the memory access, evaluating the searching performance, and the identification precision, evaluating the accuracy of the algorithms.

Experimental Environment.
In order to fully verify the practical performance of the packet classification algorithm, the algorithm and the rule sets should be written on the network traffic monitoring system to test the effect of the algorithms for the actual network traffic monitoring results and then improve our algorithm.
Figure 7 shows the deployment of the network traffic monitoring system at the export link in the campus network.The system is divided into the traffic monitoring sensors, the traffic data collector, the data storage center, the data analysis center, and the remote browser.The traffic monitor probe is deployed in the vicinity of the routers and the network servers and other kinds of network equipment, which is responsible for the data packets mirroring and identifying the data packets as the service traffic of the application layer, the experimental data as the real network traffic in campus network according to the packet classification algorithms.We use SmartBits 2000 network test platform to test the performance of the algorithms, to further improve our algorithm and the efficiency of the algorithm in practical application.
Below we use two group experiments to test and analyze the performance of the algorithms.

The Evaluation on Speed and Accuracy.
Firstly, this group experiment is utilized to evaluate the speed of the three algorithms with the same experimental configuration.As shown in Figure 8, compared with the port-based algorithm and the host-behavior-based algorithm, the average memory access of our algorithm separately drops by 66% and 47%.This experiment demonstrates that our algorithm has a fast speed to identify the P2P flows.
Secondly, this group experiment is utilized to evaluate the accuracy of the three algorithms with the same experimental configuration.As shown in Figure 9, compared with the accuracy 26.92% of port-based algorithm and accuracy 53.25% of host-behavior-based algorithm, our algorithm has a high accuracy 87.25%.This experiment demonstrates that our algorithm is suitable for actual deployment.

Conclusions
As the Internet brings efficiency and convenience to people's life, study, and work, the Internet becomes more and more important as well as its influence; besides a large number of network applications came into being.Not only abundant traditional applications such as Web, FTP, Email, and Telnet but also a mass of new services exist in the network, for example, P2P, streaming media, virtual reality, and interactive online applications.A wide variety of network applications and a large number of Internet users have made the constitution of the Internet flows increasingly complex.Followed by this, the Internet flow identification technology has developed rapidly in the meanwhile.
In this paper, an efficient P2P flows identification scheme based on multilevel bloom filters is proposed.Through

Figure 2 :Figure 3 :
Figure 2: Statistics of the average arrival time interval.
bloom filter Third-level bloom filter with all flows Move aging flows to second level Move aging flows to third level Move active flows to first level Move active flows to second level with 0-30s flows with 30-90s flows

Figure 4 :
Figure 4: The multilevel bloom filters' architecture of the flow identification scheme.
If the inequality  ≥  satisfies, we can know that  + ∈  new .If the inequality  <  satisfies, we can know that  + ∉  new .According to the real situation, the value of  can be adjusted properly.At first, the Minkowski distance between the data  + in {  , . . .,  + } and the history cluster center   should be calculated, respectively.And then the probability of the same assigned cluster of  + and   and Pearson correlation function ( + ,   ) should be calculated in the meantime.One can regard the product of the three as the deviate cost of  + for historical cluster samples .The deviate cost function is as follows:  ( + ,   ) =  ( + ,   ) *  ( + ,   ) *  ( + ,   ) .
1) A packet  +1 is produced by the Internet and added to the time series sample  at time  + 1 and the rest can be done following this way.When it comes to time  + 2, the historical cost function of sample { +1 , . . .,  +2 } should be calculated.If Hc ≤ , then a new cluster  is formed.Besides, we should calculate a new cluster   and a new variance   .(2) For the purpose of preferably reflecting new time series samples' changes and producing new clusters in a faster and better way, we should calculate the cluster quality function (  ,   ) to update the parameter  appropriately.