Obfuscated Tor Traffic Identification Based on Sliding Window

Tor is an anonymous communication network used to hide the identities of both parties in communication. Apart from those who want to browse the web anonymously using Tor for a benign purpose, criminals can use Tor for criminal activities. It is recognized that Tor is easily intercepted by the censorship mechanism, so it uses a series of obfuscation mechanisms to avoid censorship, such as Meek, Format-Transforming Encryption (FTE), and Obfs4. In order to detect Tor traffic, we collect three kinds of obfuscated Tor traffic and then use a sliding window to extract 12 features from the stream according to the five-tuple, including the packet length, packet arrival time interval, and the proportion of the number of bytes sent and received. And finally, we use XGBoost, Random Forest, and other machine learning algorithms to identify obfuscated Tor traffic and its types. Our work provides a feasible method for countering obfuscated Tor network, which can identify the three kinds of obfuscated Tor traffic and achieve about 99% precision rate and recall rate.


Introduction
With the rapid development of Internet technology and the explosive growth of information data, users pay more and more attention to their personal privacy information. Although the commonly used HTTPS protocol can ensure that the visitor's communication data is not eavesdropped on by a third party, it cannot hide his identity information. erefore, anonymous communication technology is proposed to protect the communication data and help the user conceal his IP address and other pieces of private information. Anonymous communication technology has developed from the Mix [1] technology to the commonly used Tor ( e second-generation Onion Router) [2], I2P (Invisible Internet Project), Freenet [3], and so on. Besides, some blockchain-based anonymous communication techniques have been proposed in recent years.
Tor is the most popular anonymous communication technology among these technologies. It is a type of overlay network that uses software to create layers of network abstraction that can be used to run multiple separate, discrete virtualized network layers on top of the physical network. Volunteers provide the onion routers running on Tor from various countries and regions around the world. When a user wants to connect to the Tor network, he randomly selects three onion routes via the onion directory servers, handshake in turn with them to establish a communication circuit, and can access the Internet anonymously through this communication circuit.
In addition to providing anonymous access to the Internet, the Tor network also provides hidden services, the socalled darknet. Much illegal content that is not allowed to appear on the surface web (i.e., the Internet) floods the darknet, including drug dealing, gun dealing, and gambling websites. In addition, the first ransomware Curve-Tor-Bitcoin Locker used the Tor network to hide its traffic, and action appeared in mid-2014 [4]. In other words, some people use Tor to protect their privacy and achieve anonymous access to the website. In contrast, other people use it to hide their illegal purposes and commit criminal activities. erefore, it is necessary to identify and detect Tor traffic.
In the Tor network, the information of the onion routers is stored in directory servers, including lists of active onion routers, their IP addresses, their locations, and current public keys. is information is public, which everyone can access, so Tor is very easy to block by closing the connection of onion routers' IP addresses or port 9001, which the Tor commonly uses. In order to improve the availability and anonymity of the network, Tor introduces many methods to bypass censorship, including Tor Bridge mechanisms, Meekbased obfuscation [5], Format-Transforming Encryption-(FTE-) based obfuscation [6], and Obfs4-based obfuscation [7].
ese obfuscation methods bring difficulties for detecting and identifying Tor traffic.
Since the darknet can be used to commit criminal activities, we should be capable of blocking the Tor connection.
is capability requires us to propose approaches to identify Tor traffic because detecting Tor traffic is a prerequisite for blocking it. What is more, the various obfuscation mechanisms introduced by Tor require us to detect the obfuscated Tor traffic. Existing researches only identify the traffic obfuscated by one of the obfuscation mechanisms. erefore, we propose an approach based on a sliding window to identify different kinds of obfuscated Tor traffic. In other words, we will distinguish Meek-based, FTE-based, and Obfs4-based Tor traffic from normal traffic. Besides, prior works do not make any obfuscated Tor traffic dataset publicly available for the research community to use and build upon. In summary, this paper makes the following contributions: (1) Utilizing a sliding time window to split TCP flows instead of extracting features from a single overall flow, which helps to reduce the number of data packets required for detection and effectively improve the real-time detection. (2) Conducting detailed experiments and analysis to show the effectiveness of the features extracted from the sliding window. (3) Establishing a general multiclassification model with universal features to detect three kinds of obfuscated Tor traffic, which achieves about 99% precision rate and recall rate. (4) A large amount of three kinds of obfuscated Tor traffic which has been collected and published on the web (https://github.com/QQQQing/Obfuscated-Tor-Traffic). e rest of the paper is organized as follows. Section 2 discusses related work in Tor traffic identification. Section 3 introduces the features used in the model and shows their effectiveness. Section 4 elaborates on the experiment, including the models, the dataset, and the metrics to evaluate the models' performance. Section 5 discusses the computational complexity and the limitations of our method. And finally, Section 6 concludes the paper.

Related Work
At present, traffic detection commonly uses three methods: port-based method, DPI (Deep Packet Inspection)-based method, and flow feature-based method. However, the first two techniques are not suitable to detect obfuscated Tor traffic for the following two reasons: (1) the commonly used port of onion routing nodes changes from 9001 to 443, which is the same port used by SSL/TLS protocols; (2) the Tor traffic is encrypted, so there is no special keyword in the packets. erefore, a flow feature-based method is proper to solve such a problem. A flow feature-based method is always combined with deep learning algorithms and machine learning algorithms, widely used in the Intrusion Detection System (IDS). For example, Swarna Priya et al. [8] used deep neural networks to develop effective and efficient IDS in the IoMT (the Internet of Medical ings) environment. Khan et al. [9] present a framework based on decision trees that effectively detect P2P botnets.
Since more and more methods have been developed to detect nonobfuscated Tor traffic, Tor utilizes obfuscation to improve its reliability and anonymity. ese obfuscation methods each have variant mechanisms to protect Tor traffic from detection. As mentioned above, Meek-based, FTEbased, and Obfs4-based obfuscation are three officially supported plugins. Next, we will briefly introduce their working mechanism and corresponding detecting method proposed by existing research. In Table 2, we analyze different approaches for obfuscated Tor traffic identification.

Meek-Based Obfuscation.
e key to Meek-based obfuscation lies in the use of domain front technology, which uses different domain names at different communication layers [5] and tunnels to avoid censorship. ere are three entities involved in the communication: Tor client with Meek plugin, fronted server with an allowed domain name provided by the cloud service provider, and Tor server with Meek plugin. When a client wants to connect to the Tor network, it encapsulates the Tor request into a TLS layer with the domain name of the fronted server in the header and then sends the request to the fronted server. After the fronted server received the packet, it unpacks the internal request and sends it to the Tor server. Since the server is not allowed to push data to the client actively, the client needs to continuously poll the fronted server to check whether the Tor server sends data back and finally obtains the response content. In short, Meek-based obfuscation avoid censorship by using a cloud server with an allowed domain name to forward the requests to the Tor, and as a result, the traffic just seems like ordinary cloud service traffic. e polling mechanism results in a large number of shorter packets appearing in the communication process, which is an obvious characteristic. According to this feature, Yao et al. [20] proposed a method based on a Mixture of Gaussians-based Hidden Markov Model (MGHMM), which characterize the interpacket time distribution and the packet size density distribution of flows. He et al. [21] summarize the connection characteristics based on the polling mechanism, including the static and dynamic features of the flows, and then apply SVM to identify Meek-based Tor traffic.

Obfs4-Based Obfuscation.
Obfs4 is the latest plugin of Obfs (the four letters in obfuscation) proxy, which uses encryption algorithms to disguise Tor traffic as ordinary encrypted traffic such as the SSL/TLS protocol. e main practice is to use ECC (Elliptic Curve Cryptography) to encrypt the data and randomly fill the payload, which changes the size of the packet and conceals the packet length-related features in the flows. After random packet length padding, only the recipient with the key can derive the correct packet length value and then reassemble the packet correctly. Obfs4 strengthens its random characteristics, so Gao [22] used a randomness test to identify Obfs4-based Tor traffic and achieved good results.

FTE-Based Obfuscation.
e core idea of FTE-based obfuscation is to use regular expressions to replace the bytes appearing in Tor traffic. For example, a regular expression of HTTP protocol keywords can be used to disguise Tor traffic as HTTP traffic to deceive the DPI system. However, the characteristics of flows have not changed significantly. erefore, Zhai [23] used features of flows to characterize the traffic and used a machine learning algorithm to identify FTE-based Tor traffic.  [19] Interpacket times Hidden Markov models Self-collected Precision; F-measure

Analysis of Traffic Characteristics.
Traffic identification technology based on flow features usually uses inner-packet interval and packet size features because different protocols are likely to have a unique distribution of inner-packet interval and packet size, which can well characterize the protocol. e difference among three types of obfuscation Tor traffic and normal traffic on the above two types of features is caused by the following: (1) e Meek client polls the fronted server for the response, so there will be a large number of shorter packets in Meek-based Tor traffic. (2) ere are limited onion routers in the Tor network, so each relay will load a large number of requests, resulting in a relatively larger inner-packet interval. (3) Each packet needs to be encrypted (n + 1) times before being sent out, then decrypted by n relays in the circuit selected by the client, and finally decrypted by the target server. e parameter n refers to the number of relays. is encryption and decryption process will cause the inner-packet interval of Tor traffic to be generally larger than that of normal traffic.
In order to verify whether the obfuscated Tor traffic is different from normal traffic, we conduct the following experiment. Firstly, we extract flows identified by the fivetuple (source IP address, source port number, destination IP address, destination port number, and protocol) from each traffic file and filter out the flows of TCP protocol. en for each packet in a flow, we calculate the size of it and the time interval from the previous packet, which is called the innerpacket interval. At last, the inner-packet interval and packet size of all flows in each kind of traffic are summarized, and a curve of CDF (Cumulative Distribution Function) is drawn. e definition of CDF is F(x) � P(X ≤ x). It means that, for a function f(x) and a specific value x 0 of an independent variable, the value of F(x 0 ) is the sum of the probability of occurrence of all values less than or equal to x 0 .
We take the inner-packet interval as an example to illustrate how to draw a CDF curve. e inner-packet interval can be considered continuous, so we should map it to a certain interval to discretize the values. We can set the time interval as follows: en, we put the time interval into the corresponding interval and record the number of inner-packet intervals in every interval as In the actual experiment, we make the interval shorter, and the curve of the CDF will have higher precision. e inner-packet interval CDF curve and packet size CDF curve of each traffic type are shown in Figures 1 and 2. From these eight CDF curves, we can observe the following characteristics: (1) In terms of packet size, numerous small-size packets appeared in Meek-based Tor traffic, accounting for about 90%. And about 65% of small-size packets occur in normal traffic, while Obfs4-based Tor traffic and FTE-based Tor traffic have the least proportion of small-size packets, only 40%. (2) In terms of the inner-packet interval, the innerpacket interval of normal traffic is generally shorter. About 95% of inner-packet intervals are less than 0.2 seconds, and the number of short-time intervals in Meek-based Tor traffic is slightly less. Obfs4-based Tor traffic and FTE-based Tor traffic have the least short-time-interval packets.
In other words, we can distinguish Meek-based Tor traffic easily from the other two kinds of obfuscated Tor traffic. Observing Obfs4-based Tor traffic and FTE-based Tor traffic, we find that their inner-packet interval distribution is similar, but there are certain differences in packet size distribution. We will discuss how our method can distinguish these two types of traffic in Section 3.2.

Analysis of Feature Extraction with Sliding Window.
is section will analyze whether the feature of packet size extracted in the sliding window can better distinguish Obfs4-based Tor traffic from FTE-based Tor traffic.
At present, many traffic identification technologies treat a single flow as a whole and extract features from it. Here, a flow is defined as all packets identified by the same five-tuple. However, this approach may cause an obvious problem: it is hard to achieve real-time detection. As long as we treat a flow as a whole, we may meet with the following dilemma: (1) If we extract features before a flow ends, it will lead to inaccurate feature extraction (2) If we extract features after a flow ends, it will lead to high latency to detect the flow erefore, we try to split the flow with a sliding window and treat the packets in the sliding window as a whole to extract features. To show the effectiveness of extracting features with a sliding window, we split the packets of Obfs4based Tor traffic and FTE-based Tor traffic with a sliding window. Here, we temporarily set the size of the sliding window as 100 packets and its sliding distance as ten packets. After splitting the flow, we will obtain a large number of packet sets, and each set consists of 100 packets. Next, we randomly select 1000 packet sets from the whole sets of Obfs4-based Tor traffic and FTE-based Tor traffic and then extract vectors of packet size from each packet set. Finally, we handle these vectors as time series and use DTW [24] (Dynamic Time Warping) algorithm to calculate the distance between these series. e advantage of using the DTW algorithm to calculate the distance between time series is that each element of two target series does not have to correspond to each other, and the distance between two series can be measured more accurately.
Specifically, we mark the sets of packet size vectors of Obfs4-based Tor traffic as X � x 1 , x 2 , . . . , x 1000 and those sets of FTE-based Tor traffic as Y � y 1 , y 2 , . . . , y 1000 , where x i and y j , 0 ≤ i, j ≤ 1000, respectively, represent the vectors of packet size of two types of Tor traffic, whose length equals 100. Next, we calculate the distances between every two vectors of X and X, Y and Y, and X and Y using the DTW algorithm, and we will get three matrices of size 1000 × 1000. For instance, the matrix M XY , which consists of the distances between every two vectors of X and Y, can be calculated as Similarly, we can calculate the matrices M XX and M YY . In order to show the difference of vector distances between different kinds of Tor traffic more intuitively, we use a heatmap to visualize the matrices. e value of each pixel of the heatmap ranges from 0 to 200,000, and each pixel denotes the DTW distance between two vectors. erefore, the brighter the pixel is, the further distance it represents. Figures 3(a) and 3(b) have relatively darker colors than Figure 3(c). It means that the distance between vectors of the same type of Tor traffic is smaller than that of different types of Tor traffic. In short, by splitting the flows into sets of packets, the packet size feature, which extracts from each set of packets, can reflect the difference between different types of obfuscated Tor traffic, especially, Obfs4-based Tor traffic and FTE-based Tor traffic.

Feature Extraction.
According to the analysis in Sections 3.1 and 3.2, it is effective to identify the obfuscated Tor traffic with the inner-packet interval-related and packet size-related features. Next, we will introduce how these features are extracted.
e process of feature extraction is shown in Figure 4, and its steps are as follows: (1) Extract flows from the traffic file (e.g., pcap files) according to the five-tuple (source IP address, source port, destination IP address, destination port, and protocol), and record the direction, length, and occurrence time of each packet in the flows. (2) Split the flows into sets of packets with a sliding window whose window size and sliding distance, respectively, equal n packets and m packets, and each set consists of n packets. (3) Extract 12 features listed in Table 3 from each packet set, and especially, the inner-packet interval is the difference between the occurrence time of two adjacent data packets.

Dataset.
e existing public dataset of Tor traffic is the Tor-nonTor (ISCXTor2016) dataset [5]. It is provided by the University of New Brunswick. ey defined a set of tasks to generate a representative dataset of actual traffic and gave detailed labels. But this dataset only contains nonobfuscated Tor traffic, so it cannot be applied in our experiments. us, we collected normal traffic, Meek-based Tor traffic, Obfs4-based Tor traffic, and FTE-based Tor traffic as our experimental dataset. e collection architecture is shown in Figure 5. We deployed Meek-based, Obfs4-based, and FTEbased Tor clients on six cloud servers, two for each type. en we wrote scripts to access the Internet to generate network traffic automatically. We use TShark (a commandline tool for the network analysis tool Wireshark) to capture traffic, store it as a network traffic file in pcap format, and centrally upload it to the data center server every day. What is more, we recorded the IP addresses associated with different kinds of obfuscated Tor traffic.
We collected traffic twice and used them as the training set and the test set. e time and the websites visited of the two collections are different.
Next, we process the data according to the feature extraction method described in Section 3. Here, one piece of Security and Communication Networks data refers to 12 features and its corresponding label (i.e., Meek-based Tor, Obfs4-based Tor, FTE-based Tor, or normal traffic) extracted from a packet set with 100 packets as the unit. e summary of different types of traffic in the two parts of the dataset is shown in Table 4. It should be noted that the experiments in this paper are all based on this dataset.

Evaluation Metrics.
In this experiment, we set the following metrics to indicate whether the model can accurately predict samples and to reflect whether the model can perform well on the training set, validation set, and test set.

Confusion Matrix.
Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class [25]. In the confusion matrix, we can get the TP (instances which are actually true and predicted to be true), FP (instances which are actually false but predicted to be true), TN (instances which are actually false and predicted to be false), and FN (instances which are actually true but predicted to be false) for each category in the confusion matrix. ese values are used for calculating other metrics such as precision and recall. For a four-category problem, the confusion matrix and TP, FP, TN, and FN values of category 2 are shown in Table 5. Briefly, let the confusion matrix be M, the total number of categories be n, and a target category be i; we can calculate TP, FP, TN, and FN as equations (4)- (7):

Precision.
e fraction of correctly predicted positive instances among all predicted positive instances is shown in equation (8):

4.2.3.
Recall. e fraction of correctly predicted positive instances among all actually positive instances is : Among the four metrics, when the categories are unbalanced, the accuracy rate cannot well reflect the situation of the model. In contrast, the precision rate, recall rate, and F1-Score can reflect whether the model is effective even when the number of positive instances and the number of negative examples differ greatly. erefore, we are more inclined to use precision rate, recall rate, and F1-Score as evaluation indicators.

Algorithm Selection.
In recent years, machine learning and deep learning have played an extremely important role For each packet, we need Inner-packet interval Packet size The direction of the packet (in or out) The calculate: Mean, max, min, and standard deviation of inner-packet interval Mean, max, min, and standard deviation of packet size The proportion of client receiving and sending packets and bytes Figure 4: e process of feature extraction. First, we split the TCP flow into sets of packets. en for each packet in each packet set, we extract its inner-packet interval, packet size, and direction. At last, we calculate 12 features of each packet set.  in many fields, such as face recognition [26] and intrusion detection [27]. ey are effective for both classification problems and regression problems. Identifying three kinds of obfuscated Tor traffic is a multiclassification problem. Compared with deep learning, which requires more powerful hardware, machine learning is faster in the training phase and actual detection. Since the labels in this experiment are relatively easy to obtain, we can use a supervised machine learning algorithm to solve this problem.

Experiment for the Parameters of Sliding Window.
e window size and sliding distance of a sliding window have effects on the result of the models. ese effects include model precision rate, feature extraction time, and model prediction speed. We take different values for window size n and sliding distance m, and we use Random Forest algorithm to establish a model for testing.
In this experiment, we set the range of sliding distance m from 5 to 50 and the range of window size n from 50 to 500. e experimental results are shown in Table 6. e higher the precision rate, the shorter the feature extraction time, and the shorter the model prediction time mean, the better the parameter effect. erefore, the best parameters of the sliding window are a window size of 100 and a sliding distance of 10. It is worth noting that there are only minor differences between different parameters.

Experiment Results and Analysis.
We use the seven machine learning algorithms selected in Section 4.3 to train the model with a training set and then use the validation set

Internet
Meek plugin

FTE-Tor FTE-Tor
Meek plugin Obfs4 plugin Obfs4 plugin FTE plugin FTE plugin Tor Figure 5: Architecture of traffic collection. We used six servers to collect obfuscated Tor traffic. We deployed Meek-based, Obfs4-based, and FTE-based Tor clients on these servers, two for each type. In addition, we uploaded data to a data center each day.   Tables 7 and 8, respectively. From these two tables, we can conclude that the features extracted by the sliding window are so effective that the models such as XGBoost and Random Forest achieve high precision and recall rates. What is more, we can find that the four models tree-based algorithms, including XGBoost, GBDT, Random Forest, and CART decision tree, all have good performance. Even the worst-performing CART decision tree model also has a precision rate and recall rate of more than 90%. e XGBoost and Random Forest models have about 99% precision and recall rate in half of the categories. In addition, the KNN algorithm also has good performance. In contrast, the two linear classifiers, logistic regression and support vector machine, perform poorly in this task, with a minimum precision rate of 61.72% and a recall rate of 49.23%. All in all, the nonlinear classifier has good performance in identifying obfuscated Tor traffic.
Besides, almost every model has better detection effectiveness on Meek-based obfuscated traffic than on Obfs4based and FTE-based Tor traffic. It can be attributed to the polling operation of the Meek plugin, which will introduce a lot of short packets in communication and be a significant characteristic. e effectiveness of Obfs4-based and FTEbased Tor traffic is a bit lower because of the similarity of the packet size and interval-related characteristics of these two kinds of Tor traffic. However, we distinguish them from each other by using the sliding window to split the packets, leading to a high precision rate for detecting Obfs4-based and FTE-based Tor traffic. e results of our experiments meet the analysis of the traffic characteristics in Sections 3.1 and 3.2.

Comparison.
In Section 2, we mentioned four methods [20][21][22][23] that proposed detection methods for Meek-based Tor traffic, FTE-based Tor traffic, and Obfs4-based Tor traffic, respectively. In this section, we compare their methods with ours. e comparison result is shown in Table 9.
Our method has a high precision rate while being able to identify three types of obfuscated Tor traffic. Although the precision rate of detecting Meek-based Tor traffic and Obfs4based Tor traffic is slightly inferior to the method that can only detect one kind of obfuscated Tor traffic, the gap is within 2%. e gap is because our model is a four-category model. If the problem is reduced to a two-category problem, our model will perform better. We try to implement another experiment with only Meek-based Tor traffic and normal traffic. In other words, we create a two-category model using the Random Forest algorithm. We train the model, evaluate it, and test it with the same dataset as the four-category model. Finally, we achieve a higher precision rate of 99.95% on the validation set and 99.92% on the testing set, which is almost the same as the precision achieved in [20].

Computational Complexity.
Since the model using the Random Forest algorithm achieves the best performance in the testing set, we take it as an example to calculate the computational complexity. As mentioned in [32], the Random Forest algorithm has an average time complexity T(k) of O(M dk log k), where M indicates the number of trees, d indicates the number of features, and k indicates the number of samples. e time complexity of our model also depends on the window size and sliding distance of the sliding window, which, respectively, equal m and n. us, we can calculate the time complexity T(k) as equation (12): Let q � kn/m; then we can rewrite equations (12) as equation (13): In equation (13), q indicates the actual number of the samples in our experiments.

Limitation Analysis.
(1) Our method performs well on our own dataset, but we did not test them on other datasets because we cannot find open datasets of obfuscated Tor traffic. We expect to test on these datasets when they are accessible. And we believe that our algorithm is universal, and the sliding window mechanism and feature extraction mechanism are still applicable to other datasets. (2) More obfuscation mechanisms may be introduced in the future, or some obfuscation mechanisms will upgrade. It would be a new challenge for the detection. We think that the sliding window mechanism is beneficial to highlight traffic features, and we can adjust the feature extraction mechanism to introduce such SSL/TLS-related features to alleviate these problems from new obfuscation mechanisms.

Conclusions
Tor has been used to hide the traffic of criminal activities on the darknet, so it is important to identify Tor traffic and block it when needed. Besides, Tor has introduced multiple kinds of obfuscation techniques to avoid traditional detection. However, existing detection methods only deal with one of the obfuscation techniques. eir detection granularity is in the unit of whole flow, which will lead to a relatively lower detection efficiency. erefore, we proposed a sliding window-based method for detecting three kinds of obfuscated Tor traffic. We conducted a detailed analysis of the characteristics of different obfuscation mechanisms and analyzed the effectiveness of the extracted features, which were confirmed in the final experimental results and analysis.
We use the dataset collected by ourselves to conduct identification experiments. e test results' high precision and recall rate show that the 12 features extracted by the sliding window can effectively identify three kinds of obfuscated Tor traffic. And as a comparison to other methods which can identify only one type of obfuscated Tor traffic, the performance of our models is outstanding, even though we perform a multiclassification task. In summary, the features extracted through the sliding window and the models we use can effectively identify the three kinds of obfuscated Tor traffic.
Data Availability e dataset of multiple kinds of obfuscated Tor traffic including FTE-based Tor traffic, Meek-based Tor traffic, and Obfs4-based Tor traffic is publicly available at https://github. com/QQQQing/Obfuscated-Tor-Traffic.

Conflicts of Interest
e authors declare that they have no conflicts of interest.