An Encrypted Traffic Identification Scheme Based on the Multilevel Structure and Variational Automatic Encoder

School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing, China School of Electronics & Information Engineering, Jiangsu University of Science & Technology, Zhenjiang, China School of Automation, Nanjing University of Science & Technology, Nanjing, China School of Information Engineering, Changzhou Vocational Institute of Mechatronic Technology, Changzhou 213164, China


Introduction
With the rapid development of network technology, more and more applications used encrypted algorithm to ensure their information security. As the encrypted traffic cannot be recognized well, it is becoming a good carrier of the network attacks. It is reported that the attacks represented by botnets [1], APTs (Advanced Persistent reat) [2], and worms are becoming increasingly fierce. Meanwhile, new means of communication are constantly emerging (covert communication, virtual protocol network, and tunnels), some of which may constitute resource misuse of an organization's network system. e communication will bring risk to the network. e identification to the communication is important in network management and security. us, the identification to the encrypted traffic is favored by the researchers.
e identification methods to the encrypted traffic mainly can be divided into two types, feature-based and machine learning-based. e feature-based method can also be divided into payload characteristics matching-based, host behavior-based, data packet distribution-based, and payload randomness-based [3]. A lot of achievements have been obtained by now [4,5]. A researcher from Cambridge University [6] proposed a feature matching model which can identify various applications through matching the features of the protocol. e drawback of it is that the interaction phase of the encrypted traffic and the private protocols cannot be identified. Okada et al. [7] proposed a method which made decision through calculating the correlation between the encrypted and unencrypted traffic. In this method, the feature vector which has 29 dimensions is used in the machine learning algorithm. Although it has a better performance, it also has the shortcomings that much more features used may lead to a large calculation amount. Shen et al. [8] proposed a model called SOB based on the secondorder Markov chain. It uses the SSL/TLS protocol certificate length and the size of the first application data as the feature.
Its experiments verify the effectiveness of this method. An encrypted traffic identification method based on weighted cumulation of time series has been proposed in [9]. Its experimental results show that it has a better performance to identify the encrypted traffic. However, as the private protocol has no common criteria, it will lead the method to be invalid.
Although the existing methods can achieve a better performance to the encrypted traffic, it may lead to many false alarms as the multimedia or compressed files transfer traffic existing. Meanwhile, as the private encrypted protocols have no common criteria, the current models may have difficulty to identify them accurately.
Focusing on the abovementioned investigation, a new identification scheme based on a two-level structure is proposed in this paper. e first level divides the traffic into encrypted and unencrypted based on the entropy, the estimation of the Monte Carlo π value, and the C4.5 decision tree. e first level solves the problem that the existing method may produce false alarms to the encrypted and unencrypted traffic. e second level uses a finer-grained approach to identify the traffic. In this level, first, the feature selected automatically by VAE (variational automatic encoder) and the common features used in existing methods are combined. en, the mutual information algorithm is used to get the feature set that has the greatest contribution to classification. It can avoid the efficiency and accuracy problem bring by the feature bias caused by feature redundancy. Finally, the experimental results show the effectiveness of the proposed scheme. e main contributions of this paper can be concluded as follows: (I) a two-level encrypted traffic identification scheme is proposed. e first level judges whether the traffic is encrypted or not, and the second gives the detailed application which the encrypted traffic belongs to. (II) In the first level, to overcome the shortcomings of the entropy method, the Monte Carlo π value is introduced. (III) e features used by the proposed scheme are from the VAE and the existing common features. In addition, for reducing the calculation complexity, the mutual information algorithm is used to reduce some less contribution features.

2.1.
e Analysis of the Network Packet Payload. In the network traffic, the contents transferred are all represented by characters which are composed by ASCII codes. e characters' appearance in network conversations is generally found by statistical rules that the common characters appear more frequently than the uncommon [10]. e characters' appearance in the encrypted traffic will have more randomness than that in the unencrypted traffic.
us, the entropy-based method is always used to identify the encrypted traffic.

e Brief Introduction to Entropy.
e entropy is first proposed by C.E. Shannon. It is a measure of the number of possible arrangements [11]. e higher the entropy of an object, the more it is uncertain of its states. Assuming that the number of possible events is N and the probability of their occurrence is described as p 1 , p 2 , . . . , p i , the definition of its entropy can be given as follows: From the abovementioned analysis of the payload randomness, the entropy may be different between the encrypted traffic and the unencrypted traffic. In most situations, the entropy of the encrypted traffic is larger than the unencrypted traffic. us, the researchers proposed many methods based on entropy to identify if the traffic is encrypted or not. However, there also exist some situations in which the unencrypted traffic has a larger entropy value. It will result in the entropy-based method being invalid. Figure 1 gives an example that the entropy-based method cannot identify whether it is encrypted in the abovementioned situations. From Figure 1, the entropy of the compressed traffic is like the encrypted traffic.

e Monte Carlo Simulation.
rough the principle analysis of the compression technology, the short characters are used to replace the long ones in order to maximize space savings. e compressed traffic presents local randomness of characters [12]. e Monte Carlo simulation extracts every n character of the packet payload as a set of the simulation points. Its idea is that there is an inscribed circle in a square, the first n/2 characters are taken as the x-point of the coordinate axis, and the last n/2 characters are taken as the Ypoint. e π value is calculated according to the number of coordinate points falling into the circle, and the error between it and the real π value is calculated. Figure 2 is a comparison diagram of the error π value between the unencrypted compressed traffic and the encrypted traffic. e error π value of the unencrypted compressed traffic is larger, while the encrypted traffic is smaller. e abovementioned results show that the error π value can be used to distinguish whether the traffic is encrypted or not.

e Variational Autoencoder.
e variational autoencoder (VAE) is an unsupervised deep generative model proposed recently. e discriminant model is relative to the generation model. Generally speaking, given the observation variable X and the latent variable z, the discriminate model gets P(z/X) and obtains the probability that the latent variable z appears according to the input observation variable X. However, the generation model is built on P(X/z) and outputs the probability of the observed variable by inputting the latent variable. e core idea of VAE is to assume that the data is generated by some invisible continuous random variables. For complicated models and large-scale data, the calculation cost of P(z/X) is very high. e distribution Q(z/X) which is referred to as the encoder is used to infinitely approximate the decoder's P(z/X). en the Kullback-Leibler divergence is chosen as the similar measurement between Q(z/X) and P(z/X), which is given as follows: e following is obtained through some mathematical operations like Bayesian transformation and equation transformation: When X is given, P(X) will be a fixed value. If D KL [Q(z/X)‖P(z/X)] is as small as possible, it would be equivalent to making the right-hand side as large as possible. e first item of the equation's right-hand side is based on the likelihood expectation of Q(z), and the second is a negative KL divergence. In order to get an optimal Q(z) and make it to P(z/X) as close as possible, the log-likelihood expectation for the first item on the right should be maximized, while the KL divergence of the second item on the right should be minimized.

Feature Analysis.
e features used in traffic identification mainly contain traffic features, host features, session features, and the behavior features. Among them, the traffic features are used mostly, and the most features are extracted from the transport or network layer. e traffic features are selected from a certain period traffic which has the same five-tuple information. Different application traffic has different characteristics, such as time and upload/download amount [13]. e Moore data set is a publicly available data set for the study of network traffic classification. More than 100 of the 249 network flow attributes in the Moore data set are obtained through the Fourier transform. Among them, redundant attributes are too few to represent the characteristics of the samples from the point of the machine learning view. However, too many features will also bring about redundancy, which will result in feature bias and reduce the performance of the classification efficiency [14]. Meanwhile, with the increase of the network traffic, if the Fourier transform is applied to each network flow, the computing load will be too heavy. us, in this paper, in the view of easy access to attributes, 23 network flow attributes commonly used for traffic identification are extracted, and the detailed introduction is shown in Table 1.

2.6.
e Mutual Information. In probability and information theory, the mutual information of two random variables is a measure of their interdependence [15]. e mutual information is based on the concept of entropy. e entropy can be understood as the self-information of variables. e mutual information represents that one variable contains some information of the other. e larger the I(X/Y), the higher the correlation between the output category and target category and the better the classification effect.
Formally, the mutual information of two discrete variables X and Y can be defined as follows: In equation (4), p(x, y) is the joint probability distribution function of X and Y and p(x),p(y), respectively, represent the edge probability distribution function of X and Y.
In the machine learning field, the feature bias caused by feature redundancy not only decreases the classification effect but also increases the calculation amount. us, the mutual information is used to simplify the feature set. e greater the weight of mutual information of one feature, the larger the contribution of the feature to classification.

The Proposed Scheme
e whole scheme of the proposed method is shown in Figure 3. First, it divides the traffic into encrypted and unencrypted. en, the VAE is used to identify the detailed application it belongs to.

Encrypted Traffic Filtered Based on Entropy and the Monte Carlo Method.
e current state-of-the-art algorithms have the shortcomings that cannot accurately and efficiently differentiate among the encrypted and compressed packets (such as .zip, .rar, and so on), image packets, or video packets. ey may have the drawback that they identify the unencrypted compressed or multimedia traffic as encrypted. Also, they have poor performance to the private protocols. In this paper, an improved encrypted traffic identification model based on the payload randomness is proposed. e proposed method uses entropy and the Monte Carlo estimation value as the input feature vector of the classifier. e classifier used in this paper is the C4.5 decision tree. e flow chart of the abovementioned method is shown in Figure 3. e detailed steps are as follows: Step I: the network traffic is captured according to the five-tuple. For the TCP traffic, a link is between three successive SYN packets and the final FIN or RSTpacket. Also, for the UDP traffic, a link is determined by the time between the first packet received and no packet received during 60 seconds.
Step II: the first packet of one link is extracted, and it is determined whether the length of the first packet is larger than 1024 B. If it is, its payload will be extracted, and otherwise, it will be discarded and go to the next packet until N packets are all extracted. Using equal spacing algorithm, the Shannon entropy formula is used to calculate the entropy value of each character in the whole data package.
Step III: each N characters of the extracted packet's payload are used as a set of Monte Carlo simulation points. e first N/2 characters are taken as the xpoint of the coordinate axis, and the last N/2 characters are taken as the Y-point. e π estimation value is calculated according to the number of coordinate points falling into the circle, and the error with a real π value is also calculated.
Step IV: the entropy value H and Monte Carol estimation π value error P are standardized, and then the two features are input into the C4.5 decision tree classifier to get the classification results.

Encrypted Traffic Identification Method.
In order to further distinguish which application the filtered encrypted traffic belongs to, an encrypted traffic identification model based on the variational automatic encoder has been proposed in this paper. e identified encrypted traffic data set should be preprocessed. e first n bytes of the data stream are truncated, and the number of n bytes is not enough to fill 0. In order to prevent the impact of physical hardware on classification, it is necessary to drop the link layer data of the packets. Meanwhile, as the UDP header is 12 bytes less than the TCP, in order to eliminate the influence of experimental error, 12 zero need to be filled to the UDP header. In order to get the best classification effect, it is also necessary to normalize the extracted packet bytes.  Figure 3. en, the model automatically extracts features through VAE algorithm, and the feature vector with the largest contribution to the classification through mutual information algorithm with the flow feature set is obtained. Also, finally, the feature set is input into the random forest classifier.
Let X � X 1 , X 2 , X 3 , ..., X n presents the network traffic set, and .., Y m presents all the m sets of network traffic types. e function of the identification model designed in this paper is to realize the mapping from set X to Y, so as to realize the accurate identification of encrypted traffic. According to Figure 3, the detailed identification steps can be concluded as follows: Step I: the preprocessed data is input into the VAE model. en, the n-dimensional hidden layer variables Z of the VAE model are extracted.
Step II: the stream level features related to time and packet length from the identified encrypted traffic data set are extracted to obtain the stream feature set.
Step III: the n-dimensional hidden layer variable Z obtained in step I and the flow feature collection obtained in step II are input to mutual information algorithm to obtain the feature vector with the largest contribution to classification.
is step can help to reduce the feature dimension.
Step IV: the feature set obtained in Step III is input to the random forest classifier as the feature vector, the classifier parameters are debugged through cross validation, the optimal classifier model is obtained, and the decision is made.

Experiments and Analysis
In this section, the experimental environment, experimental data set, and the performance evaluation index are given.

Experimental Environment.
e configuration of the experimental computer used in this paper is as follows: Windows 7 Professional, Intel (R) Core (TM) i5-3230M CPU @2.60 GHz, 8G RAM. e third-party software and API used are as follows: VMware Workstation 12, Ubuntu 16.04, Wireshark 2.2.1, LibPcap, Scapy, Sklearn, and Tensorflow.

Experimental Data Set.
e following experimental data used in this paper are all captured in our Lab. Our Lab is in Nanjing, Jiangsu province, whose ISP is China Education and Research Network (CERNET). e detailed information of the data set is shown in Table 2. A total of 15,000 encrypted and 9,000 unencrypted traffic streams are collected. e encrypted traffic data set includes Skype, Gmail, SFTP down, Tor Twitter, YouTube, ICQ, and Facebook. Meanwhile, the unencrypted traffic data set includes HTTP, FTP, and Socket file transfer (the file types include.txt, .zip, .doc, and .pdf).

Performance Evaluation Index.
In order to evaluate the performance of the algorithm objectively, in this paper, the accuracy P, recall R, and F1-measure are selected as the three scoring references. e recall rate is the proportion of correct prediction to the total actual positive. e F1measure is a comprehensive evaluation index, which is defined as the harmonic mean of the accuracy rate and recall rate. e calculation formula of the abovementioned three indices is shown as follows: In the abovementioned formulas, T p represents the number of correctly identified samples representing the encrypted traffic. F p indicates the number of encrypted traffic with a wrong identification. F N represents the number of correctly identified samples representing the unencrypted traffic.

Encrypted Traffic Identification Results Based on Load Randomness
(1) Experiments on the Relationship between the Detection Window Size and the Accuracy. e number of packets in the observation window has a great influence on the recognition rate of the model. If the length of packets is too small, it cannot reflect the randomness of the load, so it is necessary to extract packets with a load greater than 1024 bytes. e experimental results are shown in Figure 4. e average accuracy of the recognition model at the beginning is proportional to the number of data packets. When the number of data packets is small, the accuracy of the model is low. From the statistical point of view, as the amount of data is not enough to fully reflect the characteristics of network traffic, the limitations are too large. When the number of packets reaches 10, the average accuracy reaches 94.98%, and then the two fluctuate up and down in an oscillating relationship.

(2) Experiment between the Number of Characters and the Accuracy of Coordinate Points.
e number of characters in the coordinate points of the Monte Carlo simulation point also affects the accuracy of the recognition model. e experimental results are shown in Figure 5. When the number of coordinate point characters is 2, the accuracy of the model is 89.02%, which is not different from that of using only information entropy. As the number of characters in the coordinate point increases to 6, the accuracy of the model is the highest and then decreases with the number of characters in the coordinate point. When the observation window of the recognition model is set to 6, the pseudorandom characteristics of unencrypted compressed traffic can be distinguished mostly.
(3) Comparison Experiments. e algorithms proposed in [10,16] are used to compare with the proposed one. e results are shown in Figures 6-8. Compared with the two typical algorithms, the average accuracy, recall, and F1measure of our method are over 94.98%, 90.05%, and 92.45%. e experimental results show that our proposed method achieves the best performance among the existing. As the encrypted traffic and unencrypted compressed traffic show similar characteristics in information entropy (especially, the file types are .zip and .flv), only using the entropy value will lead to misjudgments between them. e average accuracy, recall, and F1-measure of the model in [11] are 85.45%, 83.43%, and 84.42%. Meanwhile, the proposed method is better than the experimental results of flow feature-based recognition model proposed in [16], and the average accuracy, recall, and F1-measure of the recognition model in [16] are only 92.34%, 88.50%, and 90.38%. It is because the recognition model based on flow characteristics cannot accurately identify the situation of byte filling for data packets and the traffic of too short network flow, which leads to the effect difference between models.

Encrypted Protocol Identification Based on VAE.
(1) e Experiment about the Traffic Length. e length of the data stream has great influence on the recognition rate. e experimental results about it are shown in Figure 9. From the results, with the increase in the data length, the average accuracy also increases. When the length is larger than 1,000, the detector can achieve a better performance, whose average accuracy is about 97.86%. e dimension of hidden layer Z also affects the average accuracy of the proposed method. e experimental results of the relationship between the dimension of hidden layer and the average accuracy are shown in Figure 10.
From the results shown above, when the dimension of hidden layer Z is larger than 2, the average accuracy can achieve 94.5%. When the dimension is 6, the average accuracy can achieve the best performance. e convergence rate of the model is also an important index. ere is also an experiment which tests the trend of accuracy and loss rate in the training process of the recognition model that has been performed. e results are shown in Figure 11.
From the results shown above, the loss rate of the proposed method in the first 10 rounds of training decreases rapidly. en, the loss rate decreases continuously and finally tends to be stable. It presents that the model proposed in this paper has a faster convergence speed.

Security and Communication Networks
(2) Comparison Experiments. An encrypted identification model based on VAE has been proposed by this paper. e VAE model is often used as malicious traffic monitoring [17]. e model parameters of VAE are shown in Table 3. e input of the VAE model is a 1000-dimension original bytes vector. e encoder has two full-link layers. e input of the first full-link layer is a 256-dimension vector, and the second connects two output networks in a parallel structure. e final output of the encoder is a 46-dimension hidden layer variable Z. e decoder has two steps. e first step converts the abovementioned 46-dimension vector to a 256dimension output vector, and the second converts the 256dimension vector to the 1000-dimension output vector. en, the vector Z and flow characteristics are used to calculate their mutual information. Finally, 10 features with the largest weight are input to the random forest classifier.
In order to test and compare the performance of the proposed method, the most basic deep learning model MLP (a recognition model based on flow characteristics proposed in [18]), the recognition model based on CNN proposed in [19] and the multiple classifiers fusion-based method proposed in [20] are selected for comparison. e selected MLP model has one input layer which has 784 neurons and two hidden layers. e two hidden layers, respectively, have 256 and 64 neurons. e activation function is ReLu. e MLP model has one output layer, which has 16 neurons, whose activation function is SoftMax.
e experimental results are shown in Table 4. e average accuracy, recall rate, and F1-measure of the proposed model are, respectively, 97.68%, 97.30%, and 97.49%. It has the best performance among all the comparison models. MLP is the basic deep learning method whose training process is a little simple. However, its average accuracy, recall rate, and F1-measure are only 94.50%, 94.32%, and 94.41%. e average accuracy, recall rate, and F1measure of the identification model based on the convolutional neural network and stack automatic encoder proposed in [19] are, respectively, 95.60%, 95.34%, and 95.44%. Compared with the method proposed in [19], on the basis of using the deep learning algorithm to automatically   extract features, our proposed method innovatively combines the idea of VAE algorithm to automatically extract features with the idea of using the knowledge in the field of network traffic; thus, our proposed method can get the best sample features in the load sample feature vector. Meanwhile, the comparison experiments between the proposed method and [18] have also been performed. e average accuracy, recall rate, and F1-measure of the model in [18] are 94.85%, 97.74%, and 94.30%. It is obvious that the performance of the proposed method is better than [18]. at is because the model in [18] only uses the length of the former n packets.
ere is a similar work proposed in [20]. Its method first drops the packets that has no payload and sets the burst threshold to 1 s. en, it extracts several features and uses multiple classifiers fusion to give the results. Its average accuracy, recall rate, and F1-measure are, respectively, 97.37%, 95.80%, and 96.58%. It is a little poorer than the proposed method. As it uses the time feature of the traffic, it is vulnerable to network conditions.

Conclusions and Future Work
An encrypted traffic identification scheme based on the multilevel structure is proposed in this paper. In the first level, the traffic is divided into encrypted or not by using entropy and the Monte Carlo π value as classification features.
e experimental results show that the proposed method has a better performance than the existing methods. For identifying the application within the encrypted traffic more finely, the idea that VAE algorithm can automatically extract features and the network traffic domain knowledge can be used to extract features are combined and used in this paper. Also, the feature set with the largest contribution to classification is obtained through mutual information algorithm, which avoids the feature bias problem. e comparison of the experimental results shows that the proposed method has achieved a better performance than the existing ones.
In the future, more network situations and applications should be considered to identify in the second level of the proposed scheme.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.