PF :Website Fingerprinting Attack Using Probabilistic Topic Model

Website fingerprinting (WFP) attack enables identifying the websites a user is browsing even under the protection of privacyenhancing technologies (PETs). Previous studies demonstrate that most machine-learning attacks need multiple types of features as input, thus inducing tremendous feature engineering work. However, we show the other alternative. *at is, we present Probabilistic Fingerprinting (PF), a new website fingerprinting attack that merely leverages one type of features. *ey are produced by using a mathematical model PWFP that combines a probabilistic topic model with WFP for the first time, due to a finding that a plain text and the sequence file generated from a traffic instance are essentially the same. Experimental results show that the proposed new features are more distinguishing than the existing features. In a closed-world setting, PF attains a better accuracy performance (99.79% at most) than prior attacks on various datasets gathered in the scenarios of Shadowsocks, SSH, and TLS, respectively. Besides, even when the number of training instances drops to as few as 4, PF still reaches an accuracy of above 90%. In the more realistic open-world setting, PF attains a high true positive rate (TPR) and Bayes detection rate (BDR), and a low false positive rate (FPR) in all evaluations, which outperforms the other attacks. *ese results highlight that it is meaningful and possible to explore new features to improve the accuracy of WFP attacks.


Introduction
Nowadays, privacy is one of the most important concerns for online users. Hence, privacy-enhancing technologies (PETs) like Shadowsocks [1], SSH, etc., have been leveraged to guarantee people's privacy, including those criminals who engage in illegal online activities. ese unlawful activities severely impair society. For instance, in just two and a half years, the scale of illicit transactions in the black market site "Silk Road" reaches about 1.5 billion U.S. dollars, gathering more than 4,000 illegal merchants and 150,000 anonymous users. A literature survey reveals that a website fingerprinting (WFP) attack can detect these activities by inferring the websites being visited. Hence, WFP plays an important role in fostering a peaceful society that is free of fear and violence. is goal is a part of the 17 sustainable development goals (SDGs) accepted by the United Nations General Assembly in 2015 [2]. e primary idea of WFP attacks can be summarized as follows. A local eavesdropper (i.e., an attacker) listens on the wire and intercepts the target user's network traffic. After that, he trains a classifier according to the statistical features of the traffic. Note that the features generally contain packet length, timing information, order information, and so on. Finally, the attacker could leverage the classifier to identify the surfing websites of the user. e possible target user under WFP attacks might be anyone who is surfing the Internet even under the protection of PETs.
As known to all, more types of features result in more tedious feature engineering work for WFP, which is unlikable. To avoid such annoying jobs, researchers have turned to deep-learning techniques for help. Previous studies show that deep-learning attacks (e.g., Abe_SDAE [14], DF (deep fingerprinting) [19], and Tik-Tok [20]) usually utilize one type of feature, such as packet direction, and achieve a satisfying accuracy performance, which is better than that of traditional machine-learning attacks [19,20]. e reason for the difference between the two kinds of attacks is probably ascribed to their different ability of automatic feature learning. us, we have the comprehension that introducing more types of features is not indispensable for reaching a good performance.
Unfortunately, although deep-learning attacks can avoid the tedious feature engineering work, they generally need a lot of computing resources, which require an additional budget. Besides, previous research also indicates that a considerable number of training samples are necessary for deep-learning attacks to obtain an expected accuracy [16]. Inevitably, gathering enough training samples will consume a lot of time. It is even a much more unpleasant and hard work. On top of that, considering that a WFP attack should frequently retrain its model to face the challenge of data staleness problem, the work of data gathering becomes heavier and tougher for a deep-learning attack.
In this case, traditional machine-learning attacks become meaningful and essential for WFP. Hence, the second alternative to reduce the work of feature engineering is to find out one type of more effective features, which is the aim of representational learning. us, it is interesting to investigate whether it is possible to reach a well-pleasing accuracy for a traditional machine-learning attack only using one type of features.
To the best of our knowledge, there already exist two traditional machine-learning attacks (i.e., CUMUL [5], PHMM [13]) that only take one type of features as input. Unfortunately, they are inferior to deep-learning attacks in accuracy performance [19]. By careful dimensional analysis, we note that creating new features in PHMM and CUMUL does not involve dimensional change. In other words, the new features have the same physical significance (i.e., a length) as the existing features for the two attacks. is truth possibly explains why CUMUL and PHMM perform worse than deeplearning attacks. In this case, it is meaningful for us to devise a type of features with some different physical significance.
us, in this work, we propose a new type of feature, i.e., topic probability vector, which is demonstrated to be highly effective. e proposed features have a different physical significance (i.e., a probability) from the existing features (i.e., a length). e new type of features are obtained by the PWFP model, which combines the typical probabilistic topic model, namely, Probabilistic Latent Semantic Index (PLSI), with WFP. Based on the new features, the Probabilistic Fingerprinting (PF) attack is proposed and evaluated. Evaluation results show that PF performs better than a deeplearning attack (i.e., DF) while using fewer features. is work is the first to indicate that a traditional machinelearning attack can beat a deep-learning attack. e major contributions and novelties of this paper are summarized as follows: (1) For the first time, we reveal the similarity of a plain text and the sequence file of a traffic instance in essence. Inspired by the finding, we create one type of features, i.e., topic probability vector, each component of which has a special physical significance, namely, a probability. e new features are obtained by the PWFP model, which is based on PLSI. To the best of our knowledge, it is the first time to leverage the probabilistic topic model for WFP.
(2) We propose PF, which first introduces PLSI for WFP and creates a topic probability vector for each traffic instance. Based on the obtained vectors, a KNN classifier is applied to perform a website fingerprinting attack. To date, the topic probability vector has never been presented and used before. PF has a powerful ability to distinguish traffic instances gathered in various scenarios. It can dramatically reduce feature engineering work and the number of features needed while obtaining a better accuracy than a deep-learning attack, namely, DF. As far as we know, it is the first time that a traditional machine-learning attack beats a deep-learning attack.
(3) We show the superiority of the proposed type of features over the existing features by comparison evaluation. e effectiveness of different attacks is evaluated in the closed-world evaluations against various traffic, including Shadowsocks, SSH, and TLS. Amongst all, PF performs the best. We also experiment on how the number of training instances affects the accuracy. Results show that PF only needs as few as four training instances to reach an accuracy of over 90%, which beats others. is advantage is useful for addressing the data staleness issues in WFP. (4) In the open-world evaluation, we use the Precision-Recall curves to compare different attacks' performances to avoid the base-rate fallacy. PF all achieves a high recall and precision, which substantially overwhelms the other attacks. Besides, we investigate the impact of different ratios of the number of unmonitored training instances to the number of total unmonitored instances on TPR, FPR, and BDR. PF works best in all situations. Our experiments also indicate the excellent performance of PF against defended datasets.
Organization. e remainder of this paper is organized as follows. In Section 2, we survey prior research work.
Subsequently, Section 3 describes the threat model of this work. Furthermore, the key techniques of the PF attack are explained in Section 4. To test our attack, Section 5 presents the experimental preparation. en, in Section 6, we evaluate the PF attack in different scenarios and present the results, respectively. Finally, we make a deep discussion in Section 7 and conclude the whole paper in Section 8.

Related Work
is section first surveys different kinds of significant WFP attacks, including resource length attacks, traditional machine-learning attacks, and deep-learning attacks. en, we categorize and summarize prior work on the main WFP defense methods. Moreover, four representative attacks and two typical defenses selected in the following experimental evaluations are introduced in detail.

WFP Attacks.
e WFP attacks originate from resource length attacks, which utilize the length of web page resources to identify a web page. In HTTP1.0, web page resources (images, scripts, etc.) are each requested with a separate TCP connection. us, the total length of each resource can be identified by distinguishing different connections. e earliest prototype of the resource length attack was designed and implemented by Cheng and Avnur [21]. Similar research followed later [22][23][24]. With the emergence of HTTP1.1 and various PETs, the performance of resource length attacks decreases sharply. Hence, researchers dig out more and more new features from traffic to improve the accuracy performance.
With the help of traditional machine-learning and deeplearning techniques, the success rate of WFP attacks rises greatly. If an attacker uses a traditional machine-learning method to make predictions on the websites, the attack can be classified into a traditional machine-learning attack. Such typical attacks include Li-NB [4], Li-Jaccard [4], OSAD [7], DLSVM [8], Pa-SVM [6], He-SVM [25], CUMUL [5], KNN [18], WPF [11], KFP (K-fingerprinting) [26], and PHMM (profile hidden Markov model) [13]. ese attacks leverage different traditional machine-learning techniques, such as Bayers classifier, Jaccard coefficient, SVM, KNN, RF, and HMM. e major disadvantage of traditional learning attacks lies in their requiring heavy work of feature engineering. To avoid this shortcoming, researchers need to manually create a type of features that are highly effective.
Considering the excellent performance of deep-learning methods, people have introduced them into the WFP area lately. In deep-learning attacks, the training dataset is absorbed to learn the parameters of deep neural networks, which can then be used to classify the test dataset. e deeplearning attacks have thrived since Abe and Goto first studied the application of stacked denoising autoencoders (SDAE) in WFP attacks [14]. In recent years, different neural networks, such as SDAE, LSTM, CNN, were leveraged by various deep-learning attacks, including DF [19], var-CNN [15], Tik-Tok [20], and AWF [17]. Although deep-learning attacks reach a high accuracy performance, they have a high demand for training data scale. Also, the attacker needs a substantial budget, which significantly limits the application of deep learning attacks.
To better evaluate our attack, we have selected four typical attacks for comparison, namely, KNN [18], KFP [26], DF [19], and PHMM [13]. ey use different techniques and are commonly selected as benchmarks in the WFP literature.
ese attacks are briefly introduced as follows.

KNN.
e KNN classifier was presented by Wang et al. with weight adjustment based on a large set of features, as many as 3736 [18]. e weights are used to tune the contributions to the KNN distance of different features. As weight learning proceeds, the KNN distance comes to focus on weights for features that are useful for classification. Due to its good performance in efficiency and accuracy, the attack is extensively used as a benchmark in the WFP area.  [13]. is attack collects the features of packet length with direction from each traffic instance and transforms the feature value of each packet into the alphabet to be recognized by the model. e transformation is named symbolization. In the scenarios of SSH and Shadowsocks, PHMM achieves a good performance.

WFP Defenses.
To defend against WFP attacks, many countermeasures were taken to obfuscate the traffic features by modifying the traffic. e WFP defenses can be classified into packet padding defenses, decoy page defenses, and so on [27,28]. We further classify the packet padding defenses into two types, namely, packet padding defenses with and without delay. e latency of packet padding defenses depends on their padding strategy. Some defense methods, including maximum padding defense [29], AP (adaptive padding)-based defenses [30], probabilistic defenses [13], and so on, generally introduce very low latency, which can be Security and Communication Networks 3 omitted.
e major packet padding defenses with delay include BuFLO (buffered fixed-length obfuscator) [29], CS-BuFLO (congestion sensitive BuFLO) [31], and Tamaraw [32]. Decoy defenses contain two types. One is to mimic a decoy page [33]. e other one is to add a decoy page as the background traffic [6].
Besides the aforementioned defenses, there also exist some other defenses that work at the application layer, such as randomized pipelining, which is embedded in browsers and HTTPOS. e HTTPOS defense was firstly presented by Luo et al. [27]. It needs to modify the HTTP headers and changes the HTTP requests to control the size of packets, which makes the implementation of HTTPOS a little complicated. In addition, Wang et al. presented another defense called Walkie-Talkie [28]. Walkie-Talkie works in the half-duplex mode and needs to add dummy packets and delays to create collisions. It requires both latency overhead and bandwidth overhead.
To evaluate our attack, this study selects two latest probabilistic defenses, namely, probabilistic dummy packet defense and probabilistic MTU (maximum transmission unit) padding defense, to produce the defended datasets. e former enables each packet to insert a dummy packet ahead of it with a given probability, while the latter lets each packet to decide whether to pad its length to MTU or not with a predefined probability. Each packet has the same probability to make its decision in the two defenses. e two probabilistic defenses were first simulated by Zhuo et al. [13]. ey both use the probability to weigh between latency and efficacy.

Threat Model
Our work mainly focuses on website fingerprinting under the protection of Shadowsocks, SSH, and TLS. ese PETs apply different techniques and are commonly used all over the world. To be specific, Shadowsocks is a free and opensource encryption protocol project that is widely used. It is not a proxy on its own but a protocol. Shadowsocks has become increasingly popular according to Google trend in recent years. According to incomplete statistics, hundreds of thousands of people have downloaded the Shadowsocks client [9]. e SSH protocol is included and supported in all operating systems for the reason that telnet and rlogin are insecure. us, it is convenient for those people who seek to protect their privacy. Also, the TLS technique is becoming more and more universal. According to the statistical data of Google in February 2021, about 95% of web traffic in Chrome for Mac is encrypted, while 90% of web traffic in Chrome for Windows is encrypted. Figure 1 shows the typical attack scenario in the WFP area [13,19,26]. We also use this scenario in our work. A user browses the websites under the protection of Shadowsocks, SSH, or TLS. A passive local attacker intercepts the encrypted traffic between the user and the communication network entrance and tries to infer the user's browsing privacy. Specifically, the word "passive" means that the adversary can record network packets but not modify, delay, drop, or decrypt them. Besides, the word "local" means that the adversary has access only to the link between the user and the entry of the communication networks. It is noted that all the Shadowsocks, SSH, and TLS traffic is encrypted in a different way. Like previous literature [13,19,26], we assume that the adversary has some prior knowledge of the user and only aims at identifying the websites. He does not try to decrypt packets or modify transmissions. Hence, our attack has nothing to do with the encrypted methods of the traffic.
In this work, we study the fingerprinting of the home page of those websites. at is, all instances are obtained from the homepages of websites. is task is called website fingerprinting by most authors in this field. As previous literature has mentioned, the adversary is supposed to be able to isolate and parse each traffic generated by a web page visit. Such isolating and parsing would be done before performing website fingerprinting attacks.
As with prior work, we study two scenarios, namely, closed-world and open-world. To be specific, in a closedworld scenario, it is assumed that the user only visits a given set of websites, namely, monitored websites, whereas in an open-world scenario, the user is allowed to visit not just the monitored websites but a large number of unmonitored websites, namely, the open-world. Apparently, the openworld scenario is of practical interest.

THE Proposed PF Attack
is section explains in detail the scheme of PF. For a better understanding, we first give an overview of PF, which presents the data processing flow and the module diagrams in the whole process of identification. In the following subsections, each module is elaborated by further decomposition if needed. To validate the PF scheme, the last subsection introduces the implementation of PF by pseudocode and an example.

PF Overview.
In the scenario of PETs like Shadowsocks, SSH, and TLS, we create a new type of features based on an existing type of features, i.e., packet length with direction. Based on the new features, we put forward a new attack PF, whose basic principle is shown in Figure 2. At the very beginning of the PF, the attacker needs to gather datasets in different scenarios of PETs. en, the datasets are put into the framework of PF, which contains three basic modules, including preprocessing datasets, proposing new features, and classifying.
Specifically, the first module needs to perform the symbolization to produce the sequence files for all the instances and do the TF-IDF transformation to produce the representative vector of each instance. e vectors are taken as the input of the PWFP model. e second module leverages two submodules, including model training and foldin process, to obtain the proposed new features of training instances and test instances, respectively. Finally, we use the new features to perform classification in the third module. Note that KNN is used as the classifier in this work. e technical details of these modules are explained in the following subsections.

Preprocessing Datasets.
In this subsection, we preprocess the training and test instances to fit the PWFP model. Specifically, the TF-IDF (Term Frequency-Inverse Document Frequency) transformation, which has been extensively used in the field of text classification, is then applied to obtain a good data representation for the traffic instances.
To do the TF-IDF transformation, we need to construct a connection between a traffic instance and a plain text. We build the connection by the following steps. Firstly, we introduce the concept of symbolization by which each traffic instance is converted into a traffic sequence file. en, we reveal the similarity between a plain text and the sequence file of a traffic instance for the first time. Based on the finding, we can view each traffic instance as a plain text naturally.

Symbolization.
At the very beginning of preprocessing, each traffic instance, i.e., a series of consecutive packets generated from a complete web page visit, should be converted into a sequence file. e conversion is named symbolization, as shown in Figure 3.
At first, each packet size with direction should be converted into a feature value (e.g., −1500). To be specific, the packet size decides the quantity of the feature value. Besides, the packet direction determines whether or not the feature value is positive. Since the packet size is no more than 1500 bytes, the feature value of each packet can be defined by a number ranging from −1500 to 1500. After doing symbolization, the feature value of each packet in a traffic instance (e.g., −1500) can be further turned into the corresponding symbol (e.g., AA), as illustrated in Figure 3. As we see from the figure, the feature values within a certain range are denoted by two given letters [13]. It is noted that the letters used for symbolization stem from 20 built-in amino acids in the HMMER tool [34]. us, the number of optional letters equals 20.    Figure 4. at is, we randomly select a plain CNN news text and a sequence file in our experiments as examples. To the best of our knowledge, it is the first time that the latent similarity is uncovered.

Our
According to the symbolization method mentioned above, the feature value of each packet would be turned into a symbol, which is notated as a "word" in this work. Hence, each sequence file comprises of a lot of "words." On the other side, a plain text comprises many meaningful symbols inside, called words. e similarity in essence between a sequence file and a plain text can be concluded by the following comparison analysis.
At the very beginning, we essentially analyze the similarity of a "word" in a sequence file with a word in a text. On the one hand, the essence of both a word in a text and a "word" in a sequence file is a kind of symbol. On the other hand, similar to a single word in a text, each "word" in a sequence file also has its meaning, which indicates the size and direction of the corresponding packet. erefore, each "word" in a sequence file is analogous to a word in a text. One more step further, a text is a combination of words. Similarly, a sequence file is a combination of "words". Given the above, each sequence file is analogous to a text. e above analysis is intuitively shown in Figure 4.
For the similarity between a sequence file and a plain text, it is natural to leverage text classification methods for website fingerprinting. us, the PWFP model, which incorporates the extensively used text classification method PLSI with WFP, is proposed.

TF-IDF Transformation.
As mentioned above, we will get the sequence file of each traffic instance after symbolization. However, the sequence files cannot be input into the PWFP model directly. To launch the PWFP model, the sequence file of each traffic instance, including the training instance and test instance, should be represented by a representative vector, namely, the input of the PWFP model. Hence, we also call "a representative vector" as "an input vector." Since the TF-IDF transformation quantifies the importance of each symbol in the traffic sequence files well, we utilize the TF-IDF transformation to produce the input of the PWFP model.
In the process of model training, each training sequence file, e.g., d i , needs to be converted into the corresponding input vector, namely v d i , such that it can be fed into the model. e j th component of the input vector is obtained by the TF-IDF transformation of w j , which is the j th "word" in the sequence file. e TF-IDF transformation of w j is defined by equation (1).
where tf d i (w j ) and idf(w j ) mean the TF transformation and IDF transformation of w j , respectively. In detail, tf d i (w j ) and idf(w j ) are defined as equations (2) and (3), respectively.
where M denotes the total number of different "words" in the corpus, i.e., all the set of the training sequence files. N denotes the total number of training sequence files. n(d i , w j ) denotes the number of the "word" w j occurring in the training sequence file d i . m(w j ) denotes the number of training sequence files which contain the "word" w j . Similarly, the test instances need to perform the TF-IDF transformation before the fold-in process. Each test sequence file, e.g., q i , needs to be converted into a corresponding input vector, namely, v q i . e related equations are shown below.
where n(q i , w j ) denotes the number of the "word" w j occurring in the test sequence file q i .

Proposing the New Features.
After the TF-IDF transformation, each sequence file will generate a representative vector. Since the vector is taken as the input of the following model, it is also called the input vector. e proposed PF attack leverages the PWFP model to process the input vectors. PLSI, which associates a latent semantic variable (i.e., topic) with each observation [35], is the mathematical basis of the PWFP model.  section, the basic theories of PWFP, including the basis of the PWFP model, model training, and fold-in process, will be detailed at first. Lastly, we will propose a type of new features based on the obtained PWFP model.

e Basis of the PWFP Model.
In this part, we show the basis of PWFP. Similar to using PLSI for the task of text classification, the PWFP model also introduces a latent variable called "topic," which has the same functions as the latent variable (i.e., topic) in PLSI. Rather than an intelligible meaning in the PLSI model, the variable "topic" in the PWFP model has an abstract meaning. In PWFP, "topic" is an intermediate concept that associates a sequence file (i.e., "text") with a symbol (i.e., "word") therein. e interrelations between a sequence file, a "topic", and a "word" are represented as conditional probabilities.
e crux of the PWFP model is to figure out these conditional probabilities, namely, PWFP parameters. To estimate these conditional probabilities, the EM (Expectation-Maximization) algorithm is extensively used.
Once the conditional probabilities are obtained, each traffic instance can be represented by a "topic" probability vector. Each component of the vector indicates the probability of the sequence file (i.e., "text") belonging to the corresponding "topic." Hence, mathematically speaking, the PWFP model can be viewed as a multidimensional space transformation. To better understand the PWFP model, we define some notations as shown in Table 1. Note that we add quotation marks when describing the concepts (e.g., text, topic, word) in WFP to differentiate the same concepts in the field of text classification.
To begin with, we show the graph model of PWFP in Figure 5. us, in terms of a generative model, the PWFP model can be built up in the following way: (1) Select a training sequence file d with probability p(d), (2) Pick a "topic" z with probability p(z|d), (3) Generate a "word" w with probability p(w|z).
After the above three steps, we obtain an observed pair (d, w), and the latent variable z is discarded. Hence, for given values of d i and w j , the joint probability of p(d i , w j ) can be deduced by By repeating the above process, we get all the training sequence files, i.e., the corpus. Hence, the generative probability of the corpus D, namely L(D, W), can be represented as joint probabilities of all the observation pairs. Furthermore, L(D, W) is calculated by After taking the logarithm of L(D, W), L ′ (D, W) is obtained by By combination with equations (5) and (7), L ′ (D, W) is further deduced to e goal of model training is to estimate the probabilities p(z k ), p(d i |z k ), and p(w j |z k ), namely, PWFP parameters. To figure out the probabilities, it is necessary to do maximum likelihood estimation for the function L ′ (D, W). Only when the function L ′ (D, W) reaches the maximum, the probabilities are optimal. Intuitively, the probabilities can be figured out by letting the derivative of L ′ (D, W) equal zero. e EM algorithm is introduced to solve the optimization problem.
e EM algorithm includes two steps, i.e., E (expectation) step and M (maximization) step. e two steps take turns to execute until the PWFP parameters converge.
In the E-step, the posterior probability, p(z k |d i , w j ), is introduced. It can be computed based on the current estimates of the PWFP parameters, i.e., p(z k ), p(d i |z k ), and p(w j |z k ), according to equation (9) derived by the Bayes rule.
p z k |d i , w j � p z k p d i |z k p w j |z k K l�1 p z l p d i |z l p w j |z l .
In the M-step, the PWFP parameters are updated by the posterior probabilities that are computed in the previous E-step. e equations for undating the parameters can be derived by the Lagrange multiplier method and demonstrated as

Model
Training. e aim of model training is to estimate the PWFP parameters for the training instances. We use the EM algorithm to achieve this goal. e procedures of Security and Communication Networks 7 using the EM algorithm to estimate the PWFP parameters can be specified as follows: (1) Initialization: Randomly initialize the posterior probabilities p(z k |d i , w j ) and the PWFP parameters of the training instances, namely, p(d i |z k ), p(w j |z k ), and p(z k ). (2) E-step: Based on the current PWFP parameters, update the posterior probabilities p(z k |d i , w j ) according to equation (9).

Fold-In Process.
Once the PWFP parameters of the training instances are estimated by the EM algorithm, the obtained parameters p(z k ) and p(w j |z k ) can be further used to infer the parameters of the test instances, i.e., p(q i |z k ). e inference is called as "fold-in process" in previous work [36]. We also leverage the EM algorithm in the fold-in process. In the fold-in process, the parameters p(w j |z k ) and p(z k ) remain fixed, whereas the rest parameters p(q i |z k ) and the posterior probabilities p(z k |q i , w j ) need to be updated by the EM algorithm. e fold-in process can be implemented by the following procedures: (1) Initialization: Randomly initialize the new PWFP parameters, including the conditional probabilities of the test instances p(q i |z k ) and the posterior probabilities of the test instances p(z k |q i , w j ), while keeping the PWFP parameters p(z k ) and p(w j |z k ) fixed. (2) E-step: Based on the known PWFP parameters, i.e., p(z k ) and p(w j |z k ), and the current estimated PWFP parameters, i.e., p(q i |z k ), update the posterior probabilities p(z k |q i , w j ) according to the following equation: p z k |q i , w j � p z k p q i |z k p w j |z k K l�1 p z l p q i |z l p w j |z l .
(3) M-step: Based on the current posterior probabilities p(z k |q i , w j ), update the parameters p(q i |z k ) according to the following equation: (4) Repeat the E-step and M-step until the number of iterations reaches the setting value. e results of the last iteration are taken as the estimated PWFP parameters for test traffic instances.

e Proposed New Features.
e model training and fold-in process aim to figure out the model parameters, namely, p(d i |z k ), p(z k ), and p(q i |z k ). Based on p(d i |z k ) and p(z k ), each training traffic instance can be represented by a probability vector, each component of which is a "topic" probability. Similarly, each test instance can be represented by a probability vector calculated by p(q i |z k ) and p(z k ). Hence, in the PF attack, each instance is represented by a vector constituted by "topic" probabilities. Note that the "topic" probabilities are also called the proposed new features.  e proposed new features of a traffic instance and a test instance can be calculated similarly with respective parameters of PWFP. To be specific, the proposed new features of a training instance d i , namely d iv , can be computed by equation (13). Similarly, the proposed new features of a test instance q i , i.e., q iv , would be calculated by equation (14): 4.4. Classifying. e PF attack leverages typical KNN to make a classification. To achieve this goal, it is needed to select a similarity strategy. e similarity strategy is a way to evaluate the differences between two different feature vectors.
e cosine similarity and Euclid similarity are two typical strategies. According to the experimental results, the Euclid similarity is used in our evaluations.
Known from text classification, there usually exist more common words between two texts that belong to the same class than two different classes. us, we add a weight coefficient to tune the similarity between two traffic instances or between a traffic instance and a web page class. For two traffic instances, the weight coefficient is defined by the number of common "words" between their respective sequence files. For the similarity between a traffic instance and a web page class, the weight is defined by the average number of the common "words" between the traffic instance and each instance belonging to the web page.

Implementation.
In the above subsections, the theoretical part of PF has been demonstrated extensively. To validate PF, we need to implement all the above techniques step by step, as shown in Algorithm 1. e PF algorithm takes the folder paths of training data and test data as input, and output various metrics in different experiments. e metrics will be discussed in detail in the next section. For convenience, we perform the symbolization before the PF implementation in this work. Note that four parameters are needed for the PF to run, including t, k, m, and n. e parameter t is influenced by the number of web pages, while the parameter k can tune the performance of PF. For the parameters m and n, we set them the same as in our evaluations. According to our experimental results and previous literature [35], the number of iterations can be set as a fixed number 50 in all the evaluations. We also validate the rationality of this setting by a specific experiment.
For a better understanding, we further show the empirical illustration of the PF attack, as shown in Figure 6. Each step is explained with an example of a real dataset in our evaluations.

Experimental Preparations
In this section, we make several necessary preparations for the following evaluations. Firstly, we specify the datasets used in our experiments, including the open datasets and our gathered datasets. en, the metrics of different experimental settings are explained at length. Subsequently, the baseline attacks are briefly introduced.

Datasets. We use seven datasets in our evaluations,
including three open datasets released in Ref. [4,13], and four datasets collected by ourselves. ese datasets are collected in different scenarios, such as Shadowsocks, SSH, and TLS.

e Open Shadowsocks Datasets.
We evaluate one open Shadowsocks datasets, i.e., Alexa74 [13], in this work. e Alexa74 dataset contains 74 webpages. Each web page contains 25 instances, 17 of which are for training. e webpages of Alexa74 are filtered from the Alexa top 100 sites.

e Open SSH Datasets.
We test two open SSH datasets, namely, SSH55 and SSH100. Both of them are filtered from the Liberatore's dataset [4]. It was collected over an encrypted SSH tunnel for about three months and contains traces of encrypted connections to 2000 sites. Since the dataset has a lot of empty pcap files caused by various failures during the collecting process, we first pick out two datasets with different average lengths of traffic instance file (15k and 20k) for each web page. Note that all the traffic instance files of each chosen web page are successive in time.
en, we extract the timestamp and length of each TCP packet in each file, and thus obtain the desired datasets.
Besides the open datasets, we gather another four datasets in different scenarios.
ree of them are gathered in a Shadowsocks environment, while the rest is gathered in a TLS scenario. e gathering method is specified in the following.

Data Gathering.
We rent a cloud server to collect our datasets. To collect the Shadowsocks datasets, the Clash software is installed as the client to communicate with the remote Shadowsocks proxy server, through which we can directly connect to the websites. e datasets are collected automatically by a C crawling script. e script simulates the user's behavior of surfing websites by controlling the Firefox Browser 76.0.1, whose cache function is disabled to prevent loading from the cache, and leverages tcpdump to capture the traffic on the wire. Moreover, the script runs on an Security and Communication Networks Ubuntu 16.04 virtual machine to avoid perturbations introduced by the background network traffic. Besides, we disable all the automatic or background network traffic such as the auto-updates. It is also important to make sure that the system-level network settings are all right. For example, it is critical to change the MTU to the standard Ethernet MTU (1500 bytes) and disable offload. We collect three Shadowsocks datasets, i.e., AleSS73, AleSS287, and OpwSS6879. e total size of them is about 23.5 GB, 17.6 GB, and 23.2 GB, respectively.
To collect the HTTPS100 dataset, we follow the same routine mentioned above while closing the Clash software. e total size of HTTPS100 amounts to about 10.4 GB. Note that all the webpages are randomly selected from the Alexa top 10k webpages.

Our Collected Shadowsocks and HTTPS Datasets.
e collected datasets, AleSS73, AleSS287, and HTTPS100, contain 73, 287, and 100 webpages, respectively. Each web page has 100, 20, and 40 instances, respectively. ey are evaluated in the closed-world setting. Besides, the OpwSS6879 dataset contains 6879 webpages. Each web page has one instance. It is evaluated in the open-world setting together with the AleSS287 dataset. It is worth noting that none of the webpages in AleSS287 is included in the webpages in OpwSS6879. input: training data, test data output: different combinations of performance indicators, including (TP, FN, FP, TN, TPR, FPR, BDR, ACC) and (TP, FN, TPR) (1) function PF (training data path, test data path) (2) set the number of iterations in the training process, i.e., m (3) set the number of iterations in the fold-in process, i.e., n (4) set the number of "topics", i.e., t (5) set the number of nearest neighbors for KNN, i.e., k (6) load the training samples, perform the TF-IDF transformation for training instances (7) train the PWFP model (8) compute the "topic" probability vectors of training samples, i.e., p(z|d training ) (9) load the test samples, perform the TF-IDF transformation for test instances (10) fold-in the test samples (11)   For a better understanding, the datasets used in this work are listed in Table 2.

Metrics.
To evaluate the experimental results, we utilize the following metrics that are extensively used in the WFP area.
In the closed-world evaluation, we use the attacker's accuracy, which is defined as the ratio of the number of correctly classified traces to the total number of traces, to evaluate the performance of different attacks as previous research [13,19,26]. e ratio equals TPR, namely, the probability that a monitored web page is classified as the correct monitored web page, in the closed-world scenario. Besides, another indicator called "recall" has the same definition as TPR. In the open-world evaluation, we take into consideration 3 indicators, including TPR, FPR, and BDR: where p(mon) � |monitered test instances|/ |total test instances|, p(un mon) � 1 − p(mon). Note that FPR is defined as the probability that an unmonitored web page is incorrectly classified as a monitored web page. Since BDR considers the differences in the size of the different classes, it is widely used to evaluate the feasibility and effectiveness of an attack [13,26]. Besides, a metric called "precision" is also used in previous literature [19]. In fact, it can be proved that BDR is equivalent to precision. us, we also called BDR as precision in this work.

Experimental Evaluation
In this section, we first perform two preliminary experiments to adjust the optimal parameters of PF and compare our proposed new type of features with the existing features. To validate the feasibility of PF, the subsequent experiments are conducted under three different scenarios that are extensively studied in previous literature. To be more persuasive, all the Shadowsocks, SSH, and TLS traffic are tested.

Preliminary Experiments.
At the beginning of the experimental evaluation, we conduct two preliminary experiments, including parameters tuning and feature evaluation. e former is used to pick out the optimal parameters for PF. Besides, we design a simple experiment, namely, feature evaluation, to compare the proposed features and their source features. e details are specified below.
6.1.1. Parameters Tuning. As mentioned above, four parameters need to be determined in the PF implementation. We define the four parameters as m, n, t, and k. Specifically speaking, m denotes the number of iterations in the training process. n denotes the number of iterations in the fold-in process. t denotes the number of "topics". k denotes the number of nearest neighbors of KNN. To find out the optimal parameters, we devise a method by ourselves. In the following, we show our method by three specific experiments based on SSH55. Each experiment is used to determine one parameter.
To determine m and n, we need to fix the value of t and k. Furthermore, we let m equal to n according to previous PLSI applications. en, m and n are set as several different values. Subsequently, we run the PF algorithm, draw the accuracy curve, and choose the best parameter value. Similarly, we determine the parameters t and k in the closed-world setting.
e experimental results are demonstrated in Figure 7. As the left figure in Figure 7 shows, the accuracy of PF does not improve as the number of iterations rises from 50 into 150. For efficiency, we set the parameters m and n as 50. Known from the middle figure in Figure 7, the accuracy reaches maximum when the parameter k equals 1. Hence, the parameter k is set as 1 in this scenario on the SSH55 dataset. In the right figure of Figure 7, we get the maximal accuracy when the parameter t is set as 150. Naturally, we set the parameter t as 150. Similarly, we obtain the optimal parameters' values used on other datasets.

Feature Evaluation.
We perform feature evaluation by comparing the effectiveness of the proposed new type of features, i.e., the "topic" probability vector, with the existing type of features, i.e., packets length with direction. For the sake of fairness, the KNN classifier is applied for both two kinds of features. For convenience, the latter attack is named LF (Length Fingerprinting). We compare PF and LF in a closed-world scenario on the AleSS73 dataset. e accuracy of PF and LF is 99.79% and 51.46%, respectively. us, it is evident that our proposed new type of features are more effective than the existing features. at is to say, by leveraging the PWFP model, we obtain a type of features that are more informative and powerful than the existing features, which paves a road for devising a concise traditional machine-learning attack.

Closed-World Evaluation.
e closed-world evaluation includes two kinds of experiments. ey are different in that whether the dataset is defended or not. For the nondefended datasets, we test six datasets based on three different PETs, including Shadowsocks, SSH, and TLS. For the defended datasets, we created twenty datasets to perform our experiments.

Attack on Nondefended Datasets.
To validate the feasibility of a WFP attack and tune the proper parameters' values for a WFP attack, the closed-world evaluation is fundamental and critical. We leverage six nondefended datasets of various types in this closed-world scenario. e PF attack needs to set four parameters to start. Specifically, we set the number of iterations as 50 for both the model training and fold-in process; meanwhile the number of nearest neighbors is set as 1. As for the number of "topics," we set 150 for SSH55, SSH100, HTTPS100, Alexa74, and AleSS73, while set 400 for AleSS287. It is noted that we utilize the original codes to run the state-of-the-art attacks and each algorithm is run five times to obtain its mean performance. e performance of all the attacks is shown in Table 3.
According to Table 3, the PF attack attains a stable better accuracy performance than the other attacks, even reaches an accuracy as high as 99.79% on AleSS73, while KNN and PHMM show fluctuating performance on different datasets. Specifically, PF reaches an accuracy of above 93% on five datasets, above 95% on three datasets. It beats other attacks, including DF that is based on deep neural networks, on all the datasets. As for KNN and KFP, their accuracy performance decreases to 70.91% and 43.64% on SSH55, 56.01% and 47.00% on SSH100, 78.75% and 55.75% on HTTPS100, respectively. e oscillation of performance of KNN and PHMM probably ascribes to their sensitivities to some factors, such as the number of training instances, the data quality. Although KFP achieves a comparable performance with PF on AleSS73, Alexa74, and SSH100, and DF attains a comparable performance with PF on AleSS73, they have respective disadvantages compared with PF. As for DF, when the number of training instances reduces, more numbers of iterations are needed to obtain a better accuracy, which requires more time. Regarding KFP, it needs more types of features than PF, thus introducing more feature engineering work. e results demonstrate that PF is highly effective in different scenarios, including Shadowsocks, SSH, and TLS.
Moreover, we investigate the impact of different ratios of the number of training instances to the number of total instances on classification accuracy. e results are shown in Figure 8. Since the steeper curve in Figure 8 means the more sensitive to the change of the number of training instances for each attack, it can be concluded that PF and PHMM are less sensitive to the change of training ratio than other attacks. As the results show, PF only needs rare training samples to reach a high success rate. To be specific, the PF attack only uses 4 training samples to obtain a success rate of 91.46%.
is is a piece of good news for WFP, which is bothered by the data staleness issues [16]. Conversely, the accuracy curve of DF is the steepest one. Specifically, the accuracy of DF improves from 39.9% to 73.95% when the ratio increases from 20% to 40%, and continues to rise as the ratio further increases. is is in line with the characteristics of deep-learning methods.

Attack on Defended Datasets.
In this evaluation, we consider two typical countermeasures in Shadowsocks, namely, probabilistic MTU padding defense, and probabilistic dummy packet defense, which were put forward by Zhuo et al. [13]. e former one means padding the specific packets to MTU. Whether a packet needs to be padded depends on a given probability. e latter defense denotes inserting a new packet with random size and direction following the original packet in a given probability. It is noticed that the decisions, namely, whether to pad or insert, are made by every packet in a traffic instance with the same probability. e more the probability is, the more overhead and disturbance are introduced. In total, we produce 20 defended datasets with different padding or inserting probabilities, ranging from 10% to 100%, based on AleSS287.
Note that we use the same parameters as the closedworld evaluation on nondefended datasets in this part. Table 4 shows the results of each attack against the defended datasets. Results show that our attack can well resist the two typical defenses, and achieves the best performance among all the evaluated attacks. Specifically, in the evaluation of attack on the probabilistic dummy packet defense, PF attains an accuracy of 55.66% when the inserting probability is 50%, even reaches an accuracy of 23.17% when the inserting probability becomes as high as 100%, whereas DF gets an accuracy of 34.15% and 1.92%, and PHMM obtains an accuracy of 0.53% and 0.35% in the two situations. Similarly, in the evaluation of attack on the probabilistic MTU padding defense, the accuracy of PF reaches up to 32.93% when the padding probability becomes as high as 90%, while PHMM only attains an accuracy of 0.17% by this time. It can be concluded that the two probabilistic defenses need a relatively high overhead to resist PF compared to PHMM and DF, which indicates the strong distinguishing ability of our proposed features.

Open-World Evaluation.
In this part, we first investigate the impact of the parameter k on the performance indicators in the open-world setting, including TPR, FPR, and BDR. en, two typical experiments are performed to evaluate the performance of PF.

Impact of the Parameter k.
e parameter k is vital to the KNN algorithm. In the KNN implementation, the algorithm picks out the top k closest training samples for each test sample. en, it assigns the test sample to the category that most of the k samples belong to. Known from the basic principle of KNN, the test sample tends to be assigned to the categories with larger sample sizes as the parameter k increases. Hence, to investigate the impact of k on the performances of PF in an open-world setting, we design this experiment. To be specific, we set all the parameters except for k as those in the closed-world setting while varying the value of k.
We perform seven experiments. e parameter k is set as 1, 3, 5, 7, 9, 13, and 21, respectively, in each experiment. Furthermore, the performance indicators, i.e., TPR, FPR, and BDR, are recorded and plotted in Figure 9. From the curves in Figure 9, FPR shows a slight decrease while BDR indicates a minor increase as k increases. In addition, the TPR curve shows a slight increase when k is less than 3, whereas it slowly declines as k increases from 3. e results are consistent with previous expectations. From the general tendency of the curves, k can be used to tune the performance of PF. For example, if the adversary puts more emphasis on TPR, k should be set as a small value. On the contrary, if FPR is more important, then k should be set as a large value.

Open-World Evaluation.
In the open-world evaluation, there exist two kinds of training strategies for the attacker. As for the first one, the attacker not only trains the monitored dataset but also the unmonitored dataset, whereas he only trains the monitored dataset under the  second strategy. We consider the first strategy, which is called the standard model in previous literature [19]. Since PHMM takes the second strategy in the original paper, all the attacks besides PHMM are evaluated on the AleSS287 and OpwSS6879 datasets.
We conduct two experiments here. We first investigate the impact of different ratios of the number of unmonitored training instances to the number of total unmonitored instances on the performance of all the attacks, except for PHMM. In this experiment, we conduct five tests and draw the curves of TPR, FPR, and BDR for all the attacks in each test. e ratio is set as 50%, 60%, 70%, 80%, and 90% for each test. e results are shown in Figure 10. We can see that the PF attack obtains an over 89.48% TPR in all cases. e performance is superior to all other attacks. Besides, PF attains an over 93.07% BDR, a below 1% FPR in all the tests. Such performance beats all other attacks in all situations. Although KFP reaches a comparable FPR and BDR performance with PF in most tests, its TPR performance is far falling behind PF.
Next, we experiment and draw the Precision-Recall curves for comparing these attacks. Since the size of the monitored and unmonitored datasets is heavily unbalanced, the Precision-Recall curves are extensively used to represent the performance of the classifiers to avoid the base-rate fallacy [5]. At first, we take 90% of monitored instances and 80% of unmonitored instances for the training, while the rest of the instances are used for the test. en, we conduct a set of tests by configuring a series of different settings for each attack. For different attacks, the configuring methods are different. Specifically, for KNN, KFP, and PF, we configure multiple settings by varying the parameter k following previous literature [19]. However, since DF uses the prediction probability to classify the input traffic instances, we take multiple thresholds of probability as different settings. Finally, the precision and recall performance of each attack in each test is calculated. According to the evaluation results of all the attacks, the Precision-Recall curves are plotted.
As we see in Figure 11, our attack achieves a performance of more than 97% precision and 93% recall in all settings, which outperforms the other attacks remarkably. To be specific, when the precision is 97%, the recall of DF, KFP, and KNN is remarkably less than PF. Similarly, when the recall is 93%, the precision of DF and KNN is also less than PF. For KFP, when KFP attains a comparable high precision performance (e.g., 1) as FP, its recall performance is under 80%, whereas the recall of PF is over 93%. Hence, the Precision-Recall curves demonstrate the good performance of PF.

Discussion
In this study, our goal is to seek another traditional machinelearning attack that merely uses one type of features, which are more effective than existing features. is goal mitigates the burden of feature engineering work for WFP. By combining the probabilistic topic model with WFP, we devise the new type of features, i.e., "topic" probability   vector. Based on the new features, the PF attack is proposed and evaluated on the Shadowsocks, SSH, and TLS traffic. To be more sound and persuasive, we not only use the open datasets but also the datasets collected in a realistic scenario. e evaluation results show the good performance of PF, which proves that using a probabilistic topic model for WFP is workable. e reason that PF is effective can be explained by the angle from which we look at a traffic instance. Traditionally, a traffic instance is represented by a length vector, each component of which indicates the value of packet length with direction. Hence, each component of the vector has the same physical significance (i.e., a length), which means that the traffic instance is merely viewed and described in a onedimensional space. However, in the PF scheme, a traffic instance is represented by the "topic" probability vector, whose components have different physical significances that indicate the probabilities of the traffic instance belonging to different "topics." Hence, each traffic instance is mapped to a multidimensional space in our work. is truth probably explains the reason why PF is more effective than LF and performs best in both the closed-world and open-world evaluations.
Besides lessening the boring work of feature engineering, PF also reduces the number of features fed into the classifier. For example, the KNN attack needs 3736 features as its input. Besides, DF takes 5000 features as the input of DNNs. However, the needed number of features in PF equals the number of "topics," which is one of the model parameters and usually has the same order of magnitude as the number of monitored webpages (|″topics″|/|monitored web pages| < 200%) according to the experimental results.
Regarding the future work, since our key idea lies in the combination of probabilistic topic model with WFP, other models, such as LDA (latent Dirichlet allocation), HDP (hierarchical Dirichlet process), should be effective in WFP too. Note that our work gives an example of digging out a new type of distinguishing features from the existing (i.e.,  baseline) type of features, i.e., packets length with direction. However, the baseline type of traffic features is not limited to that used in this paper. Besides, the symbolization method can be designed in another way too. Similar work might be the potential new directions to improve and extend our attack.
Besides, our method might be used in other related fields, such as radio frequency fingerprint identification (RFFID) [37,38]. RFFID is a lightweight access authentication method in mobile edge computing. It uses the radio frequency signal fingerprint of the wireless devices for identification. Generally speaking, the first process of RFFID is offline to establish a fingerprint database for legitimate wireless devices. en, the fingerprint is used on the subsequent online authentication process. Like a website fingerprinting attack, if the fingerprint of wireless devices can be converted into a symbol with some rule, it is very possible to apply our method in the second process of RFFID. Such direction has great potential.

Conclusions
In this work, we are the first to investigate the performance of a WFP attack using the probabilistic topic model. Our work is inspired by a neglected truth, that is, the sequence file generated from a traffic instance and a plain text are similar essentially. en, we propose the PWFP model and the PF attack. Furthermore, our attack is tested and compared with four state-of-the-art attacks. e results in three extensively applied scenarios, i.e., Shadowsocks, SSH, and TLS, prove that PF is feasible and effective. In all, we find and leverage the new type of features, i.e., the "topic" probability vector, to identify the test webpages under the protection of different PETs, and obtain a better performance than prior attacks, including a deep learning attack. To the best of our knowledge, it is the first time that a traditional machine-learning attack beats a deep-learning attack.
e success of PF means that there exists great potential information for existing traffic features. Besides, it indicates that the performance of current WFP attacks might further improve while using fewer types of features and need less feature engineering work if the unexploited potential information of known traffic features is dug out and leveraged. It points at a research direction for the future.

Data Availability
To ensure the reproducibility of the evaluation results, the source code and datasets of this work will be provided upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.