Few-Shot Website Fingerprinting Attack with Data Augmentation

This work introduces a novel data augmentation method for few-shot website ﬁngerprinting (WF) attack where only a handful of training samples per website are available for deep learning model optimization. Moving beyond earlier WF methods relying on manually-engineered feature representations, more advanced deep learning alternatives demonstrate that learning feature representations automatically from training data is superior. Nonetheless, this advantage is subject to an unrealistic assumption that there exist many training samples per website, which otherwise will disappear. To address this, we introduce a model-agnostic, eﬃcient, and harmonious data augmentation (HDA) method that can improve deep WF attacking methods signiﬁcantly. HDA involves both intrasample and intersample data transformations that can be used in a harmonious manner to expand a tiny training dataset to an arbitrarily large collection, therefore eﬀectively and explicitly addressing the intrinsic data scarcity problem. We conducted expensive experiments to validate our HDA for boosting state-of-the-art deep learning WF attack models in both closed-world and open-world attacking scenarios, at absence and presence of strong defense. For instance, in the more challenging and realistic evaluation scenario with WTF-PAD-based defense, our HDA method surpasses the previous state-of-the-art results by nearly 3% in classiﬁcation accuracy in the 20-shot learning case. An earlier version of this work Chen et al. (2021) has been presented as preprint in ArXiv


Introduction
For privacy protection in accessing the Internet, an increasing number of users have turned to anonymous networks. e Onion Router (Tor) [1,2] is one of the most popular choices [3].
As a free and open-source software, Tor boosts anonymous communication. It directs Internet traffic through a free, worldwide, and volunteer overlay network with thousands of relays, concealing a user's location and usage from anyone conducting network surveillance or traffic analysis. Concretely, it encrypts the content of communication and sends the data through a route comprised of successive random-selected Tor nodes. However, this remains not completely secure due to exposure of data transportation patterns before reaching Tor servers. For instance, a local attacker would eavesdrop on the connection between a user and the guard node of the Tor network, with the attacking positions including any devices in the same LAN or wireless network, switch, router, and compromised Tor guard node (see Figure 1). By just analyzing the patterns of data packets traffic without observing the content inside, the attacker is likely to reason about which website a target user is visiting. is is often known as website fingerprinting (WF) attack [4].
To implement a WF attack, the attacker needs first to create a particular digital fingerprint for every individual website and then learn some intrinsic pattern characteristics of these fingerprints for accomplishing attack. Earlier attacking methods rely on manually designed features based on expert domain knowledge [4][5][6][7][8][9][10][11][12][13]. ey are not only inflexible but also susceptible to environmental changes over time.
is limitation can now be solved by using more advanced deep learning techniques [14].
is is because other than utilizing manually designed features, deep learning methods can automatically learn feature representations directly from training data and are more scalable provided that up-to-date training data are accessible. A couple of latest state-of-the-art studies, deep fingerprinting (DF) [15] and Var-CNN [16], have demonstrated this potential in comparison to manual feature-based methods. However, these deep learning solutions are not perfect, as their success is established upon an unrealistic assumption that a sufficiently large number (e.g. hundreds) of training samples per website are available, that is, data hungry. When only a small training dataset is given as typical in practical use, their performances are not necessarily superior to traditional methods [11][12][13]. It is always expensive, tedious, or even infeasible to collect a vast training set in reality due to highly frequent and continuous changes in Internet environments. Consequently, WF attack is fundamentally a few-shot learning problem, which nevertheless is largely unrecognized in the literature.
e nature of few-shot WF attack is also considered in the recent triplet fingerprinting method [17], under a condition that there is a large set of relevant auxiliary training samples for model pretraining. It is essentially a transfer learning setting.
is will significantly limit its scalability in practice as in-the-wild changes of Internet data traffic conditions would render such assumptions to be invalid at high probabilities. On the contrary, we introduce a realistic, generic few-shot WF attack setting where only a handful of training samples are available for every target website, without making any domain-specific assumptions. Clearly, triplet fingerprinting is not applicable in our setting due to the need of auxiliary training data.
We summarize the contributions of this paper as follows: (I) We introduce a novel, practical few-shot website fingerprinting attack problem, in which only a few training samples are available without rich auxiliary data. is respects the intrinsic nature of highly dynamic Internet traffic conditions and high cost of collecting extensive training data in practice.
Highlighting the importance of few-shot learning without any auxiliary data assumption for the first time, we hope more future efforts would be dedicated for solving this practically significant WF attack challenge.  Figure 1: Illustration of data flow traffic between a user and target websites with a Tor network in-between. Despite being more secure by anonymity, website fingerprinting attackers are still able to reason about which website a victim user is visiting by analyzing the data traffic characteristics at multiple locations, as specified by red dash lines. different assumptions. e most common scenario is closedworld attack that assumes the user can only visit a small set of websites and that the adversary collects samples to train on all of them. Given that the websites in a closed-world setting are far less than in the real world, this assumption is not realistic. In an open-world scenario, the victim user is considered to likely visit any other websites including those monitored ones, as typically experienced in real-world applications. As a result, the adversary cannot collect data and train for every website. e above two scenarios are focused on the range of websites involved in WF attack, independent of WF defense.
e WF defense means that the user takes some actions to defend against a potential attack.
In the literature, several common assumptions are made. We briefly discussed three main assumptions. In user behavior, it is assumed that all Tor users browsed websites sequentially, only opening a single tab at a time. In background traffic, it is assumed that the attacker is able to collect all the clean traces generated by the victim's visits against dynamic background traffic. is is increasingly possible, as shown in [26], and the multiplexed TLS traffic can be split into individual encrypted connections to each website. In network condition, the attacker is assumed to have the same conditions as the victim, including traffic conditions and settings. To compare with the benchmark results, we follow these general assumptions for fair evaluations.
Instead, we focus on addressing the following assumption. Often, the attacker assumes that the training data fall into a similar distribution as the deployment data. is is a particularly strong and artificial assumption as the network condition is actually changing and evolving frequently. Such a property enforces the attacker to update the training data in order to have a robust attacking model over time. is implies that the attacker is not possible to collect a large set of training data at each time due to high acquiring costs. However, existing WF attack methods often ignore this factor by assuming the availability of large training data. In contrast, we study the largely ignored few-shot learning setting in the WF attack. Specifically, we approach this problem by explicitly solving the small training data issue via synthesizing new labelled training data.

Website Fingerprinting Attack Methods.
e first pioneer attack against the Tor network was evaluated by Herrmann et al. [7] in 2009. It achieved an accuracy of 2.96% using around 20 training samples per website in the closed-world scenario. Later, Wang and Goldberg [10] proposed to represent the traffic data using more fundamental Tor cells (i.e., direction data) as a unit rather than TCP/IP packets.
is representation is rather meaningful and informative as it encodes essential characteristics of Tor data. By training a kernel SVM classifier, a ground-breaking performance with 90.9% accuracy was achieved on 100 sites each with 40 training samples. In 2016, Panchenko et al. [13] proposed an idea of sampling the features from a cumulative trace representation and achieved 91.38% accuracy with 90 training instances per website. Hayes and Danezis [12] exploited random decision forests to achieve similar results. A typical design of these above methods is a two-stage strategy including feature design and classifier learning. is is not only constrained by the limitations of hand features but also lacks interaction between the two stages, making the model performance inferior. Motivated by the remarkable success of deep learning techniques in computer vision and natural language processing [27,28], several deep learning WF attack methods have been introduced which can well solve the weakness mentioned above. is is because deep learning methods carry out feature learning and classification optimization from the raw training data end-to-end. For example, Rimmer et al. [29] applied deep learning methods (e.g., stacked-denoising autoencoders, recurrent neural networks, and convolutional neural networks to WF attacks, assuming sufficient training data. Later, Oh et al. [30] utilized autoencoder (AE) to generate low-dimensional features to improve the performance of WF attacks. Meanwhile, using a popular neural network architecture called VGG network [31] as the backbone, Sirinam et al. [15] proposed a deep fingerprinting attack (DF) model that attains 90% accuracy on 95 websites. However, this method needs at least a lowdata training set (e.g., 50 training samples per website); otherwise, it will suffer from significant performance drop. When using 20 training samples per website, DF can only hit around 80% accuracy.
To overcome this limitation, Bhat et al. [16] developed the Var-CNN model based on ResNet [18] and dilated causal convolution [32,33]. When small training sets (e.g., 100 samples per website) are available, it achieves superior performance over DF but at dependence on less-realistic time features and less-scalable hand-crafted statistical information. Meanwhile, Rahman et al. [34] focused on how to utilize timing-related features in WF attacks.
A solution to few-shot learning is a recently proposed triplet fingerprinting (TF) method [17]. e key idea of TF is to pretrain a metric model that can measure pairwise distances on new classes. When the pretraining dataset is similar to the target data in distribution, TF can hit the accuracy of 94.5% on 100 websites using only 20 training samples per website. is is a strong transfer learning scenario. However, considering that the dynamics of network conditions is highly unknown and uncontrollable, such a transfer learning assumption is hardly valid in practice. In light of this observation, in this work, we propose a more realistic few-shot learning setting without assuming any auxiliary data with similar data characteristics for model pretraining. Hence, it is more scalable and generic for realworld deployments. Under the proposed more challenging few-shot setting, TF is unable to work properly due to insufficient network initialization.
ese previous attempts have shown the significance of different augmenting methods for model performance on the respective tasks. Inspired by these findings, we investigate the effectiveness of training data augmentation extensively by adapting existing operations for deep learning WF attacks in few-shot learning settings. To the best of our knowledge, this is the first attempt of its kind. Crucially, we demonstrate that the existing state-of-the-art deep WF attack method [16] significantly benefits from using the proposed data augmentation operations in varying evaluation scenarios. is result would be encouraging and influential for future investigation of deep learning WF attack methods in particular.

Problem Definition.
In website fingerprinting (WF) attack, the objective is to detect which website a target user is visiting. e common observations are data traffic traces x produced by one visit to a website y. Taking each website as a specific class, this is essentially a multiclass classification problem. For model training, a labelled training set D � (x i , y i ) N i�1 is often provided, where y i ∈ 1, 2, . . . , K { } specifies one of K target websites. Two different settings are often considered in model testing: (1) closed-world attack where any test sample is assumed to belong to the target websites/classes, and (2) open-world attack where the above assumption is eliminated, i.e., a test trace may be produced by a nontarget (unmonitored) website. e latter is a more realistic setting, yet presenting a more challenging task as identifying if a test sample falls into target classes or not is nontrivial.

Feature Representation.
For the Tor network, the raw representation of a specific traffic trace consists of a sequence of temporally successive Tor cells travelling between a target user and a website visited. It is derived from TCP/IP data. Specifically, after those TCP/IP packets retransmitted are discarded, TLS records are first reconstructed, and their lengths are then rounded down to the nearest multiple of 512 to form the final sequence data x. In value, each x is a sequence of 1 (outgoing cell) and −1 (incoming cell), with a variable length. is raw representation is hence known as the direction sample. Besides, temporal information about interpacket time is another modality of data used, but limited by high reliance on network conditions, i.e., not stable and much more noise. Consequently, we mainly consider the direction data samples in this study, which are more scalable and generic.
is strategy is not only unscalable but also unsatisfactory in performance due to limited and incomplete domain knowledge. Deep learning methods provide a viable solution via learning directly more effective and expressive representation from training data, as shown in a few recent studies [15,16]. In this work, we advance this new direction further.
1D convolutional neural networks (CNN) [40] are usually explored for WF attacks as the raw data are temporal sequences. Building on the success of deep learning in computer vision, we adopt the same high-level network designs of standard 2D CNN models [41], whilst translating them into 1D counterparts. is is similar to [15,16].
As shown in Figure 2, a CNN model consists of multiple convolutional layers with nonlinear activation functions such as ReLU [42] and fully-connected (FC) layers, characterized by end-to-end feature extraction and classification. With convolutional operations, the filters of each layer transform input sequences using learnable parameters and output new feature sequences. is feature transformation is conducted layer by layer in a hierarchical fashion. e receptive field (kernel) with size 3 is often used in each layer to capture local feature patterns. By stacking more layers and pooling operations, the model can perceive the information of larger regions and achieve translational invariance. Another effective method for enlarging the receptive field is dilated causal convolutions [32,33], which has been exploited in [16].
e feature representations f of WF samples are the output of the global average pooling layer on top of the last convolution layer. To obtain the classification probability vector y � y 1 , y 2 , . . . , y K ∈ R K over K target classes, f is fed into a FC layer and normalized by a softmax function.
For model training, we compute a cross-entropy objective loss function with the classification vector against the ground-truth class label over all N training samples as where y i refers to the ground-truth class label of a training sample x i and δ is a Dirac function. e objective is to maximize the probability of the ground-truth class in prediction. is loss function is differentiable, with its gradients backpropagated to update all the learnable model parameters.
Once the deep model is trained, we forward a given test sample, obtain a classification probability vector, and take the most likely class as a prediction in both closed-world and open-world settings. For open-world setting, all unmonitored websites are considered to belong to a background class.
3.2.1. Discussion. While deep learning techniques have advanced significantly in the last several years, it is still assumed that a large set of labelled training samples is available. is is not always true, for example, for the WF attack problems. In real-world applications, an attacker is usually faced with highly dynamic network environments. It means that the distribution of raw features is evolving continuously. As such, the training data need to update frequently, which disables collection of large training data with labels in practice due to prohibitively high labelling costs. Consequently, only a small training set is accessible in reality, making deep learning methods ineffective.

Harmonious Website Fingerprinting Data Augmentation.
To address the above small training data challenge, we propose an intuitive, novel harmonious data augmentation (HDA) method. We introduce both intrasample and intersample augmentation operations that can be applied in a joint and harmonious manner for more effective data expansion.

Intrasample Augmentation.
e key idea of intrasample augmentation is that given an individual training sample, we introduce a certain degree of random data perturbation and/or variation whilst keeping the same class labels. Doing so allows us to generate an infinite number of labelled training samples due to the nature of randomness. We consider two perturbation operations: random rotation and random masking.
Random rotation-based data augmentation means rotating an original training sample forward or backward by random steps to generate virtual samples (Figure 3(a)): where n step and dir ∈ forward, backward { } specify the steps and the direction to rotate on an input sample x. e hypothesis behind is that class-sensitive information encoded in a sample is distributed across different subsequences and data traffic order is less important than signal patterns. After a sample is rotated, the original class information is largely preserved, i.e., semantically invariant. Hence, the same class can be annotated for the rotated variants. However, this hypothesis is more likely to stand under some certain (unknown) degrees. We therefore introduce an upper bound parameter R max so that the rotation range is limited at most R max steps in both directions, n step ≤ R max .
In contrast, random masking introduces localized corruption to an original training sample by setting a random subsequence to zero (Figure 3(b)). is data augmentation is written as where n len and loc denote the length and location of the subsequence that is masked out from an original sample x.
Rather than in form of subsequence, another strategy is to randomly select individual positions to mask. We consider this may introduce more significant corruption to the underlying semantic information. Conceptually, random masking simulates varying traffic measurement errors in data transportation. Meanwhile, with the same above hypothesis, such masking would not dramatically change the semantic class information provided that the masking is subject to some limit, e.g., the length of subsequences masked out M len . It hence offers a complementary data perturbation choice with respect to random rotation.

Intersample Augmentation.
Apart from data augmentation on individual samples, we further introduce data perturbation across two different samples to enrich the limited training set.
We propose random mixing that generates virtual samples and class labels by linear interpolation between two original samples x i and x j as where (y i , y j ) are the one-hot class labels of x i and x j . e mixing parameter λ ∈ [0, 1] follows a Beta distribution: λ ∼ β(α, α) with α > 0 the parameter that controls the strength of interpolation. is is in a similar spirit of mixup  in image understanding domain [36]. Unlike intrasample augmentation above, random mixing changes the semantic class information since original samples may be drawn from different classes. It simplifies the data distribution by imposing a linear relationship between classes for complexity minimization. As shown in Figure 3(c), only the common features are remained in the mixed sample. If two original samples are generated from visiting the same website, the mixed sample reflects the shared characteristics with respect to this website. Otherwise, it reflects the commonality of two different websites. While seemingly counterintuitive, we will show that such a method brings positive contributions on top of random masking and random rotation.

Combination and Compatibility.
Different augmentation operations can be applied on the same samples without conflict to each other in a harmony. ere is also no particular constraint on the order of applying all the three data augmentation operations in a combination. Given a fixed set of parameters as discussed above, different augmentation orders will result in different virtual samples. is makes little conceptual difference as the space of sample is just infinite.

Augmentation Optimization.
In our harmonious data augmentation (HDA), three hyperparameters R max , M len , α are introduced. To generate meaningful virtual samples, obtaining their optimal values is necessary; otherwise, adversarial effects may even be imposed.
Instead of manual tuning, we adopt an automatic Bayesian estimator, called Tree of Parzen Estimators (TPE) [43]. e conventional TPE can take only a single parameter alone at a time. So, we need to optimize each of the three hyperparameters independently. is differs from our data augmentation process where the three augmentation operations are typically applied together, making the independently tuned parameters of TPE suboptimal.
is is because jointly applying three augmentations together makes them interdependent.
For solving this problem, we propose a sequential optimization process that takes into account the interdependence property of different augmentation operations gradually (see Algorithm 1). Specifically, we start with a random, fixed order of applying our random rotation, masking, and mixing operations. en, we optimize from the first one with TPE, move to the next one with all the previous ones optimized and fixed, and stop by finishing the last one. Each time, we still optimize a single hyperparameter whilst keeping all the previous optimized ones fixed. In this way, we expand the interdependence among different operations sequentially.

eoretical Foundation and Formulation.
e objective of learning a WF attack model is equivalent to deriving a function h ∈ H that fits the latent translation relationship between raw feature vectors x ∈ X and corresponding website class labels y ∈ Y, that is, fitting a joint distribution P(X, Y). To this end, in deep learning, we often leverage a loss function L defined to penalize the differences between predictions h(x) and targets y. We minimize the average loss over the joint distribution: which is known as expected risk minimization [44]. However, the joint distribution is often unknown, particularly for WF attacks with small training data. Given a limited training dataset D � (x i , y i ) N i�1 , the joint distribution can only be approximated by an empirical distribution as where δ(x � x i , y � y i ) is a Dirac mass centered at a sample (x i , y i ). Accordingly, the expected risk can now be approximated by an empirical risk: e above approximation is in the empirical risk minimization (ERM) principle [44]. e cross-entropy loss (1) is a representative example, which essentially minimizes R δ (h) for the classification task.
While ERM is a common strategy, it suffers from a high risk of poor generalization due to the tendency of memorization, mainly when a large model is used [45]. To mitigate this issue, we adopt the notion of vicinal distribution [46] which can better approximate the true joint distribution. In particular, the vicinal distribution P υ in the data space is defined as Intuitively, P υ measures the probability of finding a virtual labelled sample (x, y) in the vicinity around an original training sample (x i , y i ).
Given such vicinal distributions, we first construct a virtual dataset D υ ≔ (x i , y i ) m i�1 by sampling P υ randomly and then minimize an empirical vicinal risk to learn h as Clearly, at the core of this strategy is performing data augmentation around original training samples. Rather than computing a loss value for every single training sample, it derives a local distribution centered at each individual sample and generates more virtual training samples to reduce the negative memorization effect of deep learning. is is the key rationale of our data augmentation method.

Augmentation Formulation.
We formulate the proposed harmonious data augmentation operations in the vicinal distribution manner. For intrasample augmentation (including random rotation and masking), the vicinal distribution is defined as where T is a transformation operator.
For random rotation, given any length-n sample x � x 0 , . . . , x i , . . . , x n−1 , we first define a circle matrix B for forward rotation as en, we sample the step size n step uniformly from a range of 1, . . . , R max . By one-hot representation of n step , we can obtain a rotation transformation as For the backward case, we perform the same process as above but with a backward rotation matrix instead.
For random masking, we similarly sample the start position s uniformly in the range of 1, . . . , n − n len where n len is the length of the masked subsequence. e masking transformation can be represented by a matrix as where I is the identity matrix, 1 is the all-one vector, Row i () selects the ith row of a matrix, and diag() transforms a vector to a diagonal matrix. Masking operation is finally conducted by matrix multiplication as For intersample augmentation, random mixing in our case, the vicinal distribution is defined as Input: A training X t , Y t , and validation X v , Y v set. Output: Data augmentation with optimal parameters B aug . 1: Setting B aug � ϕ (empty set); 2: Sequencing data augmentation operations randomly; 3: while Enumerating augmentation operations do 4: Get the search space S aug of current augmentation A; 5: Using TPE on S aug to obtain the optimal parameter b aug , with the model trained by B aug and A; 6: B aug � B aug ∪ b aug 7: end while 8: return B aug ALGORITHM 1: Data augmentation optimization.

Security and Communication Networks 7
where λ is a random variable drawn from a Beta distribution β(α, α) and y is one-hot class label vector. is local vicinity is assumed to respect a linear structure with respect to class labels.

Datasets.
We evaluated our data augmentation method HDA on four standard WF attack datasets as below.

Implementation Details.
We conducted our experiments in Keras [47]. In our experiments, we used the standard training, validation, and test splits for all competitors for fair comparisons. HDA was applied only to the training set. We optimized HDA's hyperparameters using Var-CNN [16] as the deep learning model on CW 100 in closed-world setting and applied the same parameter setting for all the other deep learning methods, datasets, and settings. is allows testing the generality and scalability of our HDA method. For augmentation optimization, we set the search space as 1 ∼ 50 with step 5 for forward/backward R max (random rotation) 1 ∼ 200 with step 20 for M len (random masking) and [0, 1] with step 0.1 for α (random mixing). We selected the best value for each of these parameters with respect to the validation performance. e optimal parameter values we obtained are R max � 20, M len � 180, and α � 0.1. We applied the same parameter setting tuned on CW 100 to all other datasets for both simplicity and generalization test. For saving storage, we performed online data augmentation within each mini-batch without any data preprocessing. In each experiment, we trained every deep learning model for 150 epochs and used the checkpoint with the best performance on the validation set for the model test. We only used the direction feature data, without time sequences and hand-crafted features. We ran each experiment 10 times and reported the mean results and standard deviation as the final performance.

Why Not
We Apply HDA to DF? On the one hand, we found that DF is unstable while optimized by HDA. In some experiments, DF + HDA can get better results than original HDA, but not always so. On the other hand, the feature extractor of TF is from DF. Hence, we just provide the best result of TF following its recommended setting as baseline.

Setting.
We conducted the closed-world attack on AWF 100 , Wang 100 , and DF 95,Nodef . We separated each dataset into training and test (70 samples per class) splits. We considered few-shot settings with n ∈ 5, 10, 15, 20 { } training samples per class. e validation set was used to select the best performing model for test. We used classification accuracy as the performance metric. Besides deep network models, we also compared our method with two conventional hand-crafted feature-based methods: CUMUL [13] and k-FP [12].

Results.
e results of different methods are compared in Tables 1-3. We have the following observations: (1) TF remains the best few-shot WF attack algorithm, especially pretrained with similar datasets (pretrained and test with the AWF dataset and test with the Wang dataset). (2) However, deep learning methods (Var-CNN) become clearly stronger when pretrained TF is faced with different distributions across training and testing datasets (pretraining on AWF and testing on Wang 100 and DF 95,Nodef ), suggesting a great deal of potentials. In 10/15/20-shot cases, Var-CNN + HDA achieves the best overall result on both Wang 100 and DF 95,Nodef . In particular, on DF 95,Nodef , the benefit from HDA is significant, and Var-CNN + HDA surpasses TF with a big margin of 13.2% in 20-shot case. (3) With our HDA method for training data augmentation, every deep learning method improves in all few-shot cases. For example, the 20shot accuracy of Var-CNN is increased from 78.7% to 90.7% on AWF 100 , from 88.4% to 90.6% on Wang 100 and from 68.1% to 91.3% on DF 95,Nodef . Similarly, the 20-shot accuracy of ResNet-34 is improved from 51.3% to 86.4% on AWF 100 , from 85.9% to 87.4% on Wang 100 and from 61.4% to 85.8% on DF 95,Nodef . (4) Our HDA can consistently improve different methods on varying datasets, suggesting good generality. (5) e performance deviation of Var-CNN assisted by our method HDA is the least among all the competitors, implying strong stability.

Setting.
We conducted the open-world attack experiments on the combination of ROWWUM 400,000 and AWF 100 . We treat the websites of AWF 100 as target (monitored) classes and those of ROWWUM 400,000 as nontarget (unmonitored) classes. In this test, we selected randomly 8,020 out of 400,000 unmonitored websites and separated them into three disjoint sets sized at 20/1,000/7,000 for training, validation, and test, respectively. In this scenario, the precision and recall rates were used to evaluate model performance due to the need for detecting nontarget classes [48]. We considered the same two deep learning methods (Resnet-34 and Var-CNN [16]) for comparisons.

Results.
e results of different methods are reported in Table 4. We considered two settings, one is tuned for best precision, and one for best recall. Overall, we obtained similar trends as above that our HDA is highly effective for improving both deep learning methods. It is noted that unlike the closed-world scenario, Var-CNN + HDA achieves very top results at most cases under both tuning settings, even if it may not be the best one. Similarly, Var-CNN + HDA remains to be more stable and less sensitive to training sample size. Significantly, our HDA method further enhances these strengths by efficient data augmentation, leading to the more robust WF attack solutions.

WF Attack against Defense
Setting. In contrast to the two above experiments, we further tested a more challenging WF attack scenario with defense involved. Defense changes the data traffic patterns to be more similar to one another, therefore making the attack more difficult. We considered the most popular defense, WTF-PAD, widely deployed in the Tor network. We used the DF 95, wtf−pad dataset in this experiment. We used 100 random samples per website and divided them into three sets for training (20 samples), validation (10 samples), and test (70 samples), respectively. We reported the classification accuracy as performance metric in the closed-world scenario. We help the previous two deep learning methods (Resnet-34 and Var-CNN [16]) with HDA, compared with the pretrained few-shot method (TF [17]) and hand-crafted feature-based methods (k-NN [11], k-FP [12], and CUMUL [13]).

Results.
We reported the results of closed-world WF attack under WTF-PAD-based defense in Table 5. We made the following observations. (1) Some hand-crafted featurebased methods (CUMUL) are superior over recent deep learning methods (ResNet-34 and Var-CNN) at the few-shot learning scenarios. is is mainly because the latter suffers from lacking enough training samples, resulting in model overfitting.
(2) Using our HDA for training data augmentation, we can directly solve the data scarcity problem and significantly boost the performances of previous deep learning methods. As a result, Var-CNN + HDA outperforms the other competitors by a moderate margin, e.g., 2.9% gap over the best competitor CUMUL. (3) ResNet-34 is surpassed by Var-CNN continuously. By benefiting more from our data augmentation, Var-CNN achieves the best results across all different shot cases. is implies that Var-CNN has a higher desire for large training data with higher performance potential, as compared to ResNet-34. (4) If TF is not pretrained with a similar dataset, it will lose the advantage when a few more samples (20-shot) are provided.

Ablation Studies.
We carried out a set of component analysis experiments to examine the exact effect of different designs of our method (HDA). We adopted the most common closed-world attack scenario without defense on  [13] 72.2 ± 1.7 79.7 ± 1.4 83.3 ± 2.0 85.9 ± 0.6 k-FP [12] 79.3 ± 1.0 83.9 ± 1.0 85.9 ± 0.6 87.5 ± 0.8 DF [15] 1.   the AWF 100 dataset, following the same setting as Section 4.2. It is noteworthy that this dataset AWF 100 is different from the dataset in Section 4.2 because they are different subsets. In this section, we evaluated the 15-shot learning case in particular, using Var-CNN [16] as the deep learning model backbone.

Individual Augmentation Operations.
Recalling that our data augmentation method (HDA) consists of three different operations (random rotation, masking, and mixing), we have demonstrated their performance advantages of them as a whole in varying test settings above. For in-depth insights, examining their individual contributions would be informative and necessary as well as different combinations. We conducted these experiments with an exhaustive set of operation combinations and reported the results in Table 6.
It is observed that (1) each of the three operations makes a significant difference in performance, with rotation and masking the best individual operations that improve the classification accuracy by 17.4%. (2) When jointly using any two augmentation operations, the performance can be further increased. e combination of masking and mixing gives the highest accuracy among them. (3) Combining all three operations (HDA) achieves the best result with a smaller deviation. is suggests that all different operations are complementary and compatible with each other.

Augmentation Optimization.
For optimal data augmentation, we propose a sequential optimization strategy (see Algorithm 1) for capturing the interdependence between different augmentation operations applied. To evaluate its effect, we compared with a baseline algorithm that independently optimizes each augmentation parameter.
As shown in Table 7, the proposed optimization algorithm (see Algorithm 1) is clearly superior, validating our consideration that there exists interdependence between different augmentation operations when applied jointly on the same samples. Note that we obtained this performance gain at the same cost as the baseline counterpart. Besides, it is worth noting that even with the simpler optimization, our data augmentation method (HDA) can still greatly improve the previous deep learning model Var-CNN and achieve new state-of-the-art results (Table 7 vs. Table 1). is further validates that the proposed augmentation operations are highly compatible with one another and can be applied together well.

Conclusion
We presented a model-agnostic, simple yet surprisingly effective data augmentation method, called HDA, for the few-shot website fingerprinting attack.
is is an understudied and realistically critical problem, as in practice only a handful of training samples per website can be feasibly collected due to the inherent high dynamics of Internet networks and expensive label collection cost. Importantly, we focus on deep learning-based methods, a line of new research efforts with vast potentials for future investigations. In particular, our HDA method offers three different data augmentation operations, including random rotation, masking, and mixing in intrasample and intersample fashion. ey can be applied to the same training samples harmoniously with high complement and compatibility. Moreover, we introduce a sequential augmentation parameter optimization method that captures the interdependence nature between different operations when applied jointly. With recent state-of-the-art deep learning WF attack models, we conducted extensive experiments on four   [18] 7.4 ± 0.5 9.4 ± 0.7 13.3 ± 1.3 12.3 ± 1.2 Var-CNN [16] 6.6 ± 0.3 9.2 ± 0.7 12.5 ± 0.8 19.2 ± 1.7 ResNet-34 + HDA 12.3 ± 1.9 28.1 ± 3.7 38.2 ± 6.2 47.7 ± 5.1 Var-CNN + HDA 25.3 ± 2.2 46.9 ± 1.9 48.7 ± 1.4 63.2 ± 1.8  e results show that the proposed data augmentation method makes dramatic differences in performance and enables previous deep learning methods to outperform hand-crafted featurebased counterparts in the few-shot learning setting for the first time, often by a large margin, while pretrained-based few-shot WF attack (TF) is placed in a new environment, it cannot outperform our augmented method.
is is achieved without making any artificial assumptions of relevant, large auxiliary training data for model pretraining. With our HDA method, collecting large training data frequently is eliminated, whilst still achieving stronger and more robust WF attacks. Finally, we performed detailed component analysis to diagnose the effect of individual model components.

Additional Discussion.
Except data augmentation for reducing the demand of data annotation in a few-shot learning context, an alternative approach is semisupervised learning, which has been extensively studied in e-mail classification [49], intrusion detection [50], authorship attribution [51], computer vision [52,53], and so force. e key idea is to explore the structural knowledge (manifold and cluster structures) of unlabeled data to increase the volume of training data. Crucially, we believe that our proposed HDA can benefit existing semisupervised learning methods due to its algorithm agnostic nature. One limitation with our HDA is that more training data will lead to higher training cost. However, this is a general and common problem with all data augmentation methods including ours. To further boost the research of website fingerprinting, it is necessary to connect website fingerprinting with other fingerprinting fields, from the traditional fingerprint-based biometric systems [54] to the newest collaborative intrusion detection networks under passive message fingerprint attack [55,56]. rough introducing the strategy which has produced marked effect in related fingerprinting fields, website fingerprinting especially few-shot website fingerprinting would go further.

Conflicts of Interest
e authors declare that they have no conflicts of interest.