Network Intrusion Detection Model Based on Improved BYOL Self-Supervised Learning

+e combination of deep learning and intrusion detection has become a hot topic in today’s network security. In the face of massive, high-dimensional network traffic with uneven sample distribution, how to be able to accurately detect anomalous traffic is the primary task of intrusion detection. Most research on intrusion detection systems based on network anomalous traffic detection has focused on supervised learning; however, the process of obtaining labeled data often requires a lot of time and effort, as well as the support of network experts. +erefore, it is worthwhile investigating the development of label-free self-supervised learning-based approaches called BYOL which is a simple and elegant framework with sufficiently powerful feature extraction capabilities for intrusion detection systems. In this paper, we propose a new data augmentation strategy for intrusion detection data and an intrusion detection model based on label-free self-supervised learning, using a new data augmentation strategy to introduce a perturbation enhancement model to learn invariant feature representation capability and an improved BYOL selfsupervised learning method to train the UNSW-NB15 intrusion detection dataset without labels to extract network traffic feature representations. Linear evaluation on UNSW-NB15 and transfer learning on NSK-KDD, KDD CUP99, CIC IDS2017, and CIDDS_001 achieve excellent performance in all metrics.


Introduction
With the advent of the information age and the popularity of the Internet, all aspects of our lives have changed greatly. While the Internet has given us significant convenience, it has also brought about a variety of network security issues. How to avoid these security problems has become the focus of the industry. Intrusion detection, as an important part of the network security system, was first proposed by Anderson [1], who defined an intrusion attempt or threat as a potential, premeditated, unauthorized attempt to access information and manipulate information, making the system unreliable or unusable.
e earliest intrusion detection model was proposed by Denning and Neumann [2], which focuses on generating a number of profiles about the system based on the audit log data of the host system and monitoring the variance of the profiles to detect intrusions in the system. According to different data sources, intrusion detection systems can be classified as host-based intrusion detection system (HIDS) [3] and network-based intrusion detection system (NIDS) [4]. NIDS observes and analyses real-time network traffic and monitors multiple hosts, aiming to detect intrusions in the network by collecting packet information and viewing its contents [5]. Previous researchers have mostly used pattern-matching algorithms to analyse their data, and feature selection usually includes three schemes, which are filtered approaches (e.g., information gain and correlation coefficient algorithms), encapsulated methods (e.g., genetic algorithms [6] and particle swarm algorithms [7]), embedded methods (e.g., LASSO regression algorithms), and linear transformation methods for feature extraction, such as principal component analysis (PCA) and linear discriminant analysis, as well as nonlinear transformation methods, like kernel-based principal component analysis. However, all of the above methods have certain drawbacks. For example, genetic algorithms are prone to premature convergence problems. As far as PCA algorithm, the meaning of each feature dimension of the principal components in PCA algorithms is somewhat ambiguous and not as interpretable as the original. Moreover, the interpretability of the samples is not as strong as that of the original samples.
Traditional NIDS also has a large number of problems such as terrible detection rate of unknown attacks, high false alarm rate, and high resource consumption. e machine learning algorithms have lots of advantages such as strong generalization ability, simple implementation, and easy to understand and explain. Traditional machine learning algorithms like support vector machine (SVM), decision tree (DT), and K nearest neighbor (KNN) have been introduced into the field of intrusion detection to improve the efficiency of intrusion detection and reduce the false negative rate and false positive rate in recent years. Nonetheless, the complexity of traditional machine learning algorithms makes their performance and accuracy in dealing with high-dimensional massive data to be far away from deep learning methods. Not only that, traditional machine learning algorithms also rely on feature engineering and we need to design algorithms to extract effective features of network traffic, which greatly increases the computational cost. Deep learning methods do not require human experience to extract feature information but algorithms automatically learn feature information from original data, known as representation learning, which means farewell to task-heavy feature engineering. Furthermore, deep learning methods can extract better feature representations from massive amounts of data to create models with better generalization capabilities. In recent years, convolutional neural network (CNN) and recurrent neural network (RNN) have been widely used in the field of intrusion detection. For example, CNN methods convert one-dimensional network traffic into two-dimensional grayscale images and then use the convolutional kernel to extract effective features of network traffic to improve the detection rate of intrusion detection. However, there are many weak points in intrusion detection models based on supervised learning, the main point being the cost of acquiring labeled data, requiring professional security experts to scrutinize traffic data and decide whether a particular pattern is a new attack, which undoubtedly increases the cost of intrusion detection. Based on the above drawbacks, unsupervised learning methods have recently gained attention in the domain of intrusion detection, where various types of autoencoders (e.g., variational autoencoder, sparse autoencoder, and denoising autoencoder) and generative adversarial neural networks have been applied to reconstruct network traffic samples and learn feature representations of them. Although unsupervised learning methods can learn feature representation without labeled data, the learned feature representations are only applicable to certain datasets and cannot be transferred to other datasets, which definitely limits the generalization ability of the model.
Given the shortcomings of traditional machine learning, supervised learning, and unsupervised learning, this paper proposes a new intrusion detection model based on improved BYOL self-supervised learning. In view of the disadvantages that supervised learning methods require the use of a large amount of manually labeled data and the poor generalization ability of unsupervised learning models, we adopt self-supervised learning, which can be trained without labels, and it can fully exploit its own supervisory information from largescale unsupervised data and train the network with this constructed supervisory information to learn highly generalizable and valuable data. e essence of deep learning lies in its powerful representation learning capability.
e excellent results obtained in the transfer learning experiments of NSL-KDD, KDD CUP99, CIC IDS2017, and CIDDS_001 datasets are enough to prove the strong generalization ability of the self-supervised learning model and the generality of the extracted network traffic feature representations.
In general, the work of this paper has made the following contributions to the intrusion detection domain: (1) Self-supervised learning is introduced to the field of intrusion detection and the strong potential and scope of self-supervised learning in intrusion detection are validated. (2) A novel data augmentation strategy for the intrusion detection dataset is proposed to introduce different perturbations to generate samples with different perspectives to enhance the feature representation ability of the model to learn network traffic efficiently. (3) e BYOL self-supervised learning algorithm is improved by introducing BoTNet with multihead attention mechanism to suppress the features that contribute less to classification and increase the features that contribute more to classification, so as to promote the performance of the model. e BYOL loss function is optimized to make the model training process smoother and the model converge faster, thereby enhancing the stability and robustness of the model. (4) To verify the effectiveness of the methods and models proposed in this paper, we apply linear evaluation on UNSW-NB15 and transfer learning on NSL-KDD, KDD CUP99, CIC IDS2017, and CIDDS_001 datasets and compare them with various machine learning methods and state-of-the-art deep learning models using several experimental evaluation metrics.
e rest of the paper is organized as follows: Related works are discussed in Section 2.
e network intrusion detection model based on improved BYOL self-supervised learning is presented in Section 3. Section 4 describes the dataset used in this paper and the preprocessing of the dataset. In Section 5, we validate the effectiveness of the improved BYOL self-supervised learning intrusion detection model through relevant experiments. Finally, we draw the related conclusions as well as suggest some future works in Section 6.

Related Works
Nowadays, with the development of science and technology, new methods such as data mining, machine learning, and deep learning have been applied to the domain of intrusion detection [8][9][10]. While data mining algorithms usually require a large amount of data to extract feature information, the detection rate is low for those categories which consist of insufficient samples; furthermore, most data mining algorithms are sensitive to noise. If the dataset contains plenty of noisy data, there is no doubt that it will have a huge impact on the algorithm. Currently, researchers apply all kinds of machine learning methods to detect anomalous network traffic.
ese methods include KNN, DT, SVM, LR, and ensemble learning. Kabir et al. [11] proposed an intrusion detection model based on least square support vector machine (LSSVM), which selects representative samples from randomly divided subgroups of the dataset in order to make it reflect the whole significant features. Nancy et al. [12] proposed a dynamic recursive feature selection algorithm to select an optimal number of features from the dataset, and then, an intelligent fuzzy temporal decision tree algorithm integrated with convolution neural networks was used for classification. e experimental results showed that the new types of attacks could be detected well and it also reduced the network delay and false negative rate. Hurley et al. [13] used PCA for feature extraction, combined with fuzzy techniques to obtain the degree of sample objects belonging to each category, and then used KNN to classify the attack categories, while the accuracy of this algorithm gradually decreases as the amount of data increases. Most traditional machine learning methods are shallow learning and emphasize feature engineering and feature selection. ey cannot effectively solve the problem of classifying large-scale intrusion data appearing in the actual web application environment, and the performance and accuracy of the algorithm in dealing with multiple classification problems decrease sharply with the dynamic growth of the dataset. Moreover, shallow learning is not suitable for the prediction requirements of high-dimensional massive data.
Since recent years, deep learning methods such as CNN and RNN have been widely used among the intrusion detection. Riyaz and Ganapathy [14] proposed a new feature selection algorithm called conditional random field and linear correlation coefficient-based feature selection algorithm to select the most contributed features and classify them using the existing convolutional neural network; this new feature selection algorithm not only greatly reduces the training time but also increases the accuracy of the model by eliminating irrelevant features. Yang and Wang [15] proposed a convolutional neural network with cross-layer feature fusion using the structural properties of convolutional neural networks combined with the cross-layer aggregation design concept, and the experimental results showed that the model has a high accuracy, true positive rate, and low false alarm rate in intrusion detection. RNN and LSTM treat the network traffic as sequential data. e authors of [16][17][18] proposed RNN-based, gated recurrent units-based, and LSTM-based intrusion detection systems that treat data as time series, respectively, and experimented on KDD CUP99 and NSL-KDD datasets with good results. However, there are many drawbacks in the intrusion detection model based on supervised learning, the most important point being it is expensive to obtain labeled data, requiring professional security experts to scrutinize traffic data and decide whether a particular pattern is a new attack, which undoubtedly increases the cost of intrusion detection. Unsupervised learning methods are also gaining importance in the field of intrusion detection. Choi et al. [19] used autoencoder, denoising autoencoder, variational autoencoder, and stacked autoencoder to train datasets and construct thresholds to discriminate traffic types based on the mean and variance of reconstruction errors and upper alpha quantile obtained from each model. Farahnakian and Heikkonen [20] proposed a deep autoencoder trained in a greedy layerwise fashion in order to avoid overfitting and local optima and achieved 94.53% multiclassification accuracy and only 0.42% false alarm rate in KDD CUP99. Sakurada and Yairi [21] proposed a method to perform anomaly detection based on reconstruction error thresholds using autoencoders. In the training phase, the autoencoder model learns to reconstruct its input data, which includes only normal data. In the testing phase, test data are fed into the learned model to output the reconstructed test data. When the reconstruction error is higher than some threshold arbitrarily chosen by the user, the data are determined to be anomalous and vice versa. But, the unsupervised learned features are only applicable to this dataset and cannot be transferred to other datasets, which definitely limits the generalization capability of the model.
In summary, with the development of time and technology, machine learning, deep learning, and unsupervised learning have made good progress in the field of intrusion detection. e study in [22] used Markov models for feature extraction and solved the problem that Bayesian network classifiers were usually trained on data by selecting suboptimal model heuristics, but the evaluation indicators used in their paper were not exhaustive. e work in [23] used information gain algorithm for feature extraction and solved the problem that the KDD CUP99 dataset did not include the current state of cyber attacks, while the F-measure for the unknown attack category was low. e study in [24] solved the problem that the data became more complex and it was difficult to extract better low-dimensional features effectively as the number of features increased and the accuracy of binary classification on NSL-KDD achieved 95.25%. e work in [25] solved the problem that the traditional machine learning techniques could not solve the intrusion detection problem. Furthermore, the binary classification results on UNSW-NB15 achieved excellent performance in all metrics. e study in [26] used the autoencoder for feature extraction and solved the problem that traditional dimension reduction methods have difficulty capturing nonlinear information in the data; however, the model evaluation index is single and only the accuracy used. e study in [27] used the convolution autoencoder for feature extraction and solved the problem that traditional log anomaly detection ignores the temporal pattern of logs as well as a problem of information Security and Communication Networks 3 loss caused by vector representation, while the F-measure was as low as 73.76%. Unlike existing intrusion detection models, this paper proposes an intrusion detection model based on a self-supervised learning approach. Not only do we take into account the difficulty of acquiring labeled data, but also we attach significant importance to the generalization ability of the model. e model is applied to the intrusion detection benchmark datasets KDD CUP99, NSL-KDD, UNSW-NB15, CIC IDS2017, and CIDDS_001. e intrusion detection dataset used in this paper is relatively complete. We use multiple evaluation indicators such as accuracy, precision, detection rate, F1 score, ROC curve, and AUC value to evaluate the performance of the proposed model, which makes the evaluation of the proposed method more scientific and comprehensive.

Bootstrap Your Own Latent.
Typical methods for selfsupervised learning include CPC [28], MoCo [29], SimCLR [30], DINO [31], and BYOL [32]. CPC is mainly applied in video and speech fields for processing serialized information and SimCLR and MoCo need lots of positive and negative sample pairs and large batch sizes to train to get excellent feature representations, while Dino uses ViT [33] as a feature extractor. In the field of intrusion detection, a larger batch size means that larger memory is needed to process the data and the large number of parameters in ViT makes it difficult for real-time detection, so this paper adopts BYOL as the intrusion detection model. BYOL is a simple and elegant self-supervised learning framework that does not require positive or negative sample pairs and a large batch size to train a network with sufficiently powerful feature extraction capabilities. Furthermore, BYOL does not require human experience to extract feature information but algorithms automatically learn feature information from original data, which means that there is no need to do feature engineering, so it can save much time and effort to do other things. Furthermore, BYOL can extract better feature representations from massive amounts of data to create models with better generalization capabilities. at is to say, the feature representations extracted from BYOL can be applied to other tasks whose domain is as same as that of the original dataset, and BYOL's goal is to learn a representation y θ which can then be used for downstream tasks. It uses two interacting and mutually learning asymmetric neural networks, called online network and target network, to train target network feature representations of the same image in different augmented views. e symbols used in this paper are explained in Table 1.
Assuming that its network weight parameters are denoted by θ in the online network, the online network includes encoder f θ for feature extraction, projector g θ for feature projection, and predictor q θ for feature prediction. As for the target network, its weight parameters are denoted by ξ. e target network includes encoder f ξ for feature extraction and projector g ξ for feature projection. e specific training process is shown in Figure 1.
Given a set of network traffic X, a grayscale image x ∼ X is sampled uniformly from X (we can regard x as a greyscale image after preprocessing and reshaping the network traffic), then we need to apply two different sets of image augmentation operations t and t ′ on x, respectively, and the resulting augmented views are v and v ′ , where v � t(x) and v ′ � t ′ (x). From the first augmented view v, the online network outputs a representationy θ ≜ f θ (v), projection z θ ≜ g θ (y θ ), and a prediction q θ (z θ ). e target network outputs y ξ ′ ≜ f ξ (v ′ ) and the target projection z ξ ′ ≜ g ξ (y ξ ′ ) from the second augmented view v ′ . en, we do l 2 -normalization on q θ (z θ ) and z ξ ′ . e unit length of the two latent variables is taken, and only their directionality is preserved to pave the way for finding the loss function later. (1) Finally, the loss function of BYOL is trained with the online network and target network by constraining the similarity of the normalized online predictions and target projections.
We symmetrize the loss L θ,ξ in equation (2) by separately feeding v ' to the online network and v to the target network to compute L θ,ξ ; then, the loss function of BYOL can be written as At each training step, we perform a stochastic optimization step to minimize L BYOL θ,ξ with respect to θ only, but not ξ, as depicted by the stop gradient in Figure 1. BYOL's dynamics are summarized as where optimizer is an optimizer and η is a learning rate.
Also known as the EMA, the weight update approach (exponential moving average), where τ ∈ [0, 1], is an artificially hyperparameter. At the end of training, we only keep the encoder f θ , as in [29]. e full training steps of BYOL algorithm are shown in Algorithm 1.
From Section 3.1, we can draw the conclusion that the intrusion detection model based on improved BYOL selfsupervised learning can be divided into four main steps: (1) data augmentation, (2) feature representation, (3) feature projection, and (4) contrastive learning. Next, we will discuss the specific implementation of these four steps and how to integrate improved BYOL self-supervised learning into the intrusion detection domain.

Data Augmentation.
A set of data augmentation operations play a crucial role in learning of good data representations. Different data augmentation operations introduce different perturbations and generate samples under different enhancement views. BYOL self-supervised learning can learn the feature representation of network traffic invariance precisely by pulling the different augmentation views of the same image and pushing the augmentation views of different images, thus showing the importance of data augmentation for self-supervised learning. e existing image data augmentation operations are as follows: colour jittering (the brightness, saturation, and contrast transformation), Gaussian blur, colour dropping (a conversion to grayscale), horizontal-vertical flipping, and random cropping. For intrusion detection data, the network traffic has been converted to the grayscale format after preprocessing, so the two data augmentation operations of colour dropping and colour jittering are not needed. At the same time, the network traffic data have been normalized to a value between 0 and 1 after the data normalization in preprocessing, and if the Gaussian fuzzy operation is added, noise will be introduced, which will greatly reduce the effect of feature extraction.
As a result, this paper proposes a new data augmentation operation named random_shuffle for intrusion detection data. Given an input set X � x (1) , x (2) , . . . , x (N) Convolutional operations with 1 × 1 kernel size R h , R w e relative position encoding of the image's height and width softmax( * ) e function of SoftMax W, b Weights and bias of the fully connected layer BN Batch normalization layer σ ReLU activation function

Data Augmentation
Original  Figure 1: BYOL's architecture. BYOL minimizes a similarity loss between q θ (z θ ) and sg(z ξ ′ ), where θ and ξ are the trained weights of online and target network and sg means stop gradient. At the end of training, everything, but f θ , is discarded and y θ is used as the image representation.
representing Nnetwork traffic data. In addition, each sample x (i) is a d-dimensional feature vector and can be described as is usually a highdimensional feature vector. We can use the random_shuffle function to randomly disrupt the positions of the features to obtain the augmented data x ′ . e random_shuffle function uses the modern version of the Fisher-Yates algorithm, and we can view network traffic data as an array of x 1 , x 2 , . . . , x d ; then, the modern version of the Fisher-Yates algorithm pseudo-code is shown in Algorithm 2.
For example, suppose the original array is [9, 6, 7, 2, 4, 5, 1, 3]. Table 2 shows how the modern version of the Fisher-Yates algorithm performs the shuffle operation on this array. rough this table, we will understand better how it works. e 2D convolutional neural networks are usually used to process images in the two-dimensional (2D) array format. In order to make the network traffic conform to the input format of convolutional neural networks, the augmented data need to be subjected to the reshape operation. For instance, the preprocessed UNSW-NB15 network traffic sample has 196 dimensions, i.e., x ∈ R 196 . After reshaping, it was transformed into the grayscale map format, i.e., x ′ ∈ R 14 * 14 , and after that, multiple augmentation operations are selected from the four array augmentation operations of horizontal flip, vertical flip, random crop, and random_shuffle proposed in this paper to form a set of data augmentation operations. For example, we can define that t ′ � horizontal flip, vertical flip, random crop andt ′ � vertical flip, random shuffle, random crop . After two sets of different data augmentation, the network traffic views v and v ′ are obtained before they can be input to the f θ and f ξ for feature extraction. We selected two sets of different network traffic data augmentation comparison images in the UNSW-NB15 dataset for visualization. As shown in Figures 2 and 3, we can find that the network traffic after data augmentation retains the original traffic characteristics while also introducing different disturbances. In this way, the feature representations learned by the model are more generalized and the model can learn the feature representations of the invariance of network traffic.

Feature Representation.
After two sets of data augmentation operations, we obtained two different sets of augmented views v and v ′ of the original network traffic samples. According to the BYOL training framework, two different sets of views should be input to encoders f θ and f ξ for encoding to extract features at this time. ResNet [34] was selected as the feature encoder backbone to get the image representation in the BYOL paper; however, not every feature will play an effective role in the classification result for intrusion detection data. If we incorporate too many nonessential features, noise will be introduced, which will greatly affect the final classification result. In this paper, BoTNet [35], which uses the attention mechanism, is used as the backbone of the encoder since it can automatically learn and calculate the contribution of the input data to the output classification and can suppress the features that contribute less to the classification and increase the features that contribute more to the classification in the intrusion detection data. e only difference between the two is that BoTNet adds the global multihead self-attention mechanism in the c5 stage, as shown in Figure 4.
Suppose the input image is x ∈ R H×W×d and R h ∈ R H×1×d and R w ∈ R 1×W×d refer to the relative position encoding of the height and width, which represent the relative information in the vertical and horizontal directions of image x. Let the query matrix of the image be q, the key Inputs: X, T, and T ′ set of images and combination of transformations θ, f θ , g θ , and q θ initial online parameters ξ, f ξ , and g ξ initial target parameters optimizer updates online parameters using the loss gradient S and N total number of optimization steps and batch size τ s S s�1 and η s S s�1 target network updates schedule and learning rate schedule matrix of the image be k, and the value matrix of the image be v. en, we can get them by convolving the input image x with three different 1 × 1 convolution kernels, respectively.
rough q and k, the content-content encoding of the image can be obtained from formula (5) as follows: Given q, R h , and R w , the content-position encoding of the image can be calculated from formula (6) as follows:

] and a[i]
ALGORITHM 2: Fisher-Yates algorithm.   Security and Communication Networks 7 After obtaining the two encodings, we can obtain the attention matrix of the original image by the following formula: where softmax( * ) indicates the function of SoftMax.
Finally, the output of the MHSA is generated based on v and attention, which can be expressed as follows: e entire MHSA process can also be represented by e pseudo-code for the MHSA algorithm steps is shown in Algorithm 3. e MSHA method is simple but powerful, the convolutional neural network can efficiently learn abstract and low-resolution feature maps of the images, and the global self-attention mechanism can process and summarize the information contained in the feature maps. It is this improvement that allows BoTNet to have a large improvement in accuracy in ImageNet competition [36] and to have 1.2 times fewer model parameters than ResNet50.

Feature Projection.
After the f θ and f ξ encoding, the network traffic is converted from the grayscale image into vectors of y θ and y ξ ′ , which can be described as follows: where both the feature representations y θ and y ξ ′ of the network traffic correspond to the output of the final average pooling layer of BoTNet, v and v ′ are the augmented views obtained after two data augmentation operations, respectively, and y ∈ R d , with d being an artificially hyperparameter. Subsequently, the resulting feature representations y θ and y ξ ′ of the network traffic are projected from the high-dimensional feature space to the low-dimensional hidden space by the multilayer perceptrons (MLPs) g θ and g ξ consisting of two hidden layers and a batch normalization layer to obtain z θ and z ξ ′ , which can be expressed as follows: where W and b are the weights and biases of the fully connected layer, BN is the batch normalization layer, and σ is the activation function of ReLU. e low-dimensional hidden space can be understood as a feature representation of the network traffic after censoring nonessential feature information (e.g., location information of the image) while reducing the feature dimensionality to decrease the computational resource. e feature projection identifies invariants in the data augmentation. Meanwhile, information that may be useful for downstream tasks such as the colour or orientation of objects in the image after data augmentation can be removed. By using the nonlinear transformations g θ and g ξ , more information can be formed and maintained in y θ and y ξ ′ . e feature projection step is  essential, assuming that without this step, the intrusion detection model will likely experience model collapse; i.e., the online network and the target network can similarize the representation of all network traffic images in both networks by reducing the weights and biases to zero, which will result in that the intrusion detection model does not learn any valid feature information. From the perspective of information bottleneck, the neural network is gradually losing nonessential information for the classification task (e.g., the colour or orientation of the objects in the images after the data augmentation mentioned above, i.e., the data perturbation caused by the data augmentation); while adding the feature projection, the feature space before taking the feature projection will retain more useful information for the classification task, so that the weights and biases in the online network and target network can avoid convergence to zero and thus learn more useful feature information.

Contrastive Learning.
After feature projection, the network traffic is projected to the low-dimensional vector space to get z θ and z ξ ′ ; at this time, the network traffic after the online network also needs to go through the prediction q θ and then get the prediction vector q θ (z θ ); the network traffic after the target network does not need this process; the composition of q θ is similar to g θ and g ξ and both are multilayer perceptrons consisting of two hidden layers and a batch normalization layer; therefore, the output of feature prediction can be represented as follows: where W and b are the weights and biases of the fully connected layer, BN is the batch normalization layer, and σ is the activation function of ReLU. erefore, for the prediction vector q θ (z θ ) obtained from the online network, we can interpret z ξ ′ as the ground truth about the network traffic generated by the target network. Given q θ (z θ ) and z ξ ′ , we should apply l 2 -normalization for them to obtain q θ (z θ ) and z ' ξ . In the original paper, we can constrain the similarity between q θ (z θ ) and z ξ ′ as the loss function to train the online network and the target network and thus, the loss function can be obtained from formula (13) as follows: After applying l 2 -normalization for q θ (z θ ) and z ξ ′ , although the value of each dimension in vectors q θ (z θ ) and z ξ ′ is less than 1, since q θ (z θ ) and z ξ ′ are high-dimensional vectors, performing vector multiplication and then summation will produce relatively large gradient values which will lead to making the training process become more unstable just as shown in Figure 8(a), if the mean square error is used as the loss function; i.e., e mean square error can be avoided by subtracting the vectors and then summing the squares (because the vectors q θ (z θ ) and z ξ ′ are vectors obtained from two different augmented views of the same network traffic after feature extraction and feature projection, the difference between them is small, and subtracting the vectors and then summing them will result in a smaller loss value), which can make the model more stable during the training process. erefore, we can replace L θ,ξ in BYOL with iL θ,ξ , so that the f θ can extract the effective feature information and the training process is more stable. en, the loss obtained from equation (3) is updated by the gradient descent method for the online network weights, and the weights of the target network are updated by EMA until the two networks converge (the reason for using EMA to update the weights of the target network is that it can effectively retain the weights of the online network and the target network different, thus avoiding model collapse). At this point, discarding the data augmentation operation t, the g θ , and the q θ in the online network, we obtain the f θ to represent the network traffic features and use it as a basis to distinguish the categories of network traffic. #f_q, f_k, f_v: 1 × 1 convolution for q, k and v #rel_h, rel_w: relative position encodings for height and width #heads: num of heads for MHSA for x in loader: // load a minibatch x with N samples b, C, width, height � x.size() //get batch, channels, width and height of minbatch x q � f_q(x).view(b, heads, C//heads, −1) // apply convolution operation k � f_k(x).view(b, heads, C//heads, −1) v � f_v(x).view(b, heads, C//heads, −1) content_content � bmm(q. permute(0, 1, 3, 2), k) // get content_content encoding content_position � (rel_h + rel_w).view(1, heads, C//heads, −1).permute(0, 1, 3, 2) content_position � bmm(content_position, q) // get content_position encoding attention � softmax(content_content + content_position) // get attention matrix of x z � bmm(v, attention.permute(0, 1, 3, 2)) z � z.view(b, C, width, height) // get output z bmm: batch matrix multiplication ALGORITHM 3: Pseudo-code of MHSA in a PyTorch-like style.

Detection Procedures for the Improved BYOL Model.
It can be seen from Section 3.1 that the specific steps to the improved BYOL intrusion detection model are as follows: Step 1: preprocessing for the UNSW-NB15 intrusion detection dataset, mainly including character-based data one-hot encoding processing and data normalization processing.
Step 2: construction of the intrusion detection model based on improved BYOL training as follows: (1) Initialize model parameters and determine the structure of the network model (2) Apply two separate data augmentation processes for the UNSW-NB15 dataset (3) Put two sets of augmented data into the online network and the target network, respectively, and adjust the error of the training process according to the loss obtained from equation (3) until both the online network and the target network models reach convergence (4) Take out the feature extraction encoderf θ , get the feature representation of the network traffic, and save the f θ weights Step 3: the improved BYOL intrusion detection model is tested by inputting the preprocessed test dataset to the f θ to obtain the feature representation of each data item in this dataset and then inputting the feature representation to the classifier (one linear layer), which in turn obtains the classification result of each data item.
e overall flow chart of the improved BYOL intrusion detection model is shown in Figure 5

Datasets and Preprocessing
To verify the powerful detection and generalization capabilities of the improved BYOL intrusion detection model, this paper conducts experiments not only on the old intrusion detection datasets like KDD CUP99 [37] and NSL-KDD [38] but also on the new intrusion detection datasets such as UNSW-NB15 [39], CIC IDS2017 [40], and CIDDS_001 [41]. Since UNSW-NB15 contains more comprehensive types of attacks and rich feature information, this paper obtains the feature representation of network traffic by applying the improved BYOL intrusion detection model to UNSW-NB15 and thus performs transfer learning on the KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 datasets to validate the proposed model's powerful generalization ability. e operating environment of the experimental part is shown in Table 3.

Datasets Description.
e KDD CUP99 dataset was derived from an intrusion detection evaluation project conducted by the U.S. Department of Defense Advanced Planning Agency (DARPA) at MIT Lincoln Laboratory in 1998. Many simulated attacks are added to the network. Network traffic was labeled as normal or anomalous, and the anomaly types were subdivided into 4 major categories (Probe, DoS, U2R, and R2L) for a total of 39 attack types, of which 22 attack types appeared in the training set and another 17 unknown attack types appeared in the test set. Every network traffic sample contains 41 attributes and a category label. Table 4 describes the KDD CUP99 dataset in detail.
NSL-KDD dataset is an improvement of KDD CUP99, which solves the problems of data redundancy and duplicate data in KDD CUP99. e NSL-KDD dataset contains 4 anomaly types, namely, Dos, Probe, U2R, and R2L, and each intrusion record has 42 dimensional features, of which 42 features are composed of 9 basic TCP connection features, 13 content features of TCP connections, 9 time-based network traffic statistics features, 10 host-based network traffic statistics features, and a category label. Table 4 details the NSL-KDD dataset. e UNSW-NB15 dataset was created by the Australian Centre for Cyber Security (ACCS) in 2015. It is a comprehensive network attack traffic dataset. e dataset contains data with two labels, 1 for the attack category and 0 for the normal category. Specifically, it has a total of 9 different categories of attack, and the data flow is described by a total of 49 features, 47 of which are attack-related features, with a specific attack category label and an attack and normal category label. e detailed description of the UNSW-NB15 dataset is shown in Table 5.
e CIC IDS2017 dataset is a network traffic dataset collected and made public by the Canadian Institute for Cyber Security in 2017, which contains five days of network traffic data collected from Monday to Friday, including normal traffic and anomalous traffic due to common attacks. is paper uses Wednesday-workingHours.csv as the intrusion detection dataset, and Table 6 describes the CIC IDS2017 dataset in detail. e CIDDS_001 dataset is a tagged traffic-based dataset for evaluating anomaly-based intrusion detection systems. e dataset consists of three log files (attack logs, client configuration, and client logs) and traffic data from two servers, each consisting of four 4-week periods of captured traffic data. Table 7 details the CIDDS_001 dataset.

Data Preprocessing.
For the intrusion detection datasets, the original datasets cannot be directly inputted into the network model for intrusion detection, because the input dataset must conform to the input format of the convolutional neural network. erefore, the experimental datasets need to be preprocessed in advance by the following steps.

One-Hot Encoding.
Taking the NSL-KDD dataset as an example, where the element types of three features, protocol, flag, and service, are symbolic features, it needs to be converted into numerical representations; for example, assuming that protocol contains three categories, UDP, TCP, and ICMP, the protocol categories can be processed as [

Data Normalization.
In order to cancel the magnitude, make the gradient always in the direction of the minimum as well as accelerate the convergence; it is necessary to do the normalization of the data after feature mapping.
where x is the original data, x min is the minimum value among the same features, x max is the maximum value among the same features, and x norm is the result of using maximumminimum normalization.

Evaluation Metrics.
Since the network intrusion detection data are complex, the evaluation of the model cannot be based on the accuracy rate alone as the only evaluation criterion, so this paper will use the accuracy rate (ACC), precision rate (precision), detection rate (DR), and F1 score as the evaluation indexes of intrusion detection and verify the accuracy and stability of the model through a comprehensive comparison of the above indexes. e definition of these metrics can be given as follows: where true positive (TP) is the number of connection records correctly classified to the normal class, true negative (TN) is the number of connection records correctly classified to the attack class, false positive (FP) is the number of normal connection records wrongly classified to the attack connection records, and false negative (FN) is the number of attack connection records wrongly classified to the normal connection record.

Results and Discussion
ere are four groups of experiments in this paper, and the experiments mainly verify four aspects:       Training  Testing  Normal  130000  4240  Attacker  10000  2260  Suspicious  430000  7911  Unknown  70000  7932  Victim  8000  907  Total  648000  23241  conducted experiments on KDD CUP99, NSL-KDD,  CIC IDS2017, and CIDDS_001, so that we can verify the feasibility of using the feature representations extracted by our model to discriminate network traffic. (4) Experiments are conducted on KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 datasets for transfer learning and they are compared with other state-of-the-art models taken from the recent literature on network intrusion detection to verify that the extracted feature representation using our model has a strong generalization capability.

Effectiveness of the Improved BYOL Self-Supervised
Learning. Firstly, we verify the correctness of the improved encoder architecture in BYOL proposed in Section 3.1.2 and the impact of hyperparameter d on the accuracy of UNSW-NB15 anomaly detection in Section 3.1.3 as well as the stability of model training after optimizing the loss function of BYOL proposed in Section 3.1.4. Figure 6 shows the impact of different encoder architectures on UNSW-NB15 anomaly detection, and it can be seen that when the feature extraction encoder architecture is BoTNet, accuracy, precision, and other performance indicators of UNSW-NB15 anomaly detection are the highest and the training process is relatively more stable, further verifying that the introduction of attention mechanism in Section 3.1.2 can effectively suppress the features that contribute less to the classification in the intrusion detection data and increase the features that contribute more to the classification, thus increasing the recognition rate of network anomalous traffic. Furthermore, it also verifies the correctness of choosing BoTNet for the encoder architecture in the improved BYOL. e impact of performance metrics on UNSW-NB15 anomaly detection when d ∈ 64, 128, 256, 512, 600 { }is shown in Figure 7, and it can be seen that when d is taken as 512, the accuracy, precision, and other performance indexes of UNSW-NB15 anomaly detection are the highest, so the BoTNet model with a d value of 512 is used for the feature extraction encoder architecture in the subsequent experiments. e loss curve of different loss functions and their performance metrics are presented in Figure 8. We can see that the loss of the training process becomes smoother after using the optimized loss function proposed in this paper and the model converges faster than the loss function proposed in the original BYOL paper from Figure 8(a), and from Figure 8(b), we can see that the accuracy, precision, and other performance indexes obtained by the model are similar to the loss function proposed in the original BYOL paper for UNSW-NB15 anomaly detection. It can be verified that the model training is more stable and converges faster after the optimized BYOL loss function proposed in this paper.

Linear Evaluation.
After training the UNSW-NB15 dataset using the modified BYOL to obtain the feature representation of network traffic, in order to verify the effectiveness of this feature representation, we use linear evaluation (freezing the weights of the trained BoTNet to train only the last linear layer for network traffic classification) on UNSW-NB15. In the meantime, we also use supervised learning to train BoTNet and some state-of-theart models for comparison experiments. e experimental results are shown in Table 8 and Figure 9, where "-" represents that the results of this indicator are not given in the paper. As can be seen from Table 8, in most cases, our model achieves better detection performance than other state-of-the-art models. At the same time, the results of various metrics obtained by supervised BoTNet and linear evaluation are similar and the accuracy of UNSW-NB15 anomaly detection achieves 89.97% using only one linear layer, which is just 4.08% worse than the supervised BoTNet with the accuracy of 94.05% and 17.59% greater than SADE-ELM. In the aspect of accuracy, our model is 3.72% and 19.78% greater than VLSTM and SADE-ELM, respectively, and 4.16% and 5.44% worse than MFFSEM and TSIDS, respectively. In terms of detection rate, our model is only 2.54% worse than the highest VLSTM, 0.11% greater than TSIDS, and 14.82% and 7.84% greater than MFFSEM and SADE-ELM, respectively. F1 score is a comprehensive evaluation index of accuracy and detection rate which can better reflect the classification ability of the model. As for F1 score, our model is even 14.7% greater than the SADE-ELM model, 1.71% and 5.77% greater than the VLSTM and MFFSEM, respectively, and only 2.91% worse than the highest BoTNet. e ROC curve is not only easy to understand but also a more stable indicator that can reflect the quality of the model when faced with imbalance of the number of positive and negative samples. erefore, the ROC curve can reduce the interference brought by different test sets and measure the performance of the model itself more objectively. Figure 9 depicts the ROC curves, and it can be seen from the figure that the AUC of the self-supervised BoTNet is 0.94, which is only 0.6 times higher than that of our model, further verifying the effectiveness of the network traffic feature representation extracted by our model, which can fully and effectively distinguish the categories of network traffic. Combining Table 8 and Figure 9, we can draw a conclusion that our model can effectively distinguish network anomalous traffic and prove that the data augmentation operation random_shuffle proposed in this paper enables the improved BYOL intrusion detection model to learn the feature representation of network traffic invariance and then classify network traffic correctly.

Comparison Experiments of Traditional Deep Learning
Algorithms. To verify the feasibility of using improved BYOL to train the feature representations extracted from the UNSW-NB15 dataset with differentiated network traffic, we conducted comparative experiments on KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 datasets using the traditional deep learning models like DNN, CNN, and RNN and our model for transfer learning. Here, DNN consists of two hidden layers with 128 and 64 neurons, respectively, CNN consists of three convolutional layers with 32, 64, and Security and Communication Networks 13 128 3 × 3 convolutional kernels, respectively, and RNN consists of a layer of 70 neurons of LSTM. e experimental results are given in Table 9. Tables 9-12 describe the accuracy, precision, detection rate, and F1 score in detail when DNN, CNN, and RNN and the transfer learning of our model are used for anomaly detection on KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001. Figure 10 shows more visually the differences in the performance metrics of DNN, CNN, and RNN and the transfer learning of our model for anomaly detection on each dataset. From Table 9 and Figure 10(a), it can be seen that all the deep learning models achieve better performance in most cases. Besides, all the metrics can reach above 99%, because the KDD CUP99 dataset is simpler and there is a large amount of data redundancy. As can be seen from Table 10 and Figure 10(b), since the NSL-KDD dataset solves the data redundancy problem existing in the KDD CUP99 dataset, the performance indexes of each model are reduced on the NSL-KDD dataset and the results obtained by our model are slightly worse than of the other three models. is is mainly because our model classifies more normal traffic as abnormal traffic, which leads to the poor performance of the model. From Table 11 and Figure 10(c), it can be seen that DNN performs better on the CIC IDS2017 dataset, with all performance indicators reaching above 99% and CNN, RNN, and our model perform relatively poor, but the performance metrics can still reach more than 95%. From Table 12 and Figure 10(d), it can be seen that the results obtained by DNN, RNN, and CNN are better and all performance indexes can reach more than 99%, while the performance index obtained by our model can reach more than 98%, which can still effectively distinguish the CIDDS_001 dataset from abnormal traffic. In summary, due to the simplicity of the dataset and some problems in the dataset itself, traditional deep learning algorithms such as DNN, CNN, and RNN and our model can achieve good anomaly detection results on the KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 datasets as well as validate that using the improved BYOL to train the UNSW-NB15 dataset with the extracted feature representations is fully feasible to distinguish network traffic.

Transfer Learning.
To verify the feature representations of network traffic obtained by training the UNSW-NB15 dataset using improved BYOL that has strong generalization capability, we perform transfer learning on the KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 intrusion detection datasets as well as perform comparison experiments with the state-of-the-art models on each dataset. We evaluate our network traffic feature representations on the KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 datasets to make sure whether the feature representations learned on UNSW-NB15 are generic and thus useful across intrusion detection domains, or if they are UNSW-NB15-specific. e experimental results are shown in Tables 13-16 and Figure 11, where "-" represents that the metric is not given in the paper results.
As can be seen from Table 13 and Figure 11(a), the performance metrics obtained by transfer learning in the UNSW-NB15 dataset for network traffic feature representation are fully comparable to supervised BoTNet on the KDD CUP99 intrusion detection dataset and the difference between them is only a fraction of a percentage point, which is due to the powerful feature extraction capability of the improved BYOL intrusion detection model. Compared with other state-of-the-art models on the KDD CUP99 dataset, the results obtained from transfer learning are even 1%-6% better than those of the supervised learning SADE-ELM model in terms of performance metrics, with only 0.67% difference in accuracy compared to the DT-EnSVM model. From Table 14 and Figure 11(d), it can be seen that compared with other state-of-the-art models on the CIDDS_001 dataset, the results obtained from transfer learning differ from the MLIDS model with the highest accuracy by only 2.37% and are 4.97% higher than those of the SADE-ELM model with the lowest accuracy. In terms of detection rate, the result of our model obtained from transfer learning is 97.82%, which is 2.04% lower than that of the supervised learning BoTNet and MLIDS with the highest detection rate, 0.99% and 0.51% lower than that of DBN and RF, and 6.45% higher than that of SADE-ELM, respectively, which indicates that our model can detect the intrusion data more comprehensively and with fewer faults. As can be seen from Table 15 and Figure 11(b), the result of accuracy obtained from the transfer learning on the NSL-KDD dataset is slightly lower than that of the supervised learning BoTNet because of the increase in the complexity of the dataset, which is nearly 7% lower. However, compared with other state-of-the-art models on the NSL-KDD dataset, the transfer learning results are still better than those of other models in all metrics in most cases, even higher than the SADE-ELM model in the accuracy by nearly 16%, but slightly lower than its precision by 3.5%. In terms of F1 score, our model achieves 0.9227, which is 7.07% lower than that of       Table 16 and Figure 11(c). e results obtained from transfer learning are slightly lower than those of other models in terms of accuracy, precision, detection, and F1 score. In terms of F1 score, our model is 4.29%, 3.26%, and 2.95% worse than IGAN-IDS, DBN, and LSTM-RNN, respectively. And, in terms of precision, our model is 4.72% and 4.6% worse than DBN and LSTM-RNN, respectively. Our model is 3.75%, 1.89%, and 0.53% worse than NB-SVM, DBN, and LSTM-RNN in terms of detection rate, respectively. Our model is 3.09%, 2.22%, and 1.17% worse than IGAN-IDS, NB-SVM, and DBN in terms of accuracy, which indicates that the model is slightly weak in feature   generalization on CIC IDS2017 dataset and the model generalization capability can continue to be improved. In general, each algorithm is able to achieve a wonderful score for each performance index in the intrusion detection of KDD CUP99, NSL-KDD, CIC IDS2017, and CIDDS_001 datasets, which indicates that all types of algorithms can effectively detect network intrusion data, but the results obtained from transfer learning are significantly better than those of other models in most cases, which fully proves that the network traffic feature representation extracted by the improved BYOL has powerful network traffic discrimination ability.
To better visualize the sample distribution of the intrusion detection dataset after processing with the improved BYOL intrusion detection model, we randomly select 5,000 records from the KDD CUP99, NSL-KDD, CIC  IDS2017, and CIDDS_001 datasets without processing and after processing, respectively. en, we used the t-SNE algorithm [56] to reduce the dimension of these records and visualized them. Figures 12(a), 13(a), 14(a), and 15(a) show the visualized images of 10,000 unprocessed records of KDD CUP99, NSL-KDD, CIC IDS 2017, and CIDDS_001 datasets, respectively. As can be seen from the figures, the data in all datasets are linearly indistinguishable and the NSL-KDD and CIDDS_001 datasets are significantly more complex and difficult to distinguish than the    KDD CUP99 and CIC IDS 2017 datasets, just as reflected by the experimental results obtained from transfer learning. While Figures 12(b), 13(b), 14(b), and 15(b) are the visualized images of 5,000 records after processing with the improved BYOL intrusion detection model, comparing the visualized images of 5,000 processed records with those of unprocessed records shows that the samples of different categories show an aggregation trend in the feature space and can be almost separated linearly, which is sufficient to show that the features obtained in the UNSW-NB15 dataset for network traffic representation has a strong generalization ability and can effectively distinguish various types of network anomaly traffic.

Conclusions and Future Work
In this paper, we propose a new data augmentation strategy for intrusion detection data and an intrusion detection model based on label-free self-supervised learning. Using the improved BYOL self-supervised learning model to extract network traffic feature representations, in order to avoid the poor generalization ability of the model due to the fusion of too many invalid features, we introduce a multihead attention mechanism to suppress the features in the intrusion detection data that contribute less to the classification and increase the features that contribute more to the classification. Training and testing on the intrusion detection benchmark datasets KDD CUP99, NSL-KDD, UNSW-NB15, CIC IDS2017, and CIDDS_001 show that the proposed model has a strong ability to identify network traffic and a better generalization ability. In addition, the proposed model achieved good performance in terms of detection accuracy with 99.25%, 92.67%, 96.70%, and 97.55% in testing datasets on experiments of transfer learning, respectively, which is comparable to the results obtained by supervised learning. Compared with the state-of-the-art models in recent years, the improved BYOL intrusion detection model achieves superior detection results on intrusion detection datasets. However, there are still some gaps between the improved BYOL intrusion detection model and supervised learning methods when using datasets with complex data distribution, various attack types, and data imbalance. In view of the fact that the Mahalanobis distance is not affected by the magnitude and can take into account the connection between various features while excluding the interference of correlation between features, we will consider to use the Mahalanobis distance to calculate the similarity of the output features of the two networks to reduce the gap between our model and the supervised learning methods. In addition, because the encoder network architecture of feature extraction is slightly complex and has more parameters, compared with traditional deep learning DNN, CNN, and RNN, the training time of our model is a little longer and cannot be detected in real time on large datasets, and we will consider to improve the model neurons and calculation methods to simplify the network structure and improve the efficiency of the model.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. Security and Communication Networks 21