Deep Autoencoders and Feedforward Networks Based on a New Regularization for Anomaly Detection

Anomaly detection is a problem with roots dating back over 30 years. *e NSL-KDD dataset has become the convention for testing and comparing new or improved models in this domain. In the field of network intrusion detection, the UNSW-NB15 dataset has recently gained significant attention over the NSL-KDD because it contains moremodern attacks. In the present paper, we outline two cutting-edge architectures that push the boundaries of model accuracy for these datasets, both framed in the context of anomaly detection and intrusion classification. We summarize training methodologies, hyperparameters, regularization, and other aspects of model architecture. Moreover, we also utilize the standard deviation of weight values to design a new regularization technique. *en, we embed it on both models and report the models’ performance. Finally, we detail potential improvements aimed at increasing models’ accuracy.


Introduction
e provision of an effective and robust network intrusion detection system (NIDS) remains one of the key challenges of network security. Irrespective of technological advances in the field of NIDS, many potential solutions operate by utilizing signature-based and less-capable methods instead of an anomaly detection technique. Certain factors are linked to the hesitancy in switching, including the high cost associated with the high rate of false alarms, obstacles in the attainment of valid training data, and training data longevity. However, the reliability of conventional techniques has been proven to be limited, which subsequently leads to inaccurate and inefficient detection. In this regard, this challenge is linked to the creation of a widely accepted anomaly detection technique that is capable of reducing the limitations induced by current changes occurring within modern networks. Efficient, rapid, and effective techniques are required to deal with these issues. As such, it is important to improve effectiveness and accuracy in an in-depth manner. e analysis of NIDS must be contextually aware, and it should be more detailed in order to move toward high-level observation rather than abstract representation. Changes to behavioral attributes are required in order for this to be easily comprehensible for a network's specific element, for example, protocols, versions of the operating system, individual users, and the diverse nature of data and different types of protocols available in modern advanced networks.
is introduces eminent levels of difficulty and complications, thus representing the most crucial challenge to tracing the deviation between abnormal and normal behaviors. Due to such difficulties, it remains difficult to establish an accurate standard, which increases the domain for probable exploitation or zero-day attacks. methods, such as data preprocessing requiring expert knowledge (e.g., finding important and relevant features from data) and the interaction of expert personnel being required to carry out the task. As such, this not only requires human expertise but also involves an error-prone task [2]. Likewise, a large amount of training data is also required to ensure reliable results, which is a challenging task in such a diverse and vigorous environment.
Due to these limitations, deep learning (DL) algorithms have the highest priority in modern research. DL is the advanced field of ML, which can address these limitations and can resolve problems related to shallow learning. Initially, researchers demonstrated that the layer-wise learning features of DL algorithms have either better performance or performance equivalent to that of shallow learning [3].
is process helps to analyze network data deeply, and it can efficiently identify anomalies in network traffic.
One of the main aspects of building a deep learning model is regularization. Regularization is an essential component of supervised learning; the most widely used regularization techniques are L1 and L2. e application of the penalty term is the key difference between these regularization techniques. L1 penalizes the loss function by adding the absolute value of the magnitude of coefficient and thus it is suitable for feature selection or reduction, while L2 penalizes the loss function by adding the squared magnitude of the coefficient so that it gives less weights to unimportant features [3]. e main drawback of these regularizers is the dependence on the model parameters so the relationship of weight matrix entries is ignored and only a signal value of weights is controlled.
To address this drawback, we design and implement a new regularization technique as a substituted option to L1 and L2 regularizers. e new regularizer considers the dispersion of the weight values, which is known as the standard deviation, unlike L1 and L2 regularizers, in which only individual values of weights are controlled without considering the relationship among weight matrix entries. e merit of the proposed methods lies in the adoption of new architecture for abnormal behavior detection systems.
In this paper, we present two efficient models. e first model is based on feedforward neural network (FNN) and the second model is based on a deep variational autoencoder (VAE). To reduce the error on the given training set and avoid overfitting, we introduce a new regularization technique based on taking the standard deviation of the weight matrix to get the regularization term. e motivation behind this is to create an adaptive form of weight decay. After that, we embed it in both models to study the performance of both models. We also trained our models in both semisupervised and supervised framing. en, we conducted an in-depth analysis of the detection efficiency using different evaluation metrics. Finally, we compare our results with other wellknown existing ML techniques.
Our major contributions to the existing literature are provided as follows: (1) We present the design and implementation of two models based on VAE and FNN using a new regularization algorithm. Furthermore, we present the performance of both models on different benchmark datasets. (2) We analyze and compare the performances of the proposed models using different evaluation measures such as accuracy, true positive rate (TPR), and F-measure with other ML methods. e experimental results show the effectiveness of the proposed models for anomaly detection. e rest of this paper is organized as follows. In Section 2, we describe briefly the concept of feedforward neural network and the variational autoencoder. Section 3 provides the related work. In Section 4, we present the datasets used in this work. Section 5 discusses the system design and methodology. In Section 6, we give the experimental results and compare our models with other existing methods. Finally, Section 7 concludes the paper.

Feedforward Neural Networks.
FNNs are composed of various functions in a graph-like data structure that describes the connectivity among functions. e composition of functions can be denoted in the following manner.
Suppose that we have three different functions f 1 , f 2 , and f 3 , and we let f (x) be the composite of all these functions, denoted as f (x) � f 3 (f 2 (f 1 (x))). Generally, this composition describes the structure of neural networks. In neural network terminology, the aforementioned composition can be described with f 1 being the first layer, f 2 being the second layer, and so on. e number of functions in this composition is the depth of the neural network model. e final function, or the most outer function, is known as the output layer in neural network terminology.
During the training phase of the neural network model, we estimate a function f * (x) to match the original unknown function f (x). e training data consists of approximate examples with target output variables y .f * (x). e training examples describe the nature of the function to be estimated and specify the nature of the output layer for each data point x. In opposition to this, the training data does not describe the nature of hidden layers. is nature is decided by the learning algorithms concerning how to produce the desired output.
e learning algorithms tend to estimate the behavior of hidden layers on training data to produce optimal implementation results.
is is because the training data have a hidden relationship with these layers that eventually describe these layers, and the learning algorithm must locate this relationship, which explains why these are called hidden layers.

Variational Autoencoders.
Variational autoencoder (VAE) [4] is a generative model that provides a probabilistic manner for describing an observation in latent space. For unsupervised learning, this is one of the most consistent methods, and many successful cases have been reported in image processing, [4,5] speech recognition [6], and text generation [7].
VAEs represent a very promising method, as these methods integrate variational interpretation with the employment of neural networks as function approximates in a way that searches for the approximate posterior distribution, which can be performed with stochastic gradient descent (SGD) [8]. Moreover, they vary from the state-of-the-art autoencoders (AEs), denoising autoencoders (DAEs), and sparse autoencoders (SAEs) in that they impose a distribution over the data and hyperparameters.
As a result, VAEs have the ability to create new data once the model has been trained by sampling from this distribution. is is achieved by creating a hyperparametric description of the data that can be selected to have lower feature dimensions compared to the data. erefore, the interpretation of this description can consider a squeezed characterization of the dataset. In the domain of anomaly detection, VAE represents a pleasing fit due to its inherent probabilistic nature.
A model proposed by [24] depends on the averaged onedependence estimator (AODE) technique.
is model is used for the classification of multiple classes and achieved a high FPR of 6.57% and an accuracy of 83.47% on the UNSW-NB15 dataset. Using the same dataset, a random forest (RF) classifier was used by Janarthanan and Zargari in [25]. ey used five feature selections to classify and detect intrusion attacks, and their method achieved an accuracy of 81.6175% and FAR of 4.4%.
An emerging branch of ML which has received significant attention is DL. Recently, several studies have extensively employed DL in the field of network intrusion detection, which subsequently brought promising prospects to this realm. In unsupervised framings, DL methods and approaches used in the field of network anomaly detection for feature learning include restricted Boltzmann machines (RBMs), deep neural networks (DNNs), deep belief networks (DBNs), and autoencoders. Erfani et al. [26] used numerous benchmark datasets to test their model, which was based on the combination of DBNs with a linear one-class SVM. Likewise, to learn compressed features that are not in the packet payloads from specific features, Fiore et al. [27] used a discriminative RBM (DRBM) approach. e binary classification of traffic into normal and abnormal behaviors was carried out based on feeding the compressed features into a softmax classifier. An anomaly detection model based on DNNs was proposed by Javaid et al. [28]. Based on the findings of their study, they reported that DL is more effective for flow-based anomaly detection in software-defined networks (SDNs). A model based on self-taught learning (STL) was proposed by Tang et al. [29] for network intrusion detection. In their experiments, they used an NSL-KDD dataset to demonstrate the superiority of DL over different approaches in terms of accuracy and performance. To recognize network traffic from raw data, Wang [30] proposed a DL approach based on a stacked autoencoder. Based on their results, it was demonstrated that that method accomplished a high performance. Furthermore, a DL approach based on recurrent neural networks (RNNs) was proposed by Yin et al. [31] for intrusion detection.
e authors applied their method on the NSL-KDD dataset to measure its effectiveness. Consequently, they demonstrated the effectiveness of this DL method over traditional ML approaches for intrusion detection.
A DL method based on a DBN of RBMs having four hidden layers to reduce the feature sizes was proposed by Alrwashdeh and Purdy [32]. During the fine-tuning phase, they updated the weights of DBNs, and LRs were used to perform the classification task. ey tested their proposed algorithm on the KDD Cup 1999 dataset and achieved an accuracy of 97.9% with an FPR of 0.5%. A DL-based nonsymmetric deep autoencoder (NDAE) approach was used by Shone et al. [33]. In that study, the KDD Cup 1999 and NSL-KDD datasets were used for testing purposes in combination with an RF classifier, achieving an accuracy of 97.85% and 85.42% for KDD Cup and NSL-KDD datasets, respectively. However, the FPR values were 2.15% and 14.58% for the KDD Cup and NSL-KDD datasets, respectively, which were alarming. erefore, this method cannot be used in real-time scenarios for attack detection due to inherent deficiencies and general ineffectiveness.
A novel method based on the combination of hybrid feature selection and two-stage metaclassifier for intrusion detection was proposed in [34]. e authors used NSL-KDD and UNSW-NB15 to evaluate their model performance. Finally, they claimed that their proposed model achieved an accuracy of 91.27% and FPR of 8.90%. e authors in [35] proposed an improved anomaly-based intrusion detection system using gradient boosted machine (GBM). ey used three datasets, NSL-KDD, UNSW-NB15, and GPRS, to evaluate their model using either holdout method or tenfold cross-validation. In addition, they reported that their model yielded higher detection performance than other IDS models. An efficient DL approach based on an STL framework was proposed by Al-Qatf et al. [36]. Furthermore, a stacked autoencoder based on a twophase DL model with a softmax classifier was proposed by Khan et al. [37]. Based on the Apache Spark framework, a DBN for feature selection and an SVM-based ensemble approach were used in [38]. is method is efficient enough to provide satisfactory detection results in large-scale networks.

NSL-KDD Dataset.
e NSL-KDD essentially shares an identical structure with the old version "KDD Cup'99 dataset" and has five categories that include normal and 4 types of attacks, as well as fields for 41 features (see Table 1).
is dataset was produced by Tavallaee et al. with the intention of addressing some of the inherent problems in the KDD 99 [39]. However, the NSL-KDD dataset continues to suffer from issues and cannot be considered as a perfect representation of real networks [40]. Nevertheless, many studies in the realm of anomaly detection still use this dataset. us, we believe that the NSL-KDD dataset remains a valid benchmark because it contains reasonable record numbers of test and train sets which make the process of running the experiments on the complete set affordable (even without the need to select a small portion randomly).
us, the evaluation results of different research works will be consistent and comparable.

UNSW-NB15 Dataset. UNSW-NB15 is a recent and complex dataset collected by the Cyber Security Research
Group (CSRG) at the Australian Centre for Cyber Security (ACCS) [41].
Initially, the amount of data was large (approximately 100 GB) and it was collected through TCP dump and Ixia PerfectStorm tools, which consist of normal and various modern attack types. e data is gathered around two periods of simulation of 15 and 16 hours, respectively. e total number of instances of this dataset is approximately 2.5 million, consisting of 42 attributes excerpted using Argus, Bro-IDS, and other advanced algorithms.
ere are five feature categories in this dataset: basic features, time features, flow features, content features, and additional derived features. Moreover, there are two types of labels apart from the features. One is labels: attack_cat, which is normal or attack type, and the other is 1 or 0, representing normal or abnormal flow, respectively. e UNSW-NB15 dataset consists of a total of nine types of cyber attacks, which include shellcode, analysis, backdoor, exploits, worms, reconnaissance, generic, DoS, and fuzzers [15].

Data Preprocessing.
As is the case with the majority of ML problems, there was a significant amount of data preprocessing needed to successfully learn data representation for both UNSW-NB15 and NSL-KDD datasets. Both datasets are quite large for the problem and are split into both a training set and a test set. For columns that were string values, a label encoder was applied to transform the data into the unique integer representation.

Model Architecture.
For the sake of exploration, we pursue both the semisupervised and supervised framings of these datasets. e two methods utilized are FNN and deep VAE.
e FNNs are applied to the supervised context and modelled as multiclass classification with each type of attack being a different class. e hidden and input layers contain a swish activation function (1) [42] with a 5-layer topology and 512/1-256/4 unit distribution. e model gradients are updated via Nesterov-based Adam optimization (NADAM) [43] with a learning rate of .0091, β 1 � 0.9, and β 2 � 0.999. A large batch size is selected for training efficiency and to smooth out gradient updates, though recent publications have shown exceptional convergence (at least when it comes to certain problem domains) with online and local training scenarios [44]. Each layer is initialized randomly, meaning the weights of the layers are initialized by sampling uniformly from a Gaussian. Furthermore, L2 weight regularization is used to further reduce inefficient learning (2) and each layer used dropout set to a value of 0.5. A softmax activation is also used, with units equal to class number on the output layer. e loss function used is a categorical cross entropy for 75 epochs.
On the other hand, when training in the semisupervised context, we use an autoencoder. e theory behind autoencoders is fairly straightforward given previous knowledge with DL algorithms. Autoencoders can also be used on a variety of other very interesting problems. Among these are denoising image data, dimensionality reduction, and even compression [45]. For our problem, the autoencoder learns a compressed representation of the data. Since we are operating in the domain of anomaly detection, we train the model on normal data only (therein lies the difference from binary classification). e model then predicts whether or not an arbitrary input fits the learned representation. e autoencoder is trained on the entire feature set. e encoder and decoder are composed of four layers with an encoding dimension of 256 units. e number of units is halved at each subsequent layer in the encoder, with the inverse being true for the decoder. A total of 41 input and output units are used to learn data representation. Compared to the classification network, the autoencoder uses ReLU activation (3). Initialization occurs in the same manner as the first network. We optimize via NADAM (Algorithm 1) and substitute mean squared error with mean squared logarithmic error. e autoencoder uses both L1 and L2 regularization methods (see (4)). e mathematical formulation of the new regularizer is given in equations (5) and (6).

New Regularization
where n is the number of rows in weight matrix and i is the i-th row of the weight matrix. σ denotes standard deviation of weight values as given below: e parameter λ is used to control the values of the weight matrix, and k denotes the size of the weight vector. Particularly, it presents the number of columns in a particular weight matrix (k depends on the features number in the dataset). w i are the values of the model's weights.

Evaluation Protocols.
To evaluate our models, the performance of all classifiers is evaluated in terms of accuracy, false positive rate (FPR), true positive rate (TPR), precision, and F-Score, which are calculated based on the mathematical representation given in equations (7)-(11), respectively.
Pre � TP TP + FP , Where TP, TN, FP and FN denote true positives, true negatives, false positives and false negatives, respec-tively.

Results
To carry out simulations, a machine having core-i7 processor with 16 GB of RAM and 64 bit linux operating system is used. e implementation is done in python 2.7 with KERAS which uses an end-to-end machine learning platform called TENSORFLOW at backend. e training and testing time for both models on each dataset is shown in Table 2.
Due to various optimization and cutting-edge methods, both models achieved results equally, near to, or better than those of previous state-of-the-art methods for this problem. Each model is trained on both NSL-KDD and UNSW-NB15 datasets by using train-test split method and explicitly tested on the test set provided by the dataset KDDTest+ and UNSW _NB15_testing − set (75% for training and 25% for validation). We have shown the results in terms of accuracy, FPR, ALGORITHM 1: Nesterov-accerative adoptive moment estimation Security and Communication Networks 5 TPR, precision, and F-Score which give shreds of evidence that the intrusion detection mechanism of the proposed methods is more than satisfactory. High values of F-Score represent that the precision value of both models is also accurate and efficient to detect and find out anomaly from the network traffic. For our first model, the classifier converges to +95% on the validation data after approximately 50 epochs and starts to diverge after approximately 75. Even given the extremely accurate model, there remains room for further improvement.
e autoencoder converged to the same accuracy on the validation set (approximately 70 epochs) and diverged shortly thereafter. e autoencoders with embedded regularizers oscillated slightly more chaotically during training, likely due to both the difference in problem and feature set when compared to the classifier (along with different hyperparameters, regularizers, and activation functions).
Likewise, after embedding the new regularizer to both aforementioned models, we observed up to 1.7% improvement in average validation accuracy. e feedforward model was trained for 100 epochs, and the average testing accuracy achieved through feedforward model is 96.7% and 94.7% for NSL-KDD and UNSW-NB15 datasets, respectively. e progressions for training and validation accuracy for FFN models on both datasets are shown in Figures 1 and 2, respectively. e accuracy with the new regularizer is significantly better than other regularizers as shown in Tables 3 and 4.
Similarly, for VAE models with embedded regularizer, we achieved average testing accuracy of 97.01% and 93.3% for NSL-KDD and UNSW-NB15 datasets, respectively. e progressions for training and validation accuracy for VAE models on both datasets are shown in Figure 3 and Figure 4, respectively.
e corresponding average accuracies with other performance measures such as FPR, TPR, precision   and F-Score are also computed and shown in Table 3 and  Table 4.
From Figures 1-4, it is evident that the accuracy converges to some extent after 80th epoch but still it oscillates between certain values. Upon investigating, it is found that the test data provided is well shuffled and balanced. So, the instances of each class were distributed approximately equally. is gives us two advantages. e first one is that the model does not overfit (in which the validation accuracy is less than the training accuracy). e second advantage is that the data is well generalized.    If we notice FPR and TPR from Tables 3 and 4, the FPR is high, while the TPR is good. e reason behind that is that if we notice equations (8) and (9), the FPR incorporates the true negatives (TNs). So, the FPR is inversely proportional to TNs. So, when TN is high, the FPR value will be low and vice versa. It means that the model counts in some samples of the other classes in the class it is training for. Similarly, the model falsely rejects some samples of the class it is training for. In other words, in equation (9), the TPR is inversely proportional to FNs.

Conclusion
We introduce the design and the implementation of two models employing a new regularization technique that meets or exceeds previous bests on the NSL-KDD and UNSW-NP15 datasets for both classification and anomaly detection domains.
e new models are tested on several datasets available in network security domain (i.e., NSL-KDD and UNSW-NB15).
e simulation results represent that the performance of new models is better than those of other methods. However, there are many different ways in which one could alter model optimization to further increase test accuracy. Firstly, more could be done with data prepossessing and feature selection. For example, one could implement PCA to extract principle components, use a low variance filter, or find a different method to select important features. Additionally, with domain knowledge, one could engineer more features that may increase model effectiveness. Regarding the models, a key area could include hyperparameter tuning either manually or via an algorithm. Furthermore, experimenting with other regularizers, optimizers, and activation functions could also be worth investigation. Overall, our proposed models performed well and have resulted in satisfactory performance measures compared to existing state-of-the-art methods. For comparison purposes, Tables 3 and 4 present the performances of prior existing methods and the proposed models experimented in this paper.

Data Availability
Datasets used to support the findings of this study are included within the article.

Ethical Approval
is article does not contain any studies with human participants performed by any of the authors.

Conflicts of Interest
e authors declare that there are no conflicts of interest.