A Mean Convolutional Layer for Intrusion Detection System

The significant development of Internet applications over the past 10 years has resulted in the rising necessity for the information network to be secured. An intrusion detection system is a fundamental network infrastructure defense that must be able to adapt to the ever-evolving threat landscape and identify new attacks that have low false alarm. Researchers have developed several supervised as well as unsupervised methods from the data mining and machine learning disciplines so that anomalies can be detected reliably. As an aspect of machine learning, deep learning uses a neuron-like structure to learn tasks. A successful deep learning technique method is convolution neural network (CNN); however, it is presently not suitable to detect anomalies. It is easier to identify expected contents within the input flow in CNNs, whereas there are minor differences in the abnormalities compared to the normal content. This suggests that a particular method is required for identifying such minor changes. It is expected that CNNs would learn the features that form the characteristic of the content of an image (flow) rather than variations that are unrelated to the content. Hence, this study recommends a new CNN architecture type known as mean convolution layer (CNN-MCL) that was developed for learning the anomalies’ content features and then identifying the particular abnormality. The recommended CNN-MCL helps in designing a strong network intrusion detection system that includes an innovative form of convolutional layer that can teach low-level abnormal characteristics. It was observed that assessing the proposed model on the CICIDS2017 dataset led to favorable results in terms of real-world application regarding detecting anomalies that are highly accurate and have low false-alarm rate as opposed to other best models.


Introduction
Worldwide economic and business advancement is closely tied with Internet and enterprise networks. Organizations have stronger ties with computer networks than before within everyday operations and the ways that customers share and store personal information [1].
is rate of progress is closely tied with the complicated management of these networks, where network administrators are responsible for any issues such as flash crowds, network elements failures, mistakes with configurations, malicious attacks, and more. By ensuring prevention or quick fixes of problems, administrators protect the quality of connections for all involved and prevent end users from having their services disrupted [2,3].
Governments and private organizations require solutions offering stable performance in protecting the information assets they hold from any unlawful or unwanted accesses and attempt to prevent and detect intrusions [4]. Network intrusion detection system (NIDS) describes overseeing and categorizing network flows based on whether they are normal behavior occurring often in a network or if they are movements which could endanger safety of information systems.
Denning [5] suggests building an intrusion detection system (IDS) which would use a number of artificial intelligence (AI) approaches to detect abnormal movements and potential intrusions. is approach established a new wing of intrusion detection systems, developed using learning algorithms. In the last 30 years, machine learning (ML) approaches have been used as a traditional way of creating a network anomaly detection model.
A category of machine learning algorithms known as deep learning has become more widely used in classification and pattern recognition. Deep learning uses information processing layers within a hierarchical architecture to create the deep model. Deep learning has distinctive differences from conventional machine learning, as it can find the ideal features needed within raw data via certain nonlinear transformations, wherein every transformation achieves a greater complexity [6]. Using deep learning to counterinformation security problems has not been investigated for very long, and so there are few research studies on the topic, none of which use deep learning techniques to their full potential [7].
Two major gaps were seen during the literature review of anomaly detection problems. Firstly, there is a high falsealarm rate [8][9][10] for the methods used in anomaly detection. Secondly, training datasets were used in the training as well as the testing of models, employing cross-validation processes. Most modern studies adopt this methodology, and detection rates were extremely high. is is seen in the work of Kim et al. [11], who used a four-layer DNN with 100 units for intrusion detection on the KDD-CUP99 dataset, showing 99% accuracy. However, these practices are not considered reliable for anomaly detection problems since models can be overflowed to perform at these extreme levels [4].
e key aim of this research study is to cover the gaps described above, through the creation and implementation of anomaly detection models, using cutting edge deep learning models, and present an evaluation of these through standardized classification quality metrics. Of the varied practices existing in deep learning, CNN has had exceptional performance when it comes to computer vision, including face and object recognition. CNNs are a category of standard neural networks, as they employ convolution and pooling layers rather than completely linked hidden layers as seen in traditional neural networks [12]. is paper puts forward a newer type of CNN, by learning the anomalies' content features, which can be beneficial when used in intrusion detection. Furthermore, deep learning anomaly detection models have been contrasted with popular classification systems such as support vector machine (SVM), K-nearest neighbor (KNN), decision tree, random forest, adaptive boosting classifier, and gradient boosting. In order to further cover the existing research gaps, some models were trained on the training dataset, without any exposure to the test dataset throughout this process. Following this, the models were evaluated on the testing datasets, allowing for a more accurate and fair evaluation of the model's advantages, through the use of previously unknown data instances used at the time of testing. is study will put forward an innovative CNN architecture type, created to learn content features of anomalies, while subsequently pinpointing the specific abnormality. e suggested CNN is used to design a robust NIDS, which involves a new type of convolutional layer able to be taught low-level abnormal characteristics. e suggested CNN-MCL produces the lower false alarm compared to the original CNN.
In this paper, Section 2 will provide background information on the study topic, while Section 3 explains an in-depth description of the suggested CNN-MCL layer. Section 4 will present the network architecture for the CNN-MCL model. Section 5 will include the experimental performances of the proposed algorithm on CICIDS2017 [13], while Section 6 will share the conclusion of this study.

Background
In this section, related works and motivation, dataset description, and conventional CNN are presented.

Related Works and Motivation.
Because of its efficiency in finding ideal solutions within a finite amount of data, deep learning has gathered significant research attention. Javaid et al. [14] use a deep learning method in the context of a deep neural network for flow-based anomaly detection, and it is seen through the results that deep learning is able to be used for anomaly detection in software-defined networks (SDNs). Tang et al. [15] suggest a deep learning-based approach involving the use of self-taught learning (STL) within the benchmark NSL-KDD [16] dataset, in the context of a network intrusion detection system. In the work of paper [17], an RNN-based model is implemented for the purposes of classification instead of pretraining. In addition, the NSL-KDD dataset is employed for independent training and testing sets, in order to appraise performance in pinpointing network intrusions in both binary and multiclass classifications. e results are then contrasted with J48, ANN, RF, SVM, and other machine learning methods suggested in earlier research. e study of Zhao et al. [18] offers a cuttingedge survey of deep learning applications in the context of machine health monitoring. Experiments were conducted to contrast conventional machine learning methods with four widely employed deep learning methods (autoencoders, restricted Boltzmann machine (RBM), CNN, and RNN).
is study found that deep learning methods provide greater accuracy over their conventional counterparts. In the work of Alrawashdeh and Purdy [19], it is suggested that using a RBM with a single hidden layer can undertake unguided feature reduction. e weights are then transferred to another RBM in order to create a deep belief network (DBN), and the pretrained weights are moved into a fine-tuning layer made up of a logistic regression classifier (trained with 10 epochs) with multiclass SoftMax. In the study of Kim et al. [11], a DNN using 100 hidden units is put forward, in conjunction with the rectified linear unit (RLU) activation function and the ADAM optimizer. e study by Cordero et al. [20] suggested another unsupervised method to train models the normal network flows. RNN, autoencoder, and dropout concepts of deep learning are employed to achieve this. e performance of these suggested methods is not fully released. Along the same lines, Tang et al. [15] suggest a way of overseeing network flow data. In addition, Kang and Kang [21] put forward the notion of using an unsupervised DBN to train certain features to initialize the DNN, offering greater classification performance, even though specific details of the approach are not provided. eir appraisal depicts superior outcomes when it comes to classification error detection. In the study by Bontemps et al. [22], a real-time collective anomaly detection model using neural network learning and feature operating was described. Here, a LSTM-RNN is trained using normal time series data, prior to making a live prediction for every time step. Furthermore, Ma et al. [23] used the method of spectral clustering (SC) to find the key properties of network traffic, and a multilayer DNN was used to pinpoint attack types. e findings denote that superior performance was seen with the SC-DNN over the SVM, backpropagation neural network (BPNN), random forest (RF), and Bayesian methods, with the highest level of accuracy. On the contrary, weight parameters and thresholds for every DNN layer must be established experimentally and not theoretically. Erfani et al. [24] put forward a mixed model, which used a DBN alongside a oneclass SVM. An unsupervised DBN was trained to pinpoint common properties, and a one-class SVM was trained using features taken through the DBN.
A NIDS using a supervised CNN-IDS has been proposed, in which a datapreprocessing step normalizes the dataset; the CNN is trained, optimal features are extracted, and, finally, a SoftMax classifier is used to classify attacks [8].
To decrease computational costs, the traffic input vector is reconfigured into an image format. is model is evaluated using the KDD-CUP99 dataset. Although the study sees a reduction in detection time, the detection rate should be increased and feature learning should be improved for the model to learn the features with a small number of attack categories.
In [25], a hybrid model leverages a grey wolf optimizer (GWO) to propose a CNN for network anomaly detection, and the GWO improves initial population generation, exploration, exploitation, and revamped dropout functionality. In the first step, the GWO selects desired features to establish optimal trade-off between the two main objectives of a minimized feature set and reduced false-alarm rate. In the second step, an improved CNN (ImCNN) is utilized for anomaly classification, and the proposed model is subsequently evaluated on the DARPA98, KDD-CUP99, and synthetic datasets.
To discriminate between normal and abnormal traffic, and to auto-profile traffic patterns, D-PACK has been proposed [9].
is approach integrates an unsupervised CNN model to investigate just the first few bytes of the first few packets in each flow, therefore detecting abnormal traffic early using raw packet-level data. D-PACK is assessed using the USTC-TFC2016 dataset [26].
A combination of bidirectional long short-term memory (BLSTM), attention mechanism, and multiple convolutional (MC) layers has been suggested as the BAT-MC model [27].
is approach uses the structured network traffic information to generate time series features. e MC layers extract the local features, the BLSTM generates the packet vectors, the attention mechanism screens the network flow composed of packet vectors, and a SoftMax classifier is used for final classification. is model is tested with the NSL-KDD and KDD-CUP99 datasets.
Zheng [28] propose two convolution and pooling layers with batch normalization appended to each convolution layer to reduce computational costs and speed up detection.
To determine the optimal model, different numbers of convolution and pooling layers are examined and a SoftMax classifier assesses the CNN-extracted features. Evaluation is conducted using the KDD-CUP99 dataset.
An improved CNN for wireless network intrusion detection has been proposed using stochastic gradient descent (SGD) classification and KDD-CUP99-based evaluation although this method demonstrated problems with gradient dispersion and local optima [10]. An alternative CNN model that uses a SoftMax classifier on the KDD-CUP99 dataset is proposed [29] and shows that increasing the number of epochs improves the accuracy of the model. In addition, this approach demonstrates that a CNN model achieves better performance as compared to SVM and DBN.
CNNs offer potential benefits in learning image content to achieve object detection, but they are not yet ideal for anomaly discovery. Expected input flow content is easily found in a CNN, while abnormalities exhibit only small differences from these normal data, and this means that a specific method is needed to detect these slight changes. IDS researchers therefore work to explore whether or not CNNs, for example, can learn to detect these abnormal characteristics as well as normal content features.
Normal and abnormal data flows are not significantly different, and the CNN must distinguish the variations. A standard CNN learns the features that represent flow content; this is primarily normal data so that learning is based on content and not on variation. Although CNN-based models have been used to address key challenges in anomaly detection [8-10, 25, 27-29], the mentioned problem of CNN remains unresolved. Solutions proposed thus far have focused on feature selection, model structure, and fine-tuning, while the main objective of MCL-CNN developed in the present study is the detection of the minor differences between the abnormalities and the normal content. Furthermore, most existing studies have evaluated models using the NSL-KDD or KDD-CUP99 datasets; the CICIDS2017 set will be used here to evaluate the model with novel attacks in the testing phase.

Convolutional Neural Networks.
A CNN is a type of neural network which aims to learn appropriate feature representations of the input data. Under this type of architecture, generally, the initial layers are a collection of convolutional feature extractors used alongside an image through a number of learnable filters. e filters used act as a sliding window, which moves across all areas of an input image, where the overlapping distance is known as the stride, and the outputs created are known as feature maps. Every CNN layer is made up of numerous convolution kernels employed to produce a different feature map. Neighboring neurons areas are linked to a neuron of a feature map of the next layer. To produce the feature map, all spatial locations of the input share the kernel. Following convolution and Security and Communication Networks pooling layers, one or multiple full connected layers complete the classification [30][31][32][33]. e convolutional operation across input feature maps and a convolutional layer within the CNN architecture is provided through the following equation: where * is the a 2d convolution, h (n) j is the j th feature map's output in the n th hidden layer, h (n−1) k is the k th channel in the (n − 1) th hidden layer, w (n) kj is the weights of the k th channel in the j th filter in the n th layer, and b (n) kj is its corresponding bias term.
For every layer, the filter coefficients are seeded with random values to start and then learned through the backpropagation algorithm [34]. In addition, convolutional layers also involve an activation function to establish nonlinearity. e collection of convolutional layers produces a substantial volume of feature maps. To help limit the dimensionality of these properties, convolutional layers are followed by an additional layer, called pooling, in order to limit the computational expense of training within the network and reduce the potential for overfitting. A number of pooling operations exist, including max, average, and stochastic pooling. For the max-pooling layer, this acts as a sliding window with a stride distance in place to set the maximum value inside the dimension of a sliding window.
A CNN's training is completed with an iterative algorithm moving between feedforward and backpropagation data movements. At every iteration of the backpropagation, the convolutional filters and fully connected layers are updated. A key aim is to limit average loss E across the true class labels and the network outputs, i.e., where y * (k) i and y (k) i are, respectively, the true label and the network output of the i th input at the k th class with m training input and c neurons in the output layer. A number of solutions have been suggested to limit average loss [35][36][37], and this paper employs the adaptive moment estimation (Adam) [37] to train the model.

Dataset Description.
is paper has presented numerous experiments regarding the CICIDS2017 [13] dataset which is an intrusion detection as well as prevention dataset. e Canadian Institute for Cyber Security (CIC) obtained this dataset, and it is publicly available for researchers and students. Recently, prominent IDS research has used this dataset because of its criteria and because its size is the largest available dataset concerning real-world data. e fact that the majority of other datasets have outdated data is a well-known problem as intrusion attack types are constantly changing and becoming increasingly sophisticated. Some datasets also have other problems such as no metadata and features, while some do not include adequate diversity in known attacks. All criteria required for developing a precise dataset, according to Gharib et al. [38], are fulfilled by the CICIDS2017 dataset. Sharafaldin et al. [13] noted that this dataset is the most complete one thus far. Furthermore, this dataset includes favorable as well as recent common attacks that are similar to the true real-world data (PCAPs). e results concerning the network traffic analysis conducted through CICFlowMeter with labelled flows as per the time stamp, source and destination ports, source and destination IPs, and protocols and attack (CSV files) are also included. e CICIDS2017 benchmark dataset includes 25 users' abstract behavior as per the SSH, HTTP, FTP, HTTPS, as well as e-mail protocols. e data gathering period began at 9 a.m. on Monday, 3 July 2017, ending at 5 p.m. on Friday 7 July 2017, thus lasting five days. As only Monday has normal activity, this day only has benign traffic. On other days, besides benign traffic, various attacks are implemented such as brute force SSH, brute force FTP, DoS, web attack, heartbleed, infiltration, DDoS, and Botnet.
Following data cleaning which involves eliminating the records that have missed values, it is noted that the total collected data include 2827876 records, with 2271320 normal records, and 556556 abnormal records. Every specified attack's labelled records are stored in a specific CSV file format with every CSV file formed of a particular number of labelled records described by 78 features and 1 label. All records include two types of features which are nominal and numerical. e five features are nominal data-type features, while the remaining are numerical data-type features.
However, a set of experiments were conducted based on the CICIDS2017 dataset for verifying the CNN-MCL performance on the intrusion detection task. Because, unlike other IDS datasets, the CICIDS2017 was not divided by the provider into training and test datasets, and it was divided into training and test records in this paper based on the nature of the experiment as presented in each experiment.

Mean Convolutional Layer (CNN-MCL)
is paper suggests a CNN-based layer to separate anomalies from normal data. e method used involves using the data to directly learn changes occurring through abnormal data. e main issue was that normal and abnormal flows are not greatly different, and so the CNN must be forced to detect abnormalities variations. It is seen that if standard form CNNs are used to detect, features are learned which represent an image's content (flow's content), primarily normal flow, meaning that the classifier identifying data content is linked with training data instead of learning data variations.
However, the approach used here was designed to hold back the content and adaptively learn abnormality traces. In order to achieve this, an innovative convolutional layer is proposed, known as the mean convolutional layer (CNN-MCL), established for use with intrusion detection system tasks. In turn, these errors are employed as low-level abnormal/normal features, where more advanced abnormal detection features are created thereafter. In order to echo these actions, the suggested layer aims to exclusively learn prediction error filters. e feature maps created are then linked with prediction error fields employed as low-level abnormal traces. e CNN-MCL is able to be positioned differently from the CNN aimed to undertake IDS tasks. is acts as a way of holding back the content, as prediction errors primarily do not include flow content, and this offers the CNN low-level IDS features. Deeper layers of the CNN are able to learn higher level of features as a result of the low-level abnormal characteristics.
With the equation below, one can define the CNN-MCL, where L denotes the L th CNN-MCL, the subscript k describes the k th convolutional filter within a layer, and that the central value of a convolutional filter is defined by (c x , c y ).
e CNN is then forced to learn prediction error filters through actively implementing specific constraints: Predictions of CNN-MCL are established via a specific training process. Following this, updates of filter weights w (L) k at each iteration are made, with the Adam algorithm in the backpropagation stage. en, the updated filter weights are set into the feasible set of prediction error filters by CNN-MCL reinforcement, and projection is undertaken at every training iteration.
is is achieved by firstly setting the central filter weight to the negative mean value of the middle values of all k filters in the layer, and then, the filter weights left over are normalized using equation (3). ere are two steps to this process. Firstly, the remaining weights are multiplied with the mean value, and then, the collected weights are dividing against the sum of all filter weights, without including the central value. In layer L, the midpoints of all k filters are set to the negative mean value. e pseudocode of this process is seen in Algorithm 1.
To provide intuition into this, suppose the prediction is formed by using some function f(X) to predict normality or abnormality of input data (flow). In particular, f(X) is a classifier which predicts based on the extracted features from Feature Extraction. Moreover, suppose g(I) is the extracted feature; therefore, the full operation is formed as f(g(I)). For simplicity, we assume a network with one CNN-MCL and one CNN in the feature extraction section. Regarding equation (1), using conventional CNN, the classifier generates the output (predicts) based on the following equation: where f is the classifier function, g is (or is part of ) the feature extraction process, I is the input data, * is the 2d convolution, k is the number of channels, w kj is the weights of the k th channel in the j th filter, and b kj is its corresponding bias term. e classifier detects the anomaly based on the following equation: where w kj is the weights generated by the CNN-MCL from the k th channel in the j th filter. en, regarding equations (3) and (5), we have It can be seen from equation (7) that the Mean(w (L) × (c x , c y )) is calculated from all channels because it does not have the subscript k. erefore, the mean value is shared among all channels. en, for each individual channel, the weights are normalized using w (L) k / w (L) k . ese operations reduce the effect of normal context to be extracted as useful features, and they are progressing the effect of abnormal variations to be considered as the extracted features.
To show the advantage of CNN-MCL in the learning process compared to standard CNN, we have evaluated a simple model using a CNN layer and a classifier. Our goal here is to visualize the difference between the extracted features from the CNN-MCL and conventional CNN. erefore, for this objective, we generate a simple dataset to control the location and value of anomalies in our dataset. However, in Main Experimental, we have evaluated our proposed CNN-MCL with a real-world dataset. e dataset is generated based on the specifications below. Moreover, we have tested with different specifications and got similar behaviors for all evaluations: e input size: 11 × 11 Normal data: random uniform number in the range [0, 1] Abnormal train data: normal data + random integer between [5,10]. Abnormal test data: normal data − random integer between [5,10]. Number of training records: 100000 Number of testing records: 10000 e ratio of abnormality records in the training data in the evaluation 1 : 10%.
e ratio of abnormality records in the training data in the evaluation 2 : 30% e ratio of abnormality records in the testing data in the evaluation 1 and evaluation 2 : 50% Abnormal location: first row of the input and first to fifth values

Security and Communication Networks
It can be seen from the above dataset that the values of abnormality of test data are different from the abnormality of train data.
We test a simple model with one CNN-MCL and one CNN model. Accordingly, in one model, the proposed CNN-MCL is applied, and in the other model, only CNN is used. We train the model on 20 epochs, and we plot three channels of the output (weights) of CNN-MCL and CNN layers to compare the CNN-MCL and CNN visually. For visualization purposes, we have created pseudocolor plots of the 2D array applying quadrilaterals using Matplotlib [39], as shown in Figures 1-5. In Figure 1, samples of (a) normal, (b) abnormal train, and (c) abnormal test record with data in the ranges [0, 1], [5,11], and [−4, −10], respectively.
In the first evaluation, we have only 10% abnormal records in the training data. We choose the first three channels of the output of the layers (weights) and plot them. We then increase the abnormality rate of the training data from 10% to 30% and repeat the evaluation. e output weight plots of the CNN and CNN-MCL models for one of the abnormal records are presented in Figures 4 and 5, respectively. e figures show that the CNN-MCL and CNN layers distinguish the abnormality location effectively; the extracted features (values) in the CNN-MCL abnormality neighborhood are significantly different from normal values but less so in the CNN model.
Comparison between Figures 2 and 3 or between Figures 4 and 5 shows that the CNN-MCL has more distinct abnormal feature presentation than CNN, and the abnormality has been diagnosed very well. On the contrary, from the visual perspective, CNN-MCL has extracted weights to distinguish between normal and abnormal parts more accurately than CNN. e abnormal data in the test data can be thought of as the unseen abnormal data because the range of abnormal data in training data is [5,11] (random normal data + random integer between [5,10]), but the range of abnormal test data is [−4, −10] (random normal data-− random integer between [5,10]).
For the more realistic experiment, we set the location of abnormality randomly and evaluated the models. Based on this explanation, two datasets with 10% and 30% abnormality rates have been generated.
e results in Table 1 illustrate the accuracy of the CNN-MCL and CNN layers in detecting abnormal records in the test data with the CNN-MCL model outperforming the CNN.

Network Architecture
e CNN-MCL layer is used for devising a CNN, which can make the distinction between NIDS' normal and abnormal flows. Figure 6 illustrates the overall design of the suggested architecture including every layer in detail.
is architecture can gain information about new associations between deeper layers' feature maps by extracting a higher level of representation regarding the previously learned normal/abnormal features. e final convolution layer's output is flattened and then fed to the classification block that includes a fully connected and a SoftMax layer. A detailed overview of the suggested architecture and the different layers used in the CNN's architecture is presented.

Reshaping Layer (Layer 0).
e input vector's (one flow) shape in NIDS is typically indicated by a N × V or N × 1 × V vector, in which V refers to the number of features that indicate the flow and N is the batch size. As implemented in a previous work [40], the input shape is changed from 1D to 2D. Hence, this architecture's first layer is one which can reshape the input vector into an 11 × 11 2D matrix and has less than 121 features regarding the input vector size, and thus, the remaining values are zeros. us, layer 1 is sent an N × 11 × 11 patch. Because in the tested dataset, the feature number is more than 100 and less than 121, we choose 11 × 11 patches.
is architecture can handle other sizes, such as n × m, but the feature arrangement in the test and train must be the same.

CNN-MCL Layer (Layer 1).
In their present form, CNNs often learn content-dependent features, and thus, the proposed architecture has the CNN-MCL layers forming the second layer (layer 1). is leads to this layer learning valuedependency features which are fragile and can be eliminated by various nonlinear operations [41] including activation layers and pooling. e CNN-MCL layers' output is directly given to a regular convolutional layer.
Particularly, layer 1's input of 11 × 11 size is first convolved involving 32 diverse 5 × 5 filters that have a stride     equivalent to 1. Such filters are able to learn the prediction error features regarding the estimated center value as well as its local neighbors. Furthermore, the CNN-MCL layer results in prediction of abnormal feature maps that have dimensions of 11 × 11 × 32. Generally, the larger-sized filter (such as 5 × 5) captures generic features and essential components in the inputs. However, the smaller-sized filter (3 × 3) captures the sophisticated features and has better weight sharing. erefore, we utilize a 5 × 5 filter for the CNN-MCL layer (first layer) and a 3 × 3 filter for the other layers. Furthermore, in this paper, we represent the efficiency of using these filter sizes with some experiments in Experimental Results.

Convolutional Block.
For learning higher level prediction error features, a set of convolutional layers is used, and every layer is followed by an activation function, batch normalization, and pooling layers. Furthermore, every such convolutional layer is referred to in this paper as the convolution block. Every convolutional layer will learn feature maps' new representation which is learned by the previous convolutional layer or lower-level features.
As seen in Figure 6, general convolutional layers (convolution block) are used in third (layer 2) and in fourth (layer 3) layers. Moreover, for learning higher level representative features as well as new associations amongst the prediction feature maps, regular convolutional layers (layer 2 and layer 3) are used. e convolutional blocks' output dimensions are 6 × 6 × 32 and 3 × 3 × 8, respectively. ese layers within the proposed structure are further explained below.

Activation Function.
Generally, a nonlinear mapping known as an activation function follows a convolutional layer. Such function is then applied to every value within the feature maps of each convolutional block. Activation functions are of various types. Regarding computer vision applications, there has been successful implementation of the ReLU activation function [42,43]. A different activation function type was recommended by He et al. [44] called PReLU which creates surpass human-level performance concerning visual recognition challenge [45]. Moreover, Clevert et al. [46] suggested the exponential linear units (ELUs) activation function that can speed up learning significantly obtaining below 10% classification error as opposed to a ReLU network having the same architecture.
It is possible to strengthen the capability of CNN for separating feature space by including nonlinearity across the network layers. e proposed CNN recommends restricting the data values range having the ReLU activation function at the network's each stage. On the contrary, it is known that an activation function layer does not follow the feature maps that are learned by the CNN-MCL layer. is is primarily because it is possible to easily eliminate the learned prediction error features using several nonlinear operations such as activation functions.

Batch Normalization. Computer vision researchers
have devised numerous methods for normalizing the data in CNN architecture. In early deep learning architectures, the local response normalization (LRN) layer is used that normalizes the central coefficient in a feature map's sliding window concerning its neighbors. Ioffe and Szegedy [47] recommended the batch normalization layer that drastically accelerates the deep networks training. Such a mechanism reduces the internal covariate shift that is the input distribution change regarding a learning system.
For this, a zero-mean and unit-variance transformation of the data is implemented along with the CNN model being trained. e parameters of each previous layer impact every layer's input and amplified even the small changes. Hence, such a layer deals with a significant problem and enhances a CNN model's final accuracy. is is why the proposed architecture implements a batch normalization layer following every regular convolutional layer.

Pooling.
is CNN utilized max-pooling of 3 × 3 size and stride of 2. ere is maximum value in the max-pooling layer in the sliding window's local neighborhood. Such a layer aims to minimize the feature maps' dimensionality which, in turn, diminishes the computational cost required for training and reduces the possibility of overfitting. In particular, the set of parallel convolutional operations provides a feature maps volume of high dimension. us, pooling layers retain the features that are most representative and help in subsampling and enhancing the accuracy. In this architecture, the two pooling layers that are used have decreased the feature maps dimensions from 11 × 11 × 32 and 6 × 6 × 16 to 6 × 6 × 16 and 3 × 3 × 8, respectively.
3 × 3 × 8 is then reshaped for layer 4 as 72 outputs. Such a mapping form learns the association throughout feature maps, which is a linear combination of features through channels in the same location. However, the previously learned hierarchical features are developed by learning local spatial association in a field that is receptive (local region/ patch convolved with a filter) within the same feature map. Finally, a new association between these feature maps is learned. Figure 6 shows that a neural network classifier having a SoftMax activation function within the output layer is implemented for classifying the output features that are learned by the convolutional layers set. Such an activation function maps features in which the last layer that is fully connected to a set of probability values learns in which all neurons' output in this layer equals 1. e abnormal flows can be identified by selecting the editing operation related to the SoftMax layer neuron that has the highest activation level. In particular, this layer which is fully connected includes 256 neurons. Furthermore, this layer learns new relations among CNN's deepest convolutional features.

Experimental Results
For examining the performance of the proposed NIDS approach, a set of experiments and analysis were Security and Communication Networks implemented on the well-known datasets called CICIDS2017. is section presents the experimental work to validate CNN-MCL as well as compare its performance with that of two approaches: the state-of-the-art methods (deep learning) DL and the NON-DL methods (which this paper refers to as machine learning methods).
Regarding the IDS task experiment, the criteria derived from estimating a confusion matrix as a classification problem was used. e confusion matrix aims to compare actual labels with predicted labels. It has been known that an intrusion detection problem includes two classes: normal and attack, which is defined by a 2-by-2 confusion matrix for evaluation. Similar to any classification problem, the confusion matrix of IDS task includes the terms TP-true positive, FP-false positive, TN-true negative, and FN-false negative. Generally, in the IDS task, the terms TP, FP, TN, and FN are regarded as an attack data that are correctly classified as an attack, normal data that are incorrectly classified as an attack, normal data that are correctly classified as normal, and attack data that are incorrectly classified as normal, respectively. ese four terms help in generating the IDS evaluation measures which are accuracy (ACC), precision (P), recall (R), false alarm (FA), and Fscore (F1).
Moreover, all mathematical parts related to the evaluation measures are presented in Table 2. e false alarm rate measures the proportion of benign events incorrectly classified as malicious. e F-score rate measures the harmonic mean of precision and recall which serves as a derived effectiveness measurement.
All standard machine learning algorithms were applied by Scikit-learn [48] which is an open source machine learning library for the Python programming language. All CNNs were implemented using the Tensorflow [49] which is a Python deep learning framework. e experiments were all conducted using an Amazon AWS EC2 instance (p2.xlarge) with specifications as follows: 4 Intel Xeon E5-2686, 61 GB RAM, 1 NVIDIA K80 GPUs, each with 2,496 parallel processing cores and 12 GiB of GPU memory. Furthermore, Table 3 illustrates the CNN-MCL parameters which are shared between all the experiments.
is section includes datasets description in Section 6.1. Next, the reliability of CNN-MCL is analyzed using various ML approaches in Section 6.2.
en, the CNN-MCL for single attack is explored in Section 6.3. e structural design is explored in Section 6.4. Finally, CNN-MCL for multiattack is presented in Section 6.5.

CNN-MCL versus ML.
In this section, a set of experiments are conducted for discussing and comparing the performance of IDS as per various machine learning (ML) methods. ML experiments were also applied to compare the proposed CNN-MCL with machine learning approaches. Furthermore, a binary classification was implemented using these ML approaches to predicate the normal and abnormal flows in CICIDS2017.
To this end, the best ML classifiers were used which were k-nearest neighbors (K-NN), two types of support vector machine classifiers (SVM and NuSVC), decision tree (DT), random forest (RF), adaptive boosting classifier (AdaBoost), and gradient boosting (GB). Additionally, the default parameters of ML methods (from Scikit-learn) were considered for this experiment. e network configuration and hyperparameters were selected based on the lack of records in the dataset with CNN-MCL characteristics.
To address the problem of unbalanced datasets in CICIDS2017, the abnormal records from the dataset were first divided into training and test sets at 70% to 30% ratio, respectively, and the maximum epoch has set to 50. Second, these training and test sets were increased by adding an equal number of normal records chosen randomly. However, regarding a large-scale dataset classification by ML methods [50], experiments were conducted using a six subsample dataset including 10000, 20000, 40000, 60000, 80000, and 100000 records. Table 4 presents the accuracies of the models regarding the change in the number of records. e table shows that the number of records is obviously related to the models' generated accuracy as it is evident that the accuracy of all models increased after using a higher number of records. e proposed CNN-MCL model is more superior compared to all other ML models concerning increasing the number of records in terms of both accuracy and F-score where both the accuracy and F-score of 10000 records case was 96.77%, and when the records number increased to 100000, both the accuracy and F-score increased to 99.87%, which outperforms all other models regarding high number of records being used because of the capability of CNN-MCL for recognizing normal and abnormal flows. On the contrary, it shows that when the number of training records increased, the CNN-MCL produced better results compared to the ML methods. Figure 7 shows that the false alarm rate decreases when the number of training records increases. Furthermore, all  models begin high false-alarm rate when the number of records is small (10000), after which the false alarm rate decreased, and the CNN-MCL produces superior results with high number of records (80000-100000). is shows that the CNN-MCL requires more training data compared to the conventional ML methods. Hence, the CNN-MCL can be considered to be a more suitable method for IDS because of its reliability and validity in detecting an intrusion attack in high-scale dataset.

CNN-MCL for Unknown Attacks.
In this experiment, the capability of the proposed CNN-MCL was evaluated for detecting new types of attacks without pretrained knowledge. Furthermore, the results of CNN-MCL are compared with that of the pure CNN model, for which a set of experiments were conducted in each scenario. One attack was chosen for testing the model that was trained with the remaining attacks and the remaining dataset records. e dataset contains 14 types of attacks, but only 10 of them were used, while four types of attacks were ignored because of the small number of records that prevent the capability of training and testing in such a scenario. Figure 8 illustrates the accuracy for each type of the 10 attacks, with the x-axis representing the attack type with the number of attacks and y-axis representing the accuracy percentage.
e developed model is undeniably superior compared to the CNN in detecting almost all types of new attacks, and these results are interpreted by the capability of CNN-MCL for learning the abnormal features. Figure 9 shows the same observation that CNN-MCL outperforms CNN in terms of false alarm rate in almost all attack types. e current section of the study was evaluated through 20 independent experiments. Tables 5 and 6 present the confusion matrices of the classification results using the CNN-MCL and CNN models on DoS HULK and PortScan attacks, respectively. We chose these types of attack because they have the maximum number of records. Although the results in these tables show that the true-positive rate of CNN is higher than CNN-MCL, the false-negative rate of the CNN-MCL model is lower.
For more elaboration, the evaluation measures, precision, recall, and F-score are generated for both CNN and CNN-MCL. As shown in Table 7, it is evident that CNN-MCL outperforms CNN in terms of precision in all types of attacks with the highest percentage for DoS HULK attack at 58.64% while that of CNN was 58.5%. Furthermore, it is seen that CNN-MCL has superiority over CNN regarding F-score for all types of attacks with the highest percentage for DoS HULK attack at 73.68% while that of CNN was 73.58%. Meanwhile, CNN was slightly superior in terms of recall in certain attack types. However, this is acceptable considering good accuracy and other measures.

CNN-MCL versus DL Methods.
is section assesses the performance of binary classification of normal and abnormal for three DL approaches. In the DL approaches, the    Figure 10 shows that the standard CNN obtained the poorest results in terms of precision, F-score, and accuracy. Meanwhile, highest recall at 99.30 was obtained using the C-CNN, and the proposed CNN-MCL obtained close value recall at 99.15. However, the CNN-MCL model outperformed CNN and C-CNN models in terms of precision, F-score, and accuracy at 99.76, 99.46, and 99.46, respectively. Although the improvement was slight with the CNN variants, CNN-MCL performed better than other CNNs for IDS.
Moreover, Figure 11 displays the false alarm rate for standard CNN, C-CNN, and the developed CNN-MCL. Standard CNN obtained FAR of 0.75, and approximate false rate of 0.71 was obtained by the C-CNN. e proposed CNN-MCL obtained significantly better FAR at 0.23.

Structural Design.
e structural design of a deep learning model has impacts on the final result of its detection. We present the results of experiments to fit the structural design of the proposed architecture although the objective of this paper is not introducing the best structure for the IDS model. erefore, we ran several sets of examinations to find suitable architecture for the proposed model. We randomly chose 200 k of test and train data, randomly split to 70 : 30% for train and test, and finally trained models by only 20 iterations. We chose the training accuracy, testing accuracy, average training loss, and average testing loss for the comparison. Furthermore, we believe that metrics can be utilized because of the randomly chosen balanced dataset for this part of the experiments.
We present the experimental results of choosing the different values for filter output size, batch size, and kernel size (filter size) in Tables 8-10, respectively.
Regarding Tables 8-10, the best structure for the proposed model is selected, where the filter output size for layer 1, layer 2, and layer 3 is 32, 18, and 8, respectively; the best batch number is 128; and for the kernels, the best sizes are   5 × 5, 3 × 3, and 3 × 3 for kernel 1 (in layer 1), kernel 2 (in layer 2), and kernel 3 (in layer 3).

Conclusion
A new deep learning-based approach was suggested in this paper for developing an intrusion detection system. As compared to general CNN which depends on content features, the proposed CNN-MCL can suppress flow content as well as adapt to learn variation detection features from data directly. For this, a new type of layer known as a mean convolutional layer was created which could help the CNN in learning prediction error filters that generate low-level general abnormal features. is layer was used for designing a new CNN architecture which can identify anomaly accurately in the traffic flow. Numerous experiments were conducted for evaluating the proposed CNN-MCL model's ability in performing intrusion detection. e experiments' findings showed that it is possible to train the CNN-MCL so that it can accurately identify normal as well as abnormal flows along with attack types of unknown attacks. For further examining the constrained CNN's performance, it was compared to well-known machine learning methods which are the best detectors at present. According to the comparison, the proposed CNN-MCL architecture is able to detect the anomaly accurately, especially when large-scale training data are used. Hence, the experimental results suggest that the CNN-MCL is able to accurately identify anomalies even in case of no manual feature extraction and unbalanced training data.

Data Availability
e dataset (CICD2017) can be downloaded from https:// www.unb.ca/cic/datasets/ids-2017.html. e code will be published on GitHub after paper got published.