Deep Neural Network-Based Intrusion Detection System through PCA

Today, challenges such as a high false-positive rate, a low detection rate, a slow processing speed, and a big feature dimension are all part of intrusion detection. To address these issues, decision trees (DTs), deep neural networks (DNNs), and principal component analysis (PCA) are available. Through a higher detection rate and a lower false-positive rate, the research-based intrusion detection model DT-PCA-DNN increases the processing speed of intrusion detection systems (IDSs). To minimize the overall data volume and accelerate processing, DT is used to initially differentiate the data. Differentiate DTs save the temporary training sample set for intrusion data in order to retrain and optimize the DTand DNN, treat the DTjudges as standard data, and delete the added average data. After signing, we should lower the dimension of the data using PCA and then submit the data to DNN for secondary discrimination. However, DTemploys a shallow structure in order to prevent an excessive quantity of average numbers from being interpreted as intrusion data. As a result, additional DNN secondary processing cannot effectively increase the accuracy. DNN accelerates data processing by utilizing the ReLU activation function from the simplified neural network calculation approach and the faster convergence ADAM optimization algorithm. Class two and five trials on the NSL-KDD dataset demonstrate that the proposed model is capable of achieving high detection accuracy when compared to other deep learning-based intrusion detection approaches. Simultaneously, it has a faster detection rate, which effectively solves the real-time intrusion detection problem.


Introduction
Communication systems and network entrances are always faced with network attacks from the outside or even within their systems and are not like single attacks in the immature period of the network. Today, most of the intrusion behaviors are of various types and are developing in a mixed situation. Development is getting harder. According to the relevant literature, the Yahoo data breach caused a loss of 350 million US dollars and the "Bitcoin" breach caused a loss of about 70 million US dollars [1]. Based on intrusion behavior, the intrusion detection can be divided into network-based intrusion detection system (NIDS) and host-based intrusion detection system (HIDS) [2].
Various log les, disk resource information, and system information are used to detect intrusion behavior, while NIDS judges whether there is intrusion behavior by detecting the data packets in and out of the local network data ow. Machine learning, as a very popular algorithm tool in recent years, deserves experts and scholars to try its application in intrusion detection [3]. Especially in recent years, the application of machine learning in intrusion detection has appeared in people's eld of vision; the support vector machine (SVM) to neural network (NN) to random forest (RF) has their applications in intrusion detection.
Machine learning has a long history in intrusion detection. In 2003, the literature [4] demonstrated that decision trees (DTs) could detect intrusions faster than the Snort detection engine [5] at the time. e proposed method of joint optimization of feature selection and SVM training model is demonstrated using the intrusion detection dataset. e results indicate that the joint optimization method outperforms the SVM in terms of performance and convergence speed. e literature [6] used support vector machines to identify intrusions and explored the real-time problem; however, the accuracy rate was low. e literature [7] suggested reducing dimensionality by principal component analysis (PCA) and then detecting intrusions with support vector machines (SVMs). e self-optimization technology increases the classifier's accuracy and decreases training and testing time [8]. In 2019, the literature [9] developed an intrusion detection model based on an upgraded convolution neural network that has a high intrusion detection accuracy and true positive rate, while exhibiting a low false-positive rate. In the same year, Fernandez et al. proposed training an intrusion detection system using a feedforward fully connected deep neural network (DNN) (IDS). Due to the fact that DNN demonstrated robustness in the scenario of dynamic IP address allocation, the model they developed has a broader variety of real-world applications [10]. Still in 2019, the literature introduced the ICA-DNN intrusion detection model [11], which is based on ICA (independent component analysis) and DNN and has a higher feature learning capability than some shallow machine learning models. is has increased classification accuracy, but the algorithm's prediction time is not particularly tested and the model performs poorly in real time. According to the intrusion detection models proposed by the aforementioned scholars, it is discovered that the majority of research studies pay insufficient attention to real-time intrusion detection, while a few intrusion detection models with more in-depth realtime performance research suffer from low detection accuracy. To thoroughly investigate the real-time problem that is critical for intrusion detection and to ensure intrusion detection accuracy, this research offers the DT-PCA-DNN model. e trained DT is basically a series of if-else statements that handle huge batches of data quickly but with insufficient precision; the DNN network has a slow real-time performance but high precision when processing vast volumes of high-dimensional data. By combining the two, data are prefiltered using DT and then fed to DNN following PCA. e experimental results demonstrate that the model significantly accelerates training and detection while maintaining a high detection rate. Researchers are proposing different security protocols [12][13][14][15] for providing confidentiality and privacy against security attacks.

Principal Component Analysis.
PCA is a very commonly used linear dimensionality reduction algorithm as shown in Figure 1. It is generally used to extract the main components of high-dimensional data and simplify them into low-dimensional data, but the integrity of the data can be adjusted according to the requirements. Specifically, PCA hopes to map the original feature space into another orthogonal space and hopes to use a hyperplane that satisfies the nearest reconstruction and maximum separability to properly describe all the data in the dataset. e projections of different points on the dataset on this hyperplane should be as far away as possible.

Decision Tree.
e decision tree model is a tree structure that describes the classification of instances. It consists of nodes and directed edges. ere are two types of nodes. Internal nodes represent a feature or attribute [12]. e decision tree starts from the root node and continuously splits according to the characteristics of the data, until all the data reach the leaf nodes. e attributes used as the basis for splitting must be discrete attributes, and for continuous attributes, they can be discretized according to experimental requirements. A sales office wants to judge whether the surveyed object has a house purchase demand based on the object's identity information, age, and whether he is married. e survey results are shown in Table 1.
From this, the corresponding decision tree can be generated as shown in Figure 2. When an object "ω" is a company employee, is married, and is 29 years old, it can be known that he has a housing demand according to the decision tree.
In the above example, the feature selection order of the survey objects is different and different decision trees can be generated. According to the selection of different splitting features, there are three judgment bases, namely, information gain, gain rate, and Gini index. e core of the ID3 (iterative dichotomiser 3) algorithm is to apply the information gain criterion on each node of the decision tree to select features, the C4.5 algorithm uses the information gain ratio to select features, and the CART (classification and regression tree) uses the Gini index as the selection feature in accordance. e decision tree pruning and specific feature selection will not be repeated due to space reasons. Reference [16] is sufficient.

Deep Neural Networks.
A neural network is a large parallel interconnected network composed of simple adaptive units that can be used to simulate the interactive response of the biological nervous system to real-world objects [17], where machine learning is used to interact with neural networks in a broader sense.
In a neural network, the most fundamental structure is the neuron model, which is a simple unit by definition. Each neuron in a biological neural network is connected to other neurons, and when it is "stimulated," it releases neurotransmitters to the related neurons, altering their potential. When a neuron's potential crosses a "threshold" and the neuron becomes activated, or "excited," and begins delivering neurotransmitters to its associated neurons [18].
Foreign scholars abstracted the above situation in 1943 into the simple model depicted in Figure 3, dubbed the "M-P neuron model" [19].
In this model, ai denotes the input from the ith neuron, wi is the ith neuron's connection weight, and θ is the threshold. e neuron receives input signals from n other neurons.
e signals are transmitted via weighted connections, and neurons receive them. e resulting total input is compared, and the neuron's output y is obtained by processing the activation function f(a) as speci ed in e sigmoid function, the tanh function, and the ReLU function are all frequently used activation functions [20]. A neural network is formed by connecting many neurons [21][22][23]. e term "deep neural network" refers to a neural network with more than two layers and more than two hidden layers. A neural network with input and output layers is only capable of solving linear problems. e hidden layer is introduced to address the nonlinear separable problem. Figure 4 illustrates a fully connected neural network, which means that any neuron in the preceding layer      Mathematical Problems in Engineering 3 must be connected to any neuron in the next layer. e neurons in the input layer accept only information and do not process functions. e neural network's learning process is fundamentally one of continuously adjusting the connection weights and thresholds of neurons in order to approach the output outcomes of the training samples. Among them, the most remarkable method is the error backpropagation (BP) algorithm, which is used for the majority of neural network training today.

System Design
e overall design of the system is shown in Figure 5. e rst step is to preprocess the overall dataset. Data preprocessing rst normalizes continuous data and secondly performs one-hot encoding on discrete valued data. e dataset after data preprocessing is divided into training dataset and test dataset. e processed training dataset builds DT and trains DNN at the same time. PCA is unsupervised learning and does not need to be trained. After the DT and DNN training is completed, the DT-PCA-DNN model is established. At this time, the established intrusion detection model is tested with the preprocessed test dataset and the relevant parameters are adjusted and perfected. e trained DT is actually a series of if-else statements, processing large batches of data at high speed, but with insu cient processing precision; DNN network has poor real-time performance when processing large amounts of high-dimensional data, but high precision; PCA just happens to solve the problem caused by the high data dimension. Combining the three, rstly use DT to pre lter the data, then use PCA, and then send them to DNN for secondary classi cation, and use DT to lter out the intrusion data that is easy to judge and lighten the workload of DNN. PCA solves the problem that DNN network encounters high-dimensional data and slow training. e three methods make up for the shortcomings of each other and ensure the accuracy of detection while having good real-time performance.

Data Processing.
e data processing should be divided into two parts. First, the continuous data are normalized, and then the discrete valued data are encoded.

Data Normalization.
In this study, the normalization process of the experiment adopts min-max normalization. e normalization method is to perform linear transformation on the original data and the transformed data falls within the [0, 1] interval. e transformation function used is as Assume the dataset contains m items and each item has n-dimensional features, where x is the jth eigenvalue of the ith item prior to normalization and min is the jth dimension of the m items prior to normalization. min is the feature's minimum value, max is the feature's maximum value in the j-th dimension before normalization, and x * is the feature's j-th dimension value in the ith piece of data after normalization.

One-Hot Encoding.
For discrete data encoding, onehot encoding, also known as one-bit e cient encoding, is utilised. N states are encoded using an N-bit status register; each state has its own register bit, and only one bit is valid at any time. Table 2 summarises the present sample set. e sample feature dimension in Table 2  (i) 0 ⟶ 10 (ii) 1 ⟶ 01 e corresponding feature 2 has four values, so the encoding rule should be as follows: (i) 0 ⟶ 1000 (ii) 1 ⟶ 0100 (iii) 2 ⟶ 0010 (iv) 3 ⟶ 0001 e coding rule of feature 3 is the same as given above and will not be repeated.
e results of samples X, Y, and Z after one-hot encoding are shown in Table 3.   Followed by the depth of DT since the function of DT is not to identify as much intrusion data as possible but to misjudge average data as intrusion data as little as possible, the depth of DT should not be too deep. If the depth of DT is too deep, the accuracy of the rst classi cation will be improved, but it has been judged as an entry. Invasive but average data will a ect the nal accuracy.

Training DNN.
Because DNN requires a relatively large number of hidden layers to analyze high-dimensional data, the under tting phenomena will be severe if the number of hidden layers is too low. As the number of hidden layers increases, the time spent training DNN grows exponentially, which is incompatible with the real-time needs of this work. When PCA pair data are introduced after the dimensionality reduction process, the connection between data feature dimensions and data redundancy is reduced, DNN training is faster, and DNN accuracy is ensured. DNN uses the BP algorithm for training, ReLU as the activation function to simplify the calculation process of the neural network, and the "Adam" optimization algorithm, which occupies fewer resources and has a faster model convergence to shorten the training time. Figure 6, the preprocessed test dataset is rst classi ed with the trained DT. Next, the data whose classi cation result is intrusion are judged as intrusion and stored in the temporary training sample. e information whose classi cation result is typical is removed from this DT classi cation. Given the label, prepare for the second judgment type of data. e DT layer is equivalent to a lter screen, which lters out the intrusion data easy to lter. Since the trained DT is a series of if-else statements, and the processing speed of large batches of data is extremely high, which signi cantly reduces the workload of DNN and improves the running speed of the algorithm. e second step is to perform PCA dimensionality reduction on the data judged by DT to be expected, but the labels have been removed. e trained DNN classi es the low-dimensional data output after PCA processing for the second time. If the classi cation result is an intrusion, the intrusion label is added, and the temporary training sample is stored. e standard classi cation result adds standard labels and stores temporary training samples. Since DT and DNN belong to supervised learning, the tags assigned to the data need to be used when using the brief training sample set for retraining. Since the intrusion detection process is carried out one by one, the method can quantify the comparison results between the original data type of the test dataset and the label added to the corresponding data during the detection process. After the quanti ed value is accumulated to the threshold set, the collected data can be used. en, do a retraining ne-tuning of DT and DNN. After several ne-tuning, the designed model reaches the optimum.

Dataset.
e NSL-KDD [24] dataset was employed in this experiment, which augments the KDD 99 dataset. In comparison to the KDD 99 dataset, this dataset is devoid of duplicate records. e number of selected records is inversely related to the percentage of records in the original dataset, allowing for more e cient evaluation of the  Figure 6: DT-PCA-DNN model optimization. Mathematical Problems in Engineering developed model [25]. e training set has 125,937 items while the test set contains 22,544 items. Table 4 details the various data kinds. e NSL-KDD dataset contains 41 features, which are classified into four major feature categories: basic TCP connection features, operating features on hosts, time-based network traffic statistical features, and host-based network traffic statistical features. e first three of the 41 features are described in the following. e types are distinct features and include protocol type (there are three types of protocols: TCP, UDP, and ICMP), service (there are 70 values in the training set and 64 in the test set), take two (big value), and flag (connection normal or error status, a total of 11 values). After normalizing the data, one-hot encoding is conducted. e data dimension increases to 43 after encoding the protocol type, then to 112 after encoding the service, and finally to 122 after encoding the flag. After one-hot encoding, the final NSL-KDD dataset has a data dimension of 122. Table 5. TP represents the number of data pieces whose real data type is normal. e model prediction result is still normal; TN represents the number of data pieces whose real data type is an intrusion. e model prediction result is also intrusion. FP indicates the number of data pieces whose real data type is an intrusion. Still, the model prediction result is normal. FN represents the number of data pieces whose real data type is normal, but the model prediction result is an intrusion. Of course, the size of different data bars is not enough as a standard for evaluating the experimental results. erefore, a relatively reasonable evaluation standard is established based on the above parameters: the accuracy rate (AC), detection rate (DR), precision rate (PR), and false alarm rate (FAR), and the definitions are as follows:

Parameter Setting.
After preprocessing the data, including one-hot encoding and normalization, all data values are located in the interval [0, 1]. After discretizing each dimension of the data with 0.5 as the standard, we use DT to perform the first test on all training data. For the secondary screening, the main parameters of the DT used are shown in Tables 6-8 and then the PCA dimensionality reduction is carried out. Because the designed system is linear, all parameters can be obtained one by one by fixing other parameters to obtain the optimal parameters.
For Criterion (attribute segmentation criterion), the value is a string type. ere are two criteria to choose from: "gini" and "entropy." For Splitter (segmentation point), the value is a string type; there are two standards to choose from, "best" and "random," where "best" means that in all features finding the optimal segmentation point in the "random" means to find the optimal segmentation point in the randomly selected part of the features-max_depth (the constructed decision tree), which can be an integer or none. max_features is the number of features to consider when finding the best segmentation. random_state (multiple states used to generate random numbers) can be an integer, an instance of random state, or none. Experiments have shown that the best effect is achieved when the value is 392.
n_components (feature dimension after dimension reduction) can be the number of dimensions reduced or the percentage of data retained; whiten (whether whitening) reduces the correlation between features and all features     have the same variance; svd_solver (singular value decomposer) is a string when its value is "auto," and certain conditions are met, the complete singular value decomposition function is called hidden_layer_sizes (hidden layer sizes), tuple type. e number of hidden layers and the number of neurons in the hidden layer are determined by adjusting this value. Two hidden layers are introduced here, with 140 neurons in the first layer and 70 neurons in the second layer; activation is the activation function; solver (weight optimization function) is selected by selecting different strings of the corresponding weight optimization function.

Experimental Results.
e experiment uses a Win-dows10 system and 64 bit operating system, the processor version is Intel ® CoreTMi7-9750H CPU@2.60 GHz, the total physical memory is 16.0 GB, the development language is Python3.5, and the software package used is sklearn.

Experiment 1.
is experiment mainly studies the detection time of binary classification, the real-time problem of detection under binary classification.
is experiment primarily compared the two-class prediction accuracy and training time of FC [13], DT, PCA-DNN, EDF [20], CNN [26], and DT-PCA-DNN models. To reflect the characteristics of DT-PCA-DNN, the test data used are all the data in the NSL-KDD test dataset. For the convenience of observation, Figure 7 is obtained from Table 9. In the figure, because the FC training time is too long, the impact on the selected vertical axis interval is too significant, so it is not listed.
Observing Figure 7, we can see that the accuracy AC of PCA-DNN and FC is the same, but the training time of FC is much higher than that of PCA-DNN, the prediction time is slightly longer, and the real-time detection is poor. On the other hand, the training time of the EDF algorithm is marginally more extended than that of PCA-DNN, and the accuracy AC is the same as that of PCA-DNN. e training time of the CNN algorithm is as long as 90 s, and the accuracy rate is slightly lower than that of EDF and PCA-DNN, which is inferior to both. On the other hand, although the training speed of DT is breakneck, the accuracy rate is four percentage points lower than that of EDF and PCA-DNN.
Compared with PCA-DNN without DT, DT-PCA-DNN takes 1.32 s longer to train 125973 pieces of data and takes about 10 ms more to predict 22544 pieces of data, but the accuracy AC has improved by nearly ten percentage points remarkably. e introduction of DT has minimal impact on training time and prediction time but significantly improves prediction accuracy.

Experiment 2.
is experiment mainly studies the five-category detection time of the DT-PCA-DNN model, the real-time detection problem under five categories. Mark normal samples as 0, DoS samples as 1, probe samples as 2, U2R samples as 3, and R2L samples as 4. e analysis in Table 10 shows that the speed advantage of DT-PCA-DNN in the five classifications is undeniable. Still, the overall accuracy is slightly inferior to EDF and CNN, and we compare DT and PCA-DNN. However, the training time is relatively short. e overall accuracy is low, and the performance is poor. Comparing PCA-DNN and DT-PCA-    DNN in the ve-category experiment, the training time is 3 s longer. e accuracy rate is improved by six percentage points, proving that the introduction of DT does not cause much time while ensuring the accuracy rate loss. e analysis of Table 11 and Figures 8-10 shows that DT-PCA-DNN may have processed part of the data during DT prescreening, resulting in no U2R results (small U2R sample size). e advantages of DT-PCA-DNN are mainly re ected in the ability to recognize R2L, but because of its small proportion in the dataset.
As a result, the overall accuracy rate is lower than EDF and CNN. At the same time, the model has a relatively high false alarm rate for normal data when the detection rate is relatively high. is is a problem that needs to be pointed out.

Conclusion
e DT-PCA-DNN intrusion detection model described in this study greatly improves the training and detection speed while preserving accuracy. e model employs DT to do a preliminary screening of the preprocessed data to be detected before employing PCA as an input to perform a secondary judgment via DNN. e addition of DT causes a small increase in training time but a large boost in accuracy. Simultaneously, DT prescreening reduces future DNN burden, which has a considerable e ect on overall training pace. e following study focus is mostly on overcoming the issue of the DT-PCA-DNN model having a high false alarm rate for normal data in the ve-classi cation experiment while also increasing the DT-PCA-DNN model's ve-classi cation capability. [27].
Data Availability e data used to support the ndings of this study are available from the author Kusum Yadav upon request (kusumasyadav0@gmail.com).