Improving the Accuracy of Network Intrusion Detection with Causal Machine Learning

. In recent years, machine learning (ML) algorithms have been approved eﬀective in the intrusion detection. However, as the ML algorithms are mainly applied to evaluate the anomaly of the network, the detection accuracy for cyberattacks with multiple types cannot be fully guaranteed. The existing algorithms for network intrusion detection based on ML or feature selection are on the basis of spurious correlation between features and cyberattacks, causing several wrong classiﬁcations. In order to tackle the abovementioned problems, this research aimed to establish a novel network intrusion detection system (NIDS) based on causal ML. The proposed system started with the identiﬁcation of noisy features by causal intervention, while only the features that had a causality with cyberattacks were preserved. Then, the ML algorithm was used to make a preliminary classiﬁcation to select the most relevant types of cyberattacks. As a result, the unique labeled cyberattack could be detected by the counterfactual detection algorithm. In addition to a relatively stable accuracy, the complexity of cyberattack detection could also be eﬀectively reduced, with a maximum reduction to 94% on the size of training features. Moreover, in case of the availability of several types of cyberattacks, the detection accuracy was signiﬁcantly improved compared with the previous ML algorithms.


Introduction
Cyberattacks [1] refer to offensive actions to alter, disrupt, deceive, degrade, or destroy computer systems, networks, information, or programs in these systems. In recent years, the high frequency of cyberattacks has posed severe threats to the network security and even the national security, leading to a significant decline in network performance and service interruption. Hence, a great number of protection mechanisms [2,3] have been proposed and deployed, such as firewalls, antiviruses, and malware detection software. However, these countermeasures have been proved insufficient to provide a complete protection against the cyberattacks in the modern network environments.
Although firewalls can provide rule-based network protection, more intelligent mechanisms are required to detect advanced network intrusion in high volume of traffic data. To this end, several network intrusion detection systems (NIDSs) [4][5][6] have been designed using ML methods.
A NIDS can provide real-time data on network traffic and send out an instant alarm or block suspicious activities if a network attack is detected. ML methods are widely utilized in NIDSs to detect a network's anomalies mainly through extracting features of traffic data.
Although ML-based NIDSs have shown to be robust in real-time traffic monitoring, their accuracy and efficacy are still compromised by the imprecise features, which are greatly dependent on a human's experience. Meanwhile, a fixed feature set may not be appropriate for detecting different types of network intrusions, as some features may be redundant or unrelated, which may slow down the ML process. erefore, it is essential to explore the best features [7] to increase the accuracy of a detection system.
To overcome the abovementioned barriers, application of causal ML methods in NIDSs is proposed in this paper. Traffic features can be classified into two classes: causal features and noisy features. Causal features are those features, which have causal relationships with a network intrusion. at is, these features are caused by cyberattacks. When cyberattacks are launched, these features become abnormal. While the cyberattacks are stopped, these features become normal. Traditional distributed denial-of-service (DDoS) attacks exhaust the bandwidth, central processing unit (CPU) power, or memory of the victim host by flooding an overwhelming number of packets from thousands of compromised computers (zombies) to deny legitimate flows. e most frequent DDoS attacks mainly consist of flooding with a huge volume of traffic data and consuming network resources, such as bandwidth, buffer space at the routers, CPU power, and recovery cycles of the target server. Noisy features have no causal relationship with a cyberattack, although they may have a statistical-based correlation [8]. Noisy features can degrade detection performance because they may disrupt a detection system in real deployment.
To distinguish noisy features from causal features in NIDSs, we present two causal ML methods for NIDSs, including causal intervention and counterfactual reasoning.
e main contributions of this paper include (i) We propose a novel causal ML-based NIDS. With establishing a causal link between cyberattacks and traffic features through causal intervention, noisy features can be identified and removed. (ii) A counterfactual detection algorithm based on the Bayesian Network (BN) is developed to classify cyberattacks based on causal features. (iii) e performance of the causal ML-based NIDS is evaluated using CICIDS19, UNSW-NB15, and NSL-KDD datasets. e experiment results confirmed the effectiveness of the proposed approach. is paper is organized as follows. Section 2 provides a brief discussion on the existing relevant studies on NIDSs and their limitations, as well as a summary on the contributions of this study. Section 3 presents a detailed discussion on the theories and governing equations of the different deployment techniques. Section 4 presents a novel causal ML-based NIDS. Section 5 mainly discusses on the experimental results. And, Section 6 summarizes the main achievements of this research.

Literature Review
As one of the important areas in computer science and network security, intrusion detection based on ML [9][10][11] is a hotspot. Numerous scholars [12][13][14][15] have already carried out a variety of explorations on this topic. Tang et al. [16] established a deep neural network model of NIDSs, and the model was trained by the NSL-KDD dataset. eir model showed a robustness for detecting flow-based anomalies in software-defined networking (SDN). Daya et al. [17] proposed BotChase, a two-phased graph-based bot detection system, leveraging both unsupervised and supervised ML. e first phase pruned presumable benign hosts, while the second phase achieved bot detection with high precision. e literature [18] on NSL-KDD dataset aimed to propose an adaptive ensemble learning model to develop a multitree algorithm with an accuracy of 84.2%.
As reported previously, optimization of the size of training features is worthy of investigation. Importantly, irrelevant features in a dataset could undermine accuracy of a model and increase training time required for the establishment of a model. us, to determine the optimum training size, numerous explorations have been conducted. Feature selection [11,[19][20][21][22], a process of selecting the most relevant features by manual or algorithms, has been used to reduce the time and space complexity of model construction. Hadeel et al. [23] proposed a wrapper feature selection algorithm for intrusion detection. is method uses a doveinspired optimizer to implement the feature selection, and the binarizing algorithm of the proposed cosine similarity method showed a faster convergence speed and a higher accuracy than the sigmoid method. Another research [24] developed a feature selection model, which combined ID3 classifier algorithm and BEES algorithm. In this model, the BEES algorithm was used to generate the desired feature subset. Chung and Wahid [25] introduced a new simplified version of particle swarm optimization for feature selection, constituting a local search strategy to speed up the feature selection process by finding the optimal neighborhood solution.
e algorithm could reduce the features used to represent network traffic behavior in KDDCUP99 dataset from 41 to only 6, and the accuracy reached 93.3%. However, the method mentioned above could only select features based on relevance, and some noisy features may affect the detection accuracy.
In addition to the size of training features, correct classification of cyberattacks is also of great importance in the existing studies. e existing algorithms for NIDSs based on ML or feature selection are all on the basis of correlation between features and cyberattacks to realize the classification.
e POM provides the causal effects [36] through mathematical definitions. However, conducting randomized trials [37] with both SCM and POM is expensive, time-consuming, and sometimes unethical. Additionally, its accuracy is low, owing to insufficient consideration about the influences of exogenous variables (a variable outside the cyberattack model, which affects the cyberattack model but is not affected by the cyberattack model) [26] and noisy factors on the causal features.
Based on the deficiencies of the abovementioned algorithms, this paper starts from the decoupling of the correlation of features and the classification of types of cyberattacks under counterfactual scenarios to achieve a high accuracy in the detection of cyberattacks. e counterfactual model is based on the BN, which can model relationships among hundreds of cyberattacks and features. Firstly, the correlation of features is decoupled through causal intervention, and noisy features that do not affect the detection outcome are deleted. Secondly, based on the retained causal features, the most relevant types of labels are selected, and then, the counterfactual detection algorithm is implemented to find out the unique label. For instance, given evidence ε � e and some hypothetical interventions, the likelihood that we observed a different outcome ε � e' through the counterfactual detection algorithm is calculated. en, the expected number of anomalous features is calculated to identify the highest likelihood of cyberattacks in the counterfactual scenario [26].

Preliminaries
In this section, we present a brief introduction about causal reasoning.

Strong Spurious Correlations.
Traditional ML is driven by the association, and it is difficult to achieve consistent prediction for unknown test datasets. Traditional ML will find noncausal (noise) features in association mining, such as the relationship between risk factors and abnormal features, and such strong spurious correlations will be used for the prediction.
For example, risk factor R will cause DDoS attacks in Figure 1, for instance, X 1 , X 2 , and X 3 , and X 1 and X 2 will cause abnormalities of traffic feature Y 1 and Y 2 . If X 1 and X 2 have not been observed or counted in the prior data, risk factor R will inevitably lead to the appearance of X 3 , Y 1 , and Y 2 . If the calculation is based on the correlation algorithm only, the conclusion that X 3 is the cause of Y 1 and Y 2 may be completely wrong.
A classic New England Medicine paper on chocolate and the Nobel Prize [38] explains such strong spurious correlations. According to the paper, the more chocolate a country consumes, the more Nobel Prizes it will win. is conclusion is very absurd at the first glance, but what is wrong with the conclusion based on relevant facts? Statistical analysis of the data shows that there is indeed a linear relationship between a country's chocolate sales and the number of Nobel Prizes it has won. However, the causal analysis indicates that there is only a strong spurious correlation between chocolate sales and the number of Nobel Prizes.

Definitions.
It is supposed that Y � {C, V} is the traffic feature set, where C is a causal feature set and V indicates a noisy feature set (V � Y\C). X ∈ {0, N} represents a network attack.
As noisy features have no causal relationships with network intrusions, the conditional probability P (X|Y) satisfies the following condition [8]: Although there is no causality between X and V, they may show a strong correlation in the statistical data ( Figure 2(b)). If the spurious relationship is not distinguished from causation, it may lead to errors in real-world data distributions, even if the ML model is trained well.
To define causality, if other conditions do not change, changing X can cause a change in Y; thus, there is a causality between X and Y. If X and Y can be measured, then the causal relationship of X and Y can be calculated by changing the values of X and Y. If the magnitude of the causal relationship between X 1 and Y is stronger than that between X 2 and Y, it is considered that X 1 causes Y.
In general, cyberattacks cause the anomaly of data traffic features, as shown in Figure 3. For the sake of a simpler analysis, exogenous variables are ignored. As mentioned earlier, if other conditions remain unchanged, the change of {Y 1 , Y 2 , . . ., Y n } may lead to the change of X, which indicates that there is a causal relationship between {Y 1 , Y 2 , . . ., Y n } and X. Meanwhile, it is equivalent to the fact that X is the cause, and {Y 1 , Y 2 , . . ., Y n } is the effect.

SCM.
e detection models which will be used in our experiments are BN models which show the relationships between cyberattack, risk factors, and traffic features. BNs are an increasingly popular modelling technique in cybersecurity [39], especially due to their capability to overcome data constraints (it is impossible to learn causality between variables). In BNs, the probability is interpreted as a degree of confidence. As shown in Figure 4, in the 3-layer BN model, the traffic features are influenced by corresponding cyberattacks, where Z is the risk factor of the network being attacked, X denotes the type of cyberattack, and Y represents the traffic features. In the noisy-OR model, Y � (X 1 ∨ X 2 ∨, . . ., ∨ X n ), and as long as there is an attack type X i � 1, then Y � 1. is pattern ( Figure 4) can be extended to a further complex network model with more layers.
In the causal inference, BN is replaced by a more basic SCM. Existing BNs can be expressed as a SCM [40,41]. is SCM consists of three components [42]: a graphical model, a structural equation, and a counterfactual and intervention logic.
e key characteristic of SCMs is that they represent each variable as deterministic functions of their direct causes together with an unobserved exogenous "noise" term, which itself represents all causes outside of our model. For example, in a network without cyberattacks, some traffic features may be abnormal, which is due to unobserved exogenous variables. If an unobserved exogenous variable u � {u 1 , u 2 , . . ., u n } is specified, the causal Markov blanket (for complete random variable UR and a given set of variables X ∈ UR and }|MB, the minimum variable set MB that can meet the above conditions is a Markov blanket with X) condition [26,42,43] will be satisfied. Figure 1: Spurious correlation features.

Security and Communication Networks
Assumption 1. It is assumed that the observed variable is Y � {Y 1 , Y 2 , . . ., Y n } in the SCM of the directed acyclic graph [42]; its parent variables can be regarded as u v pa (Y); thus, Y � f {pa (Y), u} can be achieved. For each variable Y, the parent variable X (i.e., X � pa (Y)) in the model has a noise term u y with an unknown distribution P(u y ), such that Assumption 2. In the noisy-OR model [39], it is assumed that the probability that any variables Y may behave as normal (Y � 0) due to noisy variables in a network attack It is assumed that the variables Y are independent of each other, and then, For instance, the network devices are installed with antivirus software or firewalls; thus, some traffic features may not produce abnormalities.

Causal Intervention.
e causal detection problem (magnitude of the causality, feature selection, unobserved exogenous variables, and noisy variables) can be addressed by a causal intervention that is called "do-operation." e postintervention distribution resulting from the action (Y � y) is given by equation (4) [40]: e do-operator of causal intervention signifies that we are dealing with an intervention, rather than a passive observation. e subscript m is used to represent the modified probability distribution. From the perspective of probability distribution, P (X � x|Y � y) represents the probability of X � x corresponding to the part of Y among all the values that Y � y, and P (X � x|do (Y � y)) represents the probability that all Y are fixed to y and then X � x. Intervention changes the distribution of the original data, while conditional variables do not change the distribution of the original data. [26]. Counterfactuals enable us to quantify how well a cyberattack (i.e., X � 1) explains anomalous features by determining the likelihood that the features may not be presented during intervention, thereby switching to the cyberattacks by setting do (X � 0), as given by the counterfactual probability P (Y � 0|Y � 1, do (X � 0)). If the probability is high, X � 1 is a good causal explanation of the anomalous features. It should be noted that this probability refers to two contradictory states of Y, and thus, it cannot be represented as a standard posterior probability.

Counterfactual Detection
e principles for counterfactual detection of cyberattacks are as follows [26,37]: (1) e likelihood that a cyberattack causes an anomalous feature should be proportional to the posterior likelihood of that attack (2) A cyberattack X, which cannot cause an anomalous feature, cannot constitute a causality between features and attacks (3) A type of cyberattack, which causes a greater number of anomalous features, should be more likely to have a causality to these features

A Novel Causal ML-Based NIDS
In this section, the causal ML-based NIDS (CMLN) framework and time complexity will be introduced.

Framework.
is study aims to develop a novel causal ML-based NIDS. As illustrated in Figure 5, the proposed framework is divided into four main stages. e first stage is data preprocessing, consisting of Z-score, Min-Max, and deletion of the incorrect and fuzzy row datasets. e purpose of this step is to improve the performance of the training model and reduce the class imbalance problem [26] that often appears in network traffic data. Hence, data should be initially encoded with Z-score to transform any categorical features into numerical ones. en, the value of a normal feature is equal to 0 and that of an anomalous feature is a positive integer [37,40] in causal reasoning; thus, it needs to be normalized to a natural number. At the end, incorrect and fuzzy row datasets should be removed to reduce the size of training dataset and improve the accuracy of validation dataset.
e second stage of the framework is the processing of selected features, which reduces the number of features required for ML models and counterfactual detection algorithm. Firstly, although the noisy features may have a correlation with the causal features, they have no causal effect on the classified outcomes. e causal relationship between the features and cyberattacks can be identified through causal intervention. en, the noisy features are deleted, and only few features can be retained. is not only reduces the time required for the model classification but also reduces the time required for training without sacrificing other functions.
Two correlated variables have a causal relationship, while two uncorrelated variables have no causal relationship. ML algorithms are involved in the third stage of the framework to select several classes of labels. e labels with the largest correlation are selected as the reference labels of the fourth stage, which can also reduce the complexity of counterfactual detection algorithm. erefore, it is necessary for the counterfactual detection algorithm to calculate the expected anomalous features of K cyberattacks, without calculation of the expected anomalous features of M cyberattacks (K includes reference labels selected by the ML algorithm, and M covers all labeled cyberattacks).
In the fourth stage, according to the causality, it can be determined whether the results of the counterfactual detection algorithm will change or not when certain preconditions change and then provide the basis for the counterfactual judgment according to the magnitude of the causality effect. Given the evidence ε � e and an intervention all cyberattacks are switched except for X a in counterfactual. Next, the number of expected anomalous features E (X k , ε) is calculated (X a belongs to X k and X k includes reference labels selected by the ML algorithm). Finally, with obtaining the largest value of E (X k , ε), the most likelihood of a cyberattack is X k .
With the joint action of these four stages, the causal MLbased NIDS could ensure a high accuracy in the detection of anomalous features when the types of cyberattacks are increased.

Data Preprocessing.
Performing data normalization by using the Z-score, positive integerization by using the Min-Max normalization, and deletion of the incorrect and fuzzy row datasets are covered in the data preprocessing stage. [44,45] of the data is initially carried out. e most common standardization method is Z-score standardization, which is also known as standard deviation standardization. e main purpose of Z-score is to transform features of different magnitudes into the same magnitude and measure the features with the calculated Z-Score value to ensure comparability of them. is method presents the mean and standard deviation of the original data to conduct data standardization. e processed data conform to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the transformation function is

Security and Communication Networks
where Y inst is the initialized feature value, U denotes the mean feature vector, and δ is the standard deviation.

Min-Max Normalization.
Min-Max normalization [46], also known as deviation normalization, is a linear transformation of the original data, with max being the maximum and min the minimum of the sample data. In the counterfactual detection algorithm, the value of a normal feature is 0 and that of an anomalous feature is a positive integer; thus, it needs to be normalized to a natural number. Data normalization is a necessary step, in which each value needs to be extended to an appropriate range. is process helps eliminate large deviations in features: where ψ ij indicates the normalized value of Y ij with the range of 0 to N in the integer form, min (Y j ) represents the minimum value of the jth feature, and max (Y j ) is the maximum value of the jth feature.

Removal of Incorrect and Fuzzy Row Sets.
ere are features with empty values in the row of features or the label corresponding to this row of features without a dependency on a normal attack category in the intrusion detection dataset. us, this row is an invalid or incorrect row set. Alternatively, the row of features is corresponding to multiple types of cyberattacks (such as features [0, 1, 1, 1] corresponding to two types of cyberattacks, DDos, and exploits); as a result, this row is a fuzzy row set [47]. e incorrect and fuzzy sets cannot be labeled by ML algorithms. erefore, the incorrect and fuzzy sets need to be deleted in the data preprocessing stage, and only a certain subset is left, in which the row features and label have one-to-one definite correspondences (e.g., the row of features [0, 1, 1, 1] is uniquely corresponding to a DDoS), so as to improve the robustness of the causal ML-based NIDS.

Feature Selection.
If some features are irrelevant to the cyberattacks and they have no causal effect on the classified outcomes [26], these features are therefore noisy features. Normally, manually matching of features can be used directly to eliminate the impacts created by noisy features on the classified outcomes. However, when it comes to training by ML algorithms, a classifier will constantly fit these features, leading to a spurious correlation between noisy features and cyberattacks. Ultimately, the performance of the classifier could be impaired. is mainly involves the effects of causality on each feature, and calculation is carried out to assess the effects of causality. Consequently, the noisy features are distinguished and deleted based on the effects of causality. Hence, the best combination of causality-based features could be made. Figure 6, there are various relationships between cyberattack X and feature Y under the general fact. If the causal relationship and direction between these two parameters are not clarified, the judgment of the type of cyberattack may be influenced. As displayed in Figure 6(b), it is assumed that Y i and Y j have a mutually causal relationship, and the anomaly of one feature will lead to the anomaly of the other. erefore, there may be a wrong conclusion if the anomalous feature Y j is considered to be caused by the cyberattack X.

Identification of Noisy Features. As shown in
According to this hypothesis, reversal of the causal direction of the fact between cyberattack X and feature Y is illustrated in Figure 6(c). erefore, feature Y can be interfered, and the causal relationship between Y and X can be worked out according to changes of the expected value of X, which is formulated as [48] If the conditions between Y and X satisfy the following rules, respectively, equation (7) can be written as (8)-(15) [43].
Proof. In the statistical model, the calculation formula of the joint distribution is According to the Markov blanket [26,43], in a directed acyclic graph, given the parent node of X, X is independent of nonchild nodes of its parent. Hence, the abovementioned formula can be abbreviated as where Pa (x i ) represents the parent node of x i . is formula also represents a BN. As depicted in Figure 6(c), it can be simplified as follows: P x, y i , y j � P x|y i , y j P y i |y j P y j |y i .
According to the truncated factorization, P x, y i |do y j � P y j P x|do y i , y j .
Marginalized y j : P x|do y j � y j P x|y i , y j P y j .
e causal effect [49] can be calculated by measure E of X and Y: Definition 2 (noisy features). As for noncausal features, if E/ N (N is the size of training dataset) is less than the threshold δ (δ ≤ 0.01), there will be no causal relationship [50] between X and Y. us, these features can be considered as noisy features, and they should be deleted in the dataset.

Removal of Noisy
Features. e causal interventions were performed for all features, as shown in Figure 7. In the process of feature selection, only those features that have a causal relationship with the labeled attacks will be selected. As illustrated in Figure 7, the correlation between features is hidden.
If there is no causal relationship between {Y 1 , Y 3 , . . ., Y n−1 } with X and other features, equation (15) can be transformed into equation (17) according to 3 as follows: If equation (17) holds, then the causal relationship in the case can be recovered based on the factual causal direction between cyberattacks and anomalous features, as shown in Figure 8.
As displayed in Figure 9, features y 1 , y 3 , and y n−1 can be deleted when data are preprocessed according to the abovementioned method, and the causality is simplified as y 12 , y 13 , . . . , y 1n ⋮ y k1 , y k2 , y k3 , . . . , y kn

e Process of Feature Selection.
Based on the above method, all noise features satisfying Definition 2 will be deleted. Only the causal features are retained, and the selection process is as shown in Algorithm 1. Figure 6: e simplified illustration of the influences of features on cyberattacks. ...

Classification of Cyberattacks.
Although the causality is simplified after feature selection, as shown in Figure 9, there is still a many-to-many relationship between cyberattacks and traffic features. e key of counterfactual detection algorithm is how to choose the most appropriate labeled attacks to explain the causality of the features. According to the causal inference, it can be assumed that the possibility of changes in the results of the counterfactual detection is associated with certain changes in preconditions; thus, it can provide the basis for the causality judgment according to the magnitude of the causality. For instance, in order to quantify the causality of anomalous features caused by a cyberattack in a NIDS, the counterfactual detection can be used for inference. As illustrated in Figure 10, the left is the fact graph, and the right is the counterfactual graph. All variables with apostrophes in the counterfactual conditions are equal to the variables without apostrophes in the fact conditions. It is assumed that, under the condition of a given evidence ε � e and intervention that sets X to the value of 0, the counterfactual likelihood can be calculated as p (ε � e'|ε � e, do (X � 0)). erefore, through counterfactual inquiry, a formal language can be provided to quantify the probability of a counterfactual anomalous feature e' � 1 when it is only assumed that the attack X � 0. for j from i to N + i-1 Delete the j%N feature  (14) if len(Count) < len(cun[i]) (15) then count � cun[i] (16) end if (17) end for (18) for i from 0 to len (Count) (19) Delete all noise features in the Cun[i] collection; (20) end for (21) output the causal feature set C. ALGORITHM 1: Causal reasoning-based feature selection (CRFS).

Security and Communication Networks
Definition 3 (expected sufficiency [26]). e expected sufficiency of cyberattack X a is the number of anomalous features that would expect to persist if the intervention is given to switch off all other possible causes of the anomalous features: where X a denotes the type of cyberattack a, Y + indicates the anomalous features in the fact conditions, Pa (Y + ) denotes the parent node of Y + that represents all cyberattacks that may result in the anomalous feature Y, Pa (Y + )\X a is the parent node of Y + except for X a , Y + ′ represents the anomalous features in counterfactual situations, and ε denotes the set of all factual evidence features. If E (X a , ε) is maximum in the set for all E (X, ε), the cyberattack type X a will be a causal explanation for the given evidence ε.
Inference 1. According to equation (19) and SCM [26,51], the expected sufficiency of cyberattack X a is given by where Y_ denotes the normal feature in the set of all factual evidence features. It is mainly very complicated and cumbersome to solve noisy and exogenous variables, while it is unnecessary to solve these variables in equation (20). At the same time, the value of L can be calculated based on the prior data. erefore, equation (20) obtained through counterfactual reasoning greatly simplifies the causal relationship between cyberattacks and traffic features.

Time Complexity.
To determine the time complexity of the proposed causal ML-based NIDS, it is required to determine the complexity of each algorithm used in each stage. As the performance of different algorithms at different stages is compared, the overall time complexity is determined by that algorithm, producing the highest overall complexity. It is assumed that the dataset is composed of M samples and N features. In general, M ≫ N.
Starting with the data preprocessing stage, the complexity of the Z-score and Min-Max normalization is O (N). Based on the aforementioned discussion, the overall complexity of the proposed framework is O (M l * K * D). e time complexity of data preprocessing and feature selection is O (M + N 2 ). As M ≫ N, the time complexity of data preprocessing and feature selection is approximately equal to O (M), and this time complexity is far less than the time complexity O (M * N 2 ) of feature selection, including MOMBNF [9]. Finding the overall time complexity is highly critical because the model will often be retrained to learn new patterns of cyberattacks.

Experimental Setting.
e CICIDS19 dataset was launched in 2019 by the Canadian Institute for Cybersecurity, and it contains benign and the most up-to-date common cyberattacks, which is similar to real-world data with a total of 87 features [47]. is dataset contains 11 types of attacks: DRDOS_MSSQL, DRDOS_SNMP, SYN, DRDOS_NTP, TFTP, UDP-LAG, DRDOS_NETBIOS, DRDOS_DNS, DRDOS_UDP, DRDOS_LDAP, and DRDOS_SSDP. As shown in Table 1, it also includes the results of network traffic features based on timestamps, source and target IPs, source and target ports, protocols, and attack token flows. e raw network packet for UNSW_NB15 [52] was created by the Australian Cyber Security Center, and it is a comprehensive set of cyberattack traffic data. Compared with other datasets, these two datasets are more appropriate for the research on NIDSs. UNSW_NB15 dataset has nine types of cyberattacks, including Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. As presented in Table 2, tools, such as Argus, are used by UNSW-NB15 to generate a total of 49 features with similar labels.
NSL-KDD [53,54] contains 7 major categories of attacks, such as ipsweep, Neptune, nmap, portsweep, Satan, smurf, and teardrop. NSL-KDD elimination of redundant records in the training set helps classifiers to be unbiased toward more frequent records. e training and test sets contain a reasonable number of instances, which can be used as a valid benchmark dataset to help researchers compare different intrusion detection methods. As shown in Table 3, there are 41 dimensional features in NSL-KDD. e fuzzy logic system (FLS) [47] is used to evaluate the quality of realism of CICIDS19, UNSW-NB15, and NSL-KDD datasets. e FLS is based on Sugeno fuzzy model [55] that investigates the quality of realism of IDS dataset. e CICIDS19, UNSW-NB15, and NSL-KDD datasets contain a set of network intrusion attacks that reflect real-world standards.
e generation process fully considers the characteristics of network intrusion attacks and the dynamics of the network.

Security and Communication Networks
In order to use a variety of algorithms more effectively, python was used to implement our model. e hardware and software specifications are summarized in Table 4.

e Results of Experiments.
is section presents three sets of experiments to verify the effectiveness of the proposed causal ML-based NIDS.

Influences of Data Preprocessing on the Training
Samples. Concerning the effects of data preprocessing on the size of training samples, the learning curve of training accuracy and cross-validation accuracy with the change of the size of training samples could be obtained. Because the amount of data in the datasets is large enough, about 10% of the data can be used as the test set to work well, so a 90:10 split is used for normalization in this paper. After normalization, using the 90/10% splitting criteria, the two datasets are randomly divided into training and test datasets.
(1) Influences of Data Preprocessing on the Size of Training Samples. In this study, Z-score, SMOTE [56][57][58], CFS [9,[59][60][61], and CRFS (causal reasoning-based feature selection) were used for making comparison. e SMOTE algorithm is used for SMOTE to sample a few classes after data processing by the Z-score, and CFS selects features after data processing by the SMOTE. For the CRFS method proposed in this paper, the causal reasoning-based feature selection presented in Section 4.3 is used after data processing by the Z-score. e cross-validation curves of different datasets under different types of cyberattacks after data processing by the four methods mentioned above are shown in Figures 11-12. Figure 11 compares the accuracy with the number of training samples required for the four methods (it is considered that there is only one type of cyberattack here, all cyberattacks are treated as one type of cyberattack, and its name is "abnormal"). As depicted in Figure 11, to converge the training accuracy and cross-validation accuracy, the number of training samples required for the Z-score and SMOTE was more than 16,000, which was within 10,000 for the CFS; however, the number of training samples required for the CRFS was only about 5,000, which was significantly lower than that of the Z-score, SMOTE, and CFS, while it could ensure the same training accuracy. e accuracy and the number of training samples required for the four methods (it is considered that there are multiple types of cyberattacks here) were compared (Figure 12). As shown in Figure 12, in order to converge the training accuracy and cross-validation accuracy, the number of training samples required for the Z-score and SMOTE was  No of flows that has a command in ftp session Tcprtt e sum of "synack" and "ackdat" of the TCP Ltime Record last time   Security and Communication Networks close to 10,000. e number of training samples required for the CFS was within 5,000, and the number of training samples required for the CRFS was close to 4,000, which decreased by 60%, 60%, and 20% compared with those of the Z-score, SMOTE, and CFS, respectively. Meanwhile, the training accuracy reached the highest, which significantly improved by about 10% compared with the highest training accuracy achieved by the SMOTE. As illustrated in Figures 11 and 12, with the increase of types of cyberattacks, the number of training samples required for the Z-score, SMOTE, and CFS significantly increased, while the training accuracy noticeably decreased. As for the number of training samples required for the CRFS, it basically remained below 5,000 samples and the training accuracy slightly decreased.
is highlights the positive influence of utilizing the CRFS technique, as it could significantly reduce the size of the required training samples without sacrificing the detection performance.
(2) Influences of Data Preprocessing on the Time Required for Training. To further highlight the influences of the data preprocessing stage, Table 5 summarizes the time required for different methods to construct the learning curve under different types of cyberattacks. For instance, when there were two types of cyberattacks, nearly 483 s was needed for the Z-score to establish the learning curve, which was reduced to 370 s after processing by the SMOTE and 154 s after processing by the CFS. However, the time required to construct the learning curve after processing for the CRFS was only 90 s, which was 81.4%, 75.7%, and 41.6% lower than that of the Z-score, SMOTE, and CFS, respectively.
is indicates that CRFS can not only guarantee the accuracy of detection but also effectively reduce the time required for training. e proof mentioned in Section 4.5 verifies that the feature selection algorithm proposed in this article has lower time complexity than the other algorithms. As the noisy features are deleted by the CRFS, the ML algorithms only need to fit causal features. e accuracy of the subsequent steps can be guaranteed and the time complexity required for training can be reduced.

Influences of Feature Selection Methods on the Number of Features Required.
In this experiment, three groups of control experiments were set, and the number of features and the training accuracy after data processing by the SMOTE, CFS, and Min-Max were compared. e CRFS algorithm was used to further select features. SMOTE, CFS, and Min-Max add (do) in Tables 6-17 indicated that the CRFS method could be applied to process and select the data after the data processing by these methods. e number of features left after processing by different algorithms in the CICIDS19 dataset under different types of cyberattacks is shown in Table 6. After processing by the CRFS algorithm, the number of features required for training was decreased by more than 50% at the minimum and 94% at the maximum compared with that before processing. Moreover, the number of features processed by the CRFS algorithm was significantly lower than that calculated by the CFS algorithm. is may be related to the fact that CRFS based on causal reasoning only selects network features that have a causal relationship with the cyberattacks, and it eliminates the features with a spurious correlation. e CFS is a feature selection method based on high correlation, which can greatly reduce the number of features. However, this method also selects some noncausal features with a spurious correlation, resulting in the higher number of features than that of CRFS. e detection accuracy between SMOTE and CRFS, between CFS and CRFS, and between min-max and CRFS in the CICIDS19 dataset was, respectively, shown in Tables 7-9. As presented in the abovementioned tables, although the                Table 10. After further processing of features by the CRFS algorithm, the minimum and maximum reduction of the number of features required for training was >50% and >82.5% compared with that before processing. When there were few types of cyberattacks, the effect of applying causality to the data processed by the CFS to find compressed features was significantly reduced. Owing to the strong correlation and strong causality, UNSW-NB15 was consistent after the data processing by the CFS. However, when there were several types of cyberattacks, the reduction was also significant, up to 54.5%, after further processing by the CRFS algorithm. e detection accuracy between SMOTE and CRFS, between CFS and CRFS, and between min-max and CRFS in the UNSW-NB15 dataset was, respectively, shown in Tables 11-13. As presented in the abovementioned tables, when there were few types of cyberattacks, although the number of features required for training was noticeably reduced after processing by the CRFS algorithm, the accuracy of training basically remained unchanged and the effect was obvious.
In the NSL-KDD dataset, after further processing of features by the CRFS algorithm, the maximum reduction of the number of features required for training was >82.5%. As presented in the abovementioned dataset, the number of features required for training was noticeably reduced after processing by the CRFS algorithm in the NSL-KDD dataset.
To sum up, the CRFS algorithm could effectively reduce the number of required training samples in the CICIDS19, UNSW-NB15, and NSL-KDD datasets, and it could also ensure the accuracy of training samples with a relatively acceptable stability. Especially, under the circumstance of a smaller number of cyberattacks, with a greatly reduced complexity in time and calculation, the training accuracy was basically unchanged. It was proved that causal features could not only complete the NIDS detection task but also ensure the stability of the accuracy rate. e selected causal features might provide a targeted help for the next preventive treatment.

Influences of Different Types of Cyberattacks on the Detection Performance.
To evaluate the performance of the different classifiers and study the effects of the different optimization methods, it can be referred to the evaluation index of accuracy of test data (ACC). Random search (RS) and tree-structured Parzen estimator (TPE) are two optimal parameter adjustment methods with the highest accuracy of the KNN and random forest in MOMBNF [9]. CMLN is a causal ML-based NIDS.
Performance of different classifiers in CICIDS19, UNSW-NB15, and NSL-KDD datasets under different types of cyberattacks was compared in Tables 18-20. As shown in Table 18, in the CICIDS19 dataset, with an increase in the types of cyberattacks, the detection accuracy in MOMBNF significantly decreased. When there were 11 types of cyberattacks, the detection accuracy of all the parameter optimization methods in MOMBNF was lower than 90%, especially the accuracy of the test set was lower than 30% after IGBS data processing. However, after CMLN training, the accuracy of the test set was stable at more than 98.5%, which was about 9% higher than the optimal RS-KNN-CFS method. It can be seen from Tables 18-20 that, regardless of the composition of the datasets, the accuracy of CMLN test set was higher than that of MOMBNF and BRS [47], especially when there were several types of cyberattacks. e detection rate of CMLN was higher than that of MOMBNF.

Conclusions
Although ML aims to facilitate the detection of anomalies, it is important to first understand how detection is performed and clearly define the desired output of our algorithms. When traditional ML algorithms cannot decouple correlation and causality, it is difficult to achieve a stable prediction [8]. erefore, this paper proposed a novel causal ML-based NIDS. Firstly, by establishing a causal link between cyberattacks and features through causal intervention, the noisy features could be deleted and the minimum size of training features could be determined. en, the ML and counterfactual detection algorithm were used to find out the unique label. Finally, CICIDS19, UNSW-NB15, and NSL-KDD datasets were utilized to evaluate the performance of the proposed detection method. e results of experiments showed that the CRFS method proposed in this paper could reduce the size of training samples and training time by at least 40%. Meanwhile, the number of features required for training was greatly reduced after data processing by the CRFS algorithm, and it also ensured the accuracy of training with a relatively acceptable stability. It was proved that the deletion of noisy features did not affect the accuracy of detection.
e results showed that compared with other optimization techniques, CMLN has the highest detection accuracy (when there were 11 types of cyberattacks, the accuracy was improved by nearly 9% compared with the optimal RS-KNN-CFS method). It was confirmed that the counterfactual detection algorithm could effectively identify the causal relationship between features and the type of cyberattacks.
At present, new cybersecurity threats are becoming ever severe, which cannot be classified according to the existing classification methods. Hence, how to effectively combine unsupervised learning and causal ML to construct new NIDs to detect new cybersecurity threats may be a new direction for investigation.

Conflicts of Interest
e authors declare that they have no conflicts of interest.