Intrusion Detection Systems in Cloud Computing Paradigm: Analysis and Overview

Cloud computing paradigm is growing rapidly, and it allows users to get services via the Internet as pay-per-use and it is convenient for developing, deploying


Introduction
Cloud computing is defined as an Internet-based computing platform in which virtually shared servers provide software, platform, infrastructure, policies, and other functions [1]. It is visualized as a demand from its users to reduce overall cost and complexities. It is gaining popularity due to various advantages of on-demand service provision, flexible resource allocation, higher fault tolerance, and higher scalability. Various cloud service providers (CSPs), including Google, Amazon, and Microsoft, use virtualization technologies with self-service capabilities. Virtualization is the first need of cloud computing [2]. A huge increase in IT technologies leads to daily data increases [3]. Attackers have taken benefit of cloud computing as copious amounts of data are produced by it greater than 665 Gb/s [4]. Huge data generated by the cloud have become its biggest problem as it has come on the target of attackers [5]. Hackers are alluring towards the cloud due to its open and distributed nature and the amount of traffic produced [6]. Attackers can interrupt the services of the users, misuse the sensitive information, and misuse the services and resources given by the CSP. An intrusion can be an attack that can misuse the private or sensitive information of the users, or it can consume the resources such as CPU, bandwidth, and storage. Traditional methods for providing security like firewalls are not sufficient. But there is a need for a proper system that can provide security to the users. An intrusion detection system (IDS) can detect or find attacks in the network by analyzing the data of the network. ere are mainly two categories of IDS based on the deployment strategies: host-based IDS and network-based IDS [7,8]. Host-based IDS analyzes attacks by monitoring the host system only, whereas network-based IDS analyzes the whole network. Every node in the cloud has personal IDS and storage in the case of host-based IDS [9].
Host-based IDS is proposed based on statistics and probability theory [10]. SNORT-based detection is performed in Eucalyptus Cloud in Ref. [11]. Network-based IDS proposed in Ref. [12] has intrusion detection system management unit and intrusion detection system sensor.
e distributed intrusion detection system is also growing with time as it merges the characteristics of both the abovementioned IDSs [13]. Two more types of IDS are based on the detection mechanism: signature-based IDS and anomaly-based IDS. Signature-based IDS analyzes the attacks in the network by comparing the signatures of attacks stored in the database. Anomaly-based IDS can detect attacks in the network by analyzing the dynamic activities in the network. A profile is created by observing the activities of the users, applications, and users during a particular period in anomaly-based IDS [14,15].
Numerous researchers have used data mining and machine learning approaches [16]. Zero-day attacks are the biggest concerns for the cloud [17]. Classifiers based on machine learning are usually used to classify attack packets and normal packets [18]. Another emerging technique is the mining rule association technique [19]. Artificial neural networks are mostly used due to their ability to work on the incomplete dataset [20]. Some researchers have found the importance of machine learning algorithms for intrusion detection in the cloud due to the scalability and elasticity features of the cloud computing paradigm [21][22][23][24]. Different optimization algorithms such as genetic algorithm [25], particle swarm optimization [26], harmony search [27], and artificial bee colony [28] are also used with various classifiers for categorizing attack packets and normal packets of the network. e main contributions of the article are given as follows: (i) Discerned the methodologies followed by different intrusion detection systems related to the cloud computing environment. Also discerned which attacks they have considered for their research work. (ii) Analogized four existing intrusion detection systems for the detection of attacks. (iii) Analogized various attacks of two different standard benchmark datasets: NSL-KDD dataset and UNSWB-15 dataset. (iv) Epitomized the study of various existing intrusion detection systems of the cloud computing environment. Represented our research work and discerned which methodology outperformed our results and comparative analysis. (v) Exemplified the remaining challenges in cloud security and suggested possible recommendations for addressing the challenges. e structure of the remaining article is as follows: Section 2 reviews the literature review. Section 3 describes the proposed methodology. Section 4 presents the experiments and comparative analysis. Section 5 represents the future scopes and recommendations for the cloud computing environment. Conclusions are presented in Section 6.

Literature Review
e literature review section of the article is reviewing various good journal papers related to the intrusion detection in the cloud computing environment. Literature review is presented in the tabular form. Table 1 is showing the literature review, and also we have suggested the possible future scopes for the reviewed papers.
Additionally, we have compared our survey article with other latest survey papers. Table 2 shows how our survey article differs from other surveys. In table describes the novelty of our survey.

Methodology
Our methodology is described in this section of the article. It is implemented in three modules. e modules are preprocessing classification and evaluation. We have used four existing methodologies for the detection of attacks. Out of four methodologies, three methodologies are applied to the cloud computing environment, and the last methodology is applied to general network, which makes our comparison more strong. We have chosen these four methodologies for comparison as they are including the popular classifiers for intrusion detection. We have also chosen one methodology, which is using the optimization concept. So, these four methodologies' comparison will give a good comparison outcome.

Dataset.
We have used two standard benchmark datasets for the comparative analysis. We have used the NSL-KDD dataset [52] and the UNSW-NB15 dataset [53].

UNSW-NB15 Dataset.
It was created to overcome the drawbacks of the NSL-KDD dataset. is dataset contains low footprint attack characteristics and some traffic schemes, and there is no discrepancy between the distributions of datasets.
is dataset contains 49 features. e last two features represent the category and label (0 for normal and 1 for attack records). Figure 1 shows the pie chart of the UNSW-NB15 dataset distribution of various classes.

NSL-KDD Dataset.
It is a publicly available dataset refining the KDD-CUP 1999 dataset. is dataset does not contain redundant records in the training and testing dataset. ere is no requirement for creating subsets of the dataset for experimentation purposes. Figure 2 shows the pie chart of NSL-KDD dataset distribution. Features are selected based on the scoring algorithm and ranking algorithm. Classification of attacks is made by using the rule-based algorithm.
Temporal constraints can be used for collecting dynamic information related to attacks. Fuzzy rule concept can also be used for increasing accuracy. Complexity 3

Preprocessing.
Rough or raw datasets can lead to high false alarms [54]. Datasets used for classification include various attributes, which can be numeric or non-numeric.
Symbolic or non-numeric should be converted to the numeric form that easily interprets the classifiers. We have preprocessed the raw datasets and converted the dataset into one form, which is numeric. Like in the NSL-KDD dataset, attribute 41 has no use for classifying the dataset. Hence, we e authors can produce hybrid classification technique by using multiple classifiers.

Classi cation.
Classi cation of the dataset into normal and attack packets plays an important role in providing security to the cloud computing environment. Classi cation can be a binary classi cation or multiclass classi cation. Binary classi cation results in two classes. Multiclass classi cation results in more than two classes. We have performed multiclass classi cation. For the classi cation, we have implemented four existing intrusion detection methodologies. e four methodologies are described next.

FCM-ANN.
is methodology is implemented in four modules [33]. e owchart of the methodology is shown in Figure 3.
(1) Preprocessing Module. e raw dataset is preprocessed, and the dataset is converted into a form that is easily analyzed by the classi er.
(2) FCM Module. is module is used for making clusters of the dataset. e membership function used for creating the clusters is represented [33] by the following equation: where N is the number of elements, K is the number of clusters, M is a real number and, 1 ≤ m ≤ ∞, and U ij is the degree of membership functions of x i data in the j th cluster. e output of this module results in creating homogeneity between the cluster and heterogeneity among various clusters.
(3) ANN Module. is module is used for classifying the clusters generated by the fuzzy c-means algorithm. Backpropagation algorithm is commonly used for training neural network [55]. In this module, the cluster pattern is learned, and the back propagation algorithm is used to train the feedforward neural network. A feed-forward neural network has an input layer, an output layer, and numerous hidden layers. e input given to k node (belongs to hidden layer) is ln (k), and it is given [33] by where ln (k) is the input given to k node, k node is belonging to the hidden layer, θ k is the bias of the hidden layer, x i is the input given to the i node, i node is belonging to the input layer, and w ik is the weight value between the input layer and hidden layer.  Complexity e activation function is the sigmoid function, and it is used for processing the ln (k). It is given [33] by the following equation: e result of the activation function is f (ln (k)), which is sent to all the neurons of the output layer. It is given [33] by the following equation: where y j is the output sent to all the neurons j, j node is belonging to the output layer, θ j is the bias of the output layer, w kj is the weight value between the hidden layer and output layer, and f (ln (k)) is the activation function.

SVM-ANN Methodology.
In this methodology [34], the SVM classi er uses the anomaly detection technique, and the ANN classi er uses the misuse detection technique. e whole methodology is implemented in three modules. e modules are preprocessing module, SVM module, and ANN module. e owchart of the SVM-ANN methodology is shown in Figure 4.
(1) Preprocessing Module. Preprocessing module is a very important part of the classi cation methodology, and this module makes the dataset ready for classi cation. e raw dataset has redundant and useless data, and the preprocessing makes them free from redundant and useless data.
(2) SVM Module. e preprocessed dataset is given as input to the support vector machine classi er, and this classi er performs the binary classi cation and results into two classes: normal and attack. e normal packet is labelled as normal, whereas the attack packet is labelled as attack. Support vector machine (SVM) classi er usually increases the dimensionality of the data, which makes it easy for separating or classify the data into di erent categories or classes. A hyperplane can be expressed as [56] H in R n in the following equation: where x is an element in R n and b is an element in R. Some studies state that SVM is implemented successfully in regression and classi cation [52, 53, 57-59].
(3) ANN Module. e attack packets are the input for the arti cial neural network classi er. Backpropagation algorithm with feed-forward neural network is implemented. It is a commonly used algorithm by neural networks [55]. is classi er performs multiclass classi cation. It outputs the attack packets with their types.

FCM-SVM Methodology.
In this methodology [44], the hybrid approach combines FCM with the SVM classi er. e methodology comprises three modules. Figure 5 shows the owchart of FCM-SVM methodology.
(1) Preprocessing Module. e rst module is used for converting the dataset in a form easily understood by the classi er. e preprocessed dataset saves time and resources as unwanted data are removed in this module.
(2) FCM Module. is module makes various groups of the dataset, and the groups are made based on membership functions. e equations related to the FCM algorithm are discussed earlier in this study.

(3) SVM Module.
is module classi es various clusters using support vector machine classi ers. SVM classi ers are performing the multiclass classi cation.
(4) Aggregation Module. e outputs of all the SVM classi ers are combined, and the aggregation module generates the nal output.

SMO-ANN Methodology.
is is based on a fuzzy C-means clustering algorithm optimized with the Spider monkey optimization algorithm (SMO) [45]. Figure 6 shows the owchart of SMO-ANN methodology. e methodology 6 Complexity is divided into three modules. e modules are described next.
(1) Preprocessing Module. Preprocessing is carried out to obtain the preprocessed dataset from the raw dataset. e preprocessed dataset is not containing useless data.
(2) FCM-SMO Module. e whole dataset is divided into various clusters in this module. SMO is applied to the clusters to reduce the dataset further and obtain an optimized dataset.
(3) ANN Module. In this module, an arti cial neural network (ANN) is applied to classify the dataset into attack packets and normal packets. Attack packets are further classi ed into their types.

Evaluation.
Performance metrics are vital for comparing di erent intrusion detection systems, and they also tell which intrusion detection system is performing better than others.
(1) Accuracy: Accuracy describes the percentage of true intrusion detection system predictions. Accuracy is represented by the following equation: (2) Precision: Precision describes the ratio of the attack packets correctly identi ed as an intrusion by the intrusion detection system to the total number of attack packets. Precision is represented by (3) Detection Rate: e detection rate describes how many packets are identi ed correctly. It is represented by Detection Rate TP (TP + FN) .
(4) F-measure: F-measure is de ned as the harmonic composition of recall and precision. It is represented by . (10) ese performance metrics are used for comparing various methodologies by using two standard benchmark datasets.
We are using a multiclass dataset for performance assessment. We will calculate performance metrics for every class of both datasets: the UNSW-NB15 and the NSL-KDD datasets. For example, we will calculate the accuracy of every Complexity class of the NSL-KDD dataset. For calculating the overall accuracy for the whole dataset, we will find the average of the accuracies of all the classes. In this way, we will calculate the other performance metrics for both datasets. We have compared every attack of both datasets by calculating the performance metrics for every attack. We have also compared the overall performance metrics of both datasets. We have compared the performance of four existing intrusion detection systems.

Experiments and Comparative Analysis
To evaluate the performance of the various existing IDSs, we conducted the experimentation on four existing IDSs using two benchmark datasets: the NSL-KDD dataset and UNSW-NB15 dataset. We have compared four existing methodologies and used two standard benchmark datasets: NSL-KDD dataset and the UNSW-NB15 dataset. We present the analysis of the results by comparison concerning five performance metrics: accuracy, detection rate, precision, F-measure, and false-positive rate. Table 3 shows the hardware and software used in the experiments.
In Table 4, the SVM-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99855, highest detection rate of 0.98475, and highest F-measure of 0.98431. In Table 5, the SVM-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99925, highest detection rate of 0.99254, and highest F-measure of 0.99482. In Table 6, the SVM-ANN methodology has the highest precision of 1 and the lowest false-positive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99954 and the highest F-measure of 0.99482. SMO-ANN methodology has the highest detection rate 1. In Table 7, the SVM-ANN methodology has the highest detection rate of 0.98624. FCM-SVM methodology has the highest accuracy of 0.99793 and the highest F-measure of 0.98068. SMO-ANN methodology has the highest precision of 1 and lowest falsepositive rate of 0. In Table 8, the SVM-ANN methodology has the highest precision of 0.99926 and lowest false-positive rate of 0.00074. FCM-SVM methodology has the highest accuracy of 0.99838, the highest detection rate of 0.99047, and the highest F-measure of 0.98969. In Table 9, FCM-SVM methodology has the highest accuracy of 0.99983, highest detection rate of 0.99984, and highest F-measure of 0.99934. SMO-ANN methodology has the highest precision of 1 and lowest false-positive rate of 0. In Table 10, the SVM-ANN methodology has the highest precision of 1 and lowest falsepositive rate of 0. FCM-SVM methodology has the highest accuracy of 0.99788 and the highest F-measure of 0.97563. SMO-ANN methodology has the highest detection rate 1. In Table 11, SMO-ANN methodology has the highest accuracy of 1, highest detection rate of 1, precision of 1, F-measure of 1, and lowest false-positive rate of 0. In Table 12, SVM-ANN methodology and SMO-ANN methodology have precision of 1 and lowest false-positive rate of 0. SMO-ANN methodology has the highest accuracy of 1, highest detection rate of 1, and highest f-measure of 1. In Table 13, FCM-ANN methodology has the highest accuracy of 0.99862, highest detection rate of 0.98710, highest precision of 0.98710, highest F-measure of 0.98710, and lowest false-positive rate of 0.000658. In Table 14, SVM-ANN methodology has the highest accuracy of 0.99151, highest detection rate of 0.98408, and highest F-measure of 0.98836. FCM-ANN methodology and FCM-SVM methodology have a precision of 1 and the lowest false-positive rate of 0.
In Table 15, SVM-ANN methodology has the highest accuracy of 0.99365 and highest F-measure of 0.96540. FCM-SVM methodology has the highest detection rate of 1. FCM-ANN methodology and SMO-ANN methodology have the highest precision of 1 and the lowest false-positive rate of 0. In Table 16, SVM-ANN methodology has the highest accuracy of 0.99805, highest detection rate of 0.76555, and highest F-measure of 0.86721. All methodologies have precision of 1 and false-positive rate of 0. In Table 17, SVM-ANN methodology has the highest accuracy of 0.99996, highest detection rate of 1, and highest F-measure 0.95652. All methodologies have precision 1 and false-positive rate of 0. In Table 18, SVM-ANN methodology has the highest accuracy of 0.99362, highest detection rate 0.94270, highest precision of 0.96460, highest F-measure of 0.95270, and lowest false-positive rate of 0.00484. e different attacks of the UNSW-NB15 and NSL-KDD datasets are analyzed to evaluate various intrusion detection systems of cloud computing environments. e above tables are representing the results of our experimentation. Tables 4 to 18 show the different performance metrics values of different attacks of the UNSW-NB15 dataset. FCM-SVM methodology performs better in detecting every attack of the UNSW-NB15 dataset than other methodologies. Table 12 shows the performance metrics values of a complete UNSW-NB15 dataset. e overall performance of the FCM-SVM methodology for detecting attacks of the UNSW-NB15 dataset is better than other methodologies. Tables 13 to 16 show the different performance metrics values of different attacks of the NSL-KDD dataset. FCM-SVM and SMO-ANN methodologies perform better in detecting every attack of the NSL-KDD dataset than other methodologies. Table 17 shows the performance metrics values of the complete NSL-KDD dataset. e overall performances of the SMO-ANN methodology for detecting attacks of the NSL-KDD dataset are better than other methodologies. e main advantage of the SVM classifier is that it only depends on support vectors. e complete dataset does not influence the SVM function, which is the case in many artificial neural networks (ANNs). Also, SVM deals efficiently with many features because kernel functions have exploitation features. e rate of convergence of the SMO algorithm is low. e premature

Future Scopes and Recommendations
Intrusion detection systems detect known and unknown attacks. But the copious amounts of data generated and stored on the cloud make the intrusion detection problem more complex. We epitomized the underlying future scopes: (i) e brisk growing zero-day attacks and their vulnerabilities are the demanding future scope in developing the intrusion detection system for cloud computing. (ii) Another future scope is developing an adaptive architecture of intrusion detection systems to handle the dynamic computations. (iii) Researchers can also focus on integrating the intrusion detection system with blockchain technologies. (iv) e possible recommendations for the above future scopes are as follows. (v) An adaptive intrusion detection system must be developed that can adapt to change the requirements such as environment configurations, resources of computation, and various locations where intrusion detection systems are deployed.
(vi) It should expand dynamically by adding virtual machines when the cloud network extends.

Conclusion
is article reviews various intrusion detection systems related to cloud computing. e article implements various IDSs and compares them. Two standard benchmark datasets were employed and observed that the FCM-SVM methodology outperforms other techniques using the UNSW-NB15 dataset, and the SVM-ANN method outperforms the preliminaries using the NSL-KDD dataset. Hence, SVM is identified as a better classifier than other classifiers. In future work, we will work on zero-day attacks to develop an adaptive intrusion detection system that adapts to changing cloud architecture.
Data Availability e datasets used in the article are publicly available standard benchmark datasets referred to in Refs. [54,56,60].

Consent
Not applicable.

Conflicts of Interest
e authors declare no conflict of interest related to this work.