A Data Mining Method Using Deep Learning for Anomaly Detection in Cloud Computing Environment

,


Introduction
With the popularization of cloud computing, the "reliability of never down machine" in industrial applications has gradually changed from expectation to practical need. To solve this problem, how to improve the accuracy, sensitivity, and execution efficiency of anomaly detection algorithm in data mining becomes more important [1][2][3].
Aiming at the large volume and various types of the cloud computing environment, existing research starts from many aspects. e literature [4] proposed a cloud computing network traffic matrix estimation and anomaly detection model based on the Bayesian network. Because the ideal naive Bayesian model assumes that attributes are independent, this assumption is often not established in practical applications, so the effect is often not good enough in multiattribute situations. e literature [5] proposes statistical learning of anomaly detection in the cloud server system based on the Markov chain. e Markov model is not suitable for long-term prediction of the system, so it can only judge the short-term changes of the system. e judgment of long-term operation is not accurate enough. Supervised anomaly detection algorithms need a large number of samples for model detection before monitoring anomaly data. For example, the wavelet soft threshold method proposed in the literature [6] is used to eliminate noise or errors in data streams to support the framework of anomaly detection in uncertain data streams. is scheme is based on the technology of effective period pattern recognition and feature extraction under large sample detection, so there is uncertainty in engineering practice to some extent. For the anomaly detection of big data, there have been algorithm models for anomaly detection methods based on machine learning to classify data with different character attributes through linear or nonlinear methods, for example, one type of support vector machine [7] (one-class SVM, OCSVM), and this model is simpler than the support vector machine training set. At the same time, the classification algorithm based on the neural network [8] has a high research value at this stage. Neural networks are generally divided into two types of convolutional neural networks [9] and deep neural networks [10], which are proposed in the article by He et al. [11]. A convolutional neural network for video classification is proposed, but the generalization ability and the huge amount of parameters are huge problems. Su et al. [12] proposed the shortcomings of deep neural networks, which are vulnerable to one-pixel attack, and one pixel may affect the output of the entire neural network.
Unsupervised anomaly detection gets rid of the shortcomings of the above schemes. It does not need labeled samples, so it has higher practical value. For example, the local outlier factor (LOF) algorithm proposed in the literature [13] can determine the abnormal degree of a data object by calculating the local outlier factor (LOF value) of each point. Compared with other algorithms, the algorithm is simple in theory and highly adaptable. It can detect global and local anomalies effectively. However, the LOF algorithm is designed based on local density, which has high complexity and assumes that there are no more than or equal to k-repeating points. erefore, a new density-based outlier detection (DBOD) algorithm is proposed in the literature [14], which defines the point density of data as the nearest point to k divided by the distance of k. Although this algorithm reduces computational complexity and improves work efficiency, its data processing scale is limited by memory capacity and data complexity. erefore, it is very important to design an anomaly detection algorithm which can not only guarantee the advantages of the LOF algorithm but also deal with a large number of data efficiently [15,16]. e innovative points of the proposed DMAD-DVDMR in the cloud computing environment are as follows: (1) Based on the data preprocessing method of deep variational dimensionality reduction, through the training of labeled samples, a potential presentation layer with high predictive ability is constructed to ensure maximum information while reducing the dimension of data features, which is the next step. Anomaly detection provides a more complete preprocessing result.
(2) e local anomaly factor detection method based on MapReduce avoids the excessive density of data caused by the excessive concentration of data and greatly improves the data processing capacity and work efficiency.
(3) e hybrid model that combines deep variational dimensionality reduction and local anomaly factor detection improves the generalization ability, solves the calculation problem caused by the excessively high data dimension, and improves the retention of label information.
e rest of this article includes: the second section introduces the data preprocessing method based on the deep variational dimensionality reduction model; the third section introduces the local anomaly factor detection algorithm based on MapReduce and stochastic gradient descent. Section 4 presents the discussion of experiments and numerical examples. Section 5 summarizes the outlook.

Data Preprocessing Based on Deep
Variational Dimensionality Reduction Model 2.1. Deep Variational Learning. Sufficient dimensionality reduction (SDR) is a dimensionality reduction idea that aims to find a low-dimensional representation of the data while retaining predictive information about label variables. e original work of SDR proposed a method to quantify information using the concept of information theory and introduced an iterative algorithm to extract features that maximize information. e SDR method is usually applied to continuous target variables, but for discrete target variables, methods based on distance covariance can estimate the central subspace. e following regression target is often encountered in the field of machine learning, which is the predicted value of the object predictor label y ∈ RD given the observation value x ∈ RP. In the high-dimensional field, traditional regression methods may require a large amount of training data to avoid overfitting. erefore, it is urgent to use the dimensionality reduction method to replace the original covariate x with another variable z ∈ RD, which retains most or all of the information and changes of x.
When z retains all relevant information about y, such a dimensionality reduction method is considered to be sufficient dimensionality reduction. e SDR problem can be explained by any model in Figure 1. e model in Figure 1(a) can use unlabeled samples to construct the latent space, so it can be used for semisupervised learning. e model in Figure 1(b) can only use labeled samples [17,18].
Variational autoencoder is a deep learning model that can effectively maximize the lower bound of the variational loglikelihood of the joint distribution on a large scale. In this model, the conditional distribution is reparameterized by the neural network (reparameterization trick). Compared with the standard variational autoencoder, the proposed model pays more attention to the coding process and carries as much data as possible to distinguish the data during the coding process. Compared with the standard variational autoencoder, the proposed model pays more attention to the coding process and carries as much data as possible to distinguish the data during the coding process, so we hope to maximize the joint distribution probability p(X, Y). Use the depth variational autoencoder to maximize this lower limit, so this part of the model used for data dimensionality reduction preprocessing is called the depth variational dimensionality reduction model (DVDR). rough the training of the labeled samples, a potential presentation layer with high predictive ability is constructed, and the dimensionality is reduced as much as possible on the premise of preserving more predictive label information.

Hybrid Model Combining Deep Variational Dimensionality Reduction.
As the dimensionality of the generated data continues to increase and become increasingly complex, dimensional disasters have become a common problem [19][20][21]. Models that reduce the dimensionality of the data and reduce the loss of original data features are very necessary for data mining tasks. e hybrid model provides a good idea for solving the problem of dimensional disaster and data mining [22,23]. e depth variational dimensionality reduction algorithm is improved from the variational autoencoder. e autoencoder can be used as a feature dimensionality reduction algorithm because the dimensionality of the intermediate layer is smaller than that of the input data. After the process of learning and characterizing the input data, a high low-dimensional feature vector of the dimensional data and the reconstruction part is the test of the feature lifting the performance of the most coded part. Compared with the original input data, the reconstructed new data have a smaller reconstruction error, and the more it shows that the middle layer has learned more Good features, so the generation performance is also a comparative indicator of this type of model. Compared with some mainstream linear dimensionality reduction models, this type of generative model can better learn the nonlinear features in high-dimensional data, and can better express the original high-dimensional features. e feature reduction effect is better than the nonlinear dimensionality reduction model. Although the undercomplete autoencoder with the middle layer smaller than the input layer can be used to reduce the dimensionality of the data, the autoencoder only pays attention to the reconstruction error and has poor adaptability to noise vectors. At the same time, the middle layer has discrete values and poor generalization ability, unable to perform good feature representation on the original highdimensional data. erefore, a deep variational dimensionality reduction model is proposed, which improves the generalization ability, strengthens the encoder part, and improves the integrity of the predicted label information. After the depth variational dimensionality reduction model training is completed, the encoding part can be used as the processing part of the original data to reduce the feature dimensions of the original data, so that the anomaly detection model can be better used for anomaly detection. e structure of the hybrid model is as follows.
As shown in Figure 2, after the original data are reduced by the deep variational dimensionality reduction model, a latent vector with a dimension smaller than the original data is obtained, and the latent vector is used as the input of the improved anomaly detection model because the improved deep variational dimensionality reduction model e features used for classification are well preserved, so the dimensionality disaster problem can be solved well after dimensionality reduction.

Suggested Anomaly Detection Algorithm
Based on the hybrid model shown in Figure 2, the data that have undergone dimensionality reduction preprocessing are calculated by calculating the local anomaly factor of each point to achieve the effective detection of global and local anomalies.

Anomaly Detection Framework Based on MapReduce.
e MapReduce parallel programming model proposed by Google has become the main model of large data processing because of its simplicity, scalability, and fault tolerance [24,25]. e core idea of the parallel programming model is "divide and rule," that is, dividing dense large data without intrinsic dependency into several fragments, then parallel computing, and processing by multiple subtasks, respectively. e results are aggregated to the control task for output [26]. e MapReduce programming model realizes the above idea. e distributed computing task is highly abstracted into two phases: Map and Reduce. e corresponding Map and Reduce processing functions in the developer's implementation phase are shown in Figure 3 [27]. According to a specific slicing strategy, input data will be divided into multiple slices. Each slice is then transformed and processed by a Map task. Each row of data in the fragmentation applies a user-defined Map function. To reduce the network transmission consumption in aggregating intermediate results, users can specify a combined function to merge and simplify the output of Map tasks. ese intermediate output results will be hashed to different partitions and sorted according to a specific partition strategy. en, the intermediate output of the same partition number will be shuffled and copied to the corresponding node. Before running the Reduce task, the nodes will merge the intermediate results as the complete input. e merged data are reduced and processed by the Reduce function, and the final results are written to the file system. e MapReduce programming model hides specific processes such as job control and process scheduling in cluster management. erefore, developers concentrate on program development without or little consideration of fragmentation, partition, network transmission, and I/O details. In this way, the reliability, ease of use, and fault tolerance of parallel computing are ensured. MapReduce job uses master-slave architecture as its operation mechanism. It is generally composed of one master node and several slave nodes, with client nodes for submitting and monitoring MapReduce job. e master node initiates a JobTracker process. is process is responsible for tracking job progress, including receiving jobs submitted by clients, distributing jobs in the form of subtasks to slave nodes, and monitoring the execution of returned jobs from nodes [28]. e slave node initiates one or more TaskTracker processes to track the progress of tasks, including Map tasks or Reduce tasks assigned to it. e entire flow of MapReduce is shown in Figure 4.

Mathematical Problems in Engineering
Real-time online monitoring part is used to collect the current time data for anomaly detection.
en, data are compared with the threshold to determine whether the current time data are abnormal. e abnormal data are reported, and remaining normal data are added to R-tree. So, the oldest data are deleted from R-tree. In this way, the model can be adjusted according to the changes of the normal system, so as to achieve the purpose of selfadaptation.
LOF is a general and portable algorithm. We collect system information directly from cloud computing platform for anomaly detection, such as CPU usage, memory usage, and other basic system information. In addition, LOF calculates anomaly scores for each detection data. Users can choose threshold according to the situation, and find a suitable compromise between detection rate and false alarm rate. In addition, LOF only needs to learn the current normal situation and does not need to train all kinds of anomalies. It has good adaptability and recognition for new anomalies.
LOF is used to describe the abnormal degree of an object. It calculates the degree of anomaly by comparing the density of this object and its neighbors.

Anomaly Detection Algorithm Based on Stochastic Gradient Descent.
In order to get the optimal weight in most supervised learning models, we need to create cost loss function for the model. en, choose the appropriate optimization algorithm to get the minimum function loss value. e gradient descent algorithm is the most widely used optimization algorithm at present [29]. Its core idea is to calculate the minimum loss value of function. Firstly, calculate the gradient of loss function, and then reduce the loss value of function gradually according to the direction of gradient. Finally, by constantly updating and adjusting the weight value, the loss value of function reaches the minimum, so as to obtain the optimal solution. e stochastic gradient descent (SGD) algorithm is an improved algorithm based on gradient descent [30]. SGD randomly selects one sample at a time to update iteratively, rather than for all samples. erefore, the algorithm significantly reduces the computational complexity. SGD has the characteristics of fast training speed and easy convergence. It is also the most popular optimization algorithm for researchers at home and abroad. e SGD-related formulas are as follows: In the formula, the weights of network parameters are represented by c; the gradient is represented by ∇c; the loss function is represented by f(c); the objective function is represented by g(c); the sample value of the first sample is represented by y i ; the total number of iterations is represented by m; η denotes the step size in gradient descent; and the total number of parameters in KNN is represented by j. As described above, learning rate is very important for the gradient descent algorithm. If the setting of η is too small, it will need several iterations to find the optimal solution and reduce the convergence speed of the network. It may even lead to stagnation in the local optimal solution. Although the training speed of KNN will be accelerated, it will increase the probability of skipping the optimal solution with increased learning rate. KNN may not find the optimal solution. It can be seen that η is the key factor to decide whether the gradient descent algorithm is effective or not. e function of the proposed algorithm is to achieve effective anomaly data detection based on the stochastic gradient descent algorithm and MapReduce. e method ensures the high efficiency of algorithm. at is to say, when abnormal records appear in the data set, the target model obtained by the stochastic gradient descent algorithm can realize fast detection. e basic idea of the algorithm is to distribute the data in the data set to each distributed computing node, perform random gradient descent algorithm on each node through Map subtask, and use Reduce subtask to update the model merge operation. e principle of subspace clustering is to reduce highdimensional data to low-dimensional data, which makes the subsequent data analysis possible. Because of the existence of outliers, subspace clustering is disturbed. e solution is to introduce e 1 regularization coefficient. It should be noted that the initial outlier detection threshold should be set to a larger value to ensure the accurate determination of outliers. For the data to be discriminated, it is first determined whether it is an outlier. e corresponding outlier residual vectors are marked. Otherwise, the subspace is classified and the subspace is updated. e discriminant threshold is updated after each model iteration. en, with the iteration of the model, the threshold of outlier discrimination decreases exponentially. Finally, all outliers will be detected accurately, and the subspace will be clustered appropriately. In the process of iteration, k subspaces are updated step by step. Since the algorithm processes data by data, the process of updating subspace is equivalent to the SGD. e memory requirement is quite low, so the problem of large memory occupation is successfully solved.
Many existing algorithms are too absolute in judging anomalies, either normal or abnormal. However, in practical application, many test data are difficult to judge absolutely, which will result in high false alarm rate or high missed detection rate. It is also difficult to adjust the severity of anomaly detection. So, we need a degree value to judge it. We can determine different thresholds according to different use environments when we finally output the anomaly. Eventually output the anomaly points which are larger than the threshold [31].
All kinds of clustering algorithms have certain ability of anomaly detection. e common problem is most clustering algorithms use a global distance criterion as the basis of detection. e anomaly itself has a certain locality, which is related to the distribution of neighbors within a certain range. erefore, the mechanism of finding anomalies by the clustering algorithm is limited. LOF is based on the local density of the anomaly to determine the anomaly. To describe the local characteristics of LOF, a simple twodimensional data set in Figure 5 is taken as an example. As can be seen from the graph, the data amount in cluster C 1 is larger than C 2 , but the data density in cluster C 2 is higher than C 1 .
Because of the low density of cluster C 1 , the distance between each data in cluster C 1 and its nearest neighbor is larger than p 2 and its nearest neighbor in C 2 . In this case, p 2 will not be considered as an exception. e global view clustering algorithm will cause false alarm, but LOF can be successfully detected.

Algorithmic Design.
e algorithm of LOF is described as follows: (1) Calculating k-distance of p For any natural number k, the k-distance of the test object p is defined as the distance from the test object p to the Mathematical Problems in Engineering k-nearest neighbor of p(o ∈ D). at is, o needs to satisfy two conditions at the same time: (2) Calculating k-distance neighbor sets of p e k-distance neighbor set of object p is the set of all elements that do not exceed k-distance, defined as (2) e set of objects q satisfying the above formulas is the kdistance neighbor set of p.
(3) Calculating the reachable distance of object p relative to object o For each object p, compute its reachable distance relative to object o: d(p, o) .
(3) d(p, o) represents the distance from object p to object o. Figure 6 shows a diagram of the reachable distance when k � 4. If an object is far from o (e.g., p 2 in Figure 6), the reachable distance between them is the real distance. If an object is close enough to o (e.g., p 1 in Figure 6), the reachable distance between them is the k-distance of o. In this way, the fluctuation of d(p, o) of the object near o can be reduced. e smoothing intensity can be adjusted by adjusting k. So far, we have calculated k-distance, k-distance neighbor aggregation, and reachable distance. In actual anomaly detection, MinPts is defined as k parameter, and all o ∈ N, Min Pts (p) are used to calculate reach_dist MinPTs (p, o), respectively, so as to determine the density around the object p.
(4) Computing the locally achievable density of object p e local reachable density of object p is the reciprocal of the average reachable distance between object p and its MinPts neighbors: lrd Min Pts (p) represents its local arrival density. If there are at least MinPts different objects with the same coordinates as object p, the local achievable density may be infinite. Because the sum of all achievable distances at this time is 0, so assume that there are not so many identical objects in the database, or by finding MinPts closest to p but different from p coordinates.
(5) LOF for computing object p e LOF of object p can be calculated according to the following formula: e LOF of object p represents the abnormal degree of p. It is equal to the average of the ratio of local attainable density of p to the local attainable density of the MinPtsneighbor set of p. If the local achievable density of p is very low, but the local achievable density of MinPts-neighbor set is very high. en, it indicates that p is very likely to be abnormal.
e LOF algorithm is a density-based anomaly detection algorithm, which has a large amount of computation. ere  is a hypothesis about the definition of local reachable density in the LOF algorithm: there is no more than or equal to krepeating points. When such repetitive points exist, the average reachable distance of these points is zero, and the local reachable density becomes infinite. Obviously, it affects the effectiveness of the Algorithm 1.
Because of the shortage of the LOF algorithm, the concept of k-neighborhood distance is redefined. Propose the anomaly detection algorithm combining depth learning with the MapReduce framework. e redefined concept of kproximity distance is as follows, "k-distinct-distance": for any positive integer k, the k-adjacent distance of point p is defined as k-distance (p) � d(p, o), if the following conditions are satisfied: d(p, o) be the distance between point p and point o is method of improving the proximity distance effectively realizes the data classification of big data scenes.
rough a more accurate definition of k value, it realizes fast and effective data classification while ensuring calculation accuracy.

Experiments and Results Analysis
e configuration of experimental platform is as follows: 3 PCs (connected via LAN), node configuration for Cen-tOS.7 under Windows VMware Workstation Pro 12.0.0, JDK 1.8, Hadoop 2.7.4. All the algorithms in this paper are implemented in JAVA language and eclipse compiler environment. e experimental environment is a Hadoop cluster based on the cloud platform. Using the KDD99 data set, the proposed DMAD-DVDMR algorithm is compared with the convolutional neural network algorithm (CNN) in the literature [9] from the five perspectives of model robustness, algorithm efficiency, algorithm response time, algorithm accuracy, and scalability, e overall performance of the deep learning algorithm (DeepAnT) in the literature [10] and the intrusion detection method based on support vector machine (SVM-IDS) in the literature [7]. e KDD99 data set is a data packet data set collected from a network connection. It contains a total of 41 attributes and a total of about 5 million data packet information. e data set is divided into a training data set with annotations and an unlabeled training data set. ere are two parts of the data set. ere are a total of 39 attack labels in the data set. e training set contains 22 label categories in the dimensionality reduction model 39 based on deep variational learning in Chapter 4. e test set contains 17 attack methods that are not in the training set. For the generalization ability of the detection algorithm model, that is, the model can better deal with and prevent unknown attacks, so as to better evaluate the detection performance of the model. Choose a subset of about 500,000 pieces of data as the experimental data set.

Basic Performance Verification of the Algorithm.
e basic performance of the proposed DMAD-DVDMR method was verified from the three aspects of algorithm robustness, algorithm accuracy, and algorithm response time, and the performance of CNN, DeepAnt, and SVM-IDS methods is compared.
e AUC indicator is used as the robustness evaluation standard of the data mining anomaly detection method. AUC (area under curve) is defined as the area under the ROC (receiver operating characteristic curve) curve. Classifiers with larger AUC have more robust anomaly detection performance which performed. As the label position of the abnormal data changes, the comparison of the AUC indicators on the KDD99 data set is shown in Figure 7.
Next, analyze the accuracy and response time of the algorithm.
e detection results of normal data and abnormal data using anomaly detection algorithm are shown in Table 1. Normal data are expressed as "positive" and abnormal data as "negative." Data detected as normal are expressed as "1," and data detected as abnormal are expressed as "0." Accuracy (Ac) is defined as It is expressed as the ratio of normal data detected as normal and abnormal data detected as abnormal to the total data, that is, the probability of correct detection. According to the test data, the calculated accuracy of each algorithm is shown in Figure 8.
As shown in Figure 8, with the increase of data set size, the impact of accidental errors gradually decreases, and the accuracy of the algorithm continues to improve until it stabilizes to a more stable value. Compared with CNN, DeepAnt, and SVM-IDS, the accuracy of the proposed DMAD-DVDMR algorithm for anomaly detection can reach more than 94%. e accuracy of the algorithm is improved by 10.3%, 18.0%, and 17.2%, respectively.
As shown in Figure 9, it is the anomaly detection response time of different algorithms. When the size of the data set increases to 15,000, the response time of the algorithm will increase significantly. Compared with CNN, DeepAnt, and SVM-IDS methods, the response time of the proposed DMAD-DVDMR algorithm is reduced by 23.3%, 28.1%, and 36.1%, respectively. input X � x 1 , x 2 , . . . , x n : data set k: number of nearest neighbor N: data block number θ: threshold for LOF output Abnormal data and LOF values Get data set which lof i > θ from algorithm 2 add in XX Calculate lof i > θ of x j ∈ XX return Abnormal data and LOF As shown in Figure 10, when the amount of data is large, the execution efficiency of DMAD-DVDMR is obviously better than other three algorithms. e reason Hadoop will schedule multiple MapReduce tasks in input data set X � x 1 , x 2 , . . . , x n k: number of nearest neighbor θ: threshold for LOF N: data block number output data set which lof i > θ Initialize a Hadoop Job Set TaskMapReduce class Logically divide X into multiple data blocks: D 1 , D 2 , . . . , D n . In the j-th TaskMapReduce FirstMapper

Analysis of Algorithmic Execution Efficiency.
In order to verify the scalability of the DMAD-DVDMR algorithm, this paper compares the execution efficiency under different computing nodes by expanding the data scale. As shown in Figure 11, under the same data set size, with the increase of cluster computing nodes, the execution efficiency of the algorithm improves. erefore, when the data set increases, the DMAD-DVDMR algorithm is extensible to improve the execution efficiency by expanding the computing nodes in Hadoop cluster.

Conclusion
Based on the deep analysis of data characteristics in a cloud computing environment, this paper proposes DMAD-DVDMR. rough the deep variational dimensionality reduction preprocessing and parallel anomaly detection of the data, this method meets the requirement of computing efficiency for large data. And also it alleviates the computational pressure, improves the execution efficiency under various nodes, and ensures the availability of data. e next steps are as follows: (1) based on the proposed algorithm, the parameter settings in the algorithm are further optimized. e internal relationship of each parameter on the efficiency of the algorithm is analyzed, and the efficiency of the algorithm is improved further; (2) the factors leading to the fluctuation of accuracy are studied. In the process of system modeling, the above factors are considered to reduce the negative impact of irrelevant factors on the efficiency and availability of the algorithm.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.