Aiming at problems such as slow training speed, poor prediction effect, and unstable detection results of traditional anomaly detection algorithms, a data mining method for anomaly detection based on the deep variational dimensionality reduction model and MapReduce (DMAD-DVDMR) in cloud computing environment is proposed. First of all, the data are preprocessed by a dimensionality reduction model based on deep variational learning and based on ensuring complete data information as much as possible, the dimensionality of the data is reduced, and the computational pressure is reduced. Secondly, the data set stored on the Hadoop Distributed File System (HDFS) is logically divided into several data blocks, and the data blocks are processed in parallel through the principle of MapReduce, so the
With the popularization of cloud computing, the “reliability of never down machine” in industrial applications has gradually changed from expectation to practical need. To solve this problem, how to improve the accuracy, sensitivity, and execution efficiency of anomaly detection algorithm in data mining becomes more important [
Aiming at the large volume and various types of the cloud computing environment, existing research starts from many aspects. The literature [
For the anomaly detection of big data, there have been algorithm models for anomaly detection methods based on machine learning to classify data with different character attributes through linear or nonlinear methods, for example, one type of support vector machine [
Unsupervised anomaly detection gets rid of the shortcomings of the above schemes. It does not need labeled samples, so it has higher practical value. For example, the local outlier factor (LOF) algorithm proposed in the literature [
The innovative points of the proposed DMAD-DVDMR in the cloud computing environment are as follows: Based on the data preprocessing method of deep variational dimensionality reduction, through the training of labeled samples, a potential presentation layer with high predictive ability is constructed to ensure maximum information while reducing the dimension of data features, which is the next step. Anomaly detection provides a more complete preprocessing result. The local anomaly factor detection method based on MapReduce avoids the excessive density of data caused by the excessive concentration of data and greatly improves the data processing capacity and work efficiency. The hybrid model that combines deep variational dimensionality reduction and local anomaly factor detection improves the generalization ability, solves the calculation problem caused by the excessively high data dimension, and improves the retention of label information.
The rest of this article includes: the second section introduces the data preprocessing method based on the deep variational dimensionality reduction model; the third section introduces the local anomaly factor detection algorithm based on MapReduce and stochastic gradient descent. Section
Sufficient dimensionality reduction (SDR) is a dimensionality reduction idea that aims to find a low-dimensional representation of the data while retaining predictive information about label variables. The original work of SDR proposed a method to quantify information using the concept of information theory and introduced an iterative algorithm to extract features that maximize information. The SDR method is usually applied to continuous target variables, but for discrete target variables, methods based on distance covariance can estimate the central subspace. The following regression target is often encountered in the field of machine learning, which is the predicted value of the object predictor label
Deep variational dimensionality reduction model.
Variational autoencoder is a deep learning model that can effectively maximize the lower bound of the variational log-likelihood of the joint distribution on a large scale. In this model, the conditional distribution is reparameterized by the neural network (reparameterization trick). Compared with the standard variational autoencoder, the proposed model pays more attention to the coding process and carries as much data as possible to distinguish the data during the coding process. Compared with the standard variational autoencoder, the proposed model pays more attention to the coding process and carries as much data as possible to distinguish the data during the coding process, so we hope to maximize the joint distribution probability
As the dimensionality of the generated data continues to increase and become increasingly complex, dimensional disasters have become a common problem [
As shown in Figure
Overall hybrid architecture model.
Based on the hybrid model shown in Figure
The MapReduce parallel programming model proposed by Google has become the main model of large data processing because of its simplicity, scalability, and fault tolerance [
The MapReduce programming model realizes the above idea. The distributed computing task is highly abstracted into two phases: Map and Reduce. The corresponding Map and Reduce processing functions in the developer’s implementation phase are shown in Figure
Working principle of MapReduce.
The MapReduce programming model hides specific processes such as job control and process scheduling in cluster management. Therefore, developers concentrate on program development without or little consideration of fragmentation, partition, network transmission, and I/O details. In this way, the reliability, ease of use, and fault tolerance of parallel computing are ensured. MapReduce job uses master-slave architecture as its operation mechanism. It is generally composed of one master node and several slave nodes, with client nodes for submitting and monitoring MapReduce job. The master node initiates a JobTracker process. This process is responsible for tracking job progress, including receiving jobs submitted by clients, distributing jobs in the form of subtasks to slave nodes, and monitoring the execution of returned jobs from nodes [
The entire flow of MapReduce.
Real-time online monitoring part is used to collect the current time data for anomaly detection. Then, data are compared with the threshold to determine whether the current time data are abnormal. The abnormal data are reported, and remaining normal data are added to R-tree. So, the oldest data are deleted from R-tree. In this way, the model can be adjusted according to the changes of the normal system, so as to achieve the purpose of self-adaptation.
LOF is a general and portable algorithm. We collect system information directly from cloud computing platform for anomaly detection, such as CPU usage, memory usage, and other basic system information. In addition, LOF calculates anomaly scores for each detection data. Users can choose threshold according to the situation, and find a suitable compromise between detection rate and false alarm rate. In addition, LOF only needs to learn the current normal situation and does not need to train all kinds of anomalies. It has good adaptability and recognition for new anomalies.
LOF is used to describe the abnormal degree of an object. It calculates the degree of anomaly by comparing the density of this object and its neighbors.
In order to get the optimal weight in most supervised learning models, we need to create cost loss function for the model. Then, choose the appropriate optimization algorithm to get the minimum function loss value. The gradient descent algorithm is the most widely used optimization algorithm at present [
In the formula, the weights of network parameters are represented by
The function of the proposed algorithm is to achieve effective anomaly data detection based on the stochastic gradient descent algorithm and MapReduce. The method ensures the high efficiency of algorithm. That is to say, when abnormal records appear in the data set, the target model obtained by the stochastic gradient descent algorithm can realize fast detection. The basic idea of the algorithm is to distribute the data in the data set to each distributed computing node, perform random gradient descent algorithm on each node through Map subtask, and use Reduce subtask to update the model merge operation.
The principle of subspace clustering is to reduce high-dimensional data to low-dimensional data, which makes the subsequent data analysis possible. Because of the existence of outliers, subspace clustering is disturbed. The solution is to introduce
Many existing algorithms are too absolute in judging anomalies, either normal or abnormal. However, in practical application, many test data are difficult to judge absolutely, which will result in high false alarm rate or high missed detection rate. It is also difficult to adjust the severity of anomaly detection. So, we need a degree value to judge it. We can determine different thresholds according to different use environments when we finally output the anomaly. Eventually output the anomaly points which are larger than the threshold [
All kinds of clustering algorithms have certain ability of anomaly detection. The common problem is most clustering algorithms use a global distance criterion as the basis of detection. The anomaly itself has a certain locality, which is related to the distribution of neighbors within a certain range. Therefore, the mechanism of finding anomalies by the clustering algorithm is limited. LOF is based on the local density of the anomaly to determine the anomaly. To describe the local characteristics of LOF, a simple two-dimensional data set in Figure
Advantage of the LOF approach.
Because of the low density of cluster
The algorithm of LOF is described as follows: Calculating
For any natural number ① At least ② At most Calculating
The
The set of objects Calculating the reachable distance of object
For each object
Figure
Reach-dist (
So far, we have calculated Computing the locally achievable density of object
The local reachable density of object
LOF for computing object
The LOF of object
The LOF of object
The LOF algorithm is a density-based anomaly detection algorithm, which has a large amount of computation. There is a hypothesis about the definition of local reachable density in the LOF algorithm: there is no more than or equal to
N: data block number Get data set which Calculate return Abnormal data and LOF
Initialize a Hadoop Job Set TaskMapReduce class Logically divide In the FirstMapper for each data Calculate dis Sort dis for each dis if add end Calculate end First Record SecondMapper for end SecondReducer for value do end ThirdMapper for end if lof ( ThirdReduce for value do Sort End
Because of the shortage of the LOF algorithm, the concept of In the sample space, at least In the sample space, at most
This method of improving the proximity distance effectively realizes the data classification of big data scenes. Through a more accurate definition of
The configuration of experimental platform is as follows: 3 PCs (connected via LAN), node configuration for CentOS.7 under Windows VMware Workstation Pro 12.0.0, JDK 1.8, Hadoop 2.7.4. All the algorithms in this paper are implemented in JAVA language and eclipse compiler environment. The experimental environment is a Hadoop cluster based on the cloud platform. Using the KDD99 data set, the proposed DMAD-DVDMR algorithm is compared with the convolutional neural network algorithm (CNN) in the literature [
The KDD99 data set is a data packet data set collected from a network connection. It contains a total of 41 attributes and a total of about 5 million data packet information. The data set is divided into a training data set with annotations and an unlabeled training data set. There are two parts of the data set. There are a total of 39 attack labels in the data set. The training set contains 22 label categories in the dimensionality reduction model 39 based on deep variational learning in Chapter 4. The test set contains 17 attack methods that are not in the training set. For the generalization ability of the detection algorithm model, that is, the model can better deal with and prevent unknown attacks, so as to better evaluate the detection performance of the model. Choose a subset of about 500,000 pieces of data as the experimental data set.
The basic performance of the proposed DMAD-DVDMR method was verified from the three aspects of algorithm robustness, algorithm accuracy, and algorithm response time, and the performance of CNN, DeepAnt, and SVM-IDS methods is compared. The AUC indicator is used as the robustness evaluation standard of the data mining anomaly detection method. AUC (area under curve) is defined as the area under the ROC (receiver operating characteristic curve) curve. Classifiers with larger AUC have more robust anomaly detection performance which performed. As the label position of the abnormal data changes, the comparison of the AUC indicators on the KDD99 data set is shown in Figure
AUC indicators of different algorithms on the KDD99 dataset.
Next, analyze the accuracy and response time of the algorithm. The detection results of normal data and abnormal data using anomaly detection algorithm are shown in Table
Data test result.
Result | Detected normal | Detected abnormal |
---|---|---|
Normal data | ||
Abnormal data |
Accuracy (Ac) is defined as
It is expressed as the ratio of normal data detected as normal and abnormal data detected as abnormal to the total data, that is, the probability of correct detection. According to the test data, the calculated accuracy of each algorithm is shown in Figure
Accuracy comparison of the algorithm.
As shown in Figure
As shown in Figure
Response time of the algorithm.
Comparing the execution time of the DMAD-DVDMR algorithm with CNN, DeepAnt, and SVM-IDS in dealing with data sets of the same size, the efficiency of the DMAD-DVDMR algorithm is verified.
As shown in Figure
Efficiency comparison of the algorithm.
In order to verify the scalability of the DMAD-DVDMR algorithm, this paper compares the execution efficiency under different computing nodes by expanding the data scale.
As shown in Figure
Execution efficiency comparison of the algorithm.
Based on the deep analysis of data characteristics in a cloud computing environment, this paper proposes DMAD-DVDMR. Through the deep variational dimensionality reduction preprocessing and parallel anomaly detection of the data, this method meets the requirement of computing efficiency for large data. And also it alleviates the computational pressure, improves the execution efficiency under various nodes, and ensures the availability of data.
The next steps are as follows: (1) based on the proposed algorithm, the parameter settings in the algorithm are further optimized. The internal relationship of each parameter on the efficiency of the algorithm is analyzed, and the efficiency of the algorithm is improved further; (2) the factors leading to the fluctuation of accuracy are studied. In the process of system modeling, the above factors are considered to reduce the negative impact of irrelevant factors on the efficiency and availability of the algorithm.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare no conflicts of interest.
This study was financially supported by the National Social Science Foundation of China. The project name is “Online Estimation and Whole Process Management of Short-Circuit Current Level in Active Distribution Network” (no. 15BGL040 (51577018)).