A Novel System Anomaly Prediction System Based on Belief Markov Model and Ensemble Classification

Computer systems are becoming extremely complex, while system anomalies dramatically influence the availability and usability of systems. Online anomaly prediction is an important approach to manage imminent anomalies, and the high accuracy relies on precise system monitoring data. However, precise monitoring data is not easily achievable because of widespread noise. In this paper, we present a method which integrates an improved Evidential Markovmodel and ensemble classification to predict anomaly for systemswith noise. TraditionalMarkovmodels use explicit state boundaries to build theMarkov chain and thenmake prediction of different measurement metrics. A Problem arises when data comes with noise because even slight oscillation around the true value will lead to very different predictions. Evidential Markov chain method is able to deal with noisy data but is not suitable in complex data stream scenario. The Belief Markov chain that we propose has extended Evidential Markov chain and can cope with noisy data stream. This study further applies ensemble classification to identify system anomaly based on the predicted metrics. Extensive experiments on anomaly data collected from 66metrics in PlanetLab have confirmed that our approach can achieve high prediction accuracy and time efficiency.


Introduction
As computer systems are growing increasingly complicated, they are more vulnerable to various anomalies such as performance bottlenecks and service level objective (SLO) violations [1].Thus, it requires the computer systems to be more capable of managing anomalies under time pressure, and avoiding or minimizing the system unavailability by monitoring the computer systems continuously.Anomaly management methods can be classified into two categories: passive methods and proactive methods.Passive methods notify the system administrator only when errors or faults are detected.These approaches are appropriate to manage anomalies that can be easily measured and fixed in a simple system.However, in nowadays dynamic and complex computer systems, detecting some anomalies may have a high cost, which is unacceptable for continuously running applications.Proactive methods take preventive actions when anomalies are imminent; thus, they are more appropriate for systems that need to avert the impact of anomalies and achieve continuous operation.Nowadays proactive methods are preferred in both academic research and real world applications.
Previous work has addressed the problem of system anomaly prediction, which can be categorized as data-driven methods, event-driven methods, and symptom-driven methods [2].
Event-driven methods directly analyze the error or failure that events report and use error reports as input data to predict future system anomaly.Salfner and Malek use error reports as input and then perform a trend analysis to predict the occurrence of failure in a telecommunication system by determining the frequency of error occurrences [3].Kiciman and Fox use decision tree to identify faulty components in J2EE application server by classifying whether requests are successful or not.These approaches have the basic assumption that anomaly-prone system behavior can be identified by characteristics of anomaly [4].This is why only reoccurring anomaly presented in the error report can be predicted by event-driven method.
Data-driven methods learn from the temporal and spatial correlation of anomaly occurrence.They aim at recognizing the relationship between upcoming failures and occurrence of previous failures.Zhang and Ma use modified KPCA method to diagnose anomalies in nonlinear processes [5].In nonlinear fault detection scenario, they utilize statistical analysis to improve the learning techniques [6], which is also applicable for large scale fault diagnosis processes [7].Liang et al. exploit these correlation characteristics of anomaly on IBM's BlueGene/L [8].They find that the occurrence of a failure is strongly correlated to the time stamp and the location of others in a cluster environment.Zhang et al. propose a hybrid prediction technique which uses model checking techniques; an operational model is explored to check if a desirable temporal property is satisfied or violated by the model itself [9].To conclude, the basic idea of datadriven methods is that upcoming anomalies are from the occurrence of the previous ones.
Symptom-driven methods analyze some workloadrelated data such as input workload and memory workload in order to predict further system resource utilization.Tan and Gu [10] monitor a series of run-time metrics (CPU, memory, I/O usage, and network), use a discrete-time Markov chain to forecast the system metrics in the future, and finally predict the system state based on Naïve Bayesian classification.Luo et al. [11] build autoregressive model using various parameters from an Apache webserver to predict further system resource utilization; failures are predicted by detecting resource exhaustion.
Efficient proactive anomaly management relies on the system monitoring data, and the metric system generated by monitor infrastructures are continuously arriving and invariably noisy, so one big challenge is to provide high accurate and good and efficient system anomaly prediction for noisy monitoring data stream.Recently, some approaches have been proposed for system anomaly prediction using discretetime Markov chain (DTMC) [10,12].However, their work does not consider the issue that monitoring data may oscillate around the real value as we mentioned previously.DTMC which uses explicit state boundaries will lead to significantly different values even when the metrics oscillation around the boundaries is very slight.Soubaras [13] proposed Evidential Markov chain model which extends DTMC to overcome the noise value around explicit state boundaries problem caused by inaccuracies monitoring metrics.The problem of Evidential Markov chain is that although it works excellently in a static data scenario, it cannot be applied directly to stream data.Its fixed transition matrix is too restrictive for continuously changing stream data and brings in enormous amount of calculation.
In this paper, we present the design and implementation of an approach to solve the system anomaly prediction problem on noise data stream.We first present an improved belief Markov chain (BMC) to fit into a data stream scenario.We use a stream-based -means clustering algorithm [14] to dynamically maintain and generate Markov transition matrix.Only information of microclusters is stored after clustering, and new comers will falls into or newly establish one of the  groups.Compared to Evidential Markov chain method, by which all the data has to be stored and recalculated every time when new one arrives to get Markov state, our approach is time efficient and more feasible in a highly dynamic and complex system.We then employ aggregate ensemble classification method [15] to determine whether the system will turn into anomaly in the future.Aggregate ensemble classification can address the incorrect anomaly mark problem in a continuously running system.
Extensive experiments on PlanetLab dataset [16] of different parameter settings show that averagely BMC achieves 14.8% smaller mean prediction error than DTMC method in various previous works [10,12,17,18].Our system anomaly prediction method (SAPredictor), which combines BMC and aggregate ensemble classification, is proved to achieve better prediction performance than other prediction models, for example, DTMC+Naïve Bayes, DTMC+KNN, and DTMC+C4.5.SAPredictor demonstrates the best performance in the three key criteria, namely, 71.6% for precision, 84.6% for recall, and 77.5% for -measurement.
The main contributions of this paper are summarized as follows.
(1) We propose the belief Markov chain by improving the Evidential Markov model using a stream-based means clustering algorithm and make it more suitable for system metrics prediction on noisy data stream.(2) We integrate belief Markov chain and aggregate ensemble classification as SAPredictor to predict system anomaly.(3) We validate the effectiveness of SAPredictor by extensive experiments on real system data.
The rest of this paper is organized as follows.Section 2 introduces our SAPredictor method.Section 3 demonstrates the experiments and analyzes the results.Finally, we conclude and give some future research directions in Section 4.

Approach Overview
In this section, we present the detailed design of SAPredictor.We first describe the problem of system anomaly prediction and then propose our SAPredictor method, which is composed by the two components: belief Markov chain model and aggregate ensemble classification model.Belief Markov chain model is used to predict the changing pattern of measurement metrics; aggregate ensemble classification is a supervised learning method which employs multiple classifiers and combines their predictions.In this work, we use sliding window to partition the system metrics stream into some chunks and then train the belief Markov chain and aggregate ensemble learning model by the history.The future system status is predicted by putting future metrics as input into the classification model.

Problem Statement.
For a system, we have a vector of observations at time  for the system metrics,   = [ 1, ,  2, , . . . , ].   is a vector that contains  system metric time series at time , namely,  , ( = 1, 2, . . ., ),  is the th metric.We label   at time  as normal (state 0) or anomaly (state 1) by monitoring the system state at time .The system anomaly prediction problem we focus on in this paper is that whether   will fall into anomaly status in the next  steps, where  > 0 and  ∈ .To solve this problem, we need to first forecast the future value of  ,+ for each metric.Then, we train ensemble classifier EC based on a sliding window [ ,−+1 , . . . , ] of   , where  is the size of sliding window.Finally, we use EC to test on  ,+ ( = 1, 2 . . ., ) and predict the state label of  + .
2.2.SAPredictor Approach.Figure 1 describes the SAPredictor system anomaly prediction approach. measurement metrics (e.g., CPU, memory, I/O usage, network, etc.) are collected from the system continuously.Then, the collected system metrics streams are partitioned into some chunks by sliding window.The current and history chunks are used to train the belief Markov chain model and aggregate ensemble learning model.Then, the future system metrics is predicted by the belief Markov chain model, and having these metrics as input into the aggregate ensemble classification model, we can ascertain whether the system will fall into anomaly in the future.Belief Markov chain and aggregate ensemble classification will be presented in the following subsections.

System Metrics Value Prediction.
In this section, we first introduce why the Evidential Markov chain which is based on the Dempster-Shafer theory [19] is preferred over discrete-time Markov chain in dealing with system anomaly prediction for noisy data, and then we explain the advantages of our belief Markov chain method compared to Evidential Markov method in a data stream environment.
When we build discrete-time Markov chain model, it is necessary to divide all the data into discrete states.Traditional discretion techniques used in discrete-time Markov chain include equal-width and equal-depth.Both techniques generate status with explicit boundaries using all the data.However, the system metrics being monitored are usually imprecise due to system noise and measurement error.Thus, discretetime Markov chain which uses explicit boundary to divide the states will generate highly different prediction results even if their initial values are almost the same.Evidential Markov model [13] has made big improvement by being capable of coping with noisy data.Following is an example of explicit boundary problem in discrete-time Markov chain.
In one possible situation, we have a metric ranging in [0, 150], and then we use equal-width approach to discrete the range into three bins, namely, [0, 50), [  ) . ( Here, each element   in matrix  denotes the probability of transition from state  to state .When we use discrete-time Markov chain to predict future value, a vector  = [ , 1 ,  , 2 ,  , 3 ] is needed to denote the probability of the metric in each state at time .If we have an initial value 99 which is in state  2 , then the corresponding probability vector is  0 = [0, 1, 0].We can calculate the probability vector  1 after one time unit as Here, the probability vector  1 represents that the initial value will transfer into  3 most likely, and the predicted value after one step will be 125 = (100 + 150)/2 as the mean of state  3 .However, if the initial value turns to be 101, then the vector   0 will be [0, 0, 1].By applying (2) again, it turns out that the prediction value will stay in state  2 with the predicted value of 75 = (50 + 100)/2 in the next step: Note that there is only a slight difference between 99 and 101 in the initial value, yet the forecasted value after one step is in large difference from 75 to 125.
As the example shows, discrete-time Markov chain uses explicit state boundaries, and it will have very different prediction value if the original metric is around the state boundary.To solve this problem, we propose belief Markov chain based on the Dempster-Shafer theory.The Dempster-Shafer theory is an inaccurate inference theory.It can handle the uncertainty caused by unknown prior knowledge and extend the basic event space to its power set.The detailed definitions for Dempster-Shafer [19] are as follows.
Definition 1 (frame of discernment).Suppose that  is the exhaustive set of random variable , so  = { 1 ,  2 , . . .  } and the elements in  are mutually exclusive.Then, the set of all possible subsets of  is called a frame of discernment : We use   ( ∈ ,  ∈ [0, ]) to represent the subset in power set of  which contains  elements.
Definition 2 (mass function).Have  and , for every subset of ; if the following statements satisfy, then the function  is called the mass function on : ( Definition 3 (transferable belief model).Suppose that we have discernment frame  and mass function  on .Then, the probability for each random variable  in  can be calculated by transferable belief model:  The subset of  includes both single event set {  } and multiple event combinations {  ,   , . . .,   }.This is why we need transferable belief model to calculate the probability of one single random variable.
Figure 2 illustrates a metric divided into  states,  = { 1 ,  2 ,  3 , . . .,   }, and each pair of adjacent states has a state  ,+1 which means that the value is in cross-region between state  and state  + 1.When using BMC model to predict, the initial metric may belong to a single state entirely or belong to the cross-region of two adjacent states.So, the discernment frame of this problem can be simplified to Then, we declare the mass function to assign probability to each subset in  BMC .Any function that satisfies (5) can be used as mass function.The probability of each event in  can be calculated as At last, we need to infer the transition matrix which describes the probabilities of moving from one state to others as we did in discrete-time Markov chain.Each element   of transition matrix  BMC denotes the probability of the currently state   , and then it moves to state   .It can be calculated by However, the Evidential Markov chain needs to store all the data and recalculate the Markov state when new data arrives, this is not time efficient and feasible for the systems that need real-time response, especially for data stream applications.Thus, we improve Evidential Markov chain using stream-based -means clustering method.The arriving data points can be mapped onto  states using data stream clustering algorithm where each cluster represents a Markov state.For each cluster  representing state   , we need to store a transition count vector   .All transition counts can be seen as a  *  transition count matrix  where  is the number of clusters.As we use stream clustering, there is a list of operations for cluster: adding a new data to an existing cluster, creating a new cluster, deleting clusters, merging clusters, and splitting clusters.And we use Jaccard [20] as a dissimilarity threshold to detect clusters.Thus, the  states are adaptively changing to fit the arriving data, which is also an advantage compared to Evidential Markov chain method.

System Status Classification.
In this section, we first introduce why we choose ensemble classification to forecast the system status and then how the aggregate ensemble method can address the concept drift and noisy data problem in data stream.Tan and Gu [10] apply single statistical classifier on static dataset to make classification.Though this approach works well on static dataset, it is not applicable in a dynamic environment where system logs are generated continuously, and even the underlying data generating mechanism and cause of anomaly are constantly changing.To capture the time-evolving anomaly pattern, many solutions have been proposed to build classification models from data stream.
One simple model is using online incremental learning [11,21].The incremental learning methods deliver a single learning model to represent an entire data stream and update the model continuously when new data arrives.Ensemble classification always regards the data stream as several separated data chunks and trains classifiers based on these chunks using different learning algorithms, and then ensemble classifier is built through voting of these base classifiers.Although these models are being proved to be efficient and accurate, they depend on the assumption that data stream being learned is high quality and without consideration of data error.However, in real world applications, like system monitoring data stream and sensor network data stream, they are always containing erroneous data values.As a result, the tradition online incremental model is likely to lose accuracy in the data stream which has error data values.
Ensemble learning is a supervised method which employs multiple learners and combines their predictions.Different from the incremental learning, ensemble learning trains a number of models and gives out final prediction based on classifiers voting.Because the final prediction is based on a number of base classifiers, ensemble learning can adaptively and rapidly address the concept drift and error data problem in data stream.Based on the above reason, we choose to use ensemble classification.
In summary, the ensemble of classification can be categorized into two categories: horizontal ensemble and vertical ensemble classification [15].The horizontal ones build classifiers using several buffered chunks, while the vertical ones build classifiers using different learning algorithm on the current chunks.
Vertical ensemble is shown in Figure 3.It uses  different classification algorithms (e.g., we simply set  = 3) to build classifier on the current chunk and then use the results of these classifiers to form an ensemble classification model.The vertical ensemble only uses the current chunk to build classifiers, and the advantage of vertical ensemble classification is that it uses different algorithms to build the classifier model which can decrease the bias error between each classifiers.However, the vertical ensemble assumes that the data stream is errorless.As we discussed before, the real-world data stream always contains error.So, if the current chunk is mostly containing noise data, then the result Horizontal ensemble is showed in Figure 4.The data stream is separated into  consecutive chunks (e.g.,  1 and  2 are history chunks, and  3 is the current chunk), and the aim of ensemble learning is to build classifiers on these  chunks and predict data in the yet-to-arrive chunk ( 4 in this picture).The advantage of horizontal structure is that it can handle the noise data in the stream because the prediction of newly arriving data chunk depends on the average of different chunks.Even if the noise data may deteriorate some chunks, the ensemble can still generate relatively accurate prediction result.
The disadvantage of horizontal ensemble is that the data stream is continuously changing, and the information contained in the previous chunks may be invalid so that use these old-concept classifiers will not improve the overall result of prediction.
Because of the limitation of both horizontal and vertical ensembles, in this paper, we use a novel ensemble classification which uses  different learning algorithms to build classifiers on  buffered chunks and then train by- classifier as Figure 5 shows.By building an aggregate ensemble, it is capable of solving a real-world data stream containing both concept drifting and data errors.

Experiment and Result
3.1.Experiment Setup.We evaluate our SAPredictor method on the anomaly data collected from realistic system: Plan-etLab.The PlanetLab [22] is a global research network that supports the development of new network services.The PlanetLab data set [16] which we use in this paper contains 66 system-level metrics such as CPU load, free memory and disk usage, shown by Table 1.The sampling interval is 10 seconds.There are 50162 instances, and among which 8700 are labeled as anomalies.
Our experiments were conducted on a 2.6-GHz Inter Dual-Core E5300 with 4 GB of memory running Ubuntu10.4.We use sliding window (window size = 1000 instances) based validation because in real system, the labeled instances are sorted in chronological order of collecting time.The reason  that we do not use cross-validation is that it randomly divides the dataset into pieces without considering the chronological order.Under such circumstances, it is possible that current data is used to predict past data, which does not make sense.Thus, sliding window validation is more appropriate for our experiments.

The Metrics Prediction
Accuracy.Short term predictions are helpful to prevent potential disasters and limit the damage caused by system anomalies.Usually, predicting near term future is more clever and successful than long term predictions [5].So, in our experiment, we assess system state prediction in short term.
In this experiment, we choose -means discretion technique to create state boundary.The reason is that by -means the state will have more adjacent data compared to the state discrete by equal-width and equal-depth, when we divide the data into  clusters, because each middle point of the cluster will be used as a state.We set the size of bins as 5, 10, 15, . . . to 30, and evaluate the quality of metric prediction by mean prediction error (MPE) as the study by Tan and Gu [10]: is the test dataset, and || is the number of instances in .|| is the number of system metrics, and   is the actual value of metric .x is the prediction value of metric , which is represented by the mean value of samples in that bin.The less the value of MPE, the more accurate the predictor.
We assess the MPE in near term future (1-5 time units ahead) for different bin sizes (5, 10, 15, 20, 25, and 30) on PlanetLab dataset.Figure 6 shows the MPE of PlanetLab for time units (1)(2)(3)(4)(5) with bin size of 20.From these two figures, we have the following observations: (1) BMC can achieve less prediction error than DTMC from time units 1 to 5. One step prediction has the most notable advantage, and the advantage decreases slightly as time goes on, which means that our algorithm fits better when the forecast period is shorter; (2) BMC and DTMC both lose prediction accuracy as time goes by, which indicates that predict anomaly in longer term is more challenging.
Figure 7 shows the MPE of PlanetLab with different bin sizes (5, 10, 15, 20, 25, and 30) when time unit is one.From these figures, we can see that both methods have higher MPE with less number of bins.The reason is that less number of states tends to group a larger range of data into a bin.Since the mean of the bin is used as the prediction value, the gap between the prediction value and the real value will be enlarged.
In Tables 2, 3, 4, 5, and 6, we compare the mean prediction error of DTMC and BMC under different noise percentage.The noise percentage  means that the monitoring value at state  oscillates around the true value   in the range of [ −1 + (  − −1 ) * %,  +1 −( +1 −  ) * %] as illustrated in Figure 2, where  −1 is the value of the last state and  +1 is the value of next state.We choose n from 10 to 50 in our experiment because the previous  will be falsely recognized as state  − 1 or state  + 1 if  is larger than 50%.Thus, in this paper, we set the noise in the percentage from 10% to 50%.The mean prediction error results in Tables 2-6 show that our proposed method BMC has better prediction quality than DTMC.Both BMC and DTMC have the smallest prediction error in one step prediction, and the error magnifies as prediction steps become larger.BMC has the most notable advantage over DTMC in one step prediction and the advantage decreases as step goes larger.Based on the above observation, we conclude that our algorithm has better performance than DTMC in all noise ranges and fits better when we forecast imminent anomalies.

Ensemble Classification with Data Stream.
In this experiment, we compare three ensemble classification methods and other classification algorithms, as decision tree and logistic regression.For ease of comparisons, we first summarize the assessment of criteria of different classification methods.Suppose that a data stream has  data chunks.We aim to build a classifier to predict all instances' label in the yet-tocome chunk.To simulate different types of data stream, we use the following approaches used in [21]: noise selectionwe randomly select 20% chunks from each dataset as noise chunks and then arbitrarily assign each instance a class label which does not equal its original class label, and finally we put these noisy data chunks back into the data stream.The performance of system anomaly prediction is evaluated by 3 criteria according to [20]: precision, recall, and -measure.We use Table 7 to help explain the definitions of these criteria, where state 0 denotes normal and state 1 denotes anomaly.
These three criteria are defined as We define precision as the proportion of successful prediction for each predicted state in chunk  +1 , recall as the probability of each real state to be successfully predicted in the chunk  +1 , and -measure as the harmonic mean of precision and recall.
Following the above process  − 1 times, we have the average precision, recall, and -measure.Ideally, a good classifier for noise data stream should have high average precision, high average recall, and high average -measure.
Table 8 shows the quality of classification between different classifiers.In this experiment, we choose three basic classifiers C4.5, Logistic, and Naïve Bayes as our base classifiers.And we set the sliding window size as 1000 instances.Column 2 to Column 4 in Table 8 are the classification results that employ single classifier.So, we choose the  1 to train the model and test the model use  2 then repeat the process by training the model using  2 and test on  3 and so on.HTree, HNB, and HLogist are three horizontal ensemble classification methods which use both history and current chunks to train the classifier model.So, we first use  1 to train the model and test on  2 and then use both  1 and  2 to train the model and test on  3 .Repeat this process until the end of the data stream.VerEn is the vertical ensemble

Table 1 :
Monitoring metrics used for anomaly prediction.

Table 7 :
The four cases that prediction belongs to.