Study on Support Vector Machine-Based Fault Detection in Tennessee Eastman Process

,


Introduction
Fault detection in manufacture process aims at timely nosing out abnormal process behavior. Early fault detection of process makes the process safer, more efficient and more economical and guarantees the quality of products. Therefore, several statistical methods appeared for fault diagnosis and detection, such as artificial neural networks (ANN), fuzzy logic systems, genetic algorithms, principal component analysis (PCA), and more recently support vector machine (SVM) [1]. SVM has been extensively studied and has been used to construct data-driven modeling, classifying, and fault detection due to its better generalization ability and nonlinear classification ability [2][3][4]. On the other hand, since massive amounts of data produced in the industrial manufacturing process can be recorded and collected, the study of data-driven technique has become an active research field. It can benefit diverse communities including process engineering [5,6]. Over the past few years, linear supervised classification technique, for example, K-nearest neighbor [7], PCA [8,9], Fisher discriminant analysis [8], discriminant partial least squares (DPLS) [8][9][10], and nonlinear classification technique, for example, ANN [11] and SVM [12], have been proposed and greatly improved [13,14]. Moreover, several machine learning algorithms have been applied in real processes and simulation processes [15][16][17]; for example, the PCA algorithm is used to analyse product quality for a pilot plant [18], PLS and PCA are applied in chemical industry for process fault monitoring [19], and the key performance indicator prediction scheme is applied into an industrial hot strip mill in [20]. The SVM algorithm can improve the detection accuracy and it started to be used for fault detection. Now it has been extensively used to solve classification problems in many domains, for example, face, object and text detection and categorization, information and image retrieval, and so forth.
The algorithm of support vector machine (SVM) is studied detailedly in this paper. SVM is a representative nonlinear technique and it is a potentially effective technique for classifying all kinds of datasets [21]. Fault detection can be considered as a special classification problem involved in model-based method [22] and data-based method, with the purpose to timely recognise faulty condition. With the help of cross-validation algorithm to optimise parameters, the performance of classification is greatly enhanced [23][24][25]. Then, to test the classification performance of SVM algorithm, a simulation model, the Tennessee Eastman process, is used for detecting fault, which has 52 variables representing the dynamics of the process [26]. In this simulation, original dataset is handled using SVM algorithm and it obtains satisfactory fault detection result. In the process of model building and test data classifying, no other theory is added, relatively decreasing the calculation time and reducing the computational burden. Compared with PLS algorithm, classifier based on SVM performs higher accuracy. Finally, using the SVM-based classifier with optimal parameters, faulty station of the process is detected.
The paper is arranged as follows. The SVM classification algorithm, PLS algorithm, and cross-validation algorithm are introduced in the next section. Sections 3 and 4 present an application to Tennessee Eastman process simulator using SVM and PLS algorithms, respectively, and SVM-based fault detection outperforms that of PLS algorithm. Section 5 summarizes a conclusion.

Support Vector Machines
Theory. Support vector machine (SVM) is a relatively new multivariate statistical approach and has become popular due to its preferable effect of classification and regression. SVM-based classifier has better generalization property because it is based on the structural risk minimization principle [27]. SVM algorithm has the nonlinear attribute; thus, it can deal with large feature spaces [28]. Due to the aforementioned two factors, SVM algorithm begins to be used in machine fault detection. The fundamental principle of SVM is separating dataset into two classes according to the hyperplane (a decision boundary) which should have maximum distance between support vectors in each class. Support vectors are representative data points and their increasing number may increase the complexity of problem [28,29].
This thesis uses a binary classifier with dataset and the corresponding labels. Training dataset containing two classes is given in matrix with the form of × [6], in which represents the number of observe samples, while stands for the quantity of the observed variables. is denoted as a column vector to stand for the th row of . Each sample is assumed to be in a positive class or in a negative class. Besides, a column vector serves as the class label, containing two entries −1 and 1. Denote that = 1 is associated with one class and = −1 with the other class. If the training dataset is linearly separable, the SVM will try to separate it by a linear hyperplane: where is an -dimensional vector and is a scalar. The parameters and decide the separating hyperplane's position and orientation. A separating hyperplane is considered to be optimal if it creates maximum distance between the closest vectors and the hyperplane. The closest points in each class are denoted as support vectors. If other points in the training set are removed, the calculated decision boundary remains the same one. That is to say the support vectors contain all information in the dataset to define the hyperplane. The distance from a data point to the separating hyperplane is Vapink in 1995 put forward a canonical hyperplane [30], where and should satisfy min ⟨ , ⟩ + = 1. ( That is to say if the nearest point is taken to the hyperplane function, the result is constrained to be 1. This restriction on the parameters is to simplify the formation of problem. In a way, as for a training data , , the separating hyperplane of the above-mentioned form will become The best separating hyperplane is the one that makes maximum distance from support vectors to decision boundary. The maximum distance is denoted as . Consider Hence, as for linear separable data, the optimal separating hyperplane satisfies the following function: To solve the optimal problem equation (6) under the constrain of (4), define the Lagrangian to be where is called Lagrangian multiplier. The Lagrangian should be maximised by choosing appropriate and should be minimised by , .
Taking the noise in the data and the misclassification of hyperplane into consideration, the above function describing the separate hyperplane equation (4) is not accurate enough. To make the optimal separating boundary to be generalised, we reformulate the described function of the separate hyperplane: where the variable represents a measure of distance from hyperplane to misclassified points and ≥ 0. To find the optimal generalised separating hyperplane, the following optimal problem should be solved: Abstract and Applied Analysis 3 where the parameter , a given value, is called error penalty. As for the above-mentioned data inseparable case, in order to simplify the optimal problem, define the Lagrangian to be where , are the Lagrangian multipliers. We consider the minimization problem as original, primal problem. Consider min , , When satisfying the Kuhn-Tucker condition, then the primal problem is transformed to its dual problem, which is Then, the task is minimizing ℓ in (10) by adjusting the value of , , . At the optimal point, derivatives of ℓ should be zero. The saddle-point equation is as follows: If we take (13), (14), and (15) back into (10), we can obtain the dual quadratic optimization problem [30,31]: Satisfying the constrains: When solved the dual quadratic optimization problem shown in (16), the will be obtained. Then take a look back at (14) and describe the optimal using and the form of hyperplane can be changed to The classifier implementing the optimal separating hyperplane comes out in the following form: However, in some cases linear classifier is not suitable; for example, data is overlapped or cannot be linearly separated. Therefore, the input vectors should be projected into a higher dimensional feature space and there the data may be linearly classified more efficiently with the use of SVM algorithm. However, it may cause computational problem due to the large vectors and high dimensionality. The idea of using Kernel function enables the calculation performed in the original space instead of in the projected high dimensioned future space, avoiding the curse of dimensionality [27,31]. Given a feature mapping , we define the corresponding Kernel function to be Thus, the linear decision hyperplane in the high feature space is ( , ) is inexpensive to calculate. Kernel function returns an inner production of vectors in the high feature space which is evaluated in the original space not in high feature space. The commonly used four Kernel functions are as shown in Table 1.

SVM Model Selection
. SVM algorithm is a very effective data classifying technique, and building a model to detect fault based on SVM algorithm is not so complex. Training dataset and testing sets are usually involved when a classification is done. Besides, every instance in the training set contains two parts: one is target value (i.e., the class label) and the other is several attributes (i.e., observed variables). The basic process of using SVM algorithm to classify data is as follows: at first build a classifier based on the training set and then use it to predict the target value of the data in testing set where only the attributes are known.
To construct a classifier model and figure out faulty data, the following procedure is used.
(i) Transform data collected from real process to the format that SVM classifier can use.
(ii) Try a few kinds of Kernels and find out the best one; then search the optimal parameters for it. This thesis uses Gaussian RBF Kernel function and the optimal parameters are obtained by using the cross-validation algorithm. (iii) Use the optimal parameters and appropriate Kernel function to build a classifier. (iv) Take testing data into the constructed classifier and do a test. As a result the faulty data will be figured out and in this way the faulty station can be detected. Kernel function can nonlinearly project original input data to a higher dimensional space; thus, SVM equipped with Kernel function is able to deal with the case where the input data cannot be linearly separated. Generally speaking, the RBF Kernel is the first choice for the following two reasons. Firstly, linear Kernel is a special RBF Kernel-when the parameter is adjusted to a certain value, linear Kernel behaves similarly to RBF Kernel and so it is the sigmoid Kernel at some parameters. Secondly, the RBF Kernel brings fewer computational costs and fewer hyperparameters [23].

Cross-Validation.
When using an RBF Kernel function, it needs appropriate parameters to make sure that the classifier accurately predicts unknown data. It is not known beforehand what the best parameter values are; nevertheless, the optimal parameter searching can be accomplished using cross-validation algorithm [23].
The cross-validation algorithm is a model validation method which is used to evaluate the accuracy of a predictive model. The goal of cross-validation is to give an insight on how the model generalizes to an independent dataset (i.e., an unknown dataset) by defining a dataset in the training phase to test the model.
In the process of -fold cross-validation, original training set is randomly divided into k equal size parts. Sequentially, one subset is used as the testing dataset to test the predictive model and the rest of − 1 subsets are combined as the training dataset. The aforementioned validation process should be repeated times in all, with every subset performed as testing data once. Using the results, the predictive result of the model is produced and it is the overall misclassification rate across all testing sets.
This thesis uses 5-fold cross-validation algorithm to find out the overall misclassification rate across all testing sets. The cross-validation process should be performed many times to pick out the parameters which make the overall misclassification minimise. In this way, the optimal parameters are found and the classifier can obtain the best accuracy rate. [32] due to its simplicity and the lesser computational effort when dealing with the process monitoring with large data. In this technique, we also denote training data in the form of × matrix , in which represents the number of observed samples, while is the quantity of the attributes and ∈ R ×1 . The PLS algorithm projects the matrix into a low dimensional space with latent variables, and can be constructed by these latent variables. The construct model is as follows:

PLS for Fault Detection. The PLS algorithm is introduced by Dayal and Macgregor
where the ( = 1, . . . , ) is the weight vector of the th deflated . PLS decomposes into two parts: Usually, 2 statistic is used to detect abnormalities and the calculation method is With a given confidence level , the threshold for 2 will be calculated by the following: where , − , represents -distribution with l and − 1 degrees of freedom and its confidence level is . If 2 are all less than their corresponding thresholds, the process is out of fault [32]. for other kinds of control issues. Figure 1 shows the flowsheet of TE process. The TE process consists of five major parts: the reactor, the product condenser, a vapor-liquid separator, a recycle compressor, and a product stripper to help accomplish reactor, separator, and recycle arrangement. More details can be found in [33,34].

Simulated Data Intro.
The TE process is a plant-wide closed-loop control structure. The simulated process can produce normal operating condition as well as 21 faulty conditions and generate simulated data at a sampling interval of 3 min. For each case (no matter normal or faulty condition), two sets of data are produced, training datasets and testing datasets. The training sets are used to construct statistical predictive model and the testing datasets are used to estimate the accuracy of the built classifier. In training sets, the normal dataset contains 500 observation samples, while each faulty dataset contains 480 observation samples. As for testing data, each dataset (both normal and faulty conditions) consists of 960 observations. In the faulty condition, faulty information emerges 8h later since the TE process is turned on. That is to say, in each faulty condition, the former 160 samples are shown normally, while the remaining 800 instances are shown really faulty and should be detected. Every observation sample contains 52 variables which consists of 22 process measure variables, 19 component measure variables, and 12 control variables. All the datasets used in this paper can be found in [1]. In the following section, the fault detection result of SVMbased classifier will be compared with the one based on the PLS algorithm using the dataset generated by TE process simulator. For SVM classifier, the RBF Kernel is used and the related parameters will be set ahead according to the result of cross-validation to make sure of the classifier's high accuracy. In the comparison, the performance is measured by fault detection rates.

Result and Discussion
TE process simulator can generate 22 types of conditions, containing the normal condition and 21 kinds of programmed faults caused by various known disturbances in the process. Once fault is added, all variables will be affected and some changes will emerge. According to Chiang et al. [8] and Zhang [18] detection for faults 3, 9, 15, and 21 is very difficult for there are not any observable changes in the means, variance, or the peak time. Therefore, these four faults always cannot be detected by any statistics technique; thus, the four faults are not analysed in this paper. The information of all the faults is presented in Table 2.
In order to profoundly test the detection effect of the classifier based on theory of SVM, the SVM algorithm and PLS algorithm are applied to the TE process, respectively. As for SVM classifier, all faults data are by turns combined with the normal condition data as the dataset for binary classifier. After models are built by training data, we use testing data to evaluate the prediction result through the following common indices: accuracy (Acc) and fault detection rate (FDR) [35]. Then the detection result for each fault is shown in Table 3. Table 4 represents the detection result using PLS technique. 6 Abstract and Applied Analysis    Figure 2 shows the predicted label using three classifiers with the training and testing data respectively come from normal condition and fault 1, normal condition and fault 2, and normal condition and fault 4. The first 160th test, the process shows normally and should be −1. The fault information appears at the moment of 161st sample-taken and the label y should be 1 since then. From Figure 2, we can see that the SVM classifier's prediction result for most of the time is right. In addition, Table 3 presents detailed detection indices of these 21 SVM classifiers. It is worth mentioning that the hyperparameters of the classifiers are all optimized beforehand. Without the optimization of classifier parameter, the predicted effect will not be so good. Moreover, the index accuracy will be 16.77% when using classifier's default parameters to detect fault 1. Table 4 shows the detailed indices using PLS technique to detect faults. It can be seen that the SVM classifiers with optimal hyperparameters are mostly able to detect the faulty data, and the accuracy is higher than that given by using PLS algorithm. Therefore, the SVM algorithm has a better performance of fault detection.
To further test the predictive ability of the SVM classifier, that is, the detective ability of fault in the TE process, we use normal condition data combined with three faulty condition datasets (fault 1, fault 2, fault 4) as the training data to construct a classification model and then used corresponding test data as testing data to observe the classification result. As shown in Table 5, the detection indices are also good though the computing time is a little longer with the same compute facility. Thus the SVM algorithm can perform satisfactorily on the original dataset containing 52 attributes without any transformation. In this way, in real process, with advanced compute facility, we can train normal data and all faults data to build a classifier. Once the fault label is figured out, the fault in process is detected.  Figure 2: Classification results for the testing dataset by the SVM-based classifier with optimal parameters. The value = −1 represents at the sample-taken moment that the process is in normal condition, while = 1 stands for faulty condition. The top plot is the classification result between normal condition and fault 1, the middle one is the result between normal condition and fault 2, and the last plot is result between normal condition and fault 4. In the simulation of fault condition, the fault information appears from the moment of 161th sample-taken and it is pointed out by red line. From the above-shown result of fault detection on TE process, we can conclude that the classifier based on SVM algorithm is of good predictive ability. In addition, there are two facts that should be mentioned. First, before detecting normal condition or faulty condition, we have used the technique of cross-validation to optimize classifier's hyperparameters. Therefore, the performance of classifier could be the best. Second, the classification on TE process based on SVM algorithm performs satisfactorily without any other process, for example, foregoing data dealing process and attributes selection. And this feature makes the SVM classifier easy to be built. Besides, the calculation and calculate time is relatively small since the algorithms used are fewer.

Conclusion
TE process, a benchmark chemical engineering model, is used in this paper for fault detection. It can be found that the fault detection ability of the classifier based on SVM algorithm using the TE process's original data is satisfactory and this indicates the advantage of using nonlinear classification when the number of samples or attributes is very large. By comparing detection performance with classifier based on the PLS algorithm, the classifier based on SVM algorithm with Kernel function shows superior accuracy rate. In addition, parameter optimization beforehand plays a great role in improving the effectiveness of classification. It also simplifies the problem and by using no other technique relatively decreases the computational load to reach a satisfactory classification result as well.