Novel Model Based on Stacked Autoencoders with Sample-Wise Strategy for Fault Diagnosis

,


Introduction
As a result of the advancement of process control systems, modern industries have switched to automation.With advantages such as low cost, high efficiency, and improved safety, automated control systems have become increasingly popular worldwide, especially in power plants, the aerospace industry, and chemical plants [1].However, in abnormal situations, these systems are unable to cope on their own and thus require the intervention of human operators.Fault detection and diagnosis (FDD) has been developed to determine where problems lie and prevent machine damage or accidents.
In practical chemical processes, FDD comprises two vital procedures that influence the life of devices and operators.In fault diagnosis, a model is built to distinguish fault types accurately.Corresponding measures are taken to overcome these faults when necessary.In the early application of FDD, mathematical-based models elicited the most attention and thus developed rapidly.However, model-based FDD methods require priori knowledge about chemical processes before application.Therefore, the development of this kind of method was limited.With the support of related technology, voluminous data can be collected easily.Although the data collected are high-dimensional, they remain useful for some FDD methods.This approach is called the datadriven method.Basic data-driven methods include principal component analysis (PCA) [2,3], canonical correlation analysis [4], independent component analysis, and Fisher discriminant analysis [5,6].These data-driven methods greatly contribute to FDD, but they are not suitable for certain situations.For example, all the aforementioned methods fail to detect faults exactly in nonlinear process monitoring and fault diagnosis.
In the early 1990s, neural networks (NNs) were merged into FDD as a feature extraction tool [7].The NNs at that time were shallow and narrow.These NNs were also prone to overfitting once they became deep and wide.Hence, NNs were never widely used for FDD during this period.Since Hinton proposed pretraining and fine tuning [8], deep learning (DL) has received much attention.As a result, DL has undergone unprecedented application and development.DL has been applied in computer vision, speech recognition, text processing, medical, finance advertising, and other fields.Stacks of restricted Boltzmann machines (RBMs), referred to as a deep belief network (DBN), have been found to behave better than other traditional data-driven methods (e.g., PCA) in data dimension reduction [9].As pretraining helps deep neural networks (DNNs) jump out of the local minimum, the fact that DBN outperforms PCA may be easily disregarded.As for FDD, RBMs and DNNs have been applied to some specific processes to initialize offline fault detection models and detect online time-related samples automatically.Ren et al. used multiple DBN models to correspond to each working condition and set an adaptive threshold to detect faulty conditions [10].Zhang et al. considered extracting features in spatial and temporal domains [11].They built two subnetworks by mutual information and trained them with a backpropagation algorithm.In their model, each class has its own network.The result is a considerable improvement in model performance.[12].In addition, RMSProp algorithm is applied to achieve faster convergence.CNN also can be employed in FDD. Lee et al. applied CNN to multivariate time-series data and built a connection between the output of the CNN and the structural meaning of data to fault classification [13].Wen et al. proposed a new method on the basis of CNN [14].Instead of using time-domain data, they converted raw data into image data.After such transformation, the CNN can easily extract the features of faults.An autoencoder (AE) and a stacked autoencoder (SAE) are trained to extract valuable features to determine whether a machine is faulty.Lee et al. employed a stacked denoising AE for fault classification [15].Compared with models with SAEs, their model adds noise to the samples before training to prevent overfitting.In addition, their model realizes improved generalization capability.Zhang et al. considered self-correlation in data.They first extended the data matrix by correlation analysis [16].Then, they applied a deep SAE to extract the necessary features.Different from these existing studies, the current work highlights the results obtained from a trained DL model.This work takes advantage of learned knowledge and reuses it to train an identical model.
Much research has explored knowledge distillation (KD).Yim et al. calculated the inner products between two identical DNN models and created a new loss function using L2 norm to improve classification performance in image processing [17].Similarly, Furlanello et al. used two identical models.In addition, they explored two distillation objectives: confidence weighted by Teacher Max and dark knowledge with permuted predictions [18].They found that KD is suitable for almost all machine learning (ML) models.Xu et al. focused on one ML model and, trained it for several times [19].After the first training, they softened the correct targets and reset the mistaken targets to 1. Bae et al. used residual networks as basic models to transfer knowledge in a teacher-student framework [20].They proposed a layer-wise hint training method and improved the recognition results on several typical datasets.
The rest of the paper is organized as follows.Section 2 provides an overview of SAE and KD.Section 3 introduces the novel model based on SAE with the sample-wise strategy (SAE-SWS).Section 4 discusses the application of the method to several industrial benchmarked problems.Finally, Section 5 summarizes the methodology.

Preliminaries
2.1.Stacked Autoencoders.SAE refers to a stack of autoencoders.Nevertheless, SAEs cannot be formulated by connecting AEs directly.An SAE's training process includes two procedures: unsupervised pretraining and fine tuning.In unsupervised pretraining, the features of a dataset are extracted.As shown in Figure 1(a), the input dataset is assumed to be  = [ 1 ,  2 , . . .,   ]  ∈  × ,  is the number of samples, and  is the number of observed variables.  ∈  ×1 ( = 1, 2, . . ., ).  undergoes encoding and decoding and can be presented as where Feature   is used to dig deep for feature   , which is shown in Figure 1(b).Figures 1(a) and 1(b) belong to the feature extraction stage.
In the fine-tuning stage, the whole network is trained for classification or regression.The structure is shown in Figure 1(c).Regarding classification, the SAE assumes the label dataset  = [ 1 ,  2 , . . .,   ]  ∈  × (c is the number of classes) corresponding to X, and it uses one hot encoding to build   ( = 1, 2, . . ., ).The parameters of the SAE are transferred from the previous pre-training, similar to that shown in the figure.Then a softmax layer is added with c nodes after the network. = [ 1 ,  2 , . . .,   ]  ∈  × represents the output.  ( = 1, 2, . . ., ) denotes the SAE classification probabilities of input   ( = 1, 2, . . ., ), which satisfy   ∈ [0, 1] ( = 1, 2, . . ., ,  = 1, 2, . . ., ).The SAE is then trained as min   (2) where 2.2.Knowledge Distillation.KD means transferring "knowledge" between ML models.It was first proposed by Hinton to transfer knowledge learned by a cumbersome model to a much simpler one [21].Before interpreting KD, the soft targets and hard targets must be understood.For example, for binary classification, label [1, 0] is a hard target, whereas the label whose value for a certain dimension is not 1, such as [0.8, 0.2], and is a soft target.The latter clearly contains more information.KD tries to soften hard targets to gain messages for classification.KD is closely related to the "softmax" function.Softmax converts multicategory output values into relative probabilities, making them easy to understand and compare.It is usually connected after the ML model.For example, assume that the softmax layer input is ℎ ∈  ×1 , where c denotes the number of classes.The probability that ℎ belongs to the ith class can be calculated by where ,  denote the category indexes.By contrast, if softmax in KD (softmax-KD) is applied, then the probability is calculated by where  denotes the temperature of softmax.A high  equates to soft targets.When  equals to 1, softmax-KD becomes original softmax.In this paper,  is set within the range of 2.5-4 during model training. () = [ () 1 ,  () 2 , . . .,  ()   ]  ∈  × represents the set of  ()   .Given the label dataset  () = [ () 1 ,  () 2 , . . .,  ()   ]  ∈  × , where  ()  =   ( = 1, 2, . . ., ), then the teacher model  is trained by minimizing the cross-entropy cost function, which can be presented by min

FDD Model Based on SAE with Sample-Wise Strategy
where  () denotes weights and biases in the teacher model .L( ()  ,  ()  ) is calculated by The SWS is conducted after training the teacher model .Considering the ith sample  ()   , its predicted category   and true category   are obtained according to  ()    and  ()   , respectively.If   =   ,  ()    is considered to have been sorted into the correct category.As a result, all the  samples are divided into two parts: the correct samples (CSs) and the mistaken samples (MSs).Assume  MSs.In terms of MSs, the teacher model does not extract the needed features accurately.Hence, the strategy SWS improves the weight of the MSs by adding  MSs to the student model's dataset; that is, the input set of student model  1 is + ] ∈  (+)× is obtained.In other words,  (− 1 ) , which represents the distribution of the output of teacher model , is obtained from teacher model  by SWS and then used for training  1 .Thus, the student model  1 can be trained with the modified loss function, which can be presented as min where  ) can be regarded as a regularizing part for boosting the student model's performance.The ground truth label information with MSs provides strong capability for further improving discriminative capability.The two parts make full use of  (S 1 ) ,  (S 1 )  (S 1 ) ,  (S 1 )  (T−S 1 )  (S 1 )  (S 2 ) ,  (S 2 )  (S −1 ) ,  (S −1 ) The above procedures are contents of step 1 in Figure 3. Besides, the SWS can be applied continuously, as shown in Figure 3.   ( = 1, 2, . . ., ) denotes the ith student model. denote the number of steps.Each model used in this paper is SAE with softmax-KD layer.For step 1, model  is trained.Then SAE-SWS employs the SWS to mine further knowledge.Finally, the further knowledge is applied to train model  1 .After  steps, the optimal student model is selected according to the classification accuracy on several datasets.

Flow Chart of the Model for FDD.
On the basis of the SWS above, the ML model for FDD is introduced.FDD models include two parts: offline modeling and online monitoring.The procedures of offline modeling and online monitoring are shown in Figure 4.
Offline modeling: (   (7) Use X and   to calculate the classification accuracy of all trained model.Select the optimal model with the highest accuracy.

Online monitoring:
(1) The new sample   ∈  ×1 is normalized by using the mean value and variance of training data   .(2) Integrate   into the optimal model and obtain its predicted class.The CSTR process is an anaerobic treatment that is widely used in industrial effluent disposal [22].As a result of the high research value of the CSTR, many modeling methods have been proposed to fill the gaps.In this experiment, the model proposed by Belevi is adopted [23].The flow chart of the CSTR is shown in Figure 5.  through successive teacher-student training.SAE-KD-s is the optimal student model.In this experiment, SVM uses radial basis function (RBF) as kernel function.Penalty factor is set to 1.As for kNN, number of neighbors is 5. Euclidean distance is the metric.For SAE-based algorithms, their structures are [100-50-25].Adam optimizer is used and learning rate is set to 0.001.The size of batch is 10.Epoch is set to 1000.In SAE-SWS,  is set to 2.5 during training according to the reference [22].In equation (9),  is set to [0.1, 0.2, . . ., 1.0] to obtain the optimal parameter.It is set to 0.4 based on the results shown in Figure 6.How to set  remains a problem in the SWS.Hence, an experiment is conducted to decide  and its result is shown in Figure 7.According to Figure 7,  is set to 2. Theoretically,  can be very large.However, limited by equipment, in this paper,  is set to 4. The classification accuracy of each model in SAE-SWS is shown in Figure 8.According to Figure 8,  1 is chosen as the optimal student model.

Application to Benchmark Problems
The comparison results presented in Table 2 show that the SAE-SWS algorithm has the best performance among all algorithms.By comparing SAE-SWS-t with SAE-SWSs, we find that the latter outperforms the former by 1.69%.This value implies that the proposed SWS improves feature    extraction capability.SAE-SWS-s outperforms SAE-KD-s by 1.03%, which shows the effective of SWS on knowledge transfer.Table 3 shows the detailed classification accuracy for each fault of the SAE-SWS-s.

TE Process.
The TE process is an authoritative benchmark problem in FDD.Since it was proposed by Downs and Vogel, it has emerged as a popular tool to test the performance of the FDD model [25].The entire process consists of five operating units: reactor, condenser, recycle compressor, separator, and stripper.It has 53 variables, including 12       The comparison results presented in Table 6 indicate that the SAE-SWS algorithm has the best performance among all algorithms.Besides, results of SAE-SWS-t and SAE-SWS-s show that the student learns a lot from teacher.In addition, SAE-SWS-s outperforms SAE-KD-s by 1.81%, which implies the effectiveness of SWS on improving classification performance.Detailed classification accuracy for each fault of the SAE-SWS-s can be seen in Table 7.

Conclusions
In this work, a novel strategy called the SWS is proposed for SAEs to deal with fault classification problems.The proposed

Figure 2 :
Figure 2: SAE used with the sample-wise strategy in the teacher model.

(𝑆 1
)  denotes the softmax-KD output of  1 using the temperature . (− 1 )  is called the knowledge between model  and  1 using the temperature . denotes the weight of the cross-entropy between  ( 1 )  and  (− 1 )  .The novel loss function is composed of two parts: cross-entropy between the student model's output  ( 1 )  and the knowledge  (− 1 )  and cross-entropy between the student model's output  ( 1 )  and the ground truth label  ( 1 )  .L( ( 1 )  ,  (− 1 )

Figure 3 :
Figure 3: Entire training procedure with KD and SWS.

Figure 4 :
Figure 4: Flow chart of offline modeling and online monitoring.

( 3 )
Letting () = X and () =   , then the teacher model , which consists of SAE and a softmax-KD layer, is trained by matching the distribution of ground truth labels.Letting  = 1,  denotes the index of steps.(4) Transfer knowledge and obtain the input set and label set for next student model by SWS.(5) Train the student model by matching the distribution of the teacher's output and ground truth labels.(6) If  > , jump to (7); or let the current student becomes teacher of next student.And  =  + 1.Then go back to (4).

Figure 6 :
Figure 6: Relationship between parameter  and classification accuracy for the CSTR.

Figure 7 :
Figure 7: Relationship between parameter  and classification accuracy for the CSTR.

Figure 8 :
Figure 8: Classification accuracy of the series of student models for the CSTR.

Figure 9 :
Figure 9: Relationship between parameter  and classification accuracy in the TE process.

Figure 10 :
Figure 10: Relationship between parameter  and classification accuracy in the TE process.

Figure 12 :Figure 13 :
Figure 12: Relationship between parameter  and classification accuracy in the SDD dataset.

Figure 14 :
Figure 14: Classification accuracy of series student mode in the SDD dataset.
Sun et al. trained a DBN with normal samples in offline modeling.While online, this DBN can discriminate faults according to reconstruction errors [9].The convolutional neural network (CNN) is widely applied in image processing based on convolution calculation.Roy et al. used deep CNN with layer-wise training to recognize handwritten Bangla isolated compound character ∈  V×1 is the hidden layer feature of   .  ∈  V× and   ∈  V×1 are the parameters of the encoding stage, where V is the hidden number of nodes.  ∈  ×V and   ∈  ×1 are the parameters of the decoding stage.They can be expressed together by .  (.) and   (. 1) Dataset  ∈  × and label set  ∈  × are collected, where  refers to the number of samples,  denotes the number of variables, and  denotes the number of classes.Then  is divided into   ∈   1 × for training and   ∈   2 × for testing, where  1 +  2 = .Correspondingly,  is divided into   ∈   1 × for training and   ∈   2 × for testing.
(2) Every component of   is normalized to zero mean and unit variance, respectively.Then every component of   is normalized by using the mean value and variance of   .

Table 1 :
The variables listed above are the 10 operational variables of the CSTR.They are selected as the monitored variables and are recorded in detail.Table1lists and describes 11 faults we set for the experiment.Except for the listed variables, other operational variables are within the allowable range.Lists of faults and description.
AF denotes feed concentration.Q F denotes feed flow.T F denotes feed temperature.Q C denotes coolant flow.T CF denotes the coolant inlet temperature.T denotes the temperature in the reactor.h denotes the reactor level.T C denotes the coolant outlet temperature.C A denotes the concentration of component A in the reactor.Q denotes the flow of component A. SAE, are chosen as competitive algorithms.For the proposed algorithm, SAE-SWS, the performance of the teacher model of SAE-SWS (SAE-SWS-t) and the optimal student model of SAE-SWS (SAE-SWS-s) are listed.Besides, SAE-KD, which removes the SWS from SAE-SWS, is applied as comparison.It shares the same teacher with SAE-SWS and then goes

Table 2 :
Classification accuracy of some algorithms for the CSTR.

Table 3 :
Detailed results of SAE-SWS-s for the CSTR. is set to 0.3. is set to 2. The optimal student model is  1 .SAE-SWS is compared with other algorithms.Table4shows the performance of the algorithms for the TE dataset.The SAE series outperforms the conventional methods.The comparison of the SAE series shows that SAE-SWS-s achieves

Table 4 :
Classification accuracy of some algorithms in the TE process.
4.3.Sensorless Drive Diagnosis.In sensorless drive diagnosis (SDD), variables are extracted from electric current drive signals.The drive has intact and defective components.There are 11 classes with different conditions in this dataset.The current signals are measured with a current probe and an oscilloscope on two phases.The empirical mode

Table 5 :
Detailed results of SAE-SWS-s in the TE process.Number of neighbors in kNN is 6.Euclidean distance is the metric.As for SAE-based methods, structure is set to [100-50-25].Adam optimizer is applied and its learning rate is set to 0.001.The batch size and epoch are 10 and 1000, respectively.Similarly, a number of experiments are conducted to confirm the related parameters in SAE-SWS.The results are shown in Figures 12, 13, and 14.Accordingly,  is set to 0.3. is set to 2. The optimal student model is  3 .

Table 6 :
Classification accuracy of some algorithms in the SDD dataset.

Table 7 :
Detailed results of SAE-SWS-s in the SDD dataset.After training a teacher model, the strategy SWS is applied to obtain "knowledge", which denotes the output distribution of the teacher model, for student model training.Then the student model becomes next student's teacher.After step-by-step teacher-student training, the optimal model, which is selected among all the trained models, is applied for classifying real-time data.The experiments on several datasets prove the effectiveness of SWS strategy as well as the proposed SAE-SWS algorithm.Besides, the improvements of proposed method are summarized as follows.(1)A powerful DL model SAE is used for feature extraction in FDD.(2)By successive teacher-student training, SWS boosts the probability of achieving superior performance by increasing the weights of the mistaken samples.(3) KD helps to obtain more useful information about the correlation among classes.However, in practical applications, there are the following limitations.(1) Processes with insufficient training sample size are not conducive to feature extraction in the proposed SAE-SWS, resulting in the extracted features not being the intrinsic representation of data.(2) As the step increases, the number of parameters in SAW-SWS increases dramatically.So the application of SAE-SWS requires good hardware equipment and a much longer off-line modeling time than traditional monitoring methods.Therefore, further study will focus on reducing number of parameters as well as model complexity.