Fuzzy Wavelet Neural Network Using a Correntropy Criterion for Nonlinear System Identification

Recent researches have demonstrated that the Fuzzy Wavelet Neural Networks (FWNNs) are an efficient tool to identify nonlinear systems. In these structures, features related to fuzzy logic, wavelet functions, and neural networks are combined in an architecture similar to the Adaptive Neurofuzzy Inference Systems (ANFIS). In practical applications, the experimental data set used in the identification task often contains unknown noise and outliers, which decrease the FWNNmodel reliability. In order to reduce the negative effects of these erroneous measurements, this work proposes the direct use of a similarity measure based on information theory in the FWNN learning procedure. The Mean Squared Error (MSE) cost function is replaced by the Maximum Correntropy Criterion (MCC) in the traditional error backpropagation (BP) algorithm.The input-outputmaps of a real nonlinear system studied in this work are identified from an experimental data set corrupted by different outliers rates and additive white Gaussian noise. The results demonstrate the advantages of the proposed cost function using the MCC as compared to the MSE. This work also investigates the influence of the kernel size on the performance of the MCC in the BP algorithm, since it is the only free parameter of correntropy.


Introduction
System identification is a modeling procedure where the mathematical representation of the input-output maps for dynamical systems can be obtained with the aid of experimental data.This procedure is a prominent alternative for the efficient modeling of complex systems without the need for using complex mathematical concepts.For this reason, this system identification plays an important role in some control engineering related tasks such as classification and decision making, monitoring, control, and prediction [1][2][3][4][5][6][7][8].
Artificial Neural Networks (ANNs) represent one of the most successful identification techniques used to model nonlinear dynamical systems [9].This is due to their ability to learn by examples associated with intrinsic robustness and nonlinear characteristics [10][11][12][13].Recently, a wide variety of network structures have been used to model the inputoutput maps of nonlinear systems [5,14,15].Multilayer Perceptron (MLP), Radial Basis Function (RBF) network, Neurofuzzy Hybrid Structures, for example, Adaptive Neurofuzzy Inference Systems (ANFIS), and Wavelet Neural Networks (WNN) are examples of ANNs commonly used in applications involving nonlinear systems [9,13,16,17].
WNNs combine the flexibility of ANNs and the curve fitting ability of wavelet functions [18][19][20].Besides, it can be improved in terms of extending the domain of validity by the addition of an extra layer of fuzzy structures to achieve the course delimitation of the universe of discourse, resulting in Fuzzy Wavelet Neural Networks (FWNNs) [5].The architecture of the FWNN is very close to the traditional ANFIS [21], 2 Mathematical Problems in Engineering although wavelets are used as membership functions (MFs) [22,23], or in the consequent part of fuzzy rules, through the use of WNNs as local models.In literature, it is often possible to find several research works applying FWNN to deal with modeling, control, function approximation, and nonlinear system identification, among others [6,[24][25][26][27][28].
In [29], Linhares et al. evaluate an alternative FWNN structure to identify the nonlinear dynamics of a multisection liquid tank.The aforementioned proposed structure is similar to the ones presented by Yilmaz and Oysal [5], Abiyev and Kaynak [6], and Lu [24].However, the FWNN presented in [29] uses only wavelets in the consequent fuzzy rules.The wavelets in each node of the FWNN consequent layer are weighted by the activation signals of the fuzzy rules.Therefore, the local models of such FWNN are solely represented by a set of wavelet functions, which differs from [5,6,24].The results presented in [29] demonstrate that the modified FWNN structure maintains the generalization capability and also other important features presented by traditional FWNNs, despite the reduction in the complexity of these structures.
In practical applications, the experimental data set used in the identification procedure is often corrupted by unknown noise and outliers.The outliers are incorrect measurements which markedly deviate from the typical ranges of other observations [30].The main source of the outliers comes from sporadic malfunctioning of sensors and equipments [31].The presence of noise and outliers in experimental data negatively affects the performance and reliability of the dynamical model under identification, because it tries to fit such undesired measurements [30,32,33].Despite the fact that there are many outlier detection methods presented in literature, many approaches are not able to detect all the outliers.Therefore, the resulting data obtained after the application of such methods may still be contaminated with outliers [30,31].
Generally, the learning process of the neural networks is based on a given gradient method, for example, the classical error backpropagation (BP) algorithm which uses the Mean Squared Error (MSE) as its cost function.However, the applicability of MSE to obtain a model that represents an input-output relationship is optimal only if the probability distribution function (pdf) of the errors is Gaussian [34].However, the error distribution in most cases is nongaussian and nonlinear [8].In literature we can find some researches that demonstrate that the use of the Maximum Correntropy Criterion (MCC) replacing the traditional MSE is an effective approach to handle the problem of prediction and identification when the dynamical system has unknown noise and outliers [7,8,30,35].The correntropy evaluation allows the extraction of additional information from available data because such similarity measure takes into account all the moments of a probability distribution that are typically not observed by MSE [7].
In this work, the reliability of the FWNN recently proposed in [29] is evaluated when different percentages of outliers and noise contaminate the experimental data used to identify a nonlinear system.The aforementioned neural network is used to identify the dynamic relationship between the input and output of a multisection liquid tank.In order to train the FWNN, the BP algorithm is used, although the traditional MSE cost function is replaced by the Maximum Correntropy Criterion using an adaptive adjustment of its kernel size, which is the free parameter of the MCC.The obtained models using each one of the quality measures are properly evaluated and compared.Despite the advantages of correntropy over MSE, little effort has been reported towards the application of correntropy to identify nonlinear systems using neural networks [7,8].The results presented in this work demonstrate that the FWNN architecture proposed in [29] is less sensitive to the presence of outliers and noise when it is trained using the MCC.In addition, this work also investigates the influence of the kernel size on the performance of the MCC in BP algorithm.
This paper is organized as follows.Section 2 presents the definition and the basic mathematical theory of the similarity measure of correntropy.Then, Section 3 describes the FWNN proposed in [29], which is applied in this work to identify an experimental nonlinear dynamical system considering the presence of outliers and noise.Section 4 presents the updating equations of BP algorithm, which are modified according to the MCC.Section 5 describes the proposed identification architecture in detail.Section 6 presents the multisection liquid tank under study, while the performance of FWNN models obtained using MSE and MCC cost functions is evaluated, considering the presence of both outliers and noise in experimental data.Finally, concluding remarks are given in Section 7.

Correntropy
Correntropy is a generalized similarity measure between two arbitrary scalar random variables  and  defined by [36]   (, ) =   [  (, )] = ∫ ∫  (, )  , (, )  , (1) where  , is the joint probability distribution, [⋅] is the expectation operator, and   (⋅, ⋅) is a symmetric positive definite kernel.In this work,   (⋅, ⋅) is a Gaussian kernel given as where  is the variance defined as the kernel size.The kernel size may be interpreted as the resolution for which correntropy measures similarity in a space with characteristics of high dimensionality [36].By applying a Taylor series expansion to the Gaussian function in (1) and assuming that all the moments of the joint pdf are finite, such equation becomes In practice, the joint pdf in (1) is unknown and only a finite amount of data {(  ,   )}  =1 is available, leading to the sample correntropy estimator defined by Correntropy involves all the even moments of difference between  and .Compared with MSE [( − ) 2 ] which is a quadradic function in the joint input space, correntropy includes second-order and higher-order statistical information [37].However, for sufficiently large values of , the second-order moment is predominant and the measure approaches correlation [38].
Nowadays, correntropy has been successfully used in a wide variety of applications where the signals are non-Gaussian or nonlinear, for example, automatic modulation classification [39], classification systems of pathological voices [40], and principal component analysis (PCA) [41].
Therefore, it is possible to determine the optimal solution for the MCC from (4) as [43] arg max where   = (, ) −   and  = 1, . . ., , which are the errors generated by the model during the supervised learning for each of the  training samples.It is worth mentioning that such criterion is used as the cost function of the BP algorithm to adjust the parameters of the FWNN.One of the advantages of using correntropy in system identification lies in the robustness of such measure against impulsive noise due to the use of the Gaussian kernel in (5), which is close to zero; that is,   ((, ), ) ≈ 0 when (, ) or  is an outlier.Correntropy is positive and bounded, and it gives 0 < V (, ) ≤ 1/ √ 2 for the Gaussian kernel.
The Gaussian variance (also called kernel size) is a free parameter that must be selected by the user [38].Therefore, when the correntropy is estimated, the resulting values depend on the selected kernel size.In addition, the kernel size of correntropy influences the nature of the performance surface, presence of local optima, rate of convergence, and robustness to impulsive noise during adaption [37,43].If the training data size is not large enough, the kernel size must be chosen considering tradeoffs between outlier rejection and estimation efficiency [44].Some approaches can be employed to determine the kernel size, for example, the statistical method [45], Silverman's rule [46], cross validation techniques [47,48], and shape of the prediction error distribution [44].This work uses an adaptive kernel size algorithm [42], which is given by In order to assess the improved performance of an adaptive kernel size over fixed ones, Section 6 is supposed to show how the error evolves during the FWNN training for different values of the kernel size.

Fuzzy Wavelet Neural Networks
3.1.Brief Review.Wavelets are obtained by scaling and translating a special function () localized in both time/space and frequency called mother wavelet, which can be defined in such a way to serve as a basis to describe other functions.Wavelets are extensively used in the fields of signal analysis, identification and control of dynamical systems, computer vision, and computer graphics, among other applications [49][50][51][52].Given (), the corresponding family of wavelets is obtained by where x = { A WNN is a nonlinear regression structure that can represent input-output maps by combining wavelets with appropriate scalings and translations [53].The output of a WNN is determined as follows: where w  are the synaptic weights, x is the input vector, and d  and t  are parameters characterizing the wavelets.In a concise manner, the purpose of FWNNs is to incorporate WNNs into the ANFIS structure in order to obtain faster convergence and better approximation capabilities, eventually with a greater number of parameters to be adjusted.The fuzzy rules allow tackling the uncertainties, while wavelets contribute to improving the accuracy in the process of approximating input-output maps [6].

FWNN Architecture.
A particular instance of FWNN proposed in [29] is applied in this work to identify a real nonlinear system, investigating its performance and reliability when the experimental data set is corrupted by unknown noise and outliers.In this FWNN architecture, the consequent part of its fuzzy rules is described only by wavelet functions.It differs from other structures such as those proposed in [5,6,24].The basic architecture of the FWNN can be seen in Figure 1 and its layers are described as follows.Layer 1.The input layer just transfers the input signal vector x = { 1 ,  2 , . . .,   } to the next layer.
Layer 2. In the fuzzification layer, the membership functions are parameterized to match the specific requirements of a variety of applications.For instance, a Gaussian membership function can be described by the following equation: where for  = 1, 2, . . .,  and  = 1, 2, . . .,   ,   would be associated with the th membership function appearing in a given rule and evaluated for the th component of the input vector.The adjustable parameters are   and   , representing the center and width of the membership function, respectively.
Layer 3.This is the inference layer.Assuming that there are  rules, where   is a given rule and  = 1, 2, . . ., , each rule is supposed to produce and output   by aggregating   using a T-norm.The output of the th rule in this layer is where All the rule outputs of this layer are added up to the summation node located between Layers 3 and 4. The output  of this node is later used in the normalization stage.Layer 4. In the normalization layer, the normalization factor for the output of the th rule, denoted by   is given by Layer 5.This is the consequent layer of the FWNN.In this work, the Mexican Hat family of wavelets is adopted as in [5,6,54].Its mathematical representation is given by The inputs of the wavelet layer are the normalized weights   and the input vector x = { 1 ,  2 , . . .,   }, while the outputs of this layer represented by   are given by where the term   = (  −   )/  ,   > 0 is adopted to simplify the mathematical notation and  is the number of wavelet functions in a node of Layer 5.
Layer 6.In the output layer, all signals from the wavelet neurons are summed up as follows: By observing Figure 1 and considering ( 9) to (14), it is possible to notice that the FWNN related parameters are located in the second and fifth layers.The membership functions and wavelet functions are adjusted according to the application using any learning algorithm, such as BP algorithm.

Error Backpropagation Algorithm with MCC
The classical BP algorithm is the learning algorithm used in this work to adjust the free parameters of the FWNN models.According to [54], this algorithm is probably the most frequently used technique to train a FWNN.Despite its functionality, it presents some shortcomings such as the fact that it may get stuck on a local minimum of the error surface and that the training convergence rate is generally slow [55][56][57].However, it is well known that the use of wavelet functions in neural network structures reduces such inconveniences [6,58].
A neural system should be designed to present a desired behavior; hence, it is necessary to define a cost function for this task.It provides an evaluation of the quality of the solution obtained by the neural model [59].The gradient based learning algorithms, such as the BP algorithm, require the differentiation of the chosen cost function with respect to the adjustable parameters of the FWNN model.Therefore, it is necessary to obtain the partial derivatives of the chosen cost function with respect to parameters   and   of the wavelets and parameters   and   of the membership functions   .
Typically, MSE is the cost function used with BP algorithm [10].Such classical cost criterion is replaced by MCC in this work in order to increase the reliability of the FWNN model when the identified dynamical system presents outliers and noise.When using MCC, the main goal is to maximize the correntropy similarity measure between two random process variables.In the FWNN learning procedure, such variables are the desired output   and the estimated output ŷ provided by the FWNN model.Considering the estimation error of the FWNN model given by  =   − ŷ, maximizing the MCC is equivalent to minimizing where  is the number of samples in the experimental data.Equation ( 15) corresponds to the cost function used during the minimization process of the BP algorithm applied to adjust the parameters of the FWNN models.As such parameters are adjusted sequentially, ( 16) defines the instantaneous correntropy used to update the wavelet functions and membership functions parameters of the FWNN after each training pair is presented to this network.Consider By differentiating E with respect to   and   , it gives where Following the delta rule mentioned in [10], the parameters of the proposed FWNN are updated as follows: where  is the learning rate.For the training algorithm initialization, wavelets and membership functions parameters are set with random numbers from a uniform distribution.The replacement of the traditional MSE by MCC inserts another learning parameter to BP algorithm.As already explained, the success of the correntropy is based on the appropriate adjustment of the kernel size of its Gaussian functions.This new parameter influences the nature of the performance surface, presence of local optima, rate of convergence, and robustness.Therefore, if an unsuitable kernel size is chosen, the expected improved performance of the MCC will not be confirmed [60].For this reason, an adaptive kernel method is applied in this work (see ( 6)) to adjust the kernel size over the learning epochs.

Proposed Identification Architecture
The proposed architecture adopted in this work identifies the dynamic relationship between the input and output of a multisection tank for water storage.The system is evaluated when the experimental data used during the identification task is corrupted with noise and outliers.The proposed architecture is based on the series-parallel identification scheme described in [13], with small modifications due to the experimental data set characteristic and the learning  procedure used to adjust the parameters of the FWNN model.Figure 2 presents a schematic diagram of the proposed identification architecture in this work.
The inputs of the FWNN model are past values of input signal () and the system output when corrupted with noise and outliers () = () + (), while the estimated output is given by ŷ().The work developed in [9] shows that well-known linear modeling structures, such as FIR (Finite Impulse Response), ARX (AutoRegressive, eXogenous input), ARMAX (AutoRegressive, Moving Average, eXogenous input), OE (Output Error), and SSIF (State Space Innovations Form) may be extended by using nonlinear functions or representations, thus leading to the nonlinear modeling structures NFIR, NARX, NARMAX, NOE, and NSSIF.This concept is used to define the inputs of the FWNN models obtained in this study.
According to [9], the advantage of a NARX model is that none of its regressors depends on past outputs of the model, which ensures that the predictor remains stable.This is particularly important in the nonlinear case since the stability issue in this particular case is much more complex than in linear systems.Considering that the inputs of the FWNN models in this work are described exactly as the regression vector of the NARX modeling structure, they inherit important characteristics from such structure.Figure 3 shows more details on the FWNN inputs in accordance with the NARX structure, where , , and  are constants that define a model of order  and delay .
Figure 2 illustrates that the FWNN model parameters are updated according to the error signal () = () − ŷ(), by using a learning algorithm, for example, the BP algorithm.By adopting the MCC as its respective cost criterion, the learning algorithm is applied to the FWNN model.As it was previously explained in Section 2, the success of the MCC also depends on the correct choice of the kernel size.Therefore, the adaptive method described by ( 6) is used in this work to adjust the kernel size during the learning epochs.

Experiments and Results
In order to evaluate the performance of the FWNN when the traditional MSE cost criterion of the error backpropagation algorithm is replaced by MCC, the aforementioned neural network is used to identify a real dynamical system, considering that its experimental data is corrupted by noise and outliers.

Multisection Liquid Tank.
The multisection liquid tank consists of an acrylic tank for containment of liquids with three abrupt changes in its cross-sectional area, as it can be seen in Figure 4.The liquid tank was originally designed for educational purposes in order to be used in studies of identification and control of dynamical systems [61].It was also used in [29] to evaluate the performance of  the alternative FWNN structure employed in this work.In addition to the acrylic tank structure, the system is composed by a water reservoir, a water pump, a pressure sensor, an electronic power driver, and an electronic interface with A/D (analog-to-digital) and D/A (digital-to-analog) converters.
The nonlinearity presented in the liquid flow output, which is due to the different pressure levels at the tank base in accordance with the height of the liquid column, can be clearly noticed in the aforementioned dynamical system.Besides, the distinct cross-sectional areas make such nonlinearity even more evident.It is worth mentioning that the abrupt transitions between the tank sections are also responsible for discontinuities.The whole system can be seen as a set of three coupled nonlinear systems, since each tank section has its own dynamic behavior.

System Identification.
Initially, in order to collect the experimental data set used during the learning and testing phase of the identified FWNN models in this work, the water pump is excited with an APRBS (Amplitude Modulated Pseudorandom Binary Sequence) and the water level inside the tank is registered at a sample rate of   = 2 Hz.For the generation of the persistent excitation signal, the following parameters are considered: the minimum hold time  ℎ = 10 s, minimal amplitude  min = 0 V, and maximum amplitude  max = 15 V. Since only positive values of voltage are considered in this case study, the pump only operates in order to shift the liquid from the reservoir to the multisection tank.
After the system excitation, the collected data is corrupted with additive white Gaussian noise and two different percentages of outliers (1% and 3%).The resulting data are divided into two sets comprising approximately 80% and 20% of the total amount.The first set is used to train the FWNN model and the second one is used during the testing phase.
The whole data set is normalized to fit within the range [0, 1] in order to avoid numerical problems during the FWNN learning procedure.Since the multisection tank is a firstorder nonlinear system and also considering Figure 3, the inputs of the FWNN models are defined with  =  = 1 and  = 0. Thus, (), ( − 1), and () are defined as inputs to the FWNN models to predict ( + 1).
The BP algorithm presented in Section 4 is used to adjust the parameters   ,   ,   , and   of the FWNN.After a trial-and-error procedure the learning rate  = 0.0001 was found as a good choice to identify the multisection tank.It is worth mentioning that the results presented in this work were obtained after 350 learning epochs.
Figure 5 presents the model validation when 1% of the original experimental data set is corrupted with outliers and additive white Gaussian noise is inserted.In this figure, the tank water level in cm is in function of the sample time step, where each time step is equivalent to 0.5 seconds, defined by the sample rate   = 2 Hz.The terms FWNN-MCC and FWNN-MSE are used to identify the FWNN models obtained using MCC and MSE as cost criterion of BP algorithm, respectively.It is evident that FWNN-MCC has the best performance due to the use of the higherorder statistical information.On the other hand, the FWNN-MSE model based on second-order moments presents some problems to efficiently identify the input-output dynamic relationship of the multisection tank at some points of the validation curve.The presence of outliers in the experimental data has a significant negative impact on the FWNN model when the MSE criterion is used in the learning procedure, once the error due to the outliers is increased by a square rate.The same behavior is not observed in FWNN-MCC when 1% of the experimental data are corrupted by outliers because the outliers power is weighted by the Gaussian kernel.Figure 6 shows the model validation when 3% of the original experimental data is corrupted with outliers and additive white Gaussian noise is inserted.In Figure 6, only the validation points are plotted to allow the better visualization of outliers and its respective effects in the FWNN-MCC and FWNN-MSE models.Two regions are highlighted in Figure 6, thus demonstrating the improvement of the FWNN-MCC model.Both models present problems at some points, although the performance of FWNN-MCC one is improved in the identification of the multisection tank dynamics, as it also seems to be less sensitive to outliers and noise than FWNN-MSE model.It is noteworthy that MCC has intrinsic robustness due to the local estimation produced by the kernel size.
It is also important to mention that the correntropy criterion has a free parameter, that is, the kernel size, which is at the core of the learning process [38].An adaptive kernel is applied in this work to improve the performance of the FWNN learning procedure performance.Figure 7 shows MSE obtained over the 350 epochs, for three different fixed kernel sizes, that is, 0.01, 0.1, and 10, and also using the adaptive kernel.The adaptive kernel size method mathematically described by ( 6) has the highest convergence rate and the best performance in the attenuation of outliers and noise.Figure 8 presents the behavior of the adaptive kernel size during the learning stage of the FWNN-MCC model when the experimental data is composed by 1% and 3% of outliers.During the initial epochs of the BP algorithm, the kernel size is quite oscillatory.However, the behavior of the kernel size becomes more stable as it comes to the hundredth epoch.

Conclusions
This work has analyzed the performance of a FWNN when applied to identify a real nonlinear dynamical system in the presence of unknown noise and outliers.Such erroneous measurements in experimental data reduce the reliability of the identified model, once it tries to fit some behaviors that are not part of the dynamical system.The most common learning techniques applied to adjust the FWNN parameters in identification applications are methods based on gradient that use the MSE as their cost function.This paper has then proposed the replacement of this traditional evaluation measure by a similarity measure based on information theory denominated correntropy.Therefore, the MCC was used in this paper as the cost function of the error backpropagation algorithm in order to reduce the negative effects of the unknown noise and outliers.The results have demonstrated that the FWNN-MCC models based on the MCC cost function represent the input-output dynamics of the multisection liquid tank more properly, being also less sensitive to outliers and noise than the FWNN-MSE models.This work also has investigated the influence of the kernel size on the performance of the MCC in the BP algorithm, since it is a free  parameter of correntropy.The addition of this new parameter in the learning procedure of the FWNN can be considered a disadvantage of the proposed architecture, mainly because the MCC is very dependent on its proper adjustment.Within this context, the adopted adaptive kernel has shown to be more efficient if compared to the case when this parameter remains fixed during the whole FWNN learning process.The adaptive kernel size method has improved the convergence rate of the backpropagation algorithm and contributed to attenuating the effects of the outliers and noise.Due to the use of the BP algorithm, the proposed architecture is susceptible to local minima falls, limiting the correntropy action to remove the outliers.The further research work will focus on the following items: (1) analyzing the application of the MCC associated with different algorithms in order to train the FWNN architecture to avoid the outliers harmful effects.The metaheuristic algorithms such as Genetic Algorithm, Particle Swarm Optimization, and Bat Algorithm are good options since they are less sensitive to local minima than the BP algorithm; (2) including and comparing different adaptive kernel methods to improve the functionality of the MCC; (3) applying the proposed architecture to identify reliable dynamical models to be used in advanced control strategies, such as the predictive controllers; (4) evaluating the feasibility to apply the FWNN-MCC as an inferential system to estimate chemical compositions, calibrate sensors [62], and fault diagnosis, among others.
Schematic diagram of the multisection liquid tank (b) Multisection liquid tank

Figure 4 :
Figure 4: Three-section liquid tank with distinct cross-sectional areas.

Figure 5 :
Figure 5: Validation of the FWNN-MCC with 1% outlier rate in experimental data.

Figure 6 :
Figure 6: Validation of the FWNN-MCC with 3% outlier rate in experimental data.

Figure 7 :
Figure 7: Evolution of MSE for different kernel sizes.

)
Now, differentiating E with respect to   and   , it gives