Levenberg-Marquardt Algorithm for Mackey-Glass Chaotic Time Series Prediction

For decades, Mackey-Glass chaotic time series prediction has attracted more and more attention. When the multilayer perceptron is used to predict the Mackey-Glass chaotic time series, what we should do is to minimize the loss function. As is well known, the convergence speed of the loss function is rapid in the beginning of the learning process, while the convergence speed is very slow when the parameter is near to the minimum point. In order to overcome these problems, we introduce the Levenberg-Marquardt algorithm (LMA). Firstly, a rough introduction is given to the multilayer perceptron, including the structure and the model approximation method. Secondly, we introduce the LMA and discuss how to implement the LMA. Lastly, an illustrative example is carried out to show the prediction efficiency of the LMA. Simulations show that the LMA can give more accurate prediction than the gradient descent method.


Introduction
The Mackey-Glass chaotic time series is generated by the following nonlinear time delay differential equation: where , , , and  are real numbers.Depending on the values of the parameters, this equation displays a range of periodic and chaotic dynamics.Such a series has some short-range time coherence, but long-term prediction is very difficult.
Originally, Mackey and Glass proposed the following equation to illustrate the appearance of complex dynamics in physiological control systems by way of bifurcations in the dynamics: They suggested that many physiological disorders, called dynamical diseases, were characterized by changes in qualitative features of dynamics.The qualitative changes of physiological dynamics corresponded mathematically to bifurcations in the dynamics of the system.The bifurcations in the equation dynamics could be induced by changes in the parameters of the system, as might arise from disease or environmental factors, such as drugs or changes in the structure of the system [1,2].The Mackey-Glass equation has also had an impact on more rigorous mathematical studies of delay-differential equations.Methods for analysis of some of the properties of delay differential equations, such as the existence of solutions and stability of equilibria and periodic solutions, had already been developed [3].However, the existence of chaotic dynamics in delay-differential equations was unknown.Subsequent studies of delay differential equations with monotonic feedback have provided significant insight into the conditions needed for oscillation and properties of oscillations [4][5][6].For delay differential equations with nonmonotonic feedback, mathematical analysis has proven much more difficult.However, rigorous proofs for chaotic dynamics have been obtained for the differential delay equation / = (( − 1)) for special classes of the feedback function  [7].Further, although a proof of chaotic dynamics in the Mackey-Glass equation has still not been found, advances in understanding the properties of delay differential equations is going on, such as (2), that contain both exponential decay and nonmonotonic delayed feedback [8].The study of this equation remains a topic of vigorous research.
The Mackey-Glass chaotic time series prediction is a very difficult task.The aim is to predict the future state ( + Δ) using the current and the past time series (), ( − 1), . . ., ( − ) (Figure 2).Until now, there are many literatures about the Mackey-Glass chaotic time series prediction [9][10][11][12][13][14].However, as far as the prediction accuracy is concerned, most of the results in the literature are not ideal.
In this paper, we will predict the Mackey-Glass chaotic time series by the MLP.While minimizing the loss function, we introduce the LMA, which can adjust the convergence speed and obtain good convergence efficiency.
The rest of the paper is organized as follows.In Section 2, we describe the multilayer perceptron.Section 3 introduces the LMA and discusses how to implement the LMA.In Section 4, we give a numerical example to demonstrate the prediction efficiency.Section 5 is the conclusions and discussions of the paper.

Multilayer Perceptrons.
A multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs.A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one.Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function.The multilayer perceptron with only one hidden layer is depicted as in Figure 1 [15].
The output of the multilayer perceptron described in Figure 1 is and the outputs of the hidden units are respectively, where (⋅) is the activation function.We will adopt the sigmoid function () as the activation function; for example, and the derivative of the activation function with respect to  is or MLP provides a universal method for function approximation and classification [16,17].In the case of the function approximation, we have a number of observed data (x 1 ,  1 ), (x 2 ,  2 ), . ..,(x  ,   ), which are supposed to be generated by where  is noise, usually subject to Gaussian distribution with zero mean, and  0 (x) is the unknown true generating function.
Given a set of observed data, sometimes called training examples, we search for the parameters  = ( 1 , . . .,   ,  11 , . . .,  1 ,  12 , . . .,  2 , . . .,  1 , . . ., to approximate the teacher function  0 (x) best, where  denotes the matrix transposition.A satisfactory model is often obtained by minimizing the mean square error.One of the serious problems in minimizing the mean square error is that the convergence speed of the loss function is rapid in the beginning of the learning process, while the convergence speed is very slow in the region of the minimum [18].In order to overcome these problems, we will introduce the Levenberg-Marquardt algorithm (LMA) in the next section.

The Levenberg-Marquardt Algorithm.
In mathematics and computing, the Levenberg-Marquardt algorithm (LMA) [18][19][20], also known as the damped least-squares (DLS) method, is used to solve nonlinear least squares problems.These minimization problems arise especially in least squares curve fitting.
The LMA is interpolates between the Gauss-Newton algorithm (GNA) and the gradient descent algorithm (GDA).As far as the robustness is concerned, the LMA performs better than the GNA, which means that in many cases it finds a solution even if it starts very far away from the minimum.However, for well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA.
In many real applications for solving model fitting problems, we often adopt the LMA.However, like many other fitting algorithms, the LMA finds only a local minimum, which is always not the global minimum.
The least squares curve fitting problem is described as follows.Instead of the unknown true model, a set of  pairs of independent variables (x 1 ,  1 ), (x 2 ,  2 ), . . ., (x  ,   ) are given.Suppose that (, ) is the approximation model and () is a loss function, which is the sum of the squares of the deviations: The task of curve fitting problem is minimizing the above loss function () [21].
The LMA is an iterative algorithm and the parameter  is adjusted in each iteration step.Generally speaking, we choose an initial parameter randomly, for example,   ∼ (−1, 1),  = 1, 2, . . ., , where  is the dimension of parameter .
The Taylor expansion of the function (x  ,  + Δ) is As we know, at the minimum  * of loss function (), the gradient of () with respect to  will be zero.Substituting (11) into (10), we can obtain where   = (x  , )/, or Taking the derivative with respect to Δ and setting the result to zero give where J is the Jacobian matrix whose th row equals   and also y and f are vectors with th component (  , ) and   , respectively.This is a set of linear equations which can be solved for Δ.
Levenberg's contribution is to replace this equation by a "damped version, " where  is the identity matrix, giving the increment Δ to the estimated parameter vector .
The damping factor  is adjusted at each iteration step.If the loss function () reduces rapidly,  will adopt a small value, and then the LMA is similar to the Gauss-Newton algorithm.While the loss function () reduces very slowly,  can be increased, giving a step closer to the gradient descent direction, and Therefore, for large values of , the step will be taken approximately in the direction of the gradient.
In the process of iteration, if either the length of the calculated step Δ or the reduction of () from the latest parameter vector  + Δ falls below the predefined limits, iteration process stops and then we take the last parameter vector  as the final solution.
Levenberg's algorithm has the disadvantage that if the value of damping factor  is large, the inverse of J  J +  does not work at all.Marquardt provided the insight that we can scale each component of the gradient according to the curvature so that there is larger movement along the directions where the gradient is smaller.This avoids slow convergence in the direction of small gradient.Therefore, Marquardt replaced the identity matrix  with the diagonal matrix consisting of the diagonal elements of J  J, resulting in the Levenberg-Marquardt algorithm [22]: and the LMA is as follows: where

Numerical Simulations
Example 1.We will conduct an experiment to show the efficiency of the Levenberg-Marquardt algorithm.We choose a chaotic time series created by the Mackey-Glass delaydifference equation: for  = 17.Such a series has some short-range time coherence, but long-term prediction is very difficult.The need to predict such a time series arises in detecting arrhythmias in heartbeats.
The network is given no information about the generator of the time series and is asked to predict the future of the time series from a few samples of the history of the time series.In our example, we trained the network to predict the value at time  + Δ, from inputs at time ,  − 6,  − 12, and  − 18, and we will adopt Δ = 50 here.
In the simulation, 3000 training examples and 500 test examples are generated by (22).We use the following multilayer perceptron for fitting the generated training examples: The initial values of the parameters are selected randomly: The learning curves of the error function and the fitting result of LMA and GDA are shown in Figures 3, 4, 5, and 6, respectively.
The learning curves of LMA and GNA are shown in Figures 3 and 4, respectively.The training error ∑ 3000 =1 (  − (x  , )) 2 of LMA can reach 0.1, while the final training error of GDA is more than 90.Furthermore, the final mean test error (1/500) ∑ 5500 =5001 (  −(x  , )) 2 = 0.0118 of LMA is much smaller than 0.2296, which is the final test error of GDA.As far as the fitting effect is concerned, the performance of LMA is much better than that of the GDA.This is very obvious from Figures 5 and 6.
All of these suggest that when we predict the Mackey-Glass chaotic time series, the performance of LMA is very good.It can effectively overcome the difficulties which may arise in the GDA.

Conclusions and Discussions
In this paper, we discussed the application of the Levenberg-Marquardt algorithm for the Mackey-Glass chaotic time series prediction.We used the multilayer perceptron with 20 hidden units to approximate and predict the Mackey-Glass chaotic time series.In the process of minimizing the error function, we adopted the Levenberg-Marquardt algorithm.If reduction of () is rapid, a smaller value damping factor  can be used, bringing the algorithm closer to the Gauss-Newton algorithm, whereas if an iteration gives insufficient   reduction in the residual,  can be increased, giving a step closer to the gradient descent direction.In this paper, the learning mode is batch.At last, we demonstrate the performance of the LMA.Simulations show that the LMA can achieve much better prediction efficiency than the gradient descent method.

Figure 3 :
Figure 3: The GDA learning curve of the error function.

Figure 4 :
Figure 4: The LMA learning curve of the error function.