A Joint Optimization of Momentum Item and Levenberg-Marquardt Algorithm to Level Up the BPNN’s Generalization Ability

Back propagation neural network (BPNN) as a kind of artificial neural network is widely used in pattern recognition and trend prediction. For standard BPNN, it has many drawbacks such as trapping into local optima, oscillation, and long training time. Because training the standard BPNN is based on gradient descent method, and the learning rate is fixed. Momentum item and Levenberg-Marquardt (LM) algorithmare twoways to adjust theweights among the neurons and improve the BPNN’s performance. However, there is stillmuch space to improve the two algorithms.Thehybrid optimization of damping factor of LMand the dynamic momentum item is proposed in this paper. The improved BPNN is validated by Fisher Iris data and wine data. Then, it is used to predict the visit spend. The database is provided by Dunnhumby’s Shopper Challenge. Compared with the other two improved BPNNs, the proposedmethod gets a better performance.Therefore, the proposedmethod can be used to do the pattern recognition and time series prediction more effectively.


Introduction
Artificial neural network is a computing model which is similar to biological neural network.It is widely used in stimulation, trend prediction, pattern recognition, and control system.It could realize self-study without knowing the function and relationship among the training datasets.It is widely used in practice and researched in academia.For example, Jing and Cheng [1] developed a new optimal PID learning for training feedforward neural networks (FNN) for any purpose (system identification, function approximation, pattern recognition, control, etc.).And, in their paper, they compared its effect with some types of neural networks, such as BP, scale conjugate BP, and LM-BP.Although standard BPNN is one of the typical neural networks, it still has many unavoidable shortcomings.Because, the learning rate is fixed and the weights in standard BPNN are adjusted by gradient descent method according to error back propagation, for BPNN, it's easy to trap into local optima, oscillation and long training time.These issues make the standard BPNN unable to meet the demands of some pattern recognition or data regression items which need fast processing (e.g., the real-time condition monitoring of complex mechanical equipment and some control systems).In recent decades, many improvements were developed on the standard BPNN to speed up the training process and get more accurate results.And some are variants of neural networks.Among the published works, the improvements engage in solving the three main drawbacks: (1) slow learning speed, (2) oscillation, and (3) convergence to local optima.
(1) Alternative Learning Rate.In the standard BPNN, the learning rate is fixed in the whole iteration process.The magnitude of the learning rate decides the steps of the weights update, which is strongly related to the learning time.Roy [2] introduced the near optimal learning rate into the adjustment of the learning rate.Song et al. [3] adjusted the learning rate value according to the change of system error between two consecutive steps.A self-adaptive learning rate based on the adjustment of weights and biases changes was reported in [4].

Mathematical Problems in Engineering
Reference [5] proposed a new dynamic optimal learning rate which is related to the previous approach after the first iteration.Hasan et al. [6] developed Hanning window neural network and used Hanning window function to make the learning rate dynamic.The linear matrix inequality techniques were used to find the appropriate learning rates to guarantee the fast and robust convergence [7].These methods make a good performance in the learning speed.
(2) Momentum Item Methods.Momentum method is another way to modify the adjustment of weights.It reflects the previous information of the weights adjustment and could accelerate the training speed and at the same time weaken the oscillation.A new way to accelerate the convergence by adjusting the learning rate and momentum factor at the same iteration was reported by Yu and Liu [8].Wang et al. [9] proposed a restart strategy for the momentum in order to converge the cyclic and almost-cyclic learning with a single hidden layer neural network.Another way to adjust the momentum coefficient according to error function and weights in the network was given by Wu et al. [10].
(3) Second Order Methods.The gradient descent method only uses the first-order derivative of the error function; many works have proved that the second-order derivative methods are more efficient in increasing the convergence speed and getting more accurate results than gradient descent.These second order methods include quasi-Newton [11], conjugate gradient [12], and LM algorithm [13,14].Specially, LM algorithm is a good adjuster of the Gauss-Newton technique and the steepest-descent algorithm but avoids many of their limitations [13].The adjuster is based on the conception of damping factor.The adjustment of the parameter in the LM algorithm is according to the iteration effectiveness [15][16][17], while most works just regard the LM algorithm as the training method without any improvement [18].
Although there are many aspects proposed for improving BPNN, it is still deficient.Especially for some engineering demands, the real-time condition monitoring and trend prediction stress the BPNN less training time and more accurate results so that some timely operations could be implemented.Therefore, in this paper, hybrid optimization of dynamic momentum item and damping LM algorithm is proposed to accelerate the convergence speed and get more accurate results.The momentum item is added to weaken the oscillation.And LM algorithm is involved to speed up the iteration.Being different from the standard BPNN, the adjustment of the weights in the proposed method utilizes the second order derivative information and the previous iteration.The weights in the next iteration are decided by three sections, the current weight, the weight adjustment, and hybrid optimized previous weight adjustment.The influence of the previous weights is determined by the momentum coefficient and damping factor in LM algorithm.The momentum coefficient and damping factor vary in each iteration.The adjustment of the momentum coefficient is decided by the previous momentum parameter in the last iteration.It increases if the error of the BPNN declines.When it reaches the maximum, it returns to a certain value.Oppositely, damping factor in LM declines if the iteration performs better than before.Overall, the parameters vary according to the former values and the iteration efficiency.
The rest of the paper is structured as follows.Section 2 presents the novel momentum and LM algorithm to train the BPNN with one hidden layer.Section 3 validates the proposed training method by Fisher Iris data and wine data compared in the aspect of pattern recognition with LM proposed by El-Alfy [17] training BPNN and the BPNN trained by improved LM algorithm which is provided by Nørgaard [19].In Section 4, Dunnhumby's Shopper Challenge dataset is used to prove the improvements of prediction.Section 5 outlines the conclusions and presents the future work.

Modified Training Algorithm for BPNN
2.1.Basic BPNN.BPNN is a kind of feedback neural network.Its main principle is the error back propagation.The adjustment of weights is based on the gradient descent which requires that the activation function has the first order derivative.The iteration terminal condition is meeting the predetermined error goal or reaching the maximum iteration steps.Therefore, it is a supervised network.Its learning rule with one hidden layer is shown in Figure 1.
Considering the convenience of description, supposing that there is one input layer, one hidden layer, and one output layer,   and   are the input vectors and the desired output vectors, respectively, where  = 1, 2, . . ., .In the standard BPNN, the adjustments of weights are based on the derivative of error function.So the error function is vital.Fontenla-Romero et al. [20] considered the influence of the slope of the nonlinear activation functions and proposed a new way to measure the error.Nguyen et al. [18] proposed a new cost function considering the system error and the cluster weight which represents an approximation to the probability mass.In this paper, the error function is defined as follows: where   is the real network output and  is the weight vector.
For the standard BPNN, the gradient descent is implemented to train the network.And the weights update based on the last iteration and the changes

The Classical Momentum Item and LM Algorithm.
Among the published improvements in standard BPNN, momentum item and LM algorithm are two common and effective ways to improve BPNN's performance.They are used to weaken the oscillation and speed up convergence.The main characteristics of the two algorithms are as follows.
More details can be seen in the works [21,22], respectively.(1) Gradient descent with momentum item: the weight's change is related to the previous weight update: where  is the learning rate. is the gradient which is derived from the standard BPNN derivation process.
When the weight update of input-hidden layer is calculated, V is the input of the samples, while, when the weight update of hidden-output layer is calculated, V is the output of the hidden layer. is the momentum coefficient, 0 <  < 1.
(2) Levenberg-Marquardt: the weight update rule is as follows: where the term () denotes the error vector of the neural network. is a damping factor which impacts the performance of the convergence.It is also the adjuster of steepest-descent method and Gauss-Newton method.If  is large, expression (4) approximates steepest-descent method; otherwise, when  is small, the equation approximates Gauss-Newton method. is the Jacobian matrix which is defined as follows: ] . (5)

The Hybrid Optimization in BPNN.
In the standard algorithm, the parameter  which is the momentum coefficient is static.It does not change in the whole iteration process, which makes the impacts of the momentum limited.
Traditionally, the damping parameter  becomes larger or smaller according to the performance.For example, it is decreased or increased by a factor 10 based on whether the performance is improved or not, respectively [17].However, the merits of momentum item and LM algorithm have not been exerted sufficiently.So, in this paper, the two algorithms are optimized simultaneously and the weight equation for the proposed BPNN is as follows: (1) The Adjustment of .The proposed method considers the momentum coefficient dynamic.The  updates according to the error alteration and the former iteration.If the error reduces in this iteration, it means the previous weight update is beneficial to convergence; the researching direction is correct.Therefore,  should be bigger to encourage researching on this direction next time.Otherwise, the momentum coefficient should decline.The weight update can be formulated as follows: While the momentum coefficient should not increase all the time and infinitely, when it is too big, it can influence the network; therefore, it should be reset as a decimal in the interval (0,1).The restriction rule of  is described as follows: (2) The Adjustment of . could be regarded as the learning rate in the standard momentum method, while it is the damping parameter of LM algorithm.The principle of Through ( 4)-( 9), the weight update is performed by the hybrid optimization of LM algorithm and momentum method.

Validation on Pattern Recognition
In this paper, we suppose there is just one hidden layer.The hyperbolic tangent activation function is in the hidden neurons and linear activation function in the output neurons.
The number of the neurons in the hidden layer is important for convergence speed.Too many or too few neurons make the network need long time for training.Traditionally, it is set by personal experience.In this paper, the number of the neurons in the hidden layer depends on the empirical formula which is shown as follows: hn = √ in + on + .(10) in and on are the number of the neurons in the input layer and output layer, respectively. is a constant in interval (1,10). in and on are determined by the dimensionality of the input vector and the output vector.For example, the input and output dataset have 4 attributes and 1 attribute, respectively;   the numbers of neurons in the input layer and output layer are 4 and 1.
To validate our proposed method on pattern recognition, Fisher Iris data and wine data which are provided by UCI are used.Fisher Iris data and wine data should be normalized and the "mapminmax" function provided by Matlab toolbox is used.All the parameters in the compared three methods are the same.The maximum iteration step is 50000000, the training error goal is 1e-10, and the initial  is 1.In our improved program, the initial momentum coefficient  is 0.01.The adjustment of  and  is according to ( 7)-(9).

The Fisher Iris Data Example.
The Fisher Iris data is one of the most famous databases in the pattern recognition works.The dataset includes 3 classes of 50 instances each.These classes refer to "setosa, " "versicolor, " and "virginica" which are labeled as "1", "2, " and "3, " respectively.The attributes contain sepal length, sepal width, petal length, and petal width.The unit is centimeter.The whole data is divided into 2 parts; one is for training and the other for testing.The data classification and the corresponding classification map in the first three attributes are shown in Figures 2 and 3.
The output of the test data should be processed in order to be compared with the real value.Because of calculation, the real output of the network may be nonintegral.So "round" function which is provided by Matlab toolbox is used to process the real output.For simplifying the items, ELM-BPNN replaces El-Alfy's LM training BPNN, and NLM-BPNN denotes the LM-BPNN improved by Nørgaard.The compared results are shown in Table 1.
The classification results compared with the real labels in 2D and mapping in the first three attributes are shown in Figures 4 and 5.
In these 3D figures, the points with pink color are the wrong recognition.From the above figures and table, the proposed method has better performance in correct rate.The errors of iteration process are shown in Figure 6.The statistics of iteration steps, iteration time, and accuracy are listed in Table 2.

Wine Data Example.
The wine data is provided by UCI.The data is the results of a chemical analysis of wines grown in the same region in Italy but derived from 3 different cultivars.The analysis determined the quantities of 13 constituents found in each of the three types of wines.These attributes are (1) alcohol, (2) malic acid, (3) ash, (4) alkalinity of ash, (5) magnesium, (6) total phenols, (7) flavanoids, (8) nonflavonoid phenols, (9) proanthocyanins, (10) color intensity, (11) hue, (12) OD280/OD315 of diluted wines, and (13) proline.The numbers of instances are Class 1 with 59, Class 2 with 71, and Class 3 with 48.Among these instances, 19 instances are regarded as the test set, and the rest are the training set.The classification and the corresponding map in the first three attributes are shown in Figures 7 and 8.
Similarly, the compared outputs of different networks are operated by "round" function, so the results are listed in Table 3.
The classification errors in 2D and mapping in the first three attributes are shown in Figures 9 and 10.
The errors of iteration process are shown in Figure 11.
The statistics of iteration steps, iteration time, and accuracy are listed in Table 4.
By comparison, the proposed LM-BPNN gets more accurate results than ELM-BPNN and NLM-BPNN.In issue of the training speed, the proposed LM-BPNN is faster than ELM training BPNN.

Validation on Prediction
In the aspect of prediction, Dunnhumby's Shopper Challenge database is used to compare the performances of the three methods.The dataset consists of details of every visit made by 100,000 customers over a year from April 2010 to March 31, 2011.Each visit is stamped with the date and the customer's spend in that visit.The challenge is to predict the visit date and visit spend of this next visit for each customer id.But in this section, we just predict the visit spend during April to December in 2010.
In order to be trained in the network, the data should be limited in the interval (0, 1).With the consideration of the external character of prediction, all the data is divided by an enough big constant, 1 + 6. Besides, two performance indexes are given to assess the trained networks: error  =   −   .
Expressions ( 11) and ( 12) could evaluate the deviation from the real value.abs means the absolute values.So the basic information of test dataset and performances of the three networks are given in Table 5.
By comparing the results from Tables 5 and 6, the proposed LM-BPNN has a better performance than ELM-BPNN in the aspect of prediction.Two performance indexes could illustrate that the proposed method is more stable.
The results from pattern recognition (in Section 3) and prediction (in Section 4) show the proposed BPNN performs better than the other two methods.It could finish calculation in less time and less iteration step with higher accuracy.Because it uses the error information and efficiency of pervious iteration sufficiently.Momentum item could overcome the oscillation and make the iteration in a less error direction, simultaneously, LM algorithm is a good balance of steepestdescent method and Gauss-Newton method.Therefore, the proposed method not only weakens the oscillation but also converges into the optimum in less iteration steps.

Conclusions
In this paper, a joint optimization of momentum item and Levenberg-Marquardt is proposed to train BPNN.Its performance is compared with ELM-BPNN and NLM-BPNN in the aspect of pattern recognition and prediction.The validated data is provided by UCI and public challenge.The results proved that the proposed LM-BPNN has a better performance.Although the proposed method shows its better performance, there is still much work to do.For example, the adjustment strategy of  should be more adaptive.We also found the initial  could influence the network's convergence speed, so how to select appropriate  is worth researching.

Figure 1 :
Figure 1: BPNN construction and its supervised learning rule with one hidden layer.

Figure 2 :
Figure 2: The classification of Fisher Iris data.

Figure 3 :
Figure 3: Classification map in the first three attributes of Fisher Iris data.

Figure 4 :
Figure 4: Classification results compared with the real labels in 2D of Fisher Iris data.

Figure 5 :
Figure 5: Classification errors mapping in the first three attributes of Fisher Iris data.

Figure 6 :
Figure 6: The errors in the iteration process of Fisher Iris data.

Figure 7 :
Figure 7: The total figure classification of wine data.

Figure 8 :
Figure 8: Classification map in the first three attributes of wine data.

Figure 9 :
Figure 9: Classification errors in 2D of wine data.

Figure 10 :
Figure 10: Classification errors mapping in the first three attributes of wine data.

Figure 11 :
Figure 11: The errors of iteration process of wine data.

Table 1 :
The recognition results of different training methods for Fisher Iris data.

Table 2 :
Comparison of the three methods on benchmark datasets for Fisher Iris data.

Table 3 :
The results of different training methods for wine data.

Table 4 :
Comparison of the three methods on benchmark datasets for wine data.

Table 5 :
The basic information of test dataset and performances of the three networks.

Table 6 :
The mean statistic results of the performance for the three compared networks.