Weight Optimization in Recurrent Neural Networks with Hybrid Metaheuristic Cuckoo Search Techniques for Data Classification

Recurrent neural network (RNN) has been widely used as a tool in the data classification. This network can be educated with gradient descent back propagation. However, traditional training algorithms have some drawbacks such as slow speed of convergence being not definite to find the global minimum of the error function since gradient descent may get stuck in local minima. As a solution, nature inspired metaheuristic algorithms provide derivative-free solution to optimize complex problems. This paper proposes a new metaheuristic search algorithm called Cuckoo Search (CS) based on Cuckoo bird’s behavior to train Elman recurrent network (ERN) and back propagation Elman recurrent network (BPERN) in achieving fast convergence rate and to avoid local minima problem. The proposed CSERN and CSBPERN algorithms are compared with artificial bee colony using BP algorithm and other hybrid variants algorithms. Specifically, some selected benchmark classification problems are used. The simulation results show that the computational efficiency of ERN and BPERN training process is highly enhanced when coupled with the proposed hybrid method.


Introduction
Artificial neural network (ANN) is a well-known procedure that has the ability to classify nonlinear problem and is experimental in nature [1].However it can give almost accurate solution for clearly or inaccurately formulated problems and for phenomena that are only understood during experiment.
An alternate neural network approach is to use recurrent neural networks (RNN), which have inside feedback loops within network allowing them to store previous memory to train past history [2][3][4][5].The RNN model has been implemented in various applications, such as forecasting of financial data [6], electric power demand [7], tracking water quality and minimizing the additives needed for filtering water [7], and data classification [8].In order to understand the advantage of the dynamical processing of recurrent neural networks, researchers have developed an amount of schemes by which gradient methods and, in particular, back propagation learning can be extended to recurrent neural networks [8].
Werbos [9] introduced the back propagation through time approach approximating the time evolution of a recurrent neural network as a sequence of static networks using gradient methods.The simple recurrent networks were first trained by Elman with the standard back propagation (BP) learning algorithm, in which errors are calculated and weights are updated at each time step.The BP is not as effective as the back propagation through time (BPTT) learning algorithm, in which error signal is propagated back through time [10].However, certain properties of the RNN make many of the algorithms less capable, and it often takes huge amount of time to train a network of even a moderate size.In addition, the complex error surface of the RNN network makes many training algorithms more flat to being trapped 2 Mathematical Problems in Engineering in local minima [5].The back propagation (BP) algorithm is the well-known method for training network.Otherwise, the BP algorithm suffers from two main drawbacks, that is, low convergence rate and instability.They are caused by a possibility of being trapped in a local minimum and view of overshooting the minimum of the error surface [11][12][13][14][15][16][17].
To overcome the weaknesses of the above algorithms, there have been many researches on dynamic system modeling with recurrent neural network.This ability of dynamic modeling system formulate a kind of neural network that is more superior to the conventional feed forward neural networks because the system outputs are function of both the current inputs as well as their inner states [18,19].Ahmad et al. in [20] investigated a new method using fully connected recurrent neural network (FCRNN) and back propagation through time (BPTT) algorithm to observe the difference of Arabic alphabetic like "alif " to "ya."The algorithm is also used to improve the people's knowledge and understanding of Arabic words using the proposed technique.In 2010, Xiao, Venayagamoorthy, and Corzine trained recurrent neural network integrated with particle swarm optimization (PSO) and BP algorithm (PSOBP) to provide the optimal weights to avoid local minima problem and also to identify the frequency dependent impedance of power electronic system such as rectifiers, inverter, and AC-DC conversion [5].The experimental results described that the proposed method successfully identified the impedance characteristic of the three-phase inverter system, which not only can systematically help avoiding the training process being trapped in local minima, but also has better performance as compared to both sample BP and PSO algorithms.Similarly, Zhang and Wu [21] used adaptive chaotic particle swarm optimization (ACPSO) algorithm for classification of crops from synthetic aperture radar (SAR) images.During simulations, ACPSO was found to be superior to the back propagation (BP), adaptive BP (ABP), momentum back propagation (MBP), particle swarm optimization (PSO), and resilient back propagation (RPROP) methods [21].Aziz et al. [22] carried out a study on the performance of particle swarm optimization algorithm with training Elman RNN to discover the classification accuracy and convergence rate compared with Elman recurrent network with BP algorithm.Based on the simulated result it is illustrated that the proposed Elman recurrent network particle swarm optimization (ERNPSO) algorithm is better than the back propagation Elman recurrent network (BPERN) in terms of classification accuracy.However in terms of convergence time the BPERN is much better than the proposed ERNPSO algorithm.
Cheng and Shen [23] proposed an improved Elman RNN to calculate radio propagation loss, with three-dimensional parabola equation method in order to decrease calculation time and to improve approximation performance of the network.Based on results the proposed improved Elman networks show an efficient and feasible performance to predict propagation loss compared with the simple Elman RNN.However, the Elman RNN loses necessary significant data to train the network for predicating propagation.Wang et al. [24] used the Elman RNN to compute the total nitrogen (TN), total phosphorus (TP), and dissolved oxygen (DO) at three different sites of Taihu during the period of water diversion.The conceptual form of the Elman RNN for different parameters was used by means of the principle component analysis (PCA) and validated on water quality diversion dataset.The values of TS, TP, and DO calculated by the model were intimately related to their respective values.The simulated result shows that the PCA can efficiently accelerate the input parameters for the Elman RNN and can precisely compute and forecast the water quality parameter during the period of water diversion, but not free of local minim problem.
In [25], the proposed LM algorithm based on Elman and Jordan recurrent neural network has been used to forecast annual peak load of Java, Madura, and Bali interconnection for 2009-2011.The study is carried out to check the performance of the proposed LM based recurrent network with respect to their forecasting accuracy over the given period.From the simulation results, it is clear that the proposed LM based recurrent neural network has better performance than the LM based feed forward neural network.After reviewing the above algorithms, it is found that the traditional ANN training has a drawback, that is, slow speed of convergence, which is not definite to find the global minimum of the error function since gradient descent may get stuck in local minima.
To overcome the weaknesses and to improve the convergence rate, this work proposes new hybrid metaheuristic search algorithms that make use of the Cuckoo Search via Levy flight (CS) by Yang and Deb [26] to train Elman recurrent network (ERN) and back propagation Elman recurrent network (BPERN).The main goals of this study are to improve convergence to global minima, decrease the error, and accelerate the learning process using a hybridization method.The proposed algorithms called CSERN and CSBPERN imitate animal behaviour and are valuable for global optimization [27,28].The performance of the proposed CSERN and CSBPERN is verified on selected benchmark classification problems and compared with artificial bee colony using BPNN algorithm and other similar hybrid variants.
The remainder of the paper is organized as follows.Section 2 describes the proposed method.Result and discussion are explained in Section 3. Finally, the paper is concluded in Section 4.

Proposed Algorithms
In this section, we describe our proposed Cuckoo Search (CS) to train Elman recurrent network (ERN) and back propagation Elman recurrent network (BPERN).

CSERN Algorithm.
In the proposed CSERN algorithm, each best nest represents a possible solution, that is, the weight space and the corresponding biases for ERN optimization.The weight optimization problem and the size of a population represent the quality of the solution.In the first epoch, the weights and biases are initialized with CS and then those weights are passed on to the ERN.The weights in ERN are calculated.In the next cycle, CS will update the weights with the best possible solution and CS will continue searching the best weights until either the last cycle/epoch of the network is reached or the MSE is achieved.The CS is a population based optimization algorithm; it starts with a random initial population.In the proposed CSERN algorithm, the weight space and the corresponding biases for ERN optimization are calculated by the weight matrices given in (1) and (2) as follows: where   = th weight value in a weight matrix.The rand in (1) is the random number in the range [0, 1], where  is any constant parameter for the proposed method and it is less than 1, and   is a bias value.Hence, the list of weights matrix is given as follows: Now from neural network process sum of square errors is easily planned for every weight matrix in   .For the ERN structure three layers' network with one input layer, one hidden or "state" layer, and one "output" layer are used.Each layer will have its own index variable, that is,  for output nodes,  and  for hidden nodes, and  for input nodes.In a feed forward network, the input vector  is propagated through a weight layer and where  is the number of inputs and  () is a bias.In a simple recurrent network, the input vector is similarly propagated through a weight layer but also combined with the previous state activation through an additional recurrent weight layer, , and where  is the number of "state" nodes.The output of the network is in both cases determined by the state and a set of output weights  and where  is an output function.Hence, the error can be calculated as follows: The performances index for the network is given as follows: In the proposed method the average sum of squares is the performance index and it is calculated as follows: where   is the output of the network when the th input net  is presented.The equation  = (  −   ) is the error for the output layer,   () is the average performance,   () is the performance index, and   is the number of Cuckoo populations in th iteration.At the end of each epoch the list of average sums of square errors of th iteration SSE can be calculated as follows: The Cuckoo Search is replicating the minimum sum of square error (MSE).The MSE is found when all the inputs are processed for each population of the Cuckoo nest.Hence, the Cuckoo Search nest   is calculated as follows: The rest of the average sum of squares is considered as other Cuckoo nests.A new solution   +1 for Cuckoo  is generated using a Levy flight according to the following equation: Hence, the movement of the other Cuckoo   toward   can be drawn from ( 13) as follows: The Cuckoo Search can move from   toward   through Levy flight; it can be written as where ∇  is a small movement of   toward   .The weights and bias for each layer are then adjusted as follows: The pseudocode for CSERN algorithm is given in Pseudocode 1.

CSBPERN Algorithm.
In the proposed CSBPERN algorithm, each best nest represents a possible solution, that is, the weight space and the corresponding biases for BPERN optimization.The weight optimization problem and the size of the solution represent the quality of the solution.In the first epoch, the best weights and biases are initialized with CS and then those weights are passed on to the BPERN.The weights in BPERN are calculated.In the next cycle CS will update the weights with the best possible solution, and CS will continue searching the best weights until either the last cycle/epoch of the network is reached or the MSE is achieved.The CS is a population based optimization algorithm, and like other metaheuristic algorithms it starts with a random initial population.In the proposed CSBPERN algorithm, each best nest represents a possible solution, that is, the weight space and the corresponding biases for BPERN optimization.The weight optimization problem and the size of a nest represent the quality of the solution.In the first epoch, the best weights and biases are initialized with CS and then those weights are passed on to the BPERN.The weights in BPERN are calculated.In the next cycle CS will update the weights with the best possible solution and CS will continue searching the best weights until either the last cycle/epoch of the network is reached or the MSE is achieved.
In CSBPERN, the weight value of a matrix is calculated with (1) and (2) as given in Section 2.1.Also, the weight matrix is updated with (3).Now from neural network process sum of square errors (SSE) is easily planned for every weight matrix in   .For the BPERN structure three layers' network with one input layer, one hidden or "state" layer, and one "output" layer are used.In CSBPERN network, the input vector  is propagated through a weight layer  using (4).In a simple recurrent network, the input vector is not only similarly propagated through a weight layer, but also combined with the previous state activation through an additional recurrentweight layer , as given in (5).The output of the network in both cases is determined by the state and a set of output weights , as given in (6).
According to gradient descent, each weight change in the network should be proportional to the negative gradient of the cost with respect to the specific weights as given in Thus, the error for output nodes is calculated as follows: and for the hidden nodes the error is given as follows: Thus the weights and bias are simply changed for the output layer as and for the input layer the weight change is given as Adding a time subscript, the recurrent weights can be modified according to (21) as follows: The network error is calculated for CSBPERN using (7) from Section 2.1.The performance indices for the network are measured with (8) and (9).At the end of each epoch the list of average sums of square errors of th iteration SSE can be calculated with (10).The Cuckoo Search is imitating the minimum SSE, which is found when all the inputs are processed for each population of the Cuckoo nest.Hence, the Cuckoo Search nest   is calculated using (11).A new solution   +1 for Cuckoo  is generated using a Levy flight according to (12).The movement of the other Cuckoo   toward   is controlled through (13).The Cuckoo Search can move from   toward   through Levy flight as written in (14).The weights and bias for each layer are then adjusted with (15).
The pseudocode for CSBPERN algorithm is given in Pseudocode 2.

Result and Discussion
3.1.Datasets.This study focuses on two criteria for the performances analysis: (a) to get less mean square error (MSE) and (b) to achieve high average classification accuracy on testing data from the benchmark problem.The benchmark datasets were used to validate the accuracy of the proposed algorithms taken from UCI Machine Learning Repository.For the experimentation purpose, the data has to be arranged into training and testing datasets; the algorithms are trained on training set, and their performance accuracy is calculated on the corresponding test set.The workstation used for carrying out the experimentation comes equipped with (1) Initializes CS population size dimension and BPERN structure (2) Load the training data (3) While MSE < stopping criteria (4) Pass the Cuckoo nests as weights to network (5) Feed forward network runs using the weights initialized with CS (6) The sensitivity of one layer is calculated from its previous one and the calculation of the sensitivity start from the last layer of the network and move backward using ( 17) and ( 18). ( 7) Update weights and bias using (19) to (20) (8) Calculate the error using (7) (9) Minimize the error by adjusting network parameter using CS.(10)  2 GHz processor, 2-GB of RAM, while the operating system used is Microsoft XP (Service Pack 3).Matlab version R2010a software was used to carry out simulation of the proposed algorithms.For performing simulation, seven classification problems, that is, Thyroid Disease [29], Breast Cancer [30], IRIS [31], Glass [32], Australia Credit Card Approval [33], Pima Indian Diabetes [34], and 7-Bit Parity [35,36] datasets, are selected.The following algorithms are analyzed and simulated on these problems: To compare the performance of proposed algorithms such as CSERN and CSBPERN with conventional BPNN, ABC-BP, and ABC-LM, the network parameters such as number of hidden layers, node in the hidden layer, the value for the weight initialization, and value of learning rate are used similarly.Three layers' NN is used for training and testing of the model.For all problems the NN structure has single hidden layer consisting of five nodes while the input and output layers nodes vary according to the data given.From the input layer to hidden layer and from hidden to output layer log-sigmoid activation function is used as the transform function.
Although the simple Elman neural network (SENN) used the pure line as the activation function for the output layer, learning rate of 0.4 is selected for the entire test.All algorithms were tested using the initial weights and biases are randomly initialized in range [0, 1]; for each problem, one trial is limited to 1000 epochs.A total of 20 trials are run for each dataset to validate these algorithms.For each trial the network results are stored in the result file.Mean square error (MSE), standard deviation of error mean square (SD), the number of epochs, and the average accuracy are recorded in separate file for each trial for selected classification problem.

Wisconsin Breast Cancer Classification
Problem.The Breast Cancer dataset was created by William H. Wolberg.This dataset deals with the breast tumor tissue samples collected from different patients.The cancer analysis are performed to classify the tumor as benign or malignant.This dataset consists of 9 inputs and 2 outputs with 699 instances.The input attributes are, for instance, the clump thickness, the uniformity of cell size, the uniformity of cell shape, the amount of marginal adhesion, the single epithelial cell size, frequency of bare nuclei, bland chromatin, normal nucleoli, and mitoses.The selected network architecture used for the Breast Cancer Classification Problem consists of 9 input nodes, 5 hidden nodes, and 2 output nodes.
Table 1 illustrates that the proposed CSERN and CSBPERN algorithms show better performance than BPNN, ABC-BP, and ABC-LM algorithms.The proposed algorithms achieve small MSE (3.23 − 05, 0.00072) and SD (2.9 − 05, 0.0004) with 99.95 and 97.37 percent accuracy, respectively.Meanwhile, the other algorithms such as BPNN, ABC-BP, and ABC-LM fall behind the proposed algorithms with large MSE (0.271, 0.014, 0.184, and 0.013) and SD (0.017, 0.0002, 0.459, and 0.001) and lower accuracy.Similarly, Figure 1 shows the performances of MSE convergence for the used algorithms.From the simulation results, it can be easier to understand that the proposed algorithms show better performance than the BPNN, ABC-BP, and ABC-LM algorithms in terms of MSE, SD, and accuracy.

IRIS Classification Problem.
The Iris flower multivariate dataset was introduced by Fisher to demonstrate the discriminant analysis in pattern recognition and machine learning to find a linear feature sets that either merge or separates two or more classes in the classification process.This is maybe the best famous database to be found in the pattern recognition Table 2 shows the comparison between performances of the proposed CSERN and CSBPERN algorithms with the BPNN, ABCNN, ABC-BP, and ABC-LM algorithms in terms of MSE, SD, and accuracy.From Table 2 it is clear that the proposed algorithms have better performances by achieving less MSE and SD and higher accuracy than that of the BPNN, ABCNN, ABC-BP, and ABC-LM algorithms.Figure 2 illustrates the MSE convergences performances of the algorithms.From Figure 2, it is clear that the proposed algorithms show higher performances than the other algorithms in terms of MSE, SD, and accuracy.

Thyroid Classification
Problem.This dataset is taken for UCI Learning Repository, created based on the "Thyroid Disease" problem.This dataset consists of 21 inputs, 3 outputs, and 7200 patterns.Each case contains 21 attributes, which can be allocated to any of the three classes, which were hyper-, hypo-, and normal function of thyroid gland, based on the patient query data and patient examination data.The selected network architecture for Thyroid classification dataset is 21-5-3, which consists of 21 input nodes, 5 hidden nodes, and 3 output nodes.
Table 3 summarizes the comparison of performance of the all algorithms in terms of MSE, SD, and accuracy.From the table, it is easy to understand that the proposed CSERN and CSBPERN algorithms have small MSE and SD and      5 shows the convergence performance of the algorithms for MSE via epochs.From the overall results, it is clear that the proposed algorithms have better performances than the other compared algorithms in case of MSE, SD, and accuracy.

Australian Credit Card Approval Classification Problem.
This dataset is taken from UCI Machine Learning Repository, which contains all the details on the subject of card and application.The Australian Credit Card dataset consists of 690 instances, 51 inputs, and 2 outputs.Each example in this dataset represented a real detail about credit card application, whether the bank or similar institute generated the credit card or not.All attributes names and value have been changed to meaningless symbols to defend the privacy of the data.The selected architecture of NN is 51-5-2.Table 6 gives the detailed result of the proposed algorithms with the compared algorithms which shows that   Meanwhile, the other BPNN, ABCNN, ABC-BP, and ABC-LM algorithms converge with MSE of 0.26, 0.10, 0.12, and 0.08, SD of 0.014, 0.015, 0.008, and 0.012, and 85.12, 67.85, 82.12, and 69.13 percent of accuracy, which is quite lower than that of the proposed algorithms.Finally, Figure 7 represents the MSE convergence performance of the algorithms for the 7-Bit Parity Classification Problem.

Conclusion
This paper has studied the data classification problem using the dynamic behavior of RNN trained by nature inspired metaheuristic Cuckoo Search algorithm which provides derivative-free solution to optimize complex problems.This paper has also proposed a new metaheuristic Cuckoo Search based on ERN and BPERN algorithms in order to achieve fast convergence rate and to avoid local minima problem in conventional RNN.The proposed algorithms called CSERN and CSBPERN are unlike the existing algorithms; CSERN and CSBPERN imitate animal behaviour and are valuable for global convergence.The convergence behaviour and performance of the proposed CSERN and CSBPERN are simulated on some selected benchmark classification problems.Specifically, 7-Bit Parity and some selected UCI benchmark classification datasets are used for training and testing the network.The performances of the proposed models are compared with artificial bee colony using BPNN algorithm and other hybrid variants.The simulation results show that the proposed CSERN and BPERN algorithms are far better than the baseline algorithms in terms of convergence rate.Furthermore, CSERN and BPERN achieved higher accuracy and less MSE on all the designated datasets.

( 1 )
Initializes CS population size dimension and ERN structure(2) Load the training data (3) While MSE < stopping criteria (4) Pass the Cuckoo nests as weights to network (5) Feed forward network runs using the weights initialized with CS (6) Calculate the error using (7) (7) Minimize the error by adjusting network parameter using CS(8) Generate Cuckoo egg (  ) by taking Levy flight from random nest   =   (9) Abandon a fraction   ∈ [0, 1] of the worst nest.Build new nest at new location via Levy flight to replace the old one (10) Evaluate the fitness of the nest, Chose a random nest  If (a)   >   Then (b)   ←   (c)   ←   End if (11) CS keeps on calculating the best possible weight at each epoch until the network is converged.End While Pseudocode 1: Pseudocode of CSERN algorithm.

Figure 1 :Figure 2 :
Figure 1: MSE via epochs convergence for Wisconsin Breast Cancer Classification Problem.

Figure 5 :
Figure 5: MSE via epochs convergence for Glass Benchmark Classification Problem.

Figure 6 :Figure 7 :
Figure 6: MSE via epochs convergence for Credit Card Benchmark Classification Problem.
Generate Cuckoo egg (  ) by taking Levy flight from random nest.  =   (11) Abandon a fraction   ∈ [0, 1] of the worst nest.Build new nest at new location via Levy flight to replace the old one.(12) Evaluate the fitness of the nest, Chose a random nest  If (a)   >   Then (b)   ←   (c)   ←   End if (13) CS keeps on calculating the best possible weight at each epoch until the network is converged.

Table 1 :
Summary of algorithms performance for Wisconsin Breast Cancer Classification Problem.

Table 2 :
Summary of algorithms performance for Iris ClassificationProblem.
literature.There were 150 instances, 4 inputs, and 3 outputs in this dataset.The classification of Iris dataset involves the data of petal width, petal length, sepal length, and sepal width into three classes of species, which consist of Iris setosa, Iris versicolor, and Iris virginica.The selected network structure for Iris classification dataset is 4-5-3, which consists of 4 input nodes, 5 hidden nodes, and 3 output nodes.In total 75 instances are used for training dataset and the rest for testing dataset.

Table 3 :
Summary of algorithms performance for Thyroid Classification Problem.SD of 0.026, 0.021, 0.002, and 0.033, and accuracy of 86.96, 68.09, 88.16, and 56.09 percent, which is quite lower than the proposed algorithms.Figure 4 describes the MSE convergence performance of the used algorithms for Diabetes Classification Problem.
3.5.Diabetes ClassificationProblem.This dataset consists of 768 examples, 8 inputs, and 2 outputs and consists of all the information of the chemical change in a female body whose disparity can cause diabetes.The feed forward network topology for this network is set to 8-5-2.The target error for the Diabetes Classification Problem is set to 0.00001 and the maximum number of epochs is 1000.It is evident from Table4that the proposed CSERN and CSBPERN algorithms show better performance than the BPNN, ABCNN, ABC-BP, and ABC-LM algorithms in terms of MSE, SD, and accuracy.From Table4, it is clear that the proposed algorithms have MSE of 1.7 − 05, 0.039, and SD of 2.05 − 05, 0.003, and achieved 99.96, 89.53 percent of accuracy.Meanwhile, the other algorithms such as BPNN, ABCNN, ABC-BP, and ABC-LM have MSE of 0.26, 0.131, 0.2, and 0.14,

Table 4 :
Summary of algorithms performance for Diabetes Classification Problem.

Table 5
summarises the comparison performances of the algorithms.From the table it is clear to understand that the proposed algorithms outperform the other algorithms.The proposed CSERN and CSBPERN algorithms achieve small MSE of 2.20 − 05, 0.0005, SD of 2.50 − 05, 0.0002, and high accuracy of 99.96 and 97.81 percent.Meanwhile, the BPNN, ABCNN, ABC-BP, and ABC-LM algorithms have large MSE of 0.36, 1.80 − 03, 0.025, and 0.005, SD of 0.048, 0.003, 0.002, and 0.009, and accuracy of 94.04, 91.93, 94.09, and 93.96 percent, which is quite lower than the proposed algorithms.Figure

Table 5 :
Summary of algorithms performance for Glass Classification Problem.

Table 6 :
Summary of algorithms performance for Australian Credit Card Approval Classification Problem.
of MSE, SD, and accuracy.From the table, it is clear that the proposed CSERN and CSBPERN algorithms have better performance than BPNN, ABCNN, ABC-BP, and ABC-LM algorithms in terms of MSE, SD, and accuracy.The proposed algorithms have MSE of 2.3 − 06, 0.052, and SD of 2.6 − 06, 0.005, and achieve 99.98 and 89.28 percent of accuracy.

Table 7 :
Summary of algorithms performance for 7-Bit Parity Classification Problem.

Acronyms, Mathematical Symbols, and Their Meanings Used
Weight value at each layer in the feed forward network   : Weight value at each addition layer in the recurrent feedback   :Bias values for the network   :Total weight matrix for the network net  :Output function for the hidden layer net  :Output function for the output layer : : Predictedoutput   (): Averageperformance   (): Performance index rand: Random function for generating random variables : E r r o ra tt h eo u t p u tl a y e r