Novel Back Propagation Optimization by Cuckoo Search Algorithm

The traditional Back Propagation (BP) has some significant disadvantages, such as training too slowly, easiness to fall into local minima, and sensitivity of the initial weights and bias. In order to overcome these shortcomings, an improved BP network that is optimized by Cuckoo Search (CS), called CSBP, is proposed in this paper. In CSBP, CS is used to simultaneously optimize the initial weights and bias of BP network. Wine data is adopted to study the prediction performance of CSBP, and the proposed method is compared with the basic BP and the General Regression Neural Network (GRNN). Moreover, the parameter study of CSBP is conducted in order to make the CSBP implement in the best way.


Introduction
Though the traditional neural networks (such as BP) have been widely used in many areas, they have some inherent shortcomings. These disadvantages have become a major bottleneck that restricts their further development. In most cases, the gradient descent method is used in feedforward neural networks (FNN), which has the following main disadvantages.
(1) Training slowly: many iterations are required in the gradient descent method in order to adjust weights and bias. Therefore, the training process takes long time.
(2) It is easy to fall into local minimum, so that it cannot achieve the global minimum.
(3) It is very sensitive to the choice of the initial weights and bias. Due to high influence on the performance for neural networks (NN), proper weights and bias must be carefully selected in order to obtain a more ideal network. If the selection of the weights and bias is improper, convergent speed of the algorithm will be very slow and the training process would take a long time.
Therefore, in order to enhance the performance of BP, many scholars are always striving for exploring a training algorithm that has a fast training speed, a global optimal solution, and a good generalization performance. Finding this training algorithm is also the main objective of the research in recent years.
In this paper, CS algorithm [35][36][37] that is a newlydeveloped metaheuristic method is used to optimize the weights and bias of BP. That is to say, CS is well capable of selecting the best initial weights and bias so as to construct the BP network instead of the randomly-generated weights and bias used in the basic BP. In order to prove the superiority of CSBP, it is used to solve the Italian wine classification problem. By comparing with the traditional BP and GRNN, this method has higher prediction accuracy and better generalization performance.

2
The Scientific World Journal The remainder of this paper is organized as follows. The preliminaries including CS and BP are provided in Section 2. Section 3 represents the detailed BP optimized by CS. Then, in Section 4, a series of comparison experiments on Italian wine classification problem are conducted. The final section provides our concluding remarks and points out our future work orientation.

Preliminary
2.1. CS Algorithm. CS method [35,36] is a novel metaheuristic swarm intelligence [38] optimization method for solving optimization problems. It is based on the behavior of some cuckoo species in combination with the Lévy flights. In the case of CS, how far a cuckoo can move forwards in a step can be determined by the Lévy flights.
In order to describe CS algorithm more easily, Yang and Deb [35] idealized the behavior of the cuckoo species into the following three rules: (1) for all the cuckoos in the population, every one lays only one egg at a time and randomly selects a nest in order to place this egg; (2) the population cannot change the eggs with the best fitness in order to make the whole population evolve forward all the time; (3) the host bird discovers the cuckoo eggs with a probability ∈ [0, 1]. In this case, the cuckoo has no other choice and it has to build a fully new nest.
Based on the above hypothesis, the CS can be summarized as shown in Algorithm 1. We must point out that, for single objective problems, the cuckoos, eggs, and nests are equal to each other. So, we do not differentiate them in our works.
In order to make the balance of exploitation and exploration, CS uses a balanced combination of a local random walk and the global explorative random walk, controlled by a switching parameter . The exploitation step can be represented as where and are two different randomly selected cuckoos, ( ) is a Heaviside function, is a random number, and is the step size. On the other hand, the exploration step is implemented by using Lévy flights as follows: where ( , ) = ( Γ( ) sin( /2)/ )(1/ 1+ ), ( , 0 > 0), > 0 is the scaling factor, and its value can be determined by the problem of interest. More information of CS can be referred to in [39][40][41].
2.2. BP Network. BP network was proposed by a team of scientists led by Rumelhart and McCelland in 1986 which is an error back propagation algorithm according to the former train multilayer feedforward network. It is one of the most widely used neural network models. BP network can learn and remember a lot of input-output mapping model without prior mathematical equations that describe this mapping. The steepest descent method is used as the learning rules in order to adjust the weights and bias that can finally minimize the network error. In general, the topology of the BP network model includes input layer, hidden layer, and output layer. The number of layers and neurons in each hidden layer can be determined by the dimension of the input vector, and the output vector. In most cases, a single hidden layer is used in BP network.
BP network is a kind of supervised learning algorithm. Its main idea can be represented as follows. Firstly, training samples are input into the BP network, and then weights and bias are adjusted by using the error back propagation algorithm. This training process would minimize the error between the desired vector and the output vector. When the error is satisfied, weights and bias are remembered, which can be used to predict test samples. More information about BP can be referred to in [42].

CSBP
In the present work, the CS algorithm is used to optimize BP network. More specifically, the BP network is considered to be objective function (fitness function), and the weights and bias are optimized by the CS method in order to obtain the optimal weights and bias. The best weights and bias are wellsuited to construct the BP that is significantly superior to the basic BP network.
The process of the BP network optimized by the CS is divided into three parts: determining BP network structure, obtaining the best weights and bias through CS, and predicting through neural network. The structure of BP network in the first part is determined based on the number of input and output parameters, and then the length of each cuckoo individual in CS is determined accordingly. In the second part, CS method is applied to optimize the weights and bias of the BP network. Each individual in the cuckoo population includes all the weights and bias in BP, and it is evaluated by the fitness function. The CS method implements initializing CS, determining fitness function, updating position operator, selecting operator, replacing operator, and eliminating operator in order to find the cuckoo individual with the best fitness. This optimization process is repeated until the satisfactory weights and bias are found. In the last part, the BP network with the optimal weights and bias is constructed and is trained to predict the output. Based on the above analyses, the flowchart of the CSBP algorithm can be shown in Figure 1.
In CSBP, CS is applied to optimize the initial weights and bias of BP network, so that the optimized BP network has better predicted output. The elements in CSBP include initializing CS, determining fitness function, updating position operator, selecting operator, replacing operator, and eliminating operator in order to find the cuckoo individual with the best fitness. The detailed steps of the CS algorithm (see Figure 1) are as follows.

Begin
Step  (1) Initializing CS. Cuckoo individual is encoded in the real-coded form, and each individual is composed of realnumber string that consists of the following four parts: connection weights between the hidden layer and output layer, connection weights between the hidden layer and the input layer, the bias in the output layer, and the hidden layer. Each cuckoo individual contains all the weights and bias in BP network. According to the weights and bias in BP network, a certain BP network can be constructed.
(2) Determining Fitness Function. The initial weights and bias of BP network can be determined according to the best individual. After training the BP network, it is used to predict the output. The fitness value of cuckoo individual is the 4 The Scientific World Journal sum of the absolute error between the desired output and the predicted output as follows: where is the node number of the output layer in BP network and is a coefficient. and are the desired output and the predicted output for the node in BP network.
(3) Updating Position Operator. A cuckoo (say ) is randomly chosen in the cuckoo population and its position is updated according to (1). The fitness ( ) of the th cuckoo at generation and position ( ) is evaluated by (3). (5) Replacing Operator. If the fitness value of the cuckoo is bigger than the cuckoo , that is, > , is replaced by the new solution.
(6) Eliminating Operator. In order to make the population in an optimum state all the time, ceil( * ) worst cuckoos are removed in each generation. At the same time, in order to make the population size unchanged, ceil( * ) cuckoos would randomly be generated. The cuckoos with the best fitness will be passed directly to the next generation. Here, ceil( ) rounds the elements of to the nearest integers towards infinity.
BP network in CSBP (see Figure 1) is similar to an ordinary BP network, and the detailed steps can be represented as follows.
(1) Determining BP Network Structure. The weights and bias are randomly initialized, and then they are encoded according to the CS algorithm. The encoded weights and bias are input into the CS in order to optimize the BP network, followed by the CS algorithm (see Figure 1).
(2) Construct CSBP Network. The optimal weights and bias obtained from the CS algorithm are used to construct CSBP network. The training set is used to train the network and the training error is calculated. When the training error meets the requirements, training of the CSBP network stops.
(3) Predicted Output. The test set is input into the trained CSBP network to predict output.

Simulation
A classical wine classification problem (http://archive.ics .uci.edu/ml/datasets/wine) is used to test the prediction effectiveness of the CSBP network. Wine data that originated from UCI wine database records three different varieties of wine on the chemical composition analysis grown in the same region in Italy. Different kinds of wine are identified with 1, 2, and 3. Each sample contains 14 dimensions. The first dimension represents a class identifier, and the others represent the characteristics of wine. In these 178 samples, 1-59, 60-130, and 131-178 belong to the first, second, and third category, respectively. Each category is divided into two parts: training set and test set.

Comparisons of CSBP with BP and GRNN.
In this section, CSBP is applied to solve wine classification problem, and the results are compared with the traditional feedforward neural networks (BP and GRNN). For CSBP and BP, the neurons in input layer, hidden layer, and output layer are 13, 11, and 1, respectively. The length of encoded string number for each cuckoo individual is 166 that can be computed by the following equation: 13 * 11 + 11 + 11 * 1 + 1 = 166. That is, CS would find the minimum of a 166-dimension function.
Firstly, the performance of CS when optimizing the weights and bias is tested with discovery rate = 0.1 and few population sizes (10) and maximum generations (10). The fitness curve can be shown in Figure 2. From Figure 2, it can be seen that fitness value sharply decreases from 0.095 to 0.045 within two generations. This means that CS can significantly minimize the training error, and it does succeed in optimizing the basic BP network.
In the next experiments, all the paraments are setted as follows. For BP network, epochs = 50, learning rate = 0.1, and objective = 0.00004. For GRNN, cyclic training method is used in order to select the best SPREAD value, making GRNN achieve the best prediction. For the CSBP, the BP network part has the same parameters with the basic BP; for CS algorithm part, we set discovery rate = 0.1, population size NP = 50, and maximum generationMax gen = 50.
As intelligent algorithms always have some randomness, each run will generate different results. In order to get a typical statistical performance, 600 implementations are conducted for each method. The results are recorded in Figures 3 and 4 and Table 1.
From Table 1, for training set, the best performance and the average performance of BP, CSBP, and GRNN have little difference though CSBP performs slightly better than BP  and GRNN. For the worst performance, CSBP is better than GRNN and is significantly superior to BP network. For test set, the overall prediction accuracy of CSBP is much better than BP and GRNN. In addition, the Std (standard deviation) of CSBP is clearly less than BP and GRNN. That is to say, CSBP would generate a more stable prediction output with little fluctuation. Moreover, from Figures 3 and 4, CSBP has a strong ability of solving the wine classification problem.

Parameter Study.
As we are aware, parameter settings are of paramount importance to the performance of the metaheuristics. Here, the effectiveness of maximum generation, population size, and discovery rate will be analyzed and studied for CS algorithm.

Influence of the Maximum Generation for CSBP.
Firstly, the number of maximum generations (Max gen) is studied, and the results are shown in Table 2. Table 2 shows that, when Max gen is equal to 40, 50, or 100, CSBP can approach all training samples without error. However, prediction accuracy is not always getting better with the increment of maximum generation. From prediction accuracy of test set, it can be seen that, when the number of maximum generations increases from 10 to 100, the prediction accuracy of test set is gradually increased, decreased, and finally increased. Especially, when Max gen = 100, the prediction accuracy reaches maximum (89/89 = 100%). Look carefully at Table 2; it is observed that the prediction accuracy changes in a very small range. That is, CSBP is insensitive to the parameter Max gen. Meanwhile, though more generations (such as 100) have a perfect prediction accuracy, it would take a longer time in order to optimize the weights and bias. Taking into consideration all the factors we analyzed earlier, the maximum generation is set to 50 in our present work.

Influence of the Population Size for CSBP.
Subsequently, the influence of population size (NP) is studied (see Table 3). From Table 3, when NP is in the range [10,100], especially equal to 100, CSBP can approach all training samples with little error. From prediction accuracy of test set, when the number of population size is equal to 100, the prediction accuracy of test set reaches maximum. Similar to the trend about Max gen, when the NP increases from 10 to 100, though prediction accuracy gradually increased, decreased, and finally increased, its fluctuation is little. This means that population size has little effect on the prediction accuracy of CSBP. In addition, when NP = 100, the prediction accuracy reaches maximum (89/89 = 100%). The prediction accuracy 6 The Scientific World Journal    Table 4). From

Conclusion
If BP network has bad initial weights and bias, it would fail to find the best solution. In order to overcome the disadvantages of BP, this paper uses the CS algorithm to optimize the weights and bias in the basic BP to solve the prediction problem. This method trains fast, can obtain the global optimal solution, and has good generalization performance. Most importantly, CSBP is insensitive to the initial weights and bias and the parameter settings of CS algorithm. We only need to input the training samples into the CSBP network, and then CSBP can obtain a unique optimal solution. By comparing to other traditional methods (such as BP and GRNN), this method has a faster and better generalization performance.
In future, our research highlights would be focused on in the following points. On the one hand, CSBP will be used to solve other regression and classification problems, and their results can be further compared to other methods, such as feedforward neural network [43], Wavelet Neural Network (WNN) [44,45], and Extreme Learning Machine (ELM) [46,47]. On the other hand, we will hybridize BP with some other metaheuristic algorithms, such as artificial plant optimization algorithm (APOA) [48], artificial physics optimization [49], flower pollination algorithm (FOA) [50], grey wolf optimizer (GSO) [51], and animal migration optimization (AMO) [52], so as to further improve the performance of BP.