Metaheuristic Algorithms for Convolution Neural Network

A typical modern optimization technique is usually either heuristic or metaheuristic. This technique has managed to solve some optimization problems in the research area of science, engineering, and industry. However, implementation strategy of metaheuristic for accuracy improvement on convolution neural networks (CNN), a famous deep learning method, is still rarely investigated. Deep learning relates to a type of machine learning technique, where its aim is to move closer to the goal of artificial intelligence of creating a machine that could successfully perform any intellectual tasks that can be carried out by a human. In this paper, we propose the implementation strategy of three popular metaheuristic approaches, that is, simulated annealing, differential evolution, and harmony search, to optimize CNN. The performances of these metaheuristic methods in optimizing CNN on classifying MNIST and CIFAR dataset were evaluated and compared. Furthermore, the proposed methods are also compared with the original CNN. Although the proposed methods show an increase in the computation time, their accuracy has also been improved (up to 7.14 percent).


Introduction
Deep learning (DL) is mainly motivated by the research of artificial intelligent, in which the general goal is to imitate the ability of human brain to observe, analyze, learn, and make a decision, especially for complex problem [1]. This technique is in the intersection amongst the research area of signal processing, neural network, graphical modeling, optimization, and pattern recognition. The current reputation of DL is implicitly due to drastically improve the abilities of chip processing, significantly decrease the cost of computing hardware, and advanced research in machine learning and signal processing [2]. 2 Computational Intelligence and Neuroscience proposed by Hinton and Salakhutdinov [8]. On the other side, Krylov subspace descent is more robust and simpler than Hessian-free optimization as well as looks like it is better for the classification performance and optimization speed. However, Krylov subspace descent needs more memory than Hessian-free optimization [9].
In fact, techniques of modern optimization are heuristic or metaheuristic. These optimization techniques have been applied to solve any optimization problems in the research area of science, engineering, and even industry [10]. However, research on metaheuristic to optimize DL method is rarely conducted. One work is the combination of genetic algorithm (GA) and CNN, proposed by You and Pu [11]. Their model selects the CNN characteristic by the process of recombination and mutation on GA, in which the model of CNN exists as individual in the algorithm of GA. Besides, in recombination process, only the layers weights and threshold value of C1 (convolution in the first layer) and C3 (convolution in the third layer) are changed in CNN model. Another work is finetuning CNN using harmony search (HS) by Rosa et al. [12].
In this paper, we compared the performance of three metaheuristic algorithms, that is, simulated annealing (SA), differential evolution (DE), and HS, for optimizing CNN. The strategy employed is looking for the best value of the fitness function on the last layer using metaheuristic algorithm; then the results will be used again to calculate the weights and biases in the previous layer. In the case of testing the performance of the proposed methods, we use MNIST (Mixed National Institute of Standards and Technology) dataset. This dataset comprises images of digital handwritten digits, containing 60,000 training data items and 10,000 testing data items. All of the images have been centered and standardized with the size of 28 × 28 pixels. Each pixel of the image is represented by 0 for black and 255 for white, and in between are different shades of gray [13]. This paper is organized as follows: Section 1 is an introduction, Section 2 explains the used metaheuristic algorithms, Section 3 describes the convolution neural networks, Section 4 gives a description of the proposed methods, Section 5 presents the result of simulation, and Section 6 is the conclusion.

Metaheuristic Algorithms
Metaheuristic is well known as an efficient method for hard optimization problems, that is, the problems that cannot be solved optimally using deterministic approach within a reasonable time limit. Metaheuristic methods work for three main purposes: for fast solving of problem, for solving large problems, and for making a more robust algorithm. These methods are also simple to design as well as flexible and easy to implement [14].
In general, metaheuristic algorithms use the combination of rules and randomization to duplicate the phenomena of nature. The biological system imitations of metaheuristic algorithm, for instance, are evolution strategy, GA, and DE. Phenomena of ethology for examples are particle swarm optimization (PSO), bee colony optimization (BCO), bacterial foraging optimization algorithms (BFOA), and ant colony optimization (ACO). Phenomena of physics are SA, microcanonical annealing, and threshold accepting method [15]. Another form of metaheuristic is inspired by music phenomena, such as HS algorithm [16].
Classification of metaheuristic algorithm can also be divided into single-solution-based and population-based metaheuristic algorithm. Some of the examples for singlesolution-based metaheuristic are the noising method, tabu search, SA, TA, and guided local search. In the case of metaheuristic based on population, it can be classified into swarm intelligent and evolutionary computation. The general term of swarm intelligent is inspired by the collective behavior of social insect colonies or animal societies. Examples of these algorithms are GP, GA, ES, and DE. On the other side, the algorithm for evolutionary computation takes inspiration from the principles of Darwinian for developing adaptation into their environment. Some examples of these algorithms are PSO, BCO, ACO, and BFOA [15]. Among all these metaheuristic algorithms, SA, DE, and HS are used in this paper.

Simulated Annealing Algorithm.
SA is a technique of random search for the problem of global optimization. It mimics the process of annealing in material processing [10]. This technique was firstly proposed in 1983 by Kirkpatrick et al. [17].
The principle idea of SA is using random search, which not only allows changes that improve the fitness function but also maintains some changes that are not ideal. As an example, in minimum optimization problem, any better changes that decrease the fitness function value ( ) will be accepted, but some changes that increase ( ) will also be accepted with a transition probability ( ) as follows: where Δ is the energy level changes, is Boltzmann's constant, and is temperature for controlling the process of annealing. This equation is based on the Boltzmann distribution in physics [10]. The following is standard procedure of SA for optimization problems: (1) Generate the solution vector: the initial solution vector is randomly selected, and then the fitness function is calculated.
(2) Initialize the temperature: if the temperature value is too high, it will take a long time to reach convergence, whereas a too small value can cause the system to miss the global optimum.
(3) Select a new solution: a new solution is randomly selected from the neighborhood of the current solution.
(4) Evaluate a new solution: a new solution is accepted as a new current solution depending on its fitness function.
(5) Decrease the temperature: during the search process, the temperature is periodically decreased.
Computational Intelligence and Neuroscience 3 (6) Stop or repeat: the computation is stopped when the termination criterion is satisfied. Otherwise, steps (2) and (6) are repeated.

Differential Evolution
Algorithm. Differential evolution was firstly proposed by Price and Storn in 1995 to solve the Chebyshev polynomial problem [15]. This algorithm is created based on individuals difference, exploiting random search in the space of solution, and finally operates the procedure of mutation, crossover, and selection to obtain the suitable individual in system [18]. There are some types in DE, including the classical form DE/rand/1/bin; it indicates that in the process of mutation the target vector is randomly selected, and only a single different vector is applied. The acronym of bin shows that crossover process is organized by a rule of binomial decision. The procedure of DE algorithm is shown by the following steps: (1) Determining parameter setting: population size is the number of individuals. Mutation factor (F) controls the magnification of the two individual differences to avoid search stagnation. Crossover rate (CR) decides how many consecutive genes of the mutated vector are copied to the offspring.
(2) Initialization of population: the population is produced by randomly generating the vectors in the suitable search range.
(3) Evaluation of individual: each individual is evaluated by calculating their objective function.
(6) Selection operation: this operation determines whether the offspring in the next generation should become a member of the population or not.
(7) Stopping criterion: the current generation is substituted by the new generation until the criterion of termination is satisfied.

Harmony Search Algorithm.
Harmony search algorithm is proposed by Geem et al. in 2001 [19]. This algorithm is inspired by the musical process of searching for a perfect state of harmony. Like harmony in music, solution vector of optimization and improvisation from the musician are analogous to structures of local and global search in optimization techniques.
In improvisation of the music, the players sound any pitch in the possible range together that can create one vector of harmony. In the case of pitches creating a real harmony, this experience is stored in the memory of each player and they have the opportunity to create better harmony next time [16]. There are three possible alternatives when one pitch is improvised by a musician: any one pitch is played from her/his memory, a nearby pitch is played from her/his memory, and an entirely random pitch is played with the range of possible sound. If these options are used for optimization, they have three equivalent components: the use of harmony memory, pitch adjusting, and randomization. In HS algorithm, these rules are correlated with two relevant parameters, that is, harmony consideration rate (HMCR) and pitch adjusting rate (PAR). The procedure of HS algorithm can be summarized into five steps as follows [16]: (1) Initialize the problem and parameters: in this algorithm, the problem can be maximum or minimum optimization, and the relevant parameters are HMCR, PAR, size of harmony memory, and termination criterion.
(2) Initialize harmony memory: the harmony memory (HM) is usually initialized as a matrix that is created randomly as a vector of solution and arranged based on the objective function.
(3) Improve a new harmony: a vector of new harmony is produced from HM based on HMCR, PAR, and randomization. Selection of new value is based on HMCR parameter by range 0 through 1. The vector of new harmony is observed to decide whether it should be pitch-adjusted using PAR parameter. The process of pitch adjusting is executed only after a value is selected from HM.
(4) Update harmony memory: the new harmony substitutes the worst harmony in terms of the value of the fitness function, in which the fitness function of new harmony is better than worst harmony. (3) and (4) until satisfying the termination criterion: in the case of meeting the termination criterion, the computation is ended. Alternatively, process (3) and (4) are reiterated. In the end, the vector of the best HM is nominated and is reflected as the best solution for the problem.

Convolution Neural Network
CNN is a variant of the standard multilayer perceptron (MLP). A substantial advantage of this method, especially for pattern recognition compared with conventional approaches, is due to its capability in reducing the dimension of data, extracting the feature sequentially, and classifying one structure of network [20].  performance on a number of tasks for pattern recognition using error gradient method [21].
The classical CNN by LeCun et al. is an extension of traditional MLP based on three ideas: local receptive fields, weights sharing, and spatial/temporal subsampling. These ideas can be organized into two types of layers, which are convolution layers and subsampling layers. As is showed in Figure 1, the processing layers contain three convolution layers C1, C3, and C5, combined in between with two subsampling layers S2 and S4 and output layer F6. These convolution and subsampling layers are structured into planes called features maps.
In convolution layer, each neuron is linked locally to a small input region (local receptive field) in the preceding layer. All neurons with similar feature maps obtain data from different input regions until the whole of plane input is skimmed, but the same of weights is shared (weights sharing).
In subsampling layer, the feature maps are spatially downsampled, in which the size of the map is reduced by a factor 2. As an example, the feature map in layer C3 of size 10 × 10 is subsampled to a conforming feature map of size 5 × 5 in the subsequent layer S4. The last layer is F6 that is the process of classification [21].
Principally, a convolution layer is correlated with some feature maps, the size of the kernel, and connections to the previous layer. Each feature map is the results of a sum of convolution from the maps of the previous layer, by their corresponding kernel and a linear filter. Adding a bias term and applying it to a nonlinear function, the th feature map with the weights and bias is obtained using the tanh function as follows: The purpose of a subsampling layer is to reach spatial invariance by reducing the resolution of feature maps, in which each pooled feature map relates to one feature map of the preceding layer. The subsampling function, where × is the inputs, is a trainable scalar, and is trainable bias, is given by the following equation: After several convolutions and subsampling, the last structure is classification layer. This layer works as an input for a series of fully connected layers that will execute the classification task. It has one output neuron every class label, and in the case of MNIST dataset, this layer contains ten neurons corresponding to their classes.

Design of Proposed Methods
The architecture of this proposed method refers to a simple CNN structure (LeNet-5), not a complex structure like AlexNet [22]. We use two variations of design structure. First is i-6c-2s-12c-2s, where the number of C1 is 6 and that of C3 is 12. Second is i-8c-2s-16c-2s, where the number of C1 is 8 and that of C3 is 18. The kernel size of all convolution layers is 5 × 5, and the scale of subsampling is 2. This architecture is designed for recognizing handwritten digits from MNIST dataset.
In this proposed method, SA, DE, and HS algorithm are used to train CNN (CNNSA, CNNDE, and CNNHS) to find the condition of best accuracy and also to minimize estimated error and indicator of network complexity. This objective can be realized by computing the lost function of vector solution or the standard error on the training set. The following is the lost function used in this paper: where is the expected output, is the real output, and is some training samples. In the case of termination criterion, two situations are used in this method. The first is when the maximum iteration has been reached and the second is when the loss function is less than a certain constant. Both conditions mean that the most optimal state has been achieved. Δ is the essential aspect of this proposed method. Selection in the proper of this value will significantly increase the accuracy. For example, in CNNSA to one epoch, if Δ = 0.0008 × rand, then the accuracy is 88.12, in which this value is 5.73 greater than the original CNN (82.39). However, if Δ = 0.0001 × rand, its accuracy is 85.79 and its value is only 3.40 greater than the original CNN.
Furthermore, this solution vector is updated based on SA algorithm. When the termination criterion is satisfied, all of weights and biases are updated for all layers in the system. Algorithm 1 is the CNNSA algorithm of the proposed method. Similar to CNNSA method, selection in the proper of Δ will significantly increase the value of accuracy. In the case of one epoch in CNNDE as an example, if Δ = 0.0008 × rand, then the accuracy is 86.30, in which this value is 3.91 greater than the original CNN (82.39). However, if Δ = 0.00001 × rand, its accuracy is 85.51.

Design of CNNDE
Furthermore, these individuals in the population are updated based on the DE algorithm. When the termination criterion is satisfied, all of weights and biases are updated for all layers in the system. Algorithm 2 is the CNNDE algorithm of the proposed method. In this method, Δ is also an important aspect, while selection of the proper of Δ will significantly increase the value of accuracy. For example of one epoch in CNNHS (i-8c-2s-16c-2s), if Δ = 0.0008 × rand, then the accuracy is 87.23, in which this value is 7.14 greater than the original CNN  (80.09). However, if Δ = 0.00001 × rand, its accuracy is 80.23; the value is only 0.14 greater than CNN. Furthermore, this harmony memory is updated based on the HS algorithm. When the termination criterion is satisfied, all of weights and biases are updated for all layers in the system. Algorithm 3 is the CNNHS algorithm of the proposed method.

Simulation and Results
In this paper, the primary goal is to improve the accuracy of original CNN by using SA, DE, and HS algorithm. This can be performed by minimizing the classification task error tested on the MNIST dataset. Some of the example images for MNIST dataset are shown in Figure 2.
In CNNSA experiment, the size of neighborhood was set = 10 and maximum of iteration (maxit) = 10. In CNNDE, the population size = 10 and maxit = 10. In CNNHS, the harmony memory size = 10 and maxit = 10. Since it is difficult to make sure of the control of parameter, in all of the experiment the values of = 0.5 for SA, F = 0.8 and CR = 0.3 for DE, and HMCR = 0.8 and PAR = 0.3 for HS. We also set the parameter of CNN, that is, the learning rate ( = 1) and the batch size (100).
As for the epoch parameter, the number of epochs is 1 to 10 for every experiment. All of the experiment was implemented in MATLAB-R2011a, on a personal computer with processor Intel Core i7-4500u, 8 GB RAM running memory, Windows 10, with five separate runtimes. The original program of this simulation is DeepLearn Toolbox from Palm [23].
All of the experiment results of the proposed methods are compared with the experiment result from the original CNN. These results for the design of i-6c-2s-12c-2s are summarized in Table 1 for accuracy, Table 2 for the computational time, and Figure 3 for error and its standard deviation as well as Figure 4 for computational time and its standard deviation. The results for the design of i-8c-2s-16c-2s are summarized in Table 3 for accuracy, Table 4 for the computational time, and Figure 5 for error and its standard deviation as well Computational Intelligence and Neuroscience 7     In addition, we also test our proposed method with CIFAR10 (Canadian Institute for Advanced Research) dataset. This dataset consists of 60,000 color images, in which the size of every image is 32 × 32. There are five batches for training, composed of 50,000 images, and one batch of test images consists of 10,000 images. The CIFAR10 dataset is divided into ten classes, where each class has 6,000 images. Some example images of this dataset are showed in Figure 8.
The experiment of CIFAR10 dataset was conducted in MATLAB-R2014a. We use the number of epochs 1 to 15 for this experiment. The original program is MatConvNet from [24]. In this paper, the program was modified with SA algorithm. The results can be seen in Figure 9 for objective, Figure 10 for top-1 error, and Figure 11 for a top-5 error, Table 5 for Comparison of CNN and CNNSA for train as well as Table 6 for Comparison of CNN and CNNSA for validation. In general, these results show that CNNSA works better than original CNN for CIFAR10 dataset.

Conclusion
This paper shows that SA, DE, and HS algorithms improve the accuracy of the CNN. Although there is an increase in computation time, error of the proposed method is smaller than the original CNN for all variations of the epoch.
It is possible to validate the performance of this proposed method on other benchmark datasets such as ORL, INRIA, Hollywood II, and ImageNet. This strategy can also be developed for other metaheuristic algorithms such as ACO, PSO, and BCO to optimize CNN.
For the future study, metaheuristic algorithms applied to the other DL methods need to be explored, such as the