Genetic Pattern Search and Its Application to Brain Image Classification

A novel global optimization method, based on the combination of genetic algorithm (GA) and generalized pattern search (PS) algorithm, is proposed to find global minimal points more effectively and rapidly.The idea lies in the facts that GA tends to be quite good at finding generally good global solutions, but quite inefficient in finding the last few mutations for the absolute optimum, and that PS is quite efficient in finding absolute optimum in a limited region.The novel algorithm, named as genetic pattern search (GPS), employs the GA as the search method at every step of PS. Experiments on five different classical benchmark functions (consisting of Hump, Powell, Rosenbrock, Schaffer, and Woods) demonstrate that the proposed GPS is superior to improved GA and improved PS with respect to success rate. We applied the GPS to the classification of normal and abnormal structural brain MRI images. The results indicate that GPS exceeds BP, MBP, IGA, and IPS in terms of classification accuracy. This suggests that GPS is an effective and viable global optimization method and can be applied to brain MRI classification.


Introduction
The evolutionary computation community has shown for many years significant interest in optimization problems, in particular in the global optimization of numerical, realvalued "black-box" problems, for which exact and analytical methods are not productive.Genetic algorithm (GA) [1], generalized pattern search (PS) [2], particle swarm optimization (PSO) [3], firefly algorithm [4], and differential evolution (DE) [5] are among the most recent developments.These techniques have shown great promise in several real-world applications.
In most cases, an optimization problem is divided into two phases: the coarse search and the fine search [6].GA is well suited for a swift and global exploration of a large search space to optimize an objective function and to target the region near the optimum point quickly [7].However, GA may run with difficulty in the immediate vicinity of the optimum point [8].Conversely, PS is a nonrandom method to search for minima of a function that is not differentiable, stochastic, or even continuous without requiring the gradient information [9].It performs especially well in local search, but it is sensitive for the randomly or manually input initial values, and requires a high degree of expertise by the user [10].
The complementary advantages of GA and PS motivated our strategy in this paper that combines both GA and PS to produce a new algorithm referred to as genetic pattern search (GPS).We evaluated the proposed method by five different classical benchmark functions (consisting of Hump, Powell, Rosenbrock, Schaffer, and Woods) and applied it to the classification of normal and abnormal brain MRI images.
The structure of this paper was organized as follows.Section 2 gave a detailed introduction to GA and PS, respectively.Section 3 outlined the structure and flow of the proposed GPS algorithm.Experiments in Section 4 compared GPS with improved GA and improved PS on five test functions.Section 5 applied the GPS to structural brain image classification.Finally Section 6 concluded this paper.

Background
2.1.Principles of GA.GAs are powerful stochastic search techniques based on the processes of natural selection [11].These techniques perform heuristic search that mimics the process of natural evolution, such as inheritance, mutation, selection, and crossover.This heuristic is routinely used to generate useful solutions to optimization and search problems [12].The principles of the GAs can be described as follows.
The trial solutions of GAs are encoded in the form of strings.Each string is associated with an objective function that represents the degree of the fitness of the string.A collection of such strings is called a population.A random population with a few strings with higher fitness is initially created to represent different points in the search space.Each of these strings is assigned a number of copies that go into the mating pool, based on the principle of survival of the fittest [13].Crossover and mutation operators are applied on these strings.The processes of selection, crossover, and mutation continue until either a fixed number of generations or a termination condition is reached [14].The above procedures of GA can be realized by the following pseudocode.
(1) Choose the initial population of individuals.
(2) Evaluate the fitness of each individual in that population.
(3) Repeat on this generation until termination criteria are met.GAs differ from classical optimization techniques such as the gradient-based algorithm in the following three ways: (1) GAs work on a population of points instead of a single point; (2) GAs use only the values of the objective function not their derivatives or other auxiliary knowledge; (3) GAs use probabilistic transition functions and not deterministic ones [15].

Introduction of PS.
PS is a method that updates current iterate by sampling the objective function to find a decrease at a finite number of points along a suitable set of search directions [16].Suppose that;  denotes the objective function, starting from an initial guess  0 and initial step length Δ 0 , the PS generates a sequence of iterates such that ( +1 ) ≤ (  ).Each iteration consists of a "search" step and a "poll" step.The search step generates a finite number of trial points on a mesh   , which is centered at   and defined by a finite set of patterns Γ, which positively span the solution space.The mesh is given by ( The poll step polls the points in the current mesh to find a better one.The polling can be either complete, meaning that all points are polled, or incomplete, namely, the algorithm stops polling as soon as it finds a point whose objective function value is less than the current point.The complete poll performs better but consumes more time.The incomplete poll finds the local optima [17]. If the poll step finds an improved point, then  +1 equals to the new point, and update the step length by multiplying Δ  by expansion factor   (>1) such that Δ +1 > Δ  because current pattern is a suitable set of poll directions.Otherwise, the poll step cannot find an improved point; then  +1 =   , and the step length is reduced by multiplying Δ  by contraction factor   (<1) such that Δ +1 < Δ  .The aforementioned description can be summarized as follows.
(1) Initialization: generate patterns Γ and initialize the step length Δ 0 .(2) Repeat on this generation until termination criteria are met: (A) generate new mesh points {  }; (B) poll the mesh points.If successful, expand the mesh; otherwise contract the mesh.

Genetic Pattern Search
The proposed genetic pattern search (GPS) combines both GA and PS algorithms.In 2011, Zhang et al. have proposed a combination based on GA and PS [18]; however, in their method the combination algorithm was done by a first stage using a GA followed by a second stage that uses PS taking as an input the output of the GA.The method has a serious drawback: the PS could start when the GA has not reached the neighborhood of the global optimum yet.In that way, we would not take full advantage of the GA.This drawback will become more critical when the problem to be solved becomes more complex.Therefore, in this paper, we take the GA as the search method at every step of PS.In this way, the coarse and fine search would be made in each epoch, and it could speed the convergence up as shown in Figure 1.The pseudocodes of GPS are described as follows.
Step 3. GA Search with initial population designed as {  ()}.The mesh points are updated as the final GA results {   ()}.
Step 5 (  =  + 1).Check whether the termination conditions are satisfied.If yes, output x k , otherwise jump to Step 2.
Complete poll

Experiments
The test and evaluation were performed with an IBM P4 with 2 GHz processor and 1 GB memory.The program is in-house developed by MATLAB 2010b.The readers can reproduce the work at any laptop installing MATLAB.

Test Functions.
We used five classical benchmark functions to evaluate the performance of the GPS algorithm.Among the five benchmark functions, Hump, Rosenbrock, and Schaffer functions are two-dimensional and Powell and Woods functions are four-dimensional.Formulations, global optimal points, and fitness values of those functions are listed in Table 1.
It should be noted that the constant terms are added (2 in Hump function and 1.5 in Schaffer function) to make sure that the range of fitness value is always above zero so that the -axis can be plotted in logarithmic scale.Furthermore, the initial points of each test function were selected far from the optimal point deliberately to test the robustness of the algorithms.Figure 2 shows the surface plot and contour lines of Hump, Rosenbrock, and Schaffer functions, respectively.The Powell and Woods functions are not displayed due to their high-dimensional property.

Parameter
Setting.An improved GA (IGA) [19] and an improved PS (IPS) [20] were chosen for comparison.The parameters of IGA and IPS were obtained by numerous experiments, and the parameters corresponding to the best results are selected and listed in the first and second rows in Table 2. Besides, IGA and IPS are both evolutionary computations, which features in robustness of results with variation of parameters.Coelho et al. [21] and Ghanbari and Mahdavi-Amiri [22] have already proved that the results of evolutionary algorithms are sensitive to neither initial values nor parameter values.Therefore, the results of both IGA and IPS will exhibit little variation even if their corresponding parameters are changed far from the values determined in this experiment.The parameters of GPS were set by combining the parameters of IGA and IPS.
The initial population of IGA was set randomly distributed in the whole search space.For the deterministic algorithm IPS, we determine the initial point changed randomly in each run.In this experiment, we set the termination criteria as in the last row of Table 2 by trial-and-error method.

Success Rate Comparison.
The success rate is the rate at which the algorithm converges to the nearly absolute global optimal point.Suppose that  * denotes the global optimal point, and  denotes the point found in a run.A more accurate definition of success rate is shown as follows: The above formula means that the found solution  is treated identical to the global optima  * if the norm of their  differences is less than 0.1% of the maximum between 1 and the norm of  * .The reason why we do not use the rigid equation  =  * is that finding the accurate global optimal is impossible.The reason lies in the following 3 points: (I) there are always round-up errors during the computation; (II) the word length of the computer is limited; (III) 0.1% relative error is enough in most of industrial applications.
After 100 runs of each algorithm, the success rate was calculated and listed in Table 3.The data here is not the same from [18] because the parameters are different: (I) here we set the maximum fitness evaluation as infinity while in [18] it is only 2 × 10 4 ; (II) here the GPS is done by adding GA at each step of PS, while the algorithm is done by first GA followed by PS in [18]; (III) here the initialization of IPS is randomly   created at each run, while the initial values of IPS are fixed in [18].For all functions, the GPS performs best with high enough success rates.For Hump, Powell, and Rosenbrock functions, the IPS is better than IGA.For Schaffer and Woods functions, the IGA is better than IPS.In total, the GPS is the most robust algorithm among the three.

Application
As an application, we applied the GPS algorithm to the weights optimization of forward neural network (FNN), which is used as a classifier of structural MRI images between normal and abnormal brains.

Method.
The strategy is shown in Figure 3. First, we extract features via discrete wavelet transform (DWT), which has already proved to be an effective strategy for clinical diagnosis [23][24][25].Second, the wavelet domain features are reduced via principle component analysis (PCA).Third, we use stratified -fold cross-validation to prevent overfitting of the following classifier.Fourth, the reduced features are sent to the FNN.Fifth, the GPS and other training algorithms are employed to train the FNN.Finally, the classification accuracies of FNNs trained by different algorithms are compared.The following paragraphs will discuss the procedures in detail.
The DWT is a powerful implementation of the wavelet transform using the dyadic scales and positions [26].In this study, since the brain images are 2D data, the DWT is applied to horizontal and vertical dimensions separately.As a result, there are 4 subband images at each scale.The subband  +1 is used for next 2D DWT.As the level of decomposition increased, a more compact but coarser approximation component was obtained.Thus, wavelets provide a simple hierarchical framework for interpreting the image information.
PCA is an efficient tool to reduce the dimension of a data set consisting of a large number of interrelated variables while retaining most of the variations [27].The PCA describes the space of the original data projecting onto the space in a base of eigenvectors.The corresponding eigenvalues account for the energy of the process in the eigenvector directions.It is assumed that most of the information in the observation vectors is contained in the subspace spanned by the first several PCs.
Cross-validation methods consist of three types: random subsampling, -fold cross-validation, and leave-one-out validation.The -fold cross-validation is applied due to its properties as being simple to learn, easy to realize, and using all data for training and validation.The mechanism is to create a -fold partition of the whole dataset, repeat  times to use  − 1 folds for training and a left fold for validation, and finally average the error rates of  experiments.The  folds can be randomly partitioned.However, some folds may have quite different distributions from other folds.Therefore, stratified -fold cross-validation was employed, with which every fold has nearly the same class distributions [28].In this study, we empirically determined  as 5 through the trialand-error method.
FNNs are widely used in pattern classification since they do not need any information about the probability distribution and the a priori probabilities of different classes [29].A single-hidden-layer backpropagation (BP) neural network is adopted with sigmoid neurons in the hidden layer and linear neuron in the output layer.The training vectors were presented to the FNN, which is trained in batch mode.The network configuration is supposed as   ×   ×   , that is, a two-layer network with   input neurons,   neurons in the hidden layer, and   output indicating the brain is normal or abnormal.Assume that  1 and  2 represent the connection weight matrix between the input layer and hidden layer, between the hidden layer and the output layer, respectively, the outputs of all neurons in the hidden layer are calculated by Here,   denotes the th input value,   denotes the th output of the hidden layer, and   refers to the activation function of hidden layer, usually a sigmoid function.The outputs of all neurons in the output layer are given as follows: Here,   denotes the activation function of output layer, usually a line function.All weights are assigned to random values initially and are modified by the delta rule according to the learning samples.The error is expressed as the mean squared error (MSE) of the difference between output and target values: where   represents the th value of the authentic values, which are already known to users, and   represents the number of samples.Suppose that there are   samples, the fitness value is written as where  represents the vectorization of the ( 1 ,  2 ).Our goal is to minimize this fitness function (), namely, to force the output values of each sample approximate to corresponding target values.At this point, we use our proposed method GPS to optimize formula (6), compared with other algorithms including BP [30], momentum BP (MBP) [31], IGA, and IPS.

Simulations and Results
. The datasets consist of T2weighted MR brain images in axial plane and 256 × 256 in-plane resolution, which were downloaded from the Harvard Medical School website (http://www.med.harvard.edu/AANLIB/).We randomly selected 80 images consisting of 40 normal and 40 abnormal.The abnormal brain MR images consist of the following diseases: glioma, meningioma, Alzheimer's disease, Alzheimer's disease plus visual agnosia, Pick's disease, sarcoma, and Huntington's disease.A sample of each is shown in Figure 4. Figure 5 give the three-level decomposition of 2D DWT decomposition result on a normal brain image.The upperleft corner in Figure 5(b) shows the approximate coefficients serving as the reduced features.The dimension of original image is 256 × 256 = 65536, while the dimension of approximation image is only 32 × 32 = 1024.Although the dimensions of extracted features are reduced from 65536 to 1024, this is still a high computation cost.Thus, PCA is used to further reduce the dimensions of features on another level.The curve of cumulative sum of variance with number of principle components is shown in Figure 6.It shows that 19 principle components, which are only 1.855% out of the 1024 features, preserve 95.4% of the total variance.
Those 19 principal components are submitted to the FNN.The proposed GPS method is used to optimize the weights/biases of FNN.Meanwhile, other five methods are also utilized to make a comparison.In order to reduce the randomness, each algorithm was performed 20 times.The averaged classification accuracy of each algorithm is shown in Table 4.
Table 4 indicates that the strategy based on the proposed GPS algorithm obtained higher classification accuracy (95.188%).The IPS and IGA perform nearly the same with classification accuracies about 90%.The MBP and BP algorithms performed the worst with accuracies around 60%.Here, we do not divide the dataset into training and test subsets because we have already employed the -fold crossvalidation method to avoid overfitting.

Conclusions and Discussions
Our contributions can be summarized in the following three aspects.First, we proposed a novel algorithm-GPS, for efficient and rapid global search of minimum points.It improved the robustness of pattern search and improved the convergence speed of genetic algorithm.Second, the test and evaluation on five benchmark functions further demonstrate that the proposed GPS is the most robust among the three algorithms.Third, as an example of application, we employed the GPS to the classification of normal and abnormal brain  The proposed GPS and the one in [18] are based on the combination of GA and PS.They both make use of the coarse-searching ability of GA and fine-searching ability of PS.Nevertheless, the combining principles are distinct.A question is raised as "which one is better." Here, we give an assumption that GPS in [18] has a drawback that the PS could start when the GA has not reached the neighborhood of the global optimum yet.GPS in this paper takes the GA as the search method at every step of PS, so it will be more stable but cost more time.
In the future, we will develop simulation experiments to address the question, comparing them in convergence rate, success rate, and computation cost.We will also cooperate with mathematicians to find a theoretical solution.The future work also includes applying GPS to other academic and industrial fields.

(
A) Select the best-fit individuals for reproduction.(B) Breed new individuals through crossover and mutation operations to give birth to offspring.(C) Evaluate the individual fitness of new individuals.(D) Replace the least-fit population with new individuals.

Figure 6 :
Figure 6: Variances against number of principle components (axis is log scale).

Table 1 :
Information of five benchmark functions.