An Efficient Hybrid Optimization Approach Using Adaptive Elitist Differential Evolution and Spherical Quadratic Steepest Descent and Its Application for Clustering

Division of Computational Mathematics and Engineering, Institute for Computational Science, Ton Duc ang University, Ho Chi Minh City, Vietnam Faculty of Mathematics and Statistics, Ton Duc ang University, Ho Chi Minh City, Vietnam Faculty of Civil Engineering, Ton Duc ang University, Ho Chi Minh City, Vietnam Faculty of Information Technology, Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam Faculty of Mechanical Engineering, Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam


Introduction
Optimization has been widely applied in different fields such as economics, finance, engineering, etc.Although there are many optimization algorithms developed in various ways, they can be decomposed into two major techniques: population-based algorithms and gradient-based searching algorithms.
Population-based algorithms including evolutionary algorithms and swarm-based algorithms are types of global searching techniques.Evolutionary algorithms [1][2][3][4][5][6][7][8] are inspired by biological processes that allow population to adapt to their surroundings: genetic inheritance and survival of the best chromosomes; swarm-based algorithms [9][10][11][12][13][14][15][16] that focus on the social behaviors of insects and animals can solve the optimal problem as well.Among popular evolutionary algorithms, the differential evolution (DE) algorithm firstly introduced by Storn and Price [8] has been used in many practical problems and has demonstrated good convergence properties.In DE, individual solutions are selected from a population of solutions according to their fitness value to generate new offspring using some operators, such as the crossover and the mutation operators.Nevertheless, similar to many other population-based optimization algorithms, the DE is still costly to approximate the global optimal solution.To overcome this drawback, the adaptive elitist differential evolution (aeDE) [17] in which two modifications are implemented was proposed.Firstly, an adaptive technique based on the variations between the best objective function and other objective functions in the current generation is proposed to choose a suitable mutation operator.e purpose of this modification is to preserve the balance between global and local searching abilities in the DE.Secondly, an elitist selection technique is utilized to speed up the convergence.e aeDE with those modifications is more efficient than the DE and is considered as one of the state-of-the-art methods in population-based algorithms, currently.However, the aeDE is a not-so-radical idea because of the characteristics of population-based techniques existing in it.In the later steps, when the current best solution is nearly obtained, the aeDE still utilizes the crossover and mutation operators, which leads to an unexpected additional number of function evaluations (FES) but cannot ensure the improvement of the objective function through iterations.Hence, it not only lacks locally effective exploitative behavior but also increases the computational cost.
On the other hand, gradient-based techniques only compute a single solution at any time and move the solution to a better one through iterations basing on gradient information.In comparison to population-based techniques, gradient-based techniques have more advantages in terms of locally exploitative behavior and computational cost, but they usually give the optimal solution which gets stuck at the local extreme values.
To avoid the disadvantages of both population-based and gradient-based algorithms, this paper proposes a hybrid approach that combines them together.Specifically, the state-of-the-art population-based algorithm, aeDE, is performed in the initial steps to explore the global searching space.In the later steps, when the aeDE nearly converges to a critical value, spherical quadratic steepest descent (SQSD) [18], a gradient-based method, is utilized to help obtain the fast and accurate optimal result.e reason for choosing SQSD instead of other gradient-based methods is its reliability and stability for solving extremely ill-conditioned problems [19].e proposed algorithm is illustrated in detail through a bivariate function and compared with existing optimization algorithms through 32 benchmark numerical optimization functions.Finally, it is applied for clustering which is an interesting problem in statistics, machine learning, and data mining.e remainder of this paper is organized as follows.e following section presents the aeDE, the SQSD, and the proposed algorithm.Section 3 evaluates the performance of the proposed method via 32 benchmark functions.In Section 4, the clustering method based on HaeDE and its performance are presented and evaluated.After discussing the advantages and disadvantages as well as the future research direction in Section 5, the conclusion is given in Section 6.

Materials and Methods
is section reviews some theories regarding the aeDE algorithm, SQSD algorithm, and proposes the hybrid HaeDE algorithm.e detail is presented as follows.

aeDE Algorithm.
To clarify the notation used throughout this article, we refer to the minimization of the objective function f(x), where x is a vector of N variables in the decision space D � [x l , x u ]. e aeDE seeks for optimal solution through generations (iterations).In each generation, the aeDE evolves population, which is a set of NP feasible solutions, P � x 1 , x 2 , ..., x NP  .Each element in this set or each feasible solution x i , i � 1, NP is called a chromosome, which is a vector of N variables, so-called genes.
e four major phases of the aeDE algorithm, which include initialization, mutation, crossover, and selection, are briefly summarized.

Initialization.
e initialization phase of the aeDE is similar to that of the original DE, in which an initial population, including NP individuals, is generated through a random sampling technique.Specifically, each individual is represented as a chromosome containing N genes and is created by where x l j and x u j are, respectively, the lower and upper bounds of x j , rand [0, 1] is the real number having the uniform distribution within [0, 1], and NP is the population size.

Mutation.
In the case of the original DE, a mutant vector v i is generated by individuals x i in the population through mutation operations.Some mutation operations that are regularly used in the DE can be listed as follows.
(i) rand/1: where integers r 1 , r 2 , r 3 , r 4 , r 5 are mutually exclusive integers randomly selected from {1, 2, . .., NP}, F is the scale factor and randomly chosen within [0,2], and x best is the best individual in the current population.
In the case of the aeDE, a new adaptive mutation scheme for the mutation phase of the DE is proposed.In this scheme, two mutation operators including "rand/1" and "current to best/1" are utilized.e "rand/1" aims to ensure diversity of the population and prohibits the population getting stuck in a local optimum and the "current to best/1" aims to accelerate convergence speed of the population by means of guiding the population toward the best individual.ese two mutation operators are adaptively chosen based on delta, the deviation modulus between the objective function of best individual and the objective functions of entire population in the previous generation.For more details, the new mutation scheme is described as follows.

ENDIF
In above pseudocode, delta is defined by delta � |f mean − f best |, where f best is the objective function value of the best individual and f mean is the mean objective function value of the whole population; the choice of threshold is presented in Section 3.

Crossover.
After completing mutation, each target vector x i produces a trial vector u i by substituting some components of the vector x i by some components of the mutant vector v i through the following binomial crossover operation.

Selection.
In the selection process of the classical DE, each trial vector u i created after crossover phase will be evaluated and compared with the target vector x i to choose a better individual for the next generation.In the selection process of the aeDE algorithm, the elitist selection technique that was introduced by Padhye et al. [20] is utilized instead of the basic selection in the classical DE.
In this new mechanism, NP best individuals are chosen from the set of NP trial vectors u i and NP parent vectors x i .In this way, the current best individual of the whole population is always stored for the next generation, but with better convergence rate in comparison with the classical DE.

SQSD Algorithm.
e SQSD algorithm is briefly summarized through the pseudocode as follows (Algorithm 1).
In above algorithm, the step limit d and the test of whether ‖x k − x k−1 ‖ > d or not are used to control the step size between x k−1 and x k in iterations.Generally, a small step size can avoid oscillations and guarantee the algorithm convergence but leads to slow convergence.

e Proposed Algorithm.
As mentioned earlier, both aeDE and SQSD have their own advantages and disadvantages; therefore, we propose a hybrid approach for aeDE, called HaeDE, to create resonance between their advantages and avoid their disadvantages.
e proposed algorithm is presented by following pseudocode (Algorithm 2) and Figure 1.
In above algorithm, NP is the population size.In case of low dimensions, according to [17], when NP > 15, the deviation of the optimal value is not significant but the run time is significantly proportional to NP.In case of very high dimensions, a popular method is to adapt NP according to the dimensions N. ere already exists some proposals as NP � 10N or NP � 10 N [1,[21][22][23], but most of them have problems with premature convergence and high computational cost.Some studies deal with these problems by separating NP individuals into islands for reducing the computational cost [24,25].Another approach is to use a low NP with center-based initialization which is mathematically proved that it can increase the probability of finding the global optimum, especially when N > 30 [21,26].In this paper, because most of functions have N < 30, the approach of [17] is applied.erefore, we choose NP � 20 to balance the computational cost and the quality of the solution.For the scale factor F, in general, a small value of F cannot explore the search space effectively and, as a result, cannot reach the optimal solution on the completion of the algorithm.In contrast, using a high value of F results in the occasional movement; hence, the algorithm has a weak exploitation behavior for reaching the global optimum in the later steps.For CR, the trial vector tends to be the same with the mutant vector when CR ⟶ 1 and the same with the target vector when CR ⟶ 0. Although the original version of DE uses fixed values of CR and F, it can be obviously claimed that proper choices of F and CR depend not only on the problem but also on the stage of the optimization process; hence, they must vary over time.e varied control parameters F and CR have demonstrated their successful performance in solving a variety of large-scale multivariate problems [17,[27][28][29][30].Based on the above discussions and for the sake of comparison with the aeDE, in this paper, at each iteration, the values of scale factor F and crossover control parameter CR are recorded as specified in the aeDE [17].In particular, we generate the F and CR values under uniform distributions on [0.4,1.0] and [0.7,1.0],respectively.e threshold, ε m , is chosen based on the tolerance ε a and must be greater than the tolerance.It has effects on the aeDE solutions as well as the initial solution of SQSD.e larger threshold is, the faster convergence and less number of function evaluations the aeDE has and vice versa.Normally, ε a � 10 −6 is chosen; however, in the proposed method, the aeDE needs to stop when ε a > 10 −6 for utilizing SQSD in later steps.erefore, we set ε a � 10 −5 and ε m � 10 −2 in the numerical examples.For the SQSD, there are three parameters needing to be set up: the convergence tolerances ε g , ε x , and the step limit d.As mentioned earlier, the step limit d is used to keep the step size smaller than d.If d is too small, the algorithm will slowly converge, particularly for high dimensions.In contrast, a too large value of d may result in excessive oscillations occurring before convergence, particularly for Run the aeDE using mutation operator "current to best/1" (formula (3)), binomial crossover operator (Formula (4)) and elitist selection technique Compute delta |f mean − f best | ENDWHILE Initialize a starting point x 0 x best Run SQSD algorithm using convergences criteria ε g and ε x , step limit d > 0 OUTPUT:  Scienti c Programming low dimensions.erefore, in case of two dimensions, a relatively small value, for instance, d � 0.3, is required to be used.In case of high dimensions, the step limit d hinges on the number of function dimensions.Specifically, in case of quadratic functions, no step limit d is required because it is proved that the SQSD algorithm (without step size control) is always convergent when it is applied to the general quadratic function.is characteristic is very useful for optimizing a variety quadratic function problems, with very high dimensions, for instance, N � 5000, as presented in [18].In this paper, because the number of dimensions is varied from 2 to 30, the step limit d should hinge on the number of dimensions.Furthermore, in the case of HaeDE, to ensure the exploitation in the later steps, the variation between x k−1 and x k through iterations needs to be smaller than the aeDE variants.Hence, we choose the step limit

Yes
, where x best and x worst are the best and the worst individual in the latest step of aeDE and n is the number of dimensions of function, respectively.For the convergence tolerances, ε g � 10 −6 and ε x � 10 −8 are applied.

Experiments on Benchmark Functions
is section presents two examples to illustrate and test the performance of the proposed method.Particularly, the first example describes the details of the proposed method when dealing with a well-known function, Bohachevsky1.
e purpose of this experiment with a simple bivariate function is to analyze the proposed method behavior.As a result, we can illustrate how the new method works and why it is better than aeDE.In the second example, the performance of the proposed HaeDE algorithm is evaluated on 32 benchmark functions.ose functions are in 50 functions firstly performed by [31] and were often utilized later in order to compare the performance between optimization algorithms [1,32].Because gradientbased method can deal with unimodal functions that have only one optimal value, it is unnecessary to utilize the population-based algorithm to solve those functions.erefore, in current research, the HaeDE performance is compared to those of the aeDE [17], particle swarm optimization (PSO) [11], Selfish Herd Optimizer (SHO) [9], Salp Swarm Algorithm (SSA) [14], and Dragonfly algorithm (DA) [13] using 32/50 functions that are multimodal.For each benchmark function, all methods are run 30 independent times with the same initial population in each time.e obtained results are then compared using Wilcoxon's paired tests.Finally, the computational cost, especially the number of FES of all methods, is examined.

Experiment on Function Bohachevsky1
. In this subsection, we perform experiment on function Bohachevsky1, a simple bivariate function, to analyze the proposed method in detail.In the first phase, we run the aeDE algorithm with delta � |f mean − f best | < 10 −5 to determine the best individual in the current population x 0 which will be the initial point for the next phase.In the second phase, two cases are examined in which the usage of the aeDE is kept in the first case, and the SQSD is utilized in the second case to look for the final optimal solution.e first case is hence exactly the original aeDE, while the second case is named HaeDE which is the proposed algorithm in this study.e performance of the aeDE and HaeDE are then measured using the corresponding best fitness value (FBEST) and number of function evaluations (FES).In addition, the effect of step limit d on HaeDE performance is also examined.We need to look for the minimum of function Bohachevsky1 whose formulation, surface, and contour are presented as follows.
It can be seen from Figure 2 that the Bohachevsky1 is a multimodal function which has a large number of local solutions; as a result, gradient-based method that may be easily trapped to local minimum is unsuitable in this case.A more feasible strategy to solve this problem is to use evolution-based algorithm, like aeDE algorithm, for instance.As evidenced by Figure 3, aeDE (the blue line) has a good explorative manner when the FBEST rapidly decreases in the initial steps.However, in the later steps, when the number of iterations is about 30 or the number of function evaluations (FES) is about 600, the FBEST exhibits a slow decrease if the aeDE continues to be used (the red line).e primary reason is that aeDE still utilizes the crossover and mutation operators, which leads to unexpected additional FES but cannot ensure the improvement of fitness value in iterations.In contrast, if the SQSD algorithm is utilized in the later steps, FBEST value can quickly decrease through each step depending on the gradient information (the green line).As a result, it not only makes a better result but also saves the computational cost when taking only one FES for each iteration.
In addition to comparing with aeDE, the convergence behavior of HaeDE itself with different step limit parameters is examined.It can be seen from Figure 4 that when d is not small enough, for instance d � 10 −4 , the algorithm converges after one or two iterations but FBEST is still far from the true optimal value of 0. On the contrary, the convergence speed of the HaeDE with a too small d, for instance d � 10 −7 , falls behind others.e proposed step limit d � (‖x best − x worst ‖/ 100) � n √ in this paper takes more than ten iterations to converge, but it can help the algorithm reach the feasible FBEST which approximates the true optimal value of 0.Moreover, the proposed step limit in this paper can adapt to many cases of problems using different number of dimensions and can adapt to the quality of the last population in aeDE.

Experiment on 32 Benchmark
Functions.In this subsection, we compare the performance of the proposed method with those of five well-known algorithms consisting of the aeDE, PSO, SHO, SSA, and DA over 32 benchmark functions.e control parameter settings for the HaeDE are chosen as mentioned in Section 2. For other methods, we set the parameters as follows:  For each benchmark function, we run HaeDE, aeDE, PSO, SHO, SSA, and DA 30 independent times.To ensure the fairness, initial population for comparative methods is chosen such that they are the same.To determine whether HaeDE reaches a statistically better solution than other methods or not, Wilcoxon's paired tests are examined.During the test, if the optimal value is below 10 −9 , a very small positive number, it will be considered as zero.e benchmark functions and their characteristics are summarized in Table 1.e descriptive statistics of the comparative methods are presented in Table 2 where the rst number is the mean of FBEST, and the number in parentheses is the rank of method in ascending order of FBEST.rough 32 benchmark functions, it can be seen that HaeDE is the best with the smallest total of ranks.
Although Table 2 provides a rst insight into the performance of the algorithms, it is more reliable if we compare the performance by using statistical test.For this purpose, with the null hypothesis "there is no di erence between two methods", the obtained results are tested using Wilcoxon signed-rank test with a statistical signi cance value α 0.1.
In Table 3, "+" indicates the case in which the null hypothesis is rejected and the HaeDE is better than the comparative method, "−" indicates the case in which the null hypothesis is rejected and the HaeDE is worse than the comparative method, and "' " indicates the case in which we cannot reject the null hypothesis.According to the total count of (+/ /−) presented in the last row of Table 3, it can be seen that the HaeDE outperforms aeDE, PSO, SHO, and DA, and it is competitive with SSA in terms of approximating the optimal value.Finally, we examine the convergence behaviors of the comparative methods.Figure 5 illustrates some results for large problems with 30 dimensions.It can be seen that the PSO, SHO, and DA have slow convergence in general.e SSA is a competitive method with HaeDE as mentioned earlier but easily gets stuck in a speci c point and takes a large number of FES for moving to a better individual.According to Figure 5, while the HaeDE quickly converges and stops, the convergence curve of SSA is nearly a straight line or FBEST is nearly constrained for a long time.As a result, it becomes the worst method in terms of the convergence speed (see more in Figure 6).e aeDE is an improved version of DE, which utilizes the mutation operator "current to best/1" and elitist selection technique to speed up the convergence.erefore, it is not surprising that the aeDE outperforms PSO, SHO, SSA, and DA and ranks second (the red line).As explained in example 1, the HaeDE (the green line) is again computationally inexpensive compared to the aeDE.Figures 6 and 7 illustrate an overview of the number of FES and total CPU time consumed by all methods for all the benchmark functions over 30 independent runs.Clearly, in most cases, the HaeDE is the best in terms of computational cost.e results obtained from the above experiments demonstrate the promising performance of the proposed method when solving the optimization problems and especially when solving continuously di erentiable functions.In addition, we can estimate the gradient numerically, for instance Euler approximation, if the function itself is nondi erentiable.Hence, the HaeDE can be utilized for further practical problems with discontinuous, nondi erentiable, and implicit objective functions.
e above discussion and the research results have shown that the HaeDE is a competitive optimization algorithm and can be utilized in further application as the clustering problem.

An Application of HaeDE for Clustering Analysis
Clustering is a data mining technique that can partition unknown large data into groups so that elements in each group have the similar properties.It is the important rst
Mathematically, let X � x 1 , x 2 , ..., x N   be a set of N elements given in R n , we need to find an optimal way to partition them into k clusters U � C 1 , C 2 , ..., C k  , C i ∩ C j � ϕ so that the elements belonging to the same cluster are as similar as possible, in terms of a given internal validity measure m(U).It implies that the clustering problem can be transformed into an optimization problem where m(U) needs to be maximized or minimized.e formulation of the optimization problem for clustering in this paper is present in Section 4.1.No N Name Formulation

e Formulation of the Optimization Problem for
Scientific Programming chromosome that represents the position of cluster centers.
Because each center is a n-dimensional vector and we need to identify k clusters or k cluster's centers, each chromosome is a string with total length equals kn.In particular, the general form of each chromosome is represented by Figure 8.

Objective Function.
As mentioned before, the clustering quality is usually evaluated via an internal validity measure m(U) such as Intra index [42], Xie-Beni index [43], Dunn index [44], Davis-Bouldin index [45], Silhouette coefficient [46], etc.Here, we choose the Intra index, an internal validity measure used in the well-known k-means, to be the objective function.
where C i stands for the cluster i which is a partition needing to be evaluated.e Intra index m(U) is defined as follows.
where v k is the center of cluster k and d is the Euclidean distance.From the above expression, it is seen that the value of m(U) becomes smaller when the elements in cluster are more similar to those in cluster center.Hence, minimizing the Intra index is to optimize the compactness of established clusters, which leads to a suitable partition.After identifying the objective function and chromosome representation, we can utilize the HaeDE to find the optimal clustering problem.and formance of the HaeDE is tested using the Iris flower dataset, a well-known benchmark dataset by [47].
dataset consists 3 classes (clusters) named Iris virginica, Iris with 150 samples and 4 independent variables representing for the length and width of the sepals and petals.To visualize the clustering results, only the petal length and width are used as predictors; therefore, the clustering problem is now the minimization problem with six variables in range [0, 1] and the objective function is the Intra index mentioned earlier.In this case, parameters setting is similar to Section 3 for all methods.Because we have the reference of the actual classes of data, the final clustering results of the HaeDE are compared with those of aeDE, PSO, SHO, SSA, and DA using the total accuracy and the Adjusted rand index (ARI) [48] where "0" indicates that the clustering results are not fitted with the actual data classes and "1" indicates that the clustering results and the actual classes are exactly the same.e distribution of Iris dataset, the actual classes, and the clustering results for the comparative methods are presented in Figure 9 and Table 4.
As can be seen from Figure 9 and Table 4, the actual Class 1 is sufficiently well separated from the other two clusters which are strongly and significantly overlapping.In this case, the SHO obtains a quite good can Class 1 to others but cannot separate the remaining two clusters.As a result, it ranks second in terms of ARI, with ARI � 0.5149.For the aeDE, it can well recognize Class 3, but incorrectly assigns most of the Class 1 elements to Class 2. is method ranks second in terms of accuracy, with 66.67%.e best clustering result is given by the HaeDE when actual Class 1 is properly grouped.Although there are some misclustering elements in case of actual Class 2 and actual Class 3 due to their high overlapped degree, the HaeDE is still the best with ARI � 0.8857 and accuracy is about 96%.e other methods make poor performance in this case with ARI < 0.5.In terms of computational cost, it can be observed that the HaeDE is also the best when it takes only 641 FES for convergence.e aeDE ranks second when taking 880 FES and the others are worse than both aeDE and HaeDE when taking from 30000 to 100000 FES for convergence.All of above experiments and analyses demonstrate the superiority of HaeDE over the comparative methods in solving the clustering problems.

The Drawbacks and Future Research Direction
Although the proposed method possesses some advantages in terms of finding the global optimum and reducing computational cost, some disadvantages can be indicated as follows: (i) Using the proposed method for solving unimodal functions is not efficient but causes large computational costs in comparison with gradient-based methods.(ii) A parallel version of the proposed method is out of scope of this article.Parallelism is a feasible method to reduce the high computational cost by dividing   Scienti c Programming the computational cost between multiple processors.eoretically, in case that the network is homogeneous, we can predict the speedup by the number of processors (the efficiency � 1).However, the actual speedup obtained may be less than the number of available processors because the actual computer network is often a heterogeneous environment.In addition, for the proposed method, only the first stage, which is the aeDE, can typically be parallelized at all.After the first stage, the SQSD must be run sequentially.Certainly, enhancing the proposed method with parallel programming is an interesting future research direction for a wide range of researchers.(iii) Another drawback of the paper is that the application of the proposed method to the clustering still requires a given number of clusters.In future, another encoding method can be proposed to apply the HaeDE to the clustering problem with unknown number of clusters.

Conclusion
In this paper, an efficient hybrid optimization approach based on adaptive elitist Differential Evolution and Spherical Quadratic Steepest Descent was proposed and then applied for clustering problem.e new method benefits from the aeDE's global explorative manner in the initial steps and from the SQSD's locally effective exploitative manner in the later steps to improve the aeDE performance and reduce the computational cost significantly.e behavior of HaeDE is examined by a simple function, and its performance is evaluated on a set of 32 benchmark functions as well as an application in clustering problem.In summary, the HaeDE can be considered as a competitive optimization algorithm and can be utilized effectively in clustering and other applications in future.In addition to the mentioned advantages, the proposed method has a few disadvantages, e.g., it wastes the computational resource in case of unimodal functions, a parallel processing strategy for the proposed method is not considered, and the application of the proposed method to the clustering still requires a given number of clusters.
ey are also interesting future research directions for a wide range of researchers.

4
(i) aeDE: the population size NP 20, the stop criterion Δ 10 −6 , the maximum number of iterations maxiter 5000, the mutant factor F ∈ [0.4,1], the crossover control parameter CR ∈ [0.7, 1], and the threshold ε 10 −2 .(ii) PSO: the population size NP 20, the stop criterion Δ 10 −6 , the maximum number of iterations maxiter 5000, the initial velocity of particles is v ∈ [0, (Ub − Lb/7)] where Ub and Lb are the upper and lower bounds of solutions; acceleration factors c 1 and c 2 are 0.5 and 1, respectively; the inertia weight w w max − (w max − w min /max iter)iter where w max and w min denote the maximum and minimum values of the inertia weight; maxiter denotes the maximum number of iterations and iter is the current number of iterations.(iii) SHO, SSA, DA: the population size NP 20, the stop criterion Δ 10 −6 , the maximum number of iterations maxiter 5000; for detailed description of other parameter setting, please refer to [9, 13, 14].

FESFigure 4 :
Figure 4: Convergence behavior of HaeDE with di erent step limit value.

Figure 6 :Figure 7 :
Figure 6: e number of FES of 30 runs for benchmark functions.

Table 1 :
e benchmark functions and their characteristics.

Table 4 :
Summary of clustering results.