Sufficient Conditions for Global Convergence of Differential Evolution Algorithm

The differential evolution algorithm (DE) is one of the most powerful stochastic real-parameter optimization algorithms. The theoretical studies on DE have gradually attracted the attention of more and more researchers. However, few theoretical researches have been done to deal with the convergence conditions for DE. In this paper, a sufficient condition and a corollary for the convergence of DE to the global optima are derived by using the infinite product. A DE algorithm framework satisfying the convergence conditions is then established. It is also proved that the two common mutation operators satisfy the algorithm framework. Numerical experiments are conducted on two parts. One aims to visualize the process that five convergent DE based on the classical DE algorithms escape from a local optimal set on two low dimensional functions. The other tests the performance of a modified DE algorithm inspired of the convergent algorithm framework on the benchmarks of the CEC2005.


Introduction
The differential evolution algorithm (DE) is a populationbased stochastic parallel evolutionary algorithm. DE emerged as a very competitive form of evolutionary computing since it was proposed by Storn and Price in 1995 [1]. DE or its variants have been achieving competitive ranking in various competitions held on the IEEE Congress on Evolutionary Computation (CEC) Conference Series [2,3]. According to frequently reported comprehensive studies [4][5][6], DE outperforms many other optimization methods in terms of convergence speed and robustness over common benchmark functions. Compared to most other evolutionary algorithms, DE is much more simple and straightforward to implement, and has very few control parameters. Perhaps due to these advantages, it has got many practical applications, such as function optimization [7][8][9][10][11], multiobjective optimization [12], classification [13], and scheduling [14].
Theoretical studies of algorithms are very important to understand their search behaviors and to develop more efficient algorithms. With the popularity of DE in applications, more and more researchers pay attention to the theoretical studies on DE. According to the research contents, the main results of theoretical studies on DE can be divided into three classes as follows.

Researches on the Timing Complexity of DE.
DE is a population-based stochastic search algorithm. Its runtimecomplexity analysis is a critical issue. Zielinski et al. [15] investigated the runtime complexity of DE for various stopping criteria including a fixed number of generations ( max ) and maximum distance criterion (MaxDist). MaxDist means that algorithms stop the execution if the maximum distance from every vector to the best population member is below a given threshold.

Researches on the Dynamical Behavior of DE's Population.
This class focuses on investigating the evolving process of DE's population. For instance, the development of the expected population variance and population distribution over time is an important issue. Zaharie [16][17][18][19][20] theoretically analyzed the influence of the variation operators (mutation and crossover) and their parameters on the expected population variance. In 2009, Zaharie [21] theoretically investigated the influence of the crossover operators (including classical binomial and exponential strategies) and the crossover probability on the expected population variance. Dasgupta et al. [22,23] proposed a mathematical model of the underlying evolutionary dynamics of a one-dimensional DE-population, and the model showed that the fundamental dynamics of the each parameter vector in DE employs the gradient-descent type search strategy. Wang and Huang [24] developed a stochastic model of a one-dimensional DE-population to analyze the evolving process of the population distribution over time.

Researches on the Convergence Property of DE.
This class investigates the limit behavior of DE's population. The main issue is that under which assumptions can it be guaranteed that DE or its variants can reach an optimal solution [25]. Technically speaking, commonly used concepts include convergence in probability, almost sure convergence, and convergence in distribution.
Xue et al. [26] performed a mathematical modeling and convergence analysis of continuous multi-objective differential evolution (MODE) under certain simplified assumptions, and this work was extended in [27]. Zhao et al. [28] proposed a hybrid differential evolution with transform function (HtDE) and proved its convergence. Sun [29] developed a Markov chain modeling and proved that the classical DE does not hold with convergence in probability. He et al. [30] defined the differential operator (DO) as a random mapping from the solution space to the Cartesian product of solution space and analyzed the asymptotic convergence of DE by using the random contraction mapping theorem. Ghosh et al. [31] established asymptotic convergence behavior of a classical DE (DE/rand/1/bin) algorithm by applying the concepts of Lyapunov stability theorems. And the analysis is based on the assumption that the objective function has the following two properties, (1) the objective function has the second-order continual derivative in the search space, and (2) it possesses a unique global optimum in the range of search.
The studies of this paper are confined to the third class, convergence property of DE.
We note that the conclusion of [30,31] is in contradiction with [29]. According to the inference process, the asymptotic convergence in [30] refers to almost sure convergence. In fact, if DE does not hold with convergence in probability, then it does not hold with almost sure convergence. We also note that the value of the random mapping DO defined in [30] may be greater than 1, which is debatable. In [31], the asymptotic convergence analysis of DE/rand/1/bin, which was proved by applying Lyapunov stability theorems, should be a local convergence property. The reason is, according to Lyapunov stability theorems, the distribution of the initial population depends on the maximum region of the asymptotic stability. So for some functions, DE/rand/1/bin possesses asymptotic stability property if and only if initial individuals are closed enough to the global optimum. In addition, from the mutation operators of the classical DE, it can be derived that DE, if its population traps in a local optimum, cannot escape. This property was employed by [29] to prove that the classical DE does not possess global convergence in probability.
Taking into account that a convergent algorithm may have stronger robustness than a divergent one. Zhao et al. [28] developed a convergent algorithm, HtDE and proved its convergence. Zhan and Zhang [32] proposed a DE with random walk. Xue et al. [26,27] analyzed MODE's convergence. However, the conditions for global convergence of DE have not been explored. In this paper, the following problems will be addressed.
(i) What are sufficient conditions for the global convergence of DE? (ii) What is the algorithm framework of the convergent DE? (iii) Which operators can assist the classical DE to hold with a certain asymptotic convergence?
The discussion in this paper will be undertaken in a general measurable space, and infinite production will be used as an analysis tool.
This paper is organized as follows. Section 2 introduces the classical DE. Section 3 proves a sufficient condition and a corollary for the convergence of DE to the global optima. Section 4 presents a DE algorithm framework satisfying the convergence conditions. Section 5 proves several operators satisfying the convergent algorithm framework. Section 6 gives numerical experiments to verify the robustness of the convergent DE. Section 7 analyzes and discusses the theoretical conclusions and the experimental results in detail. Section 8 summarizes this paper and indicates several directions for future research.

Classical Differential Evolution
DE is a competitive algorithm for solving continuous optimization problem. Consider the optimization problem: where is a measurable space and ( ) is the objective function (or the fitness of ) which satisfies that for any bounded ∈ , ( ) is bounded. The optimal solution set is denoted as * = { * | ( * ) = max{ ( )}, ∈ }, where * is the optimum solution. Let (⋅) be a measure to space . Perhaps ( * ) = 0, which means that * is a set with measure 0. This is not convenient to analyze. In view of the accuracy of practical problems, without loss of generality, we can consider an expanded set where is a small positive value. We can choose an appropriate , which can meet the accuracy and make ( * ) > 0. We use * ( ( * ) > 0) to replace the set * in this paper. Meanwhile, in order to simplify the calculation, let us suppose that the search space = , where = [0, 1], is the dimension of . The classical DE [2,33,34] works through a simple cycle of reproduction and selection operators after initialization. The reproduction operator includes mutation and crossover operators. The classical DE for solving the above problem (1) can be described in detail as follows.
Mutation: generate a new population from ( ) by a mutation operator, denoted by ( ). Crossover: generate a new population from ( ) and ( ) by a crossover operator, denoted by ( ), and let ( ) ← ( ).
(3) Selection: generate a new population from ( ) and ( ) by a selection operator, denoted by ( ).
The initial population is generated by assigning random values in the search space to the variables of every solution.

Crossover
Operator. Following mutation, the crossover operator is applied to further increase the diversity of the population. In crossover, the target vector, , is combined with elements from the donor vector, V , to produce the trial vector, , using the binomial crossover, where Cr ∈ (0, 1) is the probability of crossover and rand is a random integer in [1, ]. Unless otherwise mentioned, rand(0, 1) is a uniformly distributed random number confined in the range [0, 1].

Selection
Operator. Finally, the selection operator is employed to maintain the most promising trial individuals in the next generation. The classical DE adopts a simple selection scheme. It compares the objective values of the target vector and trial vector . If the trial individual reduces the value of the objective function then it is accepted for the next generation; otherwise the target individual is retained in the population. The selection operator is defined as

Convergence Condition
There are different kinds of definitions of convergence for analyzing asymptotic convergence of algorithms. The following definition of convergence, that is, convergence in probability, is used in this paper.
Let us give a sufficient condition for the convergence of DE.

Theorem 2. Consider using DE to solve the optimization problem (1). In the th target population ( ), there exists at least one individual , which corresponds to the trial individual by a reproduction operator, such that
and the series ∑ ∞ =1 ( ) diverges; then DE converges to the optimal solution set * .
Where { , = 1, 2, . . .} denotes any subsequence of natural number set, { ∈ * } denotes the probability that belongs to the optimal solution set * , and ( ) is a small positive value which may change as .
Proof. In DE, each target individual corresponds to a trial individual by its reproduction operator. According to the condition of Theorem 2, we can get the probability that all the individuals of the th trial population ( ) do not belong to the optimal solution set * : so, we can get the probability that all the individuals of every trial population in previous ( − 1) iterations do not belong to the optimal solution set * : And because of the elitist selection operation in DE, the optimal individual of trail populations will retain the next generation population. So we can get the probability that the th population ( ) does not contain optima: So for the classical DE with elitist selection, we have And from the property of the infinite product [35]: (1 − ( )) = 0. (10) So for the divergent series ∑ +∞ =1 ( ), we can get that According to Definition 1, this theorem holds.

Corollary 3.
In Theorem 2, if ( ) equals ever to a positive constant > 0, then DE converges to the optimal solution * .
Proof. Obviously, the series ∑ +∞ =0 ( ) diverges when ( ) equals ever to a positive constant > 0. From Theorem 2, we can get that DE converges to the optimal solution * . Now we give several observations to the above conditions as follows.
(i) Theorem 2 means that if the probability entering into the optimal set in a certain sub-sequence population is large enough, then the modified DE converges to the global optimal set in probability. And the population states need no ergodicity.
(ii) Corollary 3 is just a special case of Theorem 2 and is very easy to check. There are some improved DE algorithms such as HtDE proposed by Zhao et al. [28], DE-RW proposed by Zhan and Zhang [32], DE-MC proposed by Braak [36], which satisfies the convergence condition of Theorem 2 (or Corollary 3).
(iii) He and Yu [37] and Rudolph [38] presented several important conclusions on convergence conditions for evolutionary algorithms. These conclusions do apply to DE algorithm. However, comparing with these conclusions, Theorem 2 is more relaxed and easier to check.

Algorithm Framework Possessing Convergence
As the introduction section analyzed, it cannot be guaranteed that the classical DE holds with the global convergence. However, DE can converge to the global optimal solution if its reproduction operation satisfies the sufficient conditions given in Theorem 2 or Corollary 3. A DE algorithm framework integrating an extra mutation component will be given in this section. Owing to the fact that the purpose of using the extra mutation is to assist the classical DE to converge, this paper addresses to the operator as AsCo-mutation operator.
According to the sufficient conditions proved above, we can define the AsCo-mutation operator as follow. (2) Let ( ) denote the population generated by using AsCo-mutation; there exists at least one individual in ( ), such that and the series ∑ +∞ =1 ( ) diverges. Taking into account the fact that the algorithm framework using AsCo-mutation will contain some convergent algorithms of DE family, this paper addresses to the algorithm framework as CDE. The algorithm framework CDE can be described as follows.
Mutation: generate a new population from ( ) by a mutation operator, denoted by ( ). Crossover: generate a new population from ( ) and ( ) by a crossover operator, denoted by ( ). AsCo-mutation: if the certain condition generating sub-sequence population is satisfied, then generate a new population ( ) from ( ) by AsCo-mutation and let ( ) ← ( ); otherwise, let ( ) ← ( ).

(3) Selection: generate a new population from ( ) and
( ) by a selection operator, denoted by ( ).
On the basis of DE, the reproduction operator of CDE increases a step, AsCo-mutation. Obviously, the algorithm framework CDE satisfies Theorem 2 when the AsComutation satisfies the Definition 4. That is to say, CDE, which employs the AsCo-mutation given by the Definition 4, converges to the global optimum.

Several Mutation Operators Satisfying Convergence Condition
Like DE algorithm, most evolutionary algorithms for numerical optimization problems use vectors of floating point numbers for their chromosomal representations. For such representations, many mutation operators [39] have been proposed. The most common mutation operators include Uniform mutation [40] and Gaussian mutation [41,42]. We introduce these operators and prove that they meet the definition of AsCo-mutation for CDE in turn.

Uniform Mutation.
Uniform mutation replaces the solution vector with a uniformly distributed random vector confined in the domain ( = ). Each component of the vector is a uniformly distributed (independent identically distributed) random number from [0, 1]. So the density function of can be expressed as: As shown in the CDE algorithm framework, suppose that AsCo-mutation operator employed by CDE is Uniform mutation. Let denote the new individual generated by Uniform mutation; then the probability that belongs to the optimal solution set can be calculated as follow: The method that CDE uses Uniform mutation is flexible, such as mutating an arbitrary individual selected from the set ( ) at a given probability ac and mutating more than one individual. Let ( < NP) denote the number of mutated individuals, then the probability one/ that at least one of ( ) belongs to the optimal solution set can be calculated as follow: where the ac is an empirical probability, ac ∈ (0, 1], and the diversity of the population will gradually enhance as ac increases.
In addition, the implementation of Uniform mutation operator can be also flexible. For example, in order to keep the tradeoff between exploration and exploitation, this paper presents the following operator.
where rand(0,1) denotes a uniform random number in [0, 1]. The 1 , 2 are boundary individuals at a given probability , each element of which equals either the upper boundary or the lower boundary value. The 1 , 2 are uniform random integers in [1, ⌊NP(1 + )⌋]. That is, when the index 1 ( 2 ) is no less than NP, 1 ( 2 ) will takes a boundary individual. Obviously, if 1 , takes the upper boundary value of the th dimension while 2 , takes the lower boundary value (and vice-versa), then the element V , is ergodic in the th dimension. Therefore the individual V can be ergodic in the search space like Uniform mutation operator.

Gaussian Mutation.
Gaussian mutation modifies all components of the solution vector by adding a random noise:̃= where is a vector of independent random Gaussian numbers with a mean of zero and standard deviations . The density function of can be expressed as: Now, let us suppose that AsCo-mutation operator employed by CDE is Gaussian mutation. Then the probability that generated by Gaussian mutation belongs to the optimal solution set can be calculated as follow: On the other hand, for any individual ∈ = , such that̃ * ⊆ [−1, 1] . So Implying that Like uniform mutation, the used method of Gaussian mutation is flexible. As before ( < NP) denotes the number of individuals mutated by Gaussian mutation operator, ac denotes the probability that each individual is mutated, and one/ denotes the probability that at least one of ( ) belongs to the optimal solution set. Then one/ can be calculated as follow: Obviously, let ( ) = one/ , uniform mutation and Gaussian mutation operators satisfy Definition 4. And thus we can get the following Theorem 5.

Theorem 5. DE algorithm employing uniform mutation or
Gaussian mutation operator converges in probability to the global optimum of the optimization problem (1).

Experimental Verification
It is proved in the previous sections that CDE algorithms possess convergence in probability, which only means it can be guaranteed that CDE algorithms reach an optimal solution when the iteration times approaches infinity, but does not mean that CDE can find out the optimal solution within finite iteration times. However, a convergent algorithm should generally hold stronger robustness. Thus this section gives experiments by being composed of two parts to verify CDE's robustness. One aims to visualize the process escaping from a local optimal set of CDE on two low dimensional functions. The other is conducted to test a modified DE algorithm, which is inspired of the above convergence theory, on the benchmark functions of the CEC2005.

Experiments on Low Dimensional Functions.
To achieve the aims mentioned above, experiments are conducted on two numerical functions which are chosen according to the experimental results of [43][44][45][46]. One is the DE deceptive function [45], which can lead the classical DE to trap in the local optimum. The other is the Rastrigin function. In [45,46], nineteen benchmark functions including the rastrigin function were tested using the classical DE. Those results indicated that the optimization effect of the rastrigin function is one of the worst.

Deceptive Function. Consider
where the function sinc( ) is given by The landscape of DE deceptive function is shown in Figure 1. The global optimum of the function is = −5.0 with the function value ( ) = −3. There is a deceptive local minimum = 8.5060 with function value ( ) = −2.9160 in this test function.

Rastrigin Function (2 Dimensions). Consider
The global optimum of the function is = (0, 0) with the function value ( ) = 0. There are many local optima in this test function.
Let CDE-um denote the CDE algorithm using uniform mutation operator. Suppose that CDE-um mutates the worst individual of ( ) at probability 1, and the new individual is directly retained to the next generation. Experiments were conducted to compare five typical versions of the classic DE with CDE-um algorithm. All experiments were implemented for 50 independent replications. The convergence times and convergence ratio on the 50 replications were reported.
In order to show the robustness of CDE-um, we reported the number of function evaluations (FES) to achieve the Ter Err within Max FES. Table 1 gave the FES of 50 independent replications of five typical versions on the DE's deceptive function, while Table 2 reported the FES on the Rastrigin function. Those typical versions included DE/best/1 versus CDE-um/best/1, DE/rand/1 versus CDE-um/rand/1, DE/curto-best/1 versus CDE-um/cur-to-best/1, and DE/best/2 versus CDE-um/best/2, as well as DE/rand/2 versus CDEum/best/2. Table 3 analyzes the results of Tables 1 and 2. From  the statistics of Table 3, we can see that the ratio (ConRa) converging to the optimum of CDE-um is much higher than the corresponding DE.    We can see that all the convergence curves hold two common characteristics as follows.
(i) When the iteration times are smaller, the convergence times of five typical versions of the classical DE are slightly larger than the corresponding CDE-um. However, with the iteration times are increasing, the convergence times of CDE-um will become far larger than the corresponding DE. From this we can see that smaller increasing in the computational cost can make a greatly improving on the robustness of CDEum algorithm.
(ii) When the iteration times are larger, all the convergence graphs of five typical versions of the classical DE become a straight line. However, all the graphs of CDE-um show the ladder's rising status. This indirectly shows that the classical DE cannot escape from a local optimal set or a premature solution set if trapping in, but CDE-um enhances the ability to escape from the local optimal set or premature solution set.
The convergence graphs on the rastrigin function had the similar characteristics with DE's deceptive function, so the graphics are omitted here.
The population size is set to 8 × . The maximum number of function evaluations (Max FES) is set to 5,000,000.

Experiments on Functions of CEC2005.
Wang et al. [48] presented a composite differential evolution algorithm (CoDE), which employed three trial vector generation strategies, that is, rand/1/bin, rand/2/bin, and current-to-rand/1. The experimental studies on the 25 benchmark functions of CEC2005 have indicated that CoDE's overall performance was better than the other seven outstanding competitors (please refer to [48] for details). Now we give a convergent CoDE algorithm (CCoDE-umbest) based on the above convergent algorithm framework. The CCoDE-umbest algorithm has the DE/um-best/1 operator, which was presented in Section 5.1, instead of the current-to-rand/1 of CoDE.
This paper compared CCoDE-umbest with CoDE on the 25 benchmark functions of CEC2005. Table 4 reported the average and standard deviation of the function error values obtained in 25 runs when FES = 1.5E + 5 and FES = 3.0E + 5, respectively. The two bottom lines in Table 4 gave the test statistics for sign test [49] on the mean errors. From Table 4, the probability values (0.012 for FES = 1.5E + 5, 0.041 for FES = 3.0E + 5) supporting the null hypothesis are less than the significance level at 0.05. So we can reject the null hypothesis, that is to say, the overall performance of CCoDE-umbest is better than CoDE on the benchmarks. It implies that the use of the convergent algorithm framework can improve the performance of CoDE. The population size was set to 60, and the dimension was set to 10. The strategies of the other parameters are the same to [48].
All the above algorithms are coded in Visual C++ and the experiments were executed on a ACER 4750G laptop with a 2.30 GHz Intel(R) Core (TM)i5 2410M CPU and 2 GB RAM.

Analysis and Discussion
In this paper, two sufficient conditions for the convergence of DE have been presented in forms of a theorem and a corollary. These conditions describe the limiting behaviors of DE. Given a sub-sequence population, the sufficient conditions require that the probability generating an optimum (or optima) by the reproduction operations is greater than a small positive number. Taking into account the selection operator of DE which can retain the elitist individual(s) of current population to the next generation, the sufficient conditions were easily proved by using the infinite product. Judging by essentials, sufficient conditions for the convergence of the classical evolutionary algorithms [38] and the elitist genetic algorithm [50][51][52], which is proved by using the Markov Chain, generally conclude two requirements. One is the ergodicity of the population states; the other is the retention of the current best solution. In contrast, the sufficient conditions for the convergence of DE presented in this paper "FES" denotes the number of function evaluations. "Std. " denotes the standard deviation of 25 mean errors. The two bottom lines record the test statistics for sign test on mean errors. "Neg Dif " and "Pos Dif " denote the number of negative and positive differences, respectively. "P value" denotes the probability value supporting the null hypothesis. Here the both P value are less than the significance level at 0.05.
do not require the population state to hold with the ergodic property.
Theoretical studies of algorithms' convergence are of significance to understand their search behaviors and to develop more robust algorithms. According to the presented sufficient conditions, a modified algorithm framework, CDE, is proposed. By employing an extra mutation operator, the CDE algorithm framework becomes to converge in probability. It is not difficult to infer that there are many mutation operators meeting the convergent condition, such as uniform mutation, Gaussian mutation operator, and other mutation operators.
Thus, now there arises a new problem: which mutation operator is the most suitable one? Inspired by the process from the classical genetic algorithm to the elitist genetic algorithm, our preference should be directed to the operators with the following characteristics. Firstly, the auxiliary operator is simple and straightforward to implement. Secondly, the operator can make DE algorithm convergent, thereby improving the robustness of the algorithm. Finally, the computational cost generated by the auxiliary operator is reasonable. Based on these factors, this paper presents the CDE-um and CCoDE-umbest algorithms and gives numerical experiments to verify the robustness and competitiveness of those convergent DE algorithms.
From Table 3, we can see that all the convergence ratios of five versions of CDE-um on test functions reach 100%. This shows that the convergent algorithm CDE-um improves the robustness of the classical DE. In addition, from Figures 2-4, comparing with the corresponding DE, the computational cost of CDE-um is not large, and still acceptable and reasonable. As shown in Table 4, the results on the CCoDE-umbest algorithm indicate that the reasonably use of the convergent algorithm framework can improve the performance of CoDE.
Moreover, the robustness of CDE-um also can be further analyzed by numerical experiments. Figure 5 gives the convergence graphs for DE's deceptive function in a single run. From the graphs in Figure 5, we can see that CDE-um can escape from the local optimum of the test functions, while the classical DE cannot escape in case of trapping in the local  optimum solution set. In fact, the mutation perturbations approach 0 when the classical DE traps in a local optimum solution set, which results in that the population of the classical DE cannot be optimized any more.
Of course, we have to note that convergence in probability is a property when the iteration times approach infinity. The previous theorem and experimental results cannot infer that the CDE can solve all function optimization problems within a finite iteration.

Conclusion and Future Work
The convergent property studies, as one of the basic researches for algorithms, benefit designing more robust algorithms. Few of researches have been done in dealing with conditions for the convergence of DE. This paper presented and proved two sufficient conditions for the convergence of DE. These sufficient conditions state that DE variants can guarantee converging to a global optimum solution if the probability, generating an optimum (or optima) by the reproduction operations of each generation in a certain sub-sequence population, is greater than a small positive number. According to the sufficient conditions, a convergent algorithm framework CDE was presented. The algorithm framework demonstrates that the employment of some auxiliary operators satisfying certain conditions can make the classical DE converge in probability. It was then proved that uniform mutation and Gaussian mutation operators meet the convergence conditions of the auxiliary operator.
Convergent algorithms may not always work competitively, but these should generally possess more powerful robustness. So in order to further verify the conclusions drawn from the theoretic researches, this paper gave numerical experiments comparing the performance of the convergent algorithm, CDE-um algorithm, and the classical DE (including five typical versions). CDE-um algorithm was designed by incorporating uniform mutation into the classical DE. The experimental results on the test functions show that smaller increase in the computational cost can make a greatly improvement on the robustness for all five typical versions of CDE-um. In addition, this paper improved the Composite Differential Evolution (CoDE) inspired of the convergence theory and tested its competitiveness on the benchmark functions of CEC2005.
In summary, the sufficient conditions guaranteeing global convergence of DE variants, which were proved in this paper, are easy to check and are general enough to be useful for the family of DE algorithms. And in future works, it appears to be promising for developing more competitive and convergent algorithms by incorporating a certain convergence-assisted operator into some outstanding variants of modified DE algorithms.