Dynamics and Stability of the Sampling Distribution of Particle Swarm Optimisers via Moment Analysis

For stochastic optimisation algorithms, knowing the probability distribution with which an algorithm allocates new samples in the search space is very important, since this explains how the algorithm really works and is a prerequisite to being able to match algorithms to problems. This is the only way to beat the limitations highlighted by the no-free lunch theory. Yet, the sampling distribution for velocity-based particle swarm optimisers has remained a mystery for the whole of the first decade of PSO research. In this paper, a method is presented that allows one to exactly determine all the characteristics of a PSO's sampling distribution and explain how it changes over time during stagnation (i.e., while particles are in search for a better personal best) for a large class of PSO's.


INTRODUCTION
Let us consider the basic form of PSO with inertia weight, which is controlled by the following two equations: where φ 1 and φ 2 represent vectors of random numbers uniformly distributed in [0, c 1 ] and [0, c 2 ], respectively, (c 1 and c 2 are nonnegative constants), ⊗ is component wise multiplication, y i is the position of the ith particle's personal best, y is the position of the particle's neighbourhood best, and w is a constant in [0, 1].Following Kennedy's graphical examinations of the trajectories of individual particles and their responses to variations in key parameters [1], traditionally, this model (and the related PSO with constriction [2], which is, provably, mathematically equivalent) has been studied theoretically under strong simplifying assumptions such as isolated single individuals, search stagnation (i.e., no improved solutions are found) and absence of randomness [2][3][4][5][6][7][8][9][10][11][12].While these assumptions make it possible to make theoretical progress, they also make the results of the analyses difficult to carry over to the real PSO (where stochasticity is present).
Only very few attempts to understand the behaviour of the PSO in the presence of stochasticity have been made.For example, Clerc [13] analysed the distribution of velocities of one particle controlled by the standard PSO update rule with inertia and stochastic forces and was able to show that a particle's new velocity is the sum of three components: a forward force, a backward force, and noise.Kadirkamanathan et al. [14] were able to study the stability of particles in the presence of stochasticity by using Lyapunov stability analysis.
These are good and important first steps in the direction of modelling the real PSO.However, they only marginally scratch the surface in relation to one of the most important open problems in PSO research: the characterisation of the PSO's sampling distribution and its changes over time.Why is this question so fundamental?For two reasons.Firstly, this scientific knowledge gives us much greater understanding of an algorithm's behaviour.Secondly, on a more practical level, the only way to beat the limitations implied by NFL [15] is to match algorithms to problems.Naturally, at least in part, this can be achieved by a trial-and-error approach involving testing different algorithms and parameter settings on a particular problem or class of problems until a suitable match is found.A more analytic approach to this problem, however, requires two things: (a) knowing the search distribution of an algorithm and (b) checking whether this is well matched to the features of a problem's fitness landscape.Unfortunately, the sampling distribution of the PSO has been unavailable during the whole first decade (and beyond) of PSO research.
In recent work [16], we introduced a novel method, which allowed us to exactly determine the mean and standard deviation of the sampling distribution of the canonical PSO as well as their changes over any number of generations.The only assumption we made was stagnation, that is, we studied the sampling distribution produced by particles in search for a better personal best.In this paper, we generalise the technique and show how it can be used to determine any number of moments of the sampling distribution of PSO's.
The paper is organised as follows.In Section 2, we provide a summary of the results presented in [16].In Section 3, we generalised the method to moments of any order.In Section 4, we apply the technique to compute the skewness and kurtosis of the canonical PSO's sampling distribution.In Section 5, we further extend the generality of our technique so as to cover a variety of PSO's, not just the one in (1).This allows us to compare, on a theoretical basis, different forms of PSO.In Section 6, we then use the moments of the PSO's sampling distribution to approximately reconstruct the distribution itself, making it possible, for the first time, to understand the search strategy implemented by the canonical PSO, while searching for new improvements.We discuss the results and conclude in Section 7.

Derivation of dynamic equations for the moments
During stagnation, each particle behaves independently.Also, each dimension is treated independently.So we can analyse each particle's behaviour in isolation.Therefore, we can drop the superscript i in (1), and rewrite them as where we used the relation v t = x t − x t−1 .Note that on the r.h.s. of this equation, while w, y, and y are constants, x t , φ 1 , φ 2 , and x t−1 are stochastic variables.As a consequence, x t+1 on the l.h.s. of the equation is a stochastic variable too (in fact, it is a stochastic function of the stochastic variables appearing in the r.h.s.).
In [16], we applied the expectation operator to both sides of the equation obtaining where we performed the substitution because of the statistical independence between φ i and x t .
If we square both sides of (2), we obtain (5) Expanding the r.h.s., we obtain an equation expressing the stochastic variable x 2 t+1 as a function of other stochastic variables, that is, x 2 t , x t−1 x t , φ 2 1 , and so forth.By applying the expectation operator to both sides of this equation, one obtains where we set If, instead, we multiply both sides of (2) by x t and apply the expectation operator, we obtain Equations ( 4), (6), and (7) form a system of coupled second-order difference equations using which, given appropriate initial conditions, one can compute E[x t ], E[x 2  t ] and E[x t x t−1 ] for any t.

Analysis of the dynamics
Equations ( 4), (6), and (7) can also be written in matrix notation as the extended first order system where where It is then trivial to verify under what conditions E[x t ], E[x 2  t ], and E[x t x t−1 ] will converge to stable fixed points.We need to have that all eigenvalues of M must be within the unit circle, that is, Λ m = max i |λ i | < 1.When this happens, we will say that the PSO is order-2 stable.
Naturally, when Λ m < 1, in principle, we could derive (either symbolically or numerically) the fixed point for the system, which we will denote as z * .This would be simply given by When the system is order-2 stable, by the simple change of variables z(t) = z(t) − z * , we can represent the dynamics of the system via the following linear homogeneous equation: which can trivially be integrated to obtain the following explicit solution for the dynamics of the first two moments:

HIGHER-ORDER MOMENTS
In [16], we obtained recursions which describe the dynamics of first-and second-order moments of the sampling distribution of a standard PSO during stagnation, as summarised in the previous section.One may then wonder whether it would be possible to follow a similar approach to study the dynamics of higher-order moments.

Derivation of dynamic equations for higher-order moments
The fundamental question is "what quantities would we have to deal with if we took higher powers of both sides of (2) as we did to derive (5)?" Generally, the r.h.s.would be a sum of terms of the form where a k are suitable constants.Naturally, taking powers of (2) and then multiplying both sides by some power of x t would also lead to equations involving terms such as those in (13).That is, for any choice of b 1 ∈ N and b 2 ∈ N, where a ki are suitable constants.If we then take expectations for both sides, we obtain where we used the independence of φ 1 , φ 2 , x t x t−1 , and, of course, their powers.If φ 1 and φ 2 are uniformly distributed in the same range, namely, [0, c], it is easy to verify that where It is important to note here that because (2) is linear in x t and x t−1 , all the terms on the r.h.s. of ( 14) and ( 17) respect the relation a 1i + a 2i ≤ b 1 + b 2 .This implies that it is possible to construct recursions for moments of arbitrary order.

Complexity of the calculations
If one wanted to push the analysis developed in Section 2 up to order 3, one would need to instantiate (17) for and add the resulting equations to ( 4), (6), and (7).If one wanted to go to order 4, an additional set of four equations (for , and E[x t+1 x 3  t ]) would be needed, bringing the total to 10.More generally, in order to compute statistics of order n, one needs to construct and iterate second-order difference equations.Since, after expansion, the r.h.s. of (2) contains 7 atomic terms of the form in ( 13), the r.h.s. of (17) contains 7 b1 such terms (Note that the exponent b 2 does not influence the number of terms because the recursion for E[x b1 t+1 x b2 t ] is obtained as follows: (a) we compute x b1 t+1 , which is given by an expression containing 7 b1 terms; (b) we multiply each term by x b2 t , which changes the exponents a 1i but does not alter the number of terms; (c) we apply the expectation operator, which again does not modify the number of terms.)So the total number of terms one needs to compute to construct the equations for order-n statistics is 7 (for the order 1 equations) plus (7 2 + 7) (for the order 2 equations) plus (7 3 + 7 2 + 7) (for the order 3 equations), and so forth.This gives us a total of terms.Note that T(n) grows exponentially approximately as 1.36 × 7 n .So although the number of equations one needs to deal with grows quadratically, the computational effort required to instantiate them is exponential.For instance, The growth in number of terms can be reduced if one makes explicit use of ρ (i.e., by adding the factor ρ a8 in ( 13)).Then T(n) = O(6 n ).Either way, manually deriving equations for moments of order 3 is already very laborious.The process, however, is clearly mechanisable.This can be done using computer algebra systems, or by explicitly representing and manipulating the ω i 's in each equation (this is what we did).As a result of mechanisation, computing the equations for up to order 6 or 7 is feasible with an ordinary personal computer.Some of the ω i 's in (17) present the same pattern of exponents for w, c, y, and y, so terms can be collected leading to more compact equations (e.g., compare ( 5) and ( 6)).Also, given their size, one will normally want to study (e.g., integrate) (17) numerically.In this case, w, c, y, and y are all numerical parameters.So the ω i 's become constants and, after collecting terms, each equation contains at most Q(n) terms, which, as we know, is quadratic in the order n.As a result, although the complexity of the construction of the motion equations for the moments is exponential in the order of the moments, their numerical integration is only of order O(n 4 ).Naturally, the system of Q(n) second-order difference equations necessary to predict the dynamics of moments of order 1 to n can be turned into a system of order 1 of the form in (8), via the choice This effectively means adding artificial update equations of the form E The transition matrix for the system is, therefore, of size Q (n) × Q (n).We will denote this with M n .For example, for n = 4, which would allow one to study the mean, variance, skewness, and kurtosis of the sampling distribution as a function of t, M 4 is merely a 14 × 14 matrix.
Interestingly, Q (n) grows so slowly that one can perform an eigenvalue analysis for any M n that one is able to compute.
That is, the expensive part of the process is the construction of M n .Once this is done, iterating the system, establishing its stability, or finding its fixed points is a trivial matter.
In the next section, we provide results for statistics of order 3 and 4, that is, n = 4, a value of n for which computing M n takes only a few seconds.However, before we do this, we need to consider the initial conditions for the system.

Initial conditions
In order to provide initial conditions for the moments' dynamics, we need to compute E[x k 0 ] and E[x k 1 x l 0 ] for generic k > 0 and l ≥ 0.
If we are at the very first iteration of the PSO algorithm, the initial conditions for the dynamics of the moments are related to the ranges used to initialise positions and velocities in the PSO.Under the assumption that a particle's initial position, x 0 , is chosen uniformly at random in a symmetric range [−Ω, Ω], we have In order to compute E[x k 1 x l 0 ], we need to consider the equation where a particle's initial velocity, v 0 , is a stochastic variable uniformly distributed the range By taking the kth power of both sides of the equation, multiplying by x l 0 , and taking expectations, as we did to construct (17), one obtains the desired expressions for E[x k 1 x l 0 ].Like for (17), these expressions contain a number of terms that grows exponentially with n.However, this process, too, can be trivially mechanised.
If the PSO is not at its first iteration, and a new improvement has just been found to a particle's personal best or neighbourhood best, then x 0 is not a stochastic variable, but a constant determined by the position where the particle found the last improvement.

SKEWNESS AND KURTOSIS OF THE PSO's SAMPLING DISTRIBUTION
We constructed the recursions for moments of up to order 4 as described in the previous section for the canonical PSO.Naturally, for higher-order moments, we can do exactly the same type of eigenvalue analysis we did in [16] for the mean and standard deviation of the sampling distribution.For example, Figure 1 shows the magnitude of the largest eigenvalue, Λ m , of M 4 as a function of the parameters w and c.The system is order-4 stable, that is, mean, variance, skewness, and kurtosis have a stable fixed point, whenever Λ m < 1, that is, within the curved region enclosed by the contour shown in the figure.
For easier comparison, we show the lines where Λ m = 1 for M 1 , M 2 , M 3 , and M 4 in Figure 2 (ordered from top to bottom, resp.).The regions of order-1, -2, -3, and -4 stability are nested.Note how the Λ m = 1 lines for M 2 and M 3 coincide for many values of w.Note also that the standard setting, c = 1.49618 and w = 0.7298, lays within the narrow region of order-3 stability.This implies that while mean, variance, and skewness of the standard PSO tend to a fixed point, kurtosis is unstable and will tend to grow indefinitely.Interestingly, a growth in the kurtosis of samples was observed by Kennedy [17], although this was effectively computed under the assumption that the sampling distribution is time independent.So the values of x t recorded in a run at t grows were treated as different samples from the same distribution, while we now know this is incorrect.That the predictions of the model are exact is also confirmed by the comparison of the dynamics of predicted and recorded higher-order moments.Figure 3(a) shows a comparison between the skewness E[(x t − μ t ) 3 ]/σ 3 t computed using our model and the average positions of the particle recorded in one billion (1 000 000 000) real runs in the first 30 iterations for the case c = 1.49618, w = 0.7298, y = 0, y = 1, and Ω = 5.As one can see, there is a very good match between the model's predictions and the stagnation behaviour of particles in real runs.Only after about 20-25 generations, the sampling errors start accumulating enough to show significant differences.We know that these differences are due to sampling errors because before performing 1 000 000 000 runs, we attempted to use much smaller sets of runs.With fewer runs, we observed that errors were visible well before generation 25.For example, with 1 000 000 runs, there is a good match up to around iteration 15.
As shown in Figure 3(b), the model also predicts very well the behaviour of the (excess) kurtosis E[(x t −μ t ) 4 ]/σ 2 t −3 of the sampling distribution.(Following standard practice, in this paper whenever we use the term "kurtosis," we will refer to the excess kurtosis E[(x t − μ t ) 4 ]/σ 2 t − 3. The excess kurtosis of the normal distribution to 0.) Note that for c = 1.49618 and w = 0.7298, the system is order-3 stable, and so, although the oscillations of the skewness shown in Figure 3  is actually only a transient effect, as shown in Figure 4 where we integrate the equations over 200 generations instead of 30.

COMPARISON BETWEEN PSO's
In the previous sections, we studied the canonical PSO with the restriction that the acceleration coefficients, c 1 and c 2 , were identical: c 1 = c 2 = c.One may wonder, however, whether allowing such coefficients to differ would produce qualitatively very different dynamics.For example, what would happen if we set one of the c i 's to zero as in a purely cognitive or purely social PSO model?This effectively would reduce to one the sources of random influences on a parti- cle's dynamics.Conversely, one might wonder what would happen if we increased the number of such sources of influence, as is done, for example, in the fully informed particle swarm (FIPS) [18,19].

Moments of generalised PSO's
To answer these (and other) important questions on the sampling distribution of different PSO models, we adopt a FIPSlike general class of PSO's described by the following difference equation: where the φ i 's are stochastic variables uniformly distributed in the range [0, c i ], c i are constants, and the y i 's are the personal best positions of neighbours of the particle (the particle itself may be included in its own neighbourhood).Naturally, this equation can be converted into the following: which is a generalisation of (2).All of the steps we performed in Section 3 can be repeated for (25).These lead to recursions of the form in (17) with the only difference that the coefficients ω i take the more general form where a 0i , a wi , a c1i , . . ., a cmi , a y1i , . . ., a ymi are appropriate constants.
Because (25) contains 3 + 2 × m terms, the complexity of the expansion now grows as O((3 the order of the moments we are interested in.So the larger m, the heavier the computation load required to compute M n .Once the transition matrices M n are computed, however, they are exactly of the same size for all PSO models within the class defined by (24).Initial conditions can be found following the approach described in Section 3. Calculations are expensive but can be mechanised.We did this for the examples described below.

Three PSO's we want to compare
An extensive comparison of different PSO's is beyond the scope of this paper.However, as an example the kind of comparisons one can make using our approach, we considered the PSO's in (24) with N = 3.Within this class of PSO's, we considered three variants as follows: (a) a purely social variant of PSO, which we will call social PSO for brevity, where c 1 > 0 and c 2 = c 3 = 0 (due to symmetries, the behaviour of a purely cognitive PSO where c 2 > 0 and c 1 = c 3 = 0 is effectively identical to that of this social PSO); (b) the canonical PSO we have studied so far in the paper, which is obtained by setting c 1 = c 2 > 0 and c 3 = 0; (c) the simplest version of FIPS with a neighbourhood of three individuals, for example, obtained using an lbest topology and an interaction radius of 1, where c 1 = c 2 = c 3 > 0. We will call this version FIPS3.
In order to perform a fair comparison of the stability properties of these PSO variants, we study them in conditions where the sum of the amplitudes of the random components, φ i , is identical across models.That is, we set c = i c i , and compare models with the same c value.Again, we analyse eigenvalues.

Results of the comparison
Figures 5-8 show the lines in the (w, c) plane where the magnitude of the largest eigenvalue of M n , Λ m is 1 for n = 1, 2, 3, 4 and for the three PSO variants mentioned above.Each figure represents a different moment.In the plot for the nth moment, the region below each line represents the set of all (w, c) pairs for which the nth moment of the distribution of the PSO corresponding to that line is stable.Let us analyse these figures in detail.
Firstly, we should note that the regions of order-1 stability for the three models are identical.This is because the dy- namics of the mean of the three models is governed by equations of the same form, namely, where the constant term may differ in different PSO variants.(This is irrelevant for the stability of the system, since stability is determined by the homogeneous part of the equation.)Note also that the rightmost point in each plot is an artifact due to the fact that, at w = 1, Λ m = 1 for M 1 irrespective of the value of c.
The regions where the variance is stable for the three models, instead, are different, with FIPS3 having the largest region of order-2 stability, followed by the canonical PSO, and finally, by the social PSO.Exactly the same happens with skewness (Figure 7) and kurtosis (Figure 8), with the order-3 stability region largely coinciding with the order-2 ones also for FIPS3 and the social PSO.
These results are counter intuitive.One would expect that the more sources of randomness, the φ i 's, there are, the more a PSO should be unstable.However, the exact opposite happens.The social PSO, where the only influence is φ 1 , is the least stable of all models, while FIPS3, which has three sources of randomness, is the most stable.What are the reasons for this behaviour?
We can understand this by rewriting (25) as follows: where Φ m = m i=1 φ i and Ψ m = i φ i y i .Both Φ m and Ψ m are the sum of independent and uniformly distributed variables: the variables φ i in the case of Φ m and the variables y i φ i in the case of Ψ m .We know that i c i = c.To simplify our treatment, let us further assume that the φ i 's are i.i.d., that is, that all c i are identical, and so c i = c/m.We can then apply the central limit theorem to Φ m .This predicts that for sufficiently large m, the distribution of Φ m is approximately Gaussian with mean So the larger m, the smaller the variance of Φ m and the stochasticity of (28).
In the case of the stochastic variable Ψ m , the quantities φ i y i are not identically distributed even if all c i are identical.This is because, in principle, each y i may be different.This prevents the use of the standard central limit theorem.We can, however, apply Lyapunov's central limit theorem to Ψ m .The conditions for its application are as follows: (1) the variables φ i y i must have finite mean, which is the case since μ i = E[φ i y i ] = c i y i /2 = c y i /(2m), (2) the variables φ i y i must have finite variance, which, again, is the case since σ ) the variables φ i y i must have finite third central moment, which is satisfied since r 3 i = E[(φ i y i − μ i )) 3 ] = 0, and finally, (4) the Lyapunov condition, lim m→∞ (( 2 ) = 0, must be satisfied, which, again, is the case since all r 3 i = 0. Then for sufficiently large m, also the distribution of Ψ m is approximately Gaussian with mean i c y i /(2m) = (c/2) × ( i y i /m) and variance (c 2 /12m) × ( i y 2 i /m).Note that i y i /m and i y 2 i /m are the mean y i and the mean y 2 i , respectively.So these are finite quantities if, as is normally the case, all y i are finite.(PSO search is normally confined to a prefixed, finite region of R N , and so all y i must be finite.)So like for Φ m , the larger m, the smaller the variance of Ψ m , and consequently, the less the stochasticity of (28).
Effectively, the larger m, the more Φ m and Ψ m become deterministic and approach constant values.This explains why adding more and more sources of randomness-while keeping c constant-produces progressively more and more stable PSO's.histograms over 1 000 000 real runs.Snapshots at times t = 0, 2, 4, 12, and 24 are shown.For each theoretical sampling distribution, we report the parameters of the corresponding GLD.

THE DENSITY FUNCTION OF THE CANONICAL PSO's SAMPLING DISTRIBUTION
The technique described in this paper, in principle, would allow one to determine all the moments of the sampling distribution of the PSO at all times.The question then is "could we derive the PSO sampling distribution itself ?"The answer is of course in the positive since knowing all the moments of a distribution implies knowing its moment generating function.This, in turn, allows one to obtain the density function of the distribution via inverse Laplace transform.In practice, however, it is impossible to compute all the moments of the PSO sampling distribution.This is for two reasons.Firstly, there are infinitely many such moments.Secondly, as we have seen in the previous sections, the cost of computing moments is exponential in the order of the moments.The next question is then "to what extent can we still reconstruct the PSO's density function from a finite number of moments?"This is an instance of the well-known truncated moment problem, a difficult, inverse ill-posed problem for which many approaches have been proposed.Here, we consider only one such approach.
A particularly simple idea is to consider a family of density functions f (x; λ 1 , λ 2 , . ..) with parameters λ 1 , λ 2 , and so forth, with sufficient expressive power to represent distributions with widely different shapes, with more or less asymmetries, with tails of different characteristics, and so on.Then one can use optimisation techniques to identify the parameters of the distribution f which minimise the difference between the moments of f and the moments of the PSO's in a typical EDA the whole new population is generated from a single distribution, in a bare-bone PSO, each individual has a personal sampling distribution from which only one sample is drawn at each iteration.(The sampling distributions of different particles may share some characteristics, since the sampling distribution of a particle is determined by both personal best and neighbourhood best, and the neighbourhood best can easily be the same for different particles.)Nonetheless, one could still interpret a bare-bone PSO as an unusual type of EDE in which multiple distributions are used and updated based on fitness during the search.

1 Figure 1 :
Figure 1: Magnitude of the largest eigenvalue of M 4 as a function of the parameters w and c.The curved line on the surface encloses the order-4 stable region.
(a)   appear to grow bigger and bigger, suggesting instability,

Figure 3 :
Figure 3: Comparison between predicted and experimental skewness and (excess) kurtosis of the PSO sampling distribution for c = 1.49618, w = 0.7298, y = 0, y = 1, and Ω = 5.Kurtosis grows exponentially, and so it is plotted on a logarithmic scale.The first point is not plotted because the excess kurtosis was negative (−1.2).

Figure 5 :Figure 6 :
Figure 5: Lines below which the mean of the sampling distributions for a social PSO, a canonical PSO, and FIPS with a neighbourhood of three individuals have a fixed point (order-1 stability).The three lines coincide.

Figure 7 :
Figure 7: Lines below which a social PSO, a canonical PSO, and FIPS3 are order-3 stable (from bottom to top, resp.).

Figure 8 :
Figure 8: Lines below which the kurtosis of the sampling distributions for a social PSO (bottom), a canonical PSO (middle), and FIPS3 (top) have a fixed point (order-4 stability).

Figure 10 :
Figure10: Estimates of the sampling distribution of a canonical PSO with parameters c = 1.49618, w = 0.7298, y = 0, y = 10, and Ω = 5 during stagnation, reconstructed via GLD best fitting vers.histograms over 1 000 000 real runs.Snapshots at times t = 0, 2, 4, 12, and 24 are shown.For each theoretical sampling distribution, we report the parameters of the corresponding GLD.