Linkage-Based Distance Metric in the Search Space of Genetic Algorithms

We propose a new distance metric, based on the linkage of genes, in the search space of genetic algorithms. This second-order distance measure is derived from the gene interaction graph and first-order distance, which is a natural distance in chromosomal spaces. We show that the proposed measure forms a metric space and can be computed efficiently. As an example application, we demonstrate how thismeasure can be used to estimate the extent to which gene rearrangement improves the performance of genetic algorithms.


Introduction
Distance metrics are fundamental tools for organizing search spaces, because the introduction of a metric is the simplest way to induce a topology [1].Different metrics produce different topologies and thus change the shape of the search space.When a space is to be searched by a genetic algorithm (GA), a good distance metric facilitates navigation of the space [2][3][4][5] and can also improve the effectiveness of search [6][7][8][9][10][11][12].Hamming distance is a popular metric in a discrete space that is to be searched by a GA.Hamming distance has also been widely used in analyses of solution spaces [13][14][15].
Fitness distance correlation (FDC), proposed by Jones and Forrest [14], is a measure of the effectiveness of a distance metric in a space to be searched by a GA.An FDC is obtained by measuring the correlation between fitness and the distance to the nearest global optimum for a number of sample solutions.FDC coefficients range from −1 to 1, where higher values suggest increased difficulty in maximizing fitness and decreased difficulty in minimizing fitness.When a GA is hybridized with a local optimization, the population consists entirely of local optima, and it is then more useful to determine FDCs of local-optimum spaces.
In this paper, we propose a new distance measure which takes account of gene interaction and show that it forms a metric space.We use this metric to compute FDCs of search space and show that FDCs obtained in this way have improved correlation with the improvement in GA performance that can be obtained by gene rearrangement.The remainder of this paper is organized as follows.In Section 2, we review gene rearrangement in GAs.In Section 3, we propose a new distance measure for GAs, show that it forms a metric space, and demonstrate an application.Finally, we draw conclusions in Section 4.

Gene Rearrangement
Holland's schema theorem [16] shows that schemata (i.e., groups of genes) with high fitness, short defining length, and low order have high probabilities of survival in a standard GA.
These durable schemata are called building blocks.They make a major contribution to fitness and have a high degree of mutual interaction.The performance of a GA is strongly dependent on the survival and reproduction of these building blocks.
The survival probability of a gene group through a crossover is strongly affected by the positions of genes in the chromosome.Schemata consisting of genes in scattered positions tend to be too long to survive.Thus, the strategy used for placing genes significantly affects the performance of a GA.Inversion is an operator which changes the location of genes while a GA is running [17], and the process of rearranging genes dynamically to improve performance is called linkage learning [18].Messy GA [19] is an example of a technique that implicitly uses dynamic gene rearrangement.
It has been observed that the performance of GAs on problems with a locus-based encoding can be improved by rearranging the indices of the genes before running the GA.Static gene rearrangement was first suggested by Bui and Moon [20,21], who rearrange genes within a chromosomal representation to improve the quality of schemata and to help the GA to preserve the better schemata.Many studies on the static rearrangement of gene positions [20][21][22][23][24] have showed performance improvements.However, the improvement in performance achieved in this way has been shown to vary greatly between problem instances.This motivated us to develop a distance metric to improve our ability to estimate how much improvement in the performance of a GA on a particular problem instance can be expected through gene rearrangement.

A Linkage-Based Distance Measure
3.1.Second-Order Distance Measure.The most usual firstorder distance measure in discrete space is the Hamming distance which is also a natural distance in chromosomal space, although there are other first-order distance measures, such as the quotient metric in redundant encoding [11].We now define a second-order distance measure derived from first-order distance.Given a problem instance , consider the unweighted undirected graph   representing first-order gene interaction [23], which is the pairwise interaction of genes.For convenience, we will assume that each gene has an interaction with itself, so that {, } ∈ (  ) for each gene  ∈ (  ).Let   be the adjacency matrix of   and consider   as a binary matrix over Z 2 [25][26][27].
Proof.It is enough to show the following four conditions [1].
(ii) Identity of indiscernibles: consider (iii) Symmetry: consider If the inverse of   does not exist, we can extend the scope of the distance metric using the following well-defined formulation: We note that if the inverse of   exists, then  :=  −1  ( ⊕ ), which implies ( ⊕ ) ⊕    = 0, and hence arg min  ‖( ⊕ ) ⊕   ‖ =  −1  ( ⊕ ).Our second-order distance and its extension can be computed in ( 3 ) by a variant of Gauss-Jordan elimination [28], where  is the number of genes.First-order gene interaction graph (a) Second-order distance = 2 x: y: Figure 1: (a) An example of a first-order gene interaction graph   and (b) distances between two example chromosomes  and .

An Application.
Intuitively, our measure of the distance between two chromosomes can be understood as the minimum number of bits that must be changed to transform one chromosome into the other in the genetic process using optimal gene rearrangement.Given an undirected graph  = (, ) with edge weights (  ) (,)∈ , the max-cut problem is that of finding a subset  ⊂  which maximizes the sum of the edge weights which traverse the cut (,  \ ) [29][30][31].Consider the 6-node maxcut problem instance , which is to maximize the following expression: where a vertex V  belongs to the position   ∈ {0, 1} and ⊕ is the Boolean XOR operator.In this problem instance, edges {V 1 , V 2 } and {V 2 , V 3 } increase the fitness and edges {V 4 , V 5 } and {V 5 , V 6 } reduce the fitness.In the max-cut problem, we can consider that the given graph removing edge weights shows the first-order gene interaction (see, e.g., Figure 1(a)).Figure 1(b) shows an example in which the Hamming and second-order distances between two chromosomes  and  are obtained by optimal gene arrangement of the gene interaction graph   .In this example, ⊕  = (1 1 1 0 1 1)  ,  −1  ( ⊕ ) = (0 1 0 0 0 1)  , and hence ‖ −1  ( ⊕ )‖ = 2.If we use the normalized Hamming distance (developed for the 2-grouping problem) [32,33] as the first-order distance measure, the FDC of this problem is −0.50.But when our second-order distance is used, the FDC becomes −0.95.
Given a graph  = (, ) and its adjacency matrix  = (  ), the graph bipartitioning problem is that of minimizing the following expression: where   ∈ {0, 1}, a vertex V  belongs to the position   ∈ {0, 1}, and  is a positive constant introduced to penalize unbalanced partitions.If we ignore the second balancing term altogether, we can regard the given graph as the firstorder gene interaction graph of the given problem instance.Bui and Moon [21] tried gene rearrangement in a GA for graph bipartitioning and obtained dramatic improvements in performance for some graphs.We hypothesized that FDCs calculated using our second-order distance would help identify graphs that could benefit most from gene rearrangement, in terms of GA performance.Figure 2 shows the relationship  between FDC and the performance improvement of a GA on 16 benchmark graphs (8 random graphs and 8 random geometric graphs) that were used in [34][35][36][37][38][39][40].
Here, the performance improvement means the difference in percentage between the average performances of a GA with and without gene rearrangement (data from [21]).The FDC values were approximated from 10,000 randomly generated local optima.When the first-order (normalized Hamming) distance was used, there was little correlation with the change in performance, but our second-order distance provided a clear correlation (see Figure 2(b) and Table 1).

Concluding Remarks
In most previous work, distances among chromosomes in GAs have usually been first-order distances, and in partic-ular Hamming distance.We have proposed a second-order distance measure for GAs, which we consider to be more meaningful.We have showed that this distance measure forms a metric space and that it can be computed efficiently.
Using second-order distance allows us to see problem spaces from a different viewpoint.We have demonstrated its value in predicting the effectiveness of gene rearrangement, and we envisage it providing further understanding of the working mechanism of GAs.

Disclosure
A preliminary version of this paper appeared in the Proceedings of the Genetic and Evolutionary Computation Conference, pp.1393-1399, 2005.

Figure 2 :
Figure 2: Correlation of gene rearrangement with FDC values computed using first-and second-order distance.

Table 1 :
Effect of gene rearrangement on FDCs computed using first-and second-order distance.