Multiple Sequence Alignment Using a Genetic Algorithm and GLOCSA

Algorithms that minimize putative synapomorphy in an alignment cannot be directly implemented since trivial cases with concatenated sequences would be selected because they would imply a minimum number of events to be explained (e.g., a single insertion/deletion would be required to explain divergence among two sequences). Therefore, indirect measures to approach parsimony need to be implemented. In this paper, we thoroughly present a Global Criterion for Sequence Alignment (GLOCSA) that uses a scoring function to globally rate multiple alignments aiming to produce matrices that minimize the number of putative synapomorphies. We also present a Genetic Algorithm that uses GLOCSA as the objective function to produce sequence alignments refining alignments previously generated by additional existing alignment tools (we recommend MUSCLE). We show that in the example cases our GLOCSA-guided Genetic Algorithm (GGGA) does improve the GLOCSA values, resulting in alignments that imply less putative synapomorphies.


Introduction
The use of DNA (deoxyribonucleic acid) or protein sequences for different purposes has greatly increased as the technology for DNA and protein sequencing has improved with the consequent cost reduction.A proof for this is the enormous amount of information available in the Protein Data Bank [1] or in GenBank [2].The exponential growth in size of these data repositories goes in parallel with the increasing need for tools to manage and analyze the valuable information therein contained.The first step to make this information manageable is to device tools to identify comparable proteins or DNA fragments, as well as comparable protein or DNA sequence units (amino acids and nucleotides, resp.).This process is referred to as sequence alignment.By aligning sequences, phylogenetic analyses can be carried out, PCR (polymerase chain reaction) primers constructed, secondary or tertiary structures predicted, among other applications.Being such a central topic, algorithms to tackle sequence alignment have already been developed.Nevertheless, as we explain more thoroughly in Section 2.2.1, sequence alignment is not a trivial problem.To reduce this complex issue to trackable problems, most available softwares consider at once pairs of sequences.Measures for alignment quality that globally use the entire data sets (matrices consisting of more than two sequences) are currently unavailable.
In this paper, we thoroughly present a Global Criterion for Sequence Alignment (GLOCSA) that uses a scoring function to globally rate multiple alignments aiming to indirectly use the parsimony criterion.We also propose an evolutionary computation technique suitable to optimize it.So this novel objective function is coupled with a Genetic Algorithm (GA), the GLOCSA-Guided Genetic Algorithm (GGGA), which uses a compact representation of the alignments and five different mutation operators to explore the solution landscape.Although GGGA can be used for completely unaligned data sets, it is more efficient for refining alignments previously generated by additional existing tools.Using GLOCSA as the scoring parameter,

Sequences and Alignments
2.1.Sequences.DNA consists of a unique sequence of repeated four nucleotides.Each nucleotide is characterized by a corresponding nitrogenous base representing the primary structure of a real or hypothetical DNA molecule or strand, with the capacity to carry information.Such sequences analogously exist for RNA (ribonucleic acid) and proteins [4,5].
In biochemistry, the primary structure of a biological molecule is the exact specification of its atomic composition and the chemical bonds connecting those atoms.For molecules of DNA, RNA, or proteins, the primary structure is equivalent to specify the sequence of its monomeric subunits, that is, the nucleotides or aminoacids sequence [4,5].

Sequence
Alignment.DNA sequences, RNA sequences and the protein sequences encoded change through time, evolving mainly under the action of mutation.The simplest types of mutation are point mutations, which are substitutions of nucleotides or aminoacids, and insertions/deletions, also known as indels.When one or two comparable sequences suffered insertion and/or deletion mutations, they will differ in length (i.e., they will have a different number of nucleotides or amino acids).Because these mutations are normally not observable, it is necessary to deduce where they occurred in order to identify which nucleotides or amino acids originally occupy the same position (which ones are homologous).This is the alignment process.Although this could appear trivial, it is a complex task due to the fact that a limited and a priori known number of minimum observable units exist for each aligned position (e.g., in the case of DNA only four nucleotides) and that all positions have the same potential alternative conditions (e.g., in DNA each unaligned position needs to have one of the four nucleotides).In this way, a gap (an inferred indel) in a sequence can be placed in many positions without making a big difference with respect to the comparable sequence.To align two or more sequences, they are put together in a (S•C) matrix, where S is the number of sequences, and C is the maximum number of residues in a sequence(positions in the alignment); the shorter sequences are filled at the end with gap codifications ("−") to fit the matrix perfectly.With a sequence in each line of the matrix, the process of alignment, represents the insertion of − in the sequences (see Table 11).In order to choose the best alignment, it is considered that, in biological terms, the process of alignment has the objective to align homologous residues (having the same evolutionary origin).Assuming that evolution is parsimonious, when performing an alignment the aim is to minimize the number of evolutionary changes (events of substitutions or indels) that the alignment implies [6].
Alignments can be either pairwise, two sequences only, or multiple, more than two sequences up to an arbitrary number.For pairwise alignments dynamic programming algorithms have been developed to address this problem, such as Needleman-Wunsch [7] and Smith-Waterman [8] algorithms.Pairwise alignments might be regarded as special cases of multiple alignment.In practice, however, the computational complexity of aligning multiple sequences is such that the corresponding algorithms are usually not straight extensions of the pairwise approaches.Instead, multiple alignments are often constructed by repeatedly merging pairwise alignments (progressive alignment) [6].

The Number of Possible Alignments of Two Sequences.
In order to define the complexity of finding an optimal alignment given an objective function, the number of possible alignments can be computed.
Having two sequences of size m and n, respectively, f (m, n) can be defined as the number of alignments that can be formed between them.
Any possible alignment of S 1 and S 2 ends in one of these specific ways [6]: That is, the last residue of S 1 aligned with a gap codification −, the last residue of S 1 and S 2 aligned, or the last residue of S 2 aligned with a gap codification −.
Considering the effect of these three possible ends on the number of alignments that can be formed out of the remaining residues in the alignment, the ending (S 1 [m]/−) removes one residue from S 1 , (S 1 [m]/S 2 [n]) removes one residue from both sequences and (−/S 2 [n]) removes one residue from S 2 .Following this, the next recursion can be written [6]: Each of the terms in the righthand side of (2) represents a possible end.In addition to this recursion, a stop criterion or boundary condition is needed: Using the recursion in (2) and the stop criterion in (3), the number of possible alignments of two sequences of equal length from 1 to 10, m = n = 1, 2, . . ., 10, can be computed (Table 1), where it is obvious that the number of alignments grows exponentially as the length of the sequences increments.Then, it is straightforward to assume that, with more sequences involved in an alignment, the number of possibilities grows even faster.

Sequence Alignment and Evolutionary Computation.
Evolutionary Computation (EC) has been previously used in the problem of multiple sequence alignment (MSA) [5], from Genetic Algorithms [9][10][11] to Evolutionary Programming [12,13].From these applications, SAGA (sequence alignment by genetic algorithm) is considered [10] the most relevant to the topic of this paper's research.One of the main advantages of EC is to allow a good separation between the optimization process and the evaluation criterion (objective function).It is the objective function that defines the aim of any optimization procedure and in the case of sequence alignment, it is also the objective function that summarizes the biological knowledge that is intended to be projected into the alignment.

Objective
Functions.An alignment is considered to be correct if it reflects, at least in the case of DNA, the evolutionary history of the species of the sequences being aligned.But, at the time of assessing the quality of an alignment, such evolutionary information is not frequently available, or even more, not known.It may also be the case that aligning a set of sequences is an intermediate step to produce an evolutionary hypothesis.Hence, alternatives must be sought, and measures of sequence similarity are an useful option.It is assumed that similar sequences share the same evolutionary origin [14], as long as the level of identity is outside the twilight zone (more than 30% identity over 100 positions).Nevertheless, to assess sequence homology by similarity has also been questioned [15,16].
Existing measures of similarity are obtained using substitutions matrices ( [17] for proteins).A substitution matrix assigns a cost for each possible substitution or conservation accordingly to the probability of occurrence, computed from data analysis.In this approach insertions and deletions are penalized (gap penalty).The most common scheme for that purpose is giving a cost for gap opening and gap extension (affine gap penalties model), in order to favor alignments with smaller numbers of indels (each gap can be regarded as an indel event).The main disadvantage of these substitutions matrices is that they are intended to rate the similarity between two sequences at a time only, and in order to extend them to multiple sequences, it is common to find that they are scaled by adding up each pairwise similarity to obtain the score for the multiple sequence alignment [5].
Every objective function defines a mathematical optimum (or a set of them), which is not necessarily the same as the biological optimum that is sought when aligning genetic sequences.This biological optimum can be said that arises as a consequence of the evolutionary history of the sequences in the alignment.An objective function is only as good as its mathematical optimum resembles the biological one.In order to make this two optima converge, biological knowledge must be integrated to the objective function [5].
SAGA [10] was used to optimize two different objective functions.A brief description of them are given as follows.
Weighted Sums of Pairs.Weighted Sums of Pairs is the objective function used by MSA [18].The sums-of-pairs principle associates a cost to each pair of aligned codifications in each column of an alignment (substitution cost) and another, similar cost to the gaps (gap cost).The sum of these costs yields the global cost of the alignment.Major variations involve using (1) different sets of costs for substitutions (PAM Matrices [17], BLOSUM tables [19]), ( 2) different schemes for the cost of gaps (quasinatural and natural [20]), and (3) different sets of weights associated with each pair of sequences due to evolutionary distance [21].
SAGA was first used to optimize the sums of pairs with quasinatural gap penalties.COFFEE Score.COFFEE stands for Consistency-Based Objective Function For alignment Evaluation and is a measure of the level of consistency between multiple alignments of a set of sequences and a library of all possible pairwise alignments of the same set of sequences.Evaluation is made by comparing each pair of aligned residues observed in the multiple alignments with the list of residue pairs that constitute the library.The consistency score is equal to the number of pairs of residues that are found simultaneously in the multiple alignment and in the library, divided by the total number of pairs observed in the multiple sequence alignment [5].
The main difference between the COFFEE function and the Weighted Sum of Pairs is the use of the library instead of the substitution matrix.

GLOCSA-A New Objective Function
The Global Criterion for Sequence Alignment (GLOCSA) is a new proposed function to assess the quality of multiple sequence alignments of DNA.It has been build from the ground up with simplicity and a global approach in mind.By global it is understood that it rates the alignment as a whole, that is, all sequences considered simultaneously, not taking pairs of sequences to score their corresponding alignment.It also takes into account the gaps, seeking to favor parsimony.

GLOCSA is composed of three individual criteria: Mean Column Homogeneity (MCH), Reciprocal of Gap Blocks (RGB)
and Columns Increment (CI).These are combined in a polynomial with a set of corresponding weights (w mch , w rgb , and w ci ).These weights are set by default to the values shown in Table 2.This default values were determined A n y b a s e o r g a p empirically, adjusting them to assign better scores to better alignments using a set of artificial examples and some realworld alignments: The main problem faced when scoring alignments is that the exact evolutionary history of the involved sequences is never known.Theories can be stated about which alignment reflects the more plausible or probable evolutionary history (which is what produces the differences in the sequences) but certainty cannot be guaranteed.
Compared to the other schemes of sequence alignment evaluation rating them on a pair basis, such as weighted sum of pairs [18], GLOCSA has the advantage of rating the whole alignment at a time (with the Mean Column Homogeneity criterion).It also has the advantage of considering parsimony, favoring more concentrated gaps (with Reciprocal of Gap Blocks) and smaller alignment matrices (with Columns Increment).
At the moment it is intended to rate only multiple sequences of DNA composed of the standard IUB/IUPAC codifications for nucleic acids, shown in Table 3.
To score an alignment of multiple sequences, a matrix with C columns and S lines is considered, where C is the maximum number of positions in a sequence, and S is the number of sequences in the alignment.Initially, to perfectly fit all the sequences in the matrix, gap positions are appended ("−") at the end of the shorter sequences.

Mean Column Homogeneity.
In the alignment matrix each position is represented in a column, and the column homogeneity has the purpose of rating the grade of diversity in the elements of a given position, scoring higher the more homogeneous columns.
The basic idea is that the occurrences of each of the four bases in a column are counted.A, C, G, and T are counted with a weight of 1.0 while polymorphisms are counted as an equal fraction of a unit for each base they represent (e.g., A counts 1.0 for A while R is either G or A, so it counts 0.50 for G and 0.50 for A).Gaps are also counted, with a unit for each.Using these counts the column homogeneity for each column is computed.
The count of bases and gaps are computed in wc jt ∀0 ≤ t ≤ 4, where t is the index for a base or gap which is being counted and j is the column.These weighted counts are the result of adding up to wc jt the corresponding weight (shown in Table 4) for the codification of each sequence in the column.This can be expressed as, where am(i, j) is a function that retrieves the codification in the sequence i at column j of the alignment, and the function T w (t, P c ) looks up the weight associated with the base t and the codification P c (in this case P c is given by am(i, j)) in Table 4.
After counting, the column homogeneity of a given column is computed using the following formula: It is to be noted that in the numerator of the fraction only the four bases are considered (A, C, G, and T indexed by 0, 1, 2, and 3), and in the denominator the gap (−, indexed by 4) is considered along with the bases.This is considered in order to penalize the insertion of gaps, assuming that as the gaps are not counted in the numerator but they are counted in the denominator, the column homogeneity value decreases when there are more gaps.
In the case that a position in a sequence has a ?codification, that position for that sequence is discarded (as it was not observed) for the computing of that column homogeneity value.This is because a ?implies that in that position the sequence has no information.
A special consideration is taken when all the elements in a column are gap codifications (−) in that case the column homogeneity is given a value of zero, to penalize the existence of such columns.
When the column homogeneity value for all the columns has been computed, the mean value is obtained and that is the Mean Column Homogeneity.
This criterion gives higher scores to more homogeneous columns, penalizing diversity of bases in a column (as shown in the examples of Table 5).

Reciprocal of Gap Blocks.
The gap codifications ("−") which are contiguous are grouped into blocks, and the where GB is the number of gap blocks in the alignment.
If there are no gap blocks, the Reciprocal of Gap Blocks criterion is given a value of 1.0.This criterion serves the purpose of rewarding the alignments where the gap codifications are located in a more concentrated manner, that is, where there are fewer larger blocks of gap codifications rather than more blocks of smaller length.Fewer blocks imply less evolutionary events to be explained and a more parsimonious alignment.
In Tables 6, 7, and 8 three alignments of a hypothetical set of sequences are shown.The three alignments have the same number of "−", but the example in Table 6 has them in 3 blocks, the example in Table 7 in 2 blocks, and finally the example in Table 8 in just 1 block, a difference which is noticeable in the reciprocal gap blocks criterion, and thus favoring the alignment which implies less evolutionary events (parsimony).

Columns Increment. Inserting gaps to align a set of sequences is common, and the number of columns increases. Columns Increment is the ratio of this augmentation, defined by CI
where C is the number of columns after aligning, and C 0 the number of columns before aligning, which is equivalent to the number of nucleotides of the longest sequence.An example of a hypothetical set of sequences for which two different alignments are shown in Tables 9 and  10.Each alignment has a different value for the Columns Increment criterion.A smaller alignment is preferred because a smaller matrix probably implies less evolutionary events (parsimony).

GGGA-a GA Using GLOCSA
Having a new objective function to evaluate the quality of multiple sequence alignments, and considering the complexity of the problem (as it is explained in Section 2.2.1), using a genetic algorithm to optimize alignments and its GLOCSA score was considered a viable option.
GGGA, GLOCSA-Guided Genetic Algorithm is the Genetic Algorithm implemented to optimize the GLOCSA value.GGGA is a variant of the Simple Genetic Algorithm where a custom representation is proposed, along with a specific mutation operator.There is no crossover operator, selection is performed by tournament, and elitism is used.
The initialization of the population is done using the mutation operator and a seed alignment which is an input to the algorithm.To produce each individual of the next generation, an individual is selected from the previous generation, using the tournament selection operator and then submitted to the mutation operator to generate the new individual (under a mutation probability).
5.1.Representation of Individuals.Each individual in the population represents a possible alignment.The alignment matrix (described in Section 4, e.g., in Table 11) used to rate an alignment with GLOCSA is the base for the representation of individuals.But not everything in the matrix is necessary to reconstruct any given alignment.Therefore it is processed to obtain a much more manageable representation.
Since the solution space explored by the algorithm consists of the possible alignments a given set of sequences which do not change, the only piece of information which is necessary to represent any alignment, is the location of every gap codification.Furthermore, if gap codifications are grouped into gap blocks (groups of contiguous gap codifications), the position and size of these blocks are the only information needed to reconstruct an alignment.
If the bases in every sequence of the alignment are indexed with consecutive numbers, starting from 0 for the first base to ease its implementation, the position of the gap blocks can be determined by the base index it precedes.
Thus, the alignment can be represented by having for each sequence a list of the positions and sizes of every gap block in them, that is, each gap block represented as two nonnegative integers (position and size).As a simple illustrative example the alignment matrix of Table 11 is transformed to its corresponding representation in Table 12.In this example, the sequence 0 has only one gap block of size 2, before the A with index 2 (the third one), hence the list of gap blocks for this sequence only has one element which is [2,2]; sequence 1 has two gap blocks [2,2], [3,1]; sequence 2 has only one [0, 2], the two gap codifications at the end of the sequence were appended to fit it in the alignment matrix; so there is no need to include them in the representation (trailing gaps are a consequence of the different lengths of the sequences); sequence 3 has just one gap block [3,3].

Mutation Operator and Suboperators.
The mutation operator is basically in charge of changing the gap codification appearances in the alignment represented by an individual, in order to explore the solution space.It works with a mutation probability, which determines the number of expected mutations in an individual when the operator is applied to it.As it is more manageable to refer to the mutation probability in terms of the number of mutations expected per individual, as it is more informative in the context of the problem, this approach will be used in the results.
For each mutation operation five types of changes to the gap codification appearances are proposed: insertion of new gap blocks, increment of the size of a gap block, decrease of the size of a gap block, shift of positions of gap blocks and deletion of a gap block.These five types of changes are denominated suboperators, and the selection of which one will be applied is determined by its individual probability, dynamically adapted throughout the generations.These suboperators were selected because in the opinion of the authors they make the algorithm capable of searching the solution space in a relatively efficient way.A crossover operator (interchanging entire sequences between alignments) was also considered but was discarded in early stages because it gave no apparent advantage to the algorithm.
It is noteworthy that while performing these changes to the alignments no penalization is done other than the modification in the GLOCSA score these changes imply.

Insertion Suboperator.
This suboperator chooses randomly a sequence and inserts a gap block in it.The size of the new gap block is also random, but with an exponential distribution with mean fitted from the gap block sizes in the seed alignment.
The size of the new gap blocks to insert is biased toward small sizes; this is because large gap blocks are not very common, but still exist.

Increment Suboperator.
The Increment Suboperator chooses a sequence at random and an existing gap block from it, increasing in one unit its size.If the selected sequence does not have any gap block at all, this operator leaves the sequence without change.

Decrease Suboperator.
As the previous operator, it chooses randomly a sequence and a gap block from it, whose size will decrease by one; if the size is 1 gap codification, this operator deletes the gap block totally.Again if the selected sequence does not have any gap block at all, it remains unchanged.

Shift Suboperator.
In a sequence chosen at random, this operator selects first a gap block in it, then a position is selected randomly in that sequence; if a gap block exists in that position, the sizes of them are interchanged.If there Table 9: Alignment to exemplify the Columns Increment criterion.In this case, the number of columns remains the same after aligning.CI = 0.
is not a gap in the selected position, the position of the first selected gap is set to the other position.If the selected sequence does not have any gap block at all, no modification is done.This operator mixes information within a given sequence in a single alignment (individual) it does not recombine information from two individuals as a crossover operator.

Deletion Suboperator.
This operator selects randomly a sequence and then a gap block.This gap block is completely deleted from the list of gap blocks.If no gap block exists in the selected sequence, it remains without any change.

Adaptation of Mutation Suboperators Probability.
The probability of applying each of the subopertors is dynamically adapted as the generations pass; it is changed accordingly to their effect in the GLOCSA score of the alignments represented by the individuals.This adaptation is done once at the end of every generation, and the procedure is as follows.
For the first generation of the genetic algorithm the five suboperators have the same probability, each with 0.20 of probability of being used.Every time the mutation operator is applied, in a record are stored the GLOCSA scores before and after the mutation and a vector which represents the use count of each suboperator (e.g., in Table 13).
After generating the entire new population, the attributed difference by suboperator (dSO) is computed by dividing the suboperators use count by the total number of mutations performed, and then multiplying it by the difference between the after and before scores of GLOCSA.This is shown in (9), where dSO s is the attributed difference for a given suboperator s, sOUC s is the use count for suboperator s, tM is the total number of mutations performed (tM = sOUC s ), and aS and bS are the GLOCSA scores after and before the mutation suboperators action: Then, the attributed difference by suboperator for all the records is summed up in the total attributed difference by suboperator (tDSO, see ( 10)): tDSO s = R dSO s ∀s = mutation suboperators .(10) These tDSO s values are then normalized by dividing them by the largest absolute value of them: Afterward, the probability (p s ) of each suboperator is added p S • SCh • tDSO s : where SCh is a constant which sets how big the steps of the adaptation are.It was set for the experiments to the value of 0.10.Finally the values of p s are scaled to make the sum of all the probabilities equal to 1:

Population Initialization.
To initialize the population a given alignment is used as a starting point.The individuals of the initial generation are mutations of it, obtained by applying the mutation operator.The mutation operator is applied discarding the adaptation stage; therefore the five suboperators have the same probability while initializing the population.

Tests with Real Data
6.1.Test Bench.To test the ability of GGGA to optimize the GLOCSA scoring function, three multiple sequence alignment problems were proposed, which are shown in Table 16 along with relevant information.The set of sequences exmpl17 is a subset of exmpl19; the two shortest sequences were eliminated, thus presumably reducing the complexity of the alignment.

GA Test Parameters.
Each set of sequences was first aligned with MUSCLE [3], a popular progressive alignment tool.The resulting alignment was seeded as a starting point for the initialization of the population; thus the aim of the test is to see if further improvements to the alignment of MUSCLE can be performed, guided by the GLOCSA scoring function.
The genetic algorithm for all the experiments was run over 1000 generations with a population of 100, with  The GLOCSA Scoring function used as the objective function has the default weights defined in Table 2.
The rate of the mutation was in the range of [0.1, 3] number of expected mutations with increments of 0.1.For each of this combination of values (the previously mentioned parameters and the number of expected mutations) 30 experiments were performed.
The only parameter tested within a range was the number of expected mutations.Because it was considered the most important and performing a parameter sweep across all parameters would have been too computationally expensive.

Experiments Results.
Results of these experiments are shown in Figures 1, 2, and 3, using box and whiskers plots; the box has lines at the lower quartile, median, and upper quartile values; whiskers extend from each end of the box to the minimum and maximum scores obtained.
It was observed that the GLOCSA-Guided Genetic Algorithm always improved (at least slightly) the solution previously found by MUSCLE (the score of the initial alignment is the lower range of the GLOCSA Scores in the chart), and as expected the amount of improvement is strongly related with the number of expected mutations; lower (near zero) and higher (close and beyond three mutations per individual) numbers of expected mutations produce less improvements while values in or in the vicinity of the range of [0.5, 1.0] produce the higher optimization values.This trend is certainly due to the exploration/exploitation balance, with fewer mutations there is not enough exploration, and with too many mutations there is excess exploration in detriment of exploitation.
It is important to notice that the range of the GLOCSA values in Figures 1, 2, and 3 is different for each of them.This is because GLOCSA values are relative to the alignment they are scoring, prominently the column homogeneity.In particular this criterion would have a value of 1000 (multiplied by its default weight) when aligning a set of copies of a single sequence (every column will score 1.0).
While the improvements for the alignments exmpl17 and exmpl19 are about the same, for exmpl29 these are bigger.This is explained by the fact that exmpl29 is less complex than exmpl17 and exmpl19, which are about the same difficulty (exmpl17 easier that exmpl19, as the sequences in the first are a subset of those in the second, but not enough to make a noticeable difference).In Table 17, the mean elapsed times for the 30 experiments of each alignment are shown.All the experiments were performed in a personal computer with an Intel Pentium D CPU 2.80 GHz processor (though not using its two cores for a single experiment) with 2 GB in RAM.

Conclusions
For the assessment of the quality of multiple sequence alignments, scoring functions have been previously defined, but in the opinion of the authors, the results obtained so far are not satisfactory enough, and therefore the GLOCSA measure was devised.It aims to be considered an alternative scoring function for multiple sequence alignments, one with the advantages of being simple, of rating the whole alignment at once, and being parsimonious.Given the complexity of the problem of multiple sequence alignment, the techniques of Evolutionary Computation-Genetic Algorithms in particular-seem useful for optimizing this new proposed scoring function.Although it is not efficient, compared with the fast progressive alignment heuristics (e.g., MUSCLE, a run of it over the larger alignment tested in this work takes less than 4.5 seconds in the same machine) the GGGA has the ability to optimize GLOCSA as the objective function.In the light of performing it as a refinement over previously aligned data with more efficient methods (as in the test experiments where MUSCLE alignments were inserted as starting points) is a promising application.
Even though a set of sequences can be aligned from scratch optimizing its GLOCSA score with the GA, it would be too time consuming.Then, an initial starting point given by another tool seems like a good idea, in the light that progressive alignment delivers good results, but these can be further refined.The seed alignment can be the product of any alignment tool, which gives this approach additional flexibility.

Future Work
Currently GLOCSA only rates DNA sequence alignments; the next step would be to extend its application scope to protein sequences.
GLOCSA as a quality measure has been validated empirically, but tests to assess its reliability are still pending.This will be done with the aid of defined sets of reference alignments such as BALiBASE (protein sequence alignments) [22,23] and the GLOCSA-Guided Genetic Algorithm, thus resulting in the assessment of both, the scoring function and the genetic algorithm implementation.
A new crossover operator (across columns) will also be implemented in the Genetic Algorithm, and its adaptation mechanism will be explored further.The performance of the Genetic Algorithm will be compared to a Random Search, to see how much the evolutionary nature of the algorithm is contributing to the results.

Figure 1 :
Figure 1: Box and whisker plot of the results of the experiments of exmpl17.

Figure 2 :
Figure 2: Box and whisker plot of the results of the experiments of exmpl19.

Figure 3 :
Figure 3: Box and whisker plot of the results of the experiments of exmpl29.

Table 1 :
Number of possible alignments for two sequences; m and n are the respective sizes of two given sequences.

Table 4 :
Base count weights matrix.

Table 6 :
Alignment to exemplify the Reciprocal of Gap Blocks criterion.RGB = 0.33.

Table 7 :
Alignment to exemplify the Reciprocal of Gap Blocks criterion.RGB = 0.50.

Table 8 :
Alignment to exemplify the Reciprocal of Gap Blocks criterion.RGB = 1.0.

Table 10 :
Alignment to exemplify the Columns Increment criterion.Here, the number of columns increased to 6 after aligning.CI = 0.66.

Table 13 :
Sample records used for adaptation.

Table 14 :
Default Test Parameters.For each experiment these default parameters were used.

Table 15 :
Test Experiments.Using the default test parameters listed in Table14, the number of expected mutations were tested in the range of [0.1,3] with increments of 0.1, performing 30 experiments for each configuration, for each alignment in the test bench.