Simulated Annealing with Previous Solutions Applied to DNA Sequence Alignment

A new algorithm for solving sequence alignment problem is proposed, which is named SAPS (Simulated Annealing with Previous Solutions). This algorithm is based on the classical Simulated Annealing (SA). SAPS is implemented in order to obtain results of pair and multiple sequence alignment. SA is a simulation of heating and cooling of a metal to solve an optimization problem. In order to select randomly a current solution, SAPS algorithm chooses a solution from solutions that have been previously generated within the Metropolis Cycle. This simple change has led to increase the quality of the solution to the problem of aligning genomic sequences with respect to the classical Simulated Annealing algorithm. The parameters of SAPS, for certain instances, are tuned by an analytical method, and some parameters have experimentally been tuned. SAPS has generated high-quality results in comparison with the classical SA. The instances used are specific genes of the AIDS virus.


Introduction
Sequence alignment is one of the most important and challenging problems in computational biology and bioinformatics [1,2].Finding the optimal alignment of a set of sequences is known as a NP-complete problem [3].Alignment of sequences can be an important tool to measure the similarity of two or more sequences.Sequence Alignment is classified as a combinatorial optimization problem [4], which is solved by using computer algorithms.These algorithms lead to represent, to process, and to compare genetic information to determine evolutionary relationships among living beings [3].The sequence alignment highlights areas of similarity among sequences.The similarities among sequences may indicate functional or evolutionary relationships among genes or proteins [5].
The problem of sequence alignment is to obtain the maximum alignment of a set of n genomic sequences, which is denoted as S = {s 1 , s 2 , . . ., s n }; each sequence of this set is formed by the alphabet = {A, C, G, T}.The solution to this problem is represented by S = Max(S), which denotes a set S = {s 1 , s 2 , . . ., s n } with the alphabet = ∪{−}.S represents the optimal alignment of S.
Exact algorithms have been applied to solve the sequence alignment problem.For example, dynamic programming has been one of the most used to solve the sequence alignment problem [6,7].The disadvantage of using exact algorithms is that these generate optimal solutions for small problems, but for large problems, exact algorithms become inefficient.For this reason, several metaheuristic methods have been designed to obtain suboptimal alignments.Metaheuristics have also been applied to solve this problem [8], for example, Ant Colony Algorithm [9], Simulated Annealing [10,11], Genetic Algorithms [12], among others.The disadvantage is that metaheuristics do not guarantee optimal solutions, but solutions generated can be very close to optimal solution in a reasonable processing time.
The proposed algorithm is a modified version of classical Simulated Annealing.SAPS includes a new way to select a current solution after the Metropolis Cycle is finished.In general, SAPS generates better solutions to sequence alignment problem than the classical Simulated Annealing.SAPS was tested with different genes of AIDS virus.
This paper is organized as follows: in Section 2, classical simulated annealing algorithm is described.In Section 3, the SASP algorithm is explained in detail.In Section 4, Create S new using a perturbation to S current 7: Obtain difference similarity between S new and S current 8: If (difference ⇐ 0) then 9: Accept S new 10: else 11: Boltzmann probability = exp (−difference/T) 12: If (Boltzmann probability) > random (0,1) then the analytical tuning method is described.In Section 5, the implementation of the SASP is detailed.In Section 6, the experimentation and results are described.Finally, Section 7 discusses the conclusions.

Classical Simulated Annealing
The classical Simulated Annealing is an algorithmic process that simulates the gradual metal cooling for crystallization.This algorithm usually starts at high value of temperature, and then this parameter is decreased until a final temperature is reached.The final temperature typically is very close to zero (T final ≈ 0) [13,14].Through a cooling function, the temperature value is decreased from the initial temperature to the final temperature.There are cooling functions that have been used in the simulated annealing algorithm [15][16][17][18]; the most common cooling function is defined by T k+1 = αT k .This function decreases the temperature value by a α factor, which does a range of 0.70 ≤ α < 1.0.A gradual cooling is applied when α is very close to 1, and a fast cooling is applied when α is very close to 0.70.
The classical Simulated Annealing has two cycles; the first cycle is named Cycle of Temperature.Into this cycle, value temperature is decreased by a cooling function.The second cycle is named Metropolis Cycle, and it is applied to generate, to accept, or to reject solutions for the problem to be optimized.Algorithm 1 shows the pseudo code of the classical Simulated Annealing.The initial and final temperature values are set (see line 1).These values are obtained by an analytical (see Section 4) or experimental way.It is recommended that the initial temperature is as high as possible, and the final temperature is as close to zero.The initial solution S initial of the problem to be optimized is created (see line 2).The current solution S current is set to S initial .Set T to initial temperature (see line 3).The temperature cycle is executed from the initial temperature to the final temperature (see lines [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18].The Metropolis Cycle gets started (see lines [5][6][7][8][9][10][11][12][13][14][15][16].This cycle takes a number of times specified in the stop criterion.A new solution S new is created within the Metropolis Cycle by creating a small perturbation to the current solution S current (see line 6).The difference between these two solutions (S new and S current ) is obtained.If the difference is less or equal than zero (see line 8), the new solution is accepted (see line 9).If the difference is greater than zero, the Boltzmann probability is calculated (see line 11).If the Boltzmann probability is higher than a random value between 0 and 1 (see line 12) then the new solution is accepted (see line 13).After the Metropolis Cycle is completed, the temperature value is decreased (see line 17).
Algorithm 2 shows the pseudo code of the SA, which is applied to obtain solutions to the problem of aligning two or more genomic sequences.The Simulated Annealing algorithm is modified then it can be implemented to solve the problem of alignment sequence.The values of initial and final temperatures are tuned by using an analytical method (see lines 1-2).The cooling factor value α is set to a value very close to 1 (α ≈ 1) (see line 3).The current solution S current is set to the original solution S initial (see line 4).The similarity  12), and the difference of similarities between S new and S current is calculated (see line 13).This difference is denoted by ΔS = S current − S new .The new solutions are accepted when these are better than current solutions, so current solutions are replaced by new solutions (see line 15).When new solutions are of low quality (worse solutions) than current solutions, then new solutions are accepted using the Boltzmann probability (see line 22).This probability is directly related to the current value of the temperature and the quality difference between S new and S current .The Boltzmann probability is calculated by the following equation P(S new ) = e (−ΔS/T) .As the temperature value is decreased, the probability of P(S new ) is decreased, which is of range 0 < P(S new ) < 1.

Simulated Annealing with Previous Solutions
In order to generate high-quality solutions to sequence alignment, the classical SA was modified, so the SAPS algorithm is a modified version of the classical SA.After the Metropolis Cycle execution is done, the selection of a current solution S current is done.During the execution of Metropolis Cycle, the best solutions are stored in a set named SS betters .
The best of all solutions created in this cycle is stored in S better .The original sequence is stored in S original .After the Metropolis cycle is finished, a current solution S current is randomly selected from S original , S better , or SS betters .So S current ∈ {S original , S better , SS betters }.
The Metropolis Cycle length of SAPS is growing, which ranges from an initial L initial value to a final L final value.At high temperature, L initial is set to a small value and as the temperature value is increased, the value of the Metropolis Cycle length is increased until L final .when T final is reached, L final is reached too.Thus, an increasing number of solutions are created as the temperature is decreased.At high temperatures, a small number of solutions are created and as the temperature is decreased, the number of solutions is increased with a factor γ, where γ > 1.
Algorithm 3 shows the pseudo code of SAPS, some lines of code were added to SA, for example, at line 5, S better and SS betters are set with S original .At line 19, S better is added to SS betters .At line 31, S current is chosen from S original , S better , or SS betters .

Analytical Tuning Method
Some parameters of SAPS are tuned by the analytical method [19][20][21][22].For example, in order to calculate the initial temperature, the maximum deterioration (defined by ΔZ max ) of the instance is applied.The probability of accepting a solution S is applied at high temperature.On other hand, the final temperature is calculated by applying the minimum deterioration (defined by ΔZ min ) of the instance and the probability of accepting a Solution S at low temperature.
The analytical tuning based on Boltzmann distribution can be helpful for setting up the initial temperature [21].The probability of accepting any new solution S new is very close to 1 (P(S new ) ≈ 1) at high temperatures, so the deterioration of cost function is maximal.The initial temperature (T initial ) is associated with the maximum deterioration admitted and the defined acceptance probability P(S new ).Let S current be the current solution and S new a new proposed one, and Z(S current ) and Z(S new ) are the costs associated to S current and S new , respectively; the maximum and minimum deteriorations are expressed as ΔZ max and ΔZ min , respectively.Then, the P(ΔZ max ) probability of accepting a new solution S new with the maximum deterioration is defined by (P(ΔZ) = exp(−ΔZ/T)).This equation basically is the Boltzmann Distribution, which is applied for calculating the T initial .This temperature value is defined by T initial = −ΔZ max / ln(P(ΔZ max )).Similarly, the final temperature (T final ) is established according to the probability of accepting a new solution S new with the minimum deterioration.The equation to calculate the final temperature is defined by T final = −ΔZ min / ln(P(ΔZ min )).
There are other parameters of SAPS that are calculated by applying a particular cooling function; for example, the Metropolis Cycle length is calculated by applying T k+1 = αT k .The incremental factor of this cycle is also calculated and defined by γ.
The analytical method determines the Metropolis Cycle lenght L k with a simple Markov model [22]; at high temperatures, only a few iterations are required because the stochastic equilibrium is quickly reached; nevertheless, at low temperatures a more exhaustive exploration is needed, so a larger L k is used.Let L 1 be L k at T initial and let L max be the maximum Metropolis Cycle length.
Let T k be decreased by the cooling function (T k+1 = αT k ), and the L k+1 be calculated by the follow equation L k+1 = γL k , where γ is the rate of increment of Metropolis Cycle (>1); so L k+1 > L k and L 1 have an initial value, and the last L f Metropolis Cycle is equal to L max .The functions T k+1 = αT k and L k+1 = γL k are applied successively in Simulated Annealing from T initial to T final ; consequently, T n and L max are obtained by T n = α n T initial and L max = γ n L 1 , respectively.n is the step number from T initial to T final ; so we can get the n and γ as follows: n = (ln(T final ) − ln(T initial ))/ ln(α) and γ = exp((ln(L max ) − ln(L 1 ))/n).

Implementation
SASP was tested with all of the most HIV virus genes of human and simian.The nine genes of the human virus were compared with the nine genes of simian virus; for example, the gen named "env" of HIV human was aligned with the gen "env" of HIV simian, the gen named "gag" of HIV human was aligned with the gen "gag" of HIV simian, and so successively.The information of the virus genes is shown in Table 1.The parameters T initial , T final , ΔZ max , and γ are tuned by analytical method.The factor γ is higher than 1, and it is very close to 1.The values of these parameters are shown in Table 2.In this table, the values of initial temperatures are high; these values are related to the maximum deterioration and the probability of accepting solutions at high temperatures.It is observed that the final temperature has a value very close to zero (0.43); this is because the minimum deterioration is equal to 1.0.The parameters L 1 and L max have the values 2, and 300, respectively.

Experimentation and Results
In Table 3, the results of the experiments are shown.The information shown is the average similarity and the standard deviation of the genes of both viruses (HIV Human and HIV Simian).The results show that the average obtained by SASP is of better quality than the average obtained by the classical SA.Table 4 shows that the SAPS processing time generally is better than the processing time of SA.

Conclusions
In this paper, a new approach is to make efficient the classical Simulated Annealing algorithm proposed to solve the problem of aligning genomic sequences.This approach is called SAPS.After completing the Metropolis Cycle, a current solution is selected randomly from the best solutions' set, the best solution and the initial solution.This change in the classical simulated annealing resulted in an improved efficiency to solve the problem of aligning sequences.The parameters of the algorithms SA and SAPS were tuned using a tuning method, specifically the initial temperature, final temperature, and Metropolis Cycle length.
This approach to tune the parameters depends directly on the instance to test.With a preprocessing of the instance, the minimum and maximum deteriorations are calculated.With these values and the probability of acceptance, the initial and final temperatures are calculated.

1 :
Setting initial and final temperatures 2: Create S current from Initial solution S initial 3: T = T initial 4: While (T > T final ) do // Temperature Cycle 5: While (stop criteria) // Metropolis Cycle 6:

Table 1 :
HIV genes of human and simian.

Table 2 :
Values of parameters., and the Metropolis Cycle length is increased (see line 28).Within the Metropolis Cycle, new solutions S new are generated by modifying the current solution S current .This is done by adding or removing gaps into DNA sequences (see line 11).The similarity of new solutions is calculated (see line

Table 3 :
Results of quality solutions.

Table 4 :
Results of processing time.