A new algorithm for solving sequence alignment problem is proposed, which is named SAPS (Simulated Annealing with Previous Solutions). This algorithm is based on the classical Simulated Annealing (SA). SAPS is implemented in order to obtain results of pair and multiple sequence alignment. SA is a simulation of heating and cooling of a metal to solve an optimization problem. In order to select randomly a current solution, SAPS algorithm chooses a solution from solutions that have been previously generated within the Metropolis Cycle. This simple change has led to increase the quality of the solution to the problem of aligning genomic sequences with respect to the classical Simulated Annealing algorithm. The parameters of SAPS, for certain instances, are tuned by an analytical method, and some parameters have experimentally been tuned. SAPS has generated high-quality results in comparison with the classical SA. The instances used are specific genes of the AIDS virus.
1. Introduction
Sequence alignment is one of the most important and challenging problems in computational biology and bioinformatics [1, 2]. Finding the optimal alignment of a set of sequences is known as a NP-complete problem [3]. Alignment of sequences can be an important tool to measure the similarity of two or more sequences. Sequence Alignment is classified as a combinatorial optimization problem [4], which is solved by using computer algorithms. These algorithms lead to represent, to process, and to compare genetic information to determine evolutionary relationships among living beings [3]. The sequence alignment highlights areas of similarity among sequences. The similarities among sequences may indicate functional or evolutionary relationships among genes or proteins [5].
The problem of sequence alignment is to obtain the maximum alignment of a set of n genomic sequences, which is denoted as S={s1,s2,…,sn}; each sequence of this set is formed by the alphabet ∑={A,C,G,T}. The solution to this problem is represented by S'=Max(S), which denotes a set S'={s1',s2',…,sn'} with the alphabet ∑'=∑∪{-}. S' represents the optimal alignment of S.
Exact algorithms have been applied to solve the sequence alignment problem. For example, dynamic programming has been one of the most used to solve the sequence alignment problem [6, 7]. The disadvantage of using exact algorithms is that these generate optimal solutions for small problems, but for large problems, exact algorithms become inefficient. For this reason, several metaheuristic methods have been designed to obtain suboptimal alignments. Metaheuristics have also been applied to solve this problem [8], for example, Ant Colony Algorithm [9], Simulated Annealing [10, 11], Genetic Algorithms [12], among others. The disadvantage is that metaheuristics do not guarantee optimal solutions, but solutions generated can be very close to optimal solution in a reasonable processing time.
The proposed algorithm is a modified version of classical Simulated Annealing. SAPS includes a new way to select a current solution after the Metropolis Cycle is finished. In general, SAPS generates better solutions to sequence alignment problem than the classical Simulated Annealing. SAPS was tested with different genes of AIDS virus.
This paper is organized as follows: in Section 2, classical simulated annealing algorithm is described. In Section 3, the SASP algorithm is explained in detail. In Section 4, the analytical tuning method is described. In Section 5, the implementation of the SASP is detailed. In Section 6, the experimentation and results are described. Finally, Section 7 discusses the conclusions.
2. Classical Simulated Annealing
The classical Simulated Annealing is an algorithmic process that simulates the gradual metal cooling for crystallization. This algorithm usually starts at high value of temperature, and then this parameter is decreased until a final temperature is reached. The final temperature typically is very close to zero (Tfinal≈0) [13, 14]. Through a cooling function, the temperature value is decreased from the initial temperature to the final temperature. There are cooling functions that have been used in the simulated annealing algorithm [15–18]; the most common cooling function is defined by Tk+1=αTk. This function decreases the temperature value by a α factor, which does a range of 0.70≤α<1.0. A gradual cooling is applied when α is very close to 1, and a fast cooling is applied when α is very close to 0.70.
The classical Simulated Annealing has two cycles; the first cycle is named Cycle of Temperature. Into this cycle, value temperature is decreased by a cooling function. The second cycle is named Metropolis Cycle, and it is applied to generate, to accept, or to reject solutions for the problem to be optimized. Algorithm 1 shows the pseudo code of the classical Simulated Annealing. The initial and final temperature values are set (see line 1). These values are obtained by an analytical (see Section 4) or experimental way. It is recommended that the initial temperature is as high as possible, and the final temperature is as close to zero. The initial solution Sinitial of the problem to be optimized is created (see line 2). The current solution Scurrent is set to Sinitial. Set T to initial temperature (see line 3). The temperature cycle is executed from the initial temperature to the final temperature (see lines 4–18). The Metropolis Cycle gets started (see lines 5–16). This cycle takes a number of times specified in the stop criterion. A new solution Snew is created within the Metropolis Cycle by creating a small perturbation to the current solution Scurrent (see line 6). The difference between these two solutions (Snew and Scurrent) is obtained. If the difference is less or equal than zero (see line 8), the new solution is accepted (see line 9). If the difference is greater than zero, the Boltzmann probability is calculated (see line 11). If the Boltzmann probability is higher than a random value between 0 and 1 (see line 12) then the new solution is accepted (see line 13). After the Metropolis Cycle is completed, the temperature value is decreased (see line 17).
Algorithm 1: Pseudocode of classical Simulated Annealing.
1: Setting initial and final temperatures
2: Create Scurrent
from Initial solution Sinitial
3: T = Tinitial
4: While (T > Tfinal) do // Temperature Cycle
5: While (stop criteria) // Metropolis Cycle
6: Create Snew
using a perturbation to Scurrent
7: Obtain difference similarity between Snew
and Scurrent
8: If (difference ⇐ 0) then
9: Accept Snew
10: else
11: Boltzmann probability = exp (−difference/T)
12: If (Boltzmann probability) > random (0,1) then
13: Accept Snew
14: end if
15: end if
16: end while
17: Decrease T
18: end while
Algorithm 2 shows the pseudo code of the SA, which is applied to obtain solutions to the problem of aligning two or more genomic sequences. The Simulated Annealing algorithm is modified then it can be implemented to solve the problem of alignment sequence. The values of initial and final temperatures are tuned by using an analytical method (see lines 1-2). The cooling factor value α is set to a value very close to 1 (α≈1) (see line 3). The current solution Scurrent is set to the original solution Sinitial (see line 4). The similarity of this solution is calculated by comparing base by base (see line 5). The variable T is set to the initial temperature Tinitial (see line 6). The Metropolis Cycle length is set to an initial value Lcm (see line 7). This cycle has an increasing length, at high temperature, it has a low value, and it is increased as the temperature is decreased. The length of Metropolis Cycle is increased by a factor γ, where γ must be greater than 1. Temperature cycle is executed (see lines 8–29) with a logic condition that T is greater than Tfinal. Within this cycle, the variable n is updated with value 1 (see line 9), and within the metropolis cycle, this variable is incremented (n=n+1) (see line 25).
Algorithm 2: Pseudo code SA applied to Sequence Alignment.
1: Tune initial temperature (Tinitial)
2: Tune final temperature (Tfinal)
3: Setting cooling factor
4: Create Scurrent
from Initial solution Sinitial
5: Calculate similarity of Scurrent
6: T = Tinitial
7: Setting Lcm
8: While (T > Tfinal) do
9: n = 1
10: While (Lcm
> n)
11: Create Snew
adding or removing gaps to Scurrent
12: Calculate similarity of Snew
13: Obtain difference similarity between Snew
and Scurrent
14: If (difference ⇐ 0) then
15: Scurrent = Snew
16: If similarity (Snew) > similarity (Sbetter) then
17: Sbetter
= Scurrent
18: end if
19: else
20: Boltzmann probability = exp(−difference/T)
21: If (Boltzmann probability) > random(0,1) then
22: Scurrent
= Snew
23: end if
24: end if
25: n = n + 1
26: end while
27: Descrease T
28: Increase Lcm
29: end while
The Metropolis Cycle is executed (see lines 10–26). At the end of the Metropolis Cycle, the temperature is decreased (see line 27), and the Metropolis Cycle length is increased (see line 28). Within the Metropolis Cycle, new solutions Snew are generated by modifying the current solution Scurrent. This is done by adding or removing gaps into DNA sequences (see line 11). The similarity of new solutions is calculated (see line 12), and the difference of similarities between Snew and Scurrent is calculated (see line 13). This difference is denoted by ΔS=Scurrent-Snew. The new solutions are accepted when these are better than current solutions, so current solutions are replaced by new solutions (see line 15). When new solutions are of low quality (worse solutions) than current solutions, then new solutions are accepted using the Boltzmann probability (see line 22). This probability is directly related to the current value of the temperature and the quality difference between Snew and Scurrent. The Boltzmann probability is calculated by the following equation P(Snew)=e(-ΔS/T). As the temperature value is decreased, the probability of P(Snew) is decreased, which is of range 0<P(Snew)<1.
3. Simulated Annealing with Previous Solutions
In order to generate high-quality solutions to sequence alignment, the classical SA was modified, so the SAPS algorithm is a modified version of the classical SA. After the Metropolis Cycle execution is done, the selection of a current solution Scurrent is done. During the execution of Metropolis Cycle, the best solutions are stored in a set named SSbetters.
The best of all solutions created in this cycle is stored in Sbetter. The original sequence is stored in Soriginal. After the Metropolis cycle is finished, a current solution Scurrent is randomly selected from Soriginal, Sbetter, or SSbetters. So Scurrent∈{Soriginal,Sbetter,SSbetters}.
The Metropolis Cycle length of SAPS is growing, which ranges from an initial Linitial value to a final Lfinal value. At high temperature, Linitial is set to a small value and as the temperature value is increased, the value of the Metropolis Cycle length is increased until Lfinal. when Tfinal is reached, Lfinal is reached too. Thus, an increasing number of solutions are created as the temperature is decreased. At high temperatures, a small number of solutions are created and as the temperature is decreased, the number of solutions is increased with a factor γ, where γ>1.
Algorithm 3 shows the pseudo code of SAPS, some lines of code were added to SA, for example, at line 5, Sbetter and SSbetters are set with Soriginal. At line 19, Sbetter is added to SSbetters. At line 31, Scurrent is chosen from Soriginal, Sbetter, or SSbetters.
Algorithm 3: Pseudo code of SAPS.
1: Tune initial temperature (Tinitial)
2: Tune final temperature (Tfinal)
3: Setting the factor cooling
4: Setting Scurrent
with Soriginal
5: Add Soriginal
to SSbetters
and to Sbetter
6: Calculate similarity of Scurrent
7: T = Tinitial
8: Setting Lcm
9: While (T > Tfinal) do
10: n = 1
11: While (Lcm
> n)
12: Create Snew
adding or removing gap to Scurrent
13: Calculate similarity of Snew
14: Obtain difference of similarity between Snew
and Scurrent
15: If (difference ⇐ 0) then
16: Scurrent
= Snew
17: If similarity (Snew) > similarity (Sbetter) then
18: Sbetter
= Scurrent
19: Add Sbetter to SSbetters
20: end if
21: else
22: Boltzmann probability = exp(-difference/T)
23: If (Boltzmann probability) > random(0,1) then
24: Scurrent = Snew
25: end if
26: end if
27: n = n + 1
28: end while
29: Descrease T
30: Increase Lcm
31: Randomly Choose Scurrent of SSbetters, Sbetter o Soriginal
32: end while
4. Analytical Tuning Method
Some parameters of SAPS are tuned by the analytical method [19–22]. For example, in order to calculate the initial temperature, the maximum deterioration (defined by ΔZmax) of the instance is applied. The probability of accepting a solution S is applied at high temperature. On other hand, the final temperature is calculated by applying the minimum deterioration (defined by ΔZmin) of the instance and the probability of accepting a Solution S at low temperature.
The analytical tuning based on Boltzmann distribution can be helpful for setting up the initial temperature [21]. The probability of accepting any new solution Snew is very close to 1 (P(Snew)≈1) at high temperatures, so the deterioration of cost function is maximal. The initial temperature (Tinitial) is associated with the maximum deterioration admitted and the defined acceptance probability P(Snew).
Let Scurrent be the current solution and Snew a new proposed one, and Z(Scurrent) and Z(Snew) are the costs associated to Scurrent and Snew, respectively; the maximum and minimum deteriorations are expressed as ΔZmax and ΔZmin, respectively. Then, the P(ΔZmax) probability of accepting a new solution Snew with the maximum deterioration is defined by (P(ΔZ)=exp(-ΔZ/T)). This equation basically is the Boltzmann Distribution, which is applied for calculating the Tinitial. This temperature value is defined by Tinitial=-ΔZmax/ln(P(ΔZmax)). Similarly, the final temperature (Tfinal) is established according to the probability of accepting a new solution Snew with the minimum deterioration. The equation to calculate the final temperature is defined by Tfinal=-ΔZmin/ln(P(ΔZmin)).
There are other parameters of SAPS that are calculated by applying a particular cooling function; for example, the Metropolis Cycle length is calculated by applying Tk+1=αTk. The incremental factor of this cycle is also calculated and defined by γ.
The analytical method determines the Metropolis Cycle lenght Lk with a simple Markov model [22]; at high temperatures, only a few iterations are required because the stochastic equilibrium is quickly reached; nevertheless, at low temperatures a more exhaustive exploration is needed, so a larger Lk is used. Let L1 be Lk at Tinitial and let Lmax be the maximum Metropolis Cycle length.
Let Tk be decreased by the cooling function (Tk+1=αTk), and the Lk+1 be calculated by the follow equation Lk+1=γLk, where γ is the rate of increment of Metropolis Cycle (>1); so Lk+1>Lk and L1 have an initial value, and the last Lf Metropolis Cycle is equal to Lmax. The functions Tk+1=αTk and Lk+1=γLk are applied successively in Simulated Annealing from Tinitial to Tfinal; consequently, Tn and Lmax are obtained by Tn=αnTinitial and Lmax=γnL1, respectively. n is the step number from Tinitial to Tfinal; so we can get the n and γ as follows: n=(ln(Tfinal)-ln(Tinitial))/ln(α) and γ=exp((ln(Lmax)-ln(L1))/n).
5. Implementation
SASP was tested with all of the most HIV virus genes of human and simian. The nine genes of the human virus were compared with the nine genes of simian virus; for example, the gen named “env” of HIV human was aligned with the gen “env” of HIV simian, the gen named “gag” of HIV human was aligned with the gen “gag” of HIV simian, and so successively. The information of the virus genes is shown in Table 1. The parameters Tinitial, Tfinal, ΔZmax, and γ are tuned by analytical method. The factor γ is higher than 1, and it is very close to 1. The values of these parameters are shown in Table 2. In this table, the values of initial temperatures are high; these values are related to the maximum deterioration and the probability of accepting solutions at high temperatures. It is observed that the final temperature has a value very close to zero (0.43); this is because the minimum deterioration is equal to 1.0. The parameters L1 and Lmax have the values 2, and 300, respectively.
HIV genes of human and simian.
Gen
Number of bases (human)
Number of bases (simian)
pol
3011
3179
env
2570
2564
gag
1502
1532
vif
578
644
nef
371
791
rev
351
351
tat
259
304
vpu
248
245
vpr
237
306
Values of parameters.
Algorithm
Gen
Tinitial
Tfinal
ΔZmax
γ
SA
env
2,639.71
0.43
26.53
1.06
SAPS
env
1,269.61
0.43
12.76
1.07
SA
gag
1,211.90
0.43
12.18
1.07
SAPS
gag
365.16
0.43
3.67
1.08
SA
nef
883.55
0.43
8.88
1.07
SAPS
nef
787.04
0.43
7.61
1.07
SA
pol
369.14
0.43
3.71
1.08
SAPS
pol
436.80
0.43
4.39
1.08
SA
rev
5,726.18
0.43
57.55
1.06
SAPS
rev
6,651.52
0.43
66.85
1.06
SA
tat
6,151.04
0.43
61.82
1.06
SAPS
tat
6,594.80
0.43
66.28
1.06
SA
vif
666.64
0.43
6.70
1.07
SAPS
vif
1,349.21
0.43
13.56
1.07
SA
vpr
1,248.71
0.43
12.55
1.07
SAPS
vpr
1,784.02
0.43
17.93
1.07
SA
vpu
4,556.07
0.43
45.79
1.06
SAPS
vpu
1609.90
0.43
16.16
1.07
6. Experimentation and Results
In Table 3, the results of the experiments are shown. The information shown is the average similarity and the standard deviation of the genes of both viruses (HIV Human and HIV Simian). The results show that the average obtained by SASP is of better quality than the average obtained by the classical SA. Table 4 shows that the SAPS processing time generally is better than the processing time of SA.
Results of quality solutions.
Gen
Classical SA
SASP
Average (%)
Standard deviation
Average (%)
Standard deviation
pol
29.70
0.31
30.35
0.24
env
47.41
5.59
61.60
4.31
gag
40.33
3.30
47.20
1.01
vif
33.71
2.72
39.40
0.66
nef
34.84
1.07
36.56
0.44
rev
98.00
0.00
98.00
0.00
tat
98.00
0.00
98.00
0.00
vpu
52.40
10.60
77.65
8.63
vpr
37.37
3.38
50.24
4.26
Results of processing time.
Gen
Processing time SA (s)
Processing time SAPS (s)
pol
171
224
env
149
142
gag
90
84
vif
19
20
nef
20
18
rev
34
19
tat
22
15
vpu
14
11
vpr
13
11
7. Conclusions
In this paper, a new approach is to make efficient the classical Simulated Annealing algorithm proposed to solve the problem of aligning genomic sequences. This approach is called SAPS. After completing the Metropolis Cycle, a current solution is selected randomly from the best solutions’ set, the best solution and the initial solution. This change in the classical simulated annealing resulted in an improved efficiency to solve the problem of aligning sequences. The parameters of the algorithms SA and SAPS were tuned using a tuning method, specifically the initial temperature, final temperature, and Metropolis Cycle length.
This approach to tune the parameters depends directly on the instance to test. With a preprocessing of the instance, the minimum and maximum deteriorations are calculated. With these values and the probability of acceptance, the initial and final temperatures are calculated.
KarpR. M.Mapping the genome: some combinatorial problems arising in molecular biologyProceedings of the 25th Annual ACM Symposium on the Theory of ComputingMay 19932782852-s2.0-002714865710.1145/167088.167170LanderE. S.LangridgeR.SaccocioD. M.Mapping and interpreting biological information199134113339WangL.JiangT.On the complexity of multiple sequence alignment1994143373482-s2.0-0028679709PapadimitriouC. H.SteiglitzK.1998Mineola, NY, USADover Publicationsxvi+4961637890SetubalJ.MeidanisJ.1997PWS PublishingGotohO.An improved algorithm for matching biological sequences198216237057082-s2.0-0020484488NeedlemanS. B.WunschC. D.A general method applicable to the search for similarities in the amino acid sequence of two proteins19704834434532-s2.0-0014757386BucakI. Ö.UslanV.An analysis of sequence alignment: Heuristic algorithmsProceedings of the 32nd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'10)September 2010182418272-s2.0-7865080805810.1109/IEMBS.2010.5626428ChenL.ZouL.ChenJ.An efficient ant colony algorithm for multiple sequences alignmentProceedings of the 3rd International Conference on Natural Computation (ICNC '07)August 20072082122-s2.0-3804908706510.1109/ICNC.2007.189KimJ.PramanikS.ChungM. J.Multiple sequence alignment using simulated annealing19941044194262-s2.0-0027935740ChenS.-M.LinC.-H.Multiple DNA sequence alignment based on genetic simulated annealing techniques2007182971112-s2.0-34250174752NotredameC.HigginsD. G.HeringaJ.T-coffee: a novel method for fast and accurate multiple sequence alignment200030212052172-s2.0-003462300510.1006/jmbi.2000.4042CernyV.Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm198545141512-s2.0-002181941110.1007/BF00940812KirkpatrickS.GelattC. D.VecchiM. P.Optimization by simulated annealing198322045986716802-s2.0-26444479778AartsE.KorstJ.1989Chichester, UKJohn Wiley & Sonsxii+272Wiley-Interscience Series in Discrete Mathematics and Optimization983115IngberL.Simulated annealing: practice versus theory1993181129572-s2.0-43949164756KjærulffU.Optimal decomposition of probabilistic networks by simulated annealing1992217172-s2.0-000189332010.1007/BF01890544Van LaarhovenP. J.AartsE. H. L.1987Kluwer Academic PublishersFrausto-SolisJ.RománE. F.RomeroD.SoberonX.Liñán-GarcíaE.Analytically tuned simulated annealing applied to the protein folding problem2007448823703772-s2.0-3804913510910.1007/978-3-540-72586-2_53Frausto-SolisJ.Soberon-MaineroX.Liñán-GarcíaE.MultiQuenching annealing algorithm for protein folding problem200958455785892-s2.0-7064908618710.1007/978-3-642-05258-3_51Frausto-SolísJ.Sanvicente-SánchezH.Imperial-ValenzuelaF.Andymark: an analytical method to establish dynamically the length of the Markov chain in simulated annealing for the satisfiability problem200642472692762-s2.0-33751358561Sanvicente-SánchezH.Frausto-SolísJ.A method to establish the cooling scheme in simulated annealing like algorithms200430457557632-s2.0-33751382064