Contextual Multiple Sequence Alignment

In a recently proposed contextual alignment model, efficient algorithms exist for global and local pairwise alignment of protein sequences. Preliminary results obtained for biological data are very promising. Our main motivation was to adopt the idea of context dependency to the multiple alignment setting. To this aim the relaxation of the model was developed (we call this new model averaged contextual alignment) and a new family of amino acids substitution matrices are constructed. In this paper we present a contextual multiple alignment algorithm and report the outcomes of experiments performed for the BAliBASE test set. The contextual approach turned out to give much better results for the set of sequences containing orphan genes.


INTRODUCTION
The multiple alignment of biological sequences has become an essential tool in molecular biology. It is used to find conserved regions and motifs in protein families, to detect the homology between new sequences and groups of sequences having an already known function and in a preliminary phase of protein structure prediction. Multiple alignment is also extensively used in molecular evolutionary analysis.
The various genome projects have provided the biologist with a great number of new protein sequences, and the rate of appearance of these data is steadily increasing. The development of an accurate and reliable multiple alignment program which is capable of handling many (often very divergent) sequences simultaneously is still of major importance.
The complexity of the problem does not allow to find the exact solution in a reasonable computational time [1]. Traditionally, the most popular heuristic approach has been the progressive alignment method [2].
In this paper we propose to explore new model for sequence alignment, in which the score for the substitution also depends on its neighborhood in the sequence. Such contextual alignment model has been proposed re-Correspondence and reprint requests to Rafał Otto, Institute of Informatics, Warsaw University, Banacha 2, 02-097 Warsaw, Email: Poland; rotto@mimuw.edu.pl This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. cently in [3] for the pairwise alignment problem. Preliminary results obtained for biological data by Gambin and Slonimski in [4] are very promising. To apply the contextual approach in the multiple alignment setting we have decided to relax slightly the model from [3]. However, we still need the family of contextual amino acid substitution matrices, for which a novel construction procedure is described. We present preliminary experimental results that illustrate the advantage of using a contextual approach in progressive alignment algorithm. It turned to be particularly useful in aligning the family of sequences containing several orphans (these are distantly related sequences, sometimes sharing the common fold).
It should be clear that the existence of orphan genes is unavoidable. Despite the accumulation of genetic information, newly sequenced genomes continue to reveal a high proportion (even to 50%) of uncharacterized genes. Among them there is a significant number of strictly orphan genes without any resemblance to previously determined protein sequences. Moreover, most genes found in databases have only been predicted by computer methods and have never been experimentally validated. Hence, for the alignment method it is important to tolerate orphans (some existing programs exclude the divergent orphans as unrelated or unalignable sequences) and to keep the stability of the family alignment when orphans are introduced into the sequence set.
The paper is organized as follows. We start with the description of an averaged contextual model. Then we present a construction method for contextual substitution tables. The next section proposes the progressive multiple alignment algorithm that takes context into account. The results of experimental analysis are presented in "results," which is followed by conclusions and discussion of further works.

AVERAGED CONTEXTUAL MODEL
The contextual alignment model considered in [3] cannot be directly applied to the problem of multiple alignment. In this model the score of an alignment depends on the order of operations (substitutions and indels) performed, as a substitution at one position can change the context for neighboring sites. The optimal alignment for the pair of sequences was defined as the alignment having the maximal score, when we maximize over all possible chronologies of evolutionary changes. More detailed study of the structure of optimal alignments and the description of efficient algorithms constructing them are included in [3].
To deal with several sequences simultaneously and also to keep the context dependency, we propose a relaxed contextual model. In this model we also penalize substitution considering two surrounding letters but we do not take care of the relative order of operations. In our algorithm the context independent and affine gap penalties are assumed.
Consider the following example of a short fragment of pairwise alignment: In the contextual model the score for substitution C → D would depend on the order of operations. For instance, if the substitution H → A has been performed after the substitution C → D and the substitution A → G has been performed before C → D, then the substitution C → D would have the left context H and the right context G. In our simplified model we consider all 4 possible contexts for the middle substitution and take an average of 4 contextual scores. Notice that standard noncontextual, for example, Blosum matrix entry can be viewed as an average over all 400 possible pairs of contexts.
As a second example, consider the substitution surrounded by a deletion on the left and an insertion on the right: If none of them has happened before the substitution C → D, then the left context is C. If AHC has been deleted before, then the left context is A or H depending on the relative order of these two substitutions. Analogous cases can be considered for the right context of the substitution C → D. As before we count the score as the average over 9 contextual scores.

Methods
The contextual alignment algorithm, as an important part of its input data, takes a contextual scoring table, which provides the score for every possible substitution in every possible context.
The family of matrices proposed in [5] suffers from the fundamental difficulty that the amount of data necessary to construct a complete contextual substitution table exceeds the data presently available by an order of magnitude. To cope with this problem we present here a new approach to the construction of contextual substitution tables. The algorithm is in fact a contextual extension of the one that has been used to create Blosum tables [7].
As the input data we take the database of blocks (ungapped fragments of multiple alignments) and start by computing the observed frequency of substitutions. The extension from the existing method is the fact that we distinguish substitutions having different contexts. Let f k,l i, j denote the number of observed substitutions i → j in the context of k and l.
Each of the considered substitutions can have 4 different contexts so instead of increasing by 1 the entry f i, j we increase entry f k,l i, j by 1/4 for all four possible pairs of (k, l).
Having computed frequency table we define the observed frequency for each substitution i → j in the context (k, l) as (1) Now, we can compute the observed frequency of the residue i in the context (k, l) as and the observed frequency of the context (k, l) as u k,l = i p k,l i . The expected frequency of the substitution i → j in the context (k, l) is given by Finally the score for i → j in the context (k, l) is To avoid the influence of highly similar sequences we adopt the idea of clustering inside blocks as it was done in the Blosum table.

Results
As an input we have taken the BLOCKS+ database available at http://blocks.fhcrc.org (see [6]), which consists of 11 858 blocks representing 2608 groups. We have derived two kinds of tables: noncontextual (using the method in [7]) and contextual using the method described above. We have created tables with 7 different clustering percentages: 100%, 90%, 80%, 70%, 60%, 50%, and 40%. In Table 1 several characteristics of computed tables are summarized. The interesting observation is that the contextual tables have higher average score and higher entropy. Entropy also increases with clustering percentage as a normal consequence of reducing multiple contributions to amino acid pair frequencies from the most closely related sequences in the block. For the discussion of the notion of entropy in the context of substitution tables see [8].
The size of contextual tables (84 000 entries instead of 210 in case of noncontextual tables) implies a small amount of data that supports each table entry. If these statistics were too low, this could have direct impact on the quality of the score value. Tables 2 and 3 give a good view of these issues. Therefore we should discuss the robustness of proposed methods. The substitution table which was finally used in our experiments (CTX70) has an acceptable number of pairs impacting the average score; moreover there was no hole (blank entry) in this table.
For interested readers the matrices parameterized by different clustering constants can be found at http://www.mimuw.edu.pl/∼aniag/TABLES.

MULTIPLE ALIGNMENT ALGORITHM
The averaged contextual model is proposed to enable computing multiple alignment in the contextual manner.
Ignoring the relative order of operations in the alignment simplifies the task of computing multiple alignment; however it is still not easy to keep the complexity on the reasonable level. Multiple alignment dynamic programming algorithm is extremely time consuming even in the case of the noncontextual model. A lot of heuristic approaches, which have been already developed, try to reach the optimal solution with the highest possible probability. The progressive alignment [1,9] is one of the most popular alignment approaches.
Our contextual multiple alignment algorithm can be viewed as a contextual extension of popular ClustalW algorithm [9] or Feng-Doolittle algorithm [2], which belong to the family of progressive alignment algorithms. The main idea is to align pairs of sequences progressively and to deduce the multiple alignment from the set of pairwise alignments.
To this aim we have developed efficient averaged contextual pairwise alignment algorithms. These are appropriately modified standard dynamic programming procedures. We omit the details here; for interested readers all algorithms implemented in C++ can be found at http://www.cern.ch/rotto/Biology/Sources/ACM.
The important remark here is that we do not (not yet) intend to concur with the existing algorithms. Our goal is to demonstrate the usefulness of the contextual approach. We want to design algorithms that can be applied to the contextual model as well as to the noncontextual one. Then, we are able to compute alignments in both models and finally compare the results. In fact the ClustalW algorithm is equipped with the huge number of additional nontrivial heuristics (such as sequences weighting, substitution matrices varied at different alignment stages, residue-specific gap penalties, etc) which are not applied in our algorithm.
An overview of our algorithm is as follows: (1) Calculate a distance matrix from pairwise scores for a given group of sequences.
(2) Construct a guide tree from the distance matrix using the neighbor-joining clustering algorithm [10].
(3) Progressively align the sequences in order of decreasing similarity. Three kinds of alignments are considered here: (i) pairwise alignment of two sequences, (ii) alignment of a sequence with an alignment, (iii) alignment of two alignments.

Calculating distance matrix
Several methods to derive the pairwise evolutionary distance (sometimes called difference score) from alignment scores are proposed (see, eg, [2]). Being aware of the drawbacks of all these approaches (see Gonnet and Korostensky, Optimal scoring matrices for estimating distances between aligned sequences available at http://www.inf.ethz.ch/personal/gonnet/papers/Distance/ Distance.html for a detailed discussion) we decided to use the method proposed by Feng and Doolittle [2]. It works for global and local alignments. Assuming that S(V, W) is the local similarity score between the sequences V and W, then their distance is defined via where S iden is the average of the two scores for the two sequences compared with themselves and S rand is the expected score of two random sequences with the same amino acids composition as V and W.

Constructing guide tree
The next step in progressive multiple alignment is building the guide tree. Here we use the neighbor-joining method [10] and two alternative methods to find the root of the tree. The first one is by adding an outgroup sequence to the given sequence group [1] and the second is by finding the middle point of the tree.

Aligning
The last step is to progressively align sequences according to the order given by the guide tree. It means that for each internal node of the tree we align sequences already aligned from the left child of the node with sequences already aligned from the right child of the node. In the simplest case this alignment is pairwise, but closer to the root of the tree we have to align two alignments. We decided to solve that problem using the method of Feng and Doolittle [2]. First, in two given alignments we replace gap letter with neutral letter X having at the end two groups of sequences over the extended alphabet Σ {X} (where Σ is an alphabet of amino acids). Also, we extend substitution matrix by adding scores for a → X equal to 0 for all a. Then for each pair V , W of sequences where V is a sequence from the first group and W is a sequence from the second group, we compute pairwise alignment. The alignment with the maximal score is chosen and according to it we align two groups of sequences. Finally, in all sequences we replace neutral letter X back with a gap letter. In that way we obtain multiple alignment while reaching the root of the tree.
It is specifically designed for the evaluation and comparison of multiple sequence alignment programs. The sequences included in the database are selected from alignments in structural databases (such as FSSP and HOMSTRAD) or from manually constructed structural alignments taken from the literature. In our experimental analysis we have used the test sequences from 4 (out of 8) parts of the database, the so-called reference sets.
(i) Reference 1. It consists of families of equidistant protein sequences of similar length. Sequences are divided into 6 groups depending on theirs lengths and percent residue identity (% ID).
(ii) Reference 2. Each set here contains the group of closely related sequences (more than 25% ID) and up to three orphan genes (ie, genes sharing the common fold with the family, but having weak sequence similarity).
(iii) Reference 3. It includes groups consisting of several divergent protein families of equidistant sequences. The reference alignments consist of up to 4 families, with less than 25% ID between any two sequences from different families.
(iv) Reference 6. This includes the protein families containing repeats of different residue similarity.

Methodology
In the first stage of our experiment the multiple alignments were calculated for all reference sets in two settings: contextual and noncontextual. Then, the results obtained were compared with the reference alignments from the database. For this comparison the following measure (sum-of-pairs score [11]) was used. Let A 1 and A 2 be two multiple alignments of N sequences. Denote by M 1 and M 2 the lengths of these multiple alignments. Let A(i, j) stand for the ith residue in the jth sequence of A. Define for two residues a and b δ(a, b) = 1 if and only if a = b and δ(a, b) = 0 if and only if a = b. Now, for one column from the multiple alignment A we define And, finally SPS is the frequency of properly aligned pairs of residues with respect to the reference alignment.
In the second phase of the analysis we examine the robustness of the alignment to the introduction of orphans. To this aim we use the alignments from Reference 2, which contains related families with divergent, orphan sequences. Denote by G the set of all sequences from the considered group. Let φ(G) be the subset of G consisting only of the family of highly related sequences. Firstly, the multiple alignments A G were calculated for all groups G; then the multiple alignments for reduced groups (without orphans) A φ(G) . Let Φ(A G ) be the operation of cutting out from A G the rows which correspond to the orphans. Define the following measure: It tests the ability of a model to align divergent sequences and also the degree to which the alignment of the family is disrupted by the introduction of the orphans. We have performed the experiments with various substitution tables and gap penalties. The best scores are obtained for NONCTX70 and CTX70 with gap open penalty 5 and gap extension penalty 1.
Results of the first experiment are presented in Table 4. The entries are the SPS measures averaged over the groups of sequences. Clearly, the contextual approach yields much better in case of sequence families from Reference 2 set, which contains families with orphan genes (especially in case of families of short sequences). Table 5 summarizes the outcomes of the second experiment. The entries here are SPS values for investigated groups of sequences. Results of this experiment confirm the observation taken in the previous one.
The advantage of the contextual approach in aligning families containing orphan genes shown above is quite clear. Here we present some statistics taken on the whole set of experimental data to show that our method is performing a little better than other existing methods also in general case. Figure 1a proves that the results of the contextual model fit those given by the noncontextual one, but Figure 1b shows that contextual scores are more uniform. The fact that the contextual approach improves alignment more significantly in case of small values of noncontextual score is presented in Figure 1c. Figure 1, Tables 4 and 5 then show the contextual model performs slightly better than the noncontextual one. However, there are some examples when the contextual approach yields much better results. Among them we have the families listed in Table 6.
The challenging task here is to explain the biological phenomena that stand behind such an excellent behavior of the contextual approach in all of these examples. Answering this question could help to discover a better alignment algorithm that profits from contextual information. Probably such an algorithm could be very efficient for the sequences belonging to some special class of sequences. The characterization of this class remains an interesting open problem.

CONCLUSIONS AND FURTHER WORKS
It is clear that the experimental analysis described in this work is just a beginning and cannot be treated as a definitive proof. Various improvements and other experiments can be envisaged.

Wider and more distant contexts
In this paper we consider the simplest contextual model. The main idea of the context is to reflect the neighborhood of a given residue in the 3D protein structure. It suggests several possible extensions. One possibility is to consider a wider context, for example, two amino acids on each side. This approach is however limited by the huge size of substitution table (20 6 = 64 000 000 entries). The solution here is to consider a reduced context (ie, 20 amino acids can be divided into a small number of groups having similar biochemical properties (cf [5])).
Another approach is to consider a more distant context. As an example look at the alignment . . . CHCAD . . . . . . HADGC . . .
In the model we have presented, as a context we have taken two amino acids surrounding given substitution, that is, the left context of the substitution C → D consists of amino acid H or A, and the right consists of A or G. It is however biologically motivated (by a secondary structure) to consider the contexts which are separated by one position from the given residue, that is, the left context for the substitution C → D is amino acid C or H, and the right context consists of D or C. Preliminary results (see Figure 2) for the entropy of such defined substitution tables are very promising and encourage further research in this direction (more on entropy can be found in [8]).

Improvements for contextual multiple alignment algorithm
The algorithm presented in this paper follows the progressive alignment approach. The most popular algorithm and one of the most effective algorithms of this kind in the standard noncontextual model is ClustalW [9]. It contains a lot of additional heuristics. The challenging task is to design analogous improvements for the contextual model.