Algorithms of Ancestral Gene Length Reconstruction

Ancestral sequence reconstruction is a well-known problem in molecular evolution. The problem presented in this study is inspired by sequence reconstruction, but instead of leaf-associated sequences we consider only their lengths. We call this problem ancestral gene length reconstruction. It is a problem of finding an optimal labeling which minimizes the total length's sum of the edges, where both a tree and nonnegative integers associated with corresponding leaves of the tree are the input. In this paper we give a linear algorithm to solve the problem on binary trees for the Manhattan cost function s(v, w) = |π(v) − π(w)|.


Introduction
Ancestral sequence reconstruction (ASR) is a well-recognized problem in molecular evolution [1]. Let G be a (phylogenetic) tree with n leaf nodes, and strings over one alphabet (gene sequences) assigned to leaves ( ≤ ). ASR may be defined in the following way: assignment of strings to inner nodes "in the best possible way. " There are two main paradigms for ASR: maximum parsimony (MP) and probabilistic-based reconstruction. The latter includes maximum likelihood (ML) and Bayesian reconstructions. MP reconstruction has a time complexity linear in the number of sequences analyzed. The problem of the parsimonious reconstruction of ancestral states for the given tree with the given states of its leaves (the most parsimonious assignment of the labels of internal nodes for a fixed tree topology) is a well-studied problem [2][3][4]. Efficient algorithms have also been developed for different types of ML-based reconstructions (reviewed in [5]). ASR methods require as input both a phylogenetic tree and a set of gene sequences associated with corresponding leaves of the tree [6].
ASR is related to gene sequence evolution while the problem presented in this paper, being inspired by ASR, deals with gene length variation. Instead of considering leaf-associated sequences we take into account only their lengths. Instead of the reconstruction of ancestral sequences, we search for the optimal reconstruction of ancestral gene lengths. The problem may be called ancestral gene length reconstruction (AGLR). AGLR is actually a problem of finding an optimal labeling which minimizes the total "length" sum of the edges, the minimum sum problem where both a tree and nonnegative integers associated with corresponding leaves of the tree are the input.
In the graph theory vertex labeling related problems were intensively studied [8]. Typically, the problems can be described as follows: for a given graph, find the optimal way of labeling the vertices with distinct integers. The problems and their solutions were described in [9][10][11][12]. In [13] we presented the algorithms to solve the minimum sum problem where both a tree and positive integers associated with all leaves of the tree are the input (finding the optimal way of labeling the vertices with positive integers). Here we would like to formulate the minimum sum problem where both a tree and positive integers associated with some of the leaves of the tree are the input (finding the optimal way of labeling the vertices with nonnegative integers). This problem reflects a situation in which the genome tree is constructed by one or another method for a set of genomes, the leaves of the tree are linked with the corresponding genomes of the set, and the leaves are labeled by integers designating lengths of genes of a chosen gene family. Some leaves would be labeled zero because corresponding genomes have no genes of the chosen gene family. Alternatively, it may be a case of a missing value but in this study we do not consider this case: in the problem definition that we bring here zero means "no value. " In this paper we provide a linear algorithm to solve max sum problem on binary trees for the Manhattan cost function (V, ) = | (V) − ( )|. The algorithm uses dynamic programming technique and the properties of the Manhattan distance.

Preliminaries
Let G be a tree with n leaf nodes, vertex set V(G), and edge set E(G).
= | ( )|. Let us number the leaf nodes of :1, 2, . . . , . Let us number the root of : . An integer labeling of is a mapping from to a set of nonnegative integers, where label 0 is an out-of-the ordinary label meaning "absent value. " Let us denote integer labeling of the leaf nodes of ( (1) = 1 , . . . , ( ) = ). Let us denote by min and max minimum and maximum positive integers labeling leaf nodes: min = min : Let us introduce a cost function of the edge V ∈ ( ): where the nonnegative cost function ( , ) has the following distance properties: 1 > 2 > = max − min + 1. 1 is a gain penalty, 2 is a loss penalty, and is a length change penalty function.
Since the likelihoods of loss and gain events are likely to differ, we may need to weight them differently. This is achieved by introducing different penalties 1 > 2 ; the loss penalty is normally assigned a value close to max − min , whereas the gain penalty should be larger due to biological considerations. They suggest that, on average, gene loss might be a more likely event than gene gain. Therefore, different gain penalties were used in our study similarly to as it was done in [14].
An example of a function ( , ) is | (V) − ( )| . In case of = 1 we obtain an absolute value of the difference between labelings V and : | (V) − ( )|. In case of = 2 we obtain a square of the difference between labelings V and : ( (V) − ( )) 2 .

An Arbitrary Tree and an Arbitrary Cost Function.
Given a tree , an integer labeling of the leaves of ( 1 , . . . , ) = 1, the gain penalty 1 , the loss penalty 2 , and a cost function ((1)-(5)), the minimum sum problem is to find a labeling which minimizes the total cost:

A Binary Tree
Problem . Given a binary tree , an integer labeling of the leaves of ( 1 , . . . , ), the "gain" penalty 1 , and the "loss" penalty 2 , the Manhattan minimum sum problem is to find the labelings which minimize the sum over all where 1 is a number of edges of type ( (V) = 0 & ( ) > 0), and 2 is a number of edges of type ( (V) > 0 & ( ) = 0). (1)). Due to the properties ((2)-(5)) of the cost function ( , ) all labels of the optimal labeling must be either equal to 0 or in the interval [ min , max ]. As a consequence of this, the dynamic programming (DP) method is applicable for the problem. It will be easier to explain the DP method on a binary tree using ( ) notation. The quantity ( ) will be interpreted as the minimal cost, given that node is assigned integer , to the subtree with the node as a root of the subtree.

DP Algorithm for a Binary Tree
Up Phase. A procedure called DP up calculates the costs ( ) of all nodes ( ) of the tree , given a cost function .
When we compute ( ) for the root node (the index of the root is ), then we simply choose the minimum of these values: Initiation. Given labeling of the leaf nodes of ( 1 , . . . , } at the tips of the tree the ( ) are easy to compute. The cost is 0 if the observed integer is integer , and infinite otherwise.
Iteration. For the immediate common ancestor of the nodes and , node , we have The interpretation of this equation is immediate. The smallest possible cost given that node is assigned zero is either the cost (0) or the "gain" penalty 1 plus the minimum of ( ), the least of the two plus the minima of corresponding values associated with the right descendant tree. The smallest possible cost given that node is assigned is a sum of two values: the first one is either the cost ( , ) of the edge from node to node , plus the cost ( ) of the left descendant subtree given that node is in state , or the "loss" penalty 2 plus (0); the second one is the cost ( , ) of the edge from the node to the node , plus the cost ( ) of the right descendant subtree given that node is in state . We select those values of and which minimize that sum. Equation (10) is applied successively to each inner node in the tree, doing a postorder tree traversal. Finally it computes all the ( ), and then (8) is used to find the minimum cost for the whole tree. The complexity of the Up phase of the algorithm is ( * * ).
Traceback. The procedure calculates the labels ( ) of all nodes of the tree .
Choose any integer which provides the minimum of the ( )-it is the root label. It may be either zero or a positive . Doing a preorder tree traversal, successively label each inner node in the tree: for any inner node , and given that a parent label was reconstructed, the label ( ) = is easily reconstructed as well.

DP Algorithm for an Arbitrary Tree
Up-Phase. A procedure DP up calculates the costs ( ) of all nodes ( ) of the tree.
Suppose that the descendant nodes of the node are called . The following equation will therefore be similar to (10) replacing the sum of and by the total sum of 1 , while 1 traverses all values of : This equation is applied successively to each node in the tree, doing a postorder tree traversal. Finally it computes all the ( ), and then (8) is used to find the minimum cost for the whole tree.
Down Phase. As Traceback above: Consider the following. (2)). Manhattan distance ( (V), ( )) is an absolute value of the difference between labelings V and : | (V)− ( )|. This distance measure has the following property: if siblings have positive labels, then all integers that lie between these values may equally serve as optimal labels of a parent.   So, as it would be proven below, at the bottom-up stage of the DP algorithm it would be sufficient to assign to each node in the tree four values: left( ), right( ), ( ), and ( ). The meanings of the values are as follows: left and right are bounds of an interval associated with the node , is a cost value (0), and is a cost ( ) for any integer from the interval: left ≤ ≤ right.

DP Algorithm for a Manhattan Sum for a Binary Tree (Problem
Initiation. Given labeling of the leaf nodes of ( 1 , . . . , ) = 1 these four values are easy to compute for the leaf nodes:

Examples.
Let us consider the simplest trees with two, three, and four labeled leaves. The simplest tree configuration is presented in Figure 1. There is only one node to label-the root node. The next simplest tree topology-three-leaf trees-is presented in Figure 2. There are two nodes to label, the inner node and the root. Determination of the optimal labeling of the four-leaf trees is very similar to the examples described above. Figure 3 illustrates labeling of the tree where all four leaves have nonzero labels: ((125, 141), (136, 150)). Labeling of the inner nodes is as above (Figure 2 Examples of the trees with very distinct subtrees are presented in Figures 4 and 5. In Figure 4 we present a tree obtained by merging two very different subtrees. The left 4leaf subtree has very obvious intuitive labeling of internal nodes: all nodes should be labeled by zero. The right subtree is identical to the tree presented in Figure 2  · · · · · · · · · Figure 4: Labeling of a "peculiar" tree. The left subtree has three zero and one nonzero leaf, while the right subtree has three nonzero leaves. The "gain" penalty 1 = 50; the "loss" penalty 2 = 30. Optimal labels are in red.

Bottom-Up Stage
Initiation. Given labeling of the leaf nodes of ( 1 , . . . , ) at the tips of the tree the ( ) are easy to compute. The cost is 0 if the observed integer is integer , and 3.2.3. Pseudocode. For more details see Pseudocode 1.

Traceback Stage
Interval Correction Rule.
Otherwise, the bounds of the corrected interval would not be changed from the original ones: Initiation. Labeling of the root: if ( ) ≤ ( ), then correct the root interval according to (15)- (17), and then choose an integer from the corrected interval assigned to the root node otherwise choose 0-it is the root label ( ).
Iteration. Doing a preorder tree traversal, successively label each node in the tree either by an integer from the corrected interval assigned to this node which is the nearest to its parent is a cost of (0), is a cost of ( ) for all i:  label (it may be either the value equal to the parent label or the boundary value of the interval assigned with the node) or by 0.
The proof of the correctness of a simpler algorithm (without zero-labeled leaves) is published in [15]. In the Appendix there are several lemmas, from which the correctness of the algorithm presented here results. Figure 5 of [7] the consensus trees obtained from 100 genome trees were presented. The trees were produced on the basis of 80% randomly chosen COGs, and the right tree was produced on the basis of 15%jackknifing (the explanations in the text of [7]). This tree possesses phylogenetic reasonableness. A part of this tree was selected to illustrate the algorithm. We took the upper part of the tree related exclusively to Archaea (see A/B marked arrow in Figure 4(b) from [7]) and placed the root at the point dividing all Archaeal genomes into Euryarchaeota and Crenarchaeota (see E/C marked arrow in Figure 4(b) from [7]). Thus, Figure 5 is a part of Figure 4(b) from [7] labeled according to COG0835. This COG was randomly selected as suitable for purposes of illustration. Table 1 presents a list of Archaeal genomes from the whole set of genomes that were used for a genome tree construction (Figure 4(b) from [7]). Table 2 presents the lengths of the Archaeal proteins of this COG.
hungatei protein are obvious outliers; (2) taking the median value of paralog's lengths of the genomes 30, 31, 46, 48, and 50. Figure 5 presents results of application of the bottom-up and traceback stages of the algorithm to this tree: a quartet that was assigned to a node at the bottom upstage is shown under the edge linking the node and its parent node, a label that was assigned to the node at the traceback stage, is shown over the same edge. As we can see the root is labeled by zero. There are two gene-birth events and one gene-death event. One gene was born with the length of 155 and another gene birth is labeled by 146. Genome number 32 (Haloquadratum walsbyi) has no protein from COG0835, while other Haloarchaea (genomes 29-31) do have. Thus, the edge connecting with leaf labeled by 32 is marked with a gene-loss symbol.

Discussion
In [15] the algorithms to find the optimal labeling of the vertices of the tree under Wagner parsimony were presented. A simple extension of the problem could be finding the optimal labeling of the vertices of the tree with nonnegative integers. This more realistic approach requests special consideration of zero labeling. Wedges of type ( , 0), > 0, should be scored differently from wedges of type (0, ), > 0, because the ( , 0) notes gene loss, while (0, ) notes gene gain. These events should be scored differently. Interestingly, this differentiated scoring in addition to tree labeling resulted in reconstruction of "parsimonious" evolutionary scenario. Reconstruction of a gene evolution along a species tree is an interesting and principal problem. Lyubetsky and his coworkers contributed a lot to formulation and solving this problem. In their studies [16][17][18][19][20][21][22] the authors tackled mainly two important and sophisticated phylogenetic problems. The obtained results are partially reviewed in the first section of [22] which also provides an extended biological background and relevant references. Reconstruction of a gene evolution along a species tree (to build the evolutionary scenario), following the approach of Lyubetsky et al., is to find an optimal mapping of a gene tree into a species tree. (An example of a different approach was presented in [14].) The second problem is to construct a supertree from the given set of gene trees.
As it was mentioned in [22], the first problem, stated as a tree-into-tree mapping, is solved in polynomial (often linear, and at maximum cubic) time even for the case of time slices and horizontal gene transfers. The algorithms presented in our study are polynomial as well.
Choosing 1 (a gain penalty), 2 (a loss penalty), and (a label change penalty) is crucial for reconstruction of trustworthy evolutionary scenario. However, it is very difficult task and we cannot claim categorically that choosing "correct" parameters of the model will result in truly reliable reconstruction. We do plan to make a comparison between results obtained by abovementioned methods of Lyubetsky and ours (work in progress).
To prepare input for the algorithm, as it was done above for 3.2.5, the original data is to be transformed to the following format: to each (genome, COG) pair one standardized protein length should be assigned (as we described in [7]). For a given COG, each organism is represented by a calculated length-a median length of all paralogous proteins. A natural extension would be to formulate the labeling problem taking into account existence of paralogs.
We may define a -tuple integer labeling Π of as a mapping Π from to a set of -tuples composed of integers The simplest extension would be to introduce the case with identical sizes of -tuples composed of nonnegative integers. A uniform -tuple integer labeling Π of is characterized by a constant (V) for all V. The stretch of the edge V in a Π ( ) is a simple sum V = ∑ =1 ( (V), ( )) ⋅ ( , ) is defined as in (1). Given a uniform -tuple integer labeling of the leaves of G the minimum sum problem is to find a labeling which minimizes the total sum of the stretches of the edges. Some i (k) = 0. The minimum sum problem is that of minimizing s(G) = ∑ ∀{V }∈ ( ) V over all Π for given . By some modifications of the algorithms presented in this paper the minimizing -tuple labeling can be found. This model again is a gain-loss model. More sophisticated extension must provide more realistic definition of distance between two -tuples composed of positive integers by introducing duplication events.

Appendix
Lemma A.1 (root optimal label). Suppose that the root node is called and suppose that its children are called and . The claim is Proof. If we consider a subtree with the node as a root then ( ) designates the minimal cost, given that node has a label : ( Lemma A.2 (leaf parent optimal interval). Every node of the optimal integer labeling that all its children are leaf nodes has either a zero label or a label between labels of its children. Suppose that the root node is called a and suppose that its children are called and . The claim is  Proof. Figure 1 illustrates this lemma. Proof of the subclaims (1)-(3) is trivial. In case of condition (4) let us prove that for optimal integer labeling ( ) ≥ ( ). Suppose ( ) < ( ). Likewise, we prove that for optimal integer labeling ( ) ≤ +1 . Q.e.d. ≤ ( ) ≤ +1 . Lemma A.3 (parent optimal interval-(I)). An optimal label of a parent either is equal to zero or lies between extreme values of optimal labels of its children. If an optimal integer labeling provides the labels of two siblings 1 and 2 satisfying the conditions 1 ≤ ( 1 ) ≤ 1 & 2 ≤ ( 2 ) ≤ 2 , then the label of their parent satisfies min( 1 , 1 , 2 , 2 ) ≤ ( ) ≤ max( 1 , 1 , 2 , 2 ). Proof is as for Lemma A.1.
Lemma A.4 (parent optimal interval-(II)). An optimal label of a parent in case of the empty intersection of the optimal intervals of its children lies between these intervals.
If an optimal integer labeling provides the labels of two siblings 1 and 2 satisfying the conditions From assumption ( ) < 1 follows that is not an optimal labeling. Similarly, we can prove that from assumption ( ) > 2 follows that is not an optimal labeling.
(2) 2 ≤ 1 . Similarly to (1) let us assume ( ) < 2 ; ( ) = 2 − . Then we introduce a new labeling by changing labels for three nodes: ( ) = 2 ; ( 1 ) = 1 ; ( 2 ) = 2 . ( ) − ( ) > 0, so we have a contradiction with the statement that ( ) is an optimal labeling. Lemma A.5 (parent optimal interval-(III)). An optimal interval of a parent is either an intersection of the optimal intervals of its children or an interval that lies between these intervals in case that their intersection is empty.
If an optimal integer labeling provides the labels of two siblings 1  For example, if 1 ≤ 2 ≤ 1 ≤ 2 , then Lemma A.5 states that ( ) satisfies the following condition 2 ≤ ( ) ≤ 1 . Proof is similar to proof in Lemma A.4.