^{1, 2}

^{1}

^{1}

^{2}

The reconstruction of phylogenies is becoming an increasingly simple activity. This is mainly due to two reasons: the democratization of computing power and the increased availability of sophisticated yet user-friendly software. This review describes some of the latest additions to the phylogenetic toolbox, along with some of their theoretical and practical limitations. It is shown that Bayesian methods are under heavy development as they offer the possibility to solve a number of long-standing issues and to integrate several steps of the phylogenetic analyses into a single framework. Specific topics include not only phylogenetic reconstruction, but also the comparison of phylogenies, the detection of adaptive evolution, and the estimation of divergence times between species.

Human cultures have always been
fascinated by their origins as a means to define their position in the world,
and to justify their hegemony over the rest of the living world. However,
scientific (testable) predictions about our origins had to wait for Darwin [

Today, molecular
phylogenies are routinely used to infer gene or genome duplication events [

Most of these
applications are beyond the scope of plant genomics, but they all suggest that
sophisticated phylogenetic methods are required to address most of today’s
biological questions. While parsimony-based methods are both intuitive and
extremely informative, for instance to disentangle genome rearrangements [

Because it is
not possible or even appropriate to discuss all the latest developments in a
given field of study, this review will focus on a very limited number of key
phylogenetic topics. Of notable exceptions, recent developments in phylogenetic
hidden Markov models [

The first step in reconstructing a
phylogenetic tree from molecular data is to obtain a multiple sequence
alignment (MSA) where sequence data are arranged in a matrix that specifies
which residues are homologous [

The easiest
sequences to align are probably those of protein-coding genes: proteins diverge
more slowly than DNA sequences and, as a result, proteins are easier to align.
The rule-of-thumb is therefore first to translate DNA to amino acid sequences,
then perform the alignment at the protein level, before back-translating to the
DNA alignment in a final step. This procedure avoids inserting gaps in the
final DNA alignment that are not multiple of three and that would disrupt the
reading frame. Translation to amino acid sequences can be done directly when
downloading sequences, for instance from the National Center for Biotechnology
Information (NCBI:

Name | Method | Platform | GUI | Inference | Reference |
---|---|---|---|---|---|

BAMBE | Bayes | DOS, MacOS, Unix | No | Tree |
[ |

BayesPhylogenies | Bayes | DOS, MacOS, Unix | No | Tree |
[ |

BAli-Phy | Bayes | DOS, MacOS, Unix | No | Simultaneous alignment and tree |
[ |

BEAST | Bayes | Windows, MacOS, Unix | Yes | Tree, times |
[ |

CONSEL | ML | DOS, MacOS, Unix | No | Tree comparison |
[ |

DAMBE | Distances, parsimony, ML | Windows | Yes | Tree |
[ |

GARLI | ML (Genetic Algorithm) | Windows, MacOS, Unix | Yes | Tree |
[ |

HyPhy | ML | Windows, MacOS, Unix | Yes | Tree, selection, recombination, tree comparison, | [ |

MEGA | Distances, parsimony | Windows | Yes | Tree, times |
[ |

MrBayes | Bayes | DOS, MacOS, Unix | No | Tree, selection |
[ |

Multidivtime | Bayes | DOS, MacOS, Unix | No | Times |
[ |

OmegaMap | Bayes | DOS, MacOS, Unix | No | Simultaneous selection and recombination | [ |

PAML | ML | DOS, MacOS, Unix | No | Tree, tree comparison, times, selection | [ |

PAUP* | Distances, parsimony, ML | DOS, MacOS, Unix | No | Tree |
[ |

PhyloBayes | Bayes | DOS, MacOS, Unix | No | Tree, tree comparison |
[ |

PHYML | ML | DOS, MacOS, Unix | No | Tree |
[ |

RAxML | ML | DOS, MacOS, Unix | No | Tree |
[ |

r8s | PL | DOS, MacOS, Unix | No | Times |
[ |

The alignment of
rRNA genes with the constraint of secondary structure has now been frequently
used in practical research in molecular evolution and phylogenetics [

What to do with
other noncoding genes is still an open question, especially when it comes to
aligning a large number (>100) of long (>20,000 residues) and
divergent sequences (<25% identity). Some authors have attempted to
provide rough guidelines to choose the most accurate program depending on these
parameters [

Whichever method
is used to obtain an MSA, a final visual inspection is required, and manual
editing is often needed. To this end, a number of editors can be used such as JalView
[

Because an MSA
represents a hypothesis about sitewise homology at all the positions, obtaining
an accurate MSA presents some circularity; an accurate MSA often necessitates
an accurate guide tree, which in turn demands an accurate alignment. The early
realization of this “chicken-egg” conundrum led to the idea that both the MSA
and the phylogeny should be estimated simultaneously [

Once a reliable MSA is obtained,
the next step in comparing molecular sequences is to choose a metric to
quantify divergence. The most intuitive measure of divergence is simply to
count the proportion of differences between two aligned sequences (e.g., [

Given the
variety of substitution models, the first step of any model-based phylogenetic
analysis is to select the most appropriate model [^{2} distribution whose degree of
freedom is the number of additional parameters entering the more complex model
(see [

However,
performing systematic hLRTs is not the optimal strategy for model selection in
phylogenetics [_{c}, [_{c},
and BIC) are available in ModelTest and ProtTest. Other procedures exist such
as the Decision-Theoretic or DT approach [_{c}was derived under Gaussian assumptions for linear fixed-effect models [

All the above
test procedures compare ratios of likelihood values penalized for an increase
in the dimension of one of the models, without directly accounting for
uncertainty in the estimates of model parameters. This may be problematic, in
particular for small data sets. The Bayesian approach to model selection, called
the Bayes factor, directly incorporates this uncertainty. It is also more
intuitive as it directly assesses if the data are more probable under a given
model than under a different one (e.g., [

There is an
element of circularity in model selection, just as in sequence alignment. In
theory, when the hLRT is used for model selection, the topology used for all
the computations should be that of the maximum likelihood tree. In practice,
model selection is based on an initial topology obtained by a fast algorithm
such as neighbor-joining [

Once the substitution model is
selected, the classical approach proceeds to reconstruct the phylogeny [

The combinatorial complexity relates to the extremely large number of
tree topologies that are possible with a large number of sequences [

A first aspect
of the computational complexity relates to estimating the support of a
reconstructed phylogeny. This is more complicated than estimating a confidence
interval for a real-valued parameter such as a branch length, because a tree
topology is a graph and not a number. The classical approach therefore relies
on a nonstandard use of the bootstrap [

The most
fundamental aspect of the computational complexity in phylogenetics is due to
the structure of the phylogenies: these are trees or binary graphs on which
computations are nested and interdependent, which makes these computations
intractable or NP-hard [

Science proceeds by testing
hypotheses, and it is often necessary to compare phylogenies, for instance to
test whether a given data set supports the early divergence of gymnosperms with
respect to Gnetales and angiosperms (the anthophyte hypothesis), or whether the
Gnetales diverged first (the Gnetales hypothesis) [

Indeed, the Bayesian approach to hypothesis testing relies on computing the probability of
the data under a particular model. This quantity is usually not available as a
close-form equation, and it must be approximated numerically. The most
straightforward approximation is based on the harmonic mean of the likelihood
sampled from the posterior distribution [

While model selection is fully
justified on the ground of the bias-variance tradeoff, it should not be forgotten
that all these models are simplified representations of the actual substitution
process and are all therefore wrong. Stated differently, if AIC selects the GTR
+Γ+I to
analyze a data set, it should be clear that this conclusion does not imply that
the data evolved under this model. All model selection procedures measure a
relative model fit. One way to estimate adequacy or absolute model fit is to
perform a parametric bootstrap test [

More realistic models are obtained by accommodating heterogeneities in the evolutionary
process at the level of both sites (space) and lineages (time). The simplest
site-heterogeneous model is one, where the aligned data are partitioned,
usually based on some prior information. For instance, first and second codon
positions are known to evolve slower than third codon positions in
protein-coding genes, or exposed residues might evolve faster than buried amino
acids in globular proteins. A number of models were suggested to analyze such
partitioned data sets (e.g., [

Lineage-heterogeneous
models or heterotachous models [

Between these two extremes are mixture models, which extend the covarion model by allowing
more categories of sites. A number of formulations exist, where each site is
assumed to have been generated by either several sets of branch lengths [

Mixture models can work within the
framework of maximum likelihood, but the treatment of the weight factors is
complicated. A sound alternative is to resort to a fully Bayesian approach. A
prior distribution is set on the weight factors, and a special form of MCMC
sampler whose Markov chain moves across models with different numbers of
parameters, a reversible-jump MCMC sampler (RJ-MCMC), is constructed. The
advantage of RJ-MCMC samplers is that they allow estimating the phylogeny while
integrating over the uncertainty pertaining to the parameters of the
substitution model and even integrating over the model itself [

All these models
assume that each site evolve independently. The independence assumption greatly
simplifies the computations, but is also highly unrealistic. Models that
describe the evolution of doublets in RNA genes [

Fitness functions are however
difficult to determine at the molecular level. In addition, while examples of
adaptive evolution at the morphological level abound, from Darwin’s finches in
the Galapagos [

When effective population sizes are no longer a concern, for instance in studies at or above
the species level, the detection of positive selection in protein-coding genes
usually relies on codon models [

These rate
ratios can be estimated by a number of methods implemented in MEGA, DAMBE,
HyPhy [

However, the
first studies using these methods found little evidence for adaptive evolution
essentially because they were averaging

The estimation of the dates when
species diverged is often perceived to be as important as estimating the
phylogeny itself. This explains why so-called “dating methods” were first
wished for when molecular phylogenies were first reconstructed [

The most recent work has then focused on relaxing this assumption, and three different
directions have emerged [

Bayesian methods appear today as the emerging approach to estimate divergence times. Taking
inspiration from Sanderson’s pioneering work [

With the advent of high-throughput
sequencing technologies such as the whole-genome shotgun approach by pyrosequencing [

Jeff Thorne provided insightful comments and suggestions, and two anonymous reviewers helped in improving the original manuscript. Support was provided by the Natural Sciences Research Council of Canada (DG-311625 to SAB and DG-261252 to XX).