The reconstruction of phylogenies is becoming an increasingly simple activity. This is mainly due to two reasons: the democratization of computing power and the increased availability of sophisticated yet user-friendly software. This review describes some of the latest additions to the phylogenetic toolbox, along with some of their theoretical and practical limitations. It is shown that Bayesian methods are under heavy development as they offer the possibility to solve a number of long-standing issues and to integrate several steps of the phylogenetic analyses into a single framework. Specific topics include not only phylogenetic reconstruction, but also the comparison of phylogenies, the detection of adaptive evolution, and the estimation of divergence times between species.
1. Introduction
Human cultures have always been
fascinated by their origins as a means to define their position in the world,
and to justify their hegemony over the rest of the living world. However,
scientific (testable) predictions about our origins had to wait for Darwin [1] and his intellectual
descendents first to classify [2] and then to reconstruct the
natural history of replicating entities, and hereby to kick-start the field of phylogenetics
[3, 4]. Rooted in the comparison of
morphological characters, phylogenies have for the past four decades focused on
the relationships between molecular sequences (e.g., [4]), potentially helped by
incorporating morphological information [5], in order to infer
ancestor-to-descendent relationships between sequences, populations, or species.
Today, molecular
phylogenies are routinely used to infer gene or genome duplication events [6], recombination [7], horizontal gene transfer [8], variation of selective
pressures and adaptive evolution [9], divergence times between
species [10], the origin of genetic code [11], elucidate the origin of
epidemics [12], and host-parasite cospeciation
events [13, 14]. As complementary tools for
taxonomy (DNA barcoding: [15]), they have also contributed
to the formulation of strategies in conservation biology [16]. In addition to untangling
the ancestral relationships relating a group of taxa or of a set of molecular
sequences, phylogenies have also been used for some time outside of the realm
of biological sciences as for instance in linguistics [17, 18] or in forensics [19, 20].
Most of these
applications are beyond the scope of plant genomics, but they all suggest that
sophisticated phylogenetic methods are required to address most of today’s
biological questions. While parsimony-based methods are both intuitive and
extremely informative, for instance to disentangle genome rearrangements [21], they also have their
limitations due to their minimizing the amount of change [22]. These limitations become
particularly apparent when analyzing distantly related taxa. A means to
overcome, at least partly, some of these difficulties is to adopt a
model-based approach, be in a maximum likelihood or in a Bayesian framework.
These two frameworks are extremely similar in that they both rely on
probabilistic models. Bayesian approaches offer a variety of benefits when
compared to traditional maximum likelihood, such as computing speed (although
this is not necessarily true, especially under complex models), sophistication
of the model, and an appropriate treatment of uncertainty, in particular the
one about nuisance parameters. As a result, Bayesian approaches often make it
possible to address more sophisticated biological questions [23], which usually comes at the
expense of longer computing times and higher memory requirements than when
using simpler models.
Because it is
not possible or even appropriate to discuss all the latest developments in a
given field of study, this review will focus on a very limited number of key
phylogenetic topics. Of notable exceptions, recent developments in phylogenetic
hidden Markov models [24] or applications that map
ancestral states on phylogenies [25] are not treated. We focus
instead on the very first steps involved in most phylogenetic analysis, ranging from reconstructing a tree to estimating
selective pressures or species divergence times. For each of these steps, some
of the most recent theoretical developments are discussed, and pointers to
relevant software are provided.
The first step in reconstructing a
phylogenetic tree from molecular data is to obtain a multiple sequence
alignment (MSA) where sequence data are arranged in a matrix that specifies
which residues are homologous [26]. A large number of methods
and programs exist [27] and most have been evaluated
against alignment databases [28], so that it is possible to
provide some general guidelines.
The easiest
sequences to align are probably those of protein-coding genes: proteins diverge
more slowly than DNA sequences and, as a result, proteins are easier to align.
The rule-of-thumb is therefore first to translate DNA to amino acid sequences,
then perform the alignment at the protein level, before back-translating to the
DNA alignment in a final step. This procedure avoids inserting gaps in the
final DNA alignment that are not multiple of three and that would disrupt the
reading frame. Translation to amino acid sequences can be done directly when
downloading sequences, for instance from the National Center for Biotechnology
Information (NCBI: www.ncbi.nlm.nih.gov). A number of programs also allow users
to perform this translation locally on their computers from an appropriate
translation table (e.g., DAMBE [29], MEGA [30, 31]; see Table 1). The second
step is to perform the alignment at the protein level. Again, a number of
programs exist, but ProbCons [32] appears to be the most
accurate single method [33]. An alternative for using one
single alignment method is to use consensus or meta-methods, that is, to
combine several methods [27]. Meta-methods such as M-Coffee
can return better MSAs almost twice as often as ProbCons [34]. Finally, when the alignment
is obtained at the protein level, back-translation to the DNA sequences can be
performed either by using a program such as DAMBE, CodonAlign [35], or by using a dedicated
server such as protal2dna (http://bioweb.pasteur.fr/seqanal/interfaces/protal2dna.html) or Pal2Nal (coot.embl.de/pal2nal).
List of programs cited in this review. GUI: graphic user interface;
ML: maximum likelihood; PL: penalized likelihood.
Name
Method
Platform
GUI
Inference
Reference
BAMBE
Bayes
DOS, MacOS, Unix
No
Tree
[36]
BayesPhylogenies
Bayes
DOS, MacOS, Unix
No
Tree
[37]
BAli-Phy
Bayes
DOS, MacOS, Unix
No
Simultaneous alignment and tree
[38]
BEAST
Bayes
Windows, MacOS, Unix
Yes
Tree, times
[39]
CONSEL
ML
DOS, MacOS, Unix
No
Tree comparison
[40]
DAMBE
Distances, parsimony, ML
Windows
Yes
Tree
[29]
GARLI
ML (Genetic Algorithm)
Windows, MacOS, Unix
Yes
Tree
[41]
HyPhy
ML
Windows, MacOS, Unix
Yes
Tree, selection, recombination,
tree comparison,
[42]
MEGA
Distances, parsimony
Windows
Yes
Tree, times
[30, 31]
MrBayes
Bayes
DOS, MacOS, Unix
No
Tree, selection
[43, 44]
Multidivtime
Bayes
DOS, MacOS, Unix
No
Times
[45–47]
OmegaMap
Bayes
DOS, MacOS, Unix
No
Simultaneous selection and
recombination
[48]
PAML
ML
DOS, MacOS, Unix
No
Tree, tree comparison, times,
selection
[49, 50]
PAUP*
Distances, parsimony, ML
DOS, MacOS, Unix
No
Tree
[51]
PhyloBayes
Bayes
DOS, MacOS, Unix
No
Tree, tree comparison
[52]
PHYML
ML
DOS, MacOS, Unix
No
Tree
[53]
RAxML
ML
DOS, MacOS, Unix
No
Tree
[54]
r8s
PL
DOS, MacOS, Unix
No
Times
[55]
The alignment of
rRNA genes with the constraint of secondary structure has now been frequently
used in practical research in molecular evolution and phylogenetics [56–60]. The procedure is first to obtain reliable
secondary structure and then use the secondary structure to guide the sequence
alignment. This has not been automated so far, although both Clustal [61, 62] and DAMBE have some functions to alleviate the
difficulties.
What to do with
other noncoding genes is still an open question, especially when it comes to
aligning a large number (>100) of long (>20,000 residues) and
divergent sequences (<25% identity). Some authors have attempted to
provide rough guidelines to choose the most accurate program depending on these
parameters [28]. However, accuracy figures
are typically estimated over a large number of test alignments and may not
reflect the accuracy that is expected for any particular alignment [28]. More crucially, most of the
alignment programs were developed and benchmarked on protein data, so that
their accuracy is generally unknown for noncoding sequences [28]. A very general
recommendation is then to use different methods [63] and meta-methods. Those parts
of the alignment that are similar across the different methods are probably
reliable. The parts that differ extensively are often simply eliminated from
the alignment when no external information can be used to decide which
positions are homologous. Poorly aligned regions can cause serious problems as,
for instance, when analyzing rRNA sequences in which conserved domain and
variable domains have different nucleotide frequencies [60]. A simple test of the
reliability of an alignment consists in reversing the orientation of the
original sequences, and performing the alignment again; because of the symmetry
of the problem, reliable MSAs are expected to be identical whichever direction
is used to align the sequences [64]. These authors further show
that reliability of MSAs decreases with sequence divergence, and that the
chance of reconstructing different phylogenies increases with sequence
divergence. More sophisticated methods also permit the direct measure of the
accuracy of an alignments or the estimation of a distance between two
alignments [65]. Applications of Bayesian
inference strictly to pairwise [66] and multiple [67, 68] sequence alignment are still
in their infancy.
Whichever method
is used to obtain an MSA, a final visual inspection is required, and manual
editing is often needed. To this end, a number of editors can be used such as JalView
[69].
Because an MSA
represents a hypothesis about sitewise homology at all the positions, obtaining
an accurate MSA presents some circularity; an accurate MSA often necessitates
an accurate guide tree, which in turn demands an accurate alignment. The early
realization of this “chicken-egg” conundrum led to the idea that both the MSA
and the phylogeny should be estimated simultaneously [70]. Although this initial
algorithm was parsimony-based, it was already too complex to analyze more than
a half-dozen sequences of 100 sites or more. Subsequent parsimony-based
algorithms allowed the analysis of larger data sets [71] but still showed some
limitations when sequence divergence increases. More recently, a Bayesian
procedure was described and implemented in a program, BAli-Phy, where
uncertainties with respect to the alignment, the tree, and the parameters of the
substitution model are all taken into account [38] (see also [72]). Uncertain alignments are a
potential problem in large-scale genomic studies [73] or in whole-genome alignments
[74]. In these contexts,
disregarding alignment uncertainty can lead to systematic biases when
estimating gene trees or inferring adaptive evolution [73, 74]. However, these complex
Bayesian models [38, 72, 73] still require some nonnegligible
computing time and resource, and to date, their performance in terms of
accuracy is still unclear.
2.2. Selection of the Substitution Model
Once a reliable MSA is obtained,
the next step in comparing molecular sequences is to choose a metric to
quantify divergence. The most intuitive measure of divergence is simply to
count the proportion of differences between two aligned sequences (e.g., [75]). This simple measure is
known as the p distance. However, because the size of the state space is
finite (four letters for DNA, 20 for amino acids, and 61 for sense codons),
multiple changes at a position in the alignment will not be observable, and the
p distance will underestimate evolutionary distances even for moderately
divergent sequences. This phenomenon is generally referred to as saturation.
Corrections were devised early to help compensate for saturation. Some of the
most famous named nucleotide substitution models are the Jukes-Cantor model or
JC [76], the Kimura two-parameter
model or K80 [77], the Hasegawa-Kishino-Yano
model or HKY85 [78], the Tamura-Nei model or TN93 [79], and the general
time-reversible model or GTR [80] (also called REV). Because
substitution rates vary along sequences, two components can be added to these
substitution models: a “+I” component that models invariable sites [78] and a “+Γ” component that models among-site rate
variation either as a continuous [81] or as a discrete [82] mean-one Γ distribution, the latter being more
computationally efficient. Amino acid models can also incorporate a “+F”
component so that replacement rates are proportional to the frequencies of both
the replaced and resulting residues [83].
Given the
variety of substitution models, the first step of any model-based phylogenetic
analysis is to select the most appropriate model [84, 85]. The rational for doing so is
to balance bias and variance: a highly-parameterized model will describe or fit
the data much better than a model that contains a smaller number of parameters;
in turn however, each parameter of the highly-parameterized model will be
estimated with lower accuracy for a given amount of data (e.g., [86]). Besides, both empirical and
simulation studies show that the choice of a wrong substitution model can lead
not only to less accurate phylogenetic estimation, but also to inconsistent
results [87]. The objective of model
selection is therefore not to select the “best-fitting” model, as this one will
always be the model with the largest number of parameters, but rather to select
the most appropriate model that will achieve the optimal tradeoff between bias and
variance. The approach followed by all model selection procedures is therefore
to penalize the likelihood of the parameter-rich model for the additional
parameters. Because most of the nucleotide substitution models are nested (all
can be seen as a special case of GTR +Γ+I),
the standard approach to model selection is to perform hierarchical likelihood
ratio tests or hLRTs [88]. Note that in all rigor, likelihood
ratio tests can also be performed on nonnested models; however, the asymptotic
distribution of the test statistic (twice the difference in log-likelihoods)
under the null hypothesis (the two models perform equally well) is complicated [89] and quite often impractical.
When models are nested, the asymptotic distribution of the test statistic under
the null hypothesis is simply a χ2 distribution whose degree of
freedom is the number of additional parameters entering the more complex model
(see [90] or [91] for applicability conditions).
With the hLRT, then all models are compared in a pairwise manner, by traversing
a choice-tree of possible nested models. A number of popular programs allow
users to compare pairs of models manually (e.g., PAUP [51], PAML [49, 50]). Readily written scripts
that select the most appropriate model among a list of named models also exist,
such as ModelTest [92] (which requires PAUP), the R
package APE [93], or DAMBE. Free web servers
are also available; they are either directly based on ModelTest [94] or implement similar ideas
(e.g., FindModel, available at http://hcv.lanl.gov/content/hcv-db/?N=Dhcv.lanl.gov/content/hcv-db/findmodel/findmodel.html). A similar implementation, ProtTest, exists for protein data [95].
However,
performing systematic hLRTs is not the optimal strategy for model selection in
phylogenetics [96]. This is because the model
that is finally selected can depend on the order in which the pairwise
comparisons are performed [97]. The Akaike information
criterion (AIC) or its variant developed in the context of regression and
time-series analysis in small data sets (AICc, [98]) is commonly used in phylogenetics (e.g., [96]). One advantage of AIC is
that it allows nonnested models to be compared, and it is easily implemented.
However, in large data sets, both the hLRT and the AIC tend to favor
parameter-rich models [99]. A slightly different
approach was proposed to overcome this selection bias, the Bayesian information
criterion (BIC: [99]), which penalizes more
strongly parameter-rich models. All these model selection approaches (AIC, AICc,
and BIC) are available in ModelTest and ProtTest. Other procedures exist such
as the Decision-Theoretic or DT approach [100]. Although AIC, BIC, and DT
are generally based on sound principles, they can in practice select different
substitution models [101]. The reason for doing so is
not entirely clear, but it is likely due to the data having low-information
content. One prediction is that, when these model selection procedures end up
with different conclusions, all the selected models will return phylogenies
that are not significantly different. It is also possible that applying these
different criteria outside of the theoretical context in which they were
developed might lead to unexpected behaviors [102]. For instance, AICcwas derived under Gaussian assumptions for linear fixed-effect models [98], and other bias correction
terms exist under different assumptions [86].
All the above
test procedures compare ratios of likelihood values penalized for an increase
in the dimension of one of the models, without directly accounting for
uncertainty in the estimates of model parameters. This may be problematic, in
particular for small data sets. The Bayesian approach to model selection, called
the Bayes factor, directly incorporates this uncertainty. It is also more
intuitive as it directly assesses if the data are more probable under a given
model than under a different one (e.g., [103]). An extension of this
approach makes it possible to select the model not only among the set of named
models (JC to GTR) but among all 203 nucleotide substitution models that are
possible [104]. An alternative use or
interpretation of this approach is to integrate directly over the uncertainty
about the substitution model, so that the estimated phylogeny fully accounts
for several kinds of uncertainty: about the substitution models, and the
parameters entering each of these models. MrBayes (version 3.1.2) [43] implements this feature for
amino acid models.
There is an
element of circularity in model selection, just as in sequence alignment. In
theory, when the hLRT is used for model selection, the topology used for all
the computations should be that of the maximum likelihood tree. In practice,
model selection is based on an initial topology obtained by a fast algorithm
such as neighbor-joining [105, 106] (default setting in
ModelTest) or by Weighbor [107] (default setting in
FindModel) on JC distances without any correction for among-site rate variation.
As mentioned above, it is known that the choice of a wrong model can affect the
tree that is estimated, but it is not always clear how the choice of a nonoptimal
topology to select the substitution model affects the tree that is finally
estimated. Again, this issue with model choice disappears with Bayesian
approaches that integrate over all possible time-reversible models as in [104].
2.3. Finding the “best” Tree and Assessing Its Support
Once the substitution model is
selected, the classical approach proceeds to reconstruct the phylogeny [108]. This is probably one area
where phylogenetics has seen mixed progress over the last five years, due to
both the combinatorial and the computational complexities of phylogenetic
reconstruction.
The combinatorial complexity relates to the extremely large number of
tree topologies that are possible with a large number of sequences [109]. For instance, with five
sequences, there are 105 rooted topologies, but with ten sequences, this number
soars to over 34 million. An exhaustive search for the phylogeny that has the
highest probability is therefore not practical even with a moderate number of
sequences. Besides, while heuristics exist (e.g., stepwise addition [109]; see [4] for a review), almost none of
these is guaranteed to converge on the optimum phylogenetic tree. The common
practice is then to use one of these heuristics to find a good starting tree,
and then modify repeatedly its topology more or less dramatically to explore
its neighborhood for better trees until a stopping rule is satisfied [110]. The art here is in designing
efficient tree perturbation methods that adaptively strike a balance between
large topological modifications (that almost always lead to a very different
tree with a poor score) and small modifications (that almost always lead to an
extremely similar tree with lower score). Some of today’s challenges are about
choosing between methods that successfully explore large numbers of trees but
that can be costly in terms of computing time [110], and methods that are faster
but may miss some interesting trees [53]. Several programs such as Leaphy, PhyML, and GARLI[41] are among the best-performing
software in a maximum likelihood setting. In a Bayesian framework, the basic
perturbation schemes were described early [36] and recently updated [111]. Three popular programs are
MrBayes, BAMBE [36], and BEAST [39]. Among all these programs and
approaches, PHYML, GARLI, and BEAST are probably among the most efficient
programs in terms of computational speed, handling of large data sets and
thoroughness of the tree search.
A first aspect
of the computational complexity relates to estimating the support of a
reconstructed phylogeny. This is more complicated than estimating a confidence
interval for a real-valued parameter such as a branch length, because a tree
topology is a graph and not a number. The classical approach therefore relies
on a nonstandard use of the bootstrap [112]. However, the interpretation
of the bootstrap is contentious. Bootstrap proportions P can be perceived as testing the correctness of internal nodes,
and failing to do so [113], or 1–P can be interpreted as a conservative
probability of falsely supporting monophyly [114]. Since bootstrap proportions
are either too liberal or too conservative depending on the exact
interpretation given to these values [115], it is difficult to adjust
the threshold below which monophyly can be confidently ruled out [116]. Alternatively, an intuitive
geometric argument was proposed to explain the conservativeness of bootstrap
probabilities [117], but the workaround was never
actually used in the community or implemented in any popular software. The
introduction of Bayesian approaches in the late 1990s [36, 118] suggested a novel approach to
estimate phylogenetic support with posterior probabilities. Clade or
bipartition posterior probabilities can be relatively fast to compute, even for
large data sets analyzed under complicated substitution models [119]. As in model selection, they
have a clear interpretation as they measure the probability that a clade is
correct, given the data and the model. But as with bootstrap probabilities,
some controversies exist. Early empirical studies found that posterior
probabilities of highly supported nodes were much larger than bootstrap
probabilities [120], and subsequent simulation
studies supported this observation (e.g., [121–124]). Some of these differences can be attributed to an artifact of the
simulation scheme that was employed [125], but more specific empirical
and simulation studies show that prior specifications can dramatically impact
posterior probabilities for trees and clades [115, 126, 127]. In the simplest case, the
analysis of simulated star trees with four sequences fails to give the expected
three unrooted topologies with equal probability (1/3, 1/3, 1/3) but returns
large posterior probabilities for an arbitrary topology [115, 126], even when infinitely long
sequences are used [128, 129] ([130]). This phenomenon, called the
star-tree paradox [126], seems to disappear when
polytomies are assigned nonzero prior probabilities and when nonuniform priors
force internal branch length towards zero [129]. The second issue surrounding
Bayesian phylogenetic methods is about their convergence rate. A theoretical
study shows that extremely simple Markov chain Monte Carlo (MCMC) samplers, the
technique used to estimate posterior probabilities, could take an extremely
long time to converge [131]. In practice, however, MCMC
samplers such as those implemented in MrBayes are much more sophisticated. In
particular, they include different types of moves [111] and use tempering, where some
of the chains of a single run are heated, to improve mixing [43]. As a result, it is unclear
whether they suffer from extremely long convergence times. It is also expected
that current convergence diagnostic tools such as those implemented in MrBayes
would reveal convergence problems [132]. Finally, it is also argued
that these controversies such as exaggerated clade support, inconsistently
biased priors, and the impossibility of hypothesis testing disappear altogether
when posterior probabilities at internal nodes are abandoned in favor of
posterior probabilities for topologies [133] (see Section 2.4 below).
The most
fundamental aspect of the computational complexity in phylogenetics is due to
the structure of the phylogenies: these are trees or binary graphs on which
computations are nested and interdependent, which makes these computations
intractable or NP-hard [134]. As a result, it is difficult
to adopt an efficient “divide and conquer” approach, where a large complicated
problem would be split into small simpler tasks, and to take advantage of
today’s commodity computing by distributing the computation over multicore
architectures or heterogeneous computer clusters. Current strategies are
limited to distributing the computation of the discrete rate categories (when
using a “+Γ” substitution model) and part of the search
algorithm [54], or simply to distribute different
maximum likelihood bootstrap replicates [53, 54] or different MCMC samplers to
available processors [44].
2.4. Comparisons of Tree Topologies
Science proceeds by testing
hypotheses, and it is often necessary to compare phylogenies, for instance to
test whether a given data set supports the early divergence of gymnosperms with
respect to Gnetales and angiosperms (the anthophyte hypothesis), or whether the
Gnetales diverged first (the Gnetales hypothesis) [135, 136]. Because of the importance of
comparing phylogenies, a number of tests of molecular phylogenies were
developed early. The KH test was first developed to compare two random trees [137]. However, this test is
invalid if one of the trees is the maximum likelihood tree [138]. In this case, the SH test
should be used [139]. Because the SH test can be
very conservative, an approximately unbiased version was developed: the AU test [140]. PAUP and PAML only implement
the KH and SH tests; CONSEL [40] also implements the AU test.
A Bayesian version of these tests also exists [141], but the computations are
more demanding.
Indeed, the Bayesian approach to hypothesis testing relies on computing the probability of
the data under a particular model. This quantity is usually not available as a
close-form equation, and it must be approximated numerically. The most
straightforward approximation is based on the harmonic mean of the likelihood
sampled from the posterior distribution [142]. This approximation was
described several times in the context of phylogenies [141, 143] and is available from most
Bayesian programs such as MrBayes or BEAST. However, the approximation is
extremely sensitive to the behavior of the MCMC sampler [52, 142]: if extremely low-likelihood
values happen to be sampled from the posterior distribution, the harmonic mean
will be dramatically affected. To date, a couple of more robust approximations
have been described and were shown to be preferable to the harmonic mean
estimator [52]. The first is based on
thermodynamic integration [52] and is available in
PhyloBayes (see Table 1). The second approximation [144] is based on a more direct
computation [145], but its availability is
currently limited to one specific model of evolution.
2.5. More Realistic Models
While model selection is fully
justified on the ground of the bias-variance tradeoff, it should not be forgotten
that all these models are simplified representations of the actual substitution
process and are all therefore wrong. Stated differently, if AIC selects the GTR
+Γ+I to
analyze a data set, it should be clear that this conclusion does not imply that
the data evolved under this model. All model selection procedures measure a
relative model fit. One way to estimate adequacy or absolute model fit is to
perform a parametric bootstrap test [146]: first, the selected model is compared with a multinomial model by
means of a LRT whose test statistic is s (twice the log-likelihood difference); the following steps determine the
distribution of s under the null
hypothesis that the selected model was the generating model; second, the selected model is used to
simulate a large number of data sets; third,
the model selection procedure (LRT) is repeated on each simulated data set, and
the corresponding test statistics s*
are recorded; fourth, the P-value
is estimated as the number of times, the simulated s* test statistics are more
extreme (>, for a one-sided test) than the original value of s. The results of such tests suggest
that the selected substitution model is generally not an adequate
representation of the actual substitution process [85]. Of course, we do not need a
model that incorporates all the minute biological features of evolutionary
processes. As argued repeatedly (e.g., [147]), we need useful models that capture enough of
reality of substitution processes to make accurate predictions and avoid
systematic biases such as long-branch attraction [148].
More realistic models are obtained by accommodating heterogeneities in the evolutionary
process at the level of both sites (space) and lineages (time). The simplest
site-heterogeneous model is one, where the aligned data are partitioned,
usually based on some prior information. For instance, first and second codon
positions are known to evolve slower than third codon positions in
protein-coding genes, or exposed residues might evolve faster than buried amino
acids in globular proteins. A number of models were suggested to analyze such
partitioned data sets (e.g., [149]); these models are
implemented in most general-purpose software (e.g., PAML, PAUP, MrBayes) and
can be combined with a “+Γ+I” component. A different approach consists
in considering that sites can be binned in a number of rate categories; the use
of a Dirichlet prior process then makes it possible both to determine the
appropriate number of categories and to assign sites to these categories; the
application of this method to protein-coding genes was able to recover the
underlying codon structure of these genes [150]. However, several studies
suggest that evolutionary patterns can be as heterogeneous within a priori partitions as among
partitions [37, 151].
Lineage-heterogeneous
models or heterotachous models [152] have attracted more
attention. In one such approach, different models of evolution are assigned to
the different branches of the tree [153], which can make these models
extremely parameter-rich. Such a large number of parameters can potentially
affect the accuracy of the phylogenetic inference (see the “bias-variance
tradeoff” above) and present computational issues (long running times, large
memory requirements, and convergence issues). Several simplifications can be
made. One assumes that some sets of branches evolve under a particular process [153]. But now these branches must
be assigned a priori, and both
the determination of the number of sets and their placement on the tree can be
difficult (but see Section 4 below for a solution to a similar question). At
the other end of the spectrum of heterotachous models lies the simplest model
known as the covarion model [154], where sites can either be
variable along a branch, or not, and can switch between these two categories
across time (e.g., [155], also described in a Bayesian
framework [156]).
Between these two extremes are mixture models, which extend the covarion model by allowing
more categories of sites. A number of formulations exist, where each site is
assumed to have been generated by either several sets of branch lengths [157, 158] or by several rate matrices [37, 96, 151]. One particularity of these
models is that they give a semiparametric perspective to the phylogenetic
estimation: if a single simple model cannot approximate a complex substitution
process, the hope is that mixing several simple substitution models makes our
models more realistic. In some applications, mixture models can also be used to
avoid underestimating uncertainty, first when choosing a single model of
evolution and then ignoring this uncertainty when estimating the phylogeny. The
mixing therefore involves fitting at each site several sets of branch lengths,
or several substitution models to the data, and combining these models using a
certain weighting scheme. The difference between the numerous mixture models
that have been described lies in the choice of the weight factors, and how
these are obtained. In one approach, known as model averaging, the weights are
determined a priori. A first
possibility is to assume that all the models are equally probable, which does
not work with an infinite number of models (individual weights are zero in this
case). More critically in phylogenetics, this assumption is not coherent for
nested models since larger models should be more likely than each submodel. A
second possibility is to weight the models with respect to their probability of
being the generating model given the data. For practical purposes, this
posterior probability can be approximated by Akaike weights [96]. The difficulty here is that model
averaging requires analyzing the data even for models that, a posteriori, turn out to have
extremely small probabilities or weights. This may be seen as a waste of
resources (computing time and storage space).
2.6. Integrated Bayesian Approaches
Mixture models can work within the
framework of maximum likelihood, but the treatment of the weight factors is
complicated. A sound alternative is to resort to a fully Bayesian approach. A
prior distribution is set on the weight factors, and a special form of MCMC
sampler whose Markov chain moves across models with different numbers of
parameters, a reversible-jump MCMC sampler (RJ-MCMC), is constructed. The
advantage of RJ-MCMC samplers is that they allow estimating the phylogeny while
integrating over the uncertainty pertaining to the parameters of the
substitution model and even integrating over the model itself [104]. Mixture models are available
in BayesPhylogenies [37] for nucleotide models.
Another Bayesian mixture model, named CAT for CATegories, was developed to
analyze amino acid alignments. The CAT model recently proved successful in a
number of empirical [159, 160] and simulation [161] studies in avoiding the
artifact known as long-branch attraction [148]. This model is freely
available in the PhyloBayes software (see Table 1).
All these models
assume that each site evolve independently. The independence assumption greatly
simplifies the computations, but is also highly unrealistic. Models that
describe the evolution of doublets in RNA genes [162], triplets in codon models [163, 164], or other models with local or
context dependencies [165–167] exist, but complete dependence
models are still in their infancy and, so far, have only been implemented in a
Bayesian framework [168, 169]. One particularly interesting
feature of this approach is that complete dependence models incorporate
information about the three-dimensional (3D) structure of proteins and
therefore permit the explicit modeling of structural constraints or of any
other site-interdependence pattern [170]. The incorporation of 3D
structures also allows the establishment of a direct relationship between
evolution at the DNA level and at the phenotypic level. This link between
genotype and phenotype is established via a proxy that plays the role of a
fitness function which, in retrospect, can be used to predict amino-acid
sequences compatible with a given target structure, that is, to help in protein
design [171].
3. Detecting Positive Selection
Fitness functions are however
difficult to determine at the molecular level. In addition, while examples of
adaptive evolution at the morphological level abound, from Darwin’s finches in
the Galapagos [172] to cichlid fishes in the East
African lakes [173], the role of natural
selection in shaping the evolution of genomes is much more controversial [147, 174]. First, the neutral theory of
molecular evolution asserts that much of the variation at the DNA level is due
to the random fixation of mutations with no selective advantage [175]. Second, a compelling body of
evidence suggests that most of the genomic complexities have emerged by
nonadaptive processes [176]. A number of statistical
approaches exist either to test neutrality at the population level or to detect
positive Darwinian evolution at the species level [147]. A shortcoming of neutrality
tests is their dependence on a demographic model [177] and their sensitivity to
processes of molecular evolution such as among-site rate variation [178]. They also do not model
alternative hypotheses that would permit distinguishing negative selection from
adaptive evolution. The development of demographic models based on Poisson
random fields [179] and composite likelihoods [180] makes it possible both to
estimate the strength of selection and to assess the impact of a variety of
scenarios on allele frequency spectra [9]. But demographic
singularities such as bottlenecks can still generate spurious signatures of
positive selection [180, 181].
When effective population sizes are no longer a concern, for instance in studies at or above
the species level, the detection of positive selection in protein-coding genes
usually relies on codon models [163, 164] (see [182] for a review including
methods based on amino-acid models). Codon models permit distinguishing between
synonymous substitutions, which are likely to be neutral, and nonsynonymous
substitutions, which are directly exposed to the action of selection. If
synonymous and nonsynonymous substitutions accumulate at the same rate, then
the protein-coding gene is likely to evolve neutrally. Alternatively, if
nonsynonymous substitutions accumulate slower than synonymous substitutions, it
must be because nonsynonymous substitutions are deleterious and this suggests
the action of purifying selection. Conversely, the accumulation of
nonsynonymous substitutions faster than synonymous substitutions suggests the
action of positive selection. The nonsynonymous to synonymous rate ratio,
denoted ω=dN/dS,
is therefore interpreted as a measure of selection at the protein level, with ω=1, <1 and
>1 indicating neutral evolution, negative or positive selection,
respectively. This ratio is also denoted Ka/Ks, in particular in studies
that rely on counts of nonsynonymous and synonymous sites (e.g., [183]).
An extension exists to detect selection in noncoding regions [184], and a promising phylogenetic
hidden Markov or phylo-HMM model permits detection of selection in overlapping
genes [185].
These rate
ratios can be estimated by a number of methods implemented in MEGA, DAMBE,
HyPhy [42], and PAML. The most intuitive
methods, called counting methods, work in three steps: (i) count synonymous and
nonsynonymous sites, (ii) count the observed differences at these sites, and
(iii) apply corrections for multiple substitutions [186]. Counting methods are however
not optimal in the sense that most work on pairs of sequences and therefore,
just like neighbor-joining, fail to account for all the information contained
in an alignment. In addition, simulations suggest that counting methods can be
sensitive to a variety of biases such as unequal transition and transversion
rates, or uneven base, or codon frequencies [187]. Counting methods that
incorporate these biases perform generally better than those that do not, but
the maximum likelihood method still appears more robust to sever biases [187]. In addition, the maximum
likelihood method that accounts for all the information in a data set has good
power and good accuracy to detect positive selection [188, 189].
However, the
first studies using these methods found little evidence for adaptive evolution
essentially because they were averaging ω rate ratios over both lineages and sites [147]. Branch models were then developed [190, 191] quickly followed by site
models [192–196] and by branch-site models [189, 197]. All these approaches, as
implemented in PAML, rely on likelihood ratio tests to detect adaptive
evolution: a model where adaptive evolution is permitted is compared with a
null model where ω cannot be greater
than one. Simulations show that some of these tests are conservative [189], so that detection of
adaptive evolution should be safe as long as convergence of the analyses is
carefully checked [198], including in large-scale
analyses [199]. If the model allowing adaptive
evolution explains the data significantly better than the null model, then an
empirical Bayes approach can be used to identify which sites are likely to
evolve adaptively [192]. The empirical Bayes approach
relies on estimates of the model parameters, which can have large sampling
errors in small data sets. Because these sampling errors can cause the
empirical Bayes site identification to be unreliable [200], a Bayes empirical Bayes
approach was proposed and was shown to have good power and low-false positive
rates [201]. Full Bayesian approaches
that allow for uncertain parameter estimates were also proposed [202]. Yet, simulations showed that
they did not improve further on Bayes empirical Bayes estimates [203], so that the computational
overhead incurred by full Bayes methods may not be necessary in this case. One
particular case, where a Bayesian approach is however required, is to tell the
signature of adaptive evolution from that of recombination, as these two
processes can leave similar signals in DNA sequences. Indeed, simulations show
that recombination can lead to false positive rates as large as 90% when trying
to detect adaptive evolution [204]. The codon model with
recombination implemented in OmegaMap [48] can then be used to tease
apart these two processes (e.g., see [205]).
4. Estimating Divergence Times between Species
The estimation of the dates when
species diverged is often perceived to be as important as estimating the
phylogeny itself. This explains why so-called “dating methods” were first
wished for when molecular phylogenies were first reconstructed [206]. In spite of over four
decades of history, molecular dating has only recently seen new developments.
One of the reasons for this slow progress is that, unlike the other parts of
phylogenetic analysis, divergence times are parameters that cannot be estimated
directly. Only sitewise likelihood values and distances between pairs of
sequences are identifiable, that is, directly estimable. Distances are
expressed as a number of substitutions per site (sub/site) and
can be decomposed as the product of two quantities: a rate of evolution
(sub/site/unit of time) and a time duration (unit of time). As a result, time
durations and, likewise, divergence times cannot be estimated without making an
additional assumption on the rates of evolution. The simplest assumption is to
posit that rates are constant in time, which is known as the molecular clock
hypothesis [207]. This hypothesis can be
tested, for instance, with PAUP or PAML, by means of a likelihood ratio test that
compares a constrained model (clock) with an unconstrained model (no clock).
These two models are nested, so that twice the log-likelihood difference
asymptotically follows a χ2 distribution. If n sequences are analyzed, the
constrained model estimates n−1 divergence
times, while the unconstrained model estimates 2n−3 branch lengths.
The degree of freedom of this test is then (2n−3)−(n−1)=n−2 [4]. The systematic test of the
molecular clock assumption on recent data shows that this hypothesis is too
often untenable [208].
The most recent work has then focused on relaxing this assumption, and three different
directions have emerged [209]. A first possibility is to
relax the clock globally on the
phylogeny, but to assume that the hypothesis still holds locally for closely related species [210–212]. Recent developments of these
local clock models now allow the use of multiple calibration points and of
multiple genes [213], the automatic placement of
the clocks on the tree [214] and the estimation of the
number of local clocks [209]. PAML can be used for most of
these computations. However, local clock models still tend to underestimate
rapid rate change [209]. The second possibility to
relax the global clock assumption is to assume that rates of evolution evolve
in an autocorrelated manner along lineages and to minimize the amount of rate
change over the entire phylogeny. The most popular approach in the plant
community is Sanderson’s penalized likelihood [215], implemented in r8s [55]. This approach performs well
on data sets for which the actual fossil dates are known [216] but still tends to
underestimate the actual amount of rate change [209].
Bayesian methods appear today as the emerging approach to estimate divergence times. Taking
inspiration from Sanderson’s pioneering work [217], Thorne et al. developed a
Bayesian framework where rates of evolution change in an autocorrelated manner
across lineages [45–47]: the rate of evolution of a
branch depends on the rate of evolution of its parental branch; the branches
emanating from the root require a special treatment. These Bayesian models work
by modeling how rates of evolution change in time (rate prior), and how the
speciation/population process shapes the distribution of divergence times
(speciation prior). These prior distributions can actually be interpreted as
penalty functions [45, 209], and they can have simple or
more complicated forms [218]. The Multidivtime program [45–47] is extremely quick to analyze
data thanks to the use of a multivariate normal approximation of the likelihood
surface. It assumes that rates of evolution change following a stationary
lognormal prior distribution. Further work suggested that it might not always be
the best performing rate prior [218–220], but these latter studies had
two potential shortcomings: (i) they were based on a speciation prior that was
so strong that it biased divergence times towards the age of the
fossil root [219, 221], and (ii) they used a
statistical procedure, the posterior Bayes factor [222], that is potentially
inconsistent. One potential limitation of the Bayesian approach described so
far is its dependence on one single tree topology, which must be either known
ahead of time or estimated by other means. Recently, Drummond et al. found a
way to relax this requirement by positing that rates of evolution are
uncorrelated across lineages, while all the branches of the tree are
constrained to follow exactly the same rate prior [223]. As a result, their approach
is able to estimate the most probable tree (given the data and the substitution
model), the divergence times and the position of the root even without any
outgroup or without resorting to a nonreversible model of substitution [224]. Drummond et al. further
argue that the use of explicit models of rate variation over time might
contribute to improved phylogenetic inference [223]. In addition, when the focus
is on estimating divergence times, a recent analysis suggests that this
uncorrelated model of rate change could outperform the methods described above
to accommodate rapid rate change among lineages [209]. Implemented in BEAST, this
approach offers a variety of substitution models and prior distributions and
presents a graphic user interface that will appeal to numerous researchers [39].
5. Challenges and Perspectives
With the advent of high-throughput
sequencing technologies such as the whole-genome shotgun approach by pyrosequencing [225], fast, cheap, and accurate
genomic information is becoming available for a growing number of species [226]. If low coverage limits the
complete assembly of many genome projects, it still allows the quick access to
draft genomes for a growing number of species [227]. As a result, phylogenetic
inference can now incorporate large numbers of expressed sequence tags (ESTs),
genes [228], and occasionally complete
genomes [229]. The motivation for
developing these so-called phylogenomic approaches is their presumed ability to
return fully resolved and well-supported trees by decreasing both sampling
errors [230] and misleading signals due
for instance to horizontal gene transfer [231] or to hidden paralogy [232]. In practice, these
large-scale studies can give the impression that incongruence is resolved [228], but they also can fail to
address systematic errors due to the use of too simple models [233]. If the genes incorporated in
phylogenomic studies are often concatenated to limit the number of parameters
entering the model, it remains important to allow sitewise heterogeneities [234]. If partition models can
reduce systematic biases [234], Bayesian mixture models such
as CAT [151] appear to be robust to
long-branch attraction [159], a rampant issue in phylogenomics [235]. All together, the
accumulation of genomic data and these latest methodological developments seem
to make the reconstruction of the tree of life finally within reach. In
comparison, dating the tree of life is still in its infancy, even if a number
of initiatives such as the TimeTree server are being developed [236]. These resources are limited
to some vertebrates but will hopefully soon be extended to include other large
taxonomic groups such as plants. To achieve this goal, however, phylogenetic
studies should systematically incorporate divergence times, as is now routine
in some research communities (e.g., [237]). This joint estimation of
time and trees is today facilitated by the availability of user-friendly
programs such as BEAST. The near future will probably see the development of
mixture models for molecular dating and more sophisticated models that
integrate most of the topics discussed here from sequence alignment to
detection of sites under selection into one single but yet user-friendly [238] toolbox.
Acknowledgments
Jeff Thorne provided insightful comments and suggestions, and two anonymous reviewers helped in improving
the original manuscript. Support was provided by the Natural Sciences Research
Council of Canada (DG-311625 to SAB and DG-261252 to XX).
DarwinC.1859London, UKJ. MurraySokalR. R.SneathP. H. A.1963San Francisco, Calif, USAW. H. FreemanCavalli-SforzaL. L.BarraiI.EdwardsA. W.Analysis of human evolution under random genetic drift196429920FelsensteinJ.2004Sunderland, Mass, USASinauer AssociatesGlennerH.HansenA. J.SørensenM. V.RonquistF.HuelsenbeckJ. P.WillerslevE.eske.willerslev@zoology.oxford.ac.ukBayesian inference of the metazoan phylogeny: a combined molecular and morphological approach200414181644164910.1016/j.cub.2004.09.027PfeilB. E.bep27@cornell.eduSchlueterJ. A.ShoemakerR. C.DoyleJ. J.Placing paleopolyploidy in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene families200554344145410.1080/10635150590945359ChareE. R.HolmesE. C.ech15@psu.eduA phylogenetic survey of recombination frequency in plant RNA viruses2006151593394610.1007/s00705-005-0675-xPhilippeH.herve.philippe@umontreal.caDouadyC. J.Horizontal gene transfer and phylogenetics20036549850510.1016/j.mib.2003.09.008NielsenR.BustamanteC.ClarkA. G.A scan for positively selected genes in the genomes of humans and chimpanzees200536e17010.1371/journal.pbio.0030170RamírezS. R.sramirez@oeb.harvard.eduGravendeelB.SingerR. B.MarshallC. R.PierceN. E.Dating the origin of the Orchidaceae from a fossil orchid with its pollinator200744871571042104510.1038/nature06039KnightR. D.rdknight@princeton.eduFreelandS. J.sfreelan@princeton.eduLandweberL. F.lfl@princeton.eduRewiring the keyboard: evolvability of the genetic code200121495810.1038/35047500AntonovicsJ.ja8n@virginia.eduHoodM. E.ja8n@virginia.eduBakerC. H.ja8n@virginia.eduMolecular virology: was the 1918 flu avian in origin?20064407088E9discussion E9-1010.1038/nature04824JacksonA. P.andrew.jackson@zoo.ox.ac.ukCharlestonM. A.A cophylogenetic perspective of RNA-virus evolution2004211455710.1093/molbev/msg232HuelsenbeckJ. P.johnh@brahms.biology.rochester.eduRannalaB.LargetB.A Bayesian framework for the analysis of cospeciation2000542352364HajibabaeiM.mhajibab@uoguelph.caSingerG. A. C.HebertP. D. N.HickeyD. A.DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics200723416717210.1016/j.tig.2007.02.001LuoS.-J.KimJ.-H.JohnsonW. E.Phylogeography and genetic ancestry of tigers (Panthera tigris)2004212e44210.1371/journal.pbio.0020442HoweC. J.c.j.howe@bioc.cam.ac.ukBarbrookA. C.SpencerM.RobinsonP.BordalejoB.MooneyL. R.Manuscript evolution200125312112610.1016/S0160-9327(00)01367-3GrayR. D.rd.gray@auckland.ac.nzAtkinsonQ. D.Language-tree divergence times support the Anatolian theory of Indo-European origin2003426696543543910.1038/nature02029HillisD. M.HuelsenbeckJ. P.Support for dental HIV transmission19943696475242510.1038/369024a0SalasA.apimlase@usc.esBandeltH.-J.MacaulayV.RichardsM. B.Phylogeographic investigations: the role of trees in forensic genetics2007168111310.1016/j.forsciint.2006.05.037SankoffD.NadeauJ. H.jhn4@cwru.eduChromosome rearrangements in evolution: from gene order to genome sequence and back200310020111881118910.1073/pnas.2035002100SwoffordD. L.swofford@lms.si.eduWaddellP. J.waddell@onyx.si.eduHuelsenbeckJ. P.johnh@brahms.biology.rochester.eduFosterP. G.LewisP. O.plewis@uconnvm.uconn.eduRogersJ. S.jsrogers@uno.eduBias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods200150452553910.1080/106351501750435086HolderM.mholder@uconn.eduLewisP. O.Phylogeny estimation: traditional and Bayesian approaches20034427528410.1038/nrg1044SiepelA.HausslerD.NielsenR.Phylogenetic hidden Markov models2005New York, NY, USASpringer32535110.1007/0-387-27733-1_12PagelM.m.pagel@rdg.ac.ukMeadeA.a.meade@rdg.ac.ukBayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo2006167680882510.1086/503444KumarS.s.kumar@asu.eduFilipskiA.Multiple sequence alignment: in pursuit of homologous DNA positions200717212713510.1101/gr.5232407NotredameC.Recent evolutions of multiple sequence alignment algorithms200738e12310.1371/journal.pcbi.0030123EdgarR. C.bob@drive5.comBatzoglouS.Multiple sequence alignment200616336837310.1016/j.sbi.2006.04.004XiaX.XieZ.DAMBE: software package for data analysis in molecular biology and evolution200192437137310.1093/jhered/92.4.371KumarS.TamuraK.NeiM.MEGA: molecular evolutionary genetics analysis software for microcomputers1994102189191TamuraK.DudleyJ.NeiM.KumarS.s.kumar@asu.eduMEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.020072481596159910.1093/molbev/msm092DoC. B.MahabhashyamM. S. P.BrudnoM.BatzoglouS.serafim@cs.stanford.eduProbCons: probabilistic consistency-based multiple sequence alignment200515233034010.1101/gr.2821705WallaceI. M.BlackshieldsG.HigginsD. G.des.higgins@ucd.ieMultiple sequence alignments200515326126610.1016/j.sbi.2005.04.002WallaceI. M.O'SullivanO.HigginsD. G.NotredameC.cedric.notredame@europe.comM-Coffee: combining multiple sequence alignment methods with T-Coffee20063461692169910.1093/nar/gkl091HallB. G.2008Sunderland, Mass, USASinauer AssociatesLargetB.larget@mathcs.duq.eduSimonD. L.Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees1999166750759PagelM.m.pagel@rdg.ac.ukMeadeA.A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data200453457158110.1080/10635150490468675RedelingsB. D.SuchardM. A.msuchard@ucla.eduJoint Bayesian estimation of alignment and phylogeny200554340141810.1080/10635150590947041DrummondA. J.RambautA.BEAST: Bayesian evolutionary analysis by sampling trees20077, article 2141810.1186/1471-2148-7-214ShimodairaH.HasegawaM.CONSEL: for assessing the confidence of phylogenetic tree selection200117121246124710.1093/bioinformatics/17.12.1246ZwicklD.2006Austin, Tex, USAUniversity of Texas at AustinKosakovsky PondS. L.FrostS. D. W.MuseS. V.muse@stat.ncsu.eduHyPhy: hypothesis testing using phylogenies200521567667910.1093/bioinformatics/bti079RonquistF.fredrick.ronquist@ebc.uu.seHuelsenbeckJ. P.MrBayes 3: Bayesian phylogenetic inference under mixed models200319121572157410.1093/bioinformatics/btg180AltekarG.galtekar@cs.rochester.eduDwarkadasS.HuelsenbeckJ. P.RonquistF.Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference200420340741510.1093/bioinformatics/btg427ThorneJ. L.KishinoH.PainterI. S.Estimating the rate of evolution of the rate of molecular evolution1998151216471657KishinoH.kishino@wheat.ab.a.u-tokyo.ac.jpThorneJ. L.BrunoW. J.Performance of a divergence time estimation method under a probabilistic model of rate evolution2001183352361ThorneJ. L.KishinoH.Divergence time and evolutionary rate estimation with multilocus data200251568970210.1080/10635150290102456WilsonD. J.daniel.wilson@sjc.ox.ac.ukMcVeanG.Estimating diversifying selection and functional constraint in the presence of recombination200617231411142510.1534/genetics.105.044917YangZ.PAML: a program package for phylogenetic analysis by maximum likelihood1997135555556YangZ.z.yang@ucl.ac.ukPAML 4: phylogenetic analysis by maximum likelihood20072481586159110.1093/molbev/msm088SwoffordD. L.PAUP∗: Phylogenetic Analysis Using Parsimony (and other Methods) 4.0 Beta200210thSunderland, Mass, USASinauer AssociatesLartillotN.nicolas.lartillot@lirmm.frPhilippeH.Computing Bayes factors using thermodynamic integration200655219520710.1080/10635150500433722GuindonS.GascuelO.gascuel@lirmm.frA simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood200352569670410.1080/10635150390235520StamatakisA.Alexandros.Stamatakis@epfl.chRAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models200622212688269010.1093/bioinformatics/btl446SandersonM. J.mjsanderson@ucdavis.edur8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock200319230130210.1093/bioinformatics/19.2.301KjerK. M.Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: an example of alignment and data presentation from the frogs19954331433010.1006/mpev.1995.1028NotredameC.cedric.notredame@ebi.ac.ukO'BrienE. A.HigginsD. G.RAGA: RNA sequence alignment by genetic algorithm199725224570458010.1093/nar/25.22.4570HicksonR. E.robert.hickson@ermanz.govt.nzSimonC.PerreyS. W.The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence200017453053910.1080/10635150050207401XiaX.xxia@hkusua.hku.hkPhylogenetic relationship among horseshoe crab species: effect of substitution models on phylogenetic analyses200049187100XiaX.xxia@uottawa.caXieZ.KjerK. M.kjer@aesop.rutgers.edu18S ribosomal RNA and tetrapod phylogeny200352328329510.1080/1063515030933110.1080/10635150390196948ThompsonJ. D.HigginsD. G.GibsonT. J.CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice199422224673468010.1093/nar/22.22.4673LarkinM. A.BlackshieldsG.BrownN. P.Clustal W and clustal X version 2.0200723212947294810.1093/bioinformatics/btm404GolubchikT.WiseM. J.EastealS.JermiinL. S.lars.jermiin@usyd.edu.auMind the gaps: evidence of bias in estimates of multiple sequence alignments200724112433244210.1093/molbev/msm176LandanG.giddy.landan@gmail.comGraurD.Heads or tails: a simple reliability check for multiple sequence alignments20072461380138310.1093/molbev/msm060SchwartzA. S.MyersE. W.PachterL.Alignment metric accuracyhttp://arxiv.org/abs/q-bio.QM/0510052, 2005ZhuJ.LiuJ. S.jliu@stat.stanford.eduLawrenceC. E.lawrence@wadsworth.orgBayesian adaptive sequence alignment algorithms1998141253910.1093/bioinformatics/14.1.25HolmesI.BrunoW. J.Evolutionary HMMs: a Bayesian approach to multiple alignment200117980382010.1093/bioinformatics/17.9.803JensenJ. L.jlj@imf.au.dkHeinJ.hein@stats.ox.ac.ukGibbs sampler for statistical multiple alignment2005154889907ClampM.michele@sanger.ac.ukCuffJ.SearleS. M.BartonG. J.The Jalview Java alignment editor200420342642710.1093/bioinformatics/btg430SankoffD.CedergrenR.SankoffD.CedergrenR.Simultaneous comparison of three or more sequences related by a tree1983Reading, Mass, USAAddison-Wesley253264HeinJ.A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given198966649668LunterG.lunter@stats.ox.ac.ukMiklósI.miklosi@ramet.elte.huDrummondA.alexei.drummond@zoology.oxford.ac.ukJensenJ. L.jlj@imf.au.dkHeinJ.hein@stats.ox.ac.ukBayesian coestimation of phylogeny and sequence alignment20056, article 8311010.1186/1471-2105-6-83WongK. M.SuchardM. A.HuelsenbeckJ. P.johnh@berkeley.eduAlignment uncertainty and genomic analysis2008319586247347610.1126/science.1151532LunterG.RoccoA.MimouniN.HegerA.CaldeiraA.HeinJ.Uncertainty in homology inferences: assessing and improving genomic sequence alignment200818229830910.1101/gr.6725608NeiM.KumarS.2000New York, NY, USAOxford University PressJukesT. H.CantorC. R.MunroH. N.Evolution of protein molecules1969New York, NY, USAAcademic Press21121KimuraM.A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences198016211112010.1007/BF01731581HasegawaM.KishinoH.YanoT.Dating of the human-ape splitting by a molecular clock of mitochondrial DNA198522216017410.1007/BF02101694TamuraK.NeiM.Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees1993103512526TavareS.Some probabilistic and statistical problems on the analysis of DNA sequences198617Providence, RI, USAAmerican Mathematical Society5786YangZ.Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites199310613961401YangZ.Estimating the pattern of nucleotide substitution1994391105111GoldmanN.goldman@ebi.ac.ukWhelanS.A novel use of equilibrium frequencies in models of sequence evolution2002191118211831LiòP.GoldmanN.N.Goldman@gen.cam.ac.ukModels of molecular evolution and phylogeny199881212331244WhelanS.LiòP.GoldmanN.n.goldman@zoo.cam.ac.ukMolecular phylogenetics: state-of-the-art methods for looking into the past200117526227210.1016/S0168-9525(01)02272-7BurnhamK. P.AndersonD. R.2002New York, NY, USASpringerBrunoW. J.billb@lanl.govHalpernA. L.Topological bias and inconsistency of maximum likelihood using wrong models1999164564566PosadaD.dp47@email.byu.eduCrandallK. A.keith_crandall@byu.eduSelecting the best-fit model of nucleotide substitution200150458060110.1080/106351501750435121CoxD. R.Further results on tests of separate families of hypotheses1962242406424GoldmanN.n.goldman@zoo.cam.ac.ukWhelanS.Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics2000176975978AnisimovaM.manisimova@hotmail.comGascuelO.gascuel@lirmm.frApproximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative200655453955210.1080/10635150600755453PosadaD.dp47@email.byu.eduCrandallK. A.MODELTEST: testing the model of DNA substitution199814981781810.1093/bioinformatics/14.9.817ParadisE.paradis@isem.univ-montp2.frClaudeJ.StrimmerK.APE: analyses of phylogenetics and evolution in R language200420228929010.1093/bioinformatics/btg412PosadaD.dposada@uvigo.esModelTest server: a web-based tool for the statistical selection of models of nucleotide substitution online200634, web server issueW700W70310.1093/nar/gkl042AbascalF.fabascal@uvigo.esZardoyaR.PosadaD.ProtTest: selection of best-fit models of protein evolution20052192104210510.1093/bioinformatics/bti263PosadaD.dposada@uvigo.esBuckleyT. R.BuckleyT@LandcareResearch.co.nzModel selection and model averaging in phylogenetics: advantages of akaike information criterion and Bayesian approaches over likelihood ratio tests200453579380810.1080/10635150490522304PolD.dpol@amnh.orgEmpirical problems of the hierarchical likelihood ratio test for model selection200453694996210.1080/10635150490888868HurvichC. M.TsaiC.-L.Regression and time series model selection in small samples198976229730710.1093/biomet/76.2.297SchwarzG.Estimating the dimension of a model19786246146410.1214/aos/1176344136MininV. N.AbdoZ.JoyceP.SullivanJ.jacks@uidaho.eduPerformance-based selection of likelihood models for phylogeny estimation200352567468310.1080/10635150390235494AbdoZ.abdo9538@uidaho.eduMininV. N.JoyceP.SullivanJ.Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation200522369170310.1093/molbev/msi050BaoL.GuH.DunnK. A.BielawskiJ. P.Methods for selecting fixed-effect models for heterogeneous codon evolution, with comments on their application to gene and genome data20077, supplement 1S510.1186/1471-2148-7-S1-S5SuchardM. A.WeissR. E.SinsheimerJ. S.Bayesian selection of continuous-time Markov chain evolutionary models200118610011013HuelsenbeckJ. P.johnh@biomail.ucsd.eduLargetB.AlfaroM. E.Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo20042161123113310.1093/molbev/msh123SaitouN.NeiM.The neighbor-joining method: a new method for reconstructing phylogenetic trees198744406425GascuelO.gascuel@lirmm.frSteelM.M.Steel@math.canterbury.ac.nzNeighbor-joining revealed200623111997200010.1093/molbev/msl072BrunoW. J.billb@lan1.govSocciN. D.HalpernA. L.Weighted neighbor-joining: a likelihood-based approach to distance-based phylogeny reconstruction2000171189197BaldaufS. L.slb14@york.ac.ukPhylogeny for the faint of heart: a tutorial200319634535110.1016/S0168-9525(03)00112-4Cavalli-SforzaL. L.EdwardsA. W. F.Phylogenetic analysis. Models and estimation procedures1967193, part 1233257WhelanS.simon.whelan@manchester.ac.ukNew approaches to phylogenetic tree search and their application to large numbers of protein alignments200756572774010.1080/10635150701611134HolderM. T.LewisP. O.SwoffordD. L.LargetB.Hastings ratio of the LOCAL proposal used in Bayesian phylogenetics200554696196510.1080/10635150500354670FelsensteinJ.Confidence limits on phylogenies: an approach using the bootstrap198539478379110.2307/2408678HillisD. M.BullJ. J.An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis199342218219210.2307/2992540FelsensteinJ.KishinoH.Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull199342219320010.2307/2992541YangZ.z.yang@ucl.ac.ukRannalaB.Branch-length prior influences Bayesian posterior probability of phylogeny200554345547010.1080/10635150590945313BerryV.GascuelO.gascuel@lirmm.frOn the interpretation of bootstrap trees: appropriate threshold of clade selection and induced gain19961379991011EfronB.HalloranE.HolmesS.Bootstrap confidence levels for phylogenetic trees199693147085709010.1073/pnas.93.14.7085MauB.Robertm@genetics.wisc.eduNewtonM. A.LargetB.Bayesian phylogenetic inference via Markov chain Monte Carlo methods199955111210.1111/j.0006-341X.1999.00001.xHuelsenbeckJ. P.RonquistF.NielsenR.BollbackJ. P.Bayesian inference of phylogeny and its impact on evolutionary biology200129455502310231410.1126/science.1065889MurphyW. J.EizirikE.O'BrienS. J.Resolution of the early placental mammal radiation using Bayesian phylogenetics200129455502348235110.1126/science.1067179DouadyC. J.cdouady@dal.caDelsucF.BoucherY.DoolittleW. F.DouzeryE. J. P.Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability200320224825410.1093/molbev/msg042CummingsM. P.mike@mbl.eduHandleyS. A.sahandle@artsci.wustl.eduMyersD. S.dmyers@pomona.eduReedD. L.reed@biology.utah.eduRokasA.arokas@wisc.eduWinkaK.katarina.winka@bmg.umu.seComparing bootstrap and posterior probability values in the four-taxon case200352447748710.1080/10635150390218213ErixonP.per.erixon@ebc.uu.seSvennbladB.BrittonT.OxelmanB.Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics200352566567310.1080/10635150390235485SvennbladB.bodil.svennblad@math.uu.seErixonP.per.erixon@ebc.uu.seOxelmanB.bengt.oxelman@ebc.uu.seBrittonT.tom.britton@math.su.seFundamental differences between the methods of maximum likelihood and maximum posterior probability in phylogenetics200655111612110.1080/10635150500481648HuelsenbeckJ. P.johnh@biomail.ucsd.eduRannalaB.Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models200453690491310.1080/10635150490522629LewisP. O.paul.lewis@uconn.eduHolderM. T.HolsingerK. E.Polytomies and Bayesian phylogenetic inference200554224125310.1080/10635150590924208KolaczkowskiB.ThorntonJ. W.Effects of branch length uncertainty on Bayesian posterior probabilities for phylogenetic hypotheses20072492108211810.1093/molbev/msm141SteelM.MatsenF. A.The Bayesian “star paradox” persists for long finite sequences20072441075107910.1093/molbev/msm028YangZ.z.yang@ucl.ac.ukFair-balance paradox, star-tree paradox, and Bayesian phylogenetics20072481639165510.1093/molbev/msm081KolaczkowskiB.ThorntonJ. W.joet@uoregon.eduIs there a star tree paradox?200623101819182310.1093/molbev/msl059MosselE.mossel@stat.berkeley.eduVigodaE.vigoda@cc.gatech.eduPhylogenetic MCMC algorithms are misleading on mixtures of trees200530957442207220910.1126/science.1115493RonquistF.LargetB.HuelsenbeckJ. P.KadaneJ. B.SimonD.van der MarkP.Comment on “Phylogenetic MCMC algorithms are misleading on mixtures of trees”2006312577236710.1126/science.1123622WheelerW. C.PickettK. M.Topology-Bayes versus clade-Bayes in phylogenetic analysis200825244745310.1093/molbev/msm274ChorB.TullerT.tamirtul@post.tau.ac.ilMaximum likelihood of evolutionary trees: hardness and approximation200521, supplement 1i97i10610.1093/bioinformatics/bti1027DonoghueM. J.Progress and prospects in reconstructing plant phylogeny199481340541810.2307/2399898Aris-BrosouS.stephane@statgen.ncsu.eduLeast and most powerful phylogenetic tests to elucidate the origin of the seed plants in the presence of conflicting signals under misspecified models200352678179310.1080/10635150390258949KishinoH.HasegawaM.Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoide198929217017910.1007/BF02100115GoldmanN.N.Goldman@zoo.cam.ac.ukAndersonJ. P.RodrigoA. G.Likelihood-based tests of topologies in phylogenetics200049465267010.1080/106351500750049752ShimodairaH.shimo@ism.ac.jpHasegawaM.Multiple comparisons of log-likelihoods with applications to phylogenetic inference199916811141116ShimodairaH.An approximately unbiased test of phylogenetic tree selection200251349250810.1080/10635150290069913Aris-BrosouS.stephane@statgen.ncsu.eduHow Bayes tests of molecular phylogenies compare with frequentist approaches200319561862410.1093/bioinformatics/btg065RafteryA. E.GilksW.RichardsonS.SpiegelhalterD. J.Hypothesis testing and model selection1996Boca Raton, Fla, USAChapman & Hall163187NylanderJ. A. A.johan.nylander@ebc.uu.seRonquistF.HuelsenbeckJ. P.Nieves-AldreyJ. L.Bayesian phylogenetic analysis of combined data2004531476710.1080/10635150490264699ChoiS. C.HobolthA.RobinsonD. M.KishinoH.ThorneJ. L.thorne@statgen.ncsu.eduQuantifying the impact of protein tertiary structure on molecular evolution20072481769178210.1093/molbev/msm097ChibS.chib@olin.wustl.eduJeliazkovI.Marginal likelihood from the Metropolis-Hastings output20019645327028110.1198/016214501750332848GoldmanN.Statistical tests of models of DNA substitution199336218219810.1007/BF00166252YangZ.2006Oxford, UKOxford University PressFelsensteinJ.Cases in which parsimony or compatibility methods will be positively misleading197827440141010.2307/2412923YangZ.Maximum-likelihood models for combined analyses of multiple sequence data199642558759610.1007/BF02352289HuelsenbeckJ. P.SuchardM. A.A nonparametric method for accommodating and testing across-site rate variation200756697598710.1080/10635150701670569LartillotN.nicolas.lartillot@lirmm.frPhilippeH.A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process20042161095110910.1093/molbev/msh112LopezP.CasaneD.PhilippeH.Heterotachy, an important process of protein evolution200219117YangZ.RobertsD.On the use of nucleic acid sequences to infer early branchings in the tree of life1995123451458FitchW. M.MarkowitzE.An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution19704557959310.1007/BF00486096TuffleyC.SteelM.Modeling the covarion hypothesis of nucleotide substitution19981471639110.1016/S0025-5564(97)00081-3HuelsenbeckJ. P.johnh@brahms.biology.rochester.eduTesting a covariotide model of DNA substitution2002195698707KolaczkowskiB.ThorntonJ. W.joet@uoregon.eduPerformance of maximum parsimony and likelihood phylogenetics when evolution is heterogenous2004431701198098410.1038/nature02917SpencerM.matts@mathstat.dal.caSuskoE.RogerA. J.Likelihood, parsimony, and heterogeneous evolution20052251161116410.1093/molbev/msi123LartillotN.BrinkmannH.PhilippeH.Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model20077, supplement 1S410.1186/1471-2148-7-S1-S4Jiménez-GuriE.PhilippeH.OkamuraB.HollandP. W. H.peter.holland@zoo.ox.ac.ukBuddenbrockia is a cnidarian worm2007317583411611810.1126/science.1142024PhilippeH.herve.philippe@umontreal.caZhouY.y.zhou@umontreal.caBrinkmannH.henner.brinkmann@umontreal.caRodrigueN.nicolas.rodrigue@umontreal.caDelsucF.delsuc@isem.univ-montp2.frHeterotachy and long-branch attraction in phylogenetics20055, article 501810.1186/1471-2148-5-50SchönigerM.Von HaeselerA.A stochastic model for the evolution of autocorrelated DNA sequences19943324024710.1006/mpev.1994.1026MuseS. V.GautB. S.A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome1994115715724GoldmanN.n_goldma@nimr.mrc.ac.ukYangZ.A codon-based model of nucleotide substitution for protein-coding DNA sequences1994115725736SiepelA.HausslerD.Phylogenetic estimation of context-dependent substitution rates by maximum likelihood200421346848810.1093/molbev/msh039HwangD. G.dhwang@u.washington.eduGreenP.phg@u.washington.eduBayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution200410139139941400110.1073/pnas.0404142101ChristensenO. F.olefc@birc.au.dkHobolthA.JensenJ. L.Pseudo-likelihood analysis of codon substitution models with neighbor-dependent rates20051291166118210.1089/cmb.2005.12.1166RobinsonD. M.JonesD. T.KishinoH.GoldmanN.ThorneJ. L.thorne@statgen.ncsu.eduProtein evolution with dependence among codons due to tertiary structure200320101692170410.1093/molbev/msg184RodrigueN.nicolas.rodrigue@umontreal.caLartillotN.BryantD.PhilippeH.Site interdependence attributed to tertiary structure in amino acid sequence evolution2005347220721710.1016/j.gene.2004.12.011RodrigueN.nicolas.rodrigue@umontreal.caPhilippeH.LartillotN.Assessing site-interdependent phylogenetic models of sequence evolution20062391762177510.1093/molbev/msl041KleinmanC. L.cl.kleinman@umontreal.caRodrigueN.nicolas.rodrigue@umontreal.caBonnardC.cecile.bonnard@lirmm.frPhilippeH.herve.philippe@umontreal.caLartillotN.nicolas.lartillot@lirmm.frA maximum likelihood framework for protein design20067, article 32611710.1186/1471-2105-7-326SatoA.akie.sato@tuebingen.mpg.deTichyH.O'HuiginC.Grant BP. R.GrantR.KleinJ.On the origin of Darwin's finches2001183299311SalzburgerW.walter.salzburger@uni-konstanz.deMackT.tmack@ukaachen.deVerheyenE.erik.verheyen@naturalsciences.beMeyerA.axel.meyer@uni-konstanz.deOut of Tanganyika: genesis, explosive speciation, key-innovations and phylogeography of the haplochromine cichlid fishes20055, article 1711510.1186/1471-2148-5-17HughesA. L.austin@biol.sc.eduLooking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level200799436437310.1038/sj.hdy.6801031KimuraM.1983New York, NY, USACambridge University PressLynchM.2007Sunderland, Mass, USASinauer AssociatesNielsenR.Statistical tests of selective neutrality in the age of genomics200186664164710.1046/j.1365-2540.2001.00895.xAris-BrosouS.ExcoffierL.The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism1996133494504BustamanteC. D.WakeleyJ.SawyerS.HartlD. L.dhartl@oeb.harvard.eduDirectional selection and the site-frequency spectrum2001159417791788ZhuL.BustamanteC. D.cdb28@cornell.eduA composite-likelihood approach for detecting directional selection from DNA sequence data200517031411142110.1534/genetics.104.035097BamshadM.WoodingS. P.Signatures of natural selection in the human genome2003429911110.1038/nrg999AnisimovaM.LiberlesD. A.liberles@uwyo.eduThe quest for natural selection in the age of comparative genomics200799656757910.1038/sj.hdy.6801052NeiM.GojoboriT.Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions198635418426WongW. S. W.sww8@cornell.eduNielsenR.Detecting selection in noncoding regions of nucleotide sequences2004167294995810.1534/genetics.102.010959McCauleyS.de GrootS.degroot@stats.ox.ac.ukMailundT.HeinJ.Annotation of selection strengths in viral genomes200723222978298610.1093/bioinformatics/btm472YangZ.BaldingD. J.BishopM.CanningsC.Adaptive molecular evolution20032ndNew York, NY, USAJohn Wiley & Sons229254YangZ.z.yang@ucl.ac.ukNielsenR.Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models20001713243WongW. S. W.sww8@cornell.eduYangZ.GoldmanN.NielsenR.Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites200416821041105110.1534/genetics.104.031153ZhangJ.NielsenR.YangZ.z.yang@ucl.ac.ukEvaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level200522122472247910.1093/molbev/msi237ZhangJ.KumarS.NeiM.Small-sample tests of episodic adaptive evolution: a case study of primate lysozymes1997141213351338YangZ.z.yang@ucl.ac.ukLikelihood ratio tests for detecting positive selection and application to primate lysozyme evolution1998155568573NielsenR.rasmus@mws4.biol.berkeley.eduYangZ.Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene19981483929936SuzukiY.GojoboriT.tgojobor@genes.nig.ac.jpA method for detecting positive selection at single amino acid sites1999161013151328YangZ.z.yang@ucl.ac.ukNielsenR.GoldmanN.PedersenA.-M. K.Codon-substitution models for heterogeneous selection pressure at amino acid sites20001551431449MassinghamT.timm@ebi.ac.ukGoldmanN.Detecting amino acid sites under positive selection and purifying selection200516931753176210.1534/genetics.104.032144Kosakovsky PondS. L.FrostS. D. W.sdfrost@ucsd.eduNot so different after all: a comparison of methods for detecting amino acid sites under selection20052251208122210.1093/molbev/msi105YangZ.NielsenR.Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages2002196908917AnisimovaM.anisimov@lirmm.frYangZ.Molecular evolution of the hepatitis delta virus antigen gene: recombination or positive selection?200459681582610.1007/s00239-004-0112-xAris-BrosouS.stephane@statgen.ncsu.eduDeterminants of adaptive evolution at the molecular level: the extended complexity hypothesis200522220020910.1093/molbev/msi006AnisimovaM.BielawskiJ. P.YangZ.Accuracy and power of Bayes prediction of amino acid sites under positive selection2002196950958YangZ.WongW. S. W.NielsenR.rasmus@binf.ku.dkBayes empirical Bayes inference of amino acid sites under positive selection20052241107111810.1093/molbev/msi097HuelsenbeckJ. P.johnh@biomail.ucsd.eduDyerK. A.Bayesian estimation of positively selected sites200458666167210.1007/s00239-004-2588-9Aris-BrosouS.sarisbro@uottawa.caIdentifying sites under positive selection with uncertain parameter estimates200649776777610.1139/G06-038AnisimovaM.m.anisimova@ucl.ac.ukNielsenR.YangZ.Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites2003164312291236AnisimovaM.manisimova@hotmail.comBielawskiJ.j.bielawski@dal.caDunnK.kathy.dunn@dal.caYangZ.z.yang@ucl.ac.ukPhylogenomic analysis of natural selection pressure in Streptococcus genomes20077, article 15411310.1186/1471-2148-7-154ZuckerkandlE.PaulingL.Molecules as documents of evolutionary history19658235736610.1016/0022-5193(65)90083-4ZuckerkandlE.PaulingL.BrysonV.VogelH. J.Evolutionary divergence and convergence in proteins1965New York, NY, USAAcademic PressBromhamL.PennyD.The modern molecular clock20034321622410.1038/nrg1020Aris-BrosouS.Dating phylogenies with hybrid local molecular clocks200729e87910.1371/journal.pone.0000879KishinoH.HasegawaM.Converting distance to time: application to human evolution1990183550570RambautA.andrew.rambaut@zoo.ox.ac.ukBromhamL.Estimating divergence dates from molecular sequences1998154442448YoderA. D.ayoder@nwu.eduYangZ.Estimation of primate speciation dates using local molecular clocks200017710811090YangZ.z.yang@ucl.ac.ukYoderA. D.Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse Lemur species200352570571610.1080/10635150390235557YangZ.A heuristic rate smoothing procedure for maximum likelihood estimation of species divergence times200450645656SandersonM. J.Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach2002191101109SmithA. B.a.smith@nhm.ac.ukPisaniD.Mackenzie-DoddsJ. A.StockleyB.WebsterB. L.LittlewoodD. T. J.Testing the molecular clock: molecular and paleontological estimates of divergence times in the Echinoidea (Echinodermata)200623101832185110.1093/molbev/msl039SandersonM. J.mjsanderson@ucdavis.eduA nonparametric approach to estimating divergence times in the absence of rate constancy1997141212181231Aris-BrosouS.YangZ.Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny200251570371410.1080/10635150290102375Aris-BrosouS.YangZ.z.yang@ucl.ac.ukBayesian models of episodic evolution support a late Precambrian explosive diversification of the Metazoa200320121947195410.1093/molbev/msg226HoS. Y.simon.ho@zoo.ox.ac.ukPhillipsM. J.DrummondA. J.CooperA.Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation20052251355136310.1093/molbev/msi125WelchJ. J.FontanillasE.BromhamL.Molecular dates for the “cambrian explosion”: the influence of prior assumptions200554467267810.1080/10635150590947212AitkinM.Posterior Bayes factors1991531111142DrummondA. J.HoS. Y.PhillipsM. J.RambautA.andrew.rambaut@zoo.ox.ac.ukRelaxed phylogenetics and dating with confidence200645e8810.1371/journal.pbio.0040088HuelsenbeckJ. P.BollbackJ. P.LevineA. M.Inferring the root of a phylogenetic tree2002511324310.1080/106351502753475862ShendureJ.MitraR. D.VarmaC.ChurchG. M.Advanced sequencing technologies: methods and goals20045533534410.1038/nrg1325MooreM. J.mjmoore1@ufl.eduDhingraA.adhingra@ufl.eduSoltisP. S.psoltis@flmnh.ufl.eduRapid and accurate pyrosequencing of angiosperm plastid genomes20066, article 1711310.1186/1471-2229-6-17GreenP.phg@u.washington.edu2x genomes—Does depth matter?200717111547154910.1101/gr.7050807RokasA.WilliamsB. L.KingN.CarrollS. B.sbcarrol@wisc.eduGenome-scale approaches to resolving incongruence in molecular phylogenies2003425696079880410.1038/nature02053ClarkA. G.EisenM. B.SmithD. R.Evolution of genes and genomes on the Drosophila phylogeny2007450716720321810.1038/nature06341DelsucF.BrinkmannH.PhilippeH.herve.philippe@umontreal.caPhylogenomics and the reconstruction of the tree of life20056536137510.1038/nrg1603GeF.WangL. S.KimJ.The cobweb of life revealed by genome-scale estimates of horizontal gene transfer2005310e31610.1371/journal.pbio.0030316PageR. D. M.Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny20001418910610.1006/mpev.1999.0676PhillipsM. J.matthew.phillips@zoo.ox.ac.ukDelsucF.PennyD.Genome-scale phylogeny and the detection of systematic biases20042171455145810.1093/molbev/msh137NishiharaH.OkadaN.HasegawaM.Rooting the eutherian tree: the power and pitfalls of phylogenomics200789R19910.1186/gb-2007-8-9-r199Rodríguez-EzpeletaN.BrinkmannH.RoureB.LartillotN.LangB. F.PhilippeH.Herve.Philippe@UMontrealDetecting and overcoming systematic errors in genome-scale phylogenies200756338939910.1080/10635150701397643HedgesS. B.sbh1@psu.eduDudleyJ.KumarS.TimeTree: a public knowledge-base of divergence times among organisms200622232971297210.1093/bioinformatics/btl505JanečkaJ. E.MillerW.PringleT. H.Molecular and genomic data identify the closest living relative of primates2007318585179279410.1126/science.1147555KumarS.s.kumar@asu.eduDudleyJ.Bioinformatics software for biologists in the genomics era200723141713171710.1093/bioinformatics/btm239