^{1}

^{2}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

DNA sequence data are now being used to study the ancestral history of human population. The existing methods for such coalescence inference use recursion formula to compute the data probabilities. These methods are useful in practical applications, but computationally complicated. Here we first investigate the asymptotic behavior of such inference; results indicate that, broadly, the estimated coalescent time will be consistent to a finite limit. Then we study a relatively simple computation method for this analysis and illustrate how to use it.

In the past decades, considerable progress has been made in the field of population genetics. One of the main goals is to infer the coalescence time of the population under study, that is, to infer the time since their most recent common ancestor (MRCA) and its distribution based on the observed data.

In genetics, coalescent theory is a retrospective of population genetics that traces all genes in a sample from a population to a single ancestral copy shared by all the members of the population. The coalescent time of a population is the time of their most recent common ancestor. The inheritance relationship among the genes is typically represented as a gene genealogy, similar to a phylogenetic tree. The goal of coalescent analysis is to infer the coalescent time of a sample of

In coalescence inference, mitochondrial DNA (mtDNA) data plays an important role. Mitochondria is one of the few genes existing outside the cell nucleus, and for mammalian it is only maternally inherited. Human mtDNA is a double-stranded molecule sequence about 16,500 base pairs in length. It is outside the cell nuclear, and it is known that the mutation rate in mtDNA is about 10 times that of the nuclear genes, and that on one section of the mitochondria, its control region, the mutation rate is even one order higher. The simple inheritance pattern and high variability make mtDNA an important source in the study of human evolutionary history. Each site on the DNA strand has one of the four bases A, C, G, or T. As the molecule evolves, mutations occur in the form of base substitutions. The change between purines (A,G) or pyrimidines (C,T) is called transition; that between a purine and pyrimidine is transversion. The former type of substitution is much more common than the latter.

We focus on the control region of the mitochondrial data in Griffiths and Tavaré [

Nucleotide position in control region.

Site | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | Lineage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Purines | Pyrimidines | Freqs. | |||||||||||||||||

Lineage | |||||||||||||||||||

a | A | G | G | A | A | T | C | C | T | C | T | T | C | T | C | T | T | C | 2 |

b | A | G | G | A | A | T | C | C | T | T | T | T | C | T | C | T | T | C | 2 |

c | G | A | G | G | A | C | C | C | T | C | T | T | C | C | C | T | T | T | 1 |

d | G | G | A | G | A | C | C | C | C | C | T | T | C | C | C | T | T | C | 3 |

e | G | G | G | A | A | T | C | C | T | C | T | T | C | T | C | T | T | C | 19 |

f | G | G | G | A | G | T | C | C | T | C | T | T | C | T | C | T | T | C | 1 |

g | G | G | G | G | A | C | C | C | T | C | C | C | C | C | C | T | T | T | 1 |

h | G | G | G | G | A | C | C | C | T | C | C | C | T | C | C | T | T | T | 1 |

i | G | G | G | G | A | C | C | C | T | C | T | T | C | C | C | C | C | T | 4 |

j | G | G | G | G | A | C | C | C | T | C | T | T | C | C | C | C | T | T | 8 |

k | G | G | G | G | A | C | C | C | T | C | T | T | C | C | C | T | T | C | 5 |

l | G | G | G | G | A | C | C | C | T | C | T | T | C | C | C | T | T | T | 4 |

m | G | G | G | G | A | C | C | T | T | C | T | T | C | C | C | T | T | C | 3 |

n | G | G | G | G | A | C | T | C | T | C | T | T | C | C | T | T | T | C | 1 |

Each row of the table represents a DNA sequence lineage. In this data, there are transitions but no transversion observed.

The coalescent is a model for the genealogical tree of a random sample of

Coalescent tree for a sample of seven individuals.

For more detailed reviews of this topic, see Hudson [

In coalescence inference one has the following.

The inference of coalescence time

For mutation, the common assumption is that the times at which mutation occurs follow a Poison process with constant rate

Thus, given the mutation rate

The key in the coalescence inference is to evaluate the postdata distribution of

To evaluate the postdata coalescent distribution, GT used the probabilities recursion formula, derived in Ethier and Griffiths [

Here we study a relatively simple approximate method using the full data information; in this method, instead of computing the tree probabilities as in GT, we just set the post-data tree probabilities as uniform for the

The rooted tree plays an important role in the analysis, which is not uniquely determined from the data. The data is equivalent to an unrooted tree, which is equivalent to a set of unrooted trees. Each rooted tree has a 0-1 valued matrix representation which is convenient for some computations, but not any 0-1 valued matrix corresponds to a rooted tree. In the following, we give more details about them and their relationships.

The presentation of a rooted tree is unique up to the relative positions of its branches, subbranches, and so forth. A rooted tree has several levels of randomness. If we only know the sample size

Different from a coalescent tree which has a complete time ordering of the splitting points of branches, a rooted tree has only partial time orderings of these splits and mutations. We only know that splits of branch(es) occurred before those of its subbranches, but do not know the ordering of splits of different branches. We know that mutation(s) on the branch occurred before those on its subbranch(es), but do not know the order of ones on the same branch, same subbranch(es), or on different subbranches. For a given sequence data, it may correspond to more than one different rooted tree. For the observed data in Table

Then, based on this unrooted tree, one can get all the other rooted trees as in Griffiths and Tavaré [

For parameter inference with independent and identically distributed data and sample size

The commonly available data is in the form of Table

The last method is to estimate

We have the following result (proof in Appendix).

(i) One has

(ii) One has

(iii) One has

The above result tells us that

The method is to construct the mutation vector

The above expectation is not easy to compute directly since we do not know the joint distribution of

Now we consider generating

Specifically, the simulation method is as below. For

Sample

For each fixed

Allocate the

After all the

Now we assign

We now allocate

Continue this way, until

The assumption that the population size

(i) Recall that the

The authors declare that there is no conflict of interests regarding the publication of this paper.