We consider an infinite-allele Markov branching process (IAMBP). Our main focus is the frequency spectrum of this process, that is, the proportion of alleles having a given number of copies at a specified time point. We derive the variance of the frequency spectrum, which is useful for interval estimation and hypothesis testing for process parameters. In addition, for a class of special IAMBP with birth and death offspring distribution, we show that the mean of its limiting frequency spectrum has an explicit form in terms of the hypergeometric function. We also derive an asymptotic expression for convergence rate to the limit. Simulations are used to illustrate the results for the birth and death process.
1. Introduction
The infinite-allele branching process was first introduced by Griffiths and Pakes [1]. As a special type of branching process, this process allows individuals to mutate into infinitely many allelic variants, each of which is “new” in the sense of being different from all previously existing variants. This idealization is approximately correct for rare point mutations in long DNA sequences. Fundamental results for the discrete-time case (simple branching process) and for the continuous-time case (Markov branching process) have been obtained in [1, 2]. These include the number of alleles at a given generation or time, the generation number or time of the last mutation, and the limiting frequency spectrum. There exists an analogy between the results for the discrete-time and the continuous-time cases; however, the characteristics in the continuous case are relatively easier to derive [2]. Many evolutionary processes may be considered time continuous, and frequently we assume Markov property in modeling. A classical example is the discrete-time Wright-Fisher model, which is typically either approximated by a continuous-time diffusion or replaced by a continuous-time Markov chain, the so-called continuous-time Moran process [3]. Therefore, the time-continuous infinite-allele Markov branching process (TCIAMBP, or simply IAMBP) seems to be appropriate for modeling evolution in population genetics.
Consider a Markov branching process with neutral mutations. Suppose that the process starts from a group of individuals carrying the same allele, and individuals can mutate into new allelic variants. We assume that the mutation is independent of the previous history of the process, and the offspring distribution is independent of the allelic type, that is, the selection is neutral for all alleles. The process can be described as an “infinitely-many-alleles” model (IAM). Whenever a mutation happens, it yields a new allele, which differs from all the previously existing ones. In this paper, we are interested in the frequency spectrum of the IAMBP, which may be defined as the number or proportion of alleles present in a given number of individuals at a specified time point. Frequency spectrum in this paper refers to random allele frequencies, not their expected value as in Griffiths and Pakes [1] and in Pakes [2], since we also consider the variance of the allele frequency later on. Unless specified otherwise, we will use terms “mean frequency spectrum” and “variance frequency spectrum” in the remainder of this paper to denote the expected value and variance of the allele frequency. The frequency spectrum plays an important role in many genetic processes, such as DNA sequence evolution. As an example, Kimmel and Mathaes [4] modeled the Alu sequence data using an infinite-allele simple branching process with linear-fractional offspring distribution, and the goodness of fit testing suggested that Alu sequences do not evolve neutrally and might be under selection. It has to be noted that the concept of the frequency spectrum is in some sense similar to the Ewens’ sampling formula [5] in population genetics. We will return to this subject in the discussion, although analysis of the analogies and differences transcends the scope of the present paper.
The paper is organized as follows. In Section 2, we rigorously define the IAMBP and the mean frequency spectrum of the IAMBP. Then, we provide explicit expressions for the special case of the birth and death process. In Section 3, we derive the variance frequency spectrum and discuss its use in interval estimation for process parameters. We perform simulations to illustrate the results using the birth and death process example in Section 4. Section 5 is a summary.
2. IAMBP and Its Limiting Mean Frequency Spectrum2.1. Definition and Basic Properties in the Supercritical Case
Let us consider a continuous-time Markov branching process consisting of individuals with exponential life spans with mean a-1. Let us assume that upon death, each individual produces a random number of offspring. As usually assumed, the offspring counts are identically distributed according to probability generating function (pgf) f(s), and they are independent conditional on the past process. The mean f′(1-) of the offspring distribution is m, regardless of the allelic type. We further assume that a newborn individual mutates into a new allelic type with probability μ independently of the previous history of the process. Let us denote by h(s)=f(μ+(1-μ)s) the offspring pgf in a clone, started by the overall ancestor or any of mutants, containing only the like-type individuals. The entire process is a union over all individual types of such clones. The theory of the IAMBP has been developed by Griffiths and Pakes [1] in the discrete-time case and then by Pakes [2] in the continuous-time case. We will assume m>1 and M=h′(1-)>1, although some results can be proved without this latter assumption.
Let αt(j) be the number of alleles present in j individuals at time t and ϕi,t(j)=Ei[αt(j)], where subscript i indicates that the process begins with i individuals carrying the same allele. It has been shown that [2]
(1)ϕi,t(j)=qij(t)+iamμeλt∫0te-λxq1j(x)dx,j≥0,
where λ=a(m-1) is the Malthusian parameter of the overall process and qij(t) is the probability of observing j individuals (j≥1) carrying the parental allele at time t when starting from i individuals with the parental allele at time t=0. Consequently, for the number Kt of alleles at time t, we have
(2)Ei[Kt]=∑j=1∞ϕi,t(j)=1-qi0(t)+iamμeλt∫0te-λx[1-q10(x)]dx.
Let Gj=∫0∞e-λtq1j(t)dt, j≥0. If we define ψij(t)=ϕi,t(j)/Ei[Kt] and
(3)ψj=limt→∞ψij(t)=limt→∞qij(t)+iamμeλt∫0te-λxq1j(x)dx1-qi0(t)+iamμeλt∫0te-λx[1-q10(x)]dx
as the limiting mean frequency spectrum, that is, the expected proportion of alleles present in j individuals as t→∞, then we see that for the supercritical process such that λ>0,
(4)ψj=λGj1-λG0,j≥1.
If M>1, then the process of the like-type clones is supercritical, and as it is known [6], q10(t)↑q10(∞)<1 and q1j(t)→0, j≥1, as t→∞. Therefore, eλt|∫0te-λxq10(x)dx-q10(∞)/λ|→0 and eλt∫0te-λxq1j(x)dx→0 as t→∞, for j≥1. This yields the following asymptotic equivalence:
(5)ψij(t)-ψj~t→∞λGj[1-q10(∞)-(λ(1-qi0(∞))/iamμ)](1-λG0)2e-λt.
Details of the proof are omitted, since they appear elementary.
2.2. IAMBP with Birth and Death Offspring Distribution
For the IAMBP with birth and death offspring distribution f(s)=α+βs2, α+β=1, we are able to obtain an explicit form for Gj, j≥0; therefore, the limiting mean frequency spectrum ψj,j≥1 can be derived. The offspring pgf of the like-type individuals clone in the birth and death IAMBP is written as
(6)h(s)=f(μ+(1-μ)s)=α+β[μ+(1-μ)s]2,
where α,β and μ stand for the death, birth, and mutation probabilities for every individual and α+β=1. Note that under another parameterization where the two newborn individuals die, live, and mutate independently, this pgf may be formulated differently as h(s)=[α+βμ+β(1-μ)s]2. Under either parameterization, λ=a(2β-1). If, as assumed, M=m(1-μ)>1, then parameters α and μ are subject to a constraint
(7)(1-α)(1-μ)>12.
Let us write A2=α+βμ2 and B2=β(1-μ)2 (note, for the other formulation, A2=(α+βμ)2 and B2=β2(1-μ)2). The explicit form of Gj can be written as
(8)G0=1cA2B2Γ(λ/c)Γ(2)Γ(2+(λ/c))×F(1,λc;2+λc;A2B2),Gj=1c(1-A2B2)2Γ(1+(λ/c))Γ(j)Γ(j+1+(λ/c))×F(j+1,1+λc;j+1+λc;A2B2),j≥1,
where c=a(B2-A2)=a[2β(1-μ)-1] is the Malthusian parameter of the like-type clone and F(·,·;·;·) is the Gauss hypergeometric function [7], defined as
(9)F(a,b;c;z)=Γ(c)Γ(b)Γ(c-b)∫01tb-1(1-t)c-b-1(1-tz)-adt,c>b>0.
For a detailed derivation, see Appendix A. Note that the supercritical condition also guarantees that the argument of the hypergeometric function remains within its region of definiteness.
It follows that
(10)ψj=λGj1-λG0=(λc(1-A2B2)2×(Γ(1+λ/c)Γ(j)Γ(j+1+λ/c))×F(j+1,1+λc;j+1+λc;A2B2)(1-A2B2)2)×(1-λcA2B2Γ(λ/c)Γ(2)Γ(2+λ/c)×F(1,λc;2+λc;A2B2))-1,j≥1.
Figure 1 shows an example of the limiting mean frequency spectrum for the birth and death process with parameters a=1, α=0.25, and μ=10-4, based on formula (10). To see how the spectrum varies with different parameter settings, we plot in Figure 2(a), the 3-D surface of a major component of the spectrum, ψ1, for different α’s and μ’s. Figures 2(b) and 2(c) illustrate the effect of one parameter on ψ1 given a fixed value of the other parameter.
Limiting mean frequency spectrum of the infinite-allele birth and death process with a=1, α=0.25, and μ=10-4.
(a) Surface of ψ1 at different α and μ, for the infinite-allele birth and death process. (b) Relation between ψ1 and μ when fixing α. (c) Relation between ψ1 and α when fixing μ.
We see that for fixed α, increasing μ causes an increase of ψ1. This can be intuitively explained by the offspring pgf h(s) of the like-type clone. From the pgf expression h(s)=α+β[μ+(1-μ)s]2, we see that the probability of obtaining one like-type individual in the offspring is 2(1-α)μ(1-μ), which is an increasing function of μ for a given α, under the constraint (1-α)(1-μ)>1/2. Therefore, increasing μ will finally lead to an increase of ψ1. The effect of α on ψ1 when fixing μ is not so obvious, but we notice that when fixing μ very close to 0, as α approaches 1/2, the process is approximately critical binary fission; therefore, ψ1 drops down because of almost sure extinction of the process, as seen from the tail behavior of the solid thick line in Figure 2(c).
Arguably, the frequency spectrum can only be observed in finite time. The finite-time mean frequency spectrum can be obtained by computing Gj(t)=∫0te-λxq1j(x)dx, j≥0 numerically. For the birth and death process, this involves the computation of the incomplete hypergeometric function. The following is a valid question in this context. In order to safely use the limiting mean frequency spectrum, how long should the process history be? Figure 3(a) compares the limiting mean frequency spectrum with some long-term mean frequency spectra, for the birth and death process with parameters a=1, α=0.25, and μ=10-4. We see that under this setting, the long-term mean frequency spectrum is almost identical to the limiting mean frequency spectrum when t≥28. In general, this result depends strongly on parameters a, α, and μ, for example, small μ leads to longer t. This provides us with some intuitions concerning the sufficiently large t for approximating the limiting mean frequency spectrum. Figure 3(b) illustrates the difference between the finite-time mean frequency spectrum and the limiting mean frequency spectrum as a function of t, for large t, t∈[15,35] and for j=1,2, where lines represent the true difference and markers represent the asymptotic approximation by formula (5). To emphasize the agreement for t large, this figure is plotted in semilogarithmic scale. We see that the true difference drops exponentially fast, and the asymptotic approximation is good for large t.
Comparison between the finite-time mean frequency spectrum and the limiting mean frequency spectrum of the infinite-allele birth and death process. (a) ψ1j(t) and ψj for 1≤j≤15, t=15,20,28 and ∞. (b) Difference between ψ1j(t) and ψj as a function of t, t∈[15,35] for j=1,2. Lines represent the true difference and markers represent asymptotic approximations.
Given the observed long-term mean frequency spectrum, the parameters θ of the IAMBP, such as α,μ in the birth and death process, can be estimated by equating the observed long-term mean frequency spectrum ψobs from the sample to the expected limiting mean frequency spectrum ψexp from formula (3) and solving for the process parameters. In the case of the birth and death process, we may estimate α and μ for example by solving
(11)(A2B2Γ(j1)Γ(k1+1+λc)×F(A2B2j1+1,1+λc;j1+1+λc;A2B2))×(A2B2Γ(k1)Γ(j1+1+λc)×F(A2B2k1+1,1+λc;k1+1+λc;A2B2))-1=ψobs(j1)ψobs(k1)(A2B2Γ(j2)Γ(k2+1+λc)×F(A2B2j2+1,1+λc;j2+1+λc;A2B2))×(A2B2Γ(k2)Γ(j2+1+λc)×F(A2B2k2+1,1+λc;k2+1+λc;A2B2))-1=ψobs(j2)ψobs(k2)
for positive integers j1≠k1, j2≠k2, where λ/c and A2/B2 are both functions of α and μ.
There is no explicit solution for such estimator, but numerical search according to some criteria is feasible. Another possibility is to minimize the distance (such as the l2 norm) between the observed long-term mean frequency spectrum and the expected limiting mean frequency spectrum, that is, θ^=argminθ∥ψobs-ψexp(θ)∥2.
The estimated parameters can be used to check the goodness of fit of the IAMBP model. Another interesting problem is to test whether two sets of parameters are identical, given two observed mean frequency spectra. A simple approach is to use Pearson’s χ2 test, such as in Kimmel and Mathaes [4]. However, there may be restrictions to applying the χ2 test, such as small cell counts and inappropriateness due to the finite length of the observed spectrum. This motivates us to develop an interval estimator for the IAMBP parameters.
3. Variance of the Frequency Spectrum
Moment estimators based on the mean frequency spectrum only give point estimates of the process parameters. In order to quantify the uncertainty of point estimates, an interval estimator is needed, which requires more information about the distribution of the statistic αt(j). First, it can be seen that [2]
(12)αt(j)=I0,j(t)+∑n=1Nt∑k=1UnIn,k,j(t-Tn),
where T1,T2,… are the successive split times of the process, I0,j(t), In,k,j(t) are two indicators, and I0,j(t)=1 if there are j individuals alive at time t carrying the parental allele, and In,k,j(t)=1, for n,k≥1 if the kth individual born at time Tn (Tn<t) mutates to a novel allelic type and further produces j individuals carrying this allele t time units later. Nt is the number of split times in (0,t], and Un is the number of offspring produced at time Tn. Obtaining the distribution of αt(j) is not elementary. However, it may still be possible to define a confidence interval (CI) based on the first and second moments of αt(j).
Let ηi,t(j)=Vari(αt(j)) be the variance frequency spectrum; by the law of total variance and independence between the indicators in the expression of αt(j) (details in Appendix B), we have
(13)ηi,t(j)=qij(t)[1-qij(t)]+im2μ2[C(t)+(λ+a)eλt∫0te-λxC(x)dx]+iamμeλt∫0te-λxq1j(x)dx+ia(σ2-m)μ2eλt×∫0te-λxq1j2(x)dx,
where
(14)C(t)=a∫0t[q1j2(x)+(σ2+m2)β12(x)]e-a(t-x)dx-[a∫0tq1j(x)e-a(t-x)dx]2-[am∫0tβ1(x)e-a(t-x)dx]2.
In Expression (14), β1(x)=aeλx∫0xe-λuq1j(u)du, and σ2 is the variance of the offspring distribution, regardless of the allelic types.
Similarly as in Expression (3), we may define a limiting variance frequency spectrum ξj=limt→∞ηi,t(j)/(Ei[Kt])2. Expression (13) is complicated and usually does not assume an explicit form, even for the special case of the birth and death process. Therefore, we will only give numerical solutions for the finite-time variance frequency spectrum. Figure 4 shows an example of the “2σ”-bands of the finite-time frequency spectrum for the infinite-allele birth and death process with i=100, α=0.25, μ=10-4, a=1, and t=28. To emphasize the tail probabilities, we draw this plot in semilogarithmic scale.
“2σ”-bands in semi-logarithmic scale of the finite-time (t=28) frequency spectrum for the infinite-allele birth and death process with i=100, a=1, α=0.25, and μ=10-4.
From the finite-time variance frequency spectrum, it is possible to define a CI [θl,θu] where the upper and lower bounds can be written as
(15)argminθ∥ψobs(t)-[ψexp(θ,t)±2ξexp(θ,t)]∥2.
This CI is useful for checking model validity and for testing whether two observed mean frequency spectra are from the same IAMBP model.
4. Simulation Study
We perform a simulation study of the birth and death process to illustrate the finite-time mean and variance frequency spectra. First we generate samples (genealogical trees) from an IAMBP with birth and death offspring distribution starting from 100 individuals carrying the same parental allele. Due to memory restrictions caused by forward simulation, we limit our simulations to 12 generations and a relatively large mutation probability μ=0.01. The other parameters of the process are set to be a=1 and α=0.25. At time t=2, we record the number of alleles αt(j) represented by j copies, for j=1,2,…. Repeating the simulation 1000 times, we then obtain the simulated finite-time mean and variance frequency spectra from the replicates.
Figure 5 shows side-by-side bar plots of the simulated and expected finite-time mean and variance frequency spectra, for j=1,…,10. We plot the mean frequency spectrum and the variance frequency spectrum in semi-logarithmic scale to emphasize the tail probabilities. In each bar plot, the first black bar represents the expected finite-time mean frequency spectrum ψj or variance frequency spectrum ξj. The remaining ten white bars represent ten replicates of the simulated finite-time mean or variance frequency spectrum as described above. We see that for some classes, the expected mean or variance frequency spectrum is slightly different from the simulated spectrum. Beside sampling bias, this may be caused by the small scale of the simulations. For small mutation probability μ, we have to set large initial population size i and a long time t to obtain acceptable values of ψi,t(j) and ξi,t(j) from the simulated genealogical trees. We note that if one tries to use a naive method to calculate the variance frequency spectrum, that is, assume the proportions ψj of alleles having j representatives to be the mean of some independent Bernoulli random variables (they are not independent) and employ ξj=ψj(1-ψj), such method performs much worse than ξj based on the derivation of the variance frequency spectrum.
Comparison between the simulated and expected finite-time (t=2) frequency spectra of the infinite-allele birth and death process with i=100, a=1, α=0.25, and μ=0.01. In each class j, 1≤j≤10, the first black bar represents the expected frequency spectrum and the rest 10 white bars represent 10 replicates of the simulated frequency spectrum. (a) mean frequency spectrum ψj in semi-logarithmic scale; (b) variance frequency spectrum ξj in semi-logarithmic scale.
5. Summary
In this paper, we consider the frequency spectrum of the IAMBP of Pakes [2]. We develop an explicit expression for the limiting mean frequency spectrum for the special case of the birth and death process, which can be stated in terms of the hypergeometric function. We also derive an asymptotic expression for the rate of convergence of the finite-time mean frequency spectrum to the limiting mean frequency spectrum and illustrate the convergence using the birth and death process. We further state and prove a theorem concerning the variance frequency spectrum of the IAMBP, which helps to quantify uncertainty in parameter estimation and hypothesis testing. We illustrate the results using simulations of the birth and death process case.
As noted in the introduction, the frequency spectrum is similar to the Ewens’ sampling formula [5] in population genetics, since they both concern the count or frequency spectrum based on an infinite-allele model under neutral selection. However, they differ in several aspects. (1) Our frequency spectrum describes population property under a branching process, whereas the Ewens’ sampling formula describes allelic class count probabilities caused by a sampling procedure and further requires the sample size n to be small compared to the size of the whole population which is assumed constant. (2) Our results concerning the frequency spectrum only provide the first and second moments and not the distribution function of the proportion of alleles having a given number of copies at a specified time point, whereas the Ewens’ sampling formula gives the joint probability of all allelic classes. We also note that the Poisson-Dirichlet process [8] is usually used to describe the equilibrium behavior of the neutral infinite-allele model. Study of the relation between variations of the frequency spectrum under different models is of our future interest.
The question of validity of the Wright-Fisher and Moran models of population genetics [3], as compared to stochastic population processes such as the IAMBP or O’Connell process [9], has importance for estimation of parameters based on genetic data. As an example, Cyran and Kimmel [10] compared estimates of the age of the Mitochondrial Eve based on the various versions of the Wright-Fisher model with those based on various branching process models. The outcomes showed differences of about 10–15%.
AppendicesA. Derivation of Gj, j≥0 for the Birth and Death Process
For the birth and death process, the offspring pgf of the like-type individuals clone assumes either the form h1(s)=α+β[μ+(1-μ)s]2 or the form h2(s)=[α+βμ+β(1-μ)s]2. In both cases, the backward Kolmogorov equation gives a unified expression for the process pgf F(s,t):
(A.1)∂F∂t=a[B2F2-(A2+B2)F+A2]=a(A2+B2)[1A2+B2(A2+B2F2)-F],
under different parameterizations, where for offspring pgf h1, A2=α+βμ2 and B2=β(1-μ)2, whereas for offspring pgf h2, A2=(α+βμ)2 and B2=β2(1-μ)2. We see that this process is equivalent to a birth and death process with a~=a(A2+B2) and offspring pgf h~(s)=(1/(A2+B2))(A2+B2s2). Using the known result of the birth and death process pgf [6], we obtain
(A.2)F(s,t)=A2(1-s)-(A2-B2s)e-ctB2(1-s)-(A2-B2s)e-ct,
where c=a(B2-A2)>0.
To obtain an explicit form for Gj, we may use two approaches. The first approach is to start from finding the pgf of Gj, which then leads to Gj. The second approach is to find q1j(t) directly and then obtain Gj. Both approaches lead to the same result. Here, we give derivation for the second approach only.
From the explicit form of F(s,t), we can directly read q10(t) and q1j(t), j≥1. Consider
(A.3)F(s,t)=A2(1-e-ct)B2-A2e-ct+B2(e-ct-1)s+(B2e-ct-A2)sB2-A2e-ct+B2(e-ct-1)s=A2(1-e-ct)B2-A2e-ct·11-(B2(1-e-ct)/(B2-A2e-ct))s+B2e-ct-A2B2-A2e-ct·s1-(B2(1-e-ct)/(B2-A2e-ct))s.
Let w1=A2(1-e-ct)/(B2-A2)e-ct, w2=(B2e-ct-A2)/(B2-A2)e-ct, and p~=(B2-A2)e-ct/(B2-A2e-ct), the above expression becomes
(A.4)F(s,t)=w1p~1-s(1-p~)+w2sp~1-s(1-p~),
and w1+w2=1.
This is a mixture of two geometric pgf’s with the same parameter p~ but different supports, one is the set {0,1,…}, the other is the set {1,2,…}. Therefore,
(A.5)q10(t)=w1p~=A2(1-e-ct)B2-A2e-ct,(A.6)q1j(t)=w1p~(1-p~)j+w2p~(1-p~)j-1=(B2-A2)2[B2(1-e-ct)]j-1e-ct(B2-A2e-ct)j+1,j≥1.
Hence,
(A.7)G0=∫0∞e-λtq10(t)dt=∫0∞A2(1-e-ct)(B2-A2e-ct)e-λtdt=1cA2B2∫01v(λ/c)-1(1-v)1-(A2/B2)vdv=1cA2B2Γ(λ/c)Γ(2)Γ(2+(λ/c))F(1,λc;2+λc;A2B2),(A.8)Gj=∫0∞e-λtq1j(t)dt=∫0∞(B2-A2)2[B2(1-e-ct)]j-1(B2-A2e-ct)j+1e-(λ+c)tdt=(B2-A2)2cB4∫01vλ/c(1-v)j-1(1-(A2/B2)v)j+1dv=1c(1-A2B2)2Γ(1+(λ/c))Γ(j)Γ(j+1+(λ/c))×F(j+1,1+λc;j+1+λc;A2B2),j≥1.
B. Derivation of the Variance Frequency Spectrum for IAMBP
Let T1,T2,… be the successive split times, let Nt be the number of split times till time t, and let Un be the number of offspring produced at split time Tn. Consider that at time t, the alleles which are represented by j individuals are from two sources: the initial allele or the mutant alleles. Correspondingly, we define two indicator functions. I0,j(t)=1 if there are j individuals carrying the initial allele alive at time t, and In,k,j(t)=1, for n,k≥1 if the kth individual born at time Tn (Tn<t) mutates to a new allelic type and further produces j individuals carrying this allele t units later. Then
(B.1)αt(j)=I0,j(t)+∑n=1Nt∑k=1UnIn,k,j(t-Tn).
For each n, In,k,j(t) is independent of Un and Tn, as well as I0,j(t), and it can be seen that Ei[I0,j(t)]=qij(t), Vari(I0,j(t))=qij(t)[1-qij(t)], E[In,k,j(t)∣Un,Tn]=μq1j(t), Var(In,k,j(t)∣Un,Tn)=μq1j(t)[1-μq1j(t)]. By the law of total variance, the variance frequency spectrum takes the form
(B.2)ηi,t(j)=Vari(αt(j))=Vari(I0,j(t))+Vari(∑n=1Nt∑k=1UnIn,k,j(t-Tn))=qij(t)[1-qij(t)]+Vari(E[∑n=1Nt∑k=1UnIn,k,j(t-Tn)|Nt])+Ei[Var(∑n=1Nt∑k=1UnIn,k,j(t-Tn)|Nt)]
For the second term, we know from independence among the indicator functions conditional on Nt that
(B.3)Vari(E[∑n=1Nt∑k=1UnIn,k,j(t-Tn)|Nt])=m2μ2Vari(∑n=1Ntq1j(t-Tn)).
The variance on the right hand side can be obtained from the following theorem, which is an analogue to Lemma 3.1.1 in [2]: βi(t)=Ei[∑n=1Ntα(t-Tn)]=iaeλt∫0te-λxα(x)dx.
Theorem 1.
Let α(t) be a bounded continuous function. Then
(B.4)γi(t)=Vari(∑n=1Ntα(t-Tn))=iC(t)+i(λ+a)eλt∫0te-λxC(x)dx,
where
(B.5)C(t)=a∫0t[α2(x)+(σ2+m2)β12(x)]e-a(t-x)dx-[a∫0tα(x)e-a(t-x)dx]2-[am∫0tβ1(x)e-a(t-x)dx]2.
Proof.
By independence of family lines, γi(t)=iγ1(t). By the law of total variance,
(B.6)γ1(t)=Var1(∑n=1Ntα(t-Tn))=Var(E[∑n=1Ntα(t-Tn)∣T1,U1])+E[Var(∑n=1Ntα(t-Tn)∣T1,U1)]=Var(α(t-T1)+U1E[∑n=1Nt′α(t-T1-Tn′)∣T1])+E[U1Var(∑n=1Nt′α(t-T1-Tn′)∣T1)],
where Tn′=Tn-T1 and Nt′ is the number of split times in (T1,t]. The right hand side can be further written as
(B.7)E[α2(t-T1)]-E2[α(t-T1)]+(σ2+m2)E[β12(t-T1)]-m2E2[β1(t-T1)]+mE[γ1(t-T1)]=a∫0t[α2(t-x)+(σ2+m2)β12(t-x)+mγ1(t-x)β12]e-axdx-[a∫0tα(t-x)e-axdx]2-[am∫0tβ1(t-x)e-axdx]2=am∫0tγ1(t-x)e-axdx+C(t).
Differentiating both sides and solving the resulting differential equation, we obtain
(B.8)γ1(t)=eλt∫0te-λx[C′(x)+aC(x)]dx=C(t)+(λ+a)eλt∫0te-λxC(x)dx.
Replacing α(x) by q1j(x) and replacing β1(x) by aeλx∫0xe-λuq1j(u)du, we see that (B.3) becomes
(B.9)Vari(E[∑n=1Nt∑k=1UnIn,k,j(t-Tn)|Nt])=im2μ2[C(t)+(λ+a)eλt∫0te-λxC(x)dx].
For the third term, we have the expression inside the expectation as
(B.10)∑n=1NtVar(∑k=1UnIn,k,j(t-Tn))=∑n=1Nt{Var(E[∑k=1UnIn,k,j(t-Tn)∣Un,Tn])+E[Var(∑k=1UnIn,k,j(t-Tn)∣Un,Tn)]}=∑n=1Nt[mμq1j(t-Tn)+(σ2-m)μ2q1j2(t-Tn)],
where σ2 is the variance of the offspring distribution, regardless of the allelic types. Therefore,
(B.11)Ei[∑n=1NtVar(∑k=1UnIn,k,j(t-Tn))]=mμEi[∑n=1Ntq1j(t-Tn)]+(σ2-m)μ2Ei[∑n=1Ntq1j2(t-Tn)]=iamμeλt∫0te-λxq1j(x)dx+ia(σ2-m)μ2eλt×∫0te-λxq1j2(x)dx
By (B.2), (B.9) and (B.11), the final expression of ηi,t(j) is then
(B.12)ηi,t(j)=qij(t)[1-qij(t)]+im2μ2[C(t)+(λ+a)eλt∫0te-λxC(x)dx]+iamμeλt∫0te-λxq1j(x)dx+ia(σ2-m)μ2eλt×∫0te-λxq1j2(x)dx.
Acknowledgments
The authors thank an anonymous referee for suggesting an asymptotic expression, which following some revisions, became (5). Part of M. Kimmel’s work has been carried out in Fall 2011 when he was visiting the Institute of Advance Study at the Warwick University, supported by grant from IAS and EPSCR. His work was also supported by the Polish NCN Grant NN519579938.
GriffithsR. C.PakesA. G.An infinite-alleles version of the simple branching process198820348952410.2307/1427033MR955502ZBL0653.92009PakesA. G.An infinite alleles version of the Markov branching process1989461146169MR96629110.1017/S1446788700030445ZBL0664.92012EwensW. J.19799New York, NY, USASpringerMR554616KimmelM.MathaesM.Modeling neutral evolution of Alu elements using a branching process201011supplement 1, article S112-s2.0-7674913113510.1186/1471-2164-11-S1-S11EwensW. J.The sampling theory of selectively neutral alleles1972387112MR0325177ZBL0245.92009AthreyaK. B.NeyP. E.1972Berlin, GermanySpringerMR0373040AbramowitzM.StegunI. A.1972New York, NY, USADoverMR0208797KingmanJ. F. C.TaylorS. J.HawkesA. G.WalkerA. M.CoxD. R.SmithA. F. M.HillB. M.BurvilleP. J.LeonardT.Random discrete distribution197537122MR0368264O'ConnellN.The genealogy of branching processes and the age of our most recent common ancestor199527241844210.2307/1427834MR1334822ZBL0837.60080CyranK. A.KimmelM.Alternatives to the Wright-Fisher model: the robustness of mitochondrial Eve dating20107831651722-s2.0-7795687651010.1016/j.tpb.2010.06.001