A Tutorial of the Poisson Random Field Model in Population Genetics

Population genetics is the study of allele frequency changes driven by various evolutionary forces such as mutation, natural selection, and random genetic drift. Although natural selection is widely recognized as a bona-fide phenomenon, the extent to which it drives evolution continues to remain unclear and controversial. Various qualitative techniques, or so-called “tests of neutrality”, have been introduced to detect signatures of natural selection. A decade and a half ago, Stanley Sawyer and Daniel Hartl provided a mathematical framework, referred to as the Poisson random field (PRF), with which to determine quantitatively the intensity of selection on a particular gene or genomic region. The recent availability of large-scale genetic polymorphism data has sparked widespread interest in genome-wide investigations of natural selection. To that end, the original PRF model is of particular interest for geneticists and evolutionary genomicists. In this article, we will provide a tutorial of the mathematical derivation of the original Sawyer and Hartl PRF model.


Introduction
Selectionists and neutralists have fiercely debated, for the past five decades, the extent to which Darwinian selection has shaped molecular evolution. However, both camps do agree that Darwinian selection is a bona fide natural phenomenon. Therefore, various so-called "tests of neutrality" have been developed to detect natural selection on a particular gene or genomic location (for a review on this topic, see [1]). However, these tests are often qualitative and only provide the directionality of selection. A decade and a half ago, S. Sawyer and D. Hartl provided a mathematical framework with which to determine quantitatively the intensity of selection on a particular gene, which they applied to the Adh locus in the Drosophila genome [2]. This framework is referred to as the Poisson random field (PRF) model. They then further used this framework to analyze codon bias in enteric bacteria [3]. Owing to the recent availability of whole genome sequences and genome-wide human polymorphism data, it has become increasingly tractable to perform genome-wide scans for signatures of selection.
The PRF model has been applied to estimate the intensity of selection on synonymous and nonsynonymous sites throughout mitochondrial and nuclear genomes of a variety of species, including human [4][5][6][7][8][9][10][11][12]. Very recently, due to the advent of high-throughput experimental and computational identification of genomic regulatory elements, there has been an interest to estimate the intensity of natural selection on regulatory mutations. Chen and Rajewsky [13] use the PRF, among other techniques, to provide evidence for purifying selection (even stronger than on nonsynonymous coding sites) on a class of regulatory sites known as microRNA target sites. Due to the potentially wide range of applications of, and opportunities for theoretical extensions to, the PRF model, it is an increasingly important mathematical framework for quantitative geneticists. In this article, we will provide a tutorial of the mathematical derivation of the basic PRF model that was originally developed in [2]. The tutorial will follow the outline provided below: (i) Wright-Fisher model, (ii) diffusion approximation to the Wright-Fisher model, 2 Advances in Bioinformatics (iii) derivation, via diffusion theory, of formulas describing evolutionary processes of interest, (iv) derivation of the PRF using the above-mentioned formulas.
The first three items are discussed in [14], and the last point was originally presented in [2]. In this tutorial, we aim to provide an integrated and comprehensive presentation that is accessible to nonprofessionals or beginners in the field of population genetics. Since the primary purpose is to review mathematical derivations, familiarity with calculus and at least a cursory knowledge of genetics will be helpful for the reader.

The Wright-Fisher Model
The Wright-Fisher (WF) model describes the change in frequency of a single mutation (derived allele) in a population over time. The simplest version of the model makes the following assumptions: (1) nonoverlapping generations, (2) constant population size in each generation, and (3) random mating, and is described as follows. Consider a population of N diploid individuals that has a single polymorphic site with two alleles, one ancestral and one derived. Under this model, the frequency of the derived allele in the current generation is a function of the selection pressure on this allele and the binomial sampling effect with probabilities proportional to the frequency of this allele in the previous generation. The probability, p i j , that there are j genes of the derived allele present at generation G + 1 given i genes of the derived allele present at generation G is given by the following binomial calculation: where Ψ i depends on the relative fitness of the derived allele. Assuming no dominance and no recurrent mutation, where 1 + s is the fitness of the derived allele relative to 1 for the ancestral allele, and x (which is simply i/2N) is the derived allele frequency (daf) in generation G. In the simplest model (no selection and no recurrent mutation), Ψ i is simply x or i/2N. The intuition behind Ψ i is the following. Consider the scenario where both the ancestral and the derived alleles are neutrally evolving (no or negligible selection pressure). In this case, the probability of sampling a gene of the derived allele from the population in generation G is simply the frequency of the derived allele in generation G, i/2N or x. This can be rewritten as x/[x + (1 − x)]. Now, suppose that the derived allele is under some selection, s, meaning that the fitness of the derived allele is 1 + s relative to 1 for the ancestral allele. In this case, genes are sampled according to their relative fitnesses (as in the equation for Ψ i above). Figure 1(a) provides a pictorial representation of the basic Wright-Fisher model.

Diffusion Theory
We define p (t) ki as the probability that a polymorphic site has i genes of the derived allele at time t, given that it had k genes of the derived allele at time 0. p (t) ki satisfies the following: where p i j is given in (1). It is convenient to change notation and write p (t) ki as f (x; p, t), so that the above becomes In this framework, it has been shown to be extremely difficult to explicitly derive formulas for several quantities of evolutionary interest. However, as the size of the population approaches infinity (i.e., N→∞), and assuming that the scaled selection pressure (Ns) and scaled mutation rate (Nμ) remain constant, the discrete Markov process given above can be closely approximated by a continuous-time, continuous-space diffusion process (Figure 1(b)): x is the daf at time t, p is the daf at time 0, and δx is the daf change in time δt. We can perform a Taylor series expansion on both sides in δt and δx to derive the forward Kolmogorov equation: where and a(x) and b(x) depend on the genetic model (e.g., see eq (24). Equation (5) can be represented diagrammatically as in Figure 2. The probability of derived allele frequency x + δx at time t + δt is the product of the probability of moving from p to x in time t and the probability of moving from x to x + δx in time δt, summed over all possible values of x.
The frequency trajectory of a derived allele can also be depicted as in Figure 3, which illustrates that the probability of frequency x at time t + δt is the product of the probability of moving from p to p + δ p in time δt and the probability of moving from p + δ p to x in time t, summed over all possible values of δp. This is formalized as follows: What is the probability of this ?
What is the probability of this ?   We can again perform a Taylor series expansion on both sides to derive the backward Kolmogorov equation: Time Derived allele frequency Figure 3: A diagrammatic intuition for (5) illustrates that the probability of frequency x at time t + δt is the product of the probability of moving from p to p + δ p in time δt and the probability of moving from p + δ p to x in time t, summed over all possible values of δ p.
The forward and backward Kolmogorov equations have played a central role in theoretical population genetics since 1922. For details regarding their derivation, we refer the reader [15,Chapter 4]. Next, we will discuss how they are utilized to derive formulas for various quantities of evolutionary interest (yellow boxes in Figure 4).  In a model where there is two-way recurrent mutation (i.e., there are no absorbing states, either extinction or fixation), stationarity is achieved when the probability of change in the derived allele frequency is no longer dependent on time t. We solve for the stationary distribution, f(x), in the following manner. First, we integrate through the forward Kolmogorov equation with respect to x : where F(x; p, t) is the probability of the derived allele assuming a frequency between 0 and x at time t. Therefore, the derivative of F(x; p, t) with respect to t can be interpreted as the probability flux (change in probability over time) of the diffusion process. The stationary distribution, f(x), can be solved by setting the probability flux equal to zero.

Derivation of Formulas Describing Evolutionary Processes of Interest
Let us now focus on a genetic model that assumes no recurrent mutation (i.e., two absorbing states, one at x = 0 and another at x = 1). As depicted by Figure 4, in such a model, it is possible to determine the probability of extinction (x = 0), the probability of fixation (x = 1), and the mean time until absorption (either at x = 0 or x = 1) by using the Kolmogorov backward equation (Figure 4). It is also possible to derive the mean time until absorption conditioned on always eventually reaching only one of the two states. Since this quantity is not directly applicable to the PRF, we do not review its derivation here, but instead refer the reader to [14].

Probability of Extinction
Using (11), we arrive at an equation parallel to (9): The probability that the derived allele frequency, x, reaches 0 at or before time t follows from (11) and is given by where p is the initial frequency of the derived allele and 0 + indicates 0 + ε, where ε is very small. Replacing F(0 + ; p, t) with P 0 (p, t), (12) can be written as As t→∞, P 0 (p, t) can be interpreted as the probability that extinction ever occurs (independent of time) and can be rewritten in the form P 0 (p). From (14), it is evident that P 0 (p) satisfies the following equation: Solving (15), we arrive at the following: where and where a(z) and b(z) are defined as in (6).
Advances in Bioinformatics 5

Probability of Fixation
The probability that the derived allele frequency, x, reaches 1 at time t follows from (11) and is given by where p is the initial frequency of the derived allele and 1 − indicates 1 − ε, where ε is very small. In (12), F(x; p, t) can be replaced by 1−F(x; p, t) without any loss of generality. Also, by replacing 1 − F(1 − ; p, t) with P 1 (p, t), (12) can be rewritten as By letting t→∞ and solving for P 1 (p), we arrive at the following: where ψ(y) has been defined in (17) and a(z) and b(z) have been defined in (6). The probability of fixation and the probability of extinction must sum to 1. Using (16) and (20), we can verify that this is indeed the case. Consider a genetic model that assumes the presence of selection, but no recurrent mutation, where a(x) = sx(1 − x) and b(x) = x(1 − x)/2N. Starting from (20), we can express the probability of fixation under this genetic model in the following manner:

Mean Time Until Either Extinction or Fixation
We define φ(p, t) to be the density function of the time t at which absorption occurs. The probability that absorption occurs, at either boundary x = 0 or x = 1, by time t, is Furthermore, since absorption must happen by t = ∞, we know that Performing integration by parts, we get the following: Equations (14), (19), and (22) show that φ(p, t) satisfies the following equation: Using (25) and the fact that φ(p, t) approaches 0 faster than t approaches ∞, we can rewrite (24) as After interchanging the order of integration and differentiation we get where t(p) = mean time until absorption and t(p, x)dx is the mean time that the daf spends in the interval (x, x + δx) before absorption occurs. We are interested in the case, where p = 1/2N, since this is the initial frequency of the derived allele. In this case, we are interested only in values of x greater than 1/2N, and for these values we can write and ψ(x) is defined in (17).
Under the simplest genetic model that assumes no selection and no recurrent mutation, we can set s = 0 in (17) and (21) and show that P 1 (p) reduces to p and ψ(y) reduces to 1. It follows from this that (29) can be reduced to Under a genetic model where s / = 0, using γ = 2Ns, (29) can be rewritten as After integrating and simplifying the terms, we obtain 6

Advances in Bioinformatics
Finally, substituting γ = 2Ns and p = 1/2N, and invoking the approximation e −a = (1 − a) for small values of a, t(p, x) reduces approximately to where f (x)dx is a notation common in the literature to represent the expected time for which the population frequency of a derived allele is in the range (x, x + dx) before eventual absorption.

Poisson Random Field Theory
is the expected number of sites in the population with derived allele frequency between x 1 and x 2 (where θ equals 2Nμ, the per-locus mutation rate). The function g(x), for which the full expression is given below, is also referred to in the literature as the limiting, equilibrium, or expected density function for derived allele frequencies.
In a sample of size n, the expected number of sites with i (which ranges from 1 to n − 1) copies of the derived allele is defined as a function of g(x): The intuition behind F(i) is the following. The expected number of polymorphic sites with population daf x that have i copies of the derived allele out of n samples is given by the product of the expected number of sites with population daf x, g(x), and the probability that each of those sites has i copies in the sample, which is given by the binomial calculation in the right-hand side of (36). To determine the expected number of sites with any population daf that have i copies of the derived allele, this product must be integrated over all possible values of x (resulting in F(i) above). Consider the sample data X = (X 1 , X 2 , X 3 , . . . , X n−1 ), where X i is the observed number of sites with i copies of the derived allele out of n. Sawyer and Hartl showed that the number of derived alleles in the entire population at a particular frequency is a PRF with mean density given by (35) [2]. It follows, from the marking theorem on Poisson processes [16], that each random variable X i is an independent Poisson distribution with mean equal to F(i) [2]. This framework allows us to define the probability of observing x i sites that have i copies of the derived allele (and n − i copies of the ancestral allele) as the following: Since the X i 's are independent, the probability of observing X = (X 1 , X 2 , X 3 , . . . , X n−1 ) is given as The likelihood equation above provides a convenient means of estimating the values of the parameters θ and γ. The use of the PRF theory leads directly to a likelihood-ratio test of neutrality. Λ is defined as the ratio of the likelihood value under the maximum likelihood estimate of γ to the likelihood value under the neutral value of γ. It is a standard result that 2 ln Λ is asymptotically chi-square distributed with one degree of freedom [17]. Sawyer and Hartl further extended the PRF model in order to calculate the ratio of expected number of polymorphisms within species to expected number of fixed differences between species.In 1991, McDonald and Kreitman devised a 2-by-2 contingency table test of neutrality that was later named the MK test [18]. In the traditional MK test, a 2-by-2 contingency table is formed in order to compare the number of nonsynonymous and synonymous sites that are polymorphic within a species (RP and SP) and diverged between species (RF and SF) ( Table 1). The central assumption of the MK test is that only nonsynonymous sites may be under selective pressure (i.e., synonymous sites are assumed to be neutrally evolving). If nonsynonymous sites are evolving according to a neutral model, then the expectation is that P n /P s = D n /D s . However, if nonsynonymous sites are under negative selection, then the expectation is that P n /P s > D n /D s , and if under positive selection, then P n /P s < D n /D s . Sawyer and Hartl derived the formulas for the expected values of SP, SF, RP, and RF using their PRF theory [2]. Below are the derivations of each of these formulas. For all of the derivations, assume that the data consists of samples of size m and n from two different species.

Expected Number of Synonymous Polymorphic Sites
Under neutral evolution (s = 0), the expected number of polymorphic sites with population daf x can be computed by taking the product of the per-locus mutation rate (θ = 2Nμ) and the probability under a neutral model of a single mutation having a frequency of x (from (30)): Advances in Bioinformatics 7 Now, consider species 1 with sample size m. The probability that a polymorphic site, with population daf equal to x, is detected as polymorphic in a sample of size m is given as The expected number of synonymous polymorphic sites, with population daf x, in the species 1 sample is the product of the expected number of synonymous polymorphic sites with daf x in the population (g neutral (x)) and the fraction of those that are expected to be detected in a sample of size m(P m (x)). It follows then that the total expected number of synonymous polymorphic sites, with any population daf, in the species 1 sample is computed by integrating the product of g neutral (x) and P m (x) over the range of possible values for x: Finally, the total number of expected synonymous polymorphic sites in both species' sample data is given as

Expected Number of Replacement Polymorphic Sites
The derivation of the expected value of RP follows the same logic. As described in (35), the expected number of polymorphic sites with population daf x given some average selection pressure γ is given by g(x). Similar to (41), the total expected number of replacement polymorphic sites in the species 1 sample is computed by integrating the product of g(x) and P m (x) from 0 to 1: Finally, the total expected number of replacement polymorphic sites in both species' sample data is given as

Expected Number of Synonymous Fixed Substitutions
When s = 0, the expected number of fixed substitutions in one species relative to another that diverged t div 2N generations ago is given as the product of the number of total mutations and the probability of fixation of each mutation. The number of total mutations is the product of the mutation rate per generation and the number of generations since divergence is The probability of fixation is given in (21). As s approaches 0 (i.e., neutral evolution), the probability of fixationcan be reduced to p using the approximation e −a = (1 − a) for small values of a. Thus, for a newly derived neutral allele that has an initial frequency of 1/2N, the probability of fixation is also 1/2N. Therefore, the total expected number of fixed substitutions in species 1 is However, given that the data are samples of the populations from both species, not all sites identified as fixed substitutions in the sample are truly fixed substitutions in the entire population. The expected number of sites in the species 1 sample that fall into this category is given by where T m (x) = Pr(a derived allele daf x < 1 is observed with x = 1 in a size m sample) and g neutral (x) is given in (39). Therefore, the total expected number of synonymous fixed substitutions in both species' sample data is given as

Expected Number of Replacement Fixed Substitutions
Similar to the calculation of (46), given some selection pressure, γ, the expected number of fixed substitutions in one species relative to another that diverged t div 2N generations ago is given as the product of (45) and (21):  a) for small values of a, we arrive at the following: However, again, given that the data are samples of the populations from both species, not all sites identified as fixed substitutions in the sample are truly fixed substitutions in the entire population. The expected number of sites in the species 1 sample that fall into this category is given by Therefore, the total expected number of replacement fixed substitutions in both species' sample data is given as

Estimating Parameters
It is possible to obtain estimates of θ and γ by setting each of the observed values SP, RP, SF, and RF (Table 1) to their PRF expectations given by (42), (44), (48), and (52), respectively, and solving for the parameters. It has been shown that these estimates are equivalent to maximumlikelihood estimates [2,19]. Bustamante et al. also eloquently describe and implement a hierarchical Bayesian model for parameter estimation [9].

Concluding Remarks
Sawyer and Hartl's seminal presentation of the PRF in 1992 provided an innovative mathematical framework for estimating selection pressures and mutation rates, which are critical parameters that influence molecular evolution. However, it is worth noting that the model does harbor certain limitations. Foremost among these is the assumption of site independence, which is equivalent to the assumption of free recombination among mutations (i.e., no linkage). Thus, the model may not be appropriate for many data wherein strong linkage is present. Another limitation is the assumption of infinite sites (i.e., each mutation is at a new site). Although this assumption allows for a simpler model, it is not always biologically appropriate, especially for organisms that experience a higher mutation rate. Indeed, recent work has shown that the assumption of infinite sites can underestimate selection pressures and mutation rates and even infer positive selection, when in fact there is weak negative selection [20]. Recent theoretical work has focused on relaxing these and other assumptions of the original PRF model, so as to make it more appropriate for diverse biological contexts. For a brief list of such studies, we refer the reader to [20]. Ongoing theoretical and empirical work in this area will undoubtedly continue to extend the power of a PRF-based approach for population genetic inference.