Mapping Quantitative Trait Loci Using Distorted Markers

Quantitative trait locus (QTL) mapping is usually performed using markers that follow a Mendelian segregation ratio. We developed a new method of QTL mapping that can use markers with segregation distortion (non-Mendelian markers). An EM (expectation-maximization) algorithm is used to estimate QTL and SDL (segregation distortion loci) parameters. The joint analysis of QTL and SDL is particularly useful for selective genotyping. Application of the joint analysis is demonstrated using a real life data from a wheat QTL mapping experiment.


Introduction
Segregation distortion is a phenomenon that the genotypic frequency array of a locus does not follow a typical Mendelian ratio. Depending on the population under investigation, Mendelian ratio of a locus varies from 1 : 1 for a backcross to 1 : 2 : 1 for an F 2 and to 1 : 1 : 1 : 1 for a four-way cross. These ratios hold for codominant markers. For some reasons, a marker may not follow a typical Mendelian ratio. Such a marker is called a distorted marker. For a long period of time, the effects of distorted markers on the result of QTL mapping were not known. For the reason of precaution, people simply discarded all the distorted markers in QTL mapping. Recently, we found that distorted markers can be safely used for QTL mapping with no detrimental effect on the result of QTL mapping [1]. This finding can help QTL mappers save tremendous resources by using all available markers, regardless whether they are Mendelian or not. We also found that if distorted markers are handled properly, they can be beneficial to QTL mapping.
Marker segregation distortion is only a phenomenon. The reason behind the distortion is due to one or more segregation distortion loci (SDL). These loci are subject to gametic selection [2], zygotic selection [3], or both and their (unobservable) distorted segregation causes the observed markers to deviate from the Mendelian ratio. Several investigators [4][5][6][7][8][9][10][11] have attempted to map these segregation distortion loci using molecular markers. It is natural to consider mapping QTL and SDL jointly in the same population.
Agricultural scientists are interested in mapping QTL for economically important traits while evolutionary biologists are interested in mapping SDL that respond to natural selection. Combining the two mapping strategies into one is beneficial to both communities. Performing such a joint mapping strategy is the main objective of this study. Since the theory of segregation distortion has been introduced and discussed in previous studies [7,8] and our own research [1], this study only presents the EM (expectation-maximization) implementation of the statistical method. The variancecovariance matrix of estimated parameters under the EM algorithm is also derived and presented in Appendix A for interested readers.

Methods
We only investigate interval mapping where a model contains a single QTL at a time and the entire genome is scanned through repeated calling of the same program for different locations of the genome. The technical difference between the joint mapping and QTL mapping occurs only in one place. In the traditional interval mapping of QTL, the conditional probabilities of genotypes for a QTL are calculated using flanking marker genotypes with the prior probabilities of QTL genotypes being substituted by the Mendelian ratio. For the joint mapping, the genotypic frequencies (segregation ratios) are treated as unknown parameters that are subject to estimation. We use an F 2 population as an 2 International Journal of Plant Genomics example to demonstrate the method. Extension to other population is discussed subsequently.

The Likelihood of Markers.
Let M and N be the left and right flanking markers bracketing the QTL (denoted by G for short). The interval of the genome carrying the three loci is labeled by a segment MGN. The marker linkage phases are known for line crosses derived from inbred lines. The three genotypes of the QTL are denoted by G 1 G 1 , G 1 G 2 , and G 2 G 2 , respectively. Similar notation also applies to the genotypes of the flanking markers. The interval defined by markers M and N is divided into two segments. Let r 1 and r 2 be the recombination fractions for segments MG and GN, respectively. The joint distribution of the marker genotypes conditional on the QTL genotype can be derived using the Markov chain property under the assumption of no interference between consecutive loci in segregation. Let us denote the three ordered genotypes, G 1 G 1 , G 1 G 2 , and G 2 G 2 , by genotypes 1, 2, and 3, respectively. If individual j takes the κth genotype for the QTL, we denote the event by G j = κ, for all κ = 1, 2, 3. The joint probability of the two markers conditional on the genotype of the QTL is and Pr(N j = ζ | G j = κ) = Γ 2 (κ, ζ). We use Γ i (κ, ξ) to denote the κth row and the ξth column of the following transition matrix: For example, Let ω κ = Pr(G = κ), for all κ = 1, 2, 3, be the probability that a randomly sampled individual from the F 2 family has a genotype κ. We use a generic notation p for probability so that p(G j = κ) represents Pr(G j = κ) and p(M j , N j | G j = κ) stands for Pr(M j , N j | G j = κ). The log likelihood function of the flanking marker genotypes in the F 2 population is where ω T = [ω 1 ω 2 ω 3 ] is a vector of parameters with constraint 3 κ=1 ω κ = 1, where m in L(ω | m) stands for marker information. Note that without any prior information, p(G j = κ) = ω κ , for all j = 1, . . . , n. Under the assumption of Mendelian segregation, However, we treat ω as unknown parameters. Because we are dealing with the genotypic frequencies, the segregation distortion is called zygotic distortion. Segregation distortion due to gametic selection will be discussed later. We postulate that deviation of ω from φ causes a marker linked to locus G to show distorted segregation. This likelihood function is the one used in mapping viability loci [10].

The Likelihood of Phenotypes.
Let y j be the phenotypic value of a quantitative trait measured from individual j. The probability density of y j conditional on the genotype of individual j is normal with mean μ κ = X j β + H κ γ (5) and variance σ 2 , that is, where H κ is the κth row of matrix H and This H matrix can be defined in a different scale, for example, which does not affect the significance test. The advantage of choosing the scale in (7) is that the expectation of the dominance indicator is zero. Vector γ = [a d] T contains the additive and dominance effects. The design matrix X j and the regression coefficients β capture non-QTL effects, for example, field location effects, year effects, and so on. The likelihood function of the phenotypic values in the F 2 population is where letter y in L(β, γ, σ 2 , ω | y) stands for the phenotype. This likelihood function is the one used in segregation analysis of quantitative traits [12] because no marker information is required.
International Journal of Plant Genomics 3

Joint Likelihood of Markers and Phenotypes
T be a vector of all parameters in the joint analysis. The likelihood function is obtained by combining (4) and (9): For QTL mapping under segregation distortion, this log likelihood function is the one to be maximized. The previous two likelihood functions (for markers and for phenotypes) were presented as background information to introduce this joint log likelihood function.

EM Algorithm for the Joint Analysis.
The MLE (maximum likelihood estimate) of the parameters is solved via an EM algorithm [13]. We need to rewrite the likelihood function in a form of complete data. Let us define a delta function as If the genotypes of all individuals are known, that is, given δ(G j , κ) for all j = 1, . . . , n and κ = 1, 2, 3, the complete-data log likelihood is where , The delta variables are missing values. Therefore, we need to take expectation of the log likelihood with respect to δ. The expected complete-data log likelihood function is Note that E δ [L(θ, δ) | θ (t) ] stands for the expectation of L(θ, δ) with respect to δ conditional on parameters at the current state θ (t) and the data (the symbol for data is suppressed for simplicity). The three components of (14) are The first component ψ 0 is a function of θ (t) but not a function of θ. Therefore, it is considered as a constant.

Expectation (E-
Step). The expectation step of the EM algorithm requires computing the expectation of δ conditional on the data and θ (t) . Because δ is a Bernoulli variable, the expectation is simply the probability of δ = 1, that is,

Maximization (M-
Step). The maximization step of the EM algorithm requires taking the partial derivatives of L(θ | θ (t) ) with respect to θ, setting the partial derivatives equal to zero, and solving for the parameters: 4 International Journal of Plant Genomics The solutions of the parameters are  (10) is required. Let us reintroduce this joint log likelihood function using a different notation so that different likelihood ratio tests are easily interpreted. The joint likelihood is rewritten as where γ represents QTL effects and ω stands for non-Mendelian segregation. The null hypothesis for QTL detection is H QTL : γ = 0 while the null hypothesis for detecting segregation distortion is H SDL : ω = φ.

Testing the Presence of QTL. The null hypothesis is
where L S (0, ω) is the log likelihood value under the null model γ = 0, which is where The estimated parameters in (22) are obtained by maximizing the corresponding likelihood functions (see Appendix B for the estimation).

Testing Non-Mendelian Segregation.
The null hypothesis is H SDL : ω = φ. The likelihood ratio test statistic is where Again, the MLEs of the parameters in (24) are obtained by maximizing this likelihood function (see Appendix B for the estimation of parameters under the null model).

Testing Both QTL and SDL.
The null hypothesis is H 0 : γ = 0 and ω = φ. The likelihood ratio test statistic is where The two components of (26) are The parameters involved in these log likelihood functions are estimated using formulas given in Appendix B. This hypothesis is rejected if either γ / = 0 or ω / = φ or both inequalities hold. The QTL effects and the segregation distortion are confounded. This hypothesis test is useful in the following situation. Suppose that, for some reason, we know for sure that the population from which the sample is drawn is a Mendelian population. The sample drawn from this population is selected based on extreme International Journal of Plant Genomics 5 phenotypes (selective genotyping). The sample is then non-Mendelian regarding the QTL that control the trait subject to phenotypic selection. Rejecting this hypothesis is equivalent to rejecting the null hypothesis of QTL. The reason is that segregation distortion in the sample is solely caused by selective genotyping. Therefore, this joint test can be used to detect QTL under selective genotyping.

Applications
This example demonstrates the application of the method to joint mapping of QTL and SDL in a wheat QTL mapping experiment. The experiment was conducted by Dou et al. [14] who made the data available to us for this analysis. A female sterile line XND126 and an elite cultivar Gaocheng 8901 with normal fertility were crossed for genetic analysis of female sterility measured as a ratio of the number of seeded spikelets to the total number of spikelets per plant. The parents and their F 1 and F 2 progeny were planted in the Huaian experimental station in China for the 2006-2007 growing season under the normal autumn sowing condition. The mapping population was the F 2 family consisting of 234 individual plants. The sterility trait was transformed using the angular sine transformation, y = arcsin(x), where x is the phenotypic value expressed as ratio. A total of 28 SSR markers were used in this experiment. These markers covered five chromosomes of the wheat genome with an average genome marker density of 15.5 cM per marker interval. The five chromosomes are only part of the wheat genome. The model for the female sterility is where X j = 1 for all j, β is the intercept, Z j1 = {+1, 0, −1} is the genotype indicator variable for the additive effect γ 1 = a, and Z j2 = {−1, 1, −1} is the genotype indicator variable for the dominance effect γ 2 = d. We used an interval mapping approach to scanning the entire genome. Therefore, the model contains one QTL at a time. With the interval mapping, Z j = [Z j1 Z j2 ] is missing and can take one of three values denoted by H κ for κ = 1, 2, 3 (see definition of H κ ) in Section 2.
The likelihood ratio test statistics were divided by 4.61 to obtain their corresponding LOD scores. The LOD score profiles across the genome are shown in Figure 1. The top panel (a) shows the LOD profile for QTL detection regardless whether there is segregation distortion or not. One major QTL was detected in the second chromosome. This chromosome segregates normally without distortion. Figure 3(b) shows the LOD profile for testing segregation distortion, regardless whether the QTL is present or absent. One major SDL was found on chromosome five. Figure 3(c) is the joint LOD score for both QTL and SDL. We can see that both the major QTL and the SDL were detected. These two major loci have very high LOD scores. We used the quick method of Piepho [15] to calculate the genome wide critical value of the LOD for significance test. We found that the genome wide critical value was slightly less than LOD = 3 criterion (data not shown). Therefore, we set LOD = 3 as  the criterion. In addition to the two major loci, several other regions of the genome also showed significant peaks of the LOD score profile. The estimated parameters are listed in Table 1. Overall we detected eight loci, four QTLs and four SDLs. Among the detected QTL, each explains from 10% to 47% of the phenotypic variance (listed as heritability denoted by h 2 in Table 1). Among the four SDL detected, all showed bias in favor of the XND126 parent, that is, the homozygote of XND126 allele was over represented at the cost of low representation of the other parent. The largest SDL locus is located in chromosome five at position 32.11 cM. The frequency of the heterozygote was close to the Mendelian frequency of 0.5, but the homozygote of Gaocheng 8901 allele was almost wiped out. The estimated genotypic frequencies are plotted against the genome location as shown in Figure 2. The deviation from Mendelian segregation is quite obvious for chromosome five.
One of the major theoretical contributions of this study is the development of the variance-covariance matrix of the 6 International Journal of Plant Genomics  estimated QTL-SDL parameters. The covariances between pairs of estimated parameters are not of interest, but the variances of the estimated parameters are important. We reported the standard errors for two selected loci, locus 4 (QTL) and locus 7 (SDL). These standard errors are listed in Table 2. The standard errors are the square roots of the variances obtained from the EM algorithm. The variance-covariance matrix of the estimated parameters takes the inverse of the information matrix. As a result, they are approximate and biased downwards (Louis 1982). The approximation is close to the true variance only in large samples. We also performed a bootstrap analysis (1000 bootstrap samples) to provide more accurate estimation of the variance. The results of the bootstrap estimates of the standard errors for the two loci are also listed in Table 2 for comparison. The approximate standard errors from the EM algorithm are indeed biased downward, especially for locus 4 (QTL). The approximation is much better for locus 7 (SDL). In practice, the bootstraps method is recommended for obtaining more accurate estimates of the standard errors if the sample size is small.

Discussion
Statistical methods for mapping quantitative trait loci are well developed for Mendelian populations. Methods also are available for mapping viability loci or segregation distortion loci when markers do not segregate in a typical Mendelian ratio [4][5][6][9][10][11]. However, QTL mapping and SDL mapping have never been combined in a single analysis. This study is the first attempt to combine the two seemingly different analyses into a joint one. When QTL and SDL are loosely linked or not linked, the joint analysis does not offer much advantage over the separate analyses. When they do overlap, a phenomenon called pleiotropy, joint mapping does offer some advantage. Unfortunately, the wheat experiment introduced here is not a good example to demonstrate the advantage of joint analysis because the QTL and SDL detected do not overlap.
An obvious situation where the joint analysis can be more powerful is QTL mapping with selective genotyping. In most designed selective genotyping experiments, two groups of extreme phenotypes are selected for genotyping. The power increase under selective genotyping has been demonstrated [16]. Directional (one-tailed) selection is rarely used for selective genotyping because it artificially reduces the variation of the trait and thus reduces the statistical power of QTL detection. However, with the joint analysis, the power can increase under directional selection. Such a directional selection is common in breeding populations. We now use a simulated example to demonstrate this power increase. We simulated 500 F 2 individuals for a single chromosome of 300 cM long. We placed 16 markers evenly over the chromosome with 20 cM per marker interval. A QTL was placed at position 150 cM of the chromosome. The QTL explains 5% of the phenotypic variance with additive effect only. This QTL is considered small or modest. We selected 300 smallest individuals out of the 500 (one-tailed selection) Table 2: Standard errors of the estimated parameters for loci 4 (QTL) and 7 (SDL) of the wheat F 2 mapping population (see Table 1 for detailed information about loci 4 and 7). The StdErr (EM) and StdErr (Boots) represent the standard errors obtained from the EM algorithm and the bootstrap method, respectively.   for mapping. The LOD test statistic profiles are depicted in Figure 3. Figure 3(a) shows the test statistic for QTL regardless whether segregation is distorted or not. The panel in the middle (b) shows the LOD test statistic for segregation distortion, regardless whether a QTL is present or not. The panel at the bottom (c) shows the LOD score profile of the joint analysis where the null model is no QTL and no SDL. If LOD = 3 is the threshold value, neither of the separate analyses is significant. However, the joint analysis has a LOD score as high as 4.5, indicating a significant QTL in the middle of the chromosome. This example clearly demonstrated the advantage of joint analysis over separate analyses.
Segregation distortion may be caused by gametic selection, zygotic selection, or both. Our model was developed under zygotic selection because we are dealing with the genotypic frequencies. However, if the true cause of segregation is gametic selection, we can still detect segregation distortion as long as the gametic selection leads to the genotypic frequencies deviating from the expected Mendelian ratio. A model particularly handling gametic selection has not been developed yet, but it is not difficult. Similar to the zygotic selection model, gametic selection requires known marker linkage phases. Let us take the F 2 population as an example to show the gametic selection model. Denote the frequencies of G 1 and G 2 alleles from the female parent by ν QTL parameters are estimated using the same algorithm as described in the zygotic selection model because they depend only on the genotypic frequencies.
Estimating parameters ν f 1 and ν m 1 requires a modified algorithm. Under the gametic selection model, we can test whether the segregation distortion is caused by the distortion of female gametes, male gametes, or both. This is an interesting topic that deserves further investigation.
The joint analysis developed in this study only applies to line crossing data where the marker linkage phases are known. It cannot be applied to pedigree data analysis. Application of the method to pedigrees warrants further investigation and it is not obvious to us at this moment. However, extension to other line crossing families is possible. We have already extended the method to BC (backcrosses), RIL (recombinant inbred lines), DH (double haploids), and 8 International Journal of Plant Genomics FW (four way crosses) and incorporated them into our QTL mapping program that is described in the next paragraph. The extension also includes dominance markers and missing marker genotypes.
The proposed joint mapping applies to interval mapping only. Extension to multiple QTL/SDL mapping is difficult. However, interval mapping is still the quickest method of QTL mapping, even though multiple QTL mapping programs are available. Compared with traditional QTL interval mapping, this joint analysis involves one additional step of updating the genotypic frequencies. This additional step presents a complication where the conditional genotypic frequencies given flanking marker genotypes cannot be calculated prior to QTL mapping. They must be calculated with the phenotypic values along with the flanking marker genotypes. This complication makes modification of existing QTL mapping programs difficult. Fortunately, we have incorporated the joint QTL/SDL mapping into our QTL mapping program. This program is a SAS procedure called PROC QTL [17]. The METHOD = "ML" option in the PROC QTL statement must be turned on with an additional suboption /DISTORTION to invoke the joint mapping procedure. Without the /DISTORTION option, the ML analysis simply assumes Mendelian segregation. PROC QTL is available on our website (http://www.statgen.ucr.edu/software.html) and users can download the program with no charge. The program is also accompanied with a detailed user manual.

A. Standard Errors of the Estimated Parameters
Let us define the individual-wise complete-data log likelihood for plant j as where ω 3 = 1 − ω 1 − ω 2 so that ω 3 is excluded from the parameter vector. To derive the variance covariance matrix of the estimated parameters, we need to define the score vector S j (θ, δ) and the Hessian matrix H j (θ, δ) for the individual-wise complete-data log likelihood function. The Louis [18] information matrix of the parameters under the EM algorithm is where the expectation and variance are taken with respect to the missing values δ. Once the information matrix is define, the variance matrix of the estimated parameters simply takes The standard error of each parameter takes the square root of each diagonal element of the above matrix. The approximation performs better for large sample sizes. We now present the score vector and the Hessian matrix. The score vector is denoted by S j (θ, δ) = ∂L j (θ, δ)/∂θ, which consists of five blocks, as follows: Concatenating the above five scores vertically, we can get the score vector: The Hessian matrix is denoted by H j (θ, δ) = ∂ 2 L j (θ, δ)/∂θ∂θ T , which is block diagonal with non-zero blocks given as follows: The expectation of the Hessian matrix E[H j (θ, δ)]and the variance matrix of the score vector var[S j (θ, δ)] can be expressed explicitly because both the Hessian matrix and the score vector are linear functions of the missing value Therefore, E[H j (θ, δ)] and var[S j (θ, δ)] can eventually be expressed as functions of the expectation and variance of δ j , which have simple expressions because δ j is a multinomial variable. The Hessian matrix H j (θ, δ) is already expressed in linear function of δ j and thus the expectation can be obtained straightforwardly by replacing δ j by E(δ j ). An explicit linearity for the score function is not obvious. The following gives the linear relationship using matrix notation. Let us define the following matrices: The score vector in matrix notation is is the variance-covariance matrix of the multinomial variable δ j .

B. Estimation of Parameters under the Null Models
This appendix provides methods for estimating parameters under various null models that are required for constructing likelihood ratio test statistics.