EM Algorithm for Mapping Quantitative Trait Loci in Multivalent Tetraploids

Multivalent tetraploids that include many plant species, such as potato, sugarcane, and rose, are of paramount importance to agricultural production and biological research. Quantitative trait locus (QTL) mapping in multivalent tetraploids is challenged by their unique cytogenetic properties, such as double reduction. We develop a statistical method for mapping multivalent tetraploid QTLs by considering these cytogenetic properties. This method is built in the mixture model-based framework and implemented with the EM algorithm. The method allows the simultaneous estimation of QTL positions, QTL effects, the chromosomal pairing factor, and the degree of double reduction as well as the assessment of the estimation precision of these parameters. We used simulated data to examine the statistical properties of the method and validate its utilization. The new method and its software will provide a useful tool for QTL mapping in multivalent tetraploids that undergo double reduction.

There is also a group of polyploids, called multivalent polyploids, in which chromosomes pair among more than two homologous copies at meiosis, rather than only two copies as like in bivalent polyploids. The origin of multivalent polyploids is mostly from the duplication of similar genomes and, for this reason, they are called autopolyploids [20,21]. The consequence of multivalent pairing in autopolyploids is the occurrence of double reduction, that is, two sister chromatids of a chromosome sort into the same gamete [22]. Fisher [23] proposed a conceptual model for characterizing the individual probabilities of 11 different modes of gamete formation for a quadrivalent polyploid in terms of the recombination fraction between two different loci and their double reductions. Wu et al. [24] used Fisher's model to derive the EM algorithm for the estimation of the linkage between fully informative markers. Wu and Ma [25] extended this model to analyze any type of markers, regardless of their informativeness and dominant or codominant nature. The significant advantage of the models by Wu and colleagues directly lies in their generality, flexibility, and robustness.
In this paper, we develop a statistical method for QTL mapping in multivalent tetraploids by considering Fisher's [23] 11 classifications of gamete formation. The method allows the estimation and test of not only the QTL-marker International Journal of Plant Genomics 3 where g 1 and g 2 are associated with double reductions at both the marker and QTL, g 3 and g 4 with double reductions only at QTL Q, g 5 and g 6 with double reductions only at marker M, and g 7 − g 9 with nondouble reductions. From matrix (1), we see that there are no, one and two recombinant events in the cells (g 1 ), (g 3 , g 5 ), and (g 2 , g 4 , g 6 , g 9 ), respectively. The cells (g 7 ) and (g 8 ) are each a mixture of two different gamete formation mechanisms or configurations (A and B), that is, g 7 = g 7A + g 7B and g 8 = g 8A + g 8B , with relative proportions determined by r. Because different configurations contain different numbers of recombination events, the expected number of recombination events in each cell, that is, an observable gamete genotype, should be the weighted average of the number of recombination events for each configuration. Wu et al. [24] used a matrix form (e) to count the expected number of recombination events for each observable gamete genotype expressed as where φ = r 2 10r 2 − 18r + 9 , Based on matrices (1) and (2), the expressions for the frequencies of double reduction (α and β) and the recombination fraction r can be expressed in terms of g i as α = g 1 + g 2 + g 3 + g 4 , β = g 1 + g 2 + g 5 + g 6 , r = 1 2 g 3 + g 5 + 2 g 2 + g 4 + g 6 + g 9 + 2φg 7 + 1 + ψ g 8 .

Quantitative Genetic
Model. For a given QTL, there are 10 different QTL gamete genotypes in the multivalent tetraploid, whose values can be partitioned into additive and dominance genetic effects of different types, expressed as

EM Algorithm.
Ignoring the effects of other covariates, the phenotypic value, y i , for individual i in the pseudotest backcross can be expressed in terms of the QTL effect and residual error as where ξ j1 j2|i is the indicator variable that is defined as 1 if individual i has a QTL genotype j 1 j 2 ( j 1 ≤ j 2 = Q 1 , Q 2 , Q 3 , Q 4 ), and 0 otherwise, μ j1 j2 is the genotypic value of QTL genotype j 1 j 2 as defined in (5), and e i is the residual error assumed to be normally distributed with mean zero and variance σ 2 . We use Ω to denote the unknown vector (μ 11 , μ 22 , μ 33 , μ 44 , μ 12 , μ 13 , μ 14 , μ 23 , μ 24 , μ 34 , σ 2 ). For a QTL mapping experiment, marker genotypes are observable. Let n l1l2 be the observation of marker genotype l 1 l 2 (l 1 ≤ l 2 = M 1 , M 2 , M 3 , M 4 ). The likelihood of the phenotypic (y) and marker data (M) is constructed, within the mixture model framework, as where π j1 j2|l1l2 is the conditional probability of QTL genotype j 1 j 2 given marker genotype l 1 l 2 , and f j1 j2 (y i ) is assumed to follow a normal distribution with mean μ j1 j2 and variance σ 2 . Prior conditional probability π j1 j2|l1l2 is calculated as the frequency of joint marker-QTL genotype l 1 l 2 j 1 j 2 , expressed in terms of nine g probabilities in matrix (1), divided by the frequency of marker genotype l 1 l 2 . Marker genotype frequencies are α/4 for each of double reduction gametes The estimates of unknown parameters that maximize the likelihood (17) can be obtained by implementing the EM algorithm. In step E, we calculate the posterior probability of a QTL genotype given a specific marker genotype of individual i by In step M, we calculate the frequencies of nine observable gamete modes based on the calculated posterior probabilities using the following: 1 2(n 12 + n 13 + n 14 + n 23 + n 24 + n 34 ) ⎤ ⎦ , g 9 = 1 2(n 12 + n 13 + n 14 + n 23 + n 24 + n 34 ) which lead to the estimates of the frequencies of double reduction as n (n 11 + n 22 + n 33 + n 44 ), β = g 1 + g 2 + g 5 + g 6 , r = 1 2 g 3 + g 5 + 2 g 2 + g 4 + g 6 + g 9 6 International Journal of Plant Genomics The genotypic value of QTL genotype j 1 j 2 and residual variance are estimated by The iteration is repeated between the E step, The difference between the log-likelihood functions under the null and alternative hypotheses are calculated. But the distribution of this log-likelihood ratio (LR) is not known because of the violation of regularity conditions for the mixture model (1). For this reason, a commonly used empirical approach based on permutation tests by reshuffling the relationships between the marker genotypes and phenotypes [26] is used to determine the critical threshold, in order to judge whether there is a QTL for the trait. After a significant QTL is detected, the next hypothesis is about the additive genetic effect of the QTL. This can be tested by formulating the null hypothesis, under which the estimates of genotypic values of QTL genotypes can be obtained with the EM algorithm as described above, but posing three constraints derived from with estimates of genotypic values under the constraints derived from (10)- (15). All these genetic effects can be tested individually.

Application to Simulated Data.
A pseudotest backcross for a multivalent tetraploid was hypothesized, in which a marker is assumed to be linked with a QTL that affects a quantitative trait. Marker and QTL genotypes were In this simulation study, fully informative markers and QTL are assumed and, thus, the double reduction at the marker can be estimated analytically. The estimates of the parameters converge to stable values at a rapid rate given that there are closed forms for parameter estimators in the EM framework. We evaluate the estimation of the other parameters related to QTL segregation, effects, and position. The means of the MLEs of the QTL-related parameters and their standard errors based on 1000 simulation replicates are illustrated in Tables 1, 2, and 3. With a small sample size (100), the double reduction of the QTL was accurately estimated, with the precision of estimation relatively independent of the magnitude of heritability and the degree of QTL-marker linkage ( Table 1). The most significant factor that affected the estimate of QTL position (in terms of its recombination with the marker) was the heritability, followed by sample size and the degree of QTL-marker linkage. In general, at least a sample size of 200 was required to reliably estimate the QTL position for a major gene that explains about 20%-30% of the phenotypic variance.
The estimation precision of the QTL effects depended on the heritability, sample size, and degree of QTL-marker 8 International Journal of Plant Genomics linkage. As heritability, sample size, and linkage degree increased, the estimates of various QTL effects were more precise. As compared with the dominant genetic effects, the estimates of the additive genetic effects required a larger sample size, more precise phenotypic measurements (leading to a higher heritability), and a denser linkage map (with a stronger degree of QTL-marker linkage). We found that the estimates of QTL effects were influenced by the frequency of QTL double reduction. At low frequencies of double reduction, the effects of QTL were more accurately estimated than at higher frequencies. For a QTL undergoing a strong double reduction (say α = 0.3), a sample size of at least 400 is required even if the QTL explains a large proportion of the phenotypic variance (0.4). For a modest-sized QTL, a much larger sample size was required.
We performed a simulation study to test how the misspecification of double reduction affects the estimate of QTL-related parameters. This was done by using traditional mapping models (without considering double reduction) to analyze the simulated data of QTL genotypes with different degrees of double reduction. When a QTL undergoes double reduction, traditional models that do not consider double reduction provided misleading results about the estimates of QTL effects and position (data not shown). Furthermore, increasing heritabilities and sample sizes did not improve the estimates. In this case, the power of QTL detection was reduced.

Discussion
A statistical method for genetic mapping of quantitative trait loci (QTLs) in a multivalent tetraploid undergoing a double reduction process is described. As an important cytological characteristic of polyploids, double reduction may play a significant role in plant evolution and maintenance of genetic polymorphism in natural populations. Also, because double reduction affects the result of linkage analysis through the crossing-over events between different chromosomes [24,25], it is important to incorporate double reduction into a QTL mapping framework. This method provides a powerful tool for QTL mapping and understanding the genetic control of a quantitative trait in a multivalent tetraploid.
The method capitalizes on 11 different classifications of two-locus gamete formations, derived by Fisher [23], during multivalent tetraploid meiosis and has proven to be powerful for simultaneous estimation of the frequencies of International Journal of Plant Genomics 9 double reduction and the recombination fraction between different loci. Although a couple of statistical approaches have been proposed to map multivalent tetraploid QTLs [26,27], this method has for the first time incorporated Fisher's tetrasomic inheritance into the mapping framework, thus enhancing the cytological relevance of QTL detection. Results from simulation studies showed that the method can be used to map QTLs in a controlled cross of multivalent tetraploids when the mapping population is adequately large (say 400). When a QTL undergoes double reduction, traditional mapping approaches will incorrectly estimate the position and effects of the QTL, proportional to the degree of double reduction. The new method can estimate the double reduction of a QTL, an important parameter related to the genetic diversity and evolution of polyploids [28,29].
Because of the high complexity of the mixture model implemented with tetrasomic inheritance, we only considered a one-marker model for QTL mapping. Interval mapping, which localizes a QTL with two flanking markers, has proven to be more advantageous in parameter estimation over the one-marker model [30]. It will be worthwhile to integrate components of our model into the interval mapping framework to fully explore the statistical merits of interval mapping for QTL mapping in multivalent tetraploids. Furthermore, the model proposed in this article assumes the segregation of fully informative codominant loci, each with 10 distinct genotypes, in a controlled cross of multivalent tetraploids. For partially informative codominant markers, a two-stage hierarchical mixture model will be needed to model the different allelic configurations for a phenotypically identical genotype. Although molecular marker technologies have improved in recent years, dominant markers may still be used in genetic mapping projects of some underrepresentative species including polyploids. Thus, it is also important to extend our model to map QTLs with dominant markers. For partially informative loci, the number of QTL genotypes may be unknown and, thus, a model selection procedure should be incorporated to determine the optimal number of genotypes at a QTL.
The genetic mapping of polyploids is complex because of their complex inheritance modes. Sophisticated statistical models are required to tackle genetic problems hidden in the polysomic inheritance of polyploids. Currently, there are some debates on the optimal modeling of tetrasomic inheritance in linkage analysis [13,25] and QTL mapping [18,31] partly because of our limited knowledge about these fascinating species. Before a detailed understanding of the cytological mechanisms for meioses in multivalent polyploids is obtained, this type of debate will continue. In any case, the development of powerful statistical models for polyploid mapping continues to be a pressing need. The application of these models to real-world data will not only test their usefulness, but also provide an unprecedented opportunity to understand the genetic differentiation among polyploid genomes and characterize the genetic architecture of quantitatively inherited traits for this unique group of species. Software for the method described is available at http://statgen.psu.edu/.