^{1, 2, 3}

^{2}

^{4 ,5}

^{4, 5}

^{2, 3}

^{1}

^{2}

^{3}

^{4}

^{5}

The most powerful and comprehensive approach of study in modern biology is to understand the whole process of development and all events of importance to development which occur in the process. As a consequence, joint modeling of developmental processes and events has become one of the most demanding tasks in statistical research. Here, we propose a joint modeling framework for functional mapping of specific quantitative trait loci (QTLs) which controls developmental processes and the timing of development and their causal correlation over time. The joint model contains two submodels, one for a developmental process, known as a longitudinal trait, and the other for a developmental event, known as the time to event, which are connected through a QTL mapping framework. A nonparametric approach is used to model the mean and covariance function of the longitudinal trait while the traditional Cox proportional hazard (PH) model is used to model the event time. The joint model is applied to map QTLs that control whole-plant vegetative biomass growth and time to first flower in soybeans. Results show that this model should be broadly useful for detecting genes controlling physiological and pathological processes and other events of interest in biomedicine.

To study biology, a classic approach is dimension reduction in which a biological phenomenon or process is dissected into several discrete features over time and space. Most efforts in the past decades have been made to understand biological details of individual features and then use knowledge from each feature to draw an inference about biology as a whole. There has been increasing recognition of the limitation of this approach because it fails to detect a rule that governs the transition from one feature to next, thus leading to a significant loss of information behind the development of a biological trait. More recently, tremendous developments in statistics and computer science have enabled scientists to model and compute the dynamic behavior of a biological phenomenon and construct a comprehensive view of how a cell, tissue, or organ grows and develops across the time-space scale.

A statistical dynamic model, called functional mapping, is one of the products of such developments [

The statistical foundation of functional mapping is longitudinal data analysis or functional data analysis. There has been a considerable body of literature on statistical modeling of time-varying mean and covariance structure using various parametric, nonparametric, and semiparametric methods [

The complexity of biology lies in the fact that no biological trait is isolated, rather every trait is affected by other traits through genes and environmental factors. For example, when a plant grows into a particular stage, reproductive behavior, such as flowering, starts to emerge as one of the important events in plant development. The time to first flower is highly associated with the amount of vegetative growth, depending on the environment where the plant is grown. Likewise, the time to recurrence of prostate cancer in humans is related with dynamic changes of prostate specific antigen level. How to jointly model longitudinal and time-to-event data within functional mapping has become an important issue for studying the common genetic basis of these processes and predicting events based on longitudinal traits.

Simultaneous modeling of longitudinal traits and time to events has been an active area in biostatistics during the past twenty years. A linear random effects model and EM estimation approach are proposed by Henderson et al. [

By simply estimating the correlation between longitudinal traits and event time, Lin and Wu [

Genetic mapping should be based on a segregating population, such as the backcross, F_{2}, or recombinant inbred lines (RILS), initiated with two inbred lines each carrying an alternative allele. An RIL population is generated by self-crossing the hybrids of the two inbred lines continuously for 7-8 successive generations, which leads to two homozygous genotypes for alternative alleles at each locus. Methods for other designs can be derived similarly. Suppose a backcross has

The central theme of functional mapping is to model the mean and covariance structures for the longitudinal trait efficiently. Here, we model the mean vector by polynomial function and the covariance matrix by an approach that guarantees the positive definiteness of the estimated covariance matrix. Without loss of generality, assume the response vector for progeny

Equation (

The optimal

Denote

Assume the vectors of subject specific random effects

We use

The longitudinal model described above is linked to the hazard model by

Since the QTL genotype of a progeny is unknown, we use a mixture model to describe the likelihood of the progeny in terms of its possible underlying QTL genotypes [

The QTL genotype is inferred from marker genotypes of the linkage map. Let _{2}, and RIL populations, respectively.

Unknown parameters

We derive a Bayesian approach for estimating the unknown parameters. This will first need to specify the prior distributions for the parameters and, given the data and the priors, derive the posterior distribution over all the unknown parameters. For

With the above priors and likelihood function, we have the joint posterior distribution for the parameters. In this case, it is quite straightforward to get the full conditional posterior distributions. Assume that the priors are independent for different parameters. Thus, we get the posterior density of

Assuming that priors for different genotypes are independent, we can express the above posterior distribution as

The full conditional distributions for the model parameters, as derived in the Appendix, are used to estimate the parameters using the MCMC algorithm. Note that the full conditional distribution for

The parameter

Because of the independence among

The new model was applied to analyze a real data set for QTL mapping in soybeans. The mapping population contains 184 RILs derived from two cultivars, Nannong 1138-2 and Kefeng no. 1. A genetic linkage map of this population was first established by Zhang et al. [

The plants and their parents were grown in a sample lattice design with two replicates at Jiangpu Soybean Experiment Station, Nanjing Agricultural University, China. After 20 days of seedling emergence, plant biomass (in gms.) were measured once every 5–10 days until most plants stopped height growth. A total of 8 measurements were taken for the biomass and the time to get the first flower in that growing season was also recorded for each plant. Figure

Whole-plant biomass growth trajectories for 184 soybean RILs. The time to first flower is indicated by a vertical line on each biomass growth curve. The black curve is the mean growth trajectory.

Prior distributions for the model parameters were taken as follows. For genotype specific fixed effect

We fitted our joint model as described in Section

Since our model is complex, we perform several standard diagnostic tests to assess the convergence of the Markov chains. First, we use the method proposed by Brooks and Gelman [

Second, we perform Geweke test which compares the earlier part of the markov chain to the later part for assessing convergence. After deleting the burn-in iterations, from the remaining 100,000 iterations, we take out two subsequences; the first 50,000 and the last 50,000 iterations. Also consistent spectral density estimates at zero frequency are calculated to compute the

Finally, we perform the Heidelberger and Welch test as proposed by Heidelberger and Welch [

Figure

BIC values for selecting the optimum (

BIC | |
---|---|

7.18 | |

4.15 | |

4.26 | |

3.91 | |

2.87 | |

3.01 | |

4.26 | |

4.66 | |

3.17 | |

3.93 | |

4.05 | |

3.34 | |

5.16 | |

4.88 | |

3.16 |

Marginal posterior plot for QTL locations over 24 linkage groups. Marker locations are indicated by ticks on the

We observed posterior peaks in linkage groups 1, 4, 15, 19, 20, 21, and 23. To draw inference about the existence of a putative QTL in each of these groups, we computed the Bayes Factor (BF), defined as

Estimates of the parameters that describe genotype-specific biomass growth trajectories and QTL locations on linkage groups 1, 20, and 23, with 95% credible intervals.

Parameter | Group 1 | Group 20 | Group 23 | |||||

estimate | C.I. | estimate | C.I. | estimate | C.I. | |||

−0.4762 | (−0.5081, −0.4442) | −1.1524 | (−1.1641, −1.1405) | 0.4190 | (0.3897, 0.4484) | |||

3.3214 | (3.3023, 3.3404) | 2.6829 | (2.6463, 2.7194) | 0.5971 | (0.5566, 0.6377) | |||

0.1548 | (0.1363, 0.1731) | 0.2295 | (0.2020, 0.2570) | 0.5438 | (0.5176, 0.5701) | |||

−5.4762 | (−5.4788, −5.4734) | −3.1004 | (−3.0231, −2.9768) | −2.3571 | (−2.3701, −2.3442) | |||

8.4464 | (8.4381, 8.4546) | 5.3250 | (5.3159, 5.3340) | 4.4660 | (4.4294, 4.5028) | |||

−0.4702 | (−0.5011, −0.4392) | −0.0250 | (−0.0618, 0.0118) | 0.0410 | (0.0390, 0.0432) | |||

Marker | ||||||||

Interval | Sat-356–B30T | GNE097b–A199H | LC4-4T–Sat-280 | |||||

30.810 | (29.1934, 31.7150) | 49.600 | (48.7515, 50.0726) | 39.472 | (38.8143, 40.1863) |

Since the nature of our model is complex and our estimation is based on MCMC, we perform posterior predictive check for the aforementioned 7 linkage groups. We simulate observations to get the posterior predictive distribution. Let

One can simulate from the posterior predictive distribution using the following two steps. First from the posterior distributions of the model parameters, simulate

We simulate 100 draws (

Posterior predictive check for 7 linkage groups.

Linkage group | BF (from actual data) | BF with SE (from posterior predictive distribution) |
---|---|---|

1 | 0.2781 | 0.2659 (0.15) |

4 | 11.274 | 12.086 (1.76) |

15 | 13.493 | 12.962 (2.17) |

19 | 10.610 | 11.138 (1.56) |

20 | 0.5913 | 0.6281 (0.73) |

21 | 11.475 | 12.183 (1.18) |

23 | 0.7953 | 0.7682 (0.89) |

Heritability (broad-sense) for the traits is estimated from the data, as the proportion of phenotypic variance attributable to genetic variance. The estimated heritability in our case is 32.6%. Also we compute the percentage of variance explained by three identified QTLs. It turns out the QTLs identified in linkage groups 1, 20, and 23 explain 6.8%, 14.3% and 11.4% of the total variance, respectively.

We show the marginal posterior plots with 95% credible intervals for the parameter

Marginal posterior plot for

Figure

Whole-plant biomass growth trajectories and times to first flower (T_{1} and T_{2}) for two different genotypes at each of the QTLs detected on linkage groups 1 (a), 20 (b), and 23 (c). Genotypes

We performed simulation studies to study the statistical properties of the joint model. We assumed an RIL design of 200 progeny and simulated 11 evenly spaced markers on a linkage group of length 100 cM. A QTL is located at 43 cM from the very first marker of the linkage group. To reflect a practical problem, we used parameter estimates of the soybean QTL detected in linkage group 20 as true values to simulate the data, allowing the covariance structure. Time-dependent phenotypic values were assumed to follow a multivariate normal distribution and the event times were taken the same as the soybean data. To make a comparison, we analyzed the simulated data using our nonparametric GLM-based covariance structure and the traditional AR

The prior distributions for the model parameters were taken in the same way as discussed in Section

Simulation results for genotypic-mean parameters and QTL locations under different covariance structures, AR

Parameter | AR | CS | GLM-based approach | ||||||

Actual value | Estimate | MCSE | Estimate | MCSE | Estimate | MCSE | |||

−1.1524 | −1.1872 | 0.0871 | −1.0982 | 0.1094 | −1.1667 | 0.0361 | |||

2.6829 | 2.5391 | 0.0502 | 2.6103 | 0.0495 | 2.7001 | 0.0103 | |||

0.2295 | 0.2288 | 0.0805 | 0.2301 | 0.0302 | 0.2291 | 0.0113 | |||

−3.1004 | −3.1204 | 0.0307 | −3.093 | 0.1025 | −3.1255 | 0.1011 | |||

5.3250 | 5.3140 | 0.0291 | 5.2998 | 0.1130 | 5.3433 | 0.0405 | |||

−0.0250 | −0.0241 | 0.1302 | −0.0257 | 0.1035 | −0.0244 | 0.0603 | |||

43.00 | 39.32 | 2.3561 | 40.61 | 1.5694 | 42.48 | 1.1572 |

Figure

The Bayesian estimate of QTL location (indicated by dash vertical lines) from simulation studies under different covariance structures, AR

We perform further simulation studies to assess the reliability of BF in our data application. For each of the linkage groups 1, 20, and 23, we simulate data under the “null” model. As mentioned earlier, under the “null” model,

Tools to reveal the secret of life should reflect the dynamic nature of life. More recently, a series of statistical models have been developed to map quantitative trait loci (QTLs) that control the dynamic process of a complex trait [

In this paper, we develop a new version of functional mapping that can map QTLs for developmental events affected by organismic growth trajectories in time. This version is benefited from recent statistical developments for joint modeling of longitudinal traits and event time [

Our joint model, embedded within functional mapping, promotes the study of testing how QTLs pleiotropically affect different biological processes and how one trait is predicted by other traits through genetic information. The application of the new model to soybean mapping data does not only validate its usefulness and utilization, but also gains new insight into the genetic and developmental regulation of trait correlations in plants. There is no doubt that the new model can be modified to study the genetic associations between HIV dynamics and the time to death as well as prostate specific antigen change and the time to recurrence of prostate. However, there is much room for modifying this model. First, to clearly describe our idea, we assume one QTL at a time for trait control. Epistatic interactions between multiple QTLs may play an important role in trait development as well as in correlations between longitudinal traits and events. Second, from a dynamic systems perspective, we need to model dynamic correlations among multiple longitudinal traits and multiple events. Third, with the availability of efficient genotyping techniques, our model should accommodate a high-dimension model selection scheme to identify significant genetic variants from a flood of marker data.

Denote

Hence, we have

Similarly, the full conditional distributions for the other parameters can be derived as follows:

This work is partially supported by NSF/IOS-0923975, Changjiang Scholars Award, “Thousand-Person Plan" Award, the China National Key Basic Research Program (2006CB1017, 2009CB1184, 2010CB125906), the China National Hightech R&D Program (2006AA100104), the Natural Science Foundation of China (30671266), and the China MOE 111 Project (B08025). R.Li was supported by an NIDA Grant P50-DA10075 and an NNSF of China Grant 11028103. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA, or the NIH.