Research Article Predicting Disease Onset from Mutation Status Using Proband and Relative Data with Applications to Huntington’s Disease

Huntington's disease (HD) is a progressive neurodegenerative disorder caused by an expansion of CAG repeats in the IT15 gene. The age-at-onset (AAO) of HD is inversely related to the CAG repeat length and the minimum length thought to cause HD is 36. Accurate estimation of the AAO distribution based on CAG repeat length is important for genetic counseling and the design of clinical trials. In the Cooperative Huntington's Observational Research Trial (COHORT) study, the CAG repeat length is known for the proband participants. However, whether a family member shares the huntingtin gene status (CAG expanded or not) with the proband is unknown. In this work, we use the expectation-maximization (EM) algorithm to handle the missing huntingtin gene information in first-degree family members in COHORT, assuming that a family member has the same CAG length as the proband if the family member carries a huntingtin gene mutation. We perform simulation studies to examine performance of the proposed method and apply the methods to analyze COHORT proband and family combined data. Our analyses reveal that the estimated cumulative risk of HD symptom onset obtained from the combined data is slightly lower than the risk estimated from the proband data alone.


Introduction
Huntington's disease HD is a severe, autosomal dominantly inherited neurodegenerative disorder that affects motor, cognitive, and psychiatric function and is uniformly fatal.HD is caused by the expansion of CAG trinucleotide repeats at the huntingtin gene IT15 use the EM algorithm to carry out the maximum likelihood estimation of the proband and family data jointly.Conditionally on the transmission status in family members, we use the logistic-exponential model in Langbehn et al. 14 to model the AAO as a function of CAG repeat length.We perform simulation studies to examine finite sample performances of the proposed methods.Finally, we apply these methods to analyze the COHORT proband and family combined data.Our results show a slightly lower estimated cumulative risk of HD symptom onset using the combined data compared to using proband data alone.

Methods
We start by introducing some notations.For the ith subject, let T i denote the age-at-onset of HD, let δ i be the event indicator, let C i denote the censoring time, and let X i min T i , C i .Let A i denote the CAG repeat length.Langbehn et al. 10 model distribution of T i given A i by a logistic function.The cumulative distribution function CDF given A i is and the density function is

2.2
Here μ A i is a location parameter depending on the covariate A i and s A i is a scale parameter depending on A i .Let S t | A i 1 − F t | A i denote the survival function of HD onset.The location and scale parameters have the following relationship with the mean and variance of T i given A i : Various parametric functions for the location and scale parameters were compared in Langbehn et al. 10, 14 , and the exponential function provides the best fit.Therefore, we use the same model where

Proband-Only Analysis
First, consider probands' data where all A i 's are observed.Since a subject's AAO of HD is subject to the right censoring, the likelihood function is and the log-likelihood is The maximum likelihood estimate MLE of the parameters, β, can be obtained via a generalpurpose optimization algorithm such as Newton- .

2.7
The standard error of the estimated survival function, S t | A i , is then estimated by the Delta method, that is, where the gradient vector Since the parameters are estimated by maximum likelihood, it is straightforward to carry out likelihood ratio tests LRTs to compare the model fit from the COHORT data with the one obtained by applying parameters from other studies such as Langbehn et al. 10 to the COHORT data.Here, twice the difference in the log-likelihood follows an asymptotic chisquare distribution with 6 degrees of freedom.

Incorporating Family Members
Next, we consider incorporating family members' AAO data.We do not directly observe whether a family member shares the huntingtin mutation with the proband, but we do have data regarding family members' age-at-onset of the first symptoms, as well as the family members' current ages.When we incorporate the additional family data, the likelihood for the survival takes a mixture form.Let p i denote the probability of the ith subject sharing a deleterious allele with a proband and therefore becoming a carrier.Such probabilities are calculated based on Mendelian transmission and a family member's relationship to the proband 8 .For example, offspring and siblings of a carrier proband have a probability of 50% of receiving the huntingtin allele that contains the CAG expansion Homozygotes for HD are extremely rare since prevalence of HD in general population is rare .We assume that, conditioning on a family member receiving the expanded huntingtin allele, the CAG repeat length is the same as observed in the proband, although this is a simplification 7 .
For subjects who receive a wild-type allele CAG < 36 , their probability of developing HD is zero, thus f t | A i < 36 0, and S tA i < 36 1, for all t.For the family members, the likelihood is where the above second term follows from the assumption that noncarriers do not develop HD.Note that for all carrier probands we observe p i 1, thus the likelihood reduces to 2.5 .
The above likelihood can be maximized by a combination of EM and Newton-Raphson algorithms.Let G i denote the unobserved carrier status indicator for the ith family member i.e., G i 1 indicates a family member receives a mutation and G i 0 indicates otherwise .Then the complete data log-likelihood is

2.11
At the k 1 th iteration of the E-step, we compute the conditional expectation of the complete data log-likelihood, given the observed data.Essentially, we compute

2.12
In the M-step, we update β k 1 by maximizing the weighted log-likelihood using the Newton-Raphson algorithm developed for the proband data.Since for the combined analysis, the parameters are estimated by maximizing the likelihood through an EM algorithm, the standard asymptotic theory applies and the standard errors of parameters can be estimated by inverting the expected or observed information matrix based on the log-likelihood of the observed data.When there is missing data and an EM algorithm is used to obtain the MLE, the information matrix based on the observed data likelihood can be difficult to compute analytically or computationally.In such situations, Louis 15 proposed to compute the observed information matrix in terms of the conditional moments of the first and second derivatives of the complete data log likelihood which can be obtained easily under the EM algorithm framework.In some cases, these moments are easier to compute than the corresponding derivatives of the incomplete, observed data loglikelihood.
However, in our application, the derivatives of the observed data log likelihood are easy to compute.Thus, we computed the gradient and Hessian matrix of the observed data log-likelihood directly and estimated the standard errors of β by the inverse of the Hessian matrix and estimated the standard errors of F t by the Delta method similar to the probandonly analysis.Simulation studies in the next section show satisfactory performance of this direct and relatively simpler approach.

Simulation Studies
We conducted two simulation studies closely related to the observed COHORT data to illustrate the performance of the Newton-Raphson optimization and the EM algorithm 16 .In all our optimization procedures, we centered both A i and X i .Since the direct optimization and EM algorithm need reasonable initial values, we fitted two nonlinear least square NLS to the observed sample mean and variance of the AAO on subjects with δ i 1.To be specific, we fit where m 1 a i and s 2 1 a i are the sample mean and variance for all subjects with A i a i , respectively.The six NLS estimators were used as the initial values for further optimization.We denoted the estimated β from the centered data as β c .For each simulation, the uncentered β were then calculated based on β c and the sample mean of A i and X i .
We restricted simulations to CAG repeat lengths between 41 and 56 to guard against sensitivity to the extremely high or low CAG repeats to be consistent with Langbehn et al. 10 .For the analysis of proband data, we generated a sample of 2000 subjects, each with a CAG length ranging from 41 to 56 that follows a multinomial distribution in which the probability pr A i a equals to the observed proportion of A i a in the COHORT proband data set.The failure times T i were simulated from the distribution 2.1 , where the parameters β were fixed at the values fitted from the COHORT proband data see next section for their values .The censoring times, C i , were generated from a rescaled Beta distribution with a scale and shape parameter of four.The parameters for the Beta distribution were chosen so that the proportion of censored subjects is the same in the simulated data and the observed COHORT proband data.
For the analysis of the combined proband and family data, we generated a sample of 4000 subjects.We assume the same proportion of the probands and relatives as observed in the combined COHORT data.For the family members, the probabilities p i were generated by resampling the observed p i 's in the COHORT data.With a given p i for each subject, we simulated his or her huntingtin carrier status from a Bernoulli distribution with success probability p i .For family members simulated to receive an expanded CAG repeat carriers , their CAG repeats A i were set to be the same as the probands and their failure times were simulated from 2.5 with β fixed at estimates from the COHORT combined data.For noncarrier family members, their failure times were set to be infinity and their X i C i .We used the same censoring distribution for generating C i as in the first simulation study.We provide simulation results of the proband only and combined analyses in Tables 1  and 2. We present mean F t | A i , empirical standard deviation of F t | A i , and the mean estimated standard error of F t | A i at various ages in.We see from these tables that mean F t | A i is very close to true F t | A i in both studies.The mean estimated standard errors of F t | A i are close to the empirical standard deviations, indicating that the estimation of variability is appropriate.Figures 1 and 2 present three curves of F t | A i at A i 41, 46, 50 and their 95% empirical confidence intervals for the proband data and combined data, respectively.We see that F t | A i coincide with the circles representing true F t | A i at various ages.

COHORT Data Analysis Results
COHORT is a multicenter observational study of individuals in the HD community.
COHORT recruitment is open to subjects who have HD symptoms and signs manifest HD , subjects who have an expanded CAG repeat but have not yet developed symptoms of HD presymptomatic , subjects who have an HD affected parent but have not been tested and do not have symptoms at risk , subjects who have an affected grandparent secondary risk , and control subjects who are not at risk for HD.Information available on participating probands include genetic status whether or not they carry HD mutation, and the number of CAG repeats , clinical diagnosis of HD, and the timing of symptom onset and timing of diagnosis.In our analyses, only probands with expanded CAG CAG ≥ 36 and their family members were included.Details of the cohort are cited in a publication in press 6 .
We first describe the proband and family data in the COHORT study.Information on CAG repeat length and age was available for 1357 probands with CAG repeats varying from 36 to 100 Table 3 .There were 3409 first-degree relatives available from 675 probands.We do not have information on whether some of the probands are from the same family.We show the descriptive statistics for the relatives stratified by relationship type in Table 4.Each proband potentially has three versions of age-at-the-first-symptom rater's report, subject's self-report, and a family member's report .We gave the rater reported AAO of symptom the highest priority.If the rater reported version is not available, we then used subject report.If neither rater nor subject's self-report is available, we then used the family member's report.Twenty-one subjects whose self-reported and rater-reported AAO of symptom differed by greater than 15 years were removed.Our proband data set has 1151 subjects with CAG length between 41 and 56 and was used for the proband-only analysis.Similar to Langbehn et al. 10 , we restricted the analysis to CAG repeat lengths between 41 and 56 to guard against sensitivity to the extremely high or low CAG repeats and against bias due to likely under ascertainment relative to the population of subjects with CAG length between 36 and 40.
Information on CAG repeat length, age at time of evaluation and the probability of being a carrier receiving huntingtin mutation from the proband was available for 2851 family members of 1151 probands.In the proband data set, both individuals with manifest HD and presymptomatic carriers 24% are included.Their age-at-diagnosis and age-at-firstmotor sign were recorded.Among 1151 probands, 876 76% subjects had experienced HD onset and the average AAO of the HD diagnosis was 44 years of age standard deviation: 10.7 .There were 54% females and 94% Caucasians.Our combined proband and family data set has 4002 subjects.In this combined data set, 51% were females and 35% subjects had experienced HD onset.Among the 4002 subjects, 467 are singletons probands with no family member included .The other 3535 subjects belong to 623 pedigrees with an average size of 5.674 sd 2.609 members.In the combined data, there are two different probabilities of being a carrier: p i 1 1199 subjects with known CAG expansions or known HD onset or p i 0.5 2803 subjects .Among the 2851 family members, 966 are parents of the probands, 1095 are siblings of the probands, and 790 are children of the probands.
When using the age-at-diagnosis in our proband data as T i , the estimated cumulative risk of HD is .

4.2
We present F t | A i curves for age-at-diagnosis and age-at-symptom at various CAG lengths and their 95% confidence intervals for the proband data in Figure 3.It can be seen that with a given A i , the estimated probability of having the first symptoms of HD is higher than  the probability of a diagnosis of HD at the same age.This is consistent with the intuition that symptoms of HD will be observed before a diagnosis.The mean AAO of first symptom is estimated to be about 2 years earlier than AAO of diagnosis As a sensitivity analysis, we compared the estimated CDF based on the parametric model with a nonparametric Kaplan-Meier estimator for subjects with a given A i .Figure 4 presents this comparison using probands' age-at-diagnosis data.We show in the figure that the parametric model fit is consistent with the Kaplan-Meier fit.However, as expected, the confidence interval for the parametric model estimate at a given age is narrower than  the Kaplan-Meier estimate results not shown .The figure comparing age-at-symptom models is similar and therefore omitted.We reanalyzed only the AAO of the first symptom using the combined proband and family data, since the age-at-diagnosis was not available for family members who are not seen in person.The estimated cumulative risk of HD at age t is .

4.3
The corresponding F t | A i curves at various CAG lengths and their 95% confidence intervals are shown in Figure 5.In Table 5, we compare the estimated mean and SD of the AAO from the proband and combined data.We can see that the estimated mean AAOs for several CAGs are similar regardless of whether family members are included.The SD estimated from the model is larger for the combined data.This is a reflection of the observed data in that there is a wider range of AAO in the combined data than in the proband data.For example, the SD for CAG 41 of the former is 11 years, whereas it is 10 years in the probands, and the SD for CAG 42 is 10 in the combined and 8 in the probands.One of the utilities of the estimated curves is to estimate the conditional probability of having an HD onset or staying HD free in the next five or ten years, given a subject has not had an onset by a given age.Similar to Langbehn et al. 10 , in Table 6, we present such conditional probabilities in five-year intervals for a subject without HD at age 40 and with given CAG repeats.For example, a 40-year presymptomatic subject with a CAG of 42 has a probability of 34% CI: 32%, 36% for developing HD in the next 10 years by age 50 , while for a subject with a CAG of 50 this probability increases to 0.93 CI: 0.91, 0.95 .

Discussion
We propose methods to predict disease risk from a known mutation or to estimate the penetrance function .For most complex diseases, predicting the AAO of a disease from genetic markers such as single-nucleotide polymorphisms SNPs continue to be a challenging issue 18 .Even with diseases like HD where the gene is identified, the predictive model can be complicated: a special feature of HD is that the mutation severity is quantifiable and varies significantly among the affected population.This contrasts with the typical categorical approach needed, for example, in genome-wide association studies.The proposed methods are also applicable to other expanded trinucleotide repeat diseases similar to HD.
One of the contributions of this work is to use the family data as well as the proband data to maximize available information in building a model.Our results reveal that the estimated risk obtained from the combined proband and family data is slightly lower than the risk estimated from the proband data alone.It is possible that the proband data consists of a biased clinical sample of gene positive or HD-affected subjects e.g., subjects with more severe disease or with earlier onset may be more likely to participate; presymptomatic subjects might be undersampled and is therefore not a fair representative sample of the entire HD population, especially underrepresenting subjects at risk.The plausibility of such underascertainment is so strong for CAG lengths of 40 or less 7 that we did exclude observations within that range from analysis.The family data may be a better representative of the population since the family members are included in the analysis only through the inclusion of the probands.Although proband may participate the study because they had HD or they had more severe symptoms of HD, the relatives were not included based on their CAG repeat lengths or affection status.Of course, some of the family members will not share an expanded CAG repeat huntingtin with the probands and therefore are noncarriers who will never develop HD.
Note that our estimated cumulative risk of onset of a positive HD diagnosis in the proband data is also slightly lower than Langbehn et al. 10 which also examined age-at-HD diagnosis.We estimated later mean AAO for each CAG repeat length shorter than 54 than did Langbehn et al. 10 .For example, the mean AAO of HD diagnosis for probands with a CAG of 42 in the former data was 3 years later and, for a CAG of 43, it was 4 years later Table 3 .On average, for all subjects with a CAG between 41 and 50, the mean AAO in Langbehn data was 2 years earlier than in the COHORT data.More detailed comparisons are presented in Table 5.There are several possible reasons for these differences.The model end point, AAO, should probably be considered to be slightly different in the two models.The outcome in Langbehn et al. 10 was earliest age at which a clinician documented an irreversible objective sign of the illness.This may occur earlier than the point at which an actual diagnosis of manifest HD is given.Many clinicians wait until there are several such signs.This may also occur, however, at a point that is later than the proband's or family's first report of subjective symptoms or their first perception of disease signs.In the CAG range of 41-49, the Langbehn et al. means are very close to the symptom onset means in the current data.For longer CAG lengths, the Langbehn et al. estimates more closely resemble the current models for disease diagnosis.Possible systematic variability between the clinicians in the two studies may also account for the differences in the estimates.
Other potential differences between the data sources include potential research-centerspecific heterogeneity in diagnostic and rating conventions and slight variations in the methods used to determine CAG repeat length.In the Langbehn study, these were measured by a variety of laboratories while in the COHORT they were all measured in the same laboratory.
We do note that the differences between the fitted models here and those in Langbehn et al. are substantially smaller than differences among other formulae in the literature 14 .AAO probabilities, conditioned on current age, are especially similar.In HD research and genetic counseling, these conditional probabilities are perhaps the most commonly used statistic deriving from these formulae.Finally, the logistic-exponential form of the parametric model proposed in Langbehn et al. 10 does indeed fit the empirical AAO distributions quite well in the COHORT data.This validates use of this relatively complicated survival model for HD AAO research and may encourage considerations of quantitative biological mechanisms that would generate exponential relationships between CAG and both AAO and its variance.
There has often been ambiguity in the modeling literature concerning the exact meaning of HD "onset."The first onset of observable signs or reportable symptoms of HD generally occurs before the actual diagnosis of clinically manifest HD is given.Much of the earlier modeling literature, reviewed in Langbehn et al. 14 , does not clearly address this distinction, although the resultant formulas have often been used for subsequent prediction of HD diagnosis 14 .The event modeled in Langbehn et al. 10 was "the first time that neurological signs representing a permanent change from the normal state was identified in a patient."This might be considered to the concept of "subject's first noted symptom" rather than age of diagnosis.Nonetheless, this model has been used frequently as a predictor of future diagnosis in HD 14 .In the current study, we do distinguish between first symptom onset and diagnosis.
Here, we assumed Mendelian transmission of huntingtin without interference so that the CAG length does not change from parents to offspring.There are several possible violations of these assumptions.CAG lengths do, in reality, vary somewhat among family members, and those inheriting the gene from their father have, on average, a slightly longer CAG repeat length than their father.The probability of this occurring is much lower if inheritance is from the mother 19 .An explanation is that there are many more biological opportunities for the CAG length to change in the father's process of sperm formation than in the mother's process of egg formation.These processes and their dynamics have been studied extensively in vitro 7, 20 , but we know of no well-verified in vivo dynamic population genetics models.Assuming the CAG length does not change from father to offspring may lead to a slightly lower estimated risk for affected fathers of probands.
Consistent with Langbehn et al. 10 and other studies 20, 21 , we estimated reduced penetrance for lower CAG repeat lengths ≤40 .We point out that the parameter estimates from the current model do not include subjects with CAG less than 41; therefore, the risk estimates for these subjects are extrapolations.However, it is conceivable that as long as the inverse relationship between AAO and CAG still holds for the lower CAGs, the life time disease risk for these subjects will be less than 100%, since the life time risk for a CAG of 41 is about 100%.
In the literature, no proportional odds model has been fitted to model the age-at-onset of HD.Proportional odds model, or along a similar line, transformation model, belongs to the semiparametric model framework and is beyond the scope of this paper.We are currently investigating semiparametric models other than the Cox proportional hazards model.
Finally, we stress that our current model does not include other observed covariates, such as additional genetic polymorphisms.In addition, we assumed conditional independence of family members' age-at-onset AAO of HD given their CAG repeats.This assumption implies that we do not account for residual correlation among family members' AAO caused by factors other than the CAG repeats, such as life style factors.When there exists such residual correlation, point estimates from our current approach are still consistent hence still valid, although the standard error estimates are no longer correct.A practical limitation of using family members' AAO data is that they may be less reliable than the data directly collected from the probands.This limitation applies to all other diseases, especially those with late onset.This limitation can be more pronounce when there is incomplete penetrance and variability of phenotype.Future work would consider incorporating such measurement error in the analysis.Lastly, the proposed methods do not include possible unobserved effects that may be site or clinician-specific and perhaps related to the interpretation of the point of "onset."Future research will focus on incorporating observed covariates and adding familyspecific random effects to account for residual familial aggregation.

Figure 4 :
Figure 4: Kaplan-Meier curve and estimated CDF of age-at-diagnosis of HD for A i 41, 43, 46, and 50 with COHORT proband data.
CDF of age at symptom

Figure 5 :
Figure 5: Estimated CDF of age-at-first-symptom of HD for A i 41, 43, 46, and 50 with COHORT combined proband and relative data.

Table 1 :
Simulation 1 proband data .Estimated CDF and standard errors from the direct optimization of proband-only analysis, n i Mean F t | A i Empi: sd Mean sd F t | A i Mean F t | A i Empi: sd Mean sd F t | A i Mean F t | A i

Table 2 :
Simulation 2 combined proband and relative data .Estimated CDF and standard errors from the EM algorithm with combined proband and family

Table 3 :
Descriptive statistics of the COHORT proband data.

Table 4 :
Descriptive statistics of the first-degree relatives of COHORT proband subjects stratified by relationship.

Table 5 :
Mean and standard deviation of the AAO estimated from the model 2.1 for four analyses.The estimated parameters for the CDF from the proband-only analysis are slightly different from the ones obtained from Langbehn et al. 10 .Our estimated mean and standard deviation of the AAO of HD is about 1 to 3 years later than the ones obtained in Langbehn et al. 10 , and the standard deviation SD is slightly smaller Table5.In addition, the estimated CDF is smaller for most A i values using COHORT data.We ran a joint likelihood ratio test on the goodness-of-fit of parameters obtained in Langbehn et al. 10 and the P value was less than 0.001 test statistic 66.0 .When analyzing the age-at-first-symptom in our proband data, the estimated cumulative risk of HD is Estimated CDFs of age-at-diagnosis and age-at-first-symptom of HD for A i 41, 43, 46, and 50 with COHORT proband data.
Table 5 and the standard deviation of the former is slightly larger, indicating that reported age-at-first-symptom is more variable.It is unclear to what extent this difference represents true physical variability in illness development versus possibly lower reliability in the retrospective reporting of symptom onset 17 .