^{1}

^{2,3}

^{4,5}

^{1}

^{2}

^{3}

^{4}

^{5}

Under cohort sampling designs, additional covariate data are collected on cases of a specific type and a randomly selected subset of noncases, primarily for the purpose of studying associations with a time-to-event response of interest. With such data available, an interest may arise to reuse them for studying associations between the additional covariate data and a secondary non-time-to-event response variable, usually collected for the whole study cohort at the outset of the study. Following earlier literature, we refer to such a situation as secondary analysis. We outline a general conditional likelihood approach for secondary analysis under cohort sampling designs and discuss the specific situations of case-cohort and nested case-control designs. We also review alternative methods based on full likelihood and inverse probability weighting. We compare the alternative methods for secondary analysis in two simulated settings and apply them in a real-data example.

Cohort sampling designs are two-phase epidemiological study designs where information on time-to-event outcomes of interest over a followup period and some basic covariate data are collected on the whole first-phase study group, referred to as a cohort, and in the second phase, more expensive or difficult-to-obtain additional covariate data are collected only on a subset of the study cohort. This usually comprises the cases, that is, individuals with a disease event of interest during the followup, and a randomly selected subset of noncases. Examples are the case-cohort [

As our motivating example, we consider here a single cohort which was used in a larger meta-analysis of association between the European lactase persistence genotype and body mass index (BMI) [

Secondary analysis of case-control data has been studied previously, using profile likelihood [

To cover our motivating example and also the general case, we introduce first some notation. Let the set

We assume that the first-phase sampling mechanism has been unconfounded in the sense of Rubin [

Valid estimates for the parameters of the secondary outcome model could alternatively be obtained by using inverses of the first-order inclusion probabilities

A very general definition for conditional likelihood is given by Cox and Hinkley [

Condition on the sampling mechanism, that is, the set of inclusion indicators

Other observed variables may be placed into

We must have

Applying these conditioning rules will reproduce the conditional likelihood expressions obtained previously in special cases of the current framework by Langholz and Goldstein [

Following the stated rules, and making the same general assumptions as in Section

The ratio of the numerator and the second term in the denominator can be further written as

Here, we are mainly interested in a variation of the “efficient case-cohort design” suggested by Kim and De Gruttola [

Consider now a nested case-control sampling mechanism in which all cases are selected to the case-control set with probability one, and

Further issue to be considered in practical applications is possible missing covariate data within the set

If the missingness can be assumed missing at random, that is,

If the missingness mechanism would be known (and missing at random), a weighted pseudolikelihood expression for the set

By partitioning the observed data into

Previous discussion was specific to a primary mortality outcome using time on study as the main time scale. In this section, we discuss separately how the different methods can accommodate cohort sampling for incident nonfatal primary outcomes. In the analysis of secondary, non-time-to-event outcomes, the presence of left truncation due to exclusion of cases of prevalent disease presents an additional complication. If the parameters of the secondary outcome model correspond to the background population alive at the cohort baseline (rather than to the disease-free population), this additional selection factor requires further adjustment. If the primary outcome is a mortality endpoint, this is not an issue, since then there is no further selection due to prevalent conditions. In likelihood-based adjustment for left truncation, the main time scale of the analysis has to be chosen as age instead of time on study. In survival modeling, it is well known that conditioning on event-free survival until the age at study baseline corresponds to exclusion of the followup time before that (e.g., [

It should also be noted that under case-cohort designs it is common to collect second-phase covariate data for more than a single outcome, since the case-cohort design naturally enables the analysis of multiple outcomes using a single subcohort selection. This is also the case in our example cohort discussed in Sections

The full likelihood expression (

The weighted pseudolikelihood approach does not readily take into account additional selection which occurs in studies of incident outcomes. This was also pointed out by Reilly et al. [

As in Section

In order to compare the efficiency of the alternative estimation methods discussed above, namely, full likelihood, conditional likelihood, and weighted pseudolikelihood, we supplemented the cohort data described in Section

Normal model

Maximum likelihood estimates of the parameters

Lik. | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.0 | 0.0 | Full | 23.72 | 0.09 | 0.10 | 0.10 | 1.33 | 0.00 | 0.01 | 0.22 | 0.22 | 4.10 | 0.25 | |||

Cond. | 23.74 | 0.09 | 0.00 | 0.10 | 0.10 | 1.33 | 0.00 | 0.00 | 0.23 | 0.23 | 4.17 | — | ||||

Weighted | — | — | — | — | — | — | — | 0.03 | 0.30 | 0.30 | 4.10 | — | ||||

0.5 | 0.0 | Full | 26.30 | 0.09 | 0.49 | 0.09 | 0.09 | 1.33 | 0.00 | 0.01 | 0.22 | 0.21 | 4.10 | 0.25 | ||

Cond. | 26.33 | 0.09 | 0.50 | 0.09 | 0.09 | 1.33 | 0.00 | 0.00 | 0.22 | 0.22 | 4.17 | — | ||||

Weighted | — | — | — | — | — | — | — | 0.03 | 0.30 | 0.30 | 4.10 | — | ||||

0.0 | 0.5 | Full | 23.73 | 0.09 | 0.10 | 0.10 | 1.33 | 0.53 | 0.25 | 0.23 | 4.09 | 0.25 | ||||

Cond. | 23.75 | 0.09 | 0.00 | 0.10 | 0.10 | 1.33 | 0.52 | 0.24 | 0.23 | 4.17 | — | |||||

Weighted | — | — | — | — | — | — | — | 0.53 | 0.31 | 0.31 | 4.09 | — | ||||

0.5 | 0.5 | Full | 26.32 | 0.09 | 0.49 | 0.09 | 0.09 | 1.33 | 0.55 | 0.23 | 0.22 | 4.09 | 0.25 | |||

Cond. | 26.34 | 0.09 | 0.50 | 0.09 | 0.09 | 1.33 | 0.54 | 0.22 | 0.22 | 4.17 | — | |||||

Weighted | — | — | — | — | — | — | — | 0.54 | 0.31 | 0.31 | 4.09 | — | ||||

1.0 | 0.0 | Full | 29.62 | 0.09 | 0.99 | 0.09 | 0.09 | 1.35 | 0.00 | 0.02 | 0.21 | 0.21 | 4.10 | 0.25 | ||

Cond. | 29.64 | 0.09 | 1.00 | 0.09 | 0.09 | 1.35 | 0.00 | 0.01 | 0.22 | 0.21 | 4.17 | — | ||||

Weighted | — | — | — | — | — | — | — | 0.03 | 0.29 | 0.29 | 4.10 | — | ||||

0.0 | 1.0 | Full | 23.77 | 0.09 | 0.00 | 0.10 | 0.10 | 1.33 | 1.11 | 0.26 | 0.23 | 4.07 | 0.25 | |||

Cond. | 23.77 | 0.09 | 0.00 | 0.10 | 0.10 | 1.33 | 1.06 | 0.24 | 0.23 | 4.15 | — | |||||

Weighted | — | — | — | — | — | — | — | 1.05 | 0.32 | 0.32 | 4.07 | — | ||||

1.0 | 1.0 | Full | 29.67 | 0.09 | 1.00 | 0.09 | 0.09 | 1.35 | 1.14 | 0.24 | 0.22 | 4.07 | 0.25 | |||

Cond. | 29.65 | 0.09 | 1.00 | 0.09 | 0.09 | 1.35 | 1.08 | 0.22 | 0.21 | 4.14 | — | |||||

Weighted | — | — | — | — | — | — | — | 1.07 | 0.31 | 0.31 | 4.07 | — |

In fitting models to the datasets so obtained, we used the same model specifications as above, with the exception of the proportional hazards model, where we fitted a “misspecified” Weibull model

Let now the observed data be only

With

Table

Maximum likelihood estimates of parameters

Normal data | Normal model | NPQM model | ||||||||||||

Full lik. | 1000 | 4.00 | 0.00 | 0.16 | 2.00 | 0.20 | 4.00 | 0.00 | 0.16 | 2.00 | 0.00 | 0.01 | 0.14 | 0.20 |

500 | 4.00 | 0.01 | 0.22 | 1.99 | 0.20 | 4.00 | 0.01 | 0.22 | 1.99 | 0.00 | 0.01 | 0.14 | 0.20 | |

200 | 4.00 | 0.00 | 0.35 | 1.99 | 0.20 | 4.00 | 0.00 | 0.35 | 1.99 | 0.00 | 0.01 | 0.14 | 0.20 | |

100 | 4.00 | 0.01 | 0.50 | 1.99 | 0.20 | 4.00 | 0.01 | 0.50 | 1.99 | 0.00 | 0.02 | 0.14 | 0.20 | |

Cond. lik. | 500 | 4.00 | 0.01 | 0.22 | 1.99 | 0.20 | 4.00 | 0.01 | 0.22 | 1.99 | 0.00 | 0.02 | 0.14 | 0.20 |

200 | 4.01 | 0.00 | 0.35 | 1.99 | 0.20 | 4.01 | 0.00 | 0.35 | 1.99 | 0.00 | 0.03 | 0.13 | 0.20 | |

100 | 4.00 | 0.01 | 0.50 | 1.98 | 0.20 | 4.00 | 0.01 | 0.50 | 1.98 | 0.00 | 0.05 | 0.13 | 0.20 | |

Gamma data | Normal model | NPQM model | ||||||||||||

Full lik. | 1000 | 4.00 | 0.01 | 0.16 | 1.99 | 0.20 | 3.99 | 0.00 | 0.13 | 2.01 | 0.15 | 0.01 | 0.14 | 0.20 |

500 | 4.00 | 0.01 | 0.22 | 1.99 | 0.20 | 3.99 | 0.01 | 0.20 | 2.01 | 0.15 | 0.01 | 0.14 | 0.20 | |

200 | 3.97 | 0.15 | 0.38 | 1.98 | 0.20 | 3.98 | 0.07 | 0.35 | 2.00 | 0.15 | 0.02 | 0.14 | 0.20 | |

100 | 3.86 | 0.91 | 0.38 | 1.88 | 0.19 | 3.95 | 0.18 | 0.32 | 1.96 | 0.16 | 0.02 | 0.14 | 0.22 | |

Cond. lik. | 500 | 4.00 | 0.00 | 0.22 | 1.99 | 0.20 | 4.00 | 0.00 | 0.19 | 2.01 | 0.15 | 0.02 | 0.14 | 0.20 |

200 | 4.00 | 0.35 | 1.98 | 0.20 | 4.01 | 0.00 | 0.29 | 2.00 | 0.15 | 0.03 | 0.14 | 0.20 | ||

100 | 4.00 | 0.50 | 1.97 | 0.20 | 4.01 | 0.01 | 0.41 | 2.00 | 0.15 | 0.05 | 0.13 | 0.20 |

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Using the NPQM model which allows a skewed residual distribution does not correct the situation either when combined with full likelihood estimation (Figure

Maximum likelihood estimates of parameters

Normal data | Normal model | NPQM model | ||||||||||||

Full lik. | 1000 | 4.00 | 1.00 | 0.16 | 2.00 | 0.20 | 4.00 | 1.00 | 0.16 | 2.00 | 0.00 | 0.01 | 0.14 | 0.20 |

500 | 4.00 | 1.01 | 0.22 | 1.99 | 0.20 | 4.00 | 1.01 | 0.22 | 1.99 | 0.00 | 0.02 | 0.14 | 0.20 | |

200 | 4.00 | 1.00 | 0.34 | 1.99 | 0.20 | 4.00 | 1.00 | 0.34 | 1.99 | 0.00 | 0.02 | 0.14 | 0.20 | |

100 | 4.00 | 1.00 | 0.47 | 1.99 | 0.20 | 4.00 | 1.01 | 0.48 | 1.99 | 0.00 | 0.02 | 0.14 | 0.20 | |

Cond. lik. | 500 | 4.00 | 1.01 | 0.22 | 1.99 | 0.20 | 4.00 | 1.01 | 0.22 | 1.99 | 0.00 | 0.02 | 0.14 | 0.20 |

200 | 4.01 | 1.00 | 0.35 | 1.99 | 0.20 | 4.01 | 1.00 | 0.35 | 1.99 | 0.00 | 0.03 | 0.13 | 0.20 | |

100 | 4.00 | 1.01 | 0.50 | 1.98 | 0.20 | 4.00 | 1.01 | 0.51 | 1.98 | 0.00 | 0.05 | 0.13 | 0.20 | |

Gamma data | Normal model | NPQM model | ||||||||||||

Full lik. | 1000 | 4.00 | 1.01 | 0.16 | 1.99 | 0.20 | 3.99 | 1.00 | 0.13 | 2.01 | 0.15 | 0.01 | 0.14 | 0.20 |

500 | 3.97 | 1.15 | 0.24 | 1.98 | 0.20 | 3.97 | 1.08 | 0.18 | 2.00 | 0.15 | 0.02 | 0.14 | 0.20 | |

200 | 3.83 | 2.01 | 0.36 | 1.86 | 0.19 | 3.92 | 1.28 | 0.23 | 1.97 | 0.16 | 0.02 | 0.14 | 0.21 | |

100 | 3.72 | 2.95 | 0.26 | 1.70 | 0.16 | 3.87 | 1.44 | 0.26 | 1.94 | 0.16 | 0.02 | 0.14 | 0.23 | |

Cond. lik. | 500 | 4.00 | 1.00 | 0.22 | 1.99 | 0.20 | 4.00 | 1.00 | 0.19 | 2.01 | 0.15 | 0.02 | 0.14 | 0.20 |

200 | 4.00 | 0.99 | 0.35 | 1.98 | 0.20 | 4.01 | 1.00 | 0.29 | 2.00 | 0.15 | 0.03 | 0.14 | 0.20 | |

100 | 4.00 | 0.99 | 0.50 | 1.97 | 0.20 | 4.01 | 1.01 | 0.41 | 2.00 | 0.15 | 0.05 | 0.13 | 0.20 |

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

Sampling distributions for maximum likelihood estimates of regression coefficient

The case-cohort set for all-cause mortality in the example cohort (

A remaining issue to be considered is the numerical evaluation of the integrals over

Maximum likelihood estimates obtained by numerical maximization of the expressions (

Maximum likelihood estimates (standard errors) of the parameters

Likelihood | |||||||||
---|---|---|---|---|---|---|---|---|---|

Conditional | 100 | 23.88 | 3.17 (0.21) | 0.12 (0.01) | 0.00 (0.01) | 1.34 | 0.29 (0.03) | — | |

( | 1000 | 23.52 | 3.16 (0.21) | 0.12 (0.01) | 0.00 (0.01) | 1.34 | 0.29 (0.03) | — | |

5000 | 23.45 | 3.15 (0.21) | 0.12 (0.01) | 0.00 (0.01) | 1.34 | 0.29 (0.03) | — | ||

Conditional | 100 | 20.43 | 3.02 (0.20) | 0.12 (0.01) | 1.33 | 0.29 (0.03) | 0.59 (0.01) | ||

( | 1000 | 20.34 | 3.01 (0.20) | 0.12 (0.01) | 1.33 | 0.29 (0.03) | 0.59 (0.01) | ||

5000 | 20.44 | 3.02 (0.20) | 0.12 (0.01) | 1.33 | 0.29 (0.03) | 0.59 (0.01) | |||

Full | — | 18.40 | 2.91 (0.17) | 0.09 (0.01) | 1.33 | 0.28 (0.03) | 0.59 (0.01) |

Maximum likelihood estimates (standard errors) of the parameters

Likelihood | |||||||||
---|---|---|---|---|---|---|---|---|---|

Conditional | 100 | 26.91 (0.14) | 4.27 | 0.88 (0.02) | 0.14 (0.03) | 0.37 | |||

( | 1000 | 26.92 (0.14) | 4.27 | 0.88 (0.02) | 0.14 (0.03) | 0.37 | |||

5000 | 26.92 (0.14) | 4.27 | 0.88 (0.02) | 0.14 (0.03) | 0.37 | ||||

Conditional | 100 | 26.93 (0.13) | 4.25 | 0.87 (0.02) | 0.14 (0.03) | 0.37 | |||

( | 1000 | 26.93 (0.13) | 4.25 | 0.87 (0.02) | 0.14 (0.03) | 0.37 | |||

5000 | 26.93 (0.13) | 4.25 | 0.87 (0.02) | 0.14 (0.03) | 0.37 | ||||

Full | — | 26.94 (0.07) | 4.06 | 0.83 (0.01) | 0.13 (0.01) | 0.35 | |||

Weighted | — | 26.94 (0.15) | 4.30 | 0.88 (0.04) | 0.17 (0.05) | 0.41 |

Full likelihood and weighted pseudolikelihood estimates agreed well with the conditional likelihood ones, although the latter had higher standard errors, as was the case also in the simulations. As noted by Kettunen et al. [

Although conditional logistic likelihood is well known in the context of risk set sampling designs (e.g., [

Compared to the full likelihood approach which would be applicable under any cohort sampling design, the advantage of the conditional likelihood is that modeling of the population distribution of the covariates collected in the second phase can be avoided, if this is not of primary interest. In addition, as we demonstrated in a simulation example, full likelihood expressions with most of the covariate data unobserved may no longer be well behaved, a problem which the conditional likelihood approach does not have. The disadvantage of the conditional likelihood approach is that the second-phase sampling mechanism, that is, the joint distribution of the inclusion indicators, needs to be specified. In the case-cohort/Bernoulli sampling situation, this is straightforward, but in mechanisms such as risk set sampling, where the sampling probabilities are specified only implicitly, resolving the joint sampling probability can be computationally nontrivial, and approximations may be needed in practice. This is avoided in the full likelihood approach, since it does not require specification of the sampling mechanism. The inverse-probability-weighted estimating function only requires specification of the first-order selection probabilities, but the possible dependencies induced by the sampling mechanism need to be addressed in the variance estimation step. The full and conditional likelihood approaches gave equivalent efficiencies in our simulated setting, although the result might be different if the first-phase data involves covariates which are highly predictive of the second-phase covariate of interest. In any case, both likelihood-based methods gave a clear improvement in efficiency in the secondary analysis setting compared to the inverse-probability-weighted method.

The use of the expression (

Denoting

Alternatively, bootstrap variance estimation might be utilized, but it should be noted that standard bootstrap would be valid only under Bernoulli type sampling designs, which can be interpreted as sampling from infinite population, whereas dependencies induced by sampling without replacement would require application of finite population bootstrap methods (e.g., [

Consider a special case where the observed data are only

In the following, we suppress the covariates

The majority of this work was carried out when the first author was working at the Department of Chronic Disease Prevention of the National Institute for Health and Welfare, Helsinki, Finland. The work was partly supported by the European Commission through the Seventh Framework Programme CHANCES Project [HEALTH-F3-2010-242244]. The authors would like to thank Professor Jarmo Virtamo and Professor Markus Perola of the National Institute for Health and Welfare for permission to use the ATBC data in our illustration.