Additive Hazard Regression for the Analysis of Clustered Survival Data from Case-Cohort Studies

,e case-cohort design is an effective and economical method in large cohort studies, especially when the disease rate is low. Casecohort design in most of the existing literature is mainly used to analyze the univariate failure time data. But in practice, multivariate failure time data are commonly encountered in biomedical research. In this paper, we will propose methods based on estimating equation method for case-cohort designs for clustered survival data. By introducing the event failure rate, three different weight functions are constructed. ,en, three estimating equations and parameter estimators are presented. Furthermore, consistency and asymptotic normality of the proposed estimators are established. Finally, the simulation results show that the proposed estimation procedure has reasonable finite sample behaviors.


Introduction
In many failure time studies, there is a natural clustering of study subjects such that failure times within the same cluster may be correlated. One example of clinical trial is that the failure times of patients within a famlily may be correlated because they have common genetic characteristics and environmental factors. Another example is the time to onset of blindness in the left and right eyes of the patients with diabetic retinopathy. So, in this case, the focus is on the potential clustering in the data. Moreover, the data structure is that the observations from different clusters are independent, while observations within a cluster may be correlated. Interested readers can refer to the papers [1][2][3][4] and the references therein.
In epidemiological research, the collection of detailed follow-up information is costly and time consuming, especially when the incidence of disease is low. One may consider the case-cohort design, proposed by Prentice [5], which is widely used for the large cohort studies. It entails collecting covariate data for all subjects who experienced the event of interest in the full cohort and for a random sample from the entire cohort. It has the same goals in studying risk factor as collecting the full data. erefore, many studies about case-cohort data have been yielded. For example, it has been applied to the Cox proportional risk model [6], additive risk regression model [7], semiparametric transfer model [8], accelerated failure time model [9], and additive multiple risk regression model [10,11].
However, there is little research on group failure time data under case-cohort design. For example, Lu and Shih [12] applied the proportional risk model to clustered failure time data under case cohort and proposed three design methods of case cohort to extract subcolumns. Under different sampling mechanisms, the estimation equations are established, and then the large sample nature of the obtained estimation is proved. Unfortunately, the risk set considered by Lu and Shih [12] only contains the individuals in the subcolumns. Furthermore, Zhang et al. [13] added the failure information out of the subcolumns to the proportional risk model and proposed three kinds of different estimation equations and parameter estimates. e numerical simulation results show that the proposed model is effective. Note that the above literature studied the clustered data from case cohort based on the marginal proportional hazard models. In some practical cases, the effect of covariates may be additive. en, the proportional risk model is no longer applicable. erefore, in this paper, an additive risk model will be applied to clustered data in casecohort design. Moreover, we will construct the weighted estimation equation, propose the estimation of regression parameters, and prove the asymptotic properties of the estimators. e remainder of the paper is organized as follows. In Section 2, we describe the proposed estimation procedures. en, we establish the consistency and asymptotic normality of the resulting estimator in Section 3. Numerical evidence in support of the theory is presented in Section 4. Finally in Section 5, some conclusions are drawn.

Model and Estimation Procedures
Suppose the full cohort consists of n independent clusters, and the i-th cluster (i � 1, . . . , n) has m i correlated subjects. We assume that subjects within the same cluster are exchangeable. In advance of follow-up, a random sample of the entire cohort, called the subcohort, is selected. Covariate data are then collected from individuals in the subcohort as well as those observed to fail in the entire cohort. Now, we consider the following three designs to obtain the subcohort [13]: Design A: randomly sample individuals from each cluster with Bernoulli sampling. In other words, each individual in each cluster has an independent fixed probability of being selected to the subcohort. Design B: randomly sample clusters from the full cohort with Bernoulli sampling. Design C: randomly sample clusters from the full cohort with Bernoulli sampling and then randomly sample subjects with Bernoulli sampling from the selected clusters.
Note that Design A and Design B are special cases of Design C.
Let T ij and C ij denote the potential failure time and the potent censoring time for the j-th subject in the i-th cluster, respectively. Let Z ij (t) represent a possibly time-dependent p × 1 vector of covariates, restricted to be external. We assume that T ij and C ij are independent conditional on the given Z ij (·). e observed time is Let H i denote the indicator for the i-th subject being selected into the subcohort and H ij denote the indicator for the jth subject in the i-th cluster being selected into the sample. Both H i and H ij are the independent Bernoulli variables, where Suppose that the marginal hazard function λ ij (t) is associated with Z ij (t) as follows: where λ 0 (·) is an unspecified baseline hazard function and β 0 is a p × 1 vector of regression parameters.
If the full cohort data are available, the estimate of the true regression parameter β 0 in (1) could be obtained by solving the following estimating function [14]: where . en, the estimator of β 0 in (1) can be estimated by β H , which is the solution to the estimating equations U H (β) � 0. at is, where τ < ∞, a ⊗2 � aa T . For clustered failure time data from case-cohort studies, we consider the observed-event probability, Pr(δ ij � 1), which we denote by p 0 [15]. We propose three procedures to estimate β 0 , which are the same designs proposed by Zhang et al. [13] except that we consider the additive hazard regression model. We then develop a weighted estimating equation as follows: where By solving the estimation equation U(β, p 0 ) � 0, we could obtain β to estimate β 0 : However, p 0 is usually not known in most situations. In case where the study cohort is well defined, we can use p w , the full cohort case proportion, to estimate p. When the study cohort is less well defined, p s , the subcohort case proportion, is a suitable alternative. e corresponding estimator of β 0 is written as β w or β s . e cumulative baseline hazard function, Λ 0 (t) � t 0 λ 0 (u)du, can be consistently estimated by In (4) and (6), either p s , p w , or p 0 could be used to replace the parameter p.

Asymptotic Properties
To derive the asymptotic properties of the proposed estimates, we impose the following assumptions:

Numerical Studies
Simulation studies are conducted to examine the finite sample properties of the proposed estimators. In the study, we generate the clustered failure time data from n � 100 clusters. m i are simulated from a binomial (50, 0 e total number in the whole queue is about 4000. e covariate Z ij takes values 1 and 0, with probabilities 0.5 and 0.5, respectively. e failure time T i1 , T i2 , . . . , T im i in the i-th cluster is simulated by the joint survival function (for details, see [16]): where κ > 0 is a parameter representing the degree of dependence among variables. Smaller value of κ represents stronger correlation.
In our experiments, we choose κ � 0.5 or 2 and β 0 � log(0.5) or 0. e censoring times C ij are constant and equal to 1. e observed-event probabilities of p are equal to 0.45 or 0.63.
For each data generation, for Design A, individuals within each cluster are selected into the subcohort by Bernoulli sampling with equal probability 0.2 or 0.15. For Design B, we select clusters by Bernoulli sampling with probability 0.2 or 0.15. For Design C, we first sample clusters by Bernoulli sampling with probability 0.4 or 0.3 and then sample individuals from those selected clusters by Bernoulli sampling with probability 0.5. erefore, for each design, we would expect approximately 800 or 600 individuals in the subcohort.
To assess the performance of the proposed estimator, we calculate the estimated standard error based on the bootstrap method. Because the failure clusters inside and outside the subcohort have different structures, we use the following method to conduct bootstrap sampling. Tables 1 and 2 report the simulation results of our proposed estimators. For simplicity, the notation Bias denotes the difference of the average of the estimates and the It can be seen from Tables 1 and 2 that the estimates of both regression parameters β 0 � log(0.5) and β 0 � 0 are unbiased, and both the estimated standard deviation and the empirical standard deviation are also close. e coverage probability of the nominal 95% confidence intervals seems to be very reasonable. Furthermore, when n s � 600 for Design B in Tables 1 and 2, we can obtain that the values of SEE and CP are slightly underestimated. It is due to the small number of clusters in the subcohort. In addition, with the increase of the capacity of the subcolumns, i.e., when n s is increased to 800, the effect of the estimation is obviously improved. Among the above three designs, Design A and Design C are more effective than Design B. e conclusion is consistent with that in [13]. By comparing SE, SEE, and BSE, SE and SEE agree with the BSE. Furthermore, the bootstrap standard errors and the simulated standard errors are similar. e proposed semiparametric estimator works well. erefore, one may use the bootstrap variance estimate for statistical inference in practice.

Conclusion
is paper proposed methods of fitting additive hazard regression models for clustered survival data from casecohort studies. Risk difference can provide information value to medical research. Specifically, risk differences can provide information regarding the reduction in the number of cases developing a certain disease due to a decrease in a particular exposure. An advantage of the additive hazard models is that risk difference between different exposure groups can be readily derived from the coefficients in the additive hazard models. With respect to differing observedevent probability, we propose three procedures for parameter estimator. Consistency and asymptotic normality of the proposed estimators are established. e simulation results show that Design A results are more efficient estimators than Design C, and Design C has greater efficiency than Design B when subcohort size is approximately equal. It can be attributed to differences in the number of sampled clusters in the subcohort.
Journal of Mathematics en, we have e second term on the right-hand side of (A.2) can be shown to converge to zero. Specially, for fixed t, E dM ij (t) � 0. e third term on the right-hand side of (A.2) can be written as By functional Taylor expansion, we can obtain that the first term of (A.3) is written as Similarly, the second term of (A.3) is written as Under the general regular condition, W i (β, p 0 ) n i�1 is an independent and identically distributed random variable with a mean value of zero and E W 1 (β 0 , p 0 ) ⊗2 . It follows from the multidimensional central limit theorem that One can write dt. (A.9) By the continuity of each component and (A.9), (i) is clearly satisfied. Now, (A.9) can be decomposed as follows: dt. (A.10) Note that Z(t) uniformly converges to e(t) � (E Y 11 (t)Z 11 (t) /E Z 11 (t) ) in t.
en, it follows that the first term on the right-hand side of (A.10) converges to A(β 0 , p 0 ) in probability as n ⟶ ∞. Each of the remaining terms on the right-hand side of (A.10) can be shown to converge to zero as n ⟶ ∞. Hence, we have