Regression Analysis for Outcome-Dependent Sampling Design under the Covariate-Adjusted Additive Hazards Model

-is paper provides a new insight into an economical and effective sampling design method relying on the outcome-dependent sampling (ODS) design in large-scale cohort research. Firstly, the importance and originality of this paper is that it explores how to fit the covariate-adjusted additive Hazard model under the ODS design; secondly, this paper focused on estimating the distortion function through nonparametric regression and required observation of the covariate on the confounding factors of distortion; moreover, this paper further calibrated the contaminated covariates and proposed the estimators of the parameters by analyzing the calibrated covariates; finally, this paper established the large sample property and asymptotic normality of the proposed estimators and conducted many more simulations to evaluate the finite sample performance of the proposed method. Empirical research demonstrates that the results from both artificial and real data verified good performance and practicality of the proposed ODS method in this paper.


Introduction
Generally, the major cost of studying large cohort is tied up in collecting the expensive exposure variables, which casts a poor shadow over the researchers with the burden of limited budgets. erefore, the simple random sampling design with the characteristics of cost expensive and time consuming has been risking losing ground in the long run. To achieve some certain goal, it is no wonder that many cost-effective strategies have been invoked. In the early 1980s, Prentice [1] proposed the notation of case-cohort design, and the exposure variables were measured on a simple random sample, which is called a subcohort, as well as all the cases that experienced events that we were interested in. Since then, the application of case-cohort sampling in the survival analysis has been reported by Self and Prentice [2], Tsai [3], and Kim et al. [4]. e case-cohort design is expected to be economical and effective sampling techniques for rare events. When the censoring rate is relatively low or medium, the method of generalized case-cohort design has been developed in responding to lowering the research cost. In addition to randomly selecting a subcohort from the entire cohort, the information containing the relevant covariables is collected for only a subset of the failure individuals (e.g., Chen [5], Cai and Zeng [6], and Kang and Cai [7]).
As a matter of fact, the failure time outcome-dependent sampling (ODS) design is considered to be another economical and effective alternative to the simple random sampling design. Exposure variables are measured against samples from two components, the subcohort and additional supplementary samples (see Chatterjee et al. [8], Zhou et al. [9], and Weaver and Zhou [10]). Regarding the ODS data, massive research has been carried out and formed a wealth of literatures. For completely observed data, the studies by Zhou et al. [11,12] and Qin and Zhou [13] offer the comprehensive analysis of the inference methods based on the partially regression linear models for data from the ODS design.
e key systematic study of how to fit the generalized linear model with the data obtained from twostage ODS design was reported by Yan et al. [14]. For censored data, detail examination of the ODS design by Ding et al. [15] showed the estimated impact of environmental pollutants on women's subfertility. A significant discussion on estimating the equation method with an ODS sampling scheme under Cox's proportional hazards model was presented by Yu et al. [16].
In clinical trials and biomedical, instead of direct observation, the covariates are observed by multiplication of unknown functions of an observable confounder. As a result, the regression with contaminated covariates was originally derived by Şentürk and Müller [17,18], in which the contamination information for model covariates cannot be ignored; otherwise, it will result in a biased estimator and the statistical inference may be misled. Since then, numerous extensions in various aspects have been developed by Şentürk and Müller [19], Cui et al. [20], and Li et al. [21]. For completely observed data, one study by Cui [22] proposed to use the nonparametric kernel estimation method to calibrate the contaminated variables and then conducted parameter estimation under the covariate-adjusted linear model, and further research conducted by Zhang et al. [23] extended the method to the nonlinear models with contaminated variables. A key study by Delaigle et al. [24] derived the process of several nonparametric covariate-adjusted estimators of conditional mean function. A preliminary nonparametric test for covariate-adjusted models was undertaken by Zhao and Xie [25], who found that the proposed test statistic has the same limit distribution as the response and predictors are supposed to be obtained directly. For survival data with censoring, very few studies have investigated survival models with contaminated covariates and even fewer people are ready to tackle the challenge in the ODS design we are discussing here.
We study the following covariate-adjusted additive hazards model under the ODS design in this paper: where λ 0 (t) is the unknown baseline hazard function, β � (θ T , c) T is the unknown parameter of p-dimension, θ and c are (p − 1)-dimensional and 1-dimensional parameter, respectively, Z � (H T , X) T is the p-dimensional covariate, H is the observed (p − 1)-dimensional covariate, X is the unobservable 1-dimensional covariate, W is the actual observed 1-dimensional covariable, and ϕ(·) is the unknown distorting function of observable confounding variable U. We focus on the method of nonparametric kernel estimation to obtain the estimator of the distortion function and calibrate the covariate X. Meanwhile, we attempt to weigh the contributions of the subcohort and the supplemental sample differently, which resulted in a weighted estimation equation with the help of calibrated covariates. Owing to the ODS design and the covariate-adjusted process, it happens to be challengeable in the work of theoretical developments. To overcome the challenge, it will be followed by an approximation to the weighted estimation equation, which is taken as the main basis for obtaining the theory properties of our proposed estimator. e structure of the rest of the paper is as follows. Section 2 analyzes the process that ODS sampling data is fitted to additive hazard model with covariate adjustment. en, in Section 3, we describe the large sample properties of our proposed estimator in progress to verify the finite sample performance of the proposed method by numerical approach in Section 4. In Section 5, the empirical research shows that the method we proposed has good practicality in the practical example of datasets from a pulmonary exacerbations analysis. Finally, the conclusion and prospect are summarized in Section 6.

Estimation Setup
Suppose that a cohort contains N independent subjects. For the ith (i � 1, . . . , N) subject, T i is the failure time and C i is the censoring time.
and Z i to be the at-risk process, the counting process, and the time-independent p-dimensional exposure variable, respectively. Denote τ to be the study end time. e additive hazards model proposed by Lin and Ying [26] is as follows: where λ 0 (t) is the unknown baseline hazard function and β is the parameter of p-dimension. If we have access to gather information about everyone's exposure, the following estimation function is commonly used for the inference of β: . Under the ODS design, the scope of the failure time of the cases is divided into K disjoint strata A l � (a l− 1 , a l ]: l � 1, . . . , K by positive constants a l : l � 1, . . . , K satisfying 0 � a 0 < . . . < a l− 1 < a l < . . . < a K � τ. We first sample n 0 SRS individuals from the cohort, and let ξ i be the indicator, by value 1 if the ith subject being into the SRS and 0 otherwise. Denote α � P(ξ i � 1) � n 0 /N. We sample subset A k (K ≤ K) stratum from set A l (l � 1, . . . , K) and then n k (k � 1, . . . , K) additional samples are drawn from the members who experience failure and not in the SRS, but in stratum A k . Denote η ik to be the indicator whether the ith individual from A k is sampled into the additional samples. Denote where n k and n 0,k is the number of the cohort failure individuals and the SRS failure individuals dropped into A k . e samples of n 0 SRS sample and n k (k � 1, . . . , K) additional sample make up the ODS sample.
Denote Λ 0 to be the set of SRS individual and Λ k (k � 1, . . . , K) to be the supplemental sample from A k . Denote Λ to be the set of individuals outside ODS sample. en, we can summarize the observed datasets obtained by the design of ODS as follows: 2 Complexity The SRS sample : For the ODS design, we only observe the variable Z for the selected subjects. e regression parameters β can be derived by U ω (β) � 0, where and , with a ⊗0 � 1, a ⊗1 � a, and a ⊗2 � aa T for a vector a; and the weight w i is defined by where ζ i � K k�1 ζ ik and ζ ik � I(T i ∈ A k ). Note that the weight of nonvalidation samples are 0, whereas the subcohort censored individuals are α − 1 . e weight of the subcohort cases are 1 if their failure time belongs to A k (k � 1, . . . , K) and are α − 1 , otherwise. e selected cases, not in the subcohort, are weighted by c − 1 k , when their time of failure belongs to A k (k � 1, . . . , K). e estimator β defined by (4) takes the explicit form as follows: where In practice, some covariates may be contaminated by some distorting factors. In this paper, we assume where H i is the observed (p − 1)-dimensional covariate and X i is the unobservable 1-dimensional covariate and satisfies where W i is the actual observable 1-dimensional variable, U i is the known confounder covariate, and ϕ(·) is the unknown distorting function of observable variable U i . At this point, for the ODS design, the available data have the form: (i) e ODS sample The SRS sample : (ii) e nonvalidation sample: Combining model (2) and equation (8), we assume T i (i � 1, 2, . . . , N) is generalized from the covariate-adjusted additive hazards model in this paper: where θ and c are (p − 1)-dimensional and 1-dimensional regression parameters of primary interest, respectively. According to Şentürk and Müller [18], two conditions on model (9) can be listed as follows: Note that condition (C1) ensured that the mean distorting effect vanishes. Based on conditions (C1) and (C2), Owing to the presence of distortion, the covariate X i is unobservable, and the estimating function (4) can be no longer used for the inference of β. If we use directly W i instead of X i , it might lead to inaccurate statistical inference. erefore, we should calibrate the covariate X i based on the known covariate W i and confounder covariate U i . From (8) and condition (C2), it can obtain that , and we adopt kernel method to estimate Φ(u): where K(·) is a kernel function and h is a bandwidth. It is easy to show that W � n − 1 n i�1 W i converges almost to (10) and (11), the distorting function ϕ(u) can be estimated by and the covariate X i can be calibrated by Denote Z i � (H T i , X i ) T , and the proposed estimator β p for model (9) can be defined as the solution of the function: where . e explicit form of the proposed estimator β p can be described as follows by some simple calculation: where

Main Results
In this section, we would like to establish the asymptotic properties of β p in (15). Firstly, we give the following definition.

Definition 1.
Define where M i (t) is a locally square integrable martingale (Lin and Ying [26]) and β 0 � (θ T 0 , c 0 T ) is the true parameters value.
Secondly, the following additional regular conditions are concluded to illustrate the process: and the 3-order derivatives of f(u) and g(u) meet the following condition, and there exist a > 0 and a neighborhood of origin, such that, if δ fails to the neighborhood, we have |f (3) e above conditions are mild and suitable in many circumstances. Conditions (C3)-(C8) are regular conditions of the regression parameters which are similar to Yu et al. [27]. However, the likes of conditions (C9)-(C12) can be traced to Cui et al. [20]. where To prove Theorem 1, the following definition and deformation lemmas are needed.
In order to prove conveniently the results, we define the partitioned matrices: where Proof. Applying Glivenko-Cantelli theorem, we obtain that where . By corollary (III).2 from Andersen and Gill [28], the uniform convergence of S (1) (t) and S (2) (t) can be similarly shown.
where ψ (h, x, t) is a function of h, x, and t satisfying E[ψ(H, X, t)] 2 < ∞, and

□
To prove eorem 2, the following deformation lemma is needed.

Proof of eorem 2. By Lemma 4, we have
rough the calculation of double expectation, we have By Yu et al. [27], we have where (β 0 ) and 1 (β 0 ) appeared in (20) and (21), respectively. By performing a simple calculation, we obtain By conditions (C1) and (C2), we have Combining equations (61) and (62), we obtain that us, by (58)-(60) and (63), we have en, we obtain that where β n is a point of the line between β p and β 0 , and

Complexity
where R is a (p + 1)-dimensional matrix with all elements being zero except for the (p + 1) × (p + 1)-element being 1. erefore, we conclude that where σ is the (p + 1, p + 1)-element of matrix Here, eorem 2 has been proven regarding the asymptotic convergence properties of normal distribution. It was from a different viewpoint compared to previous research.

Numerical Approach
Strictly speaking, we carry out some simulations in the section. e underlying additive hazards model considered is as follows: where the baseline function λ 0 (t) is set to be 3t 2 + 1 and e − t + t, respectively. e true parameters θ � 0. e observed covariate W � ϕ(U)X. We choose a highorder kernel function K(t) � (15/32)(3 − 7t 2 ) (1 − t 2 ) I (|t| ≤ 1) and use leave-one-out crossvalidation to select the bandwidth.
Under the design of ODS, we sample n 0 � 400 subcohort individuals without replacement from N � 4000. en we partition the observed failure time into three strata by quantiles of observed failure time. In order to study the influence of different cutpoints, we consider 0.2 and 0.8 quantile and 0.3 and 0.7 quantile, respectively. We sample the additional individuals of size n 1 � 25 and n 3 � 25 from the first stratum and the third stratum. In addition, we compare our proposed covariate-adjusted estimator (Proposed) with two estimators, for example, oracle estimator (Oracle) which is calculated based on the true covariate X and naive estimator (Naive) which is computed based on the contaminated covariate W. Note that the oracle estimator is computed from the observations of X, which is not available in the real data. Meanwhile, the naive method is sure to exploit to regard directly the contaminated covariate W as the true covariate X. Under each configuration, the results presented in Tables 1 and 2 are obtained from 1000 independently generated datasets, including the biases of the estimates (Bias), the sample standard deviation (SD), the estimated standard error (SE), and the 95% normal confidence interval (CP).
By comparison and analysis, the oracle estimator is considering to be the best of all three estimators. To be specific, the proposed estimators for both θ and c are all unbiased, and the statistical performance can compete with that of the oracle estimator. Foremost, the 95% normal confidence intervals are reasonable. When it comes to the naive estimator, the result for c is biased. However, through covariate-adjusted process, the main results in Section 3. In addition, it is a fact that the efficiency gains are higher when cutpoint is (0.2, 0.8) than when cutpoint is (0.3, 0.7).
Additionally, we conduct simulation studies to evaluate the behavior of the proposed method when the censoring time C depends on the covariate. e setup is the same as in Tables 1 and 2, except that the censoring time C is taken as respectively. e results are reported in Table 3 when λ 0 (t) � 3t 2 + 1 and Table 4 when λ 0 (t) � e − t + t, which show that the proposed method performs satisfactorily in the cases.

Empirical Analysis
Studies have been completed to conclude the real-world analysis. Our study data contains 641 patients. e accumulation of extracellular DNA in the lung during bacterial infection can bring out progressive deterioration of lung function and aggravation of respiratory symptoms in patients with cystic fibrosis. erefore, the dependent variable that we are interested in is time to relapse, and the censoring rate of the dependent variable is approximately 62.4%. Under the ODS design, we sample 200 individuals as subcohort sample. We partition the dependent variable that are not in the subcohort into three strata. We choose two kinds of cutpoints similar to the simulations. e supplemental samples of size n 1 � 10 and n 3 � 10 are selected from the first stratum and the third stratum.
Two variables relevant to potential confounders have been found, such as vital capacity and patient's type of treatment (Type), divided into placebo and rhDNase. In this study, we measured forced expiratory volume twice and abbreviated FEV 1 and FEV 2 separately. en, we regarded FEV((FEV 1 + FEV 2 )/2) as a disorder index of vital capacity. It has become apparent that the confounder factor U follows a uniform distribution over [0, 1] on the basis of average. A comprehensive study of the additive hazards model has been Complexity 9 undertaken to see the effect of Type and FEV on the failure time as follows: e Kaplan-Meier survival curves have been drawn with related theory taking the kinds of treatment types and the amount of FEV (adjusted FEV) of the patients into account. In the process of drawing, we view FEV (adjusted FEV) to be 1 when FEV (adjusted FEV) ≥ the median of FEV (the median of adjusted FEV), otherwise, to be 0. As shown in Figure 1, it can be seen that disturbance did affect relations between FEV and survival probability, and the patients with placebo or lower FEV (lower adjusted FEV) tend to have lower survival probabilities.
An analysis has been presented that it is available to derive the coefficients in model (69) with the proposed covariate-adjusted approach and summarize estimated     coefficients to be the column Est. After 1000 times artificially estimating the process, it shows the main characteristic of SE and the Bias which are calculated by the average of parameter estimates minus the corresponding Est. We also apply the contaminated covariate to calculate the estimator. e results based on our method are listed in the columns under Proposed and the results based on the contaminated covariate are put in the columns under Naive in Table 5. From the result above, we can see that, with the increase of the mount of FEV, the risk of relapse with pulmonary exacerbations will decrease. e treatment type rhDNase of pulmonary exacerbations can decrease the risk of death, which is consistent with Figure 1. Moreover, due to the covariate adjustment process, the bias of the proposed method is less than that of the Naive method. e sample standard error of the cutpoint (0.2, 0.8) is less than that of the cutpoint (0.3, 0.7), which is in accord with the simulation results.

Conclusion
From the point of view, the fact is that the ODS design is a benefit to lower the cost of expensive exposure variable and improve computing efficiency in large-scale cohort studies. All the results are performed for avoiding miscalculations and perfecting interpretation with adjustment on contaminated covariates. Firstly, this paper has illustrated the method of fitting covariate-adjusted additive hazard model to the data in the ODS design. Secondly, to solve the problems caused by contaminated covariates and biasedsampling schemes, this paper uses nonparametric kernel estimation method to calibrate the contaminated covariates and uses the inverse probability weighting method based on the calibration covariate to construct the weighted estimation function. In fact, this paper has carried out an extension of the theory properties of the estimator in the analysis of the proposed weighted estimation function. e numerical simulation studies have been applied to show that the estimator proposed in this paper performs well in finite sample case, and the actual data is used to show the possibility of the implement of the method.
From the perspective of research prospects, firstly, based on the methods and conclusions of this paper on the covariate adjustment problem of the additive hazard model, there has been an increasing awareness of the potential of discussing the problem of covariate adjustment of ODS studies based on some other models. For example, the accelerated failure time model discussed by Lin et al. [29], the accelerated hazard model studied by Chen and Wang [30], the nonautonomous SIRS model discussed by Lv and Meng [31], and the dynamic model of the constantor studied in [32]. Secondly, to better promote the design and keep the proposed method practical, a promising research published by Yu et al. [16] who proposed that the determination of sample sizes and the optimal sample allocation method may also be an interesting topic in the future. Furthermore, it is hoped that future research will contribute to a further development of survival models with multiple disease outcomes mentioned by Kang and Cai [7].

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. "Naive" denotes the naive estimator calculated based on the unadjusted covariate. "Proposed" denotes the proposed estimator calculated based on the adjusted covariate.