The Adjustment of Covariates in Cox’s Model under Case-Cohort Design

Case-cohort design is a biased sampling method. Due to its cost-effective and theoretical significance, this design has extensive application value in many large cohort studies. ,e case-cohort data includes a subcohort sampled randomly from the entire cohort and all the failed subjects outside the subcohort. In this paper, the adjustment for the distorted covariates is considered to case-cohort data in Cox’s model. According to the existing adjustable methods of distorted covariates for linear and nonlinear models, we propose estimating the distorting functions by nonparametrically regressing the distorted covariates on the distorting factors; then, the estimators for the parameters are obtained using the estimated covariates. ,e proof of consistency and being asymptotically normal is completed. For calculating the maximum likelihood estimates of the regression coefficients subject in Cox’s model, a minorization-maximization (MM) algorithm is developed. Simulation studies are performed to compare the estimations with the covariates undistorted, distorted, and adjusted to illustrate the proposed methods.


Introduction
Many survival data have the characteristics of large sample size and high censoring rate. e cost is very high when all variables of each individual in the data are measured. To reduce the cost and improve the efficiency of cohort studies, Prentice [1] firstly proposed the case-cohort design and gave a pseudolikelihood method to estimate regression parameters. Since the publication of the landmark article [1], casecohort design has been applied more and more, especially with the development of big data in recent years. For example, some scholars have applied this design to Life Science [2][3][4][5] and natural disasters [6], and some other scholars have further improved this design [7,8], respectively.
In practice, some subjects may be interfered by other factors due to their characteristics, so their corresponding real covariates cannot be observed; only the distorted covariates and the distorting factors can be observed. For example, for the Modification of Diet in Renal Disease (MDRD) study [9,10], glomerular filtration rate (GFR) and serum creatinine (SCr) data are distorted by the body surface area (BSA). To show the relationship of the two covariates, we need to adjust them by correcting the distorting effect of BSA. e covariate-adjusted regression was introduced for situations where both predictors and response in a regression model are not directly observable, but are contaminated with a multiplicative factor that is determined by the value of an observable factor [11]. For the nonlinear regression model, how to estimate the distorting functions was proposed by nonparametric regression the predictors and response on the distorting covariates [12]. en, the covariateadjusted method was extended to other models [13][14][15]. e data studied in these documents is all complete data. In this paper, the studied data is survival data with right censored data, in which only distorted covariates and distorting factors are included, and the nonparametric method [12] to estimate the distorting functions to obtain the adjusted value of the distorted covariates on the distorting factors, where the nonparametric method is a kernel smoothing method.
In this article, inspired by the idea of Ding et al. [16] and Deng et al. [17], we construct a surrogate function with a diagonal Hessian matrix by using the convexity of the exponential function and the negative logarithm function and then maximize this surrogate function with a diagonal Hessian matrix. is algorithm is minorization-maximization (MM) algorithm. at is, the first M is to construct a misurrogate function and the second M is to maximize the function.
e rest of the article is organized as follows. In Section 2, we firstly fit data from case-cohort design to Cox's model with distorting factors and secondly adjust the distorted covariates by using the kernel function. In Section 3, the convergence of the adjusted covariates and the asymptotic properties for the proposed estimators is completed. In Section 4, we propose a MM algorithm for implementation of the estimation and present a cross-validation (CV) to obtain the optimal bandwidth and a nonparametric bootstrap approach to get the standard error estimation. Several simulation studies are conducted to compare the estimations without distorting factors, with distorting factors, and adjusted with distorting factors in Section 5. Real data analysis of heart failure data is in Section 6. Discussion is stated in Section 7. All proofs are given in Appendix.

Model and Design.
Suppose that there are n independent subjects in the studies cohort, where T i is the failure time, C i is the censoring time, T i � min(T i , C i ) is the observe time for the ith subject, Δ i � I(T i ≤ C i ) is the right-censoring indicator for failure, Y i (t) � I(T i ≥ t) is the at risk process, and N i (t) � Δ i I(T i ≤ t) is the counting process, where I(·) is an indicator function, τ is the end of study time, and Z i is a pth covariate for subject i; we focus on the time-independent covariate.
Let β � (β 1 , β 2 , . . . , β p ) be the unknown p-dimensional vector of regression coefficients. T i arises from Cox's model as the following form: where λ 0 (t) is an unspecified baseline hazard function. e corresponding partial likelihood function is widely used for the inference of β [18] as the following form: Under the case-cohort design, let A denote the subcohort, which is selected from the full cohort by simple random sampling, and ξ i be an indicator, equaling 1 if the ith subject is selected into the subcohort and 0 otherwise. Let A denote the case-cohort sample which contained the subjects from the subcohort A and the case outside the subcohort. erefore, the observed data structure can be summarized as follows: e pseudolikelihood function [1] was proposed, and the corresponding pseudolikelihood function takes the following form: where

e Adjustment of Covariates.
We now study that the covariates are distorted by some distorting factors. e distorted observed data structure is where U i is the ith distorting factor and Z i (t) is the ith distorted covariate. Here, Z i (t) is unobservable, and ϕ(U) is denoted as the unknown distorting functions of observable factor U. Firstly, we give some basic assumptions [12] as follows: Under these assumptions, the working-likelihood function (3) can be rewritten as the following form: Now, our object is to estimate the unknown coefficients β based on observation (4) and function (5). From condition (A3), we have Many scholars have exposed some methods to estimate ϕ r (U) [19]. So, we construct the function ϕ r (U) by using the classic kernel method, for 1 ≤ r ≤ p: where Z r � 1/n n i�1 Z ri , K(·) is a kernel function, and h is bandwidth.

Complexity
Let Z ri � Z ri /ϕ r (U i ) denote the adjusted covariate. at is, we obtain the adjusted covariates by removing the distorting factors from the distorted covariates. en, function (4) can be abbreviated as Under the case-cohort design, the estimate β about the adjustment of covariates is defined as follows:

Consistency and Asymptotic Normality
To present the asymptotic results, we will introduce some notations. Let β 0 denote the true value of β. For d � 0, 1, 2, define where a ⊗0 � 1, a ⊗1 � a, and a ⊗2 � aa ⊤ for a vector a. Define We impose the following conditions throughout the paper. Note that convergence properties involve n ⟶ ∞ and ‖ · ‖ refers to the Euclidean norm in these conditions:  (1) (C) All g r (U) � ϕ r (U)p(U), 1 ≤ r ≤ p, and ϕ r (U) and p(U) are greater than a positive constant. ey are differential and their derivatives satisfy the condition that there exists a neighborhood of the origin. For example, Δ and a constant c > 0 such that, for any ξ ∈ Δ|g 3 (D1) e continuous kernel function K(·) satisfies the following properties: conditions are mild and can be satisfied in most circumstances.
Conditions (A1)-(A3) are the regularity conditions for the asymptotic results of Cox's model [20]. Conditions (B1)-(B4) are the regular conditions under the case-cohort design [21]. Conditions (C) are related to smoothness of the function g r (·) and the density function p(·) of U. Conditions (D1)-(D2) are commonly assumed for the root n consistency of kernel-based estimation [22,23]. Condition (E) is special for this problem [12,22].
Under these conditions, we have the following result with detailed proofs given in Appendix.

The Implementation of Algorithm
In Section 3, we have finished the asymptotic properties for the adjusted estimation β. To obtain β by solution of function (9), we can use the Newton-Raphson algorithm. However, in the process of calculation, we find that its Hessian matrix is complicated and easily irreversible. So, we proposed the MM algorithm [16] to get β for the adjustment of distorted covariates under case-cohort design.

Construct Surrogate.
e key of the MM algorithm to construct surrogate function: combine the ideas of [16]; we expose the surrogate function Q( β|β (m) ) for l p (β) by using the convexity of the exponential function e x and the negative logarithm function − log(x): where β (m) is the mth approximate of the adjusted maximum likelihood estimation β defined in (9), and 4 Complexity r , and Z lr are the rth components of β, β (m) , and Z l , respectively.
With the surrogate function (16), we can transfer the optimization problem (9) into the problem as follows: Following the construction process of the surrogate function (16), we can immediately conclude that where where l p (β) is strictly ascending when β (m+1) ≠ β (m) . With the surrogate function (16), we obtain the derivatives with respect to β if β � β (m) as follows. e score vector: e negative Hessian matrix: where Here, the Hessian matrix is diagonal, the process of calculation changes simply and is reversible by using Newton-Raphson algorithm.

Cross-Validation.
In the process of the adjustment for the distorted covariates, we use the cross-validation (CV) method to obtain the optimal bandwidth h to construct the kernel smoothing methods. e idea of the CV method is to calculate the coverage of error between some naive responses and the estimated responses obtained from the model fitted by the other predictors. e specific expression of the function in this article is given as follows: where u − i is the set of removed ith element, Δu � u j − u j− 1 . en, the optimal bandwidth is defined as follows:

Bootstrap.
Bootstrap has been widely used since it was first proposed [24][25][26]. We adopt this method to calculate the standard error of β; the basic idea of the nonparametric approach is to construct an empirical distribution function by repeatedly sampling from the observed data. e steps are summarized as follows: Step 1: X 1 , X 2 , . . . , X n are the adjusted observations under the case-cohort design, where We use the samples Complexity 5 with Δ i � 1 as the failure cases and the samples with Δ i � 0 as the censoring objects to constitute the bootstrap case-cohort sample X * 1 , X * 2 , . . . , X * n .
Step 3: repeating step 1 and step 2 B times, we can obtain B bootstrap estimate β * 1 , . . . , β * B . erefore, the standard error of the rth component of β can be estimated by the following form:

Simulation
As we mentioned before, if the distorting factors on covariates were ignored, the result of inference may be misled [11,12]. In this article, we will construct several simulation studies to compare the three estimators of regression coefficient β for the covariates without distorting factors, with distorting factors, and adjusted with distorting factors in Cox's model. Under the case-cohort design, where β W denotes the estimator calculated based on the covariates without distorting factors, β C denotes the estimator calculated based on the covariates with distorting factors, and β A denotes the estimator calculate based on the adjusted covariates with distorting factors, given (Z 1 , Z 2 ), we consider the hazard function of the failure time T i as follows: We set the true values of the parameters β to (-0.25, 0.5), (-0.25, 0.693), and (0, 0.5). Z 1 is generated from a normal distribution with mean Z and variance 1.44. Z 2 is generated from a uniform distribution ). e baseline hazard function λ 0 (t) is set to be 1 and 2t. us, the failure time T satisfies an exponential distribution with failure rate exp( e censoring time C is generated from a uniform distribution U(0, c) with c being chosen to desire the censoring rate ρ � 80%, 85%, and 90%.
Under the case-cohort design, 1000 full cohort samples are generated from function (27), and the corresponding distorting factor is generated. e two parts are merged to constitute the observed sample. en, we randomly selected n � 150, 200, and 300 subcohort from the full cohort.
We compare the sample bias (Bias), the sample MSE (SMSE), the sample standard deviations of estimates (SD), the means of estimated standard errors (SE), and the coverage probabilities of 95% nominal confidence intervals (CP) of the three estimators to β C , β W , and β A based on 1000 independent simulated datasets and especially apply the nonparametric bootstrap approach in Section 4.2.2 with B � 500. We set only the covariate Z 2 to be distorted and adjusted. e distribution of the distorting factor U is set to be Φ(u) � ((u + 10) 2 /194.9160). e kernel function is set to be an Epanechnikov kernel [27] e optimal bandwidth h is selected by the CV method in Section 4.2.1. e criteria are stopped with ε � 0.0001. e simulation results are summarized in Table 1 with β � (− 0.25, 0.5), Table 2 with β � (− 0.25, 0.693), and Table 3 with β � (0, 0.5), respectively.
According to the simulation results, the estimators β W and β A for β 2 are all unbiased, the three values of SMSE, SD, and SE are all closed, and the CP values are reasonable. However, the estimators β C for β 2 are all biased, the difference between the two values of SMSE and SD is large, and the CP value is low. ese facts confirm that ignoring the distorting factors may lead to the wrong results and conclusion and also explain that the proposed method about the adjustment of covariates is effective. For example, the estimator β C and β A for β � (− 0.25, 0.5), ρ � 80%, n � 150, for β 2 , the bias is -0.1178, SMSE are 0.1521 and 0.1095, and CP are 0.735 and 0.947, respectively. Furthermore, we can conclude that the estimators for β 1 are all reasonable.

Real Data Analysis
e dataset is about the medical records of heart failure patients, from "Machine Learning Repository" (http://archive.ics.uci. edu/ml/datasets/Heart+failure+clinical+records). is dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features, including age, anaemia, creatinine phosphokinase, diabetes, ejection fraction, high blood pressure, platelets, serum creatinine, serum sodium, sex, smoking, time, and death event. e goal is to assess the association between clinical features and heart failure and its happened time. is real data is survival data which meets the requirements of the article. Death event and time are regarded as heart failure and its happened time, respectively. For the remaining 11 clinical features, through pairwise correlation analysis, we conclude that ejection fraction and serum sodium and anaemia and Creatinine phosphokinase are significantly related. Referring to relevant medical knowledge, we know that ejection fraction may distort serum sodium. And in the same way, anaemia may distort creatinine phosphokinase. e main purpose of our research is to evaluate the 6 Complexity     association between these clinical features under smoothed serum sodium and creatinine phosphokinase by ejection fraction and anaemia. ere are 96 deaths in this data, and the censoring rate is nearby 67.9%. Due to the different dimensions of the covariates, we firstly standardize each covariate. en, we set the standardized age as Z 1 , the standardized creatinine phosphokinase as Z 2 , the standardized diabetes as Z 3 , the standardized high blood pressure as Z 4 , the standardized platelets as Z 5 , the standardized serum creatinine as Z 6 , the standardized serum sodium as Z 7 , the standardized sex as Z 8 , the standardized smoking as Z 9 , the standardized ejection fraction as a distorting factor U 1 , and the standardized anaemia as a distorting factor U 2 , where Z 2 and Z 7 indicate that the covariate is distorted.
e case-cohort sample is composed of the subcohort randomly selected 100 from the whole data and the dead objects in the whole data expect the subcohort. We perform regression analysis on the sample using the proportional hazard function: Because the size of this data is small, to improve the reliability of the sampling method, the sampling process is repeated 100 times. e average value of the 100 distorted estimates is (2.7224, 14.0226, 0.1603, 0.4295, − 0.2667, 3.0520, 11.0269, − 0.0742, 0.1933) ⊤ . en, for the same 100 case-cohort sample, we obtain the adjusted covariates Z 2 and Z 7 by using the proposed kernel smoothing method on Z 2 and Z 7 and replace Z 2 and Z 7 with Z 2 and Z 7 , respectively. On this basis, the average of adjusted estimates is (2.5077, 1.0344, 0.1136, 0.5474, − 0.3552, 2.8577, − 2.4303, 0.2115) ⊤ . e kernel and bandwidth selector chosen are the same as those in Section 5. After the covariates were adjusted, the estimator of creatinine phosphokinase is adjusted from 1.0344 to 14.0226, and the estimator of serum sodium is adjusted from − 2.4303 to 11.0269; the estimators of other covariates changed little after adjustment. e result is reasonable and feasible in medicine and also shows that the proposed method is effective for real data.
In practice, the covariates are measured after the casecohort sampling to reduce costs and improve efficiency. In this real data, the size of the data is small, and the each covariate has been measured. To reduce the error, we repeat 100 samplings and set the average value as the estimators. For example, based on the 10th case-cohort sample from the original data, the distorted estimator is (3.5393, − 0.5883, 0.1934, 0.3712, 0.4114, 6.5332, − 3.9763, 0.1236, 0.3585) ⊤ ; then, after the adjustment of covariates in the same sample, the adjusted estimator is (4.5125, 79.7599, 0.1690, 0.2902, 1.0486, 5.9035, 12.7945, 0.1399, 0.1720) ⊤ . erefore, the estimators based on same case-cohort sample are quite different from the average of estimators, but the trends are the same. So, for real data with a large sample size, we only need to draw a case-cohort sample and measure their covariates.

Conclusion
We study the estimation in Cox's model with the adjusted covariates by using the nonparametric method based on the kernel function under case-cohort design. Consistency and asymptotic normality of the proposed estimators are derived. We use MM algorithm for calculating the regression coefficients, where the surrogate function with a diagonal Hessian matrix is established to overcome the operation difficulty in the Newton-Raphson algorithm. Simulation and real data studies suggest that the case-cohort design can be used to reduce the cost for cohort studies and the development of the adjustment for distorted covariates has nice performance.
Here, the adjustment for survival data is discussed in Cox's model. e method can be extended to other models, such as an accelerated failure time model [28], an additivemultiplicative model [29], and an accelerated hazards model [30]. e cost-effective case-cohort design is mainly applied to the data with the high censoring rate. However, in the sampling process, some censoring subjects have not been measured. To improve the efficiency, our future works include developments of inference for estimation with adjusted covariates under a generalized case-cohort design [31] and an outcome-dependent sampling design.
We start with a lemma, which is frequently used in the process of proof. en, we proceed to the proof of eorem 1, concerning consistency and asymptotic normality. ). (A.1) Proof. of Lemma A.1. Recall that Z ir � Z ir /ϕ r (U i ), which is given by Section 2. Sample calculations give the following expression: us, it suffices to deal with the equation n i�1 Z ir η(T ir ). According to the type it holds that First, we analyze L 1 . e proof is divided into three steps. Denote by L 11 , L 12 , and L 13 the quantities: Step 1: show that Applying the equation ϕ r (·) � (g r (·)/p(·)E(Z r )), we have Equation (A.5) can be concluded by proven According to eorem 1 of [32], Lemma 3 of [23], condition (D2), and the Law of Large Numbers (LLN) for (1/n) n i�1 Z ir η(T ir ), we then follow the arguments used in [23] to yield the higher order term: Step 2: show that To analyze L 12 , for ease of understanding, we use p(U) to express the density function of U and define as follows: We need to work on n i�1 Z ir η T ir g r U i g r U i . (A.12) We can obtain the above sum by a U-statistic with a varying kernel with the bandwidth h by applying the similar arguments used by Davis and Fang [23] again. en, we can have the asymptotic representation [4] as with P r,U (δ, U) being the density function of (Z r , U), J r � Z r η(T r ) and P J r ,U (j, U) the density of variable (J r , U). Note that Z r and U are independent and J r and U are independent. According to conditions (C), (D1)-(D2), and (E), we obtain L 13 can be similarly proceeded. Combining (A.5) and (A.8), we have (7): Step 3: show that where the last equality is obtained by applying LLN to Finally, together with the asymptotic representation of L 1 and L 2 in (A.15) and (A.17), the desired result is easy to arrive at.

Proof.
For i∈C (Z i − Z i )β, according to Lemma A.1, ). (A.20) According to the Law of Large Numbers, (A.20) converges to 0 in probability.
Next, we prove that

Complexity 13
Because exp(·) is differentiable function, the first derivative of exp(·), and it is obtained from (A.17) that A(β, t).

By (16) and (A.21), A(β, t) asymptotically converges to
Under the case-cohort sampling design, let A denote the sample set of the subcolumn, n denote the number of samples in A, A denotes the case-cohort sample, and the set of n represents the number of samples in A. Under the case-cohort sampling design, the covariate is not completely observed; then, its likelihood function cannot be expressed as (2). According to the pseudolikelihood function given in the article of Prentice [1], the pseudolikelihood function expression is as follows: where adventure set R(t) � i: T i ≥ t, i ∈ A ∪ B(t)}, B(t) � { i: N i (t) ≠ N i (t − 1), i � 1, 2, . . . , n}. Note that only when the individual fails at time t, the set B(t) is nonempty, and the corresponding log-likelihood function is and from the above equation, we have Under conditions (A1)-(A3), (B1)-(B4), (C), (D1)-(D2), and (E) and similar arguments in Self and Prentice [21], we obtain the following results: Because β converges to β 0 in probability, Slutsky's theorem, and (A.12), we can obtain where the asymptotic variance matrix (A.40) □ Data Availability e simulation data used in this paper are randomly generated by R language software to support the proposed method of this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.