Survival Data Analysis with Time-Dependent Covariates Using Generalized Additive Models

We discuss a flexible method for modeling survival data using penalized smoothing splines when the values of covariates change for the duration of the study. The Cox proportional hazards model has been widely used for the analysis of treatment and prognostic effects with censored survival data. However, a number of theoretical problems with respect to the baseline survival function remain unsolved. We use the generalized additive models (GAMs) with B splines to estimate the survival function and select the optimum smoothing parameters based on a variant multifold cross-validation (CV) method. The methods are compared with the generalized cross-validation (GCV) method using data from a long-term study of patients with primary biliary cirrhosis (PBC).


Introduction
Several prognostic models for PBC data have been developed using the Cox proportional hazards model, and the values of all covariates were determined at the time when the patient entered the study [1]. However, situations may exist in which the values of covariates change for the duration of the study. The time-dependent model uses follow-up data to estimate the effect of the evolution of the covariates during the course of the disease; see, for example, Cox [2], Altman and Stavola [3] and Collett [4].
Let t be a continuous lifetime variable and x = (x 1 , . . . , x I ) a vector of time-fixed covariates. The Cox's proportional hazards model postulates that the hazard at time t is the product of two components [5,6] h(t; x) = h 0 (t) exp where b = (b 1 , . . . , b I ) is a vector of coefficients. The proportional hazards assumption is that the baseline hazard h 0 (t) is a function of t but does not involve the values of covariates x which are measured at the beginning of an interval to predict short-term survival. We investigate PBC data for 312 patients who were seen at the Mayo Clinic and were monitored for the duration of the study, as described in Murtaugh et al. [7] and Therneau and Grambsch [8]. The Cox proportional hazards model was developed based on the relationship between survival and the patient characteristics observed when the patient entered the study. The precision of time-fixed models used in PBC is rather low, partly because these models are based on data for which the covariates were measured at the time when the patient entered the study.
For the analysis of data with time-dependent covariates, however, the survivor function for any individual depends on time t and the baseline hazard function. This means that the survivor function cannot be expressed as a power of the baseline survivor function and is generally difficult to obtain for any individual; see, for example, Kalbfleish and Prentice [9] and Marubini and Valsecchi [10]. The Mayo updated model (e.g., [7]), and the European new version model (e.g., [3,11,12]) have been commonly used to improve the accuracy of survival predictions as a function of covariates measured at any time during the course of the disease. In the present article, we propose the variant multifold CV method for GAM when choosing the optimum smoothing parameters in order to estimate the survival function and predict the shortterm survival (say, for the following six months) at any time during the course of the disease. Another useful idea in our analysis is the concept of competing risk. There is "liver transplantation" in PBC data as competing risk. Competing risk has been treated as censored data. By adding the liver transplantation as one of timedependent covariate, one can test the significance of liver transplantation.

Model Building
By extending the Cox proportional hazard model (1), a flexible survival model has been examined where s i (x i ) is a spline function for the covariate x i [13][14][15]. The proportional hazard model (2) are observed. The relative hazard h(t)/h 0 (t) then depends on time t, and thus the proportional hazards assumption is no longer satisfied, as described in Altman and Stavola [3] and Arjas [19].
For example, Table 1 shows the values of age, prothrombin time, and bilirubin as time-dependent covariates for dead patient #9; for details, see Table 4 in Murtaugh et al. [7]. Patients were scheduled to return for further observations at six months, 12 months, and yearly. Thus, n = 312 patients generate 1945 observations in total. The covariates values for each patient were allowed to vary with the time interval for the lth clinic visit.
A grouped version of Cox's proportional hazard model with time-fixed covariates has been considered in the framework of discrete grouped data for the feed-forward neural network. Given the continuous survivor time, piecewise models arise from the partition of the time axis into disjointed intervals. Biganzoli et al. [20,21] show that, by treating the time interval as an input variable in a feed forward neural network, it is possible to estimate smoothed discrete hazards as conditional probabilities of failure. Biganzoli et al. [20] also pointed out that an advantage of this kind of data structure is the possibility of straightforward use of time-dependent covariates since each subject is represented, for each observation interval, by one input vector which can change across intervals. In order to apply this neural network approach, which is called partial logistic regression models [20], discretization of one-month or one-week intervals must be applied for the continuous survivor time with timefixed covariates. We cannot determine which discretization, one-month or one-week intervals, must be applied; that, is the discretization is not originally unique. For the data in Table 1, however, the choice of discretization of the time axis for the partial logistic regression model is generally determined by clinical relevance, possibly according to the scheduled time intervals between follow-up visits.
The primary goal of the present study is to predict shortterm survival in patients on the basis of measurements of several characteristics having time-dependent covariates X d l for the purpose of facilitating the decision as to when to undertake liver transplantation. Based on partial logistic model due to Cox [22] and Efron [23] for the grouped data, Tsujitani and Sakon [24] have proposed a partial logistic model with a discrete hazard rate h <d> l for ungrouped data Computational and Mathematical Methods in Medicine 3 having time-dependent covariates where β = (β 1 , . . . , β I ) is a vector of coefficients. The modeled response is the logit of hazard rate, and the logit is linear in the covariates. However, this assumption is violated when covariate effects are best represented by smooth, nonlinear function. In recent years, a variety of powerful techniques have been developed for exploring the function form of effects. We examine here a flexible survival model GAM that does not require linearity of the covariate function by extending a generalized linear model (GLM); see, for example, Hastie and Tibshirani [13] and McCullagh and Nelder [25]. By identification of nonlinear covariate effects, we can estimate more accurately a patient's prognosis and thus determine a liver transplant based on prediction of shortterm survival. The linear predictor in (4) is specified as a sum of smooth functions s(x) with twice continuous derivatives of some or all of the covariates for the discrete hazard rate h <d> l of patient #d at the time interval l The smooth functions in (5) can be represented as where q 1 , q 2 , . . . , q I are the numbers of knots, and The functions b i j (x i j ) in the matrix Z are B-spline basis functions. Thus, (5) can be rewritten as where B-splines with 10 interiors knots will be used for each continuous covariate. The number of knots is arbitrary but appears to have little effect on the results, provided that the number is not too small, as described in Gray [15]. At the time interval for the lth clinic visit of patient #d, we define where v d l is the history of dead and censored of time intervals for the first lth clinic visit of patient #d, and v d is the same history extended to include δ <d> l . Tsujitani and Sakon [24] derived the full log likelihood for all patients (11) with partial log likelihood 4

Computational and Mathematical Methods in Medicine
The unknown parameters β in (9) can thus be estimated by maximizing the partial log likelihood (12), which is the log likelihood for the independent Bernoulli trial. Although ln L(β) is not a log likelihood in the usual sense, it possesses the usual asymptotic properties under fairly broad conditions; see, for example, Andelsen and Gill [26].
To avoid overfitting, such models are estimated by penalized maximum likelihood where λ i are smoothing parameters that control the tradeoff between the fit and the smoothness, and s i is the twice derivative of s i with respect to t. The advantage of penalized estimates is enlightened in Wood ( [27], Section 4.1). Two model-fitting issues remain. The first concerns the selection of smoothing parameter λ i in (13). The careful smoothing parameter choice is outweighed by the easy identification of a covariate's functional form, and the applicability of established inferential methods to short-term survival prediction. In order to select the smoothing parameters, the algorithm due to Wood [27][28][29] can be used by minimizing GCV as an approximation to leaving-one-out CV. For example, however, the dead patient #9 generated seven observations as shown in Table 1. Patients were scheduled to return for further observations at six months, 12 months, and yearly. It should be noted that this patient generated seven observations. Thus, n = 312 patients generate 1945 observations in total.
We propose a natural extension of v-fold CV algorithm by "leaving-one-out" CV based on each n = 312 patients. The ordinal v-fold CV divides the data randomly in v groups so that their sizes are as nearly equal as possible. The partition should be made to avoid possible biases, as described in Zhang [30]. In many problems, the ordinal v-fold CV is, thus, unsatisfactory in several respects for time-dependent covariates. Applying this kind of data structure to the CV algorithm, we obtain insights into how the partition of data should be done. A natural extension of v-fold CV algorithm by setting v = n is to allow the deletion of the patient with several observations. The variant v-fold CV is given as follows: Step 1. Split the original sample Step 2. Step 3. Do the above for d = 1, 2, . . . , n and combine the CV estimates A second issue is the goodness-of-fit test of the model. After choosing the optimum smoothing parameters via v-fold CV algorithm, the deviance allows us to test the goodness of fit where ln L c denotes the maximized partial log likelihood under some current GAM, and the log likelihood for the maximum (full) model ln L max is zero. The deviance given by (15) is, however, not even approximately a χ 2 distribution for the case in which ungrouped binary responses are available; see, for example, Landwehr et al. [31] and Tsujitani and Koshimizu [32] and Collett [33]. The number of degrees of freedom required for the test for significance using the assumed χ 2 distribution for the deviance is a contentious issue. No adequate distribution theory exists for the deviance. The reason for this is somewhat technical; for details, see Section 3.8 in Collett [33]. Consequently, the deviance on fitting a model to binary response data cannot be used as a summary measure of the goodness-of-fit test of the model. Based on the above discussion, we employ bootstrapping to the deviance of (15) in order to obtain the goodness-of-fit test due to Efron and Tibshirani [34].
Step 1. Generate B bootstrap samples X * = {X 1 * l , . . . , X n * l } from the original sample X. Let X * (b) denote the bth bootstrap sample.
Step 2. For the bootstrap sample X * (b), the deviance of (15) is computed by Dev This process is repeated independently B times, and the computed values are arranged in ascending order.
Step 3. The value of the jth order statistic Dev * of the B replications can be taken as an estimate of the quantile of order j/(B + 1).
Step 4. The estimate of the 100(1 − α)-th percentile (i.e., α% critical point) of Dev * is used to test the goodness-of-fit of a model having a specified significance level α = 1 − j/(B + 1). The value of deviance of (15) being greater than the estimate of the percentile indicates that the model fits poorly.

Results
The survival function for our discretized situation is Computational and Mathematical Methods in Medicine 5  The conditional probability Pr(t, t + Δt) of survival over a short-time interval Δt (say, six months) after time t can be estimated as By using variant v-fold CV and GCV, the optimum smoothing parameters for GAM are determined as shown in Table 2. Table 3 summarizes the P values to test the nonparametric effects of covariates for the model s(time) + s(age) + s(pro) + s(bili) with the optimum smoothing parameters and GCV. From Table 3, all covariates are highly significant for GCV; however, time is not significant for variant v-fold CV. GCV is only the approximation of leaving-one-out CV. Furthermore, the variant v-fold CV is leaving one-out CV based on each n = 312 patients. So variant v-fold CV is better than GCV. For the purpose of comparison, we included the results in the case using GCV in Tables 2 and 3. Table 4 shows the results of a number of models that were fit to the data.
The likelihood ratio (LR) statistic based on deviance can be conducted to test whether the spline effect provides a significantly better fit than a linear effect. Table 5 shows the test of significance for spline effects based on the models in Table 4. It is clear from Table 5 that the spline effects of prothrombin time and bilirubin are strongly significant. No spline provides a significantly better fit than a linear model for age. Thus, we accept the final model: age + s(pro) + s(bili) with the edf(effective degrees of freedom) 3.954 and 4.477 for s(pro) and s(bili), respectively. We used likelihood ratio test instead of information criteria as a valid alternative approach for model selection. From Table 4, however, it is found that the same final model is selected by using AIC. Figure 1 shows the histogram of the bootstrapped Dev(b) for the optimum model with B = 400. The bootstrap estimate of the 95th percentile (i.e., 5% critical point) Dev * is Dev * = 734.59. Comparison to Dev = 663.65 of (15) suggests that the model fits the data. Figure 2 shows the prediction of the probability of surviving beyond the next six months for dead patient #9. For the purpose of comparison, the results obtained using partial logistic regression, the Mayo updated, and the European new version models are also provided. Figure 2 also indicates that the six-month survival probability predicted by GAM are lower than those predicted by the other models. Because the patient #9 died, the lower predicted probabilities are better. The conditional probability Pr(t, Δt) of survival over a short time interval Δt (say, six months) after time t during the course of the disease can be predicted from data collected for censored and dead data.
For the graphical representation, the individual probabilities for predicted survival are averaged in order to compare the Mayo updated model, the European new version model, the partial logistic regression model, and GAM. We can predict the probability of survival over the following six months using the four models with respect to data and censored data out of 312 patients. For the group (g = 1) and the censored group (g = 2), the probability of surviving over the next Δt months is denoted by Pr [g] d (l, l + Δt) for the l-th clinic visit of patient #d. The average probability of survival over the next Δt months for the l-th clinic visit of patient #d in group g can be estimated as Pr [g] d (l, l + Δt), g = 1, 2, where n [g] l is the total number of patients for the l-th clinic visit in group g, and Pr [g] d (l) is the survival function Pr(l) of patient #d at the time interval l in group g; see, for example, Markus et al. [35], Marubini and Valsecchi [10], and Thomsen et al. [36]. Figure 3 shows a comparison of the probability of survival over the next six months using the four models with respect to dead and censored data among all 312 patients. The figure clarifies that (i) for the case of dead data, the six-month survival probabilities predicted by GAM are lower than those predicted by the other models, and, (ii) for the case of censored data, the difference among the four models is very small.    Figure 4 also shows the box and whisker plots of probability of survival over the next six months using GAM with respect to dead data among all 312 patients. It should be noted that the variance of probabilities of survival over the next six months is much higher in the fourth clinic visits than in other clinic visits. Another useful idea in our analysis is the concept of competing risk. There is "liver transplantation" in PBC data as competing risk. Competing risk has been treated as censored data. By adding x d l4 as one of time-dependent covariate for the liver transplantation, one can test the significance of liver transplantation. The covariate for liver transplantation is taken as a binary variable (codes 0 before liver transplantation, 1 at liver transplantation) as shown in Giorgi and Gouvernet [16] and Crowley [37]. Table 6 shows the three types for the combination of "censored" and "liver transplantation." Table 7 shows the values of covariates for liver-transplanted patient #5.     In order to test the significance of "liver transplantation," we consider two models: Model IV: age + s(pro) + s(bili).
Model IV : age + s(pro) + s(bili) + liver transplantation. The values of deviance and d.f. are given in Table 8. The reduction in the value of deviance is 10.657 = 663.65−653.00 on 0.973 d.f. This is significant at the 1% level.
For the purpose of the comparison, the hazard of the cumulative incidence function (CIF) may be modeled in the presence of competing risks. The model is based on where t is the time of the last observation (not the midpoint at the time interval l), γ is the hazard of the subdistribution, and γ 0 is the baseline hazard of the subdistribution ( [38], Section 6.2). The P values are summarized in Table 8 to test the significance for covariates using the model (19). From Tables 3 and 8, there is little difference between our method and the CIF.

Discussion
In this paper, we introduced the probabilistic interpretation of GAM and constructed the maximum likelihood principle of GAM for the analysis of survival data having time-dependent covariates. We proposed the information criterion based on the variant v-fold CV when choosing the optimal smoothing parameters in application of GAM. Introducing the maximum likelihood principle into GAM, the deviance allows us to test the goodness-of-fit of GAM. The proposed methods were illustrated by comparing the probability of survival over the next six months using the Mayoupdated model, the European new version model, the partial logistic regression model, and GAM with respect to dead and censored data among PBC data. We expect that flexible methods for modeling survival data with time-dependent covariates using machine learning theory such that support vector machine will be very useful in this real-world contexts; see, for example, Hastie et al. [39]. Furthermore, smoothing spline ANOVA models by Gu [40] will enable us to include the interactions between the covariates. We assume that there is only one cause of failure; that is, the event is allowed to occur only once for each patient. However, there is increasing interest to apply survival data sets with multiple events per patient [8,41]. Wei et al. [42] analyzed bladder cancer data by modeling marginal distributions of multivariate failure time with proportional hazards models. The model may violate the proportional hazards assumption, even when the overall data set does not (Table 9). By modifying such as Thus, the ideas presented in this paper can be extended to identification of prognostic factors relative to survival time in the case that the same event may recur during a follow-up study, and covariates values change with time.