Gaussian Copula Regression Modeling for Marker Classification Metrics with Competing Risk Outcomes

,


Introduction
In clinical medical practice, decisions about personalized treatments are generally guided by markers that can discriminate between diferent levels of risk of a specifc type of death.In time-to-event data, an individual can be exposed to two types of failure, the so-called competing risks; here, the phenomenon of interest is to assess the classifcation capacity of the marker to predict a cause-specifc outcome within a certain period, leading to a more objective choice of treatment of a condition.In this context, Saha and Heagerty [1] defned cumulative/dynamic discriminatory measures to evaluate the prediction accuracy of a marker M to distinguish between the subjects who experience the cause-specifc event D within survival time T and those who do not.Specifcally, they defne the cumulative true positive rate TP C j as the probability of a high marker value among those subjects who experience the event D � j (j � 1, 2) within time t and the dynamic false positive rate FP D as the probability of a high marker value among subjects who are event-free through time t.Te cumulative/dynamic ROC C/D j curve for each event type can be obtained by plotting TP C j versus FP D at time t, for all possible values of the marker.Te corresponding area under the curve AUC C/D j is then used as a global summary of the classifcation power of the marker.
Inference methods have mainly used nonparametric approaches for the estimation of ROC C/D j [1][2][3].Te primary limitation of such formulations is the difculty that occurs when it comes to incorporating covariates into the analysis.Zheng et al. [4] obtained expressions of TP C j and FP D in terms of the cumulative incidence functions (CIF's), which are represented using cause-specifc regression hazard functions models for the diferent failure types; however, it has been documented that standard specifcations for those models, including the proportional hazards assumption they employ, lead to model selection issues since testing for covariate efects on the CIF's is not possible [5], thus hampering the comparison between diferent ROC curves.
In this article, we employ the joint regression analysis using Gaussian copulas introduced in [6] to model the joint distribution of (M, T, D).Tis approach represents the joint model in terms of a trivariate Gaussian copula distribution, subsequently utilized to characterize the ROC C/D j curve.Te regression model is constructed by specifying the marginal distribution of the marker with a skewed normal regression model, the marginal distribution of the survival time with a parametric proportional hazards regression model, and the marginal distribution of the cause of failure with a logistic regression model.Maximum likelihood is employed with a constraint that ensures that the dependence parameters in the Gaussian copula yield a positive-defnitive correlation matrix, which must be satisfed.Also, information-criterion metrics are employed for variable selection, leading to the formulation of a parsimonious model.Te joint regression modeling yields a straightforward and fexible model for the time-dependent ROC curve in the presence of competing risks.Tis direct approach enhances comprehension regarding covariates' impact on a marker's discriminatory ability, representing the paper's principal contribution.
Te rest of the paper is organized as follows.Section 2 defnes the parametric Gaussian copula-based model, the copula relation to time-dependent discrimination metrics, and the marginal specifcations.In Section 3, the maximum likelihood estimation procedure is described.Section 4 provides some simulation results to explore the properties of the proposed estimators under diferent sample sizes and censoring scenarios.Section 5 illustrates the method with data from a well-known prostate cancer dataset [7].Te paper culminates with a conclusion.

Gaussian Copula-Based Models and the Relation with ROC and AUC.
In applied research, the copula model represents a convenient and fexible way to construct multivariate distributions that embody wide ranges of dependence structures while allowing for the specifcation of the underlying marginal distributions [8].Sklar's theorem provides the decomposition of any continuous m-variate distribution function F in terms of the marginal cumulative distribution functions (CDF's) F 1 , . . ., F m and the associated copula C: [0, 1] m ⟶ [0, 1], which is a m-variate distribution function with uniform margins U(0, 1) [9], and thus the representation of the multivariate distribution function (MDF) is given by F(y 1 , . . ., y m ) � C(F 1 (y 1 ), . . ., F m (y m )); also, the theorem states that given F, the copula is unique on Range(F 1 ) × • • • × Range(F m ), implying that the copula is unique if the marginals F 1 , . . ., F m are continuous.
Given its fexibility, analytical tractability, and capacity to characterize rich dependence structures, similar to the multivariate Gaussian distribution, this study adopted the trivariate Gaussian copula, which is defned by where Φ 3 is the standard normal trivariate distribution function (zero means and unit variances), Φ − 1 is the quantile function corresponding to the univariate standard normal distribution, and Γ is a nonnegative symmetric positive defnite correlation matrix given by where each dependence parameter . Following the Radon-Nikodym derivative methodology introduced in [6], the joint density function of (M, T, D) corresponding to the copula model in equation ( 1) is represented by and Given the cumulative true positive rate for cause j, defned by TP C j � (m, t) � Pr(M > m | T ≤ t, D � j), and the dynamic false positive rate, defned by FP D (m, t) � Pr (M > m | T > t), the time-dependent cumulative/dynamic ROC curve at time t for cause j is expressed as follows: where Te corresponding area under the curve at time t for cause j, AUC C/D j (t) �  1 0 ROC C/D j (x, t)dx, represents the probability that an individual with a positive diagnosis has a higher value of the marker than an individual with a negative diagnosis [10].
Te discrimination metrics can be expressed in terms of the joint model equation ( 4): Here, C[F T (t), F D (j)] and C[F M (t), F T (t)] represent the marginal bivariate Gaussian copula functions associated with the trivariate Gaussian copula, following the marginalization property of the multivariate Gaussian copula [11].Te derivation of equations ( 6) and ( 7) is detailed in Appendix A.

Marginal Specifcations.
In practice, markers tend to exhibit skewed distributions [12].Tus, in this study, the margin of M follows the skewed-normal distribution, denoted here by M ∼ SN(ξ M , ω 2 M , α M ), whose density function is given by [13] where φ and Φ are, respectively, the density and CDF of a standard normal random variable, is the scale parameter, and α M ∈ (− ∞, ∞) is the slant parameter.Explanatory variables are incorporated by linking location parameter to a linear component, namely, ξ M � α T w, where α and w are vectors of parameters and covariates, respectively.
For modeling the margin of T, this study employed the proportional hazards model, which is given by , where F 0 (t; Λ)) is a baseline distribution with vector of parameters Λ, and β and x are vectors of parameters and covariates, respectively; here, x does not contain the intercept and does not necessarily contain the same variables in w.Since it has shown good fts to survival data in various applied analyses [14], this study used the Weibull distribution as the baseline; that is, ), where t ∈ (0, ∞), and ] ∈ (0, ∞) and κ ∈ (0, ∞) are, respectively, the shape and scale parameters.Te incorporation of covariates x to the margin was specifed with the following which will be referred to as the survival component: International Journal of Mathematics and Mathematical Sciences where S 0 (t; ], κ) is the baseline survival function (which follows a Weibull distribution) and β is the parameter vector (which does not include intercept).Similar to various competing risk regression formulations for the cause of the event [15,16], the marginal model for D � 1 was specifed with a logistic model, which is given by where z is the vector of covariates, not necessarily containing the same variables in w and x, and δ 1 is a vector representing the model parameters when D � 1.
In the proposed fully parametric Gaussian copula regression model, the cumulative true positive rate for cause j and the dynamic false positive rate are now expressed conditionally on the covariates: TP C j (m, t; w, x, z) and FP D (m, t; w, x, z).Consequently, the time-dependent cumulative/ dynamic ROC curve for event type j and the corresponding AUC are also conditional expressions, ROC C/D j (q, t; w, x, z) and AUC C/D j (t; w, x, z), respectively.

Estimation
Consider the vector of marker, survival time, and cause of death , where m i is the observed value of M i , t i is the time to either failure or censoring, d i identifes the type of failure, taking values 1 and 2 for the observed survival times and 0 for the censored times, and y i is a vector of covariates.Te full likelihood function is given by where ω is a vector containing all parameters, ; here f M,T (m, t) can be obtained by marginalizing the trivariate joint model [11].
Due to the complexity of the likelihood function formulation, we propose a numerical procedure for the corresponding estimation process, which involves employing an optimization algorithm called PRAXIS (Principal Axis), developed by Richard Brent [17].Te optimization algorithm is implemented in the R package nloptr using the function nloptr( ).
Maximizing the log-likelihood function in equation ( 11) requires setting lower and upper bound constraints for each dependence parameter, c MT , c TD , and c MD , to ensure the positive defniteness of the correlation matrix Γ.To achieve this, we developed a heuristic algorithm to create box constraints for each dependence parameter, ensuring lb MT < ub MT , lb MD < ub MD , and lb TD < ub TD .A detailed explanation of the heuristic algorithm can be found in Appendix B. Log links are used for the strictly positive parameters.

Simulation Experiments
To evaluate the methods' performance and robustness, we generate 200 samples of n � 200 and n � 400 from the trivariate Gaussian and Student t copulas with margins U[0,1].Te dependence parameters are set to c MT � − 0.2, c MD � − 0.7, and c TD � 0. Te copulas are implemented using the functions normalCopula( ) and tCopula( ) from the copula package.
Te censoring time is simulated from a uniform distribution U(0, c), where c takes values of 200 and 85.Te observation is considered censored if a simulated failure time exceeds the corresponding simulated censoring time.Te average percent censored for c � 200 and c � 85 was around 11% and 26%, respectively.
Tables 1 and 2 present empirical bias, standard error (SE), and mean square error (MSE) of the coefcients obtained from the Gaussian copula regression model for simulations from both copulas.Generally, empirical biases and standard errors are relatively small compared to the actual parameters, resulting in a low mean square error, which suggests that the estimation process is appropriate for both sample sizes.
Overall, all three performance measures tend to approach zero with an increased sample size.Notably, an increase in censoring has minimal impact on the estimation process, as evidenced by the only marginal increase in mean square error.Additionally, although there is a slight increase in the mean square error for estimators derived from simulations of a trivariate Student t copula, this increase is not drastic.Consequently, the Gaussian copula regression model exhibits favorable estimation properties in the 4 International Journal of Mathematics and Mathematical Sciences considered moderate sample sizes and appears robust concerning the Student t copula.

Prostate Cancer Study Data
Tis study utilizes the dataset described in [7] to exemplify the proposed methods' use.Te dataset comprises an approximate fve-year follow-up of a cohort consisting of 506 patients diagnosed with prostate cancer of stages III and IV.Tese patients were recruited in a clinical trial conducted between 1967 and 1969.A total of 4.54% of patients were excluded from the analysis due to incomplete covariate information.Among the remaining cohort, 125 patients (25.88%) succumbed to prostate cancer (D � 1), 219 patients (45.34%) died from other causes (D � 2), and the remaining 139 patients (28.78%) had censored survival times.Te underlying research problem consists of identifying a marker to then measure its discriminatory power.French et al. [19] suggested the utilization of a Cox regression model to formulate a composite marker, which is obtained as a weighted combination of both biomarkers and clinical variables, wherein the estimated regression coefcients serve as the weighting factors.In this study, the marker is derived as the linear predictor of a cause-specifc parametric Cox regression model [20], with the event of interest defned as death from prostate cancer.Specifcally, the linear component of the causespecifc hazard for D � 1 is estimated by treating the event times of all individuals who failed due to cause D � 2 as censored.Te ftting process is carried out utilizing the survreg( ) function from the survival package in R, considering a Weibull distribution.A backward elimination approach that is guided by the Bayesian information criterion (BIC) was employed to systematically eliminate the least signifcant covariate in each iteration, aiming to derive the most parsimonious model.Te fnal regression model incorporates the following covariates: PF (performance rating: 0, standard; 1, limitation of activity), HG (serum hemoglobin in g/100 ml), SZ (size of the primary lesion in cm 2 ), and SG (Gleason stage-grade category).Te resulting composite marker is defned as M C � 0.5243 × PF − 0.0144 × HG + 0.2948 × SG + 0.0381 × SZ. Figure 1 shows the histogram of the marker M C , which exhibits a skewed shape.Superimposed are the Gaussian kernel-based nonparametric density and the ftted skewednormal density.

Discriminatory Performance of the Composite Marker M C .
Te discrimination power of the composite marker M C is initially examined without considering covariates.Following the model ftting, it is observed that the sole parameter lacking individual signifcance is c TD , with an estimated value of − 0.089.A likelihood ratio test assesses the null hypothesis of c TD � 0 employing the chi-squared distribution approximation with one degree of freedom.With a p value of 0.2441, the null hypothesis cannot be rejected.Consequently, the simpler model with c TD � 0 is chosen on the grounds of parsimony.Te parameter estimates of the joint simpler model and their standard errors are detailed in Table 3.
Analyzing the dependence parameters, the negative correlation c MT � − 0.207 indicates that high values of M are associated with the early occurrence of any event (T represents overall survival time).Furthermore, because c MD � − 0.641, high values of M are more strongly associated with cause 1 than cause 2. Tus, the estimated correlations suggest a marker with desirable classifcation properties.
To assess the appropriateness of the parametric model's ft, we perform a comparative analysis between the ROC curves obtained from the Gaussian copula regression model and the IPCW nonparametric ROC curves introduced by Blanche et al. [2].Te IPCW nonparametric approach possesses the advantage of not requiring any tuning parameters.Additionally, it can be easily implemented using the timeROC( ) function within the timeROC package in R. Te plots of the parametric and nonparametric estimators of ROC C/D 1 (q, t) at time horizons 12, 36, and 60 months are depicted in Figure 2. It appears that the two curves are relatively close, suggesting that the ft of the parametric model is adequate.

Inclusion of Covariates.
A Gaussian copula regression model is ftted to evaluate how the classifcation power of M C changes on including covariates.We consider potential predictors as the remaining variables not included in the composite marker: RX (drug treatment), HX (history of cardiovascular disease: 0, no; 1, yes), AGE (age in years), and WT (weight index: weight in kg-height in cm + 200).A backward elimination process based on the BIC is employed to remove the least signifcant covariate sequentially, resulting in the most parsimonious regression model.Table 4 displays BIC values from various regression models, indicating the predictors included in each marginal regression model during the elimination process.Te chosen model is Model 9, as it exhibits the lowest BIC.
Table 5 presents the best-ftting parametric Gaussian copula regression model coefcients.Similar to the case without covariates, only one parameter is not statistically signifcant: c TD .A likelihood ratio test is conducted, considering a nested model with c TD � 0 as the null hypothesis.Based on a chi-squared distribution with one degree of freedom, the null hypothesis of a model with c TD � 0 cannot be rejected.
To assess the validity of the proportionality assumption for the factor AGE in the survival component, Figure 3 illustrates the plot of the empirical estimator of log − log[  S(t)]   as a function of log(t).As AGE is a continuous variable, two balanced groups are created: Group 1 comprises individuals with ages less than 74, and Group 2 comprises individuals with ages greater than or equal to 74.Te log-minus-log plot shows approximate parallelism for most of the time, with some sparsity observed toward the end due to a limited number of observations in both groups at extended time points.Tis visual examination provides evidence in favor of the validity of the proportionality assumption in the Cox regression model.
Figure 4 illustrates ROC C/D 1 (q, t; HX, AGE) based on the best-ftting parametric Gaussian copula regression model for the two levels of the variable HX (history of cardiovascular disease) and ages 60, 73, and 85 years old at a time horizon of     Te model includes all the parameters on the left, whereas on the right, the parameter c TD � 0 is set to zero. 1 (q, t; HX, AGE) curves based on the best-ftting parametric Gaussian copula regression model at a time horizon of 60 months at three diferent ages.On the left, the plot shows the curves for patients with no cardiovascular disease (HX � 0), and on the right, the plot shows the curves for patients with cardiovascular disease (HX � 1).
International Journal of Mathematics and Mathematical Sciences 60 months.In this instance, the IPCW estimators of the timedependent ROC curves are not considered due to the impracticality of incorporating covariates in this nonparametric approach.Te corresponding AUC C/D 1 (t; HX, AGE) for the six combinations of predictors at the 60-month horizon are shown in Table 6.95% bootstrap confdence intervals (with biased correction [21]) are estimated using 500 numbers of runs.Notably, insights reveal that older age and a history of cardiovascular disease enhance the classifcation performance of the composite marker M C .

Conclusion
Tis study introduces a parametric Gaussian copula regression model for estimating time-dependent predictive accuracy metrics subject to right-censoring in the presence of competing risks.Te marginal specifcations encompass a normal asymmetric distribution for the marker, wherein a linear component is linked to the location parameter.Additionally, the model includes a parametric Cox regression model with a Weibull distribution for survival time and a logistic regression model for event type.Te dependencies among these variables are captured using a trivariate Gaussian copula.
Simulation studies demonstrate that the Gaussian copula regression model exhibits favorable properties in moderate sample sizes and appears robust to slight deviations from the data-generating process.To validate the utility of our approach in a real-world scenario, we apply our proposed methodology to a well-known prostate cancer dataset.Based on the linear predictor of a cause-specifc parametric Cox regression model, the composite biomarker M C demonstrates moderate predictive performance in identifying patients at risk of dying from prostate cancer.Interestingly, older individuals with cardiovascular disease contribute to an increased discriminatory power of M C .
Te primary contribution of this paper is the straightforward incorporation of covariates into the estimation of ROC curves and AUCs through joint regression modeling.Considering predictors in the initial Gaussian copula regression model eliminates the need for stratifed analyses or adjustments through indirect methodologies, enabling a better understanding of how covariates may afect the classifcation power of a marker.Te computation of the models relies predominantly on vector and matrix operations, along with readily available functions in the R language.Te R code that implements the proposed methodology is available upon request.Furthermore, the introduced modeling strategy possesses additional favorable properties, demonstrating profciency under moderate sample sizes and an ability to identify parsimonious models through information-criterion metrics.
Despite the compelling properties of the proposed methodology, some open questions remain.Te linear dependence structure enforced by the Gaussian copula may be broadened by utilizing alternative copula models.Vine copula models are grounded in decomposing dependence into a series of interdependencies among (conditional) pairs of variables, afording the capability to manage heavy tails or asymmetric dependence [22].We intend to investigate this approach in forthcoming research work.

Figure 1 :
Figure 1: Distribution of the composite marker M C .Nonparametric density with Gaussian kernel (line) is superimposed, as well as the ftted skewed-normal density (black dots).

Figure 2 :
Figure 2: Gaussian copula regression model and nonparametric IPCW plots of ROC C/D1 (q, t) at time horizon 12 months, 36 months, and 60 months for the composite marker M C in the absence of covariates.

Figure 3 :Figure 4 :
Figure 3: Log-minus-log plot for AGE in the prostate cancer study data.Group 1 (AGE <74 years old) and Group 2 (AGE ≥74 years old).

Table 1 :
Empirical bias, standard error (SE), and mean square error (MSE) of the estimators of the parametric Gaussian copula regression model for simulations generated from a Gaussian copula, with sample sizes 200 and 400 and censoring distributions U(0, 200) and U(0, 85).

Table 2 :
Empirical bias, standard error (SE), and mean square error (MSE) of the estimators of the parametric Gaussian copula regression model for simulations generated from Student t copula, with sample sizes 200 and 400 and censoring distributions U(0, 200) and U(0, 85).

Table 3 :
Coefcients (Coef.) and standard error (SE) of the best-ftting parametric copula model considering the composite marker M C for the prostate cancer study data.

Table 4 :
BICs obtained for the various models in each component of the joint model when using the backward elimination process.

Table 5 :
Coefcients (Coef.) and standard error (SE) of the best-ftting parametric Gaussian copula regression model considering the composite marker M C for the study of prostate cancer study data.