^{1}

^{2}

^{1}

^{2}

Survival systems are difficult to analyze in the presence of extreme observations and multicollinearity. Finding appropriate models that provide a robust description of such survival systems and that address the smooth hazards in the context of covariates can be challenging given the sheer number of possibilities. Survival time algorithms that evaluate the efficiency of models in the presence of extreme observations over different datasets provide an effective tool to identify robust systems. However, the existing algorithms addressing the analysis of survival systems are limited in long-term evaluations. Therefore, an algorithm that can analyze survival time response on high-dimensional complex survival systems having extreme observations is developed which explores large margins dynamically. This algorithm is developed as a conjugate of flexible parametric models and partial least squares to estimate smooth, flexible, and robust functions to extrapolate the survival model in long-term evaluations in the presence of extreme observations. The algorithm is tested and validated using four distributions based on a simulated dataset generated from the Weibull distribution and compared with partial least squares-Cox regression. The comparison shows its flexibility and efficiency in handling different survival systems in the presence of extreme values. The algorithm is also used to analyze four real datasets of breast cancer survival time, each containing seven gene signatures. The coefficients of significant genes for each dataset are estimated. The flexibility in handling various distributions as parametric survival models supports the application of the algorithm to a large variety of different survival problems and represents a robust statistical framework for survival analysis in the presence of extreme observations.

“Time-to-event” responses have become progressively relevant in the context of research studies, where the response of interest could be based on time to remission, overall survival [

Despite the popularity and advantages of nonparametric and semiparametric methods, parametric modeling procedures offer flexible, efficient, and robust alternatives. When the proportional hazards assumption is violated and the distributional assumption on the survival times is valid, flexible parametric models (FPM) are an ample alternate of nonparametric and semiparametric methods resulting in more efficient estimates having smaller standard errors even in the presence of extreme observations. Also, statistical inferences are more precise as the full likelihood is used and results are easy to interpret. To approximate survival data, numerous theoretical distributions have been employed so far in FPM. The exponential distribution assists as the primary model for survival time. Other commonly used distributions to modeled big survival data with extreme observations are Weibull, gamma, Gompertz, log-normal, log-logistic, generalized gamma, and generalized F-distribution. Also, the FPM can efficiently incorporate covariates to investigate the dependence of survival response in the context of estimated parameters, survival function, and cumulative hazard function [

The motivation of this research was to develop a robust model that is specifically designed for the analysis of survival systems in the presence of extreme observations and is also able to handle various probability distributions. This robust algorithm, namely, Partial Least Squares Flexible parametric Models (PLS-FPM) supports the user defines its probability distribution, for estimating the survival, hazard, and cumulative hazard functions and returning a calculated model selection criterion.

After validating and testing the performance and efficiency of PLS-FPM using simulated data, as well as showing its flexibility to handle different distributions, the algorithm is applied to the analysis of four breast cancer survival datasets, and significant genes are estimated. The analyses based on different distributions using several datasets revealed the robustness of these models to estimate smooth survival and cumulative hazard functions in the presence of extreme values.

The Cox proportional hazards (PH) model is the most frequently used regression technique to address survival data. In the presence of multicollinearity, the Cox algorithm is integrated with PLS resulting in the PLS-Cox model.

The PLS-Cox regression model is used as a reference model in this study. Let

Let

Any distribution defined for

The inclusion of incomplete gamma integral in the survival and hazard functions of gamma distribution limited its use in survival analysis. A survival time random variable

Complicated numerical calculations are required to estimate parameters as maximum likelihood estimation is difficult to exercise due to incomplete gamma integrals. The gamma hazard function may be constant, monotonically increasing, or monotonically decreasing.

Let the survival time

Transforming survival function to the log cumulative hazard scale, FPM is formulated as the linear function of log time:

Adding covariates, the log cumulative hazard model becomes

Thus,

For

and the cumulative hazard function is

The hazard function of log-logistic distribution may be increasing, decreasing, or hump-shaped.

The natural logarithm of the lifetime

The cumulative hazard function of the lognormal distribution is

Various other standard and defined distributions including extreme value distributions can be used in FPM. In these models, proportional hazards imply proportional cumulative hazards; hence, covariates can be interpreted as hazard ratios similar to nonparametric models under PH assumption. The cumulative hazard, in FPM, is a more stable function compared to nonparametric models as being a function of a log time scale. For instance, the cumulative hazard function is a straight line in Weibull models. Thus, more stable-shaped functions are accurately captured. The PLSR model integrated with FPM using gamma, Weibull, log-logistic, and log-normal distribution is introduced in this study for robust estimation in the context of high dimensional survival data in the presence of extreme values.

Suppose

Flow diagram of survival analysis approaches.

The PLS-FP model works in two steps. It computes PLS components at the first step and executes FP distribution at the second step. This proposed model enhances model performance in terms of prediction and accuracy in the presence of extreme observations.

Generation of simulated data for survival response from standard parametric distributions.

The simsurv R package is used to generate simulated data to compare the performance of existing and proposed PLS-based models. The simulated dataset is generated from the Weibull distribution for scale parameter

The 10-year censored survival time datasets of breast cancer patients used in this study contain the seven-gene signature innovated by [

The PLS-FP model parameterised with gamma, Weibull, log-logistic, and log-normal distributions are fitted over simulated dataset generated from Weibull distributions for comparison of five models with 100 correlated covariates. The results supported the application of proposed models over the traditional PLS-Cox model to deal with survival time response in the presence of collinear covariates and extreme observations. The model performances based on AIC and BIC for simulated data generated from the Weibull distribution are presented in Figure

The model comparison of standard PLS-Cox regression model with PLS-FPM parameterised over gamma, Weibull, log-logistic, and log-normal distribution by using simulated survival times data generated from Weibull distribution is presented.

Before analyzing real datasets, multicollinearity among covariates and the presence of extreme observations is verified. For this purpose, correlations among covariates for all breast cancer datasets are plotted. The correlation maps for breast cancer datasets presented in Figures

The circles of correlations for breast cancer dataset, namely, mainz7g and transbig7g, are presented.

The circles of correlations for breast cancer dataset, namely, vdx7g and upp7g, are presented.

The correlation maps evidence the presence of multicollinearity. Moreover, for the identification of extreme observations, starburst graphs (also called bagplot) are plotted, presented in Figures

The starburst plot for breast cancer dataset, namely, mainz7g and transbig7g, is presented.

The starburst plot for breast cancer dataset, namely, vdx7g and upp7g, is presented.

The model comparison of standard PLS-Cox regression model with PLS-FPM parameterised over gamma, Weibull, log-logistic, and log-normal distribution using breast cancer datasets, namely, mainz7g and transbig7g, is presented.

The model comparison of standard PLS-Cox regression model with PLS-FPM parameterised over gamma, Weibull, log-logistic, and log-normal distribution using breast cancer datasets, namely, vdx7g and upp7g, is presented.

Figure

Figure

The estimates of the baseline cumulative hazard from standard PLS-Cox regression model and PLS-FPM parameterised over gamma Weibull, log-logistic, and log-normal distribution for breast cancer datasets, namely, mainz7g and transbig7g.

The estimates of the baseline cumulative hazard from standard PLS-Cox regression model and PLS-FPM parameterised over gamma Weibull, log-logistic, and log-normal distribution for breast cancer datasets, namely, vdx7g and upp7g.

All seven genes are found to be significantly associated with breast cancer in six datasets. The parameters of all genes are estimated for each survival dataset and presented in Table

PLS-FPM regression coefficients are presented where inflectional factors are extracted by generalized F distribution.

Genes | Coefficient estimates | |||||
---|---|---|---|---|---|---|

mainz7g | transbig7g | upp7g | unt7g | vdx7g | nki7g | |

PLAU | 0.65 | 0.57 | 0.37 | 0.81 | −0.23 | −0.27 |

CASP3 | −0.54 | 0.22 | −0.48 | −0.05 | −0.59 | 0.69 |

VEGFA | −0.39 | −0.18 | −0.18 | −0.28 | −0.15 | 0.34 |

STAT1 | 0.35 | 0.57 | −0.25 | 0.12 | 0.10 | −0.39 |

ESR1 | 0.06 | 0.37 | 0.18 | −0.31 | 0.137 | 0.11 |

AURKA | −0.01 | −0.35 | 0.71 | −0.24 | 0.07 | 0.09 |

ERBB2 | −0.01 | −0.17 | −0.04 | −0.31 | −0.74 | 0.40 |

Efficient model selection with robust estimates remains a challenging and computationally intensive task, especially for survival systems in the presence of extreme observations. Therefore, the number of candidate models for evaluations and comparisons is usually limited in studies. However, nonparametric and semiparametric survival methods can misappropriate model structures without considering specific probability distribution. The PLS-Cox regression model is used to deal with multicollinearity among covariates in survival time analysis. The new approach is proposed mainly for two disadvantages of Cox regression. Firstly, Cox regression is a semiparametric approach; hence, it produces unsmooth estimates which are limited in the long-term evaluations. Secondly, the standard Cox regression model is not robust to extreme values. In this article, PLS-FPM is proposed as a fully parametric survival technique to examine hazard function and efficiently estimate the parameters. The PLS-FPM was particularly projected to address survival systems in the presence of multicollinearity by using various distributions to produce smooth estimates to extrapolate the survival model. Since this approach is flexible enough to combine different distributions, it can produce a robust model in the presence of extreme observations by using a suitable probability distribution.

Overall, PLS-FPM compares favorably with the benchmark method on both simulated and real datasets in the presence of multicollinearity and extreme observations. The PLS-FPM using Weibull distribution turns out to be the best model to estimate cumulative hazards according to AIC and BIC over simulated data generated from Weibull distribution. More generally in the setting of simulated survival data, the fully parametric PLS-FPM had better performance than the semiparametric PLS-Cox regression model. The optimal model for each real dataset shows that seven genes are found significantly associated with each breast cancer survival dataset. Tumor cell proliferation is found to be one of the most significant predictors of breast cancer survival. Various previous studies investigated proliferation in tumor cells and found it a significant factor of breast cancer [

The overall accuracy of these algorithms enhances the model performance to a higher extend, considering collinear covariates and extreme observations. This efficiency suggests that survival function, hazard function, cumulative hazard function, and parameters of distribution for the survival time data with unknown distribution can be estimated more efficiently in terms of smooth lines.

PLS-FPM not only extrapolates survival outcomes beyond the available follow-up data but also supports a wide range of hazard shapes including monotonically increasing, monotonically decreasing, arc-shaped, and bathtub-shaped hazards. In a word, PLS-FPM is viewed as a useful fully parametric addition to the toolbox of robust estimation and prediction of survival time approaches for the widely used PLS-Cox model in the survival settings.

The breast cancer datasets are freely available in an

The authors declare that they have no conflicts of interest.

Maryam Sadiq and Tahir Mehmood contributed equally to this work.