The traditional variable selection methods for survival data depend on iteration procedures, and control of this process assumes tuning parameters that are problematic and time consuming, especially if the models are complex and have a large number of risk factors. In this paper, we propose a new method based on the global sensitivity analysis (GSA) to select the most influential risk factors. This contributes to simplification of the logistic regression model by excluding the irrelevant risk factors, thus eliminating the need to fit and evaluate a large number of models. Data from medical trials are suggested as a way to test the efficiency and capability of this method and as a way to simplify the model. This leads to construction of an appropriate model. The proposed method ranks the risk factors according to their importance.

Sensitivity analysis
(SA) plays a central role in a variety of statistical methodologies, including
classification and discrimination, calibration, comparison, and model selection
[

A considerable
number of methods of variable selection have been proposed in the literature.
The fundamental developments are squarely in the context of normal regression
models and particularly in the context of multivariate linear regression models
[

New methods of variable selection, such as

This study aims to use SA to extend and develop an
effective, efficient, and time-saving variable selection method in which the
best subsets are identified according to specified criteria without resorting
to fitting all the possible subset regression models in the field of survival regression models. The
remainder of this study is organized as follows: Section

Often the
response variable in clinical data is not a numerical value but a binary one (e.g.,
alive or dead, diseased or not diseased). When the latter occurs, a binary
logistic regression model is an appropriate method to present the relationship
between the disease’s measurements and its risk factors. It is a form of
regression used when the response variable (the disease measurement) is a
dichotomy and the risk factors of the disease are of any type [

The first step
in modeling binomial data is a transformation of the probability scale from
range

Thus, the
appropriate link is the log odds transformation (the logit). Then if there are

The mechanics of
maximum likelihood (ML) estimation and model fitting for logistic regression
model are a special case of GLM fitting, and then fitting the model requires
estimation of the unknown parameters

A simple model
that fits adequately has the advantage of model parsimony. If a model has
relatively little bias and describes reality well, it tends to provide more
accurate estimates of the quantities of interest. Agresti [

There are two key problems in variable selection procedure: (i) how to select an appropriate number of risk factors from the set of risk factors, and (ii) how to improve final model performance based on the given data. So answering these questions is the objective of our proposed method by applying GSA to select the influential risk factors in the logistic regression model.

GSA was defined in [

In this study, partitioning the total
variance of the objective function

(1) The first step is identification of the probability
distribution

Common shapes of three types of probability distribution.

A visual
approach is not always easy, accurate, or valid, especially if the sample size
is small. Thus it would be better to have a more formal procedure for deciding
which distribution is “best.” A number of significance tests are available for this such as the
Kolmogorov-Smirnoff and chi-square tests. For
more details, see [

(2) In the second step, the logistic regression model
as in (

(3) These results from step two will be used in
performing GSA in the binary logistic regression model using (

The purpose of this section is to compare the
performance of the proposed method with existing ones. We also use a real data
example to illustrate our SA approach as a variable selection method. In the
first examples in this section, we used the dataset and the results of the
penalized likelihood estimate of best subset (AIC), bust subset (BIC), SCAD,
and LASSO that were computed by [

In this example, Fan and Li [

Estimated coefficients and standard errors for different variable selection methods.

Methods | MLE | Best subset AIC | Best subset BIC | SCAD | LASSO |
SA |
---|---|---|---|---|---|---|

Factors | ||||||

Intercept | 5.51 (0.75) | 4.81 (0.45) | 6.12 (0.57) | 6.09 (0.29) | 3.70 (0.25) | Constant |

0 (—) | ||||||

2.30 (2.00) | 0 (—) | 0 (—) | 0 (—) | 0 (—) | 0.014 (0.125) | |

0 (—) | 0 (—) | |||||

0.30 (0.11) | 0 (—) | 0.003 (0.034) | ||||

0 (—) | 0 (—) | 0.013 (0.057) | ||||

0 (—) | 0 (—) | 0.032 (0.091) | ||||

0.03 (0.34) | 0 (—) | 0 (—) | 0 (—) | 0 (—) | 0.014 (0.237) | |

7.46 (2.34) | 5.69 (1.29) | 9.83 (1.63) | 9.84 (0.14) | 0.36 (0.22) | ||

0.24 (0.32) | 0 (—) | 0 (—) | 0 (—) | 0 (—) | 0.001 (0.042) | |

0 (—) | 0 (—) | 0 (—) | 0.016 (0.075) | |||

0 (—) | 0 (—) | 0 (—) | 0 (—) | 0.003 (0.047) | ||

1.23 (1.21) | 0 (—) | 0 (—) | 0 (—) | 0 (—) | 0.019 (0.307) |

In addition to
GSA indices, Table

A new dataset emerges from the original
dataset prepared in [

CHD (

Diabetes (debt,

Total
cholesterol (Chol,

High density lipoprotein
(HDL,

Age

Gender (Gan,

Body mass index (BMI, ^{2}, and the participant gets 1 if BMI is

Blood pressure
(hypertension, Hyp,

Waist/hip ratio

Implementation
of the GSA method for this dataset gave the results in Table

Sensitivity indices and risk factors ranking.

Factors | Ranks | ||||
---|---|---|---|---|---|

Diab | 0.2018 | 0.22657 | 0.008 | 12 | |

Chol | 0.2434 | 0.258 | 0.015 | 14 | |

HDL | 0.2243 | 0.25424 | 0.03 | 13 | |

Age | 0.2636 | 0.28507 | 0.022 | 15 | |

Gen | 0.0256 | 0.03844 | 0.013 | 1 | |

BMI | 0.5161 | 0.56173 | 0.046 | 30 | |

Hypt | 0.003 | 0.04207 | 0.039 | 0 | |

W/H | 0.2706 | 0.29714 | 0.027 | 15 | |

Sum. | 1.7484 | 1.96326 | 0.2 |

According to the
first order of sensitivity indices

Sensitivity indices: the main effect

Does the
proposed method yield a reliable model? To investigate the reliability of the
proposed method, we compared the results of the fitted models. Basically, when
the full logistic regression model is fitted, the results are

The efficiency
of the proposed method of variable selection (GSA) can be measured by comparing
its results as in (

The overall fitting criteria for the BEM for a logistic regression model.

Step | Df | Sig. | df | Sig. | Nag. | |||
---|---|---|---|---|---|---|---|---|

6 | 357.813 | 7.268 | 3 | 0.064 | 8.465 | 8 | 0.389 | 0.30 |

7 | 359.021 | 6.061 | 2 | 0.048 | 0.055 | 2 | 0.973 | 0.25 |

8 | 360.189 | 4.892 | 1 | 0.027 | — | — | — | 0.20 |

The estimated parameters and their significance for a logistic regression model using BEM.

Steps | Risk factors | Sig. ( | |
---|---|---|---|

Step 6 | CHOL | 0.538 | 0.061 |

SAGE | 0.151 | 0.271 | |

BMI | 0.241 | ||

Constant | 0.000 | ||

Step 7 | CHOL | 0.610 | 0.029 |

BMI | 0.276 | ||

Constant | 0.000 | ||

Step 8 | CHOL | 0.605 | 0.030 |

Constant | 0.000 |

Table

Also Table

The results in
Tables

This work was supported by USM fellowship.