Sensitivity Analysis to Select theMost Influential Risk Factors in a Logistic RegressionModel

The traditional variable selection methods for survival data depend on iteration procedures, and control of this process assumes tuning parameters that are problematic and time consuming, especially if the models are complex and have a large number of risk factors. In this paper, we propose a new method based on the global sensitivity analysis (GSA) to select the most influential risk factors. This contributes to simplification of the logistic regression model by excluding the irrelevant risk factors, thus eliminating the need to fit and evaluate a large number of models. Data from medical trials are suggested as a way to test the efficiency and capability of this method and as a way to simplify the model. This leads to construction of an appropriate model. The proposed method ranks the risk factors according to their importance.


Introduction
Sensitivity analysis (SA) plays a central role in a variety of statistical methodologies, including classification and discrimination, calibration, comparison, and model selection [1].SA also can be used to determine which subset of input factors (if any) accounts for most of the output variance (and in what percentage); those factors with a small percentage can be fixed to any value within their range [2].In such usage, the focus is on determination of the important variables to simplification of the model; the original motivation for our research lay in a search for how to best arrive at such a determination.Although SA has been widely used in normal regression models to extract important input variables from a complex model so as to arrive at a reduced model with equivalent predictive power, it has limited use for selection of risk factors despite the presence of a large number of risk factors in survival regression models.The limited use of these methods to select appropriate subsets in survival regression models illustrates the desirability of development of a new method of SA-based variable selection to avoid the drawbacks of traditional methods and also simplify survival regression models by choosing the appropriate subsets of risk factors.
A considerable number of methods of variable selection have been proposed in the literature.The fundamental developments are squarely in the context of normal regression models and particularly in the context of multivariate linear regression models [3].A comprehensive review of many variable selection methods is represented in [4].Methods such as forward, backward, and stepwise selection and subset selection (Akaike information criterion (AIC)) and Bayesian information criterion (BIC)) are available; however none of these methods can be recommended for use in either a logistic regression model or in other survival regression models.They give incorrect estimates of the standard errors and P-values.They also can delete variables whose inclusion is critical [3].In addition, these methods regard all the risk factors of a situation as equal, and they seek to identify the candidate subset of variables sequentially; furthermore, most of these methods focus on the main effects and ignore higherorder effects (interactions of variables).
New methods of variable selection, such as least absolute shrinkage and selection operator (LASSO) in [5], and the smoothly clipped absolute deviation (SCAD) method in [6], are at the center of attention recently in the field of survival regression models.These methods use the penalized likelihood estimation and the shrinkage regression approaches.
International Journal of Quality, Statistics, and Reliability These two approaches differ from traditional methods in their deletion of the nonsignificant covariates in the model by estimating their effects as 0. A nice feature of these methods is that they perform estimation and variable selection simultaneously, but, nevertheless, these methods suffer from some calculation and characteristics problems that are dealt with in more detail in [7,8].
This study aims to use SA to extend and develop an effective, efficient, and time-saving variable selection method in which the best subsets are identified according to specified criteria without resorting to fitting all the possible subset regression models in the field of survival regression models.The remainder of this study is organized as follows: Section 2 gives the background of building a logistic regression model, and Section 3 deals with the proposed method.The results of implementing this method and logistic regression model are the subject of Section 4, and Section 5 consists of the discussion and conclusions.

Background of Constructing a Logistic Regression Model
Often the response variable in clinical data is not a numerical value but a binary one (e.g., alive or dead, diseased or not diseased).When the latter occurs, a binary logistic regression model is an appropriate method to present the relationship between the disease's measurements and its risk factors.It is a form of regression used when the response variable (the disease measurement) is a dichotomy and the risk factors of the disease are of any type [9].A logistic regression model neither assumes the linearity in the relationship between the risk factors and the response variable, nor does it require normally distributed variables.It also does not assume homoscedasticity, and in general has less stringent requirements than linear regression models.However, it does require that observations are independent and that the independent risk factors are linearly related to the logit of the response variable [10].However, models involving the association between risk factors and binary response variables are found in various disciplines such as medicine, engineering, and the natural sciences.How do we model the relationship between risk factors and binary response variable?The answer to this question is the subject of the next subsections.

Constructing a Logistic Regression Model
The first step in modeling binomial data is a transformation of the probability scale from range (0, 1) to (−∞, ∞) instead of using the linear model for the response variable of the probability of success on risk factors.The logistic transformation or logit of the probability of success (π) is log {π/(1 − π)}, which is written as logit (π) and defined as the log odds of success.It is easily seen that any value of (π) in the range (0, 1) corresponds to the value of logit (π) in (−∞, +∞).Usually, binary data results from a nonlinear relationship between {π(x)} and (x), where a fixed change in as a function of the value of probability (π), the logit ranges from (−∞) to (+∞) as probability ranges from (0) to (1).The logit = 0 when probability = 0.5.
(x) has less impact when {π(x)} is near (0 or 1) than when {π(x)} is near (0.5).This is illustrated in Figure 1 (see [11]).Thus, the appropriate link is the log odds transformation (the logit).Then if there are n binomial observations of the form π i = y i /n i for i = 1, 2, . . ., n, where the expected value of the random variable associated with ith observation, y i , is E(Y i ) = n i π i .The logistic regression model for association of π i on the values x 1i , x 2i , . . ., x ki of k risk factors X 1 , X 2 , . . ., X k is such that [10] Logit and the equation of success probability is The linear logistic model is a member of a family of generalized linear models (GLM).The next subsection explains this model fitting process.

Fitting Logistic Regression Models
The mechanics of maximum likelihood (ML) estimation and model fitting for logistic regression model are a special case of GLM fitting, and then fitting the model requires estimation of the unknown parameters (β j ) of the ML function of this model using the Bernoulli ML as in the following [12]: The problem now is to obtain those values ( β 0 , β 1 , . . ., β k ) that maximize {L(β)} or its equivalent {Log L(β)} where it is as follows: Log L(β) where {η i = k i=0 β j x ji } and (x 0i = 1) represent all values of (i).The derivative of this log-likelihood function with respect to the (k + 1) unknown β-parameters is given by n i x ji e ηi 1 + e ηi −1 , j = 0, 1, 2, . . ., k. ( Then the likelihood equations are where π i = e ηi (1 + e ηi ) −1 is the ML estimate of {π i }.There are two methods to solve (6) and obtain the maximum likelihood estimation of ( β ).The one most often used is known as the Newton-Raphson method.This method begins with determination of the score matrix {U(β)} and the information matrix {I(β)} as in the following [13]: Here (π (t) ) is obtained from {β (t) } through (2), then we use {U (t) } and {I (t) } with the following formula {β (t+1) = β (t) − (I (t) ) −1 U (t) } to obtain the next value (β (t+1) ) as where {μ (t) i = n i π (t)  i }, this is to obtain (π (t+1) ), and so on.

Evaluating the Fitted Model
A simple model that fits adequately has the advantage of model parsimony.If a model has relatively little bias and describes reality well, it tends to provide more accurate estimates of the quantities of interest.Agresti [9] stated that "we are mistaken if we think that we have found the true model, because any model is a simplification of reality."In light of this assertion, what then is the logic of testing the fit of a model when we know that it does not truly hold?The answer lies in the evaluation of the specific properties of this model by using criteria such as deviance, the R 2 , the Wald Score test, the Person chi-square, and the Hosmer-Lemshow chi-square tests; for more details, see [9,10].Usually the first stage of construction of any model presents a large number of risk factors.Inclusion of all of them may lead to an unattractive model from a statistical viewpoint.Thus, as an important step towards an acceptable model, a decision should be made early about the proper methodology to use to select the appropriate and important risk factors.Because traditional methods of selecting variables have many limitations in their applicability to survival regression models, a new method of variables selection will be developed by using GSA to select the most influential factors in the model.This is the subject of the following section.

Sensitivity Analysis to Select the Most Influencing Risk Factors
There are two key problems in variable selection procedure: (i) how to select an appropriate number of risk factors from the set of risk factors, and (ii) how to improve final model performance based on the given data.So answering these questions is the objective of our proposed method by applying GSA to select the influential risk factors in the logistic regression model.

General Concept of GSA
GSA was defined in [14] as "the study of how the uncertainty in the output of a model (numerical or otherwise) can be apportioned to different sources of uncertainty in the model input."Hence one possible way to apportion the importance of the input factors with respect to the model response is to apply GSA.In general the importance of a given risk factor X i can be measured via the so-called sensitivity index, which is defined as the fractional contribution to the model output variance because of the uncertainty in X i .For k risk factors, the sensitivity indices can be computed using the following decomposition formula for the total output variance V (Y ) of the output Y [15]: where where V (Y ) is the unconditional variance of output of the model (incidence of CHD), V (X i ) is the conditional variance of risk factor X i , and V (X i , X j ) is the variance of interaction between X i and X j , and so on.A first measurement of the fraction of the unconditional output variance V (Y ) that is accounted for by the uncertainty in X i , which is the firstorder sensitivity index (S i ) for the factor X i is given as The second terms in (9) are known as the effect of interactions.It is a fact that the number and importance of the interaction terms usually grow (i) with the number of risk factors k, and (ii) with the range of variation of the risk factors [16].This means that if all of the V (X i ) terms are computed, then most likely k i=1 V (X i ) would still be lower than the total ) is a measure of the impact of the interactions.Consequently, when k i=1 S i = 1, then the model is additive (i.e., without interactions among its input factors), and thus the first order of conditional variances of ( 10) are all we need to decompose the model output variance.For a nonadditive model, higher-order sensitivity indices account for interaction effects among sets of input factors.However, higher-order sensitivity indices are usually not estimated directly because if the model consists of k risk factors, then the total number of indices (including the S i 's) that should be estimated is as high as 2 k − 1.For this reason, a more compact sensitivity measurement is used; this measurement is the total effect sensitivity (S T ) index, which concentrates in one single term on all the interactions involving a given factor X i .For example, for a model of k = 3 risk factors, the three total sensitivity indices would be [2] and analogously where the conditional variance in (12) expresses the total contribution to the variance of Y because of non-X i , (i.e., to the k − 1 remaining factors), so that includes all terms (i.e., a first order as well as interactions in ( 9)) that involve risk factor X i .For a given risk factor X i , the coefficient of importance (IC i ) is the difference between S Ti and S i that reflects an important role of interactions for that risk factor in Y, Explaining the interactions among risk factors helps us to improve our understanding about the model structure.Estimators for both (S i , S Ti ) are provided by a variety of methods such as Sobol, the Fourier amplitude sensitivity test (FAST), and others; for more details, see [17].

GSA in a Logistic Regression Model
In this study, partitioning the total variance of the objective function V (Y ) is the way to estimate S i and S Ti so as to perform a GSA.How can this model be extended to deal with a binary response variable?Although partitioning of variances is uncomplicated in models with a continuous response variable and a normal error distribution, the extension of this partitioning to models with binary responses is not simple [18].Consequently, to extend the variance partitioning method to our binary response variable (incident of coronary heart disease (CHD)), suppose that the data is consisting of y i , the number of people who have CHD.The actual response probability of the incidence of CHD for the ith observation π i will have a Bernoulli distribution with a mean of p i , where p i is the proportion of the patients who have a disease.This response probability is therefore a random variable where E(π i ) = p i .The variance of π i must be equal to zero when p i is zero or unity, and then a relationship between the unknown probability and our risk factors can be fitted.Typically a logistic regression model represents this relationship between y i for a sample with n people who have a binomial distribution (i.e., {Y i ∼ B(n, π i )}, i = 1, 2, . . ., n), and the corresponding response probability of the incidence of the disease is π i = y i /n for ith observation and the k risk factors X 1 , X 2 , . . ., X k as in ( 4) and ( 5).This model assumes independence between the n observations, and then all the variations conditional on the estimates of the probabilities will be binomial with equal variance: The binomial is not the only possible distribution for fitting proportion data.Other distributions exist that have greater variation (known as overdispersion) or less variation (known as underdispersion) than the binomial distribution conditional on the values of π i 's.The simplest function for the true probability of the ith observation uses a multiplicative scale factor to determine the variance of the response as where r is a scale factor that is equal to 1.If we have a binomial variation, it will be greater than 1 if there is overdispersion and less than 1 if there is underdispersion, and π i is an unobservable random variable [19].The advantages of the multiplicative approach are that it will allow both over-and underdispersions.The random variable Y i is associated with the observed number of incidences of the disease for the ith unit, y i .It will have a binomial distribution, and then the mean of Y i , conditionalon π i , is and the conditional variance of Y i is Since π i cannot be calculated, then the observed proportion of the disease incidence p i has to be an estimate of π i as According to a standard result from the conditional probability theory, the unconditional expected value of a random variable Y can be obtained from the conditional expectation of Y given X using the equation and the unconditional variance of Y is given by [20] V Applying these two results on our response variable gives now also and so in the absence of random variation in the response probability, Y i would have a binomial distribution, B(n, p i ), and in this case when r = 1 as required, then If, on the other hand, r is greater than 1, then a variation in the response probability occurs and the variance of Y i will exceed np i (1 − p i ), the variance under binomial sampling that leads to overdispersion.But if r is less than 1, then the variation in the response probability and the variance of Y i will be less than np i (1 − p i ), the variance under binomial sampling that leads to underdispersion.To use GSA to select the important covariates from the available set of covariates and construct an appropriate logistic regression model, it involves three steps.
(1) The first step is identification of the probability distribution f (x) of each covariate in the model.Usually sensitivity analysis starts from probability distribution functions (pdfs) given by the experts.This selection makes the use of the best information available of the statistical properties of the input factors.One of the methods used to obtain the pdfs starts with visualizing the observed data by examining its histogram to see if it is compatible with the shape of any distribution, as illustrated in Figure 2.
A visual approach is not always easy, accurate, or valid, especially if the sample size is small.Thus it would be better to have a more formal procedure for deciding which distribution is "best."A number of significance tests are available for this such as the Kolmogorov-Smirnoff and chisquare tests.For more details, see [21].
(2) In the second step, the logistic regression model as in (1) and the information about the covariates obtained in step one are used to create a Monte Carlo simulation to generate the sample that will be used in the decomposition and to estimate the unconditional variance of response probability and the conditional variation for covariates as in (23) to (26).
(3) These results from step two will be used in performing GSA in the binary logistic regression model using (11), and in the result of decomposing as in ( 24) and (26), where the main effect indices are and the total effect indices are where X −i are all X's but X i , and the coefficients of importance are These results and the two datasets are used to test and compare the performance of the proposed GSA method as

Numerical Comparisons
The purpose of this section is to compare the performance of the proposed method with existing ones.We also use a real data example to illustrate our SA approach as a variable selection method.In the first examples in this section, we used the dataset and the results of the penalized likelihood estimate of best subset (AIC), bust subset (BIC), SCAD, and LASSO that were computed by [7] as a way to compare the performance of the proposed method with these methods.

The First Example
In this example, Fan and Li [7] applied the proposed penalized likelihood methodology to burn data collected by the General Hospital Burn Center at the University of Southern California.The dataset consists of 981 observations.The binary response variable Y is 1 for those victims who survived their burns and 0 otherwise.Risk factors are X 1 = age, X 2 = sex, X 3 = log (burn area + 1), and binary variable X 4 = oxygen (0 normal, 1 abnormal) was considered.Quadratic terms of X 1 and X 3 , and all interaction terms were included.The intercept term was added, and the logistic regression model was fitted.The best subset variable selection with the AIC and the BIC was applied to this dataset.The unknown parameter λ was chosen by generalized cross-validation: it is 0.6932 and 0.0015, respectively, for the penalized likelihood estimates with the SCAD and LASSO.The constant a in the SCAD was taken as 3.7.With the selected λ, the penalized likelihood estimator was obtained at the sixth, 28th, and fifth step iterations for the penalized likelihood with the SCAD and LASSO.Table 1 contains the estimated coefficients and standard errors for the transformed data, based on the penalized likelihood estimators, and the calculation of the sensitivity indices obtained by using SimLab software to compare the performance of GSA as a variable selection method with other methods.The first five columns were calculated by [7].
In addition to GSA indices, Table 1 consists of the results of two traditional methods of variable selection (AIC and BIC) and two new methods (LASSO and SCAD).The traditional method, best subset procedure via minimizing the BIC scores, chooses five of 13 risk factors, whereas the SCAD chooses four risk factors.The difference between them is that the best subset keeps X 4 .Neither SCAD nor the best subset variable selection (BIC) includes X 2  1 and X 2 3 in the selected subset, but both LASSO and the best subset variable selection (AIC) included them.LASSO chooses the quadratic terms of X 1 and X 3 rather than their linear terms.It also selects an interaction term X 2 X 3 , which may not be statistically significant.LASSO shrinks noticeably large coefficients.The last column in Table 1 shows that GSA selected the variables X 1 , X 3 , and X 1 X 3 , in addition to the intercept, which resembles the SCAD method, and differs from the other methods.According to the results in the last column of Table 1, the risk factors can be ranked according to sensitivity indices S i and S Ti .Age (X 1 ) is the first and the most influential risk factor, with a percent of contribution of 0.487, and the second most important risk factor is the interaction between X 1 and X 3 , with a percentage of contribution of 0.362.The third influential risk factor is the log (area of burn + 1) (X 3 ) with a percentage of contribution of 0.143 as shown in Table 1.Consequently, we find that the proposed GSA variable selection method resembles SCAD in choosing the same risk factors.

The Second Example
A new dataset emerges from the original dataset prepared in [22] as a way to compare SA and the traditional method (backward elimination) as variable selection methods.Originally this study was undertaken to determine the prevalence of risk factors among a population-based sample of 403 rural African-Americans in Virginia.Community-based screening evaluations included the determination of exercise and smoking habits, blood pressure, height, weight, total and high-density lipoprotein (HDL) cholesterol, and glycosylated hemoglobin, and other factors.The results of this study were presented as percentages of prevalence for most factors such as diabetes (13.6% of men, 15.6% of women), hypertension (30.9% of men, 43.1% of women), and obesity (38.7% of men, 64.7% of women), without building any models to study the relationship between CHD and its risk factors.For more details, see [8].A new dataset was generated based on the first one as a way to calculate SA indices to extract the important risk factors for CHD from among these new factors, and then implement the logistic regression model to test the performance of the proposed method as follows.
(1) CHD (Y) 10-year percentage risk is generated according to Framingham Point Scores.This risk is classified as 1 if the percentage of the risk is ≥20% and 0 otherwise [23].
(2) Diabetes (debt, X 1 ): According to the criteria published by American College of Endocrinology (ACE) & American Association of Clinical Endocrinologists (AACE) [24] the participant has diabetes 1 if the Stabilized Glucose >140 mg/dL or Glycosylated Hemoglobin >7% or both of them more than these limits, and he has no diabetes 0 otherwise.
(3) Total cholesterol (Chol, X 2 ): if a participant has total cholesterol of >200 mg/dL, he will be given a 1 and a 0 otherwise [25].
(6) Gender (Gan, X 5 ): 1 is for a male and 2 for a female.
(7) Body mass index (BMI, X 6 ): values for this standard are calculated from the following equation: BMI = height/(weight) 2 , and the participant gets 1 if BMI is >30 and a 0 otherwise [25].
(8) Blood pressure (hypertension, Hyp, X 7 ): a participant has Hyp (1) if systolic blood pressure is >140 or if diastolic blood pressure is >90 or if both of them exceed these limits and 0 otherwise [25].
(9) Waist/hip ratio (X 8 ), in addition to BMI, is a second factor in the determination of obesity.
This dataset was used to perform SA through the use of SimLab software and the partitioning variance methodology discussed in Section 3.An evaluation of the efficiency of the proposed method was performed by fitting all factors into logistic regression models so as to obtain comparisons of factors chosen by the proposed method with those selected by traditional variable selection method (backward elimination).SPSS software was used to get the results that follow from fitting logistic regression models.

The Important Risk Factors
Implementation of the GSA method for this dataset gave the results in Table 2, which shows the ranking of the risk factors in order of importance and the contribution of each one to explaining the total variance of the CHD response variable.
According to the first order of sensitivity indices S i , the BMI is the first and the most influential factor, and the waisthip ratio ranks second.Both are components of the obesity factor.Age is the third influential factor and so on through the other factors as listed in Table 2.The total sensitivity index for a given risk factor provides a measure of the overall contribution of that risk factor to the output variance, taking into account all possible interactions with the other risk factors.The difference between the total sensitivity index and the first-order sensitivity index for a given risk factor is a measure of the contribution to the output variance of the interactions involving that factor; see (12) and (13).The second column in Table 2 shows the values of S Ti , which gives the same rank as S i for the risk factors.These indices point to the simple interaction between these risk factors as illustrated in the third column in the same table.Figure 3 shows the compression between the first order S i , the total S Ti sensitivity indices, and the interactions between risk factors.

Implementing the Logistic Regression Model
These results showed the significance of the overall fit of the model according to the values of χ 2 P and χ 2 HL in spite of the low value of Nag.R 2 ; also showed that the individual effect for all risk factors is not significant, which means that H 0 cannot be rejected from the following null hypothesis: Second, application of the logistic regression model by using those risk factors that appear in Table 2 as highly ranked by the proposed method also shows that this method ranks each risk factor according to its contribution to the incidence of the CHD response variable.The question also becomes how many variables must be selected in order to apply the logistic regression model.The possibility exists that the selection procedure may tend to underfit or overfit the model by selecting too few or too many variables.In the face of such a possibility, our objective becomes to find the model that uses the least number of variables while simultaneously explaining a reasonable percentage of variance in the dependent variable relative to the percentage explained by all the variables in the full model.Thus two models may be fitted from These results showed that adding the HDL risk factor does not improve the results of the first logistic regression model, but the parameter of this risk factor is not significant when we test the following hypothesis: Note that the difference between the deviances of the two models is minor.Furthermore, the value of R 2 does not improve.Thus, according to the principle of parsimony, the first model should be considered the best model and the risk factors used to construct this model are those that are the most influential in causing CHD.Moreover, showing the different results obtained from these two models demonstrates the differences between fitting the full model with all risk factors and fitting it with only selected risk factors.
The efficiency of the proposed method of variable selection (GSA) can be measured by comparing its results as in (34) with the results gained from fitting the logistic regression model by using the backward elimination method (BEM).These results are shown in Tables 3 and 4.
Table 3 shows the overall fitting criteria required for the last three steps of a logistic regression model fitted by the use of the BEM.
Also Table 4 shows the last three steps of iteration to choose the important risk factors.These results represent the sequential elimination of the factors, which requires eight steps to rank these risk factors according to their importance; however, the proposed method does not need these iterations.

Conclusions
The results in Tables 1 to 4 and (31) to (36) for the two examples confirm that the proposed method is capable of distinguishing between important and unimportant risk factors.The proposed method ranked the risk factors according to their decreasing importance as shown in Tables 1 and 2. In the example in which we compared the proposed method with those methods that are typically used, we found that its performance very much resembled the SCAD method in which the same risk factors are selected.From the first example, we found that the important risk factors are age, the area of the burns, and the interaction between them.In the second example we found that the obesity factors (BMI and W/H) are the most influential risk factor on the incidence of CHD, the second risk factor is age, and the third risk factor is the total cholesterol.These play the major roles, representing approximately 74% of the incidence of CHD.Thus, they are considered the most important risk factors according to their individual percentages of contribution in the incidence of CHD as shown in Table 1.Compression between the results of the fitting of the full logistic regression model as in (31) and the chosen models as in (34) and (36) confirm the efficiency of the proposed method in its selection of the most important risk factors.Equation (34) represents the best model, according to the model evaluation criteria, because it consists of the most influential risk factors.Therefore, a medical care plan and medical interventions should comply with this ordering of these factors.Also, to further confirm these results, one of the traditional variable selection methods was used (backward elimination method), which yields different results after eight steps, but the proposed method orders the risk factors without iteration and without the need to fit multiple regression models.Finally, these results together confirm and emphasize the importance of GSA as a variable selection method.

Figure 2 :Figure 3 :
Figure 2: Common shapes of three types of probability distribution.

Table 1 :
Estimated coefficients and standard errors for different variable selection methods.

Table 2 :
Sensitivity indices and risk factors ranking.

Table 3 :
The overall fitting criteria for the BEM for a logistic regression model.

Table 4 :
The estimated parameters and their significance for a logistic regression model using BEM.
Table 2 to compare the results.The first logistic regression model consisted of the obesity factors (BMI, and W/H ratio), age, and total cholesterol factors that explained 74% of the total variance of the CHD response variable according to the individual effect (S i ) as in Table 2.The results of fitting this model in this manner and applying SPSS software were