Logistic regression and discriminant analyses are both applied in order to predict the probability of a specific categorical outcome based upon several explanatory variables (predictors). The aim of this work is to evaluate the convergence of these two methods when they are applied in data from the health sciences. For this purpose, we modeled the association of several factors with the prevalence of asthma symptoms with both the two methods and compared the result. In conclusion, logistic and discriminant analyses resulted in similar models.
Logistic regression and linear discriminant analyses are multivariate statistical methods which can be used for the evaluation of the associations between various covariates and a categorical outcome. Both methodologies have been extensively applied in research, especially in medical and sociological sciences. Logistic regression is a form of regression which is used when the dependent variable is dichotomous, discrete, or categorical, and the explanatory variables are of any kind. In medical sciences, the outcome is usually the presence or absence of a stated situation or a disease. Using the logit transformation, logistic regression predicts always the probability of group membership in relation to several variables independent of their distribution. The logistic regression analysis is based on calculating the odds of the outcome as the ratio of the probability of having the outcome divided by the probability of not having it. Discriminant analysis is a similar classification method that is used to determine which set of variables discriminate between two or more naturally occurring groups and to classify an observation into these known groups. In order to achieve that discriminant analysis is based on the estimation of the orthogonal discriminant functions, the linear combination of the standardized independent predictor variables gives the greatest means differences between the existing groups. Thus, it can be proposed that both discriminant analysis and logistic regression can be used to predict the probability of a specified outcome using all or a subset of available variables.
Although the theoretical properties have been studied extensively throughout the literature, the choice of the proper method in data analysis is still a question for the researcher. The aim of this work after summarizing the properties of the two discriminating methods is to explore the convergence of the two analytical methods when they are used to evaluate categorical health outcomes in the pediatric epidemiological research. In particular, we tested the associations between anthropometric and lifestyle patterns in relation to asthma prevalence among 10–12-year-old children, using both statistical methods. So, the reader will elucidate the differences and the similarities of the two methods in order to make the appropriate choice in their application.
Discriminant analysis focuses on
the association between multiple independent variables and a categorical
dependent variable by forming a composite of the independent variables. This type of multivariate analysis
can determine the extent
of any of the composite variables discriminates between two or more pre-existing groups of
subjects and also can derive a classification model for predicting the group
membership of new observations [
The
linear discriminant function (LDF) is represented by
The principle by which
the discriminant coefficients (or weights) are selected is that they maximize the
distance between the two group means (centroids)
Logistic regression is a form of
regression which is used when we want to predict probabilities of the presence
or absence of a particular disease, characteristic, or an outcome in general
based on a set of independent of explanatory variables of any kind (continuous,
discrete, or categorical) [
If we use a probability cutoff of
Both logistic and linear discriminant regression analyses have the same functional frame; a composite of the independent variables and a rule for classification. But there are many differences about the assumptions made in order to apply them in a dataset.
Regarding discriminant
analysis, the assumptions have great similarity with the assumptions made for ordinary
regression and are (i) independent variables must have a
multivariate normal distribution, thus allowing only continuous or ratio
variables to enter the analysis and excluding all the forms of categorical
variables, (ii) the variance-covariance matrix of all the independent variables
must be homogenic among the population groups divided by the dependent variable
(assumption that is controlled with several statistics, such as Box's
Accurate estimation of
the discriminant function parameters demands sample size of minimum 20 cases
for each predictor variable and at least 20 cases for each of the dependent
variable groups, otherwise the estimation of the coefficients is unstable and
might lead to misleading results. The dependent variable in a discriminant
analysis should be categorical, dichotomous, or polytomous. The population
groups of the dependent
variable should be mutually exclusive and exhaustive. Discriminant independent
variables are assumed to be continuous. When categorical variables are included
in the analysis, the reliability of discrimination of the analysis decreases [
Logistic regression also has many
limitations. At first, logistic regression assumes that there is an
For the evaluation of the two methods, sensitivity, specificity, and accuracy will be also measured in the same dataset. Sensitivity of a binary classification test with respect to some class is a measure of how well this test identifies a condition and expresses the probability of a case being classified in that class, meaning the proportion of true positives of all positive cases in the population. Specificity, on the other hand, expresses the proportion of the true negative classified cases of a binary classification test of all the negative cases in the population. Finally, accuracy is a measure of the degree of conformity of a measured or calculated quantity to the actual value. It is calculated as the proportion of the true results of a binary classification test (true positive and true negative) among all possible results.
Thus, linear discriminant analysis
and logistic regression can be used to assess the same research problems. Their
functional form is the same but they differ in the method of the estimation of
their coefficient. Discriminant analysis produces a score, similar to the
production of logit of the logistic regression. Both methods with the
appropriate mathematical calculations produce the predicted probability of the
classification of a case into a group of the dependent variable, and with the
use of the appropriate cutoff value, we can also produce the predicted category
of each observation. When categorical variables are entered in the analysis and
are discrete measured, only the ones with large number of categories, more than
5, approximate the mean and the variance of the variables considered continuous and can be assumed to be normally distributed.
Thus, the assumption of normality is fulfilled, and discriminant analysis makes
robust estimations. On the contrary, logistic regression always produces robust
estimations as it makes no assumption about the distribution of the explanatory
variables or the linear relationship of them with the dependent variable and the equality of the
variance within this group. So, when the
assumptions of the discriminant analysis are violated, we should always avoid
the discriminant analysis and analyze our data with logistic regression, which
gives robust results since it can handle both continuous and categorical
variables [
In the following study,
we compared the results of discriminant and logistic regression analyses in
predicting the presence of any asthma symptoms among Greek children aged 10–12 years old
living in urban environment. During 2005, 700 students (323 males and 377 females),
aged 10–12 years (4th–6th grade), were
selected from 18 schools located in several areas of Athens, randomly selected from a list of
schools provided by the regional education offices. The participation rate of
the study was 95%. In order to evaluate
asthma symptoms in the study sample, the parents completed seven questions according
to the ISAAC protocol [
In order to examine the
relationship between childhood asthma and the patterns that are extracted from PCA, the retained 8 components
(patterns) were the predictor variables that entered
in both discriminant and logistic regression models. The assumptions for the two
models were all fulfilled
(the components due to their extraction methods follow the multivariate normal
distribution and are mutually independent), and variance covariance matrices of the groups
were equivalent—Box's
For each model, we plotted the corresponding response operating characteristics (ROC) curve. An ROC curve graphically displays sensitivity and 100% minus specificity (false positive rate) at several cutoff points. By plotting the ROC curves for two models on the same axes, one is able to determine which test is better for classification, namely, that test whose curve encloses the larger area beneath it. All analyses were performed using the SPSS version 13.0 software (SPSS, Inc., Chicago, Ill, USA).
Using PCA and applying Kaiser's
criterion, 8 patterns of our original data were extracted, expressing the
anthropometric indexes of the children, breakfast consumption, frequency of
consuming athletic refreshments, parental anthropometric indexes, shortness of
breath during recreational activities, birth weight and breastfeeding, eating
cheese pies, and frequency of listening to music. These variables were used in both discriminant
and logistic regression analyses, and both techniques revealed that
anthropometric characteristics, athletic refreshment consumption frequency, and
eating cheese pies were the most important contributors (Table
Predictors, standardized, and unstandardized coefficients for the discriminant analysis model and logistic regression model.
Predictors | Logistic regression | Discriminant analysis | ||
---|---|---|---|---|
Unstandardized coefficients | Standardized coefficients | |||
Anthropometric characteristics | 0.529 | 2.676 | 0.325 | 0.319 |
Breakfast eating frequency | 0.005 | 0.01 | −0.011 | −0.011 |
Athletic refreshments frequency consumption | −0.615 | 2.784 | −0.459 | −0.449 |
Parental BMI | 0.268 | 1.397 | 0.103 | 0.103 |
Shortness of breath during activities | 0.237 | 1.162 | 0.148 | 0.148 |
Birth weight and breastfeeding | −0.289 | 1.37 | −0.182 | −0.182 |
Cheese pies eating | 0.355 | 1.695 | 0.226 | 0.225 |
Listening to music frequency | −0.294 | 1.393 | −0.126 | −0.126 |
Sensitivity and specificity of logistic regression and discriminant analysis models, at various cutoff points for the probability of having any asthma symptoms.
Cutoff value* | Logistic regression | Discriminant analysis | ||||
---|---|---|---|---|---|---|
Sensitivity (%) | Specificity (%) | Accuracy (%) | Sensitivity (%) | Specificity (%) | Accuracy (%) | |
.05 | 94.9 | 8.3 | 29.6 | 100 | 0.8 | 25.1 |
.10 | 92.3 | 23.3 | 40.2 | 100 | 1.7 | 25.8 |
.25 | 69 | 69.2 | 69.2 | 92.3 | 19.2 | 37.1 |
.50 | 28.2 | 95.8 | 79.2 | 71.8 | 70 | 70.4 |
.75 | 5.1 | 100 | 76.8 | 25.6 | 95 | 78 |
.90 | 0 | 100 | 75.5 | 5.1 | 100 | 76.8 |
Receiver operating characteristics (ROC) curves for the discriminant analysis and logistic regression models.
In general, both logistic regression and discriminant analyses converged in similar results. Both methods estimated the same statistical significant coefficients, with similar effect size and direction, although logistic regression estimated larger coefficients overall. The overall classification rate for both was good, and either can be helpful in predicting the possibility of a child having asthma symptoms in the general population. Logistic regression slightly exceeds discriminant function in the correct classification rate but the differences in the AUC were negligibly, thus indicating no discriminating difference between the models.
Discriminant analysis can use as a dependent variable a categorical variable with more than two groups, usually three of four. The number of the predicted discriminant functions equals with the number of the variable’s categories minus 1. All of them have different sets of coefficients and produce a discriminate score for each case, but they have different classification ability. So, for a four level categorical dependent variable entering discriminant analysis, three discriminant functions are derived with their correspondent scores, and only one or two have the necessary power to achieve the optimum classification rates. The question arises in this case is about the number of functions which is needed to retain from the available set of functions.
In their
paper, Brenn and Arnesen [
In the study of Pohar et al. [
In order to compare the two methods, we applied them in a real dataset, and we did not use simulation methods, as the number of the observations in the dataset, although not very large, was sufficient to provide reliable results. Also, although linear discriminant function is a better method than logistic regression when the normality assumptions are met, the differences between them become negligible when the sample size is large enough (50 observations or more).
In conclusion, logistic regression and discriminant analyses were similar in the model analysis. In order to decide which method should be used, we must consider the assumptions for the application of each one.