Binary Response Analysis Using Logistic Regression in Dentistry

Multivariate analysis with binary response is extensively utilized in dental research due to variations in dichotomous outcomes. One of the analyses for binary response variable is binary logistic regression, which explores the associated factors and predicts the response probability of the binary variable. This article aims to explain the statistical concepts of binary logistic regression analysis applicable to the field of dental research, including model fitting, goodness of fit test, and model validation. Moreover, interpretation of the model and logistic regression are also discussed with relevant examples. Practical guidance is also provided for dentists and dental researchers to enhance their basic understanding of binary logistic regression analysis.


Introduction
Multivariate analysis is extensively used in multidisciplinary research fields, given its ability to explore multiple independent variables [1]. Research on dental science usually investigates the effect of multiple factors associated with various events, such as factors related to a disease or to the success or failure of an intervention. In dentistry, binary response variable is often recorded as a dependent variable, e.g., success-failure of a treatment, presence-absence of a disease, sound-decayed tooth, positive-negative staining, or other yes-no outcomes. Numerous studies also aim to inspect the relationship between a binary response variable and several independent variables.
Binary logistic regression is an existing causes and effects analysis for such binary response variable as the presence or absence of disease in epidemiology study, positive or negative in laboratory research, or even in the sex prediction in forensic identification of anonymous bodies. It is commonly used to investigate an existing problem by exploring associated factors and predicting the response probability for a new case [2]. Many aspects of the logistic regression differ from those of linear regression, although most researchers are familiar with the latter. Hence, understanding the application and interpretation of binary response analysis is of utmost importance to researchers. e aim of this paper is to explain important concepts of logistic regression with relevant examples, including model fitting, goodness of fit test, validation of the fitted model, and interpretation of the fitted model.

Linear Regression and Logistic Regression
Regression analysis, a common statistical method employed in dental research, is used to investigate the relationship between one response variable and one or more independent variables. Linear regression and logistic regression differ on the response variable: linear regression for continuous response variable and logistic regression for binary response variable as shown in Table 1. Linear regression models a continuous response variable (Y) by a linear combination of independent variables (Xs), as in equation (1) [3]: where β i is a regression coefficient for each X i that can be continuous, discrete, or categorical variables, e.g., to determine the factors (independent variables, Xs) associated with salivary glucose levels (a continuous response variable, Y) [4]. e response variable for logistic regression is a binary response variable (Y � 1 or 0), e.g., success, presence, disease, or positive [5].
Logistic regression model is a construction of the relationship between p, the probability of an event of interest, P(Y � 1), and a linear combination of independent variables (Xs) with the logit link function. e most commonly used link functions are logit, probit, and complementary log-log [6]. e logit link function is the natural log of the odds ratio-the ratio between the probability of occurrence of an event of interest (if occurred p, and if not occurred 1 − p) as shown in equation (2) [7]: (2) e probit link function is the inverse normal cumulative distribution function. e complementary log-log is the natural log function in terms of log(−log(1 − p)) [8]. However, the logit link function is most commonly utilized because it is less complicated and easy to interpret [6]. erefore, in this paper, only the logit link function of logistic regression will be focused. Figure 1 shows the regression plots from the continuous response linear regression analysis and binary response logistic regression analysis with one continuous independent variable. e plot from the linear regression analysis is a straight line, whereas for logistic regression, it is a S-curve. e differences are due to the logit link function in logistic regression. Peduzzi et al. (1996), in a simulation study for sample size in logistic regression analysis, suggested that the number of interested events of a response variable should be at least 10 cases or events per one independent variable [9]. For example, a study on a presenting oral microbe and 5 related factors need 50 cases of the presenting oral microbe. However, if the sample size is limited, Vittinghoff et al. (2007) stated that 5-9 events per variable with bootstrap resampling validation was acceptable [10]. Likewise, a study of an oral lesion and 5 related factors needs 25-45 cases of the oral lesion for satisfactory analysis with bootstrap validation.

Fitting Logistic Regression Model
Unlike discriminant analysis, logistic regression does not require the assumption of multivariate normal distribution or any distributional assumption on the Xs [11]. e significance of independent variables in the model is determined by the Wald test, which is a proportion between the estimating parameter β i to its standard error that is assumed to follow a standard normal distribution [12]. Several statistical software programs: SPSS, R, STATA, SAS, etc. present the square of the proportion, which follows a chi-square distribution with 1 degree of freedom [13]. e null hypothesis for both statistical tests is β i � 0. e independent variables with a p value greater than the significance level is removed from the model. An example of data from a study of risk factors associated with hyperglycemia using binary logistic regression analysis, after removal of a few nonsignificant variables, is presented in Table 2 and Figure 2 [14]. e final fitted model can be written as where p is the probability that the patient would have hyperglycemia.

Interpretation of Coefficients
Interpretation of the logistic regression model is based on the exponential function. e exponential function of the logistic regression coefficient is the odds ratio. If the independent variable, X i , is increased by 1 unit, the odds of response would be increased by e β i , when the other variables are fixed [15]. e exponential function was added to equation (3) and changed into the odds ratio, e clinical implication of this mathematical equation can be illustrated as the adjusted odds ratio (OR) in Table 1. If the age of the patient is increased by 10 years (1 unit), the odds of hyperglycemia would be increased by e 0.371 or 1.449, when the other X variables were fixed. e coefficient of the categorical variable can be interpreted in a similar way. If the patient has a family history of diabetes mellitus (HDM), the odds of developing hyperglycemia would be increased e 0.558 or 1.747 times that of those without HDM, when the other variables were fixed. ose odds are adjusted by the other variables in the equation and are reported as the adjusted odds ratios [16][17][18]. e International Journal of Dentistry confidence intervals of the regression coefficients are also reported. In some statistical packages: SPSS, R, STATA, SAS, etc., exponential function was taken to those intervals. en, the confidence interval of the odds ratio was reported instead of confidence interval of the coefficients, as in Table 2 [19].
However, some studies aim to estimate the probability of an event of interest. e odds in equation (4) can be simplified to be the probability equation, as in equation (5): where exp(·) is an exponential function, and p is an estimated probability of having hyperglycemia, as an interested event occurrence, which was the purpose of that study. e second example in Table 3 aimed to use logistic regression only for prediction. e study of sex determination using tooth widths established a logistic regression model for predicting the probability of an anonymous dead body for being male, using lower-left canine (LLC) and upper intercanine width (UIW) as independent variables [20]. Table 3 presents the result from logistic regression analysis which can be written as Equation (6) is for sex determination of an unknown dead body. Addition of LLC and UIW into the equation results in a probability of being male, p(male), which can be used for sex identification.
Another example in Table 4 aimed to use logistic regression only for investigating the risk factors. e study of the prevalence and risk factors of high-level oral microbe used logistic regression to investigate the risk factors [21]. Table 4 presents the result from the analysis which

International Journal of Dentistry 3
Age (10years)  Figure 2: Demographic data of significant variables; age, BMI, family history of DM, and periodontal status, from a study of risk factors associated with hyperglycemia using binary logistic regression analysis [14].  comprised of the statistically significant risk factors associated with the high-level oral microbe. is example presents the odds ratio of three-level categorical variables, i.e., education level. If there are more than two levels of variables, the table should present all levels to show the reference level as non/primary education level in Table 4. en, the odds ratio of secondary and higher education level, 5.26 and 1.97, respectively, can be compared to the reference level or non/primary education level in this study. For example, if the participant's education is secondary level, the risk of having high-level oral microbes is 5.26 times compared to the one whose education is non/primary level with 95% confidence interval (range 1.41 to 19.67). However, this study only aimed to investigate the risk factors. Since the adjusted odds ratio was adequate for interpretation, the prediction equation was not necessary for this study.

Goodness of Fit Test
When the final model is constructed, it should be examined in terms of the goodness of fit to describe how well the model fits the data. e R 2 value of logistic regression is usually low, which is different from the R 2 of linear regression. Hosmer et al. (2013) recommended performing the goodness of fit test instead of reporting the R 2 [12]. e Hosmer-Lemeshow goodness of fit statistic is calculated by the grouping method on percentiles of the estimated probability, which follows chi-square distribution. e null hypothesis of this test will verify the fitting of the model. If the calculated p value from the test is less than the level of significance, the model can be assumed to be a poor fit [12,22]. For example, the fitted model in Table 2 had been tested by the Hosmer-Lemeshow test, with a p value � 0.210. is indicated that the model fitted the data well. However, Hosmer et al. (1997 and found that none of the goodness of fit tests has high accuracy when the sample size is small (n � 100). erefore, they recommended a sample size of 500 for the goodness of fit test [23,24].

Model Validation
Validation of the fitted model is to confirm the inference accuracy [25]. Before the model fitting process, all data are split into two sets. e first is a validation set (or testing set) and is taken randomly from about 15 to 40% of all data [26,27]. e rest of the data is called the training set (or modelling set). It is used for the model establishment with logistic regression analysis, which, in turn, will establish the prediction equation, e.g., as in equation (3). en, the data from the validation set are applied to the previously fitted model from the training data [3,26]. Model validation is performed by comparing the results from the fitted model and realistic response [15,28]. e prediction error can be calculated using the incorrect results from the validation set as percent error. For example, if the results from the validation set of 50 samples contain 10 incorrect predictions, the prediction error will be 20% (10/50).

Various Applications in Dentistry
Superior to the univariable analysis, multiple logistic regression presents the effect of confounding factors and/or other variables with adjusted odds ratios to confirm the effect of interested variables when other factors are involved.  [40] International Journal of Dentistry Binary logistic regression can be used not only in the investigation of associated factors, as previously described in the example for evaluating the risk factors associated with hyperglycemia but also in many other aspects, as in Table 5. Logistic regression can be applied to investigate related factors, e.g., studies on factors related to tooth loss, tooth wear, implant failure, or temporomandibular joint clicking [29][30][31][32]. It can also be used to establish associations between different variables, e.g., between malocclusion and quality of life, between dentist characteristics and treatment decision, between demographic data and awareness of dental waste management, or between consanguineous marriage and dental carries [33][34][35][36]. Logistic regression can be used for developing a predictive model, such as sex identification using oral measurements in forensic science or the prediction of esthetic preference using demographic data [20,37,38]. e model can also be applied as a screening test for hyperglycemic patients [39] or identifying the stage of carcinoma using specific genes [40].

Conclusion
Binary logistic regression is utilized in dental research to understand the relationship between multiple independent variables and a binary response variable. Regression coefficients of a final model can describe the significance of each independent variable in regards to the response variable in terms of the odds ratio and Wald test. Moreover, the established model can predict the probability of a new case with the help of the probability equation. e goodness of the fit test for the final model can be examined by the Hosmer-Lemeshow test. Validation of the model can be carried out by dividing the data into validation and training sets. However, clinical factors must be considered for model plausibility.
is review provides practical guidance to dentists and dental researchers alike to enhance their understanding of the analysis, which is greatly beneficial when reading articles or performing clinical research that involves binary response.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.