Evaluation of Four Multiple Imputation Methods for Handling Missing Binary Outcome Data in the Presence of an Interaction between a Dummy and a Continuous Variable

Department of Biostatistics and Epidemiology, School of Public Health, Kerman University of Medical Sciences, Kerman, Iran Modeling in Health Research Center, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman, Iran Department of Statistics, Higher Education Center of Eghlid, Eghlid, Iran Kerman Neuroscience Research Center, Institute of Neuropharmacology, Kerman University of Medical Sciences, Kerman, Iran


Introduction
e need to adequately deal with missing data is a practical challenge for researchers when analyzing epidemiological data. A well-known approach (and the default in most statistical packages) to deal with the missing data problems is complete case analysis (CCA), which omits subjects with missing values from the analysis. In some cases, such analyses are inefficient, since they sacrifice information from partially observed responses, and even in the worst situations, they may result in biased inferences about the parameters of interest [1]. An alternative to CCA is MI, which creates copies of the dataset, replacing the missing values in each dataset with independent random draws from the predictive distribution of the missing values under a specific model (the imputation model). Each dataset is analyzed separately and the results of M datasets are combined with a set of simple rules. e corresponding estimates (M point and M variance estimates) are combined according to Rubin's combination rule [2].
Using Rubin's rule [2], the reasons for missing data are classified as missing completely at random (MCAR) when the probability of missingness is independent of the observed and unobserved data, missing at random (MAR) if the probability of missingness is independent of the unobserved data after conditioning on observed data, and missing not at random (MNAR), where the probability of missingness is dependent on unobserved data even after conditioning on observed data [3][4][5][6].
MICE, also called fully conditional specification (FCS), is commonly used for imputing missing data. e MICE method specifies the univariate distribution of each incomplete variable conditional on all other variables and creates imputations per variable. e MICE algorithm is a Gibbs sampler, a Bayesian simulation approach that generates random draws from the posterior distribution and conducts univariate imputations sequentially until convergence [7][8][9][10].
In the MICE algorithm, the set of conditional distributions may not correspond to any joint distribution since the users specify univariate conditional distributions, and so the joint distribution may not actually exist. Despite this theoretical drawback of the MICE method, the simulation study suggests this imputation method performs well in practice [11,12]. ere are a number of software packages available to impute missing data using MICE methods. ese include "IVEware" in SAS [11], "mice" [12,13] and "mi" in R [14], and "mi" and "ice" in STATA [15].
In the MICE strategy, when the data include an interaction effect, the interaction can be modeled by appropriate models manually and by imputing the missing values in separate subgroups of the data. e default setting in implementation of MICE is for imputation models to include variables as linear terms only with no interactions, but omission of important nonlinear terms may lead to biased results [16].
Motivated by these challenges, several authors have developed more flexible techniques that can handle missing values in the presence of interactions easily. Automatic Interaction Detection is one of the first implementations of recursive partitioning [17]. Besides the fact that the recursive partitioning technique finds the split that is most predictive of the response variable by searching through all predictor variables, they model the interaction structure in the dataset by splitting a dataset into increasingly homogeneous subgroups sequentially. In other words, since splits are conditional on previous splits, possible interactions are automatically detected. Others have used an approach combining of recursive partitioning with imputation methods [3,18]. ey used CART as an imputation engine in MICE. CART in MICE is available as an option in "mice" in R.
CART methods have properties that make them attractive for imputation: they are robust against outliers, can deal with multicollinearity and skewed distributions, and are flexible enough to fit interactions and nonlinear relations. Indeed, many aspects of model fitting have been automated, so there is little tuning needed by the imputer [18].
However, to our knowledge, few evaluations have been done in simulation contexts; even fewer evaluations have been done on data, in the presence of interaction with mixtures of binary and continuous predictors. is lack of comparisons makes it difficult, if not impossible, to assess the relative merits of each procedure.
us, to find out when the data include a mixture of continuous and binary predictors in the presence of an interaction, which method is best to impute the missing values in the binary response under MAR and MCAR missing mechanisms, we need to evaluate the performance of the four methods for handling missing data in a real and simulated dataset. is evaluation can be further justified through the following reasons: first, most of the studies that evaluated and compared the performance of these four methods have not included the mixture of continuous and binary predictors with binary outcome. Second, no study has yet been made due to the challenge that MICE-Interaction and MICE-Stratified will result in proper imputation in datasets with interaction between a dummy and continuous predictor. ird, it is not clear what proportion of missing data via MAR and MCAR correspond to a certain amount of bias in the results, so even if we accept that one method has a kind of superiority over others, it is desirable to know how much of the bias can be compensated by that method.
Regarding the above explanations, this study was carried out to evaluate the performance of four MI methods in real and simulated data. At first, we performed MICE-Stratified, filling in values separately for the two subgroups defined by the value of the binary variable in each dataset. en, we used three MI methods. First, we used MICE-Interaction method. For MICE-interaction, we included the interaction term as just another variable in the predictor matrix and used the default predictor matrix in the mice function. en, we used CART for specifying the imputation model in MICE (MICE-CART), and the implementation of random forest (RF) in MICE (MICE-RF). We decline to provide formulaic or mathematical details in this article because we want to focus more on simulation. e paper is organized as follows. In Section 2, MICE-Interaction and MICE-Stratified are first elaborated. In continuation of this section, two main recursive partitioning techniques are considered, namely, CART and RF. Subsequently, incorporation of the CART method and RF method in the MICE framework is presented. In Section 3, we implement a secondary empirical study to investigate the performance of the discussed methods. In Section 4, a simulation study is carried out to investigate which of the discussed methods are convenient to preserve the interaction effects. e results from both studies are discussed in Section 5; at the end of the study, some final conclusions are given.

MICE Approach.
Suppose that data are presented as X � (X miss , X obs ), where X miss is included of the columns with at least one missing value, X obs is included the columns of X that are completely observed, X j is the jth column of X, X miss j is the missing values in the jth column of theX miss , Z is the currently imputed data matrix X, and k is the number of partially observed variables. Suppose that X miss is ordered in non-decreasing numbers of missing values in each column. Define X miss −j equal to the matrix X with its jth column removed. us, we will have the following algorithm to show the implementation of standard MICE: (1) To fill in the initial values for the missing values, define a matrix Z equal to X obs ; for each X miss j , all the X miss j values are initially filled in by random draws from the predictive distribution conditional on Z, and attach the imputed version of X miss j to Z prior to incrementing j.
(2) Forj � 1, . . . , k, replace the missing values of X j with random draws from the predictive distribution conditional on X miss −j . (3) Repeat steps 1 and 2 for a number of iterations. is procedure is performed for each variable with at least one missing value, yielding one complete dataset.
(4) Repeat steps 1-3 a number of times (M), resulting in M imputed datasets that are available for analysis. Each of the M imputed datasets is analyzed separately. In the next step, the results are combined using Rubin's rules. It is standard to use generalized linear models as the basis of the posterior predictive distribution draws in steps 1 and 2.
For MICE-Interaction method, the interaction is included in the predictor matrix. In the other words, the interaction term is included as just another variable in the imputation model and then the mice function is used since this method performs better than or equivalent to all other methods considered (the "mice" package in R does this) [3,7,8,19].
For MICE-Stratified method, the datasets are stratified by dummy variable and then MICE method is used to impute missing values, in the presence of interaction effect between a dummy and a continuous variable.
An alternative approach is described in [20]. ey defined a new class of nonparametric multiple imputation methods based on the CART or RF algorithm. ese two methods fall into the umbrella concept of "recursive partitioning," which allows for the modelling of internal interactions in the data by sequentially partitioning the dataset into homogeneous subsets. Some researchers used the tree package and showed that the CART results for recovering interactions were uniformly better than standard techniques [18]. Shah and coworkers applied random forest techniques to both continuous and categorical outcomes, which produced more efficient estimates than standard procedures [21]. A similar set of routines building on the rpart [22] and randomForest [23] packages were developed by Doove and coworkers [22]. Methods CART and RF are part of mice package.

Imputation by MICE-CART.
CART is a nonparametric recursive partitioning imputation method that provides the results as a tree structure. e root node is at the top of the tree, which includes all members. It follows by exploring the data to find the best variable and a cut-off that best separates the subjects into two child nodes. Subgroups are made by the optimal split according to a measure of homogeneities such as the Gini index [24]. Partitioning of each child node continues until some certain stopping criterion has been reached, e.g., a predetermined number of observations in the final subsets [25]. erefore, tree models easily reveal the interaction between independent variables [26].
Suppose that the data are presented as a matrix X � (X obs , X miss ), where X obs contains the columns of X with observed data for all variables and X miss is the part of X that has at least one missing datum. Similar to MICE, the variable with missing data is considered as a dependent variable. e following steps indicate an implementation of CART in MICE: (1) For each variable with missing data (j � 1, . . ., k), fill in the initial values _ X j by random draws from X obs j , and update the X obs matrix (shown by Z).
(2) Fit the CART using each X miss j as outcome, and Z as predictor variables; only subjects with observed values on X miss j are used in this processX obs . (3) For subjects in X miss j , find the terminal node; they end up according to the fitted tree in step 2; and one observed value on X miss j is randomly selected from the subset in this node and used for imputation. (4) Repeat steps 2 and 3 for a number of iterations. is procedure is performed for each variable with at least one missing value, yielding one complete dataset. To construct the tree, a minimum leaf size of 5 considered, with the deviance of less than 0.0001. Additionally, the response variable (Y) was included as a predictor in the imputation models to impute each incomplete variable [3,18,20].

Imputation by MICE-RF.
e algorithm needed for RF imputation is a modification of the discussed CART algorithm. e first two steps are replaced by a construction of k bootstrapped datasets, k being the number of trees in the forest, and the fitting of k tree models. Optionally, each tree can be fitted using the full bootstrapped dataset or randomly selecting the input variables. To avoid reduced variability by imputing based on an averaged tree, possibly due to the higher stability of the individual trees, the imputed value is randomly selected from the union of the k donor pools. For more details on the algorithm, see [22].

Empirical Study
We used empirical data from a population-based study collected in southeastern Iran [27]. e data was initially collected to address the main variables that encourage participants to change their body using a variety of methods such as diet, exercise, or drugs. A multistage sampling was adopted where the area was stratified into 10 blocks. From each block, approximately 120 households were selected. In each block, the first household was selected at random. e remaining blocks were selected following a systematic sampling approach. In each household, only one person, aged 14-55, was interviewed.

Dependent and Independent Variables.
e outcome variable of interest was whether participants had tried to change their body using a variety of methods such as diet, surgery, or drugs. Independent variables include BMI (body mass index), body-esteem scale (BE), perceived sociocultural pressure scale (PSPS), physical appearance comparison scale (PACS), and Gender. Our data did not involve any missing values. ese variables were measured with standard and validated questionnaires. Logistic regression analysis showed that BMI, BE, PACS4, PSPS, and Gender were independent variables that could influence eating disorders. More details are provided in [27].
In this study, we proposed the following complete-data model (1), which contains all five main variables and an interaction term between a dummy and a continuous variable to guide the analysis of the datasets.
where Y is the binary outcome variable, X 1 is BMI, X 2 is PSPS, X 3 is BE, X 4 is PACS4, X 5 is Gender, and X 1 * X 5 is the interaction between BMI and Gender.

Simulation Study for Real Data (Generation of Missing Data).
In this section, the simulation was restricted to generating missing data in variable Y. In fact, for the experimental data, the following steps were repeated 1000 times: Step 1. First, We randomly drew 700 observations 300 times with replacement from each of the datasets. 10% to 50% univariate missing data were created in the outcome variable via MCAR and MAR mechanisms. MAR data was generated in variable Y in a way that probability of becoming missing depended on X 3 andX 4 .
Step 2. Using MI method: the four considered MI methods were MICE-Interaction, MICE-CART, MICE-RF, and MICE-Stratified. e independent variables used in all the methods were Gender, BMI, BE, PACS4, PSPS, and the interaction between Gender and BMI. Finally, the number of imputations was set to 5 [3,28].
Step 3. Recalculation of the coefficients and other indexes [5] from complete datasets, which is intended to represent how MI was able to reduce the effect of missing data on the estimations.
Step 4. ere are several measures that may inform us about the statistical validity of a particular procedure.
is step includes calculations of raw bias (RB), percent bias (PB), coverage rate (CR), model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ). e raw bias, which is the average difference between the true value of the parameter being estimated from the real data, and the value of the estimation after imputation, should be close to 0. Bias can also be expressed as percent bias (PB). For acceptable performance, we use an upper limit for PB of 5%. In more detail, the grades of PB are defined as negligible (0%-5%), minimal (5%-10%), moderate (10%-20%), heavy (20%-30%), and severe (>30%) [29,30]. e coverage is the percentage of cases where the estimand, the true value of the parameter being estimated [20], is located within the 95% confidence interval around the estimate, which should be 95% or higher. e width of the 95% confidence interval should be as small as possible (as long as the coverage does not fall below 95%) and is an indicator of statistical efficiency. However, it is important to evaluate the coverage with other measures, because high variances can lead to higher coverage rates. We regard the performance of the interval procedure to be poor if its coverage drops below 90% [31]. Model-based SE is the mean of the SE estimated across simulations and empirical SE is the SD of the estimates across simulations. e two should be similar and should match if the Rubin's variance estimate with a specific MI model is unbiased. Lastly, λ is the estimated proportion of the variation attributable to the missing data, and λ is an indicator of the severity of the missing data problem (the computation of RB, PB, and λ is based on equations (2)-(4). e mice package of R was used to calculate these measurements.
A scientific estimand Q is a quantity of scientific interest that we can calculate if we would observe the entire population. T is the total variance of Q, and hence of (Q − Q) if Q is unbiased. B is the extra variance caused by the fact that there are missing values in the sample, and B/M is the extra simulation variance caused by the fact that the imputation reputation M is finite [3].

Simulation Study
is simulation included the generation of artificial datasets and then creating missing data in the binary outcome variable. e following steps are performed to examine the imputation methods.
Step 1. In this simulation study, we generated a finite population of size N � 700 from the binary variable outcome Y.
e artificial data include one binary predictor variable and four continuous variables that the binary variable was uncorrelated from the other variables.
e number of continuous variables is kept higher than the number of categorical variables due to the fact that the simulation is aimed to be similar to our real data. One binary variable was randomly drawn from a binomial distribution. e marginal distribution of X 5 is Bernoulli (0.6). Moreover, four continuous predictors (X 1 , X 2 , X 3 and , X 4 ) were randomly drawn from a multivariate normal distribution, where the first two predictor variables (X 1 andX 2 ) had a pairwise correlation of r � 0.5 and the last two predictor variables (X 3 andX 4 ) had pairwise correlations of r � 0.3. e binary response is modeled using GLM depending on various binary and continuous predictors. e dichotomization of X 5 is based on the following criteria: where 0.6 is the mean value of X 5 . By defining and stan- We have used the simulation design that combined sampling and missing data mechanisms. To achieve this simulation design, we first generated 1000 sets of data based on the described characteristics and then created missing values one time for each complete dataset. Steps 2 and 3 are the same as those described in the real data simulation section. Various scenarios have been considered, including two missing mechanisms (MAR and MCAR), five missing proportions (10%, 20%, 30%, 40%, and 50%), and four different imputation methods (MICE-Interaction, MICE-CART, MICE-RF, and MICE-Stratified), which lead to a combination of 2 × 5 × 4 � 40 scenarios. According to the above, a total of 1000 repetitions were performed. To compare different scenarios, the average measures for the scenarios were calculated.

Results for Real Data.
For the real dataset, we present only the results with respect to the interaction effects in the form of plots. e results of the imputation methods under the missing mechanism of MCAR and MAR are shown in Figures 1-6. ey are separated by different measurements.

Raw Bias (RB).
All the imputation methods used to estimate the interaction were almost biased towards the null except the MICE-Stratified method. However, the bias values of MICE-Stratified method were lower than 0.1 at all the missing percentages. e MICE-Interaction and MICE-CART methods were less biased than the other two methods. We found that the bias values become higher with increasing missing percentages (Figure 1).

Percent Bias (PB).
e MICE-Interaction and MICE-CART methods were less biased than the other two methods at almost all scenarios. e MICE-Stratified method produced the highest PB at all missing percentages (Figure 2).

Coverage Rate (CR).
Among imputation methods, the MICE-RF method had the highest coverage rates at all missing rates and both mechanisms and the coverage rate of the MICE-Stratified method was very low (Figure 3).

Model and Empirical SE.
At all the MI approaches, the empirical and model-based standard errors are almost identical for the interaction effect (Figures 4 and 5).

e Proportion of Variation Attributable to the Missing
Data. In general, λ values for all methods were less than 0.5. e MICE-Stratified and MICE-CART methods had lower λ value for interaction than the others (for instance, the MICE-Stratified led to the lowest bias in the MCAR missing mechanism, at all the missing percentages), which showed that the MICE-RF and MICE-Interaction methods indicated greater uncertainty than the other methods in estimating the interaction effect ( Figure 6). Tables 1-6 show RB, PB, CR, model-based SE, empirical SE, and λ under the missing mechanism of MCAR and MAR for imputation methods; the tables are separated by different measurements.

Results for Artificial Datasets.
By comparing the values of RB for each coefficient, interesting results were obtained: for all scenarios, the MICE-Interaction method performed very well in terms of RB since all RB values for this method were almost zero. For instance, at 40% missing percentage under both mechanisms, the amount of RBs for the interaction produced by the MICE-Interaction method was equal to zero (Table 1).
Considering the PB values, the MICE-Interaction method had acceptable performance since the PB values were less than 5%. For all the coefficients which were in the interaction effect, at all missing percentages in both missing mechanisms, the MICE-Interaction method led to a negligible PB (PB < 2.2%)and had relatively good performance (Table 2).
MICE-Interaction, which correctly includes the interaction term in the imputation model, led to slightly higher coverage for the interaction effect than the other imputation methods. e CR values of the MICE-Interaction method were more than 95%. In almost all scenarios, the CR values of MICE-Stratified method were less than 95% ( Table 3).

Journal of Probability and Statistics
By examining the model and empirical SEs, it is determined that by simulations the difference between the empirical and Rubin's MI variance was small or negligible. e model SEs was larger than the empirical SEs, and this allowed the coverage to be relatively good despite the slight bias (Tables 4-5). method resulted in the lowest λ which showed that the influence of the imputation model on the final result was not larger than that of the complete-data model. e estimated values were less than 0.2, indicating that the methods deal with a relatively large fraction of missing information ( Table 6).

Discussion
In this study, four MI methods were evaluated, including MICE-Stratified, MICE-Interaction, and two recursive partitioning techniques that incorporate interaction effects in the data, as imputation methods in MICE: CART and RF. In addition, the percentage of missing data and two missing mechanisms on the performance of the MI methods were examined. We studied the raw bias, percent bias, coverage rates, model-based SE, empirical SE, and the proportion of variation attributable to the missing data, after imputation by these methods. Overall, at lower percentages of missing values (at most 30%), all three methods except MICE-Stratified appeared to produce reasonably high quality imputations.
Consistent with similar studies, MI methods were evaluated based on RB, PB, CR, model SE, empirical SE, and λ [3,22]. RB, PB, and CR are several measures that may inform us about the statistical validity or precision of a particular procedure. RB is a measure of invalidity and should be close zero. For acceptable performance, we used an upper limit for PB of 5% [3,32]. CR is also a measure of validity but at the same time is a function of the average width of the confidence intervals, which is a measure of         precision. If all is well, then RB should be close to zero, and the coverage should be near 0.95. Methods having no bias and proper coverage are called randomization-valid [33]. If two methods are both randomization-valid, the method with the shorter confidence intervals is more efficient. e λ value is a function of the number of imputations and interpreted as the proportion of the variation attributable to the missing data. If λ is high, say λ > 0.5, the influence of the imputation model on the final result is larger than that of the completedata model. Better imputation results in lower values of λ [2,34,35]. As we saw earlier in the current study, in almost all scenarios, the λ values were less than 0.5. Although the lowest amount of λ was attributed to the MICE-Stratified and MICE-CART method, according to the RB and CR values of these two methods, the MICE-CART method was superior to the MICE-Stratified method.
Our main result is that, relative to one another, the MICE-Interaction method appeared to have the best performance at all missing percentages. Also, the recursive partitioning methods had better performance than MICE-Stratified method. Furthermore, the simulation results of comparing the methods showed that the quality of parameter estimates was almost identical in the two missing mechanisms.
An extension of this study would consider a dataset that has more complex structures between the variables than our data. In addition, looking at each model closely, one would observe that the design of this study directly aligns with the strength of each recursive partitioning model. e recursive splits in the tree structure allow CART to effectively capture higher order interaction more accurately, as long as the number of individuals in each leaf is large enough. e imperfect imputation models that resulted in the bias of recursive partitioning techniques may have emerged from the presence of main effects in the data. at is, recursive partitioning techniques have difficulty in modelling linear main effects. It is hard to capture the main effects because, due to the binary tree model, it would take many fortuitous splits to recreate the structure [36]. We expect that this problem will also happen implementing other recursive partitioning techniques. e solution to this problem is provided by STIMA [37], which combines a linear main effects model with recursive partitioning. e difficulty with modelling linear main effects also explains why in our study the recursive partitioning imputation methods led to biases that are somewhat higher for the main effects compared to MICE-Interaction method.
Meng stated that the analysis procedure should be congenial to the imputation model. Bartlett et al. connected congeniality to compatibility by extending the joint distribution of the imputation model to include the substantive model [38,39]. Uncongeniality can occur if the imputation model is specified as more restrictive than the complete-data model, or if it fails to account for important factors in the missing data mechanism. Both types of missions introduce biased and possibly inefficient estimates [3]. In the MICE-Interaction method, we used the imputation model, which included the interaction effect, and this made compatibility between imputation and analysis model [40]. erefore, the method led to less bias in estimating the coefficients. e higher biases for the interaction effects by RF compared to CART may be explained by interactions that are missed in the tree building process due to drawing bootstrap samples and the (low) number of randomly preselected variables [41]. erefore, it is concluded that MICE-Interaction preserved the interaction effect best and the MICE-Interaction method is recommended if a user has presumptions of interaction effects. e quality of imputations for any of the methods was lower for datasets with higher percentage of missing data.
Some features of the present study could be considered as strengths. For example, to our knowledge, this is the first time that the performance of four imputation methods was evaluated in observations involving a combination of binary and continuous predictors, in the presence of an interaction effect.
To improve on the performance of the MICE-RF, an extension of this study will look into the possibility of changing the number of trees in RF algorithm. Future work will consider the missing not at random mechanism instead of missing completely at random and missing at random considered here. Finally, as mentioned earlier, datasets with more complex structures between the variables will be considered.

Conclusion
In summary, this paper offered a fair comparison between MICE-stratified, MICE-Interaction, and tree-based imputation methods in the MICE algorithm. To our knowledge, this is the first paper to compare tree-based imputation in MICE to a parametric model that includes a true interaction effect between a dummy and a continuous variable.
MICE-Interaction had the highest coverage of the interaction effect. Importantly, parametric imputation should only be utilized if there is enough information to ensure that all necessary interaction terms are included in the imputation model. If one can accept the reduction of coverage for the interaction effect, recursive partitioning methods are recommended as they do not require specification of the imputation model.
Beyond the fact that there is still room for improvement, it can be concluded that although recursive partitioning methods were valuable for imputing datasets containing interaction effect between a binary and a continuous variable, properly incorporating interactions in the parametric imputation model (MICE-Interaction method) led to much better performance.

Conflicts of Interest
e authors declare that they have no conflicts of interest.