First Use of Multiple Imputation with the National Tuberculosis Surveillance System

Aims . e purpose of this study was to compare methods for handling missing data in analysis of the National Tuberculosis Surveillance System of the Centers for Disease Control and Prevention. Because of the high rate of missing human immunode�ciency virus (HIV) infection status in this dataset, we used multiple imputation methods to minimize the bias that may result from less sophisticated methods. Methods . We compared analysis based on multiple imputation methods with analysis based on deleting subjects with missing covariate data from regression analysis (case exclusion), and determined whether the use of increasing numbers of imputed datasets would lead to changes in the estimated association between isoniazid resistance and death. Results . Following multiple imputation, the odds ratio for initial isoniazid resistance and death was 2.07 (95 % CI 1.30, 3.29); with case exclusion, this odds ratio decreased to 1.53 (95 % CI 0.83, 2.83). e use of more than 5 imputed datasets did not substantively change the results. Conclusions . Our experience with the National Tuberculosis Surveillance System dataset supports the use of multiple imputation methods in epidemiologic analysis, but also demonstrates that close attention should be paid to the potential impact of missing covariates at each step of the analysis.


Background
Missing data is a common problem in epidemiologic research. Analytic techniques used in multivariable analysis, such as regression models, rely on methods that exclude cases with missing covariate data from analysis. is missing data approach has important limitations. First, case exclusion will always lead to loss of statistical power. Second, case exclusion will introduce bias into the analysis if excluded subjects differ from included subjects in ways that are relevant for the parameter of interest [1]. e potential for bias using case exclusion depends on the mechanism for missingness. For missing-at-random (MAR) data, the missingness of a particular observation depends only on observed covariates, and for missing-not-at-random (MNAR) data, missingness may depend on both observed and unobserved covariates. For either MAR or MNAR data, case exclusion will introduce bias, as subjects excluded from analysis will differ from subjects included in analysis according to either the measured or unmeasured covariates. In contrast, when data is missingcompletely-at-random (MCAR), missingness can be considered a random deletion of observations without respect to measured or unmeasured covariates, and case exclusion does not lead to the introduction of bias (only the loss of statistical power) [2].
Multiple imputation methods were developed by Rubin to account for nonresponse bias in surveys [3], and later modi�ed by Schafer [4]. e goal of multiple imputation is to T 1: Predictor variables included in the imputation model, with percent of nonmissing observations and for association with the outcome.

Variable
No. evaluated (%) create several plausible values for the missing covariates, and consequently several complete datasets, with the imputed values generated from observed relationships between variables. e investigator determines which variables will be used to create the imputed datasets, speci�es the mathematical relationships between these variables (the imputation model), and chooses the number of imputed datasets that will be created. All predictors of missingness should be included in the imputation model in order to satisfy the MAR assumption [4]. Once these complete datasets are generated, analysis is performed on each dataset according to the hypothesis being tested. e parameter estimates that are obtained from each imputed dataset are combined into a single-point estimate, and its associated error re�ects uncertainty not only within each imputed dataset, but also between the imputed datasets [5].
Recently, we used multiple imputation methods in our analysis of the association of initial isoniazid resistance with death during therapy among cases of tuberculous meningitis (TBM) in the USA between 1993 and 2005, analyzing data collected by the National Tuberculosis Surveillance System (NTSS) [6]. Among 1614 patients with positive cerebrospinal �uid cultures for M. tuberculosis, we observed a signi�cant association between initial isoniazid resistance and death during therapy (OR 2.07, 95% CI 1.30, 3.29). We limited our analysis to patients with a known result from initial isoniazid susceptibility testing, and therefore isoniazid resistance or susceptibility was completely known for all patients in the study. However, other clinical and demographic factors that were evaluated as potential confounders of the relationship between isoniazid resistance and death during therapy had varying degrees of missingness, as shown in Table 1.
e human immunode�ciency virus (HIV) status of the case patient was unknown in 47% of observations. HIV status is included in the national Report of a Veri�ed Case of Tuberculosis (RVCT), with options including "positive, " "negative, " "indeterminate, " "test done, results unknown, " "not offered, " "refused, " and "unknown"; cases may also be submitted with missing data reported (i.e., no HIV variable response option selected). For cases reported from California during the time period of the study, HIV status was reported as missing for all cases. Matching was then performed through 2004 between the state tuberculosis surveillance dataset and the state acquired immunode�ciency syndrome (AIDS) registry, which only includes HIV-positive patients with a clinical diagnosis of AIDS. Consequently, this matching procedure did not identify patients who tested negative for HIV or HIVpositive patients without a clinical diagnosis of AIDS.
We sought to further explore the methodology of the multiple imputation in our analysis of the NTSS data. Our goal was to compare multiple imputation with case exclusion and to determine whether the use of increasing numbers of imputed datasets would have changed the inference regarding the association between initial isoniazid resistance and death during antituberculosis therapy.

2.1.
Setting. e NTSS has collected aggregate tuberculosis incidence data in the USA since 1953 and individual-level data (including antituberculosis drug susceptibilities) since 1993 [7]. In order to be included in the national count, a case of tuberculosis must satisfy a standardized case de�nition. We examined all tuberculosis cases reported from January 1, 1993, through December 31, 2005, during which time drugsusceptibility and risk factor data were available for reported cases. Institutional review board approval was obtained from the University of Pennsylvania prior to the beginning of the study.

Subjects.
Cases of tuberculosis are reported to the Centers for Disease Control and Prevention by state and local health agencies, using the RVCT. is report includes clinical and demographic information about the patient, as well as the anatomic sites of any positive mycobacterial cultures or smears. We selected subjects for inclusion in the cohort if there was a report of a meningeal site of involvement, as either a primary or secondary site of infection, with a positive culture for M. tuberculosis from cerebrospinal �uid. We limited enrollment to patients who were alive at diagnosis and who initiated antituberculosis therapy.

2.3.
Analysis. e goal of our analysis was to measure the association of initial isoniazid resistance with death during therapy in patients with tuberculous meningitis. We �rst calculated the unadjusted association of initial isoniazid resistance with the outcome of death during therapy. We used a multivariable logistic regression model to simultaneously adjust for multiple confounders.
ere are two general mathematical approaches for model-based multiple imputation: either based on the multivariate normal distribution [4] or fully conditional speci�cation [8]. In a simulation study, both approaches produced similar results and were less biased than complete exclusion [9]. We employed the multiple imputation procedure for Stata written by Royston (ice) [10], which uses the approach of fully conditional speci�cation (also known as the chained equation or regression switching approach). e imputation model included the primary exposure (isoniazid resistance), the outcome variable (death during antituberculosis therapy), and all potential confounding variables identi�ed based on univariable associations with the outcome (based on , see Table 1). Exclusion of the outcome variable in imputation models may lead to biased estimates of regression coefficients [11]. Because age was not normally distributed, a categorical age variable was used in analysis.
We chose to generate 5 imputed datasets based on simulation studies demonstrating little gain in statistical power for higher numbers of imputations [3]. e 5 imputed datasets were combined for analysis according to Rubin's method [3], using the mim procedure written by Royston et al. [12], generating single estimates for the parameters of interest. ese parameter estimates had associated error that re�ected the degree of missingness in the original dataset.
We evaluated variables in the logistic regression model based on their ability to change the observed association between initial isoniazid resistance and death during therapy. A confounder was de�ned as a variable that changed the association of interest by greater than 15% when included in the model. e �nal multivariable logistic regression model included the following terms: isoniazid resistance, age, race/ethnicity, gender, and HIV status. Age and race/ethnicity were confounders of the association between isoniazid resistance and death. HIV status and gender were not confounders but remained in the model based on our a priori clinical reasoning. Of the four covariates that remained in the �nal logistic regression model, only HIV status had missing observations that had been imputed in the datasets. In the analyses presented here, we compare the inference obtained for the association between isoniazid resistance and death using multiple imputation with inferences obtained based on alternate approaches: (1) case exclusion and (2) varying the number of imputed datasets between 2 and 10.

Multiple Imputation Compared with Case Exclusion.
Aer excluding 47% of patients with missing HIV status, we repeated the regression model (including isoniazid resistance, age, race/ethnicity, gender, and HIV status) with the subset of subjects with HIV status reported as either "positive" or "negative" (case exclusion). We compared estimates of association for initial isoniazid resistance and other variables obtained aer case exclusion with the estimates obtained aer multiple imputation. We obtained notably different estimates for the association of the older age categories with death during therapy, as well as the association between initial isoniazid resistance and death during therapy ( Table 2). Using case-exclusion approach, the association between isoniazid resistance and death was 1.53 (0.83, 2.83), while the association yielded by the multiple imputation approach was 2.07 (1.30, 3.29).
To explore why the case exclusion analysis moved the estimate of the OR for isoniazid resistance and death during therapy towards the null hypothesis, we examined the age distributions of HIV positive subjects, HIV negative subjects, and subjects with missing HIV status (Figure 1). Missing HIV status was much more common in patients at the extremes of age. Exclusion of patients with missing HIV status preferentially excluded patients in the youngest and oldest age categories. Previously, we found that advancing age was a signi�cant confounder of the association between initial isoniazid resistance and death during therapy. Older patients were more likely to die during therapy but less likely to be infected with an isoniazid resistant strain. As a result, the OR for initial isoniazid resistance and death was 1.61 (95% CI 1.08, 2.41) without adjusting for age and was 1.81 (95% CI 1.19, 2.75) aer adjusting for age. erefore, differential exclusion of patients at the extremes of age likely decreased the ability of the multivariable model to adjust for the confounding effect of age, masking the association between isoniazid resistance and death during therapy.

Varying the Number of Imputations between 2 and 10.
Next, we examined whether the number of imputations in�uenced the observed association between initial isoniazid resistance and death during therapy. In the original analysis, we chose to use 5 imputed datasets based on simulation studies showing little gain of efficiency with additional rounds of imputation [3]. However, the optimal number of imputations may depend on properties of the particular dataset, as well as speci�c properties of the association of interest.
To further evaluate the optimal number of imputations to use in the study of isoniazid resistance and death, we repeated the analysis aer varying the number of imputations from 2 to 10. We examined the error associated with the coefficient for initial isoniazid resistance in the logistic regression model, as well as the error associated with the coefficient for HIV infection in the same adjusted model, separating the total error into its within-and between-imputation components. e results of this analysis are shown in Table 3. e error associated with the isoniazid-resistant coefficient did not require more than 5 imputed datasets to reach a plateau. As a result, the use of more than 5 imputations did not lead to an increase in precision for the estimate of the association of initial isoniazid resistance with death during therapy. In contrast, the error associated with the HIV coefficient continued to decrease with more than 5 imputed datasets, with more narrow con�dence intervals as the number of imputed datasets approached 10 ( Figure 2).

Discussion
Multiple imputation offers important advantages over other methods for handling missing data in epidemiologic studies, in particular with regards to its �exibility and wide applicability [1]. However, widespread use is limited by its theoretical complexity and lack of familiarity among audiences outside of statistical disciplines [13]. We have presented in detail our experiences in order to facilitate continued use of these techniques in the future.
We designed the imputation model in the context of a speci�c research question, with particular concern for the potential impact of missing HIV status. e matching mechanism to obtain HIV status of tuberculosis cases reported from California violates the "missing-at-random" status, since missingness will partly depend on unobserved covariates (the value of the HIV test itself). However, we previously demonstrated that imputing HIV status for all California cases, rather than relying on the HIV status from matching with the AI�S registry, did not in�uence the estimated association between isoniazid resistance and death [6]. Within-imputation error 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.14 Because of the high proportion of tuberculosis cases with missing HIV status, we were concerned about the potential for bias with less sophisticated methods, such as the missing indicator method. Although frequently employed as a missing data approach, missing data categories are less optimal than imputation since they may be difficult to interpret as meaningful parameters, may produce large and unstable estimates, and in some situations, may introduce bias [14]. In our analysis, the use of a missing data category for HIV status did not signi�cantly change the relationship between isoniazid resistance and death during therapy that was obtained by the multiple imputation approach (data not shown).
In contrast, we observed signi�cant bias when we compared the results obtained from multiple imputation with the case-exclusion method that dropped subjects without known positive or negative HIV status from the analysis. Although HIV itself was not a confounder of the relationship between initial isoniazid resistance and death during therapy (unlike age and race/ethnicity), missing HIV status was predominantly seen in subjects at the extremes of age. Because age was a signi�cant confounder of this relationship, exclusion of patients with missing HIV status led to an estimate of the OR for initial isoniazid resistance and death that was biased towards the null hypothesis.
e missing data problems that we addressed were primarily a result of our a priori decision to allow HIV status to remain in the model, regardless of its confounding effect on the relationship between isoniazid resistance and death. ese challenges illustrate the importance of considering the effects of missing data when making a priori decisions for multivariable model selection. While we allowed HIV status to remain in the �nal multivariable model for clinical reasons, HIV status did not signi�cantly in�uence the observed relationship between initial isoniazid resistance and death during therapy. e other variables in the �nal model were completely known and did not differ from one imputed data set to the next. For this reason, the use of increasing numbers of imputations did not lead to a decrease in the total error associated with the regression coefficient for isoniazid resistance, since most of the total error for this coefficient was within-imputation error, rather than between-imputation error. In contrast, the error associated with the coefficient for HIV status continued to decrease beyond 5 imputed datasets. If HIV status had been a signi�cant confounder of the association between initial isoniazid resistance and death during therapy, we would have expected to see both errors continue to fall as we used increasing numbers of imputed datasets.
In summary, we found the approach of multiple imputation to be a useful method for dealing with the missing data problem in the NTSS of the USA, overcoming the bias inherent to case exclusion. e size of the dataset and completeness of reporting for variables such as age, gender, and race likely enhanced the robustness of our �ndings in these sensitivity analyses. While the use of more than 5 rounds of imputation did not lead to a more precise estimate for the association of initial isoniazid resistance and death during antituberculosis therapy, our �ndings would likely have been different if HIV status had exerted a stronger confounding effect on this relationship. While our analysis supports the use of multiple imputation methods in epidemiologic analysis, it also demonstrates that close attention should be paid to the potential impact of missing covariates at each step of the analysis.