Identification and Estimation of Graphical Models with Nonignorable Nonresponse

We study the identification and estimation of graphical models with nonignorable nonresponse. An observable variable correlated to nonresponse is added to identify themean of response for the unidentifiable model. An approach to estimating themarginal mean of response is proposed, based on simulation imputation methods which are introduced for a variety of models including linear, generalized linear, and monotone nonlinear models. )e proposed mean estimators are �� N √ -consistent, where N is the sample size. Finite sample simulations confirm the effectiveness of the proposedmethod. Sensitivity analysis for the untestable assumption on our augmented model is also conducted. A real data example is employed to illustrate the use of the proposed methodology.


Introduction
e problem of missing data in practice has attracted much attention for decades. Rubin [1] gave the weakest general conditions under which the nonresponse is ignorable; that is, ignoring the process that causes missing can still result in correct inference when both missing and observing are at random. Alternatively, the nonresponse is said to be nonignorable (Little and Rubin [2]) if it depends on the value of the possibly unobserved outcome. Groves, Presser, and Dipko [3] illustrated that if the missing mechanism is not ignorable, then a complete-case analysis which excludes missing data could result in highly biased estimates. For more discussions about nonresponse bias, one can refer to the work of Little and Rubin [2], Ibrahim and Lipsitz [4], and Goves [5], among others.
To address the problems of nonignorable nonresponse, weighting adjustments, which adjust the estimates by rescaling each unit's sample weight proportionally to the inverse of its response probability, are frequently used. e methods used for adjusting include poststratification by Holt and Smith [6], Calibration by Kott [7], and raking-ratio estimation by Deville, Särndal, and Sautory [8]. Weighting adjustments are modelbased approaches, that is, the population values are treated as realizations of random variables that are distributed according to a superpopulation [9], and auxiliary information is incorporated into various models to describe nonrespondent behavior with respect to the variables of target. For instance, Greenlees, Reece, and Zieschang [10] conducted linear regression to analyze the unobserved income data assuming that nonresponse income depends on the unobserved value. Fay [11] and Baker and Laird [12] provided a family of estimable hierarchical log-linear models for the joint distribution of the data and the response indicator.
As pointed out by Little [13], the fully parametric approach is sensitive to failure of the assumed parametric model. Based on the exponential tilting model, Kim and Yu [14] proposed a semiparametric estimation method of mean functionals with nonignorable missing data. Riddles, Kim and Im [15] presented an approach of maximum likelihood estimation that uses parametric model assumptions about the variable of interest among the respondents only. Zhao, Tang, Qu, and Jiang [16] considered the parametric propensity model and studied semiparametric estimating equations inference by the nonparametric imputation method. Guo, Ma, and Wang [17] generalized the propensity model to semiparametric form and investigated the estimation of the parametric copula model. ere are also some works using graphic models to depict nonresponse mechanisms in the literature on the modeling approach. For examples, Ma, Geng, and Hu [18] used graphic models with temporal structure to describe nonresponse mechanisms for binary income in a longitudinal study. Wang, Chen, Geng, and Zhou [19] extended the work of Ma, Geng, and Hu [18] to derive the maximum likelihood estimator for the parameter of the binomial proportion and its associated variance. More existing works along this line can be found in the work of Little [13], Fay [11], and Forster and Smith [20].
In this paper, we present a new method to tackle the problem of nonignore nonresponse. A completely observable variable correlated to Y is added to identify the mean of the response Y. Adjustments are applied by a fixing response approach which divides the population into two strata: one consists of respondents and the other of nonrespondents [21]. en, an approach to estimating the marginal mean of response Y is proposed based on simulation imputation methods under linear, generalized linear, and monotone nonlinear models, respectively. It is shown that the proposed estimators are �� N √ -consistent, where N is the sample size. e effectiveness of the method is demonstrated via simulations. Application to real data of the method is also considered. Simulations show that our new method is successful in modeling the mean of response Y. Since the assumptions on the models are untestable, we propose to assess sensitivity to the assumption of our methods. e outline of this paper is as follows. In Section 2, we introduce the graphical models. Section 3 develops the simulation imputation methods for linear, generalized linear, and monotone nonlinear models. In Section 4, the generalized linear model is extended to cope with other covariates. Section 5 introduces empirical standard errors to monitor the accuracy of bootstrap imputations. A simulation study is conducted in Section 6. Sensitivity analysis is given in Section 7. Section 8 illustrates an application to the mental health dataset. Proofs are given in Appendix.

Graphical Models
e response variable is denoted by Y. We assume that Y i N i�1 are independent and identically distributed response variables for N subjects in the study and R i N i�1 are the respective status indexes for the responses, where R i � 1 or 0, which depends on Y i response or nonresponse, respectively. Identifying the marginal mean E(Y) of Y is an important problem in itself, and the treatment effect can be obtained once the marginal mean is available. However, if one uses only observed Y i 's with R i � 1 to estimate E(Y), it may result in a large biased estimator.

Example 1.
We consider the mixture model with 1), and R i is a binary variable independent of Y i1 and Y i2 and satisfies that P(R i � 1) � 1 − P(R i � 0) � 0.5, for i � 1, . . ., n.
en, the marginal mean of Y is E(Y) � − 2, and the mean of observed Y is 0. us, if one ignores those nonresponses with R i � 0, then the bias is 2.
To deal with this problem, we introduce the following three graphical models: e graphical models have the following statistical meanings: holds. e assumption is proposed based on the information from specific experts. We will perform sensitivity analysis on the assumption for our results): e marginal mean of the response Y is identifiable for model (a), but not for model (b), since one observes only those values of response variable with R i � 1 (the response group). To solve this problem, we now introduce a surrogate Z of Y, which is completely observable in model (c), to identify the marginal mean of Y. We call model (c) the augmented model of model (b), which is identifiable for the marginal mean under certain conditions. e motivation for adding such a completely observed variable Z is from a study conducted by us for the income of the inhabitants in an area of Beijing. Since the income variable Y is relevant to private matters, it is subject to informative missing. Neglecting of the missing values may result in very biased estimation for the income levels in the area. is motivates us to add a completely observed variable Z, the size of houses that every inhabitant possesses. Note that, conditional on the income variable Y, the missing mechanism can be regarded as uncorrelated to the housing variable Z so that the model (c) holds in this example.
In the following, we consider the identification of model (c) and propose a method to estimate the marginal mean of Y i .

Simulation Imputation
Since the nonresponse subjects are nonignorable, biased estimation may be resulted if only the subjects in the response group with R i � 1 are considered. It is necessary to use the information from the nonresponse so that a consistent estimator of the marginal mean of Y can be obtained. To this end, we introduce an approach to achieving this objective. Our idea is to impute the nonresponse by simulation.
If β � 0, then model (c) is equivalent to the following model (d).
Since model (c) includes model (d), using a standard argument in hypothesis testing for general linear models, we can test if H 0 : β � 0 holds via constructing testing statistic, T (Z, Y), say, based on the estimator of β from model (c). If (d) holds, then E(Y i ) is unidentifiable, and one must find a variable Z correlated with Y to solve this problem.
If β � 0, then it follows from (1) that -consistent estimators of α and β (for example, the simple least squares estimators). We denote them byα andβ, respectively. Let the estimator of E(Z i ) be E(Z). en, the marginal mean of Y can be estimated as e abovementioned estimation method is useful for the linear model, but it is inflexible for other models, for example, the generalized linear model and nonlinear model considered later. We here introduce another flexible estimating method, which is based on bootstrap. It uses the fact To estimate E(Y), we need to estimate E(Y |R � 1) and E(Y |R � 0), where the former can be estimated by the average of Y i in the response group and the latter can be estimated by the average of imputations of Y i for the nonresponse group. is method is referred to as "simulation imputation," which is detailed in the following: (i) Fitting model (1) with a data subset: the subjects are partitioned into two groups, one is of R i � 1 (the response group), the other R i � 0 (the nonresponse group). Without loss of generality, assume R i � 1 for i � 1, . . ., N 1 , and 0 others. Based on those subjects with responses observed (i.e., i � 1, . . ., N 1 ), we regress{Z i } on {Y i } using model (1) and obtain the estimators of α and β, denoted by α and β, respectively.
(ii) Bootstrap residuals ( e bootstrap based on residuals was proposed by Efron [22]. Because ε i 's are of mean zero, the empirical distribution function of the residuals is centered at 0): the residuals are denoted by j . (iv) Estimator of the marginal mean of Y: eoretical justification of the abovementioned "simulation imputation" method can be built by using standard bootstrap theory. Especially, we have the following consistency result for the estimator E(Y), which is proved in Appendix.

Logistic Regression Models.
As an alternative to model (1), the following generalized linear model is used to model the relation between binary Z i and Y i : where ϕ > 0 is an unknown dispersion parameter, In particular, when Z i is binary with values 1 and 0, E(Z i | Y i ) � P (Z i � 1|Y i ), and one reasonable choice for g(·) is the logit link with g(p) � log{p/(1 − p)}. From model (7), we obtain that

Journal of Mathematics 3
if β is not zero. Note that by the conditional uncorrelation in . erefore, model (7) can be estimated by those data in the response group. Let (α, β) be such estimators of (α, β) for model (7). en, µ(η(y)) ≡ E(Z|Y � y) is nonparametrically estimable via regressing Z i on Y i for those observed Ys with R � 1. We denote the resulting nonparametric estimator by μ(η(y)) � E(Z i |Y i � y), for example, using the local linear smoothing in the work of Fan and Gijbels [25] and Jiang [26]. en, by (6), we estimate E(Y) by However, the estimator E(Y) may suffer from substantial loss of efficiency since it uses only a portion of sample. Furthermore, in finite sample settings, the bias of nonparametric estimator E(Z i |Y i ) may yield a highly biased estimate of the marginal mean of Y, since it can be magnified by the nonlinear link function. However, this approach can be used to compare with the following estimation method and justify the appropriateness of the parametric form of µ(·) in (7) for modeling real data.
Note that E(R|Y, Z) � E(R|Y); it follows that we can model the missing mechanism through We here extend the previous "simulation imputation" approach to model (7), from which one can develop imputation of nonresponse Y i from equation (9). Let (i) Fitting model (7) with a data subset: based on those observations from the response group, we regress {Z i } on {Y i } using model (7) and obtain the estimators (α, β) of (α, β), the fitted values μ(η i ) � g − 1 (α + βY i ) with η i � α + βY i , and Pearson's residuals.
(ii) Bootstrap residuals: we resample M samples, each with size N2 � N -N1, from the empirical distribution function of centered Pearson's residuals (where ε � is the average of ε i 's), and denote the M samples byε * (m) )/β, for j � N 1 + 1, . . . , N 1 + N 2 , and we find η j such that ε * (m) j as prediction of Y j . (iv) Estimator of the marginal mean of Y: e abovementioned bootstrap Pearson's residuals procedure in (i) and (ii) is standard in the generalized linear models (see, for example, page 341 of the work of Shao and Tu [27]). e "simulation imputation" method uses bootstrap data from Pearson's residuals to yield predicted values for the nonresponses. (7) is correct and β ≠ 0, then the result in eorem 1 continues to hold.

Monotone Nonlinear Models.
We consider the following monotone nonlinear relation between Z i and Y i : where f (·) is a known monotone nonlinear function and E(ε i |Y i ) � 0. By reversing the relation in (16) . Using a simulation imputation method similar to that in Section 3.1, we can obtain the estimator of E(Y i ). is model requires one to find a surrogate Z for Y such that E(Z| Y) is monotone.
Up to now, one may wonder if the models in (1), (7), and (16) are appropriate in practice. is involves in model selection and diagnostic analysis. Traditional model diagnostic tools are useful for this problem. e problem can also be addressed by nonparametrically modeling those responses with R i � 1 in the exploration analysis stage if a moderate sample is available, so that one can test if some parametric model holds, based on a nonparametric testing statistics such as the generalized likelihood ratio test in the work of Fan, Zhang, and Zhang [28], Fan and Jiang [29], and Fan and Jiang [30].

Extension
Previous results cannot cope well with the cases in the presence of other covariates. We here extend them in the framework of the generalized linear models. e extension to other models can be similarly made. In parallel with model (7), we consider the following model with completely observed covariates X of dimension d: Journal of Mathematics where η i � α+βY i + c T X i is a linear predictor, g is a canonical link function, and µ(·) is a known differentiable function with derivative μ ′ (·) > 0. As in Section 3.2, the same "simulation imputation" approach can be used to estimate the mean of Y, but with α in Section 3.2 replaced by α + c T X i in the present setting.
is approach will be readdressed in our real example.

Empirical Standard Errors
e empirical standard error (ESE) is used to monitor the accuracy of convergence. For the simulation imputation methods mentioned above, the sampling number M is generally required to be large enough to ensure the prediction for the nonresponses with accuracy in a reasonable range. For a specific application, what should M be? Naturally, a good choice of M should yield an accurate predicted value of the nonresponse. is motivates us to use the ESEs of the simulation imputations to assess the accuracy of the predicted value of the nonresponse. In our real data analysis, M is taken as 1000, and the ESEs of the imputations Y * j are reported.

Simulations
To investigate the performance of our procedure, we compare our estimators with the observed average of the responses in the following three models. e number of simulations is 600 and the number of bootstrapping samples for each simulation is taken as M � 1000. e deviation, E(Y) − E(Y) , of each estimator from the true mean in the 600 simulations is computed and displayed via box plots, which depict the distributions of the deviations. e more the deviation concentrates at 0, the better the corresponding estimator is. N(0, 0.5 2 ). Sample size is N � 100, where α � 1 and β � 0.5 us, the true mean E(Y) equals 0.5.

Example 2. We consider a linear model with
100, which is mixed normal. en, E(Y) � − 2. We generate binary Z i with outcomes 0 and 1 from the logistic model.

Example 4.
Consider the monotone nonlinear model with f (y) � a + y/(b + y), where a � 0.5 and b � 1. For i � 1, 100, let e boxplots of the deviations among 600 simulations for Examples 1-3 are reported in Figures 1(a)-1(c), respectively. e abovementioned three examples show that the proposed estimator is consistent, but the observed average of the response (with R i � 1) is very biased because it ignores the information from the nonresponse subjects.

Sensitivity Analysis
is assumption is untestable and made with knowledge of the specific experts. Naturally, one may ask if our method exhibits robustness to some extent against the assumption. is motivates us to assess the sensitivity of our estimators to the assumption.

Example 5.
To assess the sensitivity of our estimators to the assumption on uncorrelation of Z i with R i conditional on Y i , we set samples size N � 100, P(  Figure 1(d). When S increases, the conditional correlation gets stronger. Our estimator seems to perform reasonably well against the appropriate departure from the conditional uncorrelation, but the sample average of observed response does not.

Real Data Analysis
We now use the mental health dataset to demonstrate how the proposed procedure works in a typical application.
is dataset is from a study of mental health of children in Connecticut. It was previously analyzed by Ibrahim, Lipsitz, and Horton (2001). ere are totally 2486 subjects in study and six related variables: Father: parental status of the household (father figure present � 0; no father figure present � 1).
Of the six variables, we choose the first five variables for analysis, since the 6th variable pctage is uniquely determinated by the 5th variable counts. e outcome of main interest, "t rept ," is missing for 1061(42.7%) subjects, but the variable "p rept ," which relates to "t rept " and can serve as a surrogate for it, is observed for all subjects. Note that, as discussed in the abovementioned paper, once conditional on the surrogate "p rept ," the missing mechanism can be regarded as independent of the outcome "t rept ," although unconditionally the missing mechanism "R" (equals 0 for the missing and 1 for the observed) seems to depend on the Journal of Mathematics outcome. en, the model (c) seems reasonable to depict the relation among t rept , p rept , and R.
Our interest is to estimate the marginal mean of the response variable "t rept ." We use model (5) and the estimation approach in Section 3.2 and set Z � p rept , Y � t rept , and the number of bootstrap sampling M � 1000. e estimated mean, variance of the response t rept , and the average of the ESEs of the imputations Y j * for the nonresponses are reported in Table 1. By incorporating the information on the first two covariates, father and health, we get the estimated mean and variance of the response from model (10), which can be found in third column of Table 1.
However, if only using the observations of t rept , its marginal mean is computed to be 0.1832 with variance 0.1497. Obviously, our estimator not only rectifies the value of the marginal mean, but is more efficient than the simple average of the observed values of Y. e small values of the ESEs reflect high accuracy of the simulation imputations.

Proofs of Theorems
Proof. of eorem 1. e empirical distribution function for i � 1, · · ·, N 1 is denoted by F N 1 (ε). Since ε j * (m) 's are drawn from F N 1 (ε), conditional on the original sample points with R i � 1, ε j * (m) ∼ F N1 (ε) for j � N 1 + 1, . . ., N 1 + N 2 . Hence, Note that Using the standard argument for bootstrap consistency, it can be shown that By calculating the mean and variance, it is easy to see that It follows that is, combined with (4), yields �� N √ − consistency of E(Y).
Proof. of eorem 2. e result follows from a similar argument as eorem 1.
Data Availability e data are public and available in the reference. ey are also available upon request via emailing to the first author.

Conflicts of Interest
e authors declare no conflicts of interest.