Bayesian Analysis for Dynamic Generalized Linear Latent Model with Application to Tree Survival Rate

Logistic regression model is the most popular regression technique, available for modeling categorical data especially for dichotomous variables. Classic logistic regression model is typically used to interpret relationship between response variables and explanatory variables. However, in real applications, most data sets are collected in follow-up, which leads to the temporal correlation among the data. In order to characterize the different variables correlations, a new method about the latent variables is introduced in this study. At the same time, the latent variables about AR (1) model are used to depict time dependence. In the framework of Bayesian analysis, parameters estimates and statistical inferences are carried out via Gibbs sampler with MetropolisHastings (MH) algorithm. Model comparison, based on the Bayes factor, and forecasting/smoothing of the survival rate of the tree are established. A simulation study is conducted to assess the performance of the proposed method and a pika data set is analyzed to illustrate the real application. Since Bayes factor approaches vary significantly, efficiency tests have been performed in order to decide which solution provides a better tool for the analysis of real relational data sets.


Introduction
Logistic regression model, a widely appreciated model for analyzing categorical response data especially for binary/ dichotomous data in social science research, marketing, biomedical studies, and ecology, has been received a substantive attention [1,2].The primary concern of logistic regression analysis is to build the interrelationships between explanatory variables and responses and explain the variability of the response probability in terms of the variability in the observed covariates.The common analysis for the logistic regression model is based on the independent subjects, and statistical inference is carried out via the maximum likelihood approach (see [3][4][5]).
However, in practice, many studies need to account for correlations among the items within time as well as correlations among the same items between times.Independent assumptions cannot capture such correlations.In order to interpret the correlations among the responses, one popular choice is to introduce the latent variables.Even more, taking into account the temporal correlations, a dynamic system is established which leads to the latent variable models [6][7][8].However, this dynamic system is associated with dynamic factor.So, a Bayesian approach is proposed in this paper.With an alternative approach about the ML, the development is based on the generalized logistic model in the well-known linear dynamic factor analysis model.To cope with this complicated model, Gibbs sampler [9] with MH algorithm [10,11] is implemented to generate a sequence of observations from the appropriate joint posterior distribution.The joint Bayes estimates of the parameters and latent variables scores associated with standard error estimates can be obtained directly based on the simulated observations.Beyond the estimation problem, another main objective of this paper is to introduce the Bayes factor (see [12]) to test various hypotheses about the posited model.In general, the computation of the Bayes factor involving multiple integrals is rather intractable and, more often, are evaluated via various computational methods (see [13][14][15][16][17][18][19], and among others).Among easyto-construct, we adopted the well-known path sampling technique [20] for computing Bayes factor.The nice feature of path sampling lies in its simplicity in implementation and effectiveness over such an important sampling (e.g., [21]) and bridge sampling [22].
A basic intent of modeling dynamic model is to represent series of observations generated in time and to predict the future evolution of the series.The well-known Kalman recursion [23] is commonly employed for forecasting and provides the optimal forecasts in the case where the state transitions are linear and dynamic system is Gaussian.However, if Gaussian assumptions, in particular, of the distributions of the measurement errors are violated, Kalman recursion only yields the best linear predictor for linear dynamic system which is substantially different from the optimal forecasts.Many authors have suggested various alternatives to the Kalman filter; see, among others, Kitagawa [24], Meinhold and Singpurwalla [25], Carlin et al. [26], and Smith and West [27].Motivated by the key idea in Carlin et al. [26], in this paper, the missing data is treated as the future observations and augmented them with parameters of latent variables as observed data.Therefore, it can use Bayesian analysis approach to deal them.Simulated observations generated by Gibbs sampler from the joint posterior distribution can be directly applied to forecasts and/or smoothing.
This paper is organized as follows.In Section 2, we describe the problem under consideration.Section 3 gives the estimation procedure and explores the Bayesian approach in detail.For model selection, Bayes factor and Bayesian forecasting are introduced in Section 4. Simulation study and a real example are given in Section 5. Some concluding remarks are presented in Section 6.

Materials and Methods about Model
In this section, the problem and model structure are presented as follows, which will be considered throughout the paper.For  = 1, . . ., ,  = 1, . . ., , let   denote the number of living tree within the th plot in the th month.We assume that   follows the binomial distribution with the unknown survival rate where   is the total number of tree being investigated at time  in theth plot.Further, the probability   is in relation to the covariates via the following link function: in which "log " denotes the log  function, x  = (1,  1 , . . .,   )  is a ( + 1) vector of fixed covariates,  is a ( + 1) regression parametric vector,   is the common factor, and   = ( 1 , . . .,   )  is the unique error with distribution   (0, Ψ  ) where Ψ  = diag{ 1 , . . .,   } is the diagonal matrix.The factor loading vector is Λ = ( 1 , . . .,   )  , which measures the effect of   on the functions of the response probability.Moreover, it assumes that   satisfies the following one-order autoregressive model where || < 1 and   is the iid random errors distributed according to the normal distribution (0,  2 ) ( 2 > 0).For the initial variable  0 , we assume that  0 ∼ ( 0 ,  2 0 ) with mean  0 and variance  2 0 .  and   are mutually independent and both are independent of   .In this paper, we treat  0 and  2 0 to be known and fix them at some preassigned values.
Let   = x    +     +   , (  ) = log(1 + exp(  )).Then, (1), (2), and (3) can be reformulated as follows: The main aim of introducing the latent variables is to characterize the temporal correlation between the consecutive observations and explain the interrelationships among the response probability.To give more details, note that for  >   , and ,  = 1, . . ., , the correlation between     and   is given by in which A path diagram of latent variables and manifest variables at time −1 and  is given in Figure 1.Following the conventions of path diagrams, the ellipses represent the latent variables, the rectangles denote the observed measurements, and the arrows represent the direct effect.For ease of exposition, let y  = ( 1 , . . .,   )  ,   = ( 1 , . . .,   )  , and X  = (x 1 , . . ., x  )  , which denote the vector of random observations, the vector of latent variables, and the vector of covariates cross  items at time , respectively.Let Y = {y  :  = 1, . . ., } be the collection of the observed data, let Ω = {  :  = 0, . . ., } be the collection of the factor variables, let Ξ = {  :  = 1, . . ., } be the collection of the latent variables, and let  be the vector of unknown parameters in , Λ, Ψ  , , and  2 involved in the model and then the joint distribution of (Y, Ξ, Ω) conditioning on  is given by In the framework of Bayesian analysis, the following priors for unknown parameter vector  are used to complete Bayesian specifications: () = ()(Λ, Ψ  )()( 2 ) and in which  0 , Σ 0 , Λ 0 , H 0 ,  0 ,  0 ,  0 , and  0 are the known superparameters.

Model Selection and Bayesian Forecasting
Model selection is an important issue for Bayesian generalized logistic analysis since it is of interest to determine whether the latent variables are involved or whether the significant effects exist among the competing covariates.
Consider the problem of comparing competing models  1 and  0 which are nested or nonnested.Let (Y |  1 ) and (Y |  0 ) denote the marginal density of the data Y under  1 and  0 , respectively.A popular choice for selecting models (e.g., [16,18]) is achieved via the following Bayes factor: given the values of  under   , respectively.Kass and Rafter [18] provide the following categories to furnish appropriate guidelines in Table 1.
Computing BF 10 or log BF 10 involves the highdimensional integrations of likelihood with respect to (Ξ, Ω, ) which is rather difficult.Various techniques are explored to address this problem (see [7]).Among easyto-construct, we consider the path sampling method [20].The core of path sampling is to construct a series of linked models which link up the competing models.

Simulation Study.
A simulation study is presented to assess the performance of our proposed procedure.The data set is simulated from the model defined by ( 1   number of the Gibbs sampler iterations in getting convergence.For MH algorithm, the acceptance rate is also adjusted in posterior sampling to give about 40% [31].Hence, in the following analysis, experiment always takes such values for the tuned parameters.Figure 2 gives the estimated potential scaled reduction (EPSR) values [30] of the parameters against the number of the iterations in Gibbs sampler under prior (I).
In Figure 2, it can be seen clearly that all these EPSR values are less than 1.2 in about 2000 iterations.To be conservative, we collect 5000 observations after 5000 "burnin" in computing the absolute bias (BIAS) and the root mean squares (RMS) of the estimates and the true values in 100 replications.Results obtained from the Gibbs sampler are summarized in Table 2, where SD denotes the standard deviation of the estimates.Based on these results, it can be seen that our proposal is rather accurate and effective.It also shows that there are no significant differences of estimates between prior (I) and prior (II).Therefore, the proposal method is more robust against the choices of values of hyperparameters.An interesting phenomenon is that the estimates of variance parameters in unique errors under prior (I) are more accurate than those under prior (II).This reflects that it provides more information in estimation.
For model comparison, we consider the following competing models to illustrate the performance of the Bayes factor.
1 is the aforementioned model given by ( 4).Consider in which  = 1 and  = 0 correspond to  1 and  0 , respectively.For computation, we choose  = 20 in [0, 1] and draw 5000 observations after 5000 burn-ins deleted from the posterior distribution for each   to calculate the log BF 10 .Figure 3 gives the histogram of the values of log BF 10 across 100 replications.The estimate of log BF 10 is 11.2222 with standard deviation 6.6242.Based on the guidelines given by Kass and Raftery [18],  1 is selected with positive evidence.
To investigate the smoothing and forecasting of our proposal, we regenerate a data set with sample size 100 and treat the last 5 × 4 observations as unobserved.The Gibbs sampler with MH algorithm is used to compute the first ninety-five smoothing values and the last five forecasting values of the latent variables .Figure 4 gives the true values and smoothing/forecasting values based on 5000 simulated observations after deleting first 5000 observations.It can be seen that it fits well between the estimated values and true values.

Pika's
Data.An application is considered in relation to a study of Guo et al., 2009 [32], to illustrate the developed methodology.The data set is collected from Qinshui Forest farm of Jincheng Coal Industry Group sited on the northwest of Qinshui County of Jincheng City in Shanxi Province, 112 ∘ 19  40  -112 ∘ 19  52 east longitude and 35 ∘ 48  43  -35 ∘ 48  57 north latitude.Details about experiment design can be found in Guo et al., 2009 [32].The original data set is constituted of the number of the living tree and pika within four plots in five months.The primary concern is to assess the relationship between the survival rate of young plantation and the pika population and model the optimal pika population per hectare.The data set is reanalyzed by Xia et al. [33] to investigate the correlations among the observed responses.In these studies the latent variables model is established through AR(1) model (3) coupled with model ( 1) and (2), which interprets the correlations among the outcomes resulting from the time dependence.
Firstly, the following hypothesis is considered:  0 :  = 0.If  0 is true, the model becomes the common generalized logistic model considered by Xia et al. [33].Path sampling is used to compute log BF 10 .The linked model is defined similarly as (20). = 20 grids were chosen in [0, 1], and, for each   ( = 1, . . ., 20), 10000 random observations were drawn from the each posterior distributions and the first 5000 observations were deleted in view of the burn-in phase.The logarithm of Bayes factor is 3.0143.It shows that there exists significant evidence against  0 ; hence, our proposal is appropriate.
To assess the effect of forecasts of the proposal methodology, the original data set is divided into two parts: one is formed by the previous sixteen observations in four months within four plots and the other is made up of four observations in the following fifth month.The proposal algorithm computes the predictive values of the proportion of the living tree in the fifth month based on the previous 20 observations.Figure 5 shows the forecasts and actual values of the proportions of the living tree within four plots.It can be seen that fitted values match the actual values closely.

Concluding Remarks
Logistic regression model is the most popular model used to interpret the relationship between the responses and covariates and to assess the effect of covariates on the responsible probability.Bayesian analysis of logistic model has received a lot of attention recently; see Johnson and Albert [34] and Congdon [35].For example, Choi et al. [36] discussed a Bayesian statistical inference about missing information on the basis of the logistic regression model.They also used Gibbs sampler to get the estimates and the posterior analysis of the posited model.Since their model is not an AR(1) model, the underlying development is less complicated.Recently, McCormic et al. [37] proposed an online binary classification procedure based on the dynamic logistic regression and dynamic model averaging.However, their developments are restricted within a single response   , which do not need to explore the relationships among the observed variables.Though Xia et al. [33] establish the Bayesian analysis for generalized logistic model, the essential difference lies in that their latent variables are identically and independently distributed according to normal distribution and, hence, can not capture the temporal correlation.
In the present paper, dynamic factor model is established to characterize the temporal correlation among responses.One contribution in this paper is the development of a feasible estimation procedure for obtaining the Bayesian estimates of the parameters and latent factor scores.The hybrid Gibbs sampler with MH algorithm was implemented to provide a convenient mechanism for implementing our method.Another contribution is the development of test statistics on the Bayes factor for testing hypothesis of the model.Computation of the Bayes factor in the current complex model is on the basis of path sampling.Further, Bayesian forecasting procedure which takes advantages of the full conditional distribution is presented.Results from empirical studies indicate that these procedures can be usefully applied to real studies.

Figure 1 :
Figure 1: Path diagram about interaction with the latent variables and manifest variables.

Figure 2 :
Figure 2: EPSR values change with numbers of iterations under prior (I).

( 19 )
Note that  0 is nested to  1 in the sense that all the parameters in model  0 are included in model  1 . 0 indicates that no temporal correlation exists among the random effects.The linked model between  1 and  0 is taken as  :  (  |   ) =      exp {    −    (  )} ,   = x    +     ,   =  −1 +   ,

10 Figure 3 :
Figure 3: Histogram of log BF 10 of posterior distribution for each   .

Figure 4 :
Figure 4: Comparison with true values (dashed line) and forecasting values (pointed line).

Figure 5 :
Figure 5: Comparison with the predictions for survival rate (solid line) and the actual values (dashed line).

Table 2 :
Statistical results of Gibbs under prior (I) and prior (II) with relevant parameters.