Alternatives to Mixture Model Analysis of Correlated Binomial Data

While univariate instances of binomial data are readily handled with generalized linear models, cases of multivariate or repeated measure binomial data are complicated by the possibility of correlated responses. Likelihood-based estimation can be applied by using mixture distribution models, though this approach can present computational challenges. The logistic transformation can be used to bypass these concerns and allow for alternative estimating procedures. One popular alternative is the generalized estimating equation (cid:2) GEE (cid:3) method, though systematic errors can lead to infeasible correlation estimates or nonconvergence problems. Our approach is the coupling of quasileast squares (cid:2) QLSs (cid:3) method with a rarely used matrix factorization, which achieves a simpliﬁed estimation platform—as compared to the mixture model approach—and does not su ﬀ er from the convergence problems in GEE method. A noncontrived example is provided that shows the mechanical breakdown of GEE using several statistical software packages and highlights the usefulness of the QLS approach.


Introduction
Binomial data occur when observations on a given subject consist of a fixed series of Bernoulli trials, resulting in a proportional outcome.Maximum likelihood estimation is readily available in a generalized linear modeling framework when subjects consist of univariate measures i.e., one Bernoulli or binomial trial per subject .However, estimation becomes more complicated when several Bernoulli or binomial trials are observed for each subject.In this case subject responses could be multivariate consisting of several series of separately defined trials or repeated measure where the set of trials are defined similarly , and in both instances there is the possibility that the intrasubject responses are correlated.Here we use the term "subject" for convenience but it could be an item, store, location, plot in agriculture experiments, and so on.For example, real-life data situations where we encounter correlated proportions include 1 bankers interested in the proportion of customers making the ith type of transaction at the jth bank branch; 2 the proportion of CD deposits at the ith branch in the jth month of a year; 3 retail managers interested in the proportion of customers purchasing the ith item at the jth store; 4 marketers interested in the proportion of subjects viewing the ith advertisement type on the jth website;  1.In all of these examples the proportions within each row are correlated but the rows can be assumed to be independent.The within-row correlation, while complicating matters, must be accounted for in order to obtain proper variance estimates and inference for any regression parameters representing the associations between the vector of proportions and covariates.Thus, the problem is to estimate the parameters of interest within the ensemble of all parameters.In this context one could use a likelihood-based approach utilizing mixture-distribution models.
In the case of binomial data the mixture model would consist of both binomial and logit-normal components.However, parameter estimation in the mixture model could experience convergence problems due to the multitude of marginal means, regression, and correlation parameters.A simplified alternative approach would be to transform the variablespecific proportions for each subject in a way that would simplify the assumed probability distribution.The logit of the proportions would transform the outcome scale from 0, 1 to −∞, ∞ , which could make appropriate a multivariate normal-based methodology.One such procedure could be the generalized estimating equations GEEs proposed by Liang and Zeger 2 .Though a popular methodology for estimating regression parameters in cases of longitudinal or repeated measure data, this procedure suffers problems estimating correlations.As will be seen in subsequent sections, the GEE method can fail to converge even for cases of continuous data, which is the case if the logit transformation is used on binomial data.
Coull and Agresti 3 discussed random effects models for logit-transformed correlated binomial data.Here we suggest the method of quasileast squares QLSs , developed by Chaganty 4 and Chaganty and Shults 5 .While generally seen as an alternative method to solving the maximum likelihood score equation for correlation parameters in the case of Gaussian data Sabo and Chaganty 6 , the estimation of correlation in the QLS procedure can also be supplemented with a little-known matrix factorization that makes it distinct from the maximum likelihood method.In this sense the QLS procedure is applicable for estimating correlated continuous data, which is appropriate for logit-transformed binomial proportions.
The rest of this paper is outlined as follows.The likelihood-based mixed-model approach is discussed in Section 2, while the logit transformation of binomial data and the GEE methodology are discussed in Section 3. We briefly outline the QLS estimating procedure in Section 4, while also highlighting the matrix factorization for use in estimating correlation.A noncontrived example is given in Section 5 that shows the usefulness of the QLS approach, as well as the convergence problems experienced by several statistical software packages in implementing the GEE method.A brief conclusion follows in Section 6.

Maximum Likelihood Estimation Using Mixture Distribution Models
For i 1, . . ., m subjects, let y i y i1 , . . ., y it be a vector of t possibly dependent binomial random variables, where y ij is the number of successes out of n ij trials with success probability p ij for the jth variable of the ith subject.Also assume that x ij x i1 , . . ., x ik is the vector of k covariates corresponding to the jth variable in the ith subject, such that X i x i1 , . . ., x it is the t × k matrix of all covariates for the ith subject.
The general mixture distribution model for binomial data is given by where G is a multivariate cumulative distribution function with support in 0, 1 t and p i p i1 , . . ., p it .Basically, we assume that p i is distributed as G • , and, given p i , the vector y i consists of t independent binomial variables.Then the marginal distribution of y i is given by 2.1 .A popular choice for G is the multivariate logit-normal distribution; that is, the distribution obtained under the assumption logit p i log p i1 / 1 − p i1 , . . ., log p it / 1 − p it is multivariate normal with mean μ i μ i1 , . . ., μ it and covariance matrix Σ.Here μ ij x ij β represents the mean as a function of the covariates and a k-dimensional regression parameter vector β.To make model 2.1 identifiable we make the common assumption that Σ φR, where R is a correlation matrix and φ is a scale parameter Joe 7 , page 219 .This condition is necessary for model identification as the following simple example shows.Suppose t 2 and that the vector p is multivariate logit-normal distributed with mean μ 0 and covariance matrix Σ . It is easy to verify that two sets of choices for the variances σ 2 1 and σ 2 2 can result in identical binary distribution for y as shown in Table 2.If β, R, φ are the only parameters of interest, we can obtain maximum likelihood estimates by maximizing the likelihood L β, R, φ m i 1 f y i .However, if the E y ij /n ij p ij 's are also of interest, we can obtain estimates of these parameters using either the empirical Bayes EB method or the EM algorithm considering the full likelihood 2 has parameters that increase with m, and solutions to the maximization of 2.2 will require roots of complex nonlinear equations.These considerations may make the full likelihood approach subject to computational difficulties and convergence problems.Further, such specific definitions for the components in the mixture model may affect estimator robustness.

Alternatives to Likelihood-Based Estimation
For reasons outlined earlier it makes sense to consider the vector of logit-transformed proportions u i u i1 , . . ., u it , where u ij logit p ij log p ij / 1 − p ij and p ij y ij /n ij .Note that u i is distributed as multivariate normal with parameters E u i μ i X i β and Cov u i φR.The focus on these normally distributed random variables, rather than the mixturedistribution-based binomial random variables, can allow us to relax distributional assumptions and utilize distribution-free methodologies for parameter estimation such as the generalized estimating equations GEEs .This methodology is a two-stage process, in which the estimate of the regression parameter β is updated by a residual-based moment estimate of R. Specifically, estimation is iterated between the two equations The problem with this methodology is that the diagonal elements of Z are not necessarily equal to φ, implying that the diagonal elements of R Z/ φ are not necessarily unity.However, the GEE methodology, as implemented in software packages, forces the diagonal elements of R to unity i.e., it changes the values from whatever they are to 1 , and thus matrix R is not guaranteed to be positive definite.This can lead to most harmlessly convergence problems, but it can also lead to artificially deflated estimator variances for the regression parameters and is thus subject to improper or incorrect inference.

Quasileast Squares
The quasileast squares QLSs approach, on the other hand, provides an alternative estimate of R and does not experience the convergence problems exhibited by GEE.A further benefit of this method is that it does not require the assumption of normality for the joint distribution of each response or among their marginal distributions.The initial step for estimation of R is to minimize tr R −1 Z over the set of correlation matrices.Since the diagonal elements of R are restricted to be one, introducing a diagonal matrix of lagrange multipliers Λ, we can verify that the point of minimum R factors the matrix Z as Z R Λ R.
4.1 Whittle 8 has shown that for a positive definite matrix Z the factorization 4.1 is unique.
, and the diagonal matrix Λ satisfies the fixed-point equation Λ diag Λ 1/2 ZΛ 1/2 1/2 , which can be solved using a simple fixed-point iterative scheme Olkin and Pratt 9 , Chaganty 4 .Next, using the first step correlation estimate R, we can then obtain a consistent correlation estimate as e, e is a vector of ones, and • denotes the Hadamard product.It is possible that the correlation matrix 4.2 may not be positive definite in which case Chaganty and Shults 5 have recommended the estimate

4.3
See equation 3.2 in Chaganty and Shults 5 .The quasi-least squares method uses the estimate 4.2 of R if it is positive definite and otherwise uses 4.3 , which is clearly a positive definite correlation matrix, to update the estimate of β until convergence.Code for fitting this model using the R statistical software is provided in the Appendix.

Example
We now provide an example from Alderdice and Forrester 10 , who modeled the effects of salinity and temperature on the proportion of hatched English sole eggs.In this study, the number of hatched eggs was recorded at seven salinity and five temperature levels.Measurements were taken in four separate tanks for each combination of salinity and temperature, and for each tank we have recordings of the number of fish eggs and the number hatched.Thus, the tanks represent the repeated measure component for this binomial data set.The data, as given on page 6 of Lindsey 11 , is reproduced in Table 3.
The goal of the analysis is to study the dependence of the proportion of eggs hatched on the temperature and salinity.After calculating u ij logit y ij /n ij , where y ij is the number of eggs hatched out of the total n ij in the jth tank at the ith combination of temperature and salinity, the Shapiro-Wilk test for normality was performed on the transformed responses for each replicate.The results P -values < 0.05 indicate a departure from normality, so that maximum likelihood methods for continuous data are not applicable.The data was analyzed using GEE in several statistical software packages using an unstructured working correlation matrix to account for the correlation between the four replications of the solution in the four tanks.The results using PROC GENMOD in SAS version 9.2, gee.fit module in TIBCO Spotfire S version 8.2, and xtgee procedure in STATA version 11, are shown in parts i , ii , and iii of Table 4.The warning message from PROC GENMOD read "WARNING: Iteration limit exceeded."Here we see that in each case the estimates failed to converge.The 0.999 correlation estimates in part i represent model breakdown in that programmers often use this value to indicate nonconvergence.The warning message from TIBCO Spotfire S software read sic "Warning messages: 1: at convergence at the correlation estimate 1 is outside of the range −1, 1 in cgeefit gee.model 2: correlation matrix is not full rank, 2 < 4 in: cgeefit gee.model ."Note that the correlation between the first and second tanks in part ii is greater than one, clearly violating the most liberal of correlation boundaries.The warning message from xtgee in STATA version 11 read "convergence not achieved."Also, the fourth eigenvalue in part iii is negative, indicating that the estimated correlation matrix is not positive definite.The results of the QLS analysis are given in part iv of Table 4, which show that the estimated correlation matrix is positive definite.

Discussion
The logit transformation was originally applied on mortality rates in univariate bioassays Berkson, 12 , though the idea also generalizes nicely into the cases of correlated repeatedmeasure, longitudinal, or multivariate binomial data discussed here.Doing so allows the data analyst to bypass complicated, parametrically saturated mixture distributions and utilize methods for correlated continuous data.Interestingly, even after the logit transformation is applied, the GEE method still experiences convergence difficulties and problems with correlation parameter estimation.Potential causes for these problems are explained in Section 3. The QLS method, on the other hand, does not experience these difficulties and handles the simultaneous estimation of both regression and correlation parameters with relative ease.This was made possible by incorporating the little-known and rarely used matrix factorization given in 4.1 .

Appendix
For more details see Algorithm 1.

Table 1 :
The proportion p ij of successful outcomes for the ith subject during the jth repetition.

Table 2 :
Identical distribution for y with two different choices for Σ.

Table 3 :
Number of hatched and total eggs of English sole at different salinity and temperature levels in sea water.

Table 4 :
Analysis of English sole eggs data: i GEE parameter estimates and working correlation matrix using the SAS system GENMOD procedure.ii GEE parameter estimates and working correlation matrix using the TIBCO Spotfire S .iii GEE parameter estimates and eigenvalues of the working correlation matrix using STATA version 11. iv QLS parameter estimates and correlation estimates using an R program given in the Appendix.