Sufficient Sample Size and Power in Multilevel Ordinal Logistic Regression Models

For most of the time, biomedical researchers have been dealing with ordinal outcome variable in multilevel models where patients are nested in doctors. We can justifiably apply multilevel cumulative logit model, where the outcome variable represents the mild, severe, and extremely severe intensity of diseases like malaria and typhoid in the form of ordered categories. Based on our simulation conditions, Maximum Likelihood (ML) method is better than Penalized Quasilikelihood (PQL) method in three-category ordinal outcome variable. PQL method, however, performs equally well as ML method where five-category ordinal outcome variable is used. Further, to achieve power more than 0.80, at least 50 groups are required for both ML and PQL methods of estimation. It may be pointed out that, for five-category ordinal response variable model, the power of PQL method is slightly higher than the power of ML method.


Introduction
Data collected from hospitals and educational institutions are mostly multilevel or hierarchical data. This type of data is frequently used by researchers to construct statistical models such as multilevel models, hierarchical models, or mixed effects models [1,2]. As the observations in these nested data structures become dependent, the classical methods and models like analysis of variance (ANOVA) and linear regression cannot be applied because these models assume independence. Hence, the use of alternative multilevel models is warranted to analyze the nested data structure.
It is really challenging to decide about an appropriate sample size for multilevel ordinal logistic models. In the contemporary literature, only [3] discusses the issue of sample size in multilevel ordinal logistic model by using PQL method of estimation. The researcher uses three-category multilevel ordinal logistic models. Apart from this, there is no existing research on sample size and power issues in multilevel ordinal logistic models. Unlike [3], the study of [4] compares both PQL and ML methods in small group sizes. However, the study of [4] does not provide any results about power analysis. In the present study, the focus of researchers is not only to compare ML and PQL estimation methods of estimation in larger group sizes but also to provide guidelines about optimum sample size needed for multilevel ordinal logistic models.
2 Computational and Mathematical Methods in Medicine * underlies the observed variable . A simple two level ordinal logistic model can be written as * = 0 + 1 + level 1 model 0 = 00 + 01 + level 2 model 1 = 10 + 11 + 1 * = ( 00 + 10 + 01 + 11 ) + ( + 1 + ) where corresponds to level 1 explanatory variable, represents level 2 explanatory variable, and level 1 coefficients denoted by and 's are the fixed effects. If it is assumed that follows a normal distribution, that is, ∼ (0, 1), then the resulting multilevel model is termed as multilevel probit model. Similarly, if ∼ logistic(0, 2 /3), then the model is said to be multilevel logistic model [6]. Level 2 random effects are more often assumed to have a normal distribution as According to [7], ICC is the proportion of group level variance compared to the total variance, represented by 2 is the group level or level 2 variance and 2 /3 is the individual level or level 1 variance so the total variance = (level 2 variance) + (level 1 variance) = 2 + 2 /3.
That is why the squared sigma ( 2 ) appeared in both numerator and denominator. Now * can be linked with the observed variable through a threshold model. The threshold model for categories, = 1, 2, 3, . . . , , can be written as where ) are the threshold parameters.
For identification purpose, it is common to set the first threshold to zero and to allow an intercept in the model [8]. We will assume a proportional odds model which means that the effect of explanatory variables will remain the same across categories [9][10][11][12]. The equivalent to the cumulative logit model above in generalized linear models context is

Simulation
Design. The fixed effect parameters ( 00 , 10 , 01 , 11 ) were set from the previous study by [4]. 10 = 1, 01 = 1, 11 = 1, and 00 = 0. The reason behind 00 = 0 is that multilevel ordinal logistic model cannot identify the overall model intercept and all threshold parameters jointly. There are two ways: a researcher can (1) estimate the intercept by setting the first threshold parameter to zero or (2) equate the intercept to zero and estimate the threshold parameters. That is why we put 00 = 0. There are three groups conditions (30, 50, 100), three group sizes (5,30,50), and three values of ICC (0.1, 0.3, 0.5). These values of ICC correspond to an intercept variance 2 = 0.39, 1.45, and 3.4. Similarly, random slope values were taken as 2 1 = 0.12, 0.32, and 0.62, while 2 1 = 0.01, 0.15, and 0.30. For each scenario, the number of simulation R was set to be 1000. We used GLIMMIX (ML) with adaptive quadrature procedure and GLLIMIX (PQL) procedures in SAS for data generation and analyses.

Procedure for the Parameter Estimations.
The accuracy of different fixed effect and random effect parameters estimates was calculated through the relative parameter bias, that is, while estimate is the value produced by ML or PQL method of estimations and parameter values are those taken in the simulation design. Average relative biases for Tables 1 and 2 are obtained by using (6).
Similarly, power was computed as Estimate Standard error .
Empirical powers were computed for , , and by using (7). The power was calculated as the number of replications in which 0 : Parameter value = 0, was correctly rejected at 5 percent level of significance divided by 1000 as we used 1000 replications for each condition. Power values between 0.80 and 1.0 were considered excellent and below 0.80 as inadequate [13]. For brevity, the threshold estimates are excluded as researchers are not interested in their values. Empirical coverage rates of 95% confidence intervals were used to judge the accuracy of the standard errors of estimated parameters [14]. Tables 4-11 were obtained by using The 95% confidence intervals coverage rates were computed in each condition as the proportion of replications in which the true parameter is captured by the 95% confidence interval. Reference [15] recommended acceptable coverage rates as 92.5% to 97.5%. A separate logistic regression was used to assess the impact of simulation conditions on empirical coverage rates of estimates. values in Tables 4 to 11 are obtained by logit ( ) = 0 + 1 groups + 2 group size + 3 ICC. (9)

Results and Discussion
Tables 1 and 2 are estimated for the purpose of checking the accuracy of both fixed and random effects estimates through relative bias. Absolute relative bias <5% is negligible under both ML and PQL method. Estimates that have absolute relative bias close to zero are considered as unbiased estimates; that is, estimates and parameters become identical. Tables 4 to 11 are computed to check the accuracy of estimates standard errors through 95% confidence interval. Those estimates which achieve acceptable coverage rates (92.5% to 97.5%) are considered best. Similarly, Table 3 is computed to get power rates under both ML and PQL methods. Table 1 represents the impact of various simulation conditions on average relative bias of the fixed effect estimates under two estimation methods ML and PQL. Estimates were substantially biased when the number of groups was 30, group size was 5, and ICC = 0.1 under ML method. The estimates relative bias was less than 5% when the number of groups was 50 under ML method. With 100 groups, estimates Table 2: Average relative bias of random effect estimates obtained as a function of groups, group size, and ICC, collapsed over the category distribution and number of categories.

Groups
Group size ICC ML method PQL method were unbiased under ML method. Estimates relative bias was negligible when the number of groups was 100 under ML method. Relative bias of estimates was highly influenced by the number of groups under ML method. On the contrary, estimates were underestimated under PQL method. Substantial bias of estimates was noted when group size was 5 and ICC = 0.5 under PQL method. Estimates relative bias was less than 5% when group size was 50 under PQL method. On average, ML method fixed effects estimates average relative bias was smaller than that of PQL method in absolute terms. Moreover, group size impact was larger on estimates relative bias under PQL method. Similarly, ML method random effects estimates were substantially biased when the number of groups was 30 as shown in Table 3. With 100 groups, random effects estimates bias was less than 5% consistently. ML method performed well when the number of groups was large. On the other hand, random effects estimates were substantially biased under PQL method when group size was 5 and ICC = 0.5. Like fixed effects estimates, ML method random effects bias was smaller than that of PQL method random effects estimates bias in absolute terms. Table 3 reflects the power pattern of estimates under both ML and PQL methods in five-category ordinal outcome variable model. Further, to achieve more than 0.80 power rates for estimates, at least 50 groups are mandatory for both estimation methods. However, the power of PQL method estimates was slightly higher than that of ML method estimates in five-category ordinal outcome variable model. Table 4 lists empirical coverage rates for estimates of the multilevel ordinal logistic model under ML method when response variable had three categories and distribution type was symmetrical. Under the three group conditions, the fixed effect estimates achieve the acceptable coverage rates (92.5 to 97.5) defined by [13]. However, random effect estimates coverage rates were smaller than the nominal coverage rates. The effect of the number of groups was significant on estimates coverage rates. However, group size and ICC effect was minimal on estimates coverage rates. Table 4 results suggest that a large number of groups should be used rather than larger group size under ML method.          was skewed. Again, group size and ICC effect was minimal on estimates empirical coverage rates. The impact of the number of groups was again significant and dominant on estimates empirical coverage rates. Moreover, to achieve unbiased standard errors of estimates, large number of groups will be better than the larger group size under ML method. Table 6 lists empirical coverage rates for estimates of the multilevel ordinal logistic model under ML method when response variable had five categories and distribution type was symmetrical. Like three-category response model, the influence of the number of groups was significant on estimates coverage rates under ML method. More groups should be used to achieve the desired results under ML method. Table 7 lists empirical coverage rates for estimates of the multilevel ordinal logistic model under ML method when response variable had five categories and distribution type was skewed. Like five-category response model, the influence of the number of groups was again significant on estimates coverage rates under ML method. On the other hand, group size and ICC effect was insignificant on estimates empirical coverage rates under ML method. Table 8 lists empirical coverage rates for estimates of the multilevel ordinal logistic model under PQL method when response variable had three categories and distribution type was symmetrical. With group size 5, fixed effects estimates coverage rates were unacceptable. However, with group size of 30 and 50, respectively, fixed estimates coverage rates were acceptable. Furthermore, random effects estimates coverage rates were unacceptable across all conditions. Additionally, ICC had also a significant effect on estimates coverage rates; that is, coverage rates decreased with the larger ICC values under PQL method. On the other hand, groups effect was little under PQL method on estimates coverage rates. Table 9 lists empirical coverage rates for estimates of the multilevel ordinal logistic model under PQL method when response variable had three categories and distribution type was skewed. Group size factor had a significant effect on estimates coverage rates. However, coverage rates consistently decreased with the increase in ICC values under PQL method. Fixed effects achieved acceptable coverage rates across the last two levels of group size factor, that is, 30 and 50, respectively. Groups had a significant effect on coverage rates. Larger group sizes seem to be good under PQL method.  Table 11 lists empirical coverage rates for estimates of the multilevel ordinal logistic model under PQL method when response variable had five categories and distribution type was skewed. Fixed effect coverage rates were all acceptable across all group sizes. However, random effects coverage was unacceptable and smaller than that of ML coverage rates. Group size again showed a significant effects on all estimates coverage rates while groups effect was insignificant.

Conclusion
In both symmetrical and skewed distribution shapes of category responses, 10 , 01 , and 11 were overestimated in majority of the conditions under ML method of estimation when three-category and five-category multilevel ordinal logistic models were used. Moreover, random effects estimates, that is, random intercept and random slope estimates, were more biased than fixed effects estimates on average. The random effects estimates were not substantially biased under ML method of estimation, especially when the number of groups was large and random effects were not small. On the contrary, 10 , 01 , and 11 were underestimated almost across all conditions under PQL method of estimation when threecategory and five-category multilevel ordinal logistic models were used. Like ML method of estimation, random effects estimates relative biases were higher on average than those of the fixed effects estimates under PQL method. Both fixed effects and random effects estimates were not substantially biased under PQL method of estimation, particularly when population random effects were small ( 2 = 0.39, 2 1 = 0.12, and 1 = 0.01) and group sizes were large. In general, the absolute relative bias of ML method estimates was consistently smaller than that of PQL method estimates. The accuracy of standard errors of the estimates (excluding threshold estimates) was judged through empirical coverage rates. The influence of the number of groups was significant on the accuracy of both three-category and five-category multilevel ordinal logistic models estimates standard errors under ML method of estimation. On the contrary, the group size factor and ICC had an insignificant effect on estimates standard errors under ML method of estimation. Estimates standard errors were least biased when number of groups was 100. On the other hand, the influence of group size factor was highly significant on the accuracy of estimates standard errors under PQL method of estimation. Furthermore, ICC also influenced estimates standard errors; that is, standard errors were substantially biased when population random effects were medium ( 2 = 1.45, 2 1 = 0.32, and 1 = 0.15) and large ( 2 = 3.4, 2 1 = 0.62, and 1 = 0.30). The impact of the number of groups on estimates standard errors in majority of the conditions was minimal under PQL method. However, five-category multilevel ordinal logistic model estimates standard errors were as good as that of ML method.
The power rates of PQL estimates were slightly higher than that of ML estimates when ordinal response variable had five categories, which indicate that PQL standard errors were least biased due to increases in the number of categories of ordinal response variable.
In general, ML method performed well in terms of estimates small bias, high coverage rates, and high power rates when ordinal response variable had three categories. However, in five-category ordinal response variable model, PQL method performances were comparable to those of ML method. PQL estimates were poor when population random effects (ICC) were medium and large while ML estimates were poor in small population random effects. In addition, ML estimates and estimates standard errors benefitted from larger number of groups while PQL estimates and estimates standard errors benefitted from larger group sizes. We recommend at least 100 groups and 30 units per group to achieve accurate multilevel ordinal logistic model estimates and estimates standard errors when method of estimation is ML. Furthermore, 100 groups and at least 50 units per group are essential for accurate multilevel ordinal logistic model estimates and estimates standard errors when method of estimation is PQL. Similarly, at least 50 groups are essential to achieve 0.80 or more power for both ML and PQL methods of estimation. On the basis of this study, it is recommended that PQL method may be avoided when group sizes are small, number of groups are large, random effects are medium and large, and outcome variable has three categories. In such conditions, ML method is the best option. However, when the outcome variable has five or more categories, random effects are small, group sizes are large, and number of groups is small, PQL method may be better option.