Assessment of Differential Item Functioning in Health-Related Outcomes: A Simulation and Empirical Analysis with Hierarchical Polytomous Data

Background The purpose of this study was to evaluate the effectiveness of two methods of detecting differential item functioning (DIF) in the presence of multilevel data and polytomously scored items. The assessment of DIF with multilevel data (e.g., patients nested within hospitals, hospitals nested within districts) from large-scale assessment programs has received considerable attention but very few studies evaluated the effect of hierarchical structure of data on DIF detection for polytomously scored items. Methods The ordinal logistic regression (OLR) and hierarchical ordinal logistic regression (HOLR) were utilized to assess DIF in simulated and real multilevel polytomous data. Six factors (DIF magnitude, grouping variable, intraclass correlation coefficient, number of clusters, number of participants per cluster, and item discrimination parameter) with a fully crossed design were considered in the simulation study. Furthermore, data of Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 collected from 576 healthy school children were analyzed. Results Overall, results indicate that both methods performed equivalently in terms of controlling Type I error and detection power rates. Conclusions The current study showed negligible difference between OLR and HOLR in detecting DIF with polytomously scored items in a hierarchical structure. Implications and considerations while analyzing real data were also discussed.


Introduction
In psychological, educational, and medical quality of life studies, measurement equivalence is an essential assumption for meaningful comparison of scores across different groups. Measurement nonequivalence can occur at different levels such as instrument or item level resulting in noncomparable data across groups [1][2][3][4]. The latter, which is known as differential item functioning (DIF), is an important part of a validation study. DIF analysis which originated in educational testing has been used in psychometric studies to assess whether the probability of responding to a specific item exhibits different statistical properties for different identifiable groups after controlling the construct being measured [1,4]. There are two forms of DIF known as uniform and nonuniform. Uniform DIF is defined as a constancy of differences in the probability of correct answer for manifest group at all ability levels. In nonuniform DIF or crossing DIF (CDIF), the direction of such difference changes at some ability levels leading to different directions of DIF along the ability scale [5][6][7][8][9][10]. These forms become more complex when considering polytomous items rather than dichotomous [11]. Methodology reviews showed that there are several parametric and nonparametric statistical methods for evaluating DIF with dichotomous and polytomous items [9,11,12]. Among all these available methods, item response theory (IRT) [13] and ordinal logistic regression (OLR) [7] approaches have received notable attention in research and applied settings for polytomous items [2,12]. IRT is the most powerful DIF detection approach, but it requires large sample size especially in case of models with more than one parameter such as two-and three-parameter models. Previous studies 2 Computational and Mathematical Methods in Medicine showed that for sufficient power when assessing DIF across two group a total sample size of 1000 for two groups is necessary [14]. Furthermore, the common unidimensional IRT models have strict assumptions (e.g., unidimensionality and local independency) compared to OLR. The OLR DIF detection approach is an effective, easy to implement and model-based procedure that it does not assume normality for the ability and can control additional covariates, both categorical and continuous, which may confound the results of DIF analysis. Moreover, this method provides a number of criteria to quantify the magnitude of DIF [5,6,12,15,16]. Multilevel or hierarchical data structure often arises from most of the common sampling designs used in educational and medical research. For example, total sample size in almost any public health-related survey is a combination of patients that are nested within healthcare service providers nested within districts. The districts can also be further nested within higher level units such as provinces/states [9,17]. People nested in the same cluster share the same cluster-specific influences. Ignoring such cluster-specific influences can lead to cluster level unobserved heterogeneity and dependence between responses for participants in the same cluster [18]. Multilevel models simultaneously handle respondent level relationship and take account of this unobserved heterogeneity and dependence [6,17,19]. In the case of DIF analysis, the analytic strategy should also match the data structure [6]. Previous research has demonstrated that ignoring this structure may lead to inaccurate estimation of parameters and their standard errors, biased statistical tests, incorrect DIF detection, and inflates Type I error rate [6,20].
Recent advancement in computing power and the availability of new software to fit multilevel models caused much attention paid to developing complicated models especially in DIF analysis [9]. There are several multilevel DIF detection simulation studies in the literature that have focused on dichotomously scored items [9,20]. For example, Kamata et al. (2005) used generalized Rasch model with group membership to investigate detection of DIF [21]. In a series of studies, French and Finch expanded logistic regression, multiple indicator multiple cause (MIMIC), and Mantel-Haenszel models for detecting DIF in multilevel data [6,22,23]. Recently, Jin et al. (2013) studied the effect of item's and total score's intraclass correlation coefficient (ICC) on DIF, and Wen (2014) investigated DIF detection in both levels simultaneously. The results and implications of these studies, however, were limited to the dichotomous multilevel data.
To date, based on the authors' knowledge, there has not been a simulation study that investigated the performance of hierarchical ordinal model in DIF detection, while in practice, a wide variety of psychological, educational, and medical (e.g., health-related quality of life questionnaire (HRQoL)) outcome variables are polytomously scored and measured using Likert types of rating scales in which respondents choose their level of agreement on a symmetric agreedisagree scale [2,24].
As mentioned, OLR is one of the most popular polytomous models that is able to utilize all the information from each item response [7] but it can only accommodate respondent level (i.e., level 1) information data. As discussed above, if standard analyses are used while ignoring the multilevel structure, variation due to the levels of the data structure could be combined, leading to incorrect parameter estimates and inflation standard errors; it may also result in increased likelihood of finding DIF when DIF is not present [2]. Therefore, the main objective of this study was to investigate Type I error rate and detection power of hierarchical ordinal logistic regression model (HOLR) and OLR in detecting DIF by means of Monte Carlo simulation. Additionally, a real data example was also analyzed.

DIF Analysis Based on Hierarchical Ordinal Logistic
Regression Model. The logistic regression analysis has been suggested for DIF assessment by several studies (e.g., Swaminathan and Rogers, 1990;French and Miller, 1996;and French and Finch, 2010). Detecting DIF through OLR is based on comparing three different nested models. The full model as given by French and Miller (1996) has the following form: where ( ≤ ) is the probability of responding at or below category k to an item for the ith person, represents ability and it is measured by the total test score, is a grouping variable, and * represents the interaction between grouping variable and ability. The baseline model ( 1 ) is a model that only includes the ability term. The next nested model ( 2 ) includes the ability term plus the grouping variable as predictors. The value of the difference in −2 log-likelihood of full model and 1 can be used to detect uniform and nonuniform DIF simultaneously. This value can be compared to a Chi-square distribution with two degrees of freedom. If this comparison yields a significant result, the item is flagged for DIF and then further investigations are needed to test whether there is uniform or nonuniform DIF. Comparison between 1 and 2 can be utilized to assess uniform DIF [2].
As pointed out by French and Finch (2010), the primary problem of employing standard statistical models such as ordinal logistic regression with nested data is the violation of the assumptions of independently and identically distributed scores [6]. Although point estimates of parameter will not be seriously affected by this violation, estimate of standard errors, however, can be affected by dependency of data [6,20], because nestedness of data or cluster sampling influences the sampling variance directly [2]. Inaccurate estimation of standard errors occurs in this type of data when the multilevel structure is not taken into account so that capturing the actual sampling variance should consider cluster sampling into account.
The above-mentioned DIF detection approach can be easily extended in order to accommodate multilevel data. The general model for the logit of responding at or below category Computational and Mathematical Methods in Medicine 3 k to an item for the ith person (e.g., student) in the jth cluster (e.g., school) for two levels can be written as where is the polytomous item response for person i in cluster j; s are person level predictors; s are cluster level predictor; and and are the associated regression coefficients for and , respectively. is random effect at cluster level. This general model for uniform DIF for withincluster variable can be reduced to the following: As mentioned before, and are the polytomous item response and ability for person i in cluster j.
represents the group identifier for which DIF will be tested for person i in cluster j. When the regression coefficient 2 in the above equation is significant in comparison with the baseline model, then the studied item will be flagged as showing uniform DIF. By the same logic, for between-cluster variable we have where represents the group identifier for which DIF will be tested in cluster j. If the regression coefficient 01 in the above equation is significant compared to the baseline model, then the studied item shows uniform DIF at the cluster level.

A Simulation Study.
A simulation study was conducted to evaluate Type I error rate and detection power of hierarchical ordinal logistic regression model and OLR in detecting items expressing uniform DIF. Six manipulated factors were included in the current Monte Carlo study to investigate the comparative performance of HOLR and OLR in identifying the rate of correct (i.e., detection power) and incorrect (i.e., Type I error) uniform DIF items. (2011), a test containing 16 simulated items (15 core items and 1 studied item) for two groups (focal and reference) was generated using the R package for statistical computing [25]. Each item had five possible response categories. Item scores were generated using the multilevel graded response model as given by Fox (2007) [19]. In each test, item difficulty parameters were sampled from the uniform distribution over interval [−2.5, 2.5]. In each simulation, values of ability for two groups were derived simultaneously from a 2-level normal linear mixed model as given by Wen (2014) and Jin et al. (2013) [9,20]. Other factors considered in this study are as follows. French and Finch (2010, 2011 and Jin et al. (2013), two types of grouping variable were dichotomously simulated which are within cluster (e.g., gender or age group) and between cluster (e.g., type of school or teaching method). DIF at the between cluster was generated based on the idea that school effectiveness might impact performance on particular items which may increase or decrease the number of DIF items [9]. For simplicity, only one dichotomous grouping variable was included at each level and DIF was presented at the person level, although more than two groups and more characteristics and levels can be included to identify potential sources of DIF [9,20]. Moreover, from previous DIF simulation work, equal or unequal comparison group sample sizes threaten Type I error rate and power in LR DIF analyses [26] Thus, in this study, at both levels a balanced design was considered.

Intraclass Correlation Coefficient (ICC).
Three magnitudes of ICC (0.05, 0.25, and 0.45) were selected that present small, medium, and large correlations between individuals within each cluster. This selected range is in accord with range of ICC that is seen in applied educational and recent simulation studies [6,20,22,23].

Number of Clusters.
Number of clusters is a critical factor that can affect both Type I error rate and power. Previous studies have determined that a greater number of clusters result in larger power and smaller Type I error rates [9]. Fifty groups are a frequently occurring number of clusters in practice in school research [27]. Therefore, in this study, the numbers of clusters simulated were 50, 100, and 200.

Number of Participants per
Cluster. The numbers of participants per cluster were 5, 10, and 20. These values are in accordance with previous research on hierarchical data.

DIF Magnitude.
Mild and severe uniform DIF were also simulated by adding 0.4 and 0.8 to the difficulty parameters of the focal group examinees, while the reference group examinees item difficulty parameter remained unchanged [9]. These levels were based on prior DIF simulation work with LR and hierarchical logistic regression [6,12].

Item Discrimination Parameters.
Previous simulation studies have shown that high item discrimination parameter can result in highly inflated Type I error and high detection power [28]. Pervious research also showed that the inflation of Type I error rate was controlled fairly well when the discrimination parameters have a small range in LR DIF detection method [29]. Thus, for the purpose of this study, item discrimination parameters were sampled from the uniform distribution over two intervals as low (0.5-0.99) and high (1.5-2) discrimination parameters. The analysis of OLR and HOLR also was conducted using R and Type I error rate and detection power were computed under the simulated conditions. Each condition was replicated 1000 times. In order to determine which manipulated factors influenced these rates, four repeated measure ANOVAs were conducted where Type I error and detection powers averaged across replications for a given condition were considered as dependent variables, the repeated measures factor was type of test (i.e., OLR or HOLR), and the manipulated factors were between-subjects factors. In addition to the tests of significance, the effect size (partial 2 ) was also calculated for each factor.

Within-Cluster Variable.
A repeated measures ANOVA was conducted to identify the factors that were significantly associated with the rate of false DIF identification (i.e., Type I error rate). This analysis showed that the highest significant effects with reasonable partial effect size were the interaction between the number of clusters and method ( value < 0.001 and 2 = 0.359) and interaction between the sample size per cluster and method ( value < 0.001 and 2 = 0.609). The standard way to interpret an interaction is to assess the effect of one factor at each level of other factors. Simple main effects of method were significant at sample sizes 5, 10 and cluster size 50. In all these three cases (sample sizes 5, 10 and cluster size 50), Type I error rate of HOLR was higher than OLR. Figure 1 illustrates Type I error rates across manipulated factors for both OLR and HOLR when DIF occurs for withincluster variable. Across almost all conditions, lines are nearly overlapped. The only exceptions were for the condition with 50 clusters and 5 cases per cluster and the condition with low item discrimination, 100 clusters, and 5 cases per cluster in which OLR outperformed HOLR.
This means in these conditions that the OLR showed closer estimates to the nominal Type I error rate of 0.05 compared to HOLR. Nevertheless, both OLR and HOLR controlled the nominal Type I error rate of 0.05 reasonably well and the maximum difference of two methods was as small as 0.002.
The detection powers for OLR and HOLR by manipulated factors, when DIF occurs for within-cluster variable, are shown in Table 1. As revealed, the detection power for both approaches in high item discrimination and high level of DIF was one and in most of the conditions, this was above the acceptable rate of 0.8 [20]. The exceptions were for the condition with low level of DIF (i.e., 0.4), 50 clusters, and 5 cases per cluster and the condition with low item discrimination, low level of DIF, and total sample size equal to or less than 500 in which the detection power rate was ranging from 0.303 to 0.784. In addition, the OLR method was less powerful than HOLR across almost all conditions, with maximum difference between the two methods being as small as 0.006. Table 1 clearly shows that detection power for both approaches increased with larger cluster sizes and more clusters. Detection power also increased with higher levels of DIF and discrimination parameter. Additionally, this increase in power across sample size per cluster and the number of clusters were more pronounced for low level of DIF and low discrimination parameter compared to higher levels of DIF and discrimination parameter. As with the detection power the repeated measures ANOVA revealed that the interaction between the method, item discrimination parameter, DIF magnitude, number of clusters, and cluster sample size was significant ( value < 0.001 and 2 = 0.382). The results reveal a significant 5-way interaction that is difficult to interpret.

Between-Cluster
Variable. Similar to the within-cluster variable, repeated measures ANOVA was conducted in order to analyze Type I error rate for between-cluster condition. The interaction between number of clusters and method was significant ( value < 0.001 with 2 = 0.763). Simple main effect of method was significant with large partial effect size at each level of number of clusters. Furthermore, the main effect of sample size per cluster ( < 0.001, partial 2 = 0.355) was also statistically significant. Figure 2 also illustrates Type I error rates across manipulated factors for both OLR and HOLR when DIF occurs for between-cluster variable. Across almost all conditions, the OLR showed closer estimates to the nominal Type I error rate of 0.05 compared to HOLR. The only exceptions were for the condition with 50 clusters and 5 cases per cluster and the condition with high item discrimination, 200 clusters, and 5 cases per cluster in which HOLR outperformed OLR. Another condition that HOLR performed slightly better than OLR was the condition with low item discrimination, 200 clusters, and 10 cases per cluster. It seems that the value of ICC affected Type I error rate only for larger sample sizes and mostly for HOLR. Similar to within-cluster variable, both OLR and HOLR controlled the nominal Type I error rate of 0.05 reasonably well and the magnitude of differences in Type I error rate across the two methods was again negligible (i.e., 0.003 to 0.006). Table 2 shows the detection power of the two approaches for between-cluster variable. In contrast to the case for within-cluster variable, across almost all conditions the power for HOLR was lower compared to OLR. Similar to within-cluster variable, almost power of both approaches was above the acceptable level of 0.8. As the number of clusters and sample size within each cluster increased, power increased. The power of both methods was lower for lower level of DIF magnitude and discrimination parameter. As with the detection power, a repeated measures ANOVA was used for further investigation, and the highest order significant interaction was method by number of clusters by sample size per cluster by level of DIF by item discrimination parameter ( value < 0.001, partial 2 = 0.457).

Discussion
DIF and validity of test, especially at the item level, are important issues in the assessment of educational and psychological questionnaires. It is important to ensure that latent traits of all examinees were accurately determined by items and test scores. Although many studies have been done to expand DIF detection methods to polytomously scored items [23] and great attention also has been given to multilevel ordinal logistic regression, relatively few of them have focused on hierarchical polytomous DIF that is often present in a variety of psychological, medical, and educational researches.
As emphasized by Finch (2010, 2011, and 2013), ignoring the multilevel structure of data will result in the differences in DIF detection power rates when the analytic strategy does not match the data structure. The current study extended French and Finch's studies by considering polytomously scored items instead of dichotomous items. Based on our findings, HOLR and OLR performed almost equivalently in terms of controlling Type I error rate at the nominal alpha level of 0.05. The magnitude of Type I error rate's inflation of OLR as compared to HOLR, although negligible, for the between-cluster variable was not as small as the within-cluster variable.
This finding is in line with Jin et al. (2013) results. As compared to French and Finch (2010, 2011, the results of the current study were similar to theirs for within-cluster variable, where methods performed similarly in terms of maintaining Type I error rate at the nominal level. However, the magnitude of Type I error rate inflation of OLR as compared to HOLR was not as large as what was found in other studies for the between-cluster variable. When cluster size and number of clusters were large enough, there were not any significant differences between Type I error rate of the two methods. This trivial difference across all levels of manipulated factors in the current study is consistent with prior research on the topic with dichotomous score, using different methods of DIF detection [6,20,22,23]. The two methods maintained the power above the acceptable level (0.8) across most of the conditions. Power was extremely lower than 0.8 for sample sizes less than 500 with low level of DIF magnitude and item discrimination parameter across all levels of ICC. For larger sample sizes, power was higher for both approaches. As with previous studies analyzing dichotomous items, power for both methods increased for large magnitudes of DIF and item discrimination parameter [6,30,31]. The results indicated that when high values of item discrimination and DIF magnitude are obtained, both methods can reliably identify items exhibiting DIF. Although the use of standard OLR DIF detection approach conceptually is inconsistent with the complex structure of the data, the results of this study demonstrated that using OLR may not lead to incorrect DIF detection, which is in line with Jin et al.  [16,32,33], consists of 23 items in four domains: physical health subscale (eight items), emotional functioning subscale (five items), social functioning subscale (five items), and school functioning subscale (five items). This measure uses a five-point Likert scale ("0 = never a problem, 1 = almost never a problem, 2 = sometimes a problem, 3 = often a problem, and 4 = almost always a problem"). The PedsQL 4.0 scoring protocol has used such that higher scores imply better HRQoL.
The participants were randomly selected by a two-stage cluster random sampling technique. In each of the four educational districts, four schools were selected at random (first stage). Once the schools were chosen, random number table was used to select two classes from each school randomly.   All the students in the chosen classes were taken as studied sample (second stage).
Even though, in our simulation study, both OLR and HOLR approaches showed very similar results in DIF detection, in real world situations data might be influenced by other factors that were not considered in the simulation study. So, it is imperative to check the statistical assumptions before proceeding with data analysis. Intraclass correlation coefficient and design effect were calculated for each question in order to determine if a hierarchical analysis of the data is appropriate. The ICC and the design effect both indicate the need for multilevel modeling for this data. So, the hierarchical ordinal logistic regression DIF detection approach was used for detecting DIF in this data. Although ordinary logistic regression is frequently used to detect DIF in the field of health and quality of life research (e.g., [34]), to date, there has not been a study that utilizes multilevel modeling for detecting DIF in this area. Two different HOLR models were used for detecting DIF among gender (male = 0; female = 1), as the within-cluster variable and type of school (middle school = 0; high school = 1) as the between-cluster variable.

Results.
The results of the HOLR DIF analysis were summarized in Table 3. The estimated regression coefficients ( ), result of Chi-square difference test, and corresponding values are shown for all questions. While 11 out of 23 items (47.8%) showed uniform DIF among gender, just five items (21.7%) showed uniform DIF among two types of school.
Items (2), (3), and (4) in physical health, items (1) and (2) in emotional functioning, all items except the first item in social functioning, and items (1) and (3) in school functioning were flagged with DIF across gender. According to values, items of social functioning and school functioning subscales were in favor of girls, whereas other items were in favor of boys. The item characteristic function (ICF) can be used as a summary statistic for a polytomous item especially in order to illustrate DIF. The ICF is defined as sum of the expected scores over response categories for each item (Nering and Ostini, 2011). When we have an item with categories, ICF can be defined as the following formula: where ( ) is the probability of a score of in the th response category of item X. Figure 3 depicts the ICF of item (4) in physical health in which girls (i.e., solid black line) have higher expected scores compared to boys (i.e., dashed line) even for lower theta (i.e., HRQoL) values.
In terms of type of school, item (5) in physical health, item (4) in emotional and social functioning, and items (2) and (4) in school functioning subscales showed DIF across types of school. As it is shown in Table 3, item (4) in emotional functioning and item (2) in school functioning subscales were in favor of middle school students, whereas item (5) in physical health, item (5) in social functioning, and item (4) in school functioning subscales were in favor of high school students. It should be noted that item (5) in social functioning showed DIF across both within and between-cluster variables implying that female high school students are more likely to choose higher response category in this item. Figure 4 shows the ICF of item (4) in school functioning in which high school students (i.e., dashed line) have higher expected scores compared to middle school students (i.e., solid black line) even for lower theta values.

Conclusion
As emphasized by French and Finch (2013) choosing proper modeling in analyzing hierarchical data is crucial because it allows for a potentially greater understanding of phenomena under study, as well as avoiding statistical misspecification. The current study extended previous studies by evaluating the comparative performance of HOLR and OLR in detecting differential item functioning for polytomous items. Results of the simulation study showed that when the grouping variable was at the within-cluster level, both OLR and HOLR performed equivalently in terms of controlling type I error rate at the nominal alpha level across almost all conditions. Interestingly, when the grouping variable was at betweencluster level, OLR showed closer Type I error rate estimates to the nominal alpha level of 0.05 compared to HOLR except for few conditions where HOLR outperformed OLR. With regard to detection power, detection power rates of both approaches were above the acceptable level of 0.8 with trivial differences across all simulation conditions. Despite the fact that this study found negligible difference between OLR and HOLR in detecting DIF with polytomously scored items, there are many unknowns while working with real data. Thus, as it was discussed earlier, it is necessary to check the tenability of statistical assumptions before choosing between OLR and HOLR.
As with other Monte Carlo simulation studies, results of this study are bounded by the factors under investigation, which limits generalization of the results. Additionally, choices made as to simulate a test with 16 items and only one DIF item, equal sample sizes for both reference, and focal groups at each level, considering DIF at each level separately and generating DIF only by adding a constant value to all threshold parameters of the studied item, may affect results of the simulation study. It has been shown that adding different values to threshold parameters can result in greater group differences across the continuum of ability (French and Miller, 1996). Additionally, in the simulation study, it was assumed that all category responses of items performed reasonably and effectively contributed to the results. This may not be true with real data and PedsQL has been found to have disordered categorical functioning [34]. It is also worth noting that we only used polytomous items but such simulation study can be easily extended to mixed-item format tests as a more realistic representation of reality.
We approached DIF solely from a hypothesis testing perspective in this study. As noted by other researchers, effect size measures are valuable tools in DIF. Further work examining DIF in polytomous items utilizing effect size measures is needed. Furthermore, we did not vary ability distribution throughout the simulation study, whereas it is very likely to observe different ability distributions between reference and focal groups as well as level 2 clusters. Assessing the effect of ability distribution on DIF is another area of investigation. After all, comparing relative performance of other methods of detecting DIF in polytomous data with hierarchical structure such as MIMIC modeling and PolySIB [35] would shed more light into this area of research.