The Effect of Small Sample Size on Measurement Equivalence of Psychometric Questionnaires in MIMIC Model: A Simulation Study

Evaluating measurement equivalence (also known as differential item functioning (DIF)) is an important part of the process of validating psychometric questionnaires. This study aimed at evaluating the multiple indicators multiple causes (MIMIC) model for DIF detection when latent construct distribution is nonnormal and the focal group sample size is small. In this simulation-based study, Type I error rates and power of MIMIC model for detecting uniform-DIF were investigated under different combinations of reference to focal group sample size ratio, magnitude of the uniform-DIF effect, scale length, the number of response categories, and latent trait distribution. Moderate and high skewness in the latent trait distribution led to a decrease of 0.33% and 0.47% power of MIMIC model for detecting uniform-DIF, respectively. The findings indicated that, by increasing the scale length, the number of response categories and magnitude DIF improved the power of MIMIC model, by 3.47%, 4.83%, and 20.35%, respectively; it also decreased Type I error of MIMIC approach by 2.81%, 5.66%, and 0.04%, respectively. This study revealed that power of MIMIC model was at an acceptable level when latent trait distributions were skewed. However, empirical Type I error rate was slightly greater than nominal significance level. Consequently, the MIMIC was recommended for detection of uniform-DIF when latent construct distribution is nonnormal and the focal group sample size is small.


Introduction
In recent years, the use of differential item functioning (DIF) has also been referred to as measurement equivalence, which has been widely used to validate psychological assessment instruments such as quality of life [1,2]. People with the same quality of life level should be able to answer the items in the quality of life questionnaire the same way, regardless of their education, gender, or other group memberships [3]. Mean score in the quality of life may differ among groups and DIF occurs when an item in the questionnaire has different measurement properties for one group of individuals versus another, irrespective of the mean differences on the construct [4].
Several methods have been developed for identifying DIF in test items. All DIF detection methods fall under the parametric and nonparametric methods. Mantel-Haenszel, standardization, and simultaneous item bias test are important nonparametric methods while item response theory, logistic and ordinal logistic regression, multiple-group analysis, and multiple indicators multiple causes (MIMIC) are important parametric methods for DIF testing [5].
Multiple-group analysis and MIMIC are two approaches of structural equation modeling, which have been widely used to assess DIF by many applied researches [6][7][8]. Previous studies have shown that, under specific conditions, multiplegroup analysis is preferable to the MIMIC approach [4], but multiple-group analysis requires a large sample size. This is due to the fact that the model will fit to data for each group separately [4]. In this study, we have merely focused on MIMIC model as a well-known method to detect DIF [4]. The MIMIC model has several advantages when compared with other methods of DIF detection; it requires smaller sample size, latent variables can be predicted by at least one single-item indicator, it can be applied for dichotomous and polytomous items, it is not necessarily used for all items 2 BioMed Research International with the same number of response categories, and it provides information on the structural and measurement models [1,4,9,10].
Typically, in DIF testing of medical studies, two groups are assumed to be labeled as the reference and focal groups, where patients are often placed in the latter group. A common problem in medical and psychological studies is the small sample size, particularly in the focal group, where access to patients or rare disease patients is difficult. Furthermore, the small sample size prevents wasting of time and money [18]. Consequently, evaluation of statistical properties of MIMIC model for detecting DIF can be quite valuable when the focal group is small.
Skewness of latent trait distribution, also referred to as latent construct, is an important point that needs to be considered in DIF detection [19,20]. In psychological investigations, it is possible to be confronted with nonnormal distribution cases. Several researchers have discussed the statistical properties of MIMIC model in DIF testing with actual data [21]. They concluded that the use of different methods for evaluating DIF may lead to different results. To the best of our knowledge, the skewness of latent trait distributions in DIF detection by MIMIC model has not been investigated.
A Monte Carlo simulation study is an essential tool for assessing the behavior of MIMIC model under various conditions. This study is the first simulation-based investigation to assess MIMIC model for DIF detection, when latent construct distribution is nonnormal and focal sample size is small. We have discussed the advantages and disadvantages through a series of simulations.

MIMIC DIF Detection.
Two types of DIF can be identified and denoted as uniform and nonuniform [22]. Uniform-DIF is the simplest type of DIF where the probability of selecting the specific category of item is greater (or lesser) for one group than the other in all levels of the latent construct uniformly. Uniform-DIF occurs when item difficulty parameters are different in the two groups [22]. Nonuniform-DIF transpires when the probability of answering a specific category of items among groups varies at all latent construct levels [22].
Uniform-DIF detection with MIMIC model is performed with regressing potential DIF items and latent variable ( ) onto a covariate concurrently [14]. This covariate can be either continuous or categorical in nature, but usually in medicine and psychological research, it is assumed as dichotomous variable. The mechanism of MIMIC for detection of uniform-DIF is shown in Figure 1. Nonuniform-DIF can be assessed by regressing the interaction between the latent factor ( ) and the group membership indicator (Xi) on potential DIF items [15]. Although the MIMIC model is a useful approach : the regression coefficient displaying the mean difference on the latent trait; : threshold parameter; : the regression coefficient displaying the group difference in the threshold for item and the grouping variables; : the measurement error for item ; : a residual for latent trait ( ). Note. Item 2 to item 5 constitute the DIF free, when item 1 is tested for uniform-DIF.
for identifying uniform-DIF, the accuracy of this model in detecting nonuniform-DIF appeared to be questionable [23]. In this study, we have only focused on uniform-DIF, which is important among applied researchers.

Data Generation.
Ordinal responses were generated from the graded response model (GRM) [24]: where ( ) is the probability of a respondent selecting a particular response category ( ) or above category for one item ( ), and are the discrimination and threshold parameters, and denotes the latent trait. Determining the distribution of discrimination parameters was carried out based on empirical research and preliminary simulation. In all conditions, and were drawn from the uniform distribution over the 1.5 and 2 interval and standard normal distribution, respectively.

Simulation Scenarios.
In this study, we have assumed two groups that were labeled as reference and focal groups. The five factors in this simulation study were investigated: reference to focal group sample size ratio, magnitude of the uniform-DIF effect, scale length, the number of response categories, and latent trait distribution.
Sample size ratio between the reference and focal groups was set at R100/F100, R200/F100, R300/F100, R400/F100, and R500/F100. Medium and severe uniform-DIF were also simulated by adding 0.5 and 1 to parameters to the focal group, respectively [3,25]. The length of the scale was considered 5 and 10. It is worth mentioning that Likert-type scales and odd number response categories are frequently used in psychological and medical research. In this simulation study, 3-, 5-, and 7-point ordinal responses were used and  [4,26,27].
In this study, we used Beta distribution to generate skewed latent construct distributions. Beta (1,4) and Beta (0.5, 4) distributions were used for situations in which the participants responded moderately and mostly negatively, and Beta (4, 1) and Beta (4, 0.5) were used when they responded moderately and highly positively [3]. Since the generated data by Beta distribution with considered parameters are into the (0,1), then to compare with the standard normal distribution, we should standardize it. In total, we generated 780 (5 * 2 * 2 * 3 * 13) independent simulation situations; each simulated condition was simulated 1000 times.
Nonconvergence situation is one of the most common problems during estimation in MIMIC model. The small sample size, not positive definite matrices, and out of bounds estimates are three important causes of nonconvergence situation in MIMIC model [28,29]. Out of bounds estimates are sometimes referred to as "Heywood cases" when either improper solutions for standard error/variance (less than 0) or improper solutions for correlation (greater than 1 or less than −1) occur [28]. In this study, the number of convergence replications was calculated. In the present study, seed was used to control the randomness error of the random number generation [30]. Harwell et al. emphasized the use of seed in the simulation study that can lead to minimizing the effect of random error on parameter estimates [31]. Another advantage by determining seed is that it will be easy to reproduce the same data set afterward, which might require to be reviewed later [31]. To achieve reliable results, if the number of convergences was low, the seed given to the R program was changed and the analysis was repeated.
Statistical power is defined by the ratio of the number of times DIF was correctly identified by MIMIC method across replications. For calculating the power, we have assumed that item 1 has uniform-DIF. The Type I error rate, also referred to as false positives, was assessed by the proportion of times that DIF was incorrectly identified in the 1000 replications [32].
The CatIrt and Lavaan packages in R version 3.21 software were used to generate data from GRM model and fitting MIMIC model for DIF testing, respectively [33,34]. The nominal Type I error rate for this study was 0.05.

Effect of Reference to Focal Group Sample Size
Ratio on Detecting Uniform-DIF. By increasing the sample size, the power of MIMIC model was systematically increased; however, there was no pattern in Type I error. The results of this study showed that when latent trait distribution in the reference group was the standard normal or latent trait distribution in the reference and focal groups was the same, a sample size of 500 for graded items with 3 ordered categories of response (R400/F100) and 300 for items with 5 and 7 categories of response (R200/F100) suffices. Refer to Tables 3  and 5 for more information on mutations in power and Type I error rate.

Effect of Magnitude of DIF on Detecting Uniform-DIF.
When other circumstances stayed fixed, increase in the magnitude of DIF led to improved MIMIC model power in detecting uniform-DIF: 20.35% in total and 24.28% and 16.42% in increasing the magnitude of DIF from medium to severe in 5-item and 10-item scales, respectively. In such situation, Type I error did not change significantly.

Effect of Scale Length on Detecting Uniform-DIF.
Increasing the scale length from 5 to 10 items caused an increase of approximately 3.47% in the power of MIMIC model for detecting uniform-DIF. According to our results, increase in the number of items from 5 to 10 led to improvement of the MIMIC model power for detecting medium uniform-DIF: 6.79% in total, 8.78% in 3-point response scale, 5.90% in 5point response scale, and 5.71% in 7-point response scale. In this situation, Type I error rate was changed slightly (2.76%).
Increase in the number of items from 5 to 10 led to decreased Type I error rate of MIMIC model for detecting severe uniform-DIF: 2.87% in total, 1.56% in 3-point response scale, 1.01% in 5-point response scale, and 6.03% in 7-point response scale. In this circumstance, the power was changed about 0.15%.

Effect of Number of Response Categories on Detecting
Uniform-DIF. When other conditions remained constant, increase in the number of response categories led to improved MIMIC model power in detecting uniform-DIF: 4.83% in total and 5.66%, 1.52%, and 7.33% in increasing the number of response categories from 3 to 5, from 5 to 7, and from 3 to 7, respectively.

BioMed Research International
Simultaneously, when other conditions were fixed, increasing the number of response categories led to a decrease in the Type I error MIMIC model in detecting uniform-DIF: 5.66% in total and 2.73%, 5.47%, and 8.80% in increasing the number of response categories from 3 to 5, from 5 to 7, and from 3 to 7, respectively.

Effect of Latent Trait Distribution on Detecting
Uniform-DIF. Skewness in the latent trait distribution led to a slight change in the magnitude of Type I and power of MIMIC model for detecting uniform-DIF. When latent trait distributions were normal (condition 13), moderate (conditions 1, 3, 5, 7, 9, and 11), and highly skewed (conditions 2, 4, 6, 8, 10, and 12), mean powers of MIMIC model to detect uniform-DIF were 0.920, 0.917, and 0.915; with Type I error, they were 0.054, 0.059, and 0.069, respectively. When latent trait distributions were normal, moderate, and highly skewed, mean powers of MIMIC model to detect medium uniform-DIF were 0.842, 0.837, and 0.835; with Type I error, they were 0.054, 0.059, and 0.069, respectively. When latent trait distributions were normal, moderate, and highly skewed, mean powers of MIMIC model to detect severe uniform-DIF were 0.998, 0.997, and 0.995; with Type I error, they were 0.054, 0.060, and 0.069, respectively.
In most scenarios, when latent trait in the reference group was normal distribution or latent trait distribution in the reference and focal groups was the same (all conditions except 10 and 12), Type I error was less than 0.06 and power of MIMIC model was at an acceptable level (greater than 80%). Therefore, we can conclude that MIMIC model had a robust to skewness in latent trait. In conditions 10 and 12, when latent trait distribution in one group was highly positively skewed and in another group was highly negatively skewed or vice versa. MIMIC model was at its lowest power and the greatest Type I error in discovering uniform-DIF.
We performed all 390 different scenarios' simulation for the small magnitude of DIF (magnitude of DIF was 0.25). Under the best circumstances, when we had larger sample size (R500/F100), the 10-item scale, severe uniform-DIF, 7point ordinal responses, and the latent trait distribution in both groups were normal, and power and Type I error were 0.489 and 0.055, respectively. So given that the MIMIC model was not appropriate for detecting small uniform-DIF, we refrained from describing the results.
All 1000 replications met the convergence criteria when latent trait distribution had a normal or skewed distribution. In all scenarios, goodness-of-fit indices such as Root Mean Square Error of Approximation (RMSEA), Root Mean squared Residual (RMR), Tucker-Lewis Index (TLI), Comparative Fit Index (CFI), and Goodness-of-Fit Index (GFI) were in an acceptable level. Space management prevented us from presenting the results for goodness of fit for all the simulation in detail. Tables 2 and 3 show the Type I error rates and power of MIMIC model for detecting uniform-DIF in 5-item scale. Tables 4 and 5 indicate the statistical properties of MIMIC model for detecting uniform-DIF in 10-item scale.

Real Data Example.
In this section, we explain the example of the questionnaire to assess the effect of small sample size on measurement equivalence of psychometric questionnaires in the MIMIC model.
The 12-item General Health Questionnaire (GHQ-12) is an appropriate instrument to assess Minor Psychiatric Disorders (MPD) during the previous month [35]. A crosssectional study was conducted to identify the MPD with GHQ-12 among 771 nurses employed in hospitals of the Fars and Bushehr provinces, Southern Iran, between October and December 2014. Only a brief description of the data used in this study is mentioned here because they have been fully described elsewhere [35].
Of the 269 men participating in the study, 100 men were randomly selected. Among 502 women, samples with the size of 100, 200, 300, 400, and 500 were randomly chosen.
The results of fitting the MIMIC model to detect uniform-DIF are shown in Table 6. In all the sample sizes, item 12 of the GHQ-12 was detected with uniform-DIF. The intensity of uniform-DIF for item 12 was severe and for item 1 was medium. For this reason, in large sample size (M100/F400 and M100/F500) item 1 of the GHQ-12 was detected with uniform-DIF with the MIMIC model.

Discussion
The present study provided a simulation-based framework to determine the statistical properties of MIMIC model when latent trait distribution was nonnormal and sample size was small.
Up to now, in most simulation researches, item responses were produced using the GRM when latent trait was normally distributed. However, in many psychological researches, the assumption of normality latent construct can frequently be violated in practice [36,37]. What distinguishes this study from previous ones was the effort to assess the performance of MIMIC model in uniform-DIF detection when latent trait distribution was nonnormal. Our results showed that skewness in the latent trait distribution cannot affect MIMIC model performance in uniform-DIF detection. However, Type I error inflated when latent trait distribution in one group was highly positively skewed and in another group it was highly negatively skewed or contrariwise. Until now, there has been no documented evidence that has investigated the effect of skewness of latent construct distribution on the performance of MIMIC model. However, Monaco found that high skewness in latent trait distribution resulted in a 5% to 10% decrease in the power for detecting DIF in dichotomous items in the differential functioning of items and tests, Mantel-Haenszel, and Lord's chi-square methods [38]. The research carried out by Kaya et al. concluded that moderate skewness in latent trait leads to approximately 10% decrease in the power for detecting uniform-DIF by logistic regression in polytomous items [20]. Another Monte Carlo simulation study showed that high skewness in latent trait distribution could reduce the power ordinal logistic regression model up to 57.7% [3].       Under various combinations of latent trait distributions, the power of MIMIC model increased as the reference group sample size increased, but Type I error did not obey a specific pattern. This finding is consistent with those of previous studies that demonstrated when sample size increased, the power for detecting DIF increased [4,23]. The unbalanced sample sizes between the focal and the reference group are popular in real-life circumstances. In previous simulation studies, sample size ratio between the focal and reference groups varied between 1 and 5 [4,9,27]. A previous study indicated that MIMIC model DIF detection test was not powerful enough when the sample size ratio between the focal and reference groups was smaller than 5 (R500/F100), and latent trait was the normal distribution [4]. However, we found that, in these situations, the MIMIC model was powerful for detecting uniform-DIF when sample size ratio was more than 3 (R300/F100) in 3-point response scale and more than 2 (R200/F100) in 5-or 7-point response scale.
The results from a research study indicated that increasing the number of items could lead to improvement in the power and decrease in the Type I error rate of MIMIC model for detecting uniform-DIF. With respect to this, our results were in line with the results of several studies [4,9,25,39]. However, few researchers have argued that the number of indicators does not appear to affect the power [14].
When the magnitude of uniform-DIF was increased, the performance of MIMIC model improved; that is, the power increased and Type I error was reduced. This was an expected result, and similar results were reported in other studies [14,40].
Another important feature considered in this study was evaluation of the number of response categories that could affect the power of MIMIC model for detecting DIF. Our study shows that increased number of response categories resulted in a systematic increase in the power of MIMIC model for detecting uniform-DIF. By increasing the number of items from 5 to 7, the MIMIC model power improved just 1.52% for detection of uniform-DIF. Increasing the number of response categories creates problems for low educated participants; hence, we suggested 5-point response scale that was more suitable for people with lower levels of education which was easier to interpret. Allahyari et al. recommended the minimum number of response categories for DIF analysis to be five [3]. Willse and Goodman in a simulation study showed that MIMIC model for continuous variables had better performance than categorical variables for DIF testing [39].
Our study showed that the number of convergence MIMIC models did not depend on the skewness rate in latent construct distribution. In numerical analysis, the number of convergences could be affected by the method used for parameter estimation [39,41]. There are several methods for parameter estimation in MIMIC model, including maximum likelihood (ML), generalized least squares (GLS), weighted least squares (WLS), weighted least squares means, and variance adjusted (WLSMV). In this study, ML was used for parameter estimation. Previous studies have demonstrated that ML method was preferable to the GLS and WLS procedures when data were nonnormal in MIMIC model [42]. Another previous study showed that the ML method has less Type I error than the WLSMV [43]. Also, GLS and WLS require a larger sample size than ML estimation for the fitting model [39].
MIMIC model uses single latent covariance matrix for parameter estimation. Hence, in this model, it is assumed that the variance of latent factor is equal across the groups. Carroll concluded that violating the homogeneity of variance assumption could lead to inflated Type 1 error in DIF detection and increase in bias in estimating the factor loadings and the latent group mean difference [14]. Our study showed that the heterogeneity of variance (conditions 1, 2, 3, 4, 9, 10, 11, and 12) led to an increase in Type I error MIMIC model in detection of uniform-DIF.
There are many different methods to make DIF items. The most common technique for generating DIF items is adding a certain amount to all thresholds for the focal group which was used in this study. Although this issue is controversial, some authors point out that, by adding or subtracting a value asymmetrically to the parameters threshold, this action could affect performance model for DIF detection [3]. Scott et al. indicated that reducing or adding a specified amount to the threshold does not affect the results significantly [25].
Finally, this study had some limitations which need to be taken into account. Previous simulation studies have shown that power of MIMIC model could be affected by the number of DIF items [11]. On the contrary, in this study, we have assumed that there is only one item which has uniform-DIF. If this condition was taken into consideration, we were forced to simulate a larger number of scenarios, which was timeconsuming. The MIMIC model can be used for both uniform and nonuniform-DIF detection. However, most researchers believe that MIMIC model is not an appropriate performance to detect nonuniform-DIF, because the parameterization of the MIMIC model was only suitable for identifying uniform-DIF [23,44]. Also, nonuniform-DIF has computational effort required to fit MIMIC model because the latent trait cannot be simply multiplied by the group variable which is an observed variable [15]. In this study, we limited our DIF detection to uniform-DIF and two groups at a time, a reference group and a focal group. Nonetheless, the MIMIC model can handle two types of DIF and more than two groups [15,21].

Conclusion
Our findings showed that, by increasing the number of response categories, the number of items, the magnitude of DIF, and sample size could lead to an increase in power of MIMIC model for uniform-DIF detecting. This study revealed that MIMIC model in detection of uniform-DIF was fairly robust to departure from the normal latent trait distribution assumption. When latent trait distributions were skewed, the power of MIMIC model in detection of uniform-DIF was at an acceptable level. However, empirical Type I error rate was slightly greater than nominal significance level of 0.05. Consequently, this technique is appropriated for uniform-DIF detection when latent trait distribution is nonnormal and the focal group sample size is small. Due to the insignificant effect on improving power by increasing the number of response categories from 5 to 7, we recommend 5point response scale for uniform-DIF detection using MIMIC model, especially for participants with low levels of education. The results obtained from this study provide an appropriate guideline for further research. We recommend further studies to investigate the effect of the number of items with DIF and type of DIF on MIMIC model power when latent trait is skewed.