A Machine Learning Approach to Assess Differential Item Functioning in Psychometric Questionnaires Using the Elastic Net Regularized Ordinal Logistic Regression in Small Sample Size Groups

Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF (DIF = 0.4 and 0.8), sample size (N), sample size ratio (R), scale length (I), and weighting parameter (w). The simulation results revealed that for I = 5 and regardless of R, the elastic net regularized OLR model with w = 0.1, as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF (DIF = 0.4) approximately 35% and 21% for N = 100 and 150, respectively. Moreover, for I = 10 and severe uniform DIF (DIF = 0.8), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06, as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100 and 150, respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.


Introduction
In psychometric research such as health-related quality of life (HRQoL), measurement invariance, also known as differential item functioning (DIF), is a prerequisite assumption for the valid comparison of HRQoL scores across people from different subgroups (e.g., groups distinguished by gender, age, race. or health conditions). In general, DIF occurs when individuals from different groups respond differently to specific items in a questionnaire after controlling the construct being measured [1,2]. The ordinal logistic regression (OLR) model is one of the well-known methods for the identification of DIF in psychometric research. The OLR model can evaluate both uniform and nonuniform DIF and can also control other categorical and continuous variables which may affect the results of DIF analysis [3,4]. Uniform DIF occurs when the difference in item response probabilities remains constant across complete construct domains, whereas nonuniform DIF is evident when the direction of DIF differs across various parts of the construct scale [5,6].
It is well documented that statistical inference based on the OLR model typically depends on the asymptotic properties of the maximum likelihood (ML) estimator [7]. When the sample size is relatively large, the sampling distribution of the ML estimators for the OLR coefficients is asymptotically normal and unbiased. However, when the sample size is very small, the robustness of the ML procedure and the asymptotic assumptions are violated. This in turn limits the use of the OLR model for DIF detection in small samples [4,7]. Hence, it is essential to determine the extent to which the conventional OLR model is an efficient method for assessing DIF in the context of small sample sizes. A simulation study has shown that the OLR model needs sample sizes of about 200 for each group to obtain an adequate power and type I error rate in assessing DIF [8]. However, in HRQoL studies, researchers frequently encounter small sample sizes due to practical limitations such as prohibitive costs or low prevalence of a disease [9,10].
Nowadays, a wide range of regularized or bias correction methods, known as machine learning approaches, have been proposed to overcome the problem of small sample size in binary and ordinal logistic regression models [7,[11][12][13][14]. In the binary logistic regression model, Firth's penalized maximum likelihood (PML) estimation approach was originally introduced to correct or reduce the small sample bias of the ML estimators [11]. The ML estimation method tends to produce overfitted models with a poor prediction performance in the presence of small or sparse samples, whereas the PML procedure offers some improvements by shrinking the regression coefficients. In reality, the PML method produces unbiased and finite estimates of regression coefficients for small sample situations in which ordinary ML estimation fails [11,13,15]. Recently, for ordinal response data, the elastic net regularized OLR model has been proposed by Wurm et al. to improve the estimation of the regression coefficients, as well as to perform variable selection [14]. The elastic net is another regularized regression technique that linearly combines the penalties used in the ridge and least absolute shrinkage and selection operator (LASSO) regression models [16]. The ridge and LASSO regressions are two well-known members of the regularization family of methods [12,17]. The major difference between the two algorithms is that while the ridge method shrinks all of the regression coefficients to a nonzero value, the LASSO method shrinks some of the coefficients exactly to zero. The differences lie in the penalty terms used by each method. While the LASSO regression adds the sum of the absolute value of the magnitude of the regression coefficients to the ordinary likelihood function, the ridge regression adds the sum of the squared magnitude [12,17].
A new psychometric study has shown that regularization techniques, as a special type of machine learning methods, outperform the conventional DIF detection procedures when the sample size is extremely small [18]. Although the effect of regularization methods on the performance of binary logistic regression models in detecting DIF has been evaluated by Lee in small sample size groups [5], such a statistical description has never been provided for the OLR model. For the binary logistic regression model, Lee has shown that when the sample size is relatively small, there is no difference between the conventional ML and PML estimation methods in terms of power and type I error rate for detecting DIF [5]. Hence, in the present study, we intended to propose a new application of the elastic net regularized OLR model, as a special type of machine learning approach, in assessing DIF when the sample sizes for one or both groups of interest are relatively small. Accordingly, by presenting a comprehensive simulation study, we investigated whether the statistical properties (power and type I error) of the OLR model for evaluating DIF, with and without applying regularized techniques, could be affected by the sample size, sample size ratio, DIF amount, scale length, and the value of the weighting parameter across the focal and reference groups. Furthermore, a real data set was also used to validate the simulation results.

Methods
The cumulative ordinal logistic regression (OLR) model, also known as the proportional odds model, was used to identify DIF across the two groups [4]. Moreover, the Monte Carlo simulation method was utilized to compare the statistical properties (power and type I error) of the regularized and nonregularized OLR models for detecting DIF.

DIF Detection Method
Based on the Nonregularized OLR Model. The uniform and nonuniform DIF can be assessed by comparing three different nested OLR models as follows: Here, PðY ≤ jÞ denotes the probability of choosing category j or below, G is a covariate indicating group membership with two levels (i.e., reference and focal groups), and θ is the observed ability of an examinee usually demonstrated by the total test score [3,4].
The regression coefficients (γs and βs) are unknown constants and are estimated by maximizing the log-likelihood (ML) function [7]. According to the above models, the presence of uniform DIF can be checked by testing whether the coefficient of group membership (H 0 : β 22 = 0) differs significantly from zero. This can be done by comparing the -2 log-likelihood values for models 1 and 2 with chi-square distribution with one degree of freedom. The interaction regression coefficient between the ability and group membership (H 0 : β 33 = 0) can be tested to evaluate nonuniform DIF by comparing the -2 log-likelihood values for models 2 and 3 with chi-square distribution with one degree of freedom [4]. It should be noted that since the DIF direction is different along the latent ability scale in nonuniform DIF, the nonuniform 2 BioMed Research International DIF items cancel each other out at the test level [19]. Therefore, the focus in this simulation study is on uniform DIF detection.

DIF Detection Method Based on the Elastic Net
Regularized OLR Model. In this section, a new machine learning approach is proposed for DIF detection based on the elastic net regularized OLR model [14]. Detecting DIF through the regularized (elastic net) and nonregularized OLR models is exactly the same except that in the elastic net regularized OLR model, the uniform and nonuniform DIF can be detected by comparing the penalized loglikelihood values of models 1 and 2 and models 2 and 3, respectively [5].
The elastic net regularized OLR model is a linear combination of the LASSO and ridge regression models, and its penalized log-likelihood function can be formulated as follows: in which N is the total sample size, log L denotes the conventional version of the log-likelihood function, and λ ≥ 0 and 0 ≤ w ≤ 1 are the regularization and weighting parameters, respectively [14,16]. In this equation, β j indicates the regression coefficients, and ∑ p j=1 β 2 j (squared L 2 -norm of β j ) and ∑ p j=1 jβ j j (L 1 -norm of β j ) represent the ridge and LASSO penalty terms, respectively [12,17]. The elastic net OLR model is simplified to the LASSO OLR model when w = 1 and to the ridge OLR model when w = 0. It should also be noted that when λ = 0, the elastic net penalty term is eliminated, and the usual log-likelihood function of the ML method is obtained. The "ordinalNet" package in the R software was used to fit the elastic net regularized OLR model [14,16].
In this simulation study, for any fixed value of w in the interval 0 ≤ w ≤ 1, the optimal value of λ was determined by the Bayesian information criterion (BIC). For the elastic net regularized OLR model, the BIC was calculated by the following equation: where log L λ represents the log-likelihood value for the regression parameters estimated with regularization parameter λ, N is the total sample size, and N nonzero indicates the estimated total number of nonzero parameters including the intercepts [14]. For a regular sequence of λ values, which is automatically generated by the "ordinalNet" package, the corresponding BIC values are calculated and sorted in a descending order according to their magnitude. Then, the package chooses the optimal value of λ in a way that the BIC value is minimized [14].
2.3. Generation of Simulated Data. In this study, Samejima's graded response model (GRM) [20], which is suitable for simulating items with ordered response categories, was used to produce five-category response data for measures with five or ten items. The mathematical equation for GRM is as follows: where P ij ðθÞ is the probability of responding in or above category j of item i, a i is the item discrimination parameter, θ is the latent trait (ability), and b ij denotes the item difficulty (threshold) parameter for category j of item i. The item discrimination parameters were randomly generated from the uniform distribution over the interval (1,2), and the difficulty and ability parameters were sampled from the standard normal distribution [1].
In the present simulation study, four factors including the total sample size (N), sample size ratio (R), scale length (I), and magnitude of uniform DIF that could influence the performance of the regularized and nonregularized OLR models were manipulated. Out of five or ten items in the scale, one item (item 1) was flagged with uniform DIF across the reference and focal groups. In order to simulate the moderate and severe uniform DIF, 0.4 and 0.8 were, respectively, added to the difficulty parameters (b 1j ) of item 1 in the focal group. Five sample sizes (N = 100, 150, 200, 300, and 400) and three levels of sample size ratio (R = 1, 2, and 3) were considered. When N = 100, for example, the conditions n r /n f = 50/50, 67/33, and 75/25 were generated. In addition, two tests with 5 and 10 items (I = 5 and 10) were generated. Each item had five response categories (J = 5). The data were simulated using the "catIrt" package [21] as well as the "runif" and "rnorm" functions in the R statistical software (version: 4.0.2).
The main objective of this simulation study was to determine what the optimal value of the weighting parameter w should be to obtain the adequate power and type I error rate for the detection of DIF. In order to find the optimal value of w, the following sequence of weighting parameters was considered: w = 0, 0:01, 0:02, 0:03, 0:04, 0:05, 0:06, 0:07, 0:1, 0:5, and 1 (w = 0 and w = 1 corresponded to the ridge and LASSO regressions, respectively). For each w, the BIC was used to select the appropriate regularization parameter λ [14].
The power rate (true positive rate) is defined by the ratio of the times that the uniform DIF is rightly flagged by various methods across replications. Type I error rate (false-positive rate) also indicates the ratio of non-DIF items wrongly identified as having DIF in 1000 replications. They are averaged over all items without DIF [22]. . Tables 1 and  2, respectively, show the power and type I error rates of the regularized (elastic net) and nonregularized OLR models to assess the presence of a moderate uniform DIF (DIF = 0:4) under different combinations of scale length (I), total sample size (N), sample size ratio (R), and response category number of five (J = 5).

BioMed Research International
According to Table 1, for I = 5 and regardless of R, the elastic net regularized OLR model with w = 0:1, as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF (DIF = 0:4) approximately 35%, 21%, and 21.7% for N = 100, 150, and 200, respectively. In this case, the type I error rates of the regularized and nonregularized OLR models were close to the nominal level of 0.05 (Table 2). However, this amount of increase in power was approximately 15.5% and 12.6% for N = 300 and 400, respectively. In this situation, the average type I error rate of the nonregularized OLR model was close to the nominal level of 0.05, while it was approximately 0.07 for the elastic net regularized OLR model.
As demonstrated in Table 1, for N ≥ 300, I = 5, and R = 1, as compared with the nonregularized OLR model, increasing the weighting parameter (w) from 0.1 to 0.5 resulted in an increase of approximately 11.7% to 17.7% in the power of the elastic net regularized OLR model for detecting moderate uniform DIF (DIF = 0:4). In this case, the elastic net regularized OLR model tended to reach the adequate power of 80%,  Tables 3 and 4, respectively, represent the power and type I error rates of the regularized (elastic net) and nonregularized OLR models in detecting the presence of severe uniform DIF (DIF = 0:8) under different combinations of I, N, R, and J = 5. As shown in Table 3, when I = 5 and R = 1 , the elastic net regularized OLR model with w = 0:05, as compared with the nonregularized OLR model, increased the power of detecting severe uniform DIF (DIF = 0:8) approximately 9% and 4.5% for N = 100 and 150, respectively. In this case, the type I error rates of the elastic net regularized and nonregularized OLR models were 0.058 and 0.078 for N = 100 and 150, respectively. Similar findings were also obtained for the elastic net regularized OLR model with w = 0:04 and 0:06. Hence, for a relatively small sample size (N ≤ 150), I = 5, and R = 1, the elastic net regularized OLR model with 0:04 ≤ w ≤ 0:06 achieved the adequate  power of 80% and type I error rate of 0.05 for detecting severe uniform DIF (DIF = 0:8). However, for N ≥ 200, I = 5, and R = 1, the type I error rates were considerably higher than the nominal level of 0.05 for the elastic net regularized (w ≥ 0:03) and nonregularized OLR models. In the same conditions, the power of the elastic net regularized OLR model with 0 ≤ w ≤ 0:02 was equal to or greater than 80% in detecting severe uniform DIF (DIF = 0:8), while its type I error rate was below or close to the nominal level of 0.05. In addition, for N ≥ 200 and I = 5, similar findings were obtained when the sample size was unequal (R = 2 and 3) between the focal and reference groups. Moreover, regardless of R, for the severe magnitude of DIF (DIF = 0:8), I = 10, and N ≥ 200, the powers of the regularized and nonregularized OLR models were close to or higher than 80%, and their type I error rates were below or close to the nominal level of 0.05.
On the other hand, for a severe magnitude of DIF (DIF = 0:8), I = 10, and a relatively small sample size (N ≤ 150), when the weighting parameter (w) increased  Figures 1 and 2 demonstrate the average powers and type I error rates of the regularized and nonregularized OLR models on measures with five and ten items for moderate and severe DIF, respectively. According to these figures, regardless of DIF magnitude, sample size ratio, and the number of items, the elastic net regularized OLR model with w = 0 (ridge) and w = 1 (LASSO) had the lowest and highest power for detecting uniform DIF, respectively. Moreover, the nonregularized OLR model and the elastic net regularized OLR model with w = 0:03 had an approximately equal power. Additionally, as shown in Figures 1 and 2, the average type I error rate was generally below or close to the nominal level of 0.05 for the elastic net regularized (w = 0, 0:03, and 0:05) and nonregularized OLR models

Real Data Analysis.
In the present section, a real data set was employed to validate the simulation results. The data set was composed of 72 children and adolescents with The results of the DIF analysis of the PedsQL™ 4.0 instrument across children with ADHD and their parents based on the regularized and nonregularized OLR models are summarized in Table 5. Our results showed that while the elastic net regularized OLR model with w ≥ 0:03 and the nonregularized OLR model exhibited seven out of the 23 items with uniform DIF, the elastic net regularized OLR model with w = 0, 0:01, and 0:02 identified four, four, and six items with uniform DIF, respectively. In addition, according to Table 5, increasing the value of the weighting parameter (w) from 0 to 1 resulted in an increase in the power of the elastic net regularized OLR model for detecting uniform DIF. These results were consistent with the  Tables 1 and 3, especially when the sample size was extremely small (N ≤ 150).

Discussion
A small sample size is a challenging issue in assessing DIF for ordinal response data. To the best of our knowledge, this is the first study that introduces a new application of the elastic net regularized OLR model, as a special type of machine learning method, for detecting DIF in small sample size groups. One of the main advantages of the elastic net regularized OLR method over other regularization methods for DIF detection is that it provides the researcher with a wide range of models by varying the weighting parameter (w) over the interval from 0 to 1. However, choosing an optimal value for w is the key issue in using elastic net regularized models [14,16]. The present simulation study demonstrated that the elastic net regularized OLR model with 0:03 ≤ w ≤ 0:06 generally outperformed the nonregularized OLR model in terms of the statistical power to identify DIF especially when the sample size was extremely small (N ≤ 150), the number of items was 10 (I = 10), and DIF was severe (DIF = 0:8). These findings were different from those of Lee who showed that the nonregularized binary logistic regression model performed slightly better than the regularized logistic regression model (based on the PML estimation method) in detecting DIF for small samples [5].
The current study also showed that when w was greater than 0.1 (w = 0:5 and 1), the value of the regularization parameter (λ) converged to zero, indicating that the likelihood function of the elastic net regularized OLR model was reduced to the likelihood function of the nonregularized OLR model. This can lead to almost similar results for the elastic net regularized (w > 0:1) and nonregularized OLR models in identifying DIF.
Although all previous studies have used the LASSO regularized method for DIF assessment with binary-scored items [18,[24][25][26], the findings of the current study showed that we should be cautious about using the LASSO regularized OLR model (w = 1) for DIF analysis. This is because when the number of items is five (I = 5), the LASSO regularized OLR model inflates the type I error rate to 0.2. However, for larger scales (I = 10) and when the sample size is relatively small (N ≤ 150), using the LASSO regularized OLR model for the identification of severe uniform DIF (DIF = 0:8) is strongly recommended. This finding also confirmed the results of a previous simulation study which showed that the regularized logistic regression model with LASSO penalty outperformed the conventional logistic regression model in DIF detection when the sample size was small and the number of items was 20 (I = 20) [24].
A further novel finding of the present study is that the elastic net regularized OLR model with 0:03 ≤ w ≤ 0:07 becomes necessary for DIF analysis when the scale length and sample size are relatively small (I = 5 and N ≤ 150). These findings are different from those in the previous study conducted by Scott et al. where the nonregularized OLR model was used to detect DIF when the sample size was higher than or equal to 200 (N ≥ 200) [8]. They simulated subscales with 2, 3, 4, 5, 10, and 20 items and showed that the effect of the number of items in the scale was relatively small on the results of DIF detection based on the nonregularized OLR model [8]. Accordingly, further simulation studies are needed to explore the effect of varying the length of the scale on the performance of the LASSO OLR model for DIF analysis.
Furthermore, the findings of the current study revealed that when N ≤ 150 and DIF was moderate (DIF = 0:4), the ridge regularized OLR model (w = 0) had the lowest power in detecting DIF as compared with the other regularized OLR model and the nonregularized OLR models. In these conditions, we should be cautious about using the ridge regularized OLR model (w = 0) for detecting uniform DIF because it can lead to false negative results.
On the other hand, comparing the regularized and nonregularized OLR models for the real ADHD data set confirmed the results of the present simulation for detecting DIF. The current simulation study demonstrated that by increasing the value of the weighting parameter (w), the power of the elastic net regularized OLR model will be increased. Accordingly, in the real data set, the elastic net regularized OLR model with w equal to or greater than 0.03 was more sensitive in detecting uniform DIF across children with ADHD and their parents than the elastic net regularized OLR model with 0 ≤ w ≤ 0:02.
One of the interesting findings of the current study was that when N ≥ 200 and DIF = 0:8 (severe DIF), the regularized and nonregularized OLR models had an adequate power for detecting DIF. These findings were similar to those of Magis et al. who reported that when the sample size was large, the regularized LASSO logistic regression and the nonregularized logistic regression models yielded similar results for identifying DIF in binary response data [24]. However, the current simulation study revealed that when the magnitude of DIF was severe (DIF = 0:8), N ≥ 200, and the number of items was five (I = 5), the elastic net regularized OLR model with w ≥ 0:03 and the nonregularized OLR model inflated type I error rate up to the unacceptable level of 26.6%.
In the present study, we simulated data based on moderate (DIF = 0:4) and severe (DIF = 0:8) DIF conditions. Similar to our study, Hidalgo et al. manipulated two levels of differences in threshold parameters (0.4 and 0.8) to simulate moderate and large magnitudes of DIF [27]. Moreover, according to Li and Zumbo, the magnitude of DIF equal to 0.4, 0.6, and 0.8 can be considered small, moderate, and large DIF, respectively [28]. However, Scott et al. considered three levels of DIF magnitude including 0.2, 0.5, and 1 to generate items with small, moderate, and large DIF, respectively [8].
Although there is no consensus on the definition of DIF magnitude, it seems that the magnitude of DIF from 0.4 to 1 is a feasible option to simulate items with moderate and severe DIF.
Although various criteria including the Bayesian information criterion (BIC) and cross-validation method can be used to obtain the optimal tuning parameter λ [14,16,18], in the present research, the BIC was only used to assess DIF. In DIF analysis, the cross-validation and BIC have technical and theoretical differences. The BIC has been designed for variable selection, whereas cross-validation is generally used to select the optimal model for prediction [26]. Moreover, previous studies have demonstrated that the cross-validation method tends to have higher falsepositive rates in detecting DIF than the BIC [18,25]. Because variable selection is much more important in DIF assessment than prediction, the findings of the present study were interpreted based on the BIC [26].
This research had some limitations which are as follows. Although our simulation study was restricted to identifying only the uniform DIF, the elastic net regularized OLR model could also be extended to cover both uniform and nonuniform DIF. In addition, the regularized model used in the present research contains a maximum of two predictors which do not appear to be collinear either. Hence, in the future study, the performance of the elastic net regularized OLR model for DIF analysis should be assessed when several highly correlated continuous and categorical covariates are included in the model [14]. Finally, it should be noted that traditional DIF detection approaches are very sensitive to missing item responses and the questionnaires with missing data are usually excluded from the DIF analysis [29]. Hence, in the present study, we did not simulate items with missing data. Accordingly, as a special type of machine learning method, the elastic net regularized OLR model could be a viable choice for identifying DIF with missing data and without needing imputation [24].

Conclusion
Technically, the findings of the present study confirmed the idea proposed by Belzak and Bauer where regularization methods, as a special type of machine learning technique, could compensate for the limitation of the conventional DIF detection methods when the sample size is relatively small [18]. This study provided a guideline for researchers who conduct DIF studies with extremely small sample sizes. In general, for extremely small sample sizes (N ≤ 150), the elastic net regularized OLR model with 0:03 ≤ w ≤ 0:1 outperformed the nonregularized OLR model in terms of power and type I error rate. Moreover, in future studies, the advantages of the elastic net regularized OLR model in dealing with missing data and collinearity problem in the context of DIF analysis should be assessed.

Data Availability
The real data utilized to support the results of this study are available from the corresponding author upon request.