Measurement Invariance and Psychometric Analysis of Oxford Happiness Inventory Scale across Gender and Marital Status

Background The Oxford Happiness Inventory (OHI) is a self-report tool to measure happiness. A brief review of previous studies on OHI showed the lack of evaluation of OHI fairness/equivalence in measuring happiness among identified groups. Methods To examine the psychometric properties and measurement invariance of the OHI, responses of 500 university students were analyzed using item response theory and ordinal logistic regression (OLR). Relevant measures of effect size were utilized to interpret the results. Differential test functioning was also evaluated to determine whether there is an overall bias at the test level. Results OLR analysis detected four items across gender and two items across marital status to function differentially. An assessment of effect sizes implied negligible differences for practical considerations. Conclusions This study was a significant step towards providing theoretical and practical information regarding the assessment of happiness by presenting adequate evidence regarding the psychometric properties of OHI.


Introduction
Happiness has been the ultimate goal of humans and superior to all other goals throughout history. Previous researches indicated that happiness is rated higher than all other personal values, and it is also a highly valued component of life quality. Although the early tendency of psychological research was to focus on mental illness and social or occupational disorders, interest in the positive dimensions of human life (e.g., well-being and happiness) was increased in the late 20th century; thus, because of this new desire, different measures have been developed to assess happiness [1] The most widely used and respected questionnaires which measure the happiness are Subjective Happiness Scale [2], Satisfaction with Life Scale [3], and Panas Scale [4]. These questionnaires reflect different definitions and perceptions of happiness. The Oxford Happiness Inventory (OHI) [5] is another happiness instrument which is one of the most appropriate scales possessing several vital characteristics for assessing happiness such as easy to administer and allows endorsements over an extended range, adequate number of items, internal reliability and validity, and developmentally appropriate.
The OHI was devised as a broad measure of personal happiness in the Department of Experimental Psychology of the University of Oxford in the late 1980s. The development of the scale and some of its statistical properties were reviewed by Argyle, Martin, and Lu (1995). The scale has been found to behave consistently and was used cross-culturally to compare students in Australia, Canada, the UK, and the USA [6]. The OHI has also been studied in different countries such as China, Iran, and Italia [7][8][9].
In the cross-cultural study, OHI questionnaires were completed by four samples of undergraduate students: 378 in the UK, 212 in the USA, 255 in Australia, and 231 in Canada. Their findings support internal consistency among students in those countries. Furthermore, there were no significant sex differences in scores on the Inventory in any of our English-speaking samples. Granted those findings, the OHI can be recommended for use as a traitmeasure in studies among undergraduates in each of those cultures [6].
An Italian adaptation of the OHI was administered to 782 adolescents. Exploratory structural modeling was used, and the total scale and the subscales of the Italian adaptation of the OHI are coherent with regard to both psychometric criteria and psychological meaning. Their results also supported the validity of the Italian version of the OHI as an instrument for measuring positive psychological functioning in adolescence. The scale also showed adequate internal consistency values and strong measurement invariance across gender [8].
Using Chinese samples in 1997, Lu and Shih were examining the psychometric properties of Chinese Happiness Inventory (CHI) which was based on the OHI. 200 adults aged between 18 and 65 years old living in Taiwan completed this measurement. Their result showed a negative direct relation between neuroticism and happiness and a positive direct relation between social desirability and happiness [9].
Bayani [7] examined the reliability and preliminary evidence for validity of a Persian version of the OHI in 309 undergraduate students (161 women and 148 men). In this study OHI, the Satisfaction with Life Scale, the Beck Depression Inventory, and the Depression-Happiness Scale were completed by participant. Analyses indicated that the Persian version of the OHI Scale has reliability as a measure of well-being and provided some preliminary evidence of construct validity [7] A brief review of these previous studies on OHI shows the lack of evaluation of OHI fairness/equivalence in measuring happiness among identified groups. Measurement equivalence, also known as differential item functioning (DIF), is an important part of the process of validating questionnaires to test whether the probability of responding to a specific item exhibits different statistical properties for different identifiable groups after controlling the construct being measured [10,11]. Therefore, the goal of this study was to assess measurement equivalence of happiness by means of OHI across gender and marital status. For achieving this goal, we followed the analytical framework employed by Mousavi et al. 2019 [12].

Method
2.1. Sample. This study involved 500 university students (62.4% boys, 37.6% girls) in 2018. The participants were randomly selected by a two-stage random sampling tech-nique from Shiraz University of Medical Sciences, Iran. At the first stage, five out of the eleven faculties were selected randomly, and then for each faculty, 100 students were selected through random sampling. After explaining the aim of the study, informed consent forms were signed by the students who expressed their willingness to participate. The mean (±standard deviation) age of participants was 21:3 ± 3:7 years.
2.2. Instrument. The Oxford Happiness Inventory (OHI) [5] is a 29-item, self-report instrument, which was devised as abroad measure of personal happiness, mainly for in-house use in the Department of Experimental Psychology of the University of Oxford in the late 1980s [13]. The inventory was developed as a multidimensional scale to measure happiness, following the design and format of the Beck Depression Inventory (BDI). The instrument consists of items with an ordinal and polytomous scoring scale numbered from 0 to 3, so that the total scores range from 0 to 87, with higher scores showing greater happiness [8,13]. The validity and reliability of the Persian version of OHI have also been investigated in different studies, and it has been found to be acceptable [7].

Item Response Theory (IRT).
IRT was utilized to assess the dimensionality and psychometric properties of the OHI. The goodness of fit statistics were used to identify the best fitting polytomous IRT model among graded response model (GRM), generalized partial credit model (GPCM), and rating scale model (RSM). The indices were based on M2 statistic [14]. Additionally, a likelihood ratio test was used to statistically compare the fitted models. Finally, the OHI was analyzed based on the best fitting IRT model. . DIF assesses whether the probability of responding to a specific item is different for different groups after controlling the ability [10,11]. There are two forms of DIF known as uniform and nonuniform. Uniform DIF is defined as a constancy of differences in the probability of correct answer for manifest group at all ability levels, and nonuniform DIF happens when the direction of such difference changes at some ability levels [11,15]. Methodology reviews showed that there are several parametric and nonparametric statistical methods for investigating bias at item as well as test level [11,16,17]. Among all parametric and nonparametric methods, ordinal logistic regression (OLR) [18] approaches have received notable attention in applied researches [15,19]. This model-based procedure is effective, easy to implement which can control additional categorical and continuous covariates which may confound the results of DIF analysis [19][20][21][22][23]. Detecting DIF with utilizing OLR is based on comparing three different nested models. The models as given by French and Miller (1996) have the following forms: Where pðY i ≤ kÞ is the probability of responding at or below category k to an item for the ith person, θ represents ability and it is measured by the total test score, g is a grouping variable, and g × θ represents the interaction between grouping variable and ability. The value of the difference in -2 log-likelihood of model 1 and model 3 can be used to detect uniform and nonuniform DIF simultaneously. This value can be compared to a chisquare distribution with two degrees of freedom. If this comparison yields a significant result, the item is flagged for DIF, and then, further investigations are needed to test whether there is uniform or nonuniform DIF. Comparison of models 1 and 2 is used to assess nonuniform DIF. Uniform DIF also exist when models 2 and 3 differ significantly [11,15,24,25].
The effect of sample size on the significance testing and necessity of reporting the effect size have been well documented [26]. Several studies have shown that test score-based methods such as logistic regression (LR) are prone to Type I error rate inflation (Gómez-Benito, Hidalgo, & Padilla, 2009). Therefore, when conducting studies to detect and interpret DIF, it is particularly useful to include measures of effect size as it is not sensitive to the sample size. The use of effect size measures optimizes the decision to retain or exclude an item with DIF and also reduces the incidence of false positive outcomes. Additionally, the exclusion of items that have been falsely identified with DIF can have serious effects on the reliability and validity of measurement instruments [27,28]. The measures of effect size for all DIF items as suggested by Jodoin and Gierl (2001) were computed. The measure is the difference between two pseudo R squared [29], of model 2 and model 1 for nonuniform DIF and the difference between two pseudo R squared of    [30].
For assessment of DTF of polytomous items, ν 2 was calculated based on Penfield and Algina [31]. The magnitude of DTF can be considered as small if ν 2 is less than 0.07, medium if it is between 0.07 and 0.14, and high if it is more than 0.14 [32,33].
As in case of dichotomous items, item characteristics curves (ICC) of the item under investigation for the reference and focal groups can be used to depict DIF. Similarly, the item characteristic function (ICF) is good summary statistics for polytomous item especially in order to illustrate DIF. The ICF is defined as the sum of the expected scores over response categories for each item (Nering and Ostini, 2011). When we have an item with m j categories, ICF can be defined as the following formula: Where p jx ðθÞ is the probability of a score of x in the jth response category of item X.
The goodness of fit indices of GRM, RSM, and GPCM are summarized in Table 1. Both the M2 statistic and other criteria showed fairly acceptable goodness of fit, but the GRM was found to be the best-fitting model.

Item Response Theory
Analysis. The goodness of fit between data and the three selected IRT models was assessed using fit indices and likelihood ratio test. Table 1 shows the goodness of fit indices for GRM, RSM, and PCM models. The M2 statistic and other fit indicate better fit between data and GRM model (RMSEA = 0:072, TLI = 0:927, and CFI = 0:933), but other models also seem to be appropriate. Thus, the likelihood ratio test of model was performed in search for any potential statistical difference among three models. Table 2 shows a statistically significant difference between the three models despite having very close fit indices, and both GPCM and PCM models showed lower log-likelihood values with a trivial difference. Therefore, the OHI items were analyzed based on the GRM model as shown in Table 3. Regarding the item discrimination (i.e., in Table 3), all the items showed an adequate level of discriminant power ranging from 0.953 (for item 12) to 2.436 (for item 5) with an average discrimination power of 1.556. Regarding the item difficulties (i.e., b values in Table 3), there are three thresholds (i.e., b1, b2, and b3) for each item, since the item response is recorded based on a four-point Likert-scale. The first threshold reflects the least amount of the underlying attribute needed to endorse the first option, and the last threshold indicates the maximum level of the underlying attribute needed to endorse the last category. The threshold values showed an incremental trend with average values of -1.988, 0.213, and 2.297 for b1, b2, and b3, respectively. Goodness of fit with the GRM model at item level was examined by the polytomous extension of S-X 2 [36] and are shown in Table 3. As shown only item 21 was identified as misfitting at p value <0.05. All other items showed acceptable fit to the GRM model. Test information function and standard error of measurement in OHI are shown in Figure 1. This graph shows that the OHI is more informative and precise in the middle range of the underlying attribute (i.e., values approximately between -2 and 2). This is congruent with the aim of this tool which is measuring happiness in a broad sense. The IRT analysis of OHI asserts its psychometric quality for measuring happiness.

Differential Item Functioning (DIF) Analysis.
Results indicate that four items of OHI show uniform DIF across gender and two items with uniform DIF across the marital status. Table 4 represents summary results for assessing DIF across gender. Note that, for example, p 12 refers to the observed significance level for comparing models 1 and 2. In the same way, ΔR 2 12 refers to the observed R 2 difference between models 1 and 2. A review of the first three columns of Table 4 shows that items 17, 25, 26, and 28 have ps smaller than nominal alpha level of 0.05 (i.e., numbers in boldface). A significant difference between models 1 and 3 in addition to a nonsignificant difference between models 2 and 3 asserts a uniform DIF for items 17, 25, 26, and 28. Figure 2 represents ICF  BioMed Research International curves for items flagged with DIF. The ICF curves for items 25 and 28 indicate that female respondents are more likely to endorse response categories corresponding to a higher level of happiness compared to male respondents. On the other hand, ICF curves for items 17 and 26 indicate that males had higher expected scores of happiness compared to females. Table 5 shows the results of assessing DIF across the marital status. Note that, for example, p 12 refers to the observed significance level for comparing models 1 and 2.
In the same way, ΔR 2 12 refers to the observed R 2 difference between models 1 and 2. Based on the figures in Table 5, items 8 and 27 showed uniform DIF across the marital status. As shown in Figure 3, item 8 was in favor of the married participants, whereas item 27 was in favor of single individuals. On the other words, single individuals have higher expected scores of happiness compared to married participants in item 8 and vice versa in item 27. The measures of effect size show whether a statistically significant outcome (p < 0:05) is also practically significant or not. According to the framework to DIF effect size proposed by Jodoin and Gierl (2001), all DIF items for both DIF factors in Table 1 show negligible DIF (all effect size ≤0.035). The values of ν 2 were 0.03 and -0.004 for gender and marital status, respectively. These values indicated a small effect size according to Penfield and Algina (2006). Therefore, there is not an overall bias at the test level.

Discussion
Previous studies found the OHI to be a reliable and psychologically valid tool for assessing levels of happiness among adolescents. To date, there is no study that had looked at the validity of OHI in terms of measurement invariance and potential bias with respect to previously identified groups such as gender and marital status. Because of polytomous response style of OHI, this study utilized OLR in order to assess DIF of OHI items and DFT across gender and marital status. The psychometric properties of the OHI were also examined as a prerequisite for DIF analysis. The current results showed the appropriateness of using the GRM for analyzing OHI. The measurement invariance of OHI revealed six out of 29 items of the OHI were flagged as exhibiting uniform DIF (four items across gender and two across marital status). Examination of effect sizes suggested that observed uniform DIF is practically negligible. Very low values of ν2 also suggested negligible differential test functioning across gender and marital status. These important findings signify the validity and fairness of OHI for assessing happiness regardless of their gender or marital status. It turned out that, although in previous studies OHI was found not to be strictly unidimensional [8,13,37], this had very little impact on the DIF analysis. Like other researchers, this study had some limitations, which should be taken into consideration before drawing conclusions from its results. The major limitation of the present study was that we just assessed DIF across two variables so further research is needed to fully evaluate the generalizability of the results by looking at other grouping variables such as culture, age groups, job, and education. Another potential limitation was that students from different academic programs/colleges in the present study have been treated the same. Different simulation studies have shown that ignoring the hierarchical structure of data (e.g., students nested in programs/colleges) might affect the estimated parameters of the model. It has been mentioned that choosing proper modeling in analyzing hierarchical data is crucial as it allows for a potentially greater understanding of the issue under study, as well as avoiding statistical misspecification [11,20,38]. Therefore, the hierarchical OLR (HOLR) model should also be used in future studies for nested data. In conclusion, this study was a significant step towards providing theoretical and practical information regarding the assessment of happiness by presenting adequate evidence regarding the psychometric properties of OHI. Future studies may look at different methods for assessing DIF and different groups for strengthening conclusions with respect to OHI.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.