Validity, Reliability, and Responsiveness of the Brief Pain Inventory in Inflammatory Bowel Disease

Background and Aims. No patient-reported outcome measures targeting pain have yet been validated for use in IBD patients. Consequently, the aim of this study was to test the psychometrical properties of the brief pain inventory (BPI) in an outpatient population with IBD. Methods. Participants were recruited from nine hospitals in the southeastern and western parts of Norway. Clinical and sociodemographic data were collected, and participants completed the BPI, as well as the Short-Form 36 (SF-36). Results. In total, 410 patients were included. The BPI displayed high correlations with the bodily pain dimension of the SF-36, as well as moderate correlations with disease activity indices. The BPI also displayed excellent internal consistency (Cronbach's alpha value of 0.91, regardless of diagnosis) and good to excellent test-retest values (intraclass correlation coefficient (ICC) 0.84–0.90 and Kappa values > .70). In UC, calculation of responsiveness revealed that only BPI interference in patients reporting improvement reached the threshold of 0.2. In CD, Cohen's d ranged from 0.26 to 0.68. Conclusions. The BPI may serve as an important supplement in patient-reported outcome measurement in IBD. There is need to confirm responsiveness in future studies. Moreover, responsiveness should ideally be investigated using changes in objective markers of inflammation.


Introduction
Inflammatory bowel diseases (IBDs), including ulcerative colitis (UC) and Crohn's disease (CD), are characterised by chronic, recurrent inflammation of the gastrointestinal (GI) tract [1,2]. In UC, the inflammation is located in the colonic and rectal mucosa, whereas in CD, any part of the GI tract may be affected. Common IBD symptoms include diarrhoea,

Materials and Methods
Participants were recruited from nine hospitals in the southeastern and western part of Norway as a part of a crosssectional and longitudinal observational, multicentre study. Inclusion criteria were >17 years of age, a verified diagnosis of IBD (based on endoscopic, laboratory, and histological findings-Lennard-Jones criteria) [22], and the ability to read and write in Norwegian and to give written informed consent. Patients were excluded if the investigators found them to be unable to comply with the study procedures. The inclusion period was from March 2013 to April 2014. At each of the inclusion centres, a senior gastroenterologist was in charge of the study. The psychometrical testing of the BPI was performed using the quality recommendations of the COSMIN (Consensus-based Standards for the Selection of Health Measurement Instruments) checklist [23].

Sociodemographic and Clinical Data.
Sociodemographic variables were self-reported by patients, which included age, gender, smoking habits, and self-perceived IBD symptoms. Self-perceived IBD symptoms were obtained through patient classification of IBD symptoms during the last 14 days. Four possible scores were used: no symptoms, mild symptoms (do not interfere with everyday activities), moderate symptoms (do interfere with everyday activities and may result in sick leave), and severe symptoms (unable to carry out everyday activities, on sick leave or hospitalised).
Disease activity was assessed through laboratory tests, faecal calprotectin (FeCal test-Calpro) and the activity indices Simple Clinical Colitis Activity Index (SCCAI) and Simplified Crohn's Disease Activity Index (SCDAI) [24,25]. Phenotype was classified according to the Montreal classification. In addition, current use of medication was recorded from medical records.

Questionnaires
2.2.1. The Brief Pain Inventory. The pain intensity section of the BPI consists of four items that are scored from 0 (no pain) to 10 (worst possible pain), whereas the functional interference section consists of seven items that are scored from 0 (no interference) to 10 (complete interference). A pain severity score is calculated from the mean of the four pain intensity items, and a pain interference score is calculated from the mean of the seven pain interference items [10]. In addition to the BPI intensity and interference items, the questionnaire also has four optional items that are not included in psychometrical testing according to the BPI user manual. The Norwegian translation of the BPI has been tested previously and Cronbach's alphas were 0.87 for the pain severity and 0.92 for the interference scales. Moreover, correlation between BPI pain severity and the European Organization for Research and Therapy of Cancer (EORTC) QLQ-C30 questionnaire item on pain intensity was 0.70 ( < 0.001). The correlation between BPI interference index and the EORTC QLQ-C30 item on pain influence on daily living was 0.62 ( < 0.001) [9].

SF-36.
The SF-36 is a well-validated, generic HRQoL questionnaire comprising 36 items [20,26]. The 36 questions are divided into 8 multi-item scales, consisting of physical functioning (PF), role limitations because of physical problems (RP), bodily pain (BP), general health, vitality (VT), social functioning (SF), role limitations because of emotional problems (RE), and mental health (MH). For each question, the raw score was coded and transformed into a scale from 0 to 100, with 0 indicating the lowest level of function and 100 the highest level of function.

Statistical Analysis.
To assess the characteristics of the sample, we used descriptive analysis, frequencies, and the 2 test. Face validity was tested by distributing the questionnaire to 15 patients before field testing, to receive their input on item content, scoring, and structure. The Construct of the BPI was tested by using a principal axis factoring, Oblimin rotation with Kaiser normalisation, and eigenvalues >1. To find the optimal cut-point for mild, moderate, and severe pain, we deployed previously described methods, using the average pain item of the BPI [27,28]. Average pain was divided into 8 different schemes (using different options for upper values of mild and moderate pain). Multivariate analysis of variance was used, and the highest -value (Wilks's lambda) was considered indicative of the scheme that was most useful for distinguishing mild, moderate, and severe pain.
Concurrent validity was tested through linear regression analysis, entering the well-established BP dimension of SF-36 as the dependent variable and average pain groups as the independent variable. We hypothesised that increased pain severity would be a negative predictor of BP. Construct validity was tested according to recommendations in the literature [29] using three approaches: (a) convergent validity, (b) discriminant validity, and (c) known-group validity. Convergent validity was calculated using binary correlation analysis (Spearman's rho) of the BPI, SF-36, and well-established disease activity indices. It was hypothesised that elevated pain (increased BPI scores) would correlate negatively with all SF-36 dimensions, but that the strongest negative correlation would be between the BPI and the bodily pain dimension. Increased disease activity indices were hypothesised to be positively associated with BPI scores. In UC, none of the items in the SCCAI specifically measure pain. However, we chose to correlate the BPI against the SCCAI items (a) general condition and (b) complications (e.g., including joint pain). In CD, the BPI was correlated with the SCDAI items (a) abdominal pain and (b) complications. Discriminant validity was calculated by comparing the correlation between the BPI items and their hypothesised dimension, with its correlation to other dimensions. Known-group validity was tested through one-way analysis of variance, by comparing mean BPI scores in patients reporting no, mild, moderate, or serious IBD symptoms. Moreover, we hypothesised that the SF-36 bodily pain would be negatively associated with increased pain level (BPI average pain). Post hoc Scheffe test was used to control for multiple comparisons. Floor and ceiling effects were investigated by calculating the percentage of patients scoring either the lowest or highest possible score in individual items, as well as in dimensional scores. If the number of lowest or highest possible scores on the BPI exceeded 15%, this was, according to recommendations [30], regarded as indicative of floor or ceiling effects. Internal consistency reliability was tested with Cronbach's alpha. When answering the retest, patients were asked to indicate whether their condition was unchanged, deteriorated or improved since baseline. Test-retest reliability was measured using the intraclass correlation coefficient (ICC, two-way mixed, single measure). Patients self-reported their perceived disease state at the second time of BPI assessment (4-6 weeks apart) using a question with three potential answers: "Compared to last time you completed the questionnaire, how do you evaluate your IBD condition today? (A) Unchanged (B) Improved, or (C) Deteriorated." Based on this item, ICC values were calculated among those patients reporting to be in a stable condition. Responsiveness was calculated by comparing the BPI scores on baseline to those after 4-6 weeks in patients who reported either worsening or improvement in IBD symptoms. Both Guyatt's statistics and Cohen's were used to calculate responsiveness. Guyatt's statistics was performed by dividing the mean change in individuals reporting either improvement or deterioration of symptoms with the standard deviation of the change score in those unchanged. Cohen's effect size was calculated by comparing the mean difference between groups, divided by the pooled standard deviation. Operational definitions of 0.2, 0.5, and 0.8 were categorised as small, medium, and large, respectively. Missing data were treated as recommended in the literature; if data in half or less than half of the items within a scale were missing, they were replaced by the mean value of the respondent's completed items in the same scale [31]. All tests were two-sided, with a 5% significance level, and were performed by the use of Predictive Analytics Software, PASW, version 23.0 (SPSS Inc., 233 S. Wacker Drive, Chicago, Illinois, United States).

Ethical
Considerations. Participation in the study was based on written informed consent and performed in accordance with the principles of the revised Helsinki Declaration. Approval was obtained from the Regional Ethics Committee (reference number: 2012/845/REK Sør-Øst A).

Results
In total, 452 patients were eligible and were invited to participate in the study. Further, 414 patients (91.6%) gave written informed consent, while 4 of these patients were excluded because the number of missing data exceeded 50%, leaving the number included for analyses at 410. Of these, 230 were diagnosed with CD and 180 with UC. Baseline characteristics of the included patients are presented in Table 1. No significant differences were found in gender or age between patients declining participation, being excluded because of missing values or those included in analyses. Data on the diagnosis of those declining participation, however, were not available. In 325/410 patients, calprotectin was available. A significant increase in calprotectin ( = 0.001) levels according to patient-perceived IBD symptoms was observed (no symptoms mean = 184; mild symptoms mean = 203; moderate symptoms mean = 222; and severe symptoms mean = 440).
After inviting all 410 patients from baseline to complete the BPI a second time, 243 responded, corresponding to 59% of the original sample (CD 130/230; UC 113/180). None of those 242 patients at the retest had missing values on the BPI. In CD, 110 patients reported that their condition was unchanged compared with baseline, whereas 14 reported symptom improvements and 5 deterioration. The comparable numbers in UC were 86 unchanged, 20 improved, and 8 worsened.
Overall BPI scores according to diagnosis are presented in Table 2. In UC, floor effects in individual items varied from 25% (pain average) to 75.6% (pain interference, walking). In the BPI intensity dimension, the floor effect was 22.8%, whereas the comparable number in the BPI interference dimension was 33.9%. In CD, floor effects in individual items varied from 23% (pain average) to 75.7% (pain interference, walking). In the BPI dimensional scores, the floor effect was 21.3% for intensity and 35.2% for interference. Ceiling effects did not exceed 15% in either UC or CD.
Evaluating cut-off values for mild, moderate, and severe pain resulted in the optimal cut-off being (a) a score of . Analysis of construct, omitting values below 0.40, revealed a two-factor solution explaining 68% of the variance. The factor solution was equal when analysed separately for UC and CD. Items with high loadings in factor one (0.73-0.89) included all four original BPI intensity items (pain worst, least, average, and now), whereas items with high loadings in factor two (0.58-0.93) included the seven original BPI interference items (general activity, mood, walking, work, relationship, sleep, and enjoyment).

Validity.
Testing of face validity revealed no problematic issues regarding either item content or scoring. However, two patients called to attention the first and optional BPI screening item: "Throughout our lives, most of us have had pain from time to time (such as minor headaches, sprains, and toothaches). Have you had pain other than these everyday kinds of pain today?" This might potentially, if answered no, cause respondents to leave the rest of the items unanswered.
Calculation of concurrent validity revealed that increased pain levels were predictive of more BP on the SF-36 (CD: = −.63, < 0.001, UC: = −.65, < 0.001). Construct validity tests revealed moderate to high negative correlation between the BPI intensity and interference dimensions and the bodily pain dimension of the SF-36. In addition, the BPI dimensions were positively correlated to the SCCAI and SCDAI ( Table 3). Estimation of discriminant validity showed the highest correlation between items and their hypothesised dimension (Table 4), as well as lower correlation coefficients on SF-36 dimensions measuring other aspects than pain (Table 3).

Reliability and Responsiveness.
Cronbach's alpha calculated, for UC and CD separately, revealed excellent alpha  values regardless of diagnosis (alpha 0.91). Test-retest reliability of the BPI is presented in Table 2. In individual and dimensional BPI scores, the strength of agreement between the two occasions (ICC) was high (see Table 2 for details). The Spearman's correlation of BPI intensity scores from baseline to follow-up was .86 and .85 in CD and UC, respectively. The corresponding figure for BPI interference was .90, regardless of diagnosis. When comparing pain categories no, mild, moderate, and severe, Kappa values were .70 in CD and .72 in UC.
In patients reporting that their symptoms were either improved or deteriorated since baseline assessment, BPI values tended to be decreased or increased, respectively (Table 5). A Guyatt's statistics greater than 1.00 was regarded as indicative of high responsiveness, whereas scores greater than 0.20 were considered adequate. In UC, all scores, except BPI interference in patients reporting improvement, were lower than 0.20. In CD, all scores exceeded 0.20, and, in patients reporting improvement, the BPI interference scores were greater than 1.00, whereas, in patients reporting deterioration, the BPI intensity scores were greater than 1.00. Cohen's effect sizes were generally lower in UC than in CD. In CD, all scores, except BPI interference in patients who reported a worsening, were moderate (Table 6). In UC, only BPI interference in patients who reported improvement reached the cut-off of small effect size.

Discussion
This study revealed that the BPI is a valid and reliable tool for assessment of pain intensity and interference in both UC and CD. These findings are consistent with studies in other autoimmune diseases such as multiple sclerosis, rheumatoid arthritis, and diabetes [16,18,19]. Based on findings from the current study, the BPI consequently adds to the existing body of PROMs in IBD and may become an important supplement when pain measurement is needed.
Validation is a process involving several stages, all aiming to determine whether an instrument measures what it is intended to measure or is useful for its intended purpose  8 Canadian Journal of Gastroenterology and Hepatology [29]. Face validity testing in a subgroup of patients revealed no problematic issues regarding item content or scoring. However, some remarks were raised concerning the optional first item of the BPI, which could lead to a misunderstanding of whether or not patients should fill out the remaining questionnaire. The completeness of the BPI was, however, high, with merely four patients being excluded from analyses because of missing data. Consequently, we conclude that the level of misunderstanding was low in the current study. On the contrary, based on patient input and to avoid misunderstanding of scoring and interpretation, we suggest that it might be better to place the mandatory items of the BPI first, followed by optional items.
Other than a validation study in cancer patients [9], factor analyses of the BPI in several other languages have identified a two-factor model with pain intensity items and interference items loading on the two factors. These studies and ours have used similar statistical methods, including principal axis factoring and Oblimin rotation. The rationale for using Oblimin rotation was based on the assumption that BPI items were not orthogonal or correlated [29]. Our analysis yielded a two-factor solution that was consistent with the original structure and validation of the BPI [10].
We observed a marked floor effect in BPI score regardless of diagnosis, which implies that the number of patients scoring the lowest possible score exceeded 15% [30]. A potential explanation for this finding is that pain measurement is based on measuring the existence of a specific problem or not (pain versus no pain) compared with generic HRQoL instruments, for example, [29]. Consequently, despite a high number of participants reporting no problems in the BPI items, this does not necessarily indicate a limitation of the BPI. Indeed, for descriptive and evaluative purposes, the assessment of pain must be able to detect the number of patients with a low versus high symptom burden [29]. A manifest floor effect has also been observed elsewhere [9]. However, no ceiling effects were observed in our study, and this may indicate, compared with cancer patients, for example, that few IBD patients experience the highest possible pain burden [9]. Obviously, this cannot be generalised beyond our outpatient population.
Out of the eight SF-36 dimensions, the BPI displayed the highest correlations with the bodily pain dimension. Moreover, weak to modest correlations were found related to aspects of disease activity indices. Because joint pain and rheumatic disorders are well-known extraintestinal manifestations of IBD [8,32,33], we chose to correlate the BPI against the complication item of the SCCAI and SCDAI, which is because these items capture some of these symptoms. Stomach pain, however, is only captured by the SCDAI and not the SCCAI, which of course may limit the exact interpretability in UC.
Optimally, a PROM should be able to discriminate between groups of patients anticipated to have differences in health states. Our results indicate that the BPI can capture these differences. Moreover, when categorising patients into no, mild, moderate, and severe pain, the SF-36 BP dimension dropped accordingly.
The sample size needed in test-retest analysis has been the subject of some debate. Some have advocated that a sample size of 50 could be sufficient or a starting point [34]. Others have highlighted the need for larger sample sizes and more robust test-retest data [35]. In the current study, a convenient sample of 242 patients was included in test-retest, of which 196 patients reported an unchanged condition. Therefore, the test-retest analysis in the current study is based on a robust sample of patients. Our results showed that, within a timeframe of 4 to 6 weeks, the BPI displayed good ICC values in individual items and dimensional scores. Moreover, following recommendations [35], test-retest analysis was performed by the same assessor as at baseline. Both assessor and subjects were blinded to performance and scores from baseline during the retest of the BPI.
A central aspect of a PROM measure is the ability to respond to relevant changes in a particular condition, also known as responsiveness. The number of patients reporting either deterioration or improvement in this study was generally low. Although there was a tendency of scores corresponding with change in IBD condition, these findings should be interpreted with caution. Both Guyatt's statistics and Cohen's effect sizes revealed that the responsiveness was higher in CD than in UC. In UC, the responsiveness may be questioned because only one of the BPI dimensions reached a Guyatt's statistics and Cohen's above the threshold of 0.2. There may of course be several explanations to this finding, including inherent differences between UC and CD, as well as lack of precision in the question used to capture change from baseline to retest.
The study has some limitations. Because we recruited only hospital outpatients, our sample may not be representative of a community sample of IBD patients. Consequently, we cannot conclude about the BPI's psychometrical properties in IBD at large. Further, we evaluated change using merely the patient's subjective experience of having an unchanged, improved, or worse IBD condition at time of retest. Ideally, because subjective evaluation of health status may increase the risk of bias, we should have evaluated change using an objective marker of inflammation, such as calprotectin. In addition, we were not able to calculate minimal important difference (MID), which could have been useful in determining whether the observed change is meaningful to patients or not. MID is limited to the subgroup of people who are deemed to have had minimal change, but we did not include an adequate anchor to calculate these values.
In conclusion, this study is the first to demonstrate that the BPI is an easily scored, valid, reliable measure of pain in IBD patients. Consequently, the BPI may serve as an important supplement in patient-reported outcome measurement in IBD. Although indications of responsiveness were found, there is still a need to confirm these in future studies. Moreover, responsiveness should ideally be investigated using changes in objective markers of inflammation, such as calprotectin.

Competing Interests
Professor Lars-Petter Jelsness-Jørgensen has received an unrestricted research grant from Tillotts Pharma.