A Hierarchy of Patient-Reported Outcomes for Meta-Analysis of Knee Osteoarthritis Trials: Empirical Evidence from a Survey of High Impact Journals

Objectives. To develop a prioritised list based on responsiveness for extracting patient-reported outcomes (PROs) measuring pain and disability for performing meta-analyses in knee osteoarthritis (OA). Methods. A systematic search was conducted in 20 highest impact factor general and rheumatology journals chosen a priori. Eligible studies were randomised controlled trials, using two or more PROs measuring pain and/or disability. Results. A literature search identified 402 publications and 38 trials were included, resulting in 54 randomised comparisons. Thirty-five trials had sufficient data on pain and 15 trials on disability. The WOMAC “pain” and “function” subscales were the most responsive composite scores. The following list was developed. Pain: (1) WOMAC “pain” subscale, (2) pain during activity (VAS), (3) pain during walking (VAS), (4) general knee pain (VAS), (5) pain at rest (VAS), (6) other composite pain scales, and (7) other single item measures. Disability: (1) WOMAC “function” subscale, (2) SF-36 “physical function” subscale, (3) SF-36 (Physical composite score), and (4) Other composite disability scores. Conclusions. As choosing the PRO most favourable for the intervention from individual trials can lead to biased estimates, using a prioritised list as developed in this study is recommended to reduce risk of biased selection of PROs in meta-analyses.


Introduction
Measures for patient-reported outcomes (PROs) are used in most osteoarthritis (OA) trials. Extracting and combining these data is an essential part of any meta-analysis of such trials. At the 3rd Outcome Measures in Rheumatology (OMER-ACT) Conference in 1996, consensus was reached on three domains-pain, disability, and patient global assessmentas measures that should be reported in future OA trials [1]. This recommendation only specified which constructs to measure, but not which specific instruments to include [2]. In knee OA, pain is typically measured by either the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC, "pain" subscale) or by different pain scores using a visual analogue scale (VAS). Patient-reported disability is often measured by WOMAC "function" subscale or Short Form 36 (SF-36 "physical functioning" (PF) subscale), but a variety of different PRO measures have been used [3]. Often several PROs are used in trials for estimating change in pain and disability. Simply extracting data on the primary outcome may not be an option in a meta-analysis, as this is often not reported and may not be available for both pain and disability. If the choice of PROs is based on which outcome measure reaches statistical significance, the corresponding meta-analytic estimates are likely to be biased [4,5].
Ideally, the most responsive outcome measure is the best choice for extraction and inclusion in a meta-analysis, provided that it is a valid outcome measure and included in the majority of the trials. Whether the outcome measure is 2 Arthritis truly valid (i.e., it measures the intended change) is a separate consideration based on face, content, construct, and criterion validity [6]. Good validity is a prerequisite for high responsiveness [7]. Responsiveness is quantified as the standardised mean difference (SMD), which is calculated as the difference in the mean change between the intervention and control groups divided by the pooled standard deviation (combining the different groups in any particular trial). The PRO with the largest SMD is considered to be the most responsive [8].
The choice of the most responsive among the potential outcome measures is consistent with the OMERACT filter of truth, discrimination, and feasibility [9]. Combining instruments with different responsiveness can increase heterogeneity in a meta-analysis [10]. Alternatively, various outcomes can be transformed to the most frequently used outcome, using the transformation coefficients from regression analyses [7,10].
In meta-analyses of OA, it is recommended that a hierarchy of PROs should be determined prior to data extraction [11]. In 2006, Jüni et al. suggested a prioritised list for extracting data on pain and disability in patients with knee OA, but the methodology behind the list was not reported [11]. Likewise in a meta-analysis of aquatic exercise for OA, a nonsystematic approach was used to develop an operational hierarchy for OA meta-analyses [12].
1.1. Objectives. The aim of this study was to develop a prioritised list for extracting PROs on pain and disability for metaanalyses, based on an investigation of the responsiveness of PRO measures in knee OA, and restricted to trials with an anticipated low risk of selective outcome reporting.

Eligibility Criteria.
Trials were considered eligible in the current study if they were designed as randomised controlled trials (RCTs) or quasi-randomised trials investigating any type of intervention for patients with knee OA [13]. Trials had to include at least one group with a specified intervention and a control group. In accordance with international consensus regarding the core set of outcome measures for phase III clinical trials in OA [9], the eligible RCTs had to include assessment of at least self-reported pain and/or self-reported disability. Only trials measuring one or both of these constructs with at least two different outcome measures were eligible. Targeting journals with a high impact factor has previously been suggested as a good strategy for identifying journal articles with high methodological quality [15].

Study Selection.
Two members of the study team (C. Juhl, H. Lund) independently scrutinised titles and abstracts of all identified publications. The full text of any article was obtained if it was judged eligible by at least one of the reviewers. The two reviewers then evaluated eligibility based on the full text of all the retrieved papers, and consensus on inclusion was reached by discussion.

Data Collections Process and Data
Items. Study identification (author, year) and outcome measures were extracted using a customised data extraction form. For each outcome measure, the number of participants in the intervention and the control groups and the mean change and standard deviation (SD) were extracted in order to calculate the SMD. When SD was not available in an explicit format, it was estimated from the standard error (SE), confidence interval, the P value, the interquartile range, or other methods as recommended by the Cochrane Collaboration [16].

Risk of Bias in Individual Studies.
Selective outcome reporting has been defined as choosing a subset of the original outcomes on the basis of the results [1,2]. Two members of the study team (C. Juhl, H. Lund) assessed the risk of selective outcome reporting, indexed according to whether the trials had been classified as "adequate," "unclear," or "inadequate" in accordance with the Cochrane Handbook for Systematic Reviews of Interventions 5.1.0 [16]. Selective outcome reporting was deemed as follows.
(i) Adequate, if a protocol (published or from Clinical-Trials gov or other databases) was available and all PROs were sufficiently reported for extracting data for estimating SMD.
(ii) Unclear, if a protocol was not available (published or from ClinicalTrials gov or other databases).  (iii) Inadequate, if some or all PROs were insufficiently reported for extracting data for estimating SMD (evaluated by checking the protocol from earlier publications, trial registers, or described in the published trial).

Summary Measures.
The effect size was calculated as the SMD to allow comparison of the various PROs. The SMD was estimated as the difference in mean change between the intervention and control groups divided by the pooled SD. The pooled SD was estimated from SD pooled , where N I and SD I represent the number of patients and the SD in the intervention group, respectively. The SMD was used to represent the responsiveness; the higher the SMD, the more responsive the measure. This approach is only valid to rank PROs within a particular study, as different studies obviously measure different therapeutic interventions. Data from an intention to treat (ITT) analysis was preferred for calculating the SMD. When several intervention groups were compared with a control group, the number of control patients was divided equally into the appropriate number of groups when estimating the SMD.

Synthesis of Results.
The responsiveness estimated as SMD of the PROs in each of the included trials (or subgroups when more interventions were compared with the control) was ranked according to responsiveness for pain and disability separately. The PRO used to measure the effect of an intervention in any individual trial with the highest responsiveness was ranked 1, the second most responsive was ranked 2, and so on. The mean rank was then used to estimate the responsiveness across the trials, and a low mean rank (close to 1) indicated that this PRO was often the most responsive PRO used. The PROs used in at least 5 trials were then listed according to the lowest mean rank of the SMD. However, composite item scales with established validity would rate higher than single item scales.

Risk of Bias across Studies.
A sensitivity analysis was performed evaluating the impact of different systematic approaches to data extraction of PROs in a meta-analysis. The pooled mean across trials using the developed list from the current study was compared with lists based on (1) the most favourable outcome from each of the individual trials, (2) the most frequently used PROs, and (3) the most responsive of the PROs. Inconsistency in means between trials was evaluated using the I 2 index [17]. A random effect model was used for pooling the trials. All analyses were performed at the study level using the meta-analysis software "Comprehensive Meta Analysis" Version 2, Biostat. Inc., and were based on published data only.

Additional Analysis.
In order to assess the robustness of the developed list, subgroup analyses stratifying the available trials according to risk of selective outcome bias were applied, and a list based on trials with no risk of selective outcome bias was compared with the list based on all included trials. Stratified analyses were performed based on whether the included trials were published in the general/internal medicine journals or in the rheumatology journals. Secondly, sensitivity analyses were performed stratifying trials according to the intervention; injection in the knee joint, oral medication and other interventions (tai chi, lateral wedge shoes, etc.). Outcomes used at least 5 times were ranked based on the mean rank and the lists from these subgroups were compared to the list based of all included trials.

Study Selection.
Through the search strategy, 402 publications were identified, as presented in the flowchart ( Figure 1). Titles and abstracts of the publications were checked independently by two reviewers (C. Juhl, H. Lund). One hundred and eighty three trials were identified as potentially eligible by at least one of the reviewers and subsequently examined independently in full text by two reviewers (C. Juhl, H. Lund). Consensus was reached by discussion and resulted in 38 trials that fulfilled the eligibility criteria and were included in the analysis   (Table 1).

Study Characteristics.
Pain was evaluated with more than one PRO in 35 trials [18-21, 23-42, 45-55], and disability was evaluated in 15 trials [18,19,21,22,25,26,29,31,38,40,43,44,46,53,54]. More than one intervention group was compared in 14 trials measuring pain [19, 24, 25, 27-29, 34-36, 41, 42, 48-50, 55] and in 4 trials measuring disability [22,25,29,40]. Different specific questions were asked when VAS scores were used for measuring pain. The VAS scores were classified as pain during activity (e.g., stepping, daily activities, etc.), pain during walking, pain at rest, pain at night, and general knee pain. WOMAC was used both in a VAS version using a 100 mm scale and in a Likert version using a 1-5 scale, and these were treated as different outcomes in the initial analysis. Pain during activity and pain during walking were measured by either a VAS score or a numeric rating scale (NRS) score, and these PROs were analysed separately. Then eighteen different PROs were used for measuring pain and seven PROs for measuring disability in the included trials.

Risk of Bias within Studies.
Only 12 out of the 38 included trials were registered in ClinicalTrial.gov. Six of these trials were classified as "adequate" and six as "inadequate". Two trials reported PROs not declared in the protocol, one did not report all PROs from the published protocol, one did not report all time points, and two protocols did not report PROs at all in the published protocol (Table 1).

Results of Individual Studies. The most frequently used
PROs for measuring pain were the WOMAC "pain" subscale (in either the Likert scale or VAS format) used in 27 of the included trials and for disability the most frequently used PRO was the WOMAC "function" subscale (in either the Likert scale or VAS format) used in all 15 trials ( Table 2).
The most responsive PRO measure for pain was "pain during activity" using a VAS with a mean rank of 1.4. It was used when comparing the effect of an intervention in 12 trials (comparing 18 interventions with controls), and it was the most responsive outcome in 14 of these 18 substudies. The most responsive composite score for pain was the WOMAC "pain" subscale using a Likert scale with a mean rank of 1.8. It was the most responsive outcome in 7 out of 20 substudies and the most responsive composite score in 19 out of 20 substudies. The most responsive PRO on disability was WOMAC "function" subscale using a 100 mm scale with a mean rank of 1.4. It was the most responsive outcome in 7 out of 13 substudies.

Synthesis of Results.
According to the Food and Drug Administration (FDA), a single item may be reasonable for concepts such as pain severity, if it is a reliable and valid measure, but not for general concepts such as disability [56]. Different single item VAS scores were actually measuring different items, as they were based on different questions. A list based on the responsiveness of the PROs was constructed, placing the most responsive of the PROs as first choice for extracting data for meta-analyses, and PROs based on a single item being downgraded as they were derived from different instruments, whose validity and reliability had not been established. The final prioritised list for preferred PROs to extract when performing meta-analyses is presented in Table 3.
3.6. Risk of Bias across Studies. The likelihood of overestimating the pooled effect in meta-analyses by choosing the most favourable PROs for the intervention in each trial is illustrated in Figure 2. Compared with the developed list, consistently choosing the most favourable PROs overestimated the effect size for pain with SMD = 0.14 (95% CI 0.02, 0.26) and for disability with SMD = 0.09 (95% CI -0.03, 0.21). The differences between (1) the developed list, (2) the most frequently used PROs, and (3) the most responsive of the PROs were smaller. The list developed in this study seems to be a conservative but robust approach for extracting PROs when planning a meta-analysis.

Additional Analysis.
The subgroup analysis stratifying the available trials according to risk of selective outcome bias was applied, and a list based on trials with a low risk of selective outcome reporting was developed (i.e., trials with a published protocol and evaluated as "adequate" in the risk of bias assessment). All 6 trials with a low risk of selective outcome reporting had measured pain, but only 3 had measured disability. Even though some outcomes were missing in the list based on these 6 trials with low risk of selective outcome reporting bias compared with the list based on all the included 38 trials, the differences between the two lists were very small (data not shown).
When stratifying the included trials according to the interventions; injection in the knee joint, oral medication and other interventions (tai chi, lateral wedge shoes etc.) no differences in the responsiveness of the most frequently used outcomes (a least 5 times) were found between the "injection subgroup", the "other interventions subgroup" and all the included trials. In the "oral medication subgroup" the five most responsive outcomes (used a least 5 times) were "Pain walking", "Pain global", WOMAC (100 mm scale), "Pain at rest" and WOMAC (likert scale). When compared this to the list of outcomes from all the included trials "Pain activity" and SF-36 is missing as they were not used as frequently in the "oral medication subgroup" and the rank of the last three later were WOMAC (likert scale), WOMAC (100 mm scale) and "Pain at rest" but the responsiveness for these three outcomes was more or less the same. No differences in sensitivity analysis based on intervention were seen in the disability list. Other sensitivity analyses based on whether the included trials were published in the general/internal medicine journals or in the rheumatology journals showed no differences between lists based on mean rank in these subgroups and the list of all trials neither for the analysis of disability nor for pain.

Summary of Evidence.
Biased selection of PROs in metaanalyses (e.g., choosing the most favourable PROs for the intervention from individual trials) can overestimate the effect compared with a systematic approach. As anticipated, choosing the "most favourable" outcome from each individual trial was more positive than systematic approaches based on either (1) how frequently the PROs were used, (2) the average responsiveness of the PROs, or (3) the developed list. Using a prioritised list as developed in this study is recommended to reduce the risk of biased selection of PROs in meta-analyses. When comparing the list developed in this study with the hierarchy for extracting pain measurement scales published by Jüni et al. [11], the main differences are that Jüni et al. included global health outcome (as the total WOMAC score, patient's global health and physician's global health), contrasting with the developed list in this study, which only included specific pain measurement scales. The list used by Bartels et al. [12] was based on consensus among the authors and reports the WOMAC subscales as first choice in both pain and disability, similar to the list developed in this study.

4.2.
Limitations. This study has some limitations. First of all, even though 38 trials were included, and 35 trials had sufficient data on pain, only 15 trials had sufficient data on disability. The literature search was performed in the ten highest impact factor general and internal medicine journals as well as the ten highest impact factor rheumatology journals in order to identify trials with a low risk of selective reporting bias. Some of the highly ranked journals were primarily review journals. Even though these journals only published a few randomised controlled trials, some of the eligible trials were published in one of these journals. Only six trials were classified as "adequate" for having a low risk of selective reporting bias, and some frequently used PROs were missing in the list based on these trials. However, as the differences between the list based on these trials with a low risk of selective outcome reporting bias and the list based on all the included 38 trials were very small, the developed list seems to be trustworthy. Furthermore, stratified analyses comparing lists based on the subgroups of trials and the list from all included trials showed only small discrepancies for the subgroup of "oral medication" using pain as outcome. No differences were seen in other subgroup analyses.
As the subgroup analyses based on whether the trials were published in the general/internal medicine journals or in the rheumatology journals showed no differences between the list from the subgroup and the list from all included trials, it seems unlikely that including more trials will change the list of the most frequently used outcome. So, even though including more journals in the literature search could be preferable, based on these subgroup analyses, it is unlikely that including more trials would change the list developed in this study.
Secondly, combining the SMDs across different populations and interventions can cause heterogeneity, as characteristics of the patient populations (age, sex, BMI, etc.) and interventions have an impact on the effect size. A large SMD can firstly be due to a large difference in mean change between the intervention group and control group, or secondly, to a small standard deviation due to small variability in the included patient group. Combining SMDs across trials could then cause bias if trials with a homogeneous patient group (small SD) were combined with trials with a heterogeneous patient group (large SD). These differences are reduced when using the rank of the SMD instead, when comparing across trials. In developing the prioritised list for extracting PROs in this study, the estimated SMDs were then only used for ranking the PROs    in each trial. When the variability in a trial was small due to a homogeneous patient group, the variability in all PROs in the trial was then expected to be equally small with the rank of the SMD still being an acceptable measure of the relationship of the responsiveness between the PROs in individual trials. Thirdly, as a large number of different PROs were used and most of the included trials only compared two PROs on either pain or disability, it was not possible to make direct comparisons of the responsiveness of PROs. Therefore, the indirect method of ranking the PROs in the individual trials according to responsiveness was used. This method of ranking the PROs reduces the impact of the differences in populations and interventions between the included trials, as it only used the SMD for ranking the responsiveness of the PROs. As the mean rank of the PROs could change if more trials were included, especially the PROs only used in a few trials, the systematically developed list in this study was based on the most frequently used PROs (used in at least 5 trials).

Strengths.
A strength of this study was that a comprehensive systematic literature search in high-quality journals was performed, and a systematic approach was used for developing the prioritised list. Furthermore, the impact of using the developed list for data extraction in meta-analyses was analysed. Finally, the prioritised list developed in this study was compared with a list based on the six trials with a low risk of selective outcome reporting bias, and the differences were small.
To summarise, the WOMAC subscales for both pain and disability should be the first choice for extracting PROs in meta-analyses for patients with knee OA. Different single item VAS scores were actually measuring different items (e.g., VAS during activity covers "pain during daily activity," "pain following stepping activities," or "pain during worst activity"), so, even though different single item VAS scores for measuring pain were more responsive than the WOMAC, WOMAC is preferred for meta-analysis.
The Knee Injury and Osteoarthritis Outcome Score (KOOS) "ADL" subscale is equivalent to the WOMAC "function" subscale, and the WOMAC "pain" score is contained in the KOOS "pain" score. As the corresponding KOOS subscales and WOMAC subscales have shown equal responsiveness [57], the KOOS subscales could be included in the developed list. Based on these preliminary results and considerations, we recommend the list presented in Table 3, when extracting PROs on pain and disability for meta-analysis. Arthritis 15

Conclusions
As choosing the most favorable patient-reported outcomes (PROs) from individual trials can overestimate the effect compared with a systematic approach, using a prioritised list as presented in this study is recommended to reduce reviewers' likelihood of biased selection of PROs in meta-analyses.
The impact of the prioritised list should be tested in published meta-analyses investigating the effect of interventions on pain and disability for patients with knee osteoarthritis. When a larger number of trials are registered and classified as "adequate" for having a low risk of selective reporting outcome bias, this study should be repeated in order to check the prioritised list based on the responsiveness of PROs.