The study illustrates the application of generalizability theory (G-theory) to identify measurement protocols that optimize reliability of two clinical methods for assessing spine curvatures in women with osteoporosis. Triplicate measures of spine curvatures were acquired for 9 postmenopausal women with spine osteoporosis by two raters during a single visit using a digital inclinometer and a flexicurve ruler. G-coefficients were estimated using a G-study, and a measurement protocol that optimized inter-rater and inter-trial reliability was identified using follow-up decision studies. The G-theory provides reliability estimates for measurement devices which can be generalized to different clinical contexts and/or measurement designs.

Measuring devices are used routinely in rheumatology clinical examinations and research. Reliability analysis quantifies the consistency of examinee performance [

It is recommended that each clinical setting establish reliability for measurements obtained by their specific assessors (raters) on their particular patient population. This position is held, in part, because studies of measurement error in the clinical arena have predominantly adopted a classical test theory (CTT) framework [

Despite the common use of CTT for characterizing reliability, there are several limitations. First, the term “true” score can be confusing on several counts. When applied in a reliability context, the true score does not comment on the extent to which a measure assesses what it is intended to measure (i.e., its meaning when applied in a validity context). Also, an examinee may have different true scores depending on the study design. For example, the apparent true score for an examinee may be different for an inter-rater study design compared to a inter-trial study design. A second limitation concerns the interpretation of the error term. Although in theory it represents random measurement error, there is no way of distinguishing whether this assumption is true. Furthermore, like the true scores, it is likely that the magnitude of measurement error will be different for different study designs. Finally, CTT does not provide a coherent method for optimizing a measurement process. For example, an investigator might be interested in determining whether a greater gain in reliability could be achieved by increasing the number of raters or by increasing the number of assessments by a single rater. Applying CTT, the investigator would conduct two studies. For the results of each study, the investigator could apply the Spearman-Brown prophecy formula to estimate the impact of altering the number of raters or the number of trials. However, there is no elegant method for combining the results from these studies to determine whether it is better to increase the number of raters or to increase the number of trials. Collectively, these shortcomings led to the development of generalizability theory (G-theory) [

G-theory differs from CTT as summarized in Table

Comparison of differences between classical test theory and generalizability theory.

Classical test theory | Generalizability theory |
---|---|

True score | Universe of admissible observations’ score |

One identifiable source of “error” variance | Multiple sources of identifiable “error” variances |

One-way ANOVA | Factorial ANOVA |

“What if” optimizing assessment method: Spearman Brown | “What if” optimizing assessment method: design study |

In our clinical research setting, we were interested in designing a study involving measurement of spine curvatures in postmenopausal women with osteoporosis. Women with osteoporosis are susceptible to deformities in the axial skeleton including hyperkyphosis and flattened or accentuated lumbar lordosis [

Therefore, our purpose was to illustrate the application of the tools of the G-theory to investigate the inter-trial and inter-rater reliability of spine curvature measures in postmenopausal women with osteoporosis of the spine using two common methods—the digital inclinometer and the flexicurve ruler, in order to establish an optimal measurement protocol. For comparison, the inter-trial and inter-rater reliability of these measures were also determined using CTT.

Nine women were recruited through a local osteoporosis clinic. Women were eligible for inclusion in the study if they were 60 years of age or older, were postmenopausal (self-reported absence of menses for more than 1 year), were clinically diagnosed with osteoporosis by a physician, and had a history of one or more vertebral fracture. Participants were excluded from the study if they were not community ambulators, had cognitive difficulties, were unable to understand written or spoken English, or had a vertebral fracture within three months prior to commencement of the study. The study protocol was approved by our institutional Research Ethics Review Board, and all participants provided written informed consent prior to the start of the study.

During a single visit, spine curvatures were measured by two raters using two different measurement devices. Clothing covering the back and footwear were removed to ensure accurate identification of bony landmarks and consistent standing posture. Participants were instructed to stand erect and maintain their best posture throughout the procedure. Each rater followed a standardized protocol to acquire triplicate measurements using the digital inclinometer and the flexicurve ruler.

A digital inclinometer (Saunder’s digital inclinometer, Empi Therapy Solutions) was used according to the manufacturer’s recommended procedure [

A 61-cm long flexicurve ruler (Arts Supply Store, Hamilton, ON) was used according to the instructional CD distributed by the American Physical Therapy Association Geriatrics Division [

The traced curves were landmarked such that a vertical line was drawn to connect the C7 mark (most superior point), and the LS interspace mark (most inferior point) and a perpendicular line was drawn at the TL level. For each trial, KI was calculated according to the following formula:

For each trial, LI was calculated according to the following formula:

The raters, an undergraduate student with no prior experience using either method of measurement and a physiotherapist with minimal prior experience using a digital inclinometer and no prior experience with the flexicurve ruler, received brief training. The user’s manual for the digital inclinometer [

Descriptive statistics were calculated using SPSS v18 (www.spss.com). G-theory was applied using G_String_III version 5.4.2 for Windows [

The characteristics of the patients are summarized in Table

Characteristics of 9 postmenopausal women with osteoporosis of the spine.

Variable | Mean (SD) | Minimum, maximum |
---|---|---|

Age (years) | 71.6 (8.9) | 63, 76 |

Height (cm) | 156.1 (8.7) | 147.2, 162 |

Weight (kg) | 71.2 (24.2) | 59.4, 94 |

Cervicothoracic angle (degrees)^{a} | 36.1 (9.99) | 17.5, 49.2 |

Thoracolumbar angle (degrees)^{a} | 51.4 (13.72) | 27.2, 72.0 |

Lumbosacral angle (degrees)^{a} | 31.9 (9.17) | 15.0, 50.2 |

Kyphotic Index^{b} | 13.2 (5.07) | 5.8, 19.5 |

Lordotic Index^{b} | 13.9 (3.22) | 9.0, 18.2 |

^{
a}calculated as mean of the average values acquired by each of the two raters for each subject using the digital inclinometer.

^{
b}segment width × 100/segment length; calculated as mean of the average values acquired by each of the two raters for each subject using the flexicurve ruler.

Mean (SD) spine curvature values over 3 trials acquired by 2 raters in 9 women with spine osteoporosis.

Patient | Cervicothoracic angle^{a} | Thoracolumbar angle^{a} | Lumbosacral angle^{a} | Kyphotic index^{b} | Lordotic index^{b} | |||||
---|---|---|---|---|---|---|---|---|---|---|

Rater 1 | Rater 2 | Rater 1 | Rater 2 | Rater 1 | Rater 2 | Rater 1 | Rater 2 | Rater 1 | Rater 2 | |

1 | 51.0 (2.6) | 41.7 (0.6) | 77.3 (1.5) | 66.7 (4.0) | 34.0 (2.6) | 28.3 (4.9) | 16.5 (0.5) | 16.8 (1.6) | 13.9 (0.2) | 11.4 (3.3) |

2 | 48.7 (12.7) | 16.8 (1.0) | 36.0 (11.3) | 27.0 (2.0) | 44.3 (4.0) | 39.7 (2.1) | 6.6 (0.6) | 47.0 (1.7) | 19.6 (1.0) | 7.4 (0.7) |

3 | 41.0 (0.0) | 22.7 (0.6) | 47.7 (1.2) | 47.3 (0.6) | 20.7 (1.5) | 38.7 (1.2) | 12.3 (1.1) | 12.9 (0.4) | 9.4 (1.2) | 10.2 (0.5) |

4 | 18.7 (2.1) | 16.3 (1.2) | 28.7 (3.2) | 25.7 (2.1) | 30.0 (0.0) | 31.7 (1.2) | 5.7 (0.7) | 5.9 (1.1) | 11.7 (0.4) | 12.3 (1.2) |

5 | 42.0 (3.5) | 42.3 (1.5) | 51.0 (5.0) | 65.7 (1.5) | 22.0 (3.6) | 36.0 (2.0) | 15.6 (1.7) | 18.4 (1.6) | 17.1 (2.3) | 17.0 (1.4) |

6 | 42.7 (3.8) | 55.7 (1.5) | 55.7 (3.5) | 77.0 (2.6) | 31.0 (2.0) | 33.7 (1.5) | 18.6 (0.5) | 20.4 (1.1) | 16.3 (0.7) | 14.4 (0.7) |

7 | 28.7 (1.5) | 34.7 (2.1) | 39.0 (1.0) | 44.7 (2.1) | 16.0 (1.7) | 14.0 (1.0) | 8.8 (1.4) | 7.6 (0.8) | 8.0 (2.3) | 9.9 (1.4) |

8 | 28.3 (0.6) | 40.0 (1.0) | 45.3 (0.6) | 57.0 (1.7) | 33.0 (1.0) | 29.0 (2.0) | 13.4 (0.8) | 15.0 (0.6) | 13.5 (1.0) | 16.3 (0.7) |

9 | 39.0 (3.6) | 47.0 (2.0) | 54.0 (3.6) | 59.3 (3.1) | 38.0 (2.6) | 38.0 (1.0) | 16.2 (0.6) | 19.3 (1.6) | 15.8 (0.9) | 16.5 (1.2) |

^{
a}measured using digital inclinometer, degrees

^{
b}measured using flexicurve ruler.

Table

Estimates of variance components^{a} for Kyphotic index using G-theory and classical test theory.

Variance component | G-theory ^{2} | Classical Test Theory ^{2} | ||||

Rater 1 | Rater 2 | Trial 1 | Trial 2 | Trial 3 | ||

Patient | 25.263 | 21.227 | 30.303 | 23.593 | 25.733 | 25.233 |

Rater | 0.488 | — | — | — | — | — |

Trial | 0.083 | — | — | — | — | — |

Patient * rater | 0.563 | — | — | — | — | — |

Patient * trial | 0 | — | — | — | — | — |

Rater * trial | 0.098 | — | — | — | — | — |

Error | 1.023 | 0.919 | 1.256 | 1.901 | 2.974 | 1.641 |

^{
a}estimates having negative values are set to zero.

Table

Reliability of spine curvature measures acquired in triplicate by 2 raters in 9 postmenopausal women with osteoporosis of the spine estimated using generalizability theory (G-Theory) and classical test theory (CTT).

Measures of spine curvature | Inter-trial reliability | Inter-rater reliability | ||

G-theory | CTT | G-theory | CTT | |

Cervicothoracic angle | ||||

Reliability coefficient | 0.960 | 0.960 | 0.566 | 0.601 |

SEM (degrees) | 2.281 | 2.040 | 7.505 | 7.091 |

Thoracolumbar angle | ||||

Reliability coefficient | 0.958 | 0.964 | 0.726 | 0.722 |

SEM (degrees) | 3.090 | 2.703 | 7.868 | 7.786 |

Lumbosacral angle | ||||

Reliability coefficient | 0.942 | 0.946 | 0.637 | 0.630 |

SEM (degrees) | 2.498 | 2.367 | 6.223 | 6.213 |

Kyphotic index | ||||

Reliability coefficient | 0.956 | 0.959 | 0.921 | 0.920 |

SEM | 1.097 | 1.040 | 1.474 | 1.461 |

Lordotic index | ||||

Reliability coefficient | 0.840 | 0.837 | 0.746 | 0.768 |

SEM | 1.427 | 1.390 | 1.794 | 1.701 |

SEM: standard error of the measurement provides an estimate of absolute reliability and is expressed in the same units as the measure.

Data from the G-study were used to establish a reliable measurement protocol through D-studies. Figures

The results of the design study for optimizing inter-trial reliability are illustrated in which the influence of having different numbers of raters is shown as a function of the number of trials for (a) spine curvature angles (degrees) measured using the digital inclinometer and (b) kyphotic index and lordotic index measured using the flexicurve ruler. The results of the design study for optimizing inter-rater reliability are illustrated in which the influence of performing different numbers of trials is shown as a function of raters for (c) spine curvature angles (degrees) measured using the digital inclinometer, and (d) kyphotic index and lordotic index measured using the flexicurve ruler.

This study aimed to illustrate the application of the tools of G-theory to establish a measurement protocol with optimal inter-trial and inter-rater reliability for assessing spine curvatures in postmenopausal women with osteoporosis of the spine. Estimates of inter-trial and inter-rater reliability of spine curvature measures acquired using the digital inclinometer and flexicurve ruler were similar whether using G-theory or CTT approaches. G-Theory provides an advantage in utilizing even small datasets to explore the effect of changing aspects of the study design (e.g., number of raters and number of trials) in order to identify the optimal measurement protocol for a particular clinical or research setting.

Reliability of outcome measures needs to be established for each specific clinical environment or research laboratory. In our example, all measures of spine curvature had acceptable reliability (high reliability coefficients and low SEM) when performed by the same rater in triplicate (Table

Inter-rater reliability for LI measures was adequate given the G-coefficient of 0.75 in combination with a low SEM (1.72). However, inter-rater reliability of spine curvature measures acquired using the inclinometer was not adequate with G-coefficients varying from 0.57 to 0.73 and SEM varying from 6.22 to 7.87 degrees. The use of D-studies provided an efficient way to optimize the measurement process. We determined that inter-rater reliability could be improved satisfactorily for the TL angle and LS angle by having 5 raters acquire the measures 4 times. Scenarios for optimizing inter-rater reliability of CT angle fell outside the realm of clinical feasibility. We did not have to conduct different studies to determine whether greater gain in reliability would be achieved by increasing number of raters or increasing the number of assessments. We were able to acquire this information based on measures obtained in only 9 women representative of our target study population.

A limitation of this study may be the inclusion of assessors with varying levels of clinical experience. Neither assessor had used the flexicurve ruler before, however, the physiotherapist had over 20 years of experience performing physical assessments in general clinical practice. By building the different experience levels into the study design, we could illustrate nonzero sources of variance. However, the mean spine curvature measures acquired by each rater varied considerably, particularly when using the digital inclinometer, and this study was not designed to determine the accuracy of the measures. It would be interesting to determine the results following more extensive training of novice raters, inclusion of an expert rater, and verification of landmarks identified by each rater. Nonetheless, these results provide estimates of reliability that can be generalized to assessors with minimal levels of experience assessing posture and demonstrate that when the same rater measures spine curvatures, the measures are consistent.

We intend the results of this study to be used at the discretion of clinicians and investigators who are using measures of spine curvatures obtained using the flexicurve ruler or digital inclinometer in the clinical assessment of individuals with osteoporosis. Furthermore, this approach may be replicated to identify other measurement protocols that optimize reliability. Ultimately a suitable compromise between a feasible measurement protocol and acceptable reliability for each particular clinical or research setting must be identified. G-theory provides an alternative to CTT that enables efficient identification of an optimal measurement protocol based on data collected in a reliability study having a single study design.

The authors thank the participants who volunteered for their study and Leslie Beaumont for her assistance in recording the spine curvature measures for each rater. This study was funded in part by the Natural Science and Engineering Research Council of Canada (NSERC)—Discovery Grant (NJM).