Uncovering Indicators of the International Classification of Functioning, Disability, and Health from the 39-Item Parkinson's Disease Questionnaire

The 39-item Parkinson's disease questionnaire (PDQ-39) is the most widely used patient-reported rating scale in Parkinson's disease (PD). However, recent studies have questioned its validity and it is unclear what scores represent. This study explored the possibility of regrouping PDQ-39 items into scales representing the International Classification of Functioning, Disability, and Health (ICF) components of Body Functions and Structures (BF), Activities and Participation (AP), and Environmental (E) factors. An iterative process using Rasch analysis produced five new items sets, two each for the BF and AP components and one representing E. Four of these were found to represent clinically meaningful variables: Emotional Impairment (BF), Gross Motor Disability (AP), Fine Motor Disability (AP), and Socioattitudinal Environment (E) with acceptable reliability (0.73–0.96) and fit to the Rasch model (total item-trait chi-square, 8.28–33.2; P > .05). These new ICF-based scales offer a means to reanalyze PDQ-39 data from an ICF perspective and to study its health components using a widely available health status questionnaire for people with PD.


Introduction
The International Classification of Functioning, Disability, and Health (ICF) provides a conceptualisation and classification of different components of health including biological, individual, and social perspectives [1]. The ICF contains two parts. The first part defines functioning and disability, which in turn consists of two components, Body Functions and Structures, and Activities and Participation. Body functions include physiological and psychological functions, and body structures refer to the anatomical integrity of the body. Activities are the execution of tasks or actions, whereas participation refers to the involvement in life situations. The second part conceptualizes contextual factors, which include environmental and personal factors. The environmental component is the facilitating or hindering impact of the physical, social, and attitudinal environment. Similarly, personal factors are recognized as having a facilitating or hindering impact but they are not further specified because of their vast social and cultural variation [1].
In addition to its use for clinical, educational, and research purposes, the ICF can be used to understand the content of different health outcome measures [2]. Linking such scales to the ICF can be valuable to allow clinical studies to relate to the ICF and for gaining a conceptual understanding of scale contents [3,4], thereby serving as a base for their further development and refinement. Rules for linking health status measures to the ICF have been proposed and used for linking items of various generic and disease specific scales to the ICF [2,5,6]. The findings from such studies can guide researchers and clinicians in their selection of instruments for specific purposes.
The 39-item Parkinson's disease questionnaire (PDQ-39) [7] is the most widely used patient-reported rating scale in Parkinson's disease (PD). However, recent studies have identified problems with its measurement properties. For instance, while the instrument as a whole, as well as its 8item short form (PDQ-8), appears multidimensional, the validity of the suggested grouping of its items into eight scales also appears questionable [8][9][10]. Consequently, its validity is at stake as it is unclear what scores represent. Other problems have included suboptimal targeting (items represent more severe health problems than those experienced by people with PD) compromised measurement precision and problems associated with the use of its five response categories [8][9][10]. Although these experiences point to the need for new developments in patient-reported health outcome measurement for PD, the wide use and spread of the PDQ-39 argues for consideration of alternative and more valid means of using the questionnaire. One way to tackle the validity problems of the PDQ-39 could be to regroup items into theoretically more interpretable domains based on their linkage to the ICF, since this is a universal and standardized nomenclature of functioning and health [11]. Such linking and regrouping of items need to be followed by psychometric analyses and refinement, which preferably is done by means of the Rasch measurement model [12].
Here we explore the possibility of regrouping PDQ-39 items according to the ICF framework and test these new scales psychometrically using Rasch analysis.

Sample.
Details have been reported elsewhere [14]. Briefly, self-reported postal survey PDQ-39 data from 202 people (79% response rate) with neurologist-diagnosed PD were analyzed (Table 1). The sample consisted of people with PD seen at a South Swedish university hospital during one year, excluding those in terminal care and participants in other recent or ongoing questionnaire-based studies. The survey included a question about whether the participant had responded to the questions him-/herself; only respondents who reported that they had answered the survey themselves were included. The study was conducted in accordance with the Declaration of Helsinki, all subjects consented to participation, and the study was approved by the local research ethics committee.

2.2.
The PDQ-39. The PDQ-39 [7] is a PD specific health status questionnaire comprising 39 items that are grouped into eight scales (Mobility, Activities of Daily Living, Emotional well being, Stigma, Social Support, Cognitions, Communication, and Bodily Discomfort). In addition, an overall PDQ-39 summary index (PDQ-39SI) that summarizes the eight scale scores has been proposed [15]. Respondents are requested to affirm one of five ordered response categories according to how often (from never to always), due to their PD, they have experienced the problem defined by each item during the past month. Higher scores indicate more frequent problems.

Procedure.
The 39 items were linked to the ICF by three health science researchers experienced with the ICF and representing different disciplines (nursing, physical therapy and occupational therapy) [2]. First, the 39 items were linked to the most appropriate ICF components (Body Functions and Structures, Activity and Participation, Environment) by each researcher individually. Each item could be linked to more than one component and category [2]. If the information provided by an item was insufficient to allow linkage to an ICF category, the item was considered "not definable". The researchers discussed their results in two consensus meetings. At the first meeting preliminary consensus were reached. In the second meeting remaining classification difficulties were resolved and items were rearranged into three groups representing the ICF components of Body Functions and Structures (BF), Activities and Participation (AP), and Environment (E).

Data Analysis.
Each of the three item groups was individually analyzed psychometrically according to the Rasch measurement model [16] for ordered response categories (the partial credit model) [17,18].

The Rasch
Model. The Rasch model [16,17,19] defines, mathematically, what is required from data (item responses) for total scores to express valid measurement. The model is based on the notion that the probability of a certain item response is a logistic function of the difference between the person's location on the measured variable and the level of the variable represented by the item. The model separately locates persons and items on a common logit (log-odd units) metric, with the mean item location set at zero logits. The Rasch model requires unidimensionality (items represent a common underlying latent variable) and local independence (each item response provides unique information). Both these aspects are reflected in the fit of data to the model and violation of either distorts measurement [20,21].
Model fit is assessed by examining the accordance between expected and observed responses across person locations (class intervals) on the measured construct [17,19]. Overall fit is supported by a nonsignificant item-trait interaction chi-square statistic, and individual item fit is supported by non-significant standardized residuals that range between −2.5 and +2.5 [17,19]. Residuals represent the discrepancy between observed and expected item responses. Large negative residuals signal local dependency, whereas large positive residuals primarily suggest violation of unidimensionality.
However, fit statistics can be somewhat insensitive in detecting multidimensionality [22,23]. Smith [22] therefore proposed conducting a principal component analysis (PCA) of the residuals to identify potential subdimensions in the scale, followed by a series of independent t-tests to assess whether subsets of items yield different person measures. If violation of unidimensionality is trivial, the number of person locations that differ between the two item sets is small. This approach attempts to assess whether scales are sufficiently unidimensional to be treated as such in practice [22,24].
The Rasch model also provides a means to assess whether response categories work as assumed [17]. Ordered response categories (e.g., 0-1-2-3-4) are expected to reflect an increasing amount of the variable under investigation. The threshold between two adjacent categories is the point where there is a 50/50 probability of scoring, for example, 2 or 3. Disordered thresholds (e.g., if the 50/50 probability point between scoring 3 or 4 occurs at a lower level of the measured construct than that between scoring 2 or 3) indicate that the response categories do not work as intended. Disordered thresholds may be due to multidimensionality, too many response options, or ambiguous wording. Collapsing categories with disordered thresholds may improve model fit and provide clues regarding how the scale may be improved [25,26].
Differential item functioning (DIF) is an additional aspect of fit to the Rasch model that may result from multidimensionality and can give biased scale scores [19]. DIF analyses assess whether subgroups of people with similar levels on the measured construct respond systematically different to items [27]. When DIF is uniform (i.e., item responses differ uniformly between groups across the measured construct), this can be adjusted for by splitting the item into two new items, one for each subgroup [17]. If this does not improve model fit, the item may be involved in multidimensionality and can be considered for removal.
Targeting assesses how well a scale corresponds to the levels of, for example, health impairments experienced by respondents, by comparing the locations of persons and items. If scales are well targeted to the sample the mean sample location should approximate the mean item location (i.e., zero). Examination of the relationship between the locations of people and item response category thresholds also reveals how successful a set of items are in mapping out a continuum of relevant levels of the measured variable [28,29]. Targeting also has implications for model fit; when targeting is poor, the ability to assess fit is compromised.
Similarly, compromised reliability with poorly separated persons also reduces the ability to detect misfit [17,19,28].

Analysis Plan.
The overall aim of the analyses was to achieve well fitting and clinically interpretable scales without disordered response thresholds or DIF by gender or age. When analyzing the new item groups the following general approach was therefore taken. First, we deleted not definable items and items that were classified into more than one ICF component. However, since only one item was considered a "pure" environmental item (see Section 3), we combined this with items classified as environmental in addition to tapping either the BF or AP components.
The resulting item groups were then checked for signs of multidimensionality by means of PCA of the residuals followed by independent t-test comparisons of two estimated locations for each person, one based on the items with positive and one from the items with negative residual loadings on the first principal component [24]. Unidimensionality was considered statistically supported if the proportion of significant individual t-tests, or the lower bound of the associated 95% binomial confidence interval (CI), did not exceed 0.05 [24]. In case of multidimensionality, items were regrouped according to results of the PCA and theoretical considerations and then analyzed further as separate scales.
Functioning of response categories was then examined, and if disordered thresholds were found, categories were collapsed. If fit did not improve, items were deleted one at a time, starting with the most misfitting item, while monitoring the resulting overall and item level fit at each stage. The presence of DIF was assessed by comparing item response functions between genders and age groups (as defined by the median, <72 versus ≥72 years old). In case of DIF, these items were split into two new items (one for each subgroup). If this did not improve the scale, the item was deleted.
The resulting scales were examined regarding reliability and targeting. Reliability was assessed by the person separation index (PSI) [30], which is analogous to coefficient alpha and should exceed 0.7. We also assessed targeting (i.e., how well item locations accorded with the location of the sample) and the extent to which the points of measurement (i.e., the locations of response category thresholds) mapped out an evenly spaced quantitative continuum without significant gaps (indicating compromised measurement ability and larger measurement error) or clustering (indicating item measurement redundancy) [28]. Finally, the logic of the hierarchical ordering of item locations within each scale was considered in order to assess their internal content and construct validity. That is, do item contents appear to represent clinically interpretable variables and is their hierarchical ordering reasonably congruent with increasing and decreasing levels on that variable?
All analyses were conducted using the RUMM2020 software (Rumm Laboratory Pty Ltd., Perth). Due to the large number of statistical tests, P-values were adjusted according to Bonferroni [31].

Results
Of the 39 items, 30 were judged to belong to only one ICF component (BF, 13 items; AP, 16 items; E, 1 item), eight were judged to belong to two ICF components, and one item was considered not definable (Table 2).
Subsequent stepwise item reduction guided by fit and DIF statistics rendered the new item sets comprising five (BFa) to eight (APa) items each (Table 4). Total item-trait interaction (P > .056) and item level fit statistics (Table 4) suggested reasonable fit in all five instances, and reliabilities ranged between 0.73 (BFa) and 0.96 (APa). Figure 2 illustrates the locations of item response category thresholds relative to the locations of the sample for each item set. Inspection of these graphs shows a general tendency for the items (bars below the x-axes) to represent worse health than that experienced by the persons (bars above the x-axes). Furthermore, while the thresholds are able to map out a continuum for each scale, there are also several gaps as well as clusters along those continua (Figure 2).

Discussion
This study aimed at improving the validity of the PDQ-39 by linking its items to the ICF and to use this as a basis for defining new scales that are more interpretable than the originally proposed eight PDQ-39 scales and its summary index. Results provide support for the notion that this type of exercise is useful in improving the conceptual understanding of health status questionnaires such as the PDQ-39, whose development was not conceptually but primarily data-driven through correlational observations such as factor analysis.
Our observations also illustrate that the PDQ-39 can be used to assess the health impact of PD according to the ICF framework by regrouping items and treating them as new scales. Although the linking procedure employed here means that each of the item sets relate to the respective components of the ICF, these components are in themselves (i.e., without any further specification) relatively unspecific and broad in nature. As such, they only provide a basic framework as to what variables the new PDQ-39-based item sets represent. For the responses to a set of items to be meaningfully summarized into a total score and interpreted as a measure   Figure 1: Category probability curves depicting the probability (yaxis) of observing responses in each category (0 = never; 1 = seldom; 2 = sometimes; 3 = often; 4 = always) relative to the location on the measured construct (x-axis; positive values = more problems) for item 28 before (a) and after (b) rescoring. This item was associated with multiple disordering (thresholds 0-to-1/1-to-2 and thresholds 2-to-3/3-to-4) and needed reduction from five to three response categories (combination of responses to categories 1, 2, and 3 into a single category).
of a common underlying variable, the contents of the items need to express various aspects and degrees of that variable. That is, they should represent clinically reasonable manifestations of the variable and its expressions on a continuum ranging from less to more [28]. Scale construction should therefore preferably begin by defining the variable and its manifestations from less to more; representative items are then generated and selected to cover a relevant range of that variable [32]. However, such a bottom-up approach was not used in developing the PDQ-39 [7] and could therefore not be adapted here either. Instead, it is necessary to consider, based on their contents, what aspects within the respective ICF components the resulting item sets may represent, and if they map out clinically meaningful variables. Examination of the item sets and the relative locations of items within each set (Table 4) suggest that four psychometrically valid and clinically interpretable ICF indicators can be inferred from the results of this study, That is, Emotional Impairment (BFa), Gross Motor Disability (APa), Fine Motor Disability (APb), and Socioattitudinal Environment (E). Although the BFb item set exhibited good psychometric properties, it appears unclear what common variable manifestations (items) such as pain, poor memory, feeling unpleasantly hot or cold, and hallucinations would represent. As this item set originates from two of the original PDQ-39 scales (Cognitions, items 30, 32, 33; Bodily Discomfort, items 37-39), it could be suggested to split BFb according to these scales. However, this resulted in considerably reduced measurement precision and reliability (data not shown), and previous studies have shown that these scales are of dubious value according to classical as well as modern test theory analyses [10,33,34].
Whereas the exact labels of the four suggested ICF indicators may be open for debate, they appear to map out clinically meaningful variables. Table 4 lists each item set according to their locations in the logit metric, where lower values represent less problems relative to items with higher values. This ordering signifies the hierarchical structure of the contents of the variable as manifested by each item set. The hierarchy can therefore be seen as representing the most likely experiences as people progressively move from better to worse health and, similarly, the most probable experiences among people with varying levels of health. As such, it provides a means of judging their clinical feasibility and validity [28,29,32]. For example, inspection of the Fine Motor Disability (APb) items suggests that, among the included activities, handwriting is the one that is affected earliest, followed by the ability to do buttons and shoe laces, cut food, and hold a drink without spilling. Finally, at relatively high levels of disability, people avoid eating or drinking in public. This hierarchical pattern seems clinically reasonable and suggests that the items map out various levels of the variable.
However, it is also evident that the item sets fail to cover the levels of disability experienced by the sample, but tend to represent poorer health. This is reflected by the mean person logit locations (which all are negative), and by the relative distributions of item response category thresholds and persons along the common quantitative continua (Figure 2). This could be due to a sample bias towards people with uncharacteristically mild PD. However, the demographic characteristics of the sample suggest that this is not a major explanation. That is, the respondents appear to represent fairly representative and wide ranges of PD severities (according to Hoehn and Yahr [13] stages), durations and ages. In addition, a majority experienced motor fluctuations, which also speaks against a sample bias towards mild PD.
In addition to a general bias towards poorer health states, there are also several gaps and clusters of item response category thresholds (see, e.g., the BFa item set; Figure 2(b)). This means that people located around areas associated with gaps are measured with less precision and that differences and changes at these levels will be more difficult to detect [28,29]. These problems are well known also for the original PDQ-39 [9,10] and would not be expected to resolve without the addition of new items representing areas not covered by available items [28,29]. When selecting the initial item pool for each of the ICF components, it was decided to use only items that were linked to no more than one ICF component. This decision was made in order to enhance conceptual clarity and interpretability of item sets. However, this strategy could not be pursued for the environmental items since only one item was considered a "pure" environmental item. This means that the resulting Socioattitudinal Environment item set (E) is less specific than the other identified ICF indicators. However, we still believe that these items can be useful as an indicator of socioattitudinal environment in studies wishing to address this component of the ICF, particularly since we are unaware of any other available tool for this purpose in people with PD.
It may also be argued that Activity and Participation should be separated. However, the ICF provides no clear guidance in this respect. Instead, because of difficulties distinguishing between the two, the ICF offers alternative options for structuring the relationship between them [1,35], and practices among authors vary [36]. Since the PDQ-39 was not developed according to the conceptual framework of the ICF, it was decided not to separate between activities and participation in this study. Arguably, however, and depending on exact definitions, the vast majority of items linked to this combined AP component (and the resulting scales) appear to represent activity limitations.
As with the original PDQ-39 and PDQ-8 [8][9][10], the five-category response scale did not work as intended in the new item sets. Although this was compensated for by reducing the number of response categories in the analyses conducted here, it must be emphasized that this exercise is an exploratory post hoc one. Further developmental work and empirical confirmation that reducing and/or rephrasing response categories improve this aspect of the questionnaire is therefore needed.
The PDQ-39-derived ICF indicators identified here do not represent the full ICF spectrum but only limited aspects of its components. For example, it does not offer the possibility to study impairments of body function in terms of the motor symptoms of PD. Furthermore, as with the original PDQ-39 scales, targeting problems prohibit detailed documentation of differences and changes within the respective ICF components, which renders the new item sets relatively coarse. However, this is not likely to be a major problem for their use as survey tools and in other situations where measurement precision may not be of primacy. These limitations of the PDQ-39-derived ICF indicators also point to the need for developing new ICF related tools for use as clinical PD trial outcome measures. Such scales need to be developed from explicit operational definitions of various aspects of the ICF components and should comply with requirements for rating scales to be used as clinical trial outcome measures [37]. Finally, there is a need to reassess the psychometric properties of the PDQ-39-derived ICF indicators in additional samples and cultures in order to establish whether they provide stable and invariant measurement across subgroups of people beyond those studied here.

Conclusions
This study illustrates that the PDQ-39 can be used to derive psychometrically and clinically acceptable indicators of the main components of the ICF. This provides investigators with a means to reanalyze PDQ-39 data from an ICF perspective and to study its health components using a widely available health status questionnaire for people with PD.