Can Orthopedic Oncologists Predict Functional Outcome in Patients with Sarcoma after Limb Salvage Surgery in the Lower Limb? A Nationwide Study

Accurate predictions of functional outcome after limb salvage surgery (LSS) in the lower limb are important for several reasons, including informing the patient preoperatively and, in some cases, deciding between amputation and LSS. This study aimed to elucidate the correlation between surgeon-predicted and patient-reported functional outcome of LSS in the Netherlands. Twenty-three patients (between six months and ten years after surgery) and five independent orthopedic oncologists completed the Toronto Extremity Salvage Score (TESS) and the RAND-36 physical functioning subscale (RAND-36 PFS). The orthopedic oncologists made their predictions based on case descriptions (including MRI scans) that reflected the preoperative status. The correlation between patient-reported and surgeon-predicted functional outcome was “very poor” to “poor” on both scores (r 2 values ranged from 0.014 to 0.354). Patient-reported functional outcome was generally underestimated, by 8.7% on the TESS and 8.3% on the RAND-36 PFS. The most difficult and least difficult tasks on the RAND-36 PFS were also the most difficult and least difficult to predict, respectively. Most questions had a “poor” intersurgeon agreement. It was difficult to accurately predict the patient-reported functional outcome of LSS. Surgeons' ability to predict functional scores can be improved the most by focusing on accurately predicting more demanding tasks.


Introduction
Limb salvage surgery (LSS) rather than amputation is the operation of choice in 70-85% of all malignant bone and soft tissue lower limb sarcomas [1,2]. Since the oncological results for amputation and LSS in the surgical treatment of sarcomas are comparable [3,4], the decision to perform an amputation or LSS is based on the tumor size, the tumor location, patient preferences, the expected risk of complications and multiple reoperations, and the expected functional outcome [3]. If it is Sarcoma surgically possible, LSS is generally the preferred treatment, unless a poor functional outcome is expected. It has been shown that the functional outcome of LSS is superior to amputation, with the exception of below-knee amputation, which yields a similar function as limb salvage [5]. The expected functional outcome of patients after LSS is thus an important part of the preoperative decision making process for the surgical treatment.
Several known predictors of functional outcome include tumor size, location, grade, bone resection, muscle involvement, use of radiotherapy, and motor nerve sacrifice [6]. The functional outcome is predicted by the surgeon based on these parameters combined with his/her clinical experience. However, to the best of our knowledge, there are no reports about how well surgeons are able to actually predict functional outcome after LSS. Insight into the level of accuracy of these predictions is important for several reasons. First, accurate predictions of functional outcome are highly relevant in informing the patient preoperatively about the expected final functional outcome. Second, in some cases, the predictions are helpful in deciding between amputation or LSS. Third, information about the correlation between predicted functional outcome and patient-reported functional outcome provides valuable information for surgeons in training.
In this study, (1) we aimed to establish whether orthopedic oncologists can accurately predict patient-reported functional outcome of LSS in the treatment of sarcoma in the lower limb in a selected group of patients. (2) We also examined whether there was a tendency to over-or underestimate patient-reported functional outcome. Additionally, (3) we sought to identify which items on the functional outcome scores were least difficult and which were most difficult to predict, and whether the surgeons agreed amongst themselves (interrater reliability) in their predictions.

Patients.
We selected patients who had undergone a LSS for a sarcoma in the lower limb from a database of orthopedic oncologic patients at the Department of Orthopedic Surgery of the Radboud University Medical Center (RUMC), Nijmegen, The Netherlands. The database contained 216 patients who had undergone LSS or amputation for any type of tumor in the hip or knee region. We selected patients using the following inclusion criteria: follow-up at least six months after the surgery (before July 1, 2012) for patients without adjuvant treatment and at least twelve months for patients with adjuvant treatment, a maximum follow-up of ten years (after February 1, 2003), and age between 18-70 years, and preoperative MRI scans had to be available. The follow-up of at least six months was chosen because functional scores tend to plateau within that time frame [6,7]. We excluded patients who had a bone tumor with an intact cortical bone, as almost no functional deficits were expected to occur in those patients. Patients who had suffered local recurrence or complications that required reoperation in the last six months before the study were excluded. A flow chart of the patient selection is shown in Figure 1. Twenty-four patients

Missing surgery
Could not be contacted (n = 1) Included (n = 23) log (n = 1) reresection (n = 3) Figure 1: Flow chart of patient selection. Some patients fitted into multiple exclusion criteria (e.g., "no preoperative MRI scan available" and "too old"); in such cases, the patient was counted as belonging to the first of those exclusion criteria.
were eligible for inclusion in the study, of whom 23 were successfully contacted. All 23 patients were included in the study. The study procedures were approved by the Local Ethical Committee of the RUMC. Written informed consent was obtained from all participants.  To evaluate the functional outcome, we used  the Toronto Extremity Salvage Score (TESS) for the lower  extremity and the RAND-36 physical functioning subscale  (RAND-36 PFS). The TESS is a patient-reported questionnaire that has been specifically designed to measure the physical functional status of patients after limb-salvage surgery [7]. It contains 30 questions, and the final score ranges from 0% to 100%, 100% being the highest achievable score. The RAND-36 PFS is intended to measure physical functioning in any patient cohort [8,9], which makes it more general than the TESS. Like the TESS, the RAND-36 PFS also is a patient-reported questionnaire. The RAND-36 PFS consists of ten questions, and the final score ranges from 0% to 100%, 100% being the highest achievable score. The RAND-36 PFS is identical to the SF-36 PFS. In addition to the TESS and RAND-36 PFS, we also used the RAND-36 pain subscale to examine postoperative pain levels. The RAND-36 pain subscale contains two questions; one regarding the amount of pain and one regarding the hindrance experienced due to pain when performing everyday activities in the previous four weeks [8,9]. The final score ranges from 0% to 100%, where 100% represents no pain. We did not employ the Musculoskeletal Tumor Society score [10], as that score is not patient-reported and includes the domains of pain and emotional acceptance, which would have been impossible to predict solely on the basis of case descriptions.
A case description of each patient was made, which reflected the preoperative status of the patient. It contained the patient's age, sex, body mass index (BMI), tumor diagnosis, diagnostic MRI scans, a description of the performed surgical procedure for tumor resection and reconstruction, whether the patient had received adjuvant pre-or postoperative chemo-or radiotherapy, and whether there were any complications from the surgery (a case example is shown in Figure 2). The information did not include follow-up time. If a reresection had been performed, the preoperative MRI scans from before the primary resection surgery were provided, rather than those made after the local recurrence. The case descriptions were distributed through a central electronic platform. Whenever bone was removed, it was replaced by tumor prosthesis and/or an allograft.

Study Procedures.
The patients were interviewed about their current functional status in a structured telephone call (done by KC, an independent researcher who was not a medical doctor), consisting of the TESS, the RAND-36 PFS, and the RAND-36 pain subscale. Five independent orthopedic oncologists (JB, PD, PJ, JP, and MvdS), working in one of the other three Dutch orthopedic oncologic referral centers (other than the RUMC) participated in the study. They were asked to give a prediction of the total TESS score (one percentage for the total functional status of the patient without addressing all separate items) and a prediction of the ten individual items of the RAND-36 PFS, based on the case descriptions. They had never been involved in the treatment of the patients and were unaware of their patientreported functional outcome. All orthopedic oncologists were experienced and specialized in orthopedic oncology.
They were familiar with the employed functional scales and were provided with a copy of the TESS questionnaire for reference.

Outcome Measures and Statistical Analyses.
Descriptive statistics were calculated and stated as mean ± standard deviation. We compared the patient-reported and surgeonpredicted TESS and RAND-36 PFS scores in three ways.
First, Pearson correlations were calculated between the patients' reported scores and individual surgeon predicted scores, as well as for the average scores of all the surgeons combined. The squared correlation coefficient, 2 , (coefficient of determination), represents the variation in the values of the patient-reported outcome that can be explained by variations in the value of the surgeon-predicted outcome [11]. An 2value of 0.75-1.00 was interpreted as a "very good" prediction, 0.50-0.74 as "good, " 0.25-0.49 as "poor, " and 0-0.24 as "very poor. " The 2 -values were considered the primary outcome measure.
Second, the mean differences and 95% confidence intervals (95% CI) between the patient-reported scores on the TESS and RAND-36 PFS and the surgeon-predicted scores were calculated to reveal whether the predictions had a bias towards being too optimistic or pessimistic.
Third, the agreement between patient-reported and the median surgeon-predicted answers to the separate questions of the RAND-36 PFS were examined using percent agreement and Gwet's agreement coefficient (AC1). Compared with Cohen's Kappa [12,13], Gwet's AC1 has a more stable interrater reliability and is less affected by prevalence and marginal probability [14]. This allowed us to identify which questions were the least difficult and most difficult to predict. The intersurgeon agreement on each separate question was also calculated, using percent agreement and Gwet's AC1. To calculate the intersurgeon agreement on the TESS, we used the intraclass correlation coefficient (ICC; absolute single measure/absolute agreement). Agreement coefficients below 0.40 were considered to represent a "poor" agreement; between 0.40 and 0.59 "fair"; between 0.60 and 0.74 "good"; and between 0.75 and 1.00 "excellent, " analogous to commonly used guidelines for interexaminer agreement [15].
The associations between each separate variable (age, sex, BMI, pain, and time since surgery) and patient-reported TESS and RAND-36 were examined using univariate regression analyses to examine whether they were associated with the functional outcome scores.
Matlab R2011a (The Mathworks, Natick, MA, USA) and R version 3.0.2. [16] were used for the statistical analyses.

Patients.
The characteristics of all 23 patients are listed in Table 1. The age at the time of surgery was 39.9 ± 18.8 years and the time after surgery was 47 ± 27 months. All patients were ambulatory and able to at least walk short distances without a walking aid. Two patients (cases 10 and 21) had undergone a reresection; this was mentioned in the case file. All other patients had not suffered from local recurrence   Table 1 . or complications that required follow-up surgery. The mean patient-reported scores were TESS 87.0±12.1, RAND-36 PFS 73.3 ± 18.7, and RAND-36 pain subscale 85.5 ± 24.7.

Surgeon Predictions-TESS.
The surgeon-predicted scores and their correlations with the patient-reported scores of all five surgeons and the average predictions of all surgeons on the TESS are shown in Figure 3 and in Table 2. The correlations with the patient-reported scores were "very poor" for all surgeons, with the best correlation for surgeon 2 ( 2 = 0.185). The TESS was underestimated for most patient cases ( Figure 3); the mean underestimation ranged from 1.5 to 22.6 percentage points ( Table 2). The correlations with the patient-reported TESS formed by averaging all five surgeons' predictions were "very poor" ( 2 = 0.159) and underestimated patient-reported functional outcome by 8.7 (95% CI: 3.62-13.7) percentage points. The intersurgeon agreement on the TESS was "poor" with an ICC of 0.29 (95% CI: 0.10-0.53).

Surgeon Predictions-RAND-36 PFS.
The surgeon-predicted RAND-36 PFS scores and their correlations with the patient-reported scores are shown in Figure 4 and in Table 2. The correlations to the patient-reported scores were either "very poor" (surgeons 1, 4, and 5) or "poor" (surgeons 2 and 3). Surgeon 3's predictions had the highest correlation with the patient-reported scores ( 2 = 0.354). The patientreported RAND-36 PFS score was underestimated by all surgeons, except for surgeon 2 (5.4 percentage points overestimation) ( Table 2). The average correlations with the patientreported scores were "poor" ( 2 = 0.255) and underestimated patient-reported functional outcome by 8.3 (95% CI: 0.64-16.0) percentage points.
In the analysis of the individual questions that make up the RAND-36 PFS, "Climbing several flights of stairs" and "Walking more than a mile" were the most difficult items to predict, with "poor" agreement coefficients (AC1) of 0.15 and 0.19, respectively, between surgeon-predicted and patient-reported scores (Table 3). "Walking one block" and "Bathing or dressing yourself " were the least difficult items to predict, with "excellent" agreement coefficients of 0.81 and 0.76, respectively, between surgeon-predicted and patientreported scores. Similar to the overall RAND-36 PFS scores, most of its separate questions were underestimated; only two questions were overestimated ("Bending, kneeling, or stooping" and "Lift-ing or carrying groceries").
On most questions of the RAND-36 PFS, the intersurgeon agreement coefficient was "poor, " but there was a "fair" agreement on "Bathing or dressing yourself " and "Moderate activities" and a "good" agreement on "Vigorous activities" and "Walking one block" (Table 3).

Other Potential Predictors.
No correlations were found between the TESS or RAND-36 PFS and any of the potential predicting factors (Table 4).

Discussion
This national survey aimed to investigate how well orthopedic oncologists are able to predict the patient-reported functional outcome of patients that had undergone LSS in the lower limb. We found "very poor" to "poor" correlations between patient-reported outcomes and surgeon-predicted outcomes Sarcoma 5    on both the TESS and the RAND-36 PFS. The orthopedic oncologists tended to underestimate patient-reported functional outcome on both scales. The most difficult tasks on the RAND-36 PFS were also the most difficult to predict, whereas, for the least difficult tasks, it was easy to predict that these could be performed without substantial limitations by nearly all patients. The intersurgeon agreement on the RAND-36 PFS questions was mostly "poor" but was "good" for some of the most and least demanding tasks. None of the potentially predicting factors were related to the primary outcome measures.
Our results indicate that it was difficult for the participating orthopedic oncologists to accurately predict the patientreported functional outcome of limb salvage surgery. On the TESS, for instance, the coefficients of determination ( 2 ) between patient-reported and surgeon-predicted outcomes were lower than 0.20, indicating that less than 20% of the variance in TESS could be explained by the predictions made by the orthopedic oncologists. We did not expect such a poor predictive ability, considering the experience level of the orthopedic oncologists with limb salvage surgery. Several aspects may underlie this seemingly rather poor predictive ability.
First, each limb salvage patient presents a unique case in terms of anatomical involvement. Even in patients with the same type of tumor at a similar location, for instance, the distal femur, final functional results can differ to a large extent. In part, this depends on the amount and precise location of soft tissue involvement, which may have been difficult to see from the limited set of MRI images in the case files. Moreover, patients are unique in terms of adaptive capacity. The adaptation of the patient to the new anatomical and sensorimotor situation plays a large role in the recovery of function [17]. The amount of adaptive capacity may have been hard or impossible to estimate by the orthopedic oncologists from the case files. Second, we measured functional outcome with questionnaires, which are inherently subjective. Thus, the patients' own perception of functioning may have played a large role in the functional outcome score. It might be that functional outcome measured by objective means, such as, for example, gait analysis, more closely reflects the orthopedic oncologists' predictions. Third, in the case files, we mimicked as well as possible the information typically available preoperatively to the surgeon in a clinical setting, but the study design did not permit the independent surgeons to review the medical history of the patients nor perform a physical examination before the surgery. As such, predictions of patient-reported functional outcome in a "real" clinical setting (e.g., including a physical examination) might be more accurate than those made in this study. Fourth, patients who had a bone tumor with an intact cortical bone were not included; the patient-reported functional outcome in those patients would potentially have been less difficult to predict than that in the patients with larger tumors. The poor predictive ability raises the question of which other factors determine functional outcome in limb saving surgery and to what degree. Davis et al. showed that large tumor size, deep lesions, high grade tumor, use of radiotherapy, bone resection, and motor nerve sacrifice are significantly related to increased disability on the TESS [6]. In their study, those combined parameters were able to predict 20% of the variance in TESS score. This is in the same order of magnitude as the presently reported results, indicating that the surgeons were unable to "add" predictive value on top of the given parameters in the case files. The rehabilitation protocol may also have an effect on functional outcome; Shehadeh et al. showed that adherence to a strict rehabilitation protocol after limb salvage surgery led to a relatively high level of functional outcome compared with other studies [18]. If we interpret our findings concurrent with those of Davis et al. and Shehadeh et al., it appears that still a large percentage of functional outcome cannot be predicted by the surgeon nor by anatomical and surgery or adjuvant therapy-related factors nor by rehabilitation protocols. Other factors that may play a significant role in the patient-reported functional outcome include the preoperative physical and mental state of the patient. For example, a patient who is highly motivated and athletic may recover to a far higher level of functioning than one who is less motivated and leads a sedentary lifestyle. From this perspective, one may intuitively expect a correlation between patient-reported functional outcome and age or BMI, but we did not find this (Table 4). Further studies are required to clarify the role each factor plays in patientreported functional outcome after limb salvage surgery.  The orthopedic oncologists tended to underestimate patient-reported functional outcome on both the TESS and the RAND-36 PFS. Thus, it appears that the patients adapted to the new anatomical and functional situation better than the surgeons predicted. It is possible that this is due to some surgeons being used to picturing a somewhat more pessimistic scenario to their patients so that the actual achieved functional result exceeds the patients' expectations. However, we specifically instructed the surgeons to provide their most accurate predictions of patient-reported functional outcome, rather than to provide predictions that they would share with patients. As for clinical relevance, we did not set a specific threshold, but the underestimation of patient-reported functional outcome on both the TESS and the RAND-36 PFS was rather consistent, as demonstrated by the 95% confidence intervals that did not pass through zero.
Interestingly, we found that the "Walking one block" question was the least difficult to predict, whereas the "Walking more than a mile" question was one of the most difficult questions to predict (only "Climbing several flights of stairs" was more difficult to predict). It appears that the ultimate level of function that is reached in patients is hard to predict, whereas it is easier to predict lower levels of function. Thus, surgeons' ability to predict functional scores can be improved the most by focusing on accurately predicting more demanding tasks. Additional improvement might be gained by analyzing the prediction for the "Bending, kneeling, or stooping" question. If the prediction for this question did not match with the patient-reported outcome, it was mostly overestimated (43.5% of cases). This overestimation breaks with the general trend to underestimate patient-reported functional outcome and indicates that bending the knees is more difficult to do for patients than the median surgeon predicted.
The intersurgeon agreement on most RAND-36 PFS questions was "poor, " indicating that there was a high intersurgeon variability in the predictions to the questions. Notable exceptions were "Walking one block" and "Vigorous activities, " with "good" intersurgeon agreement. The prior arguably is the least difficult activity on the scale, whereas the latter represents the most demanding activities on the scale (including running, heavy lifting, and strenuous sports). However, this does not imply that there was also a high agreement with the patient-reported outcome; "Vigorous activities" had only a "fair" agreement with the patient-reported score. "Walking one block, " on the other hand, was the only question that had both an "excellent" agreement with the patient-reported score as well as a "good" intersurgeon agreement. This might be due to the surgeons' familiarity with predicting this basic level of functional outcome or because being able to walk at least short distances is considered one of the criteria for attempting limb salvage surgery, and most patients indeed achieved that goal.
This study has some limitations. First, the surgeons only predicted the total TESS score, instead of predicting each of the 30 questions that comprise the score. This was done because some questions were already present in the much shorter RAND-36 PFS, and to reduce the time it would take the surgeons to predict the 23 cases. Second, we used a translated version of the TESS which has not been validated in Dutch. However, as the TESS is the gold standard assessment tool after limb salvage surgery, we decided to use it [19]. The RAND-36 PFS has been validated in Dutch [8,9], and its results showed the same trend in the comparisons as the translated TESS. Third, we found a wide range of patientreported functional outcome scores, including in patients that had undergone similar surgery. Of course, each case is unique, but the perception of effort required to perform the activities in the questionnaires and the interpretation of the questions can vary between patients. Measuring actual functional outcome (e.g., in a movement laboratory or by observing patients in their home setting) could yield more knowledge of actual functioning, eliminate the subjectivity inherent in questionnaires, and establish the construct validity of the employed functional scoring systems. Fourth, the surgeons predicted the functional outcome based on a case description without being allowed to review the medical history of the patients or perform a physical examination. The time since surgery was also not provided, which might have negatively affected the predictions. This, however, does not explain the large differences found between predicted and patient-reported functional outcome nor does it explain the differences in predictions between surgeons. Furthermore, there was no correlation between the patient-reported functional scores and the time since surgery (Table 4).

Conclusions
It was difficult for the participating orthopedic oncologists to accurately predict the patient-reported functional outcome of limb salvage surgery. Patient-reported functional outcome tended to recover to a higher level than the surgeons predicted. The ultimate level of function that the patients reached was hard to predict, whereas it was easier to predict lower levels of function. Thus, surgeons' ability to predict functional scores can be improved the most by focusing on accurately predicting more demanding tasks. Intersurgeon agreement to most questions was "poor, " indicating the high variability in the surgeons' predictions, and, possibly, treatment decisions. The poor predicting ability warrants research into objective tools to assist orthopedic oncologists in the decision making process. Such tools could include, for instance, computational musculoskeletal models that prospectively calculate whether enough muscle strength remains to perform activities of daily living.