Predictive Models for Knee Pain in Middle-Aged and Elderly Individuals Based on Machine Learning Methods

Aim This study used machine learning methods to develop a prediction model for knee pain in middle-aged and elderly individuals. Methods A total of 5386 individuals above 45 years old were obtained from the National Health and Nutrition Examination Survey. Participants were randomly divided into a training set and a test set at a 7 : 3 ratio. The training set was used to create a prediction model, whereas the test set was used to validate the proposed model. We constructed multiple predictive models based on three machine learning methods: logistic regression, random forest, and Extreme Gradient Boosting. The model performance was evaluated by areas under the receiver (AUC), sensitivity, specificity, positive predictive value, and negative predictive value. Additionally, we created a simplified nomogram based on logistic regression for better clinical application. Results About 31.4% (1690) individuals were with self-reported knee pain. The logistic regression showed that female gender (odds ratio [OR] = 1.28), pain elsewhere (OR = 4.64), and body mass index (OR = 1.05) were significantly associated with increased risk of knee pain. In the test set, the logistic regression (AUC = 0.71) showed similar but slightly higher accuracy than the random forest (AUC = 0.70), while the performance of the Extreme Gradient Boosting model was less reliable (AUC = 0.59). Based on mean decrease accuracy, the most important first five predictions were pain elsewhere, waist circumference, body mass index, age, and gender. Additionally, the most important first five predictions with the highest mean decrease Gini index were pain elsewhere, body mass index, waist circumference, triglycerides, and age. The nomogram model showed good discrimination ability with an AUC of 0.75 (0.73-0.77), a sensitivity of 0.72, specificity of 0.71, a positive predictive value of 0.45, and a negative predictive value of 0.88. Conclusion This study proposed a convenient nomogram tool to evaluate the risk of knee pain for the middle-aged and elderly US population in primary care. All the input variables can be easily obtained in a clinical setting, and no additional radiologic assessments were required.


Introduction
Knee pain is estimated to affect about 35% of men and 62% of women over 40 years old, constituting a significant health threat worldwide [1]. Patients with knee pain usually experience reduced physical ability and poor life quality [2][3][4]. The disease burden of knee pain is increasing due to the aging population and limited preventive strategies. For middleaged and elderly, knee osteoarthritis is the primary cause of knee pain [5]. Importantly, most knee joint diseases usually progress slowly but would eventually result in joint failure with pain and disability.
However, there lacks a close association between radiological alteration and the occurrence of pain, and it remains unclear at which stage the disease would cause knee pain [6]. Considering the significant individual and socioeconomic burden, attention has been paid to the early detection and prevention of knee pain [7]. Many studies have revealed risk factors for knee pain, such as elder age, female gender, and obesity. A better understanding of the risk factor could provide an insightful and cost-effective tool for identifying those with an increased risk of knee pain [8]. When high-risk individuals are identified, clinicians can offer them preventive strategies and change their lifestyles [9].
Although previous studies have proposed prediction models for joint pain or osteoarthritis [10][11][12], the small sample size and the clinical inapplicability make it difficult to apply to clinical practice. Therefore, this study sought to develop a risk prediction model for knee pain based on easily obtained demographics and laboratory biomarkers. Multiple machine learning methods (logistic regression, random forest, and Extreme Gradient Boosting) were applied, and we visualized the logistic regression model using a nomogram. Additionally, pain in other areas (shoulder, elbow, hip, wrist, ankle, toes, or fingers) was defined as pain elsewhere.

Predictive Biomarkers.
This study selected multiple predictive biomarkers associated with knee pain based on literature review and expert recommendations [13,14]. All selected biomarkers can be easily obtained by inquiry, body  Computational and Mathematical Methods in Medicine measure, and blood test. Age (years), gender (male or female), race (Non-Hispanic White, Non-Hispanic Black, Mexican American, and others), education (Below high school, high school, and above high school), hypertension (yes or no), diabetes (yes or no), pain elsewhere (shoulder, elbow, hip, wrist, ankle, toes, or fingers/thumb), moderate activity (yes or no), vigorous activity (yes or no), smoking (yes or no), and drinking (yes or no) were collected by questionnaires. Body mass index (BMI) and waist circumference were obtained by body measure. Plasm levels of albumin (g/ L), phosphorus (mg/dL), total calcium (mg/dL), triglycerides (mg/dL), cholesterol (mg/dL), and vitamin D (nmol/L), were also examined. Additionally, we calculated the estimated glomerular filtration rate (eGFR) by the Chronic Kidney Disease Epidemiology Collaboration equation [15].

Development and Validation of Prediction
Model. Participants were randomly divided into a training set and a test set at a 7 : 3 ratio. The training set was used to create a prediction model, whereas the test set was used to validate the proposed model. Patients' characteristics in the training and testing set are shown in Table S1. The upmentioned variables were used as inputs, and we set the prevalence of knee pain as the outcome. We constructed multiple predictive models based on multiple machine learning methods, including logistic regression, random forest, and Extreme Gradient Boosting (XGBoost). Random forest is an ensemble learning method for data regression and classification based on a multitude of decision trees [16], whereas XGBoost is a scalable end-to-end tree boosting system. We showed the model performance by receiver operating characteristic curve and calculated the areas under the receiver (AUC) of the three models. Sensitivity, specificity, positive predictive value, and negative predictive value were also provided. Moreover, we used gender, age, hypertension, diabetes, vitamin D, pain elsewhere, total calcium, waist circumference, and BMI to create a logistic regression-based prediction model. The prediction model was then visualized by a nomogram, which is more practical for clinical application. Each variable of the nomogram was assigned a preliminary score, and the total score could be accordingly calculated. Eventually, the total score would be converted to the probability of knee pain (0-100%).

Statistical
Analysis. The missing variables were filled by the multivariate multiple imputation method. Continuous variables were presented as median (Q1, Q3) and compared by the Kruskal-Wallis test between groups, whereas the categorical variables were presented as percentages and compared by the chi-square test. We performed multivariate regression to investigate the association between the biomarkers and knee pain. Analyses were performed by R software (version 3.6.1). P < 0:05 as considered statistically significant. Table 1 describes the characteristics of the study population. Among the 5386 individuals, 1690 (about 31.4%) had self-reported knee pain. Compared with the normal group, knee pain patients were more with gender sex, lower education, hypertension, diabetes, pain elsewhere, moderate activity, and vigorous activity. Also, knee pain patients showed higher BMI, waist circumference, and triglycerides but low vitamin D levels.
3.3. Performance of the Prediction Models. We used age, gender, race, education, hypertension, diabetes, pain elsewhere, moderate activity, vigorous activity, smoking, drinking, BMI, waist circumference, albumin, phosphorus, total calcium, triglycerides, cholesterol, vitamin D, and eGFR as input variables. Three different models based on logistic regression, random forest, and XGBoost were created using the training set, respectively. Among the three models, logistic regression (AUC = 0:71, 95% CI = 0:68 − 0:74) showed 3 Computational and Mathematical Methods in Medicine similar but slightly higher accuracy than random forest (AUC = 0:70, 95% CI = 0:67 − 0:72), while the performance of the XGBoost model was less reliable (Figure 1). The random forest model shows a sensitivity of 0.72, specificity of 0.61, a positive predictive value of 0.46, and a negative predictive value of 0.83. The variable importance of the random forest model is illustrated in Figure 2. The higher mean decrease accuracy and decrease Gini index suggested the more important role of a variable in knee pain. Based on mean decrease accuracy, the most important first five predictions were pain elsewhere, waist circumference, body mass index, age, and gender. Additionally, the most important first five predictions with the highest mean decrease Gini index were pain elsewhere, body mass index, waist circumference, triglycerides, and age. The logistic regression model shows a sensitivity of 0.71, a specificity of 0.64, a positive predictive value of 0.47, and a negative predictive value of 0.83.
Moreover, the prediction model based on logistic regression was visualized by a nomogram (Figure 3). Gender, age, hypertension, diabetes, vitamin D, pain elsewhere, total calcium, waist circumference, and BMI were input into the simplified nomogram. In the testing set, the nomogram model showed good discrimination ability with an AUC of 0.75 (0.73-0.77), a sensitivity of 0.72, specificity of 0.71, a positive predictive value of 0.45, and a negative predictive value of 0.88.

Discussion
Knee pain is closely associated with the middle-aged and elderly population and has become a major reason for early retirement. The NHANES I survey indicated that about 14.6% population reported knee pain [17]. It was reported that knee pain made about 20% of individuals with knee osteoarthritis retire earlier by eight years [18]. This study developed machine learning models to evaluate the risk of knee pain for the general US population. In the test set, random forest showed similar but slightly higher accuracy than logistic regression, while the performance of the XGBoost model was less reliable.
Many risk factors have been proposed for knee pain, such as age, female gender, obesity, and pain elsewhere. [8,14,19]. However, individuals with one or some risk factors might not experience knee pain, and a single risk factor alone was insufficient to evaluate the disease risk comprehensively. Therefore, several conventional risk factors were used as the input of the prediction models in this study. All the input variables can be measured easily in clinical practice, and no radiologic assessment was required in these models. Using easily available biomarkers without additional laboratory or radiologic examinations makes the prediction model simple to use and cost-effective. The nomogram model has a high negative predictive value (0.84) but a lower positive predictive value (0.47). Therefore, this prediction model is more suitable for identifying individuals with low knee pain risk.
The aging-induced joint pain is a multifactorial process involving numerous factors, such as cartilage thinning, muscle weakening, and proprioception reduction. Aging would also decline the capability of maintaining tissue homeostasis, thus causing an inadequate response to joint injury. We also observed that BMI and waist circumference were positively associated with a higher risk of knee pain [20]. Due to the population aging and the elevated obesity prevalence, knee pain was expected to be a growing health problem.
Besides aging and obesity, other biomarkers were also involved in the prediction models [8]. Pain elsewhere (shoulder, elbow, hip, wrist, ankle, toes, or fingers/thumb) was a significant biomarker for knee pain with an OR of 4.64 (95% CI = 3:98 − 5:43), which was consistent with previous studies [14,18]. The association between joint pain and pain elsewhere might be attributed to the shared pathology or the progress of chronic pain syndrome [21]. Fernandes et al. [18] analyzed 1822 participants at risk for knee pain from the Nottingham community and followed the participants for 12 years. The results showed that pain elsewhere led to a 2.49-fold risk of knee pain [18]. In another prospective cohort study of 2982 people, the baseline pain other than the knee increased the risk of the new onset of knee pain but not for the progression from mild to severe [14].

Computational and Mathematical Methods in Medicine
Previous studies also proposed prediction models for joint pain or osteoarthritis [10][11][12]. Zhang et al. created a prediction model for radiographic knee osteoarthritis based on a 12-year retrospective community cohort (UK Nottingham cohort). Age, gender, BMI, family history, and joint injury were included in the prediction model (AUC = 0:70) [11]. Similarly, Kerkhof et al. [10] used age, gender, BMI, questionnaire variables, genetic scores, and radiographic signs to develop a prediction model for radiographic knee osteoarthritis based on Netherlands individuals aged 55 years and over (AUC = 0:79). Compared with previous prediction models, our model showed similar accuracy but was based on easily available biomarkers without additional laboratory or radiologic assessments. These advantages make it a simple-to-use and cost-effective tool suitable for primary care.
Still, some limitations should be motioned. First, the definition of knee pain was based on self-reported knee pain. A proportion of self-reported knee pain might be referred to as hips/spine pain instead of pain from the knee. Second, we tested the model performance in the internal set. However, we are unsure if the proposed knee pain prediction tool can be applied to other populations, such as the Chinese or European population. Third, the NHANES was a crosssectional design which induces the inherent bias. Further validation and improvement were required in the following research. Fourth, although we input multiple variables in the model, many risk factors potentially remain. The investigation of additional biomarkers would improve the model performance.

Conclusion
This study proposed a convenient tool to evaluate the risk of knee pain for the middle-aged and elderly US population in primary care. All the input variables can be easily obtained in a clinical setting, and no additional radiologic assessments were required. In the internal validation, the nomogram model showed reliable performance with an AUC of 0.72.

Data Availability
The data in this study can be obtained from https://www.cdc .gov/nchs/nhanes/index.htm