Identification of the Optimal Model for the Prediction of Diabetic Retinopathy in Chinese Rural Population: Handan Eye Study

Background To identify an optimal model for diabetic retinopathy (DR) prediction in Chinese rural population by establishing and comparing different algorithms based on the data from Handan Eye Study (HES). Methods Five algorithms, including multivariable logistic regression (MLR), classification and regression trees (C&RT), support vector machine (SVM), random forests (RF), and gradient boosting machine (GBM), were used to establish DR prediction models with HES data. The performance of the models was assessed based on the adjusted area under the ROC curve (AUROC), sensitivity, specificity, and accuracy. Results The data on 4752 subjects were used to build the DR prediction model, and among them, 198 patients were diagnosed with DR. The age of the included subjects ranged from 30 to 85 years old, with an average age of 50.9 years (SD = 3.04). The kappa coefficient of the diagnosis between the two ophthalmologists was 0.857. The MLR model revealed that blood glucose, systolic blood pressure, and body mass index were independently associated with the development of DR. The AUROC obtained by GBM (0.952), RF (0.949), and MLR (0.936) was similar and statistically larger than that of CART (0.682) and SVM (0.765). Conclusions The MLR model exhibited excellent prediction performance and visible equation and thus was the optimal model for DR prediction. Therefore, the MLR model may have the potential to serve as a complementary screening tool for the early detection of DR, especially in remote and underserved areas.


Introduction
Diabetic retinopathy (DR) is a common and serious microvascular complication triggered by diabetes. It is the leading cause of visual impairment and blindness among working age adults [1]. According to the World Report on Vision published by the World Health Organization (WHO) in 2020, at least 2.2 billion people have a vision impairment worldwide. In at least 1 billion-or almost half-of these cases, vision impairment could have been prevented or has yet to be addressed. Meanwhile, moderate or severe distance vision impairment or blindness in 3 million people was caused by diabetic retinopathy [2]. There were 92.4 million diabetic patients in China [3], and among them, 15.2 million developed DR [4]. The number of patients suffering from diabetes and related complications is continuously rising in China. Diabetic mellitus and DR have been major public health issues in China. Moreover, early symptoms like microaneurysm, retinal hard exudates, and cotton version are often insidious and hard to identify macroscopically. Early detection and management of DR can effectively prevent visual deterioration. Further, it has been verified that DR is also a risk factor for other diabetic complications onset like cardiovascular ones [5]. Thus, it is critical to determine the risk factors and the combined effects of these risk factors on DR.
By far, many factors such as age, body mass index (BMI), smoking status, duration of diabetes, glycemic control, systolic blood pressure (SBP), serum lipids, urinary albumin, and C peptide have been identified as risk factors for the development of DR via both cross-sectional and cohort prospective studies [6][7][8][9]. Advances in statistical methodology have provided the tools to model linear or nonlinear relationships among risk factors for specific diseases. Logistic regression is a well-established classification technique that is widely used in clinical studies [10]. However, its capacity for the prediction of nonlinear relationships among risk factors is limited. In recent years, many researchers switched their focus to machine learning field. Classic machine learning models like decision tree, neural network algorithm, and random forest method are widely used for the classification of diseases. There were several studies based on the machine learning model for the early detection of DR [11][12][13]. Given the difference in the algorithms and application condition of these machine learning models, the optimal model for the prediction of various diseases would differ. Here, we aim to apply and compare the prediction performance of multivariable logistic regression and machine learning algorithms based on the data of Handan Eye Study (HES) and to develop the practical classifier for DR identification among the Chinese rural population.

Subjects and Examination
The investigational site was in Yongnian County of Handan City. The statistic comparison revealed that the gender and age distribution of Yongnian county were similar to those of the nationwide rural areas. To investigate the 6-year cumulative incidence of RD in HES [14], the follow-up study was conducted from 2012 to 2013 [15]. At baseline, there were 6830 participants [16], and among them, 5394 participated in the follow-up investigation. The follow-up rate was 85.3% (5394/6323) [15]. After excluding participants who were previously diagnosed or lacked qualified fundus photography or data on fasting glucose value, data from 4752 participants were collected to build the DR prediction model ( Figure 1).
2.1. Diagnosis of DR. The diabetic patients who were diagnosed with DR at the baseline investigation were excluded. Therefore, the cases diagnosed during this follow-up period were the new cases that developed between the year of 2007 and the year of 2012. Two qualified ophthalmologists (Ailian Hu and Xu Zhang) reviewed the fundus photographs independently, and the diagnosis was made according to the ETDRS. The kappa coefficient was calculated to evaluate the consistency of the diagnosis between the two ophthalmologists.

Model Introduction.
Multivariable logistic regression (MLR) is a well-established classification technique widely used in epidemiological studies, which is constructed based on maximum likelihood estimation. Apart from the logistic regression, four machine learning algorithms were used to establish the prediction model on DR in HES.

Classification and Regression Trees (C&RT).
C&RT is a type of nonparametric decision tree methodology that was first developed by Breiman et al. [19] and has been widely applied to clinical and public health researches [20][21][22][23]. C&RT analysis is a nonparametric decision tree methodology that produces a visual output that is a multilevel structure that resembles the branches of a tree. C&RT analysis set of if-then conditions permits the classification of cases and then efficiently segments populations into meaningful subgroups. C&RT has the advantage of ignoring irrelevant descriptors and handling multiple mechanisms of action.

Support
Vector Machine (SVM). SVM are linear classifiers based on the margin maximization principle. SVM accomplishes the classification task through the following steps: SVM maps the data into a higher dimensional input space and then constructs an optimal separating hyperplane in this space. The hyperplane has the ability to optimally separate the data into two regions, each of which is also called a class two categories [24,25]. The SVM is capable of dealing with high-dimensional data and has excellent generalization performance. However, it is not robust to the presence of a large number of irrelevant descriptors, thus requiring descriptor preselection as well.

Random Forests (RF)
. RF belongs to "ensemble learning" proposed by Breiman and Quinlan in 2001 [26]. Ensemble learning generates many classifiers and aggregates the results of these classifiers. Boosting [27] and bagging algorithms [26] are well-known methods of ensemble learning. RF is a typical machine learning model of bagging algorithms. It draws multiple samples from the original sample based on the Bootstrap replicated sample method and constructs a decision tree model for each Bootstrap sample, then accomplishes the classification task by scored and averaged predict results of these decision tree models. RF has a high prediction accuracy, good tolerance to outliers and noise, and less probability of model overfitting [28][29][30]. Therefore, it has been widely applied in the fields of medicine, bioinformatics, and economics [31][32][33][34].

Gradient Boosting Machine (GBM)
. GBM is a typical machine learning model of boosting algorithms and was proposed by Friedman in 2001 [35]. The learning procedure of GBM is consecutively fitting new models, which is a process of consecutive iteration. Subsequently, it would be able to provide a more accurate estimate of the response variable. The purpose of each iteration is to reduce the residual generated by the previous iteration. According to the Newton-Raphson method, the new model will be constructed based on the orientation of the gradient-descent of the previous residual [36]. To avoid the over-fitting of the model, the GBM introduced the shrinkage parameter which is related 2 Journal of Diabetes Research to the learning ability of the model. Therefore, GBM has excellent generalization performance [37].

Statistical Analysis.
All the statistical analyses were performed with an open-source R-program (version 3.6.2). The mean value and standard deviation were used for the statistical description of continuous variables, and frequency and percentage were used for the statistical description of categorical variables. Continuous variables were analyzed by Student's t-test or Wilcoxon rank sum test, and categorical variables were analyzed by chi-squared test or rank sum test. The odds ratio (OR) value of each variable with the corresponding 95% confidence interval (95% CI) was calculated.
In the present study, 70% of samples (train set) were applied to construct the prediction models, while the rest of the samples, namely the validation set, was applied to estimate the prediction performance of these models. Receiver operating characteristic (ROC) curve was adopted for the evaluation of the prediction performance of the model. The ROC curve was plotted according to the prediction probability values obtained by the model. The area under the ROC curve was compared by the Z test. The statistic P value less than 0.05 was considered to be significant. The prediction probability threshold is set at the corresponding value of Youden's index (YI) on the ROC curve, and the prediction probability of any subject greater than the threshold indicates the development of DR.

Demographic Characteristics and Univariate Analysis.
Among 4752 subjects included in the current study, 46.6% were male and 53.3% were female. The age of the included subjects ranged from 30 to 85 years old, with an average age of 50.9 years (SD = 3:04). A total of 198 patients were diagnosed with DR. After the univariate analysis, 9 variables were extracted as the impact factors of the DR. The results of univariate analysis on the demographic characteristics are shown in Table 1.

The Kappa Coefficient between Two Ophthalmologists.
As shown in Table 2, the kappa coefficient was 0.857 which was indicating a good consistency between these two ophthalmologists.
3.3. Performance of MLR. The MLR model showed that blood glucose, SBP, and BMI were independently associated with the development of DR. The results are shown in Table 3. The ROC curve obtained by logistic regression is presented in Figure 2. The adjusted area under ROC curve (AUROC) was 0.936. As shown in Table 4, the sensitivity and specificity were 0.914 and 0.898, respectively. The accuracy was 0.898 (95% CI: 0.881, 0.913), and the corresponding value of YI was 0.036.

3.4.
Performance of C&RT. The ROC curve obtained by the C&RT model is shown in Figure 2. The AUROC was 0.682, and the accuracy was 0.978 (95% CI: 0.968, 0.985). The sensitivity and specificity were 0.371 and 0.992, respectively ( Table 4). The corresponding value of YI was 0.360.
3.5. Performance of SVMs. The ROC curve obtained by the SVM model is shown in Figure 2. The AUROC was 0.765, and the accuracy was 0.919 (95% CI: 0.966, 0.983). The sensitivity and specificity were 0.571 and 0.928, respectively ( Table 4). The corresponding value of YI was 0.039.
3.6. Performance of RF. The ROC curve obtained by RF is shown in Figure 2. The AUROC was 0.949, and the accuracy was 0.843 (95% CI: 0.823, 0.862). The sensitivity and specificity were 0.971 and 0.840, respectively ( Table 4). The corresponding value of YI was 0.035.

3.7.
Performance of GBM. The ROC curve obtained by GBM model is shown in Figure 2. The AUROC was 0.952, and the accuracy was 0.883 (95% CI: 0.866, 0.900). The sensitivity and specificity were 0.943 and 0.881, respectively ( Table 4). The corresponding value of YI was 0.034.

3.8.
Comparison of the AUROC of Prediction Models. As shown in Figure 2, although the AUROC of the GBM was  3 Journal of Diabetes Research the largest, there was no statistically significant difference between the GBM, RF, and MLR models. Moreover, no statistically significant difference was found between CART and SVM models. Therefore, the AUROC of these models was as follows: GBM = RF = MLR > CART = SVM. The details of pairwise comparisons of the P values are shown in Table 5.

Discussion
In order to find the optimal DR prediction model, the current study established and compared several prediction models including the traditional statistic algorithms, ensemble learning algorithms, and basic machine learning algorithms. Five DR prediction models in the Chinese rural population were established in the current study, and the performance of these models was compared. In the early stage of DR, there may be no visual symptoms. If DR patients are identified and diagnosed early, irreversible visual impairment could be prevented. Conversely, if not, the damage will be irreversible as the DR continues to advance [38,39]. In China, the rural residents accounting for more than half (64%) of the total population, and there were significant   Journal of Diabetes Research regional differences in health resources allocation between the city and the rural area, with the underdeveloped areas having access to fewer health resources [40,41]. Early detection, diagnosis, and treatment are vital in preventing such irreversible damage, especially in less-developed rural areas.
With the development of technology in DR diagnosis through artificial intelligence (AI) on eye fundus photographs, DR patients can be initial diagnosis easily. Moreover, more and more level A tertiary hospitals have set up telemedicine centers to collaborate with the primary hospitals on the diagnosis of DR, which means that rural DR patients can get a final diagnosis in the primary hospitals. Developing the DR prediction model based on the data collected by the primary hospitals in the rural areas allows the identification of the patients with high risk for DR, then makes a rapid diagnosis for those patients through AI fundus photograph review. Subsequently, the final diagnosis could be made via telemedicine center services (Figure 3). This procedure can be used to screen the majority of rural DR patients in the primary hospitals, which can not only save the cost of economic expenses and human resources but also improve the detection rate of DR in China. Therefore, it is of certain public health significance and practical value to develop the prediction model. In the current study, the MLR model and machine learning algorithms were used to establish the prediction model based on the data of HES. The GBM, RF, and MLR models showed excellent prediction performance on DR, with an AUROC of 0.952, 0.949, and 0.936, respectively, which were significantly larger than that of SVM and CART. However, the accuracy of CART and SVM was higher than that of GBM, RF, and MLR. Theoretically, the AUROC is more robust than accuracy when evaluating and comparing classifiers, especially based on the skew distribution samples [42]. In this study, the proportion of DR patients was 2.5%. Therefore, the prediction performance of these models would depend on the value of AUROC.
The good prediction performance of the GBM and RF might be due to the fact that both models used "ensemble learning" algorithms. In nature, ensemble learning is to train multiple "weak learners" as members of it and combine their predictions into a single output, thereby making accurate predictions. The CART and SVM models are "weak   Note: Gradient boosting machine (GBM); random forests (RF); multivariable logistic regression (MLR); support vector machine (SVM); classification and regression trees (C&RT).

Journal of Diabetes Research
learners" also referred to as "base learners". The study showed that the AUROC of both CART and SVM models was below 0.8, which was consistent with previous studies that the prediction performance of ensemble learning models is always better than weak learning models [43,44].
Moreover, the prediction performance of the MLR model was similar to that of the GBM and RF and better than SVM and CART algorithms. Stylianou et al. have indicated in their study that the logistic regression model has better prediction performance on the mortality risk of burn injury patients than machine learning algorithms [45], and a similar conclusion also can be found in other studies [46,47]. Actually, there was no substantial association between the algorithm complexity and prediction performance of DR found in this study. Furthermore, the MLR allows the visualization of the modeling process of prediction and provides the critical predicted factors as well as the value of predict factors. To sum up, the MLR model was the optimal model on the DR early diagnosis owing to its excellent prediction performance and visible equation.
According to the fitting result of the MLR model, blood glucose was independently associated with DR in this study. The correlation coefficient between blood glucose and the probability of DR in this study was over 0.9. It has been verified in many studies that poor glycemic control has a high positive correlation with DR occurrence among diabetes patients. Therefore, it is critical for diabetic patients to continuously control their blood glucose concentrations so as to avoid the occurrence of DR. Additionally, the SBP was correlated with the development of DR. Many studies have confirmed the association between blood pressure (BP) and DR [48,49]. One of the evidence-supported pathogenesis is that the higher BP would induce increased expression of vascular endothelial growth factor (VEGF), thereby leading to the development of DR [50,51]. The other theory is that the higher BP leads to hemodynamic alternations, including microvascular damage, abnormal lipid metabolism, and hemorheology changes in retinal microvasculature, thereby aggravating the microangiopathy of retinal [52]. Therefore, rational control of hypertension is important to slow down the microangiopathy in diabetic patients.
In addition, the BMI exhibited a positive correlation with DR occurrence. The relationship between BMI and DR has been similarly confirmed by many epidemiologic studies [53][54][55]. The pathogeneses proposed for this association include metabolic syndrome [56] and increased oxidative stress [57]. Nonetheless, there are also contradictory findings. Klein et al. proposed in their study that a higher BMI confers a protective effect on DR in Asian type 2 diabetic patients, while a higher WHR is associated with the DR in women. The Wisconsin Epidemiologic Study of Diabetic Retinopathy found that the association between obesity and DR is limited only to individuals with older-onset insulin-independent diabetes. Other studies found that decreased BMI is associated with a higher prevalence of DR in white populations [58,59]. The differences in the findings of the above studies may be attributable not only to genetic differences like racial and ethnic differences but also to the demographic difference.
This study also has some limitations. Although the MLR model showed good prediction performance, there were only three significant prediction factors. Moreover, there might be an inevitable bias caused by the patients' loss ofof follow-up. In addition, since the correlation coefficient between blood glucose and DR in this study was over 0.9, this would affect the other potential factors like total triglycerides, high-sensitive C-reactive protein, and urea albumin included in the MLR model. Furthermore, this study lacks a test set, which should be a similar, independent population, for the evaluation of the generalization capability of the final model, and this would limit the extrapolation of the DR prediction model.
In conclusion, this study used the data of HES, which focuses on the rural adult population in China. Given the limited ophthalmic resources in rural areas, the prediction model has the potential to serve as a complementary practical   Journal of Diabetes Research screening tool for the early detection of DR, especially for diabetic patients in remote areas. Moreover, the MLR model was the optimal model on the DR early diagnosis owing to its excellent prediction performance and visible equation.

Data Availability
The data set analyzed in the current study is not publicly available as it contains private patient data from HES. The information excluding patient identification and demography is available upon request for research purpose.

Additional Points
Highlights.
(i) The current study is the first one that identifies the optimal model for the prediction of diabetic retinopathy in the Chinese rural population by comparing the prediction performance of traditional statistic algorithms and machine learning algorithms. (ii) Our findings are based on the Handan Eye Study which is a longitudinal study design with a large sample size. Therefore, the study has the potential to serve as a practical screening tool for the early detection of DR, especially in the rural areas of China.
(iii) This study updates the evidence in the field, which provides a scientific basis for future research directed toward DR prediction, especially through machine learning algorithms.

Conflicts of Interest
There is no conflict of interest.