Explainable Machine Learning-Based Prediction Model for Diabetic Nephropathy

The aim of this study is to analyze the effect of serum metabolites on diabetic nephropathy (DN) and predict the prevalence of DN through a machine learning approach. The dataset consists of 548 patients from April 2018 to April 2019 in the Second Affiliated Hospital of Dalian Medical University (SAHDMU). We select the optimal 38 features through a least absolute shrinkage and selection operator (LASSO) regression model and a 10-fold cross-validation. We compare four machine learning algorithms, including extreme gradient boosting (XGB), random forest, decision tree, and logistic regression, by AUC-ROC curves, decision curves, and calibration curves. We quantify feature importance and interaction effects in the optimal predictive model by Shapley additive explanation (SHAP) method. The XGB model has the best performance to screen for DN with the highest AUC value of 0.966. The XGB model also gains more clinical net benefits than others, and the fitting degree is better. In addition, there are significant interactions between serum metabolites and duration of diabetes. We develop a predictive model by XGB algorithm to screen for DN. C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have great contribution in the model and can possibly be biomarkers for DN.


Introduction
Diabetes mellitus is an extremely common chronic disease.By 2045, the prevalence of diabetes will rise to 10.9% [1].Of greater concern to us is that the Western Pacific will have the highest number of adult diabetics in the world [2].In China, about 20-40% of diabetic patients have combined renal complications, and diabetic nephropathy (DN) has become the leading cause of end-stage chronic kidney disease [3].Meanwhile, the all-cause mortality rate in patients with DN is nearly 20-40 times higher than that in nondiabetic nephropathy [4].New screening and treatment methods have important implications for the prevention of diabetic nephropathy in the country.
In recent years, there has been a growing interest in metabolomic measurements to identify pathophysiological mechanisms and new diagnostic and prognostic biomarkers associated with disease development [5].Among the various serum metabolites that have been extensively studied, amino acids and acylcarnitine have received much attention in recent years.Amino acids are involved in different physiological roles of the body, such as cell signaling, gene expression, nutrient metabolism, and endocrine hormone production [6].There is research evidence that dysregulation of acylcarnitine homeostasis plays a role in the development and progression of various diseases, such as insulin resistance and metabolic syndrome [7,8].
Since traditional clinical indicators and serum metabolites have a large number of features and are high-dimensional datasets containing both correlated and uncorrelated data, it is not sufficient to analyze such data using traditional statistical methods [9].In recent years, machine learning methods, such as least absolute shrinkage and selection operator (LASSO) regression, support vector machine (SVM), decision tree (DT), random forest (RF), and artificial neural networks (NNs), have been widely used in healthcare [10], such as cancer, medicinal chemistry, and medical imaging [11].Investigations have shown that machine learning can help improve the reliability, performance, predictability, and accuracy of diagnostic systems for diseases that require it and can be used to examine important clinical parameters, biological indicators, and serum metabolites [12,13].
The purpose of this paper is to develop and test a prediction model for DN by using machine learning methods and the dataset of Dalian Second People's Hospital and explain the prediction model to quantify the influence of serum metabolites to DN.The overall statistical analysis process of this paper is shown in Figure 1.A preprocessing method is mainly included and investigated.The preprocessing process includes the elimination of missing values as well as feature selection, the optimization of hyperparameters using grid search, and the evaluation and analysis of classifiers.In addition, a 10-fold cross-validation is used to avoid the effect of dividing the training set and the test set differently.

Statistical Analysis
2.2.1.Data Preprocessing.The dataset used in this paper is the balanced dataset.In the prediction model, whether DN occurs or not is defined as a binary variable.Illness is denoted as 1; absence of illness is denoted as 0. The features with more than 50% missing values were excluded, and then, the samples with missing values were removed from the analysis (see Figure 2).In addition, in this paper, the features are divided into continuous and categorical variables for data preprocessing.They are normalized, if the features are continuous.The fetched values of the discrete features are extended to the Euclidean space using the unique hot coding (one-hot), if they are categorical, and there is no size significance between the fetched values.
2.2.2.Feature Selection.Feature selection was performed by using least absolute shrinkage and selection operator (LASSO) regression.The LASSO regression model improves the prediction performance by adjusting the hyperparameter λ to compress the regression coefficients to zero and selecting the feature set that performs best in DN prediction.To determine the best λ value, λ was selected by minimum mean error using 10-fold cross-validation.

Model Training and Validation.
In this paper, the 10fold cross-validation method is used to divide the training and testing sets; i.e., in each cycle, 9 subsets are used as the training set and 1 subset is used as the testing set.The model is optimized by using grid search.DN prediction models were using 10-fold cross-validation as a model evaluation strategy and four classification algorithms, extreme gradient boosting (XGB), random forest (RF), decision tree (DT), and logistic regression, respectively, mainly for predicting the risk of diabetic nephropathy in individuals.
The above models are evaluated based on their generalization ability and practicality.The generalization ability of the model is examined by the receiver operating characteristic (ROC) curve and the area under the curve (AUC) values of the model, and the clinical utility of the model was examined by using the decision curve and calibration curve.

Preprocessing Results.
Through the above missing value processing (see Section 2.2.1), the final size of the dataset was obtained as 562 × 119 (number of samples × number of features), which is a sufficient sample size to meet the statistical requirements and ensure the reliability of the study results [17,18].
The clinical characteristics of the participants according to DN as a column stratified variable are shown in Table 1.The presence or absence of DN is statistically significant with HDL, Apo AI, C4DC, C5DC, HbA1c, and hypertension (p < 0 05).Compared with nondiabetic renal disease   Journal of Diabetes Research (NDRD), patients with DN tend to be without hypertension, with hyperglycemia, as well as have higher levels of HDL, Apo AI, and C5DC and lower levels of C4DC.

Hyperparameter Optimization Results
. In this study, based on GridSearchCV in sklearn, for each combination in the hyperparameter combination list, four different machine learning models are instantiated, 10-fold crossvalidation is done, and the parameter combination with the highest average score is returned using "roc_auc" as the scoring criterion, as shown in Table 2.

Classifier Results.
Based on the preprocessed Dalian dataset, the four classifiers of XGB, RF, DT, and logistic regression were used to classify diabetic nephropathy, which showed that the XGB model (accuracy = 0 875, recall = 0 875) was significantly better than the RF, logistic regression, and DT models.The AUC value of the DT model was greater than 0.8, but the false-positive rate was higher than the other three models, so it was not recommended (as shown in Figure 4).
The decision curve provides an adequate representation of the clinical utility of a model; i.e., at a certain threshold probability, the net benefit of the model is higher than the two special cases of no intervention for anyone and intervention for everyone at the same time, indicating that the model has practical value.As shown in Figure 5, all models were valid between the thresholds of 28% and 81%, and between the thresholds of 11% and 86%, the net benefit of the XGB model outperformed the other three models.
A new sample dataset was obtained by bootstrap method using Python 3.10 by sampling 10,000 times independently to plot the calibration curve of XGB model.As shown in Figure 6, after the XGB model was calibrated, the curve gradually approached the diagonal line, indicating that the screening is close to the real situation and has practical value.

Model Interpretation.
The effect of features on screening scores is measured by SHAP, which evaluates the importance of each feature using a game-theoretic approach based on the test set [19].When the Shapley value of each feature is positive, it indicates an increased risk of DN; conversely, it  7, MAU, diabetes duration, PVC, FPG, and eGFR contributed more to the model; in the metabolite group, C2, C5DC, Tyr, Ser, and Met contributed more to the model.
When the duration of diabetes is greater than or equal to 15, the threshold value of Tyr that best describes the difference in outcomes is 45, at which point the higher the Tyr value, the lower the risk of DN (as shown in Figure 8(c)).In addition, patients with longer diabetes duration and lower C5DC values had a lower risk of disease compared to those with higher C5DC values; patients with longer diabetes duration and lower Tyr values had a higher risk of disease compared to those with higher Tyr values, or patients with lower C24 values and compared to those with higher Tyr values and longer diabetes duration; C24 vs. C5DC reasoning was the same (as shown in Figures 8(a When most features are normal and for new-onset diabetes teenager patients, the risk of developing DN is low (Figure 9(a)).When the duration of T2D is shorter but most features (PCV, ALP, UA, FT3, and HDL) are abnormal, the risk of DN increases (Figure 9(b)).

Discussion
This study focuses on the metabolites, where C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have a strong effect on DN and can be used as new biomarkers for DN.The relationship between the mean square error and log λ is plotted.Vertical dashed lines are plotted at the best value using the minimum criterion and the 1SE principle.Based on 10-fold cross-validation, the λ value of 0.017 was selected and the optimal number of features was obtained as 38.Aromatic amino acids are a group of α-amino acids that contain an aromatic ring, including phenylalanine, tyrosine, and tryptophan.Phenylalanine is oxidized to tyrosine by phenylalanine hydroxylase and then involved in glucose metabolism [20].In a prospective study, lower plasma tyrosine levels in diabetic patients were associated with an increased risk of microvascular disease [21].A previous study confirmed the association between low tyrosine concentrations and diabetic nephropathy [22].
Methionine is an essential sulfur-containing amino acid that is required for normal growth and development of the body and is also associated with %FM.It is a precursor of succinyl CoA, homocysteine, creatine, and carnitine, which the organism generally obtains from food or gastrointestinal microorganisms.Methionine plays a crucial role in the immune system because its catabolism leads to increased production of glutathione, taurine, and other serum metabolites [23].Methionine and other methyl donors improve glucose tolerance and insulin sensitivity in the offspring of high-fat diet mice [24].Experiments in rats have demonstrated that methionine ameliorates alterations in key onecarbon serum metabolites and T2D-induced disturbances in glucose and lipid metabolism in T2D rats [25].And there is growing evidence that methionine activates AMPK and SIPT1 by a mechanism similar to that of metformin [26].Given that diabetic nephropathy is one of the microvascular complications of type 2 diabetes, it is reasonable to speculate that methionine disorders are negatively associated with type 2 diabetes complicated by diabetic nephropathy.
Diabetes mellitus as a metabolic dysfunctional disease damages several organs and systems, including the liver, kidneys, and peripheral nerves.Although essential amino acids are important for maintaining normal physiological activities of the body, abnormal metabolism of nonessential amino acids is also associated with the pathogenesis of diabetes [27,28].Serine, a nonessential amino acid, levels have been found to be consistently reduced in patients with metabolic syndrome [29].In a prospective study, elevated serum glycine levels were found to be associated with a reduced risk of developing type 2 diabetes [30].Glycine being a precursor substance of serine [31], there is even more reason to speculate about the importance of serine in the microvascular complications of type 2 diabetes.Numerous studies have found that homocysteine, a precursor substance of cysteine, is considered a biomarker for microvascular diseases including diabetic neuropathy, retinopathy, and nephropathy-like diseases [32].Epidemiological studies have shown a U-shaped relationship between cardiovascular disease and cysteine after adjusting for other risk factors and homocysteine [33].In this study, screening metabolic indicators associated with diabetic nephropathy by the LASSO model revealed a positive association between cysteine and diabetic nephropathy; the fact that no risk trend relationship was observed in the first half of the U-shaped curve may be due to the fact that this study was conducted based on type 2 diabetic patients, who have much higher levels of oxidative stress and reactive oxygen species than normal subjects.
Acylcarnitine is known to play a key role in the β-oxidation of long-chain fatty acids through the inner mitochondrial membrane.Comparing cases of obesity, insulin resistance, metabolic syndrome, and diabetes with relevant controls revealed that acylcarnitine was characterized differently between groups.A 6-year prospective study of 2103 community-dwelling individuals aged 50-70 years in Beijing  and Shanghai, China, with type 2 diabetes as the observed outcome found higher plasma concentrations of short-, medium-, and long-chain acylcarnitines at baseline, but only long-chain acylcarnitines were significantly associated with the risk of type 2 diabetes [34].A previous study found that elevated levels of short-and medium-chain acylcarnitines in blood were associated with the risk of developing cardiovascular disease in T2DM [35].A study on diabetic peripheral neuropathy (DPN) claimed that C4DC and C24 concentrations in non-DPN plasma were significantly higher than in DPN patients and that factors containing C2, C3, C4, and C5 short-chain acylcarnitines were positively associated with the risk of DPN in T2DM [36].C2 is derived from carbohydrate catabolism and acetyl-CoA, the end product of β-oxidation [37].It was also found that C2 may be a biomarker of combined sugar and lipid toxicity.And animal experiments also showed that plasma C2 levels were elevated in T2DM rats [38].
Proteinuria and eGFR loss are both nonspecific markers of DN but have limitations as prognostic tools [39].This is because a high percentage of T2DM patients in renal biopsy studies do not have DN and suffer from other renal diseases   Journal of Diabetes Research [40].Therefore, it is important to identify new prognostic markers for DN based on serum metabolites in this paper.However, due to the limitation of data, this paper is limited to the dichotomous problem, and the multiclassification model for DN grade can be further investigated in the future.

Conclusion
This paper constructs a XGB model to screen for DN, whose predictive performance is better than those in previous studies [37,41,42] with 0.93, 0.79, and 0.90.LASSO plays a key role in ensuring the accuracy and stability of the predictive model, which improves the quality of the dataset.C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys are shown to be highly correlated with DN risk.This paper introduces serum metabolites as new DN markers, constructs several machine learning models to screen for DN, compares their screening abilities, and analyzes the impact of each important feature on DN.The results show that the XGB model has the best screening effect, and LASSO model plays a key role in ensuring the accuracy and stability of the screening model, which improves the quality of the dataset.In addition, compared with previous studies [37,41,42], our model has better result.

2. 1 .Figure 1 :
Figure 1: DN statistical analysis workflow diagram.DN statistical analysis workflow diagram contains four machine learning classifiers, preprocessing steps, optimization of hyperparameters of classifiers by grid search, and model evaluation methods.Feature filtering was performed using R V4.2.2.Data preprocessing and modeling, evaluation, and interpretation of machine learning models were performed with Python V3.10.

Figure 3 :
Figure 3: (a) LASSO coefficient profiles of 119 features; (b) the value of λ with the smallest mean error is selected by 10-fold crossvalidation.(a) Each line represents a feature, and each estimated parameter decreases as λ increases until it compresses to 0. (b)The relationship between the mean square error and log λ is plotted.Vertical dashed lines are plotted at the best value using the minimum criterion and the 1SE principle.Based on 10-fold cross-validation, the λ value of 0.017 was selected and the optimal number of features was obtained as 38.

Figure 4 :Figure 5 :Figure 6 :
Figure 4: ROC curves of the four models based on the test set data.XGB: extreme gradient boosting; RF: random forest; DT: decision tree.The red line indicates the XGB model based on the LASSO selection feature (AUC = 0 966), the orange line indicates the RF model based on the LASSO selection feature (AUC = 0 937), the yellow line indicates the logistic model based on the LASSO selection feature (AUC = 0 845), and the black line indicates the DT model based on the LASSO selection feature (AUC = 0 812).

Figure 7 :
Figure 7: SHAP summary plot for the XGB model based on LASSO selection of features.XGB: extreme gradient boosting; RF: random forest; DT: decision tree.Each point on the summary plot is the Shapley value of the feature and the instance.The position on the y-axis is determined by the feature, and the x-axis is determined by the Shapley value determination.Colors indicate feature values from low to high.The features are arranged according to their importance.SHAP: Shapley additive explanations.

Figure 8 :
Figure 8: SHAP plot showing the nonlinear interaction.It shows the nonlinear interaction between diabetes duration and serum metabolites, including C5DC (a), C24 (b), and Tyr (c).x-axis indicates the value of the feature, and y-axis indicates the Shapley value of the feature.Red indicates a larger right-hand feature, and blue indicates a smaller right-hand feature.The Shapley value indicates the effect of the feature on the model.

Figure 9 :
Figure 9: Three examples of the local explanation of the predictions using the Shapley additive explanation (SHAP) values: (a) subjects with 28 years old and (b) subjects with 43 years old.Factors that push the predicted score higher compared to the base value (mean prediction) are coloured red, and those pushing lower the prediction are shown in blue.

Table 1 :
Baseline characteristics of the study population.

Table 2 :
Highest AUC scores achieved by hyperparameter tuning of four machine learning models.