Tree-Based Risk Factor Identification and Stroke Level Prediction in Stroke Cohort Study

Objective. This study focuses on the identification of risk factors, classification of stroke level, and evaluation of the importance and interactions of various patient characteristics using cohort data from the Second Hospital of Lanzhou University. Methodology. Risk factors are identified by evaluation of the relationships between factors and response, as well as by ranking the importance of characteristics. Then, after discarding negligible factors, some well-known multicategorical classification algorithms are used to predict the level of stroke. In addition, using the Shapley additive explanation method (SHAP), factors with positive and negative effects are identified, and some important interactions for classifying the level of stroke are proposed. A waterfall plot for a specific patient is presented and used to determine the risk degree of that patient. Results and Conclusion. The results show that (1) the most important risk factors for stroke are hypertension, history of transient ischemia, and history of stroke; age and gender have a negligible impact. (2) The XGBoost model shows the best performance in predicting stroke risk; it also gives a ranking of risk factors based on their impact. (3) A combination of SHAP and XGBoost can be used to identify positive and negative factors and their interactions in stroke prediction, thereby providing helpful guidance for diagnosis.


Introduction
Stroke is an acute cerebral vascular disease that is mainly caused by sudden cerebral vascular rupture or blockage of blood vessels (termed ischemic and hemorrhagic stroke, respectively), leading to brain tissue damage.Stroke has high morbidity, mortality, and disability rates.Ischemic stroke accounts for 60-70% of the total incidence stroke; however, hemorrhagic stroke has a higher mortality rate.
Extensive research has focused on determining the premonitory signs of stroke.The Framingham study [1] reported a series of risk factors for stroke, including age, systolic blood pressure, antihypertensive therapy, diabetes, smoking, previous cardiovascular disease, atrial fibrillation, and left ventricular hypertrophy based on electrocardiogram.Recently, many other studies have found additional risk factors, including creatinine levels and time taken to walk 15 feet [2,3].Medical data sets tend to contain large numbers of features; thus, it is a time-consuming task to manually identify and verify risk factors using the available data.However, machine learning methods can effectively identify features that are strongly related to the incidence of stroke based on a large number of feature sets [4].Therefore, machine learning can be used to improve the accuracy of stroke risk prediction and discover new risk factors.
Models for prediction models of stroke have also been extensively studied.[2] developed a 5-year stroke prediction model based on a cardiovascular health research data set.Machine learning algorithms have also been widely explored in this field, for instance, to predict outcomes of patients with ischemic stroke after intra-arterial therapy using clinical variables [5] and those of patients with brain arteriovenous malformations after endovascular treatment [6].Among other methods, logistic regression and random forest have shown good performance in predicting the daily activities of discharged patients [7].Deep learning algorithms that use computed tomography and magnetic resonance imaging features together with clinical variables have been developed to predict hemorrhagic transformation after intravascular therapy [8], visual field defect improvements [9], and speech and motor outcomes [10,11].
The interpretation of the results of machine/deep learning models has crucial importance in medical applications.In the past few years, machine learning has been used to improve cancer diagnosis, detection, prediction, and prognosis; however, studies usually regard machine learning as a "black box" [12], which limits the confidence of patients and clinicians in the predictions of the models.[13] proposed the use of Shapley additive explanation (SHAP) to elucidate machine learning predictions based on game theory.They have introduced several versions of SHAP (e.g., DeepSHAP, KernelSHAP, Line-arSHAP, and TreeSHAP) for specific machine learning model categories.In this study, we interpret machine learning based on TreeSHAP [14][15][16] to judge the impact of a single feature on different stroke levels and the outcomes of individual cases and to explain the predictions of the machine learning method.Numerous machine-learning-based models have been applied to categorical data and have shown great promise.However, because of the ordering of the response variables in records of stroke level, it is necessary to adapt a traditional classification model to ordinal variables.The most common models are socalled cumulative logit or probit models; these can be specified as logit or probit models for the probabilities of exceeding each of the ordered categories (except the last) [17].Alternatively, some researchers have integrated the results of modeling research by treating ordered variables as continuous variables or "special" variables in an attempt to provide guidance to researchers [18,19].Numerous methods have been proposed to improve stroke prediction; however, most of the relevant studies have focused on the probability of death, dementia, or institutionalization over a fixed number of years.For instance, [20] weighted the modified Rankin scale (mRs) in ordinal analyses for stroke and other neurological disorders, as state transitions differ in clinical prognosis; and [21] assessed the distribution of mRs scores across different strata in AIS according to usual eligibility criteria.
This study focuses on the application of machine learning methods to survey data, where stroke levels are presented as ordinal variables from 0 to 4. The main contribution of this study is to extend the traditional binary/multiclassification to the cumulative binary classifier of Y ≥ k vs. Y < k (for all possible k) to construct a multiclassifier for ordinal responses.We focus on the identification of the main risk factors for stroke and the prediction of stroke level based on these risk factors.We also consider the effects of risk factors in individual patients, including interaction effects.Risk factors are identified from the cohort data based primarily on Pearson correlation and a mutual information measure; then, stroke level is predicted using a well-known multicategorical classification model.A SHAP-based interpretation is also used to provide a detailed explanation of each factor in an individual diagnosis.
The remainder of the paper is organized as follows.Section 2 describes the exploration of the stroke data and risk factor identification based on Pearson correlation and the mutual information criterion.Section 3 presents the prediction of stroke level based on multicategorical classifiers.The model's interpretation with respect to feature importance, positive and negative effects, and interactions, as well as personal prediction and treatment, is presented in Section 4. Section 5 gives our conclusion and some discussion.Details of the data are provided in Table 1.Note that for categorical features with two options, the 0-1 encoding method was adopted, and the level of stroke (Y) was represented as an ordinal variable: 0 (TIA), 1 (low risk), 2 (medium risk), 3 (high risk), or 4 (stroke).

Exploration of the Stroke Data
Table 1 also shows the results of five-sample testing of the differences among groups using analysis of variance.P values less than 0.01 were observed for all characteristics, indicating that the scores for all factors were statistically significant in classification of stroke level.
The linear relationship and nonlinear dependent relationships among the various factors (X i ) and stroke level (Y) were studied using Spearman correlation and normalmutual information (NMI).The results of these analyzes for data 2016 to 2018 are shown in Table 2. Age (X 1 ) and gender (X 2 ) had small NMI and Spearman correlation values, indicating that these factors can be discarded because of their weak relationships with stroke level.The most important factors associated with stroke level were hypertension, diabetes, family history of stroke, history of transient ischemia, and lack of exercise.
Hereafter, in this paper, the factors of age and gender are discarded from consideration in the prediction and interpretation procedure.

Prediction of Stroke Level Based on Multicategorical Classifiers
Risk factors for stroke were primarily identified based on machine learning; then, stroke level was predicted using classifiers.
2 BioMed Research International 3.1.1.Multiple Logistic Regression.Multiple logistic regression is an extension of the binomial logistic regression model for multiple classification and is used to predict the probabilities of different possible outcomes for a category distribution of dependent variables.Specifically, a probability model is used to calculate the probability of obtaining a certain result in the predicted dependent variable after the linear combination of independent variables and corresponding parameters.

Multiple
Classification Support Vector Machine.The multiple classification support vector machine (MCSVM) is mainly used for the construction of multiclassifiers by combining many binary classifiers.The one-versus-one method and one-versus-rest method are commonly used.In this study, the small-against-large (Y ≤ k vs. Y > k) method is used to predict levels of stroke.
3.1.3.XGBoost.XGBoost, or "extreme gradient boosting," is a type of boosting ensemble algorithm, which represents an improvement of the gradient boosting decision tree (GBDT) algorithm.The XGBoost algorithm adds regularization to the objective function.When the base learner is CART, the regularization is related to the number of leaf nodes of the tree and the values of the leaf node.
3.1.4.Light Gradient Boosting Machine.The light gradient boosting machine (LightGBM) is a type of boosting integrated algorithm; it is also an efficient implementation of the GBDT algorithm.It first uses a histogram algorithm to transform a traversal sample into a traversal histogram to reduce time complexity.Then, a gradient-based one-side sampling algorithm is used to filter out samples with small gradient in the training process to reduce the computation time.Moreover, a leaf-wise algorithm-based growth strategy is used to construct trees to reduce unnecessary overhead.
Concerning the ordinal response, all the classification algorithms were modified such that they could handle ordinal variables.Specifically, the ordinal responses were partitioned into two categories (Y ≤ k vs. Y > k for each possible k); then, all classifiers were applied to these binary categories.

Performance of the Multicategorical Classifiers.
The data were divided into five mutually exclusive sets by pooling, and classification performance was evaluated by fivefold cross-validation with stratified XGBoost sampling with respect to area under the curve (AUC), accuracy, F 1 , recall, and precision.
The results of the evaluation of model performance are shown in Table 3.All four models achieved acceptable results for classification, with AUC > 0:98, for example, whereas LightGBM and XGBoost showed better accuracy (above 0.9) compared with the others.The evaluation indicators of XGBoost were almost the best.Besides, owing to

Model Interpretation Based on SHAP for XGBoost Algorithm
The interpretation of the results of machine-learning-based models has a crucial role in medical research and clinical applications.In this work, SHAP [13] measurements based on the best machine learning model (XGBoost) are used for explanatory data analysis.This further illustrates the effectiveness of the algorithm proposed in this paper and provides guidance for the practical use of the model in diagnosis and survival analysis.
SHAP is a package of interpreted models that can be constructed and used to interpret any machine learning model.It originates from cooperative game theory, where each of its features can be seen as a contributor.When a value is predicted for any sample and the corresponding predicted value is obtained, the SHAP value is called the predicted value of any feature in this sample.
4.1.Feature Importance Evaluation. Figure 1 gives the feature importance rankings of this model evaluated by XGBoost and SHAP.As shown in Figure 1(a), hypertension was the most important factor in the evaluation of stroke, followed by history of transient ischemia, diabetes, atrial fibrillation or valvular heart disease, and history of stroke.The SHAP-based description shown in Figure 1  7 BioMed Research International more accurate view of each factor's effect; hypertension, history of transient ischemia, history of stroke, and diabetes are still the most important features, consistent with the results obtained with XGBoost.
From the results shown in Figure 1(b), we can conclude the following.
Hypertension is the most important factor at all stages of stroke, although it has less effect in the case of TIA (class 0).
History of TIA is almost the characteristic of the TIA, and history of stroke is the conclusive factor for recognizing stroke (class 4).
The other factors have significant impact in all stages of stroke.

Evaluation of Individual Features in Stroke Level
Prediction.To better understand the specific impact of individual features on different degrees of stroke, overall SHAP feature plots are constructed and are shown in Figure 2 (here, only the cases Y ≤ 1 and Y ≥ 2 are presented).All factors are listed on the vertical axis ranked by importance.For a specified factor, each point indicates a patient to whom that factor applies (in red) or does not apply (in blue).Right side of a patient in red means it has the positive impact for lying in the corresponding level.
A SHAP description for patients in the high-risk category is shown in Figure 2. It shows that patients with a history of transient ischemia or history of stroke are not likely be classified in the higher-risk stroke subgroup (Y > 1) (in fact, history of stroke is the most important identified factor for the occurrence of stroke, and a patient who has experienced TIA before is more likely to be categorized into class 0 (TIA)), whereas the other factors have a strong positive impact, meaning that a patient with the corresponding phenotypes is more likely to be classified as at higher risk of stroke.Similarly, a patient with TIA cannot be classified in the higher-risk category (Y ≥ 2).
The same conclusion can be obtained for dyslipidemia, diabetes, lack of exercise, and atrial fibrillation or valvular heart disease.In addition, a SHAP value near 0 means that the corresponding factor makes a small contribution to the development of stroke.Similarly, the negative SHAP values for history of stroke (in red), obesity, family history of stroke, and history mean the stroke level cannot be low risk or TIA.(1) Hypertension and diabetes (the interaction value is recorded as X 13 ), hypertension and AF/VHD ðX 14 Þ, hypertension and history of TI ðX 15 Þ, and hypertension and history of stroke Similar interactions can be found for other categorical factors in stroke risk level.Figure 3(b) again gives the importance of the factors; compared with the effect of a single factor, most of the interactions are negligible, except that of hypertension and diabetes.
In addition, we put the interaction values into the machine learning model; the AUCs are shown in Table 4, using the forward stepwise method to add to the original model.After adding the X 13 variable, the accuracy of the model showed a marked improvement.When this variable was added to X 17 , the model accuracy reached almost 1, so the procedure can be ended from the addition of X 17 .The interaction values X 13 , X 14 , X 20 , and X 17 play a greater part in promoting the occurrence of different degrees of stroke compared with other interaction values.This knowledge is crucial for medical research and clinical applications, and it provides a better theoretical basis for treatment of patients.

Individual Precision Prediction and Treatment.
Here, we give an application of SHAP interpretable values in individual precision prediction and treatment guidance.Figure 4 shows a waterfall diagram for a single patient with a factor vector ð0, 0, 1, 0, 1, 1, 1, 0, 0, 1Þ.At the bottom, E½ f ðxÞ = 0:724 indicates the base value of shake of the overall sample.The bottom row represents five unimportant features, which have a positive impact of 0:1; X 20 produces a 0:29 positive effect.Smoking history has a negative impact of 0:79, whereas X 13 has a positive impact of 1:05, and family history of stroke has a positive impact of 2:76.Finally, the SHAP value for the first patient is 10.251 (shown in the upper right corner).Compared with the value of EðxÞ, the value for this patient's illness is very large.Therefore, this individual meets the definition of a high-risk patient.
For this patient, family stroke history is the most important factor contributing to risk of stroke, followed by lack of exercise and dyslipidemia.If this individual develops hypertension and diabetes, the interaction of these factors with the

Conclusion and Discussion
In this study, risk factors were extracted and risk levels are predicted using stroke data from the Stroke Center of Lanzhou University Second Hospital from 2016 to 2018.First, risk factors were identified by sorting the importance of features.The results showed that the most important factors were hypertension, history of transient ischemia, history of stroke, and diabetes; family history of stroke, lack of exercise, dyslipidemia, smoking history, and apparent overweight or obesity were also factors with notable effects, whereas age and gender had negligible impact.Our results suggested that the XGBoost model was better at predicting stroke risk than other models according to almost all evaluation indices.Using Lundbery and Lee's optimal model and machinelearning-based SHAP, we could determine the impact of factors at each stroke level.Finally, we constructed a waterfall plot for a single patient to precisely show their level of stroke and the impact of different characteristics, to illustrate how the method could be used to guide accurate and personalized treatment for patients.
The study demonstrates precise prediction and identification of stroke level and the corresponding distinguishing features of a stroke patient.The proposed procedure involves a combination of feature selection, XGBoost classification, and SHAP interpretable analysis, which enables balancing of model accuracy and interpretability for medical applications in particular.The superiority of this approach has been demonstrated for personalized treatment of stroke patients.The XGBoost classifier can precisely determine the factors that dis-tinguished each level of stroke in a patient group.Moreover, interpretation based on SHAP can give more precise information about the individual patient, which can help to guide individual diagnosis and stroke prevention strategies.

Figure 1 :
Figure 1: Feature importance of factors based on (a) XGBoost and (b) SHAP.

4. 3 .
Interaction Effects for Stroke Level Prediction.The interaction values shown in Figure 3(a) in the low-risk case indicate that although the individual factors have negative influences, the following interactions have strong positive influence:

FamilyFigure 4 :
Figure 4: Importance ranking of characteristics of the first patient.

Table 1 :
Description of the data.

Table 1 :
Continued.VHD: valvular heart disease; AO: apparently overweight.Significance analyses were performed by analysis of variance.All tests were two-sided.

Table 2 :
Correlations and NMI values for the data.

Table 3 :
Performance evaluation using fivefold cross-validation for different models: mean (standard deviation).
2) Diabetes and AF/VHD ðX 17 Þ, diabetes and history of TI ðX 18 Þ, and diabetes and history of stroke ðX 19 Þ (3) Family history of stroke and hypertension ðX 20 Þ, family history of stroke and diabetes ðX 21 Þ, family history of stroke and AF/VHD ðX 22 Þ, family history of stroke and history of TI ðX 23 Þ, and family history of stroke and stroke ðX 24 Þ

Table 4 :
Comparisons of AUC values for the forward stepwise method with interactive effects for different models.AlgorithmM 0 : originalM 1 : M 0 + X 13 M 2 : M 1 + X 14 M 3 : M 2 + X 20 M 4 : M 3 + X 17 others will aggravate the severity of the disease.The interaction between family stroke and hypertension also plays an important part in the development of high stroke risk.