Influences of Daily Life Habits on Risk Factors of Stroke Based on Decision Tree and Correlation Matrix

Purpose To explore the influences of smoking, alcohol consumption, drinking tea, diet, sleep, and exercise on the risk of stroke and relationships among the factors, present corresponding knowledge-based rules, and provide a scientific basis for assessment and intervention of risk factors of stroke. Methods The decision tree C4.5 algorithm was optimized and utilized to establish a model for stroke risk assessment; then, the main risk factors of stroke (including hypertension, dyslipidemia, diabetes, atrial fibrillation, body mass index (BMI), history of stroke, family history of stroke, and transient ischemic attack (TIA)) and daily habits (e.g., smoking, alcohol consumption, drinking tea, diet, sleep, and exercise) were analyzed; corresponding knowledge-based rules were finally presented. Establish a correlation matrix of stroke risk factors and analyze the relationship between stroke risk factors. Results The accuracy of the established model for stroke risk assessment was 87.53%, and the kappa coefficient was 0.8344, which was superior to that of the random forest and Logistic algorithm. Additionally, 37 knowledge-based rules that can be used for prevention of risk factors of stroke were derived and verified. According to in-depth analysis of risk factors of stroke, the values of smoking, exercise, sleep, drinking tea, alcohol consumption, and diet were 6.00, 7.00, 8.67, 9.33, 10.00, 10.60, and 10.75, respectively, indicating that their influence on risk factors of stroke was reduced in turn; on the one hand, smoking and exercise were strongly associated with other risk factors of stroke; on the other hand, sleep, drinking tea, alcohol consumption, and diet were not firmly associated with other risk factors of stroke, and they were relatively tightly associated with smoking and exercise. Conclusions Establishment of a model for stroke risk assessment, analysis of factors influencing risk factors of stroke, analysis of relationships among those factors, and derivation of knowledge-based rules are helpful for prevention and treatment of stroke.


Introduction
Stroke is an acute cerebrovascular disease, associating with the characteristics of high morbidity, high disability, and high mortality. It is a refractory disease that imposes a major threat to human health and life [1]. At present, there are no effective treatments for stroke. Prevention is still the most feasible strategy to reduce the harm of stroke and reduce its social burden, especially with respect to high global incidence and potential risk factors of stroke [2]. The risk factors of stroke are divided into intervention factors (e.g., smoking, alcohol consumption, and body mass index (BMI)) and nonintervention factors (e.g., age, gender, ethnicity, and genetic attributes) according to whether the risk can be changed through intervention [3]. Hence, studying the intervention factors is of great significance for the prevention of stroke. In addition, we previously found that the interventional risk factors for stroke appeared more in people's daily lives and behavioral habits [4,5]. Unhealthy lifestyles can trigger or increase the risk of stroke, and moderate lifestyle changes may reduce the risk of stroke as well [6]. Therefore, numerous scholars suggested that further studies should be carried out to provide effective interventions to guide and improve people's lifestyle, so as to reduce the risk and incidence of stroke [7][8][9]. However, in 2019, Altobelli et al. analyzed the relevant literature and found that research in this area was conducted in only a limited number of developed countries, and there were very few reports on the impact of lifestyle and dietary habits on risk factors of stroke [10]. In China, Huang et al. conducted relevant research and demonstrated that a healthy lifestyle (high fruit intake, quitting smoking, doing housework, and good sleep quality) may reduce the chance of recurrence of first-onset ischemic stroke [11]. Although the risk factors of stroke in daily life habits are not the main risk factors of stroke, they are closely associated with the main risk factors [12].
The present study was aimed at the Chinese population, and large-scale and multidimensional stroke data were collected through modern information technology. The optimized decision tree algorithm was used to analyze risk factors of stroke in daily life habits, derive knowledge-based rules, and establish a model for stroke risk assessment to analyze relationships among risk factors of stroke.

Data Collection and Pretreatment.
We established a whole-course stroke management network system via collection of large-scale data from Shanghai suburban population, involving nearly 10,000 people, in which 5599 valid data were finally acquired. The data included subjects' demographic characteristics, physical examination, family medical history, treatment history, personal diet and lifestyle habits, sleep and breathing, psychological status, quality of life, and stroke knowledge. In order to facilitate classification of stroke, we also designed a rapid stroke screening form and performed statistical analysis. We preliminarily extracted and integrated data and determined 16 risk factors of stroke for further analysis. As shown in Table 1, among 5599 data collected, there were 2491 males and 3108 females, subjects' minimum and maximum age were 18 and 89 years old, respectively. The age-and gender-based data are shown in Figure 1.
As illustrated in Figure 1, [18,30) indicates that age is 18 years old or older and less than 30 years old; F and M denote female and male, respectively; and PN is the number of individuals.
The present research analyzed the risk factors of smoking, alcohol consumption, drinking tea, diet, sleep, sport, and BMI. The above-mentioned factors were defined as follows: (i) Smoking: those who have smoked for 6 months or more in their lifetime were marked as "y"; otherwise, they are denoted as "n" (ii) Alcohol consumption: those who have drunk no less than twice/week and no less than 80 ml each time were marked as "y"; otherwise, they were denoted as "n" (iii) Drinking tea: those who have drunk tea at least 3 days/week were marked as "y"; otherwise, they were denoted as "n" (iv) Diet: the daily food ingredients are mainly sugars, fats, or proteins, which were marked with "C1," "C2," and "C3," respectively "y" means "yes," "n" indicates "no," and definitions of the types of BMI, diet, sleep, and exercise are presented in Figure 1. In Figure 1, we sometimes use fields to represent their corresponding stroke risk factors.

Computational and Mathematical Methods in Medicine
(v) Sport: those who have exercised sport more than 3 times/week and more than 30 min each time, demonstrating regular level of sport, marked as "C1"; those who have exercised sport 2-3 times/week, and 10-30 min each time, reflecting medium level of sport, marked as "C2"; those who have exercised less than or equal to 1 time/week and less than 10 min each time, indicating lower level of sport, marked "C3" (vi) BMI: since the WHO standards are not highly appropriate for Chinese people, the Chinese Reference Standards were formulated with reference to the WHO standards and are divided into five types: B1, B2, B3, B4, and B5 (Table 2) (vii) Sleep: duration of sleep in different ages can be divided into three types: very short-term, mediumterm, and very long-term, which could be labelled as TS, TB, and TL, respectively, as shown in Figure 2 According to the rapid screening of risk factors of stroke (including hypertension, dyslipidemia, diabetes, atrial fibrillation, smoking history, BMI, sport, stroke history, family history of stroke, and transient ischemic attack (TIA)), refer to the Guidelines for Screening, Prevention and Control of Ischemic Stroke presented by the Ministry of Health of China (hereinafter referred to the guidelines), this study classified stroke risk into H, M, L, N, T, and Y levels, as summarized in Table 3.

Decision Trees.
The decision tree is a popular, logicbased, easily interpretable, straightforward, and widely applicable method [13]. The classic decision tree algorithms include ID3, C4.5, and CART. In contrast to ID3, which can only handle discrete variables, C4.5 and CART can handle continuous variables, and they are not sensitive to incomplete data. In addition, the CART generates binary trees and the C4.5 algorithm generates multiple branches. Decision trees can generate interpretable knowledge rules, which can express relationship between factors. This is in line with our goal to explore relationships among the risk factors of stroke. Therefore, the C4.5 algorithm was selected in the current research. Details of the C4.5 algorithm were described in the following.
2.2.1. C4.5 Algorithm. In 1992, Ross Quinlan developed the C4.5 decision tree algorithm [14]. C4.5 constructs a decision tree as a learning model from the data samples. The divideand-conquer approach is adopted for construction of decision tree models using a measure called information gain to select the attribute from the dataset for the tree.
(1) Information Gain. Suppose that there are C categories of data in the sample dataset D. The information entropy formula is as follows:  where D represents the training dataset, C denotes the data class number, and p i represents the ratio of the sample number in class i to all samples. When the attribute A is chosen as the node of the decision tree, the information entropy after the action of feature A is as follows: where k represents the data samples D divided into k parts.
(2) Gain Ratio. The information gain represents the value of the information entropy that the dataset D decreases after the action of the feature A. The formula is as follows: The information gain ratio is given by 2.2.2. Improvement and Implementation of C4.5 Algorithm.
We used a decision tree algorithm to analyze the abovementioned 16 risk factors of stroke (see Table 1). The decision tree is generated using the J48 (C4.5 algorithm implementation) in the Weka classifier algorithm. The confidence factor for the pruning is set to 0.25, and the minimum number of instances per leaf (minNumObj) is set to 1. The 10-fold cross-validation is additionally used to select and evaluate the model.
In order to solve imbalanced data problem and improve the robustness of the system, we, in the current study, presented SMOTE algorithm to improve the model. The SMOTE algorithm is an intelligent oversampling technique for unbalanced datasets proposed by Chawla et al. in 2002. It can effectively improve the overfitting phenomenon caused by traditional oversampling techniques and solve the problem of biased classification results. As illustrated in Figure 3, after classified dataset is preprocessed for equilibrium judgment, the number of records in each class is first counted to find out the maximum value (max) and minimum value (min) of the number of records and then quotient max and min, if max/min < 3. After the dataset is judged to be balanced, it is directly entered into the C4.5 classifier for classification. Otherwise, it is judged that the dataset is unbalanced and is entered into the SMOTE processor: first, the entire dataset is sampled, the sampling method is nonrepeatable sampling, the number is equal to the number of datasets, each record is randomly sorted, and then, SMOTE is used to generate new minority data. The effects of operations, such as filtering and sorting preprocessing on the SMOTE algorithm, are eliminated to ensure that the data obtained by SMOTE is obtained by randomly combining the major data and the minor data to avoid overfitting caused by the data generated by SMOTE only from the minor data. Then, the data are entered into the classification module.   Computational and Mathematical Methods in Medicine

Results
The number of leaves of the tree was 98, while the size of the tree was 171 (Figures 4-8). The performance indexes of the tree are as follows: classification accuracy: 87.5281%; kappa statistic: 0.8344; mean absolute error: 0.0567; and rootmean-square error: 0.175. To assess the performance of the proposed system for stroke risk classification, precision, recall, accuracy, and kappa were calculated, and 10-fold cross-validation was used. Equations (5)-(8) were presented to calculate precision, recall, accuracy, and kappa, respectively.
Accuracy = TP + TN TP + TN + FP + FN , ð7Þ Precision represents the correct positive prediction ratio to the whole positive samples. Recall is the correct positive prediction ratio to the whole positive predictions. Accuracy is correct prediction ratio to the whole predictions. True positives (TPs) are positive cases that are correctly predicted as positive. False negatives (FNs) are positive cases that are incorrectly predicted as negative. True negatives (TNs) are negative cases that are correctly predicted as negative. False positives (FPs) are negative cases that are incorrectly predicted as positive. Meanwhile, kappa offers a more robust estimated performance of the proposed system compared with a simple agreement and gives an overall evaluation of all the cases. p o is the relative observed agreement among the proposed system and the physician analysis, and p e is the hypothetical probability of chance agreement. Table 4 presents the confusion matrix of the classification result using optimized C4.5 algorithm. In order to evaluate the performance of the optimized C4.5 algorithm, the random forest and Logistic algorithm were implemented for making comparison. Random forests or random decision forests are an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees [15]. Logistic regression is a generalized linear regression analysis model, commonly used in data mining, automatic disease diagnosis, economic prediction, and other fields. The Logistic regression is good at analyzing linear relationships, and analyzing nonlinear relationships is worse than decision trees. In addition, it is sensitive to extreme values and easily affected by extreme values, and the decision tree performs better in this respect [16].
In the current study, the number of trees in the random forest was set to 100, and for each tree, the minimum number of instances for each leaf was set to 1. The Ridge value in the Logistic was set to 1:0E − 8, and the maximum number of iterations to perform was set to -1. They all use tenfold cross-validation like decision trees. Tables 5 and 6 summarize the confusion matrix of classification results using random forest and Logistic algorithm, respectively.
Regardless of accuracy or kappa value, the optimized C4.5 is the highest among the three algorithms. The recall of the risk type "T" could achieve only 0.208 using the random forest algorithm, which was noticeably lower than 0.962 using the C4.5 algorithm. Figures 9-11 demonstrate that misclassification rate of risk type "T" is the lowest in optimized C4.5 algorithm among the three algorithms.

H
The major risk factors defined in the guidelines are 2 items or more, or the major risk factors include 1 item, and the secondary risk factors involve 2 items or more. M The major risk factors defined in the guidelines include 1 item, and the secondary risk factors involve less than 2 items. L The main risk factors defined in the guidelines include 0 item, and the secondary risk factors involve 2 items or more. N The main risk factors defined in the guidelines include 0 item, and the secondary risk factors involve less than 2 items.

Computational and Mathematical Methods in Medicine
Corresponding knowledge-based rules can be deduced from the decision tree. There were 98 knowledge-based rules deduced from the present case. There are 37 rules related to the 6 daily living habits (smoking, alcohol consumption, drinking tea, diet, sleep, and sport), which are illustrated in the Supplementary Information (available here).

Discussion
According to the previous decision tree, the average depth and frequency of each risk factor in the decision tree were calculated, as shown in Table 7. Values of risk factors for stroke (stroke history, hypertension, dyslipidemia, diabetes, family history of stroke, TIA, smoking, atrial fibrillation, exercise, sleep, gender, BMI, drinking tea, age, and alcohol consumption) were increased, indicating that their influence on risk factors of stroke was relatively reduced. Simultaneously, the impact of daily living habits on risk factors of stroke was relatively insignificant, demonstrating that the influence of lifestyle habits and diet on risk factors of stroke is indirect.
We further analyzed the above-mentioned 98 knowledge-based rules for risk factors of stroke, in which risk factors were extracted from the knowledge-based rules. Within each set, the sum of the reciprocals of factors was used to represent the weight of each factor. All factor sets and their weights will be described in the Supplementary Information. Within each set, every two factors formed a factor pair; the same factor pairs were weighted and summed together to form a factor-based relationship matrix, as shown in Table 8.
As illustrated in Table 8, it was unveiled that the risk factors of stroke, such as stroke history (SH), hypertension (Hyte), dyslipidemia (Dysl), diabetes (Diab), and age (Age), have the highest correlation. Of the 6 daily habit factors we examined (smoking, alcohol consumption, tea, diet, sleep, and exercise), only the correlation of smoking (Smok) and sport (Sport) was higher than the average (1.95). This indicates that alcohol consumption, drinking tea, diet, and sleep are not strongly correlated with other factors. In addition,   regarding this weak correlation, the correlation values of alcohol consumption, drinking tea, diet, sleep, smoking, and sport were close to those of strong correlation categories (SH, Hyte, Dysl, Diab, and Age), as shown in Table 9.

Smoking and Sport.
Of the 37 knowledge-based rules mentioned above, 30 rules included a "smoking" factor, suggesting that smoking significantly increases the risk factors of stroke. Yamagishi et al. demonstrated that smoking increases the risk of stroke in patients with hypertension [17], which is in line with our findings. In addition, the radar chart of the risk ratio of smoking to nonsmoking is also illustrated by Figure 12(a).
Of the 37 knowledge-based rules mentioned above, 35 contained "sport." As displayed in Figure 12(b), there is no significant difference in the impact of high-intensity and medium-intensity exercise on risk factors of stroke. Exercise is the most common factor affecting the risk of stroke, and moderate exercise helps prevent stroke, which is consistent with the results of McDonnell et al.'s study [18].
Additionally, 28 knowledge-based rules contained both "smoking" and "sport" factors, indicating that smoking and sport are closely associated together, and further, doing exercise by smokers is beneficial to reduce the risk of stroke.

Alcohol Consumption and Drinking Tea.
It was noted that individuals who drink alcohol have a significantly higher risk of stroke than nonalcohol consumers (Figure 12(c)). This is in line with Hu et al.'s outcome that heavy drinking can increase the risk of stroke, while moderate drinking has insignificant influence on the risk of stroke [19]. However, it is not an independent factor and is typically associated with hypertension, diabetes, and hypercholesterolemia.
Knowledge-based rules showed that drinking tea has no direct effect on the risk of stroke (Figure 12(d)), and similar to alcohol consumption, it can be related to BMI. Sosa et al. demonstrated that tea is highly beneficial to reduce the risk of stroke in obese people [20]. Zhang et al. conducted experiments on mice and concluded that drinking tea has a neuroprotective effect on hemorrhagic stroke [21]. In addition, we found that "tea = y" and "alco = y" do not simultaneously appear in the same rule in the present study, and the correlation value of 0.14 ( Table 8) between them is also very insignificant, indicating that drinking tea and alcohol consumption have simultaneously no effect on the risk of stroke. Figure 12(e), the effects of the three types of diet (mainly sugar, fat, and protein) on risk of stroke are not significantly different. According to the rules, these types are more concentrated in the "H" and "M" types, demonstrating that dietary structure has a certain influence on individuals with high risk of stroke. In addition, from the perspective of correlation value (Table 8), it has a relatively higher correlation with other factors compared with alcohol consumption, drinking tea, and sleep. Figure 12(f), the risk of stroke is lower when duration of sleep is appropriate. Very long or short duration of sleep is not conducive to avoid the risk of stroke, which is consistent with Huang et al.'s findings, expressing that a good sleep quality helps reduce the risk of stroke [11,22]. From the perspective of rules, sleep is associated with smoking, alcohol consumption, and sport, and from the perspective of correlation, sleep, smoking, and exercise are relatively correlated together. People who exercise less and are obese have an increased risk of stroke, if the duration of their sleep is extremely long. People who exercise less, as well as being smokers, and alcohol drinkers have a higher risk of stroke, if the duration of their sleep would be lower than normal level.

Conclusions
In the present study, we optimized the decision tree C4.5 algorithm to assess and analyze risk factors of stroke (stroke history, hypertension, dyslipidemia, diabetes, family history of stroke, TIA, smoking, atrial fibrillation, sport, sleep, gender, BMI, drinking tea, age, alcohol consumption, and diet) via 5599 valid data collected. The classification result showed to have an accuracy of 87.5281% and a kappa coefficient of 0.8344. It also was noted that classification performance was higher than that of the random forest and Logistic algorithm. Then, we focused on 6 factors influencing daily life, such as smoking, alcohol consumption, drinking tea, sleep, and sport, and presented a series of knowledgebased rules that are conducive to guide patients to adjust individuals' living habits. With further analysis of decision tree and knowledge-based rules, the independent influence of each factor and the relationship between the factors were analyzed. Different from other studies, we analyzed the relationship between smoking and exercise, among  10 Computational and Mathematical Methods in Medicine alcohol consumption, drinking tea, and BMI, among diet, sport, and BMI, and among sleep, sport, smoking, and alcohol consumption and found that although these daily living habits cannot directly determine the risk of stroke (with low independent influence) they could be used to intervene the risk factors of stroke. On the one hand, smoking and exercise were strongly associated with other risk factors of stroke; on the other hand, sleep, drinking tea, alcohol consumption, and diet were not firmly associated with other risk factors of stroke, and they were relatively tightly associated with smoking and exercise. However, further research needs to be conducted to indicate whether smoking and exercise play a significant role in the risk of stroke in daily habits.