Stroke Disease Detection and Prediction Using Robust Learning Approaches

Stroke is a medical disorder in which the blood arteries in the brain are ruptured, causing damage to the brain. When the supply of blood and other nutrients to the brain is interrupted, symptoms might develop. According to the World Health Organization (WHO), stroke is the greatest cause of death and disability globally. Early recognition of the various warning signs of a stroke can help reduce the severity of the stroke. Different machine learning (ML) models have been developed to predict the likelihood of a stroke occurring in the brain. This research uses a range of physiological parameters and machine learning algorithms, such as Logistic Regression (LR), Decision Tree (DT) Classification, Random Forest (RF) Classification, and Voting Classifier, to train four different models for reliable prediction. Random Forest was the best performing algorithm for this task with an accuracy of approximately 96 percent. The dataset used in the development of the method was the open-access Stroke Prediction dataset. The accuracy percentage of the models used in this investigation is significantly higher than that of previous studies, indicating that the models used in this investigation are more reliable. Numerous model comparisons have established their robustness, and the scheme can be deduced from the study analysis.


Introduction
Stroke occurs when the blood flow to various areas of the brain is disrupted or diminished, resulting in the cells in those areas of the brain not receiving the nutrients and oxygen they require and dying. A stroke is a medical emergency that requires urgent medical attention. Early detection and appropriate management are required to prevent further damage to the affected area of the brain and other complications in other parts of the body. e World Health Organization (WHO) estimates that fifteen million people worldwide suffer from strokes each year, with one person dying every four to five minutes in the affected population. Stroke is the sixth leading cause of mortality in the United States according to the Centers for Disease Control and Prevention (CDC) [1]. Stroke is a noncommunicable disease that kills approximately 11% of the population. In the United States, approximately 795,000 people suffer from the disabling effects of strokes on a regular basis [2]. It is India's fourth leading cause of death. Strokes are classified as ischemic or hemorrhagic. In a chemical stroke, clots obstruct the drainage; in a hemorrhagic stroke, a weak blood vessel bursts and bleeds into the brain. Stroke may be avoided by leading a healthy and balanced lifestyle that includes abstaining from unhealthy behaviors, such as smoking and drinking, keeping a healthy body mass index (BMI) and an average glucose level, and maintaining an excellent heart and kidney function. Stroke prediction is essential and must be treated promptly to avoid irreversible damage or death. With the development of technology in the medical sector, it is now possible to anticipate the onset of a stroke by utilizing ML techniques. e algorithms included in ML are beneficial as they allow for accurate prediction and proper analysis. e majority of previous stroke-related research has focused on, among other things, the prediction of heart attacks. Brain stroke has been the subject of very few studies. e main motivation of this paper is to demonstrate how ML may be used to forecast the onset of a brain stroke. e most important aspect of the methods employed and the findings achieved is that among the four distinct classification algorithms tested, Random Forest fared the best, achieving a higher accuracy metric in comparison to the others. One downside of the model is that it is trained on textual data rather than real time brain images. e implementation of four ML classification methods is shown in this paper.
Numerous academics have previously utilized machine learning to forecast strokes. Govindarajan et al. [3] used text mining and a machine learning classifier to classify stroke disorders in 507 individuals. ey tested a variety of machine learning methods for training purposes, including Artificial Neural Network (ANN), and they found that the SGD algorithm provided the greatest value, 95 percent. Amini et al. [4,5] performed research to predict a stroke occurrence. ey classified 50 risk variables for stroke, diabetes, cardiovascular disease, smoking, hyperlipidemia, and alcohol consumption in 807 healthy and unhealthy individuals. ey used two of the most accurate methods: the c4.5 decision tree algorithm (95 percent accuracy) and the K-nearest neighbor algorithm (94 percent accuracy). Cheng et al. [6] presented a study on estimating the prognosis of an ischemic stroke. In their study, they used 82 ischemic stroke patient data sets, two ANN models, and the accuracy values of 79 and 95 percent. Cheon et al. [7][8][9] conducted research to determine the predictability of a stroke patient death. ey identified the stroke incidence using 15,099 individuals in their research. ey detected strokes using a deep neural network method. e authors utilized PCA to extract information from the medical records and predict strokes. ey have 83 percent area under the curve (AUC). Singh et al. [10] conducted research using artificial intelligence to predict strokes. ey employed a new technique for predicting stroke in their research using the cardiovascular health study (CHS) dataset. Additionally, they used the decision tree method to do a feature extraction followed by a principal component analysis. In this case, the model was built using a neural network classification method, and it achieved 97 percent accuracy.
Chin et al. [11] conducted research to determine the accuracy of an automated early ischemic stroke detection. e major objective of their research was to create a method for automating primary ischemic stroke using Convolutional Neural Network (CNN). ey amassed 256 pictures for the purpose of training and testing the CNN model. ey utilized the data lengthening technique to increase the gathered picture in their system's image preparation. eir CNN technique achieved a 90 percent accuracy rate. Sung et al. [12] conducted research to establish a stroke severity index. ey gathered data on 3577 patients who had an acute ischemic stroke. ey utilized a variety of data mining methods, including linear regression, to create their predictive models. eir ability to predict outperformed the k-nearest neighbor method (95% confidence interval). Monteiro et al. [13] used machine learning to predict the functional prognosis of an ischemic stroke. ey tested this method on a patient who died three months after admission. ey obtained an AUC value of greater than 90. Kansadub et al. [14] conducted research to determine the risk of stroke. e authors of the research analyzed the data to predict strokes using Naive Bayes, decision trees, and neural networks. ey assessed their pointer's accuracy and AUC in their research. ey categorized all of these algorithms as decision trees, with naive Bayes providing the most accurate results. Adam et al. [15] conducted research to determine the classification of an ischemic stroke. ey categorized ischemic strokes using two models: the k-nearest neighbor method and the decision tree technique. In their study, the decision tree method was found to be more useful by medical experts when used to categorize strokes. e majority of studies had an accuracy rate of around 90%, which was considered to be quite good. However, the novelty of our research is that we used several well-known machine learning methods to get the best result. Random forest (RF), decision tree (DT), voting classifier (VC), and logistic regression (LR) were the most successful algorithms, with 96, 94, 91, and 87 percent F1-scores, respectively. e accuracy percent of the models used in this research is much greater than the accuracy percent of the models used in previous investigations, suggesting that the models used in this investigation are more trustworthy.
ey have been shown to be resilient in many model comparisons, and the scheme may be generated from the results of the study's analysis.
As mentioned earlier, the major contribution of this research is that we have used different machine learning models on a publicly available dataset. In the previous work, most of the researchers used a significant model to predict the stroke disease. However, we used four different models, and also, we compared the results with the previous work. All the results and comparisons are briefly discussed in the following section. e rest of this article is set out as follows: the experimental methodology and procedures are described in Section 2; the result analysis is provided in Section 3; and conclusions have been discussed in Section 4.

Procedure and Experimental Methodology
is section includes a description of the dataset, a block diagram, a flow diagram, and evaluation matrices, as well as the process and methodology used in the study.

Proposed System.
e data has become available for model construction once it has been processed. A preprocessed dataset and machine learning techniques are needed for the model construction. LR, DT classification, RF classification, and voting classifier are some of the methods used. After creating four alternative models, the accuracy measures, namely accuracy score, precision score, recall score, and F1 score are used to compare them. e designed system's block diagram is shown in Figure 1.
All the components of the block diagram have been discussed in the following subsections.

2
Journal of Healthcare Engineering

Dataset.
e stroke prediction dataset [16] was used to perform the study. ere were 5110 rows and 12 columns in this dataset. e value of the output column stroke is either 1 or 0. e number 0 indicates that no stroke risk was identified, while the value 1 indicates that a stroke risk was detected. e probability of 0 in the output column (stroke) exceeds the possibility of 1 in the same column in this dataset. 249 rows alone in the stroke column have the value 1, whereas 4861 rows have the value 0. To improve accuracy, data preprocessing is used to balance the data. Figure 2 shows the total number of stroke and nonstroke records in the output column before preprocessing.
From Figure 2, it is clear that this dataset is an imbalanced dataset.
e SMOTE technique has been used to balance this dataset.

Preprocessing.
Before building a model, data preprocessing is required to remove unwanted noise and outliers from the dataset that could lead the model to depart from its intended training. is stage addresses everything that prevents the model from functioning more efficiently. Following the collection of the relevant dataset, the data must be cleaned and prepared for model development. As stated before, the dataset used has twelve characteristics. To begin with, the column id is omitted since its presence has no bearing on model construction. e dataset is then inspected for null values and filled if any are detected. e null values in the column BMI are filled using the data column's mean in this case.
Label encoding converts the dataset's string literals to integer values that the computer can comprehend. As the computer is frequently trained on numbers, the strings must be converted to integers. e gathered dataset has five columns of the data type string. All strings are encoded during label encoding, and the whole dataset is transformed into a collection of numbers. e dataset used for stroke prediction is very imbalanced. e dataset has a total of 5110 rows, with 249 rows indicating the possibility of a stroke and 4861 rows confirming the lack of a stroke. While using such data to train a machine-level model may result in accuracy, other accuracy measures such as precision and recall are inadequate. If such an unbalanced data is not dealt with properly, the findings will be inaccurate, and the forecast will be ineffective. As a result, to obtain an efficient model, this unbalanced data must be dealt with first. e SMOTE technique was employed for this purpose. Figure 3 depicts the dataset's balance output column. e next stage is to construct the model after finishing data preparation and managing the imbalanced dataset. To improve the accuracy and efficiency of this job, the data is divided into training and testing data with a ratio of 80  percent training data and 20 percent testing data. After splitting, the model is trained using a variety of classification methods. Random forest, decision tree classification method, voting classifier, and logistic regression are the classification algorithms utilized in this study.

Proposed
Algorithms. e most common disease identified in the medical field is stroke, which is on the rise year after year. Using the publicly accessible stroke prediction dataset, the study measured four commonly used machine learning methods for predicting brain stroke recurrence, which are as follows: (i) Random forest (ii) Decision tree (iii) Voting classifier (iv) Logistic regression

Random Forest.
e classification algorithm chosen was RF classification [17]. RFs are composed of numerous independent decision trees that were trained individually on a random sample of data. ese trees are created during training, and the decision trees' outputs are collected. A process termed voting is used to determine the final forecast made by this algorithm. Each DT in this method must vote for one of the two output classes (in this case, stroke or no stroke). e final prediction is determined by the RF method, which chooses the class with the most votes. A block diagram of random forest classification is shown in Figure 4. e flexibility of the random forest is one of its most alluring features. It may be utilized for relapse detection and grouping tasks, and the overall weighting given to information characteristics is readily apparent. Additionally, it is a beneficial approach since the default hyperparameters it employs often give unambiguous expectations. Understanding the hyperparameters is critical since there are relatively few of them, to begin with. Overfitting is a wellknown problem in machine learning, although it occurs seldom with the arbitrary random forest classifier. If there are sufficient trees in the forest, the classifier will not overfit the model.

Decision Tree.
Both regression and classification concerns are addressed using classification with DT [18]. Furthermore, as the input variables already have a related output variable, this methodology is a supervised learning model. It resembles a tree. e data is constantly segmented according to a specific parameter in this method. e decision node and the leaf node are the two parts of a decision tree. At the former node, the data is divided, and the latter is the node that produces the result. e DT classifier's basic structure is depicted in Figure 5. e DT is easy to comprehend since it replicates the phases that a person goes through while making a real world decision. It may be very beneficial in resolving issues with decision-making. Consider all potential solutions to an issue. Cleaning data is not required as much as it is with other methods.

Voting Classifier.
A voting classifier is a type of classification model that trains on an ensemble of multiple models and predicts an output (class) based on the class that has the greatest chance of being selected as the output [19]. It is used to predict the outcome of a vote. e flowchart for the voting classifier model is shown in Figure 6.
Voting summarizes the methodology we will use to compare various training models. ere are two methods of voting, which are as follows: (i) Soft voting: In this phase, the predicted probability gradients for each model are added and averaged. e category with the highest value is deemed the winner, and its contents are the output. While this seems to be a fair and rational strategy, it is only recommended if the individual categories are calibrated correctly.
is is similar to computing the weighted average of a set of numbers, except that each of the various models contributes proportionally to the final output vector. (ii) Hard voting: is phase combines the categorization outputs of all the various models and specifies the final output value as the mode value of the resultant output. Because of the fact that the particular probability values associated with each model are disregarded, this approach is analogous to computing the arithmetic mean of a collection of numbers.
e output alone of each model is considered.

Logistic Regression.
e flowchart for the logistic regression model is shown in Figure 7. In the supervised learning approach, LR is one of the most commonly used ML algorithms [20]. It is a forecasting method that uses a collection of independent factors to predict a categorical dependent variable.
Utilizing logistic regression, the output of a categorical dependent variable is predicted. As a result, the output must be discrete or categorical in nature. It may be yes or no, 0 or 1, true or false, etc., but probability values between 0 and 1 are given. Logistic regression and linear regression are used in very similar ways. e classification problems are addressed with LR, and the regression problems are addressed using linear regression. Instead of a regression line, we use an S-shaped logistic function that predicts the two maximum values (0 or 1). Figure 8 depicts the confusion matrix or evaluation matrix. e confusion matrix is a tool for evaluating the performance of machine learning classification algorithms. e confusion matrix has been used to test the efficiency of all models created. e confusion matrix illustrates how often our models forecast correctly and how often they estimate incorrectly. False positives and false negatives have been allocated to badly predicted values, whereas true positives and true negatives were assigned to properly anticipated values. e model's accuracy, precision-recall trade-off, and AUC were utilized to assess its performance after grouping all predicted values in the matrix.

Result Analysis
e models' capacities, model forecasts, investigation, and eventual outcomes are examined in this part.

Data Visualization.
A histogram depicts a recurrence dispersion with infinite classes. It is a region outline made of square shapes with bases at class boundary spans and regions proportionate to the comparing classes' frequencies. As the base fills in the spaces between the class borders, the square  e squares form the statures are proportional to the comparative class frequencies and recurrence densities for distinct classes. Figure 9 illustrates some important features of the histograms. A histogram depicts the dataset's proportions. Figure 9 depicts the dataset's gender, age, hypertension, heart disease, ever married, average glucose level, and body mass index distributions. For the gender attribute, 0 means male and 1 means female. ere are more female samples than male samples in this collection. However, based on the age distribution, it is obvious that the sample's average age is in the 40s, and the upper limit is approximately 60. When it comes to hypertension, 0 means the individual does not have it, while 1 means the person has it. e total number of individuals who are healthy and have no history of heart disease is achieved in this dataset. With regard to BMI and average glucose levels, Figure 10 shows the relationship between one feature and the target feature. Figure 10 shows the relationship between gender and stroke, age and stroke, hypertension and stroke, heart disease and stroke, ever_married and stroke, avg_glucose_level and stroke, and BMI and stroke.

Visualization of Feature Selection.
e process of feature selection is shown in Figure 11. Feature selection aids in comprehending how features are linked to one another. Figure 11 shows that age, hypertension, avg_-glucose_level, heart_disease, ever_married, and BMI are positively corelated with the target feature. However, gender is negatively corelated with stroke. Figure 12 depicts the classification report for the RF model.

Random Forest (RF).
In this case, the total F1-score obtained is 96 percent. e individual F1-scores for healthy people are 96 percent, while those who have had a brain stroke have 96 percent. is model achieved the highest accuracy after fine-tuning. Prior to fine-tuning, the model had an accuracy of 92 percent. Figure 13 depicts the random forest model's prediction. e predicted outcome and the model's calculated performance are shown in the confusion matrix. ere are 2707 accurate guesses and 113 erroneous predictions.  Journal of Healthcare Engineering

Decision Tree.
e classification report for the decision tree classification is shown in Figure 14.
e final F1-score in this case is 94 percent. An individual's F1-score is 94 percent for healthy individuals and 95 percent for those who have had a brain stroke. Also, the precision and recall are shown in Figure 14. A fine-tuned decision tree model has also been implemented. However, after fine-tuning, the accuracy did not improve. Figure 15 depicts the DT model's prediction. ere were 2664 accurate guesses and 156 erroneous predictions.

Voting Classifier.
e classification report for the voting classifier is shown in Figure 16. e total F1-score obtained in this case is 91 percent. e individual F1-scores are 91 percent for healthy people and 91 percent for those who have had a stroke. Also, the precision and recall are shown in Figure 16. Without any fine-tuning, this model achieved 91 percent accuracy. e prediction made by the voting classifier is shown in Figure 17. e overall number of accurate guesses is 2565, while the total number of erroneous predictions is 255.     Table 1 shows a comparison of the models with those found in prior studies. e chart clearly demonstrates that of the various models included in the framework, the RF model is the most effective. In addition to having a higher F1-score, it has more precision and better recall and accuracy.

Model Comparison.
From Table 1, it is clear that all algorithms have an acceptable level of accuracy, but the random forest algorithm is a preferable option because of its higher level of accuracy.
is paper achieved 96 percent accuracy using the RF algorithm, but in [21] the authors achieved only 73 percent accuracy. Also, using the decision tree algorithm, this paper achieved 94 percent accuracy, while the authors in [21] achieved 77.6 percent accuracy. Although the KNN algorithm has not been implemented in this research, ref [12] achieved 95 percent accuracy, which is higher than the voting classifier's accuracy (91 percent). However, in this paper, logistic regression performs poorly.

Conclusion
Stroke is a life-threatening medical illness that should be treated as soon as possible to avoid further complications. e development of an ML model could aid in the early detection of stroke and the subsequent mitigation of its severe consequences. e effectiveness of several ML algorithms in properly predicting stroke based on a number of physiological variables is investigated in this study. Random forest classification outperforms the other methods tested with a classification accuracy of 96 percent. According to the research, the random forest method outperforms other processes when cross-validation metrics are used in brain stroke forecasting. e future scope of this study is that using a larger dataset and machine learning models, such as AdaBoost, SVM, and Bagging, the framework models may be enhanced. is will enhance the dependability of the framework and the framework's presentation. In exchange for just providing some basic information, the machine learning architecture may help the general public in determining the likelihood of a stroke occurring in an adult patient. In an ideal world, it would help patients obtain early treatment for strokes and rebuild their lives after the event.
Data Availability e data utilized to support this research findings are accessible online at https://www.kaggle.com/fedesoriano/ stroke-prediction-dataset.

Conflicts of Interest
e authors declare that they have no conflicts of interest to report regarding the present study. Journal of Healthcare Engineering 11