A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction

,


Introduction
Te heart is a major part of the human or animal body that plays an essential role in the life of mammals.Te heart pumps blood throughout the body parts, thereby supplying oxygen to all parts of the body and controlling the pressure of the blood.Te heart performs its function together with the nervous system and the endocrine system.Te nervous system helps to control the heart rate while the endocrine system sends hormones as well as blood pressure by causing the human blood vessels to either spasm or relax.However, when the human brain is at rest or under stress, it transmits signals telling your heart to beat more quickly.In stressful situations, our heart beats faster than usual leading to serious heart problems.Aside from stress, heart problems escalate with excessive drinking of liquor, smoking, and heavy fat intake [1,2].Te rate of health hazards in humans rises as a function of unhealthy dietary habits, excessive stress, lack of good sleep, and lifestyle changes [2].
Cardiovascular disease (CVD) is one of the most noticeable heart diseases which has afected people of all ages.CVD is caused by excessive intake of alcohol, smoking, high blood pressure, high cholesterol level, poor diet, and family history [3].Del Paoli et al. [4] showed that high blood pressure, unhealthy arguments, and alcohol are highly correlated with CVD.It has been proven that men are at a higher risk of CVD compared to women [5].Age is one of the most signifcant factors for heart disease [6].
In addition to CVD, coronary disease, myocarditis, congenital heart disease, arrhythmias, cardiomyopathy, congestive heart failure, angina pectoris, and myocardial infarction have been classifed as acute heart diseases.Each type of heart disease has its symptoms.However, it is very abstruse to identify these heart diseases sharing common high-risk factors like cholesterol level and blood pressure, diabetes, abnormal pulse rate (PR), and many more [7].
Te lack of physical ftness due to lifestyle changes may also lead to heart disease for all age groups.A survey reported that seventeen million people in recent years lost their lives due to heart failure [8].Te early detection of heart disease may save a lot of lives provided the patients take their treatments together with their medication seriously and on time [8].Te predicted global number of casualties from CVD in 2015 was 17.7 million, of which 7.4 million were as a result of coronary heart disease and 6.7 million by stroke.According to the World Health Organization (WHO), approximately 54% of deaths from non-communicable diseases in Pakistan are due to cardiovascular problems [9].Although 17.3 million deaths were caused due to heart disease in 2008, studies by the WHO in 2018 estimated deaths due to heart disease to be around 56.9 million globally [10].
Deep learning models like the backpropagation neural network (BNN) are highly efective for predicting diseases [11].Likewise, feature selection approaches like decision tree (DT), logistic regression (LR), random forest (RF), Naïve Bayes (NB), and support vector machine (SVM) have been observed to be equally efective in disease prediction [12,13].Soni et al. [14] used predictive data mining techniques for the prediction of cardiovascular disease by evaluating the highest accuracy in the DT among a class of predictive machine learning models such as K-nearest neighbour algorithms, neural network classifcation, and Bayesian classifcation algorithms [15,16].
Data mining techniques are very essential in efective healthcare delivery as they can assist in determining whether a patient has a disease or not in healthcare centres (hospitals or clinics).Additionally, it can be employed to rapidly and automatically diagnose people with diseases with great satisfaction [17].Te prediction approach of these techniques may enable all participants in making rational decisions, especially professionals who must make decisions about how to treat patients [18].
Hybrid machine learning models have been applied to predict heart diseases as well as perform optimum classifcation methods for prediction.Hybrid models give a better optimum output depending on the machine learning method implemented for the execution [8].Similarly, random forest, decision trees, and hybrid algorithms have been used to predict diseases with high accuracy.Te hybrid algorithms were found to have a high accuracy in the neighbourhood of 88.7% for the prediction of disease compared to other models [8].
Nyaga et al. [19], by summarizing available information on aetiology, rates, treatment, covariates, and mortality prevalence arising from heart failure in sub-Saharan Africa, created CVD models.Prasad et al. [20] implemented procedures geared towards predicting heart problems by recapitulating recent studies that utilized artifcial intelligence procedures.Wu et al. [21] initiated new CVD forecasting structures by incorporating several procedures in a single hybridized phoned protocol.Teir result validated accuracy in diagnosing by implementing a mixture of styles emanating from all methods.
In recent medical felds, a lot of information on diseases is generated through numerous sources.Tese available data need to be purifed as fast as possible with diferent preprocessing techniques for the required information to fast-track the diagnosis of diseases.Tis study seeks to develop and propose new methodologies by the utilization of machine learning algorithms to increase the accuracy of the detection of CVD.We investigated and predicted CVD based on hybrid machine learning methods.We used hybrid machine learning models to predict CVD and perform optimum classifcation methods for the predictions.Our models and approach can be applied in all hospital settings across the world for efective prediction and diagnosis of CVD and other heart diseases.We are hopeful that our suggested technique will be utilized for the detection and prediction of other diseases in general.
We have discussed the materials and methods applied in the proceeding section followed by the results and discussion.Te paper ends with the conclusions of the study.

Materials and Methods
2.1.Data.Te data were collected from the two largest teaching hospitals, the Lady Reading Hospital (LRM) and the Khyber Teaching Hospital (KTH), in Khyber Pakhtunkhwa (KPK), one of the four provinces of Pakistan.Ethical approval for the inclusion of heart disease patients was sought from the Human Ethical Committees of the two teaching hospitals.Te ethics approval certifcate number for the Lady Reading Hospital is B371/12/07/2022, while that of the Khyber Teaching Hospital is A418/12/07/2022.A simple random sampling technique was employed in the collection of sample units included in the survey.Te sample data consisted of a total of 518 randomly selected heart disease patients.

Variables in the Study.
Te CVD data included the individual output with corresponding factors.Te all-inclusive dataset contained the following attributes: age, gender, height, weight, systolic, diastolic, cholesterol, glucose, smoke, alcohol intake, physical activity, cardiovascular disease, and body mass index (BMI).Te response variable, CVD, was classifed into two categories "presence" and "absence."Furthermore, the data were cleaned of noise, inconsistencies, or any missing observations.We found a few missing observations in the data because some of the patients were discharged from the ward without any proper residential address or mobile/telephone numbers to trace them.As a result, it was very difcult to contact them.Since our analysis is based on complete data, we replaced the missing data by implementing the usual statistical method such as using median/mode for the categorical data to replace the missing values with the corresponding value.Tus, the data cleaning was completed using the corresponding statistical tools for the preprocessing stage.
Diferent data mining techniques were utilized in association, classifcation, clustering, pattern evaluation, and prediction.In the methods section below, we have discussed the techniques extensively.

Classifcation.
Classifcation is the process of categorizing a given set of data into classes.Classifcation can be performed for both structured and unstructured data.Predicting the class of the provided data points is the frst step in the procedure [22].Common names for the classes include target, label, and categories.Diferent statistical and mathematical procedures such as linear programming, decision trees, and neural networks involve classifcation [23].Tat notwithstanding, CVD detection can be recognized through classifcation procedures because it has two categories, that is, one has CVD or not [24].

Decision Tree (DT) Algorithm. Te decision tree (DT)
is one of the most important predictive modelling and classifcation methods in learning algorithms that are widely used in practical approaches in supervised learning techniques [25,26].It utilizes algorithms that can detect diferent ways of splitting datasets based on numerous situations.In the classifcation tree, the response variable is considered a discrete set of values for tree models [26].DT is a useful contemporary approach to solving decisionmaking challenges by building models that can be used for prediction through systematic analysis.Internal nodes of a DT indicate a test of the features, branches represent the result, and leaves refect the decisions that are produced after further computation [27,28].We performed our DT as follows: In the DT, the prediction for a record class label begins at the root.Te values are compared with the root features in the succeeding record characteristics.In this contrast, the equivalent value of the next node to go is displayed [29][30][31].

Random Forest (RF) Algorithm.
A random forest (RF) is a classifer consisting of a collection of tree-structured classifers h(x; €k); k � 1, 2, . . .{ } where €k are independent and identically distributed random vectors where each tree casts a unit vote for the most popular class at the input of the predictor, x [32][33][34][35].
Te RF is an ensemble learning approach for regression or classifcation used to develop a large number of decision trees at training time.Te average prediction of the separated tree is returned for regression purposes, while in the classifcation, the RF output is the class predicted by the maximum trees.Te RF algorithm developed by Ho [36] used a stochastic subspace approach and was reintroduced as a technique for the implementation of a collection of tree predictors by Breiman [37].RF implements bootstrapping to randomly select training and testing datasets from the original data.After selecting the training dataset, the remaining dataset called out of bag (OOB) is used to estimate the goodness of ft [37].
In the growing phase of the RF, classifcation and regression tree techniques are developed for tree growth by splitting the local training set at each node with value 1 to a randomly selected subset of the response variable.Te growth of the tree continues to the largest extent possible since it does not consider pruning.Te phases of bootstrapping and growing of the tree require independent random input quantities.We assumed that these inputs are independent and identically distributed among trees.In that manner, each tree can be viewed as independently sampled for a given training data [37,38].
For prediction purposes, each tree as well as their terminal nodes are assigned to a class in the forest.Predictions by the trees are performed through voting processes in such a way that the forest returns a class with the maximum number of votes by random selection [39].

Logistic Regression (LR) Algorithm.
Te logistic regression (LR) model is the most accurate in the case of the dichotomous categorical response variable [40].In the machine learning (ML) algorithm, the LR model can be used for classifcation purposes [40,41].We used the LR model for the classifcation problem satisfying the cardiovascularafected respondents.It is implemented on the idea of likelihood by assigning observations to a discrete class being performed using logistic regression [42].Te exponential logit function is utilized for output transformation.Te cost function is often restricted by the LR hypothesis to a range between 0 and 1.Consequently, according to the regression hypothesis, linear functions cannot be implemented here because they can have values of either >1 or ≤0.We classifed and predicted the CVD patients in the machine learning LR [43] using the function 2.2.5.Naïve Bayes (NB) Algorithm.Te Naïve Bayes (NB) method is a supervised learning approach that is based on the Bayes theorem.Te NB machine learning method applies probabilistic techniques in solving classifcation problems [44].Te main assumption of the NB is the independence (free from multicollinearity) of the predictors ftted in the probabilistic models [45].A class of classifcation algorithms predicated on the Bayes theorem is referred to as Naïve Bayes classifers.It is characterized as a collection of algorithms whereby each algorithm follows the same guiding principle that every combination of features classifed is independent of each other pair [46].In our case, we used the NB classifer to partition the response variable CVD patients into those who have CVD or not for all patients with heart disease [44,47].
2.2.6.Support Vector Machine (SVM) Algorithm.Among the diferent classifcation techniques, the support vector machine (SVM) is well known for its discriminative power for classifcation.Te SVM is widely considered in recent times due to its efciency in most diferent pattern classifcation techniques [48].It has numerous applications ranging from bioinformatics to involuntary language recognition as well as handwritten typescript recognition with sufcient accomplishment.Kim et al. [49] proved that the SVM displays exceptional performance in the classifcation for prognostic prediction of class III malocclusion.Based on [50], we discuss a brief mathematical theory of the SVM below.By assuming the binary classifcation of our response variable, CVD with the convention of linear divisibility for training samples, we have where x i ∈ Ħ, such that the design matrix X belongs to the d-dimensional response space, and the response variable, CVD, is represented by y i , which has a binary class in the vector Y with y i ∈ (0, 1) in our study.Te appropriate discriminating equation is given by Similarly, Z represents the vector that determines the coordination of the hyperplane (discriminating plane), and so Z, X, and β are ofsets [48,51,52].We have infnite possible hyperplanes that are efciently classifed by the training data which can be applied to the validation dataset.Te optimal classifer identifes the similar optimal generalized hyperplanes that are nearer or even away from each cluster of objects [53].Te input set of coordinates is considered optimally separated by the hyperplane if there is accuracy in the separation with a maximum distance existing between the nearest components and the support vectors leading to the identifcation of a specifc hyperplane [53,54].
We used R version 4.1.2for all our analyses.

Results and Discussion
Te descriptive analysis of the attributes at the aggregate and age levels of the responses of all randomly selected patients with heart disease in the study is represented in Table 1.Te table illustrates the numerical output of the cardiovascular disease-associated risk factors.Table 1 indicates the variability in the age proportion of the CVD-afected patients.Te exploratory analysis revealed that almost 52.1% of the respondents had CVD at an aggregate level.Furthermore, there was a noticeable variation in the proportion of heart disease concerning diferent factors such as gender, physical activity, smoking, and so on that correlated with CVD.For instance, a maximum of 4.25% of 60-year-old patients were estimated to have CVD, whereas a maximum of 0.19% of 45-year-old patients had it.
Figure 1 shows the gender, cholesterol level, and glucose levels for all randomly selected CVD patients in the study.Te fgure shows that a greater proportion of the patients had CVD. Figure 2 presents a line graph for the proportion of gender with respect to the age of patients.Te fgure shows that CVD is predominant in males compared to females since a greater proportion of the males had the disease.Moreover, the proportion of CVD patients increases from forty years to sixty-one years, which confrms the result of Gulfam Ahmad and Jasim Shah [6].
To achieve our goal, we employed the binary classifer based on a supervised machine learning algorithm for classifcation to predict the association for the appropriate class of patients [55][56][57] as proposed by Ramesh et al. [58] and Boukhatem [42].Table 2 indicates the output of the predictive models that were used for the prediction of CVD.
All fve ML algorithms (i.e., DT, SVM, NB, LR, and RF) were used to build the CVD prediction model in two different stages.In the initial stage, the data were split into two separate 70% and 30% groups for training and validation, respectively.In the second stage, however, the data were split into 75% and 25% for training and validation, respectively.Te RF model had the highest accuracy of 85.01% with a 95% confdence interval of (0.6608, 0.8043), followed by DT with 83.72% accuracy with a 95% confdence interval of (0.654, 0.7986).Te SVM and LR algorithms had the same accuracy of 83.08%, respectively, with a 95% confdence interval of (0.654 and 0.7986), respectively.Te NB had the least accuracy of 74.74% with a 95% confdence interval of (0.567, 0.7221).Tis shows that the RF algorithm is the best predictor of CVD patients.Our outcome confrms the results obtained by the authors in [6,[55][56][57][58].
Sensitivity, mathematically defned as the ratio of the total number of true-positive patients to the sum of the number of true-positive and false-negative patients, was used to fnd the proportion of true patients sufering from CVD [59,60].Similarly, the specifcity is described according to respondents that are not afected by cardiovascular disease.Specifcity, mathematically defned as the ratio of the total number of true negatives to the sum of the number of true negatives and false-positive patients [61], was also used to determine the true proportion of true patients who are not sufering from CVD [62].Te RF algorithm estimated sensitivity and specifcity as 86.11% and 65.48%, respectively.Tat is, our algorithm correctly classifed 86.11% of the patients to have CVD but failed to identify 13.89% as having CVD.Similarly, the test correctly classifed 65.48% of patients as not having CVD while 34.52% of them were misclassifed.Although the DT was not the best in terms of accuracy of prediction, it had the highest sensitivity 4 Health & Social Care in the Community (90.28%).Our results confrm those of Boukhatem et al. [63].
Figure 3 shows the visualization of all ML algorithm outputs, thereby confrming the superiority of the RF.
Table 3 represents the confusion matrix of the predictive model for 25% of our validation data.Te confusion matrix is used to evaluate the performance of the   6 Health & Social Care in the Community classifcation algorithm by associating the actual target values for the response variable, CVD patients, with a predicted output of the response by the machine learning model.Just as expected, the RF had the best performance for all evaluation metrics for the confusion matrix.Te confusion matrix essentially provides the misclassifcation error rates for all our ML algorithms.Te misclassifcation error rates for the respondents who are afected were 0.087, 0.1228, 0.1719, 01778, and 0.1818, for the RF, DT, SVM, NB, and LR, respectively, in decreasing order of performance.Tus, the RF performed the best among all competing algorithms, while the LR had the poorest performance among them.Our results are similar to those obtained by O'Kelly et al. [64].Furthermore, the recursive operating characteristic curve (ROC) was used for the visualization of the accuracy.Te ROC uses a matrix to execute the performance of classifcation algorithms by visualizing the true-positive rate with a corresponding false-positive rate, thereby measuring and highlighting the specifcity and sensitivity of the classifers.Figure 4 shows the ROC for the diferent classifers.
Te ROC also indicates that the RF algorithm's performance is the best among all classes of ML algorithms.Te ROC ranges from 0 to 1, where the nearest to 0 value means it is inept for a given classifer, whereas a value nearest to 1 signifes a more capable algorithm for the classifer.Te ROC value is 0.8737 for the RF algorithm which precisely signifes good prediction and classifcation.Te highest ROC for the RF algorithm implies a better ability to discriminate the classes, while the highest accuracy signifes the well-performing ability of the algorithm and the sense of prediction just as in [15,42,56].

Conclusion
Heart diseases are considered a signifcant apprehension in medical data analysis.Te potential of predictive machine learning algorithms to develop the doctor's perception is essential to all stakeholders in the health sector since it can augment the eforts of doctors to have a healthier climate for patient diagnosis and treatment.Tis study investigated the performance of predictive ML algorithms for CVD CVD is one of the leading causes of mortality worldwide.We used data from the Lady Reading Hospital and the Khyber Teaching Hospital in Khyber Pakhtunkhwa Province, Pakistan.Ethical approval for the inclusion of heart disease patients was sought from the Human Ethical Committees of the two teaching hospitals.Five machine learning algorithms (i.e., DT, RF, LR, NB, and SVM) were implemented for the classifcation and prediction of CVD.We performed exploratory analysis and experimental output analysis for all algorithms.We also estimated the confusion matrix and recursive operating characteristic curve for all algorithms.Te performance of the proposed ML algorithm was estimated using numerous conditions to recognize the best suitable machine learning algorithm in the class of models.Te RF algorithm had the highest accuracy of prediction, sensitivity, and recursive operative characteristic curve of 85.01%, 92.11%, and 87.73%, respectively, for CVD.It also had the least specifcity and misclassifcation errors of 43.48% and 8.70%, respectively, for CVD.Tese results indicated that the RF algorithm is the most appropriate for  CVD classifcation and prediction.Our proposed model can be implemented in all settings worldwide in the health sector for disease classifcation and prediction.It can also be implemented in other sectors with a similar function.Te main limitation of the study is that detailed patient data and clinical datasets across the globe may be required if we need to have more powerful and considerable prediction models.For improving the accuracy of the ML models and algorithm, high-dimensional data would be more suitable.Te ML algorithms used are limited to heart disease prediction studies.Future studies should look into exploring other ML techniques in selecting signifcant characteristics.

(
I) Divide the dataset into two subdata, that is, training and testing datasets.(II) In the initial stage, the entire training data are considered the root.(III) Continuous values are discretized before the model building, whereas categorical values are preferable for feature values.(IV) Establish subsets such that each subset includes data with the aforementioned feature attributes.(V) Finally, steps I-IV are repeated for each subset until we get the tree leaves.

Figure 1 :Figure 2 :
Figure 1: Bar graph with error bars for patient CVD status with gender, cholesterol level, and glucose level.

Table 1 :
Descriptive analysis of both response and predictive variables at aggregate and age levels of CVD patients.CVD patient: proportion of afected CVD patients.Gender: proportion of male patients.Height: mean of height predictor.Weight: mean of weight predictor.Systolic: mean of systolic predictor.Diastolic: mean of diastolic predictor.Cholesterol level: median value of cholesterol.Smoke: proportion of smoker patients.Alcohol: proportion of alcohol patients.Physical activity: proportion of physical activity.

Table 2 :
An experimental output of the predictive models for CVD patients.
Figure 3: Visualization of the ML algorithm output.

Table 3 :
Confusion matrix for predictive models.