Supervised Machine Learning-Based Cardiovascular Disease Analysis and Prediction

Cardiovascular illness, often commonly known as heart disease, encompasses a variety of diseases that affect the heart and has been the leading cause of mortality globally in recent decades. It is associated with numerous risks for heart disease and a requirement of the moment to get accurate, trustworthy, and reasonable methods to establish an early diagnosis in order to accomplish early disease treatment. In the healthcare sector, data analysis is a widely utilized method for processing massive amounts of data. Researchers use a variety of statistical and machine learning methods to evaluate massive amounts of complicated medical data, assisting healthcare practitioners in predicting cardiac disease. +is study covers many aspects of cardiac illness, as well as a model based on supervised learning techniques such as Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). It makes use of an existing dataset from the UCI Cleveland database of heart disease patients. +ere are 303 occurrences and 76 characteristics in the collection. Only 14 of these 76 characteristics are evaluated for testing, which is necessary to validate the performance of various methods. +e purpose of this study is to forecast the likelihood of individuals getting heart disease. +e findings show that logistic regression achieves the best accuracy score (92.10%).


Introduction
It is difficult to diagnose cardiac disease due to the presence of many health problems such as diabetes, high blood pressure, excessive cholesterol, and an irregular pulse rate. Numerous data analysis and neural network methods have been used to determine the severity of cardiac disease in people. e severity of illness is categorized using a variety of techniques, including the K-Nearest Neighbor (KNN) algorithm, DT, Genetic Algorithm (GA), and the Naive Bayes (NB) algorithm [1,2]. Due to the complexity of cardiac disease, it must be treated with caution. Failure to do so may has a detrimental effect on the heart or result in early death. Medical science and statistical perspectives are utilized to identify different types of metabolic disorders. Data analysis with categorization is essential for heart disease prediction and data research. Additionally, we have seen decision trees used to predict the accuracy of heart disease-related events [3]. Numerous approaches to knowledge abstraction have been employed in conjunction with well-established data mining techniques for heart disease diagnosis. Numerous analyses have been conducted in this study to develop a prediction model, not only utilizing different methods but also connecting two or more techniques. Data mining is the process of extracting needed information from massive databases in a variety of areas, including medicine, business, and education. Machine learning (ML) is one of the fields of artificial intelligence (AI) that is advancing at a breakneck pace.
ese algorithms are capable of analyzing massive amounts of data from a variety of areas, one of which being the medical field. It is a replacement for the conventional prediction modeling method that uses a computer to acquire knowledge of complicated and nonlinear interactions between many variables by minimizing the difference between anticipated and actual results [4]. Data mining is the process of sifting through massive datasets in order to extract critical decision-making information from a collection of historical records for future study. e medical profession is replete with patient data. is data must be analyzed using a variety of machine learning techniques. Healthcare experts analyze these data in order to make appropriate diagnostic decisions. rough analysis, medical data mining using classification algorithms offers therapeutic assistance. It evaluates methods for classifying patients' risk of developing heart disease [5].
Several studies have been performed, and numerous machine learning models have been deployed, all with the goal of classifying and forecasting heart disease diagnoses. ANNs were developed to achieve the greatest prediction accuracy possible in the medical sector [6]. ANNs are used to forecast cardiac disease via back propagation multilayer perceptron (MLP). e resulting findings are compared to those of previously published models in the same area and found to be significantly improved [7]. e UCI laboratory's data on heart disease patients are utilized to identify patterns using NN, DT, Support Vector Machines (SVMs), and Naive Bayes. e performance and accuracy of various algorithms are compared. e proposed hybrid approach achieves an F-measure accuracy of 86.8%, which is comparable to other available methods [8]. e classification of Convolutional Neural Networks (CNNs) without segmentation is presented. is technique considers cardiac cycles with a variety of start locations derived from Electrocardiogram (ECG) data during the training phase. CNN is capable of generating features with varying locations throughout the patient's testing phase [9,10]. Previously, a significant quantity of data produced by the medical sector was not used properly. e novel methods described here reduce the cost and enhance the accuracy of heart disease prediction in a simple and efficient manner. e numerous research approaches examined in this study for the prediction and classification of heart disease utilizing ML and deep learning (DL) techniques are very accurate in proving these methods' effectiveness [11,12].
Golande et al. investigated a variety of machine learning methods that may be used to classify cardiac disease. Research was conducted to evaluate the accuracy of DT, KNN, and K-Means algorithms that may be utilized for classification [13].
is study indicates that DT achieves the greatest accuracy and that they may be made more efficient via the use of a mix of various methods and parameter tweaking. Nagamani et al. [14] developed a system that combined data mining methods with the MapReduce algorithm. For the 45 instances in the testing set, the accuracy achieved in this study was higher than the accuracy obtained using a typical fuzzy artificial neural network. Due to the usage of dynamic schema and linear scaling, the accuracy of the method was increased in this case. Alotaibi developed a machine learning model that compares five distinct methods [15]. A rapid miner was employed, which provided a better level of accuracy compared to MATLAB and Weka. In this study, the classification algorithms DT, LR, NB, and SVM were compared for accuracy. e decision tree algorithm was the most precise. Repaka et al. developed a system [16] that combines NB (Naive Bayes) methods for dataset categorization and AES (Advanced Encryption Standard) for secure data transmission for illness prediction. omas and Princy conducted a study comparing several categorization algorithms used for heart disease prediction. e classification methods utilized were Naive Bayes, KNN, DT, and Neural Network, and the accuracy of the classifiers was evaluated over a range of attribute counts [17]. Lutimath et al. used Naive Bayes classification and SVM to predict cardiac disease. e performance metrics utilized in the study are the Mean Absolute Error, the Sum of Squared Error, and the Root Mean Square Error. It has been shown that SVM outperforms Naive Bayes in terms of accuracy [18]. e authors proposed an RNN-based prediction of the risk of depression based on ECG [19].
Cardiac disease may be cured if diagnosed early, but this is not always the case. We must learn more about a few heart illness markers if we want to avert major harm. e analysis of data from these indices and the use of three machine learning classification algorithms to predict cardiac disease are the primary goals of this project. e strategy with the greatest accuracy rate will be chosen.
After analyzing the aforementioned studies, the main objective of the proposed system was to develop a computeraided diagnostic system using the inputs listed in Table 1. We compared the accuracy, precision, recall, and F1-scores of three classification algorithms. DT, RF, and LR are found to be the best classification methods for heart disease prediction. Because we employed multiple methods and reached 92% accuracy, which is greater than the prior publications, this research has a novel feature. e remaining paper is divided into three parts. e methodology and methods are presented in Section 2, the findings and analysis are presented in Section 3, and the conclusion and future scope are presented in Section 4.

Methodology and Methods
is section includes details on the methods and materials utilized, as well as a dataset description, a schematic diagram, machine learning algorithms, and evaluation matrices.

Dataset.
e Heart Disease dataset was utilized, which is a compilation of four distinct databases, but only the UCI Cleveland dataset has been used [20]. is database has 76 characteristics in total, but all published studies use just a subset of just 14 features [21]. As a result, for our study, we utilized the previously processed UCI Cleveland dataset accessible on the "Kaggle" website. Table 1 gives a detailed explanation of the 14 characteristics utilized in the proposed study.
ere are a total of 165 cardiac disease and 138 noncardiac disease datasets available in the target column. Figure 1 shows the visualization of the target column.
If similar and null data are not verified and handled, the model's generality suffers. ere is a chance that duplicates will appear in both the test and training datasets if duplicates are not handled effectively. As a result, during the preprocessing phase, all duplicate data were eliminated from the dataset.
is dataset has no missing data, as shown in Figure 2.
Because the dataset contains no missing data, Figure 2 displays 0 values in all of the dataset's attributes.

Schematic Diagram of the System.
e proposed study indicated heart disease by examining the three classification methods listed above and carrying out performance analysis. e goal of this research is to accurately predict whether or not a patient has heart disease. e input values from the patient's health report are entered by the health professional. e data are incorporated into a model that forecasts the chance of developing heart disease. Figure 3 depicts the system's schematic diagram. e properties listed in Table 1 are used as inputs for classification methods including Random Forest, Decision Tree, and Logistic Regression. e input dataset is divided into 80% of the training dataset and 20% of the test dataset. A training dataset is a collection of data that are being used to train a model. e testing dataset is also used to evaluate the trained model's performance. e performance of each method is generated and analyzed using a variety of measures, including accuracy, precision, recall, and F1-scores, as discussed below.

Machine Learning Algorithms.
Classification and regression techniques based on Random Forest are utilized. It constructs a tree for the data and then makes predictions using that tree. e RF technique is capable of processing enormous datasets and producing the same result even when substantial portions of the record values are missing. e decision tree's produced samples may be stored and used on It is the dataset's last column. Column is a class or label. It denotes the number of classes in the dataset. is dataset has a binary categorization, which means it has two classifications (0, 1). In the class, "0" indicates that there is a low risk of heart illness, but "1" indicates that there is a high risk of heart disease. e value "0" or "1" is determined by the other 13 attributes  additional data. ere are two steps in generating a random forest: first, generating a random forest and, second, using the Random Forest classifier built in the first stage, making a prediction. Figure 4 shows the schematic diagram of the Random Forest algorithm. Also, RF is a decision tree-based method. After combining numerous separate decision trees, it is generally more accurate and reliable than a single tree. e random selection of samples and features, as well as the integration procedures, gives a Random Forest an edge over a DT. While the former resists overfitting better, the latter is more accurate. Random Forest uses the DT as the bagging model. e DT method is represented visually as a flowchart, with the central node representing the dataset's properties and the outside branches representing the result. Decision trees are selected because they are quick, dependable, and simple to read and require little data preparation. In a DT, the prediction of the class label begins at the tree's root. e root attribute's value is compared to the record's attribute. e matching branch is explored for that value, and a move is made to another node based on the outcome of the comparison. Figure 5 shows the schematic diagram of the decision tree algorithm.
Strategic splits have a big influence on a decision tree's accuracy.
ey use different decision criteria. e development of subnodes increases their homogeneity. Because the target variable grows, the node's purity rises. A DT is simple to grasp and can handle both numerical and categorical data.
LR is a statistical approach that is often used to solve issues involving binary classification. Rather than fitting a straight line or hyperplane, logistic regression employs the logistic function to constrain the output of a linear equation to the range of 0 to 1. Due to the presence of 13 independent variables, logistic regression is well suited for categorization. Figure 6 shows the schematic diagram of the logistic regression algorithm.

Block Diagram of the Confusion Matrix.
A confusion matrix is a technique for describing the performance of a classification system. e number of correct and incorrect predictions is summed and denoted by count values. is is the key to the misunderstanding matrix. e block diagram of the confusion matrix is shown in Figure 7.
It elucidates not only the errors made by the classifier but also the kind of faults committed. e expected row and predicted column for a class include the total number of correct predictions. Similarly, the expected row and projected column for a class value include the total number of incorrect guesses.

Result and Data Analysis
is section discusses the capabilities of the models, model predictions, inquiry, and final outcomes.  proportional to the frequencies of the comparing classes. e square forms are all related because the base fills in the gaps between class boundaries. e square-form statures are proportional to the comparative class frequencies and recurrence densities for different classes. Figure 8 depicts the distribution of age, blood pressure, cholesterol, heart rate, and old peak. Figure 9 depicts the cardiac state of people of various ages.     e likelihood of developing cardiovascular disease rises with age. Target 0 indicates that the individual is healthy, whereas target 1 indicates that the individual has cardiac disease. Figure 10 depicts the illness status by gender. e graph illustrates that a men are more likely than women to get cardiovascular disease. e probability distribution of four distinct types of characteristics is seen in Figure 11. Figure 11 shows that the patterns of cholesterol levels, blood pressure levels, age, and maximal heart rate are not uniformly distributed. ese will need to be addressed in order to prevent overfitting or underfitting of the data. In addition, cholesterol is an essential factor in the study of heart disease. Table 2 shows the three different models' classification results.

Model Accuracy.
According to Table 2, LR outperformed the other algorithms in terms of accuracy. e RF also performed well in terms of accuracy. e performance of the DT, on the other hand, is really low. e precision, recall, F1-score, and accuracy of the Random Forest algorithm are 77%, 87%, 82%, and 80%, respectively. Also, the precision, recall, F1-score, and accuracy of the logistic regression algorithm are 92%, 92%, 92%, and 92%, respectively. Figure 12 depicts the RF classifier's confusion matrix. is is the classifier that attained an accuracy rate of 80%.  Mathematical Problems in Engineering Figure 12 illustrates that the FR classifier properly predicts 37 data points and wrongly predicts 9 data points. Figure 13 depicts the prediction's ROC (receiver operating characteristic) curve. A Random Forest classifier has an AUC (accuracy under the curve) of 88%. e confusion matrix of the decision tree algorithm is shown in Figure 14. Figure 14 shows the DT classifier accurately predicting 33 data points and wrongly predicting 13 data points. Figure 15 depicts the prediction's AUC. e DT classifier has an accuracy under the curve of 72%. Figure 16 depicts the LR algorithm's confusion matrix. Figure 16 demonstrates that the logistic regression classifier properly predicts 70 data points and wrongly predicts 6 data points.    Mathematical Problems in Engineering Figure 17 depicts the prediction's AUC. For the LR classifier, the accuracy under the curve is 95%. Table 3 compares the models to those in previous research articles. It clearly shows that logistic regression is the best model among the framework's various models. It has a higher accuracy rate.

Conclusions
ree machine learning techniques are provided in this work, and their comparative assessment is described. e goal of the article was to determine which machine learning classifier would be the most effective in predicting heart disease based on the dataset utilized. ree classifiers were built, and their results were compared. Some of the comparison approaches used include the confusion matrix, accuracy, specificity, and sensitivity. For the 14 variables in the sample, the LR classifier performed admirably in the ML approach. e logistic regression technique outperformed the other two classifiers employed, with an accuracy of 92%. e RF classifier had an accuracy of 80%, whereas the DT classifier had an accuracy of 72%. is idea has the potential to be a game changer in the medical field. Patients at risk of heart disease might be recognized quickly with this method, which could help to lower the rising death rate. e properties in the dataset that the prediction model is built on are not prohibitively costly to record. As a result, this kind of diagnostics may be made accessible to patients at a reasonable cost, allowing it to reach a considerably larger number of people. is kind of diagnosis will become more common in the future as machine learning algorithms improve as a result of continuous research. If additional patient information is utilized, the model may be refined and adjusted. A bigger dataset ensures more precise and accurate findings. is is critical since medical diagnosis is a very delicate problem that requires high degrees of accuracy and   precision. A web application that integrates these methods and uses a larger dataset than the one used in this study might be developed in the future. As a result, healthcare providers will be better able to predict and treat cardiac abnormalities with more precision and efficiency. is will improve the framework's reliability as well as its presentation.

Data Availability
e data utilized to support this research findings are accessible online at https://www.kaggle.com/ronitf/heartdisease-uci.

Conflicts of Interest
e authors declare no conflicts of interest regarding the present study.