Computer-Aided Diagnostics of Heart Disease Risk Prediction Using Boosting Support Vector Machine

Heart diseases are a leading cause of death worldwide, and they have sparked a lot of interest in the scientific community. Because of the high number of impulsive deaths associated with it, early detection is critical. This study proposes a boosting Support Vector Machine (SVM) technique as the backbone of computer-aided diagnostic tools for more accurately forecasting heart disease risk levels. The datasets which contain 13 attributes such as gender, age, blood pressure, and chest pain are taken from the Cleveland clinic. In total, there were 303 records with 6 tuples having missing values. To clean the data, we deleted the 6 missing records through the listwise technique. The size of data, and the fact that it is a purely random subset, made this approach have no significant effect for the experiment because there were no biases. Salient features are selected using the boosting technique to speed up and improve accuracies. Using the train/test split approach, the data is then partitioned into training and testing. SVM is then used to train and test the data. The C parameter is set at 0.05 and the linear kernel function is used. Logistic regression, Nave Bayes, decision trees, Multilayer Perceptron, and random forest were used to compare the results. The proposed boosting SVM performed exceptionally well, making it a better tool than the existing techniques.


Introduction
Heart disease refers to a variety of conditions that affect the heart from contamination to genetic deficiencies and bloodvessel diseases. ese defects are among the topmost causes of deaths globally for all races. In 2016, about 28.2 million adults in the United State were diagnosed with this condition [1] and in 2015 nearly 634000 people died [2] making it the foremost cause of deaths. According to the American Heart Association, a nonprofit organization that funds cardiovascular medical research, one American has a heart attack every 40 seconds [3]. Per the data, there are 720,000 new cases of heart attacks and 335,000 chronic attacks in the United States each year. e form of heart or cardiovascular disease-(CVD-) related morbidity and mortality has been rather fascinating in Sub-Saharan Africa, an area thought to have the world's youngest population. Sub-Saharan Africa remained the only region in the globe where heart disease-related fatalities increased between 1990 and 2013 [4]. e World Health Organization (WHO), for example, has listed heart disease as one of the top two causes of death in Ghana, after diarrheal infections [5]. In 2008, heart disease was the leading cause of death in Ghana among all noncommunicable diseases (NCDs) and the major cause of institutional deaths, accounting for 14.5 percent of all deaths reported [6].
Traditionally, a patient's need to know the status of his heart condition was based on the doctor's view. Before doing any test, the doctor will likely perform a few physical checks and interrogate the patient to examine his medical history, regardless of the severity of the cardiac problem. With the exception of blood tests and chest X-rays, any heart disease diagnosis may include the involvement of an electrocardiogram (ECG), which records electrical signals that aid in the discovery of anomalies in the heart's rhythm and structure. Holter monitoring echocardiogram, stress test, Cardiac Catheterizations, Cardiac Computerized Tomography (CT) Scan, and Cardiac Magnetic Resonance Imaging (MRI) are some of the other therapies. A Holter monitor is a small, wearable device that captures an ECG during a 24-to 72-hour period. Holter monitoring detects heart rhythm abnormalities that are not at all noticeable on a standard ECG. e echocardiogram consists of an ultrasound image of the chest and detailed images of the heart's construction and function. A stress test, often known as a treadmill test or an exercise test, is used by doctors to determine how well the patient's heart can endure workload. e patient will engage in some physical activity or take drugs to raise their heart rate for this test. After that, the actual examination and various photographs of the heart are taken to analyze the underlying reality. In case you ask your doctor if you have heart disease, the standard procedure is for him to assess the likelihood based on risk factors. Age, diabetes, smoking, high blood pressure, being male, and cholesterol are all significant risk factors. According to previous studies, nearly half of those who had coronary attacks had two risk factors: being male and being over 60 [7]. As a result, it is incredibly exciting that technology has enabled early diagnosis and risk assessment straightforward before people develop the disease.
Owing to the increased risk of heart disease and the fact that current research forecasts computer-assisted treatments, this study aims to suggest two novel approaches to the problem. To begin, we offer a better algorithm that enhances diagnosis, and then we explain how the proposed method is unquestionably superior to earlier proposed techniques by demonstrating the technique's real implementation. Tables 1, 2, 3, and 4 and Figure 1 demonstrate unequivocally that the suggested method is superior to earlier proposed methods. e remaining part of the study is structured as follows: previous related studies and their challenges are presented in Section 2. e proposed technique and how data is preprocessed as well as previous algorithms employed to solve the problem are discussed in Section 3. e result of the study is then discussed in Section 4. e conclusions are finally drawn in Section 5.

Related Studies
Several methods have been used to predict the risks of getting heart disease. Genetic algorithms, for example, have been used in a variety of applications. According to [8], the neurofuzzy system combines the capabilities of neuroadaptive capability and fuzzy logic reasoning for the prediction of the heart disease risk level. e algorithms are generally used for weight optimization when training the model, but there is a serious drawback. Genetic algorithms do not guarantee an optimal solution; hence, the weight optimization may not be completely accurate. In comparison to SVM, Naive Bayes, decision tree, and random forest and genetic algorithms are more complicated to implement and require a large number of parameters to be set in order to achieve a result that is close to optimal. As a result, for small datasets like the Cleveland utilized in this investigation, the genetic algorithm is not appropriate. e Iterative Dichotomiser 3 (ID3) algorithm, a type of decision tree building algorithm [9], is a relatively simple algorithm that has proven to be effective in other areas but has the drawback of only handling categorical data, so it cannot be used in Cleveland, which is plagued by missing values. If the sample data tested is tiny, this approach is prone to overfitting. As a result, it cannot be used for this research.
Deep neural networks [10], which have shown greater performance in prediction, were also excluded from this study because what is learned with deep neural nets is difficult to comprehend. Furthermore, because learning is progressive, deep neural nets require a large amount of data to train the learning algorithms [11]. When compared to random forest, logistic regression, Nave Bayes, neural networks, and decision trees, the proposed boosting SVM algorithm utilized in this study performed well. On small datasets, these solution approaches are among the bestperforming algorithms, and they are also a lot easier to grasp.
Miranda et al. [12] used the Naive Bayes algorithm to forecast this health concern and looked at the related risk levels for adults in their study. In this study, blood and urine test results from the clinical laboratory were used as training datasets. e difficulty with this study is that the authors failed to explore ECG and echocardiography analysis, both of which are crucial in detecting cardiovascular diseases, and the accuracy of 80% obtained is comparably poor. Again, since all the properties in Naive Bayes are expected to be mutually independent, using this predictor to predict heart disease is challenging because finding a collection of predictors that are totally independent of one another is extremely difficult in real life.
In addition, neural networks are widely employed [13,16]. To predict cardiovascular heart disease, Nandy et al. [14] employed a swarm-artificial neural network. e goal of the research was to increase accuracy. While the study's findings were promising, the accuracy of 95.78% needed to be improved, especially when compared to the study we recommended. Sayad and Halkarnikar [17] proposed a data mining and artificial neural networkbased detection approach for cardiac disease. A multilayer perceptron neural network (MLPNN) and a backpropagation algorithm were used in this investigation. e residual dataset was separated into two parts after preprocessing. e MLPNN with backpropagation approach had a 92% accuracy, which is below average. Kim and Kang [18] developed a neural network-based technique for predicting the risk of heart disease using the Korea National Health and Nutritional Examination Survey (KNHANES-VI) dataset [19]. is method consists of two steps. A feature sensitivity-based feature selection is the first phase, followed by a neural network-based prediction model. 3031 people were judged to be at low risk out of 4146, whereas 1115 were found to be at high risk. Dutta et al. [20] suggested a convolutional neural network for predicting heart disease by classifying clinical data that was highly class-imbalanced. e study's findings, on the other hand, were not encouraging.     [34] Novel KNN 93 Gupta et al. [35] Naive Bayes 88.16 Saini et al. [36] Hybrid classifier with weighted voting (HCWV) 82.54 Abdeldjouad et al. [37] GFS-logicboost-C 94.17 Motarwar et al. [38] AdaBoost 80.32 Alotaibi [39] Decision tree 93.19 Gupta et al. [40] Ensemble of Naïve Bayes, AdaBoost, and boosted tree 87.97 Proposed method Boosting SVM 99.92 Computational Intelligence and Neuroscience While neural networks are gaining popularity and appear to be realistic, they suffer from data overfitting and temporal complexity. When dimensionality is low, neural networks also fail to converge.
For the same reason, the random forest has been employed in various investigations [21]. Javeed et al. [22] used the Cleveland datasets to construct a random search algorithm (RSA) for feature selection and a random forest model for heart failure prediction. To improve the suggested diagnostic system, the grid search method was applied. Two types of testing were conducted to determine the accuracy of the proposed approach. e first trial only builds a random forest model, whereas the second trial builds the specified RSA-based random forest model. e proposed method has a classification accuracy of 93.33%, and that is not really impressive. Jabbar et al. [23] also proposed a random forestbased classification and feature selection by chi-square and genetic algorithm to predict the risk of heart disease on the Cleveland dataset. e proposed technique outperformed other methods such as Naïve Bayes, decision tree, and neural networks. However, the study's accuracy was only 84%, making it worthless for actual deployment. Decision tree prediction for heart disease has also been proposed [24,25]. Decision trees, on the other hand, do not work well with missing attributes in the Cleveland datasets if they are not treated with considerable attention, making the outcome inaccurate. e use of logistic regression techniques in the prediction of cardiac disorders is very common. For example, Soleimani and Neshati [26] utilized three logistic regression models with 28 features to predict heart disease risk using 711 data from patients with factors such as severe chest pain, back pain, cold chills, shortness of breath, nausea, and vomiting. However, the study's accuracy of 94.9% was not particularly noteworthy.
A Support Vector Machine (SVM) has also become highly popular. e SVM with sequential minimal optimization strategies was investigated in 2015, with prediction accuracies ranging from 82% to 90%, which was not promising. However, new research into SVM algorithms is yielding better results. Harimoorthy and angavelu [27], for example, recently used R studio's SVM-radial bias kernel approach to predict heart disease with 98.7% accuracy.
Based on the favorable results with SVM, we were encouraged to do further examination to improve the technique in the proposed study.

Datasets Description.
e Cleveland dataset was used in this study. It is a Cleveland Clinic Foundation dataset containing 14 variables related to patients' vital signs in relation to heart disease. e remaining property is used as the target or projected class, and thirteen of the fourteen qualities are used as predictor variables. Sex, age, type of chest pain, serum cholesterol, resting blood pressure, fasting blood sugar, resting maximum heart rate, electrocardiography, and ST segment elevation are among the study's 13 predictor variables. e expected characteristics include exercise-induced angina, depression, slope, thallium test result, number of vessels damaged by fluoroscopy, and diagnosis. ere were 303 data sets in total, with 6 missing values. e 303 records were reduced to 297 by deleting the 6 tuples that have missing records through the listwise method. Looking at the large size of the data, and the fact that it is a purely random subset, this method had no significant effect on the rest of the data used for the experiment because there were no biases. Table 5 contains descriptions of the datasets.

e Proposed Framework.
e proposed framework for the study is shown in Figure 2.
e framework demonstrates the whole methodology of the proposed technique. e explanations are as follows.

Feature Importance Estimation.
e feature importance score assigns a numerical value to each data feature; the higher the score, the more significant the feature to the output variable. We extracted the top features for the dataset using the Extra Tree Classifier. e amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for, is used to evaluate the relevance of a single decision tree. e purity (Gini index) was used to choose the separation points. e relevance of each attribute is then summed across all decision trees in the model. e Gini index in Algorithm 1 is presented as follows: e entire method is developed with the goal of maximizing purity in each split. Purity is defined in (1) as the degree to which the groupings are homogeneous: where p j is the probability of an object being classified to a particular class with label j number of times. Figure 3 shows the degree of importance of each feature.

Feature Correlation Matrix.
A correlation is a term that describes how features are related to one another. e heatmap makes it simple to see which features are most closely associated with the target variable. Using the seaborn library, we created a heatmap of connected features. Pearson's correlation coefficient was used in this study. is correlation evaluates how closely two numerical sequences are positively connected. We plotted Pearson's heatmap to see the correlation of independent variables. By using AdaBoost as feature selection algorithm, only selected features which have correlation above 0.5, taking into consideration absolute values, were selected. e Seaborn functions automatically perform the statistical estimation required to complete operation. e factors in deep blue in Figure 4 show the highest correlation, namely, max. heart rate and age and ST depression and max. heart rate, indicating that both "age" and "max. heart rate" will play a significant role in predicting heart disease.

Boosting SVM Classification.
Boosting is an ensemble meta-algorithm that, in essence, removes dataset biases for machine learning algorithms and upgrades weak learners to strong learners. e goal of the boosting strategy is to enhance prediction accuracy. e following is a description of the adaptive boosting algorithm that was used: Let p be denoted by positive and g negative samples and let each sample be (S i , y i ) where y ∈ ± 1 { } represents the corresponding class label. e feature selection algorithm is formulated as follows: Step 1: initialize the sample distribution by weighting every training sample equally such that the initial weights become w 1,i � 1/2p and w 1,i � 1/2g for y � 1 and -1, respectively. For the iteration t � 1, 2, . . . , T, where T is the final iteration, execute the following.
Step 2: normalize w t,i ←w t,i / N i�1 w t,i , where w t is a probability distribution and N is total number of features.
Step 3: train a weak classifier h t for feature j, which uses a single feature. e training error ξ t is estimated with respect to w t as stated in the following equation: Step 4: select the hypothesis h 1 t with the most discriminating information, that is to say, the hypothesis with the least classification error ξ 1 t , on the weighted samples.
Step 5: compute the weight ω t that weights h 1 t by its classification performance as in the following equation: Step 6: the weight distribution is then updated and normalized with the following equation: Step 7: the final feature selection hypothesis H(S) which is a function of the selected features is denoted by the following equation: Input the Cleveland training datasets sets, represented by ( y 1 , x 1 ), . . . , (y N , x N ) . N � a + b; where a datasets have y i � +1 and b datasets have y i � −1. e b datasets represent the 0 attributes of the datasets. e scale parameters x and y are the feature vectors selected by the AdaBoost algorithm. e maximal margin separating the hyperplane becomes an optimization problem shown in the following equations: Computational Intelligence and Neuroscience   Computational Intelligence and Neuroscience subject to the constraints in the following equation: Since w T x + k � 0 and c(w T x + k) � 0 define the same plane, w, c is the regularization parameter. w T (x + ) + k � 0 and w(x − ) + k � 0, where (x + ) and (x − ) are the respective positive and negative support vectors. e margin is then denoted by the following equation: e optimal plane is solved by using the convex quadratic programming problem in the following equation: for i � 1, . . . , N, c � 0.05. e decision boundary of the classifier is expressed as the sum over the support vectors in the following equation: where x i is the support vector data, α i is the Lagrange multiplier, and y i is the label of membership class (+1, −1) with n � 1, 2, 3, . . . , N. e product Q(x i , x) represents a linear kernel function, given by the following equation: e linear kernel function Q(x i , x) transforms the original data space into a new space with a higher dimension; this includes the transformation function with dot product, φ(x). e reason is to make transformed data easily separable.

Model Evaluation
Metrics. An important component of the study is to assess the performance of the proposed method. is is accomplished by comparing the performance of the proposed technique to that of some standard techniques using some acceptable measures. e confusion matrix, classification report, Receiver Operating Characteristic (ROC) curve, and Area under the Curve (AUC) data were used to evaluate the model's performance. e model's test and training accuracies must also be assessed.

Receiver Operating Characteristic Curve. A Receiver
Operating Characteristic curve is a graph that depicts a classification model's performance over all categorization levels.
e curve represents a comparison of the True Positive Rate (TPR) and the False Positive Rate (FPR) in the following equations: where TP, FP, FN, and TN represent true positives, false positives, false negatives, and true negatives, respectively.

Area under the Curve.
e Area under the Curve (AUC) is the most well-known quantitative index to describe accuracy.
e AUC is computed as follows: Generally, an area of 1 means a perfect test and area of 0.5 represents a worthless test. e general acceptable interpretation of AUC values is displayed in Table 6.

Comparative Algorithms
3.7.1. Comparing SVM with Boosted SVM. Preliminary experiment was conducted using Support Vector Machine (SVM) and the boosted SVM with the same linear kernel function to determine whether the proposed boosted SVM has significant advantages over the traditional SVM. e results show that the accuracies for SVM and the boosting SVM in terms of training and testing accuracies are 86.83% and 83.41% against 99.92% and 99.75%, respectively. is result is statistically significant (p < 0.5). us, we follow up to compare the results of the proposed method against Logistic regression, Naïve Bayes, decision tree, Multilayer Perceptron, and random forest which are extensively used in this domain.

Logistic Regression.
Logistic regression is the best regression analysis to use when the dependent variable or response variable is binary [28]. It works by combining the input variable (X) in a linear form and using coefficients to predict an output variable (Y) which is a binary value of 0 or 1. e logistic regression technique models the chance of an outcome based on the individual characteristics or input variables (X). It is represented mathematically as follows: where π indicates the probability of an event, β represents estimated parameter values or regression coefficients associated with the variables via maximum likelihood estimation, and x indicates the parameter variables.
8 Computational Intelligence and Neuroscience 3.7.3. Naïve Bayes. A Naive Bayes classifier is a simple probabilistic classifier modelled on the application of Bayes' theorem, with strong (Naive) independence assumptions [29]. Naïve Bayes classifier can be trained very efficiently in the context of supervised learning. e Bayesian rule is given in the following equation: From above, P(H|X) is a conditional probability, that is, the likelihood of event H occurring given X is true. P(X) and P(H) are the probabilities of observing X and H independently of each other.

Decision Tree.
e Gini index, impurity (information gain) approach, which evaluates the degree or chance of a given variable being incorrectly classified when it is randomly chosen, was utilized to compare with the proposed method. e term "information gain" refers to the process of determining which characteristic or attribute provides the most information about a class. e Gini impurity is calculated by summing the probabilities p i , of a class with label i, times the probability k≠i 1 − p i of a mistake in categorizing that item. e computation is given in the following equation: where p i is the probability of an object being classified to a particular class.

Multilayer Perceptron.
e Multilayer Perceptron (MLP) network is trained using the backpropagation [30], which uses data to adjust the network's weights and thresholds to minimize the error in its predictions on the training set. First, it computes the total weighted input x j , using the following equation: where y i is the activity level of the j-th unit in the previous layer and w ij is the weight of the connection between the i-th and the j-th unit. Next, the unit calculates the activity y j using the sigmoid function.

Random
Forest. e training algorithm used is the bagging or the bootstrapping aggregating trees. is creates an ensemble of trees where multiple training sets are generated with replacement, meaning data instance can be repeated. e algorithm is represented as follows.
Given a training set X � x 1 , . . . , x n with a response, Y � y 1 , . . . , y n , bagging repeatedly (B times) selects a random sample of the training set and fits trees to these samples: When training is done, predictions for unseen samples x ′ are done by determining the average of the predictions from all the individual regression trees on x ′ as stated in the following equation: e process above depicts the original tree bagging algorithm. Random forest, on the other hand, differs in only one way: its algorithm chooses a random subset of features at each candidate split in the learning process (ensemble learning method that tries to reduce the correlation between estimators in an ensemble by training them on random samples of features rather than the entire feature set), also known as feature bagging. e Gini impurity was employed as the criterion because the random forest is based on decision tree and the study is based on classification.

Results and Discussion
e results of the study are presented as follows: Table 1 shows the different models' training and testing accuracies and its processing time when run on 4 CPUs), ∼2.2 GHz processor of 8192 MB RAM. Table 2 shows the confusion matrices and Table 3 shows the classification report.
For each method, the value at the upper left corner is the true positive and the one at the upper right corner is the false positive. e lower right corner is the true negative and the lower left corner is the false negative.
Precision refers to the accuracy with which a judgment is made. e upper row values represent the likelihood of heart illness, whereas the lower row values indicate the likelihood of a decision. e harmonic mean of precision and recall is represented by the F1 score. is is a performance-based statistical measure. e capacity to determine the number of samples that test positive for a specific attribute is known as recall. Figure 1 compares the  Table 4 shows the performances of different methods on the Cleveland dataset. We conducted a one-way ANOVA for the results to find if there is a statistically significant difference between the outcome of the proposed technique result and the others in terms of boosting SVM versus random forest, boosting SVM versus Multilayer Perceptron, boosting SVM versus decision tree, boosting SVM versus Naïve Bayes, and finally boosting SVM versus logistic regression. e analysis of the variances, followed by Tukey simultaneous plot at 95% CI, shows that the corresponding means are significantly different (p < 0.5) which demonstrates that boosting SVM is the best. Also, tests for the training speed were conducted and the results again show that there was statistically significant difference between groups (p � 0.029). A further Tukey post hoc analysis shows that the processing time for the boosting SVM was significantly smaller than all the other techniques after pairing boosting SVM and random forest (p � 0.041), boosting SVM and Multilayer Perceptron (p � 0.027), boosting SVM and decision tree (p � 0.038), boosting SVM and Naïve Bayes (p � 0.04), and boosting SVM and logistic regression (p � 0.035). All comparatives show that the boosting SVM methodology is extremely promising. Figures 5 and 6 demonstrate the test application as a proof of concept using the boosting SVM algorithm.

Conclusion
e study emphasizes the seriousness of cardiac disease and the need of detecting early warning signs. Many machine learning algorithms based on random forest, logistic regression, Multilayer Perceptron, Naive Bayes, and decision trees are being investigated in light of recent studies that call for the automatic detection of dangers. is study proposed a boosting SVM technique to further investigate how to improve prediction accuracy. e technique is based on the Cleveland datasets, which have been utilized successfully and extensively in earlier studies. To reduce misclassification, we preprocessed the data by normalizing it and removing the redundant ones. e feature importance is also computed, which assigns a score to each characteristic in the data; the greater the score, the more relevant the feature to the output variable. Also a heatmap of linked features is produced.
e heatmap demonstrates that the most  important factors in predicting heart disease are age and maximum heart rates. Finally, classification is performed using the proposed boosting SVM. For the analysis, confusion matrices, classification reports, ROC, and AUC are all used, and the findings reveal that the provided methodologies performed the best. e proposed method has a recognition accuracy of 99.75%, which is much higher than previous studies. e algorithm has now been enacted and has shown to be pretty useful. In the future, we plan to develop a new ensemble model that combines SVM and AdaBoost to improve accuracy and speed, as well as releasing the app on both Android and iOS.
Data Availability e data for this study are publicly available at https:// archive.ics.uci.edu/ml/datasets/heart+disease.