Computational Learning Model for Prediction of Heart Disease Using Machine Learning Based on a New Regularizer

Heart diseases are characterized as heterogeneous diseases comprising multiple subtypes. Early diagnosis and prognosis of heart disease are essential to facilitate the clinical management of patients. In this research, a new computational model for predicting early heart disease is proposed. The predictive model is embedded in a new regularization based on decaying the weights according to the weight matrices' standard deviation and comparing the results against its parents (RSD-ANN). The performance of RSD-ANN is far better than that of the existing methods. Based on our experiments, the average validation accuracy computed was 96.30% using either the tenfold cross-validation or holdout method.


Introduction
Cardiovascular disease, or CVD, refers to various heart disorders, including structural heart abnormalities and blood vessel blockages. Over 17 million individuals have died from cardiovascular disease, according to the World Heart Federation (WHF). Additionally, the World Health Organization (WHO) reports that cardiac disease has a greater yearly fatality rate than any other disease. In various domains of the biomedical sector [1][2][3], machine learning techniques have been successfully utilized. Machine learning has transformed the biomedical industry, paving the way for several new methods for simplifying the identification of various disease classes. Applied mathematicians have long recognized the "curse of dimensionality" as a significant barrier to using complex biomedical data. Such data is distinguished by having a higher-dimensional space with fewer points. Reducing the feature space will help the classification models be more effective [4][5]. Once the irrelevant features are eliminated, efficient classifications and predictions of various diseases are possible [6][7][8][9]. Choosing an appropriate data preprocessing strategy is unavoidable, but classifying and predicting models still poses another challenge. It is highly difficult to diagnose cardiovascular disease, and we must do it with great accuracy and proficiency. Usually, the diagnosis is based on just assumptions and the doctor's experience. However, just like humans, we are not infallible, and our decisions could result in either a life-saving or a tragic outcome. In these critical situations, machine learning techniques have emerged. Instead of using human knowledge, this method can accurately foretell a disease. Selecting the correct categorization and prediction model is still considered a difficult task by many researchers. Without classifying and predicting CVD, these models' performance in classifying and predicting cardiovascular disease declines significantly. After developing such a model, the proper evaluation must be validated through a strong channel, such as using a medical and research institute for heart disease detection and prediction. is study proposed a new predictive model for early heart disease detection, in which weight decay is used to shrink the influence of each data point in the weight matrix. Once weight decay has been applied, the influence of each data point is multiplied by λ to obtain the regularization term. To summarize, this paper has contributed the following to the literature.
We compare the performance of RSD-ANN with other models. Our experimental results indicate that our model achieved relatively higher results than those of previously published methods. We present the design and implementation of a regularizer based on the Relative Standard Deviation for the Artificial Neural Network (RSD- ANN) system. Furthermore, we study the performance of RSD-ANN in combination with the PCA transform and different existing regularizers with varying parameters of regularization and their impacts on the accuracy. is paper is organized as follows: Section 2 discusses the methods employed in the detection of heart diseases. Section 3 explains the use of the dataset for heart detection. In Section 4, the benefits of the regularization methods are discussed. In Section 5, the proposed architecture is discussed. Section 6 explains the training and validation process. Section 7 discusses the efficiency of the architecture through evaluation matrices and provides a comparison with the state-of-the-art methods. Finally, conclusions are drawn, and future directions are discussed in Section 8.

Related Work
For complex parameter interpretation across multiple categories of data, machine learning methods outperform statistical approaches in both accuracy and precision. Machine learning techniques produce accurate and robust predictions based on a small set of assumptions [10], and they are becoming increasingly popular. Machine learning techniques are either classified as unsupervised learning methods or supervised learning methods. A trained dataset is not required for the former, but it is required for the latter. Gavhane et al. [11] devised a model for analyzing and monitoring the human cardiovascular system. is model was created to detect coronary artery disease in patients. ey used the Cleveland dataset. e dataset has 76 properties and 303 instances. e authors used 13 of the 76 available attributes in the analysis process. e model detected coronary artery disease using Bayes networks, Functional Trees (FT), and Support Vector Machines (SVM). SVM-based holdout tests were 83.8% accurate, whereas FT-based holdout tests were only 81.5%. e relevant attributes were selected using the Best First method. e accuracy of the FT was 84.5%, the Bayes net was 84.5%, and the SVM was 85.1%. Repaka et al. [12] used Naive Bayes to design a method for detecting cardiac disease. Due to its application of the Bayes eorem, this strategy became one of the most influential categorization models. WEKA was used to implement the solution. e model had an accuracy of 86.419%. However, for larger datasets, the model's reliance on the Bayes eorem cannot be validated. Babu et al. [13] proposed several methods for classifying cardiovascular diseases. is solution was implemented using the WEKA tool. Machine learning techniques such as Bagging, Naive Bayes, and J48 were employed. e J48 technique had an accuracy of 84.35%. e naive Bayes algorithm was 82.31%, while the Bagging strategy was 85.03%, making it the more accurate method of the three. Parthiban and Srivatsa created a predictive model for diabetic patients' CVD using Naive Bayes and SVM [14]. Only 142 people out of 500 were impacted, while the remaining patients were unaffected by the condition. e Naive Bayes algorithm was accurate to 74%. Tan et al. [15] proposed a hybrid approach using wrapper-based feature selection. It included Support Vector Machines (SVM) and Genetic Algorithm (GA) techniques. To analyze the proposed approach, WEKA and LIBSVM were employed. e authors used five datasets from the UC Irvine machine learning repository. e hybrid technique was applied to the heart disease dataset to achieve the highest accuracy of 84.07%. Ordonez [16] proposed a rule-based association method for heart disease prediction. e rule set was optimized using an algorithm. is algorithm discovered association rules in the training data and validated them on a separate test dataset. e significance of the associated rules in the medical field was validated using support and confidence values. is strategy resulted in the development of rules with a high degree of predictive accuracy for heart disease. Rairikar et al. [17] proposed a method for cardiac disease prediction. is system was developed using thirteen clinical characteristics extracted from a UC heart disease dataset. Algorithms based on the Learning Vector Quantization (LVQ) technique and Artificial Neural Networks (ANNs) were used to diagnose heart disease. e proposed system achieved a level of accuracy of approximately 80%. Nahiduzzaman et al. [18] developed a framework for diagnosing heart disease using SVM and Multilayer Perceptron (MLP). is method achieved 80.41%. Finally, Nahar et al. [19] proposed a novel hybrid CVD prediction technique that combines Computerized Feature Selection (CFS) and Medical Feature Selection (MFS). However, in the absence of a more precise classification methodology, this strategy resulted in an inefficient model. A multistage Convolutional Neural Network (CNN) model for diagnosing coronary heart disease was developed by Dutta et al. [20]. For data imbalance, this model proved to be resilient. However, lasso regression was used in this case because it assumes a linear relationship between the input factors and the labels. Worth noting, unbalanced data lead to incorrect classification of data. Tougui et al. proposed a model for classifying and predicting heart disease using ANN and SVM in [21]. e authors achieved an accuracy of 84.7%. In addition, they utilized 12 risk factors, such as cholesterol, food, age, sex, and blood pressure, to develop a genetic neural network algorithm for predicting heart diseases. 50 individuals from the American Heart Association volunteered to take part in the study. ough this approach has some merit, it also has a couple of drawbacks. e allocation of neural networks is a random process, and the neural network's performance is adversely affected when searching for the optimal global value. e other disadvantage of adopting the backpropagation technique is that it may result in the neural network failing to achieve convergence. Pathak and Arul Valan [22] proposed predictive models for cardiac disease using trees, SVM, Naive Bayes, Neural Networks, and fuzzy approaches. e F-measure, accuracy, recall, and precision of the system were all tested and satisfactory. However, there was no suitable feature selection process used in this research. Hence, the J48 classifier model was shown to be the most successful among the numerous models tested in this study.
Baitharu and Pani [23] proposed a study primarily concerned with healthcare decision-making and would employ various techniques, including J48, IBK, VFI, Nave Bayes, Multilayer Perceptron, and Xero. e accuracy attained was extremely low, and it is, therefore, unsuitable for use in healthcare decision-making situations. Bouaziz et al. [24] presented a K-NN technique for predicting cardiac illness based on wavelet analysis. e primary goal was to detect cardiac disease with the smallest number of features possible. After the samples have been generated for each output class, the suggested Average KNN computes the Euclidean distance between them to quickly identify and pick the nearest neighbors. e method has lower accuracy and is less efficient than other methods. Sharma and Saxena [25] developed a genetic algorithm-based technique for predicting cardiac disease in their paper. e accuracy of this technique was 73.46%, based on the utilization of 14 different attributes. is approach is inefficient and unsuitable for decision-making. Using a hybrid model ensemble feature selection approach, Tripathi et al. [26] proposed a method for credit scoring. e efficacy of this methodology was purely dependent on the selection of the most appropriate classifiers for the ensemble. Using a rough set and multlayered ensemble for classification, Tripathi et al. [27] created another hybrid credit scoring model implemented in the real world. Both models are well suited for use in credit scoring situations. Balogun et al. [28] suggested a technique that used four-filter feature ranking (FFR) and a fourteen-filter feature subset selection to determine the best filter feature to use (FSS). is technique improved the predictability of the inducers and is particularly well suited for diagnosing software defects. Akintola et al. [29] investigated the impact of filterbased attribute selection approaches on predicting software defects. e ten datasets from the NASA and Metric Data Program software repository were evaluated using the Principal Component Analysis (PCA), CFS, and Filter Subset Evaluation methods developed by the researchers. Balogun et al. [30] investigated the effects of 46 feature selection strategies using Naive Bayes and decision tree classifiers to determine their effectiveness. e use of probability-based, statistical-based, and classifier-based FFR approaches was advocated in this review. is approach is particularly well suited for discovering software flaws. Kolukisa et al. [31] established a novel adaptive and optimized ensemble technique to analyze coronary artery disease (CAD) that is both fast and accurate. Even though this strategy employed an ensemble approach for classification, it did not use any pretreatment techniques before classification.
On the other hand, Latha and Jeeva [32] looked at numerous ways of enhancing the efficiency of weak prediction approaches. When it came to classification, the strategy used both bagging and boosting ensembles. As described by conducting a survey of various CVD prediction techniques, the dimensionality of a medical dataset is extremely high, necessitating the use of an efficient feature selection mechanism to keep the number of attributes to a minimum while also incorporating an efficient machine learning model to improve prediction accuracy. Additionally, there is a lack of suitable classification and prediction methodologies for accurately classifying and forecasting data on heart disease. Additionally, with recent advancements in classification techniques such as ensemble learning, there may be a way to improve prediction accuracy and efficiency in the future. is study established a new predictive model for identifying heart disease diagnosis in patients using a new regularizer. RSD-ANN has several advantages over other models, including requiring less computing time and providing higher generalization capabilities.

Dataset
e heart disease dataset used in the experiments is available on the University of California's open repository. ere are 303 instances, 76 features, and 2 classes in total (absence and presence of heart disease). Each instance contains information about a patient's heart disease diagnosis and physical and biochemical constants commonly used in medical diagnosis. Notably, this is one of the most commonly used open datasets in medical machine learning papers [33]. We only used 14 features for this study, including the class attribute, which has the following distribution: the absent class had 150 instances, corresponding to 55.5% of the dataset, and the present class has 120 instances, corresponding to 44.5% of the dataset. As shown in Table 1, the class values are binary: they answer "yes" or "no" to the question of existing heart disease.
Statlog's categorical features are depicted in Figure 1 as a mosaic plot. e horizontal axis displays the values for categorical or discrete attributes, with the number representing the number of people who have that characteristic value.
e proportion of people with and without heart disease for a particular value of a characteristic is called the height proportion.
ere appeared to be a specific link between some characteristics and heart disease. For instance, when the slope was type 2, the risk was more significant for women than for males, and there was a significant association with heart disease. Differentiating the values according to the density function illustrates the association between numerical and cardiac diseases (Figure 2). e plots indicate that there may be a correlation between age, the maximum heart rate achieved during an exercise test (thalach), and the depression generated by activity at rest (oldpeak) in individuals with or without cardiac disease.

Regularization: Control the Model Complexity
As a predictive model is trained, the learning model may begin to memorize the data, causing the generalization error to increase. As a result, the model performs admirably on training data but dramatically worsens on unknown or test data. e process of avoiding memorization is known as "regularization." e goal of regularization is to penalize the learning model for starting to generalize by performing well on previously unseen data. Penalty terms are imposed on the prediction model by various types of regularizers. e most commonly used where P is the total number of features in the data, β is the weight values associated with each feature, and j is the j − th row of the weight matrix. λ is the model's regularization parameter. Reduced overfitting is achieved by increasing λ. e regularization term multiples by λ (scalar), which modifies the overall effect of regularization. Consequently, increasing the λ value will improve the regularization impact.

Ridge Regularizer.
e term "L2 or ridge regression" refers to L2 regularization. e parameters must be reduced to regularize the coefficients. e coefficients shrink in size when a penalty is applied. e additional penalty associated with ridge regularization is equal to the total of the squared values of the coefficients added to the loss function [34]: where P is the weight matrix row for each feature, β is the feature's coefficient value, and j is the j − th row of the weight matrix. e correct value for λ must be found carefully. A large Lambda overfits the training data, leading to underfitting. In general, L2 regularization does an excellent job of reducing overfitting.

Elastic Net Regularizer.
In the elastic net linear regression, regression models are regularized using both the lasso and ridge techniques. By combining the ridge and lasso techniques and learning from their shortcomings, the technique improves the regularization of statistical models. e elastic net approach addresses lasso's constraints, such as when only a few samples are required for high-dimensional data. e elastic net approach allows for the incorporation of "n" variables until saturation is reached. If the variables are highly correlated, lasso will usually pick one from each group and ignore the rest. To increase the versatility of the elastic net, a quadratic expression (||β|| 2 ) is added to the penalty (a type of ridge regression when applied in isolation) because the quadratic expression in the penalty elevates the loss function toward being convex. Elastic net is a hybrid of ridge regression and lasso regularization that excels at modeling data with many strongly correlated predictors. Consider a data matrix with dimensions of p, where p denotes the number of predictor variables and a solution vector with dimensions of n, where n denotes the number of observations. e goal of an elastic net is to minimize the following loss function: where λ is the regularization parameter and α is the mixing parameter. e λ parameter is nonnegative, that is, λ ∈ [0, ∞). When the value of λ is zero, the regularization has no effect. In other words, the only goal is to reduce the loss function to its smallest possible value. As the value approaches infinity, the regularization effect becomes more pronounced. Instead of minimizing the loss function, the only goal is to keep the coefficients β as small as possible. When α � 0, the elastic net is the same as ridge regression (i.e., a set of correlated predictors' coefficients are similarly reduced toward zero). In contrast, when α � 1, the elastic net is the same as the lasso regression (one of the correlated predictors has a larger coefficient, while the rest are shrunk to zero).

New Regularizer.
In general, the L1 regularizer selects or reduces features, whereas the L2 regularizer reduces the weights of unimportant features. e lasso fails to provide a grouped selection, which is the primary shortcoming of regularizers of this type of selection. It has a tendency to select one variable from a group while ignoring the rest in the group. Aside from that, the elastic net contributes to reducing the impact of specific features while not totally eradicating them. Furthermore, they manage individual weights without taking into account the relationship between the weight matrix entities. To address this constraint, we designed a new regularizer that considers the weight values' dispersion. is regularizer is referred to as a standard deviation-based regularizer (RSD). e new regularizer computes the regularization term by multiplying the weight matrix's standard deviation by λ. e goal is to build a more adjustable weight decline method. As a result, the regularizer restricts the learning model from utilizing global values from the weight space as input (see Figure 1).
As depicted in Figure 2, the new regularizer's outlines are presented in detail. Specifically, the penalty for all regularizers (L1, L2, elastic net, and the new) is equal to one in this case. We have also omitted the sum factor from the suggested regularizer to preserve its dimensions. As a result, the regularizer's spread depends on the penalty term λ. e Computational Intelligence and Neuroscience spread expands as the penalty term λ is reduced, and the spread shrinks as the penalty term λ increases. e equivalent mathematical formulation for the new regularizer is provided as follows: where k denotes the row numbers in the weight matrix and j is the weight matrix's row whereas σ denotes the weight values of standard deviation. e parameter λ controls the weight matrix values, and P is the columns number in each j − th row of the weight matrix (P depends on the number of features in the dataset). So, P is the size of the weight vector. Hence, we minimize the loss function with respect to w through the standard deviation of w to adopt the specific range values. e Nesterov ADAM optimizer was used to train the model, which included tanh activation functions. e model was trained over 100 epochs. A feed-forward network was used to classify the labeled data.

Model Architecture
To analyze the efficiency of RSD-ANN, we have performed certain steps given below. e functionality and dependency of each step are depicted in Figure 3.
(1) Data preprocessing (2) Data scaling (3) Dimension reduction/feature selection (4) Classifier selection 5.1. Data Preprocessing. During data preprocessing, nominal and textual attributes are converted to numerical values through the label encoding algorithm. After that, the duplication of data is removed to avoid the classifier biased toward the majority class in data and thus affect the performance. Data preprocessing is a crucial task in both supervised and unsupervised learning.

Data Scaling.
In the dataset, each attribute column has a different range of data. Some are continuous values having a low standard deviation, and others have a large range of values dispersed in feature space. To bring the data to equal mean and standard deviation, data scaling is performed.
rough scaling, the information in the data is retained. In this step, both datasets were scaled according to the following equation: for which i � 1, . . ., k where k represents the total number of rows and X i shows the i − th feature in the data.

Feature Selection.
Feature selection, also known as dimensionality reduction, completely removes unnecessary features from the data. For this purpose, we transformed the data into eight principal orthogonal components using a statistical method known as Principal Component Analysis (PCA) algorithm [35]. PCA was applied to the heart dataset, and all 13 features were reduced to eight correlated components.

Loss Function.
To evaluate our classifier, we have used the cross-entropy loss function. e cross-entropy loss function is defined in equation (6). In neural networks, most of the time, the cross-entropy is given priority over other loss functions due to some specific reasons. For example, the squared loss function is suitable for regression. e other reason is that the output values of cross-entropy are between 0 and 1. erefore, it is simple to convert the probabilistic values between 0 and 1 to either one class or another using different thresholds. Moreover, the cross-entropy loss function easily converges to the corresponding class values compared to other loss functions: Cross entropy � C i t i log f s i , (6) where C is the number of classes, t i is the ith class, and f(s i ) is the ith output after activation function f.

Training and Validation Process
We chose Python as the implementation language for simulations, and the Keras framework is used for ANN. e ANN model is embedded with a new regularizer to test our method. Using the mathematical description given in equation (3), the proposed regularizer is implemented as a function.
is new function was assigned to the kernel regularizer in the ANN model instead of the built-in regularizers. Two hidden layers of five and three units each make up the ANN predictive model. According to the class values, the last layer has two units. Except for the final layer, where the softmax activation function is used, each layer uses the tanh activation function. In the first two layers, the weight matrix was initialized with a Gaussian random   Computational Intelligence and Neuroscience distribution. For 100 epochs, the ANN model is trained on 80% of the data and validated on 20% of the data, with a batch size of 4 for each iteration. As a result, the results of multiple simulations were recorded and plotted. e following section contains the results and discussion.

Embedding L1 Regularizer with ANN.
e default L1 regularizer was embedded with ANN learner and trained for different values of λ. is regularizer's best results were with λ � 10 −8 , and accuracy was 85.14% (see Figure 4). e L1 regularizer application, in combination with the PCA algorithm, increased the accuracy to 87.7%, which was the best accuracy possible ( Figure 5). is increase in accuracy can be ascribed to the removal of less important features through PCA transformation. PCA only keeps the correlated features, which has a strong correlation with possible categories in data.

Embedding L2 Regularizer with ANN.
Similarly, in this step, we replaced the L1 regularizer with the L2 regularizer in the ANN model and searched for the best parameters to yield better accuracy. As a result, 84.8% accuracy was recorded with λ � 10 −9 (see Figure 6).
We observed a decrease in the average accuracy when combining the PCA transform with the L2 regularizer. e average accuracy for this stage was 83.04% (see Figure 7).
is is because the L2 regularizer does not assign zero values to attribute coefficients. Hence, the less important features are also incorporated, due to which the accuracy decreased.

Embedding New Regularizer with ANN.
Based on our results' analysis, the new regularizer showed a significant improvement in terms of accuracy. e accuracy of our new regularizer outperformed L1 and L2 regularizers. e average accuracy obtained was 88.59% (see Figure 8).
We then combined the PCA algorithm with RSD-ANN to measure accuracy. As a result of this combination, the accuracy jumped to 96.30%; see Figure 9. During the PCA transformation, the number of features was reduced to eight orthogonal correlated components. As an important note, the accuracy improved because PCA projects each feature to maximum variance in feature space. e new regularizer controls the spread of the weight values in weight space (see Figure 10). Hence, the data becomes more separable for the ANN classifier, providing greater accuracy as a result. e same architecture was used for this simulation, and λ values were slightly increased to 10 −6 .
To validate our proposed model's effectiveness, we compared the RSD-ANN results with the results obtained from combining elastic net regularizer and PCA transform with eight components on the heart dataset. An accuracy of 81.08% was observed, which showed that the elastic net regularizer performed worse than the L1 and L2 regularizers separately. e accuracy plot for elastic net regularizer is shown in Figure 11. Table 2 shows each embedded regularizer's experimental results with ANN, demonstrating that the new regularizer performed significantly better than the default regularizers. erefore, the proposed regularizer is extensively effective.

Computational Intelligence and Neuroscience
We performed our simulations using RSD-ANN with a PCA algorithm. Out of 13 attributes, eight orthogonal correlative features were reproduced via PCA. e experimental outcomes of the proposed approach are compared to the results of other techniques currently available. Table 3 shows the classification metrics (accuracy, sensitivity, specificity, and f-score) of RSD-ANN compared to other models. Based on our experimental results, it can be concluded that the RSD-ANN has outperformed all other models considered in the comparison. e GA-LDA + hybrid ensemble model achieved an accuracy of 93.65%. On the contrary, our model has the highest accuracy. Our model has achieved 93.75% in specificity, whereas the other approaches have rates below 90%. e sensitivity rate for the GA-LDA + hybrid ensemble is 96%, which is higher than the sensitivity rate of RSD-ANN due to  Best results are highlighted in bold. 8 Computational Intelligence and Neuroscience identifying and selecting the best features for better prediction. It is worth noting that this measurement may cause issues if the data analyzed is affected by uncertainty or inaccuracies. Consequently, it limits its usability. However, by utilizing a fuzzy classifier, we can avoid these issues [36]. e experiment results demonstrate that our proposed method outperforms the other models in terms of accuracy in classifying cardiovascular disease. RSD-ANN was trained on a train-test split of 80-20% data. According to this division, the model was trained on 237 patients and tested on 60 patients.

Conclusion
In this paper, we present an efficient computational model for heart abnormality detection. We focused on a learning model that uses a new regularizer, which is purely based on the weight matrix's standard deviation. e new regularizer penalizes the coefficients of attributes from getting high values in the weight matrix space. e proposed model obtained excellent results, and it can be used to assist medical practitioners when searching for abnormalities in heart function. During the process, holdout and 10-fold validation were used, and the accuracy obtained for heart disease detection was 96.30%. Consequently, the incorporation of the proposed regularizer with ANN surpassed other methods in terms of accuracy.

Data Availability
e datasets used to support the finding of this study are included within the article.

Ethical Approval
is article does not contain any studies with human participants performed by any of the authors.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Computational Intelligence and Neuroscience 9