Novel Based Ensemble Machine Learning Classifiers for Detecting Breast Cancer

Nowadays, for many industries, innovation revolves around two technological improvements, Artificial Intelligence (AI) and machine learning (ML). ML, a subset of AI, is the science of designing and applying algorithms that can learn and work on any activity from past experiences. Of all the innovations in the field of ML models, the most significant ones have turned out to be in medicine and healthcare, since it has assisted doctors in the treatment of different types of diseases. Among them, early detection of breast cancer using ML algorithms has piqued the interest of researchers in this area. Hence, in this work, 20ML classifiers are discussed and implemented in Wisconsin’s Breast Cancer dataset to classify breast cancer as malignant or benign. Out of 20, 9 algorithms are coded using Python in Colab notebooks and the remaining are executed using the Waikato Environment for Knowledge Analysis (WEKA) software. Among all, the stochastic gradient descent algorithm was found to yield the highest accuracy of 98%. The algorithms that gave the best results have been considered in the development of a novel ensemble model and the same was implemented in both WEKA and Python. The performance of the ensemble model in both platforms is compared based on metrics like accuracy, precision, recall, and sensitivity and investigated in detail. From this experimental comparative study, it was found that the ensemble model developed using Python has yielded an accuracy of 98.5% and that developed in the WEKA has yielded 97% accuracy.


Introduction
Any environment can be made smart and intelligent with the help of technologies like AI and ML [1]. ese technologies not only reduce the manpower and labor time but also improve the quality of life by automating society and creating more job opportunities [2,3]. A system that can apply some logic, evaluate the di erent options available, and decide on its own can be considered to be an application of Arti cial Intelligence. To provide this intelligence, the system has to be trained; i.e., the system has to learn. is is done through machine learning, a branch of AI, where the system makes a decision based on the data. In ML, the performance of a task improves based on previous data or experience. us, it can be understood that performance, task, and experience are the three main characteristics of ML. Figure 1 represents the basic ow of machine learning pictorially.
e applications of ML can be found in many areas such as object recognition, spam detection, speech recognition, and healthcare industries for survival predictions, tumor detection, and cancer detection. e use of AI and ML models assists doctors in the early detection and curing of cancer, tumors, and other deadly diseases. One such application in medicine is the detection of breast cancer using ML models. Breast cancer is the growth of unwanted cells in the breast and early detection of this cancer can lead to a better cure [4]. Many classi cation algorithms in machine learning can be used to detect breast cancer with high accuracy, but selecting a particular algorithm for breast cancer detection is a challenging task. ere are many works done by researchers working in these areas and these kinds of works give rise to the establishment of smart hospitals and constructive exploitation of these technologies leads to the improvement in quality of life [5].
Chandrasekaran et al. [6] developed a fully integrated analog-based machine learning classi er that uses Arti cial Neural Network (ANN) algorithms for breast cancer prediction. e developed ANN model mainly focused on nine attributes of breast cancer, such as (1) bland chromatin, (2) mitosis, (3) uniformity of cell shape, (4) marginal adhesion, (5) single epithelial cell size, (6) uniformity of cell size, (7) clump thickness, (8) normal nucleoli, and (9) bare nuclei. e authors achieved a mean accuracy of 96.9% with optimal energy consumption. Later, a Deep Neural Network with Support Value (DNNS) was developed by Vaka et al. [7] for predicting breast cancer. e model was developed in three stages. In the rst stage, preprocessing was carried out by removing noise from the images in the dataset. In the second stage, the geometrical and textual features were detected from the preprocessed image. In the third stage, each image was segmented using the histosigmoid based fuzzy classi er. e authors achieved an accuracy of 97.21%. Mojrian et al. [8] developed a multilayer fuzzy expert system by integrating an Extreme Learning Machine (ELM) classi cation model with Radial Basis Function (RBF). e authors compared the proposed methodology with the linear Support Vector Machine (SVM) model. Accuracy, Precision, True Positive (TP rate), False Positive (FP rate), True Negative (TN rate), and False Negative (FN rate) were considered as the performance metrics to evaluate the proposed ML model. e authors achieved an accuracy of 98% and the proposed methodology was e ective in detecting breast cancer when compared with the linear SVM model. Chaurasia and Pal [9] predicted and compared six machine learning algorithms. Classi cation and Regression Trees (CART), Linear Regression, Multilayer Perceptron (MLP), Naive Bayes (NB), K Nearest Neighbors (KNN), and SVM were the basic learners considered for prediction. Along with basic learners, a few ensemble algorithms like AdaBoost M1, Random Forest (RF), and Support Vector Clustering (SVC) were integrated to yield high accuracies. e authors divided the training and test set in the ratio of 80 : 20, respectively. Accuracy was the metric considered by the authors and they achieved a good accuracy of 95.17%. Similar to [9], a comparative analysis using data visualization and ML applications for breast cancer detection and diagnosis was carried out by [10] and achieved an accuracy of 98% using the logistic regression model.
Having studied the recent works of these authors, it is understood that individual algorithms like either deep neural networks or ensemble algorithms have been used at a time. To further improve the accuracy of existing methodologies, a novel combination of ensemble and optimization-based approaches is required to detect breast cancer with high accuracy. In this paper, 20 di erent ML classication algorithms have been selected for predicting breast cancer as benign or malignant using Wisconsin's Breast Cancer dataset. Out of these 20, 9 were implemented in Google Colab notebooks using Python programming language and the remaining 11 were executed in WEKA data mining software. Along with the 20 algorithms, we propose a novel machine learning ensemble-based algorithm. e algorithms for ensemble were chosen based on the performance of the 20 algorithms. e ensemble model developed was made to run on both platforms and the results obtained were compared and investigated in detail. us, the work in this paper can assist doctors in accurately detecting and diagnosing breast cancer in the early stages. is is an attempt to improve and automate society, establish the concept of smart hospitals, and contribute to the development of these technologies. e rest of this paper is organized as follows. Section 2 introduces the dataset and discusses the EDA performed to understand the importance and contribution of individual features. Section 3 describes the working of the selected 20 algorithms. Section 4 explains the evaluation parameters considered to investigate the performance of the algorithm and Section 5 discusses the results obtained and explains in detail the comparative study carried out in both platforms before concluding in Section 6.

Wisconsin's Breast Cancer Dataset
e Breast Cancer dataset from the University of Wisconsin Hospital was considered for carrying out the performance analysis of 20 di erent classi cation algorithms. is is a small dataset consisting of data from 569 patients with 31 di erent input features and a single output feature. e 31 di erent features are independent of each other and the output feature called "diagnosis" is dependent on the input features. e output feature "diagnosis" has two labels: malignant and benign.

Exploratory Data Analysis (EDA).
e Breast Cancer dataset consists of data from 569 patients, out of which 359 patients were diagnosed as benign and 212 patients were diagnosed as malignant. e blue region from Figure  diagnosed as malignant (37.3%) and the red region represents the percentage share of patients who were diagnosed as benign (62.7%).
As the percentage shares are not equal, there is a class imbalance and the predictions may not be generalized. To overcome this class imbalance problem and to have generalized results, the number of data samples must be equal for both labels. Inferring from Figure 3, if the mean radius is higher, then the patient is more likely to be diagnosed as malignant and if it is lower, then the patient is diagnosed as benign. So, the mean radius is directly proportional to the chances of a patient having breast cancer or not.
Similarly, from Figure 4 it is understood that people who were diagnosed as malignant have mean compactness in the range of 0.05 to 0.27 and those who have been diagnosed as benign have mean compactness in the range of 0 to 0.18. As the mean compactness increases, the chances of a person likely having cancer cells are more. When the cancer is detected as malignant, the mean area of spread of cancer is between 500 and 1800 mm and when the same has been detected as benign, the mean area of spread of cancer is between 0 and 700 mm and their respective sum is depicted in Figure 5. Early cancer detection helps to stop the spread of cancer when compared to malignant cases. Curing of malignant cases is very hard as the area of spread will be high.

Brief Description of the
Classification Algorithms e chart given in Figure 6 explains the types of ML. e algorithms listed in the chart below are the most common and popularly used classi ers. Instance-based algorithms such as KNN and statistical-based algorithms such as Naive Bayes, Bayes Net, Neural Network, Decision Trees, and more are experimentally veri ed in the next section.

Logistic Regression.
It is a widely used machine learning approach for categorization. e logistic regression algorithm draws a decision boundary between the di erent input features. e model then maps the inputs to the outputs by computing the cost function. e sigmoid function, also called a logistic function, is used for predicting the target labels. Equations (1)-(3) mathematically describe the sigmoid function [11].
where h(x) is de ned as the model and g(θ T x) is the sigmoid function. θ T is the model parameters and x represents the input features. After the sigmoid function is computed, the cost function is calculated and the results from the cost   Mathematical Problems in Engineering function are later optimized using a gradient descent algorithm.

K Nearest Neighbors (KNN).
KNN is a widely used method for various real-time applications dealing with the classi cation of classes. e classi cation can be either binary in which only two output classes will be available or multiclass classi cation, in which more than two output classes will be available. e KNN algorithm is a lazy learning and highly computational algorithm [12]. is algorithm uses the principle of matching the characteristic label of an unknown data sample with the characteristic label of a neighboring data sample based on Euclidean distance or, in some cases, the Manhattan distance [13]. Euclidean distance is the square root of the distance  between the two points in the sample space. e following equation (4) gives the Euclidean distance formula [14]: where d(x i , x j ) is the distance between a known data sample and an unknown data sample and x i , x j are the location of the data samples. e "K" in KNN is the total number of neighbors that are selected for making predictions. It should be purely user-defined for fine prediction or else should be an odd square of the total number of data in the dataset chosen [15]. When a new, unknown data sample is introduced into the search space, the distance between the unknown data sample and the k nearest neighbors is determined. e least proximity value among the unfamiliar data sample to the nearest neighbor will determine the class for the unknown data sample.

Naive Bayes Classifier.
It is a Bayes theorem-based conditional probabilistic machine learning model.

Bayes eorem.
e Bayes theorem in probability and statistics defines the likelihood of every occurrence based on prior knowledge of circumstances that may be relevant to the event. e following equation expresses the Bayes theorem mathematically [16]: where A and B are events and P(B) ≠ 0. Hence, utilizing this hypothesis, it is feasible to discover the likelihood of an incident, say A, given that B has happened. In this case, B represents the evidence and A represents the hypothesis.
e assumption here is that the existence of one trait has no effect on the other. As a result, it is said to be naive. Saritas and Yasar used the Naive Bayes model and ANN in Breast Cancer Coimbra Dataset and found that the classes were classified with an accuracy of 86.95% with ANN and 83.54% with Naive Bayes algorithms, thus proving that Naive Bayes algorithms could be used for early breast cancer detection [17].

Types of Naive Bayes Classifier
(1) Gaussian Naive Bayes Classifier (GNB). When the predictors take on a continuous value in GNB, we assume that these values have a Gaussian or normal distribution. Consider training data with a continuous attribute "X." After dividing the data by class, the mean and variance of "X" in each class are computed. e probability distribution of some observation value "X i " given a class "Y" can be computed by using the following equation [16]: where σ y is the variance of values associated with class y and μ y is the mean of values associated with the class y.
Piryonesi and El-Diraby explored the performance of different classification algorithms and used it for the examination of asphalt pavement degradation data. According to their findings, GNB can significantly improve the classifier's accuracy [18].
(2) Bernoulli's Naive Bayes Classifier (BNB). Features in the multivariate Bernoulli event model are independent Booleans (binary variables) that describe inputs. is model, like the multinomial model, is popular for document classification problems where binary term occurrence characteristics are utilized rather than term frequencies. (

3) Complement Naive Bayes Classifier (CNB).
e complement Naive Bayes classifier, also called CNB, works well with imbalanced datasets. e main principle behind the CNB is that the probability of a feature or data sample is calculated for all the classes to which the data sample belongs and not just for a single particular class. e main disadvantage of CNB is that the model tends to overfit easily when there is a large number of data samples in the dataset.

Decision Tree.
e decision tree algorithm is used for many classification problems. e main advantage of the decision tree is that it finds hidden patterns from the data by pruning the dataset and delivers high accuracy [19]. Pruning is the process of removing unwanted nodes from the decision tree that does not provide meaningful insights into the classification process. e decision tree consists of three components, mainly the root, node, and leaves. Root and node classify the unknown data based on if-then criteria and deliver the output to the next layer of nodes. Leaves are the ends of a decision tree that provides the filtered result after pruning.
ere are different algorithms to implement a decision tree and one such simple way to implement the same is using an Iterative Dichotomiser 3 (ID3) algorithm. e ID3 algorithm uses information gain and entropy to determine the output labels. Entropy measures the impurity of data and information gain is used to determine the amount of information that the data contributes to the final outcome [20]. e information gain and the entropy are measured using equations (7) and (8) [21].
where H(s) is the entropy, s is the input data, and c is the number of output classes.
where IG(A, S) is the information gain from the data S depending on output A, t is the subset after the data is split, P(t) is the number of elements in t, and H(t) is the entropy of the subset.

Decision Stump.
e decision stump algorithm is similar to decision trees. In this algorithm, there is only one decision tree which consists of one root and a maximum of two leaf nodes for output predictions [22]. Because of this reason, decision stumps are weak learners and do not subclassify like decision trees. e algorithm terminates as soon as the first classification is executed. e overall prediction can be determined using the following equation [23]: where x is the input features, g i is the predicted value of the i th feature, and x i is the i th feature.

Random Trees.
Random tree classifiers are a combination of decision trees and random forests. It is also called Random Forest Tree (RFT). It belongs to the category of ensemble classifiers. Random Trees improve the performance of weak learners as well as the accuracy of the machine learning model. A random tree is defined in three steps as (i) defining the method of splitting the leaves in the tree; (ii) selecting the predictor type for every leaf in the tree; (iii) providing the randomness into the leaves. One major disadvantage of random trees is that the classifier often tends to overfit the model [24].

Reptree.
Reptree, also called representative tree algorithm, belongs to the family of decision trees and is one of the fast learner algorithms [25]. In this algorithm, multiple decision trees are created and the one with the highest probability is chosen as the representative, and based on this, other trees will be classified [26]. e reptree algorithm is a combination or ensemble of various individual trees or individual learners.

Random Forests.
Random forest is the combination of many decision trees by randomly selecting the features and bootstrapping [27]. e random forest will first select a limited number of features from the entire set of input features and split those random features into nodes. e nodes, in turn, are split into subnodes or daughter nodes and the process is carried out until the required nodes are satisfied. e same process is repeated for "n" number of times creating "D" decision trees. e pseudocode for random forest is given in Table 1.

Support Vector Machines (SVM).
SVMs provide highly accurate results with linear data as well as nonlinear data. In SVM, the input features are defined as input vectors, and multiple hyperplanes are drawn between these input features. e hyperplane acts as a decision boundary in SVM.
e main objective of a hyperplane is to maximize the distance between two or more input features [28]. Because of this characteristic of the hyperplane, SVMs work well with nonlinear data. A hyperplane can be expressed using the following equation [29]: where x is the input vector containing the input features, b is the bias, and W is the weight that defines the distance separating the hyperplane. e best hyperplane is decided by finding different hyperplanes which classify the labels in the best way; then it will choose the one which is farthest from the data points or the one which has a maximum margin as depicted in Figure 7.
3.6. Stochastic Gradient Descent (SGD). SGD is a sophisticated version of gradient descent (GD). e only difference between GD and SGD is that the gradient descent is calculated for only a limited number of data samples. SGD selects data samples in a randomized manner and computes the gradients. Once the gradient is computed, the weights and the corresponding output labels are updated for all the data samples that were considered. is is an iterative process and continues until a minimum cost is reached [30]. e following equation is used to calculate the gradient descent [31]: where f(x) is the objective function, n is the total number of data samples, ∇f(x) is the gradient descent of objective function, and ∇f i (x) is the gradient descent that is computed for limited data samples randomly. Once the gradients are computed for the selected data samples, the weights are updated using the following equation [32]: where x is the data sample and α is the learning rate.

Sequential Minimal Optimization (SMO)
. SMO is used for solving quadratic programming problems. It is used to train the SVM classifier using a Gaussian or polynomial kernel [32]. It converts the attributes or features into binary values and also replaces missing values in the input features. SMO breaks down a quadratic problem into linear complexity. en this linear complexity is solved using Lagrange multipliers α 1 and α 2 .
e constraints for the selection of Lagrange multipliers are given in equations (13) and (14) [33]: where M is an SVM hyperparameter and K is the kernel function, both supplied by the user; and the variables α 1 , α 2 are Lagrange multipliers. SMO can be used when there is a class imbalance between the input features and works well with binary as well as multiclass classification [34].
3.8. K Star. K star or K * is one of the lazy algorithms used for many complex classification problems. It is also called an instance-based classification algorithm [35]. Every data sample is an instance of this algorithm.
e class of an instance of an unknown data sample depends on the entropic distance between that unknown data sample and the known data sample in the search space [36].

Bayes Net.
A Bayesian Network, also called Bayes Net, is one of the classi cation algorithms used for solving machine learning problems. A Bayes Net is constructed using DAGs-Directed Acyclic Graphs. ese DAGs consist of nodes and edges. e nodes represent a random variable, and the edges are used to establish the connection between two or multiple di erent nodes and carry the information regarding the probabilities between nodes. e only prerequisite of a DAG is that all the nodes must be connected in an open-loop and there must be no closed connection between the two nodes. e information sharing between the nodes depends on the Parental Markov Condition (PMC). PMC states that a variable represented by a node is independent of all variables in a nondescendant node and conditional on the parent node. e probabilities are calculated using the conditional probability formula as mentioned in the following equation [37]: 3.10. Classi cation via Regression. In classi cation via regression algorithm, rst regression is carried out. Classi cation is performed from the output features obtained from the regression model and nally, the output class is predicted using logistic regression. e main disadvantage of this algorithm is that it provides uctuating results. It is both a weak learner and a strong learner. e algorithm performs well when the number of input features is limited and if the input features are more, the accuracies may vary.
3.11. Gradient Boosting Classi er. Gradient boosting classi er is an ensemble of various weak learners which is used for both regressions and classi cation problems. A gradient boosting classi er performs gradient descent on a weak learner-like decision tree. is loss function can be de ned as in the following equation [38]: where f(x) is the loss function, e is the error, x is the vector containing input features, and p, q are weights to the loss function. e gradient boosting classi er can be applied to more complex datasets and it provides highly accurate results for nonlinear data.
3.12. AdaBoost M1. AdaBoost, also called Adaptive Boosting, is an ensemble-based algorithm that can be used for both binary and multiclass classi cation. In this boosting algorithm, classi cation is carried out by de ning a threshold or weightage value for every classi er. e nal output class is determined based on weighted votes from the individual classi ers. If there are any weak classi ers, then extra weights are given to the samples of these classi ers. is addition of extra weights to weak classi ers converts the weak learner into a strong learner [39,40]. e weights for the individual classi er are determined using the following equation [41]: where α i is the weight for sample i and ε i is the weighted error for sample i. e nal output prediction is determined using the following equation [42]: (1) Start (2) De ne the input features (3) Training the data (i) Select "i" random features from the entire set of input features i (ii) Determine the nodes from the "i" features using the best split point (4) Repeat (i) e above three steps until a speci c number of nodes are reached (5) Repeat (i) e above steps for n times to create m decision trees (6) Testing the data (i) Feed-in a new data sample and predict the output label (ii) e node which gives high prediction among the other output nodes will be taken as the nal prediction (7) Stop

Mathematical Problems in Engineering
where H(s) is the final output prediction and h i (s) is the output prediction for each sample in a classifier.

LogitBoost.
LogitBoost is also called an additive logistic regression algorithm [43]. LogitBoost is a weak classifier. LogitBoost takes different data samples while training and produces a weak output prediction. Boosting is applied to this weak model to improve performance and produce high output predictions. LogitBoost works well with both binary and multiclass classification. LogitBoost is similar to the AdaBoost algorithm. e only difference between the two algorithms is that the logit boost algorithm produces a weak classifier during the initial period of training the model. e output predictions are determined by using the following equation [44]: where P(x) are the probabilities of output class and f(x) is the fitness function 3.14. Bagging. Bagging is also called Bootstrap Aggregation. It is a powerful ensemble algorithm that works well with small datasets. Bagging is used to reduce the variance in a trained model and prevents overfitting of data while training [45]. In this algorithm, the dataset is split into clusters. Classification algorithms like decision trees and random forest are applied to this cluster, and predictions are computed. e predictions from the individual clusters are averaged and through a voting basis, the final output prediction is determined [46,47]. e pseudocode for bagging is given in Table 2.

Evaluation Metrics
e performance of the above-discussed 20 classification algorithms is evaluated using the parameters discussed below. A confusion matrix with TP, TN, FP, and FN for actual and predicted data samples is formed as seen in Table 3. A confusion matrix is a table that allows you to examine how an algorithm performs. Each column of the matrix contains examples from an actual class, whereas each row represents instances from a projected class [11]. e letter TP denotes that the actual value was positive and that the model projected a positive result. e letter TN denotes that the actual value was negative and that the model expected a negative result. In both of these examples, the projected value corresponds to the actual value. Whereas FP denotes a negative actual value but a positive prediction from the model, FN denotes a positive actual value but a negative prediction from the model. In both cases, the predicted value was falsely predicted. Based on these 4 values, accuracy, precision, recall, F1 measure, sensitivity, and specificity can be calculated.
Accuracy gives the proportion of properly identified samples to total samples. It is obtained from the formula [48,49], . (20) Precision tells how many of the correctly anticipated samples were positive. It is calculated by dividing all positive cases by all positive examples projected to be positive [48].
Similarly, recall indicates how many of the actual positive samples were properly predicted by the model. In medical circumstances, recall is critical since positive cases should not go undiscovered. In the case of this dataset, recall would be a preferable statistic because we do not want to mistake a malignant individual for a benign person. Precision and recall would decide whether or not the model is dependable [48].
Recall � TP (TP + FN) . (22) e F1 metric is a harmonic mean of recall and precision; hence it provides a comprehensive overview of these two measures. When we strive to improve our model's precision, the recall decreases, and vice versa. e F1-score incorporates both trends into a single number [48].
Sensitivity is the ability of a test to correctly classify a sample as malignant/positive. It is defined as the rate of the predicted positive case to the total positive cases [48].
Specificity, on the other hand, is the ability of a test to correctly classify a sample as negative/benign. It gives the relationship of observed negative examples with all negative examples [48].
In addition to the parameters mentioned above, three model evaluation metrics in R are utilized to examine the performance of implementing algorithms (i.e., Kappa value, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE)). e Kappa value, often known as Cohen's Kappa, is a measure of categorization accuracy. It is normalized to the dataset's random chance baseline. It is a statistic that compares the observed and expected accuracy. e Kappa statistic is used to evaluate not just a single classifier but also subclassifiers. e MAE is the average of all absolute errors. It is also defined as the difference between the expected and actual values. References [50,51] give the formula 8 Mathematical Problems in Engineering where n is the total number of samples, y i is the predicted value, and x i is the actual value. Root mean square error (RMSE). It is equal to the square root of the mean of the squares of all mistakes. RMSE is widely used and regarded as an excellent error metric for various sorts of predictions [51,52].

Results and Discussions
In this study, 20 various classification algorithms were discussed in the previous section. To analyze its performance, all the algorithms were applied to Wisconsin's Breast Cancer dataset to predict whether the cancer is benign or malignant. e analysis was justified based on various parameters such as precision, F1 measure, confusion matrix, recall, accuracy metrics, and run time discussed in Section 4. As discussed in Section 2, the dataset contains 569 samples, each with 33 attributes. e dataset is preprocessed by removing the unnecessary attributes which do not contribute to the output and checking for any incomplete data. After that, the dataset is divided into 426 training samples and 143 testing samples. All algorithms are trained on training samples before being verified on test data. e evaluation parameters given in this section are obtained from the validation of test samples. e current method employs a data driven approach. Based on historical data, the model will be able to detect breast cancer. Also, this method is faster in breast cancer detection when compared to traditional diagnostic methods on basis of performance and time. e novelty of this method is attributed to the fact that the ensemble model developed for the current study consists of combination of boosting and optimization algorithms which will dynamically reduce the errors within few iterations.  Table 4. Based on these 4 values, accuracy, precision, recall, F1 measure, sensitivity, and specificity are calculated using equations (20)-(25) and tabulated in Table 5. Apart from the above-discussed parameters, three model evaluation metrics in "R" are also calculated using equations (26) and (27) and tabulated in Table 6. Also, the accuracy and run time of the algorithms are graphically depicted in Figures 8 and 9.

Discussions.
e hyperparameters used to implement the algorithms are summarized in Table 7. From Table 5 it is clearly understood that SGD yields a higher accuracy of 98%, followed by random forest, SMO, and gradient booster with 97%. BNB yields the least accuracy of 63%. is is due to the fact that BNB is used for discrete data and operates on the Bernoulli distribution. e key characteristic of BNB is that it only takes binary values such as true or false. e Breast Cancer dataset has a continuous range of values for all the features.
is is why BNB is not able to produce higher accuracy. All other algorithms yield accuracy ranging from 90% to 96%. KNN was able to provide an accuracy of 95% with its K value set to 4. e K value was decided using a search method where, for each K value from 1 to 10, the error obtained was plotted and visualized in Figure 10.
Smaller K values can be noisy and have a greater effect on the outcome, but bigger K values have smoother decision boundaries, which means reduced variance but increased bias and is computationally intensive. Hence an optimal K value of 4 was selected. e type of solver and learning rate Define the input features and the output label (3) Split data into training and testing sets (4) Training the model (i) Perform bootstrap process on the training data (ii) Choose the number of clusters to be formed based on the data (iii) Apply a classification algorithm like a decision tree (5) Evaluate the model on the test set (i) Compute the accuracy from each cluster in the model and average the accuracy (ii) Determine the output prediction by a voting process (6) Stop Mathematical Problems in Engineering used for logistic regression was selected based on the grid search method. ree solvers "newton-cg," "liblinear," and "saga" were used and the learning rate was set in a range from 0.001 to 10. Logistic regression was able to yield an accuracy of 95% with a "Liblinear" solver and a learning rate of 10. Figure 9 shows the run time of all implemented algorithms.
e gradient booster algorithm was found to take more time than 7.74 seconds, followed by logistic regression with 4.3 seconds. is is because the gradient booster tends to overfit the model. Hence, setting a large number of boosting results increases the performance of the model. In this case, the boosting stages were set to 2000 which caused the algorithm to take more time. e same is in the case of a random forest. As discussed before, the grid search method was used in logistic regression to find the optimal solver and learning rate which resulted in more time. Decision tree was able to provide an accuracy of only 92%. e tree diagram for the dataset is given in Figure 11.
To increase the accuracy, pruning was executed. Pruning reduces the size of decision trees by removing nodes of the tree that do not contribute to model training. Pruning can also improve classification accuracy. e pruned decision was able to provide an accuracy of 95% and the same tree is shown in Figure 12.  e ultimate objective of every ML challenge is to nd a model that can predict the desired outcome with high accuracy. Rather than creating a single model and hoping it is the best, we can use ensemble to take a variety of diverse models into consideration and average those models to get a single nal model. In other words, the ensemble approach is a machine learning strategy that integrates numerous basic models to build a single highly accurate model. Common ensemble algorithms such as bagging, AdaBoost, LogitBoost, and gradient boosting are discussed above and implemented. In addition to that, a novel ensemble model has been developed combining logistic regression, stochastic gradient descent, and random forest. e hyperparameters of the developed ensemble model are given in Table 8. e performance of this ensemble model is investigated by implementing it on both platforms (i.e., Colab notebook and WEKA software). e results obtained are compared and discussed further in the next section.

Comparative Study between WEKA and Python Coded
Results. A comparative study has been performed between WEKA results and Python coded results. e ensemble model developed above has been used to perform this comparative study. Table 9 summarizes the results obtained through WEKA and Python by ensemble of the three algorithms.     From Table 9, it is evident that the algorithm implemented in the Python programming language yielded an accuracy of 98.5%, while the model implemented in WEKA provided an accuracy of 97%. But the run time is more when executed in Python. is is because logistic regression uses a grid search technique to find the optimal solver and learning rate while the other 2 algorithms use a high value of boosting stages to yield better results which together accounts for the more computational time when compared to WEKA. Now, from Tables 5 and 9, it is evident that the ensemble model developed yields an accuracy of 98.5%, while the highest accuracy yielded by individual classifiers is 98% (SGD). Conventional ensembles like gradient booster, AdaBoost, LogitBoost, and bagging were able to achieve an accuracy of 97%, 95%, 96%, and 95% respectively. Hence, it is concluded that the performance of the developed ensemble model using Python is better than the traditional ML algorithms.

Conclusion
Breast cancer is a very common type of cancer and it is predominantly found in women. In this paper, 20 different ML classification algorithms were trained on a Breast Cancer dataset from the University of Wisconsin to detect the same. From the results obtained it was found that SGD has yielded the highest accuracy of 98%, while BNB provided the least accuracy of 63%. Apart from accuracy, various other evaluation parameters like confusion matrix, precision, recall, F1 measure, sensitivity, specificity, Kappa value, MAE, RMSE, and run time were calculated and tabulated. Further, a comparative study was carried out to analyze the performance of the algorithm when coded in Python and when run on WEKA. An ensemble model has been developed combining logistic regression, random forest, and SGD algorithms and deployed in the comparative study. e performance of the ensemble model in both platforms has been compared and investigated in detail. It was found that the ensemble model developed using Python was able to achieve an accuracy of 98.5% and the model developed in the WEKA data mining tool yielded an accuracy of 97%. We believe that development and implementation of ensemble models can improve the quality of diagnosis and brings a great advantage compared to the recent works. Future directions of this work include the development of a real-time AI-based hardware prototype that can be deployed in hospitals around the world to assist doctors in detecting breast cancer. e use of AI and ML technology is limited not only to the detection of breast cancer, but also to the detection of other deadly diseases. e same technology can automate society and one such example is described in this paper. us, the work in this paper is an attempt to improve and automate society, establish the concept of smart hospitals, and contribute to the development of these technologies.

Data Availability
Data will be available upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Loss-"log"; penalty-l2; alpha-0.01