Machine Learning Hybrid Model for the Prediction of Chronic Kidney Disease

To diagnose an illness in healthcare, doctors typically conduct physical exams and review the patient's medical history, followed by diagnostic tests and procedures to determine the underlying cause of symptoms. Chronic kidney disease (CKD) is currently the leading cause of death, with a rapidly increasing number of patients, resulting in 1.7 million deaths annually. While various diagnostic methods are available, this study utilizes machine learning due to its high accuracy. In this study, we have used the hybrid technique to build our proposed model. In our proposed model, we have used the Pearson correlation for feature selection. In the first step, the best models were selected on the basis of critical literature analysis. In the second step, the combination of these models is used in our proposed hybrid model. Gaussian Naïve Bayes, gradient boosting, and decision tree classifier are used as a base classifier, and the random forest classifier is used as a meta-classifier in the proposed hybrid model. The objective of this study is to evaluate the best machine learning classification techniques and identify the best-used machine learning classifier in terms of accuracy. This provides a solution for overfitting and achieves the highest accuracy. It also highlights some of the challenges that affect the result of better performance. In this study, we critically review the existing available machine learning classification techniques. We evaluate in terms of accuracy, and a comprehensive analytical evaluation of the related work is presented with a tabular system. In implementation, we have used the top four models and built a hybrid model using UCI chronic kidney disease dataset for prediction. Gradient boosting achieves around 99% accuracy, random forest achieves 98%, decision tree classifier achieves 96% accuracy, and our proposed hybrid model performs best getting 100% accuracy on the same dataset. Some of the main machine learning algorithms used to predict the occurrence of CKD are Naïve Bayes, decision tree, K-nearest neighbor, random forest, support vector machine, LDA, GB, and neural network. In this study, we apply GB (gradient boosting), Gaussian Naïve Bayes, and decision tree along with random forest on the same set of features and compare the accuracy score.


Introduction
Nowadays, chronic kidney disease (CKD) is a rapidly growing disease, and millions of people die due to lack of timely afordable treatment. Chronic kidney disease patients belong to low-class and middle-classincome-generating countries [1,2].
In 2013, about one million people died due to chronic kidney disease [3]. Te developing world sufers more from the chronic kidney disease, and low to average income countries contain a total of 387.5 million CKD patients where 177.4 million patients are male and 210.1 million patients are female [4]. Tese fgures show that a large number of people in developing countries sufer from chronic kidney disease, and this ratio is increasing day by day. A lot of work has been done for the early diagnosis of chronic kidney disease so that the disease could be treated at an early stage. In this article, we are focusing on machine learning prediction models for chronic kidney disease and giving importance to accuracy.
Chronic kidney disease is a common type of kidney disease that occurs when both kidneys are damaged, and the CKD patients sufer from this condition for a long term. Here, the term kidney damage means any kidney condition that can cause improper functioning of the kidney. Tis could be caused by any disorder or due to lack of essentials like the glomerular fltration rate (GFR) reduction [5]. Our proposed prediction model takes the clinical symptoms as input and predicts the results using the stacking classifer with the random forest algorithm as a base classifer.
Machine learning is gaining signifcance in healthcare diagnosis as it enables intricate analysis, thereby minimizing human errors and enhancing the precision of predictions. Machine learning algorithms and classifers are now considered the most reliable techniques for the diagnosis of diferent diseases like heart disease, diabetes, tumors disease, and liver disease predictions [6].
Diferent machine learning algorithms used the Naïve Bayes, SVM, and the decision tree for the classifcation purpose, while random forest, logistic regression, and linear regression were used for the regression purpose in the medical felds for the prediction. With the efcient use of these algorithms, the death rate can be minimized due to early-stage diagnosis and patients can be treated timely. Along with maintaining the clinical symptoms, chronic kidney disease patients should include physical activities in daily life. Tey should exercise, drink water, and avoid junk food. Te common symptoms of chronic kidney disease are shown in Figure 1.
Tis article delivers an overview and analysis subsequently followed by an implementation and evaluation of the machine learning classifers used in CKD diagnosis. Further, this article discusses the importance of machine learning classifers in healthcare and explains how these can make more accurate predictions. Figure 2 represents the block diagram of the chronic kidney disease prediction model.
Te core objective of this article is to propose and implement a hybrid machine learning prediction model for chronic kidney disease where due importance is given to accuracy. In this article, we have analyzed the accuracy of same dataset with respect to diferent machine learning algorithms and compared their accuracy score so as to get a better model. Our focus remains on the solution of overftting problem using cross-validation while achieving the highest accuracy to build a best hybrid model from the combination of available popular machine learning classifers such as decision tree, gradient boosting, Gaussian Naïve Bayes, and gradient boosting. Te ultimate goal is to deliver an accurate and efective treatment to CKD patients at a reduced cost. Before we proceed further, we need to know little more about common diseases of the kidney. In Table 1, there is a list of some of the most common kidney diseases ( Table 2).
Te remaining portion of the article is organized as follows. Section 2 contains the literature survey along with the tabular comparison of the diferent machine learning algorithms used and an analysis of the results. Section 3 contains the proposed methodology. Section 4 contains the dataset details. Section 5 contains results and discussion. Section 6 contains conclusion and future work.

Literature Review
Tis section covers research work related to algorithms and assesses some algorithms based on their accuracy. In research work [7], the data mining technique applied to specifc analysis of clinical records is a good method. Te performance of the decision tree method was 91% (accuracy) compared to the Naïve Bayesian method. Te classifcation algorithm for diabetes dataset had 94% specifcity and 95% sensitivity. Tey also found that mining helps retrieve correlations of attributes that are no longer direct indicators of the type they are trying to predict. Similar work still needs to be done to improve the   overall performance of prediction engine accuracy in the statistical analysis of neural networks and clustering algorithms.
In [8], the authors described the prediction models using machine learning techniques including K-nearest neighbor (KNN), support vector machine (SVM), logistic regression (LR), and decision tree classifers for CKD prediction. From the experiment, it was concluded that the SVM classifer provides the highest accuracy, 98.3%. SVM has the absolute best sensitivity after training and testing performed with the proposed method. Terefore, according to this comparison, it could be concluded that an SVM classifer is used to predict persistent kidney disease.
In the paper [9], they chose four diferent algorithms and compared them to get an accurate expectation rate over the dataset. Unlike all approaches that were presented, they got the best results from the gradient boosting classifer. Te models efectively achieve an accuracy rate of 99.80%, whereas AdaBoost and LDA achieve 97.91% at a low value. Also, the gradient boosting ML classifer takes much time to make the prediction compared to others and has a higher predictable value in both the curves (ROC and AUC). Hence, an accurate expectation undoubtedly depends on the preprocessing strategy, and the methods of preprocessing must be approached cautiously to precisely achieve recognized results.
In [7], the authors investigated the machine learning ability, which is supported by predictive analysis so as to predict CKD early. An experimental procedure was performed by considering a dataset of 400 cases collected by Apollo Hospitals India. In this article, two labels were used as output/targets in this hybrid model (i.e., patients having CKD and others who are healthy) and four diferent machine learning classifers were implemented. On the comparison of these classifers, the classifcation along with regression tree, and the RPART classifcation model, showed remarkably better results in terms of accuracy. Tey used the information gain quotient for excruciating criterion, and here the optimum spilling reduces the noise of the resulting feature subsets. In this study, the RPART limited value of criterion for the splitting was fve, meaning that splits repeatedly occur for the fve instances present in the leaf node. In addition, they identifed an equivalent previous probability for the class attributes. Here, the RPART prediction model used seven terminal nodes for the earlier predictions of CKD. Te experimental results showed that the highest AUC and TPR were obtained with the machine learning prediction model, whereas the highest TNR (1.00) was achieved with the model RPART. Te RPART model could be described as a set of rules for making the decision. However, the major drawback of RPART is the consideration of the single factor as a parameter in every division

Glomerulonephritis
Glomerulonephritis causes infection and damage to the fltering part of the kidneys (glomerulus). It can occur quickly or could be over a longer period. Poisons, metabolic wastes, and surplus fuid are not properly strained into the urine. Instead, they build up in the body producing infammation and fatigue.

Polycystic kidney disease
Polycystic kidney disease (PKD) is a genetic disorder that can produce many cysts flled with fuid and they grow inside your kidneys. Usually, they are harmless. Te cysts can change the shape of the kidneys while making them much bigger. Charleonnan et al. [9] ACC � (TP + TN)/(P + N) 3 Ghosh et al. [7] Te results of performance degree indices are dependent on TP, TN, FP, and FN 4 Fu et al. [10] Ext. values � points > Q3 + 1.5 (IQR) points < Q1 − 1.5 (IQR) 5 Devika et al. [11] Accuracy � number of properly classifed samples/total variety of samples 6 Revathy et al. [12] Accuracy � (TP + TN)/(TP + TN + FP + FN) Accuracy � TP + TN/ TP + TN + FP + FN 7 Nishat et al. [14] Accuracy � (TP + TN)/(TP + TN + FP + FN) Accuracy � TP + TN/ TP + TN + FP + FN 8 Rabby et al. [13] Descriptive analysis of the data as well as the experimental results 9 Pouriyeh et al. [15] Finding most signifcant feature using chi-square test 10 Jabbar et al. [16] Experimental results only Computational Intelligence and Neuroscience procedure, while considering diferent parameter combinations could result in better CKD predictions. However, the machine learning prediction model gives the lowest error rate. Te major reason is that the MLP could adopt and handle complex predictions. Te complex relationships require hidden nodes and they are useful as they allow neural networks to model between parameters while sometimes deal with nonlinearity in data. Te overall results indicate that the algorithms of machine learning give an inspiring and a feasible methodology for earlier CKD prediction.
As we have already seen, there are diferent machine learning prediction models and learning programs available to assist practitioners. In [5], they used a new selection guide for predicting CKD. In this work, CKD is predicted by using specifc classifers and a reasonable study of overall performance. In this study, they performed the evaluation of the Naïve Bayes classifer, random forest, and artifcial neural network classifers and concluded that the random forest classifer performs better as compared to other classifers. Te worth of forecasting CKD has been progressive. Several sustainable evolutionary policies can be used to improve the outcomes of the suggested classifers. Here, Naïve Bayes, random forest, and KNN were applied to predict CKD. Early diagnosis of CKD helps to treat those afected well in time and prevent the disease from progressing to worse stage. Te early detection of this type of disease and well-timed treatment is one of the main objectives of the medical feld.
In [10], a machine learning prediction model was developed for the early prediction of CKD. Te dataset gives input features gathered from the CKD dataset and the models were tested and validated for the given input features. Machine learning decision tree classifer, random forest classifer, and support vector classifer were constructed for the diagnosis of CKD. Te performance analysis of the models was assessed on the basis of the accuracy score of the prediction model. On comparison, the results of the research showed that the random forest classifer model performs much better at predicting CKD as compared to decision tree and support vector classifers.
Te kidneys play a vital role in maintaining the body's blood pressure, acid-base sense of balance, and electrolyte sense of balance, not only needed to flter toxins from the body. Malfunction is accountable for insignifcant to mortal illnesses, in addition to dysfunction in the other body organs. Terefore, researchers all over the world have dedicated themselves for fnding techniques to accurately diagnose and efectively treat chronic kidney disease. As machine learning classifers are increasingly used in the medical feld for diagnosis, now CKD is also included in the list of diseases that could be predicted using machine learning classifers. Te research to detect CKD with ML algorithms has enhanced the procedure and consequence accuracy progressively. Tey proposed the random forest classifer (99.75% accuracy) as the maximum efcient classifer among all other classifers. Te study demonstrates the efective handling of missing values in data through four techniques, namely, mode, mean, median, and zero-point methods. It also evaluates the performance of machine learning models under two scenarios, with and without tuning the hyperparameters, and observes signifcant improvement in the classifers' performance, which is visually presented through graphs [11].
Overall, the motive of the study is to examine the applicability of specifc supervised machine learning classifers in the feld of bioinformatics and ofer their compatibility in detecting several serious diseases such as the diagnosis of CKD at an early stage [12].
Tey built an updated and profcient machine learning (ML) application that can perceptually perceive and predict the state of chronic kidney disease. In this work, the ten most important machine learning methods for predicting permanent kidney disease were considered. Te level of accuracy of the classifcation algorithm we used in our project is as good as we wanted.
For the prediction of disease, the frst most essential step is to detect the disease that is costly in developing countries like Pakistan and Bangladesh. Te people of these countries mostly sufer from this. Currently, CKD patient proportion is increasing rapidly in Pakistan and Bangladesh. So, in that article, the authors tried to develop a system that helps in predicting the risk of CKD. In the proposed model, they used and processed UCI datasets and real-time datasets and tried to deal with missing data and trained the model using random forest and ANN classifers. Ten, they implemented these two algorithms in the Python language. Te accuracy they got with the random forest algorithm is 97.12% and that with ANN is 94.5%, which is relatively very good. By use of this proposed method, risk prediction of CKD at an early stage is possible.
In [13], the authors predicted CKD based on sugar levels, aluminum levels, and red blood cell percentage. In this perception, fve classifers were applied, namely, Naïve Bayes, logistic regression, decision table, random tree, and random forest, and for each classifer, the results were noted based on (i) without preprocessing, (ii) SMOTE with resampling, and (iii) class equalizer. Random forest classifer has been observed to give the highest accuracy at 98.93% in SMOTE with resampling.

Comparison of Machine Learning Classifers for CKD.
In this section, a comprehensive comparison of the state of the art is presented in the form of a table. Te evaluation is formed in the aspect of accuracy, which can be comprehended in Table 3. Te table has eight features that are described below: Author: this contains the names of the authors of each article along with the reference.
Year: this column provides the year of the paper's publication.
Input data: this column shows the type of dataset that was used as input for the machine learning classifers. Disease type: Tis section shows the type of disease that was predicted by using diferent classifers. It shows the best classifer found in the research paper, which is the classifer with the maximum accuracy.  Classifers: this column signifes the diferent machine learning classifers that were used in the research and the comparison between them. Tool: Te column represents the programming language or the framework that was used in building the model. Te researchers used these tools to preprocess the input data, then create a prediction model, and fnally go to the testing stage. Cross-validation: this column gives information about the validation of the classifers and makes a comparison of diferent research papers regarding folds of crossvalidation used. Accuracy: Te accuracy of the outcomes of the recommended model is represented in this column. If the article crisscrosses a comparison, the accuracy column only contains the accuracy percent of the best classifer confrmed by the author.

ML Classifer with Highest Accuracy.
Te machine learning algorithms that we analyzed from the above literature are listed in Table 4 and Figure 3.

Proposed Methodology
Te proposed hybrid model is implemented in Python with pandas, sklearn, Matplotlib, Plotly, and other essential libraries.
We have downloaded the CKD dataset from the UCI repository. Te dataset contains two groups (CKD represented by 1 and non-CKD represented by 0) of chronic kidney disease in the downloaded information. Te machine learning algorithm that has best accuracy is selected for analysis and implementation so that repeated results are produced. We have also developed a hybrid model based on knowledge that we gained during the analysis and implementation. Te hybrid model consists of Gaussian Naïve Bayes, gradient boosting, and decision tree as base classifers and random forest as a meta classifer. We have selected the tree-based machine learning algorithms for achieving the highest accuracy, while at the same time, it can handle the overftting problem. In this paper, we detect the outliers with the violin plot as shown in Figure 4. As a solution of this problem, we implement the k-fold technique and design our model in such a way that it can reduce the problem of overftting along with achieving the highest accuracy. Te classifers are discussed as under.

Naïve Bayes (NB).
Te NB classifer is related to the group of probabilistic classifers and is constructed on the basis of the Naïve Bayes (NB) theorem. It takes up vigorous independence between the component's/features, and it contains the most crucial part of how this classifer creates forecasts. It can be built easily and is appropriately used in the medical feld for the prediction of diferent diseases [15].

Decision Tree (DT).
Te decision tree classifer has a tree-like confguration or fowchart-like construction. It consists of subdivisions, leaves/child nodes, and a root/ parent node. Here inner nodes comprise the features, whereas the subdivisions epitomize the outcome of every check on every node. Decision tree is one of the commonly used classifers for classifcation determination because it does not need abundant information in the feld or place constraints for it to work [15].

Random Forest (RF).
In the ensemble and stacking classifcation approach, the random forest (RF) is the most efective algorithm among the other machine learning algorithms. In prediction and probability estimations, random forest (RF) algorithm has been used. Random forest (RF) classifer consists of many decision trees. Tin Kam Ho of Bell Labs introduced the concept of random forest in 1995, where each decision tree casts a vote to determine the object's class. Te RF method is the combination of both bagging and random selection of attributes. Random forest classifer has the three hyperparameter tuning values [16].
(i) Number of decision trees (n tree) used by the random forest classifer (ii) Size of the minimum node in the trees (iii) Number of attributes employed in splitting every node for every tree (m try). Here, m is the number of attributes.  Computational Intelligence and Neuroscience performance when testing data patterns have those attribute values and gives sometimes wrong output labels [22].

Hybrid Model.
We use the concept of stacking for our hybrid model. As a type of ensemble technique in stacking, multiple classifcation models were combined with a main/ meta classifer. One after the other, multiple layers were placed, where the models pass their predictions, and the upper most layer model makes decisions on the base of the combination of diferent models as a base model. Te models in the low layer get attributes as input from the original data. Te topmost layer of the model gets output from the lower layers and gives the results as a fnal prediction. Te stacking technique involves using multiple independent machinelearning models as input to process the original data. After that, the meta classifer is used to predict the input along with the output of each machine learning model and individual algorithm's weights are estimated. Te algorithms that are performing best are selected, and others having low performance are removed. In this technique, multiple classifers as base model are combined and then, by using diferent machine learning algorithms, are trained on the same dataset through the use of a meta-classifer [23]. Figure 5 shows the fow diagram for the proposed hybrid model. Te execution of the model with the sequence of the steps is given below: (i) Collect the data of CKD from UCI repository (ii) Exploratory data analysis (EDA) is performed on that dataset (iii) Tis dataset is split into two parts: test data and train data (iv) Apply the cross-validation of 10 folds (v) Train the base models Gaussian Naïve Bayes, gradient boosting, and decision tree with the train set giving the predictions as M1, M2, and M3, respectively (vi) Te output of the base models M1, M2, and M3 and test set data serve as input for random forest as input for training (vii) Once the random forest gets trained, it gives the prediction on the basis of training dataset and the output predictions of the base models In this study, we have considered the UCI CKD dataset, and this dataset is split into two parts. 80% of data is used for training purposes as an input to the machine learning algorithms. We exploited the Gaussian Naïve Bayes, gradient boosting, decision tree, and stacking classifer with random forest algorithm which was used to predict the chronic kidney disease for 20% test data as input and plotted the predicted values and compared their values. Our proposed methodology has the following advantages.
(i) We implemented four machine learning algorithms that are decision tree, gradient boosting, Gaussian Naïve Bayes, and random forest. We applied stacking classifers to build the hybrid model that combines these four algorithms.
(ii) We analyzed the accuracy of the same dataset with respect to diferent machine learning algorithms and compared their accuracy score to get the best model (iii) We implemented a stacking classifer technique to build a new model with improved accuracy

Dataset Details
We selected 14 attributes from the dataset that we are using from the UCI repository dataset of chronic kidney disease as input features as shown in Table 5 where age attribute shows the patient's age, bp indicates the blood pressure, sg indicates the specifc gravity of the urine, al indicates the level of aluminum in the patient urine, bgr (blood glucose random) indicates the blood sugar level glucose tolerance, su represents the sugar level, bu indicates the blood urea, sod indicates the amount of sodium, sc indicates the serum creatinine, pot indicates the amount of potassium, hemo indicates the hemoglobin, and pcv indicates the packed cell volume. Further, wc indicates the white blood cell count, and rc indicates the red blood cell count.
To identify the number of chronic kidney disease patients and the number of healthy ones, we performed the visualization on the CKD dataset, which can be seen in the histogram plot in Figure 6. Here 0.0 represents the healthy cases, while 1.0 represents the chronic kidney disease patients. In this dataset, there are 250 chronic kidney disease patients, while 150 are healthy people.
Te Pearson correlation feature selection method is used to get the best combination of features for the prediction of chronic kidney disease. Te correlation of the 14 attributes and 1 output label is presented in Figure 7.
When we go from the exploratory data analysis stage to the pair plot visualization, it is observed to be very helpful as it gives the data that can be used to fnd the relationship between attributes for both the categorical and continuous variables. We import the Seaborn library to get pair plot. Te information about all the attributes is in one picture and is clear. Te statistical information is in attractive format represented with pair plot as shown in Figure 8.
Te violin plots are used for all the attributes in exploratory data analysis that are used in the hybrid model. Tese can give additional useful information like density trace and distribution of the dataset. Te violin plots give the whole range of dataset which cannot be shown by box plot. Te violin plots of all 14 attributes are given in Figure 4. Figure 9 shows the comparison of diferent models' accuracy scores in the form of a chart.

. Results and Discussion
Machine learning algorithms such as gradient boosting, Gaussian Naïve Bayes, decision tree, and random forest classifer were used in the proposed hybrid model. Tese diferent machine learning classifers were used as a combination for the chronic kidney disease predictions. Tis also overcomes the overftting problem and results in higher   accuracy. In order to improve accuracy and to come up with a novel approach as compared to the existing work, we have implemented the proposed hybrid model with the best combination of GB, GNB, and decision tree, along with the random forest classifers [24][25][26][27]. Te results described in Table 6 show that diagnosis of chronic kidney disease is efective using the random forest with combination as a stacking technique in the hybrid model. Gradient boosting achieves 99% accuracy, random forest achieves 98% accuracy, and our hybrid model achieves 100% accuracy, and at the same time, it has reduced the chances of overftting. In order to fnd the contributions to the development of prediction models for chronic kidney disease, a regional basis analysis is performed. As discussed in the Introduction section that the developing countries' population sufers more from chronic kidney disease, it was observed that most of the research work is performed in developing countries. A summary of this region-wise contribution is presented in Figure 10.

Conclusion
Chronic kidney disease is considered as one of the prominent life-threatening diseases in the developing world. Te most obvious cause seems to be lack of physical exercise. Te medical practitioners used a number of diagnosis processes and procedures, where machine learning is the recent development. In this paper, we have selected machine learning because in terms of accuracy, it performs better as compared to other available approaches. In this article, we have used the Pearson correlation feature selection method and applied the same on machine learning classifer. GB, GNB, decision tree, and random forest are the base classifers for the stacking algorithm, whereas these are implemented with the cross-validation on the basis of accuracy score. In this study, we evaluated these algorithms on the same dataset. Furthermore, we have used dataset of CKD from the UCI directory that contains 14 attributes and 400 instances. On the basis of these attributes, our proposed stacking model is able to predict whether the person is a CKD patient or not with 100% accuracy. Best features are selected using the Pearson correlation method, and the stacking algorithm is implemented with the best machine learning classifers. Te cross-validation enhances the performance of the stacking model. As we have worked on the chronic kidney disease data of the binary group, the stacking algorithm performs better with these combinations of algorithms. We can implement the stacking technique for the prediction of other diseases to get better accuracy score.

Data Availability
No data were used to support this study.

Conflicts of Interest
Te authors declare that they have no conficts of interest.