A Novel Architecture for Diabetes Patients’ Prediction Using K -Means Clustering and SVM

Diabetes is one of the alarming issues in today’s era. It is a chronic disease that may cause many health-related problems. It is a group of syndrome that results in too much sugar in the blood. Diabetes’s chronic hyperglycemia has been linked to long-term damage, organ breakdown, and organ failure, notably in the eyes, kidneys, nerves, heart, and veins. Machine learning has quickly advanced, and it is now used in many facets of medical health. The goal of this research is to create a model with the highest level of accuracy that can predict a patient’s chance of developing diabetes. This paper proposes a novel architecture for predicting diabetes patients using the K-means clustering technique and support vector machine (SVM). The features extracted from K-means are then classified using an SVM classifier. A publicly available dataset, namely, the Pima Indians Diabetes Database, is tested using this approach. Accuracy of 98.7% is noted on the used dataset. On this dataset, the combined method performs better than the conventional SVM-based classification. This paper also compared the accuracy, precision, recall, and F1-score of the different machine learning techniques for classifying diabetes patients.


Introduction
Various forms of diabetes exist. In type 1, pancreatic insulin stops producing hormones.
is hormone helps digest carbohydrates, fats, and proteins. In type 2 diabetes, cells associated with the digestive system cannot process insulin. Over time, the production of insulin in the body stops. Because of this, various internal organs start to become useless, which causes death. Type 3 diabetes is associated with pregnancy, in which a woman's blood sugar level increases [1][2][3][4]. A person cannot do anything special in advance for type 1 rescue. In type 2, nutritious food, regular exercise, and weight control are the best measures in prevention, and they can prevent 90% of diabetes [5,6]. In 2014, in the clinic examination of 4.21 crore citizens, diabetes was found in 31 lakhs, i.e., about 7.75% of people. Awareness about diabetes is still low. 40% of a ected citizens are unable to take proper care of themselves. ey are not able to control their sugar level either. Half of them have never got an eye test done, even though they are at entire risk of retinopathy.

Motivation.
According to the WHO, 7.30 million adults in India are in the grip of diabetes. 10.7 to 14.2% of the population is diabetic in cities, while 3 to 6.8% in villages. According to the central government, a quarter of India's 4.25 crore diabetes patients are in India. As per the National Family Health Survey-4, the number of diabetes patients in the country doubled in the year 2007 compared to 2007.3.5% of women and 3.5% of men aged 35 to 49 have diabetes. Diabetes is present in 9.7% of Indians over 80, while the same is present in 13.1% of people aged 60 to 69 and 13.2 in 70 to 79. After aging, the condition has reduced because the patient is not survived. Among those who are left and are living for 60 years or even beyond, diabetes is about 36% or say one-third compared to people in the age group of 60 to 79. Heart disease, stroke, kidney disease, and blindness are also becoming a significant threat to health due to the increasing incidence of diabetes in the new generation [7][8][9].

Main Contributions.
e contributions of this manuscript are as follows: (i) is paper proposed a novel architecture to diagnose diabetes based on some parameters early. (ii) is paper used K-means clustering combined with an SVM algorithm to classify the data. (iii) For experiment purposes, the Pima Indians Diabetes Database is used. In the Pima Indians Diabetes Database, there are 668 female patients' data. 80% of these data are used to train machine and 20% to test the machine on the proposed architecture.
Sections break apart the remaining text of the paper: Section 2 describes the related work and compares the work done by all the researchers on the Pima Indians Diabetes Database using different techniques. Section 3 outlines the suggested approach. is section briefly describes our architecture and uses algorithms to improve the accuracy of predicting diabetes. Section 4 contains the experimental evaluation and results. is section briefly describes the used dataset with all the measured parameters. Section 5 of the paper puts it to a conclusion.

Related Work
Diabetes prediction using the Pima Indians Diabetes Database is a topic of interest among researchers during the last few decades. is section highlighted some of the methods used by the research to predict diabetes using the Pima Indians Diabetes Database and the accuracy achieved.
AlJarullah [10] has used the decision tree algorithm to predict type II diabetes. e data preprocessing part and the second diabetes prediction are completed in the first phase using the decision tree algorithm. e maximum accuracy achieved in this paper is 78.17%. Anand et al. [11] used higher-order neural network (HONN) combined with principal component analysis (PCA) to predict type II diabetes. In this paper, the authors used PCA to handle the missing data and also to scale the data in the same range of values.
e maximum accuracy achieved in this paper is 89.47%. Banerjee et al. [12] used neural network, an evolutionary algorithm-based approach for predicting diabetes.
is paper also compares the neural network model with other models. e maximum accuracy achieved in this paper is 93.5%. Barale and Shirke [13] used the K-means clustering algorithm combined with an artificial neural network (ANN) and K-means combined with logistic regression classifiers to predict diabetes. e maximum accuracy achieved in this paper is 98%. In order to uncover hidden patterns in the dataset, the K-means clustering technique is applied.
Chikh et al. [14] used the modified artificial immune recognition system (AIRS). In this, they used the fuzzy K-nearest neighbor algorithm to diagnose diabetes. e maximum accuracy achieved in this paper is 89.1%. Choubey and Paul [15] used the GA combined with multilayer perceptron neural network method for diagnosing diabetes. In the first phase, the genetic algorithm (GA) is used for feature selection, and in the second, diabetes classification is completed using multilayer perceptron neural network. e maximum accuracy achieved in this paper is 79.13%. Christobel and Sivaprakasam [16] used class-wise K-nearest neighbor (CkNN) to classify the diabetes dataset. In the first phase, data preprocessing is done, and the mean value is substituted in place of missing values. In the second phase of diabetes, classification is completed using modified KNN. e maximum accuracy achieved in this paper is 78.16%. Deperlioglu and Utku [17] used a multilayer feedforward NN structure trained by the Bayesian regularization algorithm and the mean square error function. e maximum accuracy achieved in this paper is 95.5%. In this study, the ANN was trained ten times. Gandhi and Prajapathi [18] used F-score, K-means clustering, Z-score normalization, and SVM. In the first phase, data preprocessing is done using F-score and K-means. In the second phase, diabetes classification is completed using SVM. e maximum accuracy achieved in this paper is 98%. Ganji and Abadeh [19] used ant colony optimization (ACO) to predict diabetes. e maximum accuracy achieved in this paper is 84.24%-using an ant colony-based classification method, a set of fuzzy rules for diabetic illness diagnosis may be extracted. Hayashi and Yukita [20] used Re-RX with J48 graft, combined with sampling selection techniques for predicting diabetes. As a "white-box" model, the recursive-rule extraction (Re-RX) method delivers extremely accurate categorization. e maximum accuracy achieved in this paper is 83.83%. Huang and Lu [21] used information gain (IG) along with DNN for the prediction of diabetes. e maximum accuracy achieved in this paper is 90.16%. Iyer et al. [22] used the J48 decision tree algorithm and naïve Bayes algorithm for the classification dataset and achieved an accuracy of 76.9% and 79.5%, respectively. Kahramanli and Allahverdhi [23] used an ANN and fuzzy NN to classify diabetes datasets. e maximum accuracy achieved in this paper is 86.8%.
Karatsiolis and Schizas [24] used a SVM with an RBF kernel and a SVM with a polynomial kernel to classify the diabetes dataset. First, the dataset was divided into two subsets. en, one of the subsets, SVM with an RBF kernel, is applied, and on the other subset, SVM with a polynomial kernel is used. e maximum accuracy achieved in this paper is 82.2% and 81%, respectively. Karegowda et al. [25] used K-means clustering along with GA and CFS for the classification of the diabetes dataset. Classification accuracy of 96.68% is achieved in three phases. e K-means clustering algorithm is applied in the first phase to identify and eliminate incorrectly classified instances. In the second phase, a GA and correlation-based feature selection (CFS) are applied to extract relevant features. Finally, in the third phase, classification is done using K-nearest neighbor (KNN) algorithm. Karegowda et al. [26] used K-means clustering combined with decision tree C4.5 to classify the diabetes dataset. In the first phase, K-means clustering is used to eliminate incorrect instances. In the second phase, the decision tree algorithm C4.5 is used to classify the data. e maximum accuracy achieved in this paper is 93.33%. Karegowda et al. [27] used a GA and back propagation network (BPN) to classify the data. e maximum accuracy achieved in this paper is 77.7%. Kayaer and Yildirim [28] used the GRNN to classify the data and achieved an accuracy of 80.21%.
Kumar Das et al. [29] used random forest and gradient boosting classifiers to classify diabetes datasets and achieved an accuracy of 90%. Initially, data preprocessing is done, and then the classifier is applied to classify the data. Senthil Kumar et al. [30] used covering-based rough set classification for the dataset classification.
is is a pattern-based approach. Maximum accuracy of 79.34% is achieved using this procedure. Kumari and Chitra [31] used SVM with RBF kernel to classify the data and achieved an accuracy of 75.5%. Nirmala Devi et al. [32] used amalgam KNN to classify the data and achieved an accuracy of 97.4%.
is amalgam of KNN consists of K-means with KNN. K-means algorithm is used to identify missing values. Missing values are replaced by the mean and median in this algorithm. Patil et al. [33] used K-means clustering combined with decision tree C4.5 and achieved an accuracy of 92.38% to classify the dataset. Polat [34] used fuzzy C-means combined with SVM and KNN and weighting methods (FCMAW) and achieved an accuracy of 91.41 and 84.38, respectively. Polat et al. [35] used GDA and least square support vector and achieved an accuracy of 82.05% to classify the data. Rado et al. [36] used random forest combined with recursive feature elimination, and the accuracy achieved was 73%. Raghavendraet al. [37] used a neural network model with a backward elimination feature selection method and made the accuracy of 84.52% to classify the dataset. Rajni and Amandeep [38] achieved a classification accuracy of 72.9% by using the RB-Bayes algorithm. In this, the mean is used to handle the missing values. Ramana and Boddu [39] used the naïve Bayes classification algorithm, and the accuracy achieved was 76.34%. Balajiet al. [40] used a deep NN restricted Boltzmann machine, and 80.9% accuracy was achieved.
Vaishali et al. [41] used Goldberg's GA combined with a multi-objective evolutionary fuzzy classifier to classify the type 2 diabetes dataset. In the first stage, essential features are extracted using Goldberg's GA. In the second stage, the multi-objective evolutionary fuzzy classifier is applied, and an accuracy of 83.04% is achieved using this method. Vosoulipour et al. [42] used NN and ANFIS structures and achieved an accuracy of 81.3%. Wong and Lease [43] used Cartesian genetic programming and achieved an accuracy of 80.5%. Wu et al. [44] used an improved K-means algorithm and the logistic regression algorithm for the dataset classification and achieved an accuracy of 95.42%. Zolfaghari [45] used SVM combined with NN and achieved an accuracy of 88.04%, and Bano and Khan [46] used K-NN and achieved an accuracy of 82.29% to classify the dataset.
An automated model for diagnosing diabetes was reported by Lakhvaniet al. [47] utilizing a three-layered artificial neural network (ANN). For neuron activation, the authors employed a logistic activation function, and they trained the model using the quasi-Newton approach.
rough the use of the Pima Indian Diabetes Dataset, Patil and Ingle [48] offered a comparative analysis of different ML classification algorithms with diabetes prediction. For statistical modeling and accuracy verification, authors employed KNN, LR, which is based on the regression problem, naive Bayes probabilistic classifier, SVM with both linear and nonlinear kernel, and decision tree with RF classifier. 80.20 percent accuracy is the highest possible. LDA and GA were employed for feature selection by Alharan et al. [49] to increase the classification accuracy for diabetes. e approach has a maximum accuracy of 90.89 percent. Sivaranjani et al. [50] presented a model for diabetes categorization using SVM and RF techniques. PCA is also used to reduce the number of dimensions, with maximum accuracy rates of 83 and 81.4 percent, respectively.
In all the techniques used by the researchers, the main challenge is to improve the accuracy of the system for early diagnosis of diabetes. To overcome this problem, this paper suggested a fusion technique in two phases. In the first phase, data preprocessing is done using K-means, and in the second phase, diabetes classification is completed using SVM to achieve the maximum accuracy. Techniques used by different researchers and achieved accuracy are summarized in Table 1.

Proposed Methodology
is section describes the proposed Pima diabetes patient classification model using K-means clustering and SVM. Figure 1 presents an overview of the suggested model. e proposed model first created the clusters using the K-means clustering and then used the SVM for the classification.

K-Means Clustering
Algorithm. K-Means algorithm is used to cluster the dataset into different classes. K-Means works for multi-dimensional data. For two-dimensional data, the example is shown in Figure 2.
e following steps are used in the K-means clustering algorithm [51]: (1) Choose the K number of clusters.
(2) Choose at random k points. ese k points will be the centroids of the k clusters. It is not necessarily that these k points are from dataset. Any k points can be selected. (3) Assign each data point to the nearest centroid, and the resulting k cluster will be formed. e Euclidian distance is used to calculate distance. (4) Determine and set each cluster's new centroid. (5) Change the centroid that corresponds to each data point. If there was a reassignment, proceed to step 4, otherwise, end.
Mathematical Problems in Engineering e number of clusters (in step 1) is computed using the elbow method. For the used dataset, the number of clusters is 5.

SVM Classi cation
Algorithm. SVM was developed nationally in the 1960s and later found in the 1990s. SVM is very popular in machine learning because SVM is a robust algorithm. SVM is very di erent from other machine learning algorithms. SVM is about nding the best decision boundary that helps to separate the dataset into di erent classes. SVM separates the types through the maximum margin boundary between support vectors. For the best boundary, the sum of the distances of the boundary line from support vectors should be maximum. is boundary line is known as the maximum

Experimental Evaluation and Results
e speci cs of the dataset utilized in this investigation are presented in this section. Results are calculated using various categorization algorithms and suggested methods. e details are as follows.

Dataset Description.
is paper used a publicly available dataset, namely, Pima Indians Diabetes Dataset [53]. is dataset contains the data of a total of 668 female patients with eight independent parameters, namely, pregnancies, glucose, blood pressure (BP), skin thickness (ST), insulin, BMI, diabetes pedigree function and age, and one dependent parameter, outcome [53]. e rst ve records of the dataset are presented in Table 2.
All the parameters of the used dataset are as follows: (i) Pregnancies: is parameter represents the number of times pregnant. (ii) Glucose: During an oral glucose tolerance test, plasma glucose concentration exceeded 2 hours. (iii) Blood Pressure: Diastolic heart rate (mm Hg). (iv) Skin ickness: is parameter represents the triceps skinfold thickness (mm).
(v) Insulin: is parameter expresses the 2-hour serum insulin (mu U/ml). is parameter represents the class variable. 0 means nondiabetic, and 1 means diabetic.

Performance
Measure. All the approaches are compared using accuracy, precision, recall, and F1-score for performance measures. Accuracy, precision, recall, and F1-score are computed using (1)-(4) [54]. e confusion matrix is plotted for calculating all the performance measure parameters. e generated confusion matrix of the proposed method is shown in Figure 3. e used performance measurement parameters are for confusion matrix (Table 3) with two classes (binary classi cation).
(i) Accuracy: e proportion of correct classi cation (true positive and true negative) from the overall number of cases.
(ii) Precision: e percentage of instances that were correctly classi ed as positive (true positive) when they were expected to be positive.
(iv) F1-score: e balance between precision and recall is shown by the F1-score.

Experimental
Results. e discussion presented in Table 4 shows the result of the proposed approach to the Pima Indians Diabetes Dataset. e accuracy of 98.7% is recorded using the proposed method, whereas the accuracy of 82.46% is recorded using only the SVM classi cation algorithm. An improvement of 19.69% is achieved on the Pima Indians Diabetes Dataset. e comparison of the various classi cation methods, namely, decision tree, random forest, kernel SVM, naive Bayes, KNN, logistic regression, SVM, and the proposed approach based with respect to accuracy, precision, recall, and F1-score, is shown in Figures 4-7.
Before k-means A er k-means k-means Figure 2: Data space before and after applying K-means clustering algorithm.     e accuracy, precision, recall, and F1-score of the proposed method are 98.7%, 98.6%, 96.8%, and 97.5%, respectively.

Conclusion and Future Scope
is study suggested a brand-new architecture for diabetes patient categorization using K-means clustering and SVM. e clusters of the database are designed using a K-means clustering method. e predictions are then computed based on the created clusters considered as features for categorization using SVM. e Pima Indians Diabetes Database is used to verify the approach's resilience against a publicly accessible dataset. e Pima Indians Diabetes Database has 668 female patients' data. 80% of these data are used to train machine and 20% to test the machine on the proposed architecture, with a maximum accuracy of 98.7%. By obtaining more reliable characteristics from the database, the classi cation rates may rise in the future. Additionally, combining techniques like decision fusion of several classi ers might help the classi cation process.
Data Availability e data used to support the ndings of this study are available from the corresponding author upon request.  Mathematical Problems in Engineering 7