Predictive Analysis of Diabetes-Risk with Class Imbalance

Diabetes type 2 (T2DM) is a common chronic disease, increasingly leading to many complications and affecting vital organs. Hyperglycemia is the main characteristic caused by insufficient insulin secretion and poses a serious risk to human health. The objective is to construct a type-2 diabetes prediction model with high classification accuracy. Advanced machine learning and predictive model techniques are utilized to achieve cutting-edge techniques for the early diagnosis of diabetes. This paper proposes an efficient performance model to predict and classify the minority class of type-2 diabetes. The impact of oversampling and undersampling approaches to reduce the effect of an unbalanced class has been compared to classification performance algorithms. Synthetic Minority Oversampling (SMOTE) and Tomek-links techniques are applied and examined. The outcomes were then compared to the original unbalanced dataset using an artificial neural network (ANN) predictive model. The model is compared with other state-of-the-art classifiers such as support vector machine (SVM), random forest (RF), and decision tree (DT). The tuned model had the best accuracy of 92.2%. The experimental findings clearly manifest the improvement in accuracy and evaluation metrics in terms of AUC and F1-measure using the SMOTE oversampling strategy rather than the baseline and undersampling schemes. The study recommends adopting dynamic hyperparameter optimization to further improve accuracy.


Introduction
Diabetes mellitus is described as a fatal disease since it is a well-recognized lifelong illness. It causes the body to generate less insulin and raises blood sugar levels, leading to disruptions in the regular functioning of organs including the eyes, nerves, kidneys, and heart. e global prevalence of diabetes among individuals over 18 years of age increased from 4.7 percent in 1981 to 8.5 percent in 2014 [1], with the number of persons with diabetes growing from 108 million in 1981 to 422 million in 2014. Diabetes is predicted to affect 642 million people (1 in 10) by 2040, with 46.5% of those undiagnosed [2]. According to scientists, diabetes is influenced by both hereditary and environmental variables. Early identification and treatment can help to reduce disease-related complications and risk factors.
As the healthcare business produces and creates a large amount of useable data, such as patient data, electronic medical records, and diagnostic and treatment data, this may serve as a valuable resource for knowledge discovery [3,4] to aid decision-making and reduce costs.
In order to determine if a certain parameter is at risk of getting diabetes given the independent variables, several studies were conducted but shown to be inaccurate. For instance, insulin and BMI have not been linked to a history of type-2 diabetes. However, obesity is not necessarily related to a higher BMI [5]. It is necessary to integrate different observations for early diagnosis. However, when several factors were combined for diabetes prediction, the techniques used could not produce good results.
Medical data, on the contrary, is recorded over a long period of time and therefore frequently has unbalanced data sets. Unbalanced data is considered an unequal distribution of samples among the various classes. e unbalanced nature of medical data makes it difficult to mine various resources. Although significant progress has been made in machine learning, creating efficient algorithms that depend on unbalanced data remains a difficult challenge [6]. erefore, the main purpose of this study was to find and include the best data preprocessing approaches before using the processed data for the machine learning model's training.
e proposed solutions to overcome the class imbalance problem fall into two categories: [a] data-level solutions, which modify data distribution and yield an improved set with balanced data distribution; and [b] algorithmic level, which modify and optimise the accuracy of the classifier [7,8]. Sampling is the basic data-level solution and can be either undersampling (removing the majority of class instances) or oversampling (increasing the number of minority classes). Oversampling can result in overfitting and increase complexity and execution time. Undersampling may discard many potentially relevant data at random that raises the risk of losing critical data [8,9]. Undersampling may be helpful in massive data applications to reduce computational time. Resampling methods have also been used to address the imbalance problem in conjunction with class overlap, noise occurrence, and/or borderline examples in the data set.
Most resampling methods rely on the k nearest neighbor (KNN) rule [7,10], either by eliminating instances of two classes that are far from the decision boundary to reduce duplication as in condensing or by removing those that are close to the boundary for generalization as in filtering [11]. Similarly, Tomek-links are used to eliminate instances from the majority class since, if two examples form a Tomek link, then either one of them is noise or both are borderline.
To address these problems, the current study handles class imbalance and overlap by employing SMOTE oversampling and Tomek-links undersampling to obtain a subset from the majority of instances and avoid eliminating instances that may help to develop knowledge.
Considering the importance of early detection of T2DM, machine learning and statistical principles are used to generate predictive power, allowing it to automatically extract knowledge from massive databases as well as identify valuable patterns and interrelations [6]. e presented paper introduces ANN, SVM, RF, and decision tree classifiers for diabetes onset prediction. MLP is a form of ANN that can learn from experience and extract key features from inputs that contain extra, unnecessary data. e performance of a neural network is affected by too many hidden layers, resulting in an overfitting issue [12]. SVM is a statistically supervised ML classifier for binary classification problems that uses a sequence of mathematical functions called kernels to transform the input into the proper format. With huge datasets, it might be difficult to choose the proper kernel function, and it takes a long time to train. A decision tree is a supervised machine learning approach that does not need extensive data preprocessing. However, there are a few exceptions. ere are certain constraints. It is not stable; a tiny change in the data can have a big impact on the final estimates, and complexity appears with huge datasets. As a result, the preparation and analysis take a long time [12]. Considering these factors, the proposed framework is suggested to produce more accurate results in comparison to other literature. e contributions of the paper are as follows: (1) Preprocessing techniques were applied that included filling in missing values, outliers' treatment, feature selection, data transformation, and handling imbalanced data for homogeneity.
(2) is study demonstrated a combination application of machine learning, SMOTE oversampling, and Tomek-links undersampling techniques for the treatment of class imbalances, followed by the application of normalization to the data. (3) Implementations and trials were performed on the ANN model using a grid search technique to ensure optimal selection of hyper-parameters optimization with minimal time execution and on the SVM model to select the best kernel with optimal parameters. (4) e performance of ANN classification of different resampling datasets was compared to achieve the appropriate balancing technique that yields the most accurate results. (5) ree classifiers, namely, SVM, RF, and decision tree (c-Tree), were introduced and compared to ensure the high quality of the model's performance. Furthermore, a comparison analysis was conducted with other approaches. e rest of the paper is organized as follows: Section 2 includes a review of related work. e method is proposed in Section 3 with data collection and parameter setup. Section 4 describes the experimental results. e study's findings are discussed and concluded in Section 5.

Related Work
Early and onset detection is a critical step in the prevention and control of diabetes. Using a Pima Indian Diabetes dataset from the UCI repository, advanced machine learning prediction algorithms have been suggested in the literature.
Gupta et al. [13] utilized a feature selection strategy and k-fold cross-validation to increase the prediction performance of diabetes. e SVM classifier achieves higher accuracy when compared to the naive Bayes model. A comparative study of diabetes classification was conducted by Choubey et al. [14] on PIMA India and a local diabetes dataset. PCA and LDA were used for feature selection. ey applied AdaBoost, KNN regression, and the radial basis function and revealed that when combined with classification methods, both may assist in increasing accuracy and eliminating undesired variables. On the PIMA dataset, Ahuja et al. [15] did a comparison evaluation of multiple techniques. ey found that MLP outperformed NB and DT in terms of accuracy. Mohapatra et al. [16] employed MLP to identify diabetes and reached a 77.5 percent accuracy without presenting comparisons.
A stacking-ensemble technique was suggested by Singh and Singh [17]. ey trained four base modules using the bootstrap technique and cross-validation, including SVM, decision tree, RBF, and poly SVM, but with no feature selection and comparison. On PIMA and breast cancer datasets, Kumari et al. [18] constructed a diabetes prediction system that uses a stack of random forest, logistic regression, and naive Bayes to compare their outcomes, and their system yields 79 percent. Khandegar and Pawar [19] employed PCA to choose attributed features, followed by a neural network (NN) classifier, with 92.2% accuracy. Zhu et al. [2] used K-means to cluster the results after applying PCA, and LR was used to classify them, yielding an accuracy of 89.0%. Moreover, SVM, J48, KNN, and random forest (RF) classifiers were compared by Kandhasamy and Balamurali [20]. e accuracy rate was 73.82% for J48 and reached 100% for KNN and RF. Mercaldo et al. [21] used two algorithms, Greedy Stepwise and BestFirst, to find the discriminating features that improve classification performance. Six algorithms are used.
e Hoeffding Tree approach yielded the greatest accuracy of 75.5%, with a recall of 76.2%. Mohebbi et al. [22] employed an MLP neural network and CNN with an LR activation function. e diabetic dataset is comprised of continuous glucose monitoring signals that yield 77.5% accuracy using the CNN classifier. Ramesh et al. [23] used the Recurrent Neural Network (RNN) to predict Type 1 and Type 2 diabetes. e dataset utilized was the Pima Indian dataset and the predicted accuracy for diabetes type 1 was 78% while it was 81% for type 2. Lekha [24] used modified CNN to predict individuals' breath signals, composed of five diabetic patients of type 1, nine diabetic patients of type 2 and 11 healthy patients. e performance was evaluated using area under curve and was 0.96.
While the class imbalance solution poses a significant limitation, undersampling causes the removal of important data and oversampling causes overfitting. Moving on toward an undersampling strategy: Mustafa et al. [9] have proposed a hybrid method of MultiBoost ensemble and random undersampling to solve the class imbalance problem. Kubat and Matwin [25] proposed OSS (one-sided selection) that provides an undersample of the majority class instances that are redundant (border instances). Barrela et al. [26] defined a new cluster-based OSS technique (ClusterOSS) to overcome the limitations of OSS. e majority of class instances are selected and clustered by k-means. en, OSS is applied to the instances closest to the center of every cluster. Borderline and noisy cases are removed using Tomek-links. Mani and Zhang [27] proposed a scheme to use the KNN classification method to select the instances to be eliminated during undersampling. Undersampling is combined with clustering to propose a cluster-based undersampling technique [28]. e idea of Tomek-links is to uncover border cases, whereas Hart [29] defined the condensing CNN undersampling technique to detect redundant cases. e minority class is grouped into K clusters by the K-means algorithm in the Fast_CBUS technique developed by Ofek et al. [30], and for each cluster, a comparable number of examples from the majority class that are near the minority class instances are selected. Raghuwanshi and Shukla [31] used an extreme learning machine (ELM) undersampling classifier to create ensemble subsets of the majority class that yielded 80.5% accuracy. Roy et al. [32], combine both SMOTE-Tomek to balance the Pima diabetes dataset using ANN and had achieved accuracy of 98%. Guzmán-Ponce et al. [11] proposed two undersampling strategies that combine DBSCAN clustering to eliminate noisy samples and refine the decision boundary with a minimal spanning tree (MSA) algorithm to deal with the class imbalance.
Moving on to oversampling strategies, Han et al. [33] proposed a borderline SMOTE strategy for producing synthetic examples from borderline cases with significant misclassification costs. Barua et al. [7] clustered the synthetic data generated after applying MWMOTE-SMOTE. To overcome the problem of class imbalance, Gustavo et al. [34] developed a mix of undersampling and oversampling. Ensemble learning gives a more precise solution to the problem of class imbalance. AdaBoost [35] used ensemble methods to apply different weights to both successfully classified and misclassified minority samples. To properly classify minority class instances, Chawla et al. [36] presented a mix of sampling and ensemble learning. Wu and Chang [37] used the SVM to create a class-boundary alignment approach. Stefanowski and Wilk [38] reported that the identification of minority classes is only influenced by class imbalance when associated with additional data challenges such as outliers and redundant data. erefore, outliers must be taken into consideration when handling unbalanced data Table 1.

Methodology
In this section, we introduce the diabetes dataset as a binary classification problem to differentiate whether a patient is suffering from the disease or not. is approach includes multiple preprocessing steps for cleaning data, feature extraction, and algorithms to predict the onset of diabetes.

Datasets.
e Pima Indians' diabetes dataset was obtained from the public UCI data repository [45]. All were female. Of the 768 total numbers, 268 (35%) were diabetes instances and 500 (65%) were nondiabetic instances. It includes eight independent variables; the first attribute was the number of times they have had pregnancies. e second was the plasma glucose concentration in a 2 h oral glucose tolerance test (mean value of 141 suffered from the disease), followed by the diastolic blood pressure (mm·Hg), fourth was triceps skin fold thickness (mm), then 2 h serum insulin (u U/ml), followed by body mass index (weight in kg/(height in m^2)) with a mean value of 35.14 suffering from the disease and 30 not suffering, seventh was diabetes pedigree function, and finally was age (years). e dependent variable (class) is defined as (1, 0) for the presence or absence of diabetes. In order to analyze the impact of the attributes on the occurrence of diabetes, Table 2 shows the positive association between the attributes and the class. Using the t-test, glucose, BMI, pregnancy, and age had a significant effect (p value 0.05). While the etiologic reasons for NIDDM in Pima Indians are likely to be comparable to those in other ethnic groups, the genes that predict predisposition to the illness are more frequent in Pimas, according to a study of 200 normal, nondiabetic Pima Indians dataset [46].

Feature Selection Using Relative Odds.
A logistic regression model is used to characterize the risk factors for Computational Intelligence and Neuroscience 3 developing T2DM.
e odds ratio generated provides a ranking of the explanatory variables to help determine the output [47]. Diabetes is higher in the age group (<25 years) in comparison to the older group (>40 years) with an odds ratio � 6.5, as shown in Table 3. Women with one to three pregnancies are at high risk of developing diabetes (odds ratio � 1.6). Normal-weight women had a nearly 8-fold increased risk of diabetes, while women with low blood pressure are three times more likely to become diabetic. Data showed that abnormal insulin secretion is a major factor, and women with normal 2-hour glucose concentration have a 7-fold elevated risk of developing diabetes. ese findings were in accordance with [46], who reported that insulin resistance is a main risk factor for noninsulin-dependent diabetes mellitus development. e incidence of diabetes was higher in normal BMI women than in overweight subjects that may be due to a genetic predisposition factor.

Proposed Framework.
We used the software R Programming Version 3.4 for data analysis and machine learning. Initially, the median value is used to handle missing values in the dataset, followed by handling each attribute's outliers. en, ranking of the top risk variables for developing diabetes was performed using the random forest model (accuracy of 94%) and the Boruta package (accuracy of 78.6%). e essential features (glucose, BMI, insulin, age, and skin fold thickness) are in line with existing standards and have a significant influence on developing diabetes [19,39]. e data was then split into a proportion of 80% training and 20% testing. e training set was balanced by SMOTE oversampling and Tomek-links undersampling. en, each training set was normalized by scaling and embedded into an ANN model where the tuning process and hyperparameters were chosen. ANN was compared to other classifiers, SVM, and tree models, as shown in Figure 1. Performance metrics are carried out by accuracy, sensitivity, specificity, ROC curves, F1-score, and Kappa.

Preprocessing of Data.
Noise in the dataset causes inconsistency, which leads to inaccurate outputs. Data cleaning, handling of outliers, and data balancing are preprocessing stages that are discussed here.

Data
Cleaning. Some attributes have zero values that include glucose, BMI, insulin, skin thickness, and blood pressure. e missing values were replaced by the median value in reference to the outcome parameter in a process called imputation with outlier corrections. For example, missing values for glucose levels with median values of 110 were assigned to outcome "0" and median values of 140 to outcome "1".

Dimensionality Reduction.
Feature selection is used to reduce the number of attributes while still yielding the same number of attributes. e quality of data is measured by the good correlation of features with the target class and the independent correlation to each other. random forest and the Boruta library in R were used as wrapper algorithms for the selection of variables. Glucose, BMI, and insulin are the most important features.

Handling Outliers.
A data preparation method known as the interquartile range (IQR) is used to find outliers and extreme values. By splitting a rank-ordered dataset into four (Q1, Q2, Q3, Q4) equal portions, or "quartiles," it calculates dispersion. Whereas Q2 is the median, the IQR is the middle half of the data that lies between the upper Q3 and lower quartiles of Q1.
We replace the extreme values with median values since the median is more robust than the mean and is the middle rank irrespective of its value. Moreover, we consider the upper and lower boundaries for the outlier's replacement.

Stabilization of Data.
e imbalanced distribution of classes was biased toward the negative class (majority class), leading to the misclassification of the positive class (minority class) as shown in Figure 2. e following techniques are used to handle this problem: (1) Tomek-Link Technique. e Tomek-link undersampling method is a refinement of the Condensed Nearest Neighbor (CNN) [48] aimed at reducing boundary occurrences that have a tendency to be misclassified. If there is no sample xk such that d (xi, xk) < d (xi, xj), two samples xi and xj with class (xi) � class (xj) are shown to produce a Tomek-link pair. In other words, instances that form a Tomek-link pair generate noise in the data distribution. Outlier and duplicate instances, in addition to boundary instances, all contribute to the problem of class imbalance. An outlier is a case that goes beyond the decision boundary, possibly increasing the misclassification error. Redundant instances are those that  have the same information as one another. e Tomek-links technique uses the collective elimination of outlier, boundary, and redundant instances from the majority class to ensure informed deletion while also reducing loss of information. e goal is to clarify the border between the minority and majority classes so the minority regions become more distinct.
(2) e Synthetic Minority Oversampling Technique (SMOTE). In SMOTE methods, the number of minority class instances was increased, and thus there is no loss of information from the original dataset. e distance is determined by the Euclidean distance. New samples are created (Ynew), then the distance is multiplied by a number between 0 and 1 (σ) [49].
e SMOTE finds the k nearest neighbors of a given minority data instance from the neighbourhood by utilizing the k-NN method. e length of the line segment connecting two locations xi and xj equals the Euclidean distance between them. Each new instance is created by multiplying the differences (diff) between the relevant characteristics of the chosen neighbour instance and the original instance by a random value (gap) between 0 and 1 and adding them to the features (Di) of the original minority instance (Algorithm 1). is helps define the derived instance's end position [49], which might be the same as the original minority instance, a randomly picked neighbour, or anywhere in between.
(3) Combination of SMOTE and Undersampling. By randomly removing samples from the majority class, the majority class is undersampled until it reaches the proportion of the minority class. According to [32], a combination of SMOTE and undersampling or oversampling yields better results than SMOTE alone. e adjusted dataset will contain twice as many entries from the minority class if the majority class is undersampled by 200 percent. erefore, by combining both oversampling and undersampling, the training dataset would have the minority class "smoted" and the majority class "undersampled", as shown in Table 4.

Artificial Neural Network (ANN).
Interconnected neurons with numeric weights that carry messages between each other are referred to as ANNs. Updated weights of the learning method and activation function that convert weighted inputs to the learning method's output. In the R-neural net package, standard neural networks' "backpropagation" with three layers was utilized [50], with the input, hidden, and output layers, respectively, and repetition of 5. e Resilient Backpropagation algorithm of type rprop + was used as the training set [51]. e rprop is utilized by multilayer perceptrons (MLP) to minimize errors by applying a learning rate to the weights in the reverse direction of the gradient. e network is then assessed on a set of test variables after the training set. (4) used back propagation MLP to adjust weight update using gradient descent [52] and learning rate α.

Support Vector Machines (SVMs).
e SVM algorithm splits the data into two groups by performing an n-dimensional hyperplane. e SVM algorithm splits the data into two groups by performing an n-dimensional hyperplane, a kernel function with a sigmoid shape. A two-layer perceptron neural network is quite similar to the SVM model. e kernels are a set of training approaches for polynomial, RBF, and MLP classifiers in which the network's weights are computed by solving a quadratic programming problem with linear criteria [53], rather than nonconvex, as in traditional "neural network" training. e goal of SVM is to partition datasets into classes so that the largest marginal hyperplane may be found [54] because the biggest margin yields the best test case. In this paper, we construct our model using the radial basis function nonlinear kernel (RBF). Various kernel types were used in the SVM network training. By applying the "e1071" package in R Figures 3 and 4, the best performance was attained for a network using an RBF kernel. It is utilized to create completely nonlinear hyperplanes. where x & x' are feature space vectors. σ is an unbounded parameter. e critical choice is the choice of the parameters.

Random Forest (RF).
Random forest constructs numerous independent decision trees and aggregates them (Algorithm 2), typically yielding a more accurate and precise outcome. e final prediction output by RF is the category that received the most votes across the forest. Similar hyperparameters to those of a decision tree or bagging classifier are present. e reality that underlies RF simplicity is the overlap of random trees. RF produces more accurate results on big datasets, and more random trees may be produced by establishing a random threshold for all features rather than locating the most precise threshold. e overfitting problem is also resolved by this approach [56].

Decision Tree.
For regression analysis, recursive binary splitting is a prominent method. Exhaustive search algorithms frequently employed to generate such models have two major drawbacks: overfitting and bias selection towards covariates with multiple splits or incomplete data. Although pruning can be improved with overfitting, the bias of feature selection still has a significant impact on the use of structured tree regression models. Conditional inference trees (CTree) are nonparametric regression trees found in the R package. e association between outcomes and covariates is investigated by the CTree to make unbiased covariate selections at various levels. CTree differs from the CART and C4.5 algorithms [58] in that it does an exhaustive search over all possible splits before picking the covariate with the best split. e tree model shown in Figure 7 shows the following: (1) Root: glucose (most significant feature). (3) Glucose > 0.638 and glucose > 0.774 (n � 192, err � 87.3%).

Assessment Measures for Class Imbalance.
Accuracy as an evaluation metric can be misleading in imbalanced data. e G-mean is an average obtained from both minority and majority classes, the higher its value, the better, as shown in Table 6. Other metrics include the F-measure, which provides good classifier performance in the minority class. e balance between sensitivity and specificity using area under the curve-receiver operation curve (AUC-ROC) of one is a perfect model [59].

Experimental Results and Discussion
e experimental findings were evaluated and analyzed using the metrics in Table 6.

Analysis and Evaluation of ANN.
To avoid overfitting or underfitting problems, we first perform grid search to select the best parameters for training the model.

ANN Hyperparameter Optimization.
Grid search was applied to datasets (Table 7) before and after resampling in this study to choose the optimal parameters and improve and lower the training error. Setting the tuning grid with 10fold cross-validation e tuning parameters are weight, the number of hidden layers, and hidden units; decay is the weight decay; and there are three tuning values (0, 0.01, and 0.1); the learning rate is set to 0.01. Size is the number of hidden units per layer. e number of layers maintains a balance between high bias and variance and has been selected to be two layers. e batch size is 32, and the number  of iterations is 250 (Table 8). e next optimizer is gradient descent to find local minima, manage the variance, and adjust the model's parameters [52].
Initially, the diabetes dataset without oversampling is classified using (ANN). en the data after resampling is experimented with ANN. e best performance was achieved with four layers (8, c (5, 2), 1) and 10 −2 learning rate. e most significant features used in the generation of the model were sorted using the varImp function in the NeuralSens package in R (Figure 8). e run-time execution was shown in (Tables 7 and 9). e proposed models have been developed and tested on a PC having the following specifications: Microsoft Windows 10 operating system, i5-core processor @ 2.40 GHz, and 6 GB of RAM.

Computational Complexity.
e time complexity is determined by using Big O notation. For ANN trained with gradient descent (backpropagation) runs for n iterations, Hyperparameter (Cost � "c," sigma � "s") for SVM-RBF kernel.
Input: dataset D, ensemble size T, subspace dimension d Output: average of prediction from tree models for t � 1 to T do build a bootstrap sample Dt from D select d features randomly and reduce dimensionality of D t accordingly train a tree model M t on D t split on the best feature in d let the M t growing without pruning end ALGORITHM 2: Random Forest Pseudocode [57].   In our model, the backdrop contains 4 layers (1 input, 2 hidden, and 1 output) denoted i, j, k, and l with delta weight update in t-training time and n epochs is O (nt * (ij + jk + kl)).
is is the same as the feed-forward pass network. e result is O (nt * (ij + jk + kl)). e performance of the ANN classifier after overoptimized hyper-parameters was shown in Table 9, using different training sets. e accuracy after training ANN with (SMOTE 100%) was 90.2%, sensitivity was 84.5%, and specificity was 93.1% ( Figure 9) that exceeds other data sets. e training time was 290 sec and the AUC was 0.89, as shown in Figure 10.
Finding the optimal algorithm with ideal hyper-parameters in ANN is a challenge, as it requires too much computing time. e author [60] reported that the Antlion optimizer outperforms grid search in choosing the optimal hyper-parameters in the stroke dataset using DNN within a limited amount of time.
e dataset with and without resampling are then experimented with other machine learning algorithms such as SVM, RF, and DT. e percent of improvement in performance after applying to resample was considered in Table 10 and Figure 11. e ANN model shows an improvement in accuracy after applying SMOTE 100%oversampling. On the other hand, the Tomek-link technique yields 72.2% lower than the original training set, which was 80.1%, as shown in Figure 9.
Hyperparameter optimization s improve SVM after SMOTE 100%oversampling, the accuracy was 72.9%, and AUC of 0.73. By applying Tomek-link undersampling, the accuracy was improved to 71.6%, except for the F1-measure, which shows no significant change in Figure 12.
By using the training set without resampling, RF shows 52.2% sensitivity, 75.4% specificity, and an accuracy of 62%. While after resampling, the accuracy was improved and achieved 75% in simulated (100%) and only 65% in Tomeklinks, as shown in Figure 13.
In this study, the CTree model is influenced by blood glucose levels, BMI, and pregnancies. Insulin and BMI, on the other hand, appear to have a greater impact on diabetics than the other factors as shown in Figure 7. e accuracy after training SMOTE 100% oversampling was 87.8%, SMOTE/undersample was 83.4%, and after applying Tomeklinks was 75.5%, while test results were shown in Figure 14.
e overall summary of the best training set was shown in Table 11. e ?ndings revealed that the proposed ANN outperforms traditional models with respect to precision, recall, F1-score, and accuracy of the model.

Area under the Receiver Operating Characteristics (AUC-ROC)
Curve. ROC is a probability curve, and AUC represents the degree or measure of separability. We considered AUC as part of the performance evaluation. e greater the area, the better the model with FP rate � 0 plotted against TP rate � 1. In contrast, G-mean represents the performance in absolute values. Figure 10 shows the AUC of optimized oversampled ANN at 0.89, which means that the model can discriminate between positive and negative classes by 89%. ANN was the best classifier, exceeding other classifiers with AUC. e optimization of the hyper-parameters (cost and sigma) of SVM classifiers using the RBF kernel was also evaluated by AUC-ROC. Figure 5 shows the AUC of SVM after training SMOTE (oversampled 100%) of 0.905, while after the test it was 0.73, as in Table 10. Figure 6 shows the AUC of SVM after undersampling with Tomek-links that achieved 0.88, while the test set was 0.7 in Table 10. Figure 15 shows the AUC of CTree, SMOTE oversampling also exceeds the value of 0.78.

Conclusions
e current approaches involve inaccurate classification techniques as they do not consider several crucial data preparation steps that can significantly improve the level of performance. Several risk assessments for diabetes early detection are reported in the current study. e relationship between the attributes has been analyzed using conventional methods, and the inferences were predicted using analytical approaches. Using machine learning involves preprocessing steps of filling in missing data and class imbalances that might lead to misclassification. e training data were balanced using SMOTE oversampling, Tomek-links undersampling, and a combination of SMOTE/undersample modeled with ANN. To characterize the risk of developing type-2 diabetes, the odds ratio, the Boruta filter, and the varImp function were applied to rank the important variables. Our results were consistent with the guidelines and previous studies [32], where insulin resistance and elevated BMI were shown to be the major risk factors. Oversampling was performed on the training set using a SMOTE ratio of   100%. e prediction performance was higher in all models, with AUC and F1-measures ranging between 0.69 and 0.89. Furthermore, the results of Tomek-links were not considerably better than the original training set in all classifiers. A grid search method is used to track the maximum values of the parameters in order to optimise the hyper-parameters using ANN. e cost-sensitive method was applied to SVM for optimization. e ANN and SVM's sensitivity and accuracy were greatly enhanced by the oversampling stage. e ANN and SVM's sensitivity and accuracy were greatly enhanced by the oversampling stage. e predictive performance of the CTree classifier is unaffected by rebalancing. e AUC of 0.89 and accuracy of 90.2% indicate that ANN is the best model in both the oversampled and test datasets. e SMOTE oversampling increases the learning capability and improves performance rather than the Tomek-links technique.
To conclude, model-based oversampling can be utilized to identify individuals at high risk of getting diabetes and provide a timely response in treatment to women in our community aged 21 and older. Our limitation in this study is the lower number of samples. To lower the risk and effects of type 2 diabetes, the research proposed that more regulated attributes and frequent follow-up be offered, particularly during pregnancy. Future studies can be performed using sensitivity analysis and regularizations to select the most significant features based on deep learning for the early prediction of diabetes diseases. In addition, hyper-parameters can be optimized dynamically.

Data Availability
Datasets analyzed during the current study are available from the UCI repository, 452 datasets were derived from the following public domain resources: [https://www.kaggle. com/da-453 tasets/uciml/pima-indians-diabetes-database].