A Proposed Technique Using Machine Learning for the Prediction of Diabetes Disease through a Mobile App

,


Introduction
Biotechnology and public healthcare infrastructure advancements have contributed to signifcant developments in sensitive and essential healthcare data processing.When intelligent methods of data analysis are applied, many signifcant characteristics may be used to detect and prevent various chronic illnesses in their early stages.Diabetes is a disease that is increasing at the fastest rate among individuals of all ages, including children and the elderly.
Diabetes is a chronic condition that, if it becomes common, might cause a crisis in healthcare on a global scale.Diabetes is characterized by many symptoms, some of which include increased urine frequency, increased thirst, increased tiredness and drowsiness, decreased appetite, decreased body weight, impaired vision, mood changes, disorientation, difculties in focusing, and recurrent infections [1].Diabetes is very dangerous to a human's life, mainly because it raises the risk of contracting other fatal diseases, such as strokes, blindness, miscarriages, amputations, and kidney failure.
Te International Diabetes Federation (IDF) reports that the number of people being diagnosed with diabetes is continuously increasing.Te number of diabetics in the globe is expected to reach 642.8 million by the year 2030 [1].In Saudi Arabia, the number of diabetes patients will reach approximately 5.61 million by 2030, according to 2023 statistics [2].
Research in ML is an emerging topic of artifcial intelligence that examines how machines might acquire knowledge via their interactions with the world.In this work, an ML-based method for classifying, identifying, and predicting diabetes in its early stages has been presented [3].Before we can comprehend diabetes and how it occurs, we need to have a good understanding of what goes on in the body when it is not afected by diabetes.As mentioned in [3], the meals that we consume, particularly those that are high in carbohydrates, are the sources of the sugar (glucose) that our bodies need.All humans, even diabetics, require carbohydrate-containing meals as their primary source of dietary energy.Foods like rice, bread, pasta, cereal, dairy, fruit, and vegetables all fall under the category of carbohydrates.Te digestive process converts these foods into glucose.Glucose is transported all over the body via the circulatory system [4].Te brain receives a portion of the glucose to enhance its ability to think and perform tasks.Te rest of the glucose is used immediately by our cells for energy and stored in the liver for later use.Insulin is needed for glucose to be used as an energy source in the body.Insulin is produced by pancreatic beta cells.Insulin functions similar to a door key.Insulin binds to the gates of cells and opens them, allowing glucose to enter the cell from the bloodstream.When insulin production is inadequate due to pancreatic dysfunction, or when insulin is produced but cannot be used by the body (a condition known as insulin resistance), this condition is known as diabetes.Ten, glucose will build up in the bloodstream, leading to hyperglycemia and the development of diabetes.Te chronic illness known as diabetes mellitus is distinguished by excessively high blood sugar and elevated urine sugar levels [5].
According to [1], there are three primary forms of diabetes, but the one that afects the most people is type 1, in which cells are unable to produce sufcient insulin and the immune system becomes compromised.Tere is no defnitive research on type 1 diabetes causation or prevention.Type 2 diabetes is characterized by the body's cells not creating an appropriate amount of insulin or when the insulin that is produced by the body is not being used effectively.Tis is the kind of diabetes that afects most people, and as a result, it afects 90% of people who have been diagnosed with diabetes.Both the way people lead their lives and their genes have a role in causing this condition.Gestational diabetes happens when a woman who is pregnant suddenly gets high blood sugar, which can be bad for both her health and the health of her baby.Two-thirds of the time, it will show up again in later pregnancy.Tere is a high possibility that a woman with gestational diabetes will get type 1 or type 2 diabetes after childbirth.All types of diabetes should be treated instantly because they are dangerous.If one detects the early stages of these diseases, one can prevent the difculties that are associated with them [6].
Tis article proposes a machine learning-based technique for diabetes prediction through a mobile app.Te important contribution of this work includes the following: (1) Te main contribution of this work is to utilize a private dataset of diabetes mellitus.

Literature Review
Diabetes afects a sizable percentage of the adult population.Many research works were proposed for the prediction of diabetes symptoms.A wide variety of approaches, including ML, neural networks (NNs), data mining, and genetic algorithms, are discussed in these studies.In recent years, ML has gained popularity as a model-building technique and received a lot of attention from the medical community.ML has proven to have strong prediction powers as well as the capacity to analyze many variables in parallel.Moreover, ML has developed methods for variable screening that can recognize and comprehend intricate correlations between variables.Previous research has proven that ML may be a useful technique for predicting diabetes.Some closely related works using ML algorithms are discussed in this section.In [7], the authors developed fve diferent models using diferent ML algorithms.Some of them include linear SVM, multifactor dimensionality reduction, radial basis function, kernel SVM, KNN, and artifcial neural network.Tis study used a Boruta wrapper, which completely selects important features from a dataset.According to the experimental fndings, all the models appeared to have achieved good results.However, KNN and linear SVM are the two models that performed the best at identifying whether a patient has diabetes or not.
In [8], type 2 diabetes could be diagnosed using a hybrid model.Researchers proposed the T2ML model, which included a suggested list of steps.Cleaning the data to ensure homogeneity was the frst step, followed by the selection of a subset of features based on XGB classifers and RF.After that, K-means clustering was used to exclude data that had been erroneously classifed, and ultimately, a logistic regression classifer and clustering were coupled to categorize the data that were left out.
In addition, Maniruzzaman et al. [9] selected four classifers for the prediction of diabetes patients including RF, NB, AB, and DT.Risk factors for diabetic sickness were calculated with LR using the p value and odds ratio.Tese strategies relied on three distinct techniques for partitioning, referred to as K2, K5, and K10.Tese classifers were evaluated using accuracy and AUC as performance metrics.Te researchers found that age, blood pressure, diastolic blood pressure, total cholesterol, and body mass index are all signifcant risk factors for diabetes.Additionally, the performance of LR and RF-based classifers combined improved, which can help make predicting diabetes individuals much easier.
Moreover, to make an accurate prediction about type 2 diabetes mellitus, the authors of [10] compared many commonly used regression models, including Glmnet, RF, XGB, and LightGBM.Initially, the dataset consisted of 111 variables, which have been reduced to 58 variables after a data preprocessing step.Tis study compared the performance, calibration, and interpretability of multivariable regression models and ML-based prediction models.Teir fndings demonstrate that updating prediction models with new data not only enhances the performance of the prediction but also maintains variable importance ranking, but not uniformly across all ML models.Among other ML algorithms, the researchers in [11] employed DT, RF, KNN, AB, XGB, and NB.Tey used the ensemble model that was suggested for the dataset PID, where preprocessing is essential for a reliable and accurate prediction.Te results showed that the optimal setup for predicting diabetes is the combination of two boosting type classifers (XGB and AB).When the suggested preprocessing is used, the dataset utilizing the optimal combination (AB + XGB) can predict diabetes with a better degree of accuracy.In [12], the authors attempted to make use of ML techniques to fnd efective models to predict diabetes.Tey used many ML algorithms for the training of datasets, such as LR, DT, NB, gradient boosting (GB), RF, KNN, and SVM.To improve the prediction models' accuracy rates, preprocessing procedures were used which include label encoding and normalization.SVM outperformed the alternative methods, according to the authors.Te suggested model used efective preprocessing step approaches, such as label encoding and normalization, to increase the models' predictive power.Te authors further discovered and ranked several risk indicators using a variety of feature selection approaches.
In [13], the authors used ML techniques to look for diabetic patients.Tey frst utilized a bootstrapping resampling strategy in the PIMA dataset to increase accuracy International Journal of Intelligent Systems before using KNN, NB, and DT.Results proved that applying the preprocessing step to the data increases relied on the accuracy of almost all classifers, but the decision trees led over others.For an accurate diagnosis of fve main illnesses, including diabetes, ML approaches were proposed in [14].In this study, the authors trained LR and RF models using the BRFSS dataset.A chatbot was used to gather user input, anticipate the prevalence of chronic diseases, and model the data using interactive data visualization techniques to ofer risk-reduction recommendations.Tey tried several factors and concluded that RF could identify diabetes with good accuracy.
Similarly, Khanam and Foo [15] used the PIDD dataset with its diferent attributes, and seven distinct ML models were trained.In this method, two features were discarded as part of a feature selection process.Tey found that the model with SVM and LR performed well in predicting diabetes.An NN model was trained using the same dataset with many hidden layers with diferent epochs.In comparison to previous methods, the authors indicate that an NN with two hidden layers achieves better performance.
Additionally, a meta-analysis of ML's ability to diagnose diabetes was carried out [16].It was discovered that the ML algorithms in use today are strong enough to assist physicians in predicting whether a patient would eventually acquire type 2 diabetes.
In [17], ML was used to conduct a meta-analysis of diabetes prediction methods.With the use of the innovative PROBAST (Prediction Model Risk of Bias Assessment Tool), the potential for bias in the ML models was examined.To conduct the meta-analysis and assess heterogeneity, the Meta-DiSc software package was used.It was discovered that ML models outperformed traditional screening techniques in terms of predicting diabetes.
Te basic algorithm in [18] was LR although a few additional ML approaches, including DT, NB, SVM, and KNN, were utilized in ensemble techniques to evaluate performance improvement.Two datasets were primarily used in the experiment, and two alternative strategies for feature selection were used.Te Pima Indians dataset, which includes nine diferent features, was chosen as the initial dataset.Te Vanderbilt dataset, which has 16 features, was the second dataset utilized.Tis study's fndings indicated that the LR algorithm is one of the most successful ones that may be used to construct prediction models.In addition to the method of procedure, the researchers demonstrated that a number of additional parameters also afect the model's accuracy.
Furthermore, to anticipate diabetes more accurately and to classify diabetes properly, the authors in [19] ofered an efcient model for doing so.Te researchers used a variety of ML algorithms, including LR, RF, SVM, DT, KNN, AB, Gaussian Naive Bayes (GNB), and Gaussian process classifer (GPC).Tese models' performances were evaluated according to their respective precision, accuracy, F-measure, recall, and error metrics.
Te development and evaluation of semisupervised learning models for insulin prediction is the main aim of the research study [20].Te researchers employ a private dataset that includes patient information such as demographics, clinical features, and historical insulin records.Traditional supervised models often face limitations due to the scarcity of labeled data, signifcantly impacting their performance.Semisupervised learning algorithms use unlabeled data to address this issue.Te authors begin by preprocessing the dataset, addressing missing values, scaling features, and splitting it into training and testing sets.Tey then propose and compare various semisupervised models, including selftraining, cotraining, and label propagation algorithms.Trough extensive experimentation and evaluation, the authors evaluate each model using accuracy, precision, recall, and F1 score.Te fndings of the article demonstrate the potential benefts of semisupervised learning models for insulin prediction.Te experiments reveal that the inclusion of unlabeled data during model training enhances prediction performance, particularly when labeled data are limited.Notably, the self-training model exhibits the highest accuracy and F1 score, suggesting its efcacy in leveraging unlabeled data.However, the authors recognize that their study's restricted dataset and lack of comparison with advanced supervised models limit the generalizability of their fndings.One of the strengths of the article lies in its exploration of semisupervised learning techniques for insulin prediction, an area that has received limited attention in existing literature.Te authors provide comprehensive details on dataset preprocessing, model architecture, and evaluation metrics.Furthermore, they emphasize the importance of incorporating unlabeled data and demonstrate the potential of such models using a private dataset.To conclude, the article successfully investigates the application of semisupervised learning models for predicting insulin levels using a restricted dataset.Te fndings suggest that incorporating unlabeled data holds promise for enhancing prediction performance, especially when labeled data are scarce.
Te existing research on diabetes prediction using ML techniques has shown progress in identifying diabetesrelated features and developing accurate prediction models.However, there are still some gaps in the literature that need to be addressed.Te following are some gaps, along with an explanation of how the proposed article contributes to flling them: To address this gap, the proposed article aims to develop a practical mobile app that allows users to enter their diabetes-related features and obtain instant predictions.Te app will be designed to be user-friendly, easily accessible, and adaptable to diferent datasets and populations.Te proposed system's adaptability will be evaluated using a domain adaptation method to ensure its efectiveness in diferent real-world scenarios.
By flling these gaps, the article hopes to advance the feld of diabetes research and enhance early detection and prevention of diabetes, particularly in countries like Saudi Arabia with a high prevalence of the disease.

The Proposed Diabetes Prediction System
In this part, we will show the methodology that was used and the ML algorithms that were applied throughout the process of developing the proposed ML system for diabetes prediction.Te sequences of the proposed system for predicting diabetes are illustrated in Figure 1.First, the dataset needs to be gathered and preprocessed to eliminate the necessary inconsistencies within it.For example, null occurrences were replaced with average values, and problems with unbalanced class sizes were addressed, among other things.Te dataset was partitioned into two separate groups using the holdout process: the test set and the training set.After that, a number of various classifcation techniques were implemented to determine the one that performed the best in terms of accuracy regarding this dataset.Finally, the prediction model that has the highest performance is integrated into the structure of the mobile application that has been proposed.

Dataset Components.
In this study, both private and Pima Indians datasets were used for ML classifcation (Pima Indians dataset is an open-source diabetes dataset that was initially gathered by the National Institute of Diabetes and Digestive and Kidney Diseases) [21].Te private dataset consists of 300 observations, while the Pima Indians dataset has 768 observations.Both the private dataset and the NI_DDKD dataset have eight features, as follows: (1) Age: in years (2) Glucose: the amount of glucose that is still present in the blood after two hours, typically referred to as the "2-hour postprandial blood sugar level" (3) Insulin: the insulin test result that measures the level of insulin in a person's blood (μU/ml) (4) Blood pressure: the BP test measures blood pressure against artery walls as it passes through the body (mm Hg) (5) Pregnancies: the aggregate number of times that the woman has carried a pregnancy (6) SkinTickness: the thickness of the triceps fold of skin (mm) (7) BMI: (weight in kg)/(height in m) 2 (8) Diabetes pedigree function refers to a function that assigns a score to the chance of developing diabetes depending on the individual's family medical history Te target variable refers to "outcome" and presents a class variable that takes the value of 0 or 1 to indicate whether or not diabetes is present in the patient Figure 2 shows the percentage of diabetes among Pima Indians participants.Tere are 768 records, and 268 of those individuals have been diagnosed with diabetes.Te private dataset includes 300 participants (195 female and 105 male) aged 15-77.Table 1 shows the eight features of the Pima Indians and private datasets.

Dataset Preparation and Processing.
Te dataset used for this research is a collection of widely available Pima Indians and selected private datasets.We detected a few unexpected zero values in the combined dataset that we analyzed.For instance, both the BMI and thickness of the skin cannot be equal to 0. Te mean value that corresponds to the zero value has been substituted in its place.With the use of the holdout validation method, the two datasets have been partitioned so that the training dataset consists of 75% of the data and the test dataset consists of 25% of the data.
Te study uses the concept of mutual information which refers to any attempt to quantify the interdependence of diferent variables.It increases information acquisition, and larger numbers suggest a stronger dependence.Figure 3 shows a visual representation of the importance of each feature of a utilized dataset, which represents the mutual information of the many qualities that make up the dataset.For instance, based on the information shown in this fgure, the SkinTickness for diabetes is less important than previously thought, when using this mutual information International Journal of Intelligent Systems

Best ML Model
Figure 1: Sequences of the proposed approach for predicting diabetes.6 International Journal of Intelligent Systems approach.Te comparison between the Pima Indians and a private dataset, including maximum, minimum, and average values, is shown in Table 1.
Te proposed research uses the Extreme Gradient Boosting approach (XGB).It is considered a gradientboosted decision tree ML library that is scalable and distributed and used for classifcation, regression, and ranking problems.Before the acquired dataset was combined with the Pima Indians dataset, the XGB regressor model was developed.In several publications, the prediction of missing values has been achieved using a variety of regression and ensemble learning methods [22].
Comprehensive research was conducted to identify the optimal method for predicting the insulin characteristics of the mentioned dataset.Tree supervised MLbased methods, including support vector regression (SVR), Gaussian process regression (GPR), and XGB, were implemented in the mentioned datasets and used for predicting the results of interest (insulin levels in the tested samples).After that, we used the formula in (1) to calculate the RMSE of selected regression models (RMSE refers to root mean square error and is considered the standard deviation of the residuals (i.e., errors of prediction)): where n is the dataset validation sample count, P i represents the predicted values, and A j represents the actual values.As shown in Table 2, the GPR method has the smallest RMSE value of insulin on the dataset.As a result, predictions have been made on the insulin level of the mentioned dataset using the proposed model.Due to specifc characteristics and constraints of the data, a semisupervised learning model was used to predict insulin in the private dataset.Some possible reasons include the following: (1) Limited labeled data: Semisupervised learning works well with few labeled samples.Labeled data are used to train the model with known outcome variables, such as insulin levels.
(2) Cost-efectiveness: Labeling huge volumes of data, especially medical datasets, is expensive.Semisupervised learning makes use of an abundance of unlabeled data, which are cheaper to annotate.Because of its low cost, it is attractive for private datasets.
(3) Use of unlabeled data: Unlabeled data often contain valuable information that can be used to improve model performance.By incorporating both labeled and unlabeled data, semisupervised models can leverage the underlying patterns and structures present in the unlabeled data to enhance their predictive capabilities.Tis is particularly relevant when working with complex medical datasets, where the relationship between features and the target variable may not be well understood.(4) Privacy concerns: Private datasets may have severe privacy or confdentiality restrictions that limit labeled data access.Semisupervised learning allows models to generate accurate predictions while protecting sensitive data.Semisupervised models employ unlabeled data, lowering the danger of disclosing sensitive information.
Article [20] also provides some potential benefts for employing semisupervised learning in the insulin prediction domain.After using a semisupervised method to predict the features of insulin, we combined the two datasets mentioned to create a merged dataset.Te combined dataset had 1068 records with all characteristics except the SkinTickness which was determined to be less important by mutual information.Te combined dataset utilized in this research has 498 (268 + 230) diabetes samples and 570 (500 + 70) nondiabetic samples, both of which contribute to the problem of imbalance.Te Synthetic Minority Oversampling Technique (SMOTE) has been used for training the dataset to solve the problem of imbalanced classes, but the testing dataset has been left unchanged.Te min/max normalization method was also utilized in this study.Using the following equation, the data were transformed so that they fall within the same range:  International Journal of Intelligent Systems where X max and X min represent the highest and lowest possible scores in the feature column.

Machine Learning Algorithms.
To implement the mobile diabetes prediction app, this study used 10 diferent ML and ensemble techniques, which are mentioned in Section 3.3.To prevent overftting, the GridSearchCV framework has been utilized to determine the optimal values of various hyperparameters for all the ML techniques.
(i) A DT is a diagrammatic representation of the rulebased learning function.Te DT learning method is an approach to the approximation of target functions with discrete values.Each node is selected using coefcients based on the Gini or entropy measures of information gain, which are written as follows: In both ( 3) and ( 4), the value of n refers to the total number of unique class values.Using the Grid-SearchCV hyperparameter tuning, we were able to determine that the parameter's maximum depth is equal to 2 and the minimum sample leaf is equal to 50, and "Gini" impurity metrics perform efectively with the dataset that is being used in this study [23].(ii) KNN is a supervised learning, nonparametric classifer in which an approximation function with discrete values can be achieved by using K numbers of the closest ML models.To categorize the data, it frst generates a plane that contains all of the training points, then it measures the distance between the query and the plane, and fnally, it produces a classifcation.Tis method identifes a certain number, K, of neighbors (determined by the dataset) and groups them according to the results of the majority vote.During our investigation, we utilized the K � 5 binary categorization [24].(iii) RF is an ML technique that takes the predictions of many decision trees and creates an average using that information.As a consequence of this, RF has characteristics that make it suitable for use as an ensemble learning model.In this study, we utilized RF with estimators equal to 400, a minimum sample size of fve for each leaf, and "Gini" impurity measures with hyperparameter adjustment [23].(iv) SVM is a statistical method that does supervised classifcation by selecting the optimal hyperplane.In this investigation, we tried out a few diferent SVM kernels on the training dataset and compared their performance.Te SVM algorithm with parameters gamma � 1, C � 10, test_size � 0.1, and random_state � 39 performed the best results [25].(v) LR is a statistical technique that may be utilized to make predictions regarding binary classes.It best matches an S-shaped function, which may be used to forecast the result.Te hyperparameter optimization approach was used to determine that the logistic regression model needed just 150 iterations to converge.Tis number was found to be sufcient [24].(vi) AB is an example of a method for ensembles.Tis algorithm frst operates on the main dataset, and then it adjusts subsequent copies of itself to the same dataset to achieve optimal performance.Tis framework modifes the weights of cases that were incorrectly categorized to direct the attention of succeeding classifers more toward challenging situations.In this study, the AB algorithm was used with the estimator set to 50 and the rate set of learning to 0.10.(vii) XGB is a gradient boosting-based ensemble ML algorithm that uses decision.Te following are the settings that were utilized for the proposed XGB classifer: estimators' maximum depth � 4, the objective function was "binary logistic", test_size � 0.1, and random_state � 52 when applying the SMOTE to the training set [26].(viii) A voting classifer is an ensemble approach that was developed to improve classifcation via the use of voting.Tis article presents the implementation of a voting classifer, which uses a voting hyperparameter referred to as "soft" to select the majority category predicted by each algorithm [23].(ix) Bagging classifer is a type of ensemble classifer that works by frst using the initial dataset to train a basic classifer on a sample of the data and then combining the individual predictions of the base classifers to get a fnal classifcation based on the results of the voting process.As examples of different hyperparameters, the implemented bagging classifer makes use of base estimators equal to 500, a maximum number of samples equal to 120, and an out-of-bag value equal to "True" [27].(x) NB is an ML algorithm used for classifcation.It uses the Bayes theorem as its foundation, with the assumption that features are conditionally independent after the class label has been known.Due to the assumption's simplicity, the method is quick and can be extended to high-dimensional data.When it comes to classifcation tasks, particularly the classifcation of texts and spam fltering, NB is a straightforward and reliable method.It is robust to irrelevant features and can handle missing data.Te following are the settings that were utilized for the NB classifer: 3.4.Deployment of the Proposed System.ML algorithms (classifers) served as the foundation for the proposed system that has been implemented into a mobile application framework so that it can function instantly on actual data for predicting diabetes.We developed the interface of the requested application using Android Studio, J2ME, PHP, MySQL, HTML, XML, and CSS.Tis study selected the XGB ML model with the SMOTE as a fnal choice because it ofered the highest level of performance and accuracy (see Table 3).Te model has been deployed using many integrated development environments (IDEs), such as the Python environment platform and Spyder.We also developed a mobile app to test the functionality of the prediction system, which allowed us to show the system's capabilities in real time.Te user interface of this application is built with Android Studio.We used Java as the primary language for programming.A subsequent step involved integrating the pickle package into Android Studio to actualize the model.We used Heroku as the hosting server for the proposed system when creating the application programming interface (API).
Figure 4 shows the fowchart outlining the process for designing the proposed ML-based diabetes prediction app.Te proposed app has been deployed into two subsystems, namely, the website app (on the left) and the mobile app (on the right).

Results and Discussion
In this section, the fndings of the proposed diabetes prediction app are presented, along with an explanation of the system.Firstly, the efectiveness of a wide variety of ML techniques was evaluated.After that, a demonstration of the website framework that has been created as well as an Android Mobile application follows.When evaluating the various ML models, we looked at their classifcation accuracy.Tese measures can be described by the following formulas: where TP means that the system is correctly predicting a positive value and that the value itself is likewise positive.FP means that the model made a positive prediction, but the actual outcome was negative.TN means that the system is making a false prediction, and the outcome also confrms this prediction.FN means that the model is predicting a negative value, whereas the actual value is positive.All ML models were validated using the holdout validation method with a categorized 7 : 3 train-test partition.
Figure 5 shows the confusion matrix for XGBoost with the SMOTE technique.According to this fgure, the XGBoost algorithm correctly classifed 300 instances with TP � 39 and TN � 261.
Finally, to better understand the model's decisionmaking process, an explainable AI approach using SHAP libraries is implemented.Figure 6 illustrates the importance of the XGBoost with SMOTE feature using the SHAP library.
Table 3 presents a comparison of the various performance measures for ten ML classifers when applied to the combined datasets using the SMOTE.Table 3 shows that the XGB classifer had the greatest overall performance, as seen by its accuracy of 83.1% as well as its F1 score � 0.76 and AUC � 85.31% (see Figure 7).Te KNN classifer, on the other hand, obtained the min accuracy and F1 score.
Next, the domain adaptation technique was utilized for the testing and training of the ML model on source and target datasets.In this study, the proposed system for the prediction of diabetes is initially trained using the opensource dataset containing a higher number of Pima Indians.After that, the system is tested on a private dataset that has a much reduced dimension.Te key performance metrics for the private dataset are displayed in Table 4.In this instance, it is worth noting that XGB with the SMOTE was used for the training dataset.Te proposed approach has a 97.4% accuracy rate, an F1 coefcient of 0.95, and an AUC of 0.87.
Finally, the proposed system was implemented as a mobile app utilizing XGB and SMOTE.Figure 8 is a representation of an immediate diagnosis of diabetes provided by the mobile app that was developed with the assistance of real data, which was developed with the help of the most efective classifcation (i.e., with XGB).
Tis research aims to use ML classifers for predicting diabetes disease.Based on the experimental results, XGB regression yielded the most accurate predictions of insulin with the smallest RMSE.Based on the mutual informationbased feature selection method, the most important characteristics for diabetes prediction are glucose level, BMI, age, and insulin.Methods for optimizing hyperparameters and oversampling with synthetic data, such as the SMOTE, have

Limitations
Despite the promising results and potential of the proposed technique for predicting diabetes disease through a mobile app using machine learning, several limitations need to be acknowledged:    International Journal of Intelligent Systems (2) Imbalanced classes: Te presence of imbalanced classes, where one class is signifcantly more prevalent than the other, can afect the performance of ML algorithms.In this study, the researchers addressed this issue by using the SMOTE technique.While SMOTE helps in balancing the classes, it is not a perfect solution and may present synthetic samples that do not precisely represent the minority class.Tis limitation could lead to biased predictions and lower accuracy in real-world scenarios.(3) Algorithm selection: Te researchers applied several ML classifcation techniques to determine the best algorithm for diabetes prediction.However, the choice of algorithms is subjective and could impact the results.Tere might be other algorithms not considered in this study that could potentially achieve better accuracy or diferent trade-ofs.Terefore, the selection of ML algorithms should be carefully considered and evaluated in future research.(4) Domain adaptation: Te proposed system's adaptability was demonstrated through the application of domain adaptation methods.However, the generalization of the proposed technique to diferent populations or settings may still be limited.Te efectiveness of the technique in diverse populations with varying demographics, lifestyles, and healthcare systems needs to be further investigated.Additionally, the potential challenges and limitations associated with domain adaptation should be thoroughly addressed.(5) Mobile app usability and acceptance: Te development of a mobile app for users to enter features and predict diabetes instantly is a signifcant contribution of this study.Te success of the proposed technique ultimately relies on user engagement and adoption of the mobile app.So, it is essential in future work to evaluate key factors such as user experience, privacy concerns, and accessibility to ensure the app's effectiveness in a real-world setting.

Conclusions
Both life expectancy and quality may be decreased by diabetes.Early detection of this chronic ailment has the potential to lessen the severity of numerous diseases and their repercussions.In this research, we have introduced the implementation of a mobile-based model that can automatically predict a person's risk of acquiring diabetes using a range of ML techniques.Tis app was developed as part of this research and applied to the available Pima Indians and a selected private dataset.In this study, multiple ML and ensemble algorithms were evaluated based on their accuracy.Te XGB algorithm had the highest level of performance using the SMOTE, with an accuracy of 97.4%, an F1 coefcient of 0.95, and an AUC of 0.87 for the private dataset and an accuracy of 83.1%, an F1 coefcient of 0.76, and an AUC of 0.85 for the combined datasets.To show the proposed system's adaptability, the domain adaptation method was applied.Following that, the approach of domain adaptation was used to illustrate the fexibility of the proposed prediction app.Finally, a mobile app has been developed to allow users to enter features and predict diabetes instantly.In conclusion, the best performing XGB method has been implemented into a mobile app to predict diabetes.
Tere are several potential extensions to the scope of our work, such as our suggestion that more confdential data be collected from a greater number of patients to get more accurate results.While the proposed technique using ML for the prediction of diabetes disease through a mobile app shows promise, it is important to acknowledge the limitations outlined in the previous section.Future research should aim to address these limitations to improve the accuracy, generalizability, and usability of the proposed technique in real-world healthcare settings.

Figure 2 :
Figure 2: Percentage of diabetes among Pima Indians participants.

FeatureFigure 3 :
Figure 3: Te importance of the features of the diabetes dataset.

Figure 7 :
Figure 7: ROC curve and AUC for XGB using the SMOTE.

Figure 8 :
Figure 8: Immediate prediction of diabetes by the proposed mobile app.
Te development and evaluation of practical applications for self-diagnosis and monitoring of diabetes is another gap in the literature.While some studies have proposed mobile apps or other tools, their efectiveness, usability, and adaptability to diferent datasets and populations require further investigation.
approach.Te proposed article directly addresses this gap.It conducts a thorough comparison and evaluation of multiple ML techniques, including logistic regression, random forest, KNN, decision tree, bagging, AdaBoost, XGBoost, voting, SVM, and Naive Bayes.By comparing their performance metrics like accuracy, F1 coefcient, and AUC on

Table 1 :
Te features of both Pima Indians and private datasets.

Table 2 :
RMSE for diferent regression models applied to the diabetes dataset.

Table 3 :
Metrics for the performance of 10 ML algorithms using the SMOTE.

Table 4 :
Te best performance metrics of the private dataset using the SMOTE.