Use of Machine Learning and Routine Laboratory Tests for Diabetes Mellitus Screening

Most patients with diabetes mellitus are asymptomatic, which leads to delayed and more complex treatment. At the same time, most individuals are routinely subjected to standard clinical laboratory examinations, which create large health datasets over a lifetime. Computer processing has been used to search for health anomalies and predict diseases using clinical examinations. This work studied machine learning models to support the screening of diabetes through routine laboratory tests using data from laboratory tests of 62,496 patients. The classification and regression models used were the K-nearest neighbor, support vector machines, Bayes naïve, random forest models, and artificial neural networks. Glycated hemoglobin, a test used for diabetes diagnosis, was used as the target. Regression models calculated glycated hemoglobin directly and were later classified. The performance of classification computer models has been studied under various subdataset partitions and combinations (e.g., healthy, prediabetic, and diabetes, as well as no healthy and no diabetes). The best single performance was achieved with the artificial neural network model when detecting prediabetes or diabetes. The artificial neural network classification model scored 78.1%, 78.7%, and 78.4% for sensitivity, precision, and F1 scores, respectively, when identifying no healthy group. Other models also had good results, depending on what is desired. Machine learning-based models can predict glycated hemoglobin values from routine laboratory tests and can be used as a screening tool to refer a patient for further testing.


Introduction
Diabetes mellitus (DM) is a chronic metabolic disorder caused by a deficiency in insulin production or a lack of capability of the cells to use it properly. Over time, DM causes an increase in blood glucose levels, which is known as hyperglycemia. DM also increases the risk of premature death and possible diabetes-associated complications, such as heart attack, stroke, kidney failure, and vision loss [1]. Most DM patients are asymptomatic and do not undergo a DM test, leading to a delayed diagnosis. Late DM identification leads to complex treatment and poor outcomes. It is estimated that DM has an impact of $760 billion costs, accounting for 11.3% of 20-79-year-old deaths worldwide.
Early diagnosis is imperative to mitigate diabetes complications and deaths and reduce treatment costs [2].
Currently, DM is diagnosed via analysis of laboratory tests, such as those handling glucose (i.e., fast plasma glucose) and glycated hemoglobin (HbA1c). HbA1c is considered the gold standard for screening and diagnosing diabetes due to its international standardization, lower susceptibility to biological variability, not being affected by acute stress, and no need for fasting [3,4]. However, FPG exams are still widely used, being often the main diagnostic method. Even though it may present changes in glucose values, leading to erroneous interpretations [3,5]. For the Hba1c test, the individual is considered healthy if the value is equal to or less than 5.6%, considered prediabetic if the value is between 5.6% and 6.5%, and considered diabetic if the value is equal to or greater than 6.5%. For the FPG test, the individual is considered healthy if the value is equal to or less than 99 mg/dl, considered prediabetic if the value is between 99 mg/dl and 126 mg/dl, and considered diabetic if the value is equal to or greater at 126 mg/dl [1].
Computer processing has been used to identify diseases based on clinical data processing [6][7][8][9][10]. Extracting knowledge from data to support experts in decision-making is a trend in the new generation of smart health systems [11,12]. Computer methods such as data mining and machine learning can improve diagnosis alongside patient data. Several studies have been using laboratory tests and machine learning techniques to search for new results in recent years. In the case of diabetes mellitus, the search for a diagnosis has been the target of predictive medicine. Many studies have used artificial intelligence to predict a diagnosis or a future propensity to develop the disease. In general, in addition to laboratory tests, these studies make use of clinical data, patient history, imaging tests, and medical diagnoses [13][14][15][16][17][18][19][20][21], none of which used only laboratory tests. Oleg [14], for example, in addition to laboratory tests, also used data on retinopathy or nephropathy.
Similarly, Hang [16], Wu [19], and Hische [22] also made use of other clinical data in the search for a diagnosis of diabetes. Some studies, such as Ravaut [17], Bernardini [21], and Le [23], aim to determine whether a patient is likely to develop the disease in the future, which is relevant as part of a process in predictive medicine. Other authors [24][25][26][27][28] have used data from noninvasive tests, such as photoplethysmography (PPG) and electrocardiogram (ECG), with the main motivation being the screening and monitoring of blood glucose for patients already diagnosed.
The use of laboratory tests and machine learning to search for new results has been extensively explored in recent years [8,14,20,[29][30][31][32][33]. In particular, we draw attention to the work of Park [34], who performed the prediction of several diseases using laboratory tests, but not including DM.
This work has its focus on the use of routine laboratory tests. Once the blood sample has already been collected and the patient's tests performed, the possibility of predicting new information is of great relevance for the diagnostic process of medical laboratories. We do not use any other type of data, enabling the automation of analysis and medical laboratories' diagnostic processes. Discovering information can generate alerts for things not observed, thus proposing complementary exams for an early diagnosis of still unknown pathologies.
For example, in the diagnosis of diabetes mellitus, although the HbA1c test is recommended, the FPG test is the most used. However, this test may present variations and inconsistencies [3,35], generating false-negative results. It is not uncommon for discrepancies in the result of the diagnosis of DM performed with the FPG test compared to the HbA1c test. In this way, it is crucial to predict possible DM diagnoses and recommend complementary exams to prevent an asymptomatic patient from being left without proper and timely treatment. In this case, the prediction of HbA1c is a possibility to confirm the diagnosis given by the FPG test, and in discrepant cases, it may propose performing the HbA1c test with the blood sample already available. This approach would avoid false-negative results saving time and costs with further exams and treatments.
The possibility of automatically using data from laboratory tests to search for new patient information is of great relevance. This methodology can directly impact the analysis processes of laboratory tests outcome, suggesting complementary and more complex tests in the screening for new pathologies and counter-proof for false-negative cases. In most cases, the blood sample already collected can be used, saving time and costs. Thus, this methodology presents itself as an innovation to performing tests and diagnoses in medical laboratories.
We propose a machine learning-based approach that use existing laboratory data to screen DM based on predicting HbA1c and classifying subjects based on the most frequently performed laboratory examinations: hemograms, creatinine, and fasting plasma glucose. Using these data may enable earlier prediction of HbA1c levels in DM while evaluating routine and straightforward laboratory testing. In this way, the proposed approach can help detect DM by directing the patient to complementary exams. Thus, this work sought to explore and evaluate different machine learning models and dataset configurations to identify the best ways to support the DM diagnosis based on routine laboratory testing.

Materials and Methods
We used a four-step framework to study HbA1c classification models and predictions. The four steps are (1) data collection, (2) data preprocessing, (3) model training, and (4) performance evaluation. The results are shown in Figure 1

Data Preprocessing.
Preprocessing is one of the most important steps in using machine learning techniques. We used a factor analysis technique to select the most relevant examinations for HbA1c levels. Missing data (i.e., any single missing exam) were excluded. No data were inputted. Table 1 shows the selected features and the target variable HbA1c. The selected features were normalized to a mean of zero and a standard deviation of 1. The classification of HbA1c provides DM diagnoses. There are three HbA1c categories: healthy if HbA1c is <5.7% (39 mmol/mol), prediabetes if HbA1c is between 5.7% and 6.4% (46 mmol/ mol), and diabetes if this value is ≥6.5% (48 mmol/mol) [36].
The dataset was arranged into three distinct subdatasets, "HPD," "HN," and "ND," using the classification models (see Figure 2(a)). The first subdataset, "HPD," describes individuals based on HbA1c; thus, there are three categories: healthy ("H"), prediabetes ("P"), and diabetes ("D"). The second subdataset, "HN," describes individuals as healthy ("H") and no healthy ("N"), where "N" is the prediabetes and diabetes (N = P + D). The third subdataset "ND" describes individuals as no diabetes ("N") and diabetes ("D"), where "N" is the healthy and prediabetes (N = H + P ). The dataset was also arranged into three subdatasets using the regression models' classification (see Figure 2(b)) that subdatasets acronym follows the pattern of the subdatasets generated for using with the classification models with the addition of "r" suffix, i.e., "HPDr," "HNr," and "NDr." 2.3. Model Training. We trained five classification models and five regression models. The target of the classification model was an ordinary variable HbA1c, and the target for the regression models was a continuous variable HbA1c. The regression model HbA1c output was classified based on DM classification. We used several models with different approaches and complexities, ranging from simple K-nearest neighbors to complex ANNs. The Python package Scikitlearn [37] was used to implement the models. For validation, 30% of the training part of the dataset was used. The training and validation approaches were used for hyperparameter tuning. The adjustment of hyperparameters for the models was performed using Bayesian optimization (BO) with a Gaussian process (GP) [38].

BioMed Research International
(ANN). The regression was studied using these methods as regressors, i.e., K-nearest neighbor regressor (KNNr), support vector machine regressor (SVMr), naïve Bayes regressor (NBr), random forest regressor (RFr), and artificial neural networks regressor (ANNr). The following configuration was used: (i) KNN and KNNr model hyperparameters were set to "8 neighbors," "uniform weights," and "ball tree algorithm" (ii) SVM model hyperparameters was set "0,8 C," "RBF kernel," "3 degree," "true shrinking," "true probability," "decision function shape over," and "1000 cache" (iii) SVMr model hyperparameters were set to "1 C," "epsilon insensitive loss," "0.1 epsilon" and tolerance of 1e5. "decision function shape over," and "1000 cache" (iv) NB and NBr were set to default  (v) RF and RFr was set to "gini criterion," "5 max depth," and "50 estimators" (vi) ANN and ANNr were set to "2 layers with 50 neurons," "adam solver," "adaptative learning rate," and "relu activation" 2.4. Performance Metrics. The test part of the dataset is used to evaluate the results. We used the mean squared error (MSE) to assess regression performance using equation (1), where the n represents the number of samples, y i represents the original value of all i samples, andŷ i represents the predicted values of all i samples [39].
HN and ND were compared to HNr and NDr to evaluate whether the machine learning models. Prediabetes is the midstage between healthy (no diabetes) and diabetes and has a narrow HbA1c value range; this relationship might negatively influence model performance. Five metrics were used to study the models: sensitivity (SN), specificity (SP), precision (PR), negative precision (NPR), and F1 score (F1), as in equations ((2)- (6) Sensitivity is the true positive rate, and specificity is the true negative rate. The F1 score is the harmonic mean of precision and sensitivity. The F1 score is the harmonic mean of precision and sensitivity, recommended for use with unbalanced databases, such as the database used in this work. The confusion matrix was used to visualize the performance of the algorithms; the rows represent the predicted class, the columns represent the actual class, and a good model must have a true diagonal near 1 [39].
3. Results Table 2 lists the performance of the classification model for classifying the HPD dataset. The models had different score characteristics. The ANN model has greater sensitivity in identifying people with diabetes, although the precision is not high (84.9%). On the other hand, the KNN model has a lower sensitivity in identifying DM but greater precision within the identi-fied DM. Figure 3 shows the confusion matrix of the classification models using the HPD dataset. We found that KNN, SVM, and NB behaved approximately equally, while the ANN performed better than the others. All the models have approximately a 30% prediction error for prediabetes, which indicates that this category is primarily fuzzy. In addition, there is a tendency to misclassify prediabetic individuals as healthy than diabetics, which may be due to the dataset characteristics. The performance of the classification models for the HN and ND datasets is presented in Table 3. The sensitivity, precision, and F1 score are shown as bar plots in Figure 4. The HN models are more regular regarding scores; however, they have approximately 70%-80% precision; thus, some false positives are expected. We observe that the KNN has a sensitivity of only 42.7% using ND; however, it leads to high precision, useful in screening false negatives. Figure 5 shows the regression model errors (MSE results). The regression line is shown. Data points are clustered in the regression line, which is indicative of the excellent performance of the models. The average MSE of the five models is 0.32, with the best performance achieved by the ANN model (0.29) and the worst, 0.38, by KNN. Table 4 shows the performance of the predicted values of the regression, arranged as HPDr. Figure 6 shows the relative confusion matrix. The regression models make it possible to observe the misclassification tendency of prediabetics as healthy compared to individuals with diabetes. This tendency was also observed when using the classification models (see Figure 3). Thus, this tendency may be a characteristic of the data and not an imbalance in the database.    Figure 4: Sensitivity, precision, and F1 score of the classification models for the HN (healthy versus no healthy) and ND (no diabetes versus diabetes) datasets.

BioMed Research International
The performance of the regression models using HNr and NDr is presented in Table 4. The sensitivity, precision, and F1 score are also shown as bar plots in Figure 7. The results presented by the classification and regression models were similar when analyzing the same "type" of machine learning model. This characteristic can be observed in the three tested datasets (HPD, HN, and ND). Some of the tested machine learning models showed a slight improvement in performance with classification after regression.

Discussion
We studied a machine learning approach to detect DM using data from the most frequently performed clinical laboratory Clinical laboratory data are often available because they are typically generated from routine blood tests. We demonstrated that machine learning could assist in the detection of DM. This system can be implemented at a minimal cost, as data are already available on computer databases from routine examinations. The proposed approach alone is not recommended for diagnostic purposes. We recommend using the system to generate an alert and recommend a specific DM examination. Thus, the models would improve DM investigation processes, as patients diagnosed with prediabetes or diabetes could be referred for further analysis, which is compatible with intelligent health systems [40]. If a patient laboratory log shows diabetes probability, i.e., a "diabetes-like" pattern, the system can recommend further diabetes examination. The patients with "diabetes patterns" should be guided to traditional examinations and procedures. HbA1c strongly correlates with the average glucose [41], being more stable and recommended for diabetes diagnoses [42]. Thus, during the FPG analysis process, the system will be able to predict HbA1c values over different arrangements of datasets, looking for some kind of discrepancy in relation to the exam performed. If there is a difference between the results, a new FPG test or a supplementary HbA1c test may be recommended in order to obtain a counter-proof of the result. As it is a computational method, there is no interruption in traditional procedures. The system can collaborate synergistically with the current procedure and may collaborate to detect DM earlier.
Currently, some studies have used laboratory tests to predict new results and support the diagnosis of diseases that are not the target of the test, as in Park's study [34], where several diseases are predicted. The most recent reports were [43][44][45][46], which studied the prediction of the RT-PCR test. However, several studies have used other types of data and machine learning techniques to assist DM prediction. In a study by Zheng et al. [47], the authors obtained 100% sensitivity and precision above 90% in several models while using a dataset of 300 samples and used several categories of features, including self-reporting notes and medication. Oliveira et al. [48] obtained 68% sensitivity and 68% specificity after using a smaller dataset and categorical features obtained through interviews. Lai et al. [16] obtained 71.6% sensitivity and 73.4% specificity with a dataset of 13,309 samples using laboratory and clinical features. The results obtained in this study cannot be directly compared to those of the studies mentioned above because the methodologies and features (i.e., input parameters and hyperparameters) differ. This study used only quantitative data from routine laboratory tests to train different classification and regression models, as well as different dataset arrangements, having an exploratory character.
A confusion matrix was chosen for the evaluation of the overall model. The confusion matrix is particularly useful when working with unbalanced data. The values of the main diagonal of the confusion matrix make up the accuracy, which is not a good evaluation metric in classification models with unbalanced datasets. In these cases, the F1 score is the most recommended evaluation metric. This F1 score represents a consonant mean between sensitivity and precision and is a simple way to evaluate models with unbalanced databases. However, the most appropriate metric to evaluate a classification model in searching for a target is the joint analysis of sensitivity and precision. For instance, highsensitivity models are better for target identification (e.g., RF model for HN dataset, Table 3). Therefore, prioritizing models with high precision (e.g., KNN model for ND dataset, Table 3) will provide greater certainty in the results.
When using the classification models (see Table 2), we found that the ANN model had the highest sensitivity in identifying DM (66.2%), with a precision of 84.9%. The same occurred in identifying patients with prediabetes, where the ANN model had the best sensitivity (67.9%). The KNN model, on the other hand, obtained the highest precision in the identification of DM (89.1%), despite the low sensitivity (47.0%). For the identification of healthy individuals, the NB model had the highest sensitivity (79.4%), followed by the SVM and RF models (79.2% for both). The highest precision was for the ANN model (76.9%), followed by the SVM and RF models (74.2% for both). Regarding the F1 score, we found that the ANN model had the highest results for all classes (i.e., 74.4% for diabetes, 64.9% for prediabetes, and 76.7% for healthy).
When using the regression models, we verified the capacity of the models in predicting HbA1c, as shown in the scatter plot in Figure 5. ANNr yielded the best result, with an MSE of 0.29. However, the graph shows that all models were able to predict HbA1c. Subsequently, the  10 BioMed Research International predicted value of HbA1c was classified as DM status according to [36], which may lead to some classification errors when HbA1c values are close to the transition limits between the different classes. This fuzzy range in the classification of regression values is proportional to the mean absolute error (MAS) of approximately 0.33 for all tested models.
Comparing the values in Tables 4 and 5, we observe a certain similarity of the classification results after regression with the results of the classification models. This similarity was also observed by examining the confusion matrix (Figures 3 and 6). Among the different models and datasets, a slight variation in the results was observed; in some cases, the classification of regression values was better than that of the classification models. Figures 3 and 6 show that all models misclassified (by more than 30%) prediabetes cases as healthy cases. Thus, according to the classification of the models, patients with prediabetes are more "similar" to healthy individuals than patients with DM. We can further analyze the prediabetic classification characteristics using the HN and ND datasets.
In Table 3, we observe that the ANN model performed better using the HN dataset (where prediabetes and DM are in the same class), with a sensitivity of 78.1% and precision of 78.7%. This arrangement is interesting in the search for unhealthy individuals and can be used in general to screen patients who already have or are on the way to developing the disease. However, regarding the sensitivity and precision using the ND dataset (where healthy and prediabetic individuals are in the same class), we observed variations in the performance of the models. All models have lower sensitivity but higher precision values than the HN dataset. According to the F1 score, the regression model that classified better patients with diabetes and prediabetes was SVMr, reaching 74.9% and 61.1%, respectively. The model with the best performance in classifying healthy patients was ANNr (77.2%). The greater precision of the models in the ND dataset reinforces the idea that prediabetes patients  Figure 7: Sensitivity, precision, and F1 score of the regression models for the HN (healthy versus no healthy) and ND (no diabetes versus diabetes) datasets.
11 BioMed Research International are more similar to healthy individuals than they are to patients with diabetes.
Depending on the objective, a dataset arrangement of HN or ND can be used. For instance, precision is more important than sensitivity when screening for false negatives, which would lead us to use the ND arrangement. Only the results with a negative diagnosis would be analyzed in this case. Even if the system is not very sensitive, it must have high precision. Thus, even if a few cases of false negatives are identified, we will be more confident that these cases are real false negatives. In this sense, we draw attention to the KNN classification model. Using this model in the search for false negatives, even if it only identifies half of the occurrences, we will be 94% sure that these tests are false negatives.
In the search for the correct classification of diabetes or no healthy patients, we understand that the idea is to have high sensitivity and precise models, which means fewer false positives. We demonstrate that machine learning can detect DM using data from laboratory examinations performed most frequently. The model achieved better results as the sensitivity increased; however, sensitivity was less important than precision.
The artificial neural network classification model scored 78.1%, 78.7%, and 78.4% for sensitivity, precision, and F1 scores, respectively, when identifying no healthy individuals (i.e., individuals with prediabetes or diabetes). Thus, we believe that this approach exhibits the best overall performance. We observed that all tested models had difficulty classifying the prediabetes group; thus, the dataset configuration improved detection. This model may use existing laboratory examinations of patients to recommend further and specific DM follow-up. Thus, these results could support the screening of DM using machine learning algorithms and available clinical information.

Conclusions
Patients with DM may be asymptomatic and go unnoticed in diagnoses based only on FPG exams. These exams can vary and be susceptible to nonstandard methodologies, patient adherence and preparation prior to the exam, and medications in use.
The possibility for a computer system to automatically find hidden information in laboratory test data is highly advantageous to the diagnostic process in medical laboratories. These systems could perform patient screening to dis-cover early diseases, generate alerts, and recommend complementary exams to counter-proof possible problems with false negatives. These tests could be performed with the patient's blood sample, usually already available.
This work demonstrates that machine learning models can aid in DM screening using data from routinely performed laboratory tests, including blood counts, providing evidence to refer a patient for further testing (e.g., HbA1c). The proposed system can operate in conjunction with traditional methods and not interrupt the normal flow process of exams.
Different dataset arrangements and prediction models can be used depending on the purpose or application of this approach. For example, to perform a screening in the search for individuals with DM, one option would be to use the ANN model with the HN dataset, and this is because it presents greater sensitivity and maintains good precision.
If the objective is to find false negatives in an FPG exam, we could use the KNN classification model with the ND dataset. Despite having low sensitivity, this arrangement presented the highest precision, thus reinforcing the certainty in the results found.
The next step in this study is the improvement of methods that help discover false negatives with the FPG exam. Because it is the most performed test in the search for a diagnosis of DM and presents possible variations, this process may inhibit the early treatment of asymptomatic patients.
Early detection of DM is advantageous for the health system and patients as it reuses existing laboratory information. Detecting DM earlier can improve the quality of life and reduce treatment complexity, costs, and late complications.

Data Availability
The data used to support the findings of this study were supplied by Santa Luzia Medical Laboratory under license and cannot be made freely available. Reasonable requests for accessing these data should be made to the corresponding author.

Ethical Approval
The ethics committee of the Federal University of Santa Catarina approved the study under registration CAE 02203918.0.0000.012.