Leukemia can be Effectively Early Predicted in Routine Physical Examination with the Assistance of Machine Learning Models

Objectives The diagnosis of leukemia relies very much on the results of bone marrow examinations, which is never generally performed in routine physical examination. In many rural areas even community hospitals and primary care clinics, the lack of hematological specialist and facility does not allow a definite diagnosis of leukemia. Thus, there will be a significant benefit if machine learning (ML) models could help early predict leukemia using preliminary blood test data in a routine physical examination in community hospitals to save time before a definite diagnosis. Methods We collected the routine physical examination data of 1230 newly diagnosed leukemia patients and 1300 healthy people. We trained and tested 3 machine learning (ML) models including linear support vector machine (LSVM), random forest (RF), and XGboost models. We not only examined the accordance between model results and statistical analysis of the input data but also examined the consistency of model accuracy scores and relative importance order of model factors with regard to different input data sets and different model arguments to check the applicability of both the models and the input data. Results Generally, the RF and XGboost models give more identical, consistent, and robust relative importance order of factors that is also accordant with the statistical analysis, while the LSVM gives much different and nonsense orders for different inputs. Results of the RF and XGboost models show that (1) generally, the models achieve accuracy scores above 0.9, indicating effective identification of leukemia, and (2) the top three factors that contribute most to the identification of leukemia include red blood cell (RBC), hematocrit (HCT), and white blood cell (WBC), while the other factors contribute relatively less. Conclusions This study shows a feasible case example for early identification of leukemia using routine physical examination data with the assistance of ML models, which can be conveniently, cheaply, and widely applied in community hospitals or primary care clinics to save time before definite diagnosis; however, more studies are still needed to validate the applicability of more ML models to a larger variety of input data sets.


Introduction
Leukemias are a group of life-threatening malignant disorders of the blood and bone marrow [1]. Usually, leukemia could be either of the myeloid or lymphoid lineages and is classifed as acute or chronic in nature. Chronic leukemias (CL) tend to have more mature cells and are rare in pediatric patients, and acute leukemias (AL), on the other hand, are typically less mature and commonly occur in patients of all ages and are potentially rapidly fatal if not readily treated [2].
Te prognosis of AL is poor, and the death rate of AL is dramatically high. Its complications are usually life-threatening, and its treatment is generally complex [3].
Furthermore, the conditions are rapidly fatal if not treated although AL is usually initially highly responsive to chemotherapy [4]. However, physician-related delays in the diagnosis of leukemia have been shown to contribute to poor outcomes and higher mortality associated with the disease in low-income nations [5]. Tere is a high medical need to improve the outcome of leukemia patients.
Clinical diagnosis of leukemia is generally according to the cytomorphology, immunophenotyping, cytogenetics, and molecular genetics of the bone marrow and blood samples [6], which have specifc requirements on corresponding test equipment and experienced experts. However, in many rural areas, community hospitals, or primary care clinics, qualifed specialists, and test facilities are usually unavailable, and even in qualifed hospitals, such tests are usually more expensive and more specifc for suspected patients already showing symptoms. Tus, leukemia is often undiagnosed or delayed diagnosed, which consequently delays the treatment and worsens the outcome of patients.
In this case, early screening and treatment of leukemia patients in time can be very important. Certainly, a good solution for this task is to evaluate the health condition of individuals according to their regular physical examination data, which is complicated and difcult because it requires the experiences of physicians and careful subjective judgement of the complex relationships among various test parameters. In contrast, machine learning (ML) models are right designed to be expert at this task of determining complex relationships. Tey can handle even thousands of parameters, and they are able to detect and utilize their interactions [7][8][9][10][11], which is highly attractive to clinicians for disease diagnosis [12].
ML model is a practical and versatile choice for the early screening of diseases. It has achieved signifcant development and is successfully applied to a wide range of datarelated problems [7]. For example, in some studies, an unsupervised was used to predict the defuorination of perand polyfuoroalkyl substances [13], a variety of ML models were used to make predictions, extract feature importance, detect anomalies, and discover new materials or chemicals [14], and also in medicine, ML models are used to help understanding and overcoming of diseases [8][9][10][11]. Traditionally, diagnostic test data of patients are artifcially interpreted by experienced clinicians according to their expertise, whereas ML models try to automatically learn the expertise of these experiences, for the initial diagnosis [15], prognosis estimation of treatment complications [16], and even for the relapse monitoring [17]. ML models have been shown on par with experts in a variety of tasks in hematologic malignancies [10], including the diagnostic and therapeutic evaluation of leukemia [18,19], such as the image recognition of blood smears for diagnosis and classifcation of leukemia [20,21], or the automatic detection of acute leukemia using blood images [22]. Currently, more ML applications for the diagnosis of leukemia are using images of bone marrow or peripheral blood cells [20][21][22][23][24][25][26][27]. Te practical attempts on its application of early screening of leukemia using preliminary health records, like routine laboratory blood test results, are much fewer [18]. Leukemia screening using primary routine physical examination is of signifcant beneft because no other data are required than those acquired in a regular physical examination, so largescale general screening of leukemia in people is thus possible.
In this study, we aimed to try utilizing ML models for the early diagnosis of leukemia using only the individual routine medical examination results. Te advantage of doing this is thus ML models can be conveniently, cheaply, and widely applied in community hospitals or primary care clinics and can save time as much as possible before disease progression. We hope this work could provide a feasible case example of using ML models to early screen leukemia patients.

Data Collection.
Te ML models require both training and test data. In this study, we employed the routine laboratory test results of blood samples of both leukemia patients and healthy people to train and test the ML models. Te routine laboratory records of blood samples were collected from the database of the frst afliated hospital of Chongqing Medical University, including those of the leukemia patients from the department of hematology admitted during 2014.4∼2020.6 and those of healthy people from the physical examination center during 2020.1∼2020.6. Te data collection was performed under the approval of their Medical Ethics Committee (Number: 2021-152), according to the principles of the Declaration of Helsinki. Besides the blood test records of the leukemia patients, we collected their personal information and medical histories as well. As it is very common for leukemia patients to receive treatment repeatedly, we kept only the blood records at their frst admission to our hospital but waived those afterwards. Moreover, we double-checked their medical histories to exclude those patients who were already diagnosed and treated before in other hospitals.
After those eforts, we screened out totally 1230 identifed leukemia patients, and accordingly 1230 blood records, with totally 284 parameters tested at least once. Te largest number of tested parameters appearing in a single record is 134, while most records contain around 50 tested parameters. Tese blood test records of leukemia patients were scheduled to be the target group in the ML models.
For the control group of ML models, we randomly selected 1300 blood records of healthy people from the physical examination center of our hospital. Teir medical histories were also checked to ensure they were not leukemia patients, but even so, strictly speaking, we did not exclude those who had been with leukemia but were not diagnosed yet. Hopefully, such probability is very limited. Te blood test records of healthy people contain fewer tested parameters than those of the leukemia patients. Terefore, we had to keep only the intersection of their tested parameters (30 parameters in common) into the following steps (column 2 of Table 1).

Handling the Missing Values.
In the healthy group, all records contain all the 30 tested parameters, but this is not true in the patient group, where about 2/3 of the patients' records are incomplete ( Figure 1).
As some ML models can only handle data in form of the complete matrix, thus for dealing with those blanks in the record-parameter matrix, we could either fll blanks with estimations or directly drop those incomplete records. In order to evaluate the practicability of ML models, we designed 5 scenarios A-E according to the amount of records we kept in modeling ( Figure 1). Scenario E keeping the most records also contains the most uncertainties, while scenario A with the fewest records is, however, the most accurate. Te blanks in complete records were estimated as the mean of values in all other records containing the corresponding parameter.
During modeling, the total input data set of each scenario would contain the data of the leukemia patients of that scenario, and also, the data of the same number of healthy people randomly selected from the total 1300 healthy people.

Statistical Analysis.
According to the 1230 records of patients and 1300 records of healthy people, means and standard variations (SD) of the 30 tested parameters were calculated (Table 1), using mean(), sd(), and t.test() functions in R language version 4.0.3 for Mac. Parameters were compared, and the p values show their diferent signifcance.

Machine Learning Model Selection and Construction.
ML models have signifcant benefts for the preliminary screening of diseases. However, generally, it is hard to say which model is absolutely the best, because model applicability depends on specifc data set. Usually, in practice, various models will be tested, and their results will be examined to determine which model performs the best.
In this study, we chose 3 ML models to be tested and examined: linear support vector machine (LSVM), random forest (RF), and XGboost models. Te main reason for choosing them is because they are relatively much more popular and have shown good performance in various applications, and also because they are able to give relative importance to model input factors as well.
We utilized the very popular scikit-learn (sklearn) package (version 0.24.2 in Python 3.9 for Mac) [28] for LSVM and RF models and referred to the ofcial XGboost code (version 1.5 for Mac in Python) for the XGboost model  [29], which was called in sklearn via an API function XGBClassifer() of package xgboost in Python.

Model Results Examination.
As the applicability of each ML model depends on specifc data set, their model results have to be examined to check their applicability. Practically, frstly, a reliable model should show consistency and robustness with regard to its input data, and secondly, its result should be accordant with the results of other analysis method like statistical analysis as well.
Since the model input data of this study mainly depend on the scenario selection (see Section 2.2) and the split ratio which splits the total data into train and test subsets inside ML models, we prepared various input data sets with regard to diferent scenarios and diferent split ratios (R train/total-� 0.25,0.5,0.75), and accordingly, their results, including the scores of score() function of sklearn which returns the accuracy of the model on input data [28], the area under the curve score (S auc ), as well as the relative importance order of model factors (order of contribution weight of specifc factor to the model), would all be examined to check their consistency with regard to various input data sets. Table 1, target records have signifcantly lower mean hematocrit (HCT), hemoglobin (HGB), red blood cell (RBC), etc., but signifcantly higher mean white blood cell (WBC), the percentage of neutrophils (NEUT%), etc., (p < 0.001 for all unadjusted comparisons). According to the p values, most of the parameters show a signifcant diference between the two groups, and generally, the HCT, HGB, and RBC show the most signifcant differences among those parameters.

Statistic Results. As shown in
Although the p values of many parameters show signifcant statistical diferences between patients and healthy people, for a certain individual, it is hard to diagnose a person with leukemia or not only according to a single or even two parameters, because many values of the parameters of the patients still lie in its reference range. Tus, for diagnosis, more parameters need to be taken into consideration, which requires the determination of more complicated interrelationships behind. Tat's exactly what ML models are adept in.

Model Consistency Examination.
During modeling, the argument lambda of the XGboost model and the argument C of the LSVM model were adjusted to suppress overftting, and the model performance indicators including the accuracy scores on both train and test subsets (S train , S test ) and the S auc were collected in Tables 2-4. Te accuracy scores and S auc suggest all models have achieved good results because generally, the accuracy scores are mostly above 0.9 even when the R train/total is 0.25. Among scenarios, the accuracy scores are higher under scenario A, indicating the flling of missing values induces more uncertainties than discarding incomplete records.
For checking model result consistency, although the accuracy scores in Tables 2-4 look very consistent, we still need to look into the relative importance order of model factors of the models, a typical result of which was shown in Figure 2. Results show that the importance order of the RF and XGboost models is generally accordant and insensitive to the input data and the overftting suppression factor lambda. However, the importance order of the LSVM model is much diferent and is much more sensitive to the input data and the overftting suppression factor C.
Moreover, the top 3 important model factors of the RF and XGboost models are the count of RBC, HCT, WBC, while the top 3 of the LSVM model are absolute lymphocyte count (LYM # ), percentage of monocytes (MONO%), and NEUT% ( Figure 2). Obviously, the results of RF and XGboost models are more accordant with that of the statistical analysis (see Section 3.1).

Top Model Factors
Contributing to the Classifcation of Leukemia. By accepting the results of RF and XGboost models, the top 3 model factors that contribute most to the identifcation of leukemia patients are found to be the count of RBC, HCT, and WBC. Te other factors contribute relatively less to the models.

Discussions
Te clinical diagnosis of leukemia is primarily based on laboratory blood and bone marrow tests, but even the most skilled hematologist may overlook patterns, deviations, and relations between the increasing numbers of blood and bone marrow parameters that modern laboratories measure. In contrast, ML algorithms can easily handle hundreds of attributes (parameters), and they are capable of detecting and utilizing the interactions among these numerous attributes, which makes this feld of medicine particularly interesting for ML applications [12].
Nowadays, ML has already been proven to be a versatile, precise, and robust tool in the diagnostic evaluation of leukemia [18]. Rehman et al. [30] proposed a robust Table 2: Accuracy scores of model performances on diferent scenarios, train-set/test-set ratios, and regularizations. (S train and S test are accuracy scores of models on train data subset and test data subset respectively according to the score() function of sklearn; S auc is the area under the curve score of model according to the roc_auc_score() function of sklearn; R train/total is the ratio of train data to the total data) for random forest model.    [31,32] presented an automated detection system for the diagnosis of acute leukemia. Te method implemented uses basic enhancement, morphology, fltering, and segmenting techniques to extract the region of interest using a k-means clustering algorithm. Te proposed algorithm achieved an accuracy of 92.8% and is tested with the nearest neighbor and Naïve Bayes classifer on the dataset of 60 samples. Dese et al. [20] used 250 clinical images of blood smears acquired from Jimma University Specialized Hospital and a standard online database to develop an image query system for diagnosing leukemia, and its type with the accuracy is 97.69%. Loey et al. [32,33] proposed an AML classifcation system that enhanced image contrast and extracted fve features. An SVM classifer performed the classifcation. Experiments on a data set of 50 images produced 93.5% classifcation accuracy. As most of the ML application on leukemia diagnosis was dealing with the microscopic images and fow cytometry of bone marrow or peripheral blood cell, there is a lack of early prediction ML model for leukemia based on routine laboratory results. In this study, the required data for the ML models we used are able to be commonly acquired from the very primary routine physical examination in the rural area, community hospitals, or primary care clinics, which could help the early recognition of leukemia.
As the applicability of a certain ML model depends upon specifc input data set, in this study, three models including the LSVM, RF, and XGboost models were selected, and their results were examined to check their applicability. Te reason for choosing them is because they are relatively more popular, and more importantly, the sklearn toolkit we   employed could look into the relative importance (or say contribution weight) of each model factor to the model, from which we could both examine the most model details and fnd the top factors that play key roles in the recognition of leukemia. Another consideration is that these three models require relatively much less input argument during the model construction because more arguments usually lead to higher model sensitivity to these input arguments.
Specifcally, for the LSVM model, only the overftting suppression factor C (adjusted during modeling) is specifed, and for the RF model, only the number of trees (we set n_estimators � 200) is set, and for the XGboost model, only the learning rate (we set learning_rate � 0.05) and the overftting suppression factor lambda (adjusted during modeling) are required. Results show that the RF and XGboost model achieved very good consistency and robustness because their results turned out consistent and are accordant with the statistical analysis. As for the bad result of the LSVM model, we would like to regard the reason relevant to the limitation of its linear kernel to its applicability in this case of our study, but we did not check further into it.
In order to deal with the missing values of the incomplete records, we checked the diference between rather discarding the incomplete records and flling the missing values with an estimated average of existing values. Results show that the flling of missing values using estimations tends to introduce more uncertainties than directly discarding these incomplete values. Tis is interesting because, in other literature, many authors follow the flling method without any discussion or examination. We believe that the diference between flling and discarding should be case-dependent, and we should pay more attention to dealing with missing values.
Te results of the RF and XGboost models also show that in this study, the accuracy scores are generally at least above 0.9 on both the train and test subsets even when the train data are a quarter of the total input. Tis might be partly relevant to the capability of the RF and XGboost models and partly be relevant to the accuracy and specifcity of the input data as well, because the data we collected are from either very healthy people or from relatively severe patients. Terefore, about the methodology of this study, more further work is actually still needed to check the applicability of more ML models including SVM and other ML models to a larger variety of data sets. Te top 3 model factors that contribute most to the recognition of leukemia are the count of RBC, HCT, and WBC. Te other factors contribute relatively less to the models.
Te result about the WBC's count sounds reasonable that, as leukemia is a blood cancer that usually begins in the bone marrow and leads to the overproduction of abnormal WBC [34], the inspection of blood cells under a microscope allows for the evaluation and diagnosis of diseases like leukemia [35]. WBC, as one of the main cell types in peripheral blood, plays important role in the immune system and is a main defense of the body against infections and diseases [27]. Normally, WBC grows in accordance with the body's need, but in the case of leukemia, they have generated abnormally and inefciently [27]. As early as the early 1800s, the excess WBC count had been observed with the presence of leukemia [36]. However, leukocytosis is neither sufcient nor necessary for the diagnosis of leukemia, because on the one hand, leukocytosis is very common in infections, and on the other hand, leukemia patients sometimes have normal or even lower total WBC counts [3]; thus, leukemia cannot be judged only by the counts of WBC.
It also makes sense about the count of RBC and HCT. Te RBC, transporting oxygen and carbon dioxide [27,37], may probably modulate the activity of immune cells within their microenvironment as well [38,39] and is known highly correlated to the HCT [40]. Because leukemia is the overexcessive proliferation of abnormal cells in the bone marrow and then inhibits the normal hematopoietic cells, it can be inferred the RBC and HCT might be normal in the early stage of leukemia and then decrease with the progression of RBC breakdown. Terefore, the count of WBC, RBC, and HCT might be potential indication markers associated with the development of leukemia and is probably also associated with other parameters like thrombocytocrit (PCT). But it does not mean only these three factors indicate leukemia, and the other factors are negligible, and actually, the other factors also contribute and should be taken into consideration as well in the ML modeling.
Good results have the ML models get. Although it should also be emphasized that the results of ML models can only be an auxiliary reference but have no opportunities to replace the defnite diagnosis by physicians, the real advantage of ML models is that the ML models can be conveniently and widely applied in the routine physical examination in community hospitals or primary care clinics without much extra expense and can save much of the time before disease progression, because if the routine physical examination result of somebody was classifed by ML models as potential leukemia in time, he would be suggested to visit specialized hematology physicians as soon as possible for a specialized examination.

Conclusions and Limitations
In this study, we conducted a retrospective case study of utilizing the ML models to help early diagnosis of leukemia using only preliminary blood test data from the routine physical examination at community hospitals or primary care clinics. We collected data of preliminary blood test of both newly diagnosed leukemia patients and healthy people to construct the train and test data sets for ML models. We selected three models including LSVM, RF, and XGboost models according to their popularity, application convenience, and their ability to tell the relative importance or contribution weight of each factor to the model. We examined the sensitivity of model results, including the accuracy score, the area under the curve score, and the importance order, to the model input data and model argument including the scenario selection, split ratio, and the overftting suppression coefcient.
Results show that although the LSVM expressed very bad applicability to the input data of this study, the RF and XGboost turned out of good consistency and robustness 8 Journal of Healthcare Engineering with regard to the input data and model argument, and their results are also accordant to the result of statistical analysis of the collected data. Generally, the RF and XGboost models could achieve an overall accuracy score above 0.9 for all the input data we used in this study. Te top three model factors that contribute most to the recognition of leukemia are the count of WBC, HCT, and RBC, and the other factors contribute relatively less. Tis study is a feasible case example to show that leukemia can be early predicted using preliminary blood test data from routine physical examination with the assistance of ML models. Te advantage of doing this is thus ML models can be conveniently, cheaply, and widely applied in community hospitals or primary care clinics and can save time as much as possible before disease progression. Nevertheless, the results of ML models cannot replace but still require the defnite diagnosis of hematology physicians.
Technically, there are still a few limitations of this study that afect the confdence of our models: (1) the details about the applicability of ML models to our input data set are still not fully understood; (2) all records were retrospectively collected from the First Afliated Hospital of Chongqing Medical University, which may cause selection bias; (3) there are potential uncertainties, including uncertainties of laboratory measurements or the possibility of undetected leukemia patient in the healthy group; (4) only 30 parameters are kept in the modeling procedure, while some other parameters dropped might also be a potential indicator of leukemia. Terefore, the result of this study shows a good case for early predicting leukemia using preliminary blood test data from routine physical examination with the assistance of ML models; however, further investigation and prospective studies are still needed in the future to validate the applicability of more ML models to a larger variety of input data sets.