Optimization of Tree-Based Machine Learning Models to Predict the Length of Hospital Stay Using Genetic Algorithm

The length of hospital stay (LOS) is a significant indicator of the quality of patient care, hospital efficiency, and operational resilience. Considering the importance of LOS in hospital resource management, this research aims to improve the accuracy of LOS prediction using hyperparameter optimization (HPO). Expert physicians and related studies were reviewed to determine the variables affecting LOS. The electronic medical records of 200 patients in the department of internal medicine of a hospital in Iran were collected randomly. As the performance of machine learning (ML) models can vary based on the characteristics of the features, several models were applied and evaluated in this study. In particular, k-nearest neighbors (KNN), multivariate regression, decision tree (DT), random forest (RF), artificial neural network (ANN), and XGBoost have been evaluated and improved. The genetic algorithm (GA) was applied to optimize the tree-based models. In addition, the dummy coding technique, sometimes called the One-Hot encoding, was used to encode categorical features to increase prediction accuracy. Compared with other algorithms, the XGBoost model optimized by GA (XGB_GA) achieved higher accuracy and better prediction performance. The mean and median of absolute errors in the test dataset for this model were 1.54 and 1.14 days, respectively. In other words, the XGB_GA model reduced the mean absolute error by 37%, which is beneficial in the reliable design of a clinical decision support system.


Introduction
Since the 1970s, the length of hospital stay (LOS) has been studied and researched to achieve better quality and performance in hospitals. Hospitals try to achieve better outcomes with the least possible resources. Developed countries evaluate the LOS as a key performance indicator to reduce healthcare costs without compromising patients' outcomes [1]. Te growth in the number of patients admitted and the increase in inpatient unit costs have resulted in issues in hospital bed management. Te length of stay and the lack of knowledge about the discharge time are among the complications that afect hospital bed management [2]. While the length of stay is afected by various factors that may make it difcult to predict, knowing its accurate value can signifcantly help manage beds and staf schedules [3]. LOS is one of the indicators of hospital quality, productivity, and performance. As a result, in dealing with issues such as planning resources, managing capacity, and staf level, LOS prediction could be an efective solution. It increases the number of patients receiving services, increases their safety, reduces healthcare costs, and helps optimize resource consumption [4]. Incorrect prediction of LOS can cause wasting and blocking bed days. It can also lead to disruption in the provision of medical services and dissatisfaction among patients and health workers.
On the other hand, accurate prediction leads to better allocation of resources and better organization of services from the time of admission to the discharge of the patient [5,6]. LOS prediction, traditionally performed by experts, is unreliable because the patient's background information is not considered [7]. Healthcare professionals assign diferent LOS to a patient; therefore, the assigned LOS depends on the predictor, not the patient. Hence, the automatic prediction of LOS is valuable and signifcant [8]. Apart from the future planning for the use of beds, LOS estimation is also helpful for the scheduling of specialists and human resources, determining the appropriate insurance plan and reimbursement system in the private sector, and preparing the patient's relatives to plan for the return of their patients [9].
A recent review revealed that worldwide scientifc attempts for accurate LOS prediction over the past ffty years have led to a rise in the number of related publications, suggesting the importance of this topic. Many publications provide a model for LOS and focus on the best statistical technique to provide the most accurate results. Te latest studies, however, move toward more sophisticated methods, such as machine learning (ML), rather than regression [8]. Te focus of the literature has been chiefy on proposing innovative prediction methods. Alahmar et al. proposed a stacked-ensemble method combining the result of diferent models to improve performance [10]. In a similar study, Muhlestein et al. came up with another ensemble approach for ranking the result of diferent models to select the most accurate one [11]. Danilov et al. have applied a deep learning algorithm, the RNN-GRU, for text-mining operative reports [12]. Although the regression technique is still used for prediction purposes [13][14][15], many studies applied various ML techniques to compare their result and fnd the most accurate one for their prediction purpose [4,16,17].
Related works can also be classifed in terms of the type of studied sample. Various types of hospital units, such as intensive care units (ICUs) [14,18] and newborns units [4,19], as well as diferent medical diagnoses, were studied, including COVID-19 patients [20] and lung cancer [21]. Although the methods used in the literature have yielded good results, the hyperparameter tuning problem still exists in the modeling procedure [4,16,18,20,21]. Tere are two types of parameters in ML models, those learned during the training process and hyperparameters, such as maximum depth in a decision tree. Finding the optimal setting of hyperparameters is called hyperparameter optimization (HPO) [22]. Some algorithms have many hyperparameters making the HPO more complicated. Studies have shown the impact of HPO on performance improvement. By HPO, we mean that we are looking for optimal performance using the tuned hyperparameters of the model. Te least complicated method is manual tuning. Manual search is a demanding task in terms of time and efort, and since there are many possible settings, this solution needs to scale. Some well-known alternative methods are grid search and random search, performing Bayesian optimization and heuristic approaches such as GA [23].
Another issue worth noting in the literature is the occasional use of the XGBoost algorithm. When applied to structured data, XGBoost is popular and capable of powerfully solving large-scale ML problems, and it outstands many other complex ML algorithms. Its high accuracy and many hyperparameters can be mentioned to outline good reasons for choosing XGBoost over other alternatives [23]. It is suitable for large-scale datasets due to its parallel integration mechanism and has regularization promotion characteristics. It is also highly accurate and interpretable [24]. A well-tuned XGBoost can provide better prediction results than a poorly confgured XGBoost. Terefore, it is benefcial to improve XGBoost in a time-efcient way rather than doing the calculation manually [23]. Te XGBoost model, which has many applications in the feld of data science and has achieved many successes in other areas, has rarely been used in LOS prediction. Chen has proposed a "nonlinear weighted XGBoost" model to predict LOS as a classifcation problem and grid search for HPO. Te model presented by Chen has the highest accuracy compared to other models, such as the support vector machine (SVM) [24]. Budholiya et al. utilized an optimized XGBoost classifer to predict heart disease. To optimize the hyperparameters of XGBoost, they used Bayesian optimization and achieved a prediction accuracy of 91.8% [25]. Other examples of XGBoost applications in other areas include early detection of sepsis in ICU [26], early diagnosis of heart disease [27], diagnosis of chronic kidney disease [28], prediction of the groundwater level [29], and breast cancer prediction [30].
Te XGBoost model has 25 hyperparameters, each of which has its function and makes the optimization process an extremely complicated problem [31]. Proper hyperparameter tuning is essential for the successful application of any predictor [25]. Te HPO process is computationally challenging. It involves multiple training cycles of the ML model, and the dimension of the problem increases with the increase in the number of hyperparameters. Bayesian optimization is explicitly designed to minimize the number of necessary training cycles in the grid search method; however, it cannot deal with high-dimensional searches when many hyperparameters are involved. Larger datasets add to the training time and the complication of the problem too. A user-defned search space for hyperparameters is required for many tuning approaches, which is impossible in practical cases due to the user's lack of knowledge. As such, a primary barrier to the broader use of HPO techniques is setting the search bounds of hyperparameters [22].
Although other studies have achieved good results in LOS prediction by using a wide range of ML methods, only some have explored XGBoost or HPO. Te primary issues in hyperparameter tuning of ML models are time efciency and space search. Bayesian optimization improves the time effciency of the grid search method, yet it works with a limited search space such as the grid search method. On the other hand, GA has overcome both issues, i.e., it searches over a broader range of spaces in a more time-efcient manner. Due to its high robustness, the GA helps the XGBoost model to become more stable and ft better [31]. It is also a more efcient solution to the search space-defning problem and the computational cost of HPO. Using the GA helps to rapidly evaluate a broader range of solutions in order to fnd the best options. Tis issue is crucial for designing a clinical decision support system and big data analysis.  [31]. However, this technique is yet to be used to improve the XGBoost model in the feld of LOS prediction. Tis study proposes integrating the GA and XGBoost (XGB_GA) to predict LOS with higher accuracy. Te proposed algorithm considers the HPO as an optimization problem; i.e., the algorithm is looking for the optimum value of the hyperparameters so that the mean square error (MSE) function of XGBoost is minimized. As a result, the accuracy of the prediction improves while the computational cost is reduced.
It is clear from the reviewed studies that addressing regularization and underftting/overftting problems is needed [25]. Limited previous techniques provided considerable improvement in the results; however, there still exist some techniques which remain unexplored, particularly in the LOS prediction: (i) previous approaches have rarely explored tree-based ML algorithms such as XGBoost, which have several parameters for handling underftting/ overftting and regularization; (ii) the general approaches have not used categorical feature encoding methods to encode categorical features in the LOS dataset; (iii) the previous methods have not used GA as an HPO technique for optimizing ML models for better prediction of LOS; (iv) the previous studies have rarely researched a hospital's department of internal medicine to develop its own LOS predictive model; and (v) limited research has predicted the absolute value of LOS and they have only used one model. Hence, this research uses k-nearest neighbors (KNN), multivariate regression, tree-based ML algorithms, artifcial neural networks (ANNs), and genetic algorithm (GA) to design an accurate model to predict the absolute value of LOS. Te signifcant contribution of the study includes the following: (i) Exploring the application of tree-based ML algorithms, including XGBoost, in the LOS prediction (ii) Using the one-hot encoding method to encode categorical features in the LOS dataset (iii) Applying GA for hyperparameter optimization of XGBoost, decision tree, and random forest to increase the accuracy of prediction (iv) Investigating the hospital's department of internal medicine to develop its own LOS predictive model (v) Predicting the absolute value of LOS using several data mining algorithms Te organization of this study is as follows. In Section 2, LOS prediction literature and standard HPO techniques are discussed. In Section 3, data collection, data preprocessing, model training, and HPO are presented. Section 4 describes the results, and the conclusion is presented in Section 5.

Related Works
In this section, LOS prediction in the literature is discussed. First, studies have been evaluated from diferent points of view, including the studied sample, prediction method, results, and approaches used for hyperparameter tuning. Te second section discusses standard HPO methods and their advantages and disadvantages.

LOS Prediction.
Te three primary categories of LOS prediction methods are (1) regression model, (2) ML, and (3) deep learning, which is a subcategory of ML [8]. For example, Baek et al. ftted a multivariate regression model on all hospital inpatient information, and the R 2 value for their model was 0.267 [13]. Like Beak et al., Ray-Zack et al. predicted the LOS of radical cystectomy for muscle-invasive bladder cancer patients with a multivariate regression model. Te R 2 value reported for the regression model was 0.048 [15]. Meadows et al. built a logistic regression model to predict short-term (less than 48 hours) and long-term (more than 48 hours) hospitalization of ICU patients following cardiac surgery with an accuracy of 79% [14].
Alahmar et al. applied the stacked-ensemble method to predict the LOS of diabetic patients [10]. Teir new proposed method showed the best performance (accuracy 0.81) compared to nonensemble models, including regressionbased, tree-based, and ANN models. However, the results showed that the improvement achieved by the ensemble method compared to the random forest model (accuracy 0.80) and the gradient boosting method (accuracy 0.80) was insignifcant. To optimize the selected hyperparameters, they performed the manual HPO.
Tompson et al. explored a newborns unit dataset to predict LOS using methods such as Naïve Bayes, logistic classifer, multilayer perceptron, SVM, decision tree (J48), and random forest [17]. Tey used 10-foldcross-validation and achieved the highest accuracy of 0.87 using random forest but did not mention the hyperparameter tuning process. Daghistani et al. applied random forest, SVM, Bayesian network, and ANN to predict the LOS of cardiac patients and reported the highest accuracy of 80% from random forest [4]. Te hyperparameter tuning issue could also be noticed in this research study.
Using an innovative solution, Danilove et al. applied deep learning algorithms, the RNN-GRU, for text-mining operative reports of neurosurgery patients to predict their LOS as a continuous variable. Te mean absolute error (MAE) resulting from the proposed method was 2.8 days [12]. Muhlestein et al. used brain surgery data and developed a new approach that systematically ranks diferent ML models [11]. Te new technique selects the best models automatically and achieves the optimal answer by combining the results. Te strength of this research is the increase of RMSLE in predicting the test dataset (0.63) compared to the training dataset (0.55) although model hyperparameters were optimized using the grid search method.
Steele and Tompson have addressed LOS prediction for better planning of the hospitalization of elective patients. Tey constructed the prediction model using Naïve Bayes, Bayesian network, KNN, kstar, locally weighted learning, C4.5 decision tree, SVM, and decision table. Te Bayesian Journal of Healthcare Engineering network has the best accuracy (0.9) among other models, and their research did not discuss hyperparameter tuning [16]. In another study, Abd-Elrazek et al. used ML models such as fuzzy logic, KNN, Naïve Bayes, random forest, SVM, and ANN to predict the LOS of ICU patients. Fuzzy logic had the best prediction results, followed by random forest, with an accuracy value of 0.92 and 0.9, respectively. Parameter tuning was not mentioned in the modeling process [18]. Mahboub et al. used the decision tree model to predict the LOS of COVID-19 patients. Te MAE reported for this model was 2.8 days, and no other method was applied to compare the results with [20].
In a study, Chen investigated the performance of the nonlinear weighted XGBoost model compared to other ML models in predicting LOS. To optimize the hyperparameters of the XGBoost, the K-CV method was used with a value of K � 3. In all the models considered in this work, only four values were investigated for each hyperparameter. Te results showed that the nonlinear weighted XGBoost model was the most accurate among all models, and its RMSE value was 1.52 days [24]. Similarly, Alsinglawi et al. developed logistic regression, random forest, and XGBoost models to predict the LOS of lung cancer patients hospitalized in the ICU. Te random forest has shown the best performance among other models. Hyperparameter tuning and evaluating the models have not been performed in this study. Consequently, the reported results were based on the training dataset [21]. Table 1 shows a summary of the studied literature chronologically. Te table gives a better view of diferent aspects of LOS perdition in similar studies. In terms of the studied sample, it can be observed that previous studies rarely researched a hospital's department of internal medicine to develop LOS predictive models. Limited research predicted the absolute value of LOS and those that developed models for LOS's absolute value prediction using only one method. Previous approaches have hardly explored treebased ML algorithms such as XGBoost, which have several parameters for handling underftting/overftting and regularization. In addition, there needs to be a report on using GA as an HPO technique for optimizing ML models for faster prediction of LOS.

Hyperparameter Optimization Methods.
Grid search is a prevalent method in which the user manually defnes a subset of hyperparameters for a target ML algorithm, and the method searches through that subset. Despite straightforward implementation and parallelism capabilities that make grid search a reliable method in low-dimensional spaces (i.e., 1D or 2D), the computational cost increases dramatically as the number of hyperparameters increases [23].
In random search, a generative process defnes the confguration space and draws random samples, and this random sample assigns the hyperparameter and evaluates them. Random search and grid search have common advantages; however, random search is more efcient in highdimensional spaces, and generally, random search performance is better than grid search [23].
For objective functions that are slow and costly to evaluate, Bayesian optimization is a powerful strategy that tries to predict the performance of untested combinations [23,33]. Compared to grid search, Bayesian optimization is more dynamic and requires two key components to function. Tose components are the probabilistic surrogate model and the acquisition function. Te role of the surrogate model is to be ftted to all the target function observations made so far. Ten, the acquisition function looks for parameters that improve the search process to fnd the most optimum hyperparameters.
Te GA is one of the population-based metaheuristic optimization algorithms developed with inspiration from the theory of natural selection. In this algorithm, a new population is generated by repeatedly using genetic operators on each individual in the population. Te critical elements of this algorithm are chromosomes, selection, crossover, mutation, and ftness function. Te general performance of this algorithm is as follows: Initially, the population Y (Y is the number of answers or solutions) consisting of n chromosomes (n is the number of parameters of the problem) is randomly generated. Two chromosomes (two answers or two solutions), namely, C1 and C2, are selected from the population based on their ftness. C1 and C2 will produce the new ofspring O with the crossover operator. Te probability of this operation would be CP, which is the crossover probability parameter. Te genetic mutation operator with the probability of MP is then applied to O to generate a new member O'. Member O' is added to the previous population to form a new population. Te selection, crossover, and mutation process continue until an entire population is generated. Te probability of crossover and mutation is why the GA can dynamically search for the optimal solution and reach it [34].

Data Source.
Te studied hospital in this research has 300 beds and 1055 physicians and staf. Te hospital provides clinical and paraclinical services and has 19 inpatient departments. It has a health information system to collect and store patients' data. Te information studied in this research was extracted from the department of internal medicine.
In order to determine the variables that may afect LOS and collect the necessary data, similar studies were reviewed. Two hundred records of electronic data of 100 men and 100 women were randomly extracted from the information system. Table 2 shows the variables used in this study, including age, sex, type of insurance, marital status, medical advice number, and physician's expertise level.

Data Preprocessing.
Te data were checked, and there were no missing values. Te mean age of the patients was 63 years, with a standard deviation of 19 years. 50% of the data were related to women, and 50% were related to men. 90% of patients were married, and others were single. Te average number of medical advice numbers was two, with  Journal of Healthcare Engineering a standard deviation of 3. Te LOS had a mean value of 5.6 days and a standard deviation of 3.4 days. Te primary insurance type and physician's expertise level variables had relatively unbalanced distributions. 90% of the patients had ordinary social security insurance, and the rest were in other insurance groups. 45% of the patients were treated by general practitioners, 54% by specialists, and the remaining 2% by subspecialists. Table 3 shows the statistical characteristics of each variable. Dummy coding, sometimes called "one-hot encoding," was used to turn the categorical variables into numerical variables. In order to remove outliers, data with 1.5 times IQR (interquartile range) greater and less than the frst and third quartiles were removed from the data. Te lower limit value of outliers was calculated as −0.5, and the upper limit value was 11.5. Terefore, patients with LOS of more than 11.5 days (eight records of data) were excluded from the data.
Since ML algorithms cannot analyze categorical data, the one-hot encoding technique creates binary variables representing the old categorical variable. Te ML algorithm can then process these new binary variables [35]. In one-hot encoding, a new feature is created for each category level, and a binary feature is created [25]. One-hot encoding of four categorical variables is shown in Figure 1. For each category of a categorical variable, one variable (one dimension) is added to the variables, and the value of this new variable in each row is set to 0 or 1. Te value of the dummy variable is 1 when the original categorical variable is the same as the created dummy variable, and it is zero for other cases. Finally, the original categorical variable and its records are removed from the data.
Pearson correlation analysis has been performed, and the coefcients are reported in Table 4. According to Table 4, LOS has the highest positive correlation, with a p value of less than 0.05, with the medical advice number of 0.46, primary insurance type_employee health insurance of 0.2, and physician expertise level_subspecialty physician of 0.17. Te highest negative correlation, with a p value of less than 0.05, is with the primary insurance type_without insurance variable (−0.16). Other correlation values were insignifcant, and their p valuewas greater than 0.05; nevertheless, they were not removed from the dataset to check their impact on the output of the models. Te dataset was divided into training and test sets in the last step. 85% of the data was assigned to the training dataset and 15% to the test dataset. Te data distribution was checked in each dataset, and both had relatively the same distribution. Tis control mattered since the data were unbalanced.

Model Training.
Te models used in this work include KNN [18], multivariate regression [36], decision tree [37], random forest [4], ANN [38], and XGBoost [39] so that the results of the improved model can be compared. All models were built in Python version 3.8.5. Te number of parameters in the KNN model (the number of neighbors) was estimated at 12. Te estimation was performed with the help of the K-CV method with a value of k � 10. Te regression model was built in two forms. First, one was built with all variables on LOS. After checking the regression assumptions, the natural logarithm of LOS was calculated and added to the data. Another multivariate regression model was built on transformed LOS, which hereafter will be known as a transformed regression model. Regression and transformed regression models were rebuilt based on t-test results with a p value of less than 0.05 and evaluated on the test dataset. Tese two models will be referenced with the names Lm and Lm_transformed, respectively. Since changing LOS to the natural logarithm of LOS improved the regression assumptions and brought the data closer to the normal distribution, other models were also built using the natural logarithm of LOS. Decision tree (DT_default), random forest (RF_default), and XGBoost (XGB_default) were built on the training dataset using default hyperparameter values. Te ANN model was built with a 2-layer structure. Twelve neurons were placed in the frst layer and six in the second layer. Finally, the evaluation of the models was performed on the test dataset. Te details of default hyperparameters of tree-based models are presented in Table 5.

Optimization with the Genetic Algorithm.
Te values set for the hyperparameters of the tree-based models are based on the default values in the libraries developed for Python (see Table 5). Diferent combinations of the mentioned hyperparameters can be used in the models. In this research Journal of Healthcare Engineering study, the PyGAD module and the PyGAD.GA class developed for Python were used to apply the GA for HPO [40]. Implementing the GA for each model has three basic steps: determining the ftness function, determining the range of hyperparameters of each model to be evaluated in the GA, and specifying the parameters of the GA. Te ftness function for each model is the mean squared error (MSE) calculated with the K-CV method and k � 5 to reduce the overftting of the model on the training data [31].
Te hyperparameter space of each model that needs to be checked by the GA is as follows. For the decision tree model, the maximum depth of the tree is set between 1 and 1000. Te higher value of max_depth leads to more tree expansion and overftting on the data. Te minimum number of  Journal of Healthcare Engineering 7 samples per node is between 1 and 50. Te alpha value is between 0 and 1. When ccp_alpha equals zero, no pruning occurs, and higher values lead to more pruned trees.
For the random forest model, the maximum tree depth is between 1 and 7, and the minimum number of samples per node is between 1 and 50. Tese two hyperparameters have the same function as in the decision tree. Te number of variables that should be used in constructing each tree is between 1 and 12. It ranges from one to the maximum number of features which in our problem is 12. Te number of trees is considered to be between 50 and 1000, as fewer trees will provide inaccurate results.
On the other hand, too many trees will add to the training time while no improvement happens. Te maximum number of samples, defned as the number of samples to draw from X to train each base estimator, is set between 0.1 and 1. We are looking for its optimum value that ranges from 10% to 100%.
In the XGBoost model, the learning rate is between 0.001 and 1. Te learning rate is the step size shrinkage used in the update to prevent overftting. Te number of trees is set between 50 and 1000. Te maximum depth of the tree is between 1 and 7. Te percentage of samples (subsample) and variables (colsample_by_tree) used in constructing each tree is between 0.1 and 1. Te subsample is the ratio of the training instances, and it will prevent overftting. Subsampling will occur once in every boosting iteration. Setting it to, for example, 0.1 means that XGBoost would randomly sample 10% of the training data before growing trees. Colsample_by_tree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. Te regularization term is considered between 1 and 3. Increasing this value will make the model more conservative [31]. Te hyperparameter space of each model that the GA should check is also presented in Table 5. Te evaluation result of the decision tree, random forest, and XGBoost model that improved by utilizing the GA will be referenced with the names DT_GA, RF_GA, and XGB_GA, respectively.
GA parameters include the number of generations or the ending condition of the algorithm, the number of parents that the crossover operator must use, the number of solutions or individuals in each population, the type of selection operator, the type and probability of the crossover operator, and the type and probability of mutation operator. GA parameters must be determined before running the algorithm. For this purpose, the number of generations is 50, the number of parents who can participate in the crossover operation is 2, and the number of solutions (individuals) in the population is 20. Te type of selection operator is steady  state, the type of crossover operator is uniform with a probability of 60%, and the type of mutation is random with a probability of 1% [31]. Implementing the GA algorithm to optimize the hyperparameters of a tree-based model is shown in Table 6.

Prediction Accuracy Analysis.
Te models detailed in the previous section were evaluated on the test dataset, and absolute errors were calculated for each model. Te statistical indices of absolute errors, including mean, median, standard deviation (SD), interquartile range (IQR), minimum (min), and maximum (max), are reported in Table 7. Te lowest mean absolute error (MAE) is 1.52 days and belongs to the transformed regression (Lm_transformed). With a slight diference from that, the improved XGBoost (XGB_GA) model has the lowest MAE value, equal to 1.54 days. After that, the lowest MAE belongs to the re- A lower MAE generally means better model accuracy, but MAE alone does not answer the question of which model is the best. In order to solve this problem, it is better to check the error dispersion indices. Tese indices include the median, SD, and the IQR of the absolute errors in addition to the MAE. Dispersion indices help to have a better view of the model performance on each data record in the test dataset. Te lowest median belongs to the RF_default (1.02 days), followed by the XGB_GA (1.14 days). Te lowest standard deviation belongs to Lm and ANN, with values of 1.14 and 1.20, respectively. Te smallest IQR belongs to a DT_default (1 day) and XGB_GA (1.26 days). Since there are three dispersion indices for ranking the models, the average of all these three indices was calculated for each model in order to determine which model has the most negligible dispersion error in the test dataset. Tis ranking puts the XGB_GA model in the frst place and the Lm_transformed in second place. After that, Lm, ANN, RF_GA, RF_default, DT_default, KNN, DT_GA, and XGB_default, respectively, have the lowest value in dispersion indices. Tis ranking means that we are not only looking for a model with lower prediction errors on average but also we look for more records of data that are predicted as accurately as possible. In other words, if we draw the range of errors of each model in a boxplot diagram, we want to see more compression in its diagram. Figure 2 depicts the given explanations about the error comparison of the models in a boxplot diagram. As shown in the fgure, the XGB_GA boxplot has the most compression among the rest of the models. After that, the Lm_transformed model has this position. Te MAE of these two models is the lowest among the others. Another indicator that should be assessed in the analysis of each model is the maximum prediction error. In this case, the Lm_transformed model and the ANN have the lowest value. However, the graphs in Figure 2 show that in highly accurate models such as XGB_GA and Lm_transformed, the number of cases predicted with a high error is small. For example, for the XGB_GA, this term is 2 out of 29 cases, which is about 6% of the data.
In addition to comparing models and checking their prediction accuracy, it is necessary to address the efect of GA performance. Table 7 shows that the GA has reduced all the error indicators in the XGBoost model by at least 25%. In decision tree and random forest, the changes have been slightly diferent. In the decision tree, all error indicators have improved except IQR, which increased by 100%. Te mean, median, standard deviation, and maximum error have been reduced to 10%, 17%, 10%, and 9%, respectively. In the random forest, the mean and median errors increased by 7% and 31%, respectively. Te standard deviation, IQR, and maximum error decreased by 4%, 20%, and 2%, respectively. In other words, the boxplot of errors in Figure 2 is more compressed (see Table 8).
Another tool that helps to compare the performance of the models and the GA efect is the graph of predicted values (Y-axis) versus actual values (X-axis). Ideally, the data in this graph should ft on a 45-degree line, meaning that the predicted value is precisely the same as the actual value; however, it is impossible in practice. Models whose values have less dispersion around the diagonal line are considered better ones. Te reason behind the lower MAE and error dispersion in the XGB_GA and Lm_transformed can be seen in Figure 3. Although it is difcult to compare the models in this type of diagram, the way the tree-based models change after using the GA can be seen. Te MAE value of the decision tree model decreases while the error dispersion values for the random forest model increase. Te improvement in the XGBoost model is notable as the values approach the diagonal line.

Discussion.
In conclusion, if the order of accuracy of the models is considered (see Figure 4), the Lm_transformed and XGB_GA models have an excellent ability to predict LOS. After those two models, Lm, ANN, RFs, KNN, DTs, and XGB_default have the best prediction accuracy, respectively. Te XGB_default and both DT models are among the weakest models. Teir result is even weaker than the base model, KNN. A weaker result was expected from the DT models than others employing ensemble learning. Figure 4 shows the error values of the models, which are arranged relatively for comparison in the order of the MAE. Te diagram in Figure 4 helps to compare the models and check the trend of the error indicators. Te graph shows that the models become weaker in the mentioned order, thereby decreasing their accuracy. In other words, the error values increase in them.
Te noteworthy point about the frst two models is the competition between the complex tree-based model optimized with metaheuristic methods (XGB_GA) and the simple transformed regression model (Lm_ transformed). Te XGB_GA model has a higher mean but lower error dispersion than the Lm_transformed model. Since the diference in the MAE of these two models is insignifcant, the XGB_GA model can be chosen as the most accurate one.
Tere are two important points regarding the two top models in this research. One is their interpretability and the other is their computational process. Te regression model has better interpretability than the XGB_GA model because it is possible to check which variable, and to what extent, will afect the output. At the same time, this possibility is not available for the XGB_GA model. Te regression model was based on t-test results and variables selected by the user's decision. Te XGB_GA model is immune from the user's intervention in creating the model, and no variable or data is removed during the process.
To conclude, XGB_GA has three critical advantages over other models. Te frst is the lowest value of MAE, the second is the lowest value of prediction error dispersion, and the third is the absence of analyst involvement in decisionmaking and creating the fnal output. Te median value of XGB_GA is 1.14 days, and the third quartile of the error is less than two days, meaning that the model predicts LOS with less than two days of error in 75 percent of cases. For a future decision support system, a model that is less dependent on the intervention of the researcher or analyst can be a better choice. In most cases, the model must also predict LOS with a minor error. Terefore, XGB_GA could be selected as the best model.
As a criterion for measuring the accuracy of the models, we can rely on the reported results of other researchers. For this purpose, those studies that predicted the absolute value of LOS and reported RMSE or MAE indices can be included for comparison. Danilov              Te main issue this research tried to study was the efect of GA in improving ML model results. Te GA was used to calculate the optimal hyperparameters of the decision tree, random forest, and XGBoost models. Te positive efect of the GA on the XGBoost model is undeniable. It has reduced all the error indicators. Considering the value of the error indicators, the error distribution, and the related graphs, the decision tree model has become weaker, and the random forest model has generally improved. Te random forest model has generally improved. Graph (e) in Figure 3 shows that the optimized decision tree only ftted a constant number on the data to reduce the MAE. A constant value as the fnal model is considered a poor result since every case will have the exact prediction regardless of the input variables. In Figure 3(j), for the RF_GA model, the general form of data dispersion is similar to Figure 3(i)-RF_default. Te only diference is that the data are closer to the diagonal line, which means that the random forest model has improved after optimization.

Conclusions
Tis study aimed to improve the LOS prediction accuracy by focusing on the HPO process. Literature shows that this procedure has been neglected in most similar studies. Due to its superiority over other standard methods, GA has been selected for this purpose. In this work, the impact of GA on performance improvement was tested experimentally by integrating it with one of the most accurate ML models, XGBoost. Te newly proposed method outperformed other modeling techniques. However, only one set of GA parameters was used for the optimization, making it the main limitation of this research study. For future studies, it is suggested to apply other combinations of the GA parameters and compare their performance to fnd the most optimum setting. With other metaheuristic algorithms, such as PSO, GA could be used on a more extensive dataset with the ICDdiagnosis code added to the input variables. Previous studies that have used diagnostic ICD codes in their research study have models with a prediction accuracy of over 80% [10,16,17]. Improving the ftness function of XGBoost by simultaneously including dispersion indices and the mean of errors is another idea to work on and improve the results for practical uses. GA could also be used to optimize deep learning models such as ANN, which in future studies can be investigated more deeply.

Data Availability
Te dataset used to support the fndings of this study is available from the corresponding author upon request.

Disclosure
Tis research is extracted from the master's thesis of Atefeh Mansoori in the Islamic Azad University, Science and Research Branch under the title "Development of an Improved Model by Integrating Data Mining and Genetic Algorithms to Predict the Length of Hospital Stay" in the Persian language.

Conflicts of Interest
Te authors declare that they have no conficts of interest.