Length of Stay Prediction Model of Indoor Patients Based on Light Gradient Boosting Machine

The influx of hospital patients has become common in recent years. Hospital management departments need to redeploy healthcare resources to meet the massive medical needs of patients. In this process, the hospital length of stay (LOS) of different patients is a crucial reference to the management department. Therefore, building a model to predict LOS is of great significance. Five machine learning (ML) algorithms named Lasso regression (LR), ridge regression (RR), random forest regression (RFR), light gradient boosting machine (LightGBM), and extreme gradient boosting regression (XGBR) and six feature encoding methods named label encoding, count encoding, one-hot encoding, target encoding, leave-one-out encoding, and the proposed encoding method are used to construct the regression prediction model. The Scikit-Learn toolbox on the Python platform builds the prediction model. The input is the dataset named Hospital Inpatient Discharges (SPARCS De-Identified) 2017 with 2343569 instances provided by the New York State Department of Health verify the model after removing 2.2% of the missing data, and the model ultimately uses mean squared error (MSE) and coefficient of determination (R2) as the performance measurement. The results show that the model with the LightGBM algorithm and the proposed encoding method has the best R2 (96.0%) and MSE score (2.231).


Introduction
Globally, due to the pandemic and population changes, hospital inpatient departments are becoming more and more likely to face the influx and congestion of patients [1,2] and hospitals in anticipation of the need to redeploy healthcare resources to meet the massive medical requirements of patients [3]. e LOS indicates the number of days between admission and discharge, and it can often affect the admission plan of emergency patients [4] or whether there is the possibility of transfer [5]. Moreover, when technical means can reduce the long duration of LOS, the consumption of healthcare resources would also be reduced to some extent [6]. However, the inpatient department does not know when existing patients will leave the hospital in most cases. If hospitals could accurately predict LOS, they could implement and improve healthcare resource management correctly [7,8]. erefore, this study tries to establish an ML model using the information about the diagnosis, treatment, service, and cost of individual patients to predict LOS.
In the study, five ML algorithms (LR, RR, RFR, XGBR, and LightGBM) and six feature encoding methods (label encoding, count encoding, one-hot encoding, target encoding, leave-one-out encoding, and the proposed encoding method) were used and compared during the model building. e rest of the study is organized as follows: Section 2 reviews some related studies on LOS. Section 3 introduces the dataset used in this study and each step of the proposed framework in detail. Section 4 presents the experimental results and then discusses them. Section 5 draws the conclusions and direction of future work.
For example, Bacchi et al. [9] proposed an artificial neural network (ANN)-based prediction model for predicting the LOS in stroke patients. e objective is to predict whether the LOS was less than 8 days. And they finally achieved 0.62 and 0.66 area under curve (AUC) values on the inner and outer validation sets. Similarly, Daghistani et al. [10] converted the LOS values into three classes (<3 days, 3-5 days, and >5 days) and then used information gain (IG) to select features. ey compared Random Forest (RF), Bayesian Network (BN), Support Vector Machine (SVM), and ANN technology for LOS prediction. e final RF model outperformed all other models (sensitivity (0.80), accuracy (0.80), and AUROC (0.94)). Furthermore, Zheng et al. [11] compared two discrete methods that are two (1-3 days and ≥4 days) and three (1-3 days, 4-8 days, and ≥9 days) classes. Six ML algorithms were applied to the model to make comparative predictions and finally obtained the best accuracy score (ACC) of 0.7689 and 0.6594 in the training and test sets, respectively. Furthermore, Ling et al. [12] used the RF algorithm and general medical characteristics to predict LOS in patients in the intensive care unit (ICU), and the AUC value of the optimal model is 0.86. e limitation of classification-type studies is their generally poor performance and difficulty in guiding longterm LOS (e.g., LOS ≥ 10 days) prediction due to the small number of classes. Models of this discrete type are unrealistic to deploy and not recommended when hospitals hope to predict the LOS precisely (e.g., ±1 day).
Data balancing techniques can improve model performance in predicting LOS. For example, Naemi et al. [13] proposed a multistage data processing method. e method first used k-nearest neighbors (KNN), decision tree (DT), gradient boosting (GB), Bayesian ridge (BR), Gaussian process (GP), and RF for missing value imputation and then used SMOTE to overcome data skewness. After these steps, the model used DT to predict the hours of stay. It ended up with an R 2 score of 0.729. Alsinglawi et al. [14] constructed a LOS prediction framework for lung cancer patients using RF and oversampling techniques (SMOTE and ADASYN). e framework gets an AUC score of 100% on the MIMIC-III dataset. e datasets used in the above two studies have been artificially altered. Even though model performance is good on synthetic balanced data, it often does not perform well on unbalanced data. As a result, models using data balancing techniques are difficult to deploy because data tend to be biased in real life.
According to historical data, regression is the method that occupies the majority proportion of LOS prediction [15]. For example, Siddiqa et al. [16]  Previous regression models have two limitations. First, some models are built on specific or posthospital physical examination data, so they lack generality. e other model built on datasets with high versatility is insufficient in performance (R 2 < 0.95). Based on the deficiencies of the three model types, this study attempts to propose a model that does not use artificially synthesized data and excels in both generality (e.g., using prehospital diagnosis results) and performance.

Data Description.
e study used the Hospital Inpatient Discharges (SPARCS De-Identified) 2017 dataset provided by the New York State Department of Health [20]. is dataset uses the Open Database License (ODbL 1.0), which grants anyone to use the dataset for the duration of any applicable copyright and Database Rights. ese rights explicitly include commercial use and do not exclude any field of endeavor [21]. e dataset contains 2343569 instances with 34 features that de-identify the detailed information of patient characteristics, diagnosis, treatment, services, and costs. e "Length of Stay" in the dataset is the target feature, while the purpose of the proposed model is to predict it by others. Table 1 shows the description of the features of the dataset.

e Proposed Framework.
is study uses a few steps to build a complete application model. First, the raw data use visualization to analyze the internal relationship, and then the data are preprocessed for duplicates, missingness, and meaningless information. e third step determines whether each feature of the dataset positively affects the target, and the model only needs the positive partial. en the six encodings make the information in the dataset unusable into a usable form. e above steps make modifications to the raw data. en the model divides data into a training set and a test set in a 99: 1 ratio, where the training set uses a 10-fold crossvalidation technique to improve model reliability. After the five ML algorithms have trained the model, MSE and R 2 will judge the model performance to support analysis. Figure 1 presents the framework proposed in this study. And this framework fully expresses the methodology used to construct the model for this study. e following sections explain each step of the framework in detail.

Data Visualization Analysis.
e study leverages visualization techniques to analyze datasets and find relationships between independent and dependent features. e 2 Computational Intelligence and Neuroscience results produced by visualization methods are usually easily understandable by people who are not necessarily knowledgeable about ML [22]. Table 1 shows that the dataset has two or three class categorical, ordered categorical, random categorical, and continuous features, in which the target "Length of Stay" belongs to the continual type. Figure 2 shows the density distribution of the target feature. It belongs to the long-tail distribution with an average value of 5.38. erefore, all analysis methods that assume the normal distribution are less suitable for this study. Figure 3 shows the proportion of LOS in different categories of patients in the three features of "Gender," "APR Medical Surgical Description," and "Emergency Department Indicator." e results showed that the LOS of the female patients was longer than that of the male patients but was uniform. In the middle of the figure, the LOS of medical inpatients accounts for about three-quarters of the total, of which the type is much longer than surgical inpatients. And the LOS of emergency patients is about twice that of nonemergency patients, showing that the condition of emergency patients is more ill and needs a longer recovery time. Figure 4 shows the density distribution of two continuous features, and the trend is similar to Figure 2. is figure demonstrates that the two features correlate with the target. Finally, Figure 5 shows the LOS of two ordered categorical features, which shows that the younger the age and the higher the disease mortality rate, the shorter the LOS.

Data Preprocessing.
Outliers and missing values during model building would affect the model performance [23], then data preprocessing is crucial. Among the 34 features, the missing value of "Payment Type 2" is missing completely at random (MCAR) and is missing at random (MAR) [24] in "Payment Typology 3" and "Birth Weight." e proportion of their missing values is about 37.5%, 74.1%, and 90.3% [25]. Hence the process removed three features directly. e remaining dataset also needs to remove about 2% of the instances that still contain MCAR or MAR, as well as 20 samples with the value of "Unknown" in the "Gender" feature. It is worth mentioning that all eigenvalues "120+" were uniformly changed to "120" for the convenience of calculation in the feature "Length of Stay." Finally, there are 2304296 instances with 31 features left in the dataset that the preprocessing process deleted 2.2% instances. Pre-processing

Feature Selection.
Among the remaining 31 features of the dataset after preprocessing, the five features of "CCS Diagnosis Description," "CCS Procedure Description," "APR DRG Description," "APR MDC Description," and "APR Severity of Illness Description" are different representations of the same information as the five features of "CCS Diagnosis Code," "CCS Procedure Code," "APR DRG Code," "APR MDC Code," and "APR Severity of Illness Code," respectively, which are meaningless to the model and therefore deleted. Among the 24 remaining features of the dataset, the "Length of Stay" is the continuous target feature. And others are divided into four types (Binary, Ordered Categorical, Random Categorical, and Continuous). Regarding the correlation between them and the target feature, it is necessary to use various techniques to investigate.

Binary Features.
e point-biserial correlation is the value of Pearson's product-moment correlation when one of the variables is dichotomous and the other variable is metric [26]. e calculation formula of the point-biserial correlation coefficient is as follows: where n 1 and Y 1 represent the frequency of the binary feature X � 1 and the mean of the corresponding target feature, respectively. n 0 and Y 0 represent the frequency of the binary variable X � 0 and the mean of the corresponding target feature, respectively. And the s n in the denominator represents the standard deviation of the target feature [26]. Finally, the closer the absolute value of c pb is to 1, the higher the correlation between features.

Ordered Categorical Features.
e correlation of ordered categorical features with continuous features requires first converting the latter to the former type. e two most popular measures of association for this feature type are Kendall's tau and Spearman's rho [27]. is study uses the Spearman coefficient for correlation analysis, and the general idea is as follows: e method of Spearman first converts the string data  Computational Intelligence and Neuroscience conversion detail. And the data y � [y 1 � 1, y 2 � 2, . . . y n � 120] in the target feature are used directly without modification. e method then uses formula (2) [28,29] to calculate the correlation between x and y. e absolute value of the result ρ X,Y is between 0 and 1, and the closer to 1, the more correlated the features are. (2)

Random Categorical Features.
Since the target feature does not satisfy the normal distribution ( Figure 2), it is suitable to use the Kruskal-Wallis test to calculate the correlation with the target feature. e Kruskal-Wallis test is a nonparametric statistical test that assesses the differences among three or more independently sampled groups on a single, nonnormally distributed continuous feature [30]. e basic idea is as follows: e Kruskal-Wallis test first arranges the eigenvalues in ascending order, then finds their rank R i , and examines whether there is a significant difference in the mean μ i of the ranks of each eigenvalue. H 0 : μ 1 � μ 2 � . . . μ k is the null hypothesis, and the alternative hypothesis H1 is that at least two μ i are not equal. e calculation formula [31] is as follows: rough H in the above formula, the Kruskal-Wallis test can query the critical value table to get the corresponding P value. If the P value is below the significance level, there is a correlation between the features. And this study sets the threshold at 0.01.

Continuous Features.
Since the target feature is not normally distributed (Figure 2), its correlation with continuous features needs to be judged by the Spearman correlation coefficient [32].
Finally, Table 3 summarizes the correlation between each feature and the target feature. And the results show the model could keep all features.
3.6. Feature Encoding. All categorical attributes of the dataset are represented by strings, while machine learning algorithms can only calculate numerical eigenvalues. Hence these features need to be rerecorded into numbers.
3.6.1. Label Encoding. In the label encoding method, the eigenvalues of each categorical feature are first sorted by frequency from small to large and then are assigned a value from 0 to N− 1 in order (N indicates how many different eigenvalues the feature has). Even if there is no relationship between the eigenvalues before encoded, the algorithm would regard them according to the size of the values. Table 4 shows a sample of this method on one particular feature.

Count Encoding.
Count encoding is a method that uses the frequency of eigenvalues as labels. In this method, the frequency of one feature will replace the value of this feature. And different eigenvalues may be encoded into the same number. When the frequency of categorical features correlates with the target feature, this method has positive significance for model training. However, there are too many eigenvalues in discrete features in the dataset. If all features use one-hot encoding, more than 1500 new will be generated and will be too sparse. Hence, only features with a small number of unique values will use this method.

Target Encoding.
Target encoding is a preprocessing scheme for high-cardinality categorical features based on a well-established statistical approach to models (empirical Bayes). It is a method based not only on the independent eigenvalues but also on the corresponding dependent feature [33]. is method depends on the distribution of dependent features, but the feature dimension remains unchanged after encoding, and its calculation formulas (4) and (5) [33] are as follows: In formula (4), k∈L i Y k represents the sum of the corresponding target feature's values when the categorical eigenvalue is i. Its denominator n i represents the frequency of categorical eigenvalue i. And the N TR k�1 Y k on the right side of the formula represents the sum of the values of the target feature in the training set.     (5) represents the minimum times the eigenvalue must appear in the calculated feature. And λ represents the smoothing coefficient that the higher the value, the stronger the regularization of the formula.

Leave-One-Out Encoding.
e leave-one-out encoding method uses the same principle and formula as target encoding. But to reduce the influence of outliers, when calculating the encoding value of an instance, the program will ignore the current and only use the remaining for target encoding.
3.6.6. Proposed Encoding Method. One-hot encoding method can obtain the information of categorical features well, but it will lead to sparse data. e other methods do not have the problem of sparsity but will lose a lot of data information.
is study attempts to balance model performance and data dimensionality, thus combining two encodings to form a new method. Table 6 shows the encoding adopted for each feature.

Comparative Algorithms
. , x m n , y 1 )} (m represents the number of features and n indicates the number of instances) with a linear function (6) and minimize the cost function (7) [34], where f(x) represents the predicted values and y i is true values. e purpose of the operation is to find a solution (W, b) that minimizes J(w). LR imposes constraints on the model parameters (i.e., adds a penalty λ‖w j ‖ to the loss function) that shrink the regression coefficients to zero [35]. For example, if a feature highly correlates with the target, LR will select it and then shrink others uncorrelated with zero and exclude them from the model. is approach reduces bias and improves the accuracy of linear regression models.
By calculating the partial derivative concerning w of the residual on the left side and the penalty term on the right side of formula (7) could obtain formulas (8) and (9).
and Z k � m i�1 x (i) 2 k , then combine (8) and (9) to obtain the partial derivative of (7) and solve it:

Ridge Regression.
RR is similar to LR and uses the linear formula (6). It obtains regression coefficients at the cost of losing some information and reducing accuracy by giving up unbiasedness. RR adds a penalty term to the loss function in standard linear regression to alleviate multicollinearity and overfitting problems [36]. Its estimates of regression coefficients tend to become too large in absolute values, and some may even have the wrong sign [37]. Formula (11) [38] is the loss function of RR, which is the penalty term added by λ‖w j ‖ 2 . And λ is a hyper-parameter used to control the strength of the penalty. e larger the λ, the simpler the generated model. [39] technique to randomly divide the dataset D into n subsample sets D 1 , D 2 , . . . , D n . e CART regression tree will build on these subsets and output the results, and the final RFR outputs the average of all predictions. ere is no relationship between each regression tree, an increase in the number of trees does not cause the RFR to overfit the data [40]. Furthermore, RFR is insensitive to multicollinearity, and the results are robust to missing and unbalanced data [41]. e 31 features of each divided subsample set D i are set to A � A 1 , A 2 , . . . , A 31 . e CART algorithm first sorts the features A i and then tries to use each interval between adjacent feature values as the segmentation point S. e set of eigenvalues on the left side of S is R 1 (A i , S) and the right side is R 2 (A i , S) (12). c 1 and c 2 are the mean values of the target feature corresponding to R 1 (A i , S) and R 2 (A i , S), respectively (13). e next step of the algorithm is to find which S can make the MSE of the feature minimum (14) and then use the segmentation point S together with the feature as the node of the tree. After the algorithm divides all features, the CART regression tree uses the average of all leaf nodes as the output (15) [42]. 8 Computational Intelligence and Neuroscience

Extreme Gradient Boosting Regression.
Unlike RFR in the bagging form, XGBR is a boosting integrated ML algorithm based on the CART regression tree, which belongs to the regression implementation of extreme gradient boosting (XGBoost). It uses the second-order Taylor expansion and adds regularization to the objective function. And the algorithm adopts accurate greedy ideas in the tree generation [43]. Finally, XGBR uses the sum of the predictive values of all regression trees for the sample as the output of this sample in the system, and the definition function (16) [43] is as follows: where X i is the sample feature and f k (X i ) is the prediction of the Kth tree. e sum of values of all trees is the predicted value y i for the entire model. Since the algorithm belongs to the additive model, the predicted value of the Kth tree y k i can be expressed by formula (17). Let the sum of the truth values be y i . Formula (18) [43] summarizes the objective function.
where k Ω(f k ) � K− 1 i�1 Ω(f j ) + Ω(f K ). i l(y i , y i ) is the loss function between the predicted and true values that is MSE (11) in XGBR. Since the results of K-1 trees have been determined and remain unchanged when training the Kth tree, k Ω(f k ) can convert to Ω(f K ).
en, the Taylor expansion can transform the objective function on the right side of formula (18) into (19).
where i l(y k− 1 i , y i ) is the sum of the prediction losses of the first K-1 trees. And it does not change when computing the Kth tree and can therefore be ignored.
can be treated as a constant too. f k (X i ) represents the prediction result of the Kth tree, and it also indicates the leaf node position on the Kth tree where the sample X i . Here, the function q(X i ) can be defined to represent the sample position in leaf nodes, and w q(X i ) � f k (X i ) can express to solve the sample position. XGBoost defines Ω(f K ) � cT + 1/2λ T t�1 (ω t ) 2 as the penalty function (where λ represents the penalty intensity and T is the number of leaf nodes) [43]. Formula (19) can convert to formula (20) by removing the constant term and substituting the penalty.
where only w j is unknown, so the objective function becomes a typical quadratic type. XGBR adopts the CART regression tree that could fix the tree structure q(X i ). At this time, the minimum solution of the function is w * j � − (( i∈I j g i )/( i∈I j h i + λ)), substituting into formula (20) can get the objective function solution − (1/2) T j�1 (( i∈I j g i ) 2 /( i∈I j h i + λ)) + cT.

Light Gradient Boosting Machine.
Microsoft launched an upgraded version of XGBoost named LightGBM in 2017. e LightGBM in this article uses the histogram algorithm to reduce the number of candidates' split points and the mutually Exclusive Feature Bundling (EFB) algorithm to reduce the number of features [44]. e histogram algorithm refers to discretizing continuous floating-point eigenvalues into k integers and constructing a histogram with a width of k.
e algorithm counts the floating-point values within the range of the discretized values in the histogram according to the k values as an index. en traverses the discretized values to find the optimal segmentation point. XGBoost travels all floatingpoint values, while LightGBM only travels k values by establishing histograms. EFB will compare and analyze the difference between features by sparse coding. When the difference between the two features is minor, it considers that there is a conflict. Otherwise, the two features will be one. EFB reduces the feature dimension through this method to speed up.
Hence LightGBM is more efficient run on the set in largescale data. With the same performance as XGBR, LightGBM is 10x faster than train and consumes less memory [44].

Model Validation.
Although the dataset has more than 2 million instances, the model is still at risk of overfitting. Secondarily, the model training process is necessary to avoid information leakage caused by using the test set multiple times. Based on the above factors, the validation process divides the dataset into a training set and a test set in a 99:1 ratio. en the training set is used for 10-fold cross-validation, and the test set checks the model performance. e entire validation process will use the training set ten times, but the test set only once. e 10-fold cross-validation method could alleviate the overfitting and information leak [45]. e reason for choosing 10 is the estimate of prediction error is almost unbiased [46]. e 10-fold method will use different 90% training sets to train the model ten times, and the remaining measures the model performance.

Performance Measurement.
e model in this study attempts to solve a regression problem, in which people usually achieve model performance measurement by comparing the MSE and R 2 . e closer the MSE value is to 0, the smaller the gap between the predicted and the actual value. Formula (21) [47] calculates the MSE by subtracting each prediction from the truth, adding all the squared results, and dividing by the total number added.
where y i represents the actual value, y represents the predicted value, and n represents the total number of squared values.
When the dimensions are different, MSE does not say much about the performance of the regression concerning the distribution of the ground truth elements. However, the R 2 score does not have the interpretability limitations of MSE and is more informative and truthful [48]. e value of the R 2 score is between − ∞ and 1. R 2 = 1 indicates the predicted values are the same as the actual values. Hence, the closer the score is to 1, the better the model performance. Formula (22) [48] defines the calculation method for R 2 .
where the numerator in the rightest is the MSE, and the denominator is the variance of the actual value.

Model
Processing. e dataset remains 2304296 instances with 53 features after preprocessing, feature selection, and feature encoding. is study builds the model using the Scikit-Learn ML toolkit on the Python platform with 8 cores and 16 GB RAM. To ensure reproducible results, all steps involving random processes set the random seed to 0. e hyper-parameter λ in LR and RR models has the highest impact on performance. is study uses the default penalty coefficient λ � 1 in the toolkit. RFR, XGBR, and LightGBM are all tree-type models, and the hyper-parameter that most affects their performance is the number of CART regression trees (n_estimators). e more the number of trees, the higher the model performance may be, but the computing cost rises with it. e default n_estimators � 100 for RFR, and to facilitate the horizontal comparison of the three models, XGBR and LightGBM refer to the same order of magnitude of fitting time (Table 7) to set n_estimators to 500 and 25000, respectively. In particular, the LightGBM algorithm can set the number of features discarded ratio at each iteration to prevent overfitting, which is 0.6 in this study.  Table 7 shows the performance of models built with LR, RR, RFR, XGBR, and LightGBM algorithms, while the study results of Siddiqa et al. [16] are also listed side-by-side as a control. In the model of this study, the MSE (5.882) and R 2 (0.675) metrics of LR on the test set are the worst, and its training time (3.654s) is also longer than another linear algorithm RR (1.653s). e performance of the RR algorithm (MSE � 5.680 and R 2 � 0.702) outperforms the LR by a small margin, but the performance of both linear algorithms is far from satisfactory. e RFR and XGBR-based models achieved MSE scores of 2.295 and 2.287, and the R 2 scores are both 0.958 on the test set, which is well behaved as ideal.
eir single-fold fitting consumption is 946.465s and 900.799s, respectively. However, the LightGBM algorithm surpasses them in fitting time (874.331s), MSE (2.231), and R 2 (0.960), which performs best in the tree-type model.

Discussion.
e LR (R 2 � 0.675) and RR (R 2 � 0.702) models based on linear algorithms are far from ideal, which means that the datasets used in this study tend to be nonlinear, and linear algorithms are difficult to apply in practice to the process of predicting LOS. However, the three tree-type models (RFR, XGBR, and LightGBM) performed pretty well, especially the LightGBM model. Its R 2 score of 0.960 is improved by 4.4% compared to the best-performing RFR model (5 MSE and 0.92 R 2 ) in the past study [16] as the control group, while the MSE score of 2.231 is a relative decrease of 55.4%. XGBR and RFR models in this study ranked second and third in performance, with 2.5% and 2.8% respective higher MSE and 0.2% lower R 2 scores relative to the best model.
Compared with the previous study [16], the encoding method in this study is a majority different. e models composed of LR, RR, RFR, and XGBR algorithms have significantly lower MSE scores (decreased by 86.3%, 85.2%, 54.1%, and 59.3% respectively) after using the proposed encoding method in this study, and R 2 scores are improved (117.7%, 89.2%, 4.1%, and 5.5% higher).
e LightGBM model using the proposed encoding also reduces the MSE score by at least 0.76% compared to label encoding, count encoding, target encoding, and leave-one-out encoding and R 2 scores improved by at least 0.1%. e model in this study can help the hospital to estimate the LOS of the patient, and the data to construct the model only need some prehospital diagnostic characteristics of the patient, thus reducing the threshold for the actual deployment and increasing the reality. In addition, the modeling process balances the conflict between the curse of dimensionality and information retention. Even for millions of instances, the model can be trained and deployed quickly using a personal computer. However, the model performance highly correlated with the "Total Charges" and "Total Costs" feature. Where "Total Charges" can be obtained when the patient admitted to the hospital, but "Total Costs" need to be estimated from the doctor's experience and other information about the patient. Uncertainty in the estimation results may affect model performance in reality.

Conclusions
e objective of this study was to construct a model to predict LOS in the hospital by exploring the prehospital diagnostic information of potential hospitalized patients. Many ML algorithms such as RFR, LR, RR, XGBR, MLP, and DTR are being investigated in recent studies for regression prediction of LOS. Eventually, the performance of the models constructed by these algorithms can hardly meet the requirements of actual deployment. Where the linear model is not suitable for predicting LOS and the tree model overfitting is obvious. is study proposed a model using one-hot encoding + label encoding combined with the LightGBM algorithm to investigate how to improve the accuracy of LOS prediction. e model is based on the 2017 dataset provided by the New York State Department of Health. e average LOS for patients in this dataset is 5.38 days, and most patients stay in the hospital for 1-5 days for minor illnesses, and more than 70% of illness types are medical. ere was no significant difference in LOS between men and women, but over 50 spent more time in hospital. is study used a hybrid feature encoding approach to improve LOS prediction performance. And feature selection is also computed, which compared correlation scores to remove features that were not positive for prediction. e results of the correlation analysis showed that "Total Charges" and "Total Costs" were the features most associated with LOS. e proposed model ultimately successfully extends the results of related studies, with MSE and R 2 achieving the best scores of 2.231 and 0.960, respectively, which is much higher than the previous study. In the future, the problem of model overfitting still deserves more research to obtain higher accuracy for predicting LOS.
Data Availability e data that support the findings of this study are publicly available at https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/gaf8-ac33.