A Long Short-Term Memory Ensemble Approach for Improving the Outcome Prediction in Intensive Care Unit

In intensive care unit (ICU), it is essential to predict the mortality of patients and mathematical models aid in improving the prognosis accuracy. Recently, recurrent neural network (RNN), especially long short-term memory (LSTM) network, showed advantages in sequential modeling and was promising for clinical prediction. However, ICU data are highly complex due to the diverse patterns of diseases; therefore, instead of single LSTM model, an ensemble algorithm of LSTM (eLSTM) is proposed, utilizing the superiority of the ensemble framework to handle the diversity of clinical data. The eLSTM algorithm was evaluated by the acknowledged database of ICU admissions Medical Information Mart for Intensive Care III (MIMIC-III). The investigation in total of 18415 cases shows that compared with clinical scoring systems SAPS II, SOFA, and APACHE II, random forests classification algorithm, and the single LSTM classifier, the eLSTM model achieved the superior performance with the largest value of area under the receiver operating characteristic curve (AUROC) of 0.8451 and the largest area under the precision-recall curve (AUPRC) of 0.4862. Furthermore, it offered an early prognosis of ICU patients. The results demonstrate that the eLSTM is capable of dynamically predicting the mortality of patients in complex clinical situations.


Introduction
Mortality prediction is essential for the clinical administration and treatment, especially in the intensive care unit (ICU) [1,2]. Various scoring systems have been developed and widely used for assessing the clinical outcome, and the most common ones are simplified acute physiology score (SAPS) II [3], sequential organ failure assessment (SOFA) [4], and acute physiology and chronic health evaluation (APACHE) II [5]. Scoring systems assess the patients' mortality by logistic regression model assuming a linear and addictive relationship between the severity of the disease and the collected relevant physiological parameters, which are practicable but unrealistic [6]. In the recent years, machine learning was introduced in the medical application and showed its remarkable efficiency in clinical diagnosis and decision support. For admitted ICU patients, lots of physiological measurements are collected, containing symptoms, laboratory tests, and vital signs (such as heart rate, blood pressure, and respiratory rate) [7,8]. e clinical measurements are continuously monitored in ICU with the values fluctuating as time progresses and the temporal trends are predictive of mortality [9]. Hence, sequence of clinical records offers rich information of patients' physical condition [10,11] and enables the utilization of machine learning in developing prognosis model from these multivariate time series data. As a decision task, mortality prediction can be solved by classification algorithms such as logistic regression, support vector machine, and random forests (RF) [12]. However, most of the methods currently used are not sensitive to the temporal link among the sequent data and thus are not able to receive full benefits of the ICU data, which limits their performances in the mortality prediction [10,13].
Presently, recurrent neural network (RNN) was well employed in solving time series prediction problems and achieved prominent results in many fields [14][15][16][17][18][19]. Several variants of RNN have been developed, and among them, long short-term memory (LSTM) network is one of the most popular variants [20]. LSTM learns long-term dependencies by incorporating a memory cell that is able to preserve state over time. ree gates are equipped in LSTM for deciding which information to summarize or forget before moving on to the next subsequence [21][22][23]. LSTM is well suited to capture sequential information from temporal data and has shown advantages in machine translation [24,25], speech recognition [19], and image captioning [26], etc. In the medical domain, many efforts have been made to apply LSTM for clinical prediction based on electronic health records [6,17,[27][28][29][30]. Lipton et al. employed LSTM on a collection of 10, 401 episodes to establish a model for phenotype classification [28]. Given 13 frequently sampled clinical measurements (diastolic and systolic blood pressure, peripheral capillary refill rate, end-tidal CO 2 , fraction of inspired O 2 , Glascow coma scale, blood glucose, heart rate, pH, respiratory rate, blood oxygen saturation, body temperature, and urine output), the LSTM model was able to predict whether the patient suffered from 128 most common conditions, such as acute respiratory distress, congestive heart failure, and renal failure. Jo et al. used LSTM and latent topic model to extract information from textual clinical notes for assessing the severity of diseases [29]. Pham et al. conducted experiments on a diabetes cohort of 7191 patients with 53208 admissions collected in 2002-2013 from a large regional Australian hospital, and the results showed improved performances of utilizing LSTM in disease progression modeling and readmission prediction [31].
For ICU mortality prediction, the current prognosis models mostly employed single LSTM classifier [6,29,30]. However, in most cases, a single model is not efficient enough to handle the complex situation in ICU. Patients in ICU are heterogeneous suffering from different diseases with multiple concurrent problems, and the clinical data in ICU are highly complex [9,32,33]. For patients with various diseases, the underlying pathophysiologic evolutions of the patients (e.g., kidney failure) are usually manifested through different sets of physiologic variables (e.g., abnormalities in glomerular filtration rate and creatinine) [9]. Even for the patients having the same disease, they might have different comorbidities experiencing heterogeneous health conditions [33]. ereby, hybrid learners are required for the prediction model in ICU.
An ensemble learner principally has a stronger generalization ability than a single learner [34][35][36][37]. Ensemble learning is a procedure that integrates a set of models for a given problem to obtain one composite prediction [38][39][40][41][42][43]. Diverse classifiers are constructed to learn multiple hypotheses, and the multiple resulting predictions are aggregated to solve the same problem. In contrast to the stand-alone model which builds one hypothesis space, a combination of several models can expand the space and may provide a more exact approximation to the true hypothesis [34]. It has been shown that ensemble systems outperformed single classifier systems in solving complex problems [34,38,39]. erefore, we proposed an ensemble algorithm of multiple long short-term memory networks (eLSTMs) to deal with the complex situation in ICU. In eLSTM, the diversity of LSTM models owes to the multifariousness of subsets for building the models. Two strategies are employed to produce different subsets from the entire training data, namely, bootstrapped samples and random feature subspace. Bootstrapped samples strategy generates various subsets of subjects, while random feature subspace provides different combined sets of clinical indicators. at is, the subsets are distinguished from each other at both instance and feature level. A variety of LSTM classifiers are trained accordingly, and the final score is computed as the average of predicted values from all base learners. Generally, the eLSTM algorithm selects a number of training subsets using bootstrapped instances with randomly chosen feature set, constructs multiple LSTM learners on the multiple subsets, and averages all individuals' predicted scores as final output.
e main contributions of this work are as follows: (1) proposing an LSTM ensemble framework to develop hybrid sequential classification model which is able to handle complex clinical situations such as ICU and (2) applying bootstrapped samples and random feature subspace to individual LSTM classifiers for creating diversity in the ensemble. e present model will promote the application of machine learning in complex clinical situations. e rest of this paper is organized as follows. Section 2 describes the ICU dataset, the implementation of the proposed eLSTM algorithm, and the experimental design. e empirical results yielded by various systems for mortality prediction are presented in Section 3. e advantages of eLSTM are discussed in Section 4. Finally, Section 5 concludes this paper and indicates the future work.

Dataset.
e ICU data for this work were extracted from the Medical Information Mart for Intensive Care III (MIMIC-III) database [44]. MIMIC-III is a large and publicly available database of ICU admissions at the Beth Israel Deaconess Medical Center, USA, from 2001 to 2012. It comprises rich clinical data of patients, including the laboratory tests and vital signs. A total of 18415 patients were extracted from MIMIC-III database with age >15 years and length of stay ≥10 days. e prediction task of clinical outcome is 28-day postadmission mortality. e study population consists of 2162 subjects in positive group that died within 28 days after ICU admission and the other 16253 subjects in negative group that survived 28 days after ICU admission. From the tables LABEVENTS.csv and CHAR-TEVENTS.csv, 50 variables of continuous 10 days (denoted as D1, D2, . . ., D10) are recorded for mortality prediction. e variables are sampled every 24 hours. ese variables are commonly used clinical measurements, and the details are listed in Table 1.

LSTM Ensemble Algorithm.
Ensemble methods generate multiple learners and aggregate them to provide a composite prediction. Among them, the Bagging and Boosting method are most popular. e diversity of individual learner is an important issue for ensemble model, which can be achieved by selecting and combining the training examples or the input features, injecting randomness into the learning algorithm [34,36]. e proposed eLSTM algorithm is an ensemble method utilizing LSTM as base learner. Two random strategies are employed to produce different training subsets, hence constructing a number of base LSTM classifiers. All predictions are integrated to give a comprehensive estimate of the outcome.
Given a training set with N training instances, each instance can be represented as (V, Y). V is a matrix containing values of D variables and T sequences. It can be written as [X 1 , X 2 , X 3 , . . . , X t , . . . , X T ], as expressed in equation (1). X t is a vector given in equation (2). x d t represents the value of the d-th variable at t-th time step. And Y is the target label for the instance taking 0 (negative) for survival and 1 (positive) for death. e ratio of negative and positive group size is denoted as c: LSTM has the advantage of capturing temporal information and is popular to be adopted in time series modeling. Detailed structure of the LSTM block is illustrated in Figure 1.
e input of LSTM block is X t . en, the output of hidden layer, namely, the current hidden state h t , is computed as follows: where f t , i t , and o t are the forget, input, and output gates, respectively. h t− 1 is the previous hidden state. C t− 1 and C t are previous and current cell memories. e weight matrices w f , w i , w o , and w c and the bias vectors b f , b i , b o , and b c are model parameters. e symbol σ is the sigmoid function and tanh hyperbolic tangent function. e symbol · denotes matrix multiplication and * elementwise product.
A sigmoid layer is applied on the output of the LSTM block at final step for binary classification. e predicted score y is computed as equation (4). e loss function is the weighted cross entropy of real label and predicted score y with positive instances weighted c and negative ones weighted 1. e parameters within the net are updated over several iterations to reach the minimum loss value: e eLSTM model is composed of multiple LSTM classifiers, and its architecture is illustrated in Figure 2.
e procedure of eLSTM consists of two stages: base learner generation and integration.
In the stage of base learner generation, the bootstrap sampling strategy [37] and random subspace method (RSM) [35] are both employed to generate different training subsets for constructing diverse base learners. As a training set sampling method, bootstrap sampling randomly draws instances with replacement from the whole training set and RSM is to randomly choose a subset of variables. e subsets resulted from different bootstrapped instances with randomly selected variables are denoted as Subset 1 , Subset 2 , . . . , Subset p , . . . , Subset P }. In ensemble model rather than error control strategy, bias control is generally adopted to train multiple base classifiers benefiting the diversity of the model. us, appropriate number of training epochs for the classifiers is selected by experiments under a satisfied level of bias. e variance of the model due to the diversity of individual classifiers is controlled by the following ensemble operation [45,46]. For eLSTM, the number of training epochs was set as 100, which was validated by pre-experiments. en, multiple LSTM classifiers learn from the subsets. Let F 1 , F 2 , . . . , F P denote the set of P trained base classifiers. For the input V, the p-th LSTM classifier gives an individual predicted score y(p), as expressed in equation (5).
Finally, in the integration stage, the scores of all LSTM classifiers are averaged as the overall output and calculated as follows: e procedure of the eLSTM algorithm is provided in Figure 3.
Once the eLSTM model is accomplished, it is applied in this way: for an instance, each LSTM classifier uses partial values of the corresponding variable subset and makes a prediction; different LSTM classifiers utilize different sets of variables, producing multiple prediction scores; the final prediction is obtained by averaging all scores.

Dynamic Prediction.
For LSTM and eLSTM models, the full sequence of data is needed to predict the outcome. However, in practice, the patients' physiological parameters are collected day by day. To develop a dynamic procedure providing daily prediction, in this work, the values for coming days are padded by the latest available data to acquire complete sequences. en, the LSTM algorithm and the eLSTM algorithm are employed on the complete dataset for predicting the outcome.
us, the mortality assessment is updated daily with the replenished data approaching closer to the reliability. e process is illustrated in Figure 4.

Experiment Design.
e proposed eLSTM algorithm is compared with three scoring systems (SAPS II, SOFA, and APACHE II), RF algorithm, and LSTM classifier. In the LSTM classifier, a sigmoid layer is applied on top of the LSTM block for binary classification. e LSTM block has one hidden layer with 64 hidden units, and a dropout of rate 0.5 is applied to the input layer. e weight parameters are initialized randomly using Glorot uniform initialization [47]. e LSTM model is trained with the Adam optimizer of learning rate of 0.01 for a maximum of 100 epochs. 10% of the training data are used as a validation set to find the best epoch. In eLSTM algorithm, there are two important hyperparameters: the number of base LSTM classifiers and the size of variable subset. Considering the running time, the number of base LSTM classifiers in the current work is set as 200. And, half of the variables are randomly chosen to construct individual classifier as recommended in the literature [35]. Eventually, 200 individual LSTM-based classifiers are trained on resampled instances with 25 randomly selected variables. In addition, dynamic prediction by RF algorithm is realized by training 10 models on data of the first 1, 2, . . ., 10 days, respectively. e experiment is repeated 50 times. For each experiment, 90% of the dataset is chosen as training data and the left 10% as test data. Before the training procedure, data are preprocessed by imputation and normalization. e missing values are filled by linear interpolation imputation method, assuming a linear development in time of the variable with missing data [48]. en, all the variables are normalized by subtracting the means and dividing the standard deviations computed across the training data.
To compare the performances of these models, several metrics are computed on predicted scores and true labels. e receiving operating characteristics (ROC) curve and the precision-recall curve are plotted to evaluate the performance of the classifiers. e ROC curve uses 1 − specificity as the x-axis and sensitivity as the y-axis for all potential thresholds, while the precision-recall plot applies recall and precision as the x-axis and y-axis.
e area under ROC (AUROC) and the area under precision-recall curve (AUPRC) are calculated for comparison. Moreover, the bias between the predicted class labels and the true labels is comprehensively measured by sensitivity/recall, specificity, accuracy, precision, and F1 score. Sensitivity/recall calculates how many true-positive cases are correctly classified as positive, while precision counts the proportion of truepositive cases in the cases classified as positive. F1 score is the harmonic mean of recall and precision.

Mortality Prediction Performance.
e ROC curves and precision-recall curves of all models are shown in Figures 5  and 6. e eLSTM model harvests the largest AUROC of 0.8505 and the largest AUPRC of 0.45.
Detailed statistical results of repeated experiments are given in Table 2. ANOVA test shows significant differences in AUROC, AUPRC, sensitivity/recall, specificity, accuracy, precision, and F1 among the utilized methods (p < 0.001). It can be seen the models of RF, LSTM, and eLSTM have much ? ? Post hoc analysis by Dunnett test shows the differences in AUROC, AUPRC, and sensitivity between eLSTM and other methods are significant (p < 0.05). Totally, the eLSTM model obtains the significant largest value of AUROC, AUPRC, and sensitivity. It is noticed that all methods have low precision and F1 score. It is mainly due to the imbalanced distribution of class label, that is, the number of negative instances is much larger than that of positive ones. Figure 7 shows the time course of mortality prediction during one to ten days after the admission. It is seen that, with the available data updated daily, although the AUROC values of the various systems keep rising, through the whole procedure, the AUROC values of eLSTM, LSTM, and RF go higher than the three scoring systems. And from the third day, the eLSTM holds the highest value till the ending of the records. ANOVA followed by Dunnett test shows the AUROC value of the eLSTM model is significantly higher than that of LSTM and RF models (eLSTM vs. LSTM: p � 0.011; eLSTM vs. RF: p � 0.000). e charts also clearly reveal that while RF, LSTM, and the three scoring systems reach their highest performance on the last day, eLSTM achieves the corresponding levels at least 6 days earlier than the scoring systems and 2 and 1 days earlier than RF and LSTM, respectively. ese facts demonstrate that eLSTM has stronger ability of dynamic prediction as well as early prognosis than the others. Figure 8 shows that AUPRC has the similar trend with the data updating as AUROC. e eLSTM model harvests the largest AUPRC of 0.5 among all methods. ANOVA followed by Dunnett test exhibits that the AUROC value of eLSTM model is significantly higher than that of LSTM and RF (eLSTM vs. LSTM: p � 0.043; eLSTM vs. RF: p � 0.000).

Influence of the Number of LSTM Classifiers in eLSTM.
e AUROC value of eLSTM goes up with the increase of the number of base LSTM classifiers (Figure 9). It has a steep ascent when less than 40 LSTM classifiers are integrated, then keeps a moderate rising, and finally stays at a plateau after 100 classifiers are involved. Similar situation is also observed in the AUPRC (Figure 10).

Influence of the Size of Variable Subset in eLSTM.
ANOVA test indicates the size of variable subset in eLSTM models leads to significant difference in AUROC and as well as in AUPRC (AUROC: F � 45.932, p � 0.000; AUPRC: F � 7.079, p � 0.002). e AUROC values are similarly high for eLSTM with multiple sets of 16, 25, or 32 variables ( Figure 11). And eLSTM achieves the largest AUPRC when the size of variable subset is 16, 25, or 32 ( Figure 12). Pairwise comparison by Tukey test shows the AUROC and AUPRC values of eLSTM models trained by sets of 16, 25, and 32 variables are significantly higher than those of 8 and 50 variables (p < 0.05), while there are no significant differences among the models with sets of 16, 25, and 32 variables. In this work, the size of variable subset was set as the median value of 25, which is in agreement with the recommendation of literature [35].

Discussion
It is worth noticing that the algorithms of RF, LSTM, and eLSTM exhibit much better performance than the SAPS II, SOFA, and APACHE II scoring system ( Table 2). It indicates that data-driven mathematical model may help improve the mortality prediction in ICU and further other clinical tasks. Different models serve different purposes and situations. e present work demonstrates that, in dynamic prediction, LSTM and eLSTM are superior to the RF algorithm. RF is commonly considered as an easy-to-use algorithm for decision making. However, it is not sensitive to time course, resulting in the weakness in exploiting temporal information in the series data. But in the LSTM block, the values in the previous time steps impose influence on the coming time steps; hence, the LSTM block is capable of capturing temporal trends of the data and suitable for time series modeling. Moreover, with the updating of the input data, the predicting ability of LSTM is continuously improved. In other words, LSTM has the advantage in dynamic predicting.
e results demonstrate that generally, the eLSTM algorithm outperforms a single LSTM classifier. Also, it is seen in Figures 7 and 8 that the eLSTM model has much better achievement in early prediction than LSTM. It can be explained that instead of a single hypothesis space by one LSTM classifier, the eLSTM algorithm generates multiple base learners expanding the hypothesis space, which leads to a better approximation to the true hypothesis. e proposed eLSTM algorithm successfully handles clinical time series data in ICU and provides a unified model for predicting the mortality of ICU patients. In ICU, patients are suffering from various diseases. Johnson et al. summarized the distribution of primary International Classification of Diseases (ICD) in the entire MIMIC-III database [44], as that the mostly common ones in ICU are infectious and parasitic diseases (ICD-9: 001-139), neoplasms of  digestive organs, and intrathoracic organs, etc. (ICD-9: 140-239), endocrine, nutritional, metabolic, and immunity (ICD-9: 240-279), diseases of the circulatory system (ICD-9: 390-459), pulmonary diseases (ICD-9: 460-519), diseases of the digestive system (ICD-9: 520-579), diseases of the genitourinary system (ICD-9: 580-629), trauma (ICD-9: 800-959), and poisoning by drugs and biological substances (ICD-9: 960-979). Patients admitted to ICU are usually diagnosed with more than one kind of disease, i.e., syndrome. e physiological statuses of the patients are complex, and thus, it is difficult for a single learner to discover the patterns of the patients represented by recorded parameters. us, in previous relevant studies, the mathematical models in ICU were usually designed for single specific disease, such as heart failure or sepsis [49][50][51][52][53], and at present, it lacks universal quantitative mortality prediction approach covering all ICU patients. e diversity of the eLSTM is accomplished by employing bagging and RSM algorithm. In the construction of base learners, bootstrap sampling and RSM ensure the learners devoting to various patients and diseases. For model training, bootstrap sampling of ICU data produces divergent datasets of patients with different disease distributions. Meanwhile, RSM assembles different sets of physiological variables for representing patients' status. ese procedures in training subsets broaden views at both instance and feature level of the ICU data and therefore yield dissimilar base LSTM classifiers. In this work, the setting of 25 variables in the model brings out the best performance (Figures 11 and 12). While too few variables would greatly decrease the base learner's classifying capacity, redundant variables would damage the learners' diversity. e result is consistent with the previous finding [35]. Moreover, as part of the bagging strategy at the output end of the model, individual base learners are integrated to make the ICU patients' general condition comprehensive and clear. Owing to individual learners' classifying capacity and the ensemble learning ability of the model, the proposed eLSTM algorithm is competent for capturing the complex relationship among the diseases and parameters in ICU data, thus enhancing the outcome prediction.

Conclusion
In this paper, we propose a new approach named eLSTM which can deal with the complex and heterogeneous ICU data for mortality prediction. e proposed eLSTM models obtain the prediction result by merging the results of multiple parallel LSTM classifiers. e base LSTM learners are trained on different subsets which are generated using bootstrapped samples and random feature subspace. Experimental results show that the proposed eLSTM algorithm effectively utilizes the ensemble framework in LSTM classifier and achieves excellent performance on the extracted MIMIC-III dataset. Also, it provides an early prognosis of ICU patients. e eLSTM model is promising to offer a universal quantitative tool for assessing risks of all patients in ICU and even for other complex clinical situations. In the future work, other approaches of aggregating component classifiers are worth investigating to optimize the structure as well as the algorithm.

Data Availability
e data used to support the findings of this study are available at MIMIC-III website (https://physionet.org/ physiobank/database/mimic3cdb/).

Conflicts of Interest
e authors declare no conflicts of interest.