Internal Validation of the Predictive Performance of Models Based on Three ED and ICU Scoring Systems to Predict Inhospital Mortality for Intensive Care Patients Referred from the Emergency Department

Background A variety of scoring systems have been introduced for use in both the emergency department (ED) such as WPS, REMS, and MEWS and the intensive care unit (ICU) such as APACHE II, SAPS II, and SOFA for risk stratification and mortality prediction. However, the performance of these models in the ICU remains unclear and we aimed to evaluate and compare their performance in the ICU. Methods This multicenter retrospective cohort study was conducted on severely ill patients admitted to the ICU directly from the ED in seven tertiary hospitals in Iran from August 2018 to August 2020. We evaluated all models in terms of discrimination (AUROC), the balance between positive predictive value and sensitivity (AUPRC), calibration (Hosmer-Lemeshow test and calibration plots), and overall performance using the Brier score (BS). The endpoint was considered inhospital mortality. Results Among the 3,455 patients included in the study, 54.4% of individuals were male (N = 1,879) and 26.5% deceased (N = 916). The BS for the WPS, REMS, MEWS, APACHE II, SAPS II, and SOFA were 0.178, 0.165, 0.183, 0.157, 0.170, and 0.182, respectively. The AUROC of these models were 0.728 (0.71-0.75), 0.761 (0.74-0.78), 0.682 (0.66-0.70), 0.810 (0.79-0.83), 0.767 (0.75-0.79), and 0.785 (0.77-0.80), respectively. The AUPRC was 0.517 (0.50-0.53) for WPS, 0.547 (0.53-0.56) for REMS, 0.445 (0.42-0.46) for MEWS, 0.630 (0.61-0.65) for APACHE II, 0.559 (0.54-0.58) for SAPS II, and 0.564 (0.54-0.57) for SOFA. All models except the MEWS and SOFA had good calibration. The most accurate model belonged to APACHE II with lowest BS. Conclusion The APACHE II outperformed all the ED and ICU models and was found to be the most appropriate model in predicting inhospital mortality of patients in the ICU in terms of discrimination, calibration, and accuracy of predicted probability. Except for MEWS, the rest of the models had fair discrimination and partially good calibration. Interestingly, although the REMS is less complicated than the SAPS II, both models exhibited similar performance. Clinicians can utilize the REMS as part of a larger clinical assessment to manage patients more effectively.


Introduction
An important responsibility of clinicians in acute medical units is making tough decisions about the provision of life support [1,2]. Because of the shortage of resources, the number of patients who can be followed and treated is limited and physicians should assign patients to critical care services in an appropriate and optimal way to increase the benefits of patient care, as well as improve patient safety [3][4][5]. On the other hand, patients have perplexing clinical manifestations which hinder reasonable assurance regarding treatment approaches and prognosis [6,7]. Besides, delay or suboptimal care of severely ill patients may lead to increased mortality [8,9].
Early identification of critically ill patients significantly impacts on patient's outcomes [5,7,10]. Scoring systems are based on physiological parameters [11][12][13]. Altered physiology, as reflected in aberrant vital signs and other findings, often precedes patient deterioration and death [14]. The objective information provided by these severityof-illness scoring systems are considered a beneficial instrument tailored for supporting healthcare professionals to timely recognize and manage the critically ill patients who are at high risk of undesirable outcomes [15,16].
A variety of scoring systems have been designed and commonly used for use in the emergency departments (EDs) such as WPS [17], REMS [18], and MEWS [19] and the intensive care unit (ICU) such as APACHE II [20], SAPS II [21], and SOFA [22] which are mostly based on vital signs and some laboratory results obtained within the first 24 h postadmission. The variables included in each scoring system plus their point assignment are presented in Table 1.
We can highlight a few aspects that exist in emergency models, in addition to simplicity, practicality, and good prognostic ability for the outcomes of interest. These models rely on a few numbers of variables which are easily available for all patients [14]. In contrast, ICU-based scoring systems include a greater number of factors that are frequently accessible only in severely ill patients [23,24].
Although several studies have been performed for use in the EDs and ICUs, it is unknown which model is most suitable. Furthermore, there is no study that compares the ED models such as WPS, REMS, and MEWS with ICU models such as the APACHE II, SAPS II, and SOFA as ICU models in the ICU settings. So, the purpose of this study is to evaluate the performance of the WPS, REMS, and MEWS scoring systems in predicting the mortality rates of critically ill patients admitted to the ICUs.

Study Design and
Setting. An observational retrospective study was conducted to collect a prespecified set of variables in three referral centers in Tehran, the capital of Iran (three hospitals with 100 ICU beds), Mashhad in northeast Iran (two hospitals with 36 ICU beds), and Neyshabur in northeast Iran (two hospitals with 19 ICU beds). Because all of these seven centers are tertiary referral hospitals that serve a large portion of the population, they may be considered a sample of the entire population, with the results attributable to the community. More information about the participating hospitals and the distribution of the patients is presented in Figure 1. Because of the noninterventional design of the study, no informed consent was required.

Inclusion and Exclusion Criteria.
We enrolled all critically ill adult patients (age ≥ 18 years) admitted to the ICU directly from the ED between August 2018 and August 2020. Those patients who were admitted due to traumatic surgery, burns, cardiac surgery, and psychological disorders were excluded due to the nature of the diagnoses [25]. In addition, any use of psychotropic agents in patients' medication profiles or symptoms of dysarthria or paramnesia (due to a type of brain disorder) were excluded similar to other studies in the field [20,25]. Figure 1 illustrates the whole inclusion/exclusion process.

Data Collection.
Structured forms including ICUs' models in addition to some variables used by the EDs' model were designed to be filled in for all included patients (N = 3,346). The highest physiological score for each particular patient during the first 24-hour period postadmission was considered the final score. The endpoint was defined as inhospital mortality regardless of the duration of the hospital stay (i.e., occurrence of death during an ICU stay or in another ward after ICU).

Statistical Analysis.
Statistical analysis was performed using the R Statistical Software version 4.1.0. The packages pROC, Hmisc, rms, and Resource Selection were employed.
Continuous variables were expressed as a mean and standard deviation (SD). Categorical variables were expressed as number plus percentage. Between-group differences for quantitative and qualitative variables were assessed using the Student t-test and the Chi-squared test or Fisher's exact test. All tests were two-tailed. We also applied logistic regression to develop models including each scoring systems. The following formula was used to compute the expected probability for each individual patient: where β 0 is the intercept; β 1 is the coefficient of the score, and X 1 is the score.
Validation of the ICU models and the ED models was assessed by discrimination, calibration, and accuracy of predicted probabilities.
Discrimination was measured using the area under the receiver operating characteristic curve (AUC-ROC) which is a measure of how much the model can distinguish between patients who have and do not have the outcome of interest (in our study, inhospital mortality). The exact binominal 95% confidence intervals (CI) for the AUROCs were also calculated. The differences between AUROCs were measured using the method proposed by DeLong et al. Diagnostic accuracy was defined as fail if an AUROC was 0.50-0.60, poor if an AUROC was 0.60-0.70, fair if an AUROC 2 BioMed Research International   [15]. The area under the precision-recall curve (AUPRC) was also used to inspect the trade-off between precision and recall as a measure of balance between the positive predictive value and sensitivity. Calibration was assessed using calibration plots and the Hosmer-Lemeshow (HL) goodness-of-fit test. To generate smooth calibration plot, 1,000 bootstrap replicates were applied. The calibration plot is drawn by plotting the predicted probabilities on the x-axis and the actual probability of mortality, which represent the degree of concordance between the actual and predicted probabilities. To determine an optimal threshold value on the predicted probabilities, the Youden Index was used, and based on this threshold, we calculated sensitivity, specificity, positive predictive value (PPV), and the negative predictive values (NPV) for all models.
Accuracy of predicted probabilities was measured by the Brier score (BS) which is a quadratic scoring rule, where the squared differences between actual binary outcomes and predictions are calculated by the following formula: BS = ð1 /NÞ∑ N i=1 ðpredicted probabilty − actual outcomeÞ 2 [26]. The missing values were handled by taking into account the following consideration: patients with multiple laboratory and physiological missing values were excluded. The data of those patients with just one missing data were imputed by taking the value of the next day from their charts, and if this variable was not mentioned in the next day, it was considered normal.
We follow the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) statement for improving the transparency of reporting.

Results
The mean age of 3,455 included patients was 56:65 ± 21:52 years, and 1,879 (54.4%) males were covered in the study. Readmissions (n = 200) were excluded from the analysis. Only 60 eligible patients missed several laboratory or physiological parameters so they were excluded from the study. About 6 percent of patients' data (N = 204) was imputed by using the approach described in Method.
Among the six investigated models, only the APACHE II predicted inhospital mortality with good discriminatory   The receiver operating characteristic (ROC) curves graphically represent sensitivity on the y-axis, and 1-specificity on the x-axis. The area under the curve (AUC) gauges the discriminatory ability of a model. 7 BioMed Research International ability, while the WPS, REMS, SAPS II, and SOFA had fair discriminative ability and the MEWS had poor discriminative ability. The maximum AUPCR was also achieved by APAHCE II (0.63, see Figure 2). As shown in Figure 3, the APACHE II and its abbreviated version (the REMS) had no evidence of miscalibration (p = 0:9 for Hosmer-Lemeshow goodness-of-fit), whereas for the WPS and the SOFA, there is statistically significant evidence of miscalibration (p < 0:05), see also Table 3). As we present in Table 3, the best overall performance belongs to the APACHE II with the lowest Brier score (0.157), while the worst belongs to the MEWS with highest Brier score (0.183). The pairwise comparison of AUROCs is also presented in Table 4.

Discussion
The application of scoring systems in ICU has expanded dramatically for benchmarking and assessing quality of care [23]. In this study, we thoroughly assessed several scoring systems on discrimination, balance between sensitivity and the positive predictive value, calibration, and overall accuracy of the predicted probability.

Main Findings.
We found that among all models examined, the APACHE II did not only have the highest discrimination ability but also had the best accuracy of the predicted probabilities, which was statistically significantly different from the other models in our setting. The mean predicted mortality by APACHE II (31.7%) was higher than the observed mortality (26.5%), and it is probably due to the care provided during the ICU stay and the quality of the follow-up care. The impressive APACHE performance in our cohort could be explained by the exclusion of trauma patients and patients with isolated neurological problems.
The APACHE II, REMS, and SAPS II indicated good agreement between actual and predicted probability of inhospital mortality throughout the whole range of predicted probabilities. In contrast, the SOFA and MEWS demonstrating their propensity to overestimate the inhospital mortality rate for the probabilities larger than 0.50 while the WPS underestimates it. Figure 3: Calibration plots of the six models. A calibration plot is a measure of goodness-of-fit as a graphical presentation of the actual mortality probability versus the predicted mortality probability. The calibration plots of APACHE II, REMS, and SAPS II do not deviate much from the diagonal line, which represents perfect calibration. 8 BioMed Research International In this study, with the exception of the APACHE II and the MEWS which are at both ends of the good and poor spectrum, the other ICU and ED models are comparable.

Comparison to Other Similar Studies.
Our findings is in line with a previous study [23] that indicated the fair discrimination power for the APACHE II and SAPS II (AUROC: 0.779 and 0.793, respectively). However, the discriminative ability of the REMS and MEWS was evaluated as virtually equal (AUROC: 0.738 and 0.729) in their study. Furthermore, although the discriminatory ability of the REMS and SAPS II was in the fair range, there was a significant difference between AUCs in that study [23]. The REMS and SAPS II had the equal AUCs in our study. Our findings are also consistent with another study showing higher discrimination of the APACHE II in prognostication than the SAPS II (AUROC: 0.828 vs. 0.782) [27]. Similar results were obtained in a study by Khwannimit and Geater, who com-pared the APACHE II and SAPS II [28]. Another investigation demonstrated that APACHE II, SAPS II, and SOFA had comparable high discriminatory ability [29].
Badrinath et al. reported that among various scoring systems applied on sepsis patients admitted to the ICU, the APAHCE II was more sensitive and specific in predicting mortality than the SOFA and REMS, which is in line with our findings. However, the discrimination power of APACHE II and REMS was evaluated as good and equal (AUROC: 0.81 vs. 0.80). This disparity was most likely caused by the patient population examined.
The APACHE and SOFA advantage is being able to be used to track a patient's response to therapy throughout their hospital stay. The APACHE II upon admission is around 75% accurate as an early prognostic indication of illness severity [30]. The better prognostic results obtained using the APACHE II score may be attributed to the additional physiological variables involved in calculating the APACHE II score. This may reflect the greater degree of organ dysfunction when calculating the APACHE II score as compared with other prognostic scores. Besides, the impressive APACHE performance in our cohort could be explained by the exclusion of trauma patients and patients with isolated neurological problems.
Interestingly, despite the fact that the SOFA is primarily designed for prognosis in sepsis patients, compared to the APACHE II and REMS, it performed poorly (AUROC: 0.74 (95% CI: 0.67-0.80)) [31].
In contrast with our findings, there are some studies which showed that there is no superiority of APACHE II over SAPS II and they both had fair discrimination and performed the same as each other [32].
In the retrospective study by Gök et al. which included critically ill patients admitted to ICU, the effectiveness and reliability of the WPS, REMS, and MEWS in predicting mortality were assessed and the results indicated the AUROC of Table 3: Intercept and slope of the linear predictor of the logistic regression for all models to predict inhospital mortality in ED, the optimism-corrected performance measures, and various threshold-based metrics (the threshold is itself based on the Youden index).

Limitations and Strengths.
To our knowledge, this is the first multicenter cohort study aimed at investigating and comparing three ICU and three ED models in mortality prediction in Iran. In addition, we had a multicenter large sample of patients and with a very low number and percentage of missing values. Collecting patient data from seven tertiary referral hospitals regarding similar population distribution may increase the generalizability of the results to a large subset of Iranian population.
Although the two-year sampling duration adjusts the effect of time-related confounding factors and may assure the inclusion of probable seasonal disorders, time and sample-related limitations remain as an inevitable factor. Removing missing data was another limitation in this study. However, this included only 1.6% of the data and could not have meaningfully affected the findings.

Implications.
Our findings have important implications. The REMS and the SAPS II have a fair discrimination without a significant difference between the AUCs. However, the REMS has less complexity (smaller number of variables) compared to SAPS II, and its discriminative power was exactly the same as the SAPS II. Both models showed partially good calibration although overestimated mortality. Generally, it can be inferred that the REMS is more costeffective and can be easily applied as a good alternative to the SAPS II in the detection of patients who are at high risk of deterioration. The REMS is also superior to the WPS in terms of discriminative power.

Future Studies.
We suggest further evaluation of recalibrated versions of these prediction models on large samples of target population. Also, nonparametric models from statistical machine learning may help improve model performance. It is proposed that APACHE II be integrated into the electronic medical record system to enable real time predictions. Prospective studies could investigate the effect of incorporating these models in real-time decision support on mortality and other outcomes.

Conclusion
The APACHE II was found to be the most appropriate model in predicting inhospital mortality of patients in the ICU for all three performance dimensions (discrimination, calibration, and accuracy of predicted probabilities) in Iran. Except MEWS, the rest of the models have a fair discrimination and partially good calibration. Interestingly, although the REMS is less complicated than the SAPS II, both models performed similarly to each other. The findings emphasize the fact that clinicians should utilize this method as part of a larger clinical assessment to manage patients more effectively.

Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author (Dr. Saeid Eslami) at reasonable request.