Pediatric Index of Mortality and PIM2 Scores Have Good Calibration in a Large Cohort of Children from a Developing Country

Objective. Our objective was to validate the Pediatric Index of Mortality (PIM) and PIM2 scores in a large cohort of children from a developing country. Design. Prospective observational study. Setting. Pediatric intensive care unit of a tertiary care teaching hospital. Patients. All children aged <18 years admitted between June 2011 and July 2013. Measurements and Main Results. We evaluated the discriminative ability and calibration as measured by the area under the receiver operating characteristic (ROC) curves, the Hosmer-Lemeshow goodness-of-fit (GOF), and standardized mortality ratio (SMR), respectively. Of the 819 children enrolled, 232 (28%) died. The median (IQR) age of the study subjects was 4 years (0.8, 10). The major reasons for ICU admission as well as mortality were sepsis/severe sepsis. The area under ROC curves for PIM and PIM2 was 0.72 (95% CI: 0.67–0.75) and 0.74 (95% CI: 0.70–0.78), respectively. The goodness-of-fit test showed a good calibration across deciles of risk for the two scores with P values being >0.05. The SMR (95% CI) was 0.99 (0.85–1.15) and 1 (0.85–1.16) for PIM and PIM2, respectively. The calibration across different age and diagnostic subgroups was also good. Conclusion. PIM and PIM2 scores had good calibration in our setup.


Introduction
Scoring systems are used to evaluate the risk of mortality in intensive care units and form an essential part of providing intensive care. They allow for interunit and intraunit comparisons with time and also provide useful information for comparing the severity of illness of patients enrolled into clinical trials [1]. The two commonly used mortality risk scoring systems in children include the Pediatric Risk of Mortality (PRISM) and the Pediatric Index of Mortality (PIM) scores [2,3]. While both of these scores have been shown to perform well across pediatric intensive care units (PICU), the simplicity of the PIM makes it easier to collect data routinely from large numbers of sick children [4][5][6][7]. A number of studies, predominantly from developed countries as well as from a few resource-restricted settings, have validated PIM and its updated version PIM2 scores.
Almost all studies that evaluated the performance of PIM/PIM2 in units from low-and middle-income countries had reported excellent "discrimination" but poor "calibration" of the scores [6][7][8]. In contrast, in our earlier study on the performance of PIM and PIM2 scores at two different time points involving 282 sick children, we found not only acceptable discrimination but also excellent calibration across the deciles of risk at both of the time points [9]. We were indeed perplexed by this unexpected finding. As opposed to discrimination, which is the ability of a model to distinguish accurately between survivors and nonsurvivors in a given unit, calibration is an actual measure of performance of the scoring system in that unit compared with that of the unit(s) where the original score was developed. Consequently, it may be presumed that the current performance of our unit is quite similar to that of the index units during the time period in which the scores were developed. Given the stark differences in allocation of resources and possibly in the case mix between the index units and our unit, this presumption defied logic. One major factor against this presumption was the possibility of "Type II" error; it is said that the value of Hosmer-Lemeshow GOF test is unreliable with sample sizes of less than 400 [10]. The excellent calibration found in our study could simply be due to the small sample size rather than due to "good" performance of our unit. We therefore undertook this study to evaluate the performance of both PIM and PIM2 scores in a much larger sample of sick children.

Design and Setting.
We conducted this prospective observational study in our 18-bedded tertiary care PICU from June 2011 to July 2013. Of the 18 beds in the ICU, 10 are in the "intensive area" while the remaining 8 are in the "step down area. " Children aged 1 month to 18 years requiring ICU care from the wards as well as those referred from other hospitals are admitted in the ICU. Our unit caters to both medical and postsurgical patients. Children with traumatic injuries are not admitted in the ICU. The unit is staffed by two full-time pediatric intensivists, 4 fellows, and 4 residents. A total of 14 nurses are posted in the ICU with 4-5 nurses per 8-hour shift. The nurse to patient ratio is 1 : 3 in the intensive area and 1 : 5-6 in the step down area. The ICU is well equipped with facilities for continuous monitoring, mechanical ventilation, blood gas analysis, ultrasonography, and X-ray facilities.

Objectives and Outcome
Variables. Our primary objective was to evaluate the discriminative ability and calibration of PIM and PIM2 scores. The secondary objective was to assess the calibration across different age and diagnostic subgroups. The discriminative ability was assessed by the area under the receiver operating characteristic (ROC) curve [11] while calibration was assessed using the Hosmer-Lemeshow GOF test and SMR [12].

Subjects and Data Collection.
We included the data from all children admitted to the ICU for more than 1 hour during the study period. Two investigators (AS and SJ) collected the data during the study period. Both of the investigators were trained on the methods of data collection by the principal investigator (JS) at the beginning of the study. The data recorded by them over the next 2 weeks was cross-checked by two investigators (JS and MJS) to ensure correctness; discrepancies, if any, were discussed and resolved. The data collected included all variables of PIM and PIM2, demographic characteristics, clinical course, and outcomes of the study population. Further details on methodology are provided in our previously published article on PIM and PIM2 scores at different time points [9]. Data for 50 children each were recorded in duplicate during the first and the second years of study to ensure accuracy. The interobserver reliability was found to be excellent ( score = 0.93).

Results
There were a total of 855 admissions during the study period, of which 23 children died within 1 hour and were excluded. Parents of 13 children refused to give consent. The final dataset was comprised of 819 children of whom 232 died (28%) (Figure 1). Data collection of all variables of PIM and PIM2 was possible throughout the study period.
The demographic features, clinical course, and laboratory features of the survivors and the nonsurvivors are provided in Table 1. The median age of the enrolled children was 4 years with the majority being boys (57%). About 28% of the children were less than 1 year of age, 25% were between 1 year and 4 years of age, 22% were between 5 and 10 years of age, and 25% were >10 years of age. About one-fifth ( = 164, 20%) of the children were severely malnourished.
The major reasons for ICU admission as well as mortality were sepsis/severe sepsis and cardiac and neurological illnesses. Most of the patients (573, 70%) were admitted directly from the emergency department while the remaining patients were either elective (post-op) or referred from other pediatric wards. The common underlying illnesses in the study population were congenital/structural heart diseases, neurometabolic disorders, and tubercular meningitis ( Table 1).

Secondary Outcomes.
Calibration across different age and diagnostic subgroups was also good with GOF values being >0.05 across most of the subgroups for both PIM and PIM2 scores (Table 3). Individually, PIM2 score had good calibration across all age categories in comparison to PIM score which had poor calibration among children in the 2-5-year age group. Between the diagnostic subgroups, calibration was poor only in postoperative patients for the two scores. Discrimination was best for respiratory illnesses, poisoning, liver failure, and tubercular meningitis for the two scores (Table 3).

Discussion
The results of the present study confirm our earlier findings of excellent calibration but acceptable discriminatory performance of both PIM and PIM2 scores in our setup. The numbers of predicted and observed deaths were almost equal for both scores across deciles of risk, age, and diagnostic subgroups. The results of our previous study were therefore not merely due to chance. As previously mentioned, calibration is an important measure of validation of a scoring system in a unit in which it was not developed. The measure of calibration-SMR-is basically a comparison of the number of deaths predicted by the scoring system with the number of observed deaths. According to the investigators of the original PIM and PIM2 scores, the SMR that is significantly different from 1 in a given unit means that the standard of care in that unit is worse (or better, depending on the direction) than the units that derived the score [10]. It is but natural to expect that the observed deaths in a given unit would be similar to the number of expected deaths so that the SMR equals 1. However, this is often not true, and, depending on the case mix and disease patterns, the SMR might vary and may be significantly different from 1 (i.e., 95% CI of SMR would not include 1); in these cases, the Hosmer-Lemeshow values would be less than 0.05. The results of our study are different from most other studies from developing countries that reported the models to be underpredicting the deaths in their setup, with the SMR and its 95% CI being more than 1 [6][7][8]. For example, two studies from India and Pakistan have reported SMRs as high as 1.57 to 3.3 and 1.4 to 1.57 for PIM and PIM2 scores, respectively [7,8]. The study authors have attributed this to the differences in the patient profile, need to manage large numbers of severely ill children with limited manpower and resources, and possible differences in quality of care between their units and the units where the models where developed. In contrast, we found the SMR to be equal to 1 not only across deciles of risk but also across all age and diagnostic subgroups. We presume that factors like the threshold for initiating and discontinuing support, timing of intensive care admissions, and quality of care as well as the accuracy of data collection might have contributed to the near-perfect SMR in our unit. It appears that resource limitation may not be a major deterrent to imparting quality care in the ICU. It may be more important to review the systems in place and take steps to improve them in order that the performance of the models improves in units where the models underpredict deaths.
Unlike the units in which the scores were developed [3,13] and most other units from developed countries [5,[14][15][16][17], we found only acceptable discrimination of the scores in our setup. The possible reasons for this difference are the high mortality rates in our study (28%) as compared to the units in which the scores were developed or validated (5-6%) [3,13] and the difference in disease patterns between these units and our unit. For example, we had more children with sepsis, cardiac and neurological illnesses, and raised intracranial pressure. These factors could not be accounted for by the variables used to calculate the scores. Moreover, the case mix and the severity of illness at admission resulted in regression coefficients that are quite different between the development set and our study for some of the items of the scores (Table 4). For example, the coefficient of the item "elective admission" in PIM score was less than half of the original dataset. This is possibly due to the fact that almost all admissions in our setup are emergency in comparison to the development sets where almost 50% of the admissions are elective. Similarly the variable "cardiac bypass" was omitted from the PIM2 model as there were no patients admitted after such procedure in our unit. When it came to the variables of high risk and low risk diagnosis there were major differences between the development set and our setup with regard to these. For example, poisoning has a low risk of mortality in our setup and this could not be accounted for in the PIM2 "low risk diagnosis" variable as it does not have poisoning among its low risk diagnoses category. Among the cardiac illnesses, only dilated cardiomyopathy or myocarditis is included in the "high risk diagnoses category" of both PIM models. In our unit, only 2 of the 10 children with an admission diagnosis of acute myocarditis died. Despite these differences, we did not try to improve on the fit of the model by changing the coefficients as this defeats the main purpose of these models which is to allow for interunit comparisons [3,10,13]. A few studies from the developed countries did report only acceptable discrimination for PIM and PRISM [4,18]. One of these studies attributed the poor discrimination to differences in patient demographics and physiologic response to different diseases [18]. The dichotomy between discrimination and calibration that we observed in our study has been previously reported in only a few studies. A study from Trinidad reported an AUC for PIM2 of only 0.62 while the SMR was 0.86 with the 95% CI including 1 [19]. The authors attributed this to overprediction of mortality in their study, but it could mean that their unit performed better than the development sets. Similarly, a study of 303 patients from the Netherlands reported an AUC of 0.74 for PIM score and an SMR of 0.88 with the 95% CI including 1 [4]. It is often said that a perfectly calibrated model may not always be perfectly discriminatory as the area under the ROC curve would be 0.83 and not 1 in such cases [20].
Strengths and Limitations. The strengths of our study are: (a) it is the largest study till date from developing countries to validate the PIM and PIM2 scores, (b) data were collected accurately, and (c) the scores were calibrated well in our setup with an adequate sample size, thereby meaning that the scores could be used in units with resource limitation as such without any modifications. The only limitation is that it is a singleunit study. However, this fact is unlikely to affect the generalization of our results as our unit is fairly representative of most units form developing countries with high incidence of sepsis, tuberculosis, and meningoencephalitis cases.

Conclusion
Contrary to most previously published studies from developing country settings, PIM and PIM2 scores had good calibration in our setup. The good calibration was despite the differences in case mix and resource allocation between the units where the scores were developed and ours.