Risk Prediction Models for Mortality in Community-Acquired Pneumonia: A Systematic Review

Background. Several models have been developed to predict the risk of mortality in community-acquired pneumonia (CAP). This study aims to systematically identify and evaluate the performance of published risk prediction models for CAP. Methods. We searched MEDLINE, EMBASE, and Cochrane library in November 2011 for initial derivation and validation studies for models which predict pneumonia mortality. We aimed to present the comparative usefulness of their mortality prediction. Results. We identified 20 different published risk prediction models for mortality in CAP. Four models relied on clinical variables that could be assessed in community settings, with the two validated models BTS1 and CRB-65 showing fairly similar balanced accuracy levels (0.77 and 0.72, resp.), while CRB-65 had AUROC of 0.78. Nine models required laboratory tests in addition to clinical variables, and the best performance levels amongst the validated models were those of CURB and CURB-65 (balanced accuracy 0.73 and 0.71, resp.), with CURB-65 having an AUROC of 0.79. The PSI (AUROC 0.82) was the only validated model with good discriminative ability among the four that relied on clinical, laboratorial, and radiological variables. Conclusions. There is no convincing evidence that other risk prediction models improve upon the well-established CURB-65 and PSI models.


Introduction
Community-acquired pneumonia (CAP) is common and associated with significant mortality [1][2][3]. Severity assessment is an important step in the management of CAP [4][5][6] because the early identification of individuals at high risk of death may help in deciding the site of care and the intensity of management [7]. Furthermore, subjective clinical judgment can underestimate pneumonia severity [8], and this may result in under-treatment and poor outcomes [9,10]. Therefore, CAP risk prediction models have been developed to help clinicians predict pneumonia outcome and determine appropriate management more accurately.
The most widely known, well-validated, and commonly used risk prediction models are CURB-65 [3] and Pneumonia severity index (PSI) [11]. Recent systematic reviews have focused on assessing the comparative performance of these models [12,13]. However, many other models have been developed, some of which are designed to predict mortality [14,15], while others also include the need for ventilatory and vasopressor support [16][17][18]. The diverse and ever-increasing range of models may pose difficulties for clinicians who are attempting to choose a tool for use in their daily practice. To date, there has yet to be a clear consensus on the model that should be used [19], and no systematic attempt to compare the key characteristics and usefulness of the existing pneumonia scores has been made.
In this systematic review, we provide a comprehensive and up-to-date overview of the existing published risk prediction models for mortality in community-acquired pneumonia. We did not include scores which were designed to predict ventilatory and vasopressor support because of the inconsistency in decisions to provide these therapies depending on treatment site. We also aim to summarize the key features of each model such as variables used, risk stratification, and the comparative performance in terms of sensitivity, specificity, balanced accuracy, and area under the curve (AUC) values so that practitioners can make an informed choice.

Eligibility Criteria.
We selected studies that were the first to report the derivation or validation of each risk prediction model for predicting mortality in CAP. There was no restriction on the type of study (prospective or retrospective) or country of origin. For pragmatic reason, we excluded studies that aimed to carry out further testing of risk models systems that had already been validated once and reported, as there are several validation studies for commonly used scores such as PSI and CURB-65. In such instances, we have used pooled data from published meta-analyses where available [12,13]. Derivation studies were defined as studies which first reported the prognostic score. Validation studies were defined as studies which first tested the performance of a derived score in a separate cohort.

Search Strategy.
We searched MEDLINE, EMBASE, and Cochrane Central Register of Controlled Trials with no date limitations in November 2011 using the search terms listed in Supplementary Material 1 available online at http://dx.doi.org/10.1155/2013/504136, without any language restriction. We also checked the bibliographies of included studies and recent review articles for relevant studies.

Study Selection and Data Extraction.
Two reviewers (Chun Shing Kwok, Kenneth Woo) scanned all titles and abstracts to select studies that met the inclusion criteria. Full reports (where available) of potentially relevant studies were retrieved and independently checked by the other two reviewers (Yoon K. Loke, Phyo Kyaw Myint). Where there was any uncertainty or discrepancies, the article was discussed among the reviewers to determine if the studies should be included. We also contacted authors if there were any areas that required clarification. Data were collected using a standardized form by two authors independently (Chun Shing Kwok, Kenneth Woo), and this was checked by Yoon K. Loke. Data were collected on score name, setting for score application, year of study, country of origin, participant selection criteria, methodology for diagnosis of pneumonia, outcomes assessed, definition of severe pneumonia, participant characteristics, lost to followup in study, and the results. Data relating to study methodology were also collected for the quality assessment such as risk of confounding and statistical methods. The primary measure of interest was the area under the receiver operating curve (AUROC) as this reflects the overall discriminant ability of the risk prediction model; where this was not reported, we calculated balanced accuracy based on the following equation (sensitivity plus specificity) divided by two.
We also extracted results of existing meta-analyses on pneumonia risk prediction models [12,13] to address the fact that both PSI and CURB-65 have been validated several times over, and we intended to present only the pooled data.

Assessment of Study
Validity. Quality assessment was performed by Chun Shing Kwok using a methodological checklist for prognostic studies from the National Institute for Heath and Clinical Excellence [20]. Briefly, the checklist contains six components including study sample representative of population of interest, loss to followup unrelated to key characteristics, prognostic factor of interest, outcome of interest, potential confounders accounted for, and the appropriateness of statistical analysis.

Data
Analysis. Due to the nature of this systematic review, we did not intend to conduct meta-analysis but planned to summarize the main findings descriptively in tables and figures. In particular, we evaluated key performance parameters (AUROC, balanced accuracy, sensitivity and specificity) for each scoring system and depicted this graphically according to the frequency of variables required for the calculation of the score. For these plots, we used validation study or meta-analysis results where available. We conducted additional subgroup analysis restricted to studies that used prospectively collected datasets, which may potentially be of greater validity than retrospective evaluations.

Results
From the 1,947 titles and abstracts, 93 articles were selected for detailed review (Figure 1). Of these, 20 different risk prediction models for mortality in pneumonia were described in 18 documents (including abstracts-only publications) between 1987 and 2011 ( Figure 1) [6-8, 11, 14, 15, 21-32]. The list of excluded studies is shown in Supplementary Material 2. The detailed characteristics of studies and the description of individual models are shown in Table 1 and Supplementary Material 3, respectively. Aside from two [24,28], all studies were conducted in emergency department settings. Diverse combinations of variables including patient characteristics, clinical features, laboratory results, radiological findings and physician judgments were considered across these models. Two studies used ICD-9 codes [11,25] and one used ICD-10 codes to confirm pneumonia diagnosis [31]. One study [29] did not provide a formal definition as to how pneumonia was diagnosed.

Quality Assessment of Models.
Study validity is summarized in Supplementary Material 4. One major limitation is that only 14 of the risk prediction models had validation data, whereas 6 reported findings from derivation studies (SOAR, AFSS, PARB, PIRO, CARSI, and CARASI) without further validation [24,25,28,29,32]. All studies had a study sample that appeared representative of the population of interest, with adequately defined outcomes. Mortality was the main outcome of interest in all but one study where a 30-day mortality and the need for oxygen therapy were combined [29]. The extent of lost to followup or missing data was unclear in the analysis for nine models (BTS 1, 2, 3, CURB, IDSA/ATS 2007, mATS, SOAR, A-DROP, and PARB) [6, 15, 22-24, 26, 29]. The impact of potential confounding factors was unclear in many studies, whereas eleven models (BTS 1,  2, 3, CURB, CURB-65, CRB-65, MRI, PSI, SOAR, AFSS, and PARB) [11,14,[21][22][23][24][25]29] used appropriate statistical methods (i.e., use of logistic regression models or statistical methods to choose factors that were most predictive of mortality) for the derivation of the prognostic score. Where statistical methods were not used to identify variables in the derivation of the models, some models were derived based on the hypothesis that certain variables may be correlated with death (e.g., shock index), while other models tested scores proposed from guidelines (e.g., ATS scores). One study was only available in the abstract form [29].

Variables Used in Risk Prediction
Models. The frequency of variables which were used more than once in the models and their occurrence in individual scores is shown in Table 2.
Variables were categorized into five groups: patient characteristics (age, gender, immunosuppression, and renal disease), clinical variables (pulse rate, blood pressure, respiratory rate, temperature, presence of shock, and confusion), laboratory measures (urea/blood urea nitrogen (BUN), white cell count, PaO 2 /SaO 2 , hematocrit, glucose, sodium, and pH), radiological findings (pleural effusion and multilobar pneumonia on chest X-ray), and physician judgment (need for mechanical ventilation). The four most commonly used variables (found in >10 scores) were confusion or altered mental status, respiratory rate, systolic blood pressure, and urea. Some of the risk prediction models also required more complex concepts involving clinical interpretation and decision-making or even the results of other severity prediction tools. The MRI score included the Glasgow coma score, judgment on underlying ultimately or rapidly fatal illness, simplified acute physiology score, acute organ system failure, and ineffective initial antimicrobial treatment. The modified ATS score had major criteria of requirement for mechanical ventilation or septic shock, and the IDSA/ATS 2007 score included receipt of invasive mechanical ventilation and septic shock and the need for vasopressors. These models were therefore considered separately.

Risk Prediction Model Evaluation and Derivation and Validation
Results. The results from the included derivation and validation studies are shown in Table 3. Supplementary Material 2 describes the individual severity scores according to the year of publication in chronological order.

Risk Prediction Models Using Only Clinical Variables.
Four scores (BTS 1, CRB-65, CARSI, and CARASI) [21,22,32] were based on simple clinical measures that could be measured on first presentation in the community, with no requirement for laboratory or radiological testing. All were derived in the UK between 1987 and 2011. The number of variables ranged from three to six and respiratory rate was included in all scores. Of the two validated models, BTS1 and CRB-65 had fairly similar balanced accuracies (0.77 and 0.72 resp.), while CRB-65 was shown in the meta-analysis to have an AUROC of 0.78. Neither CARSI nor CARASI had been validated but the derivation studies had relatively low balanced accuracy (0.64) or AUROC (0.64) for both models.

Risk Prediction Models Using Both Clinical Variables and
Laboratory Testing. Nine prognostic models (BTS2, BTS3, CURB, CURB-65, A-DROP, CURB-age, SOAR, CURSI, CURASI) [21-24, 26, 31] were constructed using both clinical and laboratory parameters. They were developed in the UK between 1987 and 2010, except for A-DROP which was proposed by the Japanese Respiratory Society. All models were externally validated except for SOAR [24]. The number of variables ranged from three to six, and, respiratory rate was included in all models. Other commonly included variables were confusion and urea/blood urea nitrogen. CURB and CURB-65 had the best balanced accuracy (0.73 and 0.71, resp.). Here, AUROC was seldom reported amongst the  modes but both CURB-65 (AUROC 0.79 from meta-analysis) and A-DROP (AUROC 0.85) showed reasonable discriminative ability. While A-DROP appears to have superior AUROC, we noted important quality issues regarding the absence of followup for vital status within the study (Supplementary Material 3) and lack of generalizability due to it being a retrospective, single-centre study of hospitalized patients.

Risk Prediction Models Using Clinical, Laboratorial, and
Radiological Findings. Four models (PSI, AFSS, PIRO, and PARB) [11,25,28,29] required radiological finding in their scoring system. These models were developed in the US, France, Spain, and Japan between 1996 and 2010; the number of variables ranged from four to twenty in these models [11]. The PSI is the only validated model here, with an AUROC of 0.82 in the meta-analysis. The performance of these models from derivation studies ranged from an AUROC of 0.75 for AFSS to 0.88 for the PIRO score.

Risk Prediction Models That Require Additional Clinical
Decisions. Three models (MRI, mATS, and IDSA/ATS 2007) [6,14,15] gave weighting to clinical judgment, for example, that initial antimicrobial therapy was ineffective or that vasopressor therapy was needed for septic shock. These validated models were originated from the US and France and were principally designed for the prognostic use in intensive care settings or pneumonia cases that may need to be triaged to intensive care. The best performance here was achieved by the modified ATS score with a balanced accuracy of 0.94.

Summary of the Performance of Risk Prediction Models according to Number of Variables.
The comparative performance of the risk prediction models according to number of prognostic variables is summarized graphically in Figure 2 (balanced accuracy and AUC) and Figure 3 (sensitivity and specificity). Of the validated measures that are suitable for general clinical use, the CURB derivatives and PSI had the     best balanced accuracies, and this is similarly reflected in the AUROC. Similarly, Figure 3 shows that PSI had amongst the highest sensitivity, but the tradeoff is apparent here in the lack of specificity for PSI as compared to other validated models such as CURB-65. We also conducted a subgroup analysis restricted to prospective studies as these may be of potentially higher validity than retrospective datasets (Supplementary Material 5).

Discussion
Our review systematically evaluates and summarizes 20 risk prediction models for mortality prediction which included variables required for score calculation in patients with pneumonia so that clinicians and policy makers (such as guideline committees and health services researchers) can make informed choices about the ease of use and comparative predictive ability. In these times of uncertainty in the health economy, the number and type of variables required for calculation need to be weighted up against the outright performance. Here, the ease of implementation, efficient resource utilization, and availability/simplicity of testing within healthcare setting (e.g., community centre, or emergency department, or intensive care unit) may represent influential factors in determining the suitability of a particular model.
We found that most of the published models (irrespective of complexity) yielded fairly similar performance with regard to balanced accuracy and AUC. While there may be some statistical differences in AUC, this may only have limited consequence when clinicians are making treatment decisions in individual patients. For instance, in Chalmer's metaanalysis, the respective AUCs indicate that the probability of PSI correctly discriminating between patients of differing severity was 0.82, whilst the corresponding figure for CURB-65 was 0.79. We have deliberately chosen to emphasize overall performance here with balanced accuracy or AUROC because while certain models may have demonstrably superior sensitivity, others had better specificity, thus illustrating the inevitable trade-off effect between sensitivity and specificity. The choice of appropriate model will therefore depend on whether healthcare teams place greater weight on sensitivity or specificity. Given the small differences between certain scoring systems, clinicians may equally prefer to either pragmatically adopt the simplest model (appropriate to their healthcare setting) or opt for the best established and widely validated systems.
We presented both results for balanced accuracy and ROC in order to allow the comparison of the performance of each score. Balanced accuracy considers both the predictive value of sensitivity and specificity. While the ROC is a better measurement of predictive value than balanced accuracy, several studies reported sensitivity and specificity rather than ROC.
The majority of the studies were evaluated in hospital settings, but one study included both inpatients and outpatients and two studies were conducted in intensive care settings. The PSI was studied in both inpatient and outpatient settings which has an advantage because its findings can be generalisable to both of these settings [11]. Two studies, mortality risk index [14] and PIRO score [28], were conducted in intensive care settings. Community-based studies should be conducted in the future to include patients with less severe pneumonia.
Our systematic review also identified some key gaps in the existing research. One particular issue is the lack of validation data for several models. Given the diversity of patient populations and the heterogeneity seen in the metaanalyses of PSI and CURB-65, there is no guarantee that a model that performs well in one setting will do equally well in a different setting. It would be very helpful if the profusion of recently proposed models (often based only on data from a single centre) could be compared directly against older versions in a large multicentre international cohort.
The existing studies do not report on acceptability, uptake, and clinical impact of risk prediction tool in the routine clinical management of patients with pneumonia. Perry et al. conducted a survey of emergency physicians' requirements for clinical decisions rule for acute respiratory illnesses [33], and they found that physicians wanted a highly sensitive rule with a median of 97.0% for respiratory conditions. The most sensitive tool here is PSI, which offers up to 90% sensitivity to help identify those at higher risk of death, but physicians in busy emergency departments may possibly find it too time-consuming and difficult to collect all of the variables (including detailed past medical history) for calculating the PSI. Hence, it appears from Perry's survey that there is a need for a score that is highly sensitive beyond what is currently available from any of the existing scoring systems. If the uptake and implementation of risk prediction tools in clinical decision are highly variable [34][35][36][37], then patients are unlikely to reap benefits from the current profusion of risk predictions tools. There is evidence to suggest that for the pneumonia severity index the uptake of this score and the scoring accuracy were low [38,39]. Equally, it could be argued that the benefits of risk prediction models in reducing pneumonia morbidity and mortality need to be demonstrated in randomized controlled trials.
While the performance of a prediction rule is a major criterion for comparative superiority, simplicity is a very important determinant of potential clinical application. A survey conducted in Australia found that only 12% of respiratory physicians and 35% of emergency physicians reported using the PSI always or frequently even though it is recommended by the Australasian Therapeutic Guidelines [40]. Moreover, this study found that the majority of physicians were unable to accurately approximate the PSI scores and calculations of the simpler CURB-65 were more accurate [40]. This study concluded that it is recommended that a single, simple pneumonia severity score should be used in the assessment of CAP [40]. With the computer assisted programmes, PSI can be calculated easily and accurately. The pragmatic approach would be to use more complex scoring with high accuracy in resource-rich settings and to use alternative simpler scoring system in community or resourcepoor settings. Our systematic review provides comprehensive comparison for clinicians to use any or a combination of scores of their choice in various health care settings.
Our review has a number of strengths. We conducted a systematic search to cover all scores including those that are established as well as those that have yet to be validated. Also, there was no restriction of the country of score origin and we were able to capture the scores from around the world. Our review also has a number of limitations, including difficulty in finding exact search terms to pick up this type of study. We only included initial derivation and first validation studies for the scores identified. Some of the scoring systems do not appear to have been validated yet. Here, there is a definite possibility of publication bias where studies showing the most favorable predictive ability were likely to be accepted for publication sooner than equivocal or less impressive data. In order to reduce the possibility of such bias, we were able to include two systematic reviews [12,13] that examined the PSI and CURB scores (CRB-65, CURB, and CURB-65).
Since there already exist established models (CRB-65, CURB-65, and PSI) with reasonable to good discriminative ability across a wide range of settings and only small incremental differences between these and newer scores, further research should mainly focus on why patients get misclassified and whether we can identify important variables within them to improve sensitivity of current models. Equally, the uptake of risk prediction models in routine clinical practice and any relationship with improved patient outcomes need to be rigorously assessed, perhaps through cluster-randomized controlled trials of different care pathways. These future trials should test if clinical decisions based on pneumonia scores are associated with better patient outcomes compared clinical decisions based on clinical judgment. Scores should also be tested in developing countries as pneumonia mortality is high in the regions. Eventually, the goal should be to clarify the entire pathway for community-acquired pneumonia management and the role of risk prediction models for each stage in the community, at the emergency department, on hospital wards, and in intensive care.

Conclusions
Although there are a multitude of proposed risk prediction models, few have undergone proper validation, and no convincing evidence exists that the overall discriminative ability improves upon the well-established CURB-65 and PSI models. Future research should thus focus on randomized trials to test if clinical decision rules using existing risk prediction models and guided treatment pathways can significantly improve pneumonia outcomes.