Deep Learning Models to Predict Fatal Pneumonia Using Chest X-Ray Images

Background and Aims Chest X-ray (CXR) is indispensable to the assessment of severity, diagnosis, and management of pneumonia. Deep learning is an artificial intelligence (AI) technology that has been applied to the interpretation of medical images. This study investigated the feasibility of classifying fatal pneumonia based on CXR images using deep learning models on publicly available platforms. Methods CXR images of patients with pneumonia at diagnosis were labeled as fatal or nonfatal based on medical records. We applied CXR images from 1031 patients with nonfatal pneumonia and 243 patients with fatal pneumonia for training and self-evaluation of the deep learning models. All labeled CXR images were randomly allocated to the training, validation, and test datasets of deep learning models. Data augmentation techniques were not used in this study. We created two deep learning models using two publicly available platforms. Results The first model showed an area under the precision-recall curve of 0.929 with a sensitivity of 50.0% and a specificity of 92.4% for classifying fatal pneumonia. We evaluated the performance of our deep learning models using sensitivity, specificity, PPV, negative predictive value (NPV), accuracy, and F1 score. Using the external validation test dataset of 100 CXR images, the sensitivity, specificity, accuracy, and F1 score were 68.0%, 86.0%, 77.0%, and 74.7%, respectively. In the original dataset, the performance of the second model showed a sensitivity, specificity, and accuracy of 39.6%, 92.8%, and 82.7%, respectively, while external validation showed values of 38.0%, 92.0%, and 65.0%, respectively. The F1 score was 52.1%. These results were comparable to those obtained by respiratory physicians and residents. Conclusions The deep learning models yielded good accuracy in classifying fatal pneumonia. By further improving the performance, AI could assist physicians in the severity assessment of patients with pneumonia.


Introduction
Pneumonia is a leading cause of morbidity and mortality globally. In 2019, it caused 1.23 million deaths in adults older than 70 years and 2.49 million deaths in persons of all ages globally [1]. In Japan, pneumonia is classifed mainly into community-acquired pneumonia (CAP), nursing and healthcare-associated pneumonia (NHCAP), and hospitalacquired pneumonia (HAP). We have previously reported the relationship between spleen volume and severity and mortality in patients with pneumococcal pneumonia [2]. Chest X-ray (CXR) is indispensable to the assessment of the severity and diagnosis of pneumonia [3]. Te radiographic features of bilateral shadows, involvement of more than one lobe, bilateral pleural efusions, or the presence of a cavity predict a worse prognosis in pneumonia [4,5]. Terefore, the diagnosis and assessment of pneumonia severity from CXR images is important, but it is not performed accurately by nonrespiratory specialist physicians [6]. Deep learning is a technique of machine learning in artifcial intelligence (AI) technology, [7] using artifcial neural networks as computational models to discover intricate structures and patterns in large, high-dimensional datasets [7]. ImageNet, a large dataset of more than 14 million human-annotated images, has been instrumental in the development of deep learning in image recognition. Classifcation errors in the annual ImageNet's large-scale visual recognition challenge have decreased more than eightfold over the past 6 years, to less than 3% in 2017, surpassing human performance [8]. Advances in deep learning and the availability of digitized healthcare data have contributed to a growing number of studies describing deep learning applications in the feld of medical imaging, such as chest radiographs [9]. Specifcally, deep learning algorithms can diferentiate normal CXR images from those showing pneumonia and diagnose pneumonia accurately with a sensitivity of 81-100% and a specifcity of 56.6-100% [10][11][12][13][14][15]. In addition, since the global pandemic of coronavirus disease 2019 (COVID- 19), some deep learning models have been developed to diagnose COVID-19 pneumonia using CXR images, with a sensitivity of 71-98.8% and specifcity of 90-92.9% [16][17][18][19][20]. Furthermore, studies of a deep learning model using CXR images to assess the prognosis and severity of COVID-19 pneumonia have been reported. Cohen et al. developed a deep learning algorithm to predict the severity of COVID-19 pneumonia using CXR images [21]. Zhu et al. developed a deep learning model to assess the severity of COVID-19 infection [22]. Recently, Li et al. have developed a deep learning Siamese network to predict the radiographic assessment of lung edema (RALE) scores used to assess the severity of acute respiratory distress syndrome in patients with COVID-19 [23]. However, to the best of our knowledge, the prognosis prediction of non-COVID-19 pneumonia by deep learning using CXR images has not been sufciently studied. In the era of the COVID-19 pandemic, the number of deaths due to pneumonia remains high. Hence, the development of prognostic tools for pneumonia patients is vital, and computer-aided diagnosis techniques based on deep learning can be used as a supplement in the clinical decision-making process. We performed a study to establish an AI diagnostic tool for assessing the fatality of pneumonia using CXR images with deep learning models.

Patients and Dataset.
We retrospectively investigated patients with pneumonia who underwent CXR examination at diagnosis in the Department of Respiratory Medicine at Harasanshin Hospital, between January 2007 and October 2019. We then created a CXR image original dataset of patients with pneumonia at diagnosis for deep learning modeling (Figure 1(a)). No patient with COVID-19 pneumonia was included in this cohort. Te diagnostic criteria for pneumonia are listed in Table S1. Microbiological diagnosis was performed using cultures (sputum, blood, bronchial wash, and pleural efusion). Fatal cases were defned as cases of patients who died from pneumonia at Harasanshin Hospital, while nonfatal cases were defned as cases of patients who recovered from pneumonia following outpatient treatment or inpatient treatment and were discharged from Harasanshin Hospital. Complications of congestive heart failure (CHF) have been reported to afect the diagnosis and prognosis of pneumonia [24]. Terefore, we evaluated the complications associated with CHF. Patients with pneumonia and CHF complications were defned as those diagnosed with chronic CHF or new heart failure at the time of pneumonia diagnosis. Te diagnostic criteria for new heart failure are listed in Table S2. Furthermore, we prepared an external validation test dataset of 100 CXR images (50 CXR images of patients with fatal pneumonia and 50 CXR images of patients with nonfatal pneumonia who were mainly treated in the Department of General Internal Medicine at Harasanshin Hospital and not used in the training of the deep learning models) (Figure 1(b) and Table S3) to externally validate the performance of deep learning models. Te requirement for written informed consent was waived because of the retrospective observational approach, and the study was carried out using the opt-out method based on our hospital website. Te study was performed in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Harasanshin Hospital (No. 2020-09, May 5, 2020). Te datasets were not publicly available for legal and ethical reasons. We retrospectively collected the following data from the medical records of the patients: background characteristics, laboratory test fndings at the onset of pneumonia, physical examination fndings, CXR fndings, and clinical courses.

Image Preparation and Model
Training. CXR images of pneumonia patients at diagnosis were evaluated for the cardiothoracic ratio (CTR), [25] the number of lobes involved with infltrate (1 or ≥2), the location of infltrate (unilateral or bilateral), the location of pleural efusions (none, unilateral or bilateral), and the presence of cavities by a single reader (respiratory physician 1). Cardiomegaly was defned as a CTR of >50% in a posteroanterior (PA) view and >55% in an anteroposterior view [25]. To evaluate the interobserver reliability of the CXR image fndings, the external validation test dataset of 100 CXR images was independently read by respiratory physicians 1 and 2, both of whom are board-certifed with more than 10 years of experience. Interobserver reliability for the interpretation of radiographic fndings was assessed by calculating agreement rates and the kappa statistic (κ) [26]. Te CXR images were de-identifed and saved as Joint Photographic Experts Group fles with a resolution of 720 × 960 pixels. Data augmentation techniques were not used in this study.

Google Cloud AutoML Vision. Google Cloud AutoML
Vision is a publicly available platform that provides automated deep learning models through training, evaluation, and prediction based on images [10].
Models using Google Cloud AutoML Vision showed discriminative performance and diagnostic properties comparable to those of state-of-the-art deep learning algorithms [10]. Google Cloud AutoML Vision is used in diagnostic research using pathological and ultrasound images of breast cancer, diagnostic research using otoscopic images, research on retinal diseases, and evaluation of spermatogenesis using histological images of tests (Table S4). In this study, the original CXR image dataset was uploaded to Google Cloud storage and allocated to the training, validation, and test datasets (80%, 10%, and 10%, respectively) randomly in Google Cloud AutoML Vision. 10% of the dataset was used for validation. Te model learning framework incorporates training data at each iteration of the training process and then uses the model's performance on the validation set to adjust the model's hyperparameters (variables that specify the model's structure). In the current study, we used Google Cloud AutoML Vision to create a deep learning model for classifying CXR images of fatal or nonfatal pneumonia.

Performance of the Deep Learning Model in External
Validation and Comparison with Physicians. After training, the deep learning model using Google Cloud AutoML Vision was deployed for online predictions. Te model provided a score for each prediction of pneumonia prognosis based on CXR images. Te score was a confdence estimate between 0.0 and 1.0. A higher value indicated greater confdence that the annotation was accurate. We assessed the performance of the deep learning model using an external validation test dataset of 100 CXR images to verify the generalizability of the model. Te external validation test dataset was not used in the training, validation, or testing of the deep learning models. In addition, respiratory physicians 2 and 3 and residents 1 and 2, who were not informed of the prognosis of pneumonia patients, were asked to infer the prognosis from the 100 CXR images of the external validation test dataset. Respiratory physician 3 is a board-certifed physician with more than 10 years of experience. Residents 1 and 2 are physicians within 2 years of graduation.

Sony Neural Network Console. Sony Neural Network
Console (NNC) is a graphical user interface-based deep learning development tool [27]. NNC has been used in studies of retinal diseases and the classifcation of neutrophil fractions (Table S4). We evaluated whether NNC can also be used to create a deep learning prediction model with the ResNet model for fatal pneumonia from CXR images. We used the same dataset for training NNC and Google Cloud AutoML vision ( Figure 1, Table 1). Te ResNet model, shown in Figure 2, is a neural network model proposed by Microsoft Research in 2015 and is believed to exhibit high image discrimination performance [28]. In addition, the application of the deep learning model by ResNet to image diagnosis of pneumonia using CXR images had shown high performance, with a sensitivity of 96.5%, specifcity of 92.7%, and accuracy of 94.6% [15], and it was expected that a highperformance model would be developed in this study.

Statistical Analysis. Google Cloud AutoML
Vision provides an area under the precision-recall curve (AUPRC), sensitivity (recall), and positive predictive value (PPV) (precision). Sensitivity, specifcity, PPV, negative predictive value (NPV), and accuracy were calculated to evaluate the performance of the model at a threshold of 0.5. In the deep learning model with NNC, the sensitivity, specifcity, PPV, NPV, and accuracy were also calculated. Similar metrics were calculated for the prognostic performance of physicians on the external validation test dataset of 100 CXR images. We evaluated the performance of our deep learning models using sensitivity, specifcity, PPV, negative predictive value (NPV), accuracy, and F1 score. Categorical variables were compared using Fisher's exact test. Survival was evaluated  using the Kaplan-Meier method, and diferences in survival were analyzed using the log-rank test. Te observed proportional interobserver agreement rate for the presence or absence of radiographic fndings was calculated by summation of the proportions of equal interpretations of two board-certifed respiratory physicians (respiratory physicians 1 and 2). Te kappa statistic is a measure of interobserver reliability that adjusts for agreement by chance. A κ < 0.20 indicates poor agreement; a κ of 0.21-0.40, fair agreement; a κ of 0.41-0.60, moderate agreement; a κ of 0.61-0.80, good agreement; and a κ of 0.81-1.00 indicates very good agreement between two observers [26]. Logistic regression analyses were used to examine the associations among radiographic characteristics, complications of congestive heart failure, and mortality. In the frst step, each risk factor was tested individually in a univariate analysis by Fisher's exact test. In the second step, all risk factors that showed an association in the univariate model (P < 0.15) were added to the multivariable model. Finally, a backward stepwise selection was used to determine factors associated with mortality. All statistical analyses were performed using EZR, a graphical user interface for R [29].

Interobserver Variation in the Interpretation of CXR Image
Findings for Pneumonia. Table S3 shows the patient characteristics of the 100 patients with pneumonia prepared for the external validation of the deep learning models. Two respiratory physicians (respiratory physicians 1 and 2) evaluated the fndings of these CXR images at pneumonia diagnosis. Table S5 shows the agreement rates on the specifc patterns of radiographic infltrates in the external validation test dataset in which both respiratory physicians agreed on the presence of a pulmonary infltrate. Among the external validation test datasets, the calculation of agreement rates and κ demonstrated the following results: the number of lobes involved (overall agreement, 86%; κ � 0.62); location of the infltrate (overall agreement, 77%; κ � 0.529), pleural efusion (location) (overall agreement, 73%; κ � 0.687), and cavitation (overall agreement, 97%; κ � 0.556) (Table S5).

Patient Demographic Characteristics in the CXR Image
Original Dataset for Training of Deep Learning Models. Of 1356 patients with pneumonia, 1274 (94.0%) were included in the present study (Figure 1(a)). Te demographic and clinical characteristics of the study participants are presented in Ta (Table 4).

Performance of the Deep Learning Model by Google Cloud
AutoML Vision. A total of 1016 CXR images randomly selected by the platform were used for training, 125 CXR images were used for validation, and 131 CXR images were used for testing in Google Cloud AutoML Vision (Figure 1(b)). Based on the self-evaluation of the platform, the deep learning model using Google Cloud AutoML Vision showed an AUPRC of 0.929, with a sensitivity of 50.00% and specifcity of 92.4%, and accuracy of 84.0% (Figure 3(a) and Table 5). Te confusion matrix of validation results for the test data is shown in Figure 3(b). Figure 3(c) shows the CXR image at pneumonia diagnosis that was correctly assessed as fatal pneumonia by the deep learning model. Figure 3(d) shows the CXR image at pneumonia diagnosis that was correctly assessed as nonfatal pneumonia by the deep learning model.  (Figure 4(e)). Te performance of the deep learning model using Google Cloud AutoML Vision for classifying fatal and nonfatal pneumonia using the external validation test dataset is shown in Figure 5 (confusion matrix) and Figure 6(a), and the numerical values are presented in Table 6. In the group predicted to have fatal pneumonia by the deep learning model by Google Cloud AutoML Vision, the rate of poor prognostic fndings on pneumonia CXR images and complications, such as multilobar involvement, bilateral infltrate, bilateral pleural efusion, cardiomegaly, and complication of CHF, were signifcantly higher than those in the group predicted to have nonfatal pneumonia (Figure 6(b)).

Performance of the Deep Learning Model by NNC.
Te CXR images of pneumonia diagnosis were randomly allocated to the training and validation datasets (80% and 20%, respectively) (Figure 1(b)). Te CXR images used in NNC were not trained at 720 × 960 pixels because of technical problems; therefore, the images were processed to 240 × 320 pixels for further training (Figure 2). An evaluation using validation data from a deep learning model by NNC showed a sensitivity of 39.6%, specifcity of 92.8%, and accuracy of 82.7% (Table 5). Te confusion matrix of validation results for the test data is shown in Figure 3(e). Te performance of the deep learning model by NNC for classifying fatal and nonfatal pneumonia using the external validation test dataset is shown in Figure 5 (confusion matrix) and Figure 6(a), and the numerical values are presented in Table 6.

Comparison of the Performance between the Deep
Learning Models and Physicians. Respiratory physicians had better specifcity and PPV than deep learning models ( Figure 6(a) and Table 6). On the other hand, residents had lower specifcity and PPV than deep learning models ( Figure 6(a) and Table 6).

Discussion
We developed deep learning models to predict fatal pneumonia using CXR images. Te deep learning prediction models showed a performance comparable to that of physicians in predicting the prognosis of pneumonia based on CXR images (Figure 6(a) and Table 6). Tese results suggest that the deep learning model is useful for prognostic evaluation using CXR images in patients with pneumonia at diagnosis. Feng et al. developed a deep learning prognostic model for CAP using nonimaging data (such as comorbidities, vitals, and blood biomarkers), with a sensitivity of 74.4% to 98.2%, specifcity of 83.1% to 100%, and accuracy of 79.3% to 99% [30]. Furthermore, deep learning models have been reported to predict the severity of COVID-19 pneumonia using CXR images [21][22][23]. However, the prognosis prediction of non-COVID-19 pneumonia by deep learning using CXR images has not been sufciently studied. Our report suggests that AI with deep learning can also be useful in predicting the prognosis of pneumonia using CXR images   with the same level of performance as the similar study above, which was innovative noticeably. Deep learning models for automated assessment of COVID-19 pneumonia severity on CXR have been trained using radiologists' CXR severity scores as labels [21,22]. Tese labelings by severity scores are subjective to interpretation and variability exists [23]. On the other hand, image labeling in this study is highly objective, based on the clinical outcome data (fatal or nonfatal) which are a ground truth defnition [31]. Multilobar pneumonia, bilateral pneumonia, and bilateral pleural efusions have been reported as poor prognostic factors for pneumonia [4,5]. Similarly, these fndings were also poor prognostic factors in our study (Table 2). In addition, external validation showed that these fndings were signifcantly more frequent in the group predicted as fatal pneumonia than in the group predicted as nonfatal pneumonia by Google Cloud AutoML Vision (Figure 6(b)). Tese results suggest that the deep learning model may have learned these fndings as features of fatal pneumonia. In this study, 9.8% of patients with pneumonia also had CHF (Table 1). It has been reported that the prognosis of pneumonia is poor in patients with CHF [24]. In this study, the multivariate logistic regression model showed that the complication of heart failure in patients with pneumonia was an independent risk factor. Te risk of death in pneumonia patients with CHF was 3.3 times higher than that in pneumonia patients without CHF (Table 4). Furthermore, external validation by the deep learning model of Google Cloud AutoML Vision showed that the group predicted to have fatal pneumonia contained signifcantly more patients with CHF   than the group predicted to have nonfatal pneumonia (Figure 6(b)). Tis result suggests that the deep learning model can accurately diferentiate between fatal and nonfatal pneumonia, even in pneumonia patients with CHF. Te performance evaluation of deep learning using the Google Cloud AutoML Vision model in diferentiating fatal pneumonia from the external validation test dataset showed a sensitivity of 68%, specifcity of 86%, and accuracy of 77% (Figure 6(a) and Table 6). Te sensitivity and accuracy of NNC were lower than those of Google Cloud AutoML Vision, but the specifcity was as high as 92.0%. Tis may have been due to the efect of image degradation during training and the small number of fatal cases. Te CXR images used for NNC could not be trained at 720 × 960 pixels due to technical problems, so images processed to 240 × 320 pixels were used for training. In addition, in the case of Google Cloud AutoML, the details of the architecture of the model are not known, making it difcult to study the details, which is an issue that needs to be considered in the future. Tere is a good possibility that the performance of deep learning models can be improved by increasing the number of training data. Further study of additional metadata such as age, gender, and presence/absence of heart failure complications is expected to further improve learning performance and is considered a topic for future research. Furthermore, in terms of specifcity and PPV, the performance of both deep learning models on the two platforms was comparable to that of the physicians. Tese results indicated that the reproducibility of deep learning pneumonia prognosis modeling using CXR images had good performance. Additionally, the accuracy and F1 score of the deep learning model using Google Cloud AutoML Vision were higher than those of board-certifed respiratory physicians. Tese results suggest the possibility that by further improving the performance of this deep learning model, the clinical implementation of this model for the severity assessment of pneumonia patients may assist physicians in general practice, especially physicians in clinics or remote islands and suburbs, where it is difcult to consult respiratory specialists. Compared to classical deep learning frameworks, it has been reported that the image learning performance with Google Cloud AutoML is comparable to that of conventional deep learning models [10]. What is important in the future is how to implement deep learning models in clinical practice. Tis study was conducted solely by clinicians, and we believe that this research is very important for the future application of deep learning models by clinicians in clinical practice.
Regarding the diference in sensitivity between respiratory physicians and residents, we were not allowed in this study to review patient history or previous examinations that have been shown to improve the physician's diagnostic ability in interpreting CXR images [32]. In particular, respiratory physicians were more likely to refer to patient history and previous examinations, which may have infuenced the diference in sensitivity with residents.
In our study, 67.3% of cases were of pneumonia other than CAP (NHCAP, HAP, and VAP) ( Table 1), and MRSA and Pseudomonas aeruginosa were reported frequently as causative organisms (Table S6). Tis was because most of our patients were elderly people in nursing homes, and the absolute number of NHCAP and HAP was particularly high compared to that of CAP.
Tis study had several limitations. First, this was a singlecenter study with small datasets, and these deep learning models cannot be directly applied clinically in medical institutions nationwide. Furthermore, deep learning models with higher accuracy are required for clinical applications. To create deep learning models with higher accuracy and robustness that can be used at multiple institutions, it is necessary to develop models using a larger sample size with multi-institutional data. Second, the CXR radiographic fndings of the original 1274 CXR image dataset (Table 2) were assessed by a single physician (respiratory physician 1). Terefore, radiographic fndings may not be sufciently accurate [33,34]. However, the validation using external validation data showed moderate to good agreement, with κ values ranging from 0.529 to 0.687 between respiratory physicians 1 and 2 (Table S5). Furthermore, the performance evaluation of the deep learning model in the external validation showed a similar trend in the radiographic fndings assessed by respiratory physicians 1 and 2 ( Figure 6(b)). Based on these results, the radiologic fndings in the original 1274 CXR image dataset at pneumonia diagnosis were also considered to have a certain degree of accuracy. Tird, the model cannot retain its ability to accurately diagnose fatal pneumonia without updating. Medical care is advancing daily, and the survival rate of pneumonia is also expected to change over time. Terefore, deep learning models must be retrained using additional data to dynamically update their performance [35].    In 100 CXR images of the test dataset for external validation, respiratory physicians 1 and 2 evaluated the cavities, the number of involved lung lobes, the location of infltrates, and pleural efusions, respectively. Te percentages of these fndings were compared between the groups that the deep learning model by Google Cloud AutoML Vision predicted as nonfatal or fatal pneumonia. We also compared the rates of cardiomegaly and CHF in the group that the deep learning model by Google Cloud AutoML Vision predicted as nonfatal or fatal pneumonia. Error bars represent 95% confdence intervals. P values were determined by Fisher's exact test. CXR, chest X-ray; CHF, congestive heart failure.

Conclusions
Te diagnostic tool based on deep learning models yielded good classifcation accuracy for classifying fatal pneumonia. By further improving the performance of these learning models, AI could assist physicians in the severity assessment of pneumonia patients in general practice.