Chest Radiographs for Pediatric TB Diagnosis: Interrater Agreement and Utility

The chest radiograph (CXR) is considered a key diagnostic tool for pediatric tuberculosis (TB) in clinical management and endpoint determination in TB vaccine trials. We set out to compare interrater agreement for TB diagnosis in western Kenya. A pediatric pulmonologist and radiologist (experts), a medical officer (M.O), and four clinical officers (C.Os) with basic training in pediatric CXR reading blindly assessed CXRs of infants who were TB suspects in a cohort study. C.Os had access to clinical findings for patient management. Weighted kappa scores summarized interrater agreement on lymphadenopathy and abnormalities consistent with TB. Sensitivity and specificity of raters were determined using microbiologically confirmed TB as the gold standard (n = 8). A total of 691 radiographs were reviewed. Agreement on abnormalities consistent with TB was poor; k = 0.14 (95% CI: 0.10–0.18) and on lymphadenopathy moderate k = 0.26 (95% CI: 0.18–0.36). M.O [75% (95% CI: 34.9%–96.8%)] and C.Os [63% (95% CI: 24.5%–91.5%)] had high sensitivity for culture confirmed TB. TB vaccine trials utilizing expert agreement on CXR as a nonmicrobiologically confirmed endpoint will have reduced specificity and will underestimate vaccine efficacy. C.Os detected many of the bacteriologically confirmed cases; however, this must be interpreted cautiously as they were unblinded to clinical features.


Introduction
Pediatric TB remains a challenging disease to diagnose despite advances in molecular techniques in mycobacterial identification and antigen based tests for latent TB infection [1]. Classical TB symptoms are nonspecific [2] and more so in settings with high HIV prevalence and malnutrition. Atypical presentation with acute severe pneumonia in young children has been observed [3]. Childhood TB is characterised by paucibacillary disease and microbiological confirmation is only possible in <50% of pediatric cases [1]. Chest imaging is therefore of great importance in identifying smear negative, culture negative TB. Among adults with suspected TB, several clearly defined chest radiograph features have been identified as having high interrater reliability and correlation with culture positive TB [4]. Unfortunately, similar data among infants has been limited. Lymphadenopathy is the hallmark of primary TB [5]. However, it is frequently missed due to inadequate sensitivity of the chest radiograph [6]. Chest CT scan (CT) has been considered the gold standard for detecting mediastinal lymphadenopathy, detecting up to 60% more lymphadenopathy in children with normal chest radiographs [7]. Despite this, use of CT has been limited in infant TB vaccine trials, which are set up to detect every TB endpoint so as to demonstrate efficacy. Reasons include the modest agreement on lymphadenopathy on CT [7], cost limitations, and the reluctance to use high dose ionizing radiation in young children. Thus, the chest radiograph is the mainstay 2 Interdisciplinary Perspectives on Infectious Diseases of radiological diagnosis and is frequently the only tool available.
There are a limited number of studies that have described interrater agreement on chest radiograph for TB diagnosis [8,9]. Existing studies had small sample sizes, were drawn from hospitalized children, and compared agreement only among experienced and highly trained raters; and most importantly, they focused entirely on presence of lymphadenopathy as a marker of TB. While lymphadenopathy is a key feature for diagnosing childhood TB, other radiological features also contribute to the diagnosis [5,6].
We conducted the study in Siaya County, western Kenya, which has a high burden of both tuberculosis and HIV [10,11]. The objective was to determine interrater agreement on any abnormality on chest radiograph and agreement on abnormalities consistent with TB among experienced and inexperienced raters. We also aimed to compare the raters' sensitivity and specificity against microbiologically confirmed TB in young children.

Study Setting.
The study was conducted in Siaya County, western Kenya, from June 2009 to December 2011. TB diagnosis is by the Keith Edward Score Chart [12] which assigns a score to suggestive pulmonary and extrapulmonary signs/symptoms of TB. Children who score ≥7 as well as those who score <7 but have an abnormal chest radiograph are treated for TB.
A total of 2900 BCG vaccinated infants, aged zero to six weeks and weighing at least 1700 g, were enrolled and followed up for 12-24 months to determine TB incidence. TB suspects were identified through four monthly scheduled visits and sick visits as well as by review of TB case records for contact tracing. Suspect criteria included a history of contact and/or suggestive signs and symptoms of TB and/or protocol defined hospitalization history, for example, for severe pneumonia.

Clinical and Laboratory
Investigations. TB suspects were admitted into a case verification ward (CVW) for collection of two serial induced sputa specimens, two serial early morning gastric aspirates, DNA PCR HIV (HIV Qual test 48, Roche Molecular Systems Inc, Switzerland) testing, and Rapid HIV tests for those aged less than and greater than 18 months, respectively. Tuberculin skin testing was also done. Digital anteroposterior (AP) and lateral chest radiographs were taken at admission and images were written onto CD-ROMs.
The CD-ROMs had Digital Imaging and Communication in Medicine (DICOM) software (Phillips) that was used to view images. Readers could change the luminance of the grayscale display as well as the magnification.

Definitions and Classifications.
Microbiologically confirmed TB (definite TB) was M. tuberculosis identified by Xpert MTB/RIF or speciated with either Capilia (FIND and Tauns co. Ltd) or GenoType assay (Hain Diagnostika, Nehren, Germany) after positive sputum culture. Probable TB was a case started on anti-TB treatment based on Keith Edward Score Chart and/or a CXR consistent with TB.

Raters and
Training. There were four sets of raters: a radiologist and a pediatric pulmonologist (expert readers), a medical officer (M.O), and four clinical officers (C.Os). The C.Os reviewed the images, while the suspects were admitted to the CVW; all other raters read the images after close of the study and were blinded to the clinical history. All raters were trained on using the electronic interface to enter readings. The clinical and medical officer(s) were trained on reading and identifying TB on pediatric radiographs prior to the start of the study.
A chest radiograph reading form developed during a consensus meeting of TB vaccine sites (Cape Town, December 2008) was used. An electronic version was developed to standardize reporting and minimize ambivalence in the diagnosis.
The technical quality of images was assessed prior to reading. Indicators included adequacy of collimation (visibility of the lung apices, costophrenic angles with nothing obscuring the lung fields), adequate exposure, the number of visible intervertebral spaces, adequate inspiration, by counting six anterior ribs, and absence of rotation by comparing the length of the same rib on the left and right hemidiaphragms. Thereafter, the reader classified the chest radiograph quality as optimal or suboptimal or unreadable. "Unreadable" radiographs were counted as suboptimal but were still read.
Raters then systematically reviewed the images for airway narrowing, left tracheal deviation, lymphadenopathy, airway opacities, calcification, and pleural effusion. The pathology items were scored individually as present, absent, and equivocal. Final assessment of the image was normal radiograph, abnormal TB unlikely, and abnormal TB likely. Radiological signs suggestive of TB were miliary picture, airway narrowing or tracheal deviation to the left, presence of hilar, paratracheal, subcarinal, or other lymphadenopathy, evidence of calcification, cavitation, pleural effusion, or thickening.

Data Collection and
Analysis. Data quality was assured by edit, logic, and validation checks built onto the data entry interface. Data cleaning was also conducted, and an audit trail of changes was maintained. Data was saved onto an SQL database and analysed using SAS 9.0 (SAS Institute Inc, Cary, NC, USA). To quantify degree of agreement in TB diagnosis among the raters, we estimated the individual kappas for each rating category as well as a generalized (multiple rater) chance-corrected kappa statistic (a multirater measure of agreement), which is an extension of Cohen's kappa for assessing reliability or proportion of agreement for multiple raters. The overall/generalized kappa measures agreement across all categories.
Kappa scores were interpreted as follows: poor 0.01-0.20, moderate 0.21-0.40, fair 0.41-0.60, good 0.61-0.80, or excellent 0.81-1.0 [13]. We compared agreement on any abnormality on chest radiograph as well as agreement on abnormalities consistent with TB. For comparability with previously conducted pediatric chest radiographs studies, we    examined the agreement on lymphadenopathy on AP and lateral images. Sensitivity and specificity of readers' diagnosis to definite TB were also calculated. To examine each rater's propensity to place a radiograph in a certain category and thus elucidate patterns of agreement, McNemar's test for 2 × 2 tables and Bowker's test of symmetry were applied for tables with more than two categories. These methods compare marginal frequencies of each rater and test for statistically significant differences (tests of marginal homogeneity), where < 0.05; marginal heterogeneity exists, that is, differing propensity to rate a category.
The study was reviewed and approved by the Kenya Medical Research Institute Ethics Review Committee (KEMRI-ERC) SSC 1465. Written informed consent was obtained from parents and guardians before study entry and for CVW investigations.
Agreement among all raters on the diagnosis of lymphadenopathy was moderate, with a multirater weighted = 0.26 (Table 2). On any abnormality on the radiograph, agreement was moderate between expert readers ( = 0.28 (95% CI: 0.19-0.37)) and between expert readers and M.O.
There was similar propensity to rate categories between the expert pairs ( = 0.14) and between the radiologist and the clinical officer ( = 0.24).
Regarding quality of radiographs, little or no agreement was registered across all rater pairs. The best kappa score was 0.07 (0.03-0.10), observed between the expert pairs (Table 5).
There were eight definite and 40 probable TB cases in the study. One radiograph of a definite TB case was not reviewed by all four raters. The sensitivity and specificity of raters (Table 3) were determined using definite TB cases as the gold standard ( = 8). The clinical and medical officers detected the largest proportion of definite TB, while specificity was highest for expert raters.
Due to the small number of definite cases, sensitivity was imprecisely measured. The same table shows low positive predictive values (PPV) (<4.0% for all raters) and high negative predictive values (>99.0% for all raters) for the chest radiograph.
Of 28/40 (70%) probable TB cases whose chest radiographs were read by all raters, experts agreed on only two as being consistent with TB. Such would be the stringent case definition applied in infant TB vaccine trials [14], where only radiographs in which experts agreed (Figure 2) would count as a non-microbiologically confirmed (probable) TB endpoint.
Interdisciplinary Perspectives on Infectious Diseases 5

Discussion
Overall, we observed poor to moderate agreement between experts and between expert and nonexpert pairs. This was consistent across the rater's summary opinion of radiograph as well as on specific pathology such as lymphadenopathy. Our findings fit in with other studies comparing interrater agreement on lymphadenopathy for TB diagnosis [9]. Additionally, we demonstrate that agreement is also poor for the composite assessment of the radiograph, beyond individual radiographic abnormalities. This undermines the reliability of the chest radiograph for TB diagnosis in infant vaccine trials, as well as possibly for clinical diagnosis and patient care.
We suggest several contributing factors. In an infant cohort study with active case finding, suspects are likely to be picked up and investigated before advanced disease with well defined radiographic features is evident. Early tuberculosis is likely associated with greater diagnostic uncertainty and risk of misclassification.
Reference [15]. Interrater agreement studies in a clinical setting where TB suspects may present with more advanced disease are needed to confirm this.
A study of two large infant cohorts in South Africa showed frequent discordance between radiological and microbiological features of TB [2,16]. The absence of clear, defined radiological abnormalities that correlate well with microbiologically confirmed disease contributes to lack of a reproducible, standardized criteria that raters can use with certainty to evaluate radiographs.
In radiographs of young children, mediastinal abnormalities are difficult to assess and interpret [17] particularly in inexperienced readers, in this case, the C.O and M.O. However, even experienced readers have been found to have poor agreement on lymphadenopathy on chest radiographs [9].
Previous studies had smaller sample sizes for assessment of interrater variability [8,9]. One of the strengths of this study is the inclusion of a large sample of radiographs for evaluation, from young children with a broad range of respiratory illness and comorbidities. The varying levels of raters' expertise represent the general clinical care structure from primary level to specialized referral hospitals. The findings are therefore applicable to a broad range of settings.
The study had limitations. It was not possible to obtain consensus from expert readers on radiographs on which they differed. This would have increased agreement scores and validity of the chest radiograph as a diagnostic tool. It would also have elucidated patterns of disagreement in order to refine criteria for identifying pathologies consistent with TB on chest radiographs as has been previously recommended [9].
Obtaining high interrater agreement for pediatric chest radiographs in acute pediatric respiratory illness is difficult. Pneumococcal vaccine trials have succeeded in this for opacities consistent with pneumonia on pediatric chest radiographs [18]. Unfortunately, it has been difficult to replicate this success in TB vaccine trials. Conventionally, TB vaccine trial efficacy sample sizes are calculated based on composite endpoints. These include bacteriological and nonbacteriological criteria; as it is expected, microbiologically confirmed cases will contribute to a limited number of endpoints. To increase the number of endpoints and thus reduce sample size requirements, nonbacteriologically confirmed endpoints that rely on chest radiograph findings are included. The latter is defined as radiographic findings compatible with tuberculosis identified independently by two experts [15,19]. This approach has some limitations. We found that experts agreed only on five of 35 radiographs, as being consistent with TB. Of these, three were bacteriologically confirmed; therefore, the chest radiograph contributed only to two additional endpoints. Poor agreement and high variability in Interdisciplinary Perspectives on Infectious Diseases 7 Table 5: Interrater agreement on quality of chest radiographs. interpreting pediatric CXRs for TB diagnosis among experts increase the probability of misclassifying true disease status and thus underestimating vaccine efficacy [20]. C.Os and M.Os picked up majority of the cases that were later bacteriologically confirmed. This would seem like a positive outcome, given that they work as primary health care providers and would therefore accelerate diagnosis and treatment of infants with TB. However, the limited number of definite cases results in imprecise estimates of sensitivity and should be cautiously interpreted. The high sensitivity also trades off on specificity and could result in unnecessary TB treatment.
Prevailing disease rates influence predictive values. Among infants, the TB incidence of [1.12% (0.54%-2.36%)] would be considered high; however, low PPV relative to sensitivity is attributed to low disease rates. While the NPV was high, where prevalence is not much above 1%, a noninformative test may have a NPV close to 100%.

Conclusion
Poor agreement and high variability in classifying pediatric radiographs underscores need caution in diagnosing TB in clinical settings where bacteriological confirmation is unavailable, as in most resource limited settings. It further demonstrates that addition of radiographic, nonbacteriologically confirmed endpoints will be of low benefit in decreasing sample sizes for TB vaccine trials.