Robust Medical Test Evaluation Using Flexible Bayesian Semiparametric Regression Models

The application of Bayesian methods is increasing in modern epidemiology. Although parametric Bayesian analysis has penetrated the population health sciences, flexible nonparametric Bayesian methods have received less attention. A goal in nonparametric Bayesian analysis is to estimate unknown functions (e.g., density or distribution functions) rather than scalar parameters (e.g., means or proportions). For instance, ROC curves are obtained from the distribution functions corresponding to continuous biomarker data taken from healthy and diseased populations. Standard parametric approaches to Bayesian analysis involve distributions with a small number of parameters, where the prior specification is relatively straight forward. In the nonparametric Bayesian case, the prior is placed on an infinite dimensional space of all distributions, which requires special methods. A popular approach to nonparametric Bayesian analysis that involves Polya tree prior distributions is described. We provide example code to illustrate how models that contain Polya tree priors can be fit using SAS software. The methods are used to evaluate the covariatespecific accuracy of the biomarker, soluble epidermal growth factor receptor, for discerning lung cancer cases from controls using a flexible ROC regression modeling framework. The application highlights the usefulness of flexible models over a standard parametric method for estimating ROC curves.


Introduction
Bayesian analysis is often used in the support of epidemiologic research [1][2][3][4][5][6][7].A contemporary area of research in the population health sciences involves the development and application of statistical methods for evaluating the accuracy of medical tests.With binary outcome data, statistical methods focus on estimating sensitivity and specificity, while with quantitative data, standard objects of interest are the receiver operating characteristic (ROC) curve and area under the curve (AUC).
The ROC curve can be regarded as a graphical portrayal of the degree of separation between the distributions of test outcomes for "diseased" and nondiseased populations.The formula for an ROC curve depends on Se() and Sp(), the sensitivity and specificity of the test at classification threshold .We proceed under the innocuous assumption that test Epidemiology Research International imposes the strong condition that  0 and  1 be members of the normal family (i.e., the normal-normal model), while a flexible nonparametric approach treats them as arbitrary continuous distribution functions.A flexible method is desirable because  0 and/or  1 could exhibit unanticipated right or left skewness, multimodality, or they could be symmetric but not bell-shaped.As a result, the methodology we present has the flexibility to handle data from many different settings, and notably it will be designed to include the normal-normal model as a special case.
The development of modern nonparametric and semiparametric Bayesian procedures for ROC analysis is an active area of research [8][9][10][11][12][13][14][15][16][17].The main goals of this paper are (i) to use flexible Bayesian models and ROC curves to evaluate the accuracy of a biomarker that may depend on patient characteristics and (ii) to provide a nontechnical description of Polya tree priors and show how they can be implemented in epidemiological research using standard statistical software.

Materials and Methods
Robust inference for ROC curves stems from allowing  0 and  1 to be members of broad classes of distributions.Our approach is to model them using Mixtures of Finite Polya Trees (MFPT) priors, which have been carefully discussed in the statistics literature [18,19].We avoid a full technical description because it requires considerable mathematical detail.Instead, a conceptual introduction appears in Appendix A.
Briefly, a major advantage beyond flexibility is that MFPT priors for  0 and  1 can be "centered" on separate parametric families.In part, this means that the flexible model generalizes the standard parametric model since the normalnormal model is allowed as a special case.Additionally, it means that the expected value of  ℓ under the prior will be the distribution function that corresponds to the centering family.In our illustrations, the centering families will be The MFPT priors will also have positive weight parameters,  0 and  1 , and positive integers  0 and  1 that define the (finite) length of the trees.Large values for the weights indicate higher prior confidence in the centering distributions, while values near zero allow for considerable flexibility around the centering family.To aid the selection of a weight parameter, consider that the parametric model  ℓ is the special case obtained when  ℓ → ∞.Hence, using a large weight value is similar to fitting the parametric centering model to the data (possible oversmoothing).The opposite extreme of setting  ℓ ≐ 0 produces empirical estimates (possible undersmoothing).In our experience, a weight parameter between 0.5 and 1 is often a positive compromise between these extremes.Tree lengths generally range between 4 and 10, depending on the sample size.Small values for  ℓ tend to oversmooth the data.A fully nonparametric prior is obtained as  ℓ → ∞, but large values of  ℓ can substantially increase computing time.Finally, there will be priors on ( ℓ ,  ℓ ); write them as  ℓ ( ℓ ,  ℓ ), for ℓ = 0, 1.We assume  0 and  1 are independent, with priors denoted by  ℓ ∼ MFPT ( ℓ ,  ℓ ,  ℓ ), for ℓ = 0, 1.

Models.
In the absence of covariates, consider two independent samples of continuous biomarker measurements, where  1 ,  2 , . . .,   0 constitute a random sample from an unknown distribution  0 , and  1 ,  2 , . . .,   1 are realizations from  1 .We regard the   's as biomarker outcomes from  0 nondiseased individuals and the   's as outcomes from  1 diseased individuals.Our model is represented as   |  0 indep ∼  0 with  0 ∼ MFPT( 0 ,  0 ,  0 ) and similarly for the   's.The prior  ℓ ( ℓ ,  ℓ ) in our illustrations involves a normal distribution on  ℓ and an independent uniform distribution on  ℓ .
There are many possible extensions of this two-group model depending on the complexity of the data.We describe a semiparametric regression model that can be easily adapted to handle a variety of scenarios.The model specifies separate linear regressions with arbitrary residual distributions for the data from the nondiseased and diseased populations: There are no intercepts in the regression part of the models since each residual distribution has an arbitrary unknown mean.The covariate vectors   and x can be distinct or have some overlap (e.g., both might include age) or complete overlap (  = x ).In the absence of available prior information, independent normal prior distributions with mean 0 and large variance can be used for the regression coefficients.Alternatively, when prior information is available, we recommend conditional means priors [20].Special cases include a model with no covariates for the nondiseased population (no   variables), a model with no covariates for the diseased population (no x variables) and a model without covariates (the two-group model).
Our model can be used to estimate covariate-specific ROC curves and AUC.Let  * = (, x) define two particular subpopulations of nondiseased and diseased people.Then, ROC( * ) and its corresponding AUC( * ) are measures of the tests' accuracy when applied to nondiseased people with covariate vector  and diseased people with covariate vector x (often, the researcher will set  equal to x).Approximate posterior inference for ROC curves can be made by calculating posterior summaries of (1 −  0 ( − ), 1 −  1 ( − x)) across a fine grid of values for  that spans the sample space.

Fitting MFPT Models in SAS.
The SAS 9.3 (SAS Institute, Inc., Cary, North Carolina) code that was used to fit a onesample MFPT model to the simulated data described in Appendix A is in Appendix B.

ROC Analysis of sEGFR Data.
We investigated a soluble isoform of the epidermal growth factor receptor (sEGFR) as a biomarker for lung cancer in premenopausal and postmenopausal women.sEGFR also has been associated with ovarian cancer [21,22].The data were collected from a casecontrol study conducted at the Mayo Clinic in Rochester, Minnesota between 1998 and 2003.There were 140 controls and 101 lung cancer cases, and 92 premenopausal and 149 postmenopausal women enrolled in the study.
In a preliminary analysis, we found no clear evidence of a difference in the distributions of  = √ sEGFR for premenopausal versus postmenopausal lung cancer cases (Wilcoxon  = 0.15).However, there was evidence for a statistical difference in the distribution of  for controls based on menopausal status (Wilcoxon  < 0.01).We thus included menopausal status as a covariate in models for the control data.A flexible approach is supported over the standard parametric approach because normal quantile plots (not shown) indicated a right skew in the distributions of  for cases and postmenopausal controls.
We compared two analyses.The first analysis had menopausal status as a covariate for the control group in an ROC regression model.The second analysis modeled outcomes among pre-and postmenopausal controls with completely separate (flexible) distributions, without regression structure.We compared these models using the log pseudomarginal likelihood (LPML) [23], [24,Section 4.9].
For all models that used Polya trees, we set weight parameters (the  ℓ 's) equal to 0.5 and tree lengths (the  ℓ 's) equal to 5. In this setting, values of  tended to be higher among controls instead of cases, so we reversed the roles for cases and controls.Let x = 1 for premenopausal women and x = 0 for postmenopausal women.For both models under consideration, the data from cases were modeled as independently distributed according to an unknown distribution  0 .The prior was  0 ∼ MFPT(( 0 ,  2 0 ),  0 , ), with  0 ( 0 ,  0 ) selected to be  0 ∼ (50, 400) independent of  0 ∼ Uniform(0, 100).The prior mean of 50 for  0 was chosen because that was approximately the mean of  for male lung cancer cases in a similar study, but we allowed for a high degree of uncertainty about the value of  0 through a large prior variance.
Estimated ROC curves were obtained from model 2, together with estimates from the parametric normal model 3 and from nonparametric empirical distribution functions.The estimated ROC curve from the parametric analysis differs drastically from estimates obtained from the MFPT and empirical distribution functions for postmenopausal women (Figure 1).Not surprisingly, the parametric model fails to track either of the data-driven estimates.Note that although the parametric model is its prior expectation, the MFPT model has the flexibility to allow the data to reallocate probability mass away from a bell-curve to produce an appropriately smooth and flexible estimate of the ROC curve.For premenopausal women, the parametric and Polya tree models also give different inferences, including at clinically important cutoffs corresponding to false positive probabilities up to 10% (Figure 2).The empirical partial AUC over that region was 0.026, while the estimate from the parametric analysis was increased by 37% to 0.041.A 90% posterior interval for the partial AUC from the parametric model was (0.027, 0.055), which does not contain the empirical estimate.The Polya tree analysis is a middle ground between these extremes, with an estimated ROC curve that is essentially a smoothed version of the empirical curve and an estimated partial AUC of 0.019 (0.009, 0.036).Similar results were obtained from a sensitivity analysis that placed diffuse Gamma priors with mean 1 and variance 10000 on the  ℓ 's and  ℓ 's.

Conclusions
We have described the use of MFPT priors in Bayesian analysis.Even when a parametric model is thought to hold, a Polya tree analysis offers a way to assess parametric assumptions and to perform a sensitivity analysis to address deviations from them.For example, a simple approach to sensitivity analysis involves comparing estimated ROC curves and AUC from the normal-normal and semiparametric models.
A finite Polya tree prior is not technically nonparametric.Bayesian nonparametric models are characterized by having an infinite number of parameters.Such models often use finite approximations, as in our case.It is thus more accurate to view models that contain finite Polya tree priors as parametric, with a large number of parameters.In fact, Bayesian statistical analysis with finite Polya tree priors has been called "parametric nonparametric statistics, " because it uses parametric models (with many parameters) that maintain flexibility [25].In our experience, finite approximations of Polya trees are sufficiently flexible and scalable to perform well for a wide variety of complex settings.

A. Polya Tree Priors
A.1.Finite Polya Tree Priors.A random sample  1 , . . .,   is obtained from an unknown distribution .We describe informally what it means to place a finite Polya tree prior on .Polya tree priors place a distribution on a collection of cumulative distribution functions.The first step in constructing a finite Polya tree prior is to create nested partitions of the sample space.The first partition splits the sample space into two nonoverlapping intervals.In the second partition, those  two intervals are each split, yielding a finer partition that contains four intervals.Then, these four intervals are each split to give an 8 interval third-level partition of the sample space.This sequential process of splitting the sample space into finer and finer partitions continues up to a certain (finite) level .
An analogy is to a convention center with  floors.The rooftop encompasses the entire sample space.Down one level is a floor with two large meeting rooms.Two levels down is a floor with four meeting rooms, then a floor with 8 rooms, and so on until we get to the ground floor  levels down, which contains 2  meeting rooms (Figure 3 contains an example with  = 3).Suppose the unknown distribution  assigns multiple people (the   's in this analogy) to rooms on the ground floor, and our task is to use the observed distribution of people to rooms to estimate .Although all levels (floors) of the tree (convention center) are important for the purpose of estimating , of primary importance is level  (the ground floor).
The goal is to produce a data-driven estimate of  that assigns high probability to intervals (rooms) that contain lots of data (people), assigns low probability to empty intervals, and assigns midrange probability to intervals that contain some (but not a lot) of the data.Consider first the simplest case where the convention center has only one floor ( = 1).Then, people are assigned to one of two meeting rooms on floor 1.Call them rooms  1,1 and  1,2 .The probability of being assigned to the first room on floor 1 is ( 1,1 ) = Pr(  ∈  1,1 ), which, since  is a cumulative distribution function and  1,1 is an interval of the form (, ), is to be interpreted as ( 1,1 ) = () − ().Denote this unknown probability by the parameter  1,1 .Then, by the complement rule,  1,2 = 1 −  1,1 is the probability of being assigned to room 2 on floor 1.In notation, Pr(  ∈  1,1 ) = ( 1,1 ) =  1,1 and Pr(  ∈  1,2 ) = ( 1,2 ) =  1,2 .Since  is unknown,  1,1 and  1,2 are unknown.To help interpret these parameters, suppose the data arise from a right-skewed distribution (lots more people in room 1 than room 2), then  1,1 would be large and hence  1,2 would be small, and vice-versa for left-skewed data.In practice,  = 1 is never used because it would lead to a crude estimate of the density function , much like estimating a density using a relative frequency histogram that contains only two bins.In our experience, setting  equal to 4, 5, or 6 often works well in practice.Another option is to select  so that roughly 2  ≐  [18].Now suppose we return a year later to find that the convention center has expanded and contains a second floor ( = 2).Suppose the new ground floor has four rooms,  2,1 ,  2,2 ,  2,3 , and  2,4 (third row of Figure 3).Moreover, suppose that room assignments are made based on whether the individual was in room  1,1 or  1,2 during the previous visit.If the previous assignment was to room  1,1 , then the individual is assigned again to the left side of the convention center, namely, to either room  2,1 with unknown probability  2,1 or to room  2,2 with probability  2,2 = 1 −  2,1 .Similarly, define  2,3 and  2,4 (= 1 −  2,3 ) to be the probability of being assigned to room  2,3 or  2,4 , respectively, given that the previous assignment was to room  1,2 .
Notice that level 1 has only one unique parameter ( 1,1 ) associated with it because  1,2 is completely determined by  1,1 ; similarly, level 2 has two unique parameters ( 2,1 and  2,3 ) associated with it.We can continue the convention center storyline to any level .For  = 3 (Figure 3), we add conditional probability parameters  3,1 ,  3,2 ,  3,3 ,  3,4 ,  3,5 ,  3,6 ,  3,7 , and  3,8 , but only 4 of these are unique.For instance,  3,1 is the probability that   is in interval  3,1 given that it is in interval  2,1 , and ( 3,1 ) = Pr(  ∈  3,1 ) = Pr( The key point is that if we can estimate all of the  , 's, then we can estimate the probability that is allocated by  to each interval at level .Estimating continuous  requires continuing ad infinitum as  grows.However, in practice we truncate to a fixed , hence the term, finite Polya tree.But this is incomplete for the purpose of estimating a continuous , because we have not modeled how probability mass is distributed within each interval at level .In terms of the convention center analogy, we have modeled the probability of assignment to each of the ground-floor rooms, but we have not modeled how people are distributed within those rooms.For instance, they could all be clustered in the middle of the room (all the   's clumped together in the center of the interval).Alternatively, the data could be uniformly distributed, clumped to the right or left side of the interval, or have any other dispersion pattern within each interval.To address this issue, we model the data according to how a userspecified distribution  * allocates probability mass within the intervals at level .
The distribution  * is important in two other ways.First, it is used to determine the lower and upper endpoints of all intervals in the tree.The median of  * is used to split the sample space into two intervals at level 1 of the tree.The quartiles of  * define cutpoints for intervals at level 2. Writing the 25th percentile as  * −1 (1/4), the median as  * −1 (2/4), and the 75th percentile as  * −1 (3/4), we have for a sample space that covers the real line )) , ) , +∞) .
Second,  * is the prior expectation of the unknown distribution function .This provides guidance for selecting  * ; we select it based on our best prior assessment of the datagenerating distribution .If our prior assessment is that the data will be governed by a certain normal distribution, we use that normal for  * .If our prior assessment is accurate, the posterior estimate of  from the finite Polya tree model will take the shape of that normal distribution.But the real advantage of Polya tree priors is their flexibility to allow for data-driven prior-to-posterior reallocation of probability mass away from the shape of  * to any shape supported by the data.This happens by estimating the  , 's that were equated to .
The collection { , :  = 1, . . ., ;  = 1, . . ., 2  } constitutes the unknown parameters corresponding to .We need to specify a prior distribution over this collection.Recall that when  is an even number between 2 and 2  ,  , = 1 −  ,−1 .Therefore, priors are needed only on  , when  is odd.It is standard to use independent beta priors, specifically  , ∼ Beta( 2 ,  2 ) for  odd and all .The prior mean of 0.5 gives equal weight to the left and right intervals formed by splitting a parent interval.The positive constant  is often set equal to 0.5 or 1.The value of  reflects our confidence in the prior estimate  * for the unknown .Large values of  (e.g.,  ≥ 10) translate into higher prior confidence in the particular  * , so a relatively larger amount of data are needed in order for the posterior estimate to stray from  * if the data conflict with it.A low value (e.g.,  = 0.1) will often lead to an estimate of  that is similar to the empirical distribution function.We have found that  = 0.5 or 1 performs well in practice because it often is a middle ground between a purely data-based and model-based estimate.

Figure 1 :
Figure 1: Empirical ROC curve for postmenopausal women and estimates of the ROC curve from a parametric normal model (dashed line) and a mixture of finite Polya trees analysis (smooth solid line).

Figure 2 :
Figure 2: Empirical ROC curve for premenopausal women and estimates of the ROC curve from a parametric normal model (dashed line) and a mixture of finite Polya trees analysis (smooth solid line).

Figure 3 :
Figure 3: Partition structure of a Polya tree with three levels.