The application of Bayesian methods is increasing in modern epidemiology. Although parametric Bayesian analysis has penetrated the population health sciences, flexible nonparametric Bayesian methods have received less attention. A goal in nonparametric Bayesian analysis is to estimate unknown functions (e.g., density or distribution functions) rather than scalar parameters (e.g., means or proportions). For instance, ROC curves are obtained from the distribution functions corresponding to continuous biomarker data taken from healthy and diseased populations. Standard parametric approaches to Bayesian analysis involve distributions with a small number of parameters, where the prior specification is relatively straight forward. In the nonparametric Bayesian case, the prior is placed on an infinite dimensional space of all distributions, which requires special methods. A popular approach to nonparametric Bayesian analysis that involves Polya tree prior distributions is described. We provide example code to illustrate how models that contain Polya tree priors can be fit using SAS software. The methods are used to evaluate the covariate-specific accuracy of the biomarker, soluble epidermal growth factor receptor, for discerning lung cancer cases from controls using a flexible ROC regression modeling framework. The application highlights the usefulness of flexible models over a standard parametric method for estimating ROC curves.
Bayesian analysis is often used in the support of epidemiologic research [
The ROC curve can be regarded as a graphical portrayal of the degree of separation between the distributions of test outcomes for “diseased” and nondiseased populations. The formula for an ROC curve depends on
This motivates the use of flexible data-driven methods for estimating functions, which is a central goal of nonparametric Bayesian analysis. The standard parametric approach imposes the strong condition that
The development of modern nonparametric and semiparametric Bayesian procedures for ROC analysis is an active area of research [
Robust inference for ROC curves stems from allowing
Briefly, a major advantage beyond flexibility is that MFPT priors for
In the absence of covariates, consider two independent samples of continuous biomarker measurements, where
There are many possible extensions of this two-group model depending on the complexity of the data. We describe a semiparametric regression model that can be easily adapted to handle a variety of scenarios. The model specifies separate linear regressions with arbitrary residual distributions for the data from the nondiseased and diseased populations:
There are no intercepts in the regression part of the models since each residual distribution has an arbitrary unknown mean. The covariate vectors
Our model can be used to estimate covariate-specific ROC curves and AUC. Let
The SAS 9.3 (SAS Institute, Inc., Cary, North Carolina) code that was used to fit a one-sample MFPT model to the simulated data described in Appendix
We investigated a soluble isoform of the epidermal growth factor receptor (sEGFR) as a biomarker for lung cancer in premenopausal and postmenopausal women. sEGFR also has been associated with ovarian cancer [
In a preliminary analysis, we found no clear evidence of a difference in the distributions of
We compared two analyses. The first analysis had menopausal status as a covariate for the control group in an ROC regression model. The second analysis modeled outcomes among pre- and postmenopausal controls with completely separate (flexible) distributions, without regression structure. We compared these models using the log pseudomarginal likelihood (LPML) [
For all models that used Polya trees, we set weight parameters (the
For controls, model 1 for the data was the regression
Model 2 specified separate distributions for data from premenopausal and postmenopausal controls. Denote the distributions by
The LPML statistics were similar for models 1 and 2 (
Estimated ROC curves were obtained from model 2, together with estimates from the parametric normal model 3 and from nonparametric empirical distribution functions. The estimated ROC curve from the parametric analysis differs drastically from estimates obtained from the MFPT and empirical distribution functions for postmenopausal women (Figure
Empirical ROC curve for postmenopausal women and estimates of the ROC curve from a parametric normal model (dashed line) and a mixture of finite Polya trees analysis (smooth solid line).
Empirical ROC curve for premenopausal women and estimates of the ROC curve from a parametric normal model (dashed line) and a mixture of finite Polya trees analysis (smooth solid line).
We have described the use of MFPT priors in Bayesian analysis. Even when a parametric model is thought to hold, a Polya tree analysis offers a way to assess parametric assumptions and to perform a sensitivity analysis to address deviations from them. For example, a simple approach to sensitivity analysis involves comparing estimated ROC curves and AUC from the normal-normal and semiparametric models.
A finite Polya tree prior is not technically nonparametric. Bayesian nonparametric models are characterized by having an infinite number of parameters. Such models often use finite approximations, as in our case. It is thus more accurate to view models that contain finite Polya tree priors as parametric, with a large number of parameters. In fact, Bayesian statistical analysis with finite Polya tree priors has been called “parametric nonparametric statistics,” because it uses parametric models (with many parameters) that maintain flexibility [
A random sample
An analogy is to a convention center with
Partition structure of a Polya tree with three levels.
The goal is to produce a data-driven estimate of
Now suppose we return a year later to find that the convention center has expanded and contains a second floor (
The
Similar steps lead to
Notice that level 1 has only one unique parameter (
The key point is that if we can estimate all of the
The distribution
In general, the
Second,
The collection
Once we have selected
To motivate an extension to
Posterior estimates of
SAS 9.3 code to fit the one-sample MFPT model to simulated data from the mixture of two normal distributions described in Appendix
The model is fit using Gibbs sampling with block updating occurring for the groups of parameters that are defined in the
Andre Baron is a coinventor of patents related to sEGFR and cofounder of Tumor Biology Investment Group, Inc., a biotechnology company that holds the rights to several sEGFR patents. Of note, Dr. Baron was not involved in the statistical analyses or interpretation of the statistical results of these analyses.
This work was supported by National Institutes of Health (Grants K07 CA76170, R21 CA82520, and RO3 CA82091 to A.T.B). The authors thank both referees for their helpful and encouraging comments.