Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity 0470447508, 9780470447505

A valuable guide to understanding the problem of quantifying uncertainty in dose response relations for toxic substances

372 67 4MB

English Pages 230 [237] Year 2009

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity 
 0470447508, 9780470447505

Citation preview

UNCERTAINTY MODELING IN DOSE RESPONSE

UNCERTAINTY MODELING IN DOSE RESPONSE Bench Testing Environmental Toxicity Edited by

ROGER M. COOKE

A JOHN WILEY & SONS, INC., PUBLICATION

Copyright © 2009 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data Cooke, Roger M., 1946– Uncertainty modeling in dose response : bench testing environmental toxicity / Roger M. Cooke. p. cm. Includes bibliographical references and index. ISBN 978-0-470-44750-5 (cloth) 1. Environmental toxicology–Mathematical models. 2. Toxicology–Dose-response relationship–Mathematical models. I. Title. RA1226.C66 2009 615.9′02011–dc22 2008052026 Printed in the United States of America 10

9

8

7

6

5

4

3

2

1

CONTENTS

Acknowledgments

ix

Contributors

xi

Introduction

1

Roger M. Cooke and Margaret MacDonell

1 Analysis of Dose–Response Uncertainty Using Benchmark Dose Modeling

17

Jeff Swartout

Comment: The Math/Stats Perspective on Chapter 1: Hard Problems Remain

34

Allan H. Marcus

Comment: EPI/TOX Perspective on Chapter 1: Re-formulating the Issues

37

Jouni T. Tuomisto

Comment: Regulatory/Risk Perspective on Chapter 1: A Good Baseline

42

Weihsueh Chiu

Comment: A Question Dangles

44

David Bussard

Comment: Statistical Test for Statistics-as-Usual Confidence Bands

45

Roger M. Cooke

Response to Comments

47

Jeff Swartout v

vi

2

CONTENTS

Uncertainty Quantification for Dose–Response Models Using Probabilistic Inversion with Isotonic Regression: Bench Test Results

51

Roger M. Cooke

Comment: Math/Stats Perspective on Chapter 2: Agreement and Disagreement

82

Thomas A. Louis

Comment: EPI/TOX Perspective on Chapter 2: What Data Sets Per se Say

87

Lorenz Rhomberg

Comment: Regulatory/Risk Perspective on Chapter 2: Substantial Advances Nourish Hope for Clarity?

97

Rob Goble

Comment: A Weakness in the Approach?

105

Jouni T. Tuomisto

Response to Comments

107

Roger Cooke

3

Uncertainty Modeling in Dose Response Using Nonparametric Bayes: Bench Test Results

111

Lidia Burzala and Thomas A. Mazzuchi

Comment: Math/Stats Perspective on Chapter 3: Nonparametric Bayes

147

Roger M. Cooke

Comment: EPI/TOX View on Nonparametric Bayes: Dosing Precision

150

Chao W. Chen

Comment: Regulator/Risk Perspective on Chapter 3: Failure to Communicate

153

Dale Hattis

Response to Comments

160

Lidia Burzala

4

Quantifying Dose–Response Uncertainty Using Bayesian Model Averaging

165

Melissa Whitney and Louise Ryan

Comment: Math/Stats Perspective on Chapter 4: Bayesian Model Averaging Michael Messner

180

CONTENTS

Comment: EPI/TOX Perspective on Chapter 4: Use of Bayesian Model Averaging for Addressing Uncertainties in Cancer Dose– Response Modeling

vii

183

Margaret Chu

Comment: Regulatorary/Risk Perspective on Chapter 4: Model Averages, Model Amalgams, and Model Choice

185

Adam M. Finkel

Response to Comments

194

Melissa Whitney and Louise Ryan

5

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

197

Leonid Kopylev, John Fox, and Chao Chen

6

Uncertainty in Dose Response from the Perspective of Microbial Risk

207

P. F. M. Teunis

7

Conclusions

217

David Bussard, Peter Preuss, and Paul White

Author Index

225

Subject Index

229

ACKNOWLEDGMENTS

This book would not have been possible without generous support from the Richard Lounsbery Foundation and from Resources for the Future. Further support from the Life Science Research Organization is gratefully acknowledged.

ix

CONTRIBUTORS

Lidia Burzala, Delft University of Technology, The Netherlands, Dept. Mathematics, TU Delft Mekelweg 4, Delft, The Netherlands; L.M.Burzala@ tudelft.nl David Bussard, US Environmental Protection Agency, Mail Code: 8623P, 1200 Pennsylvania Avenue, N.W., Washington, DC 20460; Bussard.David@ epamail.epa.gov Chao W Chen, US EPA, National Center of Environmental Assessment (NCEA), Mail Code: 8623P, 1200 Pennsylvania Avenue, N.W., Washington, DC 20460; [email protected] Weihsueh Chiu, US Environmental Protection Agency, USEPA Headquarters, Ariel Rios Building, 1200 Pennsylvania Avenue, N.W., Mail Code: 8623P, Washington, DC 20460. Margaret Chu, California Air Resources Board, California Environmental Protection Agency, 1001 I Street, P.O. Box 2815, Sacramento, CA 958122815; [email protected] Roger Cooke, Resources for the Future, and Dept. Mathematics, Delft University of Technology, 1616 P. Street, NW, Washington, D.C. 20036; [email protected] Adam M. Finkel, Professor of Environmental and Occupational Health, University of Medicine and Dentistry of New Jersey, School of Public Health, Piscataway, N.J.; [email protected] xi

xii

CONTRIBUTORS

John Fox, US EPA, National Center of Environmental Assessment (NCEA), Mail Code: 8623P, 1200 Pennsylvania Avenue, N.W., Washington, DC 20460; [email protected] Rob Goble, George Perkins Marsh Institute, Clark University, 950 Main Street, Worcester, MA 02476; [email protected] Dale Hattis, George Perkins Marsh Institute, Clark University, 950 Main Street, Worcester, MA 01610 Leonid Kopylev, US EPA, National Center of Environmental Assessment (NCEA), Mail Code: 8623P, 1200 Pennsylvania Avenue, N.W., Washington, DC 20460; [email protected] Thomas A. Louis, Department of Biostatistics, Johns Hopkins, Bloomberg School of Public Health, 615 North Wolfe St., Baltimore, MD 21205; tlouis@ jhsph.edu Margaret MacDonell, Environmental Science Division, S. Cass Avenue 9700, 60439 Argonne Ill. Allan H. Marcus, Office of Research and Development Integrated Risk Information System, National Center for Environmental Assessment, U.S. Environmental Protection Agency, 109 TW Alexander Drive, Durham, NC 27709; [email protected] Thomas A. Mazzuchi, Department of Engineering Management, School of Engineering and Applied Science, The George Washington University, Washington, DC 20052; [email protected] Michael Messner, US EPA, Office of Ground Water and Drinking Water, Ariel Rios Building, 1200 Pennsylvania Avenue, N.W., Mail Code: 4607M, Washington, DC 20460; [email protected] Peter Preuss, US Environmental Protection Agency, 1200 Pennsylvania Avenue, N.W., Mail Code: 8623P, Washington, DC 20460; Preuss.Peter@ epamail.epa.gov Louise Ryan, Department of Biostatistics, Harvard School of Public Health, 655 Huntington Ave., Boston, MA 02115; [email protected] Lorenz Rhomberg, Gradient Corporation, 20 University Road, Cambridge, MA 02138; [email protected] Jeff Swartout, U.S. Environmental Protection Agency, Office of Research and Development, National Center for Environmental Assessment, 26 West Martin Luther King Drive, Cincinnati, OH 45268; swartout.jeff@epamail. epa.gov

CONTRIBUTORS

xiii

PFM Teunis, Department EMI, RIVM-National Institute of Public Health and the Environment, Anthonie van Leeuwenhoeklaan 9, 3721MA Bilthoven, Netherlands; Hubert Department of Global Health, Rollins School of Public Health, Emory University, 1518 Clifton Road, Atlanta, Georgia 30322, USA.; [email protected] Jouni T. Tuomisto, Department of Environmental Health, Kansanterveyslaitos (KTL)—National Public Health Institute, P.O. Box 95, FI-70701 Kuopio, Finland; [email protected] Paul White, US Environmental Protection Agency, 1200 Pennsylvania Avenue, N.W., Mail Code: 8623P, Washington, DC 20460; White.Paul@ epamail.epa.gov Melissa Whitney, Department of Biostatistics, Harvard School of Public Health, 655 Huntington Ave., Boston, MA 02115; [email protected]. edu

INTRODUCTION Roger M. Cooke Resources for the Future, Department of Mathematics, Delft University of Technology

Margaret MacDonell Argonne National Laboratory

Environmental policies and programs are guided by health effect estimates from modeling dose–response data, typically based on animal bioassays. The sources of uncertainty in these predictions are many, ranging from inherent biological processes to the specific calculation approaches and models applied. To address the latter, Resources for the Future organized a workshop on “Uncertainty Modeling in Dose Response: Dealing with Simple Bioassay Data and Where Do We Go from Here?” The workshop was held 22–23 October 2007 in Washington, DC. Notice that the title is not simply “Dose–Response Modeling.” The workshop’s objective was rather to emphasize the central role of uncertainty and thus to reflect a shift in focus that is taking place across a wide spectrum of national and international programs. In the context of pervasive uncertainty, getting the uncertainty right is as important as finding a best estimate. The US Environmental Protection Agency (EPA), Office of Research and Development (ORD), National Center of Environmental Assessment (NCEA), has made improved uncertainty characterization one of the priorities of its integrated risk assessment program. Traditionally EPA programs have made more use of reasonably protective values (deemed unlikely to underpredict risk) than other estimates such as those derived from maximum likelihood estimates. Some EPA statutes call for protection with a margin of Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

1

2

INTRODUCTION

safety, and a few statutes require explicit balancing of estimated magnitude of risk against estimated cost of reducing that risk. Many regulatory authorities, including EPA, have a tradition of seeking to protect the public through procedures that are grounded in data and reflect some of the uncertainties. The EPA is sometimes asked to capture and quantify more types of uncertainty in estimates of dose response. The US Congress, in the 1994 amendments to the Safe Drinking Water Act, has also asked EPA to answer different questions and to produce “best estimates,” with text suggesting that, if possible, these should be “expected value” estimates, of risk along with lower confidence interval estimates of risk (in addition to upper confidence interval estimates). Getting the uncertainty right means quantifying the uncertainty and subjecting the quantification method to scientific review. Why is this important? Several reports of the National Academy of Science expand on this (NRC, 1983, 1989, 1994, 1996), but a shortlist of reasons might be as follows: •









Scientific integrity demands that we not claim to know more than we know. The natural language is a poor medium for communicating uncertainty; a 1 in 100 chance of winning an office lottery might be called “very slight,” but a 1 in 100 chance of serious birth defects for an approved diet supplement would be considered unacceptably large. Uncertainty quantification enables tracking the accumulation of conservative assumptions and prioritizing uncertainty reduction measures. Optimal decision making under uncertainty requires uncertainty quantification. Without quantification, stakeholders may be left with very different perceptions of the uncertainties, leading to different perceptions of whose interests are being hedged by rule making under uncertainty.

An EPA review of 2002 underscored the importance of uncertainty quantification in dose response: “When the available data are sufficient to meaningfully characterize the distributions of interest, a probabilistic approach would provide results as a distribution rather than as a single measure for the dose– concentration-response” (EPA, 2002, pp. 4–48). The workshop was developed to address this critical need to understand dose–response uncertainty for practical risk applications. The objective was to demonstrate some new methods for quantifying uncertainty in the dose– response relationship via a systematic process, with an emphasis on environmental relevance. The workshop focused on a single part of the larger dose–response uncertainty problem: fitting modeling curves to animal data and using them to make judgments in the range of the observable responses (i.e., not extrapolating far below the observable data). It is important to note that questions such as extrapolating to doses outside the range of the tests, to

WORKSHOP FORMAT

3

humans from animals, and to chronic from shorter duration doses were explicitly excluded from the workshop. Those more complicated extrapolations can be better addressed after the first step—uncertainty in fitting dose-response curves for simple bioassay experiments—is more fully understood.

WORKSHOP FORMAT The workshop format was designed to engage experts in interdisciplinary discussions of approaches for quantifying uncertainty in fitting models to animal toxicity data within the observed range. The aim was to review the state of the art for this first step and promote better quantification for use in integrated uncertainty analysis. Toward this end, a number of experts were invited to ply their craft by testing an uncertainty characterization method of choice with example dose– response data. Results of each modeling bench test were then documented in a technical paper and presented at the workshop, with highlights of the strengths and limitations offered for dose–response and risk assessments. Each set of results were then reviewed by additional experts who were asked to represent three different perspectives: • • •

Epidemiologist/toxicologist Statistician/mathematician Risk analyst/regulator

Comments were invited from other workshop participants, and the modelers were then invited to respond to the critiques. Subjecting the bench test results to critical review not only helped strengthen the explanation of the models, it also provided excellent practical context regarding their uses and limitations. Having different modelers focus on the same data sets led to lively and sharply focused discussions, quite different from abstract deliberations of pros and cons of various methods. The interactive format helped address two common complaints of dose–response modelers: that statisticians do not understand what their models don’t do well, and that they are not sufficiently mindful of biological problems inherent in data analyses. The contrast presented in the approaches of statisticians versus those who model dose-response data for risk applications served to heighten joint awareness of the scientific issues. The format also illustrated that advances in dose–response modeling may come from statisticians who have taken part in interdisciplinary exchanges and are cognizant of the importance of physiological parameters underlying dose–response relationships. It is hoped that the methods applied and attendant discussions will prompt other modelers to ply their wares on the bench test cases, as well as to their own data sets being evaluated for ongoing projects. By summarizing the presenta-

4

INTRODUCTION

tions, critiques, and other discussions within this book, the authors aim to provide a technical resource with the goal of advancing the state of the art for dose–response uncertainty quantification, leading to better-informed decision making.

MODELS TESTED Four modelers rose to the challenge of quantifying uncertainty in the bench test exercises, with each applying a specific method to the problems described in the next section. •







Jeff Swartout (EPA ORD) conducted analyses with the EPA benchmark dose (BMD) approach using the Benchmark Dose Software (BMDS) and using a bootstrapping approach that has fairly similar goals but a different technical approach. Roger Cooke (RFF/TU Delft) deployed a method of probabilistic inversion and isotonic regression (PI-IR), based on methods from the joint US Nuclear Regulatory Commission and European Union uncertainty analyses of accident consequence models for nuclear power plants. Thomas Mazzuchi (George Washington University) and Lidia Burzala (TU Delft) used nonparametric Bayesian (NPB) methods from biostatistics and accelerated life testing. Louise Ryan and Melissa Whitney (Harvard University) used Bayesian model averaging (BMA).

The “storylines” for each approach are summarized below. Details are provided in the subsequent chapters of this book.

Benchmark Dose (BMD) Use the Akaike information criterion (AIC) to choose a “best-fitting” model (after verifying that the best fit is indeed adequate). The software then uses likelihood profile methods to calculate confidence intervals, particularly to derive a confidence limit around the dose that yields a target, typically low, response rate. Similar to the likelihood profile method, the bootstrapping approach also assumes that the best-fit model, with maximum likelihood estimates of its parameters, is true. It then simulates repetitions of the bioassay experiments by sampling from independent binomial distributions for responses at each dose, based on the maximum likelihood parameter estimates of the best model. On each repetition, the bootstrapping approach re-estimates the parameters, thus obtaining a distribution over model para-

MODELS TESTED

5

meter values.1 By sampling this parameter distribution, obtain distribution on predicted probabilities at measured and unmeasured doses. Probabilistic Inversion with Isotonic Regression (PI-IR) Assume that repeated experiments give results from independent binomials, with probabilities estimated directly from the data (not from a model). Derive a distribution of estimated response probabilities at each dose by applying the constraint that probabilities are nondecreasing in dose (isotonic regression). Find a model with the distribution over its parameters that recovers the isotonicized distributions of response probabilities (probabilistic inversion). By sampling this parameter distribution, obtain a distribution on predicted probabilities at measured and unmeasured doses. Nonparametric Bayes (NPB) Choose a prior mean response (potency) curve and precision parameters. These define a “Dirichlet process prior,” giving a prior distribution over all nondecreasing functions of probability as function of dose. Update this prior distribution with the observed data at measured doses. The posterior distribution at measured and unmeasured doses is gotten by Gibbs sampling. Bayesian Model Averaging (BMA) As applied here, this is the hybrid BMA: Choose an initial set of models and estimate the parameters of each model with maximum likelihood. Conditional on each model, estimate the variance in parameter estimates with classical methods. Determine a probability weight for each model using the Bayes information criterion (BIC). Use these weights to average model results. These four models and results of the test applications are described in Chapters 2 through 4. Each chapter’s discussion is supplemented by comments from the critiquing experts and responses from the testing experts that broaden its perspective. Two additional contributions were solicited for the workshop to characterize uncertainty from other dose–response applications and help illuminate issues being evaluated for the chemical test cases. First, Peter Teunis (RIVM and Emory University) explored the concept of dose–response in microbial risk. He illustrates that in microbial risk, the pathogens are often highly infective, and the probability of response can be bounded by the probability of exposure. For example, if someone ingests 10 milliliters of a suspension of 1 1

In the bootstrap method, under the hypothesis that the best model is true, the mean and variance-covariance of the parameter estimates can be calculated. Since these distributions are asymptotically normal, the appropriate normal distributions may also be sampled directly. In Chapter 2, Swartout uses both methods.

6

INTRODUCTION

organism per liter, the probability of infection cannot exceed 1%. This additional constraint can facilitate dose-response modeling. However, hostdependent heterogeneity, among other issues (e.g., secondary infections), introduces new complications. Second, Leonid Kopylev, John Fox, and Chao Chen (EPA NCEA) developed the mathematics for assessing the composite risk of multiple tumors at multiple sites. They provide a method for combining risk of different cancers at different sites using the multistage model. That model is particularly compliant in this regard, and the method might be extended to combining other effects. These two additional contributions expand our awareness of dose–response modeling issues and are summarized in Chapters 5 and 6. Because they were not aimed at solving the four bench test exercises, they were not subject to the interactive review process. Thus no associated critiques are included for that chapter. The current approach to modeling uncertainty in dose response is well illustrated by the EPA BMD method (Crump, 1984; US EPA, 2002, 2004, 2008); for a review of regulatory history, see Dourson and Stara (1983) and others. This approach uses classical statistical methods of uncertainty representation to find a point of departure (POD) for determining a “reference value” that is ultimately used to guide health-based exposure limits for hazardous substances. The POD takes the uncertainty in dose–response into account; a reference value is derived from the POD by applying uncertainty factors accounting for other sources of uncertainty. These include extrapolation from animals to humans, extrapolation from subchronic to chronic doses, and extrapolation to sensitive subpopulations. We note that much effort has been put into developing and critiquing uncertainty factors. Evaluations include those by Abdel-Rahman and Kadry (1995), Baird et al. (1996), Calabrese and Baldwin (1995), Dourson and Stara (1983), Dourson et al. (1992, 1996), Evans and Baird (1998), Gaylor and Kodell (2000), Hattis et al. (1999, 2002), Kodell and Gaylor (1999), Kalberlah et al. (2003), Nessel et al. (1995), Pelekis et al. (2003), Pieters et al. (1998), Renwick and Lazarus (1998), Rhomberg and Wolff (1998), Slob and Pieters (1998), Swartout et al. (1998), and Vermeire et al. (1999). Although these extrapolations and uncertainty factor issues were not the focus of the workshop, they certainly came up in the discussions.

MODEL CRITIQUES The discussants who agreed to present critiques of the models are summarized in Table I.1. The comments offered and the responses of the testing experts are included in the individual chapters for each of the four models tested (Chapters 2 through 5).

7

TEST DATA SETS

TABLE I.1

Discussants of the four modeling approaches applied to test data

Modeling Approach

Benchmark Dose (BMD) Testing expert Jeff Swartout (EPA NCEA)

Probabilistic InversionIsotonic Regression (PI-IR) Roger Cooke (RFF/TU Delft)

Nonparametric Bayes (NPB) Tom Mazzuchi (George Washington) and Lidia Burzala (TU Delft)

Bayesian Model Averaging (BMA) Louise Ryan and Melissa Whitney (Harvard)

Critiquing expert and perspective offered Epidemiologist/ toxicologist

Jouni Tuomisto (National Public Health Institute, KTL)

Lorenz Rhomberg (Gradient)

Chao Chen (EPA NCEA)

Margaret Chu (EPA NCEA)

Statistician/ mathematician

Allan Marcus (EPA NCEA)

Tom Louis (Johns Hopkins)

Roger Cooke (RFF/TU Delft)

Michael Messner (EPA Office of Ground Water and Drinking Water)

Risk analyst/ regulator

Weisueh Chiu (EPA NCEA)

Rob Goble (Clark University)

Dale Hattis (Clark University)

Adam Finkel (University of Medicine and Dentistry of NJ)

Additional commenters

David Bussard (EPA NCEA) Roger Cooke (RFF/TU Delft)

Jouni Tuomisto (KTL)

TEST DATA SETS Four sets of bioassay data excerpted from EPA sources provided the common basis for bench testing the models. Each data set identifies the number of animals exposed and the number responding at each of several dose levels for various experiments. The reporting formats differ somewhat across the four data sets, reflective of common variability in how actual data are provided to statisticians, dose–response modelers, and other risk practitioners.

8

INTRODUCTION

TABLE I.2 Dose 0 21 60

Data set 1: BMD technical guidance example Number of Subjects

Number of Responses

50 49 45

1 15 20

The first data set was taken from BMDS materials available via the EPA website when the exercises were distributed to workshop participants in May 2007.2 Because the BMD approach represents current practice, the following discussion includes a brief illustration of this “baseline” process using these example data. The other three data sets are actual bioassay results of selected chemicals of current interest. These are important to include in the practical bench testing of models for this workshop because very often, real data are less compliant than demonstration data. Their names were fictionalized—frambozadrine, nectorine, and persimonate—to ensure that active policy concerns did not impinge on the scientific discussion. The four test data sets are described below. Data Set 1: Example Data from the BMD Technical Guidance Document The first data set involves three sets of test subjects. The example dose levels, number of subjects, and number of responses are shown in Table I.2. Confronted with these data, the analyst can use any number of statistical packages to select and evaluate a statistical model. The Benchmark Dose Software (BMDS; see US EPA, 2008) is specifically designed for this purpose. We may choose a log-logistic model with slope parameter constrained to be greater than or equal to 1. The BMD output reports this choice as follows: The form of the probability function is: P[response] = background+1-background)/ [1+EXP(-intercept-slope*Log(dose))]; Slope parameter is restricted as slope >= 1. A model is fit using the method of maximum likelihood: parameter values are sought that make the data look “as likely as possible.” If our model predicts a response probability of 50% at dose 21, then the observation of 15 responses out of 49 exposed subjects (30.6%) is quite unlikely. Indeed the probability of seeing 15 or fewer responses if the true response probability were 0.5 and the subjects were randomly sampled is 0.0047. 2 The data set accompanied BMDS Version 1.4.1; the Agency has continued to refine the BMDS and supporting materials since the workshop exercises were distributed, so current online materials contain different example data (see US EPA, 2008).

TEST DATA SETS

9

Note that the log-logistic model has three parameters: background, intercept, and slope. If we fit this model without the slope constraint, the fitted “P(response)” would exactly match the observed response relative frequencies, and the maximum likelihood parameter estimates would be Background = 0.02; Intercept = −2.67429; Slope = 0.587422 Since this model with these parameter values exactly hits the observed frequencies, the residuals are zero. A simple chi-squared test gives the probability (p-value) of falsely rejecting the model based on the size of the residuals. Such a test does not make sense if the residuals are actually zero, the p-value is 1. However, the Akaike Information Criterion (AIC) does make sense. The AIC is defined as −2L + 2k, where L is the log-likelihood of the model, using the maximum likelihood estimates of the parameters, and k is the number of parameters. The log-likelihood in this case is −65.9974, and hence the AIC = 137.995. Low values are better; the AIC punishes models with more parameters. If we restrict the slope to be greater than or equal to one, then the maximum likelihood parameter estimates are Background = 0.02; Intercept = −4.07755; Slope = 1 The slope is constrained to be ≥1, and the maximum likelihood estimate of the slope is the boundary value 1. The BMDS interprets this to mean that the slope-constrained model has only two parameters, background and intercept. If the data had been different (e.g., if there had been 13 responses at dose 21 and 23 responses at dose 60), then the slope would be estimated at 1.04921. We would have again a three-parameter model; the number of parameters is actually a random variable under this interpretation. Be that as it may, on the unaltered dataset, the AIC for the constrained model is 136.907, which is less than the AIC of the unconstrained model (137.995). In fact the constrained log-logistic model has the lowest AIC of all the standard models in the selection list in this case, and thus it emerges as the “best” model. The goodness-of-fit data for the constrained model is shown in Table I.3. The p-value for obtaining residuals at least as large as those found by this model is 0.3352 if the model is correct. Hence the probability of falsely rejecting this model is higher than the traditional rejection level (0.05). Note that the predicted number of responses at dose 21 and 60 are 12.784 and 22.125 respectively, not so different from the altered dataset, leading to the threeparameter model.

10

INTRODUCTION

TABLE I.3 Dose 0 21 60

Goodness of fit for the BMD slope-constrained log-logistic model

Estimated Probability

Expected Response

0.0218 0.2609 0.4917

1.091 12.784 22.125

Observed Response

Sample Size

Scaled Residual

50 49 45

−0.088 0.721 −0.634

1 15 20

Note: Chi square = 0.93, degrees of freedom = 1, p-value = 0.3352.

To compute the benchmark dose, we first choose a benchmark response (BMR) as “excess over background.” Two formulations for computing the excess over background are used. In the extra risk model, BMR = ( P ( BMD ) − P ( 0 ) ) ( 1 − P ( 0 ) ) ,

(I.1)

while in the additional risk model, BMR = P (0) + P ( BMD) .

The extra risk model is used in these exercises. The maximum likelihood estimates for the parameters of the best model are need in (I.1) to find the BMD. Typical choices for the BMR (the response level of interest for a given assessment) are 10%, 5%, and sometimes 1%. To find a point of departure, uncertainty is factored in. From the storylines, we see that each uncertainty modeler obtains a distribution over response probabilities for any dose. Equivalently, this can be expressed as a distribution over the doses that realize any given response probability. Figure I.1 shows this relationship. Given an uncertainty distribution over dose–response relationships, we can fix a given probability of response and consider the distribution of doses that realize that response. The lower five percentile of the distribution for a dose realizing the BMR is called the BMDL, and this is used as a point of departure. Details for calculating the distribution over dose–response relationships and the BMDL differ for each model and are explained in the succeeding chapters. Returning to our slope-constrained log-logistic model, the BMDS finds that BMD = 7.21305 and BMDL = 4.93064. Had we used the unconstrained loglogistic model, we would find BMD = 2.25273, and the software would give us the message “Benchmark dose computation failed. Lower limit includes zero.” This would correspond to a situation like that shown in Figure I.2, in which 5% of the dose–response curves at dose zero are above the BMR. On the altered dataset, with responses (1, 13, 23), both the constrained and the unconstrained log-logistic models have an AIC of 134.861; however, the

11

Probbility of response

TEST DATA SETS

BMR

BMDL

BMD

Dose

Probbility of response

Figure I.1 Distribution over response probabilities: Constrained log-logistic model. Note that 5% of doses realizing the BMR are equal to or below the BMDL; equivalently, 5% of curves at the BMDL are greater than or equal to the BMR.

BMR

BMDL

BMD

Dose

Figure I.2 Distribution over response probabilities: Unconstrained log-logistic model. Note that more than 5% of curves at zero are greater than or equal to the BMR.

12

INTRODUCTION

BMDL of the constrained model is 4.67469, whereas that of the unconstrained model is 0.25667. The difference reflects the fact the uncertainty distribution for the slope factor in the unconstrained case has substantial probability for values below 1, whereas the constrained model must pile up mass at 1. For models with the same AIC, the difference in BMDL is a bit unsettling. Among the BMD model selection suite, another popular form is the Weibull model: P[response] = background+(1-background)× [1-EXP(-slope*(dose)^power)]; Slope parameter is restricted as slope >= 1. With this slope-constrained model, we find an AIC of 138.17. The p-value based on residuals is 0.1322, again too high for rejection. The BMD = 9.28942 and the BMDL = 6.92055. Beyond these “example data” that support BMDS training, the other three sets used for bench testing the modeling approaches are excerpted from actual data for chemicals of interest to current environmental programs. These three data sets are presented below in various formats (see Tables I.4–I.6). Also included are the key uncertainty characterization questions posed.

TABLE I.4 Gender

Data set 2: Rat oral exposure to frambozadrine Dose (mg/kg-day)

Male

Female

TABLE I.5

Number in Trial

Number Expressing Hyperkeratosis

0 1.2 15 82

47 45 44 47

2 6 4 24

0 1.8 21 109

48 49 47 48

3 5 3 33

Data set 3: Rat inhalation exposure to nectorine Exposure Level (ppm)

Observed Lesion

0

10

30

60

Number responding/number in trial Respiratory epithelial adenoma Olfactory epithelial neuroblastoma

0/49 0/49

6/49 0/49

8/48 4/48

15/48 3/48

13

TEST DATA SETS

TABLE I.6

Data set 4: Mouse inhalation exposure to persimonate

Test Subject

Continuous Equivalent Exposure Level (ppm)

Total Metabolism (mg/kg-day)

Survival-Adjusted Tumor Incidence

B6C3F1 male mice

0 18 36

0 27 41

17/49 31/47 41/50

Crj:BDF1 male mice

0 1.8 9.0 45

0 3.4 14 36

13/46 21/49 19/48 40/49

Data Set 2: Frambozadrine—Combine Males and Females? Uncertainty questions: Do we need separate dose–response relations for males and females? Does combining genders alter the uncertainty in response? Data Set 3: Nectorine—Combine Endpoints? The rats in each study were different, and only the indicated endpoint was sought in each study. The summation of risks from multiple tumor sites when tumor formation occurs independently in different organs or cell types is considered superior to calculations based on individual tumor sites alone. Uncertainty question: What is the uncertainty in response as a function of dose for either respiratory epithelial adenoma or olfactory epithelial neuroblastoma? Data Set 4: Persimonate—Combine Studies? Uncertainty questions: Can we combine these studies? Does it affect our uncertainty? The applications and discussions are presented in Chapters 2 through 5, and highlights of the results are given in Chapter 7.

14

INTRODUCTION

REFERENCES Abdel-Rahman M, Kadry AM. 1995. Studies on the use of uncertainty factors in deriving RfDs. Hum Ecol Risk Assess 1(5):614–24. Baird SJS, Cohen JT, Graham JD, Shlyakhter AI, Evans JS. 1996. Noncancer risk assessment: a probabilistic alternative to current practice. Hum Ecol Risk Assess 2:79–102. Calabrese EJ, Baldwin LA. 1995. A toxicological basis to derive generic interspecies uncertainty factors for application in human and ecological risk assessment. Hum Ecol Risk Assess 1(5):555–64. Crump, KS. 1984. A new method for determining allowable daily intakes. Fundam Appl Toxicol 4:854–871. Dourson ML, Felter SP, Robinson D. 1996. Evolution of science-based uncertainty factors in noncancer risk assessment. Regul Toxicol Pharmacol 24:108–20 Dourson ML, Knauf LA, Swartout JC. 1992. On reference dose (RfD) and its underlying toxicity data base. Toxicol Indust Health 8:171–89. Dourson ML, Stara JF. 1983. Regulatory history and experimental support of uncertainty (safety) factors. Regul Toxicol Pharmacol 3:224–38. Evans JS, Baird SJS. 1998. Accounting for missing data in noncancer risk assessment. Hum Ecol Risk Assess 4:291–317. Gaylor DW, Kodell RL. 2000. Percentiles of the product of uncertainty factors for establishing probabilistic risk doses. Risk Anal 20:245–50. Hattis D, Baird S, Goble R. 2002. A straw man proposal for a quantitative definition of the RfD. Drug Chem Toxicol 25(4):403–36. Hattis D, Banati P, Goble R, Burmaster DE. 1999. Human interindividual variability in parameters related to health risks. Risk Anal 19:711–26. Kalberlah F, Schneider K, Schuhmacher-Wolz U. 2003. Uncertainty in toxicological risk assessment for non-carcinogenic health effects. Reg Tox Pharmacol 37:92–104. Kodell RL, Gaylor DW. 1999. Combining uncertainty factors in deriving human exposure levels of noncarcinogenic toxicants. An NY Acad Sci 895:188–95. Lehman AJ, Fitzhugh OG. 1954. 100-Fold margin of safety. Assoc Food Drug Off USQ Bull 18:33–5. Nessel CS, Lewis SC, Staubrer KL, Adgate JL. 1995. Subchronic to chronic exposure extrapolation: toxicological evidence for a reduced uncertainty factor. Hum Ecol Risk Assess 1:516–26. NRC 1983. Risk Assessment in the Federal Government: Managing the Process. Washington DC: National Academy Press. NRC 1989. Improving Risk Communication. Washington DC: National Academy Press. NRC 1994. Science and Judgment in Risk Assessment. Washington DC: National Academy Press. NRC 1996. Understanding Risk: Informing Decisions in a Democratic Society. Washington DC: National Academy Press. Pelekis M, Nicolich M, Gauthier J. 2003. Probabilistic framework for the estimation of the adult and child toxicokinetic interspecies uncertainty factors. Risk Anal 23:1239–55.

REFERENCES

15

Pieters MN, Kramer HJ, Slob W. 1998. Evaluation of the uncertainty factor for subchronic-to-chronic extrapolation: statistical analysis of toxicity data. Regul Toxicol Pharmacol 27:108–11. Renwick AG, Lazarus NT. 1998. Human variability and noncancer risk assessment—an analysis of the default uncertainty factor. Regul Toxicol Pharmacol 27:3–20. Rhomberg LR, Wolff SK. 1998. Empirical scaling of single oral lethal doses across mammalian species based on a large database. Risk Anal 18:741–53. Slob W, Pieters MN. 1998. A probabilistic approach for deriving acceptable human intake limits and human health risks from toxicological studies: general framework. Risk Anal 18:787–9. Swartout JC, Price PS, Dourson ML, Carlson-Lynch HL, Keenan RE. 1998. A probabilistic framework for the reference dose (probabilistic RfD). Risk Anal 18(3):271–81. US EPA. 2002. A review of the reference dose and reference concentration processes. EPA/630/P-02/2002F. US EPA. 2004. An examination of EPA risk assessment principles and practices. EPA/100/B-04/001. US EPA. 2008. Benchmark dose methodology; Benchmark dose software (BMDS). Websites last updated Sept. 30. http://www.epa.gov/NCEA/bmds/bmds_training/ methodology/intro.htm; http://www.epa.gov/ncea/bmds/bmds_training/software/ overp.htm. Vermeire T, Stevenson H, Pieters MN. 1999. Assessment factors for human health risk assessment: a discussion paper. Crit Rev Toxiocol 29(5):439–90.

1 ANALYSIS OF DOSE–RESPONSE UNCERTAINTY USING BENCHMARK DOSE MODELING Jeff Swartout US Environmental Protection Agency, Office of Research and Development, National Center for Environmental Assessment

1.1

INTRODUCTION AND METHODS

I have assessed the uncertainty in modeling the test data sets using the US EPA BMDS (Version 1.4.1) software (US EPA 2007), with a secondary verification using a parametric bootstrap algorithm in S-PLUS® (Version 6.2 for Windows®). I present only the analyses of the (mythical) frambozadrine, nectorine, and persimonate data sets, as they offer the most interesting issues for combining results. The issue of combining response data is first addressed by determining if the individual data sets are similar conceptually. That is, can they be considered to belong to the same population? The statistical aspect of pooling the data is addressed by assessing the net deviance (−2× optimized log-likelihood ratio) between the dose–response model fits to the pooled data set and the individual data sets, assuming a specific dose–response family (Stiteler et al., 1993). The test statistic is Dpool − EDi, where Dpool is the deviance of the pooled data model fit and EDi is the sum of the individual model fit deviances from the same fitted model. Data are pooled by combining all dose groups into one data set, keeping each group intact as well as their controls (i.e., not by summing animals and responders into single groups for the same dose). The test statistic Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

17

18

BENCHMARK DOSE MODELING

is compared to a chi-squared distribution with degrees of freedom equal to the difference between the sum of the number of individual model parameters minus the number of parameters in the pooled-data model. As a convention, the probability is reported as 1 − pchisq, where pchisq is the cumulative probability of the chi-squared distribution, with values greater than 0.05 accepted as adequate for pooling the data. Uncertainty analysis for the Benchmark Dose procedure was accomplished by obtaining lower 95% confidence bounds for selected benchmark response (BMR) levels, corresponding to the BMDLX, where X is the response percentage in the tested animals. BMRs ranging from 1% to 99% were evaluated. Confidence bounds in BMDS are determined by the profile likelihood method, assuming an asymptotic relationship of the log-likelihood ratio to the chisquared distribution. The parametric bootstrap procedure consists of fitting the selected dose– response model to the raw data by maximum likelihood and treating the fitted response at each dose as the “true” response in the tested animals. Then, by the assumption that the observed response in each dose group can be represented by simple binomial uncertainty,1 a new set of responses is generated based on the fitted probability and sample size (simulating a repeat of the experiment). The selected dose–response model is fit to 1000 bootstrap samples and the fitted parameters are saved. From the resulting parameter distribution, pointwise confidence bounds can be calculated for the entire dose–response curve. Upper and lower bootstrap confidence bounds can next be compared with the corresponding profile likelihood confidence bounds computed in BMDS. Constraints on the parameter space are an issue for use of either of these methods. Standard likelihood-based methods are sometimes found to perform poorly when there are parameter constraints. The most general results assume no constraints (i.e., an “open parameter space”). 1.2

FRAMBOZADRINE DATA ANALYSIS

The frambozadrine data are reproduced in Table 1.1. The questions to be answered are as follows: 1. Do we need separate dose–response relations for males and females? 2. Does combination alter the uncertainty in response? 1.2.1

Frambozadrine BMDS Model-Fitting Results

Table 1.2 shows the results of the model fitting from BMDS. Any model fit with a p-value greater than 0.1 was considered adequate. Goodness of fit was evaluated on the basis of the Akaike Information Criterion (AIC) value, with 1

The parametric bootstrap procedure does not account for over-dispersion (i.e., extra-binomial variance) in the response data.

19

FRAMBOZADRINE DATA ANALYSIS

TABLE 1.1

Frambozadrine dose–response data

Dose (mg/kg-day)

Incidence (Hyperkeratosis)

Total Rats

Percent Response

Male 0 1.2 15 82

47 45 44 47

2 6 4 24

4.3 13 9.1 51

3 5 3 33

6.3 10 6.4 69

Female 0 1.8 21 109

TABLE 1.2

48 49 47 48

BMDS results for frambozadrine

Model

AIC

p-Value

Deviance

BMD10

BMDL10

2.551 2.478 2.478 2.478 2.478

33.7 40.8 42.3 36.9 44.5

12.0 10.3 12.0 13.6 10.8

1.742 0.667 0.667 0.667 0.667

34.4 71.2 82.4 66.7 86.4

21.2 25.0 24.7 24.3 25.2

4.498 3.603 3.380 3.192 4.036

34.1 43.5 44.3 44.3 41.2

22.9 25.7 26.0 25.9 24.6

Male N

Multistage (2 ) Gamma Log-logistic Log-probit Weibull

150.4 152.3 152.3 152.3 152.3

0.28 0.12 0.12 0.12 0.12 Female

N

Multistage (2 ) Gamma Log-logistic Log-probit Weibull

142.4 143.3 143.3 143.3 143.3

0.43 0.41 0.41 0.41 0.41 Pooled data

N

Multistage (2 ) Gamma Log-logistic Log-probit Weibull

289.0 290.1 290.1 289.9 290.5

0.60 0.60 0.48 0.51 0.54

20 1.0

BENCHMARK DOSE MODELING

0.6 0.4 0.0

0.2

Cumulative response

0.8

Females Males Pooled data

1

5

10

50

100

Dose (mg/kg-day)

Figure 1.1

Multistage (second-order) model fits to the frambozadrine data.

lower values indicating better fit. The deviance statistic is given for determination of the statistical propriety of combining the data. BMD and BMDL values are in units of mg/kg-day. On the basis of the AIC values, the best-fitting model for all three data sets is the second-order multistage. The multistage model fits are shown in Figure 1.1. Given the small differences in the AIC values, however, all of the models are virtually equivalent for goodness of fit. BMDL10 values are similar for all models. The multistage (best-fitting), and Weibull (most flexible) models were chosen for uncertainty analysis. Pooling the data across gender for the same species and strain is reasonable conceptually, and is done commonly. The pooled-data test statistics are 0.205 and 0.891 for the multistage and Weibull models, respectively. The chi-squared p-values for the test statistics, given three degrees of freedom, are 0.98 and 0.83 for the multistage and Weibull models respectively, indicating that pooling the data is reasonable. Note, in particular, that the BMDL10 values for the pooled data are higher than those for either of the individual–gender response data sets—almost twice the male response BMDL10. This behavior is of particular significance in defining the point of departure (POD) for deriving a reference dose (RfD). 1.2.2

Frambozadrine Uncertainty Analysis Results

The results of the Benchmark Dose and bootstrap uncertainty analyses of the frambozadrine data for the second-order multistage model is shown in Table 1.3. The BMD, BMDL and ratio (BMDLr) of the BMD to the BMDL are given for selected BMRs. The analogous values (ED, EDL, EDLr) for the bootstrap simulation are shown for comparison. The BMDLr and EDLr values provide a

21

FRAMBOZADRINE DATA ANALYSIS

TABLE 1.3 multistage) BMRa

BMD and bootstrap results for frambozadrine (second-order BMDLb

BMDc

BMDLrd

EDLe

EDf

EDLrg

1.35 6.85 13.8

8.90 20.7 30.3

6.6 3.0 2.2

1.59 7.77 15.6

9.05 21.2 30.9

5.7 2.7 2.0

1.75 8.67 16.9

9.37 21.9 32.0

5.4 2.5 1.9

Male 0.01 0.05 0.10

1.15 5.86 12.0

10.4 23.5 33.7

9.0 4.0 2.8 Female

0.01 0.05 0.10

2.55 11.6 21.2

10.6 24.0 34.4

4.2 2.1 1.6 Pooled data

0.01 0.05 0.10

2.92 12.8 22.9

10.5 23.8 34.1

3.6 1.9 1.5

a

Benchmark response level (fraction responding). Lower 95% confidence bound on the BMD (from BMDS). c Maximum likelihood estimate (from BMDS). d Ratio of BMD to BMDL. e Lower 95% bootstrap confidence bound (bootstrap analogue of BMDL). f Median bootstrap ED estimate (bootstrap analogue of BMD). g Ratio. of ED to EDL. b

quick measure of the relative statistical uncertainty for any given combination of model and response level (BMR). The 5% and 10% BMRs are generally applied in the derivation of RfDs. The 1% BMR results are included to show the behavior of the models at response levels farther away from the data. BMD, BMDL, ED, and EDL values are in units of mg/kg-day. As would be expected, Table 1.3 shows that the uncertainty increases with decreasing BMR, that is, as the prediction moves farther away from the data. The BMDLr values for the male response data are about twice as large as for the female and pooled response data. The bootstrap EDLr values are much closer together than the corresponding BMDLr values. The EDLr values are somewhat larger than the corresponding ones for the BMDLr, except for the male response data, in particular for the BMDLr0.01. Pooling the data reduces the uncertainty at all BMR levels, particularly with respect to the male response data. The 90% bootstrap confidence bounds for the multistage model fit to the pooled frambozadrine data are shown in Figure 1.2. The results of the Benchmark Dose and bootstrap uncertainty analyses of the frambozadrine data for the Weibull model are shown in Table 1.4. Except for the pooled data, the BMDLr values for the Weibull model are larger and more variable than those for the second-order multistage model (compare to Table 1.3). In addition the BMD uncertainty is greater than the

22 1.0

BENCHMARK DOSE MODELING

0.6 0.4 0.0

0.2

Cumulative response

0.8

Multistage fit Median bootstrap response 90% bootstrap confidence bounds

1

5

10

50

100

Dose (mg/kg-day)

Figure 1.2 Multistage model fits to the pooled frambozadrine data with 90% bootstrap confidence bounds. TABLE 1.4

BMD and bootstrap results for frambozadrine (Weibull model)

BMR

BMDL

BMD

BMDLr

EDL

ED

EDLr

1.13 6.19 13.4

20.7 36.0 45.3

18.3 5.8 3.4

4.34 12.2 19.8

27.8 43.7 53.2

6.4 3.6 2.7

4.85 14.2 22.9

15.9 31.1 41.7

3.3 2.2 1.8

Male 0.01 0.05 0.10

0.678 4.61 10.7

19.8 34.7 44.5

29.2 7.5 4.2 Female

0.01 0.05 0.10

5.76 16.1 25.2

68.3 80.4 86.4

11.9 5.5 3.4 Pooled data

0.01 0.05 0.10

5.34 15.4 24.6

15.8 30.7 41.2

3.0 2.0 1.7

bootstrap uncertainty for the individual data sets, with the greatest difference at the lowest BMRs. Another significant difference between the two models is in the central-tendency estimate (BMD or ED) for any given BMR, with Weibull BMD values 1.5 to 3 times greater than multistage BMDs. As for the multistage model, the bootstrap uncertainty is somewhat less than the BMD uncertainty in most cases. BMDL and EDL values are similar, but there is a

NECTORINE DATA ANALYSIS

23

large disparity between the BMDs and EDs for the female response data not seen for either the male response data (Weibull fit) or the multistage model fits to the female data. Figure 1.3 shows the BMDS plots (BMR = 0.05) for these three model fits. The Weibull fit for the female data (Figure 1.3c) is much shallower at the low end than either of the other two fits. The corresponding likelihood profile will also be flat in that dose region, resulting in a large difference between the BMD and BMDL. The primary problem here is the wide gap in female response between the last two doses, rising from essentially background level to over 50%. Large model-specific differences can arise, depending on the relative shape flexibility of the model. Ideally observed responses in the vicinity of the BMR are needed to “anchor” the model fit appropriately (US EPA, 2000). The pooled-data Weibull fit reduces the uncertainty significantly when compared to either of the individual data set fits. 1.2.3

Conclusions for Frambozadrine Uncertainty Analysis

1. We do not need separate dose–response relations for males and females. 2. Combining the response data decreases dose–response uncertainty somewhat for the multistage model and significantly for the Weibull model. Given the lack of response near the BMR for the female response data, however, a BMD analysis might not be appropriate. At the least, highly flexible models such as the Weibull probably should not be used. 1.3

NECTORINE DATA ANALYSIS

The nectorine data are reproduced in Table 1.5. The animals in each study were different and only the indicated endpoint was evaluated in each study (i.e., the tumors are independent). The question to be answered is as follows: What is the uncertainty in response as function of dose for either respiratory epithelial adenoma OR olfactory epithelial neuroblastoma? 1.3.1

Nectorine BMDS Model-Fitting Results

Table 1.6 shows the results of the model fitting from BMDS. Human-equivalent exposures are not evaluated. As uncertainty in animal-to-human scaling is not evaluated, the uncertainty in the human-equivalent inhalation unit risks (IUR) will be identical to that for the rats. Model acceptability and goodness-of-fit criteria are the same as for the frambozadrine analysis. The multistage and Weibull models were fit to the data. The multistage model is the standard default cancer model (US EPA, 2005). As for frambozadrine, the Weibull model was used to show the effect of a more shape-flexible model. The first-order multistage model provides an adequate fit (p > 0.10) to both data sets and is selected as the primary model. Under the EPA guidelines, if

24

BENCHMARK DOSE MODELING

a.

Weibull model with 0.95 confidence level 0.7

Weibull

Fraction affected

0.6 0.5 0.4 0.3 0.2 0.1 0

BMDLBMD 0 10

b.

20

30

BMD 40 Dose

50

60

70

80

Multistage model with 0.95 confidence level 0.8

Multistage

Fraction affected

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

BMDL 0

BMD 20

c.

40

60 Dose

80

100

Weibull model with 0.95 confidence level 0.8

Weibull

Fraction affected

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

BMDL 0

20

40

60 Dose

BMD 80

100

Figure 1.3 Frambozadrine BMD05/BMDL05 plots: (a) Male response data, Weibull model; (b) Female response data, second-order multistage model; (c) Female response data, Weibull model.

25

NECTORINE DATA ANALYSIS

TABLE 1.5

Nectorine inhalation dose–response data Concentration (ppm): # Response/# In Trial

Lesion Respiratory epithelial adenoma in rats Olfactory epithelial neuroblastoma in rats

TABLE 1.6

0

10

30

60

0/49 0/49

6/49 0/49

8/48 4/48

15/48 3/48

BMDS results for nectorine

Model

AIC

p-Value

BMD10

BMDL10

Multistage IUR1

Respiratory epithelial adenoma in rats N

Multistage (1 ) Weibull

143.6 143.9

0.44 0.75

15.2 8.71

11.3 0.259

0.00883

Olfactory epithelial neuroblastoma in rats Multistage (1N) Weibull

55.2 57.2

0.42 0.24

70.1 70.4

39.9 39.9

0.00251

0.1/BMDL10 for the multistage model fit in units of (mg/kg-day)−1.

1

a p-value of at least 0.10 is obtained by the first-order fit, higher order models are not used even if the p-values are higher. The Weibull model provides a much better fit for the adenoma endpoint. While the Weibull would not be used for derivation of an IUR, it is useful for uncertainty analysis. The Weibull fit to the neuroblastoma data is essentially identical to the first-order multistage, with a Weibull exponent parameter value of 0.99 (virtually exponential). Assuming independence of the two tumor types, the combined risk of either tumor can be calculated exactly from the binomial risk formula: IUR 1 + [1 − IUR 1 ] × IUR 2, where IUR1 and IUR2 are the two independent IURs. A close approximation can be obtained by simply adding the IURs if the sum is less than 0.1. In this case the combined risk of tumor is 0.011 by either method. BMDS, however, does not estimate confidence limits for the IUR, as it is only an approximation of a 95% upper confidence bound. In addition the probability of the sum of multiple IURs cannot be estimated from the BMDS output. The US EPA cancer guidelines (US EPA, 2005) do not address this issue, which is an active area of investigation. An attempt to address the uncertainty distribution of the combined IURs is found in the following uncertainty analysis. 1.3.2

Nectorine Uncertainty Analysis Results

The results of the Benchmark Dose and bootstrap uncertainty analyses of the nectorine data for the multistage model are shown in Table 1.7. A second-order

26

BENCHMARK DOSE MODELING

TABLE 1.7

BMD and bootstrap results for nectorine (multistage model)

BMR

BMDL

BMD

BMDLr

EDL

ED

EDLr

1.73 8.72 17.3

1.5 1.5 1.4

9.76 39.3 63.0

2.1 1.7 1.4

Respiratory epithelial adenoma 0.01 0.05 0.10

1.08 5.51 11.3

1.45 7.39 15.2

1.3 1.3 1.3

1.15 5.86 12.0

Olfactory epithelial neuroblastoma 0.01 0.05 0.10

3.81 19.4 39.9

TABLE 1.8 BMR

6.69 34.1 70.1

1.8 1.8 1.8

4.60 22.9 43.8

BMD and bootstrap results for nectorine (Weibull model) BMDL

BMD

BMDLr

EDL

ED

EDLr

0.222 2.92 44.9

4600 68 8.9

Respiratory epithelial adenoma 0.01 0.05 0.10

2.3 × 10 0.0037 0.259

0.01 0.05 0.10

3.1 × 10 2.79 39.9

−7

0.191 2.70 8.71

8.4 × 105 730 34

4.8 × 10−5 0.043 1.05

Olfactory epithelial neuroblastoma −28

6.60 34.1 70.4

2.1 × 1028 12 1.8

0.094 16.3 43.2

8.43 39.9 64.4

89 2.4 1.5

parameter was fit in the bootstrap simulation to account for bootstrapped responses that could not be fit adequately by the first-order model. The BMD results are strictly first-order. The second-order parameter was zero when the second-order model was fit to the raw data. Because the first-order multistage model has only one parameter, the BMDLr values are constant for any given data set and relatively small, indicating low uncertainty. The second-order multistage bootstrap results represent a slight increase in uncertainty at the lower BMR values. Otherwise, the BMDS and bootstrap results are similar. Table 1.8 shows the results of the Weibull model uncertainty analysis. The BMDS results show extreme uncertainty, particularly for low BMRs and the adenoma endpoint. The bootstrap results are characterized by large variability as well but are much more stable. The Weibull exponent for the fit to the adenoma response data is 0.67, indicating a supralinear response. High BMDLr values are usually associated with supralinear fits. Most two-parameter models that have real support only above zero will exhibit this behavior, as the vari-

27

NECTORINE DATA ANALYSIS

TABLE 1.9 Bootstrap inahlation unit risk distribution fractiles for nectorine (mg/kg-day)−1 0.01

0.05

0.10

0.25

0.50

0.75

0.90

0.95

0.99

0.0079

0.0084

0.095

0.0021

0.0023

0.0027

0.0095

0.0010

0.011

Respiratory epithelial adenoma 0.0033

0.0039

0.0042

0.0049

0.0059

0.0069

Olfactory epithelial neuroblastoma 0.00027

0.00059

0.00080

0.0012

0.0016

0.0019

Combined tumors 0.0045

0.0052

0.0056

0.0064

0.0074

0.0085

ance increases without bound as supralinearity increases (i.e., as the exponent, or shape parameter, approaches zero). For this reason, the US EPA recommends constraining the models to avoid sublinearity (US EPA, 2000). However, a large number of dose–response data sets exhibit supralinear behavior when unconstrained models are fit. Thus, although the extreme uncertainty shown for the Weibull analysis of these data is not to be considered too seriously, the results serve as a caution when interpreting confidence bounds based on constrained model fits. IUR distributions for each individual tumor type are approximated from the bootstrapped-response model fits by dividing the BMR (0.1) by the ED10 at each iteration. The combined risk is determined by summing the (MLE) IURs for randomly paired iterations for each tumor type. The results are shown in Table 1.9, which lists the empirical quantiles for each distribution. The values in Table 1.9 are approximate, as only 500 paired bootstrap samples were used in the calculation. Probability density plots of the distributions are shown in Figure 1.4. Note that summing the quantiles directly does not yield the same answer, such that summing the 95% upper-bound multistage cancer IURs as calculated by BMDS will not give a 95% upper bound but something farther out in the tail of the distribution. In this case the direct sum of the BMDS IURs of 0.011 is equivalent to a 99th percentile in the bootstrap distribution. The 90% bootstrap confidence interval for the adenoma IUR is 0.0039 to 0.0084 (mg/kg-day)−1, spanning a 2.2-fold risk interval. The 90% bootstrap confidence interval for the neuroblastoma IUR is 0.00059 to 0.0023 (mg/kg-day)−1, spanning a 3.9-fold range. The 90% bootstrap confidence interval for the combined IUR is 0.0052 to 0.0010 (mg/kg-day)−1, spanning a 1.9-fold interval. 1.3.3

Conclusions for Nectorine Uncertainty Analysis

The constant-shape first-order multistage model indicates low uncertainty compared to the shape-flexible Weibull model. Assuming, however, that a

28

BENCHMARK DOSE MODELING

2 0

1

Probabiity density

3

a.

-2.6

-2.4

-2.2

-2.0

Log10 IUR

2 0

1

Probabiity density

3

b.

-3.8

-3.6

-3.4

-3.2

-3.0

-2.8

-2.6

-2.4

Log10 IUR

3 2 0

1

Probabiity density

4

c.

-2.4

-2.3

-2.2

-2.1

-2.0

-1.9

Log10 IUR

Figure 1.4 Bootstrap Inhalation Unit Risk distributions for nectorine: (a) Respiratory epithelial adenoma; (b) Olfactory epithelial neuroblastoma; (c) Combined tumors.

29

PERSIMONATE DATA ANALYSIS

mutagenic (i.e., one-hit) mode of action would apply, the Weibull model would not be considered for deriving an inhalation unit risk. Therefore, based on the 90% bootstrap confidence intervals, the uncertainty in the combined response is less than that for either of the individual tumor responses.

1.4

PERSIMONATE DATA ANALYSIS

The persimonate data are reproduced in Table 1.10. The questions to be answered are as follows: 1. Can we combine these studies? 2. Does it affect our uncertainty? 1.4.1

Persimonate BMDS Model-Fitting Results

Table 1.11 shows the results of model fitting from BMDS. The analysis is based on the metabolized dose rather than the external exposure levels. Humanequivalent exposures are not evaluated. Model acceptability and goodness-offit criteria are the same as for the frambozadrine analysis. Only the first-order (one-parameter) multistage model can be fit to the BC63F1 data, as there are not enough degrees of freedom to fit any of the other models and still allow for statistical comparison. This model is essentially a quantal-linear, or exponential, model. On the basis of the AIC values, the best-fitting models for the Crj:BDF1 data set are the gamma, log-logistic, logprobit, and Weibull, all with identical fits. The log-logistic and log-probit fit the pooled data best. The multistage model fits are shown in Figure 1.5. Given the small differences in the AIC values, however, all of the two-parameter models are virtually equivalent for goodness of fit. BMDL10 values vary, however, by about a factor of two among the better-fitting models for each data set. Pooling the data across strains within a species is reasonable conceptually. TABLE 1.10

Persimonate inhalation dose–response data Exposure Level (ppm)

Metabolized Dose (mg/kg-day)

SurvivalAdjusted Tumor Incidence

Percent Response

B6C3F1 male mice

0 18 36

0 27 41

17/49 31/47 41/50

34.7 66.0 82

Crj:BDF1 male mice

0 1.8 9.0 45

0 3.4 14 36

13/46 21/49 19/48 40/49

28.3 42.9 39.6 81.6

30

BENCHMARK DOSE MODELING

TABLE 1.11

BMDS results for persimonate

Model

AIC

Deviance

p-Value

BMD10

BMDL10

3.70

2.71

5.616 2.673 2.215 2.211 2.216 2.201

3.54 12.2 16.1 16.1 15.9 16.2

2.53 3.45 5.06 6.82 7.93 4.60

6.499 3.110 2.915 2.706 2.707 3.103

3.60 10.7 13.3 13.8 14.3 11.5

2.86 3.82 4.80 6.80 7.79 4.27

BC63F1 male mice N

Multistage (1 )

175.2

0.49

0.4714

Crj:BDF1 male mice N

Multistage (1 ) Multistage (2N) Gamma Log-logistic Log-probit Weibull

242.5 241.6 241.1 241.1 241.1 241.1

0.06 0.10 0.14 0.14 0.14 0.14 Pooled data

N

414.1 412.7 412.5 412.3 412.3 412.7

0.26 0.54 0.57 0.61 0.61 0.54

1.0

Multistage (1 ) Multistage (2N) Gamma Log-logistic Log-probit Weibull

0.6 0.4 0.0

0.2

Cumulative response

0.8

B6C3F1 Crj:BDF1 Pooled data

1

5

10

50

100

Dose (mg/kg-day)

Figure 1.5

Multistage (first-order) model fits to the persimonate data.

31

PERSIMONATE DATA ANALYSIS

The only common model across the data sets is the first-order multistage, so the pooling test is based on the deviance for that model fit even though the fit is marginal for the Crj:BDF1 data (p = 0.06; the US EPA BMD guidance [US EPA, 2000] suggests rejecting fits with a p-value less than 0.1). The pooleddata test statistic is 0.4116. The chi-squared p-value for the test statistic, given two degrees of freedom, is 0.81, indicating that pooling the data is reasonable. For conducting the uncertainty analysis on the Crj:BDF1 and pooled data sets, the first-order multistage and Weibull models are chosen. The Weibull, as for the frambozadrine and nectorine analyses, is selected for shape flexibility. The first-order multistage, although not the best-fitting model, would be the preferred model under the US EPA cancer risk assessment guidelines (US EPA, 2005) for modeling the pooled response data (if not for the Crj:BDF1 data). 1.4.2

Persimonate Uncertainty Analysis Results

The results of the Benchmark Dose and bootstrap uncertainty analyses of the persimonate data for the first-order multistage model is shown in Table 1.12 and Figure 1.6 (pooled data). The BMD and BMDL values are similar across the data sets for any given BMR. The same holds for the ED and EDL values. Because the dose–response model has only one parameter, the BMDLr and EDLr values are constant for any given data set and relatively small, indicating low uncertainty. The results of the Benchmark Dose and bootstrap uncertainty analysis of the persimonate data for the Weibull model are shown in Table 1.13. The most

TABLE 1.12 BMR

BMD and bootstrap results for persimonate (first-order multistage) BMDL

BMD

BMDLr

EDL

ED

EDLr

0.249 1.27 2.61

0.348 1.78 3.65

1.4 1.4 1.4

0.237 1.21 2.49

0.336 1.72 3.52

1.4 1.4 1.4

0.270 1.38 2.83

0.344 1.76 3.61

1.3 1.3 1.3

BC63F1 male mice 0.01 0.05 0.10

0.259 1.32 2.71

0.353 1.80 3.70

1.4 1.4 1.4 Crj:BDF1 male mice

0.01 0.05 0.10

0.242 1.23 2.53

0.338 1.73 3.54

1.4 1.4 1.4 Pooled data

0.01 0.05 0.10

0.273 1.39 2.86

0.343 1.75 3.60

1.3 1.3 1.3

32 1.0

BENCHMARK DOSE MODELING

0.6 0.4 0.0

0.2

Cumulative response

0.8

Multistage fit Median bootstrap response 90% bootstrap confidence bounds

5

10

50

100

Dose (mg/kg-day)

Figure 1.6 Multistage model fit to the pooled persimonate data for 90% bootstrap confidence bounds.

TABLE 1.13 BMR

BMD and bootstrap results for persimonate (Weibull model) BMDL

BMD

BMDLr

EDL

ED

EDLr

1.25 3.84 6.37

7.02 12.2 15.7

5.6 3.2 2.5

0.600 2.51 4.68

3.73 8.17 11.5

6.2 3.3 2.5

Crj:BDF1 male mice 0.01 0.05 0.10

0.609 2.48 4.60

7.57 12.8 16.2

12.4 5.2 3.5 Pooled data

0.01 0.05 0.10

0.503 2.22 4.27

3.70 8.15 11.5

7.4 3.7 2.7

notable difference from the multistage model results is the increase in BMDLr and EDLr values, substantially raising the uncertainty estimate. This result is expected, given the greater flexibility of the Weibull model. All of the BMDs and EDs are increased compared to the multistage model, as well, but consistent with each other (i.e., within the Weibull model results). The BMDLr values are larger than the EDLr values, particularly for the BMR0.01 results for the Crj:BDF1 response data, similar to the results for the frambozadrine Weibull analysis.

REFERENCES

1.4.3

33

Conclusions for Persimonate Uncertainty Analysis

The two studies can be combined, but combining the response data does not affect the uncertainty for the first-order multistage model. It reduces uncertainty slightly for the Weibull model based on the BMD analysis. The uncertainty is not affected for the Weibull model, however, if it is based on bootstrap analysis.

1.5

GENERAL CONCLUSIONS

The BMD approach tends to overestimate, with respect to the bootstrap method, the dose–response uncertainty at low BMR values for these particular data sets, primarily for responses relatively far from the lowest observed nonbackground response and for smaller sample sizes. Overall, however, there is not too much of a difference between BMD and bootstrap results for these data. Most of these sample data sets are relatively well behaved, in that the most of the model fits are very good and all are linear (exponential) or sublinear. For supralinear data sets (e.g., Weibull exponents < 1), such as the nectorine respiratory epithelial adenoma response data, BMDLr values tend to increase without bound as the supralinearity increases (results not shown). A useful follow-on exercise would be to evaluate different uncertainty analysis techniques on more supralinear response data.

REFERENCES Stiteler WM, Knauf LA, Hertzberg RC, Schoeny RS. 1993. A statistical test of compatibility of data sets to a common dose–response model. Reg Tox Pharm 18:392–402. US EPA. 2000. Benchmark Dose Technical Guidance Document. External Review Draft. EPA/630/R-00/001. US EPA. 2005. Guidelines for Carcinogen Risk Assessment. Risk Assessment Forum, Washington, DC. EPA/630/P-03/001B. www.epa.gov/cancerguidelines. US EPA. 2007. Benchmark Dose Software (BMDS) version 1.4.1 (last modified 2006). http://www.epa.gov/ncea/bmds.

COMMENT THE MATH/STATS PERSPECTIVE ON CHAPTER 1: HARD PROBLEMS REMAIN Allan H. Marcus Office of Research and Development, Integrated Risk Information System, National Center for Environmental Assessment

INTRODUCTION It is a pleasure to respond to the request for a math/stat perspective on J. Swartout’s model application of the current BMD methodology. Despite the wide popularity that the BMD method currently enjoys, its application in dose–response modeling may pose nonroutine issues. Typical “hard” problems remain: •





How do we deal with steep changes between adjacent data points (“climbing up” or “falling off a cliff”) with insufficient data between adjacent data values? How do we deal with Weibull quantal endpoint models or power function continuous models and supralinear dose response? How to acquire informative estimates of BMD, BMDL in cases where supralinearity induces uninformative or near-zero BMDLs?

Our models must sometimes address diverse and opposing needs. This will be illustrated below. SPECIFIC MODELING ISSUES FOR “NECTORINE” The dose–response data for respiratory epithelial adenomas appear to be somewhat supra-linear and may be consistent with several BMDS models. In 34

35

SPECIFIC MODELING ISSUES FOR “NECTORINE”

particular, should we prefer a better-fitting Weibull model to a more plausible biologically based multistage model? This illustrates the sort of trade-offs that an analyst frequently confronts. On the one hand, we need models that connect with biologically based, or at least biologically inspired, dose–response models, such as multistage and Moolgavkar, Venzon, and Knudson (MVK) models. On the other hand, we need smooth interpolation formulations to simplify BMD/BMDL estimation. On the one hand, we need a large universe of plausible models for sensitivity analyses of methods; on the other hand, we need a large universe of implausible models as well because data sometimes are contrary to our expectations, such as in profoundly nonmonotonic models. Swartout’s Table 1.2 for persimonate shows a common situation where many models have AIC within 1.4 units (most with 0.1) yet BMDL10s differ twofold, and will differ even more with smaller BMRs. The nectorine example is interesting because it raises questions about a common hard problem, namely supralinearity. The analogous problem occurs for continuous endpoints analyzed using BMDS power function and Hill function models: When the power exponent is less than 1, the BMDLs can be forced toward 0—but they don’t have to be. Table 1.14 presents an illustrative calculation. The Weibull dose–response model is Prob( response at dose d ) = Background probability + (1 − Background probability ) × exp(Slope × Dose∧ Exponent ) .

The maximum likelihood estimate (MLE) of the exponent, and the associated AIC, BMD with BMR = 0.1, and the BMDL are shown below. If we specify

TABLE 1.14

Respiratory epithelial adenoma—Weibull model Likelihood Profile Approach

Exponent

AIC

BMD10

BMDL10

0.25 0.35 0.45 0.55 0.615 0.65 0.75 0.85 1.0 MLE 0.615

143.7 142.8 142 141.9 141.9 141.9 142.1 142.5 143.6

8.7

0.417 1.41 2.80 4.37 5.42 5.98 7.58 9.13 11.34

143.9

8.7

0.26

36

THE MATH/STATS PERSPECTIVE ON CHAPTER 1

the exponent to be equal to its MLE, then we can reduce the number of parameters to be estimated, and the AIC goes down, as expected. Put differently, adding parameters to a model incurs an AIC penalty. We then look at the likelihood profile, by varying the fixed value of the exponent. Between 0.55 and 0.65 the AIC does not change; however, the BMDL10 changes from 4.37 to 5.98. If we allow the AIC to vary from 142 to 142.1, which is still a negligible variation, then the BMDL varies from 2.80 to 7.58. This illustrates the instability of the BMDL10 under the Weibull model.

CONCLUSION The choice of model is affected by diverse and sometimes opposing demands. We want models to be biologically plausible and also to have nice mathematical properties. We want models to be smooth, have a low number of parameters that can be easily estimated, and we want the quantities of interest, in particular the BMDL, to be stable. Mathematical models may have features for which we could, or should, seek physiological justification. Among these one might list the following: •

• • •

Monotonicity versus nonmonotonicity. Is the choice for a monotonic model really driven by data, or by convention and convenience? Points of inflection. Do these have a biological meaning? Rates of change of curvature Uni- versus multi-modal behavior. Are there biological arguments underlying these choices?

COMMENT EPI/TOX PERSPECTIVE ON CHAPTER 1: RE-FORMULATING THE ISSUES Jouni T. Tuomisto National Public Health Institute (KTL), Kuopio, Finland

INTRODUCTION I am grateful for the opportunity to offer some epidemiological/toxicological perspectives on Jeff Swartout’s chapter. Swartout is to be applauded for his very clear exposition of the “standard approach” to uncertainty quantification and its role in defining a BMDL. The BMDL is a regulatory reference value that incorporates uncertainty in a transparent statistical manner, and thus constitutes a significant improvement over the older allowable daily intake or reference dose. I offer a few general remarks, and use these to re-formulate some issues regarding the frambozadrine and persominate data sets.

WHAT IS THE RESEARCH QUESTION? The first question is, What is the question? Whatever the question, it should pass the clairvoyant test so called, because an entity possessing all possible knowledge should be able to answer this question. We may surmise that the research question is: What is the probability of having a response R in an individual of a large population Pop at time t after an exposure pattern Exp? 37

38

EPI/TOX PERSPECTIVE ON CHAPTER 1 0.9 0.8

Response

0.7

E(R |Exp)

P(Disease)

0.6 0.5 0.4 0.3 0.2 0.1 0 0

20

40

60

80

100

120

Exposure (mg/kg/day)

Figure 1.7 Ideal data for assessing probability of response as function of exposure.

Before a clairvoyant can answer this question, we should specify: •

• • •

What is the outcome measured? What is the magnitude for a positive response? What is the target population? What is the time of observation? What is the exposure pattern?

PROBABILITY OF RESPONSE R The probability of response will be a function of exposure and will ideally be based on data such as that shown in Figure 1.7. Figure 1.7 shows a cloud of percentage response points from a very large number of studies (each with a given large number of experimental animals). Of course, we seldom if ever have such data at our disposal, but this is the sort of information that a clairvoyant would call up to answer the research question above. The expected response for each exposure is given as the “regression curve” E(R|Exp). BENCHMARK DOSE FOR R Current regulatory practice involves computing the BMDL. As shown in Figure 1.8, this is the 5% lower bound of the estimated exposure leading to a benchmark response (BMR) in a specified population. This exposure is estimated from the dose–response curve E(R|Exp).

39

CRITICAL ASSUMPTIONS 0.9 Response

0.8

E(R |Exp)

P(Disease)

0.7

BMR: 0.1

0.6

P(Exp|R=Bg + (1 – Bg)*BMR)

0.5

BMD L

0.4 0.3 0.2 0.1 0 0

20

40

60

80

100

120

Exposure (mg/kg/day)

Figure 1.8 BMDL for data in Figure 1.7.

CRITICAL ASSUMPTIONS The calculation of the BMDL in a real situation, with limited data, appears to depend on the following critical assumptions: • •

The dose–response function E(R|Exp) reflects reality. The number of parameters in the function used to characterize E(R|EXP) allows for enough but not too much flexibility.

The conclusion from the EPI/TOX viewpoint is fairly straightforward: there is very limited toxicological or epidemiological support for these assumptions. This does NOT mean that these assumptions are unreasonable. It reflects the fact that these assumptions are based on conventions and convenience rather than on epidemiological or toxicological science. The way forward from conventions toward scientifically based functions and parameterisations is difficult but not impossible. First, mechanistic understanding of the causal chain from exposure to response helps us develop plausible functions for describing the causal chain. The chain is usually long and complex, and therefore the evidence for rejecting or accepting a particular function is always limited. However, it can be very useful if even one basic question about, for example, linearity/nonlinearity can be resolved using mechanistic arguments. Second, more work could be done using large data sets of dose responses. Different functions and parameterisations can be tested using these data sets to see whether one function performs better than another one on average.

40

EPI/TOX PERSPECTIVE ON CHAPTER 1

Even if we cannot know whether a particular function and parameterization is a good one for a particular chemical, we can apply functions that perform well in general. This approach must include the definitions of what we mean by performance. It can be statistical performance such as informativeness or calibration, but it can also be the value of information produced for decision making. Understanding and operationalizing the assessment of performance is a research branch of its own. Resources could be wisely invested in performance research, as the results could shift the whole realm of dose–response assessment in a useful direction. FRAMBOZADRINE The data for Frambozadrine is reproduced in Figure 1.9. The specific question to be addressed is: “Do we need separate DR relationships for males and females?” Swartout argues that we do not need to separate the responses of male and female rats. However, the question answers itself if we first impose the clairvoyant test as described above. If the target population Pop is “rats,” then we must pool the male and female rats. This is not a statistical question, but a question of the study’s goals. The ultimate objective is to understand human dose responses. So the question about pooling or not pooling female and male data is only one step in the chain of reasoning. The possible heterogeneity should not be seen as a hindrance to pool data but rather as information about the heterogeneity in the human population. Small variation between female and male rats gives us some confidence that this might be the same for humans.

0.9 E(R |Exp)

0.8

BMR: 0.1 P(hyperceratosis)

0.7

Male rats

0.6

Female rats

0.5

BMD L (pooled data)

0.4 0.3 0.2 0.1 0 0

20

40

60

80

100

Frambozadrine exposure (mg/kg/day)

Figure 1.9

Fambozadrine data with BMR and BMDL.

120

41

CONCLUSIONS 0.9 0.8

Tumor incidence

0.7 0.6 0.5 0.4

B6C3F1 B6C3F1 BMD L-BMD Crj:BDF1 Crj:BDF1 BMD L-BMD

0.3 0.2 0.1 0 0

10

20

30

40

50

Metabolized dose (mg/kg/day)

Figure 1.10 Statistical uncertainty for two mice strains, as reflected in the difference between the BMDL and BMD.

PERSIMONATE: MICE DATA SETS The comment about the ultimate objective applies to persimonate mice data sets too. Two strains have been used, and they show slightly different results related to the shape at low doses. The question we should ask is: Given these mice data sets, what do we learn more about the human dose-response for Persimonate? For example, the statistical uncertainty about BMD is very small for B6C3F1 strain, shown as a small difference between BMD and BMDL. However, the different picture from Crj:BDF1 mice reminds us that we should not think that the uncertainty about human BMD is reflected in the statistical uncertainty in one or even two studes. The mice studies are able to reduce only a small part of the total uncertainty about the human dose–response of persimonate (see Figure 1.10).

CONCLUSIONS Three conclusions emerge from this discussion: •





Statistical methods should serve to clarify our understanding of physiological reality. We should be very concerned if conventions or statistical expediencies start to replace physiological understanding. Research on the performance of dose–response assessment should be done more systematically.

COMMENT REGULATORY/RISK PERSPECTIVE ON CHAPTER 1: A GOOD BASELINE Weihsueh Chiu US Environmental Protection Agency

Bootstrapping may be the most straightforward “extension” of what is standard practice from the current guidelines. It basically proposes two steps: 1. Comparison of the pooled model and summed individual model deviances to assess whether pooling is statistically justified. 2. Using a parametric bootstrap to calculate confidence bounds, for comparison with the profile likelihood method. From a regulatory point of view, these practices would probably be relatively easily justified. In addition step 1 is readily implementable with current software, and step 2 does not seem like a substantial additional computational burden. Thus this seems to be a good “baseline approach” with which the compare to the others. I have six comments/questions: 1. The analysis with different empirical dose–response models seems to affirm the efforts and reasoning of the EPA Cancer Guidelines to make the dose–response analysis (in the range of observation) relatively model independent. In particular, the empirical models are most robust in the range of observation—as the BMR declines below that range, the uncertainty goes up, particularly with flexible models such as the Weibull. Disclaimer: These comments are those of the author, and do no necessarily reflect the views or policies of the US Environmental Protection Agency.

42

REGULATORY/RISK PERSPECTIVE ON CHAPTER 1

2.

3.

4.

5.

6.

43

Moreover statistical tests are poor at distinguishing between models. This is the justification for the separation of dose–response analysis in the range of observation from extrapolation to low doses. Interestingly the fairly common practice of taking the geometric mean of different slope factors (e.g., males and females for frambozadrine) clearly does not yield the same results as calculating a slope factor after pooling the data. This is true whether one uses the BMD or BMDL. For the BMDLs this is not surprising, since two tail estimates are combined, and so the combination can be expected to be further out into the tail. However, the same is true with the BMD, which is important. Thus it matters which “pooling” is done first—the data or the slope factors. Interestingly, in epidemiological meta-analysis, one often tries to pool as many studies as possible, separating out only if there is heterogeneity (either statistical or due to a strong a priori information). In the frambozadrine and persimonate examples, it was concluded that the studies could be combined. It would be interesting to see visually “how much” deviation is necessary to conclude that study cannot be pooled. In the examples here of combining data sets, the dose–response shapes were similar. It is not clear to me how to test for and implement BMD models in which some component of the dose responses were similar and some were not. For example, what if the background rates were different, but the additive (or relative) treatment effects were the same? Alternatively, what if background rates were the same, but the treatment effects were different? How would this compare with a nonlinear mixed effects, namely using tests of heterogeneity to determine whether the treatment effect is fixed or needs additional random component (related to previous comment, perhaps)? My final question is more of a technical one, regarding the AIC/deviance measures. Clearly, small differences are of no consequence, but large differences are important. However, how does one determine how “small is small”? Does an “uncertainty analysis” on AIC/deviance need to be performed? How would one do it? While the use of a BMR in the range of the data should minimize the impact of “errors” in model choice due to “variance in the AIC/deviance,” the same AIC/deviance used to decide whether to pool or not to pool data could have a greater impact. A characterization of the “frequentist” properties of using the test statistic for “to pool or not to pool” may be useful, namely to identity the type I and type II error rates.

COMMENT A QUESTION DANGLES David Bussard US Environmental Protection Agency

Roger M. Cooke posed a question (after Jeff Swartout’s presentation of bootstrapping results): “Did it trouble people to see the observed datapoints outside the confidence intervals on the model fit to the data?” (See also Cooke’s comment). I think many users of our standard risk values assume that uncertainty in modeling the data is captured in using the confidence limit on the BMD (the BMDL). While I have heard statisticians say “the confidence interval reflects the sampling error,” I don’t recall them noting that the confidence interval does not reflect consistency between the model chosen and the data. If there are circumstances where the chosen model does not fit the data well or one otherwise thinks the confidence limit on the BMD does not capture the uncertainty, I think it is important to explain that in a useful and clear way to the nonstatistician user.

44

COMMENT STATISTICAL TEST FOR STATISTICS-AS-USUAL CONFIDENCE BANDS Roger M. Cooke Resources for the Future, Department of Mathematics, Delft University of Technology

Jeff Swartout’s admirably clear paper helps focus a key issue in what I call the statistics as usual (SAU) approach. Two graphs show 90% confidence bands. In the frambozadrene plot (Figure 1.11), 3 of the six observed percentage responses fall outside their respective 90% confidence band. For persimonate (Figure 1.12) this is 2 of the 5. In all, 5 of the 11 observations fall outside their 90% bands; 6 fall inside. Calling “success” the event that the observed percentage falls within its 90% confidence band, and assuming the trials are independent with probability 0.9 of success, the probability of seeing 6 or fewer success on 11 trials is 0.0028. The normal interpretation of these bands is roughly this: if the model were true, then if we repeat the bioassay experiments, 90% of the estimates would fall within these bands. Perhaps it would be good to have an interpretation that is not framed in terms of a conditional statement with a false antecedent. Of course, a conditional with a false antecedent is always true, but it would be helpful if someone could say what these bounds mean without conditionalizing on the truth of the model?

45

46 1.0

STATISTICAL TEST FOR STATISTICS-AS-USUAL CONFIDENCE BANDS

0.6 0.4 0.0

0.2

Cumulative response

0.8

Multistage fit Median bootstrap response 90% bootstrap confidence bounds

1

5

10

50

100

Dose (mg/kg-day)

1.0

Figure 1.11 Multistage model fit to the pooled frambozadrine data 90% bootstrap confidence bounds.

0.6 0.4 0.0

0.2

Cumulative response

0.8

Multistage fit Median bootstrap response 90% bootstrap confidence bounds

5

10

50

100

Dose (mg/kg-day)

Figure 1.12 Multistage model fit to the pooled persimonate data 90% bootstrap confidence bounds.

RESPONSE TO COMMENTS Jeff Swartout

All four reviewers raised issues that are of particular interest to me. Many of them are of importance in the context of the underlying methodology and are under active investigation. Alan Marcus questioned whether we should stick to a rigid policy of restricting the shape parameters in the models (essentially precluding supralinearity) or allow the models to achieve full flexibility. A few years ago I was strongly in favor of the latter but came to appreciate, with experience, the often unrealistic conclusions (excessive population variability and false high risk) one can reach allowing such fits. This became especially apparent in using pathogen dose–response data, even though supralinearity is not considered an issue (low-dose linearity strictly enforced). Noise, likely systematic, appeared to be having an undue influence on the implied population response variance. The curve is flattened and low-dose risk is high. Also, for the Weibull, both the mean and variance of the distribution rise rapidly without bound for low shape parameter values (starting at about 1.5), so restricting the parameter to be greater than 1 is reasonable on that basis. For the Weibull and gamma models that means the form reduces to an exponential one. Furthermore, although supralinearity can be apparent in high dose–response data, linearity may still be the case at low-dose exposure (depletion of susceptibles, saturation of metabolic activation). Perhaps we could apply a model used for (discrete particle) pathogen dose response, such as the beta-exponential, to (massbased) chemical dose–response data. Supralinearity would be excluded and BMDLs probably would be more stable, but with some loss of biological relevance. Juoni Tuomisto commented that from an epidemiological perspective, if the target population is “rats,” then we must pool the data across all subsets 47

48

RESPONSE TO COMMENTS

that form the “rat” population and it’s not a matter of statistical plausibility. I have found, with other data sets, that pooling the data sometimes gives dramatically higher risks at low doses (i.e., low-response region) than treating the components as separate populations. I think the problem in these cases is that our standard dose–response models can’t handle well a mixture of two or more distinct subpopulations if they are truly different, as males and females frequently are. Pooling the data violates the assumption of a continuous underlying distribution. Pooling the data prior to fitting in these cases results in the same problem as with supralinearity; the curve gets stretched out and the low-dose risk is high, perhaps artificially. I have addressed this problem by combining the output (usually by resampling methods) after fitting the data sets individually. I am, however, continuing to investigate the phenomenon, primarily in the context of pathogen dose–response data. Weishei Chiu asked about evaluating AIC differences in a more objective fashion. There is in fact a method that has very intuitive properties, which is the evidence ratio. It gives the relative probability that one model or the other is favored based on the AIC. For example, an AIC difference of 1 favors the model with the lower AIC only by about a factor of 1.6, while an AIC difference of 4 is a relative factor of 7.

COMMENTS ON OTHER ISSUES RAISED DURING THE WORKSHOP Dave Bussard pointed out that many of the observations for some data sets are outside the bootstrap confidence bounds and asked what that might mean. This is a common phenomenon in dose–response modeling. It could be an indication of extra-binomial variability in the response data, dosing errors (which are not considered in the models), improper model specification, a fundamental limitation of the parametric bootstrap procedure, or just something that is expected with small samples. I will leave that for the theoretical statisticians to resolve. The Benchmark Dose guidance says to reject the fit if any of the standardized residuals are greater than 2, which is not the case for any of the data sets in question. For the purposes of risk assessment, the overall fit of the model and, perhaps, the fit in the range of the BMR are the most important criteria. Dave Bussard also questioned why the fitted responses, rather than the observed responses, were bootstrapped and whether the data might then fall within the confidence bounds. The point of fitting a model is that we don’t believe the data absolutely. The observed proportions are random realizations from an underlying true response probability, which we model. Furthermore the observed responses are often nonmonotone, while we assume monotonicity in our dose–response models. A parametric bootstrap is one method for

COMMENTS ON OTHER ISSUES RAISED DURING THE WORKSHOP

49

enforcing monotonicity. Bootstrapping the fitted response probabilities simulates repeating the experiment (many times) with the original “true” population. As a follow-up to this issue, I did a preliminary analysis and found that bootstrapping the observations really did not affect the bootstrap bounds much unless the model fit was poor.

2 UNCERTAINTY QUANTIFICATION FOR DOSE–RESPONSE MODELS USING PROBABILISTIC INVERSION WITH ISOTONIC REGRESSION: BENCH TEST RESULTS Roger M. Cooke Resources for the Future, Department of Mathematics, Delft University of Technology

2.1

INTRODUCTION

This chapter describes an approach for quantifying uncertainty in the dose– response (DR) relationship1 to support health risk analyses. I demonstrate uncertainty quantification for bioassay data using the mathematical technique of probabilistic inversion (PI) with isotonic regression (IR). The basic PI technique applied in this chapter was developed in a series of European uncertainty analyses. (Illustrative studies from the EU-USNRC uncertainty analysis of accident consequence models are available on line at http://cordis.europa.eu/fp5-euratom/src/lib_docs.htm; the main report is ftp:// ftp.cordis.europa.eu/pub/fp5-euratom/docs/eur18826_en.pdf. See also Kraan and Cooke 2000a.) Previous applications concerned atmospheric dispersion, environmental transport, and transport of radionuclides in the body, and were 1

Different relations may be seen as instantiations of a general functional form, for example a Taylor expansion of log(dose). Therefore the distinction between“parameter” and “model” uncertainty is more semantic than real, and is not maintained in this chapter. Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

51

52

DOSE–RESPONSE MODELS USING PI WITH IR

based on structured expert judgment. I transfer these techniques to bioassay data. My focus in this evaluation is on mathematical techniques; the experimental data are not analyzed from a toxicological viewpoint. The analyses should be seen as illustrative rather than definitive. Background on isotonic regression (IR), iterative proportional fitting (IPF) and probabilistic inversion (PI) aimed at the uninitiated is provided. In this chapter, observational uncertainty is discussed first, followed by a brief introduction to PI, iterative proportional fitting (IPF) and IR. Four cases of simple data are then analyzed to demonstrate how uncertainty in the DR modeling could be done in such simple cases. In reality, many more factors will contribute to the uncertainty, but it is important to first get such simple cases right, before attacking more difficult issues. For two cases, Benchmark Dose (BMD) and the example chemical nectorine, standard dose–response (DR) models proved suitable. For two other cases, example chemicals frambozadrine and persimonate, the data suggest a threshold model and a barrier model, respectively, for recovering observational uncertainty. Appendix A provides some results with alternative models and alternative optimization strategies. 2.2 TENT POLES Tent poles for this approach are as follows: 1. Observational Uncertainty There must be some antecedent uncertainty over observable phenomena that we try to capture via distributions on parameters of DR models. This is the target we are trying to hit. When the target is defined, we can discuss whether this is the right target and whether we got close enough. Without such a target, the debate over how best to quantify uncertainty in DR models is underconstrained. 2. Integrated Uncertainty Analysis The goal is to support integrated uncertainty analysis, in which potentially large numbers of individuals receive different doses. For this to be feasible, the uncertainty in DR must be captured via a joint distribution over parameters that does not depend on dose. 3. Monotonicity Barring toxicological insights to the contrary, we will assume that the probability of response cannot decrease as dose increases. It is not uncommon that data at increasing doses show a decreasing response rate, simply as a result of statistical fluctuations. A key feature in this approach is to remove this source of noise with techniques from isotonic regression. 2.3

OBSERVATIONAL UNCERTAINTY

Suppose we give 49 mice a dose of 21 [units] of some toxic substance. We observe a response in 15 of the 49 mice. Now suppose we randomly select 49

53

OBSERVATIONAL UNCERTAINTY bin (49,15/49) predictive 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Number of responses

0.1 0 –0.48 3.1

6.7

10

14

17

21

25

28

32

35

Figure 2.1 Binomial and Bayes noninformative predictive distributions.

new mice from the same population and give them dose 21. What is our uncertainty in the number of mice responding in this new experiment? Without making some assumptions; there is no “right” answer to this question; the customary answer would be: “our uncertainty in the number of responses in the second experiment is binomially distributed with parameters (49,15/49).” In other words, there is a probability of 0.898 of seeing between 9 and 19 responses in the second experiment. A Bayesian with no prior knowledge (all probabilities are equally likely) would update a uniform [0, 1] prior for the probability of response to obtain a Beta(16, 35) posterior2 for the probability p of response, with expectation 16/51 = 0.314, and variance 0.0038. His “predictive distribution” on the number of responses is obtained as a p ∼Beta(16, 35) mixture of binomial distributions Bin(49, p). He would have approximately a 76% chance of seeing the result of the second experiment between 9 and 19. The two distributions are shown in Figure 2.1. Since (i) these two distributions tend to be close, relative to the other differences encountered further on, (ii) the distribution Bin(49, 15/49) is much easier to work with, and (iii) the assumption that before the first experiment, all probability values are equally likely, regardless of dose, is rather implausible, we will take the binomial distribution to represent the observable uncertainty in this simple case. 2.3.1

Binomial Uncertainty

In the example from the Benchmark Dose Technical Guidance (2000), shown in Table 2.1, the different doses are far apart, and the respective binomial 2

To recall, the B(16,35) density is p16−1(1 − p)35−1/B(16,35), where B(16,35) = 15! × 34!/50!.

54

DOSE–RESPONSE MODELS USING PI WITH IR

TABLE 2.1 Dose

BMD Technical Guidance example

Number of Subjects

Number of Responses

50 49 45

1 15 20

0 21 60

TABLE 2.2

Combined male–female data for frambozadrine

Dose

0

Number exposed

1.2

1.8

15

21

82

109

95

45

49

44

47

47

48

Number of responses

5

6

5

4

3

24

33

Percentage of responses

5.26

9.09

6.38

51.06

68.75

13.3

10.20

distributions have little overlap. In such cases, we interpret the observable uncertainty as “binomial uncertainty.” For the example above, the observable uncertainty is three binomial distributions Bin(50, 1/50), Bin(49, 15/49), and Bin(45, 20/45). If we take this to represent our uncertainty for the results of repetitions of the three experiments, then our uncertainty modeling should recapture, or recover this uncertainty.

2.3.2

Isotonic Uncertainty3

It may happen that the doses in two experiments are close together, and that the observed percentage response for the lower dose is actually higher than the observed percentage response at the higher dose. Table 2.2 gives an example from the combined frambozadrine data. The percentage response at dose 21 is lower than that at doses 1.2, 1.8, and 15. According to tent pole 3, we know that this decreasing relation is merely due to noise and does not reflect the relation between dose and response. We should therefore adapt our observational uncertainty to reflect the fact that probability of response is a nondecreasing function of dose. The method for doing this is called isotonic regression, and the maximum likely probabilities are found by the pooling adjacent violators (PAV) algorithm (Barlow et al., 1972). We illustrate its use for the data above. Suppose that we draw five samples from the binomial distributions based on the data in Table 2.2, for the lower five doses. For each sample value we estimate the binomial probability (with three responses in 95 trials, the estimated probability is 3/95 = 0.0316). The results are shown in Table 2.3. 3

I am grateful to Eric Cator of TU Delft for suggesting this approach.

55

STATISTICS AS USUAL (BMD)

TABLE 2.3

Binomial samples for frambozadrine

Dose

0

1.2

1.8

15

21

Number exposed

95

45

49

44

47

Number of responses

5

6

5

4

3

Samples

5

5

6

5

6

0.0526

0.1111

0.1224

0.1136

0.1277

3

4

5

4

3

0.0316

0.0889

0.102

0.0909

0.0638

5

7

5

3

3

0.0526

0.1556

0.102

0.0682

0.0638

5

4

4

4

2

0.0526

0.0889

0.0816

0.0909

0.0426

6

4

7

6

3

0.0632

0.0889

0.1429

0.1364

0.0638

TABLE 2.4

0

1.2

1.8

15

21

Estimated binomial probabilities

PAV algorithm for first row probabilities of Table 2.3 1

i=

2

3

4

Dose

0

1.2

1.8

Starting probability

0.0526

0.1556

0.102

0.0682

0.0638

i=2

0.0526

0.1288

0.1288

0.0682

0.0638

PAV

i=3

0.0526

0.1086

0.1086

0.1086

0.0638

PAV

i=4

0.0526

0.0974

0.0974

0.0974

0.0974

PAV

15

5 21

For each row of probabilities the PAV algorithm changes the probabilities to the most likely nondecreasing probabilities in the following way: In Table 2.3, move through each row of the estimated binomial probability matrix from left to right. At cell i, if its right neighbor is smaller, average these numbers and replace cells i and i + 1 with this average. Now move from i to the left; if you find a cell j that is bigger than the current cell i, average all cells from j to i + 1. Table 2.4 shows the successive PAV adjustments (in bold) to the binomial probabilities in the third row of Table 2.3.

2.4

STATISTICS AS USUAL (BMD)

Usual statistical methods aim at estimating the parameters of a best fitting model, where goodness of fit is judged with a criterion like the Akaike information criterion (AIC). A joint distribution on the parameters reflects the sampling fluctuations of the maximum likelihood estimates (MLEs), assuming that the model is true. Suppose that we sample parameter values from their joint distribution (assume this distribution to be joint normal with means and covariance obtained in the usual way). For each sample vector of parameters, we sample

56

DOSE–RESPONSE MODELS USING PI WITH IR bd21 bd0 bd60 id0 id21 id60 bmd0 bmd21 bmd60

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3

1

5.1

9.1

13

17

21

25

29

33

37

Figure 2.2 Binomial (bd), Isotonic (id), and BMD uncertainty distributions. On the horizontal axis is number of animals responding; on the vertical axis is cumulative probability.

binomial distributions4 with probabilities computed from the sampled parameters with doses and with a number of subjects as given in the bioassay data (e.g., Table 2.1). Repeating this procedure, we obtain a mixture of binomial distributions for each bioassay experiment. The statistics-as-usual (SAU) approach would regard these mixture binomial distributions as representing our uncertainty on the numbers of responses in each experiment. We take the best-fitting model, assume it to be true, consider the distribution of the maximum likelihood parameter estimates we would obtain if the experiments were repeated many times, and plug these into the appropriate binomial distributions. In the sequel, the distributions obtained in this way are called BMD distributions, as this approach is compatible with the Benchmark Dose Technical Guidance Document (2000); and the best-fitting models and their MLE parameter distributions are obtained with the BMD software. Do these mixture binomial distributions bear any resemblance to the binomial uncertainty distributions? We illustrate with the BMD data from the bench test. The preferred model in this case is the log-logistic model with β ≥ 1: Prob( dose ) = γ + ( 1 − γ ) (1+ e( −α − β ln ( dose )) ) Figure 2.2 shows the BMD cumulative distributions on the number of responses for the three dose levels (0, 21, 60) with number of experimental subjects as (50, 49, 45). Bd* denotes the binomial uncertainty and id* the isotonic uncer4

Where possible, we used the normal approximation to the binomial.

PROBABILISTIC INVERSION

57

tainty at dose level “*” (these are always step functions). Since the doses are widely spread, bd and id are nearly identical. The BMD uncertainty for 22 or fewer responses in the experiment at dose 60 is about 0.45, and the binomial and isotonic uncertainties are about 0.79. Apparently the SAU approach does not recover the binomial or isotonic observational uncertainty. The BMD distributions are mixtures of binomials, and so are the Bayesian predictive distributions. It may therefore be possible for a Bayesian to make a teleological choice of prior distributions that, after updating, would agree with the BMD distributions. Of course, this choice would depend on the dose (contra tent pole 2) and on the preferred model. The practice of retro-choosing one’s prior is not unknown in Bayesian statistics, but is frowned upon. Indeed it comes down to defining the target distribution as whatever it was we hit. We will see that these BMD distributions can differ greatly from the isotonic distributions when the doses are close together. The isotonic distributions are not generally binomial. The SAU approach is not wrong. It is simply solving a different problem, namely estimating parameters in a preferred model. That is not the same as quantifying observational uncertainty. The latter is the goal of uncertainty analysis, according to tent pole 1. Overall, the best-fitting model was the logistic, and we will concentrate on BMD results with this model. Alternatives for frambozadrine and nectorine are presented in the appendix. 2.5

PROBABILISTIC INVERSION

We know what it means to invert a function at a point or a point set. Probabilistic inversion denotes the operation of inverting a function at a distribution or a set of distributions. Given a distribution on the range of a function, we seek a distribution over the domain such that pushing this distribution through the function returns the target distribution on the range (see Kraan and Bedford (2004)). In dose–response uncertainty quantification, we seek a distribution over the parameters of a dose–response model, which, when pushed through the DR model, recovers the observational uncertainty. The conclusion from Figure 2.2 is that the BMD distributions in Figure 2.2 are not the inverse of the observational distributions, for the log-logistic model with β ≥ 1. 2.5.1

Iterative Proportional Fitting

Applicable tools for PI derive from the iterative proportional fitting (IPF) algorithm (Kruithof, 1937; Deming and Stephan, 1940). A brief description is given below (e.g., see, Du et al., 2006; Kurowicka and Cooke, 2006). In the present context we start with a DR model and with a large sample distribution over the model’s parameters. This distribution should be wide enough that its

58

DOSE–RESPONSE MODELS USING PI WITH IR

push-forward distribution covers the support of the observable uncertainty distributions.5 If there are N samples of parameter vectors, each vector has probability 1/N in the sample distribution. We then re-weight the samples such that if we re-sample the N vector samples with these weights, we recover the observational distributions. In practice, the observational distributions are specified by a number of quantiles or percentiles. In all the cases reported here, three quantiles of the observational distributions are specified, as close as possible, to (5%, 50%, 95%). Technically we are thus inverting the DR model at a set of distributions, namely those satisfying the specified quantiles. Of course, specifying more quantiles would yield better fits in most cases, at the expense of larger sample distributions and longer run times. The method for finding the weights for weighted resampling is IPF. IPF starts with a distribution over cells in a contingency table and finds the maximum likelihood estimate of the distribution satisfying a number of marginal constraints. Equivalently it finds the distribution satisfying the constraints which is minimally informative relative to the starting distribution. It does this by iteratively adjusting the joint distribution. The procedure is best explained with a simple example. Table 2.5a shows a starting distribution in a 2 × 2 contingency table. The “constraints” are marginal probabilities which the joint distribution should satisfy. The “results” are the actual marginal distributions. Thus 0.011 + 0.022 + 0.32 = 0.065, and we want the sum of the probabilities in these cells to equal 0.5. In this example, the constraints (0.5, 0.3, 0.2) would correspond to the 50th and 80th percentiles; there is a 50% chance of falling beneath the median, a 30% chance of falling between the median and the 80th percentile, and a 20% chance of exceeding the 80th percentile. In the first step (Table 2.5b), each row is multiplied by a constant so that the row constraints are satisfied. In the second step (Table 2.5c), the columns of Table 2.5b are multiplied by a constant such that the column constraints are satisfied; the row constraints are now violated. The next step (not shown) would resatisfy the row constraints. After 200 such steps we find the distribution in Table 2.5d. Each iteration has the same zero cells as the starting distribution, but the limiting distribution may have more zeros. To apply this algorithm to the problem at hand, we convert the N samples of parameter vectors and the specified quantiles of the observational distributions into a large contingency table. With 3 specified quantiles, there are 4 interquantile intervals, for each observational distribution. If there are 3 doses, there are 43 = 64 cells in the contingency table. Let NRi be the number of animals exposed to dose di, and let P(α, β, γ, … , di) be the probability of response if the parameter values are (α, β, γ, …) and the dose is di. The initial 5

In the studies reported here, we begin with a wide uniform or loguniform distribution on the parameters, and conditionalize this on a small box containing the values of the observational uncertainty distributions. Each observational uncertainty distribution is based on 500 samples; in other words, on tables like Table 2.3, with 500 sample rows.

59

PROBABILISTIC INVERSION

TABLE 2.5A

Starting distribution for IPF

Result

Constraint

0.065

0.500

0.011

0.022

0.032

0.097

0.300

0.054

0.043

0.000

0.838

0.200

0.000

0.000

0.838

TABLE 2.5B

Constraint

0.100

0.200

0.700

Result

0.065

0.065

0.870

First iteration, row projection

Result

Constraint

0.500

0.500

0.083

0.167

0.250

0.300

0.300

0.167

0.133

0.000

0.200

0.200

0.000

0.000

0.200

Constraint

0.100

0.200

0.700

Result

0.250

0.300

0.450

TABLE 2.5C

Second iteration, column projection

Result

Constraint

0.533

0.500

0.033

0.111

0.389

0.156

0.300

0.067

0.089

0.000

0.311

0.200

0.000

0.000

0.311

TABLE 2.5D

Constraint

0.100

0.200

0.700

Result

0.100

0.200

0.700

Result after 200 iterations

Result

Constraint

0.501

0.500

0.000

0.002

0.499

0.298

0.300

0.100

0.198

0.000

0.201

0.200

0.000

0.000

0.201

Constraint

0.100

0.200

0.700

Result

0.100

0.200

0.700

60

DOSE–RESPONSE MODELS USING PI WITH IR bd21 bd0 bd60 id0 id21 id60 bmd0 bmd21 bmd60 id0pi id21pi id60pi id0pi_b=1 id21pi_b=1 id60pi_b=1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3

1

5.1

9.1

13

17

21

25

29

33

37

Figure 2.3 PI distributions for the BMD test, with log-logistic model and β = 1 (orange) and β > 0 (black). The number of responses are on the horizontal axis, and the cumulative probability on the vertical.

probability in each cell is proportional to the number of sample parameter vectors (α, β, γ, …) for which NRi × P(α, β, γ, … , di), i = 1, 2, 3, lands in that cell. IPF is now applied, and the final cell probabilities are proportional to the weights for the sample vectors in that cell. Csiszar (1975), finally,6 proved that if there is a distribution satisfying the constraints whose zero cells include the zero cells of the starting measure, then IPF converges, and it converges to the minimal information (or maximum likelihood) distribution relative to the starting measure which satisfies the constraints. If the starting distribution is not feasible in this sense, then IPF does not converge (Csiszar and Tusnady, 1984). Much work has been devoted to variations of the IPF algorithm with better convergence behavior. An early attempt, termed PARFUM (PARameter Fitting for Uncertain Models; Cooke, 1994), involved simultaneously projecting onto rows and columns, averaging the result. It has recently been shown (Matus, 2007; Vomlel, 1999) that this and similar adaptations always converge, and that PARFUM converges to a solution if the problem is feasible (Du et al., 2006). In fact geometric averaging seems to have some advantages over arithmetic averaging, but PARFUM is used in the case of infeasibility. In the infeasibility case PARFUM is largely— though not completely—determined by the support of the starting measure. The best-fitting distribution in the sense of AIC is not always the best distribution to use for probabilistic inversion. Figure 2.3 illustrates this for the 6

There had been many incorrect proofs, and proofs of special cases; see Fienberg (1970), Kullback (1986, 1970), Haberman (1974, 1984), and Ireland and Kullback (1968). More general results involving the “Bregman distance” were published earlier in Russian (Bregman, 1967), for a review see (Censor and Lent, 1981). A simple sketch of Csiszar’s proof is in Csiszar (undated).

61

PROBABILISTIC INVERSION

BMD bioassay data. At each dose level *, it shows the binomial observational uncertainty (bd*), the isotonic observational uncertainty (id*), the BMD distributions from Figure 2.2, and also the result of PI on the log-logistic model with β = 1 (corresponding to the best fitting model) and PI on the loglogistic model with β > 0. We can see that for PI in the case β = 1, the CDFs fit at the 5, 50, and 95 percentiles but not over the rest of the distribution ranges. For the figures throughout this report, binomial distributions (bd*) are black step functions, and isotonic distributions (id*) are gray step functions with smaller steps. Other distributions are indicated with pointers. (Note that while parameters are usually given in Greek letters, the graphics software used does not support this. Therefore common letters are used, e.g., the background response parameter γ is rendered as g, and the intercept parameter α is rendered as a.)

2.5.2

Benchmark Dose

Computing the Benchmark Dose for the extra risk model is easy in this case. We solve the equation BMR =

( Prob( BMD) − Prob(0 )) 1 − Prob(0 )

for values BMR = 0.1, 0.05, 0.01. Prob(BMD) is probability of response at dose BMD, and it depends on the parameters of the DR model, in this case the log-logistic model. Prob(0) is the background probability, and is taken to be the observed percentage response at zero dose. In this case the log-logistic model can be inverted and we may write BMR = (1 + e −α − β ln (BMD) ) . −1

Sampling values from (α, β) after PI, we arrive at the following values shown in Table 2.6.

TABLE 2.6

BMD Technical Guidance: Benchmark Dose calculations BMD

BMR 0.1 0.05 0.01

Mean

Variance

5% Perc

50% Perc

95% Perc

BMD 50/ BMDL

3.12E + 00 9.31E − 01 6.70E − 02

4.52E + 00 4.85E − 01 3.78E − 03

8.49E − 01 2.10E − 01 9.34E − 03

2.53E + 00 7.19E − 01 4.64E − 02

7.54E + 00 2.38E + 00 1.91E − 01

2.98E + 00 3.42E + 00 4.97E + 00

62

2.6

DOSE–RESPONSE MODELS USING PI WITH IR

FRAMBOZADRINE

2.6.1 Threshold Model The data for the frambozadrine case are reproduced below as Table 2.7. We see strong nonmonotonicities for male and female rats. A higher percentage of male rats respond at dose 1.2 than at dose 15, and similarly for female rats at doses 1.8 and 21 [mg/kg-day]. The logistic model fit best according to the AIC. For the male and female tests, analyzed individually, a standard multistage model with linear parameters gave best results for PI, among the standard models (adding quadratic and cubic terms did not improve the results). This model is Nr ∗ ( gamma + (1 − gamma) ∗ (1 − exp ( −b1 ∗ dose ))) . The results for these two cases are shown in Figure 2.4. Note the differences between the binomial observational uncertainties (bd) and isotonic observational uncertainties (id) for doses 1.2 and 15. The PI distributions are fit to the id distributions, and are closer than the BMD distributions, but the fit is bad for the lower doses. A similar picture emerges for female rats shown in Figure 2.5. If dose really does scale as (mg/kg-day), and barring countervailing toxicological evidence, we should be able to pool these data. However, the problems of fitting a multistage model, or any other models in the BMD library, become more severe. The pooled data seem to indicate a response plateau. Beneath a certain threshold s the background probability g applies. Above another threshold t, the dose follows a multistage model. The threshold model is Nr ∗ ( g + (1 − g ) ∗ i1 {s, dose, ∞} ∗ (1 − exp ( − b1 ∗ max {t, dose} 2 3 − b2 ∗ max {t, dose} − b3 ∗ max {t, dose} )))

TABLE 2.7

Male

Female

Frambozadrine data Dose (mg/kg-day)

Total Rats

Hyperkeratosis

0 1.2 15 82

47 45 44 47

2 6 4 24

0 1.8 21 109

48 49 47 48

3 5 3 33

63

FRAMBOZADRINE

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

bd0 id0 d0pi bmd0

0

1 2.1 3.1 4.2 5.2 6.3 7.3 8.4 9.4 10 bd15 id15 d15pi bmd15

0 1.2 2.5 3.7 5 6.2 7.5 8.7 10 11 12

Figure 2.4

bd1.2 id1.2 d1_2pi bmd1.2

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –2 –0.6 0.8 2.2 3.6 5 6.4 7.8 9.2 11 12

bd82 id82 d82pi bmd82

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 8.6 12 15 18 21 24 27 30 38 36 39

Fitting results for frambozadrine data for males.

Here i1{X < Y < Z} is a random indicator function, which returns 1 if X ≤ Y ≤ Z, and 0 otherwise. The starting distributions for the parameters in this model were g ∼ uniform [0.005, 0.1] s ∼ uniform [0, 21] t ∼ (21 − s) b1 ∼ uniform [0, 0.01] b2 ∼ uniform [0, 0.0001] b3 ∼ loguniform [10−13, 5 × 10−6] These distributions were conditionalized on values falling within a box around the observable uncertainty distributions. The results shown in Figure 2.6

64

DOSE–RESPONSE MODELS USING PI WITH IR bd0 id0 d0pi bmd0

1 0.9

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 –3.8 –2.2 –0.58 1

2.6 4.3 5.9 7.5 9.1 11 12

bd21 id21 d21pi bmd21

1 0.9

0 –3 –1.4 0.22 1.9 3.5 5.1 6.8 8.4 10

1 0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

12 13

bd109 id109 d109pi bmd109

0.9

0.8

0 –3.1 –1.2 –0.76 2.7 4.6 6.6 8.5 10 12

bd1.8 id1.8 d1_8pi bmd1.8

0 14 16

0

4.5 8.9 13

18

22 27 31 36

40 45

Figure 2.5 Fitting results for frambozadrine data for females.

indicate a decent fit. Note the strong differences between the bd and id distributions, and note that the BMD distributions are close to neither, except for the higher dose levels. The parameter distributions after PI are shown in Figure 2.7. The lower threshold s concentrates on values near 0, while the higher threshold t concentrates on values near 21. The distributions of b1 and b2 are strongly affected by the PI; the distribution of b3 is less affected, indicating that this parameter is less important to the fitting. Figure 2.8 shows the parameter distributions before PI. The thresholds s and t are most affected by the fitting; b3 is changed very little. The joint distribution of all parameters in the threshold model, after PI, is shown in Figure 2.9 as a cobweb plot. On each sample there is a value for each of the parameters; a jagged line connecting these values thus represents one sample. There are 172 such samples results in the picture in Figure 2.9. We see

65

FRAMBOZADRINE

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3.2 –1.1 1

bd0 id0 d0pi bmd0

3.1 5.2 7.2 9.3

11

13

16

18

bd1.8 id1.8 d1.8pi bmd1.8

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

0.1 0 –3.7 –2.2 –0.7 0.8 2.3 3.8 5.3 6.8 8.3 9.8 11

1 0.9

bd21 id21 d21pi bmd21

1 0.9

bd1.2 id1.2 d1.2pi bmd1.2

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3.2 –1.7 –0.17 1.4 2.9 4.4 5.9 7.4

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3.2 –1.5 0.22 1.9 3.6 5.3

bd82 id82 d82pi bmd82

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0 –35 –16 0.35 2.3 4.2 6.1 8 9.9 12 14 16

Figure 2.6

0

7

8.7 10

12

12

14

bd109 id109 d109pi bmd109

1

0.8

10

bd15 id15 d15pi bmd15

0.9

0.8

9

0.1 12 10 13 16 19 22 25 28 31 34 37

0

14 17 21 24 27 31 34 37 41 44 47

Results for fitting frambozadrine, pooled data.

strong negative correlations between b1, b2, b3, as well as the concentration of t toward values near 21. Once we have a joint distribution over the model parameters, we can easily compute the uncertainty distributions (smoothed) for probabilities at arbitrary doses; see Figure 2.10. The distributions tend to get wider as dose increases up to 100. These results show that a good fit to isotonic observational uncertainty is possible with a threshold model. The same may hold for other more exotic models. Of course, the toxicological plausibility of such models must be judged by toxicologists, not by mathematicians.

66

DOSE–RESPONSE MODELS USING PI WITH IR

Figure 2.7

2.6.2

Parameter densities for frambozadrine MF after PI.

Benchmark Dose

Computing the Benchmark Dose was complicated in this case by the fact that unlike the previous example, the DR model could not be analytically inverted. We therefore proceed to compute, for any value of the parameters, the dose d* that satisfies BMR =

( Prob (d *) − Prob(0 )) , 1 − Prob (0 )

where again Prob(0) is taken to be the observed background rate, 0.0526. Consider the cumulative distribution function of d*. If a number dL is the 5 percentile of this distribution, this means that with probability 0.95,

67

FRAMBOZADRINE

Figure 2.8

Parameter densities for frambozadrine MF, before PI.

Prob(dL ) < BMR(1 − Prob(0 )) + Prob(0 ) . Hence, to find the 5% value of d*, we must find the value dL for which BMR(1Prob(0)) + Prob(0) is the 95 percentile. This is BMDL, and we compare it to the value, denoted BMD50, for which BMR(1 − Prob(0)) + Prob(0) is the median. The results are shown in Table 2.8. Values for BMR = 0.01 were not obtainable, as the 95% of the background parameter g was already larger than 0.01 × (1 − 0.0526) + 0.0526 = 0.062.

68

DOSE–RESPONSE MODELS USING PI WITH IR

Samples selected: 172 g b1 0.1 0.0065

0.005

2.5E-7

b2 0.0001

b3 1.4E-6

s 21

t 21

1.6E-7

1E-13

0.00052

0.0042

Figure 2.9 Cobweb plot for parameters for frambozadine, male and female data.

25 22 20 17 15

p1 p10 p20 p30 p40 p50 p60 p70 p80 p90 p100

12 9.9 7.4 5 2.5 0 0.0051 0.087 0.17 0.25 0.33 0.41 0.5 0.58 0.66 0.74 0.82

Figure 2.10 Uncertainty distributions for probability of response as function of dose; p* is the probability of response at dose *.

69

NECTORINE

TABLE 2.8

Frambozadrine: Benchmark Dose calculations

BMR 0.1 0.05 0.01

TABLE 2.9

BMDL

BMD50

BMDL/BMD50

1.68E + 01 1.30E + 00 —

2.35E + 01 1.53E + 01 —

1.40E + 00 1.18E + 01 —

Nectorine data Concentration (ppm)

Lesion

0

10

30

60

0/49

6/49

8/48

15/48

0/49

0/49

4/48

3/48

# Response / # In trial Respiratory epithelial adenoma in rats Olfactory epithelial neuroblastoma in rats

TABLE 2.10

Combined nectorine data

Dose Number of rats Number responding

2.7 2.7.1

10 49 6

30 48 12

60 48 18

NECTORINE Sample Results

The data for the nectorine case is shown in Table 2.9. As we are unable at present to analyze zero response data, the data do not support an analysis of the neuroblastoma lesions, so we consider only pooling. The text accompanying these data indicates that these two lesions occur independently. Therefore the probability of showing either lesion is given by Pa or n = Pa + Pn − PaPn, where a and n denote adenoma and neuroblastoma respectively. These two endpoints are taken to be distinct (with summation of risks from multiple tumor sites when tumor formation occurs independently in different organs or cell types considered superior to the calculation of risk from individual tumor sites alone). Under these assumptions, the bioassay for either lesion is shown in Table 2.10. The model used for PI was multistage with two parameters: NR ∗ ( g + (1 − g ) ∗ (1 − exp ( − b1 ∗ dose ))) .

70

DOSE–RESPONSE MODELS USING PI WITH IR b10 b30 b60 id10 id30 id60 bmd10 bmd30 bmd60 d10pi d30pi d60pi

dose 10

1 0.9 0.8

dose 30

0.7 0.6 0.5

dose 60

0.4 0.3 0.2 0.1 0 1

4.7

8.5

12

Figure 2.11

TABLE 2.11 BMR 0.1 0.05 0.01

16

20

23

27

31

35

38

Results for fitting nectorine.

Nectorine: Benchmark Dose calculation

Mean

Variance

5%

50%

95%

BMDL/ BMD50

1.72E + 01 8.43E + 00 1.64E + 00

3.44E + 01 8.21E + 00 3.12E − 01

1.02E + 01 4.98E + 00 9.70E − 01

1.58E + 01 7.74E + 00 1.51E + 00

2.94E + 01 1.44E + 01 2.80E + 00

1.55E + 00 1.55E + 00 1.56E + 00

The results are shown in Figure 2.11. In this case both PI and BMD yield decent fits. 2.7.2

Benchmark Dose

The calculation of the Benchmark Dose and the ratio BMD/BMDL proceeded as in the case of the BMD Technical Guidance example. The results are shown in Table 2.11.

2.8

PERSIMONATE: BARRIER MODEL

The data for persimonate are shown in Table 2.12. Here again there is nonmonotonicity. The issues that arise are similar to those with frambozadrine,

71

PERSIMONATE: BARRIER MODEL

TABLE 2.12

Persimonate data Continuous Equivalent Dose

Total Metabolism (mg/kg-day)

Survival-Adjusted Tumor Incidence

B6C3F1 male mice inhalation

0 18 ppm 36 ppm

0 27 41

17/49 31/47 41/50

Crj : BDF1 male mice inhalation

0 1.8 ppm 9.0 ppm 45 ppm

0 3.4 14 36

13/46 21/49 19/48 40/49

and the analysis will be confined to the pooled data, provided that the two types of mice can be pooled. The doses used are in units of parts per million. The PI with the standard DR models from the BMD library was unsuccessful in this case. The data suggest that there may be two or three plateaus of response. A plateau might arise if there are successive biological barriers that, when breached, cause increasing probabilities of response. After breaching the first barrier, each additional dose has no effect until the second barrier is breached, and so forth. A barrier model was explored, using two barriers: NR ∗ ( g + (1 − g ) ∗ (1 − exp ( − b1 ∗ i1 {0, dose, t} − b2 ∗ i1 {t, dose, s} −b3 ∗ i1 {s, dose, ∞}))) . Here, t and s are random barriers with t < s, and i1{X, Y, Z} is a random indicator function returning 1 if X ≤ Y ≤ Z and 0 otherwise. The parameters b1, b2, and b3 are random but increasing: b1 < b2 < b3. Thus doses less than or equal to t have coefficient b1, those between t and s have b2, and those greater than or equal to s have b3. In fitting this barrier model, the PI chooses maximum likely weights for resampling the starting distribution, and thus chooses a maximum likely distribution for the barriers and their associated coefficients, based on the starting sample distribution. The results are shown in Figure 2.12. Note again the pronounced differences between the binomial observational uncertainty and the isotonic observational uncertainty. Also we see again decent agreement between the id and pi distributions, while the BMD distributions don’t agree with either bd or id, except at the highest dose (45). The parameter distributions for the barrier model, after PI, are shown in Figure 2.13. Graph (b) shows the graphs for the thresholds t and s. The starting distributions were uniform on [0, 21] and [21, 45] respectively. The distribution of t concentrates near 21 and that of s concentrates between 21 and 36 (there

72

DOSE–RESPONSE MODELS USING PI WITH IR b0 id0 bmd0 d0pi

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

4.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 13 15

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 30 31

9.1 14

18

23

27 32

36

17

19

21

23

25 27

29

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

41 46 b9 id9 bmd9 d9pi

31 33 b36 id36 bmd36 d36pi

b1.8 id1.8 bmd1.8 d1.8pi

2.7

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 23

5.4 8.1

11 14

16 19

22

24 27 b18 id18 bmd18 d18pi

25

27

29

31

33 36

39

40 42 b45 id45 bmd45 d45pi

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

33

35

37

39

41 42

Figure 2.12

44

46 48

0.2 0.1 0 32 33

35 36 38

39 41 42

44

45

47

Results for fitting persimonate, pooled data.

is no dose between 21 and 36). The cobweb plot, graph (a) shows less interaction between b1, b2, and b3 than was the case in Figure 2.9 for frambozadrine. Figure 2.14 shows the starting densities for b1, b2, and b3. Compare these with Figure 2.13c to see the action of PI. The difference between the barrier model for persimonate and the threshold model for frambozadrine can be appreciated by comparing the graphs for uncertainty in probability as a function of dose (Figures 2.10 and 2.15). Whereas Figure 2.10 showed a smooth shift of the probability distributions as dose increased, Figure 2.15 reflects the uncertainty of the positions of the barriers. Doses 0, 5, and 10 fall beneath the first barrier with high probability, and their

73

PERSIMONATE: BARRIER MODEL Samples selected: 1600 g t s 0.39 20 45

b1 0.41

b2 1.3

b3 2.3

t s 0.17 0.15 0.14 0.12 0.1 0.086 0.069 0.052 0.034 0.010

0.25

0.07

20

0.1

0.22

0 0.07 4.6

0.85 b1 b2 b3

17

30 27

12

24

10

20

8.7

17

7

14

5.2

10

3.5

6.8

1.7

3.4 2

18

23

27

32

36

40

45

34

14

1.8

14

g

16

0 0.1 0.32 0.53 0.75 0.97 1.2 1.4 1.6

9.1

2.3

0 0.25 0.27 0.28 0.29 0.31 0.32 0.33 0.35 0.36 0.38 0.39

Figure 2.13 Persimonate parameter distributions after PI from upper left to bottom right: (a) all parameters, (b) t and s, (c) b1, b2, b3, and (d) g.

b1 b2 b3

5.8 5.2 4.7 4.1 3.5 2.9 2.3 1.7 1.2 0.58 0 0.1 0.31 0.53 0.74 0.95 1.2

Figure 2.14

1.4

1.6

1.8

2

2.2

Starting Distributions for b1, b2, and b3.

74

DOSE–RESPONSE MODELS USING PI WITH IR

10 9.2 8.2 7.1 6.1

p0 p5 p10 p15 p20 p25 p30 p35 p40 p45

5.1 4.1 3.1 2 1 0 0.33 0.39 0.45 0.51 0.57 0.63 0.69 0.75 0.81 0.87 0.93

Figure 2.15 Persimonate (smoothed) probability as function of dose; p* is the probability of response a dose*.

probabilities of response are similar. The same holds for doses 35, 40, and 45, with high probability they fall above the upper barrier. For intermediate doses, the uncertainty in the probability of response reflects the uncertainty in the positions of s and t. As in the case of frambozadrine, this exercise shows that a good fit to isotonic observable uncertainty is possible. The toxicological plausibility of such models is a matter for toxicologists. The Benchmark Dose was not stable in this case, owing to the form of the DR model. In the region of interest the 95 percentile of the probability of response exhibited discontinuities, making it impossible to find a dose whose 95 percentile was equal to BMR × (1 − Prob(0)) + Prob(0). 2.9

CONCLUDING REMARKS

The quantification of uncertainty in DR models can only be judged if there is some external, observable uncertainty that this quantification should recover. We have used the isotonic uncertainty distributions for this purpose. With PI it is possible to find distributions over model parameters that recover these observable distributions. For the example from the BMD Technical Guidance Document and for nectorine, PI was successful with standard DR models. For frambozadrine and persimonate, it was necessary to introduce threshold and barrier models. Decent fits with PI were obtained with these models. Better fits could be obtained by stipulating more than three quantiles in the observable distributions, and perhaps by exploring different models. The point of this

CONCLUDING REMARKS

75

analysis is to show that good fits to observable uncertainty distributions are achievable. The SAU approach, as reflected in the BMD distributions does not recover observable uncertainty as reflected either in binomial uncertainty distributions or isotonic uncertainty distributions. This is not surprising, as the SAU approach aims at quantifying uncertainty in the parameters of a model assumed to be true, it is not aimed at capturing observable uncertainty. The number of parameters in the threshold and barrier models for frambozadrine and persimonate is larger than the corresponding BMD logistic distributions. Is this not “stacking the deck”? A number of remarks address this question: 1. The AIC criterion for goodness of fit punishes rather severely for additional parameters, and this is partly responsible for the discrepancy between the BMD and the observable uncertainties (esp. in Figure 2.2). 2. Simply adding more parameters to a monotonic model, like the logistic, probit, or multistage, will not produce better fits for the frambozadrine and persimonate data; it seems more important to get a right model type than to add parameters to a wrong model. We are fitting three quantiles per dose; thus for persimonate there are 3 × 6 = 18 quantiles to be fit, whereas the barrier model has 6 parameters. Bayesian models, by comparison, have many more parameters. 3. The fits are judged—by occulation—with respect to the whole distributions, not just the three fitted quantiles. 4. The question of judging unimportant parameters and finding a parsimonious model within the PI approach is a subject that needs more attention. Having addressed the problem of capturing observational uncertainty, we can move on to more difficult problems such as extrapolation of animal data to humans, allometric scaling, sensitive subgroups, the use of incomplete data including human data from accidental releases, and extrapolation to low dose. Approaches to these more difficult problems should be grounded in methods that are able to deal with the relatively “clean” problems typified by these bench test exercises. The PI approach is based on antecedently defined observable uncertainty distributions. These bench exercises may be based on isotonic regression, as was done here, but they may also incorporate distributions from structured expert judgment or from anecdotal incident data. The PI approach may provide a method (Kraan and Cooke, 2000) for attacking the more difficult extrapolation issues identified above. The calculations and simulations displayed here are based on 500 samples for the isotonic uncertainty distributions, and 10,000 to 20,000 samples for the probabilistic inversion. The processing was done with the uncertainty analysis

76

DOSE–RESPONSE MODELS USING PI WITH IR

package UNICORN, which is available free from http://dutiosc.twi.tudelft. nl/∼risk/. The probabilistic inversion software used here is a significant improvement over the version currently available at this website, but is not yet ready for distribution. After initial setup, processing time per case, is on the order of a few minutes. More time is involved in exploring different models and different starting distributions.

APPENDIX: ALTERNATIVE MODELS AND OPTIMIZATION STRATEGIES FRAMBOZADRINE LOG PROBIT For frambozadrine, pooled data, the logistic model had an AIC of 290.561. The log probit model was slightly better at 290.396. The multistage model with second-degree polynomial was better still with an AIC of 289.594, but the standard deviations of the parameters were not calculated with the BMD software. The log-probit model is Prob(Dose ) = γ + (1− γ ) Φ (α + β × ln(Dose )) , where Φ is the cdf of the standard normal distribution. Figure 2.16 compares the observable uncertainty distributions for the log probit (lprob) and logistic model (BMD) models, together with the binomial and isotonic uncertainties. The (log) probit models tend to be wider, a feature noted on many exercises (not reported here).

NECTORINE PROBIT For the nectorine example, the logistic model had an AIC of 158.354, whereas the probit model had 158.286. Log-logistic and log-probit models were slightly worse. The probit model is Prob(dose ) = Φ (α + β × dose ) , where Φ is the standard normal cdf. Figure 2.17 compares the previously obtained results with those of the probit model. As in Figure 2.16, we see that the probit model tends to have higher variance than the others.

OPTIMIZATION STRATEGIES Probabilistic inversion is an optimization strategy applied to quantiles of the observable uncertainties. As with all optimization strategies in multi-extremal

77

OPTIMIZATION STRATEGIES bd0 id0 Iprob0 bmd0

1 0.9

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 7

9.5 12

1

20 22

0 –3.8 1.1

0.8 0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 23

bd21 id21 Iprob21 bmd21

28

33 39

44 49

1 0.9

11

16

21

30 35

40 45

0 –3.7 1.4 6.5 12 17 22 27 32 37 42 47

bd82 id82 Iprob82 bmd82

1 0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0 –3.8 1.3 6.4 12 17 22 27 32 37 42 47

25

bd15 id15 Iprob15 bmd15

1

0.7

17

6

0.9

0.8

0 –3.7 1.6 6.9 12

1

15 17

bd1.8 id1.8 Iprob1.8 bmd1.8

0.9

0.9

1 0.9

0.8

0 –3.2 –0.63 1.9 4.4

bd1.2 id1.2 Iprob1.2 bmd1.2

0 7.2 10 14 17 20 24 27 30 33 37 40

bd109 id109 Iprob109 bmd109

0 15 19 22 25 29 32 35 39 41 45 49

Figure 2.16 Logistic (BMD), log-probit (lprob), binomial (bd), and isotonic (id) uncertainties for frambozadrine, male and female.

78

DOSE–RESPONSE MODELS USING PI WITH IR bd10 id10 bmd10 d10pi probit10

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –2.3 0.084 2.5 4.9 7.3

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –2.4 6.3 10

9.7 12

14 17

19

22

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –1

bd30 id30 bmd30 d30pi probit30

1.7 4.4 7.1 9.9

13

15

18

21

23

26

bd60 id60 bmd60 d60pi probit60

14

19

22 26

30 34

39

42

Figure 2.17 Nectorine, pooled, binomial (bd) isotonic (id) BMD, probabilistic inversion (pi) and probit (probit).

problems, its solution may depend on the starting point. Compared to other strategies, it is more labor intensive. A standard optimization routine might work as follows: 1. Choose a preferred, smooth model. 2. Sample from the binomial uncertainty distributions and isotonicize this sample (as in Table 2.4). 3. Find values of the parameters for the preferred model that minimize

∑ ( NR × P i

i

m

2

(di ) − NRi × Pis (di )) ,

where NRi is the number of animals given dose di, Pm(di) is the probability of response given by the model at dose di, and Pis(di) is the isotonic probability of response at dose di. 4. Store these parameters, and repeat steps 2 and 3. Figure 2.18 compares the results of optimizing7 the log-logistic model (lglgopt), β ≥ 1 with the BMD distributions, which are also based on the log-

79

OPTIMIZATION STRATEGIES bd0 bd21 bd60 id0 id21 id60 bmd0 bmd21 bmd60 lglgopt_0 lglgopt_21 lglgopt_60

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3

1

5.1

9.1

13

17

21

25

29

33

37

Figure 2.18 BMD example with binomial (bd), isotonic (is) bmd, and log-logistic optimization (lglgopt) with β ≥ 1.

logistic model but use the MLE parameter distributions given by the BMD software. The binomial and isotonic uncertainties are also shown. The optimization results in a somewhat different, but not overwhelmingly better fit to the isotonic uncertainties. Whereas the previous example concerned a case where a smooth model (log logistic) yielded a good fit with PI, the persimonate case did not yield a good fit with any smooth model. The barrier model used in Figure 2.12 is not differentiable. This means that the regularity conditions for the asymptotic normality, and even for the strong consistency of the maximum likelihood estimators do not hold (Cox and Hinkley, pp. 281, 288). In Figure 2.19 we show optimization applied to the logistic and the loglogistic models with the binomial and isotonic uncertainties. This figure should be compared with Figure 2.12. The log-logistic (β ≥ 0) model has three parameters, whereas the logistic model has two. Figure 2.19 shows that this extra parameter does not significantly improve the fits, and that the attempt to minimize square difference to the isotonic uncertainty does not produce better results than simply using the logistic model with MLE parameter distributions (BMD).

7

The EXCEL solver with default settings was used for all the optimization results.

80

DOSE–RESPONSE MODELS USING PI WITH IR b0 id0 Igstop0 IIgstop0

1 0.9 0.8 0.7 0.6

1 0.9 0.8 0.7 0.6

0.5 0.4

0.5 0.4

0.3 0.2

0.3 0.2

0.1

0.1

0 23

25

26

28

30

31

33 35

37

38 40

b9 id9 Igstop9 IIgstop9

1 0.9 0.8 0.7 0.6

0 13

0.5 0.4

0.3 0.2

0.3 0.2

0.1

0.1 17

20

24

27

31

34 38

41

45 48

b36 id36 Igstop36 IIgstop36

1 0.9 0.8 0.7 0.6

0 23

0.5 0.4

0.3 0.2

0.3 0.2

0.1

0.1 34

36

38

40

41

43 45

47

48 50

16

17

19

20

21

23

0 32

24

26 27

b18 id18 Igstop18 IIgstop18

25

28

30

33

35

37 40

42

45 47

b45 id45 Igstop45 IIgstop45

1 0.9 0.8 0.7 0.6

0.5 0.4

0 33

14

1 0.9 0.8 0.7 0.6

0.5 0.4

0 13

b1.8 id1.8 Igstop1.8 IIgstop1.8

34

36

37

39

41

42 44

46

47 49

Figure 2.19 Persimonate results of Figure 2.12 compared with optimization of logistic (lgstop) and log logistic (llgstop).

REFERENCES Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. 1972. Statistical Inference under Order Restrictions. New York: Wiley. Benchmark Dose Technical Guidance Document. 2000. External Review Draft. Bregman LM. 1967. The relaxation method to find the common point of convex sets and its applications to the solution of problems in convex programming. USSR Comp Math Math Phy 7:200–17.

REFERENCES

81

Censor Y, Lent A. 1981. An iterative row-action method for interval convex programming, J Optimiz Theory Appl 34(3):321–353. Cooke R. 1994. Parameter fitting for uncertain models: modeling uncertainty in small models. Reliab Eng Sys Safety 44:89–102. Cox DR, Hinkley DV. 1974. Theoretical Statistics. London: Chapman and Hall. Csiszar I, Tusnady G. 1984. Information geometry and alternating minimization procedures. Statist Decis 1:205–37. Csiszar I. 1975. I-Divergence geometry of probability distributions and minimization problems. Ann Probab 3:146–58. Csiszar I. nd. Information Theoretic Methods in Probability and Statistics. Deming WE, Stephan FF. 1940. On a last squeres adjustment of sampled frequency table when the expected marginal totals are known. Ann Math Statist 11:427–44. Du C, Kurowicka D, Cooke RM. 2006. Techniques for generic probabilistic inversion. Comp Statist Data Anal 50:1164–87. Fienberg SE. 1970. An iterative procedure for estimation in contingency tables. Ann Math Statist 41:907–17. Haberman SJ. 1974. An Analysis of Frequency Data. Chicago: University of Chicago Press. Haberman SJ. 1984. Adjustment by minimum discriminant information. Ann Statist 12:971–88. Ireland CT, Kullback S. 1968. Contingency tables with given marginals. Biometrika 55:179–88. Kraan B, Bedford. 2004. Probabilistic inversion of expert judgements in the quantification of model uncertainty. Mgmt Sci. Kraan B, Cooke RM. 2000. Processing expert judgements in accident consequence modelina. Radiat Prot Dosim 90(3):311–5. Kraan B, Cooke RM. 2000a. Uncertainty in compartmental models for hazardous materials—a case study. J Hazard Mater 71:253–68. Kruithof J. 1937. Telefoonverkeersrekening. De Ingen 52(8):E15–25. Kullback S. 1968. Probability densities with given marginals. Ann Math Statist 39(4):1236–43. Kullback S. 1971. Marginal homogeneity of multidimensional contingency tables. Ann Math Statist 42(2):594–606. Kurowicka D, Cooke RM. 2006. Uncertainty Analysis with High Dimensional Dependence Modeling. Hoboken, NJ: Wiley. Matus F. 2007. On iterated averages of I-projections. Statist Inform. Vomlel J. 1999. Methods of probabilistic knowledge integration. Phd thesis. Faculty of Electrical Engineering Czech Technical University, Prague.

COMMENT MATH/STATS PERSPECTIVE ON CHAPTER 2: AGREEMENT AND DISAGREEMENT Thomas A. Louis Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

I would first like to recall Cooke’s starting points, or “tent poles,” and position my remarks relative to them:

TENT POLES 1. Observational Uncertainty: There must be some antecedent uncertainty over observable phenomena that we try to capture via distributions on parameters of dose–response (DR) models. With this I agree. 2. Integrated Uncertainty Analysis: The uncertainty in DR must be captured via a joint distribution over parameters that does not depend on dose. I am in partial agreement; there is an important caveat that no model can be completely global. 3. Monotonicity: Barring toxicological insights to the contrary, we assume that the probability of response cannot decrease as dose increases. Here again, agreement is partial. There may well be toxicological insights to the contrary, such as doses beyond the MTD. In the light of such evidence, would you really want absolutely strict enforcement of monotonicity (see below)? Before going into details, it is useful to gather points of agreement and disagreement regarding generic issues. 82

MODEL-BASED UNCERTAINTY AND MODEL UNCERTAINTY

83

POINTS OF AGREEMENT 1. One needs to compute and report all relevant uncertainties. This includes a priori, a posteriori, along with sampling and nonsampling uncertainties. 2. In general, the Bayesian formalism does this best. One needs to use models with sufficient flexibility to honor the data and sufficient constraints to honor biology (or physics or sociology, as the case may be). 3. If you really (really) believe in monotonicity, then enforce it. 4. One needs to be vigorous in model criticism.

POINTS OF DISAGREEMENT 1. Cooke’s statistics as usual (SAU) is not my SAU and is not SAU at least not for WEES (well-educated and experienced statisticians)! For the situations considered, WEESs would use isotonic regression with an appropriate Bayesian or frequentist uncertainty analysis. 2. Cooke’s Bayes is one example of Bayes, but not my Bayes. 3. I have issues with some “tent poles” in particular regarding rigid monotonicity and global uncertainty analysis. 4. Probabilistic inversion could have a role in approximating a fully Bayesian approach in problems that need it, and should serve only that role. 5. Cooke’s approach works with discretized or quantal response, but how would it work with a continuous response variable? 6. How would Cooke’s approach handle null or complete responses (i.e., no animals or all animals exhibit the response)?

MODEL-BASED UNCERTAINTY AND MODEL UNCERTAINTY It is important to distinguish model-based uncertainty from model uncertainty. Model-based uncertainty assessment is “easy”: that is, if we know a particular model is true, then there are more or less accepted methods for quantifying uncertainty on the model parameters. However, if there are little or no data to inform about models (the likelihood is not informative about model choice), prior uncertainties over the correct model will migrate directly to the posterior distribution. A priori model uncertainty is not substantially changed by updating and so one can inflate uncertainty with little or no empirical control. This uncertainty propagation operates in “what if” analyses or in Bayesian model averaging.

84

MATH/STATS PERSPECTIVE ON CHAPTER 2

A LOW-DOSE EXTRAPOLATION EXAMPLE Examples of the preponderance of model over sampling are notorious in the area of low-dose extrapolation. The shape (linear, sub-/superlinear) depends on how the dose is represented in the model (e.g., dose, log(dose)). The shape also depends on the fundamental relation between exposure and dose. It may happen that many models fit the observed data well but give radically different estimated safe doses. An example is given below: Model Uncertainty: Liver Tumors in Rats Consuming 2AAF Safe Dose (10−8 Elevation)

Low Dose Model

10−5

Linear Multistage* Weibull* Gamma* Logit Probit

10−2 10−1

*Model fits observed data.

Faced with such underdetermination by the data, ad hoc approaches are often used. ENFORCING MONOTONICITY: BINOMIAL EXAMPLE If one really believes in monotonicity, or places great weight on it, then the standard binomial estimates aren’t appropriate. Isotonic regression or a Bayesian version thereof would be appropriate: Pd = pr(Response | Dose = d), P = (P0 … , Pk) ∼ G, G(P0 ≤ … ≤ Pk) = 1. This can be done with an ordered Dirichlet process or a gamma process (see Mazzuchi and Burzala, Chapter 3 of this volume). As a simple example with k = 1, we might put P0 = P1 = and require that PG(θ > 0) = 1.

eμ , 1 + eμ

e μ +θ , 1 + e μ +θ

85

TRY FOR BAYES

What would you do if you had very large sample sizes and still saw nonmonotonicity? How are you going to make the decision to switch gears? Wouldn’t it be better to build in a priori a (small) possibility of nonmonotonicity? For example, you could use the following prior: for ε > 0 and small, let

θ=

{

− η1

with probability ε ,

η2

with probability 1 − ε ,

where ηj is normally distributed with mean μj and standard deviation σj.

PROBABILISTIC INVERSION PI can be a useful technique in many contexts.8 However, it is difficult to get it right for the goals Cooke has in mind, especially in complex situations. To see this, consider a basic situation wherein the “obvious” simulation is incorrect and you need a two-stage simulation. For the Gaussian prior and likelihood, under a flat prior, [θ | Y] ∼ N(Y, σ2). Simulating from this distribution understates variation in the probabilistic inversion. You need to: 1. sample θ from N(Y, σ2), 2. generate a Y*, and 3. repeat 1 and 2.

TRY FOR BAYES My overall advice is to use the Bayesian formalism to structure inferences, even if you then have to approximate. However, beware of statistical mischief. The perils of Bayesian model averaging were the subject of a recent discussion over air pollution and health. Koop, and Tole9 show that using Bayesian model averaging and a broad range of candidate models, there is a great deal of uncertainty about the magnitude of the air pollution effect. Thomas, Jerrett, Kuenzli, Louis, Dominici, Zeger, Schwartz, Burnett, Krewski, and Bates10 rebut Koop and Tole by showing that the data provide very little information on

8

Poole D, Raftery AE. 2000. Inference for deterministic simulation models: the Bayesian melding approach. JASA 95:1244–55. Yin Y, Louis TA. 2007. Optimal constrained Bayesian updating to match a target output distribution and retain stochastic model structure (submitted). 9 Measuring the health effects of air pollution: to what extent can we really say that people are dying from bad air? J Environ Econ Mgmt 2004;47:30–54. 10 “Bayesian model averaging” in time series studies of air pollution and mortality, J Toxicol Environ Health, A, 2007;70:1–5.

86

MATH/STATS PERSPECTIVE ON CHAPTER 2

model choice and that you need to be careful in selecting candidate models and priors. The same caution about focus holds for any frequentist approach to model selection. Statistical “wrappers” can be used to package mischief. There are limits to what we can do. For example, model choice will always have unquantifiable aspects. We can bring in expert opinion/knowledge on mechanism, genomics, discount rates, and so on, but how do we rate the “credibility” of an expert? There are limits to all formalisms.

SUMMARY Cooke provokes us to do better, and many of his themes are right on the mark. I disagree with others and with some of his proposals. We do concur that we need to take BMD “to the next level.”

COMMENT EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY Lorenz Rhomberg Gradient Corporation Cambridge, MA

Roger Cooke has presented some ideas about the application of a probabilistic inversion method toward characterizing uncertainty in dose–response analysis. His viewpoint is that, once model-fitting has been done, the error structure in the fitted model ought to reproduce the sampling uncertainty (“observational uncertainty”) at each of the original data points. Most conventional modeling does not do this—the fitted model’s characterization of the uncertainty in predicted response at a tested dose generally does not match the original binomial sampling uncertainty. Cooke proposes an approach that first uses isotonic regression to “clean up” the sampling distribution on the assumption that any apparent deviations from monotonicity are due to sampling error, and then uses probabilistic inversion to constrain the model fitting such that the fitted model’s uncertainties in predicted response at the tested doses matches the observational uncertainty for the original data points. He shows that such constrained models produce different fits—and different estimates of uncertainty in predictions for untested doses—than do unconstrained models. Moreover certain model forms labor so much under these constraints (when applied to particular datasets) that satisfactory fits reproducing the sampling error structure cannot be found, an outcome that should serve as evidence that the model form in question is ill suited to describing the data (even if the unconstrained fit is satisfactory by goodness-of-fit tests). The notion is that this observational uncertainty—the uncertainty about the true value of the dose-specific probability of response that arises owing to the 87

88

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

limited sample of animals—is really the fundamental (and only) expression of uncertainty about responses that comes from the data themselves. How, then, can an uncertainty analysis about dose response—one that purports to inform us about the uncertainty in model predictions for responses at untested doses—claim to be basing its findings on the information in the data set if its characterization of uncertainty in response at the tested doses does not even match the original sampling uncertainty? Even if one does not entirely subscribe to this point of view, the ideas are thought provoking, and they prompt an interesting exploration of the implicit meaning of fitted models vis-à-vis the data to which they are applied, as well as of the epistemic status of the fitted models as hypotheses about true underlying dose–response relationships. My charge as a commentator on Cooke’s presentation was to examine the “toxicology and epidemiology” aspect in understanding the uncertainty in dose–response analysis as explored in his research. In the context of the workshop, however, we are analyzing hypothetical data sets on hypothetical chemicals. (The toxicity data sets on the hypothetical chemicals frambozadrine, nectorine, and persimonate are real enough, but the names have been changed to protect the innocent.) As a consequence, if one abides by the rules of the workshop, the data are just numbers—they have no toxicological or epidemiological context upon which I can comment. I have chosen to address this quandary in two ways: (1) for illustrative purposes, I imagine a set of further toxicological and epidemiological findings on nectorine—findings that (like the nectorine data) bear some resemblance to those for an actual chemical—and then comment on how this hypothetical context can be brought to bear on dose–response uncertainty characterization, and then (2) I examine the question of what a data set per se (stripped of this or any such context) has to say about the uncertainty in dose–response modeling. This second approach is apropos in view of Cooke’s thesis that the modeling results should reflect and express the uncertainty inherent in the experimental response data. Beginning with the first approach, imagining a set of toxicological and epidemiological findings in addition to the raw data set of doses and response numbers that was provided for the workshop. Suppose that, in addition to the proffered data on nasal tumors in rats, we also know that nectorine: • •



did not cause these nasal tumors in mice (but caused lung tumors); gives rats (but not mice) marked target-tissue cytotoxicity at the higher bioassay doses; is locally metabolized to a reactive compound by a cytochrome that is very active in these nasal tissues, but less so in other rat strains or in mice or in humans;

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY •



89

is not genotoxic in standard studies, but a reactive metabolite is formed; and produces oxidative stress and depletes glutathione.

What do such results say about the dose–response analysis of the data set on nasal tumors in rats at different air concentrations of nectorine as it was provided to workshop attendees? More specifically, what does the external knowledge say about what dose–response shapes, or models, or fitted outcomes are more or less plausible and apparently valid? What do they say about applying any particular dose–response analysis on the tumor data to project potential human risks at low doses? The additional knowledge raises a number of questions about dosimetry— especially about tissue-specific doses in the nasal tissues and possible nonlinearity of such effective doses vis-à-vis the air concentration—and about the mode of carcinogenic action that led to the nasal lesions. If there are indeed nonlinear pharmacokinetics, would this not affect the shape of the dose– response curve plotted against an external measure of exposure? Would the curvature in such a graph reflect the pharmacokinetic nonlinearity, or is the shape of the curve expected to result only from pharmacodynamic considerations? How do these considerations affect the relative plausibility of different curve shapes and different mathematical models that generate them? How is the mathematical rationale for the dose–response equation that is being fit to the data to be reconciled with different and unknown mixes of contributions— each with its own actual biological dose-response pattern—of contributing factors to the measured outcomes? Is the carcinogenic process for which the dose–response pattern is being evaluated expected to be nonlinear for biological reasons? Should threshold based on the carcinogenic response be secondary to local cytotoxicity? In some contexts, nonmonotonic dose–response curves are expected, as when there is a hormetic effect; are such possibilities to be considered for these data? There are two endpoints for nectorine, and they constitute similar but distinct tumors in neighboring but distinct tissues. Are these to be considered independent responses in different endpoints or as related manifestations of a single process? Do they thereby inform one another regarding the dose– response pattern? Could they be pooled into a single measure of nasal tumor response, or would doing so conflate two distinct processes—potentially different in their dose–response patterns? Are the tumors as reported already overly lumped (i.e., combining distinct subtypes that might better have been separated)? Carcinogenicity data come from chronic bioassays, typically lasting two years, and tumor incidences go up with age. In many cases there are animals lost to chemical-related and/or to chemical-independent causes before their lifetime ability to produce a tumor has been fully tested. How much such intercurrent mortality happened in the nectorine data set, and how were the

90

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

animals at risk counted? How would different survival adjustment approaches alter the apparent dose response? Could there have been regressing lesions or stages of carcinogenesis? My aim is not to try to answer these questions, even hypothetically, and indeed even in real cases, most of the questions admit to no definitive answer. But a number of points can be drawn by asking the questions above: 1. Data sets on doses and responses cannot be taken at face value, stripped of their context. They result from a number of consequential decisions about how to measure doses, how to define and measure responses, what to combine and what to separate, what to report and what to consider as inconsequential detail. A data set is not the beginning point of a dose– response analysis but an intermediate point. Once tabulated, however, data tend to take on a conventionalized reality, and properties that are really assumptions about the data being made (or asserted) by the quantitative analysis that is applied to the numbers can get mistakenly attributed to the data. 2. The way to address the issues in the point 1 above is to keep in mind the context and the broader scientific understanding of the processes the data are meant to measure. Alternative ways of casting the results and of interpreting the meaning of the numbers need to be considered, as well as the sensitivity of the analysis to alternative answers. 3. Uncertainty in the dose–response description obtained by modeling data then is much more than the statistical uncertainty of curve-fitting. Which models make plausible and informative descriptors, which curves generated by those models make plausible and informative interpolations and extrapolations from the data point anchors, and which consequential phenomena might be utterly missed by reference to the test data alone are all informed by considering the larger toxicological and scientific context and all the background knowledge that can be brought to bear. 4. As a consequence, separating the “statistical” issues of dose–response curve-fitting uncertainty from all the other contributors to uncertainty in risk assessment is really artificial. (I say this realizing that in doing so, I am undermining the premise of this workshop.) I am not simply saying that a full uncertainty analysis has to consider issues beyond the simple uncertainty entailed in statistical curve-fitting—of course it does—but even the “pure” statistical uncertainty is actually bound up with the larger context via the dependence on the underlying biology of the validity of statistical assumptions and the presumed meaning of the measurements. Can we not acknowledge all of these points and nonetheless find it useful to examine just the statistical curve-fitting issues in characterizing the uncertainty

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

91

of fitted dose–response curves? More specifically, what does all the foregoing have to do with the evaluation of Cooke’s approach using isotonic regression and probabilistic inversion? This brings me to the second of my two approaches to my charge—to examine what the data set per se has to say about the evaluation of dose–response uncertainty. The impetus for the probabilistic inversion approach is the feeling that the fitted model’s uncertainty distribution for its prediction of the response at a given dose should—for the particular cases of the experimental data points— match the experimental uncertainty in measuring those responses. If one wants to accomplish this, then the probabilistic inversion approach is a good way to go about it, and Cooke has demonstrated how to do so. But it is worth exploring whether one should want to have this property, as well as what forcing it entails in terms of constraining the model fit. Consider first the observational uncertainty associated with each of the data points. For each dose group, we have one outcome (the rate of positives) that actually happened, but we realize that if the experiment were to be repeated many times, there would be variation in the outcome reflecting the limited number of animals tested each time. As noted by Cooke, we generally describe this as binomial variation around the sample mean outcome. It is noteworthy that we are already adding external information; the data offer no evidence that variation would indeed be binomial, and some circumstances such as heterogeneity among animals can lead to extrabinomial variation in practice. We should attend to how we have transformed the question about observational uncertainty—in a sense, inverting it. Our one observed outcome constitutes a deviation (large or small we don’t know, but we hope to characterize the probability distribution) away from a true population mean, the value of which we don’t know. Our one observation could have resulted from a small upward deviation from a true mean just below the observed outcome or from a big downward deviation from a true mean well above the observed outcome, or any number of alternatives. Our observational uncertainty question of interest, what we could call question 1, is: In view of our one sample outcome, how likely is it that the unknown true mean probability of response lies at any one of its various possible values? We characterize this uncertainty by asking a related but different question 2: If the sample mean of our one observation were the true mean, how much variation around it would we get, were we to test many more sets of the same number of animals? That is, we treat the likelihood of various deviations from the sample mean as indicative of the likelihood that the true mean lies in each of these spots (and has produced our one sample result as a statistical deviation from this truth). As Cooke shows, as long as the prior is uninformed, a Bayesian approach using the beta binomial reveals that these two questions have nearly the same answer. The distinction is important, however, because when we are curve-fitting, we definitely examine the data points as deviations from the (hypothesized) true mean generated by the model (question 1) and not the model as a devia-

92

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

tion from the sample mean (question 2). In maximum likelihood fitting, we choose the model that, among all those under consideration (i.e., of the same form but with different parameters), would be the most likely to produce the observed data as a set of deviations from it. An infinitely flexible model would maximize the likelihood by going exactly through all the data points, since this would make the observed outcomes the most likely of all possible outcomes. Why do actual maximum likelihood fits not produce fitted curves that exactly hit all the data points? Because the curves are not infinitely flexible. The chosen equation establishes some constraints on the shape and curvature of the model such that, if it is tweaked to be closer to this data point, it must necessarily be farther from that data point, and the maximum likelihood fit represents the optimal compromise. Whether, for this best-fitting model, it is plausible that the observed deviations constitute variations from the modeled means can be tested with goodness-of-fit tests, and inadequately fitting models can be rejected. Such a rejection would be evidence that the observed data reflect some underlying dose–response pattern that has features that cannot be captured by the fitted equation; that is, that the sources of the model’s constraints on flexibility conflict with the actual processes dictating the shape of the curve. Of course, different equations produce different curve shapes that miss the data points in different ways, resulting in different maximum likelihoods; that is, they find a different compromise (and a different degree of satisfactoriness of the compromise) in minimizing their failures to hit all the observed dose points. These differences arise as a consequence of the different constraints on shape and curvature entailed in each fitted equation. The important point is that, in choosing a model to fit, we are in fact adding constraints onto the possible interpretation of the dose–response data, and the particular constraints we add vary with the model we choose. Fitting a logprobit model imposes the constraint that the curve must adhere to the possible shapes for cumulative lognormal distributions, fitting a multistage model of degree n constrains the curvature to be no more than the nth power of dose, and so on. Even the monotonicity expectation of Cooke’s method, and the use of isotonic regression to impose it, affects the way in which permitted patterns of true group means are viewed. These added constraints are a kind of added information not present in the dose–response data. Even empirical “curve-fitting” models adds information about expectations regarding shape—none is purely empirical in the sense of deriving all of its information from the data alone. The decision to use one empirical model over another is the decision to impose the constraints of one over the other, and the justification for this choice is informed not only by fit to data, but also by reference to expectations regarding the general form the dose–response curve ought to have, in view of our larger biological knowledge of the processes being modeled and our experience in modeling other compounds and datasets.

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

93

In fitting a model, we are hypothesizing a partitioning of the variance among outcomes into an “explained” component, embodied in the model’s functional form of response at different doses, and a residual “error” component, embodied in the deviations of particular observations around the curve that are presumed to arise because of sampling error. Because of the constraints on curve shape dictated by the model’s equations, data points “borrow” information from one another in the sense that a poorer fit here can be traded for a better fit there, all in service of finding an optimal hypothesis for partitioning overall variance into explained and residual components. In contrast, the original observational uncertainty comprises the raw, unpartitioned variation. None of the variation is “explained,” and the different dose levels associated with each group’s response plays no role. Seen in this way, why shouldn’t the uncertainties around a model’s predictions of the responses at tested doses differ from the unmodified original observational uncertainty? Is this indeed a problem in need of correction by probabilistic inversion? The observational uncertainty—measured in the question 2 sense but of interest in the question 1 sense—represents data cut loose from their context. They don’t have meaning or implications beyond their role as measurements of phenomena as they happened in the experiment. As soon as we want to apply these data to some larger question—including using them to characterize a dose–response curve that will apply to other, untested, doses—then we have changed the nature of the uncertainty around data points. First, we impose the constraints inherent in the models we choose to fit (justified by knowledge that includes information outside the data set itself), and second, we hypothesize a partition of variance into that explained by our proposed dose–response equation and that part that is relegated to statistical error. There is different information in the dose–response model than there is in the data alone, and so the uncertainties really ought to be different. Where does this added information come from? Some comes from goodness of fit of one model form versus another, in that marked differences in success at explaining variation in outcomes using the fitted models would appear to say something about whether the constraints on dose–response shape actually operating match up with those imposed by each of the alternative models. But this distinction is most telling for discriminating between models that are acceptable and those that are not. Within the realm of adequately fitting models, the information comes from the plausibility of the shapes they hypothesize in view of other scientific information we have about the responses and the potential means by which they might depend on dose. Some of this is general: the expectations about smoothness of the curve, its monotonicity, the plausible degree of abruptness of change with dose, and the expectations about how inflection points might occur. Other parts of the externally motivated expectations are more specific to the chemical

94

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

and he responses, such as thoughts about mode of action, the possibility of thresholds, and the properties that low-dose extrapolations ought to be supposed to have. In sum, I am not really convinced that we should expect the uncertainty in model predictions about observed dose points to match the observational uncertainty in the outcomes at those points. Having said this, I think that Cooke’s paper raises an important point about the uncertainty characterization of fitted models that bears some thinking about. As he notes, the uncertainty around a fitted model is generally approached by assuming that the fitted model is true. Setting aside the question of reconstructing the observational uncertainty, is this a good way to go about characterizing the uncertainty in the dose–response relationship? The answer depends on how the uncertainty in the dose response is characterized. For any particular dose level, the uncertainty really is about the still unknown, but now perhaps better estimated, true population mean response probability. (That is, it is a “question 1” type of question.) Again for a particular dose, a profile likelihood approach is often used. For instance, to get an “upper bound” for the curve at some dose level, the model’s equation is assumed to apply, but one allows alternative less-than-optimal parameters chosen so that (1) the overall likelihood is sufficiently close to the maximum that the model’s fit to all the data is not significantly worse, and (2) among all sets of alternative parameters satisfying (1), the particular set producing a curve deviating most from the best-fitting curve at the dose in question is chosen. Generally, this curve achieves its deviation above the best-fitting curve at the dose level in question by being below it elsewhere, at other dose levels. (This is best for preserving as much likelihood as possible while deviating as much as possible at the dose level of interest.) That is, the upper bound curve is not general across all dose levels; other dose levels will have upper bounds given by different sets of alternative parameters. In the case of low-dose extrapolation, for models in which the shape of the curve at low doses becomes dominated by a single parameter, it is fortunately so that the upper bound curve defined with respect to one low dose will also apply to other low doses (but not to higher doses). That is, uncertainties in the whole dose–response relationship are not fully addressed by such a procedure in two important ways. First, as just argued, the uncertainties do not apply to the whole curve, just to particular doses, and just stringing together the results of upper (or lower) bounds at different dose levels produces a net result that mischaracterizes the overall dose response, since all the upper bound values cannot be simultaneously true. Second, uncertainties are contingent on the model equation chosen, and as we have seen, different models impose different constraints on the allowable shapes of fitted curves. That is to say, the statistical uncertainty analysis for a given model does not address “model uncertainty.” In my oral comments at the workshop, I presented some ideas about how to deal with more qualitative uncertainties, including model uncertainty, that

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

95

do not lend themselves to statistical description but are often the major sources of uncertainty in a quantitative risk analysis. These comments are best incorporated here through reference to another written treatment of the same ideas (Rhomberg, 2007). Briefly, though, in that piece I advocate an approach that brings information on the uncertainties in hazard identification into the dose– response characterization. For each endpoint that one is proposing as a source of human risk, and for each data set that one puts forward as a risk analyst to be used in estimating such a human risk, it is important to articulate the hypothesized rationale for how and why the observed responses (usually in an animal experiment) are expected to generalize to other settings, including the human exposure context of interest. The expected consequences of this hypothesized basis for applying the results to human risk estimation can then be noted and compared against all of the available data, taking note not only of things that are “consistent with” the hypothesis but also outcomes that are not consistent, at least on their face (e.g., the failure of another rodent species to respond to the same doses). This approach gives a structure to weighing the evidence regarding how compelling any one basis for human risk estimation should be deemed to be, and each alternative model that can be applied to those data can be weighed by how well its fit and its restrictions on allowable shape (which should be articulated to the degree they can be) comport with the evidence at hand. Thinking about the problem of characterizing dose–response uncertainty in this way emphasizes that in choosing a model, finding its optimal parameter values, accepting its constraints on the possible curve shapes, and invoking the fitted shapes as a basis for projecting expected responses to untested doses, one is in fact making a hypothesis about how to generalize the modeled bioassay results to other settings. The soundness of this generalization—that is, the uncertainty entailed in applying it—must be evaluated in view of the evidence that suggests it, that supports it, or that tends to cast doubt on it. Some of that evidence comes from model fit, but a lot comes from sources external to the dataset per se, and indeed from sources external to the toxicology of the particular compound in question. The assertion of a hypothesized model divides the overall variation in outcomes into an “explained” portion and an “unexplained” or “error” portion, and the particular partitioning (and its consequences for descriptions of uncertainty around the fitted curve) inevitably reflect the hypothesis, and not just the original data. In my view, it is difficult then to cleanly separate the “purely statistical” issue of parameter uncertainty in a fitted model from this larger context. It is not simply that the overall uncertainty needs to be considered as well in a comprehensive evaluation of uncertainly in risk assessment, but that the evidence being brought to bear on even the assessment of parameter uncertainty inevitably includes this external context. The means by which this external context looms over the immediate question about curve fits is via the assumptions and constraints imposed by the chosen equation. Some of these restrictions have consequences for the distribution of uncertainties around the fitted

96

EPI/TOX PERSPECTIVE ON CHAPTER 2: WHAT DATA SETS PER SE SAY

model at particular dose levels, and they reflect themselves in how the model choice alters the distribution of uncertainty at tested doses, away from the original “observational uncertainty” that does not seek to explain, but only to describe, the experimental variation. In sum, perhaps we ought to expect that the uncertainty around a model’s predictions for the outcome at doses that were tested (and to which the model was fitted) differs from the original observational uncertainty. We should ask, is this difference something that is problematic and in need of fixing by probabilistic inversion?

REFERENCE Rhomberg LR. 2007. Hypothesis-based weight of evidence: an approach to hazard identification and to uncertainty analysis for quantitative risk assessment. White paper, National Research Council, Committee on Improving Risk Analysis Approaches Used by the US EPA.

COMMENT REGULATORY/RISK PERSPECTIVE ON CHAPTER 2: SUBSTANTIAL ADVANCES NOURISH HOPE FOR CLARITY? Rob Goble Clark University

“Uncertainty” as a freely floating term can mean many things to many different people. It can represent a general state of mind, or in a particular context, with a considerable degree of specification, it can have a reasonably precisely defined meaning. A particularly useful aspect and the intellectual challenge of this workshop is reflected in the structure of the agenda. The organizers have created a forum in which relatively high-powered statistical tools with promise for the assessment of uncertainties in dose–response relationships are examined in three contexts. Quantitative presentations, such as dose–response graphs, can be used to address different issues that represent the concerns in different areas of specialty: •





Biologists, whether they are toxicologists or molecular biologists, will view dose–response graphs as illustrations of the action of biological mechanisms. Statisticians will use graphs to illustrate quality of fit and the determination of parameters. Risk analysts and regulators will illustrate by graphs the bases for extrapolations to situations of regulatory concern.

These are different, but overlapping, spheres of meaning. Effective analytic approaches will offer insight to all three specially areas. My charge, in this forum, is to consider Roger Cooke’s presentation of probabilistic inversion 97

98

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

methods from a risk analytic/regulatory perspective. It may be helpful to begin by describing key concerns that can be expected within such a perspective. From the point of view of a regulator, the primary uncertainty of interest is uncertainty in the (extrapolated) risks at the relevant (low) doses for which regulations are supposed to have an effect. In dealing with such uncertainties, the regulators require clarity and transparency in risk and uncertainty estimates and the estimates they use must be defendable as “good and appropriate science.” Furthermore, whatever the explicit and implicit values embedded in their regulatory authority, the regulators will need to be responsible; they must balance, appropriately, between being too fearful and too complacent. The risk analysts’ perspective is partly defined by their duty to the regulatory process: their job is to assist regulators, clients who must cope with regulation, and intervenors who urge changes in regulation. The concerns are with: •



good methods that are transparent and address relevant (low-dose) risks; and vigilance regarding new phenomena, potential problems, and misconceptions.

In addition risk analysts must be concerned with the foundations of their work. They are interested in the generation of new and relevant knowledge, and that includes moderating between toxicologists and statisticians. The primary premise of risk assessment is the use of available science—the assumption of continuity in nature, so that knowledge gained in one setting can be extended to other settings. In the case at hand, where the risks to be assessed are the risks from an exposure to a toxic chemical, there are very few situations in which reliable observational data are available for the incremental health effects of human exposures of concern. In such situations a risk assessment that interpolates between data points can be used to draw inferences, and one could have quite high confidence in such interpolations. Almost always, however, it will be necessary to project from data obtained under quite different conditions; the kinds of extrapolation that are often needed include (1) projecting probabilities of cancer occurrence from relatively high exposures to levels of exposure too low for there to be reliable direct data; (2) projecting across species, when there are data from animals, but limited or no data for humans; and (3) making inferences about mechanisms/modes of action when there are data that relate to biological mechanisms (in animals and humans) relevant to the induction of cancer. A fourth kind of projection is closely related to the third: (4) projecting across members of groups of chemicals where there are data from some (putatively related) chemicals; even when we don’t have detailed information about mechanisms, we can still draw inferences based on the assumption that there is some similarity in the action of those mechanisms for different chemicals in a putatively related groups (Hattis and Goble, 2007). Figure 2.20, taken from (Hattis and Goble, 2007) illustrates typical projections made in a risk analytic construction of a dose/probability-of-response

99

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

Animal

PROBABILITY of RESPONSE per PERSON

Where there may be data

Human

Effects may be detectable

Animal Data Effects may be of concern

DOSE

Human Data

Extrapolations Figure 2.20 Illustration of typical dose response projections made in a risk analysis. Clouds show regions where there may be data, animal or human; extrapolations may be required from animals to humans and from high doses where there are data to low doses whose effects may still be of concern. The illustrated projections are straight lines.

curve for cancer. It provides a context for reviewing Roger Cooke’s presentation. The figure distinguishes several regions that have significance. The level of response for which there may be risk and regulatory concerns will, in general, be different and much lower in incidence than the level at which there may be data about induced cancers that can be distinguished from the statistical noise in feasible measurements. The “response level of concern” is a risk management judgment and may be made differently in different settings by different managers—depending among other things on the difficulty and competing risks of feasible risk abatement options, and the transparency and degree of voluntary choice involved on the part of the people who bear the risk. The challenge of assessment is to make projections from regions (shown as clouds) where empirical data can be obtained to regions of potential concern. There may be human data, but where these are weak or nonexistent, it may be necessary to project animal data—shown in the cloudy region in the back of the Figure 2.20 to a region of human dose response (also shown in

100

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

gray). Such a projection is illustrated by green dotted lines. A projection from the pink region where there may be data to the level of concern is shown as a brown dotted line. The particular projection illustrated is a linear projection assumed to have no threshold; other choices are, of course, possible. Dose/ probability-of-response projections in this figure should be understood to lie in the human plane. Also indicated in the figure is the fact that the human dose–response data are often complicated by much more uncertainty in the dose dimension, whereas there is uncertainty in the response dimension for both animal and human information. The challenge examples for the workshop, as discussed by Cooke, all require projecting from animal data; the uncertainties to be discussed are the uncertainties in interpreting such data. As such they offer a good opportunity to confront directly the question of what uncertainties are of concern. The statistical approaches presented in the conference provide a characterization of dose–response relations in the study animals that can be inferred from the example data, including a characterization of the uncertainty in such inferences. This uncertainty is, presumably, an important ingredient in assessing the uncertainty in the risk analyst’s real concern, low-dose risks to humans; but, as illustrated in the figure, two further, and very substantial inferential steps are required, the extrapolation to humans and the extrapolation from high to low doses. Cooke in his presentation, and the workshop papers generally, do not provide much guidance as to how the analysis of uncertainty in describing the animal data is to be carried forward in these further inferences. To my mind this is the major issue raised by the workshop and I will come back to it. Cooke begins with three “tent poles” that he uses to define his objectives and constrain the mathematical path he follows. His first “tent pole” is to define the “observational uncertainty” attributable to a data set and to identify the “observational uncertainty” as the “target” for uncertainty analysis. We will come back to the question of whether this is the most useful target, but for now I want to stress that Cooke’s definition and identification of a target are an important contribution to the discussion. As he makes clear in his comparison with uncertainty characterizations typical of “statistics as usual,” the usual operational characterization of uncertainties is contingent on the choice of dose–response function to be fitted and on the quality of the fit to the data; without further specification, the target is thus not well defined. Whatever choice one ultimately decides to make for a target, such ambiguity cannot be desirable and will limit the capability for drawing useful inferences. Cooke’s second “tent pole” is to assume that uncertainty characterizations should not be explicitly dose dependent. This assumption appears appropriate (except for unlikely cases where there happen to be dose-specific information with direct implications for uncertainty). Not to make the assumption opens up a box full of potential new ambiguities. Cooke’s third “tent pole” is to assume that dose response is nondecreasing, and to implement that assumption using isotonic regression techniques. This

101

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

assumption is questionable in that there are circumstances when one can anticipate nonmonotonic dose response. However, such situations may not have serious practical import, so they could be addressed if needed. More serious is that isotonic regression techniques can introduce biases that may create more difficulties than the problems they are supposed to alleviate. I will come back to these possibilities in discussing the frambozadrine example. After creating a clearly defined target, Cooke’s next major accomplishment is to present an analytic approach, probability inversion, that enables hitting the target (at least approximately and in most cases). These are two substantial advances and offer the hope of bringing some clarity to a discussion that has been marred by lack of definition. Louise Ryan, for instance, urges that we distinguish between “confusion” and “uncertainty”; the latter is the subject of our workshop, and Cooke provides real help in reducing the former. However, in the limited time I have available, I want to focus instead on concerns particularly germane to risk analysis. The frambozadrine case makes a good starting point. It is interesting to consider the data from different perspectives. Figure 2.21 shows the data putting together the two experiments for male and female rats. From my rather crude risk analytic perspective, this looks like a situation where there is a background effect on the order of 10%, and toxicity can only be distinguished from background at doses substantially greater than 20. The data do not provide strong guidance for choosing the shape of a dose–response curve; even without noting that the two points with high responses come from tests with male rats in one and female rats in the other, the data are consistent

"Frambozadrine" response 0.8 0.7 0.6 0.5 0.4

Response/#

0.3 0.2 0.1 0 0

20

40

60

80

100

120

Dose

Figure 2.21 Fraction of animals responding compared to dose from the frambozadrine example. The data come from experiments with male and female rats in groups of approximately 50. The response observed is hyperkeratosis.

102

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

with a linear fit through the origin plus background. A toxicologist, however, might well expect that hyperkeratosis is more likely to have a threshold dose– response relationship. Before considering the implications of interpreting the data from this perspective, I want to consider Cooke’s third tent pole—monotonicity. The issue is whether the ratcheting that is the pooling of adjacent violators (PAV) tool in isotonic regression could introduce a significant bias in the final determinations of probability distributions. My conjecture is that it can—based in part on the strong differences that Cooke shows between his “bd” and “id” distributions. This conjecture is testable (including testing the “significance” aspect of the conjecture). The approach, a Monte Carlo endeavor, would be to generate hypothetical data using a two-parameter fit (background plus linear) to the data. For these hypothetical data, statistics as usual do represent the “observational uncertainty.” The test would be whether “probabilistic inversion” based on the “id” determined “observational uncertainty” would provide an unbiased estimator of the probability distributions that generated the data. Now lets consider the possibility of a threshold for the nonbackground dose–response relationship. By this assumption, the data tell us something, but not too much. The threshold lies somewhere between a dose of 20 and 80. Other questions arise. If the threshold is toward the lower end of the range, that implies that there is considerable inter-animal variability and that should be explored; if the threshold is near the upper end of the range, the modest difference between the response fraction at 80 and at 120 requires explana tion, perhaps a difference between the sexes or some sort of saturation mechanism. Notice that I have brought in several additional (toxicological rather than statistical) concepts: • • • •

Toxicological reasons for expecting a threshold Inter-animal variability Saturation An independent background

These all have implications for choosing models to fit. Once again, I want to remind us that a dose response fit for animal data is only one of the projections needed for a risk estimate expressed as a human dose/probability of response. Figure 2.22 is an illustration modeled after Figure 2.20; it shows typical projections made in a risk analytic construction of a dose/probability-of-response curve for a toxic effect with a threshold. As with Figure 2.20, it provides context for this review by distinguishing regions of regulatory significance and regions with data. The challenge of assessment is to make projections from regions (shown as clouds) where empirical data can be obtained to regions of potential concern. There may be human data, but where these are weak or nonexistent, it may be necessary to project animal

103

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

PROBABILITY OF RESPONSE per PERSON

Animal Where there may be data

Human

Effects may be detectable

Sublinear Linear Animal Data Effects may be of concern

DOSE

Human Data

Extrapolations - Continued Figure 2.22 Illustration of typical dose response projections made in a risk analysis involving a threshold for response. The clouds show regions where there may be data, animal or human; extrapolations may be required from animals to humans and from high doses where there are data to low doses whose effects may still be of concern. The illustrated projections in this case require additional information: information about human interindividual variability for human low-dose projections and scaling assumptions for animal to human extrapolation. We recommend use of the central animal response point for animal-to-human extrapolation shown as a solid red line.

data—shown in the cloudy region in the back of the Figure 2.20 to a region of human dose response (also shown as clouds). Here the projection from the (human) cloudy region where there may be data to the level of concern is shown as an aqua dotted line (with a brown dotted linear line also shown as a reminder). The other critical projections are the fit to the animal data (the subject of this workshop) shown as the curvy (threshold) black line in the animal data and the projections from the animal to human regions. Here three projections are shown: a high-response projection and a low-response projection shown as two double lines, and a central response projection shown as a solid line. We believe the latter is the least ambiguous and most stable of such projections, and that is the one we recommend.

104

REGULATORY/RISK PERSPECTIVE ON CHAPTER 2

The principal lesson illustrated in the three figures is that developing a dose–response relationship suitable for use in a risk assessment makes use of a considerable amount of toxicological information beyond a set of experimental data points. Additional information is used in extrapolating from animals to humans and in extrapolating to low doses for human exposure. Some critical elements for human dose/probability of response can’t be derived at all from animal experiments: background incidence of health effects is one such element; human interindividual variability is another. In contrast, there are good toxicological reasons to believe that other critical elements can be extrapolated from animal experiments, the slope of a dose/cancer probability of response, for instance, or the threshold for a dose that overwhelms homeostatic protective measures. Thus toxicology provides an essential context for making the interpretations that go into developing a risk assessment; statisticians and statistical methods also play an essential role, but it is a role that must be specified by that context and by the needs of the risk analyst/ regulator. We can summarize those needs as a series of questions. Each has unanswered uncertainty that statisticians can help clarify and each goes beyond the matters examined in this workshop: • •

• • •

Is there an effect? What is a favored DR model ⬚ How good is the evidence for it? ⬚ What are plausible alternative choices? And their relationship to the evidence What are human risks at relevant (low-dose) exposures? What is the role of background? What is the full package of information that we bring to bear on these questions?

And there is a final question: •

Can toxicologists and statisticians communicate better?

Improving communication is not easy, but one of the real contributions of this workshop has been to encourage improvement. And as Weisueh Chiu pointed out, the issues here represent a relatively safe area for communication; we now need to move to the more difficult areas suggested by the questions above.

REFERENCE Hattis D, Goble R. 2007. Uncertainties in risk assessment for carcinogenesis: a road map toward practical improvements. White paper, Clark University, Worcester.

COMMENT A WEAKNESS IN THE APPROACH? Jouni T. Tuomisto National Public Health Institute (KTL) Kuopio, Finland

This is a brief comment on Cooke’s idea that a distribution over model parameters should capture the observable uncertainty in the bioassay data. While the exposition of this idea seems to flow smoothly toward the intended conclusion, I can imagine situations in which this approach would seem to run into trouble. Imagine that we had bioassay data like that of the BMD Technical Guidance Document, except that the number of exposed animals at dose 21 was not 49 but 49,000, with 15,000 showing the response. The percentage responses are the same, but the very large number of animals in the dose group 21 means that the uncertainty in the percentage response will be very small. Indeed, for dose group 21, we can be 90% certain that the number of animals responding on an independent iteration would be between 14,832 and 15,167. That translates to a percentage response between 30.3% and 31%. By contrast, dose group 60 with 45 animals, 20 of which showing the response, leads to 90% confidence that the percentage response on iteration of this experiment lies between 31% and 56%. Since dose–response models predict percentage response, or equivalently, probability of response, it is difficult to see how a distribution over model parameters could ever capture the “observational uncertainty” created by this lopsided bioassay experiment. It seems that the observational uncertainty as defined by Cooke is limited to situations where the original bioassay design is repeated. Then the probability distribution of a particular dose group is interpreted as the distribution of outcomes that will be observed when the original design is repeated over and over again. It is clearly good to have such an interpretation for the distribution that can be tested and observed. 105

106

A WEAKNESS IN THE APPROACH?

However, the problem is that we should be interested in the physiological and toxicological mechanisms and their realization as a dose–response curve, which can be observed in bioassays. The mechanisms are obviously not dependent on particular dose group sizes, so the uncertainties about the dose– response curve reflecting the mechanisms should not be either. The tricky question is how to define the uncertainty in a way that is independent of group size but still can be observed with a particular group size in a bioassay. This numbers of animals in this example are not meant to be realistic but to draw attention to a weakness in the method Cooke proposes.

RESPONSE TO COMMENTS Roger M. Cooke Resources for the Future, Department of Mathematics, Delft University of Technology

Nothing is more rewarding to a scientist than having his/her views seriously studied by highly esteemed colleagues. The reward lies not in seeing one’s views prevail, but in seeing reason prevail—a bit purple perhaps, but this is, after all, our common purpose. As such I feel richly rewarded by the comments of these reviewers. Rhomberg’s first three paragraphs summarized my proposed approach so perfectly that I feared he would agree with me completely—a fear that proved unfounded. Tuomisto’s pesky little example proved even more awkward. All reviewers accept the value of having a model independent characterization of observational uncertainty—Rhomberg is a little circumspect in this regard, the others are more emphatic. That is a very significant shift relative to current practice, in which uncertainty characterization is model dependent. I’m taking that to the bank. The issues the reviewers raise are addressed in order of appearance. TOM LOUIS The tent pole discussion needs a little clarification. Tent poles 2 and 3 (integrated uncertainty analysis and monotonicity) are points of departure not rigid principles. They serve to circumscribe the cases for which the isotonic regression–probabilistic inversion (IR-PI) approach should make sense. Monotonicity, for example, is not a physical or biological law. Paracelsus—the father of toxicology—said “the poison is the dose.”11 Ingested to excess, 11

Alle Ding sind Gift, und nichts ohn Gift; allein die Dosis macht, daß ein Ding kein Gift ist. “All things are poison and nothing is without poison, only the dose permits something not to be poisonous.”

107

108

RESPONSE TO COMMENTS

any food will kill, but the dose–response relation is surely not monotonic! If one had biological reasons for doubting monotonicity in the relevant dose regime, then one should not erect this tent pole—but then one should also not restrict attention to monotonic dose–response models. Restricting oneself to monotonic models is implicitly assuming monotonicity. In this case the standard product form of the likelihood function would not reflect our beliefs. Isotonic regression is the appropriate way to impose the monotonicity assumption. I think the second tent pole is a bit underappreciated. Imagine that we had to do an integrated uncertainty analysis, like those referenced in the introduction of my article. Such studies take account of the uncertainty of a source term, the uncertainty in airborne dispersion models, uncertainty in deposition models, uncertainty in transport through the food chain, uncertainty in exposure, uncertainty in dose response—lets stop there. Suppose that our uncertainty in dose response is captured by an uncertainty distribution that depends on dose. It might be a mixture of models where the mixing coefficients depend on dose, or a distribution over model parameters that varies with dose, or some combination of these. On each Monte Carlo sample we would sample a source term, a set of dispersion coefficients, a set of coefficients for deposition modeling, a set of transfer coefficients for the transport models, sample a distribution of exposed people, and for each person compute the dose received on that sample. Tent pole 2 assures that we can factor in dose–response uncertainty by sampling a set of parameters from the appropriate distribution of a dose–response model and apply this model, with sampled parameters, to each individual. Hence, on each sample, we apply a monotonic dose response function. If we sample “favorable parameters,” then these should apply to every exposed individual in the sample, and similarly for “unfavorable parameters.” If we do not capture dose–response uncertainty via a distribution on model parameters (contra tent pole 2), then we could seriously under estimate the overall uncertainty. Suppose that we had a distribution for response probability at each possible dose but did not have these in the form required by tent pole 2. We should have no alternative but to sample a response probability separately for each exposed individual. Not only would this by much more burdensome, it would also be wrong. If the dose–response relation is unfavorable, then it is unfavorable for everyone. Moreover it could easily happen that the dose–response probabilities on a given sample would not be monotonic in dose. A few points must be conceded. How would this approach work with continuous, as opposed to quantal, response? I don’t see a problem with probabilistic inversion, but characterizing the observational uncertainty might be problematic. How would it handle null or complete responses? Not very well. Again, the problem would lie in characterizing the observational uncertainty in such experiments. This is similar to estimating a binomial probability in such cases, the MLE would be 0 or 1. As for low-dose extrapolation, I believe that statistical analysis of bioassay data sets should be confined to the range of

LORENZ RHOMBERG

109

measured doses. In going outside this range, the statistical problem changes into an expert judgment problem.

LORENZ RHOMBERG Rhomberg eloquently describes the toxicologist’s frustration with an analysis in which the data is shorn of all toxicological information—this is his second way. In the first way he offers abundant contextual information that could alter or disable a purely statistical analysis. My interpretation of the four conclusions drawn from the first way is that quantifying uncertainty in dose response is really an expert judgment problem, not a statistics problem. I agree; in the EU-USNRC joint uncertainty studies, quantifying uncertainty in dose– response relations for ionizing radiation was treated as an expert judgment problem. However—as he points out—this does undermine the whole point of the workshop. Indeed Resources for the Future hosted a workshop on expert judgment in 2006 (see http://www.rff.org/rff/Events/Expert-JudgmentWorkshop.cfm). With regard to observational uncertainty, Rhomberg remarks that we are really interested in question 1: “In view of our one sample outcome, how likely is it that the unknown true mean probability of response lies at any one of its various possible values?” However, we characterize our uncertainty by asking a different question 2: “If the sample mean of our one observation were the true mean, how much variation around it would we get, were we to test many more sets of the same number of animals?” This is spot on. I would say it a bit differently: If we want to treat the quantification of uncertainty in dose response as a statistical problem (the point of the workshop!), then we must make an assumption to characterize observational uncertainty. The assumption I propose is that the answer to question 2 is also the answer to question 1. The questions are different but the answers (by assumption) are the same. Other postures are also possible. The Bayesian posture depicted in Figure 2.1 of my article gives different answers to the two questions, but the differences are not great. The important thing is that we characterize observational uncertainty in some model-independent fashion in order to see how well models can capture this uncertainty. Rhomberg then asks: “Why shouldn’t the uncertainties around a model’s predictions of the responses at tested doses differ from the unmodified original observational uncertainty?” They should indeed—if we are certain that the model is true, which we generally are not. One reason for demanding that our dose–response models recover the observational uncertainty is that this gives us more leverage in assessing model adequacy. The degree to which “certain model forms labor under these constraints” might tell us something. In particular, I would like to know if the threshold and barrier models to which I had to resort have any biological plausibility. Might this not help data sets per se say something?

110

RESPONSE TO COMMENTS

ROB GOBLE Goble does an admirable job in characterizing the concerns of the risk analyst/ regulator. These concerns kick in at low-dose exposures of humans, very much not the subject of statistical analysis of bioassay data. The extrapolations to low doses and to humans certainly involves different and larger uncertainties than those involved in the analysis of bioassay data. The point I take away from the papers in this workshop, however, is that we as a community haven’t yet sorted out how to do the bioassay uncertainty characterization. I am pleased that the goal of a model-independent characterization of observational uncertainty is generally accepted. There may be better ways of characterizing this, different targets if you will. Goble, for example, questions whether monotonicity and isotonic regression are always appropriate. I don’t see the potential for bias in the example he sketches. I could imagine a followup workshop in which the test data are not taken from real studies, as was the case here, but are concocted to stress the modelers in specific ways. It would be a move from bench testing to stress testing. Goble’s role as risk analyst makes him particularly sensitive to the demand for clarity and transparency. Uncertainty estimates must be defendable. That is exactly why we need a model-independent characterization of uncertainty. I’m hearing agreement that this is “good and appropriate science.”

JOUNI TUOMISTO Tuomisto seems to have found a real weakness in the approach—one that I had not anticipated beforehand. In the situation he describes, the PI-IR approach would not work. I don’t know how other approaches would fare, but I suspect that we would all have trouble. This is a good example for the stress testing workshop. What does this mean? For one thing, it means that our observational uncertainty is a influenced by how we design our bioassay experiments. With bizarre designs we might get bizarre results. Once we are aware of this, we shouldn’t be surprised but should figure the result into the experimental design. Tuomisto’s “the tricky question is how to define the [observational] uncertainty in a way that is independent of group size but still can be observed with a particular group size in a bioassay.” I fear that that just isn’t possible. Our uncertainty is inherently bound up with sample size.

3 UNCERTAINTY MODELING IN DOSE RESPONSE USING NONPARAMETRIC BAYES: BENCH TEST RESULTS Lidia Burzala Delft University of Technology

Thomas A. Mazzuchi George Washington University

We illustrate the analysis of uncertainty in dose response using the nonparametric Bayesian model developed by Gefland and Kuo (1991).1 This model has been used in accelerated life testing, where the probability of failure is assumed to be an increasing function of physical stress. Very roughly, the model starts with a prior distribution over all nondecreasing functions expressing probability of response as a function of dose. The prior is specified by choosing (1) a prior mean response function P0(d), giving the mean probability of response as increasing function of dose d, and (2) a parameter α controlling the spread of the prior around the mean. α is similar to the “equivalent observation” parameter used in standard binomial-beta updating. We find that setting α equal to the number of experiments in each bioassay, plus one, gives best results. This prior distribution over all nondecreasing dose response func1

Earlier proposals are Ferguson (1973), Antoniak (1974), Disch (1981), and Ammann (1984). Mukhopadhyay (2000), Gefland and Smith (1990), Shaked and Singpurwalla (1990), Ramsey (1972), and Ramgopal, Laud, and Smith (1993) have also contributed significantly to the development of these models. Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

111

112

BENCH TEST RESULTS

tions is then updated with the results of the bioassay experimental data. The updating is done with Gibbs sampling, and we obtain, for each dose level, a distribution over the probability of response at that level. This can be used to generate a distribution over the number of responses in each experiment, and this distribution may be compared with the binomial distribution with parameters n*, p*, where n is the number of subjects given dose d, and p* is the observed percentage responding to dose d. The agreement varies from good to poor. The Gibbs sampling must be done at each dose level to obtain the uncertainty distribution at that dose. The report breaks into two main parts: the first part presents a short description of the nonparametric Bayesian approach; the second part illustrates the analysis of data sets.

3.1

NONPARAMETRIC BAYESIAN APPROACH

We present here the nonparametric Bayesian approach to uncertainty modeling in dose response. The Bayesian approach provides the mechanism to combine data with the prior information to produce updated results in the form of posterior distributions, a combination of the prior distribution (derived using prior information) and the likelihood function derived from the data. In our analysis, we assume that the number of responses, Si, when ni subjects are exposed to dose level xi are independently distributed as Binomial random variables, Bin(ni, pi), where ni is the number of experimental subjects at dose level xi, and pi = P(xi) is the unknown probability of response at dose level xi, with x1 < x2, … < xM. Thus the likelihood at S = s is ⎛ ni ⎞ n −s L( p; s ) = ∏ ⎜ ⎟ piSi (1 − pi ) i i , ⎝ ⎠ i = 1 si M

(3.1)

where p = (p1, … , pM), and M is the number of experiments. The prior is a distribution over the class of right continuous, nondecreasing functions taking values in [0, 1]. We introduce the Dirichlet process (DP) as a prior distribution on the collection of all probability measures. We define it as follows: a random probability distribution P is generated by the DP if for any partition B1, … , BM+1 of the sample space, the vector of random probabilities P(Bi) follows a Dirichlet distribution: (P(B1), …, P(BM)) ∼ D(αP0(B1), … , αP0(BM)). Note that in order to specify a DP prior, a precision parameter α > 0 and base distribution P0 are required. Here P0 defines the prior expectation: E[P(Bi)] = P0(Bi); the parameter α controls the strength of belief, or confidence in the prior. Thus a large (small) value of α means that new observations will have a small (large) affect in updating the prior P0. Indeed α is sometimes interpreted as the “equivalent prior observations.” In our study we assume that P has an ordered Dirichlet distribution, with density at (p1, … , pM) as stated below:

113

NONPARAMETRIC BAYESIAN APPROACH

π D ( p1, … , pM ) =

Γ

(∑



M +1

i =1 M +1

i =1

αξi

)p

αξ1 − 1 1

Γ (αξi )

( p2 − p1 )αξ2 − 1 … ( pM − pM − 1 )αξM − 1 (1 − pM )αξM +1 − 1, (3.2)

where 0 ≤ p1 ≤ … ≤ pM ≤ 1 and ξi = P0(xi) – P0(xi–1) (i = 1, … , M + 1) with P0(x0) = 0 and P0(xM+1) = 1. When ∀i = 1, … , M, ξi = P0(xi) − P0(xi−1) = const. and α = 1/ξi, the posterior approaches the maximum likelihood estimate (MLE). Unfortunately the joint posterior distribution, obtained from (3.1) and (3.2), is too complicated to allow an analytical solution, especially for obtaining the marginals. Therefore we introduce one of the most commonly used Markov chain Monte Carlo methods—the Gibbs sampler, which can be used to generate the marginal posterior densities. We follow investigation of Gefland and Kuo (1991). The Gibbs sampler requires successively resampling the following conditional distributions: • •

p | S, Z1, … , ZM, Zi | S, n, p.

The first conditional distribution is an ordered Dirichlet updating: Γ

(∑



M +1

j =1 M +1 j =1

ϑj

)

Γ (ϑ j )

M +1

∏ (p

− pj − 1 )

ϑ j −1

j

,

(3.3)

j =1

where ϑ j = α ( P0 ( x j ) − P0 ( x j − 1 )) + ∑ M Zij , and p0 ≡ 0, pM+1 ≡ 1, and Zij is as i =1 defined below. Moreover the conditional density for pi, over the set [pi–1, pi+1], denoted by g(pi | S, Z1, … , ZM, pj(j ≠ i)) is truncated beta(ϑi, ϑi+1); that is, pi ∼ pi−1 + (pi+1 − pi−1)beta(ϑi, ϑi+1). The second conditional distribution is defined as follows: Zi = (Zi 1, … , Zij , … , Zi , M + 1 ) ∼ mult ( ni , λ ),

for i = 1, … , M

(3.4)

where λ = (λ1, … , λj, … , λM+1) with λj = pj − pj−1, The variable Zij denotes, among the ni individuals receiving dosage level xi, the unobserved number of subjects who would have responded to dosage level xj but not to dosage level xj−1. Notice that Zi is a concatenation of two multinomials Zi = {Zi(1), Zi(2)} such that: Zi (1) = (Zi 1, … , Zii ) and Zi ( 2 ) = ( Zi ,i +1, … , Zi ,M +1 ) ,

114

BENCH TEST RESULTS

where Zi (1) ∼ mult {si , pi−1λ (1)} and Zi ( 2 ) ∼ mult {ni − si , (1 − pi ) λ ( 2 )} , −1

with

λ (1) = (λ1, … , λi ) and λ ( 2 ) = (λi + 1, … , λ M + 1 ) . Given the aforementioned conditional distributions, we can perform the Gibbs sampling as follows: • •

Specify the initial value p(0) = (p1, … , pM). For each i = 1, … , M, sample from:

(

)

Zi(0) (1) ∼ mult si , ( pi(0) ) λ (1) −1

(

with λ (1) = ( p1(0), p2(0) − p1(0), … , pi(0) − pi(−0)1 ) ,

)

Zi(0) ( 2 ) ∼ mult ni − si , (1 − pi(0) ) λ ( 2 ) •

−1

(0 ) with λ ( 2 ) = ( pi(+0)1 − pi(0), … , 1 − pM ).

For each i = 1, … , M, sample from pi(1) ∼ pi(−1)1 + ( pi(+0)1 − pi(−1)1 ) beta (ϑ i , ϑ i + 1 ) .

(1) Then we sample again from multinomials (using p(1) = ( p1(1), … , pM )) in order (2) (2) (2) to calculate p = ( p1 , … , pM ). This schema is repeated r times; as a result we (r ) obtain ( p(r ), Z1(r ), … , ZM ). Moreover Gelfand and Kuo proposed to calculate v replications of this procedure (each to the rth iteration, with the same initial (r ) value of p(0)), such that the results from it— ( ps(r ), Z1(rs), … , ZMs ) for s = 1, …, v—can be used to calculate the estimate of the marginal posterior density of each pi as stated below: v

(r ) fˆD ( pi S ) = v−1 ∑ gD ( pi S, Z1(rs), … , ZMs , p(jsr ), j ≠ i ) .

(3.5)

s =1

The posterior mean of pi is estimated using the mean of gD, leading to

ϑ i, s ⎡ ⎧⎛ ⎞ ⎫⎤ ED ( pi S ) = v−1 ∑ ⎢ pi(−r )1, s + ( pi(+r )1, s − pi(−r )1, s ) ⎨⎜ ⎟ ⎬⎥ ⎝ + ϑ ϑ ⎩ i, s i + 1, s ⎠ ⎭ ⎦ s =1 ⎣ v

(3.6)

for i = 1, … , M. As we have mentioned, our sampling procedure is conducted with v parallel replications, each taken to r iterations. Note that the choice of v determines how close our density estimate is to the exact density at the rth iteration, whereas the choice of r determines how close the density estimate at rth

ANALYSIS OF THE DATA SETS

115

iteration is to the actual marginal posterior density. Therefore values for r and v needed to obtain smooth convergent estimates are strictly dependent on the application and the prior information about the problem studied. If we are interested in other dose levels than those in our bioassay experiments, we will have to run the Gibbs sampler for each new dose level. Computations of Benchmark Dose (BMD), the dose for which P(BMD) = BMR(1 − P(0)) + P(0), where P(0) is the background probability (the observed percentage response at zero dose) and BMR = 0.1, 0.05, 0.01, are not easy in the case of the nonparametric Bayesian model. We propose some approximation procedure in order to find BMD. Suppose that P(BMD) is somewhere between the posterior means: E(pi| S) and E(pi+1 | S). The BMD belongs then to the interval (xi; xi+1). Our purpose is to find x* subject to E(p*) = BMR(1 − P(0)) + P(0). According to Gelfand and Kuo, the conditional density estimate for p* (the potency curve at the nonobserved dose x*) is also a truncated beta p* ∼ pi + ( pi + 1 − pi ) beta (ϑ *, ϑ i + 1 − ϑ *) ,

(3.7)

where ϑ* = α(P0(x*) − P0(xi)). Moreover the posterior mean can be calculated from

ϑ * ⎞ ϑ i+1 − ϑ * ϑ* = E ( p* S ) = E ⎛ pi + ( pi + 1 − pi ) E ( pi S ) + E ( pi + 1 S ) . ⎝ ϑ i+1 ⎠ ϑ i+1 ϑ i+1

(3.8)

The procedure below can be executed to find an approximation of BMD: • • • • •

Find i s.t. E(pi | S) < P(BMD) < E(pi+1 | S). Calculate E(P(xi+0.5)) from (1.8). If E(P(xi+0.5)) > P(BMD), then calculate E(P(x*)) where x* = (xi + xi+0.5)/2. If E(P(xi+0.5)) < P(BMD), then calculate E(P(x*)) where x* = (xi+1 + xi+0.5)/2. Repeat the previous step until you will get “enough close” approximation of P(BMD).

The BMD is then associated with the dose at which we calculated the last E(P(x*)).

3.2 ANALYSIS OF THE DATA SETS Below we present four data sets obtained from different experiments on rats and mice. Each data set is analyzed using the nonparametric Bayesian approach with base distribution P0(xi) = i/(M + 1), where M is the number of experiments, and with precision parameter α. In order to compare how far the prior

116

BENCH TEST RESULTS

information is from the data, we present a table that supplies the maximum likelihood estimate (si/ni) and prior guess P0, for each stimulus level. The posterior mean and variance for the pi for α = M + 1 are summarized in the tables. To examine the impact of precision parameter on the results, we present the figures with prior means together with posterior means. Notice that the plots are made by linear interpolations based on the means at each dose level. The main purpose of this report is to quantify uncertainty in our dose– response model. The aim is to analyze the model’s suitability to recover the observational probabilities. We would like to answer the question: What is the uncertainty on the number of experimental subjects responding in each experiment. We assume that the uncertainty on the number of responses in a random repetition of each experiment follows a binomial distribution with the numbers of experimental subjects and the probability of response, which is given by the percentage observed on the experiment. This uncertainty represents the observational uncertainty, which we further call binomial uncertainty. In order to check how the nonparametric Bayesian model recovers this uncertainty, we sample the probabilities of responses at each observational dose level from our model and generate the number of responses from the binomial distribution with the appropriate number of experimental subjects. The results of these calculations can be used to compute the cumulative distribution function on the number of responses (nonparametric Bayesian cdf on the number of responses) as a mixture of binomials. However, it sometimes happens that the percentage of responses at lower dose level is higher than the percentage of response at higher dose level. In such situations we apply the so-called pooling adjacent violators (PAV) algorithm2 in order to create the increasing values of probabilities of response. The results obtained from this algorithm are then used to generate binomial cdf in order to represent the uncertainty on the number of responses (we further call it isotonic uncertainty). The aforementioned distributions on the number of responses are presented in the figures with the following notation: light lines (stairs) indicate the binomial uncertainty and dark lines (stairs) represent the isotonic uncertainty. The nonparametric Bayesian cumulative distributions on the number of responses are plotted by smoothed lines. 3.2.1

Data Set from the BMD Technical Guidance Document3

Description of Data Set The first data set (Table 3.1) consists only three dose levels at which 50, 49, and 45 experimental subjects were observed. At zero dose, one of the 50 experimental subject responded. However, the probability of response at this level is not so high (0.02); therefore it should not

2 3

See Barlow, et al. (1972). See also the discussion in Cooke discussion in chapter 2 of this volume. Available at http://www.epa.gov/ncea/pdfs/bmds/BMD-External_10_13_2000.pdf.

117

ANALYSIS OF THE DATA SETS

TABLE 3.1 Dose

Data set 1 (BMD)

Number of Subjects

0 21 60

50 49 45

TABLE 3.2

Number of Responses 1 15 20

MLE and prior (BMD) Dose Levels

MLE P0

TABLE 3.3

0.0200 0.2500

0.3060 0.5000

0.4440 0.7500

Posterior mean and variance (BMD) Dose Levels

Posterior mean Variance

1.0e-003

0.0389 0.2144

0.3056 0.5805

0.4567 0.7457

cause additional problems. Notice also that at the last dose level we can observe 20 responses from among of 45 experimental subjects. This suggests that data set 1 does not provide information about the whole potency curve. We do not know for example what dose is enough to induce the response of 80% of the experimental subjects. Convergence The Gibbs sampling procedure was implemented for an extensive experimental range of iteration-replication combinations. It was found that stable estimates could be obtained using 1500 replications and 100 iterations of the Gibbs sampling (Tables 3.2 and 3.3). Results Figure 3.1 presents the results from Bayesian model with three values of precision parameter: α = 0.4, 4, 40. Recall that this parameter controls the strength of belief in the prior guess for P0; the figure confirms this role. It can be observed that the results for α = 40 are closest to P0, whereas the posterior obtained from model with α = 4 approaches the MLE. Uncertainty Distributions Figures 3.2, 3.3, and 3.4 present the cumulative distributions on the number of responses for dose levels 0, 21, and 60 with the number of experimental subjects as in Table 3.1. We follow the notation we described in the introduction to this section. Note that the calculations were performed with values of precision parameter, namely: α = 0.4, 4, 40.

118

BENCH TEST RESULTS Nonparametric Bayesian 0.8 Prior mean Posterior mean (alpha=0.4) Posterior mean (alpha=4) Posterior mean (alpha=40) Data

Probability of response

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

60

Dose level

Figure 3.1

Prior and posterior mean.

1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15 20 Number of responses

25

30

Figure 3.2 Nonparametric Bayesian with α = 0.4, and binomial uncertainty distributions.

Conclusions An obvious conclusion follows from the figures above. The Bayesian model with α = 4 recovers the uncertainty on the number of responses the best. It should not surprise us, since for this parameter the posterior mean approaches the MLE. Recall that for this value of precision parameter the posterior is based only on the data. However, the fit at dose

119

ANALYSIS OF THE DATA SETS 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15 20 Number of responses

25

30

Figure 3.3 Nonparametric Bayesian with α = 4, and binomial uncertainty distributions. 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15 20 Number of responses

25

30

Figure 3.4 Nonparametric Bayesian with α = 40, and binomial uncertainty distributions.

zero is worrisome; it seems that the best fit at dose zero is obtained from model with α = 40. Table 3.4 presents the results for BMD. The BMD for BMR = 0.01 cannot be computed, since P(BMD) = BMR(1 − P(0)) + P(0) is lower than the posterior mean at zero dose.

120

BENCH TEST RESULTS

TABLE 3.4

TABLE 3.5

BMR

BMD

0.1 0.05 0.01

6.234 2.460 —

Data set 2 (frambozadrine, male) Dose(mg/kg-day)

Total Rats

0 1.2 15 82

47 45 44 47

2 6 4 24

0 1.8 21 109

48 49 47 48

3 5 3 33

Male

Female

3.2.2

BMD results (frambozadrine, male)

Hyperkeratosis

Frambozadrine

Description of Data Set Table 3.5 presents the data set obtained from the experiments on male and female rats. The main purpose of these experiments was to examine the effect of frambozadrine. As we can observe, this data set supplies only partial information about the potency curve (at the higher dose level we observe around 50% responses in case of males and around 70% responses in case of females). Similar to the previous data set, responses at dose level zero are observed in both groups. This suggests that there are some other factors that are responsible for occurrence of hyperkeratosis. In these studies we assume that as the dose level increases, the probability of response increases as well. Unfortunately, the data are not always monotonic in dose. In the experiments on male rats, we can observe that at dose level 1.2, the percentage of response (0.133) is bigger than at the higher dose 15 (0.091). A similar situation occurs with females, where at dose 1.8 the probability of response (0.102) is bigger than the probability of response (0.064) at dose equal to 21. Note that although males are exposed to lower dose levels than females, they react stronger than females. For example, the probability of response at dose level 1.2 is 0.133 in case of males, whereas in case of females the probability at dose level 1.8 is only equal to 0.102. Similar conclusions follow from the experiments at the next dose levels.

121

ANALYSIS OF THE DATA SETS

Convergence To find suitable parameters for analysis of data set 2, we perform Gibbs sampling for an extensive experimental range of the iterationreplication combinations. We found that stable estimates could be obtained by using 1500 replications and 100 iterations of the Gibbs sampling in case of the data obtained from the experiments on males. Unlike the experiments on females, the estimates become stable with 1000 replications and 100 iterations of Gibbs sampler. The same parameters are recommended during analysis of combination of both data sets. Results The analysis of data set 2 is presented in the following way: we first analyze the data obtained from experiments on males, then we analyze the data from experiments on females, and last, we analyze them jointly, without distinguishing gender. Notice in Figure 3.5 the problems with monotonicity. Because the posterior is based on an increasing prior, it necessarily increases as the doses increase. The results for α = 50 are closest to the prior, whereas in case of α = 5, the results approach MLE. Observe that at the dose levels where the violation of monotonicity occurred, we have the worst fit of data. Uncertainty Distributions Since the posterior satisfies the assumption of monotonicity, we would like to check how our model recovers the isotonic uncertainty. Figures 3.6 through 3.9 present the cumulative distributions on the number of responses for each dose level with the number of experimental subjects as in Table 3.5. “NPB” denotes the cdf obtained from the nonparametric Bayesian model (red lines), “binun” indicates the binomial uncertainty (green stairs) and “isoun” the isotonic uncertainty (blue stairs). Since the results for α = 5 approach the MLE the best, we perform experiments with this parameter.

TABLE 3.6

MLE and prior (frambozadrine, male) Dose Levels

MLE P0

TABLE 3.7

0.0430 0.2000

0.1330 0.4000

0.0910 0.6000

0.5110 0.8000

Posterior mean and variance (frambozadrine, male) Dose Levels

Posterior mean Variance

0.0536 1.23 × 10−4

0.1135 1.02 × 10−4

0. 1554 1.84 × 10−4

0.5108 1.06 × 10−3

122

BENCH TEST RESULTS Nonparametric Bayesian 0.8 0.7

Probability of response

0.6

0.5

0.4

0.3

0.2

Prior mean Posterior mean (alpha=0.5) Posterior mean (alpha=5) Posterior mean (alpha=50) Data

0.1

0

0

10

20

30

40

50

60

70

80

Dose level

Figure 3.5

Prior and posterior mean (frambozadrine, male).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

1

2

3

4

5

6

7

8

9

Number of responses

Figure 3.6 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 0 (frambozadrine, male).

Conclusions Because in male data set a violation of monotonicity occurs, the pooling adjacent violators algorithm was applied. The differences between binomial and isotonic uncertainty can be observed at the doses where the problems with monotonicity occur. Note that the fit for dose level 1.2 approaches better the binomial uncertainty, whereas the fit for the next dose

123

ANALYSIS OF THE DATA SETS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

Number of responses

Figure 3.7 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 1.2 (frambozadrine, male). 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8 10 Number of responses

12

14

16

Figure 3.8 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 15 (frambozadrine, male).

(15) approaches better the isotonic uncertainty. Since the posterior mean for these doses are not sufficiently close to the MLE (especially at dose level 15), the binomial fits are not satisfactory. Moreover our model does not recover the binomial uncertainty at dose level zero very well. At the last dose, smooth lines track the dark stairs very closely, which can be confirmed by the small

124

BENCH TEST RESULTS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 10

15

20

25 Number of responses

30

35

Figure 3.9 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 82 (frambozadrine, male). TABLE 3.8

TABLE 3.9

BMD results (frambozadrine, female)

BMR

BMD

0.1 0.05 0.01

9.29 0.71 0.019

MLE and prior (frambozadrine, female) Dose Levels

MLE P0

0.0625 0.2000

0.1020 0.4000

0.0640 0.6000

0.6880 0.8000

difference between posterior mean and the MLE (compare the results from Tables 3.6 and 3.7). Table 3.8 presents the results for BMD. The female results for the dose levels are presented in Tables 3.9 and 3.10. Notice that for the female data the violation of monotonicity also occurred. However, as in the previous case, the results (posterior means) increase as the dose levels increase (Figure 3.10). Uncertainty Distributions Figures 3.11 through 3.14 present the cumulative distribution functions on the number of responses for each dose level with the

125

ANALYSIS OF THE DATA SETS

TABLE 3.10

Posterior mean and variance (frambozadrine, female) Dose Levels

Posterior mean Variance

0 0590 0.0941

1.0e-003

0.0949 0.0742

0.1272 0.1560

0.6780 0.9965

Nonparametric Bayesian 0.8

0.7

Probability of response

0.6

0.5

0.4

0.3

0.2

Prior mean Posterior mean (alpha=0.5) Posterior mean (alpha=5) Posterior mean (alpha=50) Data

0.1

0

0

20

40

60

80

100

Dose level

Figure 3.10

Prior and posterior mean (frambozadrine, female).

number of experimental subjects as in Table 3.5. We perform the calculations with α = 5. Conclusions The uncertainty analysis for the female data is similar to the analysis of the male data. From the figures above we observe almost the same problems as in case of male data: bad fits for the first dose, and violation of monotonicity impairing the ability to recover the uncertainty. Since the model approaches the MLE the best at the last dose level, we observe a very good fit at this dose (see Figure 3.14). Table 3.11 presents the results for BMD. The male and female results are presented in Tables 3.12 and 3.13. Although the impact of gender on the results is evident, we would like to perform the nonparametric Bayesian approach for both data sets without distinguishing gender. The analysis of such a data set can be difficult, since the problems with monotonicity occurred for almost all doses (see Table 3.12). As we can observe from Figure 3.15, the posterior mean obtained from model with α = 8 approaches MLE the best. The question that we would like to answer is whether the combination of data sets from

126

BENCH TEST RESULTS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6

0.5 0.4 0.3 0.2 0.1

0 0

1

2

3

4 5 6 Number of responses

7

8

9

10

Figure 3.11 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 0 (frambozadrine, female).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

Number of responses

Figure 3.12 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 1.8 (frambozadrine, female).

127

ANALYSIS OF THE DATA SETS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6

0.5 0.4 0.3 0.2 0.1

0 20

25

30

35

40

45

Number of responses

Figure 3.13 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 21 (frambozadrine, female).

1 NPB 0.9

Binun Isoun

0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

20

25

30 35 Number of responses

40

45

Figure 3.14 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 109 (frambozadrine, female).

128

BENCH TEST RESULTS

TABLE 3.11 and female)

TABLE 3.12

BMD results (frambozadrine, male

BMR

BMD

0.1 0.05 0.01

25.57 10.35 0.67

MLE and prior (frambozadrine, male and female)

Dose Levels MLE P0

0.0530 0.1250

TABLE 3.13

0.1330 0.2500

0.1020 0.3750

0.0910 0.5000

0.0640 0.6250

0.5110 0.7500

0.6870 0.8750

Posterior mean and variance (frambozadrine, male and female) Dose Levels

Posterior mean Variance

1.0e-003

0.0571

0.1052

0.1300

0.1540

0.1799

0.4969

0.6870

0.0634

0.0348

0.0241

0.0210

0.0551

0.2762

0.2778

Nonparametric Bayesian 0.9 0.8

Probability of response

0.7 0.6 0.5 0.4 0.3 Prior mean Posterior mean (alpha=0.8) Posterior mean (alpha=8) Posterior mean (alpha=80) Data

0.2 0.1 0 0

20

40

60

80

100

Dose level

Figure 3.15 Prior and posterior mean (frambozadrine, male and female).

129

ANALYSIS OF THE DATA SETS

experiments on males and females alters the uncertainty on the number of responses. Uncertainty Distributions Figures 3.16 through 3.22 present the cumulative distributions on the number of responses for each dose level with number of experimental subjects for combined data sets. The experiments were performed with α = 8. Conclusions Analogous to the previous analysis, the best fits are obtained for the doses where we do not have problems with monotonicity. The exception is zero dose level. Again, we cannot say that the PAV algorithm helps our model to recover the uncertainty. Moreover the analysis above confirms our conjecture that these data should be considered separately. Our conclusion can be confirmed by comparing results for dose 1.2 (male data)—see Figures 3.7 and 3.17, or results for dose 21 (female data)—see Figures 3.13 and 3.20. Table 3.14 presents the results for BMD. 3.2.3

Nectorine

Description of Data Set Data set 3 presented in Table 3.15, was obtained from experiments on rats. The purpose of this experiment was to check the activity of nectorine on rats’ lesions: respiratory epithelial adenoma and olfactory epithelial neuroblastoma. The following dose levels were examined: 0, 10, 30 and 60 (ppm). Table 3.15 presents the results of these experiments. As can be seen, it supplies only limited information about the potency curve. At the highest dose level we observed only 15 responses from among of 48

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8

10

12

14

16

18

Number of responses

Figure 3.16 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 0 (frambozadrine, male and female).

130

BENCH TEST RESULTS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

Number of responses

Figure 3.17 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 1.2 (frambozadrine, male and female).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

2

4

6

8 10 Number of responses

12

14

16

Figure 3.18 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 1.8 (frambozadrine, male and female).

131

ANALYSIS OF THE DATA SETS 1 NPB Bobun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

Number of responses

Figure 3.19 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 15 (frambozadrine, male and female).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8 10 12 14 Number of responses

16

18

20

22

Figure 3.20 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 21 (frambozadrine, male and female).

132

BENCH TEST RESULTS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 10

15

20 25 Number of responses

30

35

Figure 3.21 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 82 (frambozadrine, male and female).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 20

25

30 35 Number of responses

40

45

Figure 3.22 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 109 (frambozadrine, male and female).

experimental subjects (which gives the probability of response 0.3125) in the first type of lesion, and only 3 responses from among of 48 rats (which gives the probability of response 0.0625) in the second type. Moreover the second data set suggests that nectorine is less potent for the second type of lesion.

133

ANALYSIS OF THE DATA SETS

TABLE 3.14 female)

BMD results (frambozadrine, male and

BMR

BMD

0.1 0.05 0.01

TABLE 3.15

10.20 1.05 0.15

Data set 3 (nectorine) Concentration (ppm): # Response / # in Trial

Lesion

0

10

30

60

Respiratory epithelial adenoma in rats

0/49

6/49

8/48

15/48

Olfactory epithelial neuroblastoma in rats

0/49

0/49

4/48

3/48

TABLE 3.16

Data set 3 (nectorine) Concentration (ppm): # Response / # in Trial

Lesion Respiratory epithelial adenoma and olfactory epithelial neuroblastoma in rats

0

10

30

60

0/49

6/49

12/48

18/48

Indeed we did not observe any responses at the first two doses, whereas at the next two dose levels only few responses from among of 48 were noted. The worrisome situation occurs at the last two doses. The data show that the percentage response at smaller dose level (30) is bigger than at the higher dose (60). To sum up, this data set does not give much information about behavior of rats on the examined substance, and moreover the data do not satisfy the assumption of monotonicity. Therefore we consider the analysis of both data sets under the assumptions that the lesions occur independently and the rats in the adenoma experiment were not analyzed for neuroblastomas, and conversely. The bioassay for 60th types of lesions is presented is Table 3.16. Convergence The numerous experiments, with different values of parameters which are required to perform Gibbs sampling, confirm that stable estimates could be obtained using 1000 replications and 100 iterations.

134

BENCH TEST RESULTS

Results Olfactory epithelial neuroblastoma and respiratory epithelial adenoma results are presented in Tables 3.17 and 3.18. As we can observe from Figure 3.23, the dose–response curve increases as the dose levels increase. The model approaches the MLE very well for precision parameter α = 5. Therefore we should expect good results in the uncertainty analysis. Uncertainty Analysis Because a positive number of responses is observed only at the last three dose levels, we plot the cumulative distributions on the number of responses only for these three levels. We present the results from model, with α = 5, in Figure 3.24. TABLE 3.17

MLE and prior (nectorine, combined) Dose Levels

MLE P0

0.0000 0.2000

TABLE 3.18

0.1220 0.4000

0.2500 0.6000

0.3750 0.8000

Posterior mean and variance (nectorine, combined) Dose Levels

Posterior mean Variance

0.0191 0.0700

1.0e-003

0.1323 0.2520

0.2557 0.2530

0.3850 0.5063

Nonparametric Bayesian 0.8 Prior mean Posterior mean (alpha=0.5) Posterior mean (alpha=5) Posterior mean (alpha=50) Data

0.7

Probability of response

0.6

0.5

0.4

0.3

0.2

0.1

0 0

10

20

30

40

50

Dose level

Figure 3.23

Prior and posterior mean (nectorine, combined).

60

135

ANALYSIS OF THE DATA SETS 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15 20 Number of responses

25

30

Figure 3.24 Nonparametric Bayesian (NPB) with α = 5, and binomial uncertainty distributions (nectorine, combined). TABLE 3.19 BMR 0.1 0.05 0.01

BMD results (nectorine, combined) BMD 7.03 2.81 —

Conclusions After pooling the data in Table 3.15, we no longer have a problem with monotonicity. Recall that the posterior mean obtained from model with α = 5 approaches MLE very well. Therefore the fits that are based on this model serve as very good examples of recovering uncertainty on the number of responses. Table 3.19 presents the results for BMD. The BMD for BMR = 0.01 cannot be computed, since P(BMD) = BMR(1 − P(0)) + P(0) is lower than the posterior mean at zero dose. 3.2.4

Persimonate

Description of Data Set Table 3.20 presents the data set from the experiments on two types of male mice. Both data sets provide extensive information about the potency curve. A worrisome situation occurs at the zero dose level, where a large number of responses is recorded. In case of the second type of mice we observe 13 responses out of 46, whereas in case of the first type of mice, 17 responses out of 49. Problems of monotonicity are also evident. The percentage response at dose level 3.4 (in mg/kg per day) is higher than at dose level 14. (precisely: 0.429 at dose 3.4, whereas at dose 14, 0.396). As in the

136

BENCH TEST RESULTS

TABLE 3.20

Data set 4 (persimonate) Continuous Equivalent

B6C3F1 male mice inhalation Crj:BDF1 male mice inhalation

TABLE 3.21

0 18 ppm 36 ppm 0 1.8 ppm 9.0 ppm 45 ppm

Total Metabolism (mg/kg-day) 0 27 41 0 3.4 14 36

Survival Adjusted Tumor Incidence 17/49 31/47 41/50 13/46 21/49 19/48 40/49

MLE and prior (persimonate, B6C3F1 males) Dose Levels

MLE P0

TABLE 3.22

0.3470 0.2500

0.6600 0.5000

0.8200 0.7500

Posterior mean and variance (persimonate, B6C3F1 males) Dose Levels

Posterior mean Variance

0.3531 0.0010

0.6489 0.0007

0.8132 0.0006

second data set (frambozadrine), we would like to check whether the combination of data affects the uncertainty. Convergence We perform the Gibbs sampler with 1500 replications and 100 iterations in case of the first data set. For the second data set and combined data, 2000 replications together with 100 iterations are used, since for these values the stable estimates are obtained. Results We first analyze the data obtained from the experiments on the first type of mice, then data from the second type of mice, and finally we analyze both data sets, without distinguish differences between the types of mice. B6C3F1 male mice inhalation results are presented in Tables 3.21 and 3.22. From Figure 3.25 it follows that the posterior means obtained from the model with α = 4 approach the MLE the best. Uncertainty Analysis Figure 3.26 presents the results for uncertainty on the number of responses from model with precision parameter α = 4, at dose levels 0, 27, and 41 mg/kg per day (starting from the left).

137

ANALYSIS OF THE DATA SETS Nonparametric Bayesian 1 Prior mean Posterior mean (alpha=0.4) Posterior mean (alpha=4) Posterior mean (alpha=40) Data

0.9

Probability of response

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0

5

Figure 3.25

10

15

20 Dose level

25

30

35

40

Prior and posterior mean (persimonate, B6C3F1 males).

1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

20

25

30

35

40

45

50

Number of responses

Figure 3.26 Nonparametric Bayesian (NPB) with α = 4, and binomial (binun) uncertainty distributions (persimonate, B6C3F1 males).

Conclusions Although we do not have problems with monotonicity, and the posterior obtained from model with α = 4 approaches the MLE, the uncertainty analysis for zero dose does not look so promising. However, model recovers the binomial uncertainty at the rest dose levels.

138

BENCH TEST RESULTS

TABLE 3.23

TABLE 3.24

BMD (persimonate, B6C3F1 males)

BMR

BMD

0.1 0.05 0.01

5.48 2.32 0.105

MLE and prior (persimonate, Crj: BDF1 males) Dose Levels

MLE P0

TABLE 3.25

0.2830 0.2000

0.4290 0.4000

0.3960 0.6000

0.8160 0.8000

Posterior mean and variance (persimonate, Crj: BDF1 males) Dose Levels

Posterior mean Variance

1.0e-003

0.2788 0.3562

0.3884 0.1576

0.4519 0.2482

0.8045 0.6116

Table 3.23 presents the results for BMD. The Crj:BDF1 male mice inhalation results are given in Tables 3.24 and 3.25. Figure 3.27 shows the violation of monotonicity. The problem appears at dose 3.4. As can be observed, the results with α = 5 posterior approaches the MLE the best at the first and the last dose level, whereas at the other two doses the problem with monotonicity has occurred. Uncertainty Analysis We consider our model with precision parameter: α = 5. The results are presented in Figures 3.28 through 3.31. Conclusions From the analysis above, it follows that model with α = 5 recovers the binomial uncertainty only at the last dose levels. Analogous to the previous data sets, problems with fit at zero dose occurred as well. Table 3.26 presents the results for BMD. The Crj:BDF1 male mice inhalation and B6C3F1 male mice inhalation results are presented in Tables 3.27 and 3.28. Here is a worrisome situation. From our experience with other data sets, the best fit to data (MLE) should be obtained from model with α = 7. However, it follows from Figure 3.32 that the posterior obtained from model with α = 0.7 for dose levels 3.4, 14, and 36 approaches the MLE better. Uncertainty Analysis The analysis presented in Figures 3.33 through 3.38 was done in order to check whether the combination of data from different

139

ANALYSIS OF THE DATA SETS Nonparametric Bayesian 1 Prior mean Posterior mean (alpha=0.5) Posterior mean (alpha=5) Posterior mean (alpha=50) Data

0.9

Probability of response

0.8

0.7

0.6

0.5

0.4

0.3

0.2 0

5

Figure 3.27

10

15

20 Dose level

25

30

35

Prior and posterior mean (persimonate, Crj: BDF1 males).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10 15 Number of responses

20

25

Figure 3.28 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 0 (persimonate, Crj: BDF1 males).

type of mice affects the uncertainty on the number of responses. We performed the analysis for a “compromise value” of α = 5. Conclusions As shown in Figure 3.32, good fits are obtained only for doses 27 and 41. Problems with recovering uncertainty at dose levels 3.4 and 14 can

140

BENCH TEST RESULTS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 10

15

20

25

30

Number of responses

Figure 3.29 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 3.4 (persimonate, Crj: BDF1 males).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 10

15

20 25 Number of responses

30

35

Figure 3.30 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 14 (persimonate, Crj: BDF1 males).

141

ANALYSIS OF THE DATA SETS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0

28

30

32

34

36 38 40 42 Number of responses

44

46

48

50

Figure 3.31 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 36 (persimonate, Crj: BDF1 males).

TABLE 3.26 BMD results (persimonate, Crj: BDF1 males) BMR

BMD

0.1 0.05 0.01

TABLE 3.27

2.34 1.17 0.32

MLE and prior (persimonate, combined) Dose Levels

MLE P0

0.3160 0.0000

TABLE 3.28

0.4290 0.2000

0.3960 0.4000

0.6600 0.6000

0.8160 0.8000

0.8200 1.0000

Posterior mean and variance (persimonate, combined) Dose Levels

Posterior mean Variance

1.0e-003

0.3065

0.3946

0.4528

0.6467

0.7803

0.8389

0.1815

0.0751

0.1070

0.1909

0.0930

0.1092

142

BENCH TEST RESULTS Nonparametric Bayesian 0.9 Prior mean Posterior mean (alpha=0.7) Posterior mean (alpha=7) Posterior mean (alpha=70) Data

0.8

Probability of response

0.7

0.6

0.5

0.4

0.3

0.2

0.1 0

5

10

15

20

25

30

35

40

Dose level

Figure 3.32

Prior and posterior mean (persimonate, combined).

1 NBP Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 5

10

15

20

25 30 35 Number of responses

40

45

50

Figure 3.33 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 0 (persimonate, combined).

143

ANALYSIS OF THE DATA SETS 1 NBP Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 10

15

20

25

30

Number of responses

Figure 3.34 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 3.4 (persimonate, combined).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 10

12

14

16

18 20 22 Number of responses

24

26

28

30

Figure 3.35 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 14 (persimonate, combined).

144

BENCH TEST RESULTS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 20

25

30

35

40

Number of responses

Figure 3.36 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 27 (persimonate, combined).

1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 25

30

35

40

45

50

Number of responses

Figure 3.37 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 36 (persimonate, combined).

be explained by the violation of monotonicity at these doses. Note that the MLEs for dose levels 36 and 41 are very close, which can induce such bad fits. If we compare separate analyses for mice of different types with the one presented above, the better fits are obtained for separate data sets. To see the

145

FINAL REMARKS 1 NPB Binun Isoun

0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 32

34

36

38

40

42

44

46

48

50

Number of responses

Figure 3.38 Nonparametric Bayesian (NPB), isotonic (isoun), and binomial (binun) uncertainty distributions for dose level 41 (persimonate, combined).

TABLE 3.29 BMR 0.1 0.05 0.01

BMD results (persimonate, combined) BMD 2.92 1.59 0.53

uncertainty analysis for dose level 36 (second type of mice), compare Figures 3.31 and 3.37, or in case of the first type of mice, compare the analyses for the last dose 41 in Figures 3.26 and 3.38. Table 3.29 presents the results for BMD. 3.3

FINAL REMARKS

From the analyses we presented in this chapter, it is clear that the nonparametric Bayesian model recovers the observational uncertainty best in cases where the precision parameter takes the value α = M + 1, where M is the number of experiments. With this choice of α, the Bayesian model is fully specified and not “tweaked” to the data. The posterior mean generally, agrees with the MLE. If there is no violation of monotonicity in the data, this value of the precision parameter recovers the uncertainty on the number of responses almost perfectly. The exceptions are the fits for the zero doses. Moreover, because of monotonicity violation at some dose levels, unsatisfactory fits result.

146

BENCH TEST RESULTS

This can be confirmed by the use of posterior means. Unfortunately, the PAV algorithm does not improve the model’s ability to recover the uncertainty. In our attempt to find stronger methods to “repair” our data, we combined data sets in order to check whether pooling of data alters the uncertainty. In case of data sets 2 and 4 we concluded that better fits are obtained by way of separate analyses. However, in case of data set 3, the situation called for pooling data to recover the monotonicity in the data.

REFERENCES Ammann LP. 1984. Bayesian nonparametric inference for quantal response data. Ann Statist 12:636–45. Antoniak CE. 1974. Mixtures of Dirichlet process with applications to Bayesian nonparametric problems. Ann Statist 2:1152–74. Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. 1972. Stastical infererence under order restrictions, New York, Wiley. Burzala L. 2007. Nonparametric Bayesian bioassay. MS thesis. Department of Mathematics, Delft University of Technology. Disch D. 1981. Bayesian nonparametric inference for effective doses in quantal response experiment. Biometrics 37:713–22. Ferguson TS. 1973. A Bayesian analysis of some nonparametric problems. Ann Statist 1:209–30. Gefland AE, Kuo L. 1991. Nonparametric Bayesian bioassay including ordered polytomus response. Biometrica 78(3):657–66. Gefland AE, Smith AFM. 1990. Sampling based approach to calculating marginals densities. J Am Statist Assoc 85:398–409. Govindarajulu Z. 1988. Statistical Techniques in Bioassay, 2nd ed. Karger Publishers, Basel. Mukhopadhyay S. 2000. Bayesian nonparametric inference on the dose level with specified response rate. Biometrics 56:220–6. Ramgopal P, Laud PW, Smith AF. 1993. Nonparametric Bayesian bioassay with prior constrains on the shape of the potency curve. Biometrika 80(3):489–98. Ramsey FL. 1972. A Bayesian approach to bio-assay. Biometrics 28:841–58. Shaked M, Singpurwalla ND. 1990. A Bayesian approach for quantile and response probability estimation with applications to reliability. Ann Inst Statist Math 42:1–19.

COMMENT MATH/STATS PERSPECTIVE ON CHAPTER 3: NONPARAMETRIC BAYES Roger M. Cooke Resources for the Future, Department of Mathematics, Delft University of Technology

The authors are applauded for bringing the nonparametric Bayesian (NPB) methods to bear on the dose–response uncertainty modeling problem. They develop a practical heuristic for choosing the precision parameter α and show that good results are obtained, at least in cases where the percentage responses are monotone in the dose. Like all Bayesian methods this approach has the virtue of resting on a solid mathematical foundation, in which the assumptions are clearly visible. The prior distribution is a distribution over all increasing potency curves. Normally the choice of prior has to be defended and subjected to a sensitivity analysis, showing the final results’ sensitivity to the choice of prior. While choosing and defending a distribution over all increasing potency curves might appear to be a daunting task, the use of the Dirichlet process prior reduces this task to choosing an initial potency curve P0(x) and choosing the precision parameter α. Their choice of P0(x) has nothing particular to recommend it, but the authors are able to show that, with a suitable choice of α, the prior’s influence in the final results is not great. The ability to draw these conclusions with simple and transparent arguments is a major strength of this method. The comparisons with observable uncertainties as a way of assessing performance of the NPB model demonstrates that Bayesian methods are amenable to external validation. The author served as a supervisor of Burzala’s masters thesis, on which this work is partially based.

147

148

MATH/STATS PERSPECTIVE ON CHAPTER 3

MATHEMATICAL ISSUES However, there are some downsides to this method, from a mathematical point of view: 1. The first is simply the complexity of the model. The exposition in Section 3.1 is very clear, but the material is complex and will tax the patience of nonmathematical readers. The Gibbs sampler relies on numerical experimentation to get stable results. Practicing toxicologists and risk analysts will shy away from methods they cannot easily understand and explain to stakeholders and decision makers. Note that the updating formula (1.3) is like the standard Dirichlet updating of the “equivalent observations” expressed by α but increased with a term ϑi reflecting the number of animals that would not have died at any dose lower than di but died at di. The terms ϑi are not observed but are “forecast” within the updating process. Developing a simple version of this story, which is mathematically correct and at the same time suitable for general audiences, is a challenge for the future. 2. A second issue concerns the posterior mean potency curve at doses other than those in the bioassay data. According to equation (1.8) this posterior potency curve is a piecewise linear function between the mean potencies at the observed doses. This would seem to lock us into a linear extrapolation between observed doses. 3. Under current practice the main regulatory instrument is the BMDL, that is, the 5 percentile of the dose which realizes the Benchmark Response BMR. It should be possible to give BMDL results, in addition to the BMD results reported by Burzala and Mazzuchi. Also it would be helpful to have a picture of the computational burden. COMPARISON WITH BMA Two Bayesian models were deployed in the bench test exercise, Bayesian model averaging (BMA, Whitney and Ryan, Chapter 4 of this volume) and NPB. Test driving the two Bayesian models on common problems affords some interesting comparisons. I focus on three. Choice of Prior Choice of prior for the NPB, as discussed above, comes down to choosing a prior potency curve P0(x) and a precision parameter α. The choice is clear and can be easily subjected to sensitivity analysis. Choosing a prior in BMA involves first selecting a set of models and then choosing a prior distribution over these models. The first of these choices is probably more critical. Assessing sensitivity to the choice of prior would seem to be less straightforward than

149

CONCLUSION

for NPB; the space of potential models is large. In addition some models may have common elements. An example arises in the case of models 3 and 4 in Whitney and Ryan’s Table 4.4, the model-fitted values for the parameters β1, β2, and β3 are shown in parentheses: Model 3: β3 + (1 − β3 ) (1 − e − β1d − β2 d Model 4: β3 + (1 − β3 ) (1 − e − β1

dβ 2

) )

2

(β1 = 0.6805; β 2 = 0.00001; β 3 = 0.0251) ; (β1 = 0.6805; β 2 = 1.0000; β 3 = 0.0251) .

Evidently, for these parameter values, models 3 and 4 are effectively identical: Model (3 = 4): β3 + (1 − β3 ) (1 − e − β1d ) . By giving models 3 and 4 equal prior weight, we effectively double the weight that model(3 = 4) would have had it been used instead of models 3 and 4. Alternatively, if we expanded the initial set of models to include model(3 = 4), then this model would get triple weight. Given the small number of models, the decision to retain both models 3 and 4 when they collapse to the same model, and not to include model(3 = 4) ab initio, requires some defense. It seems to me that NPB dodges his particular bullet. Calculation of BMDL It was noted above that the NPB did not produce values for BMDL, whereas BMA does. Is this caused by the difficulty the posterior density for unmeasured doses? Since the current regulatory context requires BMDL’s . NPB has some work outstanding in this regard. Extrapolation to Low Dose It is my belief that extrapolation outside the range of observed data requires nonstatistical input, in the form of either expert judgment or prior constraints motivated from physical models. Strictly speaking, extrapolation to low dose is outside the charge of this bench test exercise, and all the approaches will have to accommodate extra-statistical information for such extrapolation. I find it easier to imagine how BMA might deal with this, than NPB, but this may simply reflect my lack of imagination. CONCLUSION NPB is certainly a contender. Its advantages include transparent choice of priors, easy sensitivity analysis with respect to choice of prior, and solid mathematical foundation. Its main disadvantages are complexity, need for numerical routines, and inability—at least for the time being—to calculate BMDL’s.

COMMENT EPI/TOX VIEW ON NONPARAMETRIC BAYES: DOSING PRECISION Chao W. Chen National Center of Environmental Assessment US Environmental Protection Agency

INTRODUCTION I appreciate the opportunity to review this thought-provoking work by Burzala and Mazzuchi from the TOX-EPI viewpoint. They propose to analyze uncertainty in dose–response relationship using nonparametric Bayes over the class of all nondecreasing dose–response functions. The key features of their approach are that a parameter α controlling the spread of prior around the mean dose response can be defined, and that the best results can be obtained by setting α equal to (M + 1), where M is the number of dose levels in a bioassay. Uncertainty on dose response is defined as the model’s suitability to recover the observational probability, or equivalently, the uncertainty about the number of experimental subjects responding in each experiment. I will make some general comments on uncertainty in dose–response modeling, and specific comments on the approach proposed to characterize uncertainty about dose–response relationship. BIOLOGICAL AND STATISTICAL UNCERTAINTIES Both biological and statistical uncertainties are important in risk assessments. Burzala and Mazzuchi consider mainly statistical uncertainty at high doses. Since the major uncertainty in risk assessment is the shape of the dose– response curve at low doses, not at doses where bioassays are conducted, it is relevant to ask the question: Is it possible to extend the approach to incorpo150

FREQUENTIST VERSUS SUBJECTIVIST INTERPRETATION OF UNCERTAINTY 151

rate biological uncertainties? The challenge is how the usually incomplete biological information can be incorporated into does–response modeling, and how the uncertainty is characterized. For instance, it is often argued that a threshold effect exists when a U-shape dose–response curve is observed. However, observing a U-shape dose–response curve does not automatically lead to a threshold effect, since the chemically induced effect at low doses can be easily lost in the background noise by statistical and/or biological uncertainties. Assuming that the observed response at dose d is given by f(d) = background rate + P(d), it can be shown that by randomness alone, there is a significant probability of observing a “dip” below the background rate when the p(d) is small. Biological uncertainty can also impinge on the analysis of a dose–response model at low doses. For example, suppose that the background rate of tumor incidence is 5%, and further that exposure to a small amount of an agent at dose d prevents 40% of “initiated” cells from progressing to become background tumors by some biological reasons such as inhibition or competition. Then the background contributes only 60% of its tumor incidence when exposed. Thus, at low doses, the observed tumor incidence is f(d) = 0.05 * 0.6 + P(d), which can be less than the background rate of 5% when p(d) is small relative to the background rate. So we have a U-shape dose–response curve as is often observed in bioassays. However, we can still have P(d) > 0 for all dose d; that is, there is no threshold for the excess dose– response function, P(d). In this case the compound would have a therapeutic benefit below certain doses, but only for a population with high background tumor rate. Confronted with an observed nonmonotonicity in response at low dose, how can we determine whether this is attributed to a biological mechanism or to mere statistical fluctuations? This also raises an issue about the reasonableness of using isometric uncertainty (PAV) algorithm when the observed nonmonotonicity is real.

FREQUENTIST VERSUS SUBJECTIVIST INTERPRETATION OF UNCERTAINTY ABOUT RISK PREDICTION AT LOW DOSES It is timely for risk assessors to consider the Bayesian framework and to take advantage of its flexibility in incorporating biological information as well as its interpretation of probability as a degree of belief or state of knowledge, rather than as a relative frequency. It may not always be easy to interpret a risk number using a frequentist interpretation. What does a “lifetime risk of 10−7” mean? Should this be interpreted as a limiting relative frequency of occurrence of a specified cause of death in a suitable population? Is this population statistically homogeneous; that is, does any subsample chosen before observing cause of death have the same statistical properties? This result, of course, is quite unlikely. If the population is not homogeneous, then the lifetime risk must be interpreted as limiting relative frequency under “random sampling.” What arguments would compel us to regard any given individual

152

EPI/TOX VIEW ON NONPARAMETRIC BAYES

as a “random” sample from this population? Assigning such a limiting frequency to any given individual involves subjective degree of belief. These questions become more insistent when the risk estimate of p = 10−7 is derived from a model with low-dose linearity; a default assumption that is presumed to be conservative (health protective), and thus a subjective judgment.

SPECIFIC REMARKS Regarding the nonparametric Bayesian model by Burzala and Mazzuchi, it might make sense from a toxicological viewpoint to allow the precision parameter α to depend on dose. For example, we might contemplate using α1 and α2 respectively for low and high doses. This would allow us flexibility in capturing the low-dose uncertainty. This differentiation is desirable because there is increasing evidence showing that mode of action observed at high doses may not apply at low-doses. If this is not feasible, then “model averaging” would seem to have more practical utility. By considering a class of dose-response functions (e.g., function with low-dose linearity) to support our prior distribution, we could capture our low-dose uncertainty more transparently.

RESEARCH NEEDS In conclusion, it is useful to underscore topics for future research that emerge from discussion of uncertainty in dose response. 1. The major uncertainty concerns low dose response. The challenge is to incorporate incomplete biological information in estimating risk and characterizing the uncertainty of risk estimates. 2. Comparison of nonparametric Bayes with other methods such as the isotonic regression method (Cooke), Bayesian procedures, and model averaging (Ryan and Whitney) would be most useful. 3. Develop models that can incorporate biological information.

COMMENT REGULATOR/RISK PERSPECTIVE ON CHAPTER 3: FAILURE TO COMMUNICATE Dale Hattis George Perkins Marsh Institute Clark University

My title, of course comes from the classic film Cool Hand Luke. In the course of administering a series of terribly brutal beatings, the prison warden advises the prisoner, “I’m afraid what we have here is a failure to communicate.” I, however, will be much gentler than any prison warden, but the point should be taken that if elaborate Bayesian mathematical/statistical modeling analysis is to successfully inform health risk assessments, then the modelers must take considerable pains to relate their mathematical formulations to recognizable causal properties of the toxicological processes being represented. Why is this important? •



Past experience and theories of the causal mechanisms of harm are important components of our “prior” information. In contrast to positivistoriented “frequentist” school statisticians, who want to draw all information for analysis from “the data” in a particular data set, I would think that Bayesians should be especially mindful that causal mechanism information be incorporated into the mathematical structures used for analysis. Only by incorporating a few discrete and identifiable possibilities related to causal mechanism theories can a quantitative analysis shed even meager light on which mechanisms are made more versus less likely by the ever limited available data. If there are multiple plausible causal theories for a particular relationship between a toxicant and a toxic outcome, then there should be some meta-theory incorporating variants structured to represent this mechanistic uncertainty. 153

154

REGULATOR/RISK PERSPECTIVE ON CHAPTER 3



Risk projections for doses outside the range of observations critically depend on mechanistic theories and their quantitative implications.

In this regard I have a particular problem with the “prior” used in the analyses. The only constraint placed on the mathematical functions considered for prediction of the incidence of effects is that the functions must be nondecreasing with increasing dose. Paradoxically this has problems in imposing too little and too much restriction on the functions used, relative to a reasonable set of mechanistically driven hypotheses. There is in a sense too much restriction because there are in fact some respectable mechanistic theories that can produce nonmonotonic dose–response behavior over some ranges of dose (Andersen and Conolly, 1998; Bogen, 1998; Slikker et al. 2004a and 2004b). These include the following: •



Induction of repair processes that can do more good than harm over a specific dose range of the inducing chemical by reducing or counteracting “background” damage processes caused by other agents and/or general oxidative metabolism, Toxicants that mimic signaling molecules (e.g., hormones) often bind to some related receptors with greater affinity than others. Sometimes binding to one receptor induces an opposite response than binding to another. A switch will occur in the direction of the parameter changes affected by the signaling when a higher affinity receptor approaches saturation, and the binding increases with increasing dose. The increased binding predominantly involves the lower affinity type of receptor molecule, resulting in signal-induced changes in the opposite direction at some dose high enough to saturate many of the high-affinity receptors .

This, of course, raises the question of how would a mechanistic perspective help lead one to a different analysis of the specific data sets provided for this workshop. For the first data set I will start by noting that the presentation of the problem is rather unrealistic: we are given no information about either the chemical (e.g., structure, toxic properties of related chemicals) or about the endpoint (which could allow us to draw inferences from toxic mechanisms associated with that endpoint for other chemicals). Even without this routinely available information, what is apparent is that the dose–response relationship is convex upward—the effect flattens out toward a plateau as dose increases to higher levels. The most common mechanism for this kind of flattening is an approach to saturation either of a metabolic activation process, or of binding to a receptor. This is usually described by Michaelis–Menten enzyme kinetics: Activation rate =

* Dose Vmax . K m + Dose

REGULATOR/RISK PERSPECTIVE ON CHAPTER 3

155

This type of equation strictly applies to cases where a single molecule of substrate or signaling molecule binds to the active site of the enzyme or receptor. It produces a linear dose response at the limit of low dosage (where the dose is small relative to Km), but an approach to a maximum (Vmax) at the limit of high doses, where Km is small relative to the dose. There is a generalization, called the Hill equation that approximates a case where more than one binding molecule of a polymeric enzyme or receptor is needed to cause effects: Activation rate =

* ( Dose)n Vmax . n K m + ( Dose)

While saturating at high doses, this generalization of the equation allows for some upward turning nonlinearity at lower doses. I would use the former equation (with n implicitly set to 1) unless there were good reason to suspect the upward turning low-dose nonlinearity reflected in the second equation. For the second data set, we at least have some good information about the nature of the response. “Hyperkeratosis” is a thickening of the skin, most famously associated with exposure to inorganic arsenic at very high levels. The dose–response data in this case show little or no response over a modest background until very high dose levels are reached. From this I would draw the mechanistic inference that we are most likely dealing with an effect that takes place by overwhelming some set of homeostatic controls. I would therefore adopt a log-probit mathematical form, based on the idea that the individuals in a population are likely to have a lognormal distribution of individual thresholds for manifesting this response, as has been the case for many other examples of classical toxicity (Hattis et al., 1999; Hattis, 2008). If the data were to require a generalization of this basic model to achieve a reasonable fit, I would usually postulate a mixture of two or more lognormal distributions, on the assumption that there might be subgroups (because of genetic differences, past exposures to interacting substances, or pathologies) each of whose population response functions might be described by its own separate lognormal component. The basic reason why lognormals are assumed is that the factors causing individuals to differ tend to be numerous and to exert multiplicative effects on individual susceptibility. Multiplying random variables means that the logarithms of the variable influences on susceptibility add, and the variability distribution for the overall sum of the logs is expected to approach a normal distribution. To model the human risk distribution in this type of case, I would use the animal data only to estimate an uncertainty distribution for the animal ED50. This is in part because the data from genetically uniform animals, raised under tightly restricted environmental conditions and dosed at very close to the same age, are expected to reflect a considerably smaller amount of variability than the out-bred human population. As previously described (Hattis et al., 2002) I would use

156

REGULATOR/RISK PERSPECTIVE ON CHAPTER 3

empirically derived uncertainty distributions for human variability and other considerations now treated as single-point “uncertainty factors” in the standard type of EPA analysis. For both the third and the fourth data sets it is clear that we are dealing with tumor responses, but with apparent saturation at high doses. The most likely general form is therefore a one- or multistage model (in the usual polynomial form) with a Michaelis–Menten transformation of dose. I recently had occasion to apply just this model to the analysis of tumor data for a chemical with activating metabolism (trichloropropane) in the course of a review of the proposed IRIS analysis for that chemical. In that case I found that Michaelis– Menten parameters for metabolic activation in different organs could be estimated from existing DNA adduct data (La et al., 1995) at two doses (Tables 3.30 and 3.31). This allowed me to transform the administered doses to “low-dose equivalent” doses (the doses that would have been sufficient to

TABLE 3.30 DNA adduct levels in relation to dose for different rat organs, and estimated local Km’s for saturable metabolism

Organ Forestomach Glandular stomach Kidney Liver Pancreas Spleen Tongue a

Dose (mg/kg) 3 30 3 30 3 30 3 30 3 30 3 30 3 30

Adducts (μmole/ mole guanine) 3.7 14.6 3.8 20.4 6.6 38.9 5.4 47.6 5.3 37.8 0.8 7.1 4 20.4

Standard Deviationa —a — — — 1.4 5 0.7 21 1 12.8 0.06 1.8 — —

Standard Error 0.92a 3.62 0.94 5.06 0.7 2.5 0.35 10.5 0.5 6.4 0.03 0.9 0.99 5.06

30/3 Adduct Ratiob — 3.9 — 5.4 — 5.9 — 8.8 — 7.1 — 8.9 — 5.1

Suggested Km (local, units of external mg/kg) — 14.6 — 28.3 — 35.8 — 198 — 64.1 — 210 — 25

Cases where no standard deviation is given represent the results of measurements on pooled samples from different animals. Their standard error is estimated from the square root of the average sum of squares of the coefficients of variation (standard deviation divided by the mean) for organs where there were separate measurements on 4 animals. b This should be 10 if there were no saturation. The fact that it is less than 10 in all 7 cases for this table is very unlikely to have occurred unless there was some saturation of a process in the pathway on the way to the formation of the adducts.

157

REGULATOR/RISK PERSPECTIVE ON CHAPTER 3

TABLE 3.31 DNA adduct levels in relation to dose for different mouse organs, and estimated local Km’s for saturable metabolism

Organ

Dose (mg/ kg)

Adducts (μmole/ mole guanine)

Standard Standard Deviationa Error

60/6 Adduct Ratiob

Suggested Km (local) (units of external mg/kg)

Forestomach

6 60

19.8 41

— —

7.38 15.28

— 2.1

— 8.1

Glandular stomach

6 60

28.1 208.1

— —

10.48 77.57

— 7.4

— 148.1

Kidney

6 60

4.4 32.5

2.9 11.3

1.45 5.65

— 7.4

— 146.6

Liver

6 60

12.1 59.3

4.6 21.7

2.3 10.85

— 4.9

— 45.9

Brain

6 60

0.43 3

0.11 0.2

0.055 0.1

— 7.0

— 118.6

Spleen

6 60

0.61 7.8

— —

0.23 2.91

— 12.8

Heart

6 60

0.38 2.4

— —

0.14 0.89

— 6.3

— Not meaningful — 86.6

Lung

6 60

Testes

6 60

0.77 5.3 0.32 1.2

0.16 0.2 0.14 0.6

0.08 0.1 0.07 0.3

— 6.9 — 3.8

— 113.3 — 26.4

a

Cases where no standard deviation is given represent the results of measurements on pooled sample from different animals. Their standard error is estimated from the square root of the average sum of squares of the coefficients of variation (standard deviation divided by the mean) for organs where there were separate measurements on 4 animals. b This should be 10 if there were no saturation. The fact that it is less than 8 of 9 cases for this table is very unlikely to have occurred unless there was some saturation of a process in the pathway on the way to the formation of the adducts.

induce the observed DNA adducts if there had been no approach to Michaelis– Menten saturation of the internal doses as indicated by the DNA adducts). The result was a great improvement of the fit of a one-stage model to the data and a two-to threefold upward revision in the tumor risk expected at low doses (Tables 3.32 and 3.33). In conclusion, mechanistic perspectives can often shed light on appropriate “priors” for model forms. Such “priors” are particularly needed and helpful for analyses of data sets with limited numbers of data points, and where projections of risks are needed outside the range of the direct observations.

158

REGULATOR/RISK PERSPECTIVE ON CHAPTER 3

TABLE 3.32 Results of fitting a one-stage carcinogenesis model to the alimentary system tumor data for rats: Administered dose compared with two measures of “low-dose equivalent” estimates of delivered DNA adduct doses Data Set and Dosimeter

P for Fit

MLE q1

UCL q1

BMD10

BMDLo

Full data set, nominal doses

4E-04

0.199

Not done

0.529

Not done

Full data set, systemic Km low-dose equivalent doses

0.054

0.262

0.314

0.402

0.336

Full data set, forestomach only Km low-dose equivalent doses

0.46

0.351

0.417

0.300

0.253

Three lower dose points nominal doses

0.009

0.240

0.294

0.438

0.358

TABLE 3.33 Results of fitting a one-stage carcinogenesis model to the alimentary system tumor data for mice: Administered dose compared with two measures of “low-dose equivalent” estimates of delivered DNA adduct doses Data Set and Dosimeter

P for Fit

MLE q1

UCL q1

BMD10

BMDLo

Full data set, nominal doses

8E-10

0.216

Not done

0.489

Not done

Full data set, systemic Km low-dose equivalent doses

3E-07

0.262

Not done

0.401

Not done

Full data set, forestomach Km low-dose equivalent doses Three lower dose points nominal doses

0.059

0.664

0.823

0.159

0.128

3E-06

0.288

0.294

0.365

0.358

2E-05

0.324

Not done

0.325

Not done

Three point systemic Km low dose equivalent doses

REFERENCES

159

REFERENCES Andersen ME, Conolly RB. 1998. Mechanistic modeling of rodent liver tumor promotion at low levels of exposure: An example related to dose-response relationships for 2,3,7,8-tetrachlorodibenzo-p-dioxin. Belle Newsletter 7(2):2–8. Bogen KT. 1998. Mechanistic model predicts a U-shaped relation of radon exposure to lung cancer risk reflected in combined occupational and U.S. residential data. Belle Newsletter 7(2):9–14. Hattis, D. 2008. Distributional analyses for children’s inhalation risk assessments. J Toxicol Environ Health, 71(3):218–26. Hattis D, Banati P, Goble R, Burmaster D. 1999. Human interindividual variability in parameters related to health risks. Risk Anal 19:705–20. Hattis D, Baird S, Goble R. 2002. A straw man proposal for a quantitative definition of the RfD. Final technical report. U.S. Environmental Protection Agency STAR grant R825360. Human variability in parameters potentially related to susceptibility for noncancer risks. 2002. Full version available on the web at http://www2.clarku.edu/faculty/ dhattis; shortened version Drug Chem Toxicol 25:403–36. La DK, Lilly PD, Andregg RJ, Swenberg JA. 1995. DNA adduct formation in B6C3F1 mice and Fischer-344 rats exposed to 1,2,3-trichloropropane. Carcinogenesis 16:1419–24. Slikker W Jr, Andersen ME, Bogdanffy MS, Bus JS, Cohen SD, Conolly RB, David RM, Doerrer NG, Dorman DC, Gaylor DW, Hattis D, Rogers JM, Setzer RW, Swenberg JA, Wallace K. 2004a. Dose-dependent transitions in mechanisms of toxicity. Toxicol Appl Pharmacol 201:203–25. Slikker W Jr, Andersen ME, Bogdanffy MS, Bus JS, Cohen SD, Conolly RB, David RM, Doerrer NG, Dorman DC, Gaylor DW, Hattis D, Rogers JM, Setzer RW, Swenberg JA, Wallace K. 2004b. Dose-dependent transitions in mechanisms of toxicity: case studies. Toxicol Appl Pharmacol 201:226–94.

RESPONSE TO COMMENTS Lidia Burzala

We are very grateful for the reviewers’ thoughtful comments. They are addressed in order of appearance. ROGER COOKE Issue 1 We agree that the NPB model is complex and could tax the patience of nonmathematical readers. Friendly software enabling the nonmathematical user to get results by only specifying the inputs would help. A good nontechnical narrative could also help; nontechnical users don’t need to understand the details, but they do need to understand the “storyline.” Issue 2 Formula (3.8) is indeed a linear extrapolation between E(Pi) and E(pi+1), and this is a result of the form of the piecewise prior potency function P0(x). Burzala (2007) reviews the work of authors who have proposed convex, concave, and ogive prior potency curves, and these could be employed as well. Issue 3 The BMDL, that is, the 5 percentile of the dose that realizes the benchmark response (BMR), can be obtained in some cases, but at considerable computational cost. Observe that the posterior mean of P(x) (3.6), is the average of the potency curves at dose x, determined by the MCMC runs. Therefore the BMDL—the dose that realizes the 5 percentile of the distribution of P(BMD), can be approximated in an MCMC sample by a dose at which 5% of the sample potency curves is above P(BMD) (see Figure 3.39). 160

161

Probability of response

ROGER COOKE

Benchmark response (BMR) Dose BMDL

BMD

BMDL: 5% of doses realizing BMR are equal or below BMDL; equivalently, 5% of curves at BMDL are greater or equal to BMR

Figure 3.39 BMR, BMD, and BMDL for a set of sample dose–response curves.

The procedure to derive BMDL is presented below. 1. Find i such that E[ P ( xi )] < P ( BMD) < E[ P ( xi +1 )], where E[P(xi)], E[P(xi+1)] are defined in the chapter by (3.6). 2. Count how many potency curves at dose xi are above P(BMD). Note that the potency curve based on the sth MCMC run at dose xi is defined as pi(−s)1 + ( pi(+s)1 − pi(−s)1 )

ϑ i(s) . ϑ + ϑ i(+s)1 (s) i

3. If a percentage of curves at dose xi, situated above P(BMD), is less than 5%, then choose x* ∈ (xi, xi+1), and count the potency curves at this dose, defined as pi(s) + ( pi(+s)1 − pi(s) )

ϑ* ϑ i+1 − ϑ *

and situated above P(BMD). Repeat this step until you will find the dose x* for which 5% of the potency curves from MCMC runs are above P(BMD) (x* is then BMDL).

162

RESPONSE TO COMMENTS

TABLE 3.34

BMDL results for data sets

Data Set First data set

Second data set—male

Second data set—female

Second data set—all

Third data set

Fourth data set—I type

Fourth data set—II type

Fourth data set—all

BMR

BMD

BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01

6.23 2.46 —b 9.29 0.71 0.019 25.57 10.35 0.67 10.20 1.05 0.15 7.03 2.81 — 5.48 2.32 0.105 2.34 1.17 0.32 2.92 1.59 0.53

BMDL 2.601 —a — 0.816 — — 8.600 0.554 — 1.08 0.279 — 4.061 — — — — — — — — 0.32 — —

a

BMDL is not possible to calculate, since the percentage of potency curves at dose xi = 0 is bigger than 0.05. b BMD is not possible to calculate, since P(BMD) < E(P(0)), which suggests that BMD takes a negative value.

If a percentage of curves at dose xi, situated above P(BMD) is bigger than 5%, then find the biggest k < i such that a percentage of curves at xk above P(BMD) is smaller than 5%. Then choose x* ∈ (xk, xk+1), and count curves at this dose defined as pk(s) + ( pk(s+) 1 − pk(s) )

ϑ* ϑ k +1 − ϑ *

and situated above P(BMD). Repeat this step similar to the previous case. The results for the BMDL are given in Table 3.34. The time required for the computations (in MATLAB) is given in Table 3.35.

163

CHAO CHEN

TABLE 3.35

Computational burden of DR and BMD, BMDL calculations

Data Set

Time of DR Calculations

First data set

1.54 min

Second data set—male

2.14 min

Second data set—female

1.35 min

Second data set—all

2.61 min

Third data set

2.84 min

Fourth data set—I type

1.47 min

Fourth data set—II type

2.71 min

Fourth data set—all

4.61 min

Time of BMD and BMDL Calculations BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01 BMR = 0.1 BMR = 0.05 BMR = 0.01

0.005 s —a — 0.003 s — — 0.002 s 0.002 s — 0.002 s 0.0023 s — 0.002 s — — — — — — — — 0.0034 s — —

a

Not possible to calculate BMDL.

CHAO CHEN Chao Chen suggested that it might make sense from toxicological viewpoint to allow the precision parameter α to depend on dose. For example, two different parameters α1 and α2 could be used respectively for low and high doses. Although this idea is intuitively appealing, it can create mathematical difficulties. If we had attempted to concatenate two Dirichlet processes for high and low doses with precision parameters α1 and α2, then we would have had to ensure that no discontinuities in potency curves arose at the boundary between these two regimes. It is not clear how, or indeed whether, such can be arranged.

164

RESPONSE TO COMMENTS

DALE HATTIS Dale Hattis indicated some problems with the constraints on the prior, namely the nondecreasing behavior of prior functions with increasing dose. As he suggested, there are indeed some respectable mechanistic theories that can produce nonmonotonic dose–response behavior over some ranges of dose. The NPB model does not take into account additional information such as chemical structure or toxic properties of chemical. It is therefore an appropriate model for the simple cases where only bioassay data are available without any other information.

4 QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BAYESIAN MODEL AVERAGING Melissa Whitney and Louise Ryan Department of Biostatistics, Harvard School of Public Health

4.1

INTRODUCTION

Regulatory agencies such as the US Environmental Protection Agency (EPA) often face the difficult task of establishing environmental standards and regulations in contexts where substantial knowledge and/or data gaps lead to significant uncertainty regarding the right decision. Developing effective strategies for decision making in such settings has become a major thrust for the EPA in recent years. Indeed decision making in the face of uncertainty has been the focus of a number of workshops and national academy reports. Recently Resources for the Future convened a workshop entitled Uncertainty Modeling in Dose Response, with the goal of addressing the specific question of how uncertainty issues can be appropriately reflected in dose– response modeling. In preparation for the workshop, invited attendees were asked to tackle a variety of interrelated uncertainty issues individually, with the final goal of comparing their various approaches to uncertainty analysis at the workshop. To explore different methodologies for uncertainty analysis, four bioassay data sets were utilized, each characterizing an unique uncertainty issue. The first data set was taken from the Benchmark Dose Software available through the EPA’s website, and this data set was utilized for its sim-

Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

165

166

QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BMA

plicity of only three dose groups and no extraneous covariates. Researchers were then left with the task of choosing the appropriate structure for the model when there are only three groups of subjects, with only 50 animals per group, and a range of 1 to 20 responses per group. The second problem set posed to researchers was to consist of two data sets, one for male and another for female rats, each exposed to four (differing) dose levels. To model this dataset, researchers had to determine (1) whether separate dose–response curves are needed for each sex and (2) to what extent the decision to combine or report separate estimates affects the overall uncertainty in risk estimation. Similarly a third data set was to present results for two separate outcomes, two different types of tumor formation. Researchers had to decide whether to combine these endpoints for dose–response modeling or whether each possible outcome (i.e., type of tumor formed) should be modeled separately. Finally, the last data set was to consist of two studies, and researchers were left to determine whether these studies should be combined for purposes of risk estimation. These four data sets were designed to demonstrate some of the recurring dilemmas researchers face, namely what structural forms should be used to model the data and whether data from different experiments, animal types, or outcome types should be combined in dose–response modeling and subsequent risk analysis. Our strategy for tackling the challenges posed by the workshop data sets was to use Bayesian model averaging (BMA). A number of authors have found the use of BMA to be an effective strategy for addressing uncertainty associated with model choice in a variety of practical settings (Viallefont et al., 2001; Brown et al., 1998; Volinsky et al., 1997; Raftery et al., 1997). BMA is appealing in the sense that quantities of interest are averaged with respect to a set of candidate models, with weights proportional to the posterior probability that each model is correct, given the observed data. Consequently the approach gives more weight to estimates obtained from models that fit the data fairly well, while estimates corresponding to poorly fitting models are downweighted. These properties of BMA were proved by Morales et al. (2006) to be effective for quantifying model uncertainty in a risk assessment setting. Their work was motivated by an analysis of epidemiological data from southwestern Taiwan on cancer risk associated with exposure to arsenic in drinking water. Although there were a number of problems with the arsenic data set, including the fact that it lacked individual-level exposure information and had no information related to important confounders such as smoking behavior, Morales et al. (2000) calculated Benchmark Dose (BMD) estimates (the dose that leads to a specified increase in cancer risk, compared with unexposed subjects) and reported considerable sensitivity of the results to the choice of underlying dose–response model. Depending on whether the dose was modeled linearly or on the log scale, and whether the risk was quantified on an additive or multiplicative scale, estimated BMDs for male lung cancer ranged from 42 to 70 parts per billion (ppb), with nonoverlapping confidence intervals. In contrast, the BMD obtained through model averaging led to an estimate of

167

INTRODUCTION

TABLE 4.1 Bayesian model averaging of arsenic data and resulting averaged BMD estimate Individual models

BMD (95%CI)

Add, linear dose Multi, linear dose Multi, log dose Model averaged

42 (40, 43) 91 (78, 107) 70 (60, 83) 60 (47, 74) 27

60 ppb, with a slightly wider confidence interval than those obtained from individual models, reflecting the additional uncertainty due to model choice (Table 4.1) (Morales et al., 2006). In this chapter, we explore the use of BMA as a tool for quantifying uncertainty in the contexts of (1) model structure choice and (2) model covariate selection for any given model structure. Two data analyses are used to illustrate these concepts. Data set I consists of EPA BMD test data for which researchers must determine an appropriate model structure choice and possible dose transformations when fitting the data. Data set II consists of separate data for male and female subjects and poses the model covariate selection question of whether sex should be taken into account when calculating risk due to exposure. We apply multiple techniques for conducting Bayesian model averaging to these two data sets and subsequently use the BMA method to calculate a measure of excess risk due to exposure known as the Benchmark Dose (BMD). 4.1.1 Bayesian Model Averaging for Toxicological Data and Benchmark Dose Estimation Although the above-mentioned example utilized environmental epidemiological data, the flexible BMA framework can be adapted for quantal response toxicology data analysis. Using the notation of Clyde et al. (2004) and Hoeting et al. (1999), let Δ be the quantity we wish to estimate by Bayesian model averaging. In these analyses we are interested in estimating a Benchmark Dose (BMD)(Crump 1984) that accounts for model selection uncertainty and thus incorporating information from a variety of reasonably fitting dose–response models. To estimate Δ = BMD, the excess risk due to exposure, we utilize the additive risk definition of the Benchmark Dose when analyzing data sets I and II. Thus the BMD is defined as the dose that increases the probability of an adverse effect from P0 in an unexposed subject to P0 + BMR, where BMR is the benchmark response, a level of response deemed epidemiologically/biologically relevant (Butz-Jorgensen et al., 2000). For this analysis, we chose BMR = 0.1, consistent with common EPA practice (US EPA, 2004, 2007). We calculated the lower bound of the BMD, the BMDL, using bootstrap methods. In particular, to calculate the BMDL for each of model, we generated 1000 data set samples drawn at random under a given model, calculated the BMD

168

QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BMA

for each sample, and then estimated the BMDL as the lower 5% cutoff value for the BMDs calculated from the 1000 samples. To carry out BMA to find a model-averaged exposure–response curve and the resulting averaged BMD estimate, we first specify a set of suitable models. We consider K total models, which are typically weighted equally a priori, p(Mk) = 1/K for k = 1, 2, …, K. Next we compute the posterior model probabilities, Pr(Mk|Data), that reflect the likelihood that a model holds “true” given the observed data. The posterior model probability for a given model, k, can be written as Pr ( M = k Data ) =

p( Data M = k ) p( Mk ) , ∑ j =1…k p( Data θ jM j ) p( M j )

(4.1)

where p( Data Mk ) = ∫ p( Data θ k , Mk ) p(θ k Mk ) dθ k , which follows from Baye’s theorem. In addition to calculating posterior model probabilities for each model, we use classical estimation methods to calculate an estimate of excess risk for each model run, Δ*k . Finally, we average the results using posterior probabilities as weights such that poorly fitting models get downweighted and better-fitting models contribute more strongly to the final, averaged estimate of risk. After obtaining posterior model probabilities using equation (4.1), Pr(Mk|Data), and using classical estimation procedures to obtain a quantity of interest for each model, Δ*k , we can then use posterior model probabilities as model weights to obtain the BMA estimates of the unconditional expectation and variance, i. e. the averaged estimate of risk over all models examined. The unconditional expected value and variance of Δ given the posterior model probabilities, pk = Pr(Mk|Data), are E ( Δ Data ) = ∑

k =1…k

Δ*k pk

(4.2)

where Δ*k = E ( Δ Mk , Data), and Var ( Δ Data ) = ∑

k =1…k

Var(Δ*k Mk , Data) pk + ∑

k =1…K

(Δ*k − Δ* )2 pk , (4.3)

where K is the total number of models considered. The variance formula above separates out the variance into two components: the variance due to the estimation procedure for each individual model, Σk =1…K Var(Δ*k Mk , Data) pk , and the uncertainty due to model selection, Σk =1…K (Δ*k − Δ* )2 pk (Hoeting et al., 1999). To implement Bayesian model averaging, one must calculate posterior model probabilities, pk = Pr(Mk|Data), for each of the k models using equation (4.1). However, this requires solving an integral that is difficult to calculate in all but the simplest cases, and many posterior distributions for models used in

169

INTRODUCTION

environmental epidemiological studies do not have closed-form solutions (Clyde and George, 2004). Both fully Bayesian methods and frequentist approximations to posterior model probabilities have been utilized to carry out BMA. Fully Bayesian methods include closed-form solutions (which rarely exist when modeling toxicology data) and Markov chain Monte Carlo methods such as reversible jump MCMC, Carlin–Chib method, stochastic search variable selection, and Gibbs variable selection. The most common frequentist analytical approximation used to estimate posterior model probability is the BIC approximation described below. For data set I, the BIC approximation to the posterior model probability is utilized. For data set II, both the BIC approximation and Gibbs variable selection are employed. These methods are not exhaustive. Researchers must assess the computational feasibility of these various techniques when determining which model-averaging methods are most appropriate for a given data set.

4.1.2 The BIC Approximation Several analytical techniques have been used to approximate the marginal distribution for Y, p(Data | Mk). Raftery (1995) derived the following approximation using Schwarz’s Bayesian information criteria (BIC) (Schwarz, 1978) to approximate model posterior probabilities: p( Data Mk ) ∝ exp( −0.5BIC ( Mk )) , and thus Pr ( Mk Data ) : =



p( Mk ) exp( −0.5BIC ( Mk ) ) , p( Mk ) exp( −0.5BIC ( Mk ) )

k =1…K K

with BIC defined as BIC(Mk) = −2 log(Maximized likelihood | Mk) + dim(Mk) log(n), where dim(Mk) is the number of parameters for Mk and n is the sample size. This approximation tends to work well in settings with independent covariates and moderate to large sample sizes (Wasserman, 2000).

4.1.3

Simulation Techniques

The most common way to calculate posterior model probabilities is to simulate data from a posterior distribution using Markov Chain Monte Carlo (MCMC) methods. The simplest simulation techniques, however, are designed for sampling from distributions of parameters with fixed dimensionality. In the modelaveraging framework we aim to characterize the posterior distributions of models with different numbers of parameters. Standard MCMC theory does not apply when the dimension of the parameter space is allowed to vary,

170

QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BMA

and model averaging may violate conditions necessary to ensure Markov chain convergence (Carlin and Chib, 1995). Gibbs variable selection (GVS; Dellaportas et al., 2002; George and McCulloch, 1993) addresses the problem of varying sample space dimension. The data set II analysis utilizes GVS, and its implementation is described below in the context of covariate selection.

4.2 DATA SET I ANALYSIS: UNCERTAINTY DUE TO CHOICE OF LINK FUNCTION/DOSE–RESPONSE MODEL For the analysis of data set I (presented in Table 4.2), we consider a variety of models, most of which are included in EPA’s BMD software, BMDS Version 1.4.1b. We consider only nonzero background models, both for their biological plausibility and computational issues with calculating BMD for zero background dose models when P(Response at d = 0) = 0. We fit logit, probit, multistage, and Weibull models, and consider nontransformed dose d, logtransformed dose log(d), additive background effects, adding an effective dose term to the models, and simple empirical models for the data. For each of these models we adopted the common constraints on parameter values suggested by the US EPA Benchmark Dose Software (2007) and Gart et al. (1986). In total, 10 models are fit to the data (Table 4.3). These 10 models can be fitted easily using maximum likelihood, which involves choosing the values of the unknown parameters that maximize ll = ∑ j =1… J (rj (log ( P ( d j ) ) + ( n j − rj ) (log ( 1 − P ( d j ) )} , where rj is the number of responses observed among the nj subjects exposed at the jth dose level, dj. We used the nonlinear optimizing function nlminb in R (Version 2.6.2) to fit all models. Table 4.4 shows the estimated parameters for each of the 10 models, and Figure 4.1 depicts these models graphically. We calculated the posterior probability for each of the 10 models using the BIC approximation to posterior model probability, as described

TABLE 4.2 Dose 0 21 60

Data set I: BMD technical guidance data Number of Subjects 50 49 45

Number of Responses 1 20 28

171

DATA SET I ANALYSIS

TABLE 4.3 Ten models considered in data set I analysis Constraints

p(d) β1 + β 2 log (d )

e 1 + e β1 + β2 log (d) 2. β3 + (1 − β3)Φβ1 + β2log(d)) 1. β 3 + (1 − β 3 )

3. β 3 + (1 − β 3 ) (1 − e 4. β 3 + (1 − β 3 ) (1 − e

− β1 d − β 2 d 2 − β1 d β 2

β1 + β 2 log (d + β 3 ) 5. e 1 + e β1 + β2 log (d + β3 ) 6. Φβ1 + β2log(d + β3))

)

)

β2 > 0; 0 < β3 < 1 β2 > 0; 0 < β3 < 1 β1,β2 > 0; 0 < β3 < 1 β2 > 1; 0 < β3 < 1 β3 > 0 β3 > 0

7. 1 − e

− β1 (d + β 3 ) − β 2 log (d + β 3 )2

β1,β2,β3 > 0

8. 1 − e

− β1 (d + β 3 )β2

β2 > 1; β3 > 0

β1 + β 2 d 9. e 1 + e β1 + β 2 d 10. Φ(β1 + β2d)

β2 > 0 β2 > 0

above (Table 4.4). Both the table and graph demonstrate that similarly wellfitting models have nearly identical posterior model probabilities. Models 1, 2, 5, and 6 provide the best fit to the data, are quite similar in structure, and have the highest shared posterior model probabilities (approximately 0.18 each, or over 0.70 combined posterior probability). Those models that fit the data poorly, particularly the simple logistic and probit models, models 9 and 10, have the lowest posterior probabilities of 0.02 and 0.03, respectively. Figure 4.1 also reveals that these models fit the data poorly. For each model, we then calculated the BMD by solving for x, such that p (Dose = x) − p(Dose = 0) = BMR, with BMR = 0 : 1. Lower limits on x, the BMDLs, were computed via parametric bootstrapping (Table 4.4). Finally, utilizing the posterior model probabilities and individual model BMDs and BMDLs reported in Table 4.4, we calculated a model-averaged BMD, ΔBMD = 5.3112, and corresponding averaged lower limit, Δ BMDL = 2.9071 , using equation (4.2). These averaged values capture and quantify the model selection uncertainty and overall variability observed in the risk estimates and the graphical depictions of the 10 exposure–response curves fit above. Note that the weighted average of the BMDL should be interpreted carefully. From a regulator’s perspective, it would better to compute a lower limit of estimated BMDs. This would more naturally be done using an MCMC approach, though frequentist versions would also be possible.

172

− β1 d − β 2 d 2

− β 1 ( d + β 3 ) − β 2 ( d + β 3 )2

9.

e β1 + β 2 d 1 + e β1 + β 2 d 10. Φβ1 + β2d)

7. 1 − e β2 8. 1 − e − β1 (d + β3 )

5.

e 1 + e β1 + β2 log (d + β3 ) 6. Φβ1 + β2log(d + β3))

β1 + β 2 log (d + β 3 )

4. β3 + (1 − β3 ) (1 − e − β1dβ2 )

3. β3 + (1 − β3 ) (1 − e

1. β3 + (1 − β3 )

)

e β1 + β2 log (d) 1 + e β1 + β2 log (d) 2. β3 + (1 − β3)Φβ1 + β2log(d))

Model

0.0215 0.0310

(−2.2419, 2.2017) (−1.3587, 1.3452)

0.0597 0.0597

0.1772

(−0.1413, 0.3525, 0.0044) (0.6805, 0.00001, 0.0374) (0.6805, 1.0000, 0.0374)

0.1772

0.0597

(−0.2240, 0.5685, 0.0016)

(0.6805, 1.0000, 0.0251)

0.0597

0.1772

(0.6805, 0.00001, 0.0251)

(−0.1685, 0.3612, 0.0200)

Pr (Mk|Data) 0.1772

)

(−0.2692, 0.5874, 0.0200)

ˆ Model Fit b = βˆ 1 , βˆ 2 , βˆ 3

(

20.9702

22.6390

9.5423 9.5422

2.9295

2.5798

9.5422

9.5423

2.8418

2.3406

( )

BMD BMD Δˆ k

18.0580

19.6551

7.9041 7.9854

0.0065

0.0064

8.0666

7.9702

0.0562

0.0287

BMDL

TABLE 4.4 Estimated parameters, estimated posterior probabilities using BIC approximation: Resulting Benchmark Dose for 10 models considered, and BMDLs calculated using bootstrap methods

173

DATA SET II ANALYSIS

Figure 4.1 All 10 models, predicted dose–response curves for data set I.

TABLE 4.5

Male

Female

Data set II: Frambozadrine data Number of Response, Hyperkeratosis

Dose (mg/kg-day)

Number of Rats

0 1.2 15 82

47 45 44 47

2 6 4 24

0 1.8 21 109

48 49 47 48

3 5 3 33

4.3 DATA SET II ANALYSIS: UNCERTAINTY DUE TO COVARIATE ADJUSTMENT To analyze data set II (Table 4.5), we use Bayesian model averaging to address whether males and females should be combined when estimating risk due to frambozadrine exposure, or, rather, whether separate dose–response curves are necessary. Whereas the data set I analysis utilized BMA to examine model structure uncertainty, the data set II analysis uses BMA to address model covariate selection uncertainty given a particular model structure, logistic regression. Under a traditional approach to data analysis, researchers would (1) consider fitting two separate models versus pooling the data to form a

174

QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BMA

combined model, (2) examine the model fits, and then (3) draw inferences from the “final” model(s) chosen, ignoring the alternative estimates and the uncertainty arising from the model-selection process. Past research has demonstrated that this model selection process can underestimate true variability and uncertainty and thereby result in overconfident decision making (Draper, 1995). The goals of the data set II analysis are twofold. First, we address model uncertainty arising from covariate selection by comparing two strategies for implementing model averaging. Second, we use the averaged model to calculate benchmark doses that account for the uncertainty involved in modeling data set II. To carry out Bayesian model averaging, we approach the data set as a covariate selection problem and compare the three (multiple) logistic regression models: M1 : M2 : M3 :

p ( Response ) = g (β0 + β1 Dose ) ,

p ( Response ) = g (β0 + β1 Dose + β 2 Sex ) ,

p ( Response ) = g (β0 + β1 Dose + β 2 Sex + β 3 Dose × Sex ) .

As depicted above, model 1 (M1) combines males and females to calculate a single dose–response curve and model 2 (M2) assumes a common dose effect but allows for an additive gender effect. Finally, the interaction model, M3, is the equivalent of fitting two entirely separate dose–response curves for males and females. Under a standard, naive, analysis, the three models would be compared using standard model selection criterion, such as the AIC. The model with the lowest AIC would be chosen as the final model from which all inferences would be made. In this case the combined model (M1) would be utilized with the lowest AIC value of 36.37, with no indication that researchers also considered separate dose–response curves for males versus females (Table 4.6). Bayesian model averaging can be utilized to combine information across these three models in lieu of traditional model/covariate selection methodology. As with the analysis of data set I, the BIC approximation to the posterior model probability is utilized. In addition to this analytical approximation, we use a simulation technique, Gibbs variable selection, to conduct the model averaging and compare results. Gibbs variable selection proceeds by noting that regression models with various numbers of included/excluded covariates can be written as logit ( p( y = 1) ) = ∑ j = 0 … 3 g ( j ) X j β j + ε ,

where g(j) = 1 if the jth variable is included in the model, and g(j) = 0 otherwise. The variable X1, X2, and X3 refer to sex, dose, and sex*dose, while X0 is the intercept term and identically equal to 1.

175

DATA SET II ANALYSIS

TABLE 4.6 Traditional approach to covariate selection versus BMA approach Logistic Regression Model—Full Model: Pr(Response) = g(β0 + β1Dose + β2Sex + β12Dose × Sex) Model(k) 1. All data combined 2. Sex effect only 3. Separate models

β0

β1

β2

−2.639

0.0313

−2.531

0.0315

−2.501

0.0309

β12

AIC

pk(GVS)

pk(BIC)

0

36.37

0.546

0.639

−0.175

0

38.04

0.443

0.266

−0.229

0.001

40.02

0.011

0.095

0

Note: (1) Akaike information criteria for model selection, (2) estimated posterior model probabilities, (pk = Pr(Mk | Data)), using Gibbs variable selection, (3) estimated posterior model probabilities using BIC approximation.

Introducing the variable indicator function g(·) reduces the framework to one of fixed dimensionality. We can then utilize standard simulation techniques to estimate g(·) and θk = (βk; τ) for all models, Mk, k = 1, 2, 3. By way of the following framework, GVS was implemented for data set II analysis using WinBUGS 14: Likelihood Y [i ] ∼ binom( p[i ]; n[i ]) , logit ( p[i ]) = β0 + g (1) × β1 × Dose[i ] + g ( 2 ) × β 2 × Sex[i ] + g ( 3) × β12 × Dose[i ] × Sex[i ]. (This equation is expressed “dosewise”; in case of dose zero where there both male and female mice, the equation should be read “ratwise.”) Priors g ( j ) ∼ Bernoulli (0.5)

for j = 1, 2, 3,

β j ∼ N (0, τ ) , τ ∼ gamma (0.1; 0.1) . Gibbs sampling first samples each variable indicator g(j), then βj, and finally, τ. Table 4.6 gives posterior model probabilities for the three models of interest estimated using the GVS method in WinBUGS. The combined data model had the highest posterior probability (pk = 0.546). The model that allowed for an additive sex effect had posterior probability 0.443, and the model allowing for a sex*dose interaction (i.e., the equivalent of modeling male and female data via two separate models) only had a posterior probability of 0.011.

176

QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BMA

TABLE 4.7 Comparison of two BMA methods: BIC approximation and MCMC– GVS method Variable

GVS Method (Fully Bayesian)

BIC Approximation

Dose

Pr(Dose) = 1.0 EV: 0.0317 SD: 0.00344

Pr(Dose) = 1.0 EV: 0.03149 SD: 0.00389

Sex

Pr(Sex) = 0.454 EV: −0.0311 SD: 0.2249

Pr(Sex) = 0.292 EV: −0.0552 SD: 0.212

Dose × Sex

Pr(Interaction) 0.011 EV: −0.00249 SD: 0.249

Pr(Interaction) 0.27 EV: −0.000218 SD: 0.00289

Note: EV: E ( β Data ) = Σ k =1…k β k* pk ; where β k* = E ( β Mk , Data), and pk is the posterior model probability. SD is sqrt(variance) with variance defined as above: Σ k =1…k Var ( β k* Mk , Data) pk + Σ k =1…3 ( β k* − β )2 pk .

Next the BIC approximation was used to estimate posterior model probability. We implemented the BIC approximation method using Chris Volinsky’s R BMA package (function BIC.GLM). The final column of Table 4.6 gives estimated posterior model probabilities for the three models of interest estimated using the BIC approximation. As with GVS, the BIC approximation finds that the combined model has the highest posterior probability (0.639). M2 has posterior probability 0.266, and fitting separate models via M3 is least supported by the data, with posterior probability 0.095. A comparison of the GVS method and the BIC approximation model-averaged estimates are reported in Table 4.7. Table 4.7 gives the estimated posterior probabilities for inclusion of variables in the true model, the model-averaged coefficient estimates, and the standard deviations calculated using equations (4.2) and (4.3) for both methods. The GVS and BIC results are quite similar in terms of model-averaged dose estimation and posterior probability of a dose effect, with probability of dose inclusion in the “true” model of approximately 1, and model-averaged coefficient values of 0.0317 and 0.0315 for dose, respectively. Nonetheless, these methods deviate with respect to inclusion of a sex term and dose-sex interaction term.

4.4

DISCUSSION

We have demonstrated that benchmark dose estimates are highly dependent on the particular dose–response curve calculated. In adopting a Bayesian model averaging framework, we accounted for the additional variability due to choosing a “final” dose–response curve, and we used BMA to provide

DISCUSSION

177

benchmark doses that more accurately reflect uncertainty in our understanding of the effects of exposure on the occurrence of adverse responses. For the data set I analysis, model averaging performed well, as the bestfitting models all had high posterior probabilities and thus larger “weight” in the final averaged model estimates of risk due to exposure. In addition the models with very similar structural forms had nearly the same posterior probabilities. The top four models accounted for over 70% of the total 18 posterior probability and had an average BMD value of 2.673. Nonetheless, researchers must be careful in choosing which models to include for model averaging. The overall model-averaged BMD including ill-fitting, lower weighted models was 5.311, revealing the penalty for including models structures ill-suited to the data. Naively averaging over ill-performing models may result in biologically implausible model fits and can lead to nonmonotonic curves for which risk estimates such as BMDL cannot be obtained. In addition researchers must include plausible restraints for models to ensure that they perform reasonably when fitting the data. Weibull models were particularly unstable absent suitable constraints. For the data set II analysis, both the BIC approximation method and the GVS method for carrying out BMA performed similarly in terms of calculating posterior model probabilities, namely the “weight” of the evidence in favor of modeling male and female rat data separately. In this case small posterior probabilities for the full logistic regression model with an interaction term (0.011 using GVS and 0.095 using BIC) demonstrate that separate estimates of excess risk due to exposure are probably unnecessary. Nonetheless, the BMA approach is worthwhile because it accounts for the uncertainty arising from the decision-making process of fitting multiple models. This contrasts with the traditional approach, which ignores plausible alternatives when reporting results and thus may result in overconfident inferences regarding risk. The sensitivity of the model-averaged BMD highlights the importance of careful specification of the prior set of candidate models. In reality, of course, no single model is likely to be “correct.” The best one can hope for is the use of a sensible model that is flexible enough to capture the key features of the data. In terms of specifying the model space for the purpose of toxicological dose–response modeling, the set of models allowed for by the EPA’s Benchmark Dose Software are a good start. It would be of value to expand this set to include additional models, for example, those motivated by biological considerations and perhaps reflecting specific mechanisms of toxicity. A related question concerns specification of the prior weights assigned to each model. We have taken the simplest approach of allowing each model the same a priori weight. However, other approaches are possible. For example, expert opinion could be used to elicit prior model weights. For data set II, the resulting model-averaged coefficient estimates and posterior probability estimates differ somewhat for GVS and BIC approximation methods. One must exercise caution in applying the BIC approximation to smaller data sets where covariates are not independent, such as in our data set II analysis, which includes an interaction term. (Wasserman, 2000). In addi-

178

QUANTIFYING DOSE–RESPONSE UNCERTAINTY USING BMA

tion the BIC approximation exacts a strong penalty for model complexity, giving lower posterior probability to models with increasing number of parameters. Accordingly researchers must consider which method(s) for implementing BMA are most appropriate for a given data set. This chapter has addressed uncertainty due to model structure and model covariate selection. There are other sources of uncertainty worthy of exploration for quantal response toxicology data, such as uncertainty due to dose levels where data are sparse or risk estimate uncertainty that may arise from the presence of outliers or lack of monotonicity. As always, researchers must struggle with the question of when a formal uncertainty analysis will be applicable for a particular data set. We have demonstrated that Bayesian model averaging can serve as a useful tool for accounting for several sources of uncertainty. Ongoing work aims to utilize alternative simulation methods to calculate BMDs, BMDLs, and resulting variance estimates for both data sets I and II. In addition we plan to utilize reverse jump MCMC simulation method for implementing BMA for data set I in order to compare this method with the BIC approximation to posterior model probability calculated above. While the examples presented in this chapter illustrate some of the usefulness of BMA techniques for dose–response modeling in the presence of uncertainty, many questions remain. For example, BMA is primarily useful for estimating a reliable dose–response model in settings where there is uncertainty about the model, either in terms of the particular class of dose–response models that should be used or the covariates to be included. Environmental risk assessment, on the other hand, is concerned with protecting the public. It could be argued that the best estimate of the average risk is less important than a reliable characterization of the upper limit of the risk associated with various dose or exposure levels. Exploring extensions of the BMA technology to address this would be worthwhile.

REFERENCES Brown PJ, Vannucci M, Fearn T. 1998. Multivariate Bayesian variable selection and prediction. J Roy Statist Soc B60:627–42. Butz-Jorgensen E, Grandjean P, Keiding N, White RF, Weihe P. 2000. Benchmark dose calculations of methylmercury-associated neurobehavioural deficits. Toxicol Lett 112–3;193–9. Clyde M, George EI. 2004. Model uncertainty. Statistical Science 19(1):81–94. Crump KS. 1984. A new method for determining allowable daily intakes. Fund Appl Toxicol 4:854–71. Draper D. 1995. Assessment and propagation of model uncertainty. J Roy Statist Soc B57:45–97. Gart JJ, Krewski D, Lee PN, Tarone RE, Wahrendorf J. 1986. The Design and Analysis of Long-Term Animal Experiments. Volume III: The Design and Analysis of Long-

REFERENCES

179

Term Animal Experiments. International Agency for Research on Cancer (IARC) Scientic Publications. George EI, McCulloch RE. 1993. Variable selection via Gibbs sampling. J Am Statist Assoc 88:881–9. Hoeting JA, Madigan D, Raftery AE, Volinsky CT. 1999. Bayesian model averaging: a tutorial. Statist Sci 14:382–417. Morales KH, Ibrahim JG, Chen CJ, Ryan LM. 2006. Bayesian model averaging with application to benchmark dose estimation for arsenic in drinking water. J Am Statist Assoc 101(473):9–17. Morales KH, Ryan L, Kuo TL, Wu MM, Chen CJ. 2000. Risk of internal cancers from arsenic in drinking water. Environ Health Perspect 108(7):655–61. Raftery AE, Madigan D, Hoeting JA. 1997. Bayesian model averaging for linear regression models. J Am Statist Assoc 92:179–91. Raftery AE. 1995. Bayesian model selection in social research. Sociological Methodology 25:111–63. Schwarz G. 1978. Estimating the dimension of a model. Ann Statist 6(2):461–4 (1978). US EPA. Benchmark Dose Technical Guidance Document. External Review Draft EPA/630/R-00/001. http://www.epa.gov/ncea/pdfs/bmds/BMD-External_10_ 13_2000.pdf. US EPA. 2004. An Examination of EPA Risk Assessment Principles and Practices. EPA/100/B-04/001. Washington, DC: US Environmental Protection Agency (2004). US EPA. 2007. Benchmark Dose Software (BMDS). Version 1.4.1b. http://www.epa. gov/ncea/bmds (2007). Viallefont V, Raftery AE, Richardson S. 2001. Variable selection and Bayesian model averaging in epidemiological case-control studies. Statist Med 20:3215–30. Volinsky C, Madigan D, Raftery AE, Kronmal RA. 1997. Bayesian model averaging in proportional hazard models: predicting the risk of a stroke. Appl Statist 46:443–8. Wasserman L. 2000. Bayesian model selection and model averaging. J Math Psychol 44(1):92–107.

COMMENT MATH/STATS PERSPECTIVE ON CHAPTER 4: BAYESIAN MODEL AVERAGING Michael Messner US Environmental Protection Agency, Office of Ground Water and Drinking Water

INTRODUCTION The views expressed here are my own, and not those of EPA. My job title is Mathematical Statistician, so the “Math/Stats View” seems fitting. That said, my perspective is that of statistical decision analysis. My experience is that data are usually dirtier (more complex) than the examples studied in this workshop. There are typically more explanatory variables and confounders. Moreover the usual errors (precision, bias) are also accompanied by blunders, so we always need to look for and deal with outliers. My own blunders have been due to not considering a sufficiently full set of models in analyzing dose– response data. My expectation is that model averaging with sufficient set of competing models will suggest a good plan for obtaining new data (value-ofinformation analysis).

BACKGROUND Anticipating the study by Whitney and Ryan, I prepared by gathering some general information about Bayesian model averaging (BMA). BMA has a home page (http://www.research.att.com/∼volinsky/bma.html), an R package (http://cran.r-project.org/src/contrib/Descriptions/BMA.html), and a tutorial (http://www.stat.colostate.edu/∼jah/papers/statsci.pdf). A nice, clear paper on case-control studies using BMA is by Viallefont, Raftery, and Richardson 180

AN IDEA / PROPOSAL

181

(2001; see Viallefont V, Raftery AE, Richardson S. Variable selection and Bayesian model averaging in case-control studies, Statist Med 20:3215–30). I found no guidance for constructing priors to inform BMA. Although some of the software allows prior weights to be entered for the competing models, how does one select the prior weights?

SOME REVIEW COMMENTS In Chapter 4 Whitney and Ryan provide a nice overview of their approach and findings when applying BMA to two of the workshop’s data sets. Not included, however, are the interesting details that Ryan so often described as “weird” in her workshop presentation. Each of those weird behaviors could have been an opportunity to learn something, and I have to wonder what was learned but not described in the chapter. For one, didn’t the bootstrap sampling (for the first problem) sometimes generate odd-looking (weird) MLE dose–response functions? Did that happen whenever the bootstrap sample failed to include healthy subjects at the very lowest dose or diseased subjects at the highest dose? Something could have been learned here about the bootstrap when applied to small binary data sets. When is the bootstrap appropriate? What advantages or disadvantages does the bootstrap offer, when compared to Markov chain Monte Carlo sampling?

AN IDEA / PROPOSAL Rather than challenge BMA and other methods with real data (where we don’t know the “correct” model or parameter values), why don’t we generate some artificial data using at least three different model forms (where we select the model and the parameters). We can create data sets of any size, but most interesting or challenging would be sets large enough to be somewhat informative regarding parameter values, but not so large as to give high confidence about the model. We might use 6 dose levels, 40 subjects each. Without revealing the models, we ask our analysts to analyze the data using different methods (classic, simple Bayes, model averaging, reverse-jump MCMC, etc.). We might ask the analysts to provide estimates of benchmark dose in the form of central estimates and 95% confidence or credible intervals, and also to estimate the probability of illness at one or two particular dose levels, again reporting central estimates and 95% intervals. We could then see which methods do best. One can imagine variations on this theme. Under the usual reading, model uncertainty is epistemic rather than aleatoric; that is, one model generates all the data, but we don’t know which. However, model uncertainty can also be aleatoric; that is, subjects could randomly be assigned different DR models. Could BMA detect or “fit” this aleatoric model uncertainty (or should we call

182

MATH/STATS PERSPECTIVE ON CHAPTER 4

this “variability”)? In the limit, every individual might have his/her own dose– response function, and perhaps none of these is in our limited family of functions. Is our family large enough to approximate any reasonable aggregate dose–response function? With excellent information (data from a huge dose– response study), might we learn that all our models are wrong, but that a mixture or model-average characterization is excellent? Is BMA superior to other methods when “fitting” a case where truth is a mixture of highly variable subject-specific models?

COMMENT EPI/TOX PERSPECTIVE ON CHAPTER 4: USE OF BAYESIAN MODEL AVERAGING FOR ADDRESSING UNCERTAINTIES IN CANCER DOSE–RESPONSE MODELING Margaret Chu California Air Resources Board, California Environmental Protection Agency

Modeling to quantitatively assess human cancer risks from exposures to environmental agents has many uncertainties. What’s the prospect of Bayesian model averaging for reducing or better describing the uncertainties in assessing environmental carcinogenic risks. Cancer is a multitude of diseases. The dominant and most flexible theory is that cancer development is multistage and many factors contribute to its progression. The number of chemicals in commerce is large. Only a small number of these have been tested for their carcinogenic potential. By and large, the two-year bioassay is still the gold standard for identifying carcinogens. But the bioassay is maximized for the detection of the agent’s potential for inducing cancer (hazard identification) not for quantitative determination of the magnitude of carcinogenic risk in humans. A fewer number of chemicals have human data, and those are mainly from occupational epidemiology studies. To a large extent the approach for estimating human carcinogenic risks is to relate the dose of a suspected carcinogen to tumor incidence on the basis of animal bioassay results. In its 1983 book, Risk Assessment in the Federal Government: Managing the Process, the NRC talks of pervasive uncertainty in risk assessments, that there 183

184

EPI/TOX PERSPECTIVE ON CHAPTER 4

is no easy solution, and that more, and better, knowledge of the health impact assessed is needed. The conceptualization of how an agent affects cancer development underpins the modeling. There’s uncertainty in species differences such as physiology, metabolism, pharmacokinetics, and tissue and target organ susceptibilities. Extrapolations using high-dose animal testing to project human cancer risks vary as the biological concept of carcinogenesis varies. My opinion is that Bayesian model averaging, when used appropriately, is one of many statistical tools that can help illuminate uncertainties in cancer dose–response analysis. BMA has been applied to teasing out critical factors in estimating the risks of patients for cardiovascular diseases—a multifactorial and multistage chronic disease. With new experimental tools there’s an explosion of molecular, cellular, and pathological information regarding cancer development in humans and animals. Data and information on carcinogenesis are also more complex. BMA could potentially be used to tease out the contributions of biological parameters in estimating cancer risks for data rich situations. BMA could also be used to analyze model uncertainties when more then one model is supported by the data. I think this is an area worth exploring to support the development of a framework for when and how it should be used.

COMMENT REGULATORY/RISK PERSPECTIVE ON CHAPTER 4: MODEL AVERAGES, MODEL AMALGAMS, AND MODEL CHOICE Adam M. Finkel School of Public Health, Department of Environmental and Occupational Health, University of Medicine and Dentistry of New Jersey, Piscataway

The question “What must I do about uncertainty?” should prompt not a quick answer, but this return question: “Are you treating this as an estimation problem or as a decision problem?” Estimators—the people who estimate, that is, not the statistical constructs themselves—see the world differently than “deciders” do, and that is all well and good. Treat an estimation problem as a decider would, and no harm may befall; treat, on the other hand, a decision problem as if it was an estimation problem, and human, environmental, and financial resources may be squandered or worse. The estimator’s goal is akin to that of the counterfeiter’s—to produce something that will resemble the original so closely that no one will notice the difference, or perhaps in such a way that glaring differences will occur only rarely. Failing to achieve such a goal can lead to embarrassment but can just as well lead to knowledge. The decision-maker’s goal, on the other hand, is exogenous to the problem itself, and can’t be grafted onto the situation from any generic principles. An accurate (e.g., in the sense of minimizing root-meansquare divergence from the true value) estimator may be a terrible guide to an optimal decision, depending entirely on some raw materials that estimators are trained to ignore—the properties of the available choices and the loss functions associated with any mismatch between the actual consequences of making a choice and the best consequences possible. 185

186

REGULATORY/RISK PERSPECTIVE ON CHAPTER 4

The study by Whitney and Ryan demonstrates elegantly how Bayesian model averaging (BMA) can lead to improved estimates, but it does not explore the relationship, if any, between BMA and improved decisions. In part, this reflects the different ways of looking at the world, but in part, this arises from the narrow estimation problems the workshop attendees were all asked to consider. In this response, I will explore the disconnect between estimation and decision for the two problems Whitney and Ryan consider (accounting for multiple plausible functional forms of the dose–response relationship, and accounting for a possible systematic difference between the susceptibility of male and female rodents), and then extend my comments to evaluate a more vexing but arguably more common type of estimation problem (amenable to BMA, but with many pitfalls) that the workshop did not present.

MULTIPLE BENCHMARK DOSE ESTIMATES—BE CAREFUL HOW YOU AVERAGE Whitney and Ryan’s analysis of the first data set using 10 different dose– response models shows clearly that there exists an “objective” alternative to the expert-driven selection of model weights. Weighting each model by the Bayesian posterior probability that the model is correct given the observed data takes the nonstatistical information out of the process, and that, of course, can be seen either as the fundamental strength or the fundamental weakness of the procedure. To the extent that sound expert judgment gauges which model(s) fit the observed data more or less well because they accurately reflect underlying biological theory, the autonomic nature of the procedure Whitney and Ryan used could be a major disadvantage. Perhaps one priority for future methods development could be the exploration of hybrid processes that integrate statistical fit and biological insights. Regardless of how the weights are derived, the results Whitney and Ryan present reveal what to me is a problematic implicit choice about how to average model predictions together, one that underscores the disparity between estimation and decision. Some estimators are simply ill-suited to being summarized by their arithmetic average. For example, the ratio of two uncertain quantities is itself uncertain, but if one feels compelled to summarize the resulting pdf via a single statistic, the arithmetic mean of the ratio is a particularly unfortunate choice. Consider two quantities (they could be risks, in units of statistical fatalities per year, or any other quantity): A is distributed lognormally with median 8 and logarithmic standard deviation 1.732 (the square root of 3), and B is also lognormal, with a median of 10 and logarithmic standard deviation of 1. Which risk is larger? Does it help to know that the arithmetic mean of the quantity [B ÷ A] is approximately 9? But then does it trouble you to learn that the arithmetic mean of [A ÷ B] is approximately 6? Is B both

MULTIPLE BENCHMARK DOSE ESTIMATES

187

nine times larger than A and one-sixth as large as A?1 No. “On average” (in a sense of the typical case) B is slightly larger than A (and there is about a 55% chance B > A), but large possible values of [B ÷ A] and of [A ÷ B] dominate their respective arithmetic means and make it appear that both ratios are greater than 1 at the same time. The same pitfall plagues Whitney and Ryan’s calculation of the modelaveraged BMDL, which is a summary statistic for a collection of fifth-percentile values taken from 10 different pdfs. Their “model-averaged BMDL” of 2.9071 is the weighted (arithmetic) sum of the 10 values in the rightmost column of their Table 4.4. To a decision-maker, this would imply that human exposure to 2.9071 units of the chemical tested in data set I is “safe” to a high level of assurance—but only by viewing Table 4.4 in its entirety can s/he see that there is a 71% probability (4 times 0.1772) that the actual lower bound on the BMD is between 0.0064 and 0.0562, a factor of 52 to 454 more protective of human health than the weighted arithmetic mean. How can the “best estimate of the lower 5th percentile” be a value 50 to 500 times higher than values that have a 71% chance of being correct? The models with large arithmetic values of the BMDL are squelching the signal from the much more plausible models that say the BMDL needs to be set much lower to protect the population. There is insufficient information in Whitney and Ryan’s analysis to calculate an averaged BMDL in what I would argue is the more appropriate fashion—by choosing through nested Monte Carlo simulation a BMD value from each model (the choice of which model to use at each iteration arising out of a random draw determined by the model probabilities, and the choice of BMD conditional on this choice arising from a bootstrap procedure), and then reporting the 5th percentile value from a large number of iterations.2 But an approximate analogue to this procedure is simply to compute the (weighted) geometric mean3 of the 10 BMDL values in their Table 4.4, which equals

1

The ratio of two lognormal distributions is itself lognormal, with median equal to the ratio of the two medians and logarithmic standard deviation equal to the square root of the sum of the squares of each logarithmic standard deviation. So [A ÷ B] has median 0.8 and logarithmic standard deviation of 2, whereas [B ÷ A] has median 1.25 and logarithmic standard deviation of 2. Because the arithmetic mean always exceeds the median—by a factor of exp (0.5*log-standarddeviation squared)—both distributions have an arithmetic mean greater than 1. 2 I made up a simpler example to show how this would be done and to confirm my intuition about the result. Assume there are two equiprobable models: under Model 1, the BMD is 100, distributed lognormally with a logarithmic standard deviation of 1.4, so that the BMDL is 10. Under Model 2, the BMD is 1, with the same standard deviation, so its BMDL is 0.1. The weighted arithmetic average of the two BMDLs is 5.05 ((10 + 0.1) ÷ 2). But if you sample a BMD from each distribution with equal probability, the 5th percentile (of 10,000 Monte Carlo iterations here) is 0.18, a factor of 28 smaller (more protective). The arithmetic mean of 5.05 in fact is a 45th percentile bound, not a “lower bound” in any colloquial sense at all. 3 The weighted geometric mean of N data points xi with corresponding weights wi (when the N

wi sum to 1 as they do here) is

∏x

wi i

i =1

. The geometric mean allows small values of x to have

some influence on the summary statistic, as opposed to the arithmetic mean, which “averages them away.”

188

REGULATORY/RISK PERSPECTIVE ON CHAPTER 4

0.103—30-fold lower than the weighted arithmetic mean, and only 2 to 16 times higher than the BMDLs associated with the four models that actually dominate the probability space. I still believe, however, that the decision-maker would be best served by a slightly more complete summary rather than by any single average: “we are unsure which model is correct, but four closely related models that together have about a 70% chance of being correct all suggest that a BMDL of roughly 0.01 is necessary to protect the public, while six less-plausible models all suggest the BMDL could be set at either 8 or 20.”

MALE/FEMALE “DIFFERENCES”—BE CAREFUL WHETHER YOU AVERAGE I think Whitney and Ryan would probably agree that this particular hypothetical data set is ill-suited to exploring sex effects in dose response, as the male and female rodents respond so similarly (the observed response in males at a dose of 82 mg/kg-day was 51%, whereas the interpolated response in females at that dose is approximately 50%; both data sets show the same “dip” in response probability between approximately 2 mg/kg-day and 20). I assume that the very low (0.01) posterior probability for model M2 under the BIC approach is another way of stating the more intuitive finding that any differences between the two dose–response curves are difficult to attribute to anything other than random fluctuations in the observed response (the hypothesis that the two sexes are equally susceptible cannot be rejected). I think it’s also important to note that even if male and female rodents appeared to be systematically different, there is no guarantee that male and female humans will exhibit the same (or any such) difference. Like the analysis of data set I, the purging away of biological insight comes at a cost here as well. If the “frambozadrine” data set did show a significant sex effect, however, and if we had reason to believe this effect was relevant to humans, the contrast between estimation and decision would be particularly stark and instructive here. Suppose that M2 was highly probable, and implied that males are much more susceptible than females to frambozadrine, such that BMDF = 5 while BMDM = 1 (and the BMD for the pooled data would be approximately 3); what decision would make sense here? As always, the only way to think about this sensibly is to consider the loss functions, the consequences of the decision. If this was only a thought experiment (with multiple opportunities to run this experiment over many substances), and the estimator was rewarded for accurate prediction rather than optimal decision, then it might well make sense to pool or average the data and act as if the BMD was 3. If there was some fortuitous way to differentially control the exposures of men versus women (perhaps issuing every man a respirator to wear 24/7 that blocked 80% of the ambient frambozadrine from being inhaled?), then setting an exposure limit based on BMD = 5 might make some sense. But in the real world, basing the

WHEN MODELS REALLY DIVERGE

189

exposure limit on BMD = 5 would presumably be less expensive—but would result in adverse health effects among males—compared to basing the limit on BMD = 1 and protecting both sexes. Without some understanding of the marginal costs and marginal benefits of the more health-protective decision, there is no basis for acting on the additional information gained from decoupling the dose–response curves. Moreover a decision-maker who anticipated difficulty in explaining what s/he would do if the “for women only” exposure standard was optimal on cost–benefit grounds might well wonder why the analyst is providing information that could not realistically be acted upon. WHEN MODELS REALLY DIVERGE—BE CAREFUL WHAT YOU AVERAGE This workshop tackled an important, interesting, but ultimately not very vexing question. Fitting variants on the same basic models to the data they are supposed to explain is “model uncertainty lite” (and arguably really fits under the rubric of parameter uncertainty rather than model uncertainty at all) compared to another kind of model uncertainty in dose response: what should be done when a model (including its closely related kin) fits the data adequately but is based on a questionable biological assumption, not resolvable via any analysis of these data? There are many variations of such “fundamental causal uncertainty” in human health risk assessment, but for simplicity, let’s consider a paradigm case: a dose–response model explains the animal response data well and yields a significant risk estimate associated with an exposure of regulatory interest, but there is a nagging concern, whose probability could be a few percent, or far more than 50%, that the human risk at this exposure level would be much smaller than this because the interspecies analogy is qualitatively or quantitatively erroneous. The simplest version of this uncertainty problem (but one the regulatory agencies see all the time in no more complex a form than this) is a “face value” risk estimate of X, but with a probability q that the true human risk is exactly zero (a limiting case of the risk being of magnitude Y (0 < Y < X) with probability q). Putting aside the problems of subjectivity in assessing q (for the statistical techniques explored in this volume don’t really apply here), the model-averaging dilemma is whether we should think of this risk as having an expected value of pX (where p = [1 − q]), and/ or should act based on this expected value. My “hurricane parable”4 that was the subject of some comments at the RFF workshop has been an attempt to show what I’ve always thought was uncontroversial—that the decision one 4

I first presented this example in Discover magazine (“Who’s Exaggerating?” May 1996, pp. 48–54): rival factions of meteorologists contest whether one causal model applies to a hurricane brewing in the Gulf of Mexico (and that it will therefore hit New Orleans) or whether another model applies that will cause it to hit Tampa. The respective assessed probabilities p and q are such that if one averaged the spatial locations of the two model predictions, the “expected landfall” of the hurricane would be at Mobile.

190

REGULATORY/RISK PERSPECTIVE ON CHAPTER 4

would make if the unknown quantity was exactly (pX + qY) can be very different than (and grossly inferior to) the decision that has the highest value averaged over p and q. I think some of the participants misunderstood my concern: it’s not the act of averaging that troubles me (although my comments up to this point do include concern about how the averaging occurs), but the act of averaging the model predictions rather than averaging the pros and cons of different decisions in light of the model uncertainty. It should be obvious, I hope, that (assuming the resources needed to evacuate either city are identical) the decision to evacuate New Orleans will incur costs A with probability q, with A proportional to the number of lives lost if the hurricane hits Tampa, and the decision to evacuate Tampa will incur costs B with probability p proportional to the damage to New Orleans. If the expected marginal benefit of evacuating New Orleans rather than Tampa, which equals [pB − qA], is greater than zero, then that choice is preferable. Either choice, of course, is preferable to evacuating Mobile, with its expected cost of [pB + qA]. The point is that the real decision problem involves averaging the consequences, not averaging the model predictions. Perhaps this analogy suffers from the far-fetched nature of what would or wouldn’t be averaged. To illustrate the general point that the expected value is meaningful but not necessarily a sensible guide to decision making, consider instead the decision to invade Iraq. Assume that 10,000 troops were deemed necessary to find and secure any WMD that Iraq might have possessed. If the probability that there were such weapons at all was assessed at p = 0.01 (the famous “one percent doctrine” that is the title of a 2006 book by Ron Suskind), would we have been tempted to send a force of 100 (the “expected number needed”) soldiers? Again, the real decision problem is to minimize the probability-weighted cost of sending an adequate force for no reason plus the probability-weighted cost of sending no soldiers if the “one percent” assumption was in fact correct. Now it is certainly possible for the decision corresponding to the expected value of the risk to be sensible, or even no worse than the decision with the highest expected value—but this is only a special case where there is a continuous gradation of possible decisions and when all the relevant loss functions are linear. If the cost of sending a particular number of soldiers went down in lockstep with the ability of the reduced force to find WMD, then sending one percent of the troops might be the right response to a one percent chance of finding weapons—but you would still get the same answer here in a more defensible way by averaging the costs and benefits rather than by averaging the risk. POSTSCRIPT—ANOTHER WAY TO RESOLVE MODEL UNCERTAINTY There is another unspoken but very important tension between estimation and decision when it comes to model uncertainty, one that many other fields of

POSTSCRIPT—ANOTHER WAY TO RESOLVE MODEL UNCERTAINTY

191

analysis have resolved differently from what the thrust of this volume and much of the scholarly writing about model uncertainty in environmental risk assessment recommends. Every field relies on some fundamental assumptions that are not certain to be correct, and to the extent that analyses use these assumptions or models to make predictions or estimates, they are guilty of “censoring some model uncertainty.” Certainly the reliance on core beliefs can become slavish, but is it per se wrong to relegate alternative assumptions to explicit or implicit “footnotes” in the analysis? As a corollary to this question, why does human health risk assessment seem to agonize more than most fields about the alternative predictions that would arise if it turned out that a model not included in the analysis was in some sense correct?5 To answer these questions, we have to acknowledge that in theory, “leaving out” any plausible model at any step in an analysis does lead to overconfidence and does open the door to possible misestimation. Yet we also have to acknowledge that in practice, including all plausible models comes with a price tag. Expanding the analysis to admit more models means delaying the estimation and any decision that can follow from it, and one danger is that because there can always be “one more model” with nonzero weight to consider, the finish line of the analysis can recede faster than our progress toward it. Therefore the pioneers of environmental risk assessment recommended a sensible middle ground—a system that takes more interest in alternative models than do many other fields of analysis, but one that tries to open some escape hatches to avoid “paralysis by analysis.” Beginning with the landmark Red Book in 1983, the National Academy of Sciences has encouraged EPA and the other agencies that use risk assessment to identify “default” inferences, supported by theory and evidence, that can be presumed reasonable in the absence of information to the contrary. A system based on defaults could resolve difficult choices about fundamental causal and other model uncertainty in any of three ways, depending on the strength of evidence (biological, statistical, or otherwise) supporting one or more alternatives to each default: (1) reaffirm the Agency’s confidence in the default for a given risk assessment (when no credible evidence supports an alternative); (2) “depart” from the default in favor of a different model when credible or compelling evidence supports this choice (see below); or (3) present results from more than one model, as an input to sensible decision-making in light of model uncertainty (see above). 5

For a mundane example, when driving through an intersection with the light green, most of us adopt the “default” assumption that a stopped car to our left or right will remain at rest while we cross. The consequences of being wrong about this are huge, and the probability is probably not trivial. Still we do not often take the welfare-increasing step of making eye contact with the other driver(s) before proceeding. For an EPA-relevant example, regulatory economists routinely assess the costs of pollution control under the default assumption that unemployment is zero or negligible, and therefore that investments in pollution control bring no macroeconomic benefits with them. In both cases decision-makers may be unaware that any model uncertainty is being put to the side.

192

REGULATORY/RISK PERSPECTIVE ON CHAPTER 4

Two subsequent NAS committees (Science and Judgment in Risk Assessment in 1994 and Science and Decisions: Advancing Risk Assessment in 2008) have tried with limited success to convince EPA to refine its treatment of defaults and departures. The two key recommendations in Science and Judgment were as follows: (1) EPA should articulate a philosophy of defaults that would explain how high the evidentiary bar should be before abandoning one— although the members were divided (see appendixes N-1 and N-2 of the report) as to whether the standard should be relatively permissive (eager to accept alternative models when some evidence supports them) or relatively reluctant to abandon a default unless the evidence supporting an alternative was compelling, the Committee was unified in cautioning that EPA should not continue to adjudicate these controversies in an ad hoc fashion. (2) EPA should further explain, for each specific recurring choice it will face between default and alternative models (e.g., the perennial dilemmas about whether in a given case a positive tumor response in rodents is irrelevant to human risk, or whether allometric interspecies scaling is inferior to a pharmacokinetic model), what specific scientific questions need to be answered in order for an alternative model to meet the hurdle it defined earlier. The recent Science and Decisions committee reaffirmed these two cornerstones of a sound system for model choice, and was able to agree where the previous group could not on the “height of the bar”—the 2008 report recommended that an alternative should supplant a default when evidence showed it to be “clearly superior” (in legal terms, rather more compelling than a 51/49 “preponderance of the evidence” standard but perhaps not as strict as a “beyond a reasonable doubt” standard). This committee also encouraged EPA to present multiple risk characterizations in certain cases where an alternative model was judged less than “clearly superior” compared with the default, but where “full disclosure” would benefit from decision-makers and the public seeing what effect the alternative would have on the risk estimate. Unfortunately, EPA has always resisted these sorts of recommendations to codify its philosophy of defaults and to spell out for researchers what quantity and quality of evidence will be necessary for the Agency to replace a default with an alternative. I have argued (“Too much of the ‘Red Book’ is still ahead of its time,” Hum Ecol Risk Assess 9(5): 1253–71) that this abdication destroys the incentives for conducting research to improve risk assessment, and encourages advocates to put resources into lobbying (either to promote a half-baked alternative or to influence the weight it will be given in a model-averaging exercise) rather than into research. More recently EPA has actually expressed a distaste for its own defaults, as if they were chosen for bureaucratic convenience rather than in light of the decades of theoretical and empirical support many of them have6 (as if the Agency views using a well-established inference that happens to be a default using the colloquial meaning of “defaulting” on 6

In various exceptional cases, for example, adverse effects in rodents are quite probably irrelevant to humans due to profound interspecies differences—but surely the general presumption that effects are conserved across species is where the science leads us, not merely a “convenient truth.”

POSTSCRIPT—ANOTHER WAY TO RESOLVE MODEL UNCERTAINTY

193

an obligation or doing something reflexively rather than carefully). In its 2005 Guidelines for Carcinogen Risk Assessment (page 1–7), EPA has in effect claimed it has actually evolved beyond the need for defaults: Rather than viewing default options as the starting point from which departures may be justified by new scientific information, these cancer guidelines view a critical analysis of all of the available information that is relevant to assessing the carcinogenic risk as the starting point from which a default option may be invoked if needed to address uncertainty or the absence of critical information.

Putting aside the glaring bias in this statement (if the defaults are worthy of their name, they are not “inferences we retreat to because of the absence of information” but rather “inferences we generally endorse on account of the information”), this process is illogical as well. A model that is chosen only after a de novo “analysis of all of the available information” cannot be termed a default any longer but is instead a judgment about the “correct” model with no presumption that what has been judged correct in past cases has any special claim.7 I urge EPA to rethink the various written statements it has recently made expressing the agency-wide goal of “reducing reliance on defaults.” Should we also reduce reliance on Newton’s laws or on the assumption that demand curves have constant elasticity? I hope it is obvious that EPA should strive to reduce reliance on unreliable defaults, and not poison the well by trying to overturn reliable ones just to say they have done so. Although this new policy does not necessarily move EPA any closer to an embrace of model averaging (it could simply foreshadow a period in which wellsupported defaults are simply abandoned in favor of shaky alternatives, but where only one inference at a time is considered), the promise to “invent the wheel” for each risk assessment suggests a presumption against completing the analysis until model uncertainty is fully explicated. Perhaps we need an “Environmental Estimation Agency” to manage this open-ended process (or perhaps we already have such an institution in the National Institute of Environmental Health Sciences). But we need an environmental decisionmaking agency as well. I may disagree with specific actions EPA may take, but I expect the Agency to keep its eye on environmental protection as it considers the twin activities of making predictions and decisions under uncertainty. 7

EPA used very similar language in its 2004 staff paper An Examination of EPA’s Risk Assessment Principles and Practice (p. 51): “EPA’s current practice is to examine all relevant and available data first when performing a risk assessment. When the chemical- and/or site-specific data are unavailable (i.e., when there are data gaps) or insufficient to estimate parameters or resolve paradigms, EPA uses a default assumption in order to continue with the risk assessment. Under this practice EPA invokes defaults only after the data are determined to be not usable at that point in the assessment—this is a different approach from choosing defaults first and then using data to depart from them.” The curious use of “invokes” makes the default seem distasteful indeed, and in my view inappropriately pits defaults against data, without realizing that the defaults are often built on an enormous foundation of data, while the alternatives sometimes present data, but data that are incomplete or misleading.

RESPONSE TO COMMENTS Melissa Whitney and Louise Ryan Department of Biostatistics, Harvard School of Public Health

We are grateful for the thoughtful comments of the reviewers of our chapter. They raise many important topics that are addressed in order of appearance.

MIKE MESSNER Mike Messner asks how a BMA user should choose a prior distribution: This is a good question. We have added the following text in the Discussion section of the chapter. This added text also addresses the issue raised by another reviewer about the potential role of expert opinion: “The sensitivity of the model-averaged BMD highlights the importance of careful specification of the prior set of candidate models. In reality, of course, no single model is likely to be “correct.” The best one can hope for is the use of a sensible model that is flexible enough to capture the key features of the data. In terms of specifying the model space for the purpose of toxicological dose–response modeling, the set of models allowed for by the EPA’s Benchmark Dose Software are a good start. It would be of value to expand this set to include, for example, models motivated by biological considerations and perhaps reflecting specific mechanisms of toxicity. A related question concerns specification of the prior weights assigned to each model. We have taken the simplest approach of allowing each model the same a priori weight. However, other approaches are possible. For example, expert opinion could be used to elicit prior model weights. Messner also mentions “weird bootstrap results” discussed in the workshop. A programming error was the explanation for the “weird” results presented at the workshop. This has been fixed now. 194

ADAM FINKEL

195

Regarding the question when and/or whether bootstrapping is preferable to MCMC, the bootstrap method was actually an artifact of the programming error mentioned above. The question is not so much “bootstrap versus Markov chain Monte Carlo sampling,” as much as the frequentist-based approximation (which involves approximating the posterior model probabilities using the BIC model fit criterion) versus a fully Bayesian approach. In the frequentist approach one needs an estimate of the variance of the BMD. We used bootstrap for this. Other approaches are, of course, possible. We include some discussion on this in the final section of the chapter. The suggestion of a different type of bench test exercise in which data are generated from a model that is hidden from the modelers is an intriguing one. Modelers could compare their BMDLs with the BMDL from the “true” model used to generate the data. As a variation on this theme, the data could be generated by a mixture of models, thus mixing model uncertainty with variability in the data. These are great ideas for another workshop! EPA could render an enormous service by funding this effort.

MARGARET CHU We are grateful for the supportive suggestions of Margaret Chu, with which we heartily concur. She draws attention to an important feature of BMA, namely that it enables the analysis of model uncertainty when more than one model fits the data. Chu’s comments did not seem to point to the need for any specific revisions to the chapter. However, her point about our increasing knowledge of mechanisms is reflected in the comment we have inserted into the discussion regarding the inclusion of biologically based models into the prior model set.

ADAM FINKEL Adam Finkel is certainly thought long and hard about the issues of averaging and raises numerous provocative issues. Here are our response: On the tension between this data-driven weighting and expert elicitation of weights: In our opinion, this tension can be resolved to a certain degree by using the BMA framework. We have taken the simplest approach of weighting all the specified models equally. There is no reason why expert opinion could not be incorporated to allow some variability in those prior weights. The data would still be used to calibrate those weights (in terms of the posterior), but building in expert opinion could be very helpful in terms of giving even greater weight to biologically feasible models (assuming they fit the data well and therefore retain those higher weights a posteriori). We have added a note to this effect in the discussion.

196

RESPONSE TO COMMENTS

What does it mean to have an averaged BMDL? This is a good point indeed. It is quite true that the averaged BMDL that we present has a rather subtle interpretation and is not likely to be what the decision-maker or regulator wants. We clarify this in the text by stating. “Note that the weighted average of the BMDL should be interpreted carefully. From a regulator’s perspective, it would better to compute a lower limit of estimated BMDs. This would more naturally be done using an MCMC approach, though frequentist versions would also be possible.” General comments regarding estimation of “central tendancy” rather than focusing on uncertainty. We add the following to the discussion: “While the examples presented in this chapter illustrate some of the usefulness of BMA techniques for dose–response modeling in the presence of uncertainty, many questions remain. For example, BMA is primarily useful for estimating a reliable dose–response model in settings where there is uncertainty about the model, either in terms of the particular class of dose–response models that should be used or the covariates to be included. Environmental risk assessment, on the other hand, is concerned with protecting the public. It could be argued that the best estimate of the average risk is less important than a reliable characterization of the upper limit of the risk associated with various dose or exposure levels. Exploring extensions of the BMA technology to address this issue would be worthwhile.

5 COMBINING RISKS FROM SEVERAL TUMORS USING MARKOV CHAIN MONTE CARLO Leonid Kopylev, John Fox, and Chao Chen National Center for Environmental Assessment US Environmental Protection Agency

INTRODUCTION In animal bioassays, tumors are often observed at multiple sites. Unit risk estimates calculated on the basis of tumor incidence at only one of these sites may underestimate the carcinogenic potential of a chemical (NRC 1994). Furthermore the National Research Council (NRC, 1994) and Bogen (1990) concluded that an approach based on counts of animals with one or more tumors (counts of “tumor-bearing animals”) would tend to underestimate overall risk when tumors occur independently across sites. On independence of tumors, NRC (1994) stated: “… a general assumption of statistical independence of tumor-type occurrences within animals is not likely to introduce substantial error in assessing carcinogenic potency.” Also application of a single dose–response model to pooled tumor incidences (i.e., counts of tumorbearing animals) does not reflect possible differences in dose–response relationships across sites. Therefore the NRC (1994) and Bogen (1990) concluded that an approach that is based on well-established principles of probability

The views expressed in this article are those of the authors and do not necessarily reflect the views or policies of the US Environmental Protection Agency. Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

197

198

COMBINING RISKS FROM SEVERAL TUMORS USING MCMC

and statistics should be used to calculate composite risk for multiple tumors. Bogen (1990) also recommended a re-sampling approach, as it provides a distribution of the combined potency. Both NRC (1994) and Guidelines for Carcinogen Risk Assessment (US EPA 2005) recommend that a statistically appropriate upper bound on composite risk be estimated in order to gain some understanding of the uncertainty in the composite risk across multiple tumor sites. This chapter presents a Markov chain Monte Carlo (MCMC) computational approach to calculating the dose associated with a specified composite risk and a lower confidence bound on this dose, after the tumor sites of interest (those believed to be biologically relevant) have been identified and suitable dose–response models (all employing the same dose metric) have been selected for each tumor site. These methods can also be used to calculate a composite risk for a specified dose and the associated upper bound on this risk. For uncertainty characterization, MCMC methods have the advantage of providing information about the full distribution of risk and/or benchmark dose. This distribution, in addition to its utility in generating a confidence bound, provides expected values of risks that are useful for economic analyses. The methods presented here are specific to the multistage model with nonnegative coefficients fitted to tumor incidence counts (i.e., summary data rather than data on individual animals, as in the nectorine example; if data on individual animals are available, other approaches are possible), and they assume that tumors in an animal occur independently across sites. The nectorine example is used to illustrate proposed methodology and compare it with the current approach.

COMBINING RISKS FOR THE MULTISTAGE MODEL The NRC (1994) has described an approach for combining risk estimates across tumor sites based on the multistage model: P(d, θ) = 1 − exp[−(q0 + q1d + q2d2 + … + qkdk)], where θ = (q0, q1, … ), qi ≥ 0, where d ≥ 0 is the dose metric and P(d, θ) is the probability of response at dose d, with parameters θ. For the multistage model and two tumor types, A and B, assume m > k, and PA (d, β A ) = 1 − exp( − ( β0 A + β1 A ∗ d + … + βkA ∗ d k )) , PB (d, β B ) = 1 − exp( − ( β0 B + β1B ∗ d + … + β mB ∗ d m )) . Assuming independence of tumors PAorB (d β A, β B ) = PA (d, β A ) + PB (d, β B ) − PA (d, β A ) ∗ PB (d, β B ),

COMBINING RISKS FOR THE MULTISTAGE MODEL

199

we have, PAorB (d, β A, β B ) = 1 − exp( − [(β0 A + β0 B ) + (β1 A + β1B ) ∗ d + … + (β kA + βkB ) ∗ d k + βk + 1B ∗ d k + 1 + … + β mB ∗ d m ]) . Similarly, for more than two tumor types, the combined tumor model, Pc, is

Pc (d, θ ) = 1 − exp( − [Q0 + Q1d + Q2 d 2 + … + Qk d k ]), where Qi = ∑qij, in which i indexes the model “stages” 1 to k and j indexes the tumor sites, with θ = (Q0, Q1, …, Qk). The benchmark dose method (Crump, 1984) consists of estimating a lower confidence limit for the dose associated with a specified increase in adverse response (i.e., increased risk) above the background level. Extra risk (ER) is a common choice: ER =

P (d, θ ) − P (0, θ ) . 1 − P (0, θ )

The benchmark dose (BMD) for extra risk is the solution of the above equation when the left-hand side is fixed. Statistical inference for chemical risk assessment has mainly emphasized finding confidence limits for the BMD. For two tumors, extra risk is given by ERAorB (d, β A, β B ) =

PAorB (d, β A, β B ) − PAorB (0, β A, β B ) . 1 − PAorB (0, β A, β B )

After simple algebra, ERAorB (d, β A, β B ) = 1 − exp( − [(β1 A + β1B ) ∗ d + … + (βkA + βkB ) ∗ d k + βk + 1B ∗ d k + 1 + … + β mB ∗ d m ]) . BMD for combined risk BMRAorB is (after re-arranging and taking logarithms of both sides) the solution to the polynomial −log (1 − BMRAorB ) = (β1 A + β1B ) ∗ d + … + (βkA + βkB ) ∗ d k + βk + 1B ∗ d k + 1 + … + β mB ∗ d m .

A similar polynomial equation applies when more than two tumors are observed in a bioassay.

200

COMBINING RISKS FROM SEVERAL TUMORS USING MCMC

BAYESIAN APPROACH A Bayesian approach for calculating a confidence bound on the BMD for composite risk can be implemented using WinBUGS (Spiegelhalter et al., 2003). This is a freely available software package that can be used to apply MCMC methods (e.g., Smith and Gelfand, 1992; Casella and George, 1992; Chib and Greenberg, 1995; Brooks, 1998; Gilks et al., 1998; Gelman et al., 2004). Gelfand et al. (1992) discusses MCMC methods involving constraints, as in the case of applying the multistage dose–response model to the nectorine data, where maximum likelihood estimates (MLE) for background coefficients for both tumors are on the boundary. The use of MCMC methods (via WinBUGS) to derive a posterior distribution of BMDs for a single multistage model has been recently described by Kopylev et al. (2007). This methodology can be straightforwardly generalized to derive a posterior distribution of BMDs for combined tumor risk across sites, using the approach for composite risk described in the previous section. The mode and 5th percentile of the resulting posterior distribution of the dose for a fixed extra risk provide estimates of the BMD and the BMDL (“lower bound”) for composite tumor risk. Similarly, the mean and 95th percentile of the posterior distribution of the composite extra risk provide estimates of the expected extra risk and the upper bound on the extra risk.

EXAMPLE The data for the nectorine bioassay are given in Table 5.1. Two tumor types were observed and no information beyond summary data is available. A linear multistage model fits better—based on lowest values of the deviance information criterion (DIC) reported by WinBUGS—than a quadratic or cubic model for both adenoma and neuroblastoma tumors. Similarly a linear model is preferable for both tumors, based on minimizing AIC—based on BMDS (US EPA 2006) computations. The WinBugs results are obtained based on convergence of 3 chains with different initial values, and 50,000 burnin (i.e., the first 50,000 samples discarded) from 50,000 simulations each, using WinBUGS 1.4.1. The posterior distribution is obtained from the combined

TABLE 5.1

Nectorine bioassay data Concentration (ppm)

Tumor Type Respiratory epithelial adenoma Olfactory epithelial neoblastoma

0

10

30

60

0/49 0/49

6/49 0/49

8/48 4/48

15/48 3/48

201

EXAMPLE

150,000 samples from the three chains and thinned by retaining every 10th sample to reduce autocorrelation, resulting in 15,000 samples. A prior that is a mixture of a diffuse continuous distribution on a positive real line and a point mass at 0 was used for multistage parameters. In WinBUGS the prior was constructed by truncating a high-variance Gaussian prior distribution at −1 and using the “step” function to collect mass from the interval on the negative real line at 0 as a point mass. In a limited Monte Carlo simulation of the frequentist properties of the posterior, this was a reasonable choice for a range of scenarios we investigated. The posterior distribution of the extra risk at dose 11 ppm and BMD10 is shown on Figures 5.1 and 5.2 (11 ppm is chosen because it is close to the combined BMD10). The posterior distribution of the linear coefficient for the neoblastoma tumor is strongly bimodal with 40% of its mass at 0 and the rest of the mass continuously distributed. In a linear model, lack of the linear term implies no extra risk; that is, risk is the same at every dose. Since the extra risk for a linear model is a 1 : 1 function of the linear parameter, there is also bimodality of extra risk for neoblastoma in Figure 5.1. As the BMD in a linear model is inversely proportional to the linear term, for that 40% of neuroblastoma simulations in which the estimated linear term was zero, the BMD (Figure 5.2) could not be determined (mathematically it is infinite). In contrast, the linear parameter for the adenoma model has a unimodal posterior bounded away from zero (and so the extra risk distribution for adenoma

100

Distribution of extra risk for individual and combined tumors

0

20

40

Density

60

80

Adenoma Neuroblastoma Combined

0.00

0.05

0.10

0.15

0.20

Extra risk at dose 11 ppm

Figure 5.1 Distribution of the extra risk at dose 11 ppm for individual and combined tumors for the nectorine example. Note that 40% of the distribution for neuroblastoma extra risk is at 0, since the posterior of the linear parameter for this model has 40% of its mass at 0.

202 7

COMBINING RISKS FROM SEVERAL TUMORS USING MCMC

4 0

2

3

Density

5

6

Adenoma Neuroblastoma Combined

Bˆ Bˆ

1

2

3

4

5

Log10(Dose)

Figure 5.2 Distribution of the logarithm of BMD10 for individual and combined tumors in the nectorine example. Letters “B” on the x-axis indicate BMD10s for adenoma and combined tumors. For 40% of neuroblastoma simulations (see Figure 5.1 caption), BMD cannot be determined (is infinite); thus the integral under the distribution for neuroblastoma is 0.6.

TABLE 5.2

Statistics for individual and combined BMD10 for the nectorine example WinBUGS

Approach Adenoma Neuroblastoma Combined

BMDS

5th Percentile (BMDL)

Posterior Mode (BMD)

95th Percentile BMDU

5th Percentile (BMDL)

MLE (BMD)

95th Percentile BMDU

11.06 40.63 8.69

14.20 & 11.24

20.63 & 20.63

11.32 39.91 9.58*

15.17 70.11 12.47

22.79 153.60 *

Note: & indicates BMD and BMDU for neuroblastoma could not be determined for 40% of simulations; *indicates lower confidence bound on BMD is obtained with a BMDS module that is still under development. This software does not calculate BMDU.

(Figure 5.1) is also bounded away from 0); consequently the distribution of BMD and extra risk for combined tumors is well defined. Tables 5.2 and 5.3 show individual and combined BMDs and extra risks calculated by two approaches: WinBUGS and BMDS (US EPA 2006). Lower confidence bound on combined BMD (Table 5.2) is obtained by an experimental module of BMDS, which is still undergoing testing. That module uses a profile likelihood approach similar to the current approach for an individual

203

DISCUSSION

TABLE 5.3 example

Statistics for individual and combined extra risks for the nectorine WinBUGS

Approach Adenoma Neuroblastoma Combined

BMDS

5th Percentile

Mean

95th Percentile

5th Percentile

MLE

95th Percentile

0.055 0 0.059

0.076 0.011 0.086

0.099 0.028 0.135

0.048 0.007 *

0.072 0.016 0.088

0.097 0.028 0.115*

Note: For *, see caption for Table 5.2.

tumor in BMDS. Also note that for the extra risk WinBUGS provides estimates of average extra risk, whereas BMDS provides estimates of MLE extra risk. Interestingly, while BMDLs calculated by two methods are quite close for individual tumors, there is a difference in combined BMDL (Table 5.2) and upper bound on extra risk (Table 5.3). It is clear from the figures that risk of adenoma dominates the risk of neuroblastoma. However, even for this example, with markedly unequal risks from the tumors, the extra risk can be substantially greater when risks are combined. For example, Table 5.3 shows that the 95th percentile risk at 11 ppm is about 35% greater for combined risk than for adenoma alone (WinBUGS computation). The difference for average risk is less pronounced (13% larger) but still nontrivial.

DISCUSSION In this chapter, an application of Bayesian methods for calculating probability distributions for composite cancer risk estimates is proposed. Advantages of the proposed approach are that the concept can easily be extended to a more general case, such as a Bayesian hierarchical model with covariates, and computations are easy to implement. As NRC (1994) stated, ignoring issues about combining tumors could lead to underestimation of risk. In this example with an order of magnitude difference between individual tumors, the underestimation was moderate, compared to other uncertainties inherent to extrapolating risk from animals to humans. It is easy to see that in some situations (e.g., more than two tumors or tumors of similar potency), ignoring additional tumors could lead to a serious underestimation of risk from a toxic substance. It is very important to choose an appropriate prior when parameter estimates are either on the boundary of the parameter space or are near the boundary for a finite sample. We chose a prior that is a mixture of a diffuse

204

COMBINING RISKS FROM SEVERAL TUMORS USING MCMC

continuous prior over the nonnegative real line and a small point mass at 0. In limited Monte Carlo simulations such mixture priors performed reasonably well in a frequentist sense. Using a continuous prior over the nonnegative real line could lead to undesirable frequentist properties of the posterior when model parameters are at or near the boundaries of parameter space (zero for parameters of the multistage model). However, further research is needed to determine the optimal way of allocating prior density to the point mass at 0 and the continuous nonnegative part. The approach presented in this chapter provides statistical uncertainty in risk estimates conditional on the data set and multistage models of specified order. The approach can be straightforwardly extended to models other than the multistage model; Bayesian model selection could also be used. However, the uncertainty discussed in this article is only for a particular dose–response model and does not address uncertainty associated with the selection of a particular model, or a set of models. The approach assumes independence of tumors given dose. Independence may not be true for some types of tumors. Possible next steps include investigation of consequences of violation of independence assumption, and ways to evaluate and incorporate dependence in the methodology.

REFERENCES Bogen KT. 1990. Uncertainty in Environmental Health Risk Assessment. London: Taylor and Francis. Brooks S. 1998. Markov chain Monte Carlo method and its application. Statistician 47(1):69–100. Casella G, George E. 1992. Explaining the Gibbs sampler. Am Statistician 46(3): 167–74. Chib S, Greenberg E. 1995. Understanding the Metropolis–Hastings algorithm. Am Statistician 49(4):327–35. Crump KS. 1984. A new method for determining allowable daily intakes. Fund Appli Toxicol 4:845–71. Gelfand AE, Smith AFM, Lee T-M. 1992. Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. J Am Statist Assoc 418: 523–32. Gelman A, Carlin JB, Stern HS, Rubin DB. 2004. Bayesian Data Analysis, 2nd ed. London: Chapman and Hall. Gilks W, Richardson S, Spiegelhalter D. 1998. Markov Chain Monte Carlo in Practice. New York: Chapman and Hall/CRC. Kopylev L, Chen C, White P, 2007. Towards quantitative uncertainty assessment for cancer risks: central estimates and probability distributions of risk in dose–response modeling. Regul Toxicol Pharmacol 49:203–7. NRC (National Research Council). 1994. Science and Judgement in Risk Assessment. Washington, DC: National Academy Press.

REFERENCES

205

Smith A, Gelfand A. 1992. Bayesian statistics without tears: a sampling-resampling perspective. Am Statistician 46(2):84–89. Spiegelhalter D, Thomas A, Best N. 2003. WinBUGS Version 1.4 User Manual. http:// www.mrc-bsu.cam.ac.uk/bugs/winbugs/manual14.pdf. US Environmental Protection Agency. 2005. Guidelines for Carcinogen Risk Assessment. EPA/630/P-03/001F. US Environmental Protection Agency. 2006. Help Manual for Benchmark Dose Software—Version 1.4. EPA 600/R-00/014F.

6 UNCERTAINTY IN DOSE RESPONSE FROM THE PERSPECTIVE OF MICROBIAL RISK P. F. M. Teunis National Institute of Public Health and the Environment Bilthoven, Netherlands Emory University Atlanta, GA

INTRODUCTION Dose–response assessment for microbial pathogens differs from its toxicological counterpart in two important aspects: the discrete nature of the inoculum (Tillett and Lightfoot, 1995) and the abundance of heterogeneity. The former aspect, particulate inocula, greatly helps dose–response modeling because it limits model space considerably. The second aspect, heterogeneity, may be partly due to the noxious agent itself being a biological organism that often (but not always) co–evolved with its host. Data for microbial dose–response assessment usually come from human studies, and as a consequence data sets are often small. Despite scarcity of experimental or observational data, biological variation should not be ignored, even in the simplest problems.

BINOMIAL UNCERTAINTY Most microbial dose–response studies use binary outcomes. The use of binomial sampling as the basic distribution of uncertainty depends on assumptions

Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

207

208

THE PERSPECTIVE OF MICROBIAL RISK

about heterogeneity (variability) of risk. When the experiment is repeated, would we find the same fraction positives at any given dose? Or would factors like susceptibility have changed so that the experiment would have a new outcome, a realization from a set of possible probabilities, rather than a fixed probability? Suppose that we have five animals in a given dose group, and three die as a consequence of exposure. Does that mean that any animal has a 3/5 = 60% probability of death at that dose? Or did the three susceptible animals have some factor predisposing them to die, and the exposed population happened to include three of these susceptible animals, and two others who had zero probability of death. Of course, that question cannot be answered when the experiment is conducted only once. This does, however, show a weakness in using binary endpoints in a dose– response study. In such a study heterogeneity cannot be studied because the response is a group response, not an individual one. This can be modeled as a detection problem, with (a combination of) physiological variables scored by an observer, who applies some criterion for deciding whether an animal is positive or not (Slob and Pieters, 1997a, b). Heterogeneity is a fundamental property of living organisms (outside the laboratory) and is important in risk assessment. This problem of identifying heterogeneity may be attacked by choosing graded response variables, or adding graded covariables influencing the binary outcome. An example is a study where the infectivity of the pathogenic protozoan Cryptosporidium parvum was determined in a human challenge experiment, with individual baseline serum IgG levels as a covariable (Teunis et al., 2002b). Each individual subject has received a certain dose, has a specific baseline IgG level and infection status (0 or 1). The model is a modified hit theory model, using a logistic relation to describe the influence of (log) IgG level on the single hit probability pm. Details can be found in Teunis et al. (2002b). The resulting dose–response relation has two arguments, the dose and the (log) baseline IgG level (Figure 6.1).

EPIDEMIOLOGY AND DOSE RESPONSE Heterogeneity in experimental or observational data is usually structured hierarchically: sets of studies of different strains of a microbial pathogen, or separate studies of the same pathogen in different populations (e.g., different ages). For this reason mixed models are a natural choice for addressing heterogeneity in risk, especially when studies must be combined (Teunis et al., 2002a). A series of human challenge studies for Cryptosporidium parvum showed that different isolates of the same pathogen species may have strongly diverging dose response relations. Both the locations (ID50) and shapes of the dose response relations were different (Figure 6.2). The predictive intervals show

EPIDEMIOLOGY AND DOSE RESPONSE

209

Figure 6.1 Dose response relation for infection by Cryptosporidium parvum (Iowa isolate) in humans, using baseline anti–Cryptosporidium IgG levels as a covariable. Posterior mode plane, blanketed between 5 and 95% predicted levels.

Figure 6.2 Dose–response relation for infection by Cryptosporidium parvum (5 different isolates) in humans. Shown are fractions infected and posterior mode curves for each of the individual isolates, and a density graph of quantiles of predicted probabilities of infection (spanning a 99% interval).

the uncertainty in the probability of infection for an unspecified isolate, represented by a sample from the joint parameter distribution characterizing infectivity within and between isolates. This distribution may be used to characterize the uncertainty in the infectivity of an unspecified “new” environmental isolate.

210

THE PERSPECTIVE OF MICROBIAL RISK

Often there is insufficient information (lack of experimental data) to identify parameters, especially for a complicated model with many parameters. In risk assessment, however, we are mainly interested in prediction: the magnitude of the uncertainty in predicted levels of risk, marginalizing over all levels of heterogeneity. Predictions can be made, even in cases where parameters cannot be estimated. The magnitude of (posterior) uncertainty then increases, because of the increase in degrees of freedom of the model. With this in mind, it can even be sensible to deliberately overparameterize a model, to maximize uncertainty in the predicted outcome, or at least to avoid gross underestimation of this uncertainty. An example is the use of epidemiological data for dose–response assessment: data from a single outbreak can be used as a “natural experiment” with a single applied dose and corresponding attack rate. For a few pathogens reports of sufficiently high quality do exist, and the question then is: What model should be used? A model with one parameter can be fitted by ignoring any heterogeneity in infectivity of pathogens and/or susceptibility of hosts. However, it can be easily shown that a two parameter model allowing for such heterogeneity in host–pathogen interaction produces much more generous uncertainty estimates for the predicted risk (Teunis et al., 2004). Figure 6.3 shows the uncertainty in predicted probabilities of infection as a function of dose, for E. coli O157:H7, treating data from a single outbreak as an experiment with a single applied dose. Near this observed dose the uncertainty is small (binomial uncertainty); above this dose we know very little, except that the infection risk is not smaller than the observed risk (due to the monotonically increasing dose–response relation). But below the observed dose the uncertainty is not very large. The reason is that the observed dose is low (31 cfu) so that a slightly lower dose approaches the level where exposure is becoming unlikely. We consider this a strong argument for limiting infectivity at low doses, for any microbial pathogen (see below).

Figure 6.3 Dose–response relation for infection by Escherichia coli O157:H7 in humans. Shown is the fraction observed in an outbreak, and a density graph of quantiles of predicted probabilities of infection (99% interval). From Teunis et al. (2004).

EPIDEMIOLOGY AND DOSE RESPONSE

211

This approach can be taken one step further by considering not a single outbreak but a set of outbreaks by the same pathogen. These are likely to have different dose–response relations, one for every single observed pair of dose and attack rate (affected/exposed) data. A (small) set of studies may thus be approached as alternative realizations of a dose response model with different sets of parameters in each observed outbreak, but with a joint distribution of all these parameter sets over all studies. A literature study produced a useful set of outbreaks with the required information (Strachan et al., 2005), however, a one-level analysis failed to produce an appropriate model fit. Re–analysis of these data, using a two-level model with variable infectivity within and between outbreaks produced improved dose–response estimates, with wider predictive intervals (Figures 6.4 and 6.5). The latter analysis also included dose uncertainty due to non-Poisson exposure, as these were foodborne outbreaks with sometimes highly inhomogeneous exposure, by adding an additional parameter into the model (Teunis et al., 2008). The uncertainty in predicted dose response for E. coli O157:H7 may seem high (Figure 6.5a), but extrapolation to low doses shows that the estimated uncertainty may not be disastrous for risk assessment (Figure 6.5b). This is again a consequence of the infectivity being high, shifting the dose–response relation for infection to doses where exposure limits the infection risk. These analyses were most conveniently done in a Bayesian framework because distributions for heterogeneity at any level can be specified, and their influence on the outcome is easily tested by manipulating prior distributions.

Figure 6.4 Dose–response relation for infection by Escherichia coli O157:H7 in humans, based on 8 different outbreaks. Shown are fractions infected and posterior mode curves for each separate outbreak. From Teunis et al. (2008).

212

THE PERSPECTIVE OF MICROBIAL RISK

Figure 6.5 Contour graph of percentiles of predicted probabilities of infection by Escherichia coli O157:H7 in humans, corresponding to Figure 6.4. High-dose (range of observations) and low-dose extrapolations are shown.

PARTICULATE INOCULA Compared to toxicological risk assessment, microbial dose–response assessment is usually based on human dose–response data. Model selection in dose response is aided by the high infectivity of many microbes. Only few organisms (particles) are required for a considerable (i.e., observable) risk of infection, possibly resulting from evolutionary optimization of pathogens maximizing their chances of propagation in their preferred hosts. As a consequence exposure (to one or more organisms) is an important proxy for infection. This is because the sampling distribution for a sample from a microbial suspension can be verified experimentally, and so used to impose an upper limit to all microbial dose–response models. Infection can never occur with a higher probability than exposure: if the probability of exposure is 0.01 (e.g., because an exposed subject ingested 10 ml of suspension of one organism/liter) then the probability of infection cannot exceed 1%. For wellmixed (Poisson) suspensions the dose–response relation should be a mixture of exponentials, with the distribution describing the heterogeneity in host– pathogen interaction as mixing function (Teunis and Havelaar, 2000). The argument above also gives hit theory models a prominent position in microbial risk assessment: infection is conditional on exposure, and exposure is a strong determinant of the shape of microbial dose response relations. Formulating the microbial dose response model as a mixture of a (Poisson or other) exposure component with a (beta or other) distribution for the singlehit probability allows interpretation of the latter: this is the probability that any single ingested pathogen survives all host barriers to infection. The distribution of this probability (called pm for crossing m barriers) is a “generic”

VERIFICATION OF HUMAN RISK ESTIMATES

213

Figure 6.6 Single-hit infectivity (pm) for Escherichia coli O157:H7 in humans. From Teunis et al. (2008).

characteristic (independent of exposure) of a host–pathogen combination. An example is shown for the E. coli O157:H7 model (Figure 6.6). These distributions of pm show considerable variation, which is not only caused by host factors. Unlike in toxicology, one cannot assume that all viruses or other microparasites have the same properties. In addition to variable host susceptibility, we have to take into account pathogen variation. Some of this variation changes with time because many pathogens cannot survive outside their hosts for an extended period. There are other issues as well: differences between infection and illness endpoints, innate immunity (genetic compatibility of host and pathogen), and dynamic aspects as transient (acquired) immunity, and secondary spread of infection in a population. These are important but not immediately pertinent to the topics of this workshop.

VERIFICATION OF HUMAN RISK ESTIMATES Risk assessment is usually applied in cases where other approaches (i.e. epidemiological studies) are not feasible, often because the expected probabilities of effects are so low that study sizes would be too large. As a consequence the outcomes of a risk study is often hard to verify, as alternative means of measuring the predicted effects are not available. Risk studies therefore are speculative: the results cannot be falsified. Risk studies are also increasingly used to support decision making, in industry and government. It would therefore be useful to at least in principle have methods for checking the quantitative outcomes of risk studies. Microbial risk may be in a favorable position for such a “reality check” because there usually is not much cause for doubting causality. Outbreaks of food or waterborne infectious disease do occur, and the agent is often identified. Dose–response studies, if they exist, are usually based on human challenge, not rodents. On the other hand, much is unknown about pathogen

214

THE PERSPECTIVE OF MICROBIAL RISK

variability, the importance of secondary transmission, and human susceptibility for illness. In an EU project (Polymod, in the 6th framework programme of the European Union) two aspects of this problem are addressed: (1) use of outbreak data for dose–response assessment, and (2) development of serologybased methods for estimating incidence of infection.

OUTBREAKS VERSUS CHALLENGE STUDIES Motivation for the former study is the potential bias of human challenge studies: pathogens must cause relatively mild illness (these pathogens are usually propagated in lab conditions, to remove other unwanted harmful microbes or chemicals), and volunteers must be immunecompetent and in good general condition so that the study is not likely to do them any serious harm. This is the converse for outbreaks, where pathogens are not adapted to lab conditions, and are likely to be from virulent strains. And outbreaks are detected because of clustering of patients who may be more susceptible than the general population, at least to developing illness symptoms. It is useful to have data from both of these sources of information because they may be viewed as extremes on a common scale (of pathogen infectivity and pathogenicity, and host susceptibility). Some results of outbreak-based dose response have been mentioned above. It may be noted that sometimes outbreaks and experimental challenge studies can be reconciled, indicating that the bias may not always be serious (Teunis et al., 2005).

SEROLOGY–BASED INCIDENCE ESTIMATES The development of serology-based incidence estimates is aimed at finding a common scale for measuring infectious disease risk, by risk assessment and by epidemiology. Microbial risk studies are better equipped for estimating infection risks than for estimating illness risks. Illness is much more dependent on host factors that are usually unknown; experimental studies usually result in few illnesses. For many foodborne pathogens, a considerable fraction of infected cases remains asymptomatic, and these healthy infected individuals cannot be easily found in epidemiological studies. Infection is, however, known to be often associated with a seroconversion: shortly after exposure there is a rapid increase in serum antibody concentrations, followed by a slow decline, dependent on antibody class and pathogen. If the serological response to infection is known for a study population, so that its heterogeneity can be characterized (Versteegh et al., 2005), a cross-sectional population sample (a “snapshot” sample of the population) can be used for estimating times since infection, or incidence of infection, for the sampled population. A pilot study for a respiratory infection (pertussis) supported claims that silent transmission

REFERENCES

215

of this pathogen in the Netherlands exceeds estimates based on reported illnesses by several hundredfold (de Melker et al., 2006). We now are working on improvements of the procedure, and application to enteric infections, and initial results indicate that incidence of asymptomatic Salmonella infections in Denmark also exceeds the incidence of symptomatic ones by two orders of magnitude (Simonsen et al., 2008). Suppose that two studies in microbial risk produce two independent estimates of risk, say as estimated probabilities of infection. When these estimates are on a common scale (so that we are not comparing apples and oranges), the question of whether they agree is sensible. Either set of estimates is based on separate data, and calculations are made with different models, so that the two sets of outcomes may be assumed independent. Both estimates are uncertain, possibly characterized by a Monte Carlo sample of predicted outcomes. The variances (or other measures of spread) of the two estimates do not have to be equal, as the two methods may have different precision, also because they are based on different data. First concern therefore is similarity of the locations of the two sets of estimates. Various procedures can be used. A very simple method might be useful in practice: for instance, subtract one set of estimates from the other, and use the fraction positives (or negatives) as a decision criterion. REFERENCES de Melker HE, Versteegh FGA, Schellekens JFP, Teunis PFM, Kretzschmar MEE. 2006. The incidence of Bordetella pertussis infections estimated in the population from a combination of serological surveys. J Infect 53(2):106–13. Simonsen J, Strid MA, Molbak K, Krogfelt KA, Linneberg A, Teunis PFM. 2008. Seroepidemiology as a tool to study the incidence of Salmonella infections in humans. Epidemiol Infect 136(7):895–902. Slob W, Pieters MN. 1997a. Few large, or many small dose groups? an evaluation of toxicological study designs using computer simulations. Technical report 620 110 006, RIVM, Bilthoven. Slob W, Pieters MN. 1997a. A probabilistic approach for deriving acceptable human intake limits and human health risks from toxicological studies: general framework. Technical report 620 110 005, RIVM, Bilthoven. Strachan NJC, Doyle MP, Kasuga F, Rotariu O, Ogden ID. 2005. Dose response modelling of Escherichia coli O157 incorporating data from foodborne and environmental outbreaks. Int J Food Microbiol 103(1):35–47. Teunis P, van den Brandhof W, Nauta M, Wagenaar J, van den Kerkhof H, van Pelt W. 2005. A reconsideration of the Campylobacter dose–response relation. Epidemiol Infect 133(4):583–92. Teunis PFM, Chappell CL, Okhuysen PC. 2002a. Cryptosporidium dose response studies: variation between isolates. Risk Anal 22(1):175–83. Teunis PFM, Chappell CL, Okhuysen PG. 2002b. Cryptosporidium dose response studies: variation between hosts. Risk Anal 22(3):475–85.

216

THE PERSPECTIVE OF MICROBIAL RISK

Teunis PFM, Havelaar AH. 2000. The beta Poisson model is not a single hit model. Risk Anal 20(4):511–18. Teunis PFM, Ogden ID, Strachan NJC. 2008. Hierarchical dose response of E. coli O157:H7 from human outbreaks incorporating heterogeneity in exposure. Epidemiol Infect 136(5):761–70. Teunis PFM, Takumi K, Shinagawa K. 2004. Dose response for infection by Escherichia coli O157:H7 from outbreak data. Risk Anal 24(2):401–8. Tillett HE, Lightfoot NF. 1995. Quality control in environmental microbiology compared with chemistry: what is homogeneous and what is random? Water Sci Technol 31(5-6):471–7. Versteegh FGA, Mertens PLJM, de Melker HE, Roord JJ, Schellekens JFP, Teunis PFM. 2005. Age-specific long-term course of IgG antibodies to pertussis toxin after symptomatic infection with Bordetella pertussis. Epidemiol Infect 133(4):737–48.

7 CONCLUSIONS David Bussard, Peter Preuss, and Paul White US Environmental Protection Agency

At the heart of the actual quantitative assessment of dose–response uncertainty is the common process of fitting one of several plausible dose–response models to bioassay data to determine a dose corresponding to a selected response level at the lower end of the observable range. This is the central uncertainty issue evaluated by the workshop. That is, the workshop was set up to explore alternative ways to tackle this aspect of quantitative dose–response assessment—namely for a chosen set of data how one can fit a dose–response curve, with an understanding of the uncertainty, in the range of the observed data. Results most useful for environmental decisions are values at the lower end of the dose range, including an estimate of the lower confidence limit on the dose that yields a response near the bottom of that observed range and information about the degree of uncertainty at the lower end of the observable response range. The basic terminology is that the curve is fit to find a benchmark dose or BMD that produces a specified response, commonly 5% or 10% above the background level. The lower confidence limit on the BMD is termed the BMDL. The process of synthesizing information to determine the hazard of a chemical and provide quantitative information as to the dose–response relationship is very complex and involves many other steps including evaluating study design, concordance, and relevance of animal data to human risk. A number of commenters have expressed dissatisfaction with the “standard approach”

Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

217

218

CONCLUSIONS

as it often provides a single risk estimate, usually one meant to ensure that the risk is not underestimated. The EPA is evaluating approaches to provide “best estimates” of risk and more fully characterize the uncertainties in risk estimates and the variability in risks across subpopulations. If by “best estimate” one is seeking a value useful to estimating the “expected value” of risk or “expected value” estimate of total impact on a population, a challenge is that the expected value should reflect not just the single most likely value but the contribution toward the expected value of the full distribution of possible risk values. This is not an easy task. It involves complex judgments evaluating a wealth of empirical data, deciding which studies are most relevant, how to extrapolate from diverse animals to humans, and how to extrapolate to lower doses. Much of that judgment is currently reflected in text discussions of the available studies and underlying science. This workshop was not meant to explore all these facets of the process, but rather to focus on one relatively well-defined question: Given a bioassay data set, how should we quantify the uncertainty in the dose–response relationship? Even within this restricted format, results indicate that the scientific jury is still out on ways to improve the current practice. That simple insight is perhaps the foremost result of this workshop. The four modeling approaches deployed on the four bench test problems represent four different philosophies for capturing uncertainty in dose response. The approaches tested in the workshop, formulated in the introduction, are summarized as follows: 1. Benchmark Dose Modeling (BMD). Choose the “best” model, and assess uncertainty assuming this model is true. Supplemental results can compare estimates obtained with different models and sensitivity analyses can investigate other modeling issues. 2. Probabilistic Inversion with Isotonic Regression (PI-IR). Define modelindependent “observational” uncertainty, and look for a model that captures that uncertainty by assuming that the selected model is true and providing for a distribution over its parameters. 3. Nonparametric Bayes (NPB). Choose a prior mean response (potency) curve (potentially a “noninformative prior”) and precision parameter to express prior uncertainty over all increasing dose–response relations, and update this prior distribution with the bioassay data. 4. Bayesian Model Averaging (BMA) (as applied here). Choose an initial set of models, and then estimate the parameters of each model with maximum likelihood. Use classical methods to estimate parameter uncertainty, given the truth of the model. Determine a probability weight for each model using the Bayes information criterion (BIC), and use these weights to average the model results. The benchmark dose (BMD) model is the current standard approach for estimating doses corresponding to a target response level, above background.

CONCLUSIONS

219

The other three methods could be considered somewhat “exotic” as they are new, do not draw on a wealth of experience, and cannot rely on standardized software (although there have been a number of researchers investigating the potential use of Bayesian model averaging for this task). For this workshop, expert modelers were asked to quantify the uncertainty as a function of dose for several example chemicals and data conditions. Although they were not specifically asked to derive benchmark doses, most did so for selected chemicals—notably for the 10% response level (BMD10) and corresponding lower confidence limit (BMDL10). These estimates and the indication of whether the approach captures parameter or model uncertainty, or both, are summarized in Table 7.1 and highlighted as follows: •







An estimate from a single “best” BMD model captures parameter uncertainty but not model uncertainty, and the uncertainty captured on the parameter distribution is conditional on the model being true. Assessors are advised to examine results from different models to see if large differences exist; supplemental results can also be presented to show model sensitivity. The PI-IR approach captures both model and parameter uncertainty in the sense that the uncertainty captured is model independent and is represented as a distribution over model parameters. We might say that the uncertainty captured is not model dependent, but once captured, it is caged in the parameter space of a model chosen for that specific purpose. The NPB approach captures model-independent uncertainty and does not represent this as a distribution over model parameters, but rather as a distribution over response curves. The BMA approach as deployed here applies maximum likelihood estimates for the parameters of a chosen set of models and uses the BIC approximation to weigh the different models. In this sense it combines classical parameter uncertainty and Bayesian model uncertainty. Weighting the BMDLs from the models is not expected, however, to yield a good estimate of the confidence limits that would evolve from considering as a whole the weighted distribution of uncertain estimates from the set of models.

A first cursory conclusion is that there is a fairly wide spread in BMDL10 values. This is particularly true for the illustrative chemical persimonate. The BMD and NPB approaches differ by almost an order of magnitude, and PI-IR would return a value of zero. For frambozadrine, BMD modeling and PI-IR are in the same ballpark, but NPB is again an order of magnitude lower. This might suggest that the NPB method, which makes no assumptions that restrict attention to one or more model classes, factors in more uncertainty. However, for the first data set, NPB agrees reasonably with the BMD method and PI-IR is much lower.

220 c c

5.31 2.91

d

d

d

10.20 1.05

6.23 2.60

d

33.7 12 30.3 13.8 23.5 16.8

7.21 4.93 d d 2.53 0.849

Frambozadrine

a: 14.2 | e 11.24 a: 11.06 | 40.63 8.69

d d

7.03 2.81

a: 15.2 | 70.1 a: 11.3 | 39.9 a: 17.3 | 63.0 a: 4.6 | 43.8 15.8 10.2

Nectorine

d

d

d d

2.92 0.32

3.6 2.86 3.61 2.83 b b

Persimonate

Combined Results per Chemical

b

Parameter

Model

Uncertainty Addressed

Yes, parameter No distributions conditional on model being true Yes, uncertainty Yes, uncertainty captured is expressed as not model distribution over dependent parameters Yes, prior covers all Yes, uncertainty increasing potency captured is curves not model dependent No, each model uses Yes, probabilities MLE parameter for each of the values chosen models are used Yes, diffuse prior No, uses only with bayesian multistage updating model

BMDLs are given for separate tumor types: adenoma | neuroblastoma; If split, lower is combined tumor value. The 95% percentile of the background probability was above the benchmark response (BMR). c BMA over logistic models with dose, dose + gender, and dose + gender + dose-gender interaction. d Not computed. e Cannot be determined.

a

BMDL10

BMD10

Kopylev et al

MCMC Mode

BMD10 BMDL10

BMD10 BMDL10

Mean

Bayesian model averaging (BMA)

Nonparametric Bayes (NPB)

Probabilistic inversion-isotonic regression (PI-IR)

BMD10 BMDL10 BMD10 BMDL10 BMD10 BMDL10

Estimate per BMD Guidance

BMD Software Parametric bootstrap Median

Method/ Measure

Metric

Summary of model results

Benchmark dose (BMD)

Model

TABLE 7.1

WHERE MIGHT THE CURRENT METHOD MERIT ENHANCEMENT?

221

The BMA method restricts attention to a limited class of models. The flexibility of this approach is demonstrated in the data set for frambozadrine, where it is applied in a creative way to answer the question about aggregating male and female rats, within the logistic model. The other cases were not reported. The PI-IR method selects a unique model type, based on its ability to capture observational uncertainty. Probability distributions as a function of dose are obtained for all data sets. For the last data set, the uncertainty in the background response rate was too high relative to response rates at low doses to derive a BMDL10. The BMD approach derives answers without excessive computing time. Table 7.1 does not show the fact, illustrated in Swartout’s chapter, that statistically “nearly equivalent” models may give different results for the BMD and BMDL. By focusing on the problem of estimation within the range of the observed data, the workshop temporarily put aside questions of whether or how one can then extrapolate to lower doses and risks of concern for environmental decision making. Most environmental exposure levels start below the range typically observable in bioassay or human (typically occupational) epidemiology studies. Therefore the ultimate question does indeed involve either lowdose extrapolation or some other way to make decisions keying off the lower end of that observable range of data. The workshop also did not aim to address how best to consider the myriad of biological and other scientific judgments involved in vetting various data sets and selecting which to prioritize for model applications. Nevertheless, this was clearly a key issue in discussing the approaches examined—namely the extent to which biology, chemistry, analytical errors, and other information should inform the statistical approaches used for the chosen data sets. WHERE MIGHT THE CURRENT METHOD MERIT ENHANCEMENT? One question is whether the calculation of the confidence limit on the benchmark dose for the low-response level truly captures the full range of even the statistical uncertainty (let alone the kinds of judgments applied to select a data set). If the current approach does not capture enough of the statistical uncertainty, other approaches might be worth exploring. Countervailing considerations as to whether an alternative approach is worth pursuing include whether it: • •





works routinely—for example, does it converge to a solution? is computationally straightforward or intensive—does it take seconds or days to compute? can be applied reasonably easily or requires many sophisticated judgments in its application? produces results that can be transparently explained and seem reasonable.

222

CONCLUSIONS

It is possible that methods that are not yet “suitable for prime time” could nevertheless be helpful in “recalcitrant” cases where the BMD approach is not sufficiently stable, or does not enclose the observed percentage responses in the 90% confidence bands. The first recalcitrancy is illustrated in Swartout’s results for respiratory adenoma for the example chemical nectorine. The AIC values for the multistage and Weibull models do not differ significantly (143.6 vs. 143.9). However, the BMDL10s in these two cases are 11.3 (multistage) and 0.259 (Weibull) (see Tables 1.7 and 1.8). The frambozadrine and persimonate cases illustrate the second limitation: the best-fitting model seems to have trouble enclosing the observed response rates within the 90% uncertainty bounds (see Figures 1.2 and 1.6). Some discussants did not necessarily think this was a problem, since there can be sources of error other than binomial sampling distribution such as measurement error that the modeling is not explicitly set up to address. When recalcitrant cases arise, we cannot simply switch to some alternate procedure; rather, we have to exercise judgment in a greater degree. It is at this point that some depth in the repertory of techniques for capturing uncertainty becomes useful. Models that are not yet ready for routine application may nonetheless prove valuable in difficult cases, where the additional analytical or computational burdens are justified. As experience and confidence is gained with other models, they may gradually move into more widely accepted use. Thus, while an alternative might not be a good candidate for replacing a standard method as the default approach, it is worth having available in cases one might be able to identify where the default approach has difficulties or where the alternative would have real benefits.

HOW COULD WE IMPROVE ON QUANTITATIVE CHARACTERIZATIONS OF UNCERTAINTY? One criterion discussed in the workshop is whether the estimated confidence limits around the best-fitting model encompass a reasonable amount of the experimental data. If the confidence limits around the estimate do not include the actual observed data, one might conclude the approach is giving a false sense of precision around the fit curve and that the true uncertainty is likely broader than is represented. At least one participant (PI-IR) started with an observation that this was not always the case and developed a method specifically designed to address that problem. The current state of knowledge about the complex processes underlying dose response is such that our current approaches might overly constrain possible outcomes by imposing a set of model specifications a priori. One approach examined in the workshop imposes almost no fixed dose–response shape other than monotonicity and also produces a Bayesian estimate of probabilities of risk as a result but with a rigorous mathematical basis.

WHAT DID WE LEARN?

223

Related to the question about whether we can reasonably specify a model form a priori is a view that while we may articulate a reasonable set of biologically plausible models, we do not estimate a good “expected value” or risk or capture the full range of uncertainty when we narrow down to the model that “best fits” the data without carrying forward into the results the other model forms that also seemed reasonable and should carry some weight in our estimates of “best value” or of uncertainty. A final approach was explored that does start with a constrained set of models but does then carry forward some contribution from each reasonable model through a Bayesian averaging approach.

WHAT DID WE LEARN? A key outcome is that the workshop expanded our modeling repertory, and this alone is an important achievement. We learned that there are workable alternatives to current practice, each with their own strengths and weakness. •





The PI-IR approach seems to capture observational uncertainty, and it seems readily applicable to integrated uncertainty analysis. On the other hand, it does not always work with “standard” dose–response models. The process of finding suitable nonstandard models could be very time-consuming. The NPB approach has a solid mathematical foundation, but it is difficult to explain and is somewhat less successful in capturing observational uncertainty. Moreover it is not clear how NPB would work in an integrated uncertainty analysis. BMA is very flexible and could easily adapt to integrated uncertainty analysis. It is not clear whether the approximation used in selecting parameters for each model could be readily discharged so as to include full parameter uncertai.nty. Moreover the initial choice of models might require further argumentation. Note that all of the “exotic” approaches are based on “research grade” software.

The pros and cons of different models can be debated on a philosophical level as long as the rivers flow and the trees grow. Real progress, however, is made when different approaches are applied to the same problems so that their performance can be compared. Moving the discussion from the philosophical to the applied level may be the greatest return of the workshop, and perhaps other modelers will be provoked to attempt similar analyses on these bench test problems. We need experience with the relatively “clean” problems like the bench test problems, in order to “toughen up” for the really hard problems, including

224

CONCLUSIONS

extrapolating from animals to humans, extrapolating to doses well below those tested in bioassay experiments, dealing with sensitive subpopulations, and accounting for sparse data. Existing techniques for addressing these problems for responses considered to have a threshold (noncarcinogenic effects, and in some cases the cancer endpoint) start with a point of departure or reference value, like a BMDL, and divide this value by “uncertainty factors.” The practice of using uncertainty factors dates from 1954 (Lehman and Fitzhugh), and remains controversial. Almost all reviewers stressed that the really hard problem lies in extrapolating beyond “clean example” bioassay data. It is important for policy makers to develop the depth of field to look beyond current methods. One might imagine factoring in more exotic types of data, such as from allometric scaling studies (per discussant Rhomberg), drug-testing studies on children and other sensitive groups (per discussant Hattis), or even data from structured expert judgment. How would our present methods for capturing uncertainty deal with these new types of data? If indeed the new methods afford more modeling flexibility, we may cherish the hope that they will be better able to deal with difficult data. We may find that current standard BMD modeling has the greatest difficulty in dealing with exotic data and moving beyond the era of uncertainty factors. With such reflections we have reached the boundary of what can be concluded from this workshop. Having taken the first step, to test approaches for characterizing the “easy” uncertainty related to fitting curves to observed data, subsequent bench test workshops can focus on the much more complicated set of extrapolations. The ultimate aim is to better characterize dose response in order to better understand potential health effects at exposure levels environmentally relevant to humans.

AUTHOR INDEX

Abdel-Rahman M. 6, 14 Adgate J.L. 14 Ammann L.P. 111, 146 Andersen M.E. 154, 159 Andregg R.J. 159 Antoniak C.A. 111, 146 Baird S.J.S. 6, 14 Baldwin L.A. 6, 14 Banati P. 14, 159 Barlow R.E. 54, 80, 116, 146 Bartholomew D.J. 80, 116, 146 Bedford T.J. 57, 81 Best N. 205 Bogdanffy M.S. 159 Bogen K.T. 154, 159, 197–198, 204 Bregman L.M. 60, 80 Bremner J.M. 80, 146 Brooks S. 200, 204 Brown P.J. 166, 178 Brunk H.D. 80, 146 Burmaster D.E. 14, 159 Burzala L. 4, 7, 84, 111, 146, 148, 150, 152, 160

Bus J.S. 159 Butz-Jorgensen E. 167, 178 Calabrese E.J. 6, 14 Carlin J.B. 204 Carlson-Lynch H.L. 15 Casella G. 200, 204 Censor Y. 60, 81 Chappell C.L. 215 Chen C.J. 179 Chib S. 170, 200, 204 Chiu W. 7, 42, 48, 104 Chu M. 7, 183, 195 Clyde M. 167, 169, 178 Cohen J.T. 14 Cohen S.D. 159 Conolly R.B. 154, 159 Cooke R.M. 1, 44–45, 51, 57, 75, 81, 107, 147 Cox D.R. 79, 81 Csiszar I. 60 David R.M. 159 de Melker H.E. 215, 216

Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

225

226 Deming W.E. 57, 81 Disch D. 111, 146 Doerrer N.G. 159 Dorman D.C. 159 Dourson M.L. 6, 14–15 Doyle M.P. 215 Draper D. 174, 178 Du C. 57, 60, 81 Evans J.S. 6, 14 Fearn T. 178 Felter S.P. 14 Ferguson T.S. 111, 146 Fienberg S.E. 60, 81 Fitzhugh O.G. 14, 224 Gart J.J. 170, 179 Gauthier J. 14 Gaylor D.W. 6, 14, 159 Gefland A.E. 111, 113, 146 Gelman A. 200, 204 George E.I. 169–170, 178–179 Gilks W. 200, 204 Goble R. 7, 14, 97–98, 104, 159 Govindarajulu Z. 146 Graham J.D. 14 Grandjean P. 178 Greenberg E. 200, 204 Haberman S.J. 60, 81 Hattis D. 6–7, 14, 98, 104, 153, 155, 159, 164 Havelaar A.H. 212, 216 Hertzberg R.C. 33 Hinkley D.V. 79, 81 Hoeting J.A. 167–168, 179 Ibrahim J.G. 179 Ireland C.T. 60, 81 Kadry A.M. 6, 14 Kalberlah F. 6, 14 Kasuga F. 215 Keenan R.E. 15 Keiding N. 178 Knauf L.A. 14, 33 Kodell R.L. 6, 14 Kopylev L. 6, 197, 200, 204, 220

AUTHOR INDEX

Kraan B. 51, 75, 81 Kramer H.J. 15 Kretzschmar M.E.E. 215 Krewski D. 85, 179 Krogfelt K.A. 215 Kronmal R.A. 179 Kruithof J. 57, 81 Kullback S. 60, 81 Kuo L. 111, 113–115, 146 Kuo T.L. 179 Kurowicka D. 57, 81 La D.K. 156, 159 Laud P.W. 111, 146 Lazarus N.T. 6, 15 Lee P.N. 179 Lee T.-M. 204 Lehman A.J. 14, 224 Lent A. 60, 81 Lewis S.C. 14 Lightfoot N.F. 207, 216 Lilly P.D. 159 Linneberg A. 215 Louis T.A. 82, 85 Madigan D. 179 Marcus A. 7, 47 Matus F. 60, 81 Mazzuchi T.A. 111, 148, 150, 152 McCulloch R.E. 170, 179 Mertens P.L.J.M. 216 Messner M. 7, 180, 194 Molbak K. 215 Morales K.H. 166–167, 179 Mukhopadhyay S. 111, 146 (National Research Council) NRC (1994) Science and Judgement in Risk Assessment. 2, 14, 96, 197, 204 Nauta M. 215 Nessel C.S. 6, 14 Nicolich M. 14 NRC 1983 Risk Assessment in the Federal Government: Managing the Process 2, 14, 183 NRC 1989 Improving Risk Communication Washington DC 2, 14 NRC 1994 Science and Judgment in Risk Assessment 2, 14, 197, 204

227

AUTHOR INDEX

NRC 1996 Understanding Risk Informing Decisions in a Democratic Society 2, 14 NRC. Science and Judgment in Risk Assessment 204 Ogden I.D. 215–216 Okhuysen P.C. 215 Pelekis M. 6, 14 Pieters M.N. 6, 15, 208, 215 Poole D. 85 Preuss P. 217 Price P.S. 15 Raftery A.E. 85, 166, 169, 179–181 Ramgopal P. 111, 146 Ramsey F.L. 111, 146 Renwick A.G. 6, 15 Rhomberg L.R. 6, 7, 15, 87, 95–96, 107, 109, 224 Richardson S. 179–181, 204 Robinson D. 14 Rogers J. M. 159 Roord J.J. 216 Rotariu O. 215 Rubin D.B. 204 Ryan L. 4, 7, 101, 148, 152, 165, 179–181, 186, 188, 194 Schellekens J.F.P. 215–216 Schneider K. 14 Schoeny R.S. 33 Schuhmacher-Wolz. U. 14 Schwarz G. 169, 179 Setzer R. W. 159 Shaked M. 111, 146 Shinagawa K. 216 Shlyakhter A.I. 14 Simonsen J. 215 Singpurwalla N.D. 111, 146 Slikker W. Jr. 154, 159 Slob W. 6, 15, 208, 215 Smith A.F.M. 111, 146, 200, 204 Spiegelhalter D. 200, 204–205 Stara J.F. 6, 14 Staubrer K.L. 14 Stephan F.F. 57, 81 Stern H.S. 204

Stevenson H. 15 Stiteler W.M. 17, 33 Strachan N.J.C. 211, 215–216 Strid M.A. 215 Swartout J.C. 5–7, 14–15, 17, 37, 40, 47 Swenberg J.A. 159 Takumi K. 216 Tarone R.E. 179 Teunis P.F.M. 207–208, 210–216 Thomas A. 82, 85, 111, 205, 206 Tillett H.E. 207, 216 Tuomisto J. 7, 47, Tusnady G. 60, 81 U.S. EPA. 2005. Guidelines for Carcinogen Risk Assessment. 33, 193, 198, 205 U.S. EPA. 2004. An examination of EPA risk assessment principles and practices. 15, 179 U.S. EPA. 2008. Benchmark dose methodology Benchmark dose software (BMDS) 8, 15 U.S. EPA. 2006. Help Manual for Benchmark Dose Software—Version 1.4. 205 van Pelt W. 215 van den Brandhof W. 215 van den Kerkhof H. 215 Vannucci M. 178 Vermeire T. 6, 15 Versteegh F.G.A. 214–216 Viallefont V. 166, 179–181 Volinsky C. 166, 179–180 Vomlel J. 60, 81 Wagenaar J. 215 Wahrendorf J. 179 Wallace K. 159 Wasserman L. 169, 177, 179 Weihe P. 178 White P. 204, 217 White R.F. 178 Whitney M. 4, 7, 148–149, 152, 165, 180–181, 186–188, 194 Wolff S.K. 6, 15 Wu M.M. 179

SUBJECT INDEX

accelerated life testing 4, 111 activation rate 4, 154–155 Akaike Information Criterion (AIC) 4, 9–10, 12, 18–20, 25, 29–30, 35–36, 43, 48, 55, 60, 62, 75–76, 174–175, 200 aleatoric uncertainty 181 allometric scaling 75, 224 Bayes Information Criterion (BIC) 5, 169–170, 172, 174–178, 188, 195, 218–219 Bayesian Model Averaging (BMA) 4–5, 7, 148–149, 166–169, 173, 175–178, 180–182, 184, 186, 194–196, 219–221, 223 Benchmark Dose (BMD) 4, 6–8, 10–12, 20–24, 26, 31–35, 41, 43–44, 52, 55–57, 60–62, 64, 70–71, 74–79, 86, 105, 115–117, 119–120, 124–125, 128–129, 133, 135, 138, 141, 145, 148, 160–163, 166–168, 170–172, 177, 187–189, 194–195, 199–203, 217–222, 224 Benchmark Dose Lower Bound (BMDL) 10–12, 20–24, 26, 31–32, 34–41, 43–44, 61, 67, 69–70, 148–149, 160–163, 167–168, 171–172, 177, 187–188, 195–196, 200, 202–203, 217, 221, 224

benchmark dose software 4, 8, 165, 170, 177, 194 benchmark response 10, 18, 21, 38, 148, 160–161, 167, 220 binomial uncertainty 18, 54, 56, 75, 78, 116, 118–119, 121–123, 135, 137–138, 207, 210 bootstrap 4–5, 17–22, 25–29, 31–33, 42, 44, 46, 167, 171–172, 181, 187, 194, 220 cytotoxicity 88–89 dirichlet process prior 5, 147 Effective Dose (ED) 20–22, 26, 31–32 Effective Dose Lower bound (EDL) 20–22, 26, 31–32 epistemic uncertainty 88, 181 extra risk 10, 61, 199–203 frambozadrine 8, 12–13, 17–23, 29, 31–32, 37, 40, 43, 46, 52, 54–55, 57, 62–63, 65–67, 70, 72, 74–77, 88, 101, 120–133, 173, 188, 219, 220–222

Uncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Edited by Roger M. Cooke Copyright © 2009 by John Wiley & Sons, Inc.

229

230 gamma model 47 gibbs sampler 113, 115, 121, 136, 148 Gibbs Variable Selection (GVS) 169–170, 174–177 hyperkeratosis 12, 19, 62, 101–102, 120, 155, 173 Inhalation Unit Risks (IUR) 23, 25, 27–28 isotonic regression 4–5, 7, 51–52, 54, 75, 83–84, 87, 91–92, 100–102, 107–108, 110, 152, 218, 220 isotonic uncertainty 74–75, 79, 116, 121–122 Iterative Proportional Fitting (IPF) 52, 57–60 log logistic model 8–11, 56–57, 60–61, 78 log probit model 76, 92 log-logistic model 56–57, 61, 78–79 low-dose extrapolation 84, 94, 103, 108, 212, 221 Markov Chain Monte Carlo (MCMC) 113, 160–161, 169, 171, 176, 178, 181, 195–198, 200, 220 Michaelis-Menten 154, 156–157 model uncertainty 26, 51, 83–84, 94, 166, 174, 181, 189–191, 193, 195, 219 monotonicity 36, 48–49, 52, 82–85, 87, 92–93, 102, 107–108, 110, 121–122, 124–125, 129, 133, 135, 137–138, 144–146, 178, 222 multistage model 20–27, 29, 31–33, 35, 46, 62, 76, 92, 156, 198, 200, 204, 220 nectorine 8, 12–13, 17, 23, 25–28, 31, 33–35, 52, 57, 69–70, 74, 76, 78, 88–89, 129, 132–135, 198, 200–203, 220, 222 non-parametric bayes 111, 150, 152, 218, 220 non-parametric bayesian 112, 115–116, 122–135, 137, 139–145, 147, 152

SUBJECT INDEX

observational uncertainty 52, 54, 57–58, 61, 65, 71, 75, 87, 91, 93–94, 96, 100, 102, 105, 107–110, 116, 145, 218, 221, 223 olfactory epithelial neuroblastoma 12, 23, 25–28, 69, 129, 133–134 parameter uncertainty 95, 189, 218–219 persimonate 13, 17, 29–32, 35, 41, 43, 45, 52, 70, 72–75, 79, 80, 88, 135–145, 219–220, 222 pharmacodynamics 89 pharmacokinetics 89, 184, 192 Point Of Departure (POD) 6, 20 Pooling Adjacent Violators (PAV) 54–55, 102, 116, 129, 146, 151 potency curve 115, 117, 120, 129, 135, 147–148, 160–163, 218, 220 probabilistic inversion 51–52, 57, 60, 75–76, 78, 83, 85, 87, 91, 93, 96–97, 102, 107–108, 218, 220 probit model 76, 92, 171 profile likelihood 18, 42, 94, 202 Reference Dose (RfD) 20 Resources for the Future 1, 45, 51, 107, 109, 147, 165 respiratory epithelial adenoma 12–13, 23, 25–28, 33–35, 69, 129, 133–134, 200 reverse jump MCMC 178, 181 Safe Drinking Water Act 2 S-PLUS 17 U.S. EPA Office of Research and Development (ORD) 1, 4, 17, 34 U.S. EPA National Center of Environmental Assessment (NCEA) 1, 6–7, 150 Weibull model 12, 20–27, 29, 31–33, 35–36, 170, 177, 222 WinBUGS 175, 200–203